key: cord- -jqh authors: nan title: next generation technology for epidemic prevention and control: data-driven contact tracking date: - - journal: ieee access doi: . /access. . sha: doc_id: cord_uid: jqh contact tracking is one of the key technologies in prevention and control of infectious diseases. in the face of a sudden infectious disease outbreak, contact tracking systems can help medical professionals quickly locate and isolate infected persons and high-risk individuals, preventing further spread and a large-scale outbreak of infectious disease. furthermore, the transmission networks of infectious diseases established using contact tracking technology can aid in the visualization of actual virus transmission paths, which enables simulations and predictions of the transmission process, assessment of the outbreak trend, and further development and deployment of more effective prevention and control strategies. exploring effective contact tracking methods will be significant. governments, academics, and industries have all given extensive attention to this goal. in this paper, we review the developments and challenges of current contact tracing technologies regarding individual and group contact from both static and dynamic perspectives, including static individual contact tracing, dynamic individual contact tracing, static group contact tracing, and dynamic group contact tracing. with the purpose of providing useful reference and inspiration for researchers and practitioners in related fields, directions in multi-view contact tracing, multi-scale contact tracing, and ai-based contact tracing are provided for next-generation technologies for epidemic prevention and control. outbreaks of infectious diseases could cause huge losses in human lives. the spanish pandemic in led to over million deaths [ ] . as of , approximately . billion people worldwide are at risk of malaria, and every seconds one patient dies due to malaria infection [ ] . the death rate of tuberculosis has exceeded aids, becoming the most deadly infectious disease in the world. about % of people in south africa have latent tuberculosis and there were , cases of tuberculosis in alone [ ] . along with the serious threat to human lives, infectious diseases also bring huge economic losses. statistically, malaria causes an economic loss of billion us dollars every year in african countries [ ] . seasonal influenza in the us causes an annual economic burden of billion us dollars [ ] . the developments in vaccines and drugs have enabled us to combat infectious diseases and have greatly reduced the harm brought to human society. however, the sudden emerging infectious diseases caused by the drug resistance and the inherent variability of viruses still remains to be a serious global problem that often leaves us in an unprepared and vulnerable situation. for example, the h n flu in mutated to become the h n flu in and the h n flu in . the h n virus began to spread through contact networks in hong kong in january , and lives were taken in just one month [ ] . the death toll rose to after three months. thus, in the fight against various kinds of infectious diseases, relying solely on vaccine development is far from enough. a more effective ''active prevention and control'' method is desperately needed so as to rapidly detect and block figure . the spatial distribution of h n cumulative cases in the early stage of the outbreak in mainland china in [ ] . in this case, infected cases are recorded with the information of location and time, while the contact information dominating transmission remains unknown. the transmission paths of new infectious diseases, detaining the disease to a minimum spread until its eradication. many infectious diseases are transmitted through personto-person ''contact''. in computational epidemiology, a contact is simply defined either as a direct physical interaction (e.g. sexual contact) or proximity interaction (e.g. to people being within m of each other, or being in the same room) [ ] . human contact interactions constitutes a ''contact network'' of virus transmission. in this network, nodes represent individuals, and links represent contact relationships. the structure of the contact network significantly affects the spatiotemporal patterns of virus spread. for example, in the case of respiratory infections that spread through droplets, interactions like face-to-face communication, shaking hands, crowd gathering, and sharing vehicles enable the spread of diseases and increase the possibilities of transmission from infected to susceptible persons. tracking the contact interactions of individuals can effectively restore the ''invisible'' virus transmission paths, quickly locate and isolate high-risk individuals who were in contact with infected persons, and can aid in quantitative analysis of the transmission paths, processes, and trends of the infectious diseases, all leading to the development of corresponding effective epidemic control strategies. the biggest obstacle in contact tracking is obtaining data that directly describes contact behaviors. because contact interactions between individuals are diverse and often subtle, they are difficult to be directly observed and recorded. in other words, it is hard to obtain first-hand high-quality data for contact tracking. when a disease is spreading, the impact of the disease could be observed, instead of the underlying direct interactions between individuals. for example, during the outbreak of h n bird flu, it is difficult to identify who were infected due to contacts with certain infected people. as shown in fig. , only new h n cases and the number of deaths in different time and space can be observed. many epidemiology scholars and computer scientists have conducted research on how to accurately capture individuals' contact behavior data as well as how to indirectly infer the contact network from other data sources. many methods have been proposed, most of which utilize intelligence data analytics related technologies, such as intelligent sensing, network modeling and analysis, data visualization, multi-source heterogeneous data mining, data-driven reverse engineering, machine learning, and multi-agent simulation, among others. based on the granularity of contact modeling, the existing methods can be classified into four categories: static individual contact tracking, dynamic individual contact tracing, static group contact tracking, and dynamic group contact tracking. each of these methods are described and discussed separately in following sections. individual contact tracking records fine-grained ''individualto-individual'' contact information, such as contact time, location, frequency, duration, etc. the most common ways to gather contact information are non-automatic methods, e.g., offline and online questionnaires [ ] , [ ] , [ ] , and automatic methods, e.g., mobile phone, wireless sensors, rfid, and gps [ ] , [ ] , [ ] , [ ] . offline questionnaire has been used for many years in some counties to trace sexual contacts of sexually transmissible infections (stis), particularly for hiv [ ] . in recent years, hiv has killed more than million people. currently, there are still million newly infected hiv individuals and half of them will be died every year [ ] , [ ] . to reveal the spread patterns of hiv infection and aids in the u.s., fay et al. analyzed the patterns of same-gender sexual contact among men using the data developed from a national sample survey in and , respectively. they found that at least . percent of adult men have had sexual contact with another man in life, and never-married men are more likely to have same-gender sexual contacts [ ] . similarly, merino et al. [ ] sampled homosexual men from colombia as volunteers to answer questionnaires on sexual practices. analysis of these questionnaires suggests two significant risk factors for hiv- infection: ) having sexual contact with foreign visitors; ) having more than ten homosexual partners. they suggest that the spread of hiv- infections should be monitored at the international level and more attention should be paid to these subgroups with high transmission rates. in general, most of us tacitly approve that unsafe sexual behavior would be more prevalent among individuals with optimal viral suppression. in the swiss cohort study on april , , wolf et al. [ ] investigated the unsafe behavior among hiv-infected individuals by selfreported questionnaire. however, after adjustment for covariate, it reported that unsafe sex is associated with other factors, e.g., gender, age, ethnicity, status of partner, having occasional partners, living alone, etc. in recent years, researchers designed questionnaires to measure the validity and reliability of sexual abstinence and avoidance of highrisk situations. for example, najarkolaei et al. [ ] sampled female undergraduate students from tehran university, iran, and assessed the validity and reliability of the designed sex, behavioral abstinence, and avoidance of high-risk situation. zhang et al. [ ] surveyed hiv-positive persons on their socio-demographic characteristics and sexual behavior, and traced hiv infection status of persons who had heterosexual contact with the hiv carriers. among these persons, were hiv-positive, i.e., the secondary attack rate was . %. therefore, they appeal to improve the knowledge about hiv/aids, enhance psychological education, and promote the use of condom, so as to suppress the transmission of hiv. in addition to hiv, offline questionnaire has also been used to trace the contact between individuals to investigate other infectious diseases such as chlamydia trachomatis infection, zika, and flu. to seek the view of patients with chlamydia trachomatis infection on legislation impinging on their sexual behavior, an investigation was performed on patients at std clinics in stockholm, sweden in . during the past months, men ( %) were more likely to have sexual intercourse with occasional partners than women ( %), and the mean number of men and women was . and . , respectively [ ] . the zika virus is primarily transmitted by aedes species mosquitoes. however, by reviewing the travel experience and sexual intercourse of infected individuals in the us, researchers confirmed that there were cases of zika virus infection were transmitted by sexual contact [ ] . for instance, a person in texas was getting infected with the zika virus after sexual contact with someone who had acquired the infection while travelling abroad [ ] . in , molinari et al. [ ] investigated the contact interactions of , students in a high school through questionnaires. a local campus contact network was established based on information such as the length of contact time and contact frequency. the outbreak process of flu was then simulated based on this established contact network. they found that the classroom is the location with the most campus contact and that class break and lunch break are the times with the most campus contact. offline questionnaire is an efficient way to trace private contact interactions such as sexual practice between individuals. however, it needs to find target participants one by one within a specific region, which is time consuming and needs more physical labor. moreover, data collected by this method is usually time delayed and incomplete. with the purpose of collecting more timely and low-priced data of various kinds of contact behaviors, online questionnaires such as online survey and web-based survey have been extensively applied. in , a national online survey was constructed in adolescent males, using computer-assisted self-interviewing (audio-casi) technology. comparing with traditional selfadministered questionnaire, the prevalence of male-male sex with intravenous drug users estimated by audio-casi was higher by more than % [ ] . influenza-like illness (ili) outbreaks on a large scale every year in many countries, recording and detecting ili are important public health problems. flutracking, a weekly web-based survey of ili in australia, has been used to record the past and current influenza immunization status of participants in winter influenza seasons for many years [ ] . it only takes the participants less than seconds to complete the survey, including documenting symptoms, influenza vaccination status, and mobility behaviors such as time off work or normal duties. in , the peak week detected by flutracking was august, which was contemporaneous with that in other influenza surveillance systems [ ] . for the first three years being applied, the participants increased from to and , in , , and , respectively, due to its convenience in completing the survey and its accuracy in detecting the peak week of influenza activity. flutracking also provides vaccine effectiveness analysis by investigating the status of vaccinated and unvaccinated participants. in , the ili weekly incidence peaked in mid-july in the unvaccinated group, month earlier than vaccinated group confirmed by national influenza laboratory [ ] . in recent years, by cooperating with the health department, organizational email systems, and social media, flutracing gained over new participants each year by sending invitations from existing participants. as a result, the number of online participants in flutracing has exceeded , in [ ] . contact information collected through an online questionnaire is more timely and low-priced than offline questionnaire. however, it still cannot record real-time contact information, and, moreover, contact information collected online sometimes inaccurate or even false [ ] . because people on the internet are usually anonymous, which is incapable to verify the information of their real name, age, place of residence, etc. individual contact information obtained through offline or online questionnaires is usually time delayed, incomplete, and inaccurate. with the aim to collect dynamic, complete, and accurate individual contact information, some researchers began to use mobile phone, wireless sensors, rfid, and gps devices to track individual contact behaviors. in recent years, the application of mobile phones has become increasingly universal, providing a convenient way to record real time location information [ ] , [ ] . in , yoneki [ ] developed the first contact pattern discovery software, fluphone, which could automatically detect and record the contact behavior between users by mobile phone. the researchers collected the contact information of users on the cambridge university campus with this software and established the contact network between different users at different times. then, they simulated an influenza outbreak on this network using a seir model. in view that the large power consumption of gps and bluetooth resulted in short standby time of mobile phones, yoneki and crowcroft [ ] further developed a new contact tracking application, epimap, using wearable sensors, which had lower power consumption and longer standby time. epimap thoughtfully transmits and stores data by satellite, as many high-risk areas are in developing countries where there often are not enough wireless communication facilities to support contact tracking. wearable wireless sensors can record individuals' contact events such as time, location, and duration continuously and accurately, and gradually becomes a useful tool for collecting high-precision contact data in small areas [ ] . it has been applied to discover contact patterns in various kinds of social settings such as hospitals and campuses. for example, mit media lab researchers nathan eagle and alex pentland proposed the reality mining method as early as . this method suggests the use of wearable wireless sensors to record people's daily activities [ ] , [ ] . they developed an experimental system to record the activities of several mit students in a teaching building over time, and then established a small social network describing their contact relationships [ ] . salathé et al. [ ] collected the contact interactions of students in a high school in the united states for one day using wireless sensors, and they established an individual-based campus contact network. it was found that the campus contact network had high density connectivity and a small-world structure [ ] . however, it is costly to trace contact interactions using wearable wireless sensors, especially when the number of individuals being monitored is large. moreover, people wearing wearable wireless sensors are very conspicuous, participants are unwilling to wear such devices due to privacy concerns especially for patients. radio frequency identification (rfid) is a non-contact automatic identification technology, by which the contact behavior can be recorded when individuals carrying a small non-contact chip getting closer. in , olguin et al. [ ] collected , contacts among people (including medical staff, children in critical condition, and nursing staff) in a children's hospital in the united states using radiofrequency identification devices (rfid), and established a contact structure for the hospital. similarly, yoneki [ ] collected students' contact data from a french primary school using radiofrequency identification devices. more recently, in october , ibm researchers kurien and shingirirai from africa labs invented a radio tag designed to extend tracker working distance, and implemented it in tracking tuberculosis ( fig. ) [ ] . each tag contains a tiny sensor, figure . an ibm researcher holds a micro-radio tuberculosis tracker [ ] . in october , ibm researchers from johannesburg, south africa, released their latest research update: using cheap radio tags to anonymously track the contact transmission paths of tuberculosis. this study is an important step for ibm in helping who eliminate tuberculosis. a storage device, and a battery. radio tags can communicate with each other, allowing individuals' contact interactions to be recorded when two tags are in close proximity. the contact data collected by the radio tags is presented in a three-dimensional visualization system. using of the intelligent data analysis method provided by the system, medical staff can view the spatiotemporal distribution of tuberculosis patients in real time, track the transmission paths of tuberculosis, and find high-risk populations. because of the high cost of tuberculosis vaccines, contact data can also aid in the determination of high-priority vaccinations. however, the traditionally used radiofrequency tracker has a limited transmission and receiving range and only works within a small area. gps (global positioning system) has the capability of long-distance positioning, which has been widely used for tracing indoor and outdoor mobility behaviors and physical activities [ ] , [ ] . with the aged tendency of population, tracing mobility behavior is critical for measuring, describing, and comparing mobility patterns in older adults. for example, hirsch et al. [ ] investigated the mobility patterns using gps tracing data collected from older adults in vancouver, canada, with the goal of understanding neighborhood influences on older adults' health. they found that participants who were younger tend to drive more frequently and live far from their neighborhoods. gps devices have also been used for tracing physical activities of adolescents in school and other social settings [ ] - [ ] . for instance, rodriguez et al. [ ] sampled adolescent females in minneapolis and san diego, usa, and traced their physical activity and sedentary behaviors by gps every s in different settings. physical activities were more likely to occur in parks, schools, and places with high population density during weekend, less to occur in places with roads and food outlets. besides, tracing animals in the sea or on the land using gps devices can obtain detailed spatiotemporal data regarding the movement patterns. for instance, dujon et al. [ ] traced a green turtle travelling more than km across the indian ocean and obtained more than , locations. moreover, by tracing the whole-body motion dynamics of a cheetah using gps devices, patel et al. [ ] illuminated the factors that influence performance in legged animals. although detailed individual contact information can be collected through non-automatic methods, e.g., offline and online questionnaire, and automatic methods, e.g., mobile phone, wearable wireless sensors, rfid, and gps devices. these methods are mostly limited to small-scale population experiments due to high cost and short range collection. they have not been applied to large areas or large-scale contact behavior studies. group contact tracing captures contact interactions of human beings with similar characteristics (e.g., age, occupation, hobbies) in different social settings from the macroscopic level. static group contact behavior can be traced by large-scale questionnaire and simulated by multi-agent models. dynamic group contact behavior can be inferred by data mining method like tensor deconvolution. in recent years, a composite group model that can characterize population heterogeneity and model epidemic spreading dynamics, overcoming the difficulty of obtaining fine-grained individuals' data has attracted much attention. such models not only simulate the transmission process, but also depict the contact structure of a larger population. the composite group model divides the population into several meta-populations by age or spatial location, so that individuals within a meta-population have similar biological characteristics (such as susceptibility, infectivity, latent period, and recovery period). then, the process of epidemic transmission can be modeled using group contact interactions among meta-populations instead of individuals' contact interactions [ ] . based on this model, the infection and spread of epidemics can be described as a reaction-diffusion process. ''reaction'' characterizes the process of individual infection within a meta-population. ''diffusion'' characterizes the transfer process of epidemic diseases between different meta-populations through the group contact structure (fig. ). in addition, there is a practical significance in establishing contact networks for composite groups because control strategies for epidemic diseases are usually oriented towards composite populations, for example, vaccination groups are usually sectioned by age when planning vaccine allocation strategies. [ ] . the diffusion process is illustrated from a macroscopic perspective, i.e., the transmission among different meta-populations, whereas the reaction process is illustrated from a microscopic perspective, i.e., the individual infection within a meta-population. the composite group contact network was first established using questionnaires. in , mossong et al. [ ] conducted the polymod research project in europe, in which they organized a wide-range survey on contact behaviors, involving volume , , participants from eight european countries. a total of , contact records were collected. they found that contact interactions have significant spatial heterogeneity, with most individual contacts occurring at home ( %), work ( %), school ( %), places of entertainment ( %), and while using transportation ( %). further, contact structures under different scenarios have obvious differences. there are some age-related contact patterns: in many scenarios (such as in schools), individuals are more likely to contact people of similar age; most of the contact between children and their parents occurs at home, while most contacts for adults occur in workplaces. the researchers thus divided the population into several meta-populations, establishing a composite group model based on age. interaction probabilities between different age groups were estimated according to questionnaire data (fig. ) , and a contact network based on composite groups was established. the simulation method based on multi-agent models is also applied to the establishment of contact networks. this generally involves combining the questionnaire survey with population census data to establish the contact structure of composite groups. iozzi et al. [ ] modeled a virtual society with the characteristics of italian society based on questionnaire data from , people. human daily migration behaviors were simulated by a virtual community, and a contact matrix of the composite group was obtained. based on this matrix, the outbreak process of italian b (human parvovirus) was successfully simulated. similarly, eubank et al. [ ] simulated the movement of individuals within a city by large-scale agent system, and then modeled a group contact structure based on their simulation. the data they used included population census data, land usage, population migration, and other daily behavior data. constructing contact matrix for meta-population requires large-scale even nationwide questionnaire survey, which is quite costly and time delayed. multi-agent method simulates human mobility behaviors in the virtual world based on the contact matrix constructed using the data from the real world of the past [ ] . it doesn't consider the changes of existing contact patterns caused by human self-awareness and epidemic-control strategies in the future. most of the above studies focus on static properties of contact behaviors, such as the contact object (who is contacting), scene (where this contact happens), frequency, and duration. in other words, the aforementioned studies assume that the contact patterns of the individual remain stationary. however, contact interactions usually change with time, and show different temporal and spatial patterns. for example, contact interactions can change periodically with the weather and season, vary significantly between workdays, weekends, and holidays, and may be adjusted in response to the threat of an epidemic disease and during the outbreak by reducing travel or wearing face masks [ ] , [ ] . additionally, governmentimposed epidemic-control strategies can significantly change individuals' contact patterns [ ] , [ ] , [ ] , [ ] . for example, during the outbreak of h n flu in hong kong in , interventions, such as flight reductions, school closures, and vaccination efforts, significantly altered individuals' contact interactions [ ] - [ ] . dynamic contacts between individuals are more difficult to be observed and recorded than static contacts because of the limitations of existing contact tracing methods. offline and online questionnaires are incapable of recording real-time contact information, and usually time delayed to receive feedback from participants. automatic contact tracing methods such as mobile phone, wearable wireless sensor, rfid, and gps devices can collect continuous mobility information [ ] , [ ] . however, all these methods are limited to monitoring mobility behaviors for small-scale population, due to the large consumption of power, short range of positioning, high cost of money, etc. besides, most people cannot be expected to agree to have their dynamic contact interactions monitored in real time because of privacy issue. for example, wearing a tracker can also be equated to declaring one's self an infectious disease patient. tuberculosis patients in african countries are branded with social prejudice, making wearing an identifier a sensitive issue [ ] . in light of these obstacles, a new path that does not ''directly'' capture and record individuals' dynamic contact behaviors, but ''indirectly'' infers the dynamic contact model of a large-scale population from other readily available data sources must be found. infectious disease surveillance, like that depicted in fig. , expands everyday with the vast applications of information technology in the medical field. surveillance data record spatiotemporal information related to the spread of infectious diseases, which is the result of the spread model acting on the real contact network, as shown in fig. (a) . such surveillance data can be regarded as an external manifestation of the implicit contact network, suggesting that the dynamic contact network could be ''inversely'' inferred from infectious disease surveillance data, as shown in fig. (b) . essentially, this is a complex inverse engineering problem: using the observed dynamics phenomenon to determine the dynamic structure that leads to the phenomenon. in other words, determining time-dependent contact interactions using the timedependent spread trend of infectious diseases. based on the idea of inverse engineering, yang et al. [ ] proposed a novel modeling and inference method for constructing a dynamic contact network based on tensor model. they described the spatiotemporal patterns of composite group contacts as a tensor, modeled the inference of the dynamic contact network as low-rank tensor reconstruction problem, and proposed a tensor deconvolution based inference method by fusing compression perception, sparse tensor decomposition and epidemic propagation models. this method makes it possible to determine the dynamic contact network of the large-scale composite group from population census data and surveillance data of many epidemic diseases. using this method, composite group dynamic contact networks for hong kong and taiwan were established using population census data and surveillance data of a variety of infectious diseases (such as h n , influenza, measles, mumps, etc.) for these two areas. the temporal and spatial evolution patterns of individuals' dynamic contact interactions were analyzed. based on the established dynamic contact network, they further studied the spread law, and prevention and control strategies of h n epidemic disease. they arrived at two important conclusions: ( ) in the h n outbreak in hong kong in , if the beginning of the new semester was delayed two to six weeks, the total number of infections would have been reduced by % to %; ( ) the best strategy for prevention and control of h n spread is vaccination of school-age children in the first few days of the new semester. contact tracking based on intelligent information processing technology represents an active prevention and control strategy for infectious diseases. its main functions are to achieve early detection and timely intervention of infectious diseases. research on contact tracking methods not only expands the options for preventing and controlling infectious diseases, but also further improves people's understanding of their own contact behaviors. contact tracking has become an increasingly mature datadriven technology for disease prevention and control, evolving from individual tracking to group tracking. individual tracking attempts to capture more detailed contact interactions for accurate locating of infected patients and high-risk susceptible populations. traditional offline questionnaire is a practical method for tracing private contact interactions between individuals such as sexual practice. however, it is quite costly and time delayed to find target participants and receive feedback from them. comparatively speaking, online questionnaire serves a low-priced way to collect feedback from participants timely. however, it cannot record the time exactly when contact occurs. meanwhile, offline and online questionnaires sometimes provide inaccurate information of human mobility. for example, klous et al. [ ] surveyed participants in a rural in the netherlands using questionnaire and gps logger, respectively. investigations on walking, biking, and motorized transport duration showed that time spent in walking and biking based on questionnaire was strongly overestimated. the use of automatic contact tracing methods enabled researchers to obtain continuous and accurate individual contact information, e.g., time, location, duration, etc. mobile phone and wireless sensors were widely used to trace mobility behaviors of students in campus and patients in hospital. then, small-scale contact network within the tracing regions can be constructed and the diffusion process of infectious volume , disease such as influenza can be simulated and analyzed in detail. however, the use of mobile phone is limited to tracing short-term contact behavior because of large power consumption of gps and bluetooth. wearable wireless sensors can only be applied to small-scale population due to its high cost and privacy concerns. rfid devices are convenient carrying which solves privacy concerns very well, but it only can be used for short range collection. gps device has the advantage of long-distance positioning. however, it is costly to capture indoor mobility behaviors due to the requirement of communication stations [ ] . all these automatic contact tracing methods have not been used for studies of large-scale individual contact so far. group tracking replaces individual contacts with the contact probability of meta-populations, which, to some extent, overcomes the obstacles of individual tracking. using a contact matrix of meta-population, contact patterns regarding people with similar features can be depicted from the macroscopic level. however, the contact matrix is usually constructed using the data collected from a nationwide questionnaire, which is static and can only represent the contact patterns of the past. to explore dynamic contact patterns of meta-population, a data-driven ai (artificial intelligence) method was adopted, i.e., tensor deconvolution [ ] . based on this method a dynamic evolutionary model of the group contact was constructed and dynamic contact patterns were inferred inversely through insights into the time-dependent nature of the infectious disease surveillance data. nevertheless, it should be noted that although it can characterize a wider range of dynamic contact behaviors, it cannot be used to accurately locate unique contact events because of the coarse granularity of the captured contact behaviors. exploring social contact patterns for epidemic prevention and control is an every promising research direction, and some potential future development directions are illustrated as follows. a. multi-view contact tracing data obtained from different views can give expressions to different patterns of mobility behaviors. for example, offline and online questionnaire can accurately record contact events occurred in places that individuals frequently visited [ ] . gps devices can record indoor and outdoor contact events happened occasionally [ ] . heterogeneous contact network constructed by various kinds of information can provide a new way for analyzing and simulating the spread of epidemics. therefore, tracing mobility behaviors and analyzing contact patterns from multi-views to get new insight into what heterogeneous contact patterns like will be a new direction in the future. existing studies focus on either individual level or group level contact tracing, presenting independent contact patterns from microscopic and macroscopic scales, respectively. however, group contact patterns are formed by collaborative behaviors of individual mobility, while individual mobility behaviors can be influenced by others in the same group. revealing hidden interactions between individual contact and group contact will be helpful to identify influential individuals as sentries for disease monitoring. therefore, discovering hidden interactions from multi-scale contact patterns that tunneling individual contact and group contact will be a new opportunity for early epidemic detection. dynamic mobility behaviors lead to complex contact patterns, which are usually hidden and cannot be directly traced by non-automatic or automatic methods. a better way to infer dynamic contact patterns is adopting ai-based methods using heterogeneous real-world data. existing studies such as tensor deconvolution consider the combination of contact probabilities within real-world social settings like school, home, and workplace as linear [ ] . however, hidden dynamic contact patterns within these social settings could be more complicated than linear models can characterize. therefore, exploring advanced ai-based contact tracing methods, e.g., multi-view learning [ ] - [ ] , deep learning [ ] , broad learning [ ] , etc., will be the next generation technology for epidemic prevention and control. in this paper, we introduced current studies on contact tracing and its applications in epidemic prevention and control. this paper covered research directions, i.e., individual contact and group contact, which were introduced from both static and dynamic aspects. non-automatic tracing methods like offline and online questionnaires record static individual contact information, while automatic tracing methods like mobile phone, wearable wireless sensor, rfid, and gps devices collect dynamic contact events. static group contact patterns can be depicted by a coarse granularity contact matrix constructed by large-scale questionnaire data, dynamic contact patterns, however, can only be inversely inferred using data-driven ai technologies. both individual and group contact tracing are promising research directions and filled with challenges, especially for dynamic contact tracing. collecting contact data from multi-views and analyzing contact patterns from multi-scale mobility interactions will be new directions in the future. moreover, exploring advanced ai-based contact tracing methods using heterogeneous and multi-source data will provide new opportunities for epidemic prevention and control. hechang chen received the m.s. degree from the college of computer science and technology, jilin university, in , where he is currently pursuing the ph.d. degree. he was enrolled in the university of illinois at chicago as a joint training ph.d. student from to . his current research interests include heterogenous data mining and complex network modeling with applications to computational epidemiology. bo yang received the b.s., m.s., and ph.d. degrees in computer science from jilin university in , , and , respectively. he is currently a professor with the college of computer science and technology, jilin university. he is currently the director of the key laboratory of symbolic computation and knowledge engineer, ministry of education, china. his current research interests include data mining, complex/social network modeling and analysis, and multi-agent systems. he is currently working on the topics of discovering structural and dynamical patterns from large-scale and time-evolving networks with applications to web intelligence, recommender systems, and early detection/control of infectious events. he has published over articles in international journals, including ieee tkde, ieee tpami, acm tweb, dke, jaamas, and kbs, and international conferences, including ijcai, aaai, icdm, wi, pakdd, and asonam. he has served as an associated editor and a peer reviewer for international journals, including {web intelligence} and served as the pc chair and an spc or pc member for international conferences, including ksem, ijcai, and aamas. updating the accounts: global mortality of the - 'spanish' influenza pandemic plasmodium ovale: a case of notso-benign tertian malaria the global burden of respiratory disease impact of the large-scale deployment of artemether/lumefantrine on the malaria disease burden in africa: case studies of south africa, zambia and ethiopia high-resolution measurements of face-to-face contact patterns in a primary school tracking tuberculosis in south africa the annual impact of seasonal influenza in the us: measuring disease burden and costs updated situation of influenza activity in hong kong little italy: an agent-based approach to the estimation of contact patterns-fitting predicted matrices to serological data the tencent news what types of contacts are important for the spread of infections? using contact survey data to explore european mixing patterns estimating within-school contact networks to understand influenza transmission reality mining: sensing complex social systems time-critical social mobilization capturing individual and group behavior with wearable sensor fluphone study: virtual disease spread using haggle epimap: towards quantifying contact networks and modelling the spread of infections in developing countries a high-resolution human contact network for infectious disease transmission collective dynamics of small world networks close encounters in a pediatric ward: measuring face-toface proximity and mixing patterns with wearable sensors computing urban traffic congestions by incorporating sparse gps probe data and social media data modeling human mobility responses to the large-scale spreading of infectious diseases modelling dynamical processes in complex socio-technical systems social contacts and mixing patterns relevant to the spread of infectious diseases characterizing and discovering spatiotemporal social contact patterns for healthcare outbreaks in realistic urban social networks skip the trip: air travelers' behavioral responses to pandemic influenza the effect of risk perception on the h n pandemic influenza dynamics quantifying social distancing arising from pandemic influenza behavioral responses to epidemics in an online experiment: using virtual diseases to study human behavior an evaluation of an express testing service for sexually transmissible infections in low-risk clients without complications hiv/aids: years of progress and future challenges reflections on years of aids prevalence and patterns of same-gender sexual contact among men hiv- , sexual practices, and contact with foreigners in homosexual men in colombia, south america prevalence of unsafe sexual behavior among hiv-infected individuals: the swiss hiv cohort study sexual behavioral abstine hiv/aids questionnaire: validation study of an iranian questionnaire study on the risk of hiv transmission by heterosexual contact and the correlation factors a survey of patients with chlamydia trachomatis infection: sexual behaviour and perceptions about contact tracing transmission of zika virus through sexual contact with travelers to areas of ongoing transmission-continental united states zika virus was transmitted by sexual contact in texas, health officials report adolescent sexual behavior, drug use, and violence: increased reporting with computer survey technology insights from flutracking: thirteen tips to growing a web-based participatory surveillance system flutracking: a weekly australian community online survey of influenza-like illness in flutracking weekly online community survey of influenza-like illness annual report mobility assessment of a rural population in the netherlands using gps measurements generating gps activity spaces that shed light upon the mobility habits of older adults: a descriptive analysis what can global positioning systems tell us about the contribution of different types of urban greenspace to children's physical activity? out and about: association of the built environment with physical activity behaviors of adolescent females a study of community design, greenness, and physical activity in children using satellite, gps and accelerometer data the accuracy of fastloc-gps locations and implications for animal tracking tracking the cheetah tail using animal-borne cameras, gps, and an imu examining the spatial congruence between data obtained with a novel activity location questionnaire, continuous gps tracking, and prompted recall surveys using global positioning systems in health research: a practical approach to data collection and processing mobile sensing in environmental health and neighborhood research feasibility and acceptability of global positioning system (gps) methods to study the spatial contexts of substance use and sexual risk behaviors among young men who have sex with men in new york city: a p cohort sub-study strengths and weaknesses of global positioning system (gps) data-loggers and semi-structured interviews for capturing fine-scale human mobility: findings from iquitos using mobile phone data to study dynamics of rural-urban mobility inferencing human spatiotemporal mobility in greater maputo via mobile phone big data mining gps tracking in neighborhood and health studies: a step forward for environmental exposure assessment, a step backward for causal inference? multi-view clustering with graph embedding for connectome analysis a self-organizing tensor architecture for multi-view clustering mmrate: inferring multi-aspect diffusion networks with multi-pattern cascades inferring diffusion networks with sparse cascades by structure transfer partially observable reinforcement learning for sustainable active surveillance broad learning: an emerging area in social network analysis key: cord- -hr smx authors: van kampen, antoine h. c.; moerland, perry d. title: taking bioinformatics to systems medicine date: - - journal: systems medicine doi: . / - - - - _ sha: doc_id: cord_uid: hr smx systems medicine promotes a range of approaches and strategies to study human health and disease at a systems level with the aim of improving the overall well-being of (healthy) individuals, and preventing, diagnosing, or curing disease. in this chapter we discuss how bioinformatics critically contributes to systems medicine. first, we explain the role of bioinformatics in the management and analysis of data. in particular we show the importance of publicly available biological and clinical repositories to support systems medicine studies. second, we discuss how the integration and analysis of multiple types of omics data through integrative bioinformatics may facilitate the determination of more predictive and robust disease signatures, lead to a better understanding of (patho)physiological molecular mechanisms, and facilitate personalized medicine. third, we focus on network analysis and discuss how gene networks can be constructed from omics data and how these networks can be decomposed into smaller modules. we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, and lead to predictive models. throughout, we provide several examples demonstrating how bioinformatics contributes to systems medicine and discuss future challenges in bioinformatics that need to be addressed to enable the advancement of systems medicine. systems medicine fi nds its roots in systems biology, the scientifi c discipline that aims at a systems-level understanding of, for example, biological networks, cells, organs, organisms, and populations. it generally involves a combination of wet-lab experiments and computational (bioinformatics) approaches. systems medicine extends systems biology by focusing on the application of systems-based approaches to clinically relevant applications in order to improve patient health or the overall well-being of (healthy) individuals [ ] . systems medicine is expected to change health care practice in the coming years. it will contribute to new therapeutics through the identifi cation of novel disease genes that provide drug candidates less likely to fail in clinical studies [ , ] . it is also expected to contribute to fundamental insights into networks perturbed by disease, improved prediction of disease progression, stratifi cation of disease subtypes, personalized treatment selection, and prevention of disease. to enable systems medicine it is necessary to characterize the patient at various levels and, consequently, to collect, integrate, and analyze various types of data including not only clinical (phenotype) and molecular data, but also information about cells (e.g., disease-related alterations in organelle morphology), organs (e.g., lung impedance when studying respiratory disorders such as asthma or chronic obstructive pulmonary disease), and even social networks. the full realization of systems medicine therefore requires the integration and analysis of environmental, genetic, physiological, and molecular factors at different temporal and spatial scales, which currently is very challenging. it will require large efforts from various research communities to overcome current experimental, computational, and information management related barriers. in this chapter we show how bioinformatics is an essential part of systems medicine and discuss some of the future challenges that need to be solved. to understand the contribution of bioinformatics to systems medicine, it is helpful to consider the traditional role of bioinformatics in biomedical research, which involves basic and applied (translational) research to augment our understanding of (molecular) processes in health and disease. the term "bioinformatics" was fi rst coined by the dutch theoretical biologist paulien hogeweg in to refer to the study of information processes in biotic systems [ ] . soon, the fi eld of bioinformatics expanded and bioinformatics efforts accelerated and matured as the fi rst (whole) genome and protein sequences became available. the signifi cance of bioinformatics further increased with the development of highthroughput experimental technologies that allowed wet-lab researchers to perform large-scale measurements. these include determining whole-genome sequences (and gene variants) and genome-wide gene expression with next-generation sequencing technologies (ngs; see table for abbreviations and web links) [ ] , measuring gene expression with dna microarrays [ ] , identifying and quantifying proteins and metabolites with nmr or (lc/ gc-) ms [ ] , measuring epigenetic changes such as methylation and histone modifi cations [ ] , and so on. these, "omics" technologies, are capable of measuring the many molecular building blocks that determine our (patho)physiology. genome-wide measurements have not only signifi cantly advanced our fundamental understanding of the molecular biology of health and disease but table abbreviations and websites have also contributed to new (commercial) diagnostic and prognostic tests [ , ] and the selection and development of (personalized) treatment [ ] . nowadays, bioinformatics is therefore defi ned as "advancing the scientifi c understanding of living systems through computation" (iscb), or more inclusively as "conceptualizing biology in terms of molecules and applying 'informatics techniques' (derived from disciplines such as applied mathematics, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale" [ ] . it is worth noting that solely measuring many molecular components of a biological system does not necessarily result in a deeper understanding of such a system. understanding biological function does indeed require detailed insight into the precise function of these components but, more importantly, it requires a thorough understanding of their static, temporal, and spatial interactions. these interaction networks underlie all (patho)physiological processes, and elucidation of these networks is a major task for bioinformatics and systems medicine . the developments in experimental technologies have led to challenges that require additional expertise and new skills for biomedical researchers: • information management. modern biomedical research projects typically produce large and complex omics data sets , sometimes in the order of hundreds of gigabytes to terabytes of which a large part has become available through public databases [ , ] sometimes even prior to publication (e.g., gtex, icgc, tcga). this not only contributes to knowledge dissemination but also facilitates reanalysis and metaanalysis of data, evaluation of hypotheses that were not considered by the original research group, and development and evaluation of new bioinformatics methods. the use of existing data can in some cases even make new (expensive) experiments superfl uous. alternatively, one can integrate publicly available data with data generated in-house for more comprehensive analyses, or to validate results [ ] . in addition, the obligation of making raw data available may prevent fraud and selective reporting. the management (transfer, storage, annotation, and integration) of data and associated meta-data is one of the main and increasing challenges in bioinformatics that needs attention to safeguard the progression of systems medicine. • data analysis and interpretation . bioinformatics data analysis and interpretation of omics data have become increasingly complex, not only due to the vast volumes and complexity of the data but also as a result of more challenging research ques- tions. bioinformatics covers many types of analyses including nucleotide and protein sequence analysis, elucidation of tertiary protein structures, quality control, pre-processing and statistical analysis of omics data, determination of genotypephenotype relationships, biomarker identifi cation, evolutionary analysis, analysis of gene regulation, reconstruction of biological networks, text mining of literature and electronic patient records, and analysis of imaging data. in addition, bioinformatics has developed approaches to improve experimental design of omics experiments to ensure that the maximum amount of information can be extracted from the data. many of the methods developed in these areas are of direct relevance for systems medicine as exemplifi ed in this chapter. clearly, new experimental technologies have to a large extent turned biomedical research in a data-and compute-intensive endeavor. it has been argued that production of omics data has nowadays become the "easy" part of biomedical research, whereas the real challenges currently comprise information management and bioinformatics analysis. consequently, next to the wet-lab, the computer has become one of the main tools of the biomedical researcher . bioinformatics enables and advances the management and analysis of large omics-based datasets, thereby directly and indirectly contributing to systems medicine in several ways ( fig. . quality control and pre-processing of omics data. preprocessing typically involves data cleaning (e.g., removal of failed assays) and other steps to obtain quantitative measurements that can be used in downstream data analysis. . (statistical) data analysis methods of large and complex omicsbased datasets. this includes methods for the integrative analysis of multiple omics data types (subheading ), and for the elucidation and analysis of biological networks (top-down systems medicine; subheading ). systems medicine comprises top-down and bottom-up approaches. the former represents a specifi c branch of bioinformatics, which distinguishes itself from bottom-up approaches in several ways [ , , ] . top-down approaches use omics data to obtain a holistic view of the components of a biological system and, in general, aim to construct system-wide static functional or physical interaction networks such as gene co-expression networks and protein-protein interaction networks. in contrast, bottom-up approaches aim to develop detailed mechanistic and quantitative mathematical models for sub-systems. these models describe the dynamic and nonlinear behavior of interactions between known components to understand and predict their behavior upon perturbation. however, in contrast to omics-based top-down approaches, these mechanistic models require information about chemical/physical parameters and reaction stoichiometry, which may not be available and require further (experimental) efforts. both the top-down and bottom-up approaches result in testable hypotheses and new wet-lab or in silico experiments that may lead to clinically relevant fi ndings. biomedical research and, consequently, systems medicine are increasingly confronted with the management of continuously growing volumes of molecular and clinical data, results of data analyses and in silico experiments, and mathematical models. due fig. the contribution of bioinformatics ( dark grey boxes ) to systems medicine ( black box ). (omics) experiments, patients, and public repositories provide a wide range of data that is used in bioinformatics and systems medicine studies to policies of scientifi c journals and funding agencies, omics data is often made available to the research community via public databases. in addition, a wide range of databases have been developed, of which more than are currently listed in the molecular biology database collection [ ] providing a rich source of biomedical information. biological repositories do not merely archive data and models but also serve a range of purposes in systems medicine as illustrated below from a few selected examples. the main repositories are hosted and maintained by the major bioinformatics institutes including ebi, ncbi, and sib that make a major part of the raw experimental omics data available through a number of primary databases including genbank [ ] , geo [ ] , pride [ ] , and metabolights [ ] for sequence, gene expression, ms-based proteomics, and ms-based metabolomics data, respectively. in addition, many secondary databases provide information derived from the processing of primary data, for example pathway databases (e.g., reactome [ ] , kegg [ ] ), protein sequence databases (e.g., uniprotkb [ ] ), and many others. pathway databases provide an important resource to construct mathematical models used to study and further refi ne biological systems [ , ] . other efforts focus on establishing repositories integrating information from multiple public databases. the integration of pathway databases [ - ] , and genome browsers that integrate genetic, omics, and other data with whole-genome sequences [ , ] are two examples of this. joint initiatives of the bioinformatics and systems biology communities resulted in repositories such as biomodels, which contains mathematical models of biochemical and cellular systems [ ] , recon that provides a communitydriven, consensus " metabolic reconstruction " of human metabolism suitable for computational modelling [ ] , and seek, which provides a platform designed for the management and exchange of systems biology data and models [ ] . another example of a database that may prove to be of value for systems medicine studies is malacards , an integrated and annotated compendium of about , human diseases [ ] . malacards integrates disease sources into disease cards and establishes gene-disease associations through integration with the well-known genecards databases [ , ] . integration with genecards and cross-references within malacards enables the construction of networks of related diseases revealing previously unknown interconnections among diseases, which may be used to identify drugs for off-label use. another class of repositories are (expert-curated) knowledge bases containing domain knowledge and data, which aim to provide a single point of entry for a specifi c domain. contents of these knowledge bases are often based on information extracted (either manually or by text mining) from literature or provided by domain experts [ - ] . finally, databases are used routinely in the analysis, interpretation, and validation of experimental data. for example, the gene ontology (go) provides a controlled vocabulary of terms for describing gene products, and is often used in gene set analysis to evaluate expression patterns of groups of genes instead of those of individual genes [ ] and has, for example, been applied to investigate hiv-related cognitive disorders [ ] and polycystic kidney disease [ ] . several repositories such as mir disease [ ] , peroxisomedb [ ] , and mouse genome informatics (mgi) [ ] include associations between genes and disorders, but only provide very limited phenotypic information. phenotype databases are of particular interest to systems medicine. one well-known phenotype repository is the omim database, which primarily describes single-gene (mendelian) disorders [ ] . clinvar is another example and provides an archive of reports and evidence of the relationships among medically important human variations found in patient samples and phenotypes [ ] . clinvar complements dbsnp (for singlenucleotide polymorphisms) [ ] and dbvar (for structural variations) [ ] , which both provide only minimal phenotypic information. the integration of these phenotype repositories with genetic and other molecular information will be a major aim for bioinformatics in the coming decade enabling, for example, the identifi cation of comorbidities, determination of associations between gene (mutations) and disease, and improvement of disease classifi cations [ ] . it will also advance the defi nition of the "human phenome," i.e., the set of phenotypes resulting from genetic variation in the human genome. to increase the quality and (clinical) utility of the phenotype and variant databases as an essential step towards reducing the burden of human genetic disease, the human variome project coordinates efforts in standardization, system development, and (training) infrastructure for the worldwide collection and sharing of genetic variations that affect human health [ , ] . to implement and advance systems medicine to the benefi t of patients' health, it is crucial to integrate and analyze molecular data together with de-identifi ed individual-level clinical data complementing general phenotype descriptions. patient clinical data refers to a wide variety of data including basic patient information (e.g., age, sex, ethnicity), outcomes of physical examinations, patient history, medical diagnoses, treatments, laboratory tests, pathology reports, medical images, and other clinical outcomes. inclusion of clinical data allows the stratifi cation of patient groups into more homogeneous clinical subgroups. availability of clinical data will increase the power of downstream data analysis and modeling to elucidate molecular mechanisms, and to identify molecular biomarkers that predict disease onset or progression, or which guide treatment selection. in biomedical studies clinical information is generally used as part of patient and sample selection, but some omics studies also use clinical data as part of the bioinformatics analysis (e.g., [ , ] ). however, in general, clinical data is unavailable from public resources or only provided on an aggregated level. although good reasons exist for making clinical data available (subheading . ), ethical and legal issues comprising patient and commercial confi dentiality, and technical issues are the most immediate challenges [ , ] . this potentially hampers the development of systems medicine approaches in a clinical setting since sharing and integration of clinical and nonclinical data is considered a basic requirement [ ] . biobanks [ ] such as bbmri [ ] provide a potential source of biological material and associated (clinical) data but these are, generally, not publicly accessible, although permission to access data may be requested from the biobank provider. clinical trials provide another source of clinical data for systems medicine studies, but these are generally owned by a research group or sponsor and not freely available [ ] although ongoing discussions may change this in the future ( [ ] and references therein). although clinical data is not yet available on a large scale, the bioinformatics and medical informatics communities have been very active in establishing repositories that provide clinical data. one example is the database of genotypes and phenotypes (dbgap) [ ] developed by the ncbi. study metadata, summarylevel (phenotype) data, and documents related to studies are publicly available. access to de-identifi ed individual-level (clinical) data is only granted after approval by an nih data access committee. another example is the cancer genome atlas (tcga) , which also provides individual-level molecular and clinical data through its own portal and the cancer genomics hub (cghub). clinical data from tcga is available without any restrictions but part of the lower level sequencing and microarray data can only be obtained through a formal request managed by dbgap. medical patient records provide an even richer source of phenotypic information , and has already been used to stratify patient groups, discover disease relations and comorbidity, and integrate these records with molecular data to obtain a systems-level view of phenotypes (for a review see [ ] ). on the one hand, this integration facilitates refi nement and analysis of the human phenome to, for example, identify diseases that are clinically uniform but have different underlying molecular mechanisms, or which share a pathogenetic mechanism but with different genetic cause [ ] . on the other hand, using the same data, a phenome-wide association study ( phewas ) [ ] would allow the identifi cation of unrelated phenotypes associated with specifi c shared genetic variant(s), an effect referred to as pleiotropy. moreover, it makes use of information from medical records generated in routine clinical practice and, consequently, has the potential to strengthen the link between biomedical research and clinical practice [ ] . the power of phenome analysis was demonstrated in a study involving . million patient records, not including genotype information, comprising disorders. in this study it was shown that disease phenotypes form a highly connected network suggesting a shared genetic basis [ ] . indeed, later studies that incorporated genetic data resulted in similar fi ndings and confi rmed a shared genetic basis for a number of different phenotypes. for example, a recent study identifi ed potentially pleiotropic associations through the analysis of snps that had previously been implicated by genome-wide association studies ( gwas) as mediators of human traits, and phenotypes derived from patient records of , individuals [ ] . this demonstrates that phenotypic information extracted manually or through text mining from patient records can help to more precisely defi ne (relations between) diseases. another example comprises the text mining of psychiatric patient records to discover disease correlations [ ] . here, mapping of disease genes from the omim database to information from medical records resulted in protein networks suspected to be involved in psychiatric diseases. integrative bioinformatics comprises the integrative (statistical) analysis of multiple omics data types. many studies demonstrated that using a single omics technology to measure a specifi c molecular level (e.g., dna variation, expression of genes and proteins, metabolite concentrations, epigenetic modifi cations) already provides a wealth of information that can be used for unraveling molecular mechanisms underlying disease. moreover, single-omics disease signatures which combine multiple (e.g., gene expression) markers have been constructed to differentiate between disease subtypes to support diagnosis and prognosis. however, no single technology can reveal the full complexity and details of molecular networks observed in health and disease due to the many interactions across these levels. a systems medicine strategy should ideally aim to understand the functioning of the different levels as a whole by integrating different types of omics data. this is expected to lead to biomarkers with higher predictive value, and novel disease insights that may help to prevent disease and to develop new therapeutic approaches. integrative bioinformatics can also facilitate the prioritization and characterization of genetic variants associated with complex human diseases and traits identifi ed by gwas in which hundreds of thousands to over a million snps are assayed in a large number of individuals. although such studies lack the statistical power to identify all disease-associated loci [ ] , they have been instrumental in identifying loci for many common diseases. however, it remains diffi cult to prioritize the identifi ed variants and to elucidate their effect on downstream pathways ultimately leading to disease [ ] . consequently, methods have been developed to prioritize candidate snps based on integration with other (omics) data such as gene expression, dnase hypersensitive sites, histone modifi cations, and transcription factor-binding sites [ ] . the integration of multiple omics data types is far from trivial and various approaches have been proposed [ - ] . one approach is to link different types of omics measurements through common database identifi ers. although this may seem straightforward, in practice this is complicated as a result of technical and standardization issues as well as a lack of biological consensus [ , - ] . moreover, the integration of data at the level of the central dogma of molecular biology and, for example, metabolite data is even more challenging due to the indirect relationships between genes, transcripts, and proteins on the one hand and metabolites on the other hand, precluding direct links between the database identifi ers of these molecules. statistical data integration [ ] is a second commonly applied strategy, and various approaches have been applied for the joint analysis of multiple data types (e.g., [ , ] ). one example of statistical data integration is provided by a tcga study that measured various types of omics data to characterize breast cancer [ ] . in this study breast cancer samples were subjected to whole-genome and -exome sequencing, and snp arrays to obtain information about somatic mutations, copy number variations, and chromosomal rearrangements. microarrays and rna-seq were used to determine mrna and microrna expression levels, respectively. reverse-phase protein arrays (rppa) and dna methylation arrays were used to obtain data on protein expression levels and dna methylation, respectively. simultaneous statistical analysis of different data types via a "cluster-of-clusters" approach using consensus clustering on a multi-omics data matrix revealed that four major breast cancer subtypes could be identifi ed. this showed that the intrinsic subtypes (basal, luminal a and b, her ) that had previously been determined using gene expression data only could be largely confi rmed in an integrated analysis of a large number of breast tumors. single-level omics data has extensively been used to identify disease-associated biomarkers such as genes, proteins, and metabolites. in fact, these studies led to more than , papers documenting thousands of claimed biomarkers, however, it is estimated that fewer than of these are currently used for routine clinical practice [ ] . integration of multiple omics data types is expected to result in more robust and predictive disease profi les since these better refl ect disease biology [ ] . further improvement of these profi les may be obtained through the explicit incorporation of interrelationships between various types of measurements such as microrna-mrna target, or gene methylation-microrna (based on a common target gene). this was demonstrated for the prediction of short-term and long-term survival from serous cystadenocarcinoma tcga data [ ] . according to the recent casym roadmap : "human disease can be perceived as perturbations of complex, integrated genetic, molecular and cellular networks and such complexity necessitates a new approach." [ ] . in this section we discuss how (approximations) to these networks can be constructed from omics data and how these networks can be decomposed in smaller modules. then we discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, lead to predictive diagnostic and prognostic models, and help to further subclassify diseases [ , ] (fig. ) network-based approaches will provide medical doctors with molecular level support to make personalized treatment decisions. in a top-down approach the aim of network reconstruction is to infer the connections between the molecules that constitute a biological network. network models can be created using a variety of mathematical and statistical techniques and data types. early approaches for network inference (also called reverse engineering ) used only gene expression data to reconstruct gene networks. here, we discern three types of gene network inference algorithms using methods based on ( ) correlation-based approaches, ( ) information-theoretic approaches, and ( ) bayesian networks [ ] . co-expression networks are an extension of commonly used clustering techniques , in which genes are connected by edges in a network if the amount of correlation of their gene expression profi les exceeds a certain value. co-expression networks have been shown to connect functionally related genes [ ] . note that connections in a co-expression network correspond to either direct (e.g., transcription factor-gene and protein-protein) or indirect (e.g., proteins participating in the same pathway) interactions. in one of the earliest examples of this approach, pair-wise correlations were calculated between gene expression profi les and the level of growth inhibition caused by thousands of tested anticancer agents, for cancer cell lines [ ] . removal of associations weaker than a certain threshold value resulted in networks consisting of highly correlated genes and agents, called relevance networks, which led to targeted hypotheses for potential single-gene determinants of chemotherapeutic susceptibility. information-theoretic approaches have been proposed in order to capture nonlinear dependencies assumed to be present in most biological systems and that cannot be captured by correlation-based distance measures . these approaches often use the concept of mutual information, a generalization of the correlation coeffi cient which quantifi es the degree of statistical (in)dependence. an example of a network inference method that is based on mutual information is aracne, which has been used to reconstruct the human b-cell gene network from a large compendium of human b-cell gene expression profi les [ ] . in order to discover regulatory interactions, aracne removes the majority of putative indirect interactions from the initial mutual information-based gene network using a theorem from information theory, the data processing inequality. this led to the identifi cation of myc as a major hub in the b-cell gene network and a number of novel myc target genes, which were experimentally validated. whether informationtheoretic approaches are more powerful in general than correlationbased approaches is still subject of debate [ ] . bayesian networks allow the description of statistical dependencies between variables in a generic way [ , ] . bayesian networks are directed acyclic networks in which the edges of the network represent conditional dependencies; that is, nodes that are not connected represent variables that are conditionally independent of each other. a major bottleneck in the reconstruction of bayesian networks is their computational complexity. moreover, bayesian networks are acyclic and cannot capture feedback loops that characterize many biological networks. when time-series rather than steady-state data is available, dynamic bayesian networks provide a richer framework in which cyclic networks can be reconstructed [ ] . gene (co-)expression data only offers a partial view on the full complexity of cellular networks. consequently, networks have also been constructed from other types of high-throughput data. for example, physical protein-protein interactions have been measured on a large scale in different organisms including human, using affi nity capture-mass spectrometry or yeast two-hybrid screens, and have been made available in public databases such as biogrid [ ] . regulatory interactions have been probed using chromatin immunoprecipitation sequencing (chip-seq) experiments, for example by the encode consortium [ ] . using probabilistic techniques , heterogeneous types of experimental evidence and prior knowledge have been integrated to construct functional association networks for human [ ] , mouse [ ] , and, most comprehensively, more than organisms in the string database [ ] . functional association networks can help predict novel pathway components, generate hypotheses for biological functions for a protein of interest, or identify disease-related genes [ ] . prior knowledge required for these approaches is, for example, available in curated biological pathway databases, and via protein associations predicted using text mining based on their cooccurrence in abstracts or even full-text articles. many more integrative network inference methods have been proposed; for a review see [ ] . the integration of gene expression data with chip data [ ] or transcription factor-binding motif data [ ] has shown to be particularly fruitful for inferring transcriptional regulatory networks. recently, li et al. [ ] described the results from a regression-based model that predicts gene expression using encode (chip-seq) and tcga data (mrna expression data complemented with copy number variation, dna methylation, and microrna expression data). this model infers the regulatory activities of expression regulators and their target genes in acute myeloid leukemia samples. eighteen key regulators were identifi ed, whose activities clustered consistently with cytogenetic risk groups. bayesian networks have also been used to integrate multiomics data. the combination of genotypic and gene expression data is particularly powerful, since dna variations represent naturally occurring perturbations that affect gene expression detected as expression quantitative trait loci ( eqtl ). cis -acting eqtls can then be used as constraints in the construction of directed bayesian networks to infer causal relationships between nodes in the network [ ] . large multi-omics datasets consisting of hundreds or sometimes even thousands of samples are available for many commonly occurring human diseases, such as most tumor types (tcga), alzheimer's disease [ ] , and obesity [ ] . however, a major bottleneck for the construction of accurate gene networks is that the number of gene networks that are compatible with the experimental data is several orders of magnitude larger still. in other words, top-down network inference is an underdetermined problem with many possible solutions that explain the data equally well and individual gene-gene interactions are characterized by a high false-positive rate [ ] . most network inference methods therefore try to constrain the number of possible solutions by making certain assumptions about the structure of the network. perhaps the most commonly used strategy to harness the complexity of the gene network inference problem is to analyze experimental data in terms of biological modules, that is, sets of genes that have strong interactions and a common function [ ] . there is considerable evidence that many biological networks are modular [ ] . module-based approaches effectively constrain the number of parameters to estimate and are in general also more robust to the noise that characterizes high-throughput omics measurements. a detailed review of module-based techniques is outside the scope of this chapter (see, for example [ ] ), but we would like to mention a few examples of successful and commonly used modular approaches. weighted gene co-expression network analysis ( wgcna) decomposes a co-expression network into modules using clustering techniques [ ] . modules can be summarized by their module eigengene, a weighted average expression profi le of all gene member of a given module. eigengenes can then be correlated with external sample traits to identify modules that are related with these traits. parikshak et al. [ ] used wgcna to extract modules from a co-expression network constructed using fetal and early postnatal brain development expression data. next, they established that several of these modules were enriched for genes and rare de novo variants implicated in autism spectrum disorder (asd). moreover, the asd-associated modules are also linked at the transcriptional level and transcription factors were found acting as putative co-regulators of asd-associated gene modules during neocortical development. wgcna can also be used when multiple omics data types are available. one example of such an approach involved the integration of transcriptomic and proteomic data from a study investigating the response to sars-cov infection in mice [ ] . in this study wgcna-based gene and protein co-expression modules were constructed and integrated to obtain module-based disease signatures. interestingly, the authors found several cases of identifi er-matched transcripts and proteins that correlated well with the phenotype, but which showed poor or anticorrelation across these two data types. moreover, the highest correlating transcripts and peptides were not the most central ones in the co-expression modules. vice versa , the transcripts and proteins that defi ned the modules were not those with the highest correlation to the phenotype. at the very least this shows that integration of omics data affects the nature of the disease signatures. identifi cation of active modules is another important integrative modular technique . here, experimental data in the form of molecular profi les is projected onto a biological network, for example a protein-protein interaction network. active modules are those subnetworks that show the largest change in expression for a subset of conditions and are likely to contain key drivers or regulators of those processes perturbed in the experiment. active modules have, for example, been used to fi nd a subnetwork that is overexpressed in a particularly aggressive lymphoma subtype [ ] and to detect signifi cantly mutated pathways [ ] . some active module approaches integrate various types of omics data. one example of such an approach is paradigm [ ] , which translates pathways into factor graphs, a class of models that belongs to the same family of models as bayesian networks, and determines sample-specifi c pathway activity from multiple functional genomic datasets. paradigm has been used in several tcga projects, for example, in the integrated analysis of urothelial bladder carcinomas [ ] . paradigm-based analysis of copy number variations and rna-seq gene expression in combination with a propagation-based network analysis algorithm revealed novel associations between mutations and gene expression levels, which subsequently resulted in the identifi cation of pathways altered in bladder cancer. the identifi cation of activating or inhibiting gene mutations in these pathways suggested new targets for treatment. moreover, this effort clearly showed the benefi ts of screening patients for the presence of specifi c mutations to enable personalized treatment strategies. often, published disease signatures cannot be replicated [ ] or provide hardly additional biological insight. also here (modular) network-based approaches have been proposed to alleviate these problems. a common characteristic of most methods is that the molecular activity of a set of genes is summarized on a per sample basis. summarized gene set scores are then used as features in prognostic and predictive models. relevant gene sets can be based on prior knowledge and correspond to canonical pathways, gene ontology categories, or sets of genes sharing common motifs in their promoter regions [ ] . gene set scores can also be determined by projecting molecular data onto a biological network and summarizing scores at the level of subnetworks for each individual sample [ ] . while promising in principle, it is still subject of debate whether gene set-based models outperform gene-based one s [ ] . the comparative analysis of networks across different species is another commonly used approach to constrain the solution space. patterns conserved across species have been shown to be more likely to be true functional interactions [ ] and to harbor useful candidates for human disease genes [ ] . many network alignment methods have been developed in the past decade to identify commonalities between networks. these methods in general combine sequence-based and topological constraints to determine the optimal alignment of two (or more) biological networks. network alignment has, for example, been applied to detect conserved patterns of protein interaction in multiple species [ , ] and to analyze the evolution of co-expression networks between humans and mice [ , ] . network alignment can also be applied to detect diverged patterns [ ] and may thus lead to a better understanding of similarities and differences between animal models and human in health and disease. information from model organisms has also been fruitfully used to identify more robust disease signatures [ - ] . sweet-cordero and co-workers [ ] used a gene signature identifi ed in a mouse model of lung adenocarcinoma to uncover an orthologous signature in human lung adenocarcinoma that was not otherwise apparent. bild et al. [ ] defi ned gene expression signatures characterizing several oncogenic pathways of human mammary epithelial cells. they showed that these signatures predicted pathway activity in mouse and human tumors. predictions of pathway activity correlated well with the sensitivity to drugs targeting those pathways and could thus serve as a guide to targeted therapies. a generic approach, pathprint, for the integration of gene expression data across different platforms and species at the level of pathways, networks, and transcriptionally regulated targets was recently described [ ] . the authors used their method to identify four stem cell-related pathways conserved between human and mouse in acute myeloid leukemia, with good prognostic value in four independent clinical studies. we reviewed a wide array of different approaches showing how networks can be used to elucidate integrated genetic, molecular, and cellular networks. however, in general no single approach will be suffi cient and combining different approaches in more complex analysis pipelines will be required. this is fi ttingly illustrated by the diggit (driver-gene inference by genetical-genomics and information theory) algorithm [ ] . in brief, diggit identities candidate master regulators from an aracne gene co-expression network integrated with copy number variations that affect gene expression. this method combines several previously developed computational approaches and was used to identify causal genetic drivers of human disease in general and glioblastoma, breast cancer, and alzheimer's disease in particular. this enabled identifi cation of klhl deletions as upstream activators of two previously established master regulators in a specifi c subtype of glioblastoma. systems medicine is one of the steps necessary to make improvements in the prevention and treatment of disease through systems approaches that will (a) elucidate (patho)physiologic mechanisms in much greater detail than currently possible, (b) produce more robust and predictive disease signatures, and (c) enable personalized treatment. in this context, we have shown that bioinformatics has a major role to play. bioinformatics will continue its role in the development, curation, integration, and maintenance of (public) biological and clinical databases to support biomedical research and systems medicine. the bioinformatics community will strengthen its activities in various standardization and curation efforts that already resulted in minimum reporting guidelines [ ] , data capture approaches [ ] , data exchange formats [ ] , and terminology standards for annotation [ ] . one challenge for the future is to remove errors and inconsistencies in data and annotation from databases and prevent new ones from being introduced [ , , - ]. an equally important challenge is to establish, improve, and integrate resources containing phenotype and clinical information. to achieve this objective it seems reasonable that bioinformatics and health informatics professionals team up [ - ] . traditionally health informatics professionals have focused on hospital information systems (e.g., patient records, pathology reports, medical images) and data exchange standards (e.g., hl ), medical terminology standards (e.g., international classifi cation of disease (icd), snomed), medical image analysis, analysis of clinical data, clinical decision support systems, and so on. while, on the other hand, bioinformatics mainly focused on molecular data, it shares many approaches and methods with health informatics. integration of these disciplines is therefore expected to benefi t systems medicine in various ways [ ] . integrative bioinformatics approaches clearly have added value for systems medicine as they provide a better understanding of biological systems, result in more robust disease markers, and prevent (biological) bias that would possibly occur from using single-omics measurements. however, such studies, and the scientifi c community in general, would benefi t from improved strategies to disseminate and share data which typically will be produced at multiple research centers (e.g., https://www.synapse.org ; [ ] ). integrative studies are expected to increasingly facilitate personalized medicine approaches such as demonstrated by chen and coworkers [ ] . in their study they presented a -month "integrative personal omics profi le" (ipop) for a single individual comprising genomic, transcriptomic, proteomic, metabolomic, and autoantibody data. from the whole-genome sequence data an elevated risk for type diabetes (t d) was detected, and subsequent monitoring of hba c and glucose levels revealed the onset of t d, despite the fact that the individual lacked many of the known non-genetic risk factors. subsequent treatment resulted in a gradual return to the normal phenotype. this shows that the genome sequence can be used to determine disease risk in a healthy individual and allows selecting and monitoring specifi c markers that provide information about the actual disease status. network-based approaches will increasingly be used to determine the genetic causes of human diseases. since the effect of a genetic variation is often tissue or cell-type specifi c, a large effort is needed in constructing cell-type-specifi c networks both in health and disease. this can be done using data already available, an approach taken by guan et al. [ ] . the authors proposed tissue-specifi c networks in mouse via their generic approach for constructing functional association networks using lowthroughput, highly reliable tissue-specifi c gene expression information as a constraint. one could also generate new datasets to facilitate the construction of tissue-specifi c networks. examples of such approaches are tcga and the genotype-tissue expression (gtex) project. the aim of gtex is to create a data resource for the systematic study of genetic variation and its effect on gene expression in more than human tissues [ ] . regardless of the way how networks are constructed, it will become more and more important to offer a centralized repository where networks from different cell types and diseases can be stored and accessed. nowadays, these networks are diffi cult to retrieve and are scattered in supplementary fi les with the original papers, links to accompanying web pages, or even not available at all. a resource similar to what the systems biology community has created with the biomodels database would be a great leap forward. there have been some initial attempts in building databases of network models, for example the cellcircuits database [ ] ( http://www.cellcircuits.org ) and the causal biological networks (cbn) database of networks related to lung disease [ ] ( http://causalbionet.com ). however, these are only small-scale initiatives and a much larger and coordinated effort is required. another main bottleneck for the successful application of network inference methods is their validation. most network inference methods to date have been applied to one or a few isolated datasets and were validated using some limited follow-up experiments, for example via gene knockdowns, using prior knowledge from databases and literature as a gold standard, or by generating simulated data from a mathematical model of the underlying network [ , ] . however, strengths and weaknesses of network inference methods across cell types, diseases, and species have hardly been assessed. notable exceptions are collaborative competitions such as the dialogue on reverse engineering assessment and methods (dream) [ ] and industrial methodology for process verifi cation (improver) [ ] . these centralized initiatives propose challenges in which individual research groups can participate and to which they can submit their predictions, which can then be independently validated by the challenge organizers. several dream challenges in the area of network inference have been organized, leading to a better insight into the strengths and weaknesses of individual methods [ ] . another important contribution of dream is that a crowd-based approach integrating predictions from multiple network inference methods was shown to give good and robust performance across diverse data sets [ ] . also in the area of systems medicine challenge-based competitions may offer a framework for independent verifi cation of model predictions. systems medicine promises a more personalized medicine that effectively exploits the growing amount of molecular and clinical data available for individual patients. solid bioinformatics approaches are of crucial importance for the success of systems medicine. however, really delivering the promises of systems medicine will require an overall change of research approach that transcends the current reductionist approach and results in a tighter integration of clinical, wet-lab laboratory, and computational groups adopting a systems-based approach. past, current, and future success of systems medicine will accelerate this change. the road from systems biology to systems medicine participatory medicine: a driving force for revolutionizing healthcare understanding drugs and diseases by systems biology the roots of bioinformatics in theoretical biology sequencing technologies -the next generation exploring the new world of the genome with dna microarrays spectroscopic and statistical techniques for information recovery in metabonomics and metabolomics next-generation technologies and data analytical approaches for epigenomics gene expression profi ling predicts clinical outcome of breast cancer diagnostic tests based on gene expression profi le in breast cancer: from background to clinical use a multigene assay to predict recurrence of tamoxifentreated, node-negative breast cancer what is bioinformatics? a proposed defi nition and overview of the fi eld the importance of biological databases in biological discovery the nucleic acids research database issue and an updated nar online molecular biology database collection reuse of public genome-wide gene expression data experimental design for gene expression microarrays learning from our gwas mistakes: from experimental design to scientifi c method effi cient experimental design and analysis strategies for the detection of differential expression using rna-sequencing impact of yeast systems biology on industrial biotechnology the nature of systems biology gene expression omnibus: microarray data storage, submission, retrieval, and analysis the proteomics identifi cations (pride) database and associated tools: status in metabolights--an open-access generalpurpose repository for metabolomics studies and associated meta-data the reactome pathway knowledgebase data, information, knowledge and principle: back to metabolism in kegg activities at the universal protein resource (uniprot) path models: large-scale generation of computational models from biochemical pathway maps precise generation of systems biology models from kegg pathways pathguide: a pathway resource list pathway commons, a web resource for biological pathway data consensus and confl ict cards for metabolic pathway databases the ucsc genome browser database: update biomodels database: a repository of mathematical models of biological processes a community-driven global reconstruction of human metabolism the seek: a platform for sharing data and models in systems biology malacards: an integrated compendium for diseases and their annotation genecards version : the human gene integrator in-silico human genomics with genecards peroxisomedb . : an integrative view of the global peroxisomal metabolome the mouse age phenome knowledgebase and disease-specifi c inter-species age mapping searching the mouse genome informatics (mgi) resources for information on mouse biology from genotype to phenotype gene-set approach for expression pattern analysis systems analysis of human brain gene expression: mechanisms for hiv-associated neurocognitive impairment and common pathways with alzheimer's disease systems biology approach to identify transcriptome reprogramming and candidate microrna targets during the progression of polycystic kidney disease mir disease: a manually curated database for microrna deregulation in human disease a new face and new challenges for online mendelian inheritance in man (omim(r)) clinvar: public archive of relationships among sequence variation and human phenotype searching ncbi's dbsnp database dbvar and dgva: public archives for genomic structural variation using electronic patient records to discover disease correlations and stratify patient cohorts on not reinventing the wheel beyond the genomics blueprint: the th human variome project meeting comprehensive molecular characterization of urothelial bladder carcinoma open clinical trial data for all? a view from regulators clinical trial data as a public good biobanking for europe whose data set is it anyway? sharing raw data from randomized trials sharing individual participant data from clinical trials: an opinion survey regarding the establishment of a central repository ncbi's database of genotypes and phenotypes: dbgap mining electronic health records: towards better research applications and clinical care phenome connections phewas: demonstrating the feasibility of a phenome-wide scan to discover genedisease associations mining the ultimate phenome repository probing genetic overlap among complex human phenotypes systematic comparison of phenomewide association study of electronic medical record data and genome-wide association study data finding the missing heritability of complex diseases systems genetics: from gwas to disease pathways a review of post-gwas prioritization approaches when one and one gives more than two: challenges and opportunities of integrative omics the model organism as a system: integrating 'omics' data sets principles and methods of integrative genomic analyses in cancer toward interoperable bioscience data critical assessment of human metabolic pathway databases: a stepping stone for future integration the bridgedb framework: standardized access to gene, protein and metabolite identifi er mapping services integration of transcriptomics and metabonomics: improving diagnostics, biomarker identifi cation and phenotyping in ulcerative colitis a multivariate approach to the integration of multi-omics datasets comprehensive molecular portraits of human breast tumours bring on the biomarkers assessing the clinical utility of cancer genomic and proteomic data across tumor types incorporating inter-relationships between different levels of genomic data into cancer clinical outcome prediction the casym roadmap: implementation of systems medicine across europe molecular classifi cation of cancer: class discovery and class prediction by gene expression monitoring how to infer gene networks from expression profi les coexpression analysis of human genes across many microarray data sets discovering functional relationships between rna expression and chemotherapeutic susceptibility using relevance networks reverse engineering of regulatory networks in human b cells comparison of co-expression measures: mutual information, correlation, and model based indices using bayesian networks to analyze expression data probabilistic graphical models: principles and techniques. adaptive computation and machine learning inferring gene networks from time series microarray data using dynamic bayesian networks the biogrid interaction database: update architecture of the human regulatory network derived from encode data reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes a genomewide functional network for the laboratory mouse string v . : protein-protein interaction networks, with increased coverage and integration advantages and limitations of current network inference methods computational discovery of gene modules and regulatory networks a semisupervised method for predicting transcription factor-gene interactions in escherichia coli regression analysis of combined gene expression regulation in acute myeloid leukemia integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks integrated systems approach identifi es genetic nodes and networks in late-onset alzheimer's disease a survey of the genetics of stomach, liver, and adipose gene expression from a morbidly obese cohort an introduction to systems biology: design principles of biological circuits from signatures to models: understanding cancer using microarrays integrative approaches for fi nding modular structure in biological networks weighted gene coexpression network analysis: state of the art integrative functional genomic analyses implicate specifi c molecular pathways and circuits in autism multi-omic network signatures of disease identifying functional modules in protein-protein interaction networks: an integrated exact approach algorithms for detecting signifi cantly mutated pathways in cancer inference of patient-specifi c pathway activities from multi-dimensional cancer genomics data using paradigm pathway-based personalized analysis of cancer network-based classifi cation of breast cancer metastasis current composite-feature classifi cation methods do not outperform simple singlegenes classifi ers in breast cancer prognosis prediction of human disease genes by humanmouse conserved coexpression analysis a comparison of algorithms for the pairwise alignment of biological networks cross-species analysis of biological networks by bayesian alignment graphalignment: bayesian pairwise alignment of biological networks an oncogenic kras expression signature identifi ed by cross-species gene-expression analysis oncogenic pathway signatures in human cancers as a guide to targeted therapies interspecies translation of disease networks increases robustness and predictive accuracy integrated cross-species transcriptional network analysis of metastatic susceptibility pathprinting: an integrative approach to understand the functional basis of disease identifi cation of causal genetic drivers of human disease through systems-level analysis of regulatory networks promoting coherent minimum reporting guidelines for biological and biomedical investigations: the mibbi project data standards for omics data: the basis of data sharing and reuse biomedical ontologies: a functional perspective pdb improvement starts with data deposition what we do not know about sequence analysis and sequence databases annotation error in public databases: misannotation of molecular function in enzyme superfamilies improving the description of metabolic networks: the tca cycle as example more than , problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology biomedical and health informatics in translational medicine amia board white paper: defi nition of biomedical informatics and specifi cation of core competencies for graduate education in the discipline synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care elixir: a distributed infrastructure for european biological data enabling transparent and collaborative computational analysis of tumor types within the cancer genome atlas personal omics profi ling reveals dynamic molecular and medical phenotypes tissue-specifi c functional networks for prioritizing phenotype and disease genes the genotype-tissue expression (gtex) project on crowd-verifi cation of biological networks inference and validation of predictive gene networks from biomedical literature and gene expression data verifi cation of systems biology research in the age of collaborative competition dialogue on reverse-engineering assessment and methods: the dream of highthroughput pathway inference revealing strengths and weaknesses of methods for gene network inference wisdom of crowds for robust gene network inference we would like to thank dr. aldo jongejan for his comments that improved the text. key: cord- -nml kqiu authors: lhommet, claire; garot, denis; grammatico-guillon, leslie; jourdannaud, cassandra; asfar, pierre; faisy, christophe; muller, grégoire; barker, kimberly a.; mercier, emmanuelle; robert, sylvie; lanotte, philippe; goudeau, alain; blasco, helene; guillon, antoine title: predicting the microbial cause of community-acquired pneumonia: can physicians or a data-driven method differentiate viral from bacterial pneumonia at patient presentation? date: - - journal: bmc pulm med doi: . /s - - -y sha: doc_id: cord_uid: nml kqiu background: community-acquired pneumonia (cap) requires urgent and specific antimicrobial therapy. however, the causal pathogen is typically unknown at the point when anti-infective therapeutics must be initiated. physicians synthesize information from diverse data streams to make appropriate decisions. artificial intelligence (ai) excels at finding complex relationships in large volumes of data. we aimed to evaluate the abilities of experienced physicians and ai to answer this question at patient admission: is it a viral or a bacterial pneumonia? methods: we included patients hospitalized for cap and recorded all data available in the first -h period of care (clinical, biological and radiological information). for this proof-of-concept investigation, we decided to study only cap caused by a singular and identified pathogen. we built a machine learning model prediction using all collected data. finally, an independent validation set of samples was used to test the pathogen prediction performance of: (i) a panel of three experts and (ii) the ai algorithm. both were blinded regarding the final microbial diagnosis. positive likelihood ratio (lr) values > and negative lr values < . were considered clinically relevant. results: we included patients with cap ( . % men; [ – ] years old; mean sapsii, [ – ]), % had viral pneumonia, % had bacterial pneumonia, % had a co-infection and % had no identified respiratory pathogen. we performed the analysis on patients as co-pathogen and no-pathogen cases were excluded. the discriminant abilities of the ai approach were low to moderate (lr+ = . for viral and . for bacterial pneumonia), and the discriminant abilities of the experts were very low to low (lr+ = . for viral and . for bacterial pneumonia). conclusion: neither experts nor an ai algorithm can predict the microbial etiology of cap within the first hours of hospitalization when there is an urgent need to define the anti-infective therapeutic strategy. the world health organization (who) estimates that due to antimicrobial resistance, bacterial infections will outcompete any cause of death by [ ] , meaning that there is an urgent need for new strategies to improve antibiotic treatments. the agency for healthcare research and quality (ahrq) safety program for improving antibiotic use recently proposed a structured approach to improve antibiotic decision making by clinicians, which emphasizes the critical time points in antibiotic prescribing [ , ] . the first time point of this organized approach requires the physician to ask: "does this patient have an infection that requires antibiotics?". this question aims to remind the clinician to synthesize all relevant patient information to determine the likelihood of an infection that requires antibiotic therapy. the questionable ability of physicians to answer this first question properly in the context of pneumonia was the impetus for this study. community-acquired pneumonia (cap) is a major global healthcare burden associated with significant morbidity, mortality and costs [ ] [ ] [ ] [ ] [ ] [ ] . identifying the etiology of cap is an utmost priority for its management and treatment decisions [ ] . although the range of pathogens that may be involved in these cases is broad, physicians must at least determine whether a bacterial or a viral pathogen (or both) is causing the pneumonia to determine if antibiotic treatment is appropriate. whether the etiology of cap is viral or bacterial should be determined based on the patient interview, clinical symptoms and signs, biological findings and radiological data from the very first hours of the patient's presentation (a time when microbiological findings are typically not yet available). physicians must use the knowledge obtained from their routine practice and medical education to make sense of these diverse data input streams, triage the resulting complex dataset, and make appropriate decisions. a growing body of research has recently suggested that difficulties in accessing, organizing, and using a substantial amount of data could be significantly ameliorated by use of emerging artificial intelligence (ai)-derived methods, which are nowadays applied in diverse fields including biology, computer science and sociology [ ] . ai excels at finding complex relationships in large volumes of data and can rapidly analyze many variables to predict outcomes of interest. in the context of cap in intensive care units (icus), where information are particularly diverse, we wondered if an ai datadriven approach to reducing the medical complexity of a patient could allow us to make a better hypothesis regarding the microbial etiology at the patient's presentation. the aim of our study was to evaluate and compare the abilities of experienced physicians and a data-driven approach to answer this simple question within the first hours of a patient's admission to the icu for cap: is it a viral or a bacterial pneumonia? this study was conducted in two steps. first, we performed prospective data collection (step ); second, we retrospectively assessed the microbial etiology prediction performances of experienced physicians (more than years' experience) and a computational data-driven approach for this dataset (step ). step : patient data collection prospective data collection was conducted in a single center over an -month period. the study complied with french law for observational studies, was approved by the ethics committee of the french intensive care society (ce srlf [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , was approved by the commission nationale de l'informatique et des libertés (cnil) for the treatment of personal health data. we gave written and oral information to patients or next-of-kin. patients or next-of-kin gave verbal informed consent, as approved by the ethic committee. eligible patients were adults hospitalized in icu for cap. pneumonia was defined as the presence of an infiltrate on a chest radiograph and one or more of the following symptoms: fever (temperature ≥ . °c) or hypothermia (temperature < . °c), cough with or without sputum production, or dyspnea or altered breath sounds on auscultation. community-acquired infection was defined as infection occurring within h of admission. cases of pneumonia due to inhalation or infection with pneumocystis, pregnant women and patients under guardianship were not included. cases with pao ≥ mmhg in ambient air or with the need for oxygen therapy ≤ l/min or without mechanical ventilation (invasive or non-invasive) were not included. baseline patient information was collected at case presentation through in-person semi-structured interviews with patients or surrogates (see supplementary table ). observations from the physical examination at presentation, including vital signs and auscultation of the lungs, were recorded. findings of biological tests done at presentation (within the first three-hour period) were also recorded (hematology and chemistry tests), as were findings from chest radiography. two physicians interpreted chest x-rays; a third physician reviewed the images in cases of disagreements in interpretation. microbiological investigations included blood cultures, pneumococcal and legionella urinary antigen tests, bacterial cultures and multiplex pcr respifinder smart ® (pathofinder b.v., oxfordlaan, netherlands) analyses on respiratory fluids (sputum and/or nasal wash and/or endotracheal aspirate and/or bronchoalveolar lavage [bal]). step : clinician and data-driven predictions of microbial etiology clinicians and a mathematical algorithm were tasked with predicting the microbial etiology of pneumonia cases based on all clinical ( items), and biological or radiological ( items) information available in the first -h period after admission except for any microbiological findings (supplementary table ). for this proof-of-concept investigation, we decided to study only cap caused by a singular and identified pathogen; cases of cap with mixed etiology or without microbiological documentation were excluded. from the initial dataset of patients, we randomly generated two groups (prior to any analysis): (i) a work dataset ( % of the initial dataset) dedicated to construction of the mathematical model and training the experts; (ii) an external validation dataset ( % of the initial dataset) dedicated to testing the prediction performances. the methodology used is summarized in fig. a . an external three member expert panel reviewed the work dataset to familiarize themselves with the dataset containing the patient characteristics. then, the experts were asked to predict the microbial etiologies in the external validation dataset (fig. a) . the clinicians had to answer the question: is it a viral or a bacterial pneumonia? they were also asked to give a confidence index regarding the accuracy of their answer: (very low), (low), (moderate), (high). agreement of at least two of the three experts was required for the final predicted etiology. the data were analyzed using an ai method (fig. b) involving a logistic regression analysis using forward stepwise inclusion. this method was employed to optimize the ability of the algorithm to distinguish viral and bacterial pneumonia based on the combination of parameters available in the work dataset. all available data were thus included in the model, regardless of the data type. qualitative data were processed as binary information (i.e. influenza immunization: present " ", absent " "). raw data were provided for quantitative values (no cut-offs defined). we built the predictive mathematical model from the work dataset using the random forest method and leave-one-out cross-validation. we started by determining the most relevant item to use through a variable selection procedure using the random forest method and the mean decrease in gini criterion (value . ). then, the population in the work dataset was randomly separated into two independent datasets: % of cases were assigned to the training set and % were assigned to the test set. n models with bootstrap resampling (with n = ) were performed on the training set and validated on the test set. the model providing the best prediction criteria was selected, and the final model was built from the entire work dataset. finally, an independent validation set of samples was used to test the pathogen prediction performance of the ai algorithm. to decipher the relative importance of clinical versus biological/radiological variables in the predictions, we generated three algorithms built from different parameters of the work dataset: (i) clinical variables only, (ii) biological and radiological variables only, and (iii) all variables. for each parameter tested, the area under the roc curve (auc) was calculated, and the best cutoff value that yielded the highest accuracy was determined along with the sensitivity and specificity. we compared the concordance between the predictions and the final microbial etiologies for the experts and for the algorithm and calculated sensitivity, specificity, positive predictive value (ppv), negative predictive value (npv) and likelihood ratios (lrs) for the predictions [ ] . given the importance of this diagnostic prediction in the patient's therapeutic management, we determined that the discriminant properties should be "high" (lr + > and/ or lr-< . ) for the prediction to be considered useful for clinical practice [ , ] . table summarizes the lr cutoff values defining the discriminant properties of the predictions [ ] . quantitative data are reported as the median value and interquartile range (iqr). statistical analyses were done with jmp software (sas, version . ). a total of patients diagnosed with cap were eligible for inclusion over an -month period; patients were included; % had viral pneumonia, % had bacterial pneumonia, % had a co-infection and % had no identified respiratory pathogen. finally, we performed the analysis on patients as co-pathogen and no-pathogen cases were excluded. the patient selection flow chart is presented in fig. . the characteristics of the patients according to microbial diagnosis are detailed in table . experts had "high" confidence in their predicted etiology only . % of the time. confidence levels were typically "moderate" ( . %) or "low" ( . %), but never "very low". all three experts agreed in . % of the cases. correct predictions were made . % of the time. the clinician predictions had a sensitivity of . , specificity of . , ppv of . and npv of . for the diagnosis of bacterial pneumonia ( table ). the lr+ for diagnosing a viral pneumonia was . , and the corresponding lr-was . . the lr+ for diagnosing a bacterial pneumonia was . , and the corresponding lr-was . . therefore, the discriminant abilities of experienced physicians to distinguish viral and bacterial etiologies for pneumonia were categorized as very low to low (according to defined cutoff values for the interpretation of likelihood ratios, see table ). predictions by the data-driven algorithms generated from clinical data alone resulted in an roc curve with a fig. schematic representation of the study methodology. a we built an initial dataset from all sources of information available in the first h of the patient's presentation in the icu for cap. we matched these presenting cases with their final identified causal respiratory pathogen. the initial dataset was randomly split into a work dataset, used for the machine learning and training the icu experts on how the data were presented, and an external validation dataset used to assess the prediction performances of the artificial intelligence (ai) algorithm and the panel of experts. b data flow to engineer the data-driven algorithm corresponding auc of . . predictions by the data-driven algorithms generated from biological and radiological variables data alone resulted in an roc curve with an auc of . . finally, predictions generated from the dataset that included all data sources outperformed the other algorithms and resulted in an roc curve with an auc of . (table , fig. ). this model based on the more inclusive dataset was considered the final model for comparison with the expert panel. the final algorithm made predictions with a sensitivity of . , specificity of . , ppv of . and npv of . for the diagnosis of bacterial pneumonia. the lr+ for diagnosing a viral pneumonia was . , and the corresponding lr-was . . the lr+ for diagnosing a bacterial pneumonia was . , and the corresponding lr-was . . consequently, the discriminant abilities of the data-driven algorithm to distinguish viral and bacterial etiologies for pneumonia were categorized as low to moderate (according to defined cutoff values for the interpretation of likelihood ratios, see table ). addressing antimicrobial resistance requires investment in several critical areas, the most pressing of which is the ability to make rapid diagnoses to promote appropriate anti-infective therapeutics and limit unnecessary antibiotic use. here, we set up a pilot study and demonstrated that neither experts nor a mathematical algorithm could accurately predict the microbial etiology of severe cap within the first hours of hospitalization when there is an urgent need to define the appropriate anti-infective therapeutic strategy. we encoded all information available in the first hours after admission for a large cohort comparable with other published cohorts in terms of the distribution of causal microbial pathogens, patient characteristics and severity of disease [ ] [ ] [ ] . we demonstrated that experienced clinicians synthesizing all this information failed to adequately answer the question: "is it a viral or a bacterial pneumonia?", as the discriminant ability between the two diagnoses was considered low. we interpreted our results mainly based on the calculation of likelihood ratios, as recommended for reports of a diagnostic test for an infectious disease [ ] . likelihood ratios incorporate both sensitivity and specificity and, unlike predictive values, do not vary with prevalence, making them good statistical tools to facilitate translation of knowledge from research to clinical practice [ ] . in parallel, we designed a data-driven approach. different ai methods were available; we selected the random forest method because it is one of the most efficient strategies for providing a predictive algorithm in this context [ ] [ ] [ ] [ ] . importantly, the final algorithm was tasked with providing predictions for a novel population independent of the dataset used for the algorithm construction. the discriminant abilities of the ai approach restricted to the binary choice "viral" or "bacterial" were superior to those of experts but still considered low or moderate and were ultimately insufficient to provide an indisputable therapeutic decision. it is important to emphasize that we chose a high cutoff value for determining the discriminant ability of the ai approach (lr+ > , lr-< . ); this choice was made for two reasons. first, in this proof-ofconcept study, we did not analyze co-infections and restricted the possible choices to a binary prediction. because we reduced the complexity of the cases, we expected high predictive performances. second, the goal of this study was not a prediction of outcomes (e.g., icu length of stay, mortality), which are informative but do not determine patient management; it was to provide a clear and immediate medical decision: whether or not to prescribe antibiotics. the immediate clinical consequences in this situation demand a high predictive performance. still, it is important to highlight that the machine learning method we developed achieved an auc of . , which is superior or at least equal to auc values usually observed for predictive mathematical models developed for the icu environment. for instance, the systemic inflammatory response syndrome (sirs) criteria, the simplified acute physiology score ii (saps ii) and the sequential organ failure assessment (sofa) have auc values of . , . and . , respectively, for identifying sepsis [ ] . an ai sepsis expert algorithm for early prediction of sepsis has been engineered and achieved auc values ranging from . - . according to the time of the prediction. an ai method for predicting prolonged mechanical ventilation achieved an auc of . [ ] . how can it be that ai or machine-learning predictive algorithms that can already automatically drive cars or successfully understand human speech failed to predict the microbial cause of pneumonia accurately? first, having data of excellent quality is critical for the success of ai predictions. the icu environment is data-rich, providing fertile soil for the development of accurate predictive models [ ] , but it is also a challenging environment with heterogeneous and complex data. in our study, the data that fueled the ai method were from a real-world data source. it is probably more difficult to create a consistent data format when merging data from interviews, patient examinations, biological and radiological information than when using datasets from the insurance or finance industries. additionally, data arising from patient examinations and interviews are still strictly dependent on the physician's skill and experience. finally, although we hypothesized that the ai capabilities would exceed human skills and make accurate predictions when physicians cannot, we must also consider the null hypothesis: viral and bacterial pneumonias share the same characteristics and cannot be distinguished based on initial clinical, biological or radiological parameters. the dividing lines between the signs and symptoms of a viral versus a bacterial infection could be too blurry to permit the two diagnoses to be discerned without microbial analyses. our results emphasize the need to use a rapid turnaround time system for the accurate identification of respiratory pathogens from patient specimens. utilizing rapid molecular respiratory panel assays may increase the likelihood of optimal treatment of acute respiratory infections [ ] [ ] [ ] [ ] [ ] . however, antibiotic consumption was not reduced by the use of a molecular point-of-care strategy in adults presenting with acute respiratory illness in a large randomized controlled trial [ ] . it seems that we are experiencing a switch in perspectives regarding microbial diagnoses of respiratory infections: physicians are used to dealing with an absence of information, but they will likely be overloaded with information in the near future [ ] . the positive detection of respiratory viruses may or may not be useful for the immediate management of a patient [ ] . thus, the development of molecular point-of-care analysis techniques will not lessen the usefulness of our ai strategy. on the contrary, we believe that ai could be a great help in dealing with information overload, which could soon be a common problem. ai methods should not be viewed as ways to replace human expertise but rather as catalysts that accelerate human expertise-based analyses of data. ai methods can assist-rather than replace-in clinical decision-making by transforming complex data into more actionable information. further studies are needed to assess if ai system integrated with point-of-care rapid molecular respiratory panel assays could be a useful addition for the clinician. ultimately, randomized controlled trial should determine the effect of this strategy on the decision making regarding antibiotic use. our study should be interpreted in the context of several limitations. first, this was a proof-of-concept study, and we excluded cases of cap with mixed etiology or without microbiological documentation. consequently, the results were obtained from artificially dichotomized situations (viral or bacterial pneumonia, patients in total) and cannot be directly extrapolated to real-life practice. moreover, we did not include cases of acute pneumonia with non-infectious origins. second, the experts were asked to make their predictions based on case reports exhaustively described in excel files. they did not have the opportunity to interview or directly examine the patients themselves. furthermore, the experts' predictions were not performed in "real-life" situation. this could have affected the experts' predictive performance. third, we cannot rule out the possibility that some bacterial or viral pneumonia cases were misdiagnosed. we relied on stateof-the-art methods for microbial discovery, but it is possible that our current technology is sometimes suboptimal for detecting respiratory microbial pathogens. neither a panel of experts nor a data-driven approach could accurately distinguish viral from bacterial pneumonia within the first hours of patient admission in icu for cap. the heterogeneous and complex data generated in the icu environment are likely difficult to use to generate an ai algorithm with a high predictive quality. the results of our pilot study at least highlight that we should not treat machine learning and data science as crystal balls for making predictions and automating decision-making; we should rather use these techniques to more critically examine all available information and enhance existing human expertise. supplementary information accompanies this paper at https://doi.org/ . /s - - -y. additional file : supplementary table s . prospective data collection of elements available in the first -hour period after admission. clinicians and a mathematical algorithm were tasked with predicting the microbial etiology of pneumonia cases based on this information. qualitative data were processed as binary information (i.e. influenza immunization: present " ", absent " "). raw data were provided for quantitative values (no cut-offs defined). rethinking how antibiotics are prescribed: incorporating the moments of antibiotic decision making into clinical practice long-term prognosis in communityacquired pneumonia respiratory infection and the impact of pulmonary immunity on lung health and disease infectious disease mortality trends in the united states trends in infectious disease mortality in the united states during the th century years lived with disability (ylds) for sequelae of diseases and injuries - : a systematic analysis for the global burden of disease study ten-year trends in intensive care admissions for respiratory infections in the elderly use of tracheal aspirate culture in newly intubated patients with community-onset pneumonia data-driving methods: more than merely trendy buzzwords? ann intensive care clinical prediction rules in staphylococcus aureus bacteremia demonstrate the usefulness of reporting likelihood ratios in infectious diseases indices de performance diagnostique rapid detection of bacterial meningitis using a point-of-care glucometer aetiology of lower respiratory tract infection in adults in primary care: a prospective study in european countries viral infection in community-acquired pneumonia: a systematic review and meta-analysis systematic review of respiratory viral pathogens identified in adults with community-acquired pneumonia in europe integration of scheimpflugbased corneal tomography and biomechanical assessments for enhancing ectasia detection congestive heart failure detection via short-time electrocardiographic monitoring for fast reference advice in urgent medical conditions applying artificial intelligence to identify physiomarkers predicting severe sepsis in the picu multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach using artificial intelligence to predict prolonged mechanical ventilation and tracheostomy placement artificial intelligence in the intensive care unit clinical impact of rt-pcr for pediatric acute respiratory infections: a controlled clinical trial impact of a rapid respiratory panel test on patient outcomes implementation of filmarray respiratory viral panel in a core laboratory improves testing turnaround time and patient care routine molecular point-ofcare testing for respiratory viruses in adults presenting to hospital with acute respiratory illness (respoc): a pragmatic, open-label, randomised controlled trial impact of multiplex molecular assay turn-around-time on antibiotic utilization and clinical management of hospitalized children with acute respiratory tract infections impact on the medical decisionmaking process of multiplex pcr assay for respiratory pathogens acute respiratory distress syndrome secondary to human metapneumovirus infection in a young healthy adult publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the authors thank veronique siméon, christine mabilat, aurélie aubrey, delphine chartier, and all the physicians of the tours intensive care department for collecting the data. f- tours cedex , france. authors' contributions agu, cl, hb and dg conceived and designed the study, and wrote the manuscript. cl, dg, em, agu performed the prospective inclusion of the patients. sr, pl, ago performed the microbial analysis. cj, hb performed the ai algorithm. pa, cf, gm, kab and lgg made substantial contribution to analysis, the conception of the study and to the draft of the manuscript. all authors read and approved the version to be published. we have no sources of support to declare. the datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. the study complied with french law for observational studies, was approved by the ethics committee of the french intensive care society (ce srlf [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , was approved by the commission nationale de l'informatique et des libertés (cnil) for the treatment of personal health data. we gave written and oral information to patients or next-of-kin. patients or next-of-kin gave verbal informed consent, as approved by the ethic committee. patients or next-of-kin gave verbal informed consent for publication, as approved by the ethic committee. the authors declare that they have no competing interests.author details chru tours, service de médecine intensive réanimation, bd tonnellé, key: cord- -t aufs authors: aurrecoechea, cristina; barreto, ana; basenko, evelina y.; brestelli, john; brunk, brian p.; cade, shon; crouch, kathryn; doherty, ryan; falke, dave; fischer, steve; gajria, bindu; harb, omar s.; heiges, mark; hertz-fowler, christiane; hu, sufen; iodice, john; kissinger, jessica c.; lawrence, cris; li, wei; pinney, deborah f.; pulman, jane a.; roos, david s.; shanmugasundram, achchuthan; silva-franco, fatima; steinbiss, sascha; stoeckert, christian j.; spruill, drew; wang, haiming; warrenfeltz, susanne; zheng, jie title: eupathdb: the eukaryotic pathogen genomics database resource date: - - journal: nucleic acids res doi: . /nar/gkw sha: doc_id: cord_uid: t aufs the eukaryotic pathogen genomics database resource (eupathdb, http://eupathdb.org) is a collection of databases covering + eukaryotic pathogens (protists & fungi), along with relevant free-living and non-pathogenic species, and select pathogen hosts. to facilitate the discovery of meaningful biological relationships, the databases couple preconfigured searches with visualization and analysis tools for comprehensive data mining via intuitive graphical interfaces and apis. all data are analyzed with the same workflows, including creation of gene orthology profiles, so data are easily compared across data sets, data types and organisms. eupathdb is updated with numerous new analysis tools, features, data sets and data types. new tools include go, metabolic pathway and word enrichment analyses plus an online workspace for analysis of personal, non-public, large-scale data. expanded data content is mostly genomic and functional genomic data while new data types include protein microarray, metabolic pathways, compounds, quantitative proteomics, copy number variation, and polysomal transcriptomics. new features include consistent categorization of searches, data sets and genome browser tracks; redesigned gene pages; effective integration of alternative transcripts; and a eupathdb galaxy instance for private analyses of a user's data. forthcoming upgrades include user workspaces for private integration of data with existing eupathdb data and improved integration and presentation of host–pathogen interactions. a unique infrastructure and search strategy system distinguish the eukaryotic pathogen database resource (eupathdb, http://eupathdb.org) from other organism databases. the power of eupathdb lies in the ability to query across hundreds of data sets while refining a set of genes, proteins, pathways or organisms of interest. the interface is designed for easy mastery by biological researchers, enabling in silico experiments that interrogate diverse and complex data sets. despite the sophisticated strategy system, browsing gene pages and genomic spans or regions remains a simple and informative task in this innovative and valuable resource. eupathdb facilitates the discovery of meaningful biological relationships between genomic features such as genes or snps by integrating pre-analyzed data with sophisticated data mining, visualization and analysis tools that are designed to be used by wet-bench researchers. organized into free, online databases eupathdb supports over eukaryotic pathogens with genomic sequence and annotation, functional genomics data, host-response data, isolate and population data and comparative genomics. table provides a web address and a link to a list of organisms supported for each database. all databases are built with the same infrastructure and use the strategies web development kit ( ) , which provides a graphical interface for building complex search strategies and exploring relationships across data sets and data types ( figure ; strategy http://plasmodb.org/plasmo/im.do?s= b dd c ). as one of four national institute of allergy and infectious disease (niaid/nih) funded bioinformatics resource centers ( - ) eupathdb provides data, tools and services to scientific communities researching pathogens in the niaid list of emerging and re-emerging infectious diseases which includes niaid category a-c priority pathogens and many fungi. additional eupathdb support for the kinetoplastid and fungal research communities is funded by the wellcome trust in collaboration with genedb ( ), including support for focused curated annotation. this manuscript describes expanded content, features and tools added since that increase the data mining and discovery power of eupathdb. over the past years, eupathdb has routinely updated existing databases and added two new databases. we added new data, expanded the range of supported data types, enhanced infrastructure and added new analysis tools. eupathdb resources have been expanded to include fungidb (http://fungidb.org) ( ) , which supports fungi and oomycetes, and hostdb (http://hostdb.org), for interrogation of host responses to infection. hostdb supports host data obtained during infections by organisms supported by eupathdb's parasite lineage-specific databases. minot et al. ( ) , for example, infected murine macrophages with toxoplasma gondii strains and collected mixed parasitehost samples for rna sequencing. reads that align to the t. gondii genome are integrated into toxodb whereas hostdb houses those sequencing reads that align to the m. musculus genome. because all eupathdb databases employ the same data analysis pipelines, search strategy system, visualization and analysis tools, the t. gondii and m. musculus data can be compared. for example, one can easily identify parasite genes that are differentially expressed between two t. gondii strains from toxodb as well as host genes that are differentially expressed during infection with the same two strains from hostdb. enrichment analyses and comparison of these lists offers insights into host-pathogen interactions and responses. eupathdb tools are conceived and designed to reduce analysis barriers, enhance data mining and improve communication within and between the scientific communities we serve. the near-seamless integration of strategy results with tools for functional enrichment analyses and transcript interpretation as well as our new galaxy workspace and the availability of publicly shared strategies augment the data mining experience in eupathdb. galaxy workspace. eupathdb sites now include a galaxy-based ( ) workspace for large-scale data analyses, e.g. rna-seq read mapping to a reference genome. developed in partnership with globus genomics ( ), workspaces offer a private analysis platform with published workflows and pre-loaded annotated genomes for the organisms we support. the workspace is accessed through the 'analyze my experiment' (figure a ) tab on the home page of any eupathdb resource and can be used to upload your own data e.g. rna-seq reads, compose and run preconfigured or custom workflows ( figure b and c), retrieve your results, visualize them in eupathdb ( figure d ), and share workflows and data analysis results with colleagues. explore transcript subsets. transcript subsets occur when a multi-transcript gene has at least one transcript that does not meet the search criteria. for example, signal peptides are short sequences at the n-terminus of secretory proteins and eupathdb predicts signal peptides for all annotated genomes using signalp ( ) . the predicted signal peptide search returns genes and transcripts with predicted signal peptides. if one transcript of a multi-transcript gene excludes the exon containing the signal peptide, the search returns the gene but not the signal peptide-deficient transcript. searches and strategies that query transcript-specific data ( figure a ; strategy http://plasmodb.org/plasmo/im. do?s= df f e) are equipped with an explore tool for interrogating or filtering transcript subsets. the explore tool appears in the gene results tab above the table of ids ( figure c ) and offers filters for transcripts based on their inclusion in the result set. filters are applied to the strategy result and update the gene result list. for two-step strategies where both steps query transcript specific data, the explore tool offers further filters for viewing transcripts that were returned by both searches, either search or neither search. enrichment analyses. gene ontology, metabolic pathway and word enrichment analyses are available for gene strategy results to aid with their interpretation ( figure f ). these functional analyses apply the fisher's exact test to determine over-represented pathways, ontology terms and product description terms. clicking the analyze results tab of any gene strategy result ( figure e ) and selecting an enrichment analysis will open an analysis tab where users are prompted for parameter values. the results of an enrichment analysis are presented in tabular form and include a list of enriched go terms, pathways or product description words and associated data. public strategies. strategies marked as public when saved to a user's profile will also be shared with the community in the 'public strategies' tab of the 'my strategies' interface. users control the availability of the strategy and can remove it at any time. the panel also includes example strategies provided by eupathdb. data sets search tool. each data set integrated into eu-pathdb is documented with a data set record which contains information about the data including a description, contact information for the investigator that generated the data, literature references, and when available, example graphs and links to searches and genome browser tracks. links to data set records appear on gene pages and on search pages beneath the parameters. a searchable table of all data sets is available from the data summary tab in the gray drop-down menu bar. eupathdb's philosophy is to provide a data mining platform that allows users to ask their own questions in support of hypothesis driven research. the extensive range of data types (genomic, transcriptomic, proteomic, metabolomic, etc.) maintained by eupathdb broadens the user's ability to mine extensively by providing multiple forms of experimental evidence to interrogate. as the omics world expands, eupathdb endeavors to support meaningful data types and has expanded its coverage over the past few years. source brought many genomes from this large and diverse research community. updates to eupathdb's reflow workflow system ( ) make it possible to quickly and reliably analyze and load data. thus, over the past years, numerous functional data sets have been loaded. data sets of interest can be located with the data set search tool described above. protein microarray. this new data type offers a measure of host response to infection by revealing pathogenspecific antibodies in host serum or plasma samples. a typical data set includes data from serum samples collected from patients during an infection (or from healthy controls) that were hybridized to arrays spotted with possible pathogen antigens (peptides representing gene products) ( ) ( ) ( ) ( ) . searches that query this data type are classified un-d nucleic acids research, , vol. , database issue der immunology and graphs of a pathogen gene's antigenicity for each sample appear on gene pages. the searches employ the filter parameter for selecting samples based on clinical characteristics of patients when configuring the search ( ) . metabolic pathways. pathways are integrated from meta-cyc, kegg, trypanocyc and leishcyc ( ) ( ) ( ) ( ) ( ) as networks of enzymatic reactions and substrate/product compounds. genes are mapped to pathways based on ec numbers. pathway record pages feature a cytoscape image which can be 'painted' with experimental data, e.g. gene expression values or ortholog profiles. for easy transition to functional analysis, gene search results can be converted to pathways using the transform to pathways function in the add step popup or users can run a pathways enrichment analysis of their gene result to identify pathways that are statistically enriched. compounds. compound records are integrated from the chemical entities of biological interest (chebi) database ( ) and associated to genes through metabolic pathway mappings. lists of compounds are returned based on molecular weight or formula, compound id, enzyme ec number, compound id and text. lists of genes and metabolic pathways can be transformed into their associated compounds using the transform function. a genome-wide loss of function screen using crispr technology is available in toxodb and provides a measure of a gene's contribution to parasite fitness ( quantitative proteomics. this new data type provides evidence for differential protein expression from experimental methods such as silac ( , ) . the searches appear under the proteomics, quantitative mass-spec evidence and return genes based on the fold change in protein expression between samples. gene pages include graphs of these data when available. copy number variation. whole genome resequencing data are used to estimate chromosome and gene copy number in re-sequenced strains ( ) . the median read depth is set to the organism's ploidy and each chromosome's median read depth is normalized to this value. contigs that are not assigned to chromosomes are excluded from this analysis. gene copy number is similarly calculated using a normalized read depth for each gene. to compare the number of genes in the re-sequenced genome to the reference genome, genes are grouped into clusters that are inferred to have originated by duplication. searches are categorized under genetic variation and either return genes with a certain copy number, or genes with different copy numbers between strains. polysomal transcriptomics. rna-sequencing of polysome or ribosome associated transcripts reveals potential translation events. data sets of this data type are available in plasmodb ( , ) and trytripdb ( ) . categorized under transcriptomics, rna seq evidence, the searches against this new data type return genes with differential translation potential (fold change search) or genes within a certain percentile rank within a sample. expression graphs and rna sequencing coverage plots are available statically in gene pages and dynamically in gbrowse. these coverage plots provide evidence for the cds and translational start site usage. metadata. biological sample characteristics such as host clinical parameters for pathogen isolates or blood samples offer valuable information for stratifying samples while configuring searches. eupathdb integrates metadata when available and presents it in the filter parameter interface to take advantage of the rich data type when selecting samples for data mining (see below). the most recent eupathdb release represents significant updates to the underlying data and infrastructure. in addition to refreshing all data to the latest versions, we added workspaces, redesigned our gene pages, incorporated alternative transcripts into gene pages and searches, updated search categories and contemporized the rna sequence analysis workflow. categories. searches, the experimental data sets they query, and genome browser tracks for visualization are now displayed with a common logic across the websites. the categories are based on the embrace data & methods ontology (edam) ( ) , which relates biological concepts with bioinformatic analyses. the result is a logical, consistent menu structure from home page to gene page to genome browser. for example, the category names and order in the home page 'search for genes' (figure b) is the same as the 'contents' section of the gene page ( figure c ). eupathdb's extensive record system documents integrated data and analysis results for entities such as genes, genomic sequences, snps, isolates, compounds and metabolic pathways. record pages have a new streamlined look, contain improved navigation tools, and are reorganized to reflect edam-based categories ( figure ) . to view the gene page for pf d , autophagy-related protein , putative that is highlighted in figure , go to http://plasmodb.org/plasmo/app/record/ gene/pf d . for example, in gene record pages, gene ids and product descriptions are prominently displayed in the upper left corner of the page with other pertinent gene information and links directly below ( figure a ). also at the top of the page are 'shortcuts' ( figure b ) which serve two functions--clicking on the shortcut's magnifying glass icon offers a larger view of the data, while clicking on the image (or its title) navigates to the data within the gene page. 'view in genome browser' links (e.g. above and below the gene models image in figure d ) accompany data that are also available for dynamic viewing in the genome browser. these links open the genome browser (gbrowse) ( ) with the pertinent data track added to the user's current browser session. the collapsible and interactive 'contents' section reflects the new edam-based categories and features a search function for quickly locating a category ( figure c ). the contents section remains stationary and visible while scrolling the gene page data ( figure d) . a section indicator (small blue circle) appears to the left of the category name of the data currently in view. clicking a category name directs the page to that data section. the check boxes to the right of the category names can be used to customize the data display. data from categories with empty check boxes will be hidden from view. data tables ( e, f and within figure d ) are collapsible, interactive, contain sortable columns and present transcript-specific information when data can be unambiguously assigned to a transcript. tables with two or more rows include a search function. the transcriptomics (figure e) , protein properties and features ( figure f ), mass spec -based expression evidence and sequences tables contain expandable rows for retrieving detailed information. each row of the transcriptomics table represents a data set and expanding a row reveals graphs, data tables, and a data set description, as well as coverage plots for rna sequencing data. expansion of the rows in the protein properties and features table reveals the domains, blastp hits and other analysis results pertinent to the transcript's protein product. the mass spec-based expression evidence graphic table shows proteomic evidence associated with each transcript. the sequences table offers genomic, coding, predicted mrna and predicted protein sequences for each transcript. human and mouse genes (hostdb) have extensive alternative transcripts and there is increasing evidence that many eukaryotic pathogen genes have more than one transcript. eupathdb infrastructure was updated to better represent transcript information. transcripts are graphically represented on gene pages and listed in gene page tables when data can be unambiguously assigned to a transcript (figure d ). all gene search results now include a transcript id column ( figure c ). the results of searches that query transcript-specific data (e.g. predicted signal peptide) contain an explore tool (see tools section of this manuscript) for investigating transcript subsets ( figure b ). filtering samples based on metadata. sequences from pathogen isolates and data from host clinical blood samples are often accompanied by rich metadata-sample characteristics including host, age, geographic location, disease status and parasitemia. eupathdb's new filter parameter ( figure ) increases the user's power to mine data via display of sample characteristics (metadata) on the interface for selection of samples while configuring a search or multiple sequence alignment. for example, the filter parameter makes it possible to compare the antigenicity of parasite genes between infected children and uninfected children within the same dataset. the filter parameter is available for searches and sequence alignments that access snp, chip-seq and hostresponse data. rna-sequence analysis workflow updated. our pipeline for analyzing and loading rna-sequence data was updated to use standard tools and to accommodate data sets with biological replicates. the new workflow aligns reads with gsnap and calculates fpkm/rpkm with ht-seq ( , ) . deseq is used to determine differential expression for experiments that have appropriate biological replicates ( ) . future development efforts at eupathdb will concentrate on expanding private analysis workspaces and better integration and support for host response to pathogen infection. the galaxy toolshed contains many tools for data analysis. we expect to enhance our existing galaxy workspace with new workflows such as alignment of resequencing reads and snp calls or production of multiple sequence alignments and phylogenetic analyses. critical to our expanded workspace will be the ability for users to fully integrate the results of their analyses into eupathdb so that they can query, view, and share their results in the context of the publicly available data in eupathdb. a high priority for eupathdb in the coming year is to better represent host responses to pathogen infection and enable users to mine these data to identify genes (or other entities) and relationships of interest. currently, only a few omics data sets are available for host response, but we expect this situation to change rapidly. we will be expanding not only the amount of host data that we load, but also the types of host response data so that we can include highthroughput metabolic and immune profiling and rich descriptions of all study, experiment and sample metadata. we will be loading these rich multi-dimensional studies and we will be implementing a variety of tools and analyses to mine these data at a systems level. the strategies wdk: a graphical search interface and web development kit for functional genomics databases eupathdb: the eukaryotic pathogen database ) patric, the bacterial bioinformatics database and analysis resource vectorbase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases virus pathogen database and analysis resource (vipr): a comprehensive bioinformatics database and analysis resource for the coronavirus research community influenza research database: an integrated bioinformatics resource for influenza research and surveillance. influenza other respir viruses genedb-an annotation database for pathogens fungidb: an integrated functional genomics database for fungi admixture and recombination among toxoplasma gondii lineages explain global genome diversity the galaxy platform for accessible, reproducible and collaborative biomedical analyses: update cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses signalp . : discriminating signal peptides from transmembrane regions submicroscopic and asymptomatic plasmodium falciparum and plasmodium vivax infections are common in western thailand--molecular and serological evidence a prospective analysis of the ab response to plasmodium falciparum before and after a malaria season by protein microarray plasmodium falciparum protein microarray antibody profiles correlate with protection from symptomatic malaria in kenya malaria transmission, infection, and disease at three sites with varied transmission intensity in uganda: implications for malaria control a framework for global collaborative data management for malaria research kegg as a reference resource for gene and protein annotation leishcyc: a biochemical pathways database for leishmania major leishcyc: a guide to building a metabolic pathway database and visualization of metabolomic data the metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases trypanocyc: a community-led biochemical pathways database for trypanosoma brucei the chebi reference database and ontology for biologically relevant chemistry: enhancements for a genome-wide crispr screen in toxoplasma identifies essential apicomplexan genes quantitative proteomics using silac: principles, applications, and developments proteome remodelling during development from blood to insect-form trypanosoma brucei quantified by silac and mass spectrometry chromosome and gene copy number variation allow major structural change between species and strains of leishmania genome-wide regulatory dynamics of translation in the plasmodium falciparum asexual blood stages polysome profiling reveals translational control of gene expression in the human malaria parasite plasmodium falciparum extensive stage-regulation of translation revealed by ribosome profiling of trypanosoma brucei edam: an ontology of bioinformatics operations, types of data and identifiers, topics and formats the generic genome browser: a building block for a model organism system database htseq-a python framework to work with high-throughput sequencing data gmap and gsnap for genomic sequence alignment: enhancements to speed, accuracy, and functionality differential expression analysis for sequence count data the authors wish to thank members of the eupathdb research communities for their willingness to share genomicscale data sets, often prior to publication and for numerous comments and suggestions from our scientific advisors and the scientific community at large, which have helped to improve the functionality of eupathdb resources. we also thank past and present staff associated with the eupathdb brc project, and our research laboratory colleagues whose contributions have facilitated the creation and maintenance of this database resource. key: cord- -hn o authors: pivette, mathilde; mueller, judith e; crépey, pascal; bar-hen, avner title: drug sales data analysis for outbreak detection of infectious diseases: a systematic literature review date: - - journal: bmc infect dis doi: . /s - - - sha: doc_id: cord_uid: hn o background: this systematic literature review aimed to summarize evidence for the added value of drug sales data analysis for the surveillance of infectious diseases. methods: a search for relevant publications was conducted in pubmed, embase, scopus, cochrane library, african index medicus and lilacs databases. retrieved studies were evaluated in terms of objectives, diseases studied, data sources, methodologies and performance for real-time surveillance. most studies compared drug sales data to reference surveillance data using correlation measurements or indicators of outbreak detection performance (sensitivity, specificity, timeliness of the detection). results: we screened articles and included in the review. most studies focused on acute respiratory and gastroenteritis infections. nineteen studies retrospectively compared drug sales data to reference clinical data, and significant correlations were observed in of them. four studies found that over-the-counter drug sales preceded clinical data in terms of incidence increase. five studies developed and evaluated statistical algorithms for selecting drug groups to monitor specific diseases. another three studies developed models to predict incidence increase from drug sales. conclusions: drug sales data analyses appear to be a useful tool for surveillance of gastrointestinal and respiratory disease, and otc drugs have the potential for early outbreak detection. their utility remains to be investigated for other diseases, in particular those poorly surveyed. electronic supplementary material: the online version of this article (doi: . /s - - - ) contains supplementary material, which is available to authorized users. since the mid- s and the raise of concerns about bioterrorism and emerging diseases, non-diagnosis-based data have increasingly been used for routine disease surveillance and outbreak detection [ ] . the cdc defined "syndromic surveillance" as an investigational approach where health department staff, assisted by automated data acquisition and generation of statistical alerts, monitor disease indicators in real-time or near real-time to detect outbreaks of disease earlier than would otherwise be possible with traditional public health methods [ ] . in such efforts, different registries have served as data sources for public health surveillance [ , ] , including data on absenteeism at work or school [ ] , calls to health helplines [ , ] , emergency department consultations [ , ] , ambulance dispatching [ ] , or drug sales. although unspecific, such data sources can have the advantage over diagnosis-based surveillance of providing information within short delays since the event and in readily available electronic form for relatively low-cost, while capturing large parts of the population. drug sales data analysis may overcome the limitation of poor specificity when groups of drugs are exclusively used for the disease or disease syndrome of interest. furthermore, drug sales data may earlier capture changing population health status, as over-the-counter (otc) sales and a dense network of pharmacies in most developed countries make drugs easily accessible to patients at the earliest appearance of their symptoms. despite this potential interest, no state of the art of drug sales based surveillance is available to date. the present systematic literature review therefore summarized the evidence for an added value of drug sales data for infectious disease surveillance. we limited the scope of the review to infectious diseases, as they represent a public health problem for which early and valid signal detection is of particular concern, in light of potentially rapid emergence and opportunity for control interventions. we conducted a literature search from up to june to identify relevant peer-reviewed articles regarding surveillance of infectious diseases based on drug sales data. prisma guidelines were followed in the reporting of the review [ ] . published articles were searched for on electronic databases (pubmed, embase, scopus, lilacs, african index medicus, cochrane library), using combinations of the following key words: ("surveillance" or outbreak detection or warning system) and (overthe-counter or "prescription drugs" or pharmacy or (pharmaceutical or drug or medication) sales). the search was limited to articles in english or french. there were no limitations on study settings. to be included in the review, articles had to describe, test, or review an infectious disease surveillance based on drug sales data; and be original research that presented new data and results. we excluded studies that monitored chronic diseases, as well as prevalence studies whose purpose was not epidemic detection. one reviewer screened and evaluated the titles and abstracts. articles were widely included in a first stage. the full-text review and the final selection of the articles were made by two reviewers. we reviewed and described the articles in terms of objectives, diseases studied, data sources, methodologies, and performance for real time surveillance. to describe methods and results, we separated the articles into three groups based on their main objective: descriptive retrospective studies, drug selection studies, and prediction studies. outcomes selected to compare drug sales data to reference surveillance data of the corresponding disease were correlation measurements (strength and timeliness of the correlation) and indicators of outbreak detection performance (sensitivity, i.e. ability to identify true outbreaks; specificity, i.e. ability to identify true negative and timeliness of the detection). we screened a total of articles, of which were included in the final review. the search and selection process is presented in figure . articles excluded based on fulltext review (no drug sales data, no infectious disease, no outbreak detection) n= figure flow chart of study selection process in a systematic review of drug sales data analysis for syndromic surveillance of infectious diseases. three types of studies were defined: retrospective descriptive studies, drug selection studies and prediction studies. nineteen of the studies were descriptive retrospective studies assessing the strength of the correlation between drug sales and reference surveillance data of the corresponding disease or evaluating outbreak-detection performance [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . five studies used statistical algorithms to select groups of drugs that were closely associated with clinical surveillance data of a given disease and that would be most appropriate for future drugsales-based surveillance [ , [ ] [ ] [ ] [ ] . in a third group of three studies, the authors developed and evaluated statistical models to predict clinical surveillance data based on drug sales data [ ] [ ] [ ] . table summarizes the studies in terms of their general characteristics. most of the studies focused on respiratory illnesses ( studies) [ , , , , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] or gastrointestinal illnesses ( studies) [ , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , ] . only two other studies evaluated surveillance of pertussis [ ] and syphilis [ ] . most of the studies were set in the united states (n = studies, %), followed by canada (n = ), france (n = ), japan (n = ), the netherlands (n = ) and england (n = ). only one study was conducted in more than one country [ ] . in most retrospective studies, data were collected specifically for the purpose of the study from a sample of pharmacies [ , [ ] [ ] [ ] or from retailers [ , , , ] . for example, in a canadian study [ ] , electronic data were provided by one major retailer for all of their pharmacies in the study area. automatically compiled data sources were used in all the drug selection and prediction studies and in some retrospective studies. drug sales data were routinely collected in samples of a city's or country's pharmacies. such routine data collection systems were mainly implemented by research or public health groups, such as the johns hopkins applied physics laboratory [ , , , ] , the new york city department of health [ ] , the national institute of infectious diseases in japan [ ] , or the real-time outbreak and disease surveillance laboratory at the university of pittsburg [ ] . data are available the day after the day's sales in those systems. in eight other studies, private marketing companies had automatically aggregated and made available drug sales data from a sample ( - %) of pharmacies in a given city or country [ , , , [ ] [ ] [ ] [ ] ] . nineteen studies retrospectively compared drug sales data to gold standard reference data of the disease. details are given in table . reference data of the disease included medical case reports [ ] [ ] [ ] , diagnostic registries of microbiological laboratories [ , , , ] , hospital admission or discharge data [ ] [ ] [ ] [ ] [ ] ] , or clinical emergency department data [ ] [ ] [ ] . the selection of indicator drugs in these studies was based on the literature or expert opinion. for example, edge et al. [ ] selected all anti-nauseant and antidiarrheal otc drugs for gastrointestinal surveillance. in stirling et al. [ ] , pharmacists determined which common antidiarrheal drugs they would report. two methods were commonly used to compare drugsale and diagnostic data time series: correlation analysis and signal detection comparison ( table ). ten studies used cross-correlation function to measure the similarity of two curves and to determine the time lag at which the correlation between the datasets is maximized. cross-correlation is a standard method to determine the time delay between two signals. in three studies, only correlation between the time series was examined without analyzing time-lagged relationship. six studies used aberration detection methods to evaluate whether and by how long the date of signal detection by drug sales precedes the signal based on diagnostic data. the signal definition for aberration detection was based on either a simple threshold to define alerts [ ] or more complex algorithms such as the serfling method [ ] , arima models [ ] , the simple moving average method (ma), the cumulative sum method (cusum) [ , ] , or the exponentially weighted moving average (ewma) [ ] . these studies assessed the performance in terms of sensitivity, specificity and timeliness of disease outbreak detection. five other studies [ , [ ] [ ] [ ] ] only evaluated whether drug sales showed a significant increase during a known epidemic period. twelve of studies evaluating otc sales retrospectively found significant correlations or a significant increase in drug sales [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] ] . only two studies didn't found any consistent correlation. for example, das et al. [ ] found a poor correlation between otc antidiarrheal drug sales and emergency department visits for diarrhea in new york city, with an r of . . they found however an increase in sales during a known outbreak of norovirus. otc drug sales preceded clinical data in three of eight studies that analyzed timeline correlations [ , , ] . for example, in hogan et al. [ ] , the correlation coefficient between electrolyte sales and hospital diagnoses of respiratory and diarrheal illness was . ( % ci, . - . ) when drug sales were assumed to precede clinical diagnosis data by . weeks. outbreaks were detected with % sensitivity and specificity in of studies that analyzed signal detection [ , , ] . drug sales data provided an earlier outbreak signal in two of them [ , ] . in davies et al. [ ] , the rate of cough/cold sales exceeded a threshold of units per week two weeks before the peak in emergency department admissions during three consecutive winters. in hogan et al. [ ] , detection from electrolytes sales occurred an average , weeks earlier than detection from hospital diagnoses of respiratory and diarrheal diseases. six of the seven studies that focused on prescribed drugs found strong correlations (r = . - . ) with clinical reference data or a significant increase in drug sales, without lead time however. the other study [ ] showed that the cusum signal generated for prescriptions for macrolide antibiotics was linked to a pertussis outbreak in a county of new york state. no association was observed between the type of reference data and the time lags observed. an important challenge for drug-sales-based surveillance is identifying relevant indicator drug groups to monitor diseases. five retrieved articles addressed this question. characteristics of the studies are described in table . two studies [ , ] developed methods to find homogeneous groups of otc products. the authors used unsupervised clustering algorithms for aggregating otc products in groups sharing similar sales histories. for example, magruder et al. [ ] first assigned otc products for respiratory diseases to subgroups qualitatively based on indication, dose form, and age group. a stepwise hierarchical clustering algorithm was then used to form categories sharing a similar sales history, leading to a set of product categories. in two studies [ , ] , the authors developed procedures to identify the drugs correlating with disease incidence. clusters were formed specifically for a particular disease. in pelat et al. [ ] , a hierarchical clustering procedure was applied to the time series of all therapeutic classes and the acute diarrhea incidence rate reported by a network of general practitioners. four therapeutic classes were found to cluster with diarrhea incidence and an algorithm based on the selected drugs allowed the detection of epidemics with a sensibility of %, a specificity of % and a timeliness of . weeks before official alerts. in three studies [ ] [ ] [ ] , the authors developed models to predict clinical data based on drug sales data. vergu et al. [ ] used a poisson regression model on selected otc sales to forecast influenza-like illness (ili) incidence as recorded by a sentinel network of general practitioners. the forecast at the national level - weeks ahead showed a strong correlation with observed ili incidence (r = . - . ). najmi et al. [ ] used least mean square filtering methods to estimate the incidence of emergency room consultations for respiratory conditions from past and present sales of groups of cold-remedy sales. in a later article [ ] , they succeeded in extending the estimation algorithm for predicting increases in clinical data several days in advance. the evidence gathered in this systematic literature review suggests that drug sales data analysis can be a useful tool for surveillance of acute respiratory and gastrointestinal infections. as could be expected, prescribed drug sales data were strongly correlated with clinical case reporting. no lead time was observed, which is consistent with the fact that patients purchase drugs after seeing a healthcare professional. analysis of prescribed drug sales data may nevertheless have an additional utility for epidemic detection, as these data might be available with a shorter delay than clinical surveillance data [ ] . a high correlation between otc drug sales data and reference surveillance data were found in almost all the retrospective studies. several studies also showed that otc drug sales can serve as an early indicator of disease epidemics. patients may buy nonprescription drugs during the early phase of illness when they become symptomatic, before consulting a health practitioner [ ] . a surveillance system based on drug data should ideally detect all the outbreaks, rapidly, with a low false alert rate. however, few studies in the review determined the sensitivity and specificity of the outbreak detection and those aspects should be analyzed in more details in future studies. surveillance based on otc drug sales could be particularly relevant for diseases whose prodromal phase persists for several days before the onset of more severe symptoms. for example, the early stages of dengue fever symptoms are nonspecific (fever, headache, myalgia, arthralgia, etc.) [ ] . the occurrence of grouped cases could trigger an excess of nonspecific drug sales over baseline levels, which in turn could provide an early warning of outbreak in an endemic area. results from drug selection studies showed that it is possible to identify groups of products strongly associated with incidence data, which can then be used to predict future trends in clinical data and help public health authorities to prepare response planning. such product selection procedures, however, depend on the existence of large clinical surveillance databases of the diseases concerned. similarly, the validity of drug sales data analysis has been evaluated mainly for two disease groups, respiratory and gastrointestinal illness, for which clinical reference data, used as the gold standard, are readily available. pertussis and syphilis have been evaluated in only one study each, and still require further confirmation. the concept of drug-based surveillance therefore needs to be validated for other infectious diseases. all the studies were conducted in developed countries or area. surveillance based on drug sales data requires electronic information systems for routine data analysis. besides, its implementation requires that the population has access to the health care system and mainly buy drugs in pharmacies. this could limit the use of drug based surveillance systems in developing countries. by improving the timeliness of epidemic detection compared to clinical data and giving information from a larger part of the population, drug sales data can be an additional source of information for already monitored diseases. besides, drug sales data analysis could have its greatest value in the surveillance of diseases for which clinical surveillance is cumbersome and costly, or where substantial under-reporting is suspected. to confirm the selected drug group as a valid proxy of disease, clinical surveillance may be conducted for a defined period in a representative population. examples of diseases for which this would be useful are typically varicella, urinary infections, allergies/asthma, and parasitic diseases. ideally, the drugs to be monitored should be specific to the disease and widely used to treat it in order to maximize the sensitivity of the signal. for example, benzylpenicillin benzathine . mui is the quasi exclusive treatment for syphilis infection [ ] and is a good candidate. in contrast, the treatment of measles is mostly symptomatic without a specific drug, which makes this disease unattractive for this approach. another limitation applies to diseases that are usually treated in hospitals or specialized centers, such as tuberculosis. surveillance based on drug sales, may not be appropriate to accurately estimate incidence of diseases, as the source population size is not precisely known. moreover, it may be difficult to link the number of drug packages sold to the number of patients with disease. however, the method is very efficient to determine temporal dynamics of a situation and to detect abnormal phenomena. surveillance based on drug sales is therefore well adapted to diseases with seasonal variations such as norovirus gastroenteritis, influenza and other infectious respiratory agents, or community outbreaks (foodborne illnesses, waterborne illnesses, hepatitis a, etc.). drug sales can be influenced by store promotions, sales period (holidays, weekends), and the media. also, we do not know whether people buy medications to treat a disease they currently have or a disease they fear they may have in the near future. for example, during the media coverage of avian influenza a (h n ) in the us, an increase in antiviral medications sales was observed [ ] , which corresponded to stockpiling behavior of the population. health-seeking behaviors also vary by demographic, social, cultural, and economic factors. a survey [ ] in canada analyzed the healthcare-seeking behaviors of patients with acute gastroenteritis. they found significant differences (patient age and sex) between the patients who used otc drugs and those who did not. consequently, factors that prompt self-medication should also be taken into account. the usefulness of drug sales based surveillance is also dependent on the available resources and the organization of the health care system. otc drug sales surveillance is for example less relevant in countries where reimbursement rate are high and patients mainly get prescribed drugs. population mobility, particularly in tourist areas, may lead to an increase in remedy sales, which could wrongly be interpreted as a disease outbreak. inversely, patients with high geographical mobility may not be included in the region of study and lead to an underestimation of the magnitude of an epidemic. despite some limitations, routine collection and analysis of drug sales data are likely to be developed in the coming years. many automated surveillance systems that collect drug data the day after the sales have been implemented in the last decade [ , , , ] . they allow a rapid assessment of the public health situation. early detection of outbreaks allows public health authorities to set up epidemic investigations and control measures sooner. most studies included in this review were published after the year , with their number increasing recently. they illustrate the need for improved surveillance systems, evidenced by recent public health crises (e.g., anthrax in , the sars outbreak in , the a/h n influenza pandemic in , etc.). drug sales data present indeed many advantages in terms of public health surveillance. data can be obtained in a real-time manner and usually cover a large portion of the population. data collection may be exhaustive, without selection of specific sales, and allows the simultaneous monitoring of a large number of diseases, especially new or emerging diseases. although non-specific, drug sales data are directly linked to patients' health conditions. drug sales data are therefore more specific than other syndromic surveillance data, such as tracking search patterns on the web and are likely to reflect more accurately disease activity in the population. moreover, it should be noted that alternative sources of data for disease surveillance are currently under development. healthcare management databases that can provide exhaustive information on drug consumption and diagnosis, as the dossier médical personnel [ ] in france, are promising tools for disease surveillance. our review may be affected by a publication bias since studies unable to show correlations between drug sales and reference data may have been less published. in addition, selections bias may have occurred in the studies. indeed, some studies in the review were based on a limited number of pharmacies and/or a limited study period (e.g. less than one year). language bias may exist as we were not able to identify studies published in languages other than english and french. the review focused on the temporal dynamics of infectious disease; consequently, further analyses are required to determine the capacity of these systems to efficiently monitor other aspects of infectious diseases such as spatial spreading. this review suggests that the analysis of drug sales data is a promising method for surveillance and outbreak detection of infectious diseases. it has the potential to trigger an outbreak alert earlier than most surveillance systems. however, the main challenges consist in the appropriate selection of indicator drug groups and the validation of this approach for diseases for which no or poor quality clinical surveillance data exists. the usefulness of the approach also depends on the available resources and the organization of the health care system. drug sales databases with real-time or near real-time data transmission are available in several countries; future studies should be encouraged to expand their use on other infectious diseases. what is syndromic surveillance? mmwr morb mortal wkly rep cdc: framework for evaluating public health surveillance systems for early detection of outbreaks: recommendations from the cec working group review of syndromic surveillance: implications for waterborne disease detection absenteeism in schools during the influenza a(h n ) pandemic: a useful tool for early detection of influenza activity in the community? using ontario's "telehealth" health telephone helpline as an early-warning system: a study protocol using nurse hot line calls for disease surveillance disease outbreak detection system using syndromic data in the greater washington dc area assessment of a syndromic surveillance system based on morbidity data: results from the oscour network during a heat wave use of ambulance dispatch data as an early warning system for communitywide influenzalike illness preferred reporting items for systematic reviews and meta-analyses: the prisma statement use of medicaid prescription data for syndromic surveillance-new york a practical method for surveillance of novel h n influenza using automated hospital data syphilis surveillance in france monitoring over-the-counter medication sales for early detection of disease outbreaks sales of over-the-counter remedies as an early warning system for winter bed crises syndromic surveillance of gastrointestinal illness using pharmacy over-the-counter sales. a retrospective study of waterborne outbreaks in saskatchewan and ontario syndromic surveillance of norovirus using over-the-counter sales of medications related to gastrointestinal illness detection of pediatric respiratory and diarrheal outbreaks from sales of over-the-counter electrolyte products prediction of gastrointestinal disease with over-the-counter diarrheal remedy sales records in the san francisco bay area evaluation of over-the-counter pharmaceutical sales as a possible early warning indicator of human disease progress in understanding and using over-the-counter pharmaceuticals for syndromic surveillance experimental surveillance using data on sales of over-the-counter medications-japan using oral vancomycin prescriptions as a proxy measure for clostridium difficile infections: a spatial and time series analysis surveillance data for waterborne illness detection: an assessment following a massive waterborne outbreak of cryptosporidium infection pharmaceutical sales; a method for disease surveillance? waterborne cryptosporidiosis outbreak real-time prescription surveillance and its application to monitoring seasonal influenza activity in japan validation of syndromic surveillance for respiratory pathogen activity sales of nonprescription cold remedies: a unique method of influenza surveillance seasonal influenza surveillance using prescription data for anti-influenza medications mining aggregates of over-the-counter products for syndromic surveillance a multivariate procedure for identifying correlations between diagnoses and over-the-counter products from historical datasets a method for selecting and monitoring medication sales for surveillance of gastroenteritis unsupervised clustering of over-the-counter healthcare products into product categories estimation of hospital emergency room data using otc pharmaceutical sales and least mean square filters an adaptive prediction and detection algorithm for multistream syndromic surveillance medication sales and syndromic surveillance implementing syndromic surveillance : a practical guide informed by the early experience value of syndromic surveillance within the armed forces for early warning during a dengue fever outbreak in french guiana in increased antiviral medication sales before the - influenza season factors associated with the use of over-the-counter medications in cases of acute gastroenteritis in hamilton drug sales data analysis for outbreak detection of infectious diseases: a systematic literature review this research was funded by celtipharm (vannes, france) a company specialized in the real time collection and statistical processing of healthcare data (www.celtipharm.orgwww.openhealth.fr), through a doctoral thesis contract for mathilde pivette. mathilde pivette prepares a doctoral thesis under the french framework "cifre" (industrial contract for training through research; www.anrt.asso.fr), in partnership with the company celtipharm (www.celtipharm.org). the other authors declare they have no competing interests. all authors contributed to the study's design. mp and jm carried out the literature search and reviewed articles. mp drafted the manuscript. all authors interpreted the results, revised and approved the final manuscript. key: cord- - eylgtbc authors: singh, david e.; marinescu, maria-cristina; carretero, jesus; delgado-sanz, concepcion; gomez-barroso, diana; larrauri, amparo title: evaluating the impact of the weather conditions on the influenza propagation date: - - journal: bmc infect dis doi: . /s - - -w sha: doc_id: cord_uid: eylgtbc background: predicting the details of how an epidemic evolves is highly valuable as health institutions need to better plan towards limiting the infection propagation effects and optimizing their prediction and response capabilities. simulation is a cost- and time-effective way of predicting the evolution of the infection as the joint influence of many different factors: interaction patterns, personal characteristics, travel patterns, meteorological conditions, previous vaccination, etc. the work presented in this paper extends epigraph, our influenza epidemic simulator, by introducing a meteorological model as a modular component that interacts with the rest of epigraph’s modules to refine our previous simulation results. our goal is to estimate the effects of changes in temperature and relative humidity on the patterns of epidemic influenza based on data provided by the spanish influenza sentinel surveillance system (sisss) and the spanish meteorological agency (aemet). methods: our meteorological model is based on the regression model developed by ab and js, and it is tuned with influenza surveillance data obtained from sisss. after pre-processing this data to clean it and reconstruct missing samples, we obtain new values for the reproduction number of each urban region in spain, every minutes during . we simulate the propagation of the influenza by setting the date of the epidemic onset and the initial influenza-illness rates for each urban region. results: we show that the simulation results have the same propagation shape as the weekly influenza rates as recorded by sisss. we perform experiments for a realistic scenario based on actual meteorological data from - , and for synthetic values assumed under simplified predicted climate change conditions. results show that a diminishing relative humidity of % produces an increment of about . % in the final infection rate. the effect of temperature changes on the infection spread is also noticeable, with a decrease of . % per extra degree.conclusions: using a tool like ours could help predict the shape of developing epidemics and its peaks, and would permit to quickly run scenarios to determine the evolution of the epidemic under different conditions. we make epigraph source code and epidemic data publicly available. seasonal influenza may not make headlines, but together with pneumonia, it is one of the top ten causes of death worldwide. influenza epidemics results in to million cases of severe illness a year, which puts a high burden on health providers and results in loss of productivity and absenteeism, such as mentioned by the world health organization in [ ] . it's been long known that in temperate climates these seasonal epidemics occur mostly in winter, and typical hypotheses assigned the blame to people being in closer proximity for longer periods of time, or lowered immune systems. in general, meteorological conditions affect virus transmission due to multiple effects: virus survival rates, host contact rates and immunity, and the transmission environment (except the case of direct or short-range contact). while these factors may have an influence, the solid evidence sustains the hypothesis that the virus's best surviving conditions are low temperatures and low absolute humidity. one of the goals of the current research in this field is to understand this relationship to be able to develop a more accurate seasonal influenza model for both temperate and tropical regions. as a motivation of this work, jt et al. [ ] conclude that environment factors may become more important for a future predictive model of the effects of climate change. in a previous paper [ ] , some of the authors of this paper studied the interaction of the spatio-temporal distribution of influenza in spain and the meteorological conditions during five consecutive influenza seasons. the work uses real influenza and meteorological data in combination with statistical models to show that there is a relationship between the transmission of influenza and meteorological variables like absolute humidity and amount of rainfall. in this work we use the same data sources (sisss and aemet agencies) following a different approach: we study some of these relationships from a simulation perspective, considering not only the existing influenza distributions but also the ones related to the climate change. in this work we extend epigraph [ ] , an influenza simulator, with a meteorological model (mm) starting from the model developed by ab and js [ ] . in their paper ab and js analyze monthly weather and influenza mortality data collected between and throughout all of the us urban counties. using a regression model, they conclude that there exist correlations between both absolute humidity and temperature with mortality. they report a quantitative assessment of the relation between mean daily humidity and temperature levels and mortality rates in different ranges. this is an extensive study and, as a result, we start from the assumption that their results are solid and appropriate to incorporate to epigraph in order to produce meteorological-dependent simulations based on real data. in this work we extend epigraph [ ] , an influenza simulator, with a meteorological model (mm) starting from the model developed by ab and js [ ] . in their paper ab and js analyze monthly weather and influenza mortality data collected between and throughout all of the us urban counties. using a regression model, they conclude that there exist correlations between both absolute humidity and temperature with mortality. they report a quantitative assessment of the relation between mean daily humidity and temperature levels and mortality rates in different ranges. this is an extensive study and, as a result, we start from the assumption that their results are solid and appropriate to incorporate to epigraph in order to produce meteorological-dependent simulations based on real data. regarding other influenza simulators that consider weather conditions, ps et al. presents an agent-based simulation model [ ] that evaluates the seasonal effects on the influenza propagation. although the reproductive rates are generated synthetically without considering actual meteorological data, this paper shows, in a similar way than our work, the impact of changing reproductive rates on the course of the influeza pandemic. in the article [ ] , js et al. simulate influenza transmission via a sirs model modulated by climate data to obtain the basic reproduction number r . both js et al. [ ] and acl et al. [ , ] study the effects of humidity on influenza transmission from the point of view of virus survival and conclude that aerosol transmission is most efficient in low humidity conditions. acl et al. [ , ] and bx et al. [ ] also conclude that aerosol transmission is more efficient at low temperatures. js et al. [ ] and jm et al. [ ] also deduce that virus survival increases with decreasing the humidity values. epigraph simulations use real data for modelling the population, the spatio-temporal distribution of influenza, and the meteorological conditions. this simulator consists of different components and data sources shown in fig. . the previous and novel components are represented in blue and orange colors, respectively. the simulator uses input data that is obtained from different sources including: ( ) the influenza data, that contains information about the initial individuals that are infected; ( ) the population data, that describes the individual interactions with others; ( ) the transport data, that contains information about the movement of individuals between different locations and ( ) the climate data, that contains the meteorological conditions existing during the simulated time span. this data feeds the different models implemented in the simulator. we briefly describe the three models that have been previously developed and presented in [ , ] . the epidemic model considers the propagation model of influenza extending the sir (as explained in [ ] by fb et al.) to include states for latent, asymptomatic, dead and hospitalized. the infective period has different phases which may affect the dissemination characteristics of the fig. overview of the data sources, processed data, and epigraph components influenza virus as ame et al. describe in [ ] . each individual has a slightly different length for each infection state. we adopt most of the concrete values for the model parameters from the existing literature on flu epidemics (see [ ] [ ] [ ] [ ] ). you can find them all in [ ] . the transport component models the daily commute of individual to neighboring cities (inter-city movement) and the long-distance travels for several days that represent commute of workers that need to reside at different locations or people that move at any distance for vacation purposes. the people mobility model is based on the gravity model proposed by cv et al. [ ] that uses geographical information extracted from google using the google distance matrix api service. the social model is an agent model that captures individual characteristics and specifies the interaction patterns based on existing interactions extracted from social networks. these patterns determine the close contacts of each individual during the simulation, which is a crucial element to model the spread of the infection. we extract interaction patterns from virtual interactions via email or social networks (enron and facebook) and scale them to approximate a physical connection of the whole network within an urban area. these connections are timedependent to realistically capture the temporal nature of interactions, in our case modeled depending on the day of the week and time of day. the distribution of the population is in terms of four group types: school-age children and students, workers, stay-home parents, and retirees. in this paper, as main contribution, we introduce a new component of the simulator (the meteorological model), that evaluates the impact of climate parameters on influenza propagation. this component is tuned with influenza surveillance data obtained from sisss to provide realistic simulations. as far as we know, this work is the first simulator that integrate real meteorological data to predict the spatio-temporal distribution of influenza. we think that this contribution will help to better understanding the influenza propagation in real environments. in the literature we can find different influenza simulators although none of the following consider meteorological factors in the simulation. examples of them is the work of kk et al. [ ] that presents an sirbased epidemic simulator that permits to parametrize both the population characteristics and the epidemic process. the goal of this work is to identify the turning point (peak of the infected population) of the infection. although the initial approaches for modelling the infection spreading across the contact network, our work consider a broader number of parameters and configuration of the network. he et al. [ ] analyze, by means of simulation, the relationship between social interaction patterns at workplaces and the virus transmission patterns during influenza pandemics. the main effort is geared towards the flexible specification of the different aspects involved in a simulation, such as intervention policies, social modelling, social organization of work, etc. sim-flu [ ] is different from most epidemic simulators in that it focuses on the discovery of most probable future influenza variants starting from virus sequences published by the national center for biotechnology information (ncbi). this work is complementary to the goal of most simulators, including ours, which is to understand and predict the spreading infection patterns of a known flu strand across a population. their methodology is based on observing directional changes in subtypes of influenza over time. js and ak present a framework [ ] to adjust an epidemic simulation based on real-time forecasts of infections from google flu trends. the paper focuses on prediction of the timing of peak infection, but other metrics could be predicted as well. the authors of [ ] simulate the spreading of influenza in an urban environment consisting of several close-by towns connected by trains. their goal is to be able to model and simulate intervention policies. epiwork [ ] was a european project in fp whose focus was to develop a tool framework for epidemic forecast. within this project's framework, wb et al. describe gleamviz [ ] , their tool for epidemic exploration which includes a simulator of transmission based on an accurate demographics of world's population over which they superpose a (stochastic) mobility model. db et al. [ ] use human mobility extracted from airline flights and local commute (based on the gravity model) to predict the activity of the influenza virus based on monte carlo analysis. sm and sm [ ] study the role of population heterogeneity and human mobility in the spread of pandemic influenza. in [ ] , the authors reconstruct contact and time-in-contact matrices from surveys and other socio-demographic data in italy and use this matrix for simulation. epigraph uses meteorological data provided by the spanish meteorological agency (aemet) to generate environment-dependent influenza simulations. the preprocessing stage is performed to obtain clean inputs for the meteorological model. first, the weather station nearest to each simulated urban region is identified. our simulations consider different urban regions with more than , inhabitants. in some cases, the station is within the city limits, while in others it is located in a nearby area (for instance at the region's airport). the data from each weather station is analyzed to reconstruct potentially missing samples. sometimes it is the case that some station data samples are missing because the station was not operational during a given time period. these represents just a small fraction of the overall samples, but they have to be properly addressed. figure shows an example of how the original missing data (shown in upper figure) is reconstructed producing a complete samplingclearpage (reconstructed values are shown in the lower figure in red color). in order to add the missing samples, we have used the reconstruct data algorithm (missdata) included in the matlab's system identification toolbox. this toolbox permits the construction of mathematical models for dynamic systems, starting from measured input -output data. the resulting data is then processed to filter nonrealistic values. some weather stations produce abnormal samples corresponding to non-realistic values that are too big or too small. figure shows an example of this kind of values around sample , . we have corrected these cases with a matlab algorithm we implemented to detect these peaks and correct them using an interpolation of the values from the previous days. these two steps are only performed once for each new meteorological input data and the results may be used for the rest of the process. this section describes how the r s are obtained from the meteorological conditions. in addition to the notations introduced in the introduction, for the rest of the paper we will use sh for the specific humidity and p * h o for the equilibrium water vapor pressure. a related value to sh is absolute humidity (ah), which is the mass concentration that describes the amount of water vapour per volume of air. previous studies [ , ] suggest that ah (and by extension sh) are one of the main factors affecting the influenza virus transmission. in epigraph we adopt the results of the regression model used by ab and js [ ] . in their paper, they analyze monthly weather and influenza mortality data collected between and throughout all the us urban counties. using regression, they conclude that there exists a strong correlation between absolute humidity and mortality, even when controlling for temperature, when the humidity drops below daily means of g/kg. temperature correlations also exist, mainly in the daily ranges between - . c and . c. in an earlier paper ( [ ] ) js et al. study the same dataset and simulate influenza transmission via a sirs model modulated by the data to obtain the basic reproduction number r . they also find bestfit parameter range combinations of r max between . and , and r min between . and . . we adopt the pair of (r max , r min ) that was found to be the best-fit parameter combinations they discover: r max = . , r min = . . from the definition of the specific humidity (sh) and relative humidity (rh)-see rhp and dwg [ ]-we know that: ( ) we also know from buck's equation that the equilibrium water vapor pressure can be calculated using the formula: where the temperature t is measured in degrees celsius. this formula works best for values of t in the range of - c to c. from known values rh, p, and t, and using eqs. ( ) and ( ), we can calculate the specific humidity. from laboratory experiments by js et al. [ ] we have: ( ) where a = − , b = log(r max − r min ) and q is the m above-ground specific humidity, which we approximate to sh at the given temperature. in this way, we obtain a value for p * h o in every sample (obtained every minutes) using eq. ( ) . from this value in combination with the values of rh and p we obtain the value of sh using eq. ( ). finally, eq. ( ) computes the new r values for each urban region. r s are, therefore, time-dependent values that determine, in a stochastic process, how many susceptible individuals of an infected person's connections could be potentially infected. this is the dynamic component of the infectivity of an individual with respect to the others. the other dynamic component is the stochastic transition between infective states [ ] , computed with variable probabilities. our model is not different for the different types / subtypes of influenza. the values of the model parameters (basic reproduction numbers for each stage of the disease) were chosen to fall in the ranges published by ab and js, which are based on actual data for all types of influenza, over years. we choose fixed r s within the ranges, although this is a parameter that can be configured to vary. on the other hand, the evaluation was performed over data from the - influenza season over the whole territory of spain, for all types of influenzas that were diagnosed. we consider that both the choice of r (based on exhaustive data) and the evaluation against real reported cases across spain are comprehensive enough to validate our results. the spanish influenza sentinel surveillance system (sisss) comprises networks of sentinel physicians (general practitioners and pediatricians) in of the spanish regions, as well as the network-affiliated laboratories, including the national influenza reference laboratory (national centre for microbiology, world health organization national influenza centre in madrid). more than sentinel physicians participated each season covering a population under surveillance of around one million-see [ , ] . sentinel physicians reported influenza-like illness (ili) cases-integrating virological data collected in the same population-detected in their reference populations on a weekly basis, following a definition based on the eu-ili, as described in [ ] . for influenza surveillance, they systematically swab (nasal or nasopharyngeal) the first two ili patients each week and sent the swabs to the network-affiliated laboratories for influenza virus detection. the information collected by the sisss includes data on demographics, clinical and virological characteristics, seasonal vaccination status, chronic conditions, and pregnancy. data is entered weekly by each regional sentinel network in a web-based application [ ] and analyzed by the national centre of epidemiology to provide timely information on the evolving influenza activity in spanish regions and at the national level. for example, during the - season, sentinel physicians and pediatricians participated to sisss and surveyed a total population of , , , which represents . % of the total population of spain. we obtained the sisss data from the national center of epidemiology, institute of health carlos iii of madrid (isciii). in order to produce realistic simulations, epi-graph has to be properly configured. this configuration process consists of setting up two parameters: the date of the epidemic onset and the initial influenza-illness rates for each urban region. the first parameter is the time of onset of the epidemics, which occurs during week of . at this time the national average incidence values for influenza are greater than cases per , inhabitants, which is the threshold determined by siss, based on data from the - seasonal epidemic, to be the start of the influenza season. in our simulation the exact date is the th of december of . the second parameter values were obtained from influenza surveillance data obtained from the sisss corresponding to the influenza season - . from this data we obtained the reported weekly ili rate at national and regional level in spain. the data for the murcia and galicia communities are not available and we approximated them based on the data from the nearest community. these rates allow us to approximate the initial number of (clinically) influenza-like-infected individuals using the following formula, based on the study published online (in march ) in the lancet respiratory medicine by ach et al. [ ] . where n report are the cases that demanded medical attention, as reported by the sisss, f pos is the fraction of positive cases, symp is the percentage of symptomatic individuals, and attend is the percentage of those with symptoms that see a doctor. for instance, for the reported n report = cases per , inhabitants in week , and with values f pos = % (empirical value for - in spain), and symp = %, attend = % (values taken from the cited study), we calculate that the total number of infected individuals is of approximately cases per , inhabitants -or . % of the total population. we use this value to set up the initial conditions of the simulation (described in this section), but also to validate its results. each community has a different n report , which leads to different numbers of initially infected individuals. epigraph allows modeling at the level of each individual, and thus can simulate the effect of vaccination policies. to produce realistic results, we use different influenza vaccination coverages by age group; for those older than we have used the vaccination ratios (per community) provided by the ministry of health, social services, and equality of spain. these values correspond to vaccination coverages collected by the national health system [ ] . for the rest of the population (individuals younger than ) we have used the data provided by the spanish statistical office, which is based on surveys done in each community. given that the data are available at community level, we assume that all the urban areas located in the same community have the same vaccination coverages. table shows these percentages per community and age. as input of the mobility model, we use % workers and % students for short distance travel, and % workers, % students, % retired individuals, and % unemployed for long distance travel. while epigraph accounts for many of the components that influence the spreading of the virus, the behavior of these parts and the values of the parameters (such as the initial infectious individuals or the vaccination rate) are unavoidably approximate. on their website, the world health organization reports that in annual influenza epidemics, - % of the population are affected with upper respiratory tract infections [ ] . we have therefore introduced a scaling factor which adjusts the infection propagation rate of each individual to produce, for each urban region, a final infection rate between % and % of the total population. these values are obtained in a pre-calibration phase of epi-graph for the real climate conditions-performed only once-and are then used for all subsequent simulation experiments. note that this is the only data -which is also based on real data-that we use for the calibration process. we do not calibrate the model to an existing epidemic curve. once calibration is done, we use data from sisss, which records influenza-like-illness cases that are not confirmed by laboratory tests, for setting the initial simulation conditions of each urban area. this fact doesn't affect the validity of our results because the purpose is to compare yearly/monthly numbers under different climate conditions rather than know the accurate number of infected individuals. we have performed different tests to validate our approach and simulator. we first validated the simulator against influenza surveillance data, then we evaluated two different environmental scenarios. we believe that our simulator can be useful to predict the short-and mediumterm spread of an infection, as well as to assess the effects that changes in climate can have over influenza epidemics worldwide. the first scenario involves real climate values from aemet and allows studying the short-and medium-term propagation for influenza strands. for the second set of scenarios we generate fictitious values of rh and t by scaling the real values. our idea is to study the effects of the changing climate conditions on influenza propagation. simulations occur across the largest cities in spain, which account for a population of , , inhabitants. the time span is months starting from the day identified as the onset date in our data -the th of december of . in our experiments we have used data from weather stations from the national network, distributed across the country. each weather station collects the values of temperature, atmospheric pressure, and relative humidity every minutes during the entire . these consists of about , data samples per station and . million data values in total. based on these values, we generate the basic reproduction numbers to obtain an r value per urban area at every minutes. with the previously determined initial influenza-like rates per region and (year-specific) date of onset, and after calibration, each urban region data -vaccination rates, individuals' characteristics, initial infective individuals, and r s values -are loaded from files. the validation of our simulator in terms of its capacity to predict qualitatively similar propagation results as those approximated from the influenza surveillance data recorded by sisss. the simulated values for each of the spanish regions are the aggregated values of all the urban regions belonging to it. figure shows the simulated and actual estimated data. the simulated values are scaled to make the largest simulated value to be the same as the maximum real value. this allows a comparison of the evolution of the influenza propagation for each community over time. we can observe that although not perfect, the prediction shows a similar evolution with those from real scenarios. note that the simulator considers an approximation of the real conditions during the simulated period, but producing a better (unlikely perfect) fit between the two domains would need to consider all the factors of the real world that affect the flu propagation at nation-level. some of these are possibly unknown, others are not currently measured, and yet others are not possible to measure. the reason for scaling the data is that the simulated and actual estimated data reflect the population rather differently. on one hand, the simulated values correspond to the overall number of individuals infected with influenza across the considered urban areas. these take into account all the individuals within the simulated areas but only include the largest urban regions (above , inhabitants); small cities, towns, and villages are not considered. on the other hand, the influenza surveillance data are only related to a small fraction of the existing clinical cases: sisss covers a representative but small percentage of the population, in addition to the fact that there are more cases than those reported due to people not seeking medical attention. in contrast, the number of cases are collected from the complete community (including both large and small populations). it is thus not possible to compare the absolute values of the two data sources, although they should be linearly related. figure shows an example of the value of these parameters for the urban region of terrasa (barcelona) over one year. we can observe strong variations of r that are related to the changing temperature, relative humidity, and pressure conditions. to evaluate the effect of both real and hypothetical meteorological climate changes on the spreading of influenza we evaluate temperature variations of t degrees and percentage variations of the relative humidity prh. t = and prh = . correspond to the initial scenario with the original climate conditions. studies show that climate change is producing increments in the average temperature (amplified by pollution) and, in southern europe, longer periods of drought. the idea is to evaluate the impact of these changes on the influenza propagation. in this section, we consider long-term meteorological climate changes, that is, changes in the climate conditions that extend to the entire simulated period of weeks. in this context, we evaluate two different scenarios, probably not as complex as future real climate changes. the first one corresponds to drought conditions, when the relative humidity values (rh) are smaller than current ones. we have considered a reduction of the relative humidity from % to % in increments of % (rh values half than the original ones). according to the infection model, influenza propagates easier for smaller rh values; we thus expect to observe a larger effect. figure shows the overall percentage of infected individuals per community predicted by epigraph. the diminishing rh has indeed a strong impact on the number of infected individuals. on average, . % of the population was infected in the base case (reduction factor equal to ), while the average infection rate for . factor is . %. we can observe that a percental reduction of rh of % produces an approximate increment of . % in the final infection rate. the second scenario evaluates the impact of an increase of temperature on the propagation. figure shows the final infection rate for an increment of the temperature between degrees (current case) and degrees celsius. we can observe that now there is a reduction in the infection rate when the temperature increases. now, an increment of degrees reduces the average infection rate from . % to . %-a decrement of . % per degree. both scenarios assume that the values of the parameters (rh, t) change one at a time. this is a simplification, and the idea behind this approach is to evaluate the impact of a single parameter variation on the overall influenza outcome. however, epigraph supports specifying any changing combination of climate conditions. in a more realistic scenario both parameters would change, and the climate specialists are those who should define what the concrete values are. figure shows the combined effect of temperature and relative humidity change on the average nation-wide infection rate. we have plotted two planes: the first one (colored) represents the average infection rates for different increments in the temperature values and percentile reductions in the relative humidity; the second one (green) displays the infection rate of the original scenario (without climate variation) for all the coordinates and represents the baseline case. the two planes intersect in the lowerleft point, where the temperature and rh have the original values. although both parameters influence the final infection rate, relative humidity has a larger effect than temperature. figure shows the effect of rh and temperature variations on the infection distribution for andalucia community. we can observe that the variation of both parameters changes the shape of the distribution, especially in terms of the peak values but also -more subtly -in terms of the propagation interval. the maximum and minimum % confidence intervals baseline scenario (no rh reduction nor temperature increment) ranges between . and . for urban areas in castilla la mancha and aragón, respectively. these results are produced by a simulator repeating the simulations times. note that there already exist uncertainty in the input data, both with respect to the number of initially infected individuals as well as from the point of view of the epidemic model. to evaluate the effect of short-term changes in climate conditions, we modify rh and the temperature exactly like described in the previous section, only for the first week of the simulation. the rest of the simulation uses the original climate parameters. figures and show the final infection rate for different variations of rh and temperature. we can observe that the impact on the overall percentage of infected individuals is still important, particularly for a decrease in rh of . or more and -less evidently -for an increase in temperature of degrees or more. for smaller changes in temperature the effect is less evident, but we believe that this is due to the fact that the short-term simulation of temperature increase is only one week. we achieve herd immunity in two ways: as result of vaccinating campaigns, and naturally when an individual that was infected goes to the recovery (or dead) state, in which case he becomes immune and starts acting as a propagation stopper. as a result, after certain threshold of infected vs susceptible individuals, the infection rate naturally goes fig. effect of long-term changes in the relative humidity (percentil reduction) and temperature (value increment in celsius) on the influenza propagation for the average nation-wide infection rates down. this occurs at the inflection point in the propagation graph, specifically at about weeks (in our data). the vaccination success rate during the - season was approximately % [ ] . given the parameters shown in fig. of [ ] for herd immunity for influenza, we consider that the r considered by our model takes into consideration this type of immunity. we do not model the level at which herd immunity starts acting as a parameter, although this phenomenon occurs naturally in the simulations. the simulator is flexible enough to support different daily contact patterns for each individual. the probability of an individual getting infected during an interaction also differs (it's a stochastic process), and thus the infection can be transmitted to individuals pertaining to different groups. recently, the work in [ ] suggests that rh should also be considered (together with the temperature) as a modulating factor in the influenza propagation. another study that analyses this relationship can be found in [ ] . this work provides a transmission risk contour map based on the temperature and rh humidity values. note that our work addresses the problem of evaluating the influenza propagation from a different perspective. instead of analyzing the propagation mechanisms of the virus and how they are related to the environment conditions, we focus on an empirical relationship between the virus's basic reproduction number and the outdoor specific humidity. the r values used in this work are the combination of both outdoor and indoor virus propagations, and provide an approximation of a real scenario. note also that the fig. effect of long-term parameter variation on the infection distribution shape for andalucía. in a different rh scales are evaluated ( % in red, % in green, % in blue and % in black); in b different temperature offsets are evaluated ( degrees in red, + degrees in green, + degrees in blue and + degrees in black) effect of short-term changes in the relative humidity on the influenza propagation for the different communities considered in the simulation: in color the average infection rates for different increments in the temperature values and percentile reductions in the relative humidity; in green the infection rate of the scenario without climate variation main goal of this work is to evaluate the impact of the weather conditions on the propagation. a possible limitation is that we only model the largest urban regions in spain; we could add more information related to smaller cities and towns, including rural regions. nevertheless we don't think this data would make a significant difference in the results, as the infection needs a large number of hosts to explode, and / or travel patterns between the infected areas. small town and village areas are arguably much less likely than cities to fulfill these roles. a second limitation is related to the meteorological factors affecting the infection propagation: the number and set of climate factors that the meteorological model takes into account, and the choice of the model itself. additional parameters that specialists mention as possible influencers in virus transmission are factors such as wind, precipitation, or pollution. fig. effect of short-term changes in the temperature on the influenza propagation for the different communities considered in the simulation one important thing to underline is that the data that the study [ ] (whose model we adopt) is based on is of real cases and spans years. interactions between meteorological trends and human behavior are therefore intrinsically reflected in the data, although the rules of behavior change are not explicitly specified for the agents (i.e. individuals) involved in the simulation. the case can be made that meteorological changes were not as extreme before , and that a regression model based on new data may change as well over time. while this is a definite possibility, we believe that its nature will not change in a fundamental way, such that we can still predict trends, if not absolute values. a third limitation is that we don't calibrate the model on an epidemic curve, which results in different timings of the flu peaks in some regions, such as in navarra and madrid. finally, to successfully simulate the flu epidemics requires leveraging many different types of data, most of them in large amounts, as input and calibration measurements for our tool (epigraph). for instance, we are using social network data from enron and facebook to set up the population interaction patterns, census data to extract the characteristics of the different types of individuals, google maps to initialize the transportation module, data from aemet to run simulations that are realistic from a meteorological viewpoint, and weekly ili rates obtained from the sisss to initialize and evaluate the simulator. this makes the implementation of epigraph more realistic, a strength that can lead to more accurate simulations. we have extended our simulator epigraph with a meteorological model that interacts with the rest of the system to better reflect the behavior of the influenza propagation through the entire population of spain. to produce realistic results we also take into account vaccination, with different ratios based on the individuals' ages. the simulator results are compared to real data on infection rates and across the whole country. the results for the prediction of the evolution of the influenza propagation for each community over time are similar in shape to the real data. after validating the simulator, we evaluate different scenarios that reflect changes in climate conditions, and show the predictions for variations in the relative humidity and temperature. lastly, we make epigraph's source code publicly available at [ ], to be used by the scientific community. as future work, an interesting, although independent, possibility is to investigate the potential of epigraph to simulate the evolution of the virus spread for different subtypes of influenza, once the propagation model parameters (e.g. incubating period, infectious period, basic reproduction numbers, etc.) are known, or to narrow down the possible subtypes in early phases of an infection. one could also investigate the impact of new meteorological factors on the evolution of the infection. world healh organization influenza (seasonal) global influenza seasonality: reconciling patterns across temperate and tropical regions climatic factors and influenza transmission, spain leveraging social networks for understanding the evolution of epidemics absolute humidity, temperature, and influenza mortality: years of county-level evidence from the united states modelling seasonality and viral mutation to predict the course of an influenza pandemic absolute humidity and the seasonal onset of influenza in the continental united states absolute humidity modules influenza survival, transmission, and seasonality influenza virus transmission is dependent on relative humidity and temperature high temperatures ( degrees c) blocks aerosol but not contact transmission of influenza virus climatological and geographical impacts on global pandemic of influenza a(h n ) role of absolute humidity in the inactivation of influenza viruseson stainless steel surfaces at elevanted temperatures towards efficient large scale epidemiological simulations in epigraph emergence of drug resistance: implications for antiviral control of pandemic influenza containing pandemic influenza with antiviral agents an influenza simulation model for immunization studies synchrony, waves, and spatial hierarchies in the spread of influenza coupling effects on turning points of infectious diseases epidemics in scale-free networks relevance of workplace social mixing during influenza pandemics: an experimental modelling study of workplace cultures simflu: a simulation tool for predicting the variation pattern of influenza a virus forecasting seasonal outbreaks of influenza parallel agent-based simulator for influenza pandemic the gleamviz computational tool, a publicly available software to explore realistic epidemic spreading scenarios at the global scale seasonal transmission potential and activity peaks of the new influenza a(h n ): a monte carlo likelihood analysis based on human mobility the role of population heterogeneity and human mobility in the spread of pandemic influenza little italy: an agent-based approach to the estimation of contact patterns-fitting predicted matrices to serological data absolute humidity modulates influenza survival, transmission, and seasonality global environmental drivers of influenza perry's chemical engineers' handbook sentinel surveillance system. characterisation of swabbing for virological analysis in the spanish influenza sentinel surveillance system during four influenza seasons in the period epidemiology of the influenza pandemic in spain. the spanish influenza surveillance system amending decision / /ec laying down case definitions for reporting communicable diseases to the community network under decision no / /ec of the european parliament and of the council comparative community burden and severity of seasonal and pandemic infl uenza: results of the flu watch cohort study coberturas de vacunación en mayores de años effectiveness of the - seasonal trivalent influenza vaccine in spain: cyceva study the vaccination coverage required to establish herd immunity against influenza viruses mechanistic insights into the effect of humidity on airborne influenza virus survival, transmission and incidence aerosol influenza transmission risk contours: a study of humid tropics versus winter temperate zone we would like to acknowledge all the sentinel general practitioners and pediatricians, epidemiologists, and virologists participating in the spanish influenza sentinel surveillance system. part of the input data used in this work have been obtained from the spanish influenza sentinel surveillance system and the meteorological information provided by spanish national meteorological agency, aemet, ministerio de agricultura, alimentación y medio ambiente. authors' contributions des., mcm. and jc. designed and implemented epigraph simulator. all authors conceived and designed the experiments. des. processed the input data and run the experiments. des. and mcm. wrote the paper. cd., dgb., and al. provided insights on the validity of our assumptions, recommended additional related work, and contrasted the results with their own findings. all authors review the different manuscript drafts and approved the final version for submission. the author(s) read and approved the final manuscript. this work has been partially supported by the spanish "ministerio de economía y competitividad" under the project grant tin - -p "towards unification of hpc and big data paradigms". the work of maria-cristina marinescu has been partially supported by the h european project growsmarter under project grant ref. . the role of both funders was limited to financial support and did not imply participation of any kind in the study and collection, analysis, and interpretation of data, nor in the writing of the manuscript. epigraph's user manual and source code are publicly available at [ ] and can be used by the scientific community. the dataset supporting the conclusions of this article is available in the [ ] repository. not applicable. not applicable. the authors declare that they have no competing interests. key: cord- - ih jdpe authors: shibuya, kazuhiko title: identity health date: - - journal: digital transformation of identity in the age of artificial intelligence doi: . / - - - - _ sha: doc_id: cord_uid: ih jdpe identity health has especially specific meanings for social relationships in contemporary digital age. first, computerized digital communication makes many citizens in severe maladaptation. the who often warns mental addictions of internet usages and online gaming among the youth. the advent of social media and online networking has endangered them in ambiguous situations which are not stabilizing in those basic grounds for human relationships. further, because social networking sites and social gaming frequently enforce each member to interconnect with the others, many of participating members often hold harder mental debts to respond and maintain their interconnections. in this situation, in other words, it can say that all of users simultaneously might share common conditions under mental illness. who (the world health organization) has already published their warning reports on gaming disorder and its mental addiction (who ) . and their site also says: "gaming disorder is defined in the th revision of the international classification of diseases (icd- ) as a pattern of gaming behavior ("digital-gaming" or "videogaming") characterized by impaired control over gaming, increasing priority given to gaming over other activities to the extent that gaming takes precedence over other interests and daily activities, and continuation or escalation of gaming despite the occurrence of negative consequences". those who can be identified as icd- case are not globally well known, but suspicious cases are roughly estimated as adults (approximately . million) and youths (approximately a million) in japan (data at ). since the beginning of the internet revolution, internet communication and online activities among the youth had been frequently accused by serious concerns in mental health. the first is violent behavior induced by playing violence game. until recently, video game has been argued by media psychologists (xu ) . they often verified violent behavior of adolescents related to experiences and the extent of game playing. the causal relationships across those patterns have not been clarified enough yet. some reasons endorsed by experimental designs are still difficult to interpret clearly (bruner and bruner ) . secondly, there are still controversial issues on online addiction cases (griffiths ) . colleagues ( , ) have been publishing serial reports on online addictions. the number of people diagnosed with the condition of addiction for internet usage has been increasing in the era of social media. especially, adolescents and young adults are eager to immerse in online cyber-world activities. they consume their time through browsing web, watching online video, participating in online games, and internet communication. in japan, mic (ministry of internal affairs and communications) at , reported the latest statistics that the total average of internet usage time of young citizens (ages vary from to ) was h and min every day. their motivations for internet usages diversified categories such as watching online movie ( %), playing games ( %), and communicating by sns and emails ( %) (multiple answers). the most important is how we think about such online additions and youth's sound development. these are a kind of mental illnesses and conditions as a maladaptation of gaming and social withdrawals from actual society, or they are overadaptation in somewhat online communities rather than physical environment. the former is to step further grounding in social living, and the latter may suggest that they extraordinarily prefer to online human relationships. it should be clinically observed in each case. thus, online gaming and social networking sites are indeed based on somewhat human relations and such online communities organized by providers often offer to share some comforts, cooperative achievements, and entertainments among active participants. and simultaneously those services require much engagement among participants, and then such mental obligations enforce each participant to keep playing online game and committing with the other online partners during longer times (oberst et al. ) . here, it experientially indicates that group commitments reduce anxiety of members and enhance comfort and mental bonding among members (leary and baumeister ; baumeister et al. ) . their group life intends to maintain such conditions in in-group memberships, and commitment belonged in group whether offline or not has crucially important meaning for them. to date, advanced data analysis on our phr (personal health record) and personality dispositions has been conducting in medics (pol and thomas ) . further, the advancement of big data and the ai driven medical services is to rush into the daily life contexts (marin et al. ; king et al. ) . namely these services can offer to assess and promote both mental and physical health of each individual. first, as mentioned valuation by the ai at chap. , those services already contain some questionnaires on psychological assessments related to personality characteristics, health attitudes, and social adaptation in daily life. those assessed data might intend to statistically reveal our strength of mental health and degree of adaptation in social relations, and then automatic prediction for those who answered personality tests enables to trustfully measure financial limitations for loans and transactions in actual contexts. therefore, our mental conditions and its social adaptations, to date, have been unveiled by such ways, and those applied services could be built for somewhat vigilance system on mutual trust among citizens. mental disorder and maladaptation of each individual have possibilities to further pervade unsound influences among the others, and vice versa. financial bad-debt by personality dysfunctions of individual will also engender chain bankruptcy among stakeholders, but those services would intend to predict such consequences in advance through checking personality maladaptation in daily life. hence, our digital life has been already founded in those mechanisms, and the ai and big-data operations indeed interlude into our mentality. secondly, wearable devices and sensing tools for human behavior can help monitoring and analyzing latent patterns of physical and mental conditions in daily life (morahan-martin and schumacher ; clifton ; zhu et al. ) . and telephone communication patterns using smart phone can be interacted by identifiable chronotype of each user in daily activities (aledavood et al. ) . medical cares in each country has the demands to organize national health information systems, and it includes big data in the relations to national assurances for health cares, medical quality, demographic statistics, financial investment, quality of life (qol), quantifications for modeling (shibuya ) , and other social welfares (brady et al. ) . those information systems may be governing own centered database, but it will be replaced by distributed blockchain database in the future. during the previous era, there were some troubles yet to bridge between clinical psychology and sociological studies as well as computational models and experimental cases. sympathy interpretations for client's latent mental process and their needs should be taken carefully. but case studies often mean that there are no effective ways to explore future patterns and expectations based on past clinical cases in the daily interactions (leary ; kircher and leube ) . traditionally, the clinical psychological way is usually beneficial to manage mental dynamics that are impossible to be generalized and formalized. clinical psychology and its fundamental assumptions usually hesitate to do generalizations from clinical case studies. that is because each personal condition and mental distress may be too individualized, and rather researchers in this field recommend qualitative and intensive caring ways for understanding each personal experiences embedded in actual conditions. historically, emerging diseases have been suffering us ever since our civilization (roeser et al. ) . human history could be said as somehow survival process from lethal diseases. for example, there were smallpox, pest, dysentery, tuberculosis, and other diseases. otherwise recent outbreaks of emerging diseases such as hiv, ebola, sars (severe acute respiratory syndrome : shibuya ) , zika, and others are still ongoing matters, and those medical ways such as drugs and examination tools have not completed enough yet. in contrary, smallpox can be exemplified as a success case. the humankind had finally achieved the extinction of this disease threat in natural conditions using vaccine as a land-breaking medical way. parts of above diseases could be cured by specific medicines, but known well, there is another problem on resistant bacteria against those medicines. to date, in those fields, immunological and medical investigations have been accelerated by biotechnological and gene-technological advancements. those who obtained the nobel prize in physiology or medicine, and nobel prize in chemistry contributed toward enhancing medical progress for wellbeing of the humankind. in fact, there were contributions by japanese scientists in the medical field (e.g., tonegawa, s, yamanaka, s, ohmura, s, ohsumi, y, etc.). but, the humankind cannot completely repel both disease and death. according to the who report "top causes of death globally ," infectious diseases such as lower respiratory infections (over million deaths), diarrheal diseases (nearly million deaths), and tuberculosis (nearly million deaths) still remain within list of top . otherwise, the worst three cases were ischemic heart disease (nearly million deaths), stroke (nearly million deaths), and chronic obstructive pulmonary disease (approximately million deaths). additionally, total deaths caused by alzheimer disease and other dementias can be estimated to approximately million per year globally. in those areas, as mentioned before, computational searches by the ai driven system and big-data analysis will boost enhancing diagnosis of subtle symptoms, image processing on medical data, pharmacologic utility discovery, and statistical precisions for future risks of each patient. precision of diagnosis by the ai's pattern recognition systems on medical images has outperformed more accurately than human doctors (zhang et al. ) . as larger growing size of knowledgebase on medical science increasingly requires much experience for them, and it will be impossible to operate any clinical cases unless the ai's supports can be provided. and other computational contributions to epidemics can enumerate such as computer simulations of mathematical models on spreading infectious diseases (e.g., sir (susceptible-infected-recovered), small-world networking model (moore and newman ; newman ) ), gene analysis of virus in bioinformatics (ksiazek et al. ) , and risk management on health data of patients and medical policies for future controls of emerging diseases (shibuya ) . in terms of medical cares, epidemiological actions should lay weights on governmental policy, because there are great needs to control against secondary contagions and predict precisely dynamic trends on diseases. traditionally, those fields must be accumulated from onsite clinical data on diagnosis of patients and analyze statistical trends which localized in each region. regarding these concerns, to date, online query results mostly reflect citizens' intentions and latent needs for specific actual events in society. recently, ginsberg and his colleagues ( ) had unveiled such facts by their big data, and their findings by statistical analyzations on logistic positive correlations between actual trends of data from cdc (centers for disease control and prevention, usa) and web-query data among citizens had become a pioneer for big-data age. namely this study clearly suggested that citizens were usually apt to seek more accurate and necessary information in uncertainty conditions such as disaster (see chap. ), unwelcome infectious disease, terrorism, and other fascinated events. their study seemed to be a first breakthrough for researches using web data analysis. as an implication of this finding, many researchers realized significant meanings on synchronizing and corresponding evidence between web trends and offline events. namely "big data" can be analyzed by computational engineering methodologies such as artificial intelligence, statistical machine learning techniques, and natural languages processing, and it can open the gate to investigate novel findings automatically. let me exemplify an actual case. according to data from the who, , from the spring of to , global pandemic caused by new type of influenza (h n ) had suffered global citizens. the total amount of the death was estimated as , in globally (the present data at ). at the peak of this pandemic, the author investigated those trends using google insights for search services. japanese patients were roughly estimated as totally . million (it finally includes at least death cases), nevertheless many citizens have traditional customs encouraged to treat and keep their hygiene in daily living. below fig. . indicates a trend of google query result inputted into keyword "influenza" in japanese. at , it was certainly that there were mostly three peaks during this year. and fig. . , in contrary, shows only seasonal trends of ordinary influenza (except for data of pandemic patients), and both peaks (earlier weeks of this year and the late of year) can be identified during year. seasonal trends on influenza can be also recognized in each year, and only bizarre peak around the middle of this year can be specified. namely, because the pandemic caused by new type of influenza virus occurred around the beginning of may , it can understand that the middle peak in fig. . was underlying in above pandemic influences. and then, in this japan case, google trends could entirely indicate correlational patterns between information needs among citizens and actual influenza trends, and each peak corresponded with seasonal or pandemic ones. as an alternative of google query, using twitter as one of microblogging tools, signorini et al. ( ) revealed synchronizing phenomena on online tweets about influenza corresponded with actual trends of influenza given from cdc data, and https://www.who.int/csr/disease/swineflu/en/ https://www.who.int/wer/ /wer .pdf?ua= implementation of the international health regulations ( ) http://apps.who.int/gb/ebwha/ pdf_files/wha /a _ -en.pdf?ua= fig. . trends of google query result which inputted into a keyword "influenza" in japanese. y axis means frequencies of a query on a specific keyword and x axis shows each year in this case they could find efficient results. those metrics have stronger merits being qualified for online and real-time analyzation than trend data obtained from google query. similarity, broniatowski et al. ( ) reported significant correlation between normalized prevalence data on influenza filtered from twitter's real-time tweets data and cdc actual trends of influenza. they further attempted to forecast influenza trends using data from twitter (paul et al. ) , and then social media consequently enables to do real-time sensing among citizens. as succeeding previous chap. , in the digitized society, disaster, environment, and climate data have also become a target for big-data analyzation. larger natural disasters and human-made hazards have globally potentials to corner to the crisis of humanity. because our global society has been endlessly threatened by various disasters (unisdr ), more than million citizens have been globally harming by natural disasters every year. and this data contains more than , deaths per year. natural disasters are almost interrelated to numerous factors such as climate, demography, environment, and anthropogenic events. further, it seemed obvious that complicated factors related to climate changes in global level have . y axis denotes reported new patients, and x axis periodically shows serial weeks. each line (from to ) means each site given data from medical hospital. this figure was cited from idsc (infectious disease surveillance center, japan) (http://idsc.nih.go.jp/idwr/kanja/weeklygraph/ flu.html) been influencing those meteorological disasters (e.g., hurricanes, drought, flood, etc.). of course, the anthropogenic factors (e.g., industrial damages to the environment, carbon gas emissions) should be occupied in system models in order to examine detail mechanisms (meadows et al. ) . especially, at , after serial tragedies of the tohoku quakes and fukushima nuclear disasters, the unisdr as a part of the united nations (at , unisdr was renamed to undrr : the un office for disaster risk reduction) held the global conference on disaster management at both tokyo and sendai city of japan. as consequences of much discussion, the committee finally proposed following the four priorities for global actions which entitled "sendai framework for disaster reduction ." . priority : understanding disaster risk . priority : strengthening disaster risk governance to manage disaster risk . priority : investing in disaster risk reduction for resilience . priority : enhancing disaster preparedness for effective response and to "build back better" in recovery, rehabilitation, and reconstruction disaster management means just our preparedness against disasters. certainly, the digitized global world can offer artificial space satellites, wireless internet, mobile computing, social networking services, and other mechanical relief robots. and the ai driven systems and big-data analyzations on the disaster can be useful for us. recently, noaa (the national oceanic and atmospheric administration, usa) has released their online services, which is named as the coastal inundation dashboard. it visually enables people to know and prepare for floods. in this way, nevertheless digitized systems for global monitoring the earth, collaboration with each nation and analyzation of vast necessary data had been pervasively equipped, but our acquired technologies for forecasting and preventing disasters have not been achieved enough yet. as above four priorities said, the undrr as a part of the united nations lays heavier weights on rather mitigations and resilience from disasters. because of such inevitable reasons why our survival efforts from the tremendous disasters are never diminishing, there are still greater needs to consider the human existential issues in disaster management. here, as introduced a bit at chap. , the author had a chance to conduct own researches on the tohoku quake and nuclear disasters in fukushima. and this time, a part of findings and evidences can be exhibited in health topic (shibuya (shibuya , (shibuya , (shibuya , (shibuya , ). after , the fukushima case of nuclear power plant accident had incubated another problem (oecd/nea ines (international nuclear event scale) ranked level (the worst)). this fukushima case can be called as a nuclear power plants' crisis (nrc ; roeser et al. ; oecd/nea ; shibuya ) . at the time, many of fukushima citizens lost not only hometown, but background of their identities. in this point, they forced to be laid in identity crisis. according to theoretical sociologists berger et al. ( ) and giddens ( ) , they commonly argued that western post-modernizations could reconstruct mindsets on reality and social identification ways among citizens during achieving industrial progresses, if above severe incidents of nuclear power plants and those systems failures could be regarded as malfunctions as a symbol of modernity, above consequences of nuclear crisis on the fukushima case (and other human-made disasters) might be contextualized to reexamine social adaptation and consciousness among fukushima citizens by sociological verifications. how did daily belief systems among fukushima naïve citizens against the safety surrounding in nuclear power plants deal with? their attitudes had been rather steadily stabilizing among many of them before the crisis. however, risk cognition, social constructive senses of reality, and meaningful understanding against nuclear disasters among citizens would be collapsed in those conditions. they were betrayed by advanced technologies, government, and optimistic beliefs shared among them. it is probably that their theoretical discussions have few suggestions for any policies on fukushima case more than awkward theoretical bases, and rather there are quite needs to tackle grounding in actuality, resilience of identification, purifying environment, and rebuilding community for the fukushima citizens (science council of japan (scj) ; shibuya (in press)). actually, this fukushima case was a reluctantly controversial issue on an accident of nuclear power plant by academic researchers in japan. since the tohoku quake, many natural scientists in japan were very criticized by ordinary citizens. especially these were academic scholars such as nuclear physicists, governmentside natural science researchers and engineers at nuclear power plant, and of course politicians must be also confronted with such serious criticisms (funabashi and kitazawa ) . with deep reflections, the science council of japan (scj) ( ) published globally their investigated documents by belonging scientists and researchers in various academic fields. this report laid stress on the fact description of the fukushima nuclear power plant accidents and the statement of actual conditions for researchers in foreign countries. here, the author conducted to investigate this scj's statement ( ) in depth using text mining. table . shows a result by text mining analyzation on the whole contents of scj's statement, and it depicted a part of frequent and important words (it extracted around top words among the most frequent words) and its total counts within above scj's statement text. at a glance, it appeared that their motivations were implicated by some words such as nuclear, radioactive, cooling, accident, safety, emergency, evacuation, and so on. tepco means an abbreviated name of administrative company for electric power plants in fukushima. figure . shows an example of network structure of words' co-occurrence. in this case, it configured mathematically to color each separated subgraph structure that limited to important co-occurrence words. the author found some clusters of words on radioactive materials, fukushima nuclear plant, accident inside-out and others. these patterns were weighted and frequently articulated by document writers. otherwise, fig. . depicts a result of three-dimensional visualization which analyzed by mds (multi-dimensional scaling: this time was configured by kruskal and jaccard models). this method located statistically each word in cubic dimensions, and it appeared some clusters such as power plants (e.g., power, plant, nuclear, and fukushima), quake (e.g., tsunami, earthquake, situation, and operation) and others. as result of these malfunctions, confidence for ruling party critically had been fallen down. what text mining analysis made clear was that this report concluded the negative consensus against the nuclear hazard among scientific community in japan, and their statements explicitly described that mythical beliefs among stakeholders were no avail in the case of fukushima nuclear power plant accident. obviously, they intended to publish the truths for globally foreign academicians and citizens in terms of mainly nuclear physics and energy engineering after the fukushima crisis. as mentioned at chap. , without doubts, many of global citizens were eager to know more accurate and immediate information in detail at that time of moment. those results by text mining could exhibit japanese governance of risk management against nuclear power plants and energy policies before the fukushima case. there were no rational reasons for excuses that risk communication and consensual discussion had not been openly organized among stakeholders in fukushima, as it was differently the canada's case (johnson ) . rather, tepco and governmental ministry had oppressed to scientifically contemplate and examine nuclear risks and their published data by ordinary citizens and external professionals. with these backgrounds, the focal point of disputes on the compensations has been accused by plaintiff (i.e., citizens, evacuees, and victims) under trials in courts (oecd/ nea ). for the digitized society, this fukushima case indicated further unneglectable facts. iot and xaas will be deployed anywhere (geng ) . many of those systems such as computational controls and sensing networking related to power plants and sensitive artifacts will be impossible to keep under controls unless electric fig. . it depicts a part of network structure of co-occurrence words by text mining. this configuration was to separately color each subgraph structure which limited to important co-occurrence words power can be continually provided. nuclear power plants and its control systems per se always also require independent electric power for controlling those mechanisms. serial incidents of the fukushima nuclear power plants can be determined by the serious factors on both the loss of external electric power supply and vent malfunctions for refrigerating systems caused by tremendous tsunami attacks. besides, until now, the ai-driven robots sensing inner-damaged power plants cannot be activated over the physical limitations because inner-damaged power plants still remain with higher radiation dose (i.e., the human dies immediately and computational mechanisms will be disabled by radiations sooner). then, the lessons for the digitized future must be intensively deduced and learned by convincible investigations. in the usa, after this disaster, a taskforce team was immediately assembled for investigations to report the cause of the nuclear power plants, reconstruction of the nuclear power plants for safety, and future policy on energy management in the after that, it should turn eyes to persevering purification from the radiation damage caused by the nuclear accidents, its environmental restoration, health monitoring, and socioeconomic reconstruction in community (nrc ; iaea ) . consequently, the fukushima case definitely needs to solve nuclear accidents and future design for long-term reconstruction on their devastated communities and hometown. at least, the following points should be tackled. . decommissioning work for the wrecked nuclear power plants in fukushima . removal of nuclear fuel and substances . chemical management on nuclear substances . purifications in polluted places . temporary storage and final disposal of contaminated soil, water, and garbage . medical health survey and care for victims . risk assessment against environment and people exposed by nuclear substances . city and socioeconomic reconstructions as consequences of serial incidents of the nuclear power plants in fukushima, nuclear pollution provoked the severe disputes on human rights, health, radiation contaminations to foods and environmental restorations. both environments (soil, waters and air) as well as ordinary peoples who lived in fukushima were polluted and exposed by both nuclear substance and radiation (gibney ; merz et al. ; oecd/nea ) . sampling data accumulated by agricultural scientists endorsed that many of crops absorbed radioactive substances such as sr, cs, cs, and others (takahashi ) . and excessive intake and exposure of radioactive pollutants will endanger citizens' and workers' health (hiraoka et al. ) . consequentially, enacted provisions often reflect social actualities. for conquests against those hardships of citizens and evacuees, the "basic act on reconstruction in response to the great east japan earthquake ( th, june )" enacted the basic policies for reconstructions. for example, a part of provisions clearly stipulated as follows. • article . the reconstruction in response to great east japan earthquake will be implemented based on the following -the unprecedented disaster resulted in enormous damage, where countless lives were lost, numerous people were deprived of their basic living infrastructures and have been forced to evacuate in and out of the disaster-affected regions. also, the disaster's influence extends over the entire nation; the economic stagnation in the disaster-afflicted areas is affecting business activities and peoples' lives nationwide…(hereafter omitted) at the fy, the total budget of fukushima prefecture for reconstruction after the disaster was approximately . billion yen (including both quakes and nuclear disaster countermeasure portion of . billion yen). and it includes population declining and aging countermeasures as well as restoration of birth number ( . billion yen). in addition, other items were . billion yen for environmental restorations, . billion yen for living reconstruction assistance, billion yen for expenses to protect medical health of the citizens, . billion yen for expenses for children and youths who will be responsible for the future, and other expenses. further, in addition to above costs, the fukushima case requires unprovoked compensations and its litigation disputes are ongoing matters in courts (shibuya ) . namely, there are still requirements to solve future designing for reconstruction from devastation in their hometown. next, the total amount of casualties in japan was more than , at the time of . moreover, as aftermath of the fukushima disaster, one of the hardest matters was collective immigration of evacuees from their hometowns to other places in japan (akabayashi and hayashi ; library of congress ) . at the peak (may ), gross migrants from fukushima (e.g., total population of evacuees) were estimated over , . and including this, gross migrants (e.g., total evacuees in japan) were estimated over , at the time of . this estimation was not too low. please recall similar past cases, for example, the case of chernobyl in reported that total population of evacuees was approximately , around km (ines level ). and, in , the case of the three mile island accident was estimated over , around km (ines level ). table . shows a part of outflow data on migrants across major cities (it includes mobility data within same city). it queried into big data of mic (e.g., demographic data of migration and population) and the geospatial information authority of japan (e.g., geospatial data and distance information). the numbers of citizens lived in fukushima prefecture has been notably decreasing from . million ( ) to . million ( ) . and this area is statistically , km (the third widest area in japan). and it compares fukushima with people lived in major metropolis such as tokyo area (total population is approximately million people and within million people in special districts), nagoya area (approximately . million people within central city), and osaka area (approximately . million people within central city) in japan. it namely denotes moving flows within fukushima cities, moving toward one of metropolises from fukushima and moving patterns between metropolises. actually, there is still another problem in residential data. in year , a national census every years was carried out in japan, and it has achieved to unveil many data discrepancies and inconsistencies of population in each local area of fukushima. comparing with statistics on resident data holding municipal government office and actual population from census, the latter cases were almost too lower than estimated populations in many cases of fukushima. to date, these missing populations have not been traced properly and many evacuees did not intentionally apply immigration cards to municipal government office. it means that many of them still have strong intentions to go back to fukushima in the near future. however, the stumbling blocks still remain against their returns, even though the government purifies radioactive substances and pollutions around their towns. according to the general surveys by tokyo capital government for evacuees and interviews for evacuees by the author (shibuya ), they found that evacuees' motivations which choose the destination and refuge were relying on following critical factors: ( ) to tie with any kindred relationships (a factor of human relationship), ( ) rich opportunities for jobs in the new address (a factor of new job opportunity), ( ) conveniences to manage their own real estates, farms, livestock, and factories in their hometown (a factor of holding estates). first factor implies mutual cooperation and helping among local acquaintances, and second answer clearly reflects their needs for jobs in new dwelling. and third factor relates to geospatial location, and parts of them have been living in two places of both fukushima and refuges. namely, their conditions were back and forth between hometown and refuge. thus they could not leave to far refuges, and geographical area around km within fukushima and refuges satisfied their above motivations using transportations such as the bullet train, highways, and other land transportations (it drives toward the destination within h). for those who required living needs, tokyo and nearby area of tokyo could properly offer new job opportunities and dwelling availabilities for them. our identifications are often determined by not only own cognitive factors but human relational and spatiotemporal factors, and their daily conditions of mental health would be interlinked with those internal and external surroundings. the fukushima case similarly indicated those mental malfunctions of evacuees caused by the human-made disaster. when such individuals lose all (or a part) of the linkages with both human relationships and living place, their mental foundation for identification will be seriously damaged in severe situations. such accidents and events caused by both natural and human-made disasters will be easy to engender secondary damages against social adaptation and subjective well-being of each individual. it is too important to care for each, but there are resilient needs to wholly repair and revive the community among them and their human relationships such as social capital (putnum ; kawachi and berkman ; kawachi et al. ; oecd/nea ) and family bondages. even though digitized communication styles renewed our daily commitment for online community, physical-contact based commitments in onsite community have still special meanings for their well-being. frey and osborne ( ) reported how will contemporary industries and jobs be changed and replaced by computerization and the ai-driven robotics, and they simulated socioeconomic trend patterns of jobs fitted by gaussian stochastic model. their estimations have indeed shown that the ai society will engender emerging job market and require other skills for citizens. but it will be clearly understood whether correct or not in the future. but, will a meaning on working life be steeply altered by such innovation? some theorists said that both living and working are indispensable relations each other. as one of the renown episodes, psychoanalyst freud answered that it requires "lieben und arbeiten" for becoming sound and independent adult. the former "lieben" means the accepting and loving for the others as a partner. and the latter "arbeiten" devotes to achieving the goals for maintaining daily living by own works and pursuing enhancement of own intellectual abilities and skills. both are still certainly the fundamental necessary factors for people. and a life-course approach in developmental psychology has much suggestion to wholly understand our mental development and health promotion during the life process (erikson (erikson , (erikson , . in each development stage of identity, each individual closely faces the problems and what should be conquered by each of them. such tasks can be furnished for own rich experiences of each, and each can be reorganized to adapt in own life process (e.g., self-actualization). working, learning, and other daily activities will be achieved by undertaking self-development and adaptation in social surroundings. it motivates to enhance each quality of life (qol) through own working experiment. it is certainly that the ai and big-data-based society enables to change our qol and working life. such innovation progress in our working styles has already become the cascading to collapse larger barriers by the big wave of digital transformation. working environment has been crucially invested in the contexts of either employees or employer. but quality in working environment cannot be determined by factors of material and physical surroundings, and it should be cared about human factors such as enhancement of human relationships and well-being (oecd ; strack et al. ; buunk and gibbons ) . for example, oecd proposed total framework ("measuring well-being and progress: well-being research") , and they enumerated necessary factors related to measuring both economic and well-being value in working and daily life. the digital transforming society will enlarge our working from actual physical space to virtual online space through tele-existence and online collaboration tools. and our working skills and abilities will be required conquering the harder roads of uphill progress of the ai and data sciences (boyd and holton ) . further, traditional stressful working environments can be attempted to quantify and coordinate with each parameter of employees such as personality characteristics, demanded skill levels, abilities, chemistry among members, and other necessary factors. now, such hr (human resource) technology (i.e., a case using business microscope ) enhances our working styles and improves productivities in various situations (khartri and samuel ) . analytics teams have vividly rushed to dive into the ocean of big data, but matching between the needs and their analyzed solutions can be improved by further efforts. active workers are usually facing issues at the marriage and family in their life courses. in some developed countries, the reasons, due to which unmarried rate during the lifespan of the youth generation has been increasing, may be understandable in work-life balance context. in japan, statistical data of ipss (national institute of population and social security research, japan) show such facts: men's case of unmarried rate excessed %, and women's case was . % at . many adolescents frequently hesitate to lose their free time and conformity, and it simultaneously means that they hate any interruptions by others and physical contacts. as necessary, they can choose tentative friends online (su and hu ) . using smartphone, many matching service applications for marriage among future partners have been launched in japan. and then, the matching needs can be bridged with the youths for their marriages. those matching services might be applied by stable matching problem in economics of mechanism design (roth ) . such algorithm can be formalized for matching pairs of stable marriage. otherwise, daily living enriches its big data (ganchev et al. ) . especially, there are strong requirements for children and elder people in their local community and living environment. first, smart sensing and ubiquitous technologies aim to enhancing our daily life (shibuya ) , and smart cities and smart house have cutting-edges for improving health services for us (lee ) . for example, the digital human research center in japan proposed an autonomous caring system and simulators for toddlers and little infants. due to their sudden and unpredictable manner of behaviors, serious accidents such as injuries and death at home often happen. this system intends to monitor and analyze daily patterns for improving their safety. of course, those systems which are equipped in smart houses are also applicable for elder people to watch their daily cares and health monitoring. on the other hands, secondly, airbnb and similar sharing house services have been launched in many nations, and big data on those paring patterns between house-owners and visitors will be arranged to analyze such trip purposes, sharing durations, other preferences by the ai-driven services (koh et al. ) . in these regards, digitization has already reshaped our quality of living in those contexts. using smartphone, mental health monitoring can be possible recently (ben-zeev et al. ; bakker et al. ) . especially, using social media data, there were innumerable examinations to analyze the relations with mental health and diagnosis of discourses on twitter and sns services. for example, there were enumerable cases on depression trends in community corresponding with data of geospatial location (yang and mu ) , depression detection on twitter (guntuku et al. ) , adhd (attention deficit hyperactivity disorder) diagnosis using discourses on twitter (coppersmith et al. ) , and other borderline cases in clinical psychology (e.g., hopeless, loneliness, social withdrawal). as social networking services clearly indicate a part of human relationships online (lazakidou ) , it can consider that their relations itself still have sharing illness personalities and depressed mental health. namely, there is a possibility that latent patients were apt to be participating in such social media, and some of them flocked online each other. inclusive cares within communal and relationship level can be also very effective for each person. traditionally, in the studies of communication and communal health, the concept on illness identity in interactive communication has potentials to recognize identification process for caring illness and health (hecht et al. ) . illness identity could be regarded as interactive processes from personal to communal layers in terms of the communication theory of identity. these mechanisms should be inclusively cared by various viewpoints from personal to social communal level. that is because, for example, social anxiety, depression, and other mental (and physical) illness could not be easily emancipated from not only individuals but also more diverse interactions and social groups. according to wegner et al., they discussed mental control and relationships with the others. and their findings can be understood only by including perspectives of social contexts. that is, it should pay attention to not only individual experiences of depression and mental distress but social relationship and interaction process with standpoints from the others (aneshensel et al. ). as implied earlier, rapport interaction between clinical psychologists and clients has been better focused in empathy oriented understanding as client-centered therapy (rogers ) . clinical psychological cases such as autism, psychopathy and other diagnoses usually display specific patterns of behaviors and assertions. especially, those who have specific disabilities against soundly interactions with the others are managed by tom (theory of mind), and this study intends to reveal our mental manners to recognize and coordinate with the others in dyad models (semeijn ; freitas et al. ) . such client cannot understand any intentions asserted by the others, and they often confuse meanings pretending and deceptions by the others. namely, understanding for the others flexibly requires more imaginable coordination in social context. and the loss of such basic intellectual skills becomes difficult for them to behave appropriately against troublesomeness with the others. in other words, some evolutionary psychologists and neuroscientists told that the humankind could be evolved both to lie against the others and detect . therapy for human by ai deceptive intentions. they further said that acquisitions for those neural mechanisms of social intelligence took advantages of beating against other wild animals during surviving history of ancient peoples. according to attachment theory by developmental psychologist bowlby, it authorized that physical attachment between mother and children offers strong and comfort foundation during child development (bretherton ) . suggestively, serial experiments also shown that an alternative of "mother" could be sufficiently replaced for children. for example, a fluffy doll as alternative of mothership could fulfill having comfort emotions for child (in their experiments, they tried to use a child of monkey). in this concern, personal intelligent robots have potentials to assist our daily life (coeckelbergh ) . as an example case, paro already achieved improving many patients of mental illness and alzheimer diseases. it calls robot therapy assisted by ai and adorable doll-like robots. such new services can be adapted in the social welfare institutes, daily cares in home and hospitals (wada et al. ; yu et al. ) . their attachments with physically autonomous entities may offer them some reliefs and comforts. in such ways, as mentioned earlier, there are certainty to automatically diagnose clients by the ai using telecommunications and smartphone. sensing data accumulated by wearable devices in daily life has strong potential to detect mental illness and bad mental-physical conditions in earlier stages. mandatory evacuation of residents during the fukushima nuclear disaster: an ethical analysis social network differences of chronotypes identified from mobile phone data handbook of the sociology of mental health mental health smartphone apps: review and evidence-based recommendations for future developments social exclusion impairs self-regulation next-generation psychiatric assessment: using smartphone sensors to monitor behavior and mental health the homeless mind: modernization and consciousness technology, innovation, employment and power: does robotics and artificial intelligence really mean social transformation testing the nation's healthcare information infrastructure: nist perspective the origins of attachment theory: john bowlby and mary ainsworth national and local influenza surveillance through twitter: an analysis of the - influenza epidemic play station nation: protect your child from video game addiction health, coping, and well-being: perspectives from social comparison theory machine learning for healthcare technologies health care, capabilities, and ai assistive technologies from adhd to sad: analyzing the language of mental health on twitter through self-reported diagnoses childhood and society identity and the life cycle common knowledge, coordination, and strategic mentalizing in human social life the future of employment: how susceptible are jobs to computerisation? fukushima in review: a complex disaster, a disastrous response enhanced living environments: algorithms, architectures, platforms, and systems internet of things and data analytics handbook fukushima data show rise and fall in food radioactivity: giant database captures fluctuating radioactivity levels in vegetables, fruit, meat and tea modernity and self-identity: self and society in the late modern age detecting influenza epidemics using search engine query data social networking addiction: emerging themes and issues detecting depression and mental illness on social media: an integrative review the communication theory of identity review of health issues of workers engaged in operations related to the accident at the fukushima daiichi nuclear power plant additional report of japanese government to iaea -accident at tepco's fukushima nuclear power stations deliberative democracy for the future: the case of nuclear waste management in canada neighborhoods and health social capital and health analytics for managerial work equipment health monitoring in complex systems self-consciousness, self-agency, and schizophrenia offline biases in online platforms: a study of diversity and homophily in airbnb a novel coronavirus associated with severe acute respiratory syndrome virtual communities, social networks and collaboration understanding social anxiety the nature and function of self-esteem: sociometer theory advances in computational environment science: selected papers from international conference on environment how information technology can change our lives in a globalized world the limits to growth analysis of japanese radionuclide monitoring data of food before and after the fukushima nuclear accident epidemics and percolation in small loneliness and social uses of the internet the near-term task force review of insights from the fukushima dai-ichi accident negative consequences from heavy social networking in adolescents: the mediating role of fear of missing out measuring social well-being: a progress report on the development of social indicators five years after the fukushima daiichi accident: nuclear safety improvements and lessons learnt twitter improves influenza forecasting the demography of health and healthcare bowling alone: the collapse and revival of american community handbook of risk theory: epistemology, decision theory, ethics, and social implications of risk client-centered therapy: its current practice, implications and theory the economics of matching: stability and incentives report to the foreign academies from science council of japan on the fukushima daiichi nuclear power plant accident interacting with fictions: the role of pretend play in theory of mind acquisition a framework of multi-agent based modeling, simulation and computational assistance in an ubiquitous environment actualities of social representation: simulation on diffusion processes of sars representation a study on participatory support networking by voluntary citizens -the lessons from the tohoku earthquake disaster a simulation on networked market disruptions and resilience from an exploring study on networked market disruption and resilience the society for risk analysis: asia conference a risk management on demographic mobility of evacuees in disaster the use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h n pandemic subjective well-being: an interdisciplinary perspective gender-specific preference in online dating radiological issues for fukushima's revitalized future problematic use of social networking sites: antecedents and consequence from a dual system theory perspective the benefits and dangers of enjoyment with social networking websites disaster displacement: how to reduce risk, address impacts and strengthen resilience robot therapy for prevention of dementia at home-results of preliminary experiment public health implications of excessive use of the internet, computers, smartphones and similar electronic devices exploiting psychology and social behavior for game stickiness gis analysis of depression among twitter users use of a therapeutic, socially assistive pet robot (paro) in improving mood and stimulating social interaction and communication for people with dementia: study protocol for a randomized controlled trial pathologist-level interpretable whole-slide cancer diagnosis with deep learning unsupervised bayesian inference to fuse biosignal sensory estimates for personalizing care key: cord- -ax x gk authors: wu, jia; chen, zhigang title: data decision and transmission based on mobile data health records on sensor devices in wireless networks date: - - journal: wirel pers commun doi: . /s - - -y sha: doc_id: cord_uid: ax x gk the contradiction between a large population and limited and unevenly distributed medical resources is a serious problem in many developing countries. this problem not only affects human health but also leads to the occurrence of serious infection if treatment is delayed. with the development of wireless communication network technology, patients can acquire real-time medical information through wireless network equipment. patients can have the opportunity to obtain timely medical treatment, which may alleviate the shortage of medical resources in developing countries. this study establishes a new method that can decide and transmit effective data based on sensor device mobile health in wireless networks. history data, collection data, and doctor-analyzed data could be computed and transmitted to patients using sensor devices. according to probability analysis, patients and doctors may confirm the possibility of certain diseases. in developing countries, the health of people can not be effectively protected by medicine because of the underdeveloped medical technology and large population of these countries. therefore, a patient with a minor illness may develop a very serious disease or even cause a disastrous infection. thus, developing countries should spend a substantial amount of manpower and funding to solve the problem. as a representative of developing countries, china once suffered greatly because of the problems posed by its huge population and the [ ] . thousands of people were affected and many of them died because the first few patients delayed treatment. the ebola virus [ ] , which rages in africa, also broke out because the first few patients did not obtain timely treatment. the phenomenon of medical resource shortage is serious in china, a country with a population of more than . billion. according to statistical data from china's ministry of health in , an average of people share only one doctor, and a doctor has to treat patients per day at most [ ] . further data show that a hospital in a large city treats an average of million people every year, and an advanced hospital treats at least . million patients annually. facing such a large population and shortage of medical resources, china needs to find an effective solution to rationally maximize the utilization of the existing medical resources. as a type of network, the sensor device mobile health wireless network [ ] is mainly characterized by ''carry-store-transfer'' transmission among its nodes. this fact implies that if one node is not in the communication area, the node stores the information and moves to transmit it to the next-hop node. a complete link in wireless networks is not necessary. communication is accomplished through the movement of the nodes, and information transmission in sensor wireless network is diffused. this feature of the sensor device wireless networks can be adopted in mobile health services. people can store health information, transmit it to any mobile device, and share information anywhere even without a mobile signal. this study establishes a new method that can decide and transmit effective data based on sensor device mobile health in wireless networks. history data, collection data, and doctor analysis data can be computed and transmitted to patients using sensor devices. patients and doctors may confirm the possibility of diseases basing on probability analysis. this paper is debated some problems with data decision and transmission in mobile health. then we found some methods to solve problem in wireless network. section is related work. section is model design. section is experiment and data text. and in sect. is conclusion. mobile health is a new health communication model that involves sensor technology and mobile computing. as of , more than million people are using mobile health devices [ ] . the united states is the first country to apply communication technology in the medical field. over % of mobile health equipment are applied in the us. at present, many other countries, including several in asia and europe, are also using this technology [ , ] . mobile health equipment was applied by nasa for astronauts to conduct remote monitoring and physiological index recording. shaikh et al. [ ] designed a new wireless remote monitoring system in linux embedded platform to measure pulse blood oxygen. page et al. [ ] applied low-energy consumption sensor devices in mobile health and invented a portable mobile health system. this system can monitor the physiological indices of patients when double sensors are working. according to recorded physiological indices in recent months, doctors can assess several illnesses of the patients. dias et al. [ ] built a new mobile health system with improved testing and nursing ability. data can be analyzed and stored using hospital servers. data from clients can be collated and monitored. through the use of tcp/ip and udp, physiological indices can be transmitted to data packets and sent by databases in servers [ ] . seto et al. [ ] proposed a heart-failure remote monitoring system in clinical research. this system could analyze and monitor many heart data indices. today, this system is used in first-aid environments. according to an established mobile health system, patients can obtain timely treatment from doctors or hospitals by using wireless sensor devices. at present, the research on wireless networks focuses on routing algorithms. existing routing algorithms can be transplanted into different areas by improvement. several methods are also adopted in wireless networks. grossglauser et al. [ ] proposed a store-and-forward mechanism called epidemic algorithm, which simulated the transmission mechanism of infectious diseases. this algorithm has two nodes that exchange messages instead of store messages on each other when they meet. this method is similar to exclusive transmission and allows the two nodes to obtain more information. when the node reaches the target node and transmits a message, the path could be guaranteed to be the shortest one with ample network bandwidth and buffer memory space. however, the increase in the included nodes could cause congestion in the network transmission of the message given the limited related resource in the real network. in actual situation and application, this method cannot have a good effect because of the resource limitation. wang et al. [ ] proposed the spray and wait algorithm based on the epidemic algorithm. the spray and wait algorithm consists of two phases. in the spray phase, the source node first counts the available nodes around for message transmission and transmits its message to the surrounding nodes by spraying. in the wait phase, the message is transmitted to the target node via direct delivery to fulfill the transmission process if no available node could be found in the spray phase. this method is a modified algorithm that improves the flood transmission of the original epidemic algorithm. however, the spray phase may cause a waste of source nodes if a large amount of neighbor nodes consume significant source node space. thus, this algorithm could cause node death by overspraying the source nodes in networks with great randomness. spyropoulos et al. [ ] proposed the prophet algorithm, which improves network utilization. this algorithm first counts available message transmission nodes around and then calculates the appropriate number of transmission nodes to form message groups. leguay et al. [ ] proposed the mv algorithm based on the probability algorithm. the mv algorithm calculates transmission probability using records statistics in the process of node meeting and area visiting. burgess et al. [ ] proposed the maxprop algorithm based on array setting priority. this algorithm determines the transmission sequence using the settled array priority when two nodes meet. this method reduces resource consumption and improves the algorithm efficiency by setting a reasonable message transmission sequence. leguay et al. [ ] suggested the moby space algorithm. in this algorithm, node groups or node pairs with higher relevance form a self-organizing transmission area to fulfill optimal communication among nodes. burns et al. [ ] proposed the context-aware routing algorithm to calculate the transmission probability of the source nodes in obtaining target nodes. this algorithm first data decision and transmission based on mobile data health… acquires the middle node by calculating the cyclic exchange transmission probability. then, it collects and groups the messages to guide the middle node in transmitting the messages directly to the node with higher transmission probability. kavitha et al. [ ] proposed the message ferry algorithm, which groups and transmits messages. this algorithm classifies and groups the message first, and then collects the source nodes to be transmitted. lastly, this algorithm counts the existing transmission traces of each ferry node in the network. based on the algorithm, the movement rule of the ferry nodes can be reached and the source node automatically moves to the ferry node in message transmission. transmission effect could be improved by predicting the node moving trace in this algorithm. based on the introduction to mobile health and wireless technology, the next step is designing a model of the mobile health wireless sensor. in the context of mobile health, data collection in hospitals and among patients is an important problem. patients use mobile devices and sensor devices that receive and send data messages. however, these devices limit storage space when it sends and receives messages. the electronic medical records of each patient contains more than - gigabytes of stored data. a large amount of energy and overhead would be consumed if all messages are received or sent by the device. moreover, the transmission of large data may delay illness assessment from doctors. some problems must be considered in establishing effective data decision and transmission in the mobile health environment. . how are data collected from patients in the mobile health environment? . which data are important for the system and doctors to evaluate diseases? . how is the probability of each disease analyzed using mobile devices and sensor devices? . how are complex diseases analyzed by sensor devices? these four problems will be used as a guide in discussing our next study. in mobile health, sensor devices and mobile device are the cheapest and most convenient means of data collection and transmission among doctors, patients, and hospitals. given that mobile device devices are universal, patients can deliver their physiological indices to doctors through mobile devices to ensure a rapid evaluation of their diseases. this method can reduce stress in hospitals in developing countries when a number of patients need diagnosis (fig. ) . moreover, patients can bring wearable devices so they can look over their physiological indices. these devices include wireless sensors that can collect the physiological indices of patients. patients may place these devices on their finger, wrist, and chest to scan and collect data from certain parts of the body. shortly, physiological data such as heartbeat, pulse, and blood pressure can be collected and stored by wireless sensor devices. finally, data can be transmitted to the mobile devices (fig. ) . in mobile health, mobile devices can analyze the physiological data they receive and provide diagnosis using embedded wireless sensor software. patients can assess their diseases using the devices. at the same time, the data are transmitted to the hospital server. data text and program text can be analyzed very swiftly and the shortage of medical resources can be solved through remote diagnosis. patients with serious conditions in an ambulance or in non-hospital areas can collect data using wireless sensor devices. the data can be organized on the basis of history, collection, and diagnosis, and doctors could compare the data using mobile device and then provide scientific conclusions. data transmission is crucial for a doctor to conduct a remote diagnosis and save a patient's life. rapid decision data in devices becomes important when data are collected by wireless sensor devices. therefore, the following work focuses on data decision in mobile health services. a sensor device can transmit a number of patient data to mobile device. the device may then spend time to analyze the important data. thus, data assembly is necessary. figure shows a patient's collected data. this model includes time assembly, item assembly, location assembly, and diagnosis assembly. these four assembly groups in these assemblies, int is a digital type and chair is a character type. each assembly contains many elements such as the following: diagnose {category(chair) {diastolic pressure (int), sphagnums (int), heartbeat rate (int), liver function (int)} to establish a sub-assembly, a decision tree containing all information for a patient in mobile health can be formed. the root of the decision tree is a patient (fig. ) . the first branches are the basic information assemblies of the patient. the second branches are sub-assembly, which are one level higher. a diagnosis is chosen from the second branches for a sample analysis. a patient's physiological index can be collected, analyzed, and computed while the patient is wearing a wireless sensor device with a mobile device. all results become conclusions and are sent to the patient and doctor. using these conclusions, doctors may diagnose patients in time. for different diagnosing conditions, the process is described as follows: . according to history records, a sensor device acquires past diagnostic records. . history records are analyzed and the sensor device can perform calculations. . using the collected time and by comparing the patient's data with normal data, the sensor device can calculate the probability of certain diseases in real time. . the probability of certain diseases colligates the history disease and collected data basing on the current data collection, and the result is transmitted to patients and doctors. . if a disease or diagnosis is highly complicated, the doctor needs to decide and determine the disease probability after the sensor device has analyzed the disease probability. according to the process of diagnosis decision, patients could acquire accurate data on disease probability, which may help ensure accurate treatment. doctors need a large amount of data to accurately calculate disease probability and correctly diagnose the disease. thus, the disease probability analysis created by the sensor device and mobile device is important. in light of accurate disease probability calculation, some data types need to be defined. definition (diagnosis data in history) diagnosis data in history shows the diagnosis information that was recorded in the past. these data contain physiological index, conclusion on diagnosis, and disease analysis. as many diseases are related to diagnosis data in history, convenience is improved when diagnosing a disease by focusing on such data in history and analyzing the data collected by the sensor device. for a patient, diagnosis data in history contain many disease records. the data are shown as follows: diseðh i Þ shows one disease in history and h i shows a disease. the relationship between the disease and the physiological index is kðtÞ; kðtÞ ½a t ; b t ð Þ kðtÞ shows the time physiological index in the sensor device. according to the recorded diagnosis data, the probability of one disease in history is p sin is the probability of one disease in history and kðt i Þ shows abnormal data. therefore, kðtiÞ p in kðt i Þ is the rate of abnormal data and complete data in history. a patient may experience many types of emergency diseases. thus, the complex probability is p com is the complex probability of many types of emergency diseases. each disease contains its index and the probability is standalone; thus, p com ¼ t i m p sin i . doctors and patients can easily diagnose different diseases using the analysis of diagnosis data in history. this item is created by sensor devices when patients wear wireless sensor devices. physiological indices can be stored and transmitted to mobile device in time. wireless sensor devices may analyze the data and provide a conclusion. finally, patients and doctors could diagnose diseases. the process is described as follows: step : wireless sensor devices collect data and transmit them to the mobile device to establish data assembly. for example, a patient's collected diagnosis data are assembled as follows: patient f(d) = {d = systolic pressure; d = diastolic pressure; d = blood glucose; d = lpcr; d = pct} the assembly contains five physiological indices, which are then analyzed by wireless sensor devices. step : comparing the normal and history data, the wireless sensor devices may provide abnormal data indices and disease types. the process is shown in fig. . in fig. , the record in the mobile device is: history data: diagnose (d = high, d = low); conclusion (psin = stroke) collect data: diagnose (d = high, d = low, d = low); conclusion (psin = stroke, psin = diabetes) history data and normal data can be compared. sensor devices can help diagnose the high incidence of probability of past diseases. if a disease has never happened, sensor devices may list new diseases and assemble diagnosis data. the probability of collection diagnosis data is s wol is the whole normal data index and s col \ s nor is the abnormal data. the probability of collection diagnosis data for one disease is decided on the basis of the abnormal data. step : diagnosis data from doctors. history data and collection data can be delivered to doctors, who may then determine the disease probability. according to fig. , the disease probability determined by doctors is p doc ðdise ¼ p sin j Þ is the probability of a disease according to the doctor. (d , d ) shows the abnormal index for stroke; (d , d ) shows the abnormal index for diabetes; and (d , d ) shows the abnormal index for thrombus. if a disease needs to be superposed with many diagnosis indices, the probability of a disease diagnosis data from the doctor is doctors may diagnose the probability of certain diseases using sensor device calculations. definition (synthesize probability) this item contains history data, collection data, and doctor data. sensor devices could provide synthesize probability according to set weight rate. this type of probability is expressed as follows: pðdiseÞ is the probability of a disease, p his is the probability of history data, a shows the rate of history data, p col is the probability of collection data, b is the rate of collection data, p doc is the probability of doctor data, and c is the rate of doctor data. using the synthesize probability, patients can rapidly diagnose the disease probability. patients can obtain good service with mobile health. in this paper, patients' data were obtained from the mobile health information of the ministry of education-china mobile joint laboratory. the experiment is described as follows: . experiment and data text were collected and stored by a sensor device and were transmitted by mobile device. . experiment and data text contain five physiological indices, including systolic pressure, diastolic pressure, blood glucose, lpcr, and pct. . two patients were chosen as research subjects. each patient has three history data and three collection data. . experiment data are found in the following tables. as shown in tables , and , the experiment analyzes a single disease and a complex disease. figure shows that patient has three history data consisting of record = , record = , and record = . records and are higher than the normal data, which range from to . disease is shown as dise ¼ high blood pressure. according to formula ( ), the probability of high blood pressure in patient is in collection data, record = and record = , which are also higher than the normal data. according to collection data and formula ( ), the probability of high blood pressure in patient is after checking the history probability and collection probability, the doctor can determine the probability of high blood pressure in patient , as assumed by doctor probability p doc ¼ %. according to three probabilities, sensor devices may set the weight rate and calculate the probability of high blood pressure in patient . formula ( ) assumes that a ¼ : ; b ¼ : ; c ¼ : : sensor devices may calculate the probability and transmit diagnosis data to the mobile app to be evaluated by patients and doctors. in figs. and , patient may have low systolic pressure and low diastolic pressure. the disease probabilities for patient are p his ¼ ðdise ¼ low systolic pressureÞ ¼ %; p col ¼ ðdise ¼ low systolic pressureÞ ¼ : %; p his ¼ ðdise ¼ low diastolic pressureÞ ¼ : %; p col ¼ ðdise ¼ low diastolic pressureÞ ¼ : %: a complex disease is constituted by a single disease. for example, low blood pressure involves low systolic pressure and low diastolic pressure. thus, a doctor can make a decision when the probability of a single disease is calculated. the probability determined by the doctor is p doc ðlow blood pressureÞ ¼ for patient , the pet index in history and collection are both lower than the normal data (fig. ) . therefore, p his ðlowpetÞ ¼ p col ðlowpetÞ ¼ %: septicemia consists of low blood pressure and low pet. according to formula ( ), a doctor can diagnose the probability of septicemia. according to the analysis of complex probability, patients can choose from different optimal therapeutic schedules using sensor devices. this paper discusses a method in mobile health that uses a sensor device to decide and transmit medical data. according to the sensor devices, the history data, collection data, and doctor analysis data can be computed and transmitted to patients. using the probability analysis, patients and doctors may confirm the possibility of several diseases. in the future, mobile health is expected to focus on complex diseases and to be combined with big data studies to solve data analyse and artificial intellegence. it is good for doctor and patient to improve the problem in medical resource distribution. especially in developing countries, this technology may reduce disease diffusion. it is an effective method in researching mobile health. isolation and characterization of viruses related to the sars coronavirus from animals in southern china genomic surveillance elucidates ebola virus origin and transmission during the outbreak pct with two patients data decision and transmission based on mobile data health… business ecosystem strategies of mobile network operators in the g era: the case of china mobile an incentive game based evolutionary model for crowd sensing networks. peer-to-peer networking and applications securing legacy mobile medical devices. wireless mobile communication and healthcare applying a sensor energy supply communication scheme to big data opportunistic networks the role of service oriented architecture in telemedicine healthcare system physiological approach to monitor patients in congestive heart failure: application of a new implantable device-based system to monitor daily life activity and ventilation mobile telemedicine system for home care and patient monitoring perceptions and experiences of heart failure patients and clinicians on the use of mobile phone-based telemonitoring mobility increases the capacity of ad hoc wireless networks dynamic spray and wait routing algorithm with quality of node in delay tolerant network spray and wait: an efficient routing scheme for intermittently connected mobile networks dtn routing in a mobility pattern space maxprop: routing for vehicle-based disruptiontolerant networks evaluating mobyspace-based routing strategies in delaytolerant networks mora routing and capacity building in disruptiontolerant networks analysis and design of message ferry routes in sensor networks using polling models improvement the quality of mobile target detection through portion of node with fully duty cycle in wsns energy-balanced cooperative transmission based on relay selection and power control in energy harvesting wireless sensor network he is a senior member of ccf (china computer federation), a member of ieee and acm. his research interests include wireless communications and networking, wireless network, big data research he is currently a professor, supervisor of ph.d. and dean of school of software, central south university. he is also director and advanced member of china computer federation (ccf), and member of pervasive computing committee of ccf. his research interests cover the general area of cluster computing data decision and transmission based on mobile data health… key: cord- - fuy tlp authors: patson, noel; mukaka, mavuto; otwombe, kennedy n.; kazembe, lawrence; mathanga, don p.; mwapasa, victor; kabaghe, alinune n.; eijkemans, marinus j. c.; laufer, miriam k.; chirwa, tobias title: systematic review of statistical methods for safety data in malaria chemoprevention in pregnancy trials date: - - journal: malar j doi: . /s - - -z sha: doc_id: cord_uid: fuy tlp background: drug safety assessments in clinical trials present unique analytical challenges. some of these include adjusting for individual follow-up time, repeated measurements of multiple outcomes and missing data among others. furthermore, pre-specifying appropriate analysis becomes difficult as some safety endpoints are unexpected. although existing guidelines such as consort encourage thorough reporting of adverse events (aes) in clinical trials, they provide limited details for safety data analysis. the limited guidelines may influence suboptimal analysis by failing to account for some analysis challenges above. a typical example where such challenges exist are trials of anti-malarial drugs for malaria prevention during pregnancy. lack of proper standardized evaluation of the safety of antimalarial drugs has limited the ability to draw conclusions about safety. therefore, a systematic review was conducted to establish the current practice in statistical analysis for preventive antimalarial drug safety in pregnancy. methods: the search included five databases (pubmed, embase, scopus, malaria in pregnancy library and cochrane central register of controlled trials) to identify original english articles reporting phase iii randomized controlled trials (rcts) on anti-malarial drugs for malaria prevention in pregnancy published from january to july . results: eighteen trials were included in this review that collected multiple longitudinal safety outcomes including aes. statistical analysis and reporting of the safety outcomes in all the trials used descriptive statistics; proportions/counts (n = , %) and mean/median (n = , . %). results presentation included tabular (n = , . %) and text description (n = , . %). univariate inferential methods were reported in most trials (n = , . %); including chi square/fisher’s exact test (n = , . %), t test (n = , . %) and mann–whitney/wilcoxon test (n = , . %). multivariable methods, including poisson and negative binomial were reported in few trials (n = , . %). assessment of a potential link between missing efficacy data and safety outcomes was not reported in any of the trials that reported efficacy missing data (n = , . %). conclusion: the review demonstrated that statistical analysis of safety data in anti-malarial drugs for malarial chemoprevention in pregnancy rcts is inadequate. the analyses insufficiently account for multiple safety outcomes potential dependence, follow-up time and informative missing data which can compromise anti-malarial drug safety evidence development, based on the available data. guidance on clinical trial reporting of safety outcomes through adherence to the consolidated standards of reporting trials (consort) guidelines [ , ] . however, there is scant literature on standardized ways to statistically analyse the safety outcomes in clinical trials. although there exist some general regulatory guidelines on safety data analysis, such as international conference on harmonization which recommend descriptive statistical methods supplemented by confidence intervals [ , ] , the proposed statistical methods rarely account for the complexity of the collected safety data, e.g., recurrent adverse events (aes). effective solutions to statistical analysis of safety data in clinical trials may need to be tailored to specific indications (set of diseases with similar characteristics) since safety data collected are also influenced by the medical condition under study. absence of standardized guidelines for safety data analysis in specific settings may limit the ability to draw rich conclusions about the safety of the investigational product, based on collected data. standardized guidelines can simplify integration of safety information from multiple outcomes across rcts [ ] and would ensure optimal use of data in developing the safety profile of the investigational product. statistical analysis of safety data in clinical trials is characterized by a challenge of multiple and related endpoints measured over time. the safety endpoints may include clinical and laboratory defined aes. laboratorybased aes are defined based on standard cut-off points for measures such as vital signs (e.g., body temperature), hepato-toxicity measures (e.g., bilirubin level), cardiotoxicity measures (e.g., electrocardiograms), and other tests relevant to the medical indication being studied [ ] . the safety endpoints may be correlated within patients and over time such that failure to account for this in an analysis may yield biased estimates and false inference. furthermore, time to occurrence of the safety endpoint may be very informative in profiling the drug safety. such data present statistical analysis and interpretation challenges due to the complexity in structure [ ] . for instance, in the case of multiple, repeatedly measured, safety outcomes, false positives may arise from multiple statistical testing if appropriate longitudinal or time to event methods and/or multiplicity adjustments are not considered. in clinical trials, aes may impact compliance and study participation which may further affect treatment efficacy estimates [ , ] . occurrence of (even mild) aes due to a drug would lead to non-adherence, leading to informative censoring. the dropping of the patients from the study generates missing data that may lead to biased results if poorly accounted for. therefore, safety data analysis accounting for missing data is useful to facilitate identification and characterization of the safety profile of the drug as early as possible. other analysis challenges include lack of adequate ascertainment and classification of aes, and limited generalizability of results [ ] since some aes cannot be pre-specified at study design stage. there are many populations where drug safety assessment is complex. one of the special settings in safety data assessment is the use of drugs to prevent adverse outcomes in pregnancy, currently referred to as intermittent preventive treatment of malaria in pregnancy (iptp). for example, the world health organization recommends that pregnant women receive routine treatment with anti-malarial drugs to clear any malaria infection that is present and also to prevent infection in the weeks after administration [ ] . recent review indicates that methodological issues in studying antimalarial drugs in pregnancy have prevented firm conclusions on the safety of new anti-malarial drugs in pregnancy [ ] . previous efforts have attempted to standardize safety assessment methodology for antimalarial drug trials in pregnancy, including study designs and data collection [ , ] . however, literature remains limited in describing the standard practice in the statistical analysis of safety data that are collected on anti-malarial drugs during pregnancy trials. the current review focusses on safety assessment in anti-malarial drugs for chemoprevention in pregnancy trials. since anti-malarial drug for malaria chemoprevention is given repeatedly to healthy pregnant women, it is critical to improve safety assessment in this vulnerable population. specifically, appropriate statistical analysis of safety outcomes can improve development of anti-malarial drug safety profile. this can be achieved through sufficient use of the data generated during the rct which provides a comprehensive drug safety insight. this review, therefore, aims at identifying applied statistical methods and their appropriateness in the analysis of safety data in anti-malarial drugs for malaria prevention during pregnancy clinical trials. the systematic review was conducted according to preferred reporting items for systematic review and meta-analyses (prisma) statement [ ] which outlines minimum standards for reporting systematic reviews and meta-analysis (additional file : table s ). the protocol for this review was registered and published with prospero (crd ). the study population is pregnant women on any anti-malarial drug for malaria chemoprevention. primary original articles published in english from phase iii rcts were considered for inclusion. the articles were from rcts assessing the efficacy and safety of malaria chemoprevention in pregnancy. this review focused on phase iii rcts, because they have the largest sample size among pre-marketing trials and accommodates multidisciplinary support in safety evaluation. further, the data are systematically collected and have the benefit of being randomized, which aids a fair comparison of treatment groups. observational studies, case reports, letters to the editor, narrative reviews, systematic reviews and trials in phase i or phase ii or phase iv were excluded from this review. this review did not include clinical trials on malaria prevention in pregnancy using intermittent screening and treatment (istp) as an alternative to iptp. istp refers to intermittent rapid diagnostic testing (rdt) for malaria in pregnant women followed by treatment of rdt-positive cases with an effective artemisinin-based combination therapy, and iptp is given to pregnant women regardless of their malaria status. hence, istp and iptp consider different populations which may confound the practice in safety assessment methods (i.e., istp considers symptomatic population and iptp considers both symptomatic and asymptomatic population). non-english publications were excluded. studies published between january and july, were searched from five databases (pubmed, embase, scopus, malaria in pregnancy libray (mipl) and cochrane central register of controlled trials (cen-tral). the mipl is an excellent scholarly source of articles on malaria in pregnancy that enabled the review to capture both indexed and non-indexed articles beyond the searched databases. additional searches included reference lists of the identified trials and relevant reviews to identify trials potentially missed in the database search. the year was selected since it is when consort guideline updated and emphasized on appropriate statistical analysis and reporting of clinical trials [ ] . conference proceedings were not included because they usually contain abstracts that do not give detailed analysis of the presented results and they are not rigorously peer-reviewed. the review focussed on published studies only so no experts or abstract publication authors were contacted for unpublished data. the key search items included: malaria, anti-malarial drug, pregnancy, efficacy safety or tolerability. the detailed search strategy is presented in additional file : table s . the search was customized per database. based on prisma procedure, after removing duplicates, two reviewers (np and ank) independently screened titles and abstracts initially before arriving at a final list of eligible articles. based on the eligible studies list, full text articles were retrieved. the references were managed using endnote x . (thomson reuters). if there were disagreements, the two reviewers discussed the paper to reach a consensus and reasons for exclusion were provided for ineligible publications/studies the data extraction file created in microsoft excel was used to record all key variables from the selected articles. some of the collected variables such as mode of safety data collection, participant withdrawal due to ae and handling of continuous measures were based on consort guideline. the following key variables were extracted from the papers: main author, publication date, study design, study location, main efficacy outcome, sample size, list of safety parameters collected (including laboratory data), nature of safety data collection (i.e., passive or active), list of statistical methods used for respective safety outcomes, how the results were presented, retention rate at the end of the follow up and how missing safety or efficacy data were handled. the primary hypothesis type (as superiority, non-inferiority or equivalence) was defined based on what was reported in the actual manuscript or inferred by the lead author (np), based on how the study framed the primary hypothesis. superiority hypotheses aim to show whether treatment is better than control, non-inferiority hypotheses intend to show that one treatment is not worse than the other and equivalence hypotheses intend to show that a given treatment is similar to another for defined characteristics [ ] . the statistical methods were classified as descriptive or inferential and univariate or multivariate depending on the purpose and nature of the statistical methods based on previous similar reviews [ , ] , reviewing statistical methods. the extracted quantitative data were reported as percentages in tables. the commonly reported safety parameters, suitability of the used statistical methods and other findings were also summarized narratively. the search identified articles. after removing duplicates, unique articles were identified and considered for possible inclusion in the review. the duplicates (i.e. repeated citations) were the same articles but identified in multiple search databases. figure presents details of the selection process. during screening, a total of articles were excluded based on relevance of their titles and abstracts. the remaining full text articles were assessed for possible inclusion and articles satisfied the inclusion criteria, and were included in this review as shown in table . reasons for exclusion are shown in fig. . the trials included in this review were conducted in oceania ( trials, . %) and sub-saharan africa ( trials, . %) regions. the rcts reviewed recruited , pregnant women with a median sample size of (interquartile range (iqr): , ) women per treatment arm in a trial. thirteen trials ( . %) recruited more than patients per arm. as expected, all trials ( trials, %) computed sample size based on the efficacy outcome(s). the majority of the trials ( trials, . %) had two treatment arms and the rest had three treatment arms. all trials had an active comparator and iptp-sp was studied as a standard malaria chemoprevention in the majority of trials ( trials, . %). although the review focussed on published trials from to , the trials were conducted between and . based on the primary hypothesis tested, superiority design rcts were the most common ( trials, . %) and the other trials had a noninferiority hypothesis. over half of the trials ( over half of the trials ( trials, . %) reported that they collected safety data using a combination of scheduled and non-scheduled visits (table ) , while a third of the trials ( trials, . %) did not specify the safety data collection approach used. the median retention rate (based on the defined efficacy outcome reported for respective trials) was . % (iqr: . %, . %) and trials ( . %) had a retention rate below %. all the reviewed trials indicated that they had collected multiple longitudinal safety endpoints. as expected, almost all the trials ( trials, . %) reported obstetric safety outcomes such as foetal loss. table s and s in additional file provide a detailed list of safety outcomes and respective statistical method reported in each reviewed trial. despite the reported occurrence of multiple aes, none of trials seemingly reported recurrence of aes during pregnancy. in total, trials ( . %) reported adverse events with different severity levels, e.g., mild, moderate and severe. all trials reported occurrence of aes by treatment arm. almost all trials ( trials, . % %) reported laboratory data in their safety assessment of the drug and trials of these ( . ) dichotomized at least a single continuous safety outcome (e.g., haemoglobin) based on standard cut-off points, to define an ae. the safety analysis approach (based on treatment allocation and adherence) was specified and reported in trials ( . %). per protocol (pp) and intention to treat (itt) analysis approaches were used in trials ( . %) and trials ( . %), respectively. two trials indicated that they used both pp and itt to analyse the safety data. although all the reviewed trials had at least one patient lost to follow-up, only trials ( . %) reported missing efficacy data and of the trials indicated that the missingness was ignorable after exploring data missingness patterns ( table ) . none of the reviewed trials conducted an advanced sensitivity analysis on the relationship between missing data and drug safety. for example, none of the studies assessed the safety outcomes (e.g., aes) in relation to missing efficacy outcomes. this review found that most trials ( trials, . %) had at least one participant who experienced an ae leading to discontinuation from the trial although the studies did not formally investigate/quantify the relationship between the aes and trial completion. all the trials included reviewed used descriptive statistics as one of the methods to summarize aes (table ) . proportions or counts were the descriptive statistics used in all of the studies to report safety data. definition safety data was dependent on respective trials as shown in the additional file : table s and s ). incidence rates were reported in trials ( . %). most trials ( trials, . %) reported univariate inferential statistical methods; these included chi square or fisher`s exact test ( trials, . %), t-test (n = , . %). only trials reported multivariate statistical methods. the multivariable methods were poisson regression (n = , . %), and negative binomial regression (n = , . %). usage of at least two inferential statistical methods to compare safety outcomes was reported in trials ( . %). although all studies reported multiple safety outcomes, none reported adjustment for multiplicity during analysis. the review showed that at least a single optimal statistical methods was reported in trials ( . %) that considered multivariable modelling. even though univariable analysis comparing arms in an rct were appropriately used, further inferential statistical methods reported in the rest of the trials were suboptimal for the type of data being collected. for further details, additional file : tables s provide a detailed list of reported statistical methods with their respective safety outcome(s). in terms of presentation of results, none of the trials presented aes in a graph. only trials ( . %) narratively presented the safety results; the other trials ( . %) presented the results in tabular format. a total of trials reported p-values after comparing treatments and there were only trials ( . %) that reported point estimates with their respective confidence intervals. this review sought to provide a detailed overview of the actual practice of the statistical analysis of safety data in the unique setting of drug trials for the preventions of malaria in pregnancy as reflected published literature. the results demonstrate that there is limited reporting of statistical analyses of safety data, at the end of the trial, in these published reports. the findings are useful to advance the development of standardized guidelines for safety data statistical analysis in analysis in anti-malarial drugs in pregnancy trials and related fields. such guidelines will not replace but rather complement the con-sort guidelines that are general (i.e., not providing specific statistical methods in analysing harms in rcts). based on the authors' knowledge of the available literature, this is the first paper to review statistical methods for safety data in anti-malarial drugs in pregnancy. descriptive methods were commonly used to summarize safety data. this review found that each clinical trial used at least one descriptive method to summarize safety data. univariate statistical methods such as chi square or fishers exact tests were used in two-thirds of the articles reviewed. such descriptive statistics and univariate statistical inference ignored useful information such as variability in follow-up time, missing data and correlation (for those trials which had their multiple safety outcomes repeatedly measured). hence there was inefficient data use during analysis that may lead to a loss of useful information for improved and informative conclusions. although a third of the reviewed trials attempted to use crude incidence, the analyses failed to adequately account for individual patient follow-up-time and potential confounders. all trials dichotomized at least a single continuous clinical laboratory safety outcome (i.e., where ae was defined based on standard cut-off points for adult toxicity). although this aids in providing time-specific drug safety status and easy interpretation, the dichotomized outcome may miss some information on the magnitude of the temporal changes, overtime during the trial. the information loss may lead to reduction in statistical power to detect safety signal if it exists. valid longitudinal methods (used without restriction on cut-off points) can address the information loss by exploiting potential within-subject correlations for the repeated clinical laboratory measurements [ ] [ ] [ ] . furthermore, the longitudinal methods can provide the basis for developing improved cut-off points tailored to pregnant women in malaria-specific settings. to ensure improved uptake of such methods, future work needs to strive towards making the results from the longitudinal methods feasibly interpretable to the medical practitioners. only three studies appropriately used multivariable statistical methods. adjusting for known prognostic covariates is useful to control for confounding that can be introduced due to imbalance when assessing if treatment is independently associated with safety outcome(s). of secondary interest, covariate adjustment also preserves type i error [ ] . such adjustment for potential confounders (e.g., age) in safety data analysis are suitable in clinical trials with at least moderate sample size, unlike small sample sizes that lead to unstable estimates. of specific interest in this review, the poisson model was more suitable in the context of rare aes which usually have low event rates [ , ] . since poisson regression assumes a constant rate of occurrence of a rare event, it is not ideal for other multiple transient aes that were common or recurred and would vary in occurrence overtime [ ] . alternatively, mixed effects models could be considered to characterize the safety events over time since they capture patient-specific effects [ , ] . whenever time to ae occurrence information is available, survival analysis models may also be preferred to characterize the time to ae occurrence. for recurrent safety events, that may induce dependence, methods that extend the cox regression model may be preferred; such models include survival mixed effects models (e.g., frailty models) [ ] . almost half of the reviewed trials did not explicitly define the population on which the safety analysis was based. if per protocol analysis is used to address non-adherence there is potential selection bias since it destroys the balance due to randomization. although consort recommends itt, as an alternative for analysis of safety endpoints, non-adherence cannot be explicitly addressed with itt approach since it ignores dropouts, withdrawal or loss-to-follow up for various reasons including safety concerns; itt-based inference ignores causal effect of the actual treatment received [ ] . patient withdrawal or dropout due to aes can induce informative censoring useful in quantifying anti-malarial drug safety. for example, if a patient withdraws due to vomiting after taking an anti-malarial drug, their obstetric efficacy outcomes such as birth weight may appear as missing data. in the context of anti-malarial drug for malaria prevention, even mild ae can lead to drug nonadherence. since the patient has no disease symptoms, they would judge it less costly for them to discontinue the drug than continue experiencing aes. hence, inclusion of information on treatment/trial completion status in relation to anti-malarial safety would enrich development of the safety profile of anti-malarial drug in pregnancy. although study completion status, anti-malarial drug safety and missing data may be interlinked, missing data received limited attention such that the few trials that considered efficacy missing data did not explicitly explore the potential link. studying such complex associations requires statistical methods that can appropriately estimate the pathway from the anti-malarial chemoprevention to study completion. advantageously, methods based on causal inference framework, such as mediation analysis [ ] [ ] [ ] [ ] could be adapted/extended to assess the influence of the aes on non-adherence in rcts. despite about three-quarters of the trials reporting p-values after comparing safety outcomes by treatment arms, only about half of the reviewed trials adhered to international harmonisation conference guideline e in reporting of confidence intervals in quantifying the safety effect size [ , ] . use of confidence interval aids in interpretation of results by providing a measure of precision. furthermore, graphical displaying of safety data to aid in summarizing of safety data was inadequate. graphs on safety data have a greater ability to convey insight about patterns, trends, or anomalies that may signal potential safety issues compared to presentation of such data in tabular form only [ ] . for example, the graphs could help to visualize frequency and changes in aes over time by treatment arm. the graphs could further help in assessing assumptions for some statistical methods. over three-quarters of the reviewed trials were designed as superiority trials based on efficacy outcomes. although the statistical approach for safety assessment was mainly on superiority hypotheses (for both the superiority and non-superiority trials), clinical and statistical justification of assessing safety based on superiority hypotheses may be invalid. superiority hypotheses concentrate on the absence of difference in drug safety effect/risk between or across the treatment arms which may be challenging [ ] . for example, when comparing high ae incidences, non-significant difference (when using a superiority hypothesis) would not necessarily translate to a conclusion that a drug is safe and welltolerated since sometimes all compared treatment arms may have high ae incidence. perhaps, drug safety evaluation should strive to prove that there is no risk beyond a protocol-defined/hypothesized priori clinical safety margin (i.e., no excessive safety risk). based on the findings in this work, researchers are encouraged to consider defining safety margins in safety assessment of anti-malarial drugs. since safety is mostly a secondary outcome, it is not straightforward on how to define a non-inferiority margin and the appropriate analysis population. currently, it is still unclear and debatable how to implement this, such that further research is needed [ ] . interestingly, over half of the trials were openlabel which may influence physician clinical safety assessment on a patient and patient reporting of aes based on their expectations since they know the treatment assigned. appropriate reporting of the aes would be guided by data safety and monitoring boards (dsmb) from early stages of the trial. however, availability of dsmbs in over three-quarters of the trials did not translate to improved reporting and analysis. therefore, dsmb members should advocate for improved analysis approaches for aes. tabel summarizes recommendations to consider on best practices for safety analyses. this provides a general framework for statistical analysis of safety in malaria chemoprevention in pregnancy trials. as highlighted above, the recommendations assume a context where sample size is moderate or large. for rare events, bayesian approaches can be considered since they do not depend on asymptotic properties when handling rare events and can incorporate prior/external information [ ] . future research work can further consider adapting/extending recently developed statistical methods for rare disease or small population clinical trials towards analysis of rare safety outcomes in iptp trials [ ] [ ] [ ] . this review agrees with other similar publications focusing on drug safety assessment in clinical trials that have noted the need for further improvement in the statistical analysis of the safety data [ , ] . this review concurs with a recent review that has noted that inappropriate handling of multiple test is prevalent, although their review focussed on four high impact journals, ae in general and a short time of review period [ ] . issues raised in this review include time-dependence of aes, informative censoring due to discontinuation of treatment because of aes, safety graphs, and repeated occurrence of aes and multivariate longitudinal structure of laboratory data that yields complex correlation. this is an ongoing work whereby further analysis will be explored to address the identified statistical issues above. the application of the systematic review protocol in describing the current practice is highly reliable and objective since it exhaustively identified the published anti-malarial drug clinical trials in pregnancy for studied period. however, this review covered only the last decade of publications and may have missed studies published in other languages or that did not appear in during the literature search. because the trials reported in the publications spanned for a decade, it was difficult to assess temporal trends. this review represents the most comprehensive review of safety data analysis practice for this important indication. although useful safety data are collected in malaria chemoprevention in pregnancy clinical trials, the analysis remains sub-optimal and this hinders definitive conclusions about drug safety in this setting. descriptive statistical methods and dichotomization of continuous outcomes are predominant which may lead to loss of useful information. the definition of analysis population and informative presentation of results are not standardized. overall, the results suggest that choice of a statistical method(s) to use should be jointly driven by the scientific question of interest, epidemiological/clinical plausibility of the method and structure of the raw data. further work in addressing the highlighted gaps can enhance drug safety decisions and conclusions. supplementary information accompanies this paper at https ://doi. org/ . /s - - -z. a systematic review of safety data reporting in clinical trials of vaccines against malaria, tuberculosis, and human immunodeficiency virus explanation and elaboration: updated guidelines for reporting parallel group randomised trials statistical principles for clinical trials (ich e ): an introductory note on an international guideline ich harmonised tripartite guideline. statistical principles for clinical trials. international conference on harmonisation e expert working group sources of safety data and statistical strategies for design and analysis: clinical trials a question-based approach to the analysis of safety data adherence to therapy and adverse drug reactions: is there a link? reporting of lost to follow-up and treatment discontinuation in pharmacotherapy and device trials in chronic heart failure drug safety assessment in clinical trials: methodological challenges and opportunities updated who policy recommendation: intermittent preventive treatment of malaria in pregnancy using sulfadoxine-pyrimethamine (iptp-sp). geneva, world health organization treatment of uncomplicated and severe malaria during pregnancy methodology of assessment and reporting of safety in anti-malarial treatment efficacy studies of uncomplicated falciparum malaria in pregnancy: a systematic literature review evaluating harm associated with anti-malarial drugs: a survey of methods used by clinical researchers to elicit, assess and record participant-reported adverse events and related data the prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration statement: updated guidelines for reporting parallel group randomised trials superiority, equivalence, and non-inferiority trials the statistical content of published medical research: some implications for biomedical education use of statistical analysis in the the analysis of multivariate longitudinal data: a review modeling laboratory data from clinical trials statistical methods for evaluating safety in medical product development the risks and rewards of covariate adjustment in randomized trials: an assessment of outcomes from studies a model for the distribution of daily number of births in obstetric clinics based on a descriptive retrospective study some simple robust methods for the analysis of recurrent events an approach to integrated safety analyses from clinical studies analysis of adverse events in drug safety: a multivariate approach using stratified quasi-least squares a framework for the design, conduct and interpretation of randomised controlled trials in the presence of treatment changes correcting for non-compliance in randomized trials using structural nested mean models addressing complications of intention-to-treat analysis in the combined presence of all-or-none treatment-noncompliance and subsequent missing outcomes intention-to-treat meets missing data: implications of alternate strategies for analyzing clinical trials data estimating treatment effects in randomised controlled trials with non-compliance: a simulation study graphical approaches to the analysis of safety data from clinical trials bayesian methods for design and analysis of safety trials recent advances in methodology for clinical trials in small populations: the inspire project lessons learned from ideal- recommendations from the ideal-net about design and analysis of small population clinical trials applicability and added value of novel methods to improve drug development in rare diseases recommendations to improve adverse event reporting in clinical trial publications: a joint pharmaceutical industry/journal editor perspective analysis and reporting of adverse events in randomised controlled trials: a review effect of repeated treatment of pregnant women with sulfadoxine-pyrimethamine and azithromycin on preterm delivery in malawi: a randomized controlled trial intermittent preventive treatment of malaria with sulphadoxinepyrimethamine during pregnancy in burkina faso: effect of adding a third dose to the standard two-dose regimen on low birth weight, anaemia and pregnancy outcomes superiority of over doses of intermittent preventive treatment with sulfadoxine-pyrimethamine for the prevention of malaria during pregnancy in mali: a randomized controlled trial efficacy of malaria prevention during pregnancy in an area of low and unstable transmission: an individually-randomised placebo-controlled trial using intermittent preventive treatment and insecticide-treated nets • fast, convenient online submission • thorough peer review by experienced researchers in your field • rapid publication on acceptance • support for research data, including large and complex data types • gold open access which fosters wider collaboration and increased citations maximum visibility for choose bmc and benefit from: in the kabale highlands, southwestern uganda intermittent preventive treatment with sulfadoxine-pyrimethamine versus weekly chloroquine prophylaxis for malaria in pregnancy in honiara, solomon islands: a randomised trial cotrimoxazole prophylaxis versus mefloquine intermittent preventive treatment to prevent malaria in hiv-infected pregnant women: two randomized controlled trials intermittent preventive treatment of malaria in pregnancy with mefloquine in hiv-negative women: a multicentre randomized controlled trial intermittent preventive treatment of malaria in pregnancy with mefloquine in hiv-infected women receiving cotrimoxazole prophylaxis: a multicenter randomized placebo-controlled trial effectiveness of co-trimoxazole to prevent plasmodium falciparum malaria in hiv-positive pregnant women in sub-saharan africa: an open-label, randomized controlled trial safety of daily co-trimoxazole in pregnancy in an area of changing malaria epidemiology: a phase b randomized controlled clinical trial intermittent screening and treatment or intermittent preventive treatment with dihydroartemisinin-piperaquine versus intermittent preventive treatment with sulfadoxine-pyrimethamine for the control of malaria during pregnancy in western kenya: an open-label, three-group, randomised controlled superiority trial sulphadoxine-pyrimethamine plus azithromycin for the prevention of low birthweight in papua new guinea: a randomised controlled trial dihydroartemisinin-piperaquine for the prevention of malaria in pregnancy efficacy and safety of azithromycin-chloroquine versus sulfadoxine-pyrimethamine for intermittent preventive treatment of plasmodium falciparum malaria infection in pregnant women in africa: an open-label, randomized trial intermittent preventive treatment with dihydroartemisinin-piperaquine for the prevention of malaria among hiv-infected pregnant women chloroquine as weekly chemoprophylaxis or intermittent treatment to prevent malaria in pregnancy in malawi: a randomised controlled trial comparative study of mefloquine and sulphadoxine-pyrimethamine for malaria prevention among pregnant women with hiv in southwest nigeria monthly sulfadoxine-pyrimethamine versus dihydroartemisinin-piperaquine for intermittent preventive treatment of malaria in pregnancy: a double-blind, randomised, controlled, superiority trial publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations utrecht university, utrecht, the netherlands. key: cord- -umrlsbh authors: jiang, shufan; angarita, rafael; chiky, raja; cormier, stéphane; rousseaux, francis title: towards the integration of agricultural data from heterogeneous sources: perspectives for the french agricultural context using semantic technologies date: - - journal: advanced information systems engineering workshops doi: . / - - - - _ sha: doc_id: cord_uid: umrlsbh sustainable agriculture is crucial to society since it aims at supporting the world’s current food needs without compromising future generations. recent developments in smart agriculture and internet of things have made possible the collection of unprecedented amounts of agricultural data with the goal of making agricultural processes better and more efficient, and thus supporting sustainable agriculture. these data coming from different types of iot devices can also be combined with relevant information published in online social networks and on the web in the form of textual documents. our objective is to integrate such heterogeneous data into knowledge bases that can support farmers in their activities, and to present global, real-time and comprehensive information to researchers. semantic technologies and linked data provide a possibility for data integration and for automatic information extraction. this paper aims to give a brief review on the current semantic web technology applications for agricultural corpus, then to discuss the limits and potentials in construction and maintenance of existing ontologies in agricultural domain. recent advances in information and communication technology (ict) aim at tackling some of the most important challenges in agriculture we face today [ ] . supporting the world's current food needs without compromising future generations through sustainable agriculture is of great challenge. indeed, among all the topics around sustainable agriculture, how to reduce the usage, and the impact of pesticide without losing the quantity or quality in the yield to fulfill the requirement of the growing population has an increasingly important place [ ] . researchers have applied a wide range of technologies to tackle some specific goals. among these goals: climate prediction in agriculture using simulation models [ ] , making the production of certain types of grains more efficient and effective with computer vision and artificial intelligence [ ] , soil assessment with drones [ ] , and the iot paradigm when connected devices such as sensors capture real-time data at the field level and that, combined with cloud computing, can be used to monitor agricultural components such as soil, plants, animals and weather and other environmental conditions [ ] . the usage of such icts to improve farming processes is known as smart farming [ ] . in the context of smart farming, iot devices themselves are both data producers and data consumers and they produce highly-structured data; however these devices and the technologies we presented above are far from being the only data sources. indeed, important information related to agriculture can also come from different sources such as official periodic reports and journals like the french plants health bulletins (bsv, for its name in french bulletin de santé du végétal ) , social media such as twitter and farmers experiences. the goal of the bsv is to: i), present a report of crop health, including their stages of development, observations of pests and diseases, and the presence of symptoms related to them; and ii), provide an evaluation of the phytosanitary risk, according to the periods of crop sensitivity and the pest and disease thresholds. the bsv and other formal reports are semi-structured data. in the agricultural context, twitter -or any other social media-can be used as a platform for knowledge exchange about sustainable soil management [ ] and it can also help the public to understand agricultural issues and support risk and crisis communication in agriculture [ ] . farmer experiences (aka old farming practices or ancestral knowledge) may be collected through interviews and participatory processes. social media posts and farmer experiences are nonstructured data. figure illustrates how this heterogeneous data coming from different sources may look like for farmers: information is not always explicit or timely. our objective is to integrate such heterogeneous data into knowledge bases that can support farmers in their activities, and to present global, real-time and comprehensive information to researchers and interested parties. we present related work in sect. , our initial approach in sect. and conclusions and perspectives in sect. . we classify existing works into two categories: information access and management in plant health domain, and data integration in agriculture. in the information access and management in plant health domain category, the semantic annotation in bsv focuses on extracting information for the traditional bsv. indeed, for more than years, printed plant health bulletins have been diffused by regions and by crops in france, giving information about the arrival and the evolution of pests, pathogens, and weeds, and advises for preventive actions. these bulletins serve not only as agricultural alerts for farmers but also documentation for those who want to study the historical data. the french national institute for agricultural research (inra) has been working towards the publishing of the bulletins as linked open data [ ] , where bsv from different regions are centralized, tagged with crop type, region, date and published on the internet. to organize the bulletins by crop usage in france, an ontology with concepts was manually constructed. with the volume of concepts and relations augmenting, manual construction of ontologies will become too expensive [ ] . thus, ontology learning methods to automatically extract concepts and relationships should be studied. inra has also introduced a method to modulate an ontology for crop observation [ ] . the process is the following: ) collect competency questions from researchers in agronomy; ) construct the ontology corresponding to requirements in competency questions; ) ask semantic experts who have not participated in the conception of the ontology to translate the competency questions into sparql queries to validate the ontology design. in this exercise, a model to describe the appearance of pests was given but not instantiated, nevertheless it could be a reference to our future crop-pest ontology conception. finally, pest observer (http://www.pestobserver.eu/) is a web portal [ ] which enables users to explore bsv with a combination of the following filters: crop, disease and pest; however, crop-pest relationships are not included. it relies on text-mining techniques to index bsv documents. regarding data integration in agriculture, agris , the international system for agricultural science technology states that many initiatives are developed to return more meaningful data to users [ ] . some of these initiatives are: extracting keywords by crawling the web to build the agrovoc vocabulary, which covers all areas of interest of the food and agriculture organization of the united nations; and semagrow [ ] , which is an open-source infrastructure for linked open data (lod) integration that federates sparql endpoints from different providers. to extract pest and insecticide related relations, semagrow uses computer-aided ontology development architecture (coda) for rdf triplification of unstructured information management architecture (uima) results from analysis of unstructured content. though inra kick-started categorizing the french crop bulletins using linked open data, and that project semagrow shed light upon heterogeneous data integration using ontologies, both projects focused on processing formal and technical documents. moreover, in coda application case, ispestof rule was defined but not instantiated. therefore, a global knowledge base, that covers the crops, the natural hazards including pests, diseases, and climate variations, and the relations between them, is still missing. there is also an increasing necessity to a comprehensive and an automatic approach to integrate knowledge from an ampler variety of heterogeneous sources. -linguistic preprocessing: unstructured and semi-structured textual data are passed through a linguistic prepossessing pipeline (sentence segmentation, tokenization, part-of-speech (pos) tagging, lemmatization) with existing natural language processing (nlp) tools such as stanford nlp (https:// nlp.stanford.edu/), gate (https://gate.ac.uk/) and uima (https://uima. apache.org/). -terms/concept detection: at the best of our knowledge and from the state of the art study, there is no ontology in french that modulates the natural hazards and their relations with crops. existing french thesaurus like french crop usage and agrovoc can be applied to filter collected data and served as gazetteer. linguistic rules represented by regular expressions can be used to extract temporal data. recurrent neural network (rnn), conditional random field (crf) model and bidirectional long-short term memory (bilstm) were applied for health-related name entity recognition from twitter messages and gave a remarkable result [ ] . once the ontology is populated, it could provide knowledge and constraints to the extraction of terms [ ] . -relation detection: similar to term/concept detection, initially there's no ontology. a basic strategy could be using self-supervised methods like modified open information extraction (moie): i) use wordnet-based semantic similarity and frequency distribution to identify related terms among detected terms from previous step ii) slicing the textual patterns between related terms [ ] . once the ontology is populated, it could contribute to calculate semantic similarities between detected terms in phase i). new digital technologies allow farmers to predict the yield of their fields, to optimize their resources and to avoid or protect their fields from natural hazards whether they are due to the weather, pests or diseases. this is a recent area where research is constantly evolving. we have introduced in this paper work relevant to our problem, namely: the integration of several data sources to extract information related to the natural hazards in agriculture. we then proposed an architecture based on ontology learning and ontology-based information extraction. we plan in a first phase build an ontology from twitter data that contains vocabulary in the existing thesaurus. to evaluate the constructed ontology, we will extract crops and pests from the learnt ontology, and compare it with tags in pest observer. in the following iterations, we will work on ontology alignment strategies to update the ontology with data from other sources. to go further, multilingual ontology management with keeping tempo-spacial contexts should be investigated. a little birdie told me about agriculture: best practices and future uses of twitter in agricultural communications ontology-based healthcare named entity recognition from twitter messages using a recurrent neural network approach a crop-pest ontology for extension publications discovering, indexing and interlinking information resources. f research information technology: the global key to precision agriculture and sustainability appelà projets -durabilité des systèmes de productions agricoles alternatifs advances in application of climate prediction in agriculture automatic relationship extraction from agricultural text for ontology construction designing innovative linked open data and semantic technologies in agro-environmental modelling the use of twitter for knowledge exchange on sustainable soil management computer vision and artificial intelligence in precision agriculture for grain crops: a systematic review a methodology for the publication of agricultural alert bulletins as lod annotation sémantique pour une interrogation experte des bulletins de santé du végétal towards smart farming and sustainable agriculture with drones open data platform for knowledge access in plant health domain: vespa mining internet of things (iot) and cloud computing for agriculture: an overview ontology-based information extraction: an introduction and a survey of current approaches big data in smart farming-a review key: cord- -afgvztwo authors: nan title: engineering a global response to infectious diseases: this paper presents a more robust, adaptable, and scalable engineering infrastructure to improve the capability to respond to infectious diseases.contributed paper date: - - journal: proc ieee inst electr electron eng doi: . /jproc. . sha: doc_id: cord_uid: afgvztwo infectious diseases are a major cause of death and economic impact worldwide. a more robust, adaptable, and scalable infrastructure would improve the capability to respond to epidemics. because engineers contribute to the design and implementation of infrastructure, there are opportunities for innovative solutions to infectious disease response within existing systems that have utility, and therefore resources, before a public health emergency. examples of innovative leveraging of infrastructure, technologies to enhance existing disease management strategies, engineering approaches to accelerate the rate of discovery and application of scientific, clinical, and public health information, and ethical issues that need to be addressed for implementation are presented. powerful antibiotics and vaccines helped mitigate the threat from infectious diseases for several generations. in , most human deaths were associated with infectious diseases like tuberculosis and influenza. as recently as , the worldwide mortality associated with infectious diseases was . percent of deaths from all causes. in , this had continued to decline to . percent. unfortunately, this means there were still over million deaths associated with infectious diseases [ ] . a recent review examined antimicrobial resistance and predicted that by , the impact would include a reduction of the world's potential gross domestic product by % to . % and cause an additional million premature deaths a year [ ] . although beyond the scope of this paper, it is worth noting that microorganisms have also been implicated as contributing to or causing many chronic diseases, including some forms of cancer, arthritis, and neurological disease [ ] . it is tempting to approach the infectious disease challenge as doing battle with a pathogen enemy where brigades of combatant bacteria or viruses are held back or even defeated by increasingly sophisticated pharmaceutical weapons. there is certainly a place for improved pharmaceuticals; however, a sustainable approach will need to be much more sophisticated. microbes and their hosts form a complex and dynamic ecosystem, and a long-term strategy for infectious disease control must take into account the fact that diseases can result from changes in the microbe, the host, or the environment. it is time to move beyond the simple war metaphor [ ] . to compound the challenge, microbes are notoriously fast in adapting to new environments. this can include bacteria developing antibiotic resistance or acquiring metabolic traits that allow them to thrive in a new environmental niche, and viruses evolving to reduce the effectiveness of antivirals and vaccines. in addition to the evolution of existing pathogens, like the seasonal influenza virus, there are emerging pathogens that are often the result of changing or encroachment upon new ecosystems and the ''leap'' from a conventional host to a new host species. an example from recent headlines is the middle east respiratory syndrome coronavirus. mers-cov appears to have reached humans by direct contact and potentially airborne transmission through animal hosts including camels [ ] . the mers-cov is related to the coronavirus that caused severe acute respiratory syndrome (sars). the sars outbreak began in and likely spread to humans via bats [ ] . although the viruses are similar, this does not guarantee that utilization of the same medical and public health intervention techniques will be effective. there are many examples of emerging infectious disease outbreaks, including the on-going hiv/aids pandemic that has already caused million deaths [ ] . each pathogen involved in an infectious disease outbreak provides an opportunity to identify what scientific data are needed to support effective interventions. quoted in the global health security agenda [ ] , u.s. president obama said in ''. . . we must come together to prevent, and detect, and fight every kind of biological dangervwhether it's a pandemic like h n , or a terrorist threat, or a treatable disease.'' the agenda complements and supports existing international health regulations of the world health organization [ ], u.s. public health [ ] , [ ] , and biodefense objectives [ ] - [ ] . the framework for data requirements and response priorities used in this paper integrates across these initiatives and regulations. specifically, in order to manage infectious diseases, capabilities are required for: preparedness, detection, characterization, response, and support for the return to normal, see fig. . this framework is analogous to homeostasis in living organisms. these capabilities are relevant for addressing health interests from the global to the individual organism, e.g., human, animal, plant, or bacterium. the global perspective is studied and implemented by public health, ecological, industrial, and other communities with the principal foci of public benefit, humanitarian needs, scaling, and statistical measures. for an individual human, the perspectives are from medical, economic, relationship, and other personal priorities with the foci of individual health, quality of life, and gaining access to effective care. integrating frameworks are needed to support optimization of technical, economic, medical, and ethical components of this complex system. infectious diseases are a major cause of death and economic impact worldwide. a more robust, adaptable and scalable infrastructure would improve the capability to respond to epidemics. because engineers contribute to the design and implementation of infrastructure, there are opportunities for innovative solutions to infectious disease response within existing systems that have utility, and therefore resources, before a public health emergency. examples of innovative leveraging of engineered infrastructure are provided throughout the paper. the next section of this paper discusses opportunities for technology to improve on current approaches to infectious disease management and the following section discusses engineering challenges to accelerate the application of science to infectious disease planning and response at the global scale. i conclude with a brief discussion of the importance and opportunity for engineers to address the ethical issues needed to leverage traditional infrastructure for infectious disease response and help nurture a global culture of responsibility in both healthcare and technical applications. leveraging their significant investment in planning and response experience, i adopted from the u.s. pandemic influenza plan [ , p. g- ], infectious disease management goals to provide: public health policy-makers with data to guide response, and clinicians with scientific data to justify recommended treatments, vaccines, or other interventions. i have integrated priorities from the influenza plan with a more pathogen-centric approach from the food industry [ ] in table to provide descriptions of priority data to support infectious disease outbreak response. health have parallels shown in the inner and outer rings, respectively. because traditional diagnostics and treatments have long lead development, regulatory approval, and manufacturing lead times, it is challenging to provide timely and effective interventions at a public health scale for an outbreak caused by an emerging or novel pathogen. approaches to achieving robust, economically viable scaling include improved leveraging of existing infrastructure, establishment of an integrating framework like the digital immune system for optimization, and spiral development processes similar to homeostasis. these data can be provided with currently available technologies. however, there are several recurring issues that inhibit global utilization. the issues that can be addressed, even if only partially, by technology are discussed. the key recurring issue is availability. limited availability is driven by many factors including cost, appropriate sharing of data and materials, and timely manufacture and distribution. on a more basic level, one of the key factors in sustainable preparedness is infrastructure, and availability is a challenge here as well. malnutrition due to starvation, unsafe water, and insufficient sanitation all impact infectious disease mortality. in , over half of the deaths of children under five years old were associated with infectious diseases and the significant contributing factor for many of these deaths was under nutrition [ ] . building the infrastructure to eliminate these hazards has historically been the domain of civil and agricultural engineers. electrical and computer engineers are now providing valuable low-cost information linkages across systems so that weather satellite data can be utilized to help increase local crop yields and prepare water treatment and sanitation plants for adverse weather. the computational algorithms and information networks can be applied worldwide for irrigation and weather prediction for storm and drought management [ ] . remote sensing has been utilized to indirectly detect vibrio cholera [ ] and predict a rift valley fever (rvf) outbreak [ ] . in both of these examples, the disease and environmental biology were shown to correlate with changes that could be measured by air and space borne sensors. for v. cholerae, sea-surface temperature and sea height were linked to the inland incursion of water with commensal plankton. satellite measurements of seasurface temperature, rainfall, and vegetation changes were used to predict the areas where outbreaks of rvf in humans and animals occurred in africa. the techniques and data used for rvf may be more broadly applicable to other vector-borne diseases. in regions with limited infrastructure, it is often difficult or impossible to provide the refrigeration required to maintain the ''cold chain'' for life-saving vaccines and other medicines. an inspirational consortium of industry, churches, and nonprofits in zimbabwe, africa, leveraged the reliable power requirements of cellphone towers to help address the refrigeration storage needs for many vaccines [ ] . the initiative has included innovative contributions by wireless providers, refrigerator manufacturers, and others in order to help provide immunizations against polio, measles, and diphtheria. another area in which technology is poised to impact infectious disease management is in gaining timely situational awareness of outbreaks. global and regional travel often make this a difficult task, and data collection and sharing for epidemiologists, care providers, patients, and the public is also limited by other factors such as privacy concerns for individual patients' medical data, governmental goals to protect tourism and other local-to-national interests, the lack of recognized standards for sharing protected data, and an accepted international norm for transfer of public and commercial material during and in response to an infectious disease. improved network, encryption and access information technologies are also necessary to support managed care organizations and telemedicine applications. a global system addressing these issues and capable of operating on time scales relevant to controlling an epidemic is needed. a comparison of five outbreak detection algorithms was conducted using a surveillance case study of the seasonal ross river virus disease [ ] . challenges were identified for making quantitative comparisons of the algorithms as well as in evaluating the performance of each algorithm. a network model has been proposed that has the potential to address algorithm shortcomings for outbreak localization and performance under changing baselines [ ] . this is accomplished through modeling the relationships among different data streams rather than only the time series of one data stream compared to its historical baseline. using measured and simulated data, this approach showed promise for addressing shifts in health data that occur due to special events, worried well, and other population shifts that happen during significant events like pandemics. another study compared animal and public health surveillance systems, finding challenges due to a limited number of common attributes, unclear surveillance objectives of the design, no common [ ] . privacy issues pose challenges that are difficult to address with technology, but they have been addressed in some applications through voluntary enrollment. for example, the geographic location capability in many cell phones allows applications to push public health and animal disease outbreak information to users based on location. the u.s. centers for disease control and prevention (cdc) has the fluview application that provides geographic information on influenza-like illness activity [ ] . obviously, the pervasiveness of cell phones improves timely reporting from the field for both the public and the public health profession. moving forward, addressing privacy issues will be critical so that geographic tracking of a phone's location could be used to help inform an individual of potential contact with infected persons or animals and support automated, anonymous, electronic integration of those data to accelerate the epidemiological detective work of identifying and surveying those same individuals for public health benefit. electronic health information systems have made significant progress. however, even as recently as , it was noted: ''despite progress in establishing standards and services to support health information exchange and interoperability, practice patterns have not changed to the point that health care providers share patient health information electronically across organizational, vendor, and geographic boundaries. electronic health information is not yet sufficiently standardized to allow seamless interoperability'' [ ] . the u.s. has taken a risk-based approach to health information technology (it) regulation [ ] . safety in health it has been recognized as part of a larger system that needs to consider not just specific software, but how it interacts with the it system and how it will be used by clinicians [ ] . continued development of quality management, human factors, and other standards to support usability and regulatory review are needed. as the large-scale systems that require computer algorithms to scan and integrate data into summary reports continue to progress, significant benefit is being derived from systems with fairly simple technology. in , promed was started as the first e-mail reporting system with curation. it has grown to over , subscribers in over countries and has roles in outbreak detection including sars in [ ] . there are many health alert networks (han) that utilize websites and electronic communications [ ] , [ ] . a study in new york city showed that most physicians ( %) received health department communications, but less than half of those ( %) received the information through the han [ ] . an important trend is that % prefer e-mail distribution of communications and this preference trends with younger respondents. looking to the future, achieving the benefits of distributed diagnostics and electronic reporting will depend on both technical and clinical integration. this would improve individual care as well as reduce or eliminate separate data entry and reporting for public health surveillance. in order to realize the benefits of personalized medicine with treatment customized to the individual, the costs, scalability, and compatibility across all data sources must be addressed. personalized infectious disease medicine might include customization of antimicrobials to an individual patient to improve care and help reduce overuse and misuse of antibiotics. similarly at the global health scale, timely identification of appropriate virus strains coupled with rapid manufacturing for seasonal influenza vaccination would reduce the disease burden of thousands. in personalized and global applications, the approaches will need to move from diagnostics, characterizations, and interventions aimed at a single specific disease causing pathogen to robust methods that can be adapted for safe and effective use in a timely manner for broad classes of disease. fully realizing these goals will require improved scientific understanding and new engineering and computational approaches. there are significant challenges in utilizing traditional engineering approaches in the life sciences. living organisms are complex by most machine standards. individual organisms are also typically influenced by peer organisms forming a community, by other living organisms that may be beneficial, neutral or detrimental, and by the environment. there are opportunities for engineers to develop improved measurement, analysis, and model systems to better characterize, predict and manage infectious diseases. given the complexity of these interacting systems, there are significant challenges to the reductionist approaches familiar to design-based engineering. fortunately, as the history of vaccination demonstrates, a comprehensive knowledge of the biology is not always required in order to provide healthcare benefits. with the advent of deoxyribonucleic acid (dna) sequencing and the efficient detection and laboratory replication of dna through polymerase chain reaction (pcr) amplification, there are opportunities to organize scientific and medical data using dna-based indices. the field that has grown up around these technologies, genomics, is an excellent example of how engineering can enable profound advances in biological research. a significant contributor to the organizing principles and demonstrator of this dnabased approach, carl woese, summarized in the impact as providing ''a new and powerful perspective, an image that unifies all life through its shared histories and common origin, at the same time emphasizing life's incredible diversity and the overwhelming importance of the microbial world,'' [ ] . here, i build on this insight as well as our previous assessment in of the engineering contributions and opportunities related to dna sequence fitch: engineering a global response to infectious diseases data that describe an organism's genome, transcriptome and proteome [ ] . as depicted in fig. , nucleic acid sequence provides a common framework to organize data related to infectious diseases. today's dna sequencing instruments can produce terabases of data in a few days with per-base costs a million-fold lower than a decade ago. recent studies have demonstrated the power of personalized approaches to major human diseases like cancer. it appears likely that the next few years will bring the personalized human genome and the miniaturized sequencing instrument, each for less than one thousand dollars. now is the time to innovate and apply the engineering approaches needed to utilize these and other data. dna sequencing and the associated bioinformatics tools for analysis provide a powerful methods-based approach to monitoring living systems. dna and ribonucleic acid (rna) provide data that help characterize an individual's health status as well as the status of the surrounding environment. because sequencing converts biological information to digital data, computer networks, data management, integrity, scaling, analysis, privacy, and affordability are keys to expanding access. the ''digital immune system'' is a powerful concept that generalizes the method to population dynamics and public health [ ] . basing the system on dna allows correlation techniques to identify patterns in the sequence datavrecurring and deviate patterns. these patterns can be indicative of healthy or pathologic host status as well as the absence or presence of pathogens in clinical and environmental samples. the growth of sequence databases with appropriate clinical and environmental metadata will improve the potential quality of the analyses and the impact on individual and public health. growing databases will also need to address the scaling and privacy issues. one of the most powerful features of the digital immune system approach is the potential to detect a novel pathogen, i.e., one that is not already in the database. the flexibility inherent in this methodbased approach is a huge strength and distinguishes it from many of the traditional methods which are not able to detect or characterize a new or emerging pathogen. seasonal influenza provides an example of a system of method-based opportunities. the influenza virus changes genetically as it uses an error-prone enzyme to replicate its genome, and mutates further as it migrates from host-tohost, across speciesve.g., waterfowl to humans, and across ecosystems of the environment. this mutation process gives rise to a different population of viruses in circulation each year, and poses a challenge to vaccine manufacturers, as one year's vaccine will typically have little value when confronted with the next year's viral strains. each year's vaccine is constructed specifically to protect against the strains that are projected to be dominant in the upcoming flu season. the current process uses egg-based techniques for manufacturing influenza vaccine and has been successfully utilized for decades as have the techniques for isolating and identifying emerging strains of the virus. unfortunately, the combined pipeline is relatively slow and does not typically allow for vaccine manufacturing to be based on strains that have been detected at the start of the flu seasonvvaccine production must be started earlier, and so relies heavily on imperfect predictions as to which strains of influenza will dominate. rna sequence data provide a method-based framework for managing influenza response at the global scale. in addition to the detection and identification of the virus, sequence data can be utilized to compare and predict the performance of egg and other manufacturing approaches. as the sequence, clinical, animal, and environmental data are accumulated, there is also the potential to support computational safety and efficacy screening. shortening the current timeline from detection through vaccination would have significant positive health benefits. there are method-based approaches to vaccine manufacturing with significant potential to improve upon the egg-based approach. even though the influenza virus has only eight genes and an ominous history of multimillion death pandemics, the scientific understanding is not yet sufficient to avoid the thousands of deaths annually from seasonal influenza nor to mitigate the potential from a pandemic strain. complex samples to be converted to nucleic acid sequence data that are easily represented in a digital computer. these data can be used as a framework or index for health and disease related metadata supporting correlative studies across species and providing insight into infectious diseases. because genetic material is traded among organisms and is often part of complex nonlinear networks within an organism and beyond, increased collection and interpretation of the associations across dna sequence and metadata are needed. a digital immune system for individuals and populations is envisioned that identifies causation and intervention options to support patient-specific and public health interventions. biocontainment laboratories are needed to safely conduct the research needed to understand pathogens as well as analyze clinical samples. characterization of existing pathogens increases understanding of current diseases and also helps to prepare for emerging diseases. laboratories that work with infectious agents are categorized by biosafety level (bsl) ranging from a basic biomedical laboratory (bsl- ) to bsl- laboratories (fig. ) that can safely handle untreatable disease agents [ ] . the engineering systems for automatically controlling airflow and other facility safety components are critical for supporting clinical and research laboratories. infectious disease research is benefiting from ''omic'' methods for characterizing proteins (proteomics), metabolites (metabolomics), messenger rna (transcriptomics), etc. in order to safely produce genomic data more quickly, dna sequencing instruments have been moved into our biocontainment laboratories. these methods produce large volumes of data, requiring research and clinical labs to have access to traditional engineering disciplines in data management and analysis. data management is not the only challenge. so far, our descriptions have implied a single host interacting with a single invading pathogenvthe war metaphor. it is not that the metaphor does not work in many cases. for instance, the eradication of smallpox was accomplished through an aggressive worldwide campaign as was the animal and livestock disease rinderpest [ ] . it is that the war metaphor is an oversimplification with an often unrealizable aspiration for victory. consider that most hosts, including bacteria, have associated pathogens and symbiotic microbes. for example, a tropical grass, fungus, and virus have been found to have a three-way symbiosis that confers heat tolerance [ ] . when the virus is removed from the fungus, thermal tolerance is lost by the grass. if the virus is reintroduced to the fungus, thermal tolerance is conferred to the grass. viruses can also integrate their genes into animal genomes as part of the animal's nuclear dna affecting inherited traits [ ] . virus infection is one of several naturally occurring changes in the nucleic acid of a genome. even simple nucleic acid transfers among different biological entities often have difficult to predict results. the complexity compounds as the number of entities increases and community behaviors emerge. for each of the approximately one trillion human cells in our bodies, there are about ten microbes. the microbes are distributed in different, highly specialized, communities in the gut, mouth, skin, etc. collectively referred to as the microbiota [ ] , [ ] . given a global population over billion, there are approximately cells/microbes associated with people on the planet. for comparison, it has fig. . engineering, process, and other controls allow important infectious disease experiments to be conducted safely. representative biosafety level (bsl) labs are shown. bsl- is for agents not known to cause disease in normal, healthy humans. bsl- is for moderate-risk agents that may cause disease of varying severity through ingestion or percutaneous or mucous membrane exposure. bsl- is for agents with a known potential for aerosol transmission and that may cause serious and potentially lethal infections. bsl- is for agents that pose a high individual risk of life-threatening disease resulting from exposure to infectious aerosols, or for agents where the risk of aerosol transmission is unknown. agents appropriate for this level have no vaccine available, and infection resulting from exposure has no treatment other than supportive care. fig. . the global ecosystem is a highly interconnected network of hosts and environments each with its own associated microbiome. nucleic acids, chemicals, and energy are shared within each environment as well as across the global network. new approaches are needed to visualize and understand the relationships among these environments to improve infectious disease management. fitch: engineering a global response to infectious diseases been estimated that there are cells of prochlorococcus bacteria in the ocean [ ] . equally impressive, there are viruses in the ocean causing roughly viral infections every second [ ] . as represented in fig. , whether, grass, animals, or humans, the microbes, the communities, and their host community are sharing chemicals, energy, and nucleic acids. it is no surprise that clinical, animal, environmental, and other samples often have significant genetic complexity. information visualization tools like krona (fig. ) , support analysis of complex metagenomic data [ ] . these types of data can be used to support many applications including investigating food borne disease outbreaks [ ] . genomics helped explain the initiation of the ebola outbreak in sierra leone in [ ] . genomes from virus samples from patients provided evidence that the index case and associated thirteen initial cases in sierra leone had attended the funeral of a traditional healer that had been seeing ebola patients from nearby guinea. the sequence data also showed mutations in parts of the genome that might impact diagnostics, vaccines and therapies. even though the west africa ebola virus disease (evd) epidemic is by far the largest to date, fortunately ''the clinical course of infection and the transmissibility of the virus are similar to those in previous evd outbreaks'' [ ] . the goal is to be able to make these clinically relevant assessments from the virus genome prospectivelyvi.e., before the epidemic provides the observations. computational algorithms and tools are available to find signatures of existing pathogens as well as recognize patterns that help identify previously unidentified pathogens. moving forward, computational models for the interactions among the different genes are needed that provide a context for the correlations in data as well as improve predictive capabilities. for instance, going beyond detecting disease outbreaks to being able to assess and predict disease risks is desired. these much more complicated models offer the potential to accelerate disease understanding and the development and safe utilization of medical interventions at both the individual and the global scales. there are multiple initiatives underway that will lay the foundation and strengthen the underlying science. the foundation will enable new engineering approaches that will achieve the greater goals for infectious disease management. using an approach that parallels the evolution of electrical circuit design and fabrication, the biobricks approach is having increasing impact [ ] . perhaps most famous for its student genetic engineering competition , the biobricks approach of utilizing a library of well-characterized, interconnectable parts is powerful. just as early circuit designers benefited from standards for circuit fabrication, design, interfaces and modules, aspects of biology are amenable to these approaches. not only does the approach provide validation of reductionist concepts of each ''brick,'' but the innovative integration of parts also contributes to the characterization and future sharing of more complicated parts. just as circuit designers can incorporate existing designs or entire functional units like memory or analog to digital conversion, biologists are increasingly able to import others' designs and achieve significantly more functionality. for instance, a rewritable digital memory system has been demonstrated that writes and rewrites nucleic acid bases in a chromosome [ ] . just as dna sequencing brought a methods-based structure to pathogen detection, approaches like biobricks bring structure to biological circuits. the electronic circuit analogy continues with the opportunity for software to facilitate design and accelerate testing and evaluation. just as in our early example for influenza vaccine development, the components of the biobricks approach are amenable to spiral engineering for both performance and cost. building computer architectures and software that can implement ab initio test and evaluation calculations for infectious disease trials would require significant advances in algorithms, capabilities, and fundamental biochemical data. while exciting progress is being made in these areas, an alternative approach is to consider the network of interactions among fundamental building blocks and identify correlations with disease and health. this is the domain of systems biology and the building blocks can include dna, rna, and peptides. progress in systems biology is often paced by access to appropriately curated and calibrated datavnot just the underlying nucleic acid content but the associated metadata that describe the host and the relevant associated microbial communities. one of the multiple approaches to address data needed is the hundred person wellness project of the institute for systems biology. this project is measuring multiple indicators of health over a nine month period and will include lifestyle coaching as part of the study [ ] . if successful, the project already has plans to scale to a thousand and then thousand participants. there are similar concepts underway at organizations as diverse as google x and human longevity. even though these studies do not focus on infectious diseases, they are of significant value in helping to define the baselines for ''normal'' at different ages, genders, and many other variations that may affect health. infectious disease is a global issue that remains a significant cause of death and economic impact. engineers and engineering have many opportunities to help mitigate these diseases ranging from using cell tower power to help deliver vaccines to providing new network analysis software to identify novel viruses in an outbreak. the joint evolution of engineering and life sciences brings expanded availability and opportunity to understand and design living systems. these are attributes that will be needed to address the medical and infrastructure needs for effective global infectious disease management. however, the zeal to innovate needs to be mediated by a culture of responsibility [ ] where the benefits and the risks are considered. the ieee code of conduct [ ] is consistent with this ethos and addresses quality of life and privacy topics that have been discussed in this paper. engineers have an opportunity to provide innovative application of existing infrastructure to infectious disease management and to help nurture a global culture of responsibility in both healthcare and technical applications. h global health estimates summary tables: deaths by cause, age, and sex antimicrobial resistance: tackling a crisis for the health and wealth of nations enhancing the research, mitigating the effects ending the war metaphor: the changing agenda for unraveling the host-microbe relationship detection of the middle east respiratory syndrome coronavirus genome in air sample originating from a camel barn owned by an infected patient. mbio learning from sars: preparing for the next disease outbreak emerging infectious diseases: threats to human health and global stability available: http:// www.globalhealth.gov/global-health-topics/ global-health-security/ghsagenda.html fitch: engineering a global response to infectious diseases department of health and human services implementation plan for the national strategy for pandemic influenza hspd- : biodefense for the st century defense of united states agriculture and food food safety management tools global health risks: mortality and burden of disease attributable to selected major risks global agricultural report: sustainable pathways to sufficient nutritious and affordable food climate and infectious disease: use of remote sensing for detection of vibrio cholerae by indirect measurement prediction of a rift valley fever outbreak power from cellphone towers keeps vaccines cool outbreak detection algorithms for seasonal disease data: a case study using ross river virus disease. bmc medical informatics and decision making an epidemiological network model for disease outbreak detection evaluation of animal and public health surveillance systems: a systematic review. epidemiology infection report to congress: update on the adoption of health information technology and related efforts to facilitate the electronic use and exchange of health information the office of the national coordinator for health information technology, food and drug administration and federal communications commission building safer systems for better care global infectious disease surveillance and health intelligence: the development of effective, interconnected systems of infectious disease surveillance is essential to our survival nyc health alert network putting public health into practice: a model for assessing the relationship between local health departments and practicing physicians ( s ), pp. s -s interpreting the universal phylogenetic tree genomic engineering: moving beyond dna sequence to function the rise of a digital immune system. gigascience department of health and human services oie world organization for animal health a virus in a fungus in a plant: three-way symbiosis required for thermal tolerance endogenous viral elements in animal genomes human microbiota human microbiome project present and future global distributions of the marine cyanobacteria prochlorococcus and synechococcus marine virusesvmajor players in the global ecosystem interactive metagenomic visualization in a web browser whole-genome sequencing expected to revolutionize outbreak investigations. food safety news genomic surveillance elucidates ebola virus origin and transmission during the outbreak the who ebola response teamonline. ebola virus disease in west africavthe first months of the epidemic and forward projections rewrittable digital data storage in live cells via engineered control of recombination directionality medicine gets up close and personal guidance for enhancing personnel reliability and strengthening the culture of responsibility the author thanks dr. k. bernard for his suggestion to consider the broader utility of pandemic influenza plans to other diseases, dr. n. bergman for his valuable suggestions and comments, and ms. c. conrad for her expert assistance preparing the manuscript. key: cord- - hf q r authors: salierno, giulio; morvillo, sabatino; leonardi, letizia; cabri, giacomo title: an architecture for predictive maintenance of railway points based on big data analytics date: - - journal: advanced information systems engineering workshops doi: . / - - - - _ sha: doc_id: cord_uid: hf q r massive amounts of data produced by railway systems are a valuable resource to enable big data analytics. despite its richness, several challenges arise when dealing with the deployment of a big data architecture into a railway system. in this paper, we propose a four-layers big data architecture with the goal of establishing a data management policy to manage massive amounts of data produced by railway switch points and perform analytical tasks efficiently. an implementation of the architecture is given along with the realization of a long short-term memory prediction model for detecting failures on the italian railway line of milano - monza - chiasso. in recent years, big data analytics has gained relevant interest from both industries and academia thanks to its possibilities to open up new shapes of data analysis as well as its essential role in the decision-making processes of enterprises. different studies [ ] , highlight the importance of big data, among other sectors, for the railway industries. the insight offered by big data analytics covers different areas of the railway industry, including and not limited to maintenance, safety, operation, and customer satisfaction. in fact, according to the growing demand for railway transportation, the analysis of the huge amount of data produced by the railway world has a positive impact not only in the services offered to the customers but also for the railway providers. knowledge extracted from raw data enables railway operators to optimize the maintenance costs and enforce the safety and reliability of the railway infrastructure by the adoption of new analytical tools based on descriptive and predictive analysis. maintenance of railway lines encompasses different elements placed along the railway track, including but not limited to signals and points. our interest is mainly on predictive maintenance tasks that aim to build a variety of models with the scope of monitoring the health status of the points composing the line. typical predicting metrics of remaining useful life (rul) and time to failure (ttf) enable predictive maintenance by estimating healthy status of objects and replacing them before failures occur. despite unquestionable value of big data for the railway companies, according to [ ] big data analytics is not fully adopted by them, yet due to different aspects mainly related with the lack of understanding on how big data can be deployed into railway transportation systems and the lack of efficient collection and analysis of massive amount of data. the goal of our work is to design a big data architecture for enabling analytical tasks typical required by the railway industry as well as enabling an effective data management policy to allows end-users to manage huge amounts of data coming from railway lines efficiently. as already mentioned, we considered predictive maintenance as the main task of our architecture; hence to show the effectiveness of the proposed architecture, we use real data collected from points placed over the italian railway line (milano -monza -chiasso). the complexity of the considered system poses different challenges for enabling efficient management of the huge amount of data. the first challenge concerns the collection of the data given the heterogeneity of the data sources. multiple railway points produce distinct log files, which must be collected and processed efficiently. the second challenge is to deal with the data itself. data collected from the system must be stored as raw data without any modification to preserve the original data in case of necessity (e.g., in case of failures, further analyses require to analyze data at a higher level of granularity). at the same time, collected data must be processed and transformed to be useful for analysis thus, data must be pre-processed and aggregated before fitting models for analytics. finally, the data analysis performed by the end-users requires analytical models to perform predictive or descriptive analysis; thus, the architecture should enable model creation as well as graphical visualization of results. the paper is organized as follows. section describes similar works that discuss the design of big data architectures for railways systems. section describes the kinds of data produced by a railway system. sections and describe, respectively, the architectural design and its implementation. the sect. presents a real case scenario in which failure prediction is performed on real data. section draws some conclusions. to the best of our knowledge, few solutions take into consideration challenges arisen when deploying a big data architecture for railway systems. most works focus on theoretical frameworks where simulations produce results without experimenting with real data. moreover, researchers mainly focus on machine learning algorithms as well as analytical models, giving less importance to the fundamental tasks related to data management, ingestion processing, and storage. close to our work in [ ] , authors propose a cloud-based big data architecture for real-time analysis of data produced by on-board equipment of high-speed trains. however, the proposed architecture presents scalability issues since it is not possible to deploy large-scale computing clusters in high-speed trains; neither is it possible to deploy a fully cloud-based architecture due to bandwidth limitation of trains which make infeasible transferring huge amount of data to the cloud to perform real-time analysis. on the contrary, the scope of our work is to define a scalable big data architecture for enabling analytical tasks using railway data. for this reason and due to space limit, we have reported only work related to the railway scenario. in our work, we take into consideration the data log files produced by a railway interlocking system. a railway interlocking is a complex safety-critical system that ensures the establishment of routes for trains through a railway yard [ ] . the computer-based interlocking system guarantees that no critical-safety condition (i.e., a train circulate in a track occupied by another train) will arise during the train circulation. among other actions issued by the interlocking, before the route is composed, it checks the state of each point along the line. the interlocking system produces log files that store information about the command issued to the point as well as data about its behavior. commands are issued by the interlocking through smart boards, which in turn control the physical point on the line and collect data about their status. once data are collected, they are written into the data storage of the interlocking system as log files. these log files can be both structured and semi-structured data and contain diverse information about the behavior of the points upon the requests sent by the interlocking system. requests may vary according to the logic that must be executed to set up a route (e.g., a switch point is moved from the normal to the reverse position or vice versa) and this implies that the information contained in log files may vary. a complete railway line is controlled by multiple interlocking systems, which in turn produce different log files according to the points they control. the analytical task which motivates the design of our architecture is the prediction of failures. a failure may occur when a mechanical part of points has a break. this kind of failure propagates negatively on the entire railway traffic; therefore, its prediction is desirable. moreover, instead of doing maintenance when a failure occurs, it is also useful in particular, to estimate the rul of point in order to enable predictive maintenance by estimating if a points will fail or not in a certain time-frame. predicting failures of railway points requires to take into consideration the log files produced by the interlocking whenever a command is issued to a point. these log files are heterogeneous in type and contain different information resumed as: data and are considered to train and evaluate the proposed model to estimate the health status of the points, thus to estimate its rul (see sect. ). as an example, we report a sample collected from a railway switch point (fig. ). these samples contain three types of information: . referencecurve: is a sample curve representing the behavior of the point upon the command issued by the interlocking. this curve is used to derive the following ones; . prealarmsample: is a pre-threshold curve, computed by adding to the referencecurve an intermediate threshold value; . alarmsample: is the alarm curve computed as the previous one by adding an alarm threshold value. the big data architecture presented in this section covers all the fundamental stages of a big data pipeline [ ] of: . data acquisition by implementing ingestion tasks for collecting data from external sources. . information extraction and retrieval by processing ingested data and storing them in a raw format. . data integration, aggregation, and representation by data table view as well as data aggregation functionalities to produce new data for analytics. . modeling and analysis by providing a set of functionalities to build models to perform predictive analytics. . interpretation of results by graphical visualization of data. the architecture presented in fig. includes the following layers: storage layer is the layer responsible for implementing the data storage. it contains the storage platform to provide a distributed and fault-tolerant filesystem. in particular, this layer, should store data in their raw form. therefore datasets for analytic models will be originated from the upper layers. processing layer provides all tasks for data manipulation/transformation useful for the analytical layer. in particular, this layer presents a structured view of the data of the storage layer, allowing the creation of datasets by transforming raw data coming from the storage layer. the structured view of data is implemented through table views of the raw data. the transformation of original data is performed through aggregation functions provided by this layer. service layer contains all components to provide analytics as a service to the end-users. this layer interacts with the processing layer in order to access data stored on the platform, manipulate and transform data to fit analytical models. in addition, it provides: ) data visualization functionalities for graphical displaying data ) models creation to perform analytical tasks. ingestion layer this layer implements all the tasks for the ingestion of data from external sources. it is based on ingestion tools, which enable the definition of dataflows. a dataflow consists of a variable number of processes that transform data by the creation of flow files that are moved from one processor to another through process relations. a relation is a connection between two processors to define data flow policies among data flow processes. our architecture is an implementation of the concept of a datalake [ ] . to avoid situations in which data stored in the platform become not usable due to multiple challenges related to their complexity, size, variety, and lack of metadata, we adopt a mechanism of uri abstraction to simplify data access, thus establishing a data governance policy. as an example, at the storage layer, the uri of a resource is simply its absolute path. in order, to avoid the definition of multiples uris for each resource (since they can be used by multiple components at different architectural layer), we define a uri abstraction mechanism to simplify the access to resources since they are stored in a distributed manner (where keeping track of the physical location of a resources could be tricky). therefore the realuri refers to a resource stored on the distributed filesystem abstracting its physical location. a realuri is bound to a single virtualuri, which is in charge of abstracting the details of paths adopted by a particular implementation of a distributed filesystems. a presentationuri is an optional uri created whenever a component of the processing layer or service layer uses a resource stored on the filesystem. as an example, the uri abstractions defined for a single resource are reported in table . each resource is identified by ) a smartboard id, ) a channel number that controls a specific point, ) a point number which identifies the object on the line. these metadata are extracted from the data described in sect. . through the tasks provided by the ingestion layer described in the next section. in addition, the virtualuri refers to the resource at a platform level, while the presentationuri represents a hive table view of the data created by the processing layer (see sect. ). we stress the fact that while resources can be assigned to an unbounded number of pre-sentationuri, depending on the type of components that consume the data, the virtualuri is mandatory and it refers to a single realuri. the architecture has been implemented mainly using components of the hadoop stack. hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple commodity-hardware. hadoop provides different components to implement a complete big data architecture. in particular, for this work, we considered: fig. . ingestion process to store data and related metadata storage layer, which has been implemented using the hadoop filesystem named hdfs. hdfs is a fault-tolerant distributed filesystem that runs on cluster providing fault-tolerance and high availability of the data. hdfs stores raw data as ingested by the ingestion layer, as represented in fig. . in particular, the ingestion layer performs extraction of data and metadata and aggregate data into a specific folder stored on hdfs representing data for a particular point. data representing point behavior (see sect. ) are stored in their original format as xml files. therefore these data must be processed and transformed to create new datasets. this task is performed by the processing layer. processing layer before data can be employed into analysis must be transformed to fulfill the requirements of analytical models. the processing layer implements all the tasks required to build datasets from raw data. this layer has been implemented through the specification of two components. the first component has the scope of processing raw xml files by extracting relevant features and aggregating them into csv files, thus producing new datasets. results are written back to the hdfs in the folder of the original data provenance. this mechanism allows enriching the data available, producing aggregation of raw data as well as providing features extraction functionalities to extract/aggregate features to be used by analytical models. to fulfill this task, a dataset builder processes raw data and extracts relevant features (classes are reported in fig. ). this module extracts features provided as input and aggregates them into csv files using an aggregation function (min, max, avg). results are written back to the hdfs utilizing the hdfs context to get the original data path. in addition, to enable an analysis of aggregated data, these files are imported into hive tables. hive is a data warehousing tool provided by the hadoop stack, which includes a sql-like language (hiveql) to query data. to import data into hive tables, we define a general schema to match the structure of data points. the table schema for representing data points is read from the aggregated csv created by the dataset builder. a general table schema representing data for a generic point is structured as: the designed hive tables store aggregated data containing the extracted features obtained from raw data. results of aggregation extract four features respectively representing: ) a timestamp in which sample was collected, ) the estimated time to complete the operation, ) the average current expressed in ma issued by the point ) the average voltage v. features , and are obtained by the aggregation of single measurements contained in the original data. service layer acts as a presentation layer. it implements all the tasks needed to build models for analytics as well as graphical visualize data. these tasks are fulfilled by jupyter notebooks. notebooks are designed to support the workflow of scientific computing, from interactive exploration to publishing a detailed record of computation. the code in a notebook is organized into cells, chunks which can be individually modified and run. the output from each cell appears directly below it and is stored as part of the document [ ] . in addition, a variety of languages supported by notebooks allows integrating different open-source tools for data analysis like numpy, pandas, and matplot [ ] . these tools allow to parse data in a structured format and to perform data manipulation and visualization by built-in libraries. in addition, the data structure adopted by pandas, named, dataframe, is widely adopted as input format by a variety of analytical models offered via machine learning libraries like scikit-learn and scipy. ingestion layer has been realized through apache nifi, a dataflow system based on the concepts of flow-based programming. dataflows specify a path that describes how data are extracted from the external sources and stored on the platform. an example of dataflow, which combines data coming from an external filesystem, is provided in fig. . the flow files created by the dataflow are then written to the hdfs. in the reported example, the files read from a local filesystem are unpacked and then written into a specific folder on the hdfs. this folder is created by extracting meaningful information necessary to identify the smartboard where the data comes from as well as the point which produced that data. the proposed architecture has been employed for the collection and processing of data of the railway line milano-monza-chiasso. this line is composed of points forming the railway track, which are managed by multiple smart boards that collect data. in particular, data are collected from points which produces roughly gb/month. data produced by the system contains information about points status and type of commands issued by the interlocking to move points. the definition of a data management policy allows to collects, govern, and controls raw data as well as enabling data analysis for end-users. the proposed platform has been deployed in a test environment using a containerization technology fig. . we adopted two separate containers, which respectively implement the data storage & processing layers plus the ingestion layer in a separate environment. these containers communicate over a virtual network, which allows exchanging data in an isolated environment exposing web services to access the platforms and perform tasks. this deployment enables scalability of the architecture by moving containers on a cluster. cloudera pre-built image has been adopted as a container implementing the hadoop stack, while a separate container based on apache nifi has been proposed to perform the ingestion tasks. a docker image containing the proposed platform is made available for further testing . as an example to show the effectiveness of the proposed architecture, we report the creation of a long short-term memory (lstm) model for failure detection of a specific railway point along the railway line milano-monza-chiasso. lstm models are a special kind of recurrent neural networks (rnn) widely employed by both academia and industries for failure prediction [ ] . a key aspect of rnn is their ability to store information or cell state for use later to make new predictions. therefore these aspects make them particularly suitable for analysis of temporal data like analysis of sensor readings for detecting anomalies. for the considered scenario, we use sensor reading collected from a specific switch point positioned along the line. data originated from the point includes measurements of power supplied to the object, voltage, and time of movement (to move from a normal to a reversal position or vice-versa) of types described in sect. and reported in fig. . once data are ingested, we use components composing the processing layer to produce useful datasets for anomalies detection using the techniques described in the previous sections. results of the aggregation process produce a dataset consisting of samples, which are used as input for training the model. the evaluation part is performed using reference data, which represents threshold values above which failures occur ( samples). examples of features used for training the model are reported in fig. , , . the autoencoder model used to make predictions learns a compressed representation of the input data and then learns to reconstruct them again. the idea is training the model on data not containing anomalies; therefore, the model will likely to be able to reconstruct healthy samples. we expect that until the model predicts healthy samples, its reconstruction error (representing the distance between the input and the reconstructed sample) is low. whenever the model processes data outside the norm as the ones represented by the reference data, which consist of threshold values, we expect an increase in the reconstruction error as the model was not trained to reconstruct these kinds of data. therefore, we use the reconstruction error as an indicator for anomaly detection. figure reports anomalies detected by trying to predict reference values. in particular, we will see an increase of the reconstruction error on those values which are greater than a threshold of . . this threshold was obtained by computing the error loss on the training set. therefore, we identified this value suitable for the point considered as a case study, but it varies according to the particular behavior of the object. for example, considering two objects having the same characteristics (e.g., switch points) placed in different railway network topologies, may have different behaviors; therefore, they must be analyzed using different prediction models. this paper proposes a novel architecture for big data management and analysis of railway data. despite big data attract railways industries, many challenges have to be faced to enable effective big data analysis of railway data. this work proposes a four-layer architecture for enabling data analytics of railway points. each layer is loosely coupled with others; therefore, it enables the integration of diverse data processing components intra-layers and inter-layers. to show the effectiveness of the proposed architecture, we reported the analysis of a railway switch points using predictive models for detecting failures. nevertheless, instead of being task-oriented, the proposed architecture integrates different data processing tools to perform diverse analytical tasks as real-time data analysis. a data governance policy has been defined to deal with the variety and the complexity of railway data making them easily manageable at different granularity levels. a containerized deployment has been proposed to scale the architecture on a cluster, increasing up its scalability, thus enabling parallel data processing. our architecture can also be extended according to the nature of the task to perform. in fact, it allows practitioners to extend architectural components to fulfil different tasks not limited to failure prediction. moreover, in this work, we did not consider any real-time scenario in which data must be analyzed using streaming techniques, but the architecture flexibility also allows to deal with such cases. as future work, the analytical layer will be extended, proposing a comparison of different classes of predictive algorithms to measure their accuracy in diverse predictive maintenance tasks of a railway system. moreover, we aim to extend the scope of the architecture by monitoring other kinds of infrastructures, including but not limited to power grids and highways intelligent systems. geographical versus functional modelling bystatecharts of interlocking systems proceedings of the ninth international workshop on formal methods for industrial critical systems managing data lakes in big data era: what's a data lake and why has it became popular in data management ecosystem recent applications of big data analytics in railway transportation systems: a survey big data and its technical challenges jupyter notebooks ? a publishing format for reproducible computational workflows a recurrent neural network architecture for failure prediction in deep drawing sensory time series data. procedia manuf python data analytics: with pandas, numpy, and matplotlib railway assets: a potential domain for bigdata analytics a platform for fault diagnosis of high-speed train based on big data -project supported by the national natural science foundation, china ( , ) and the national high-tech th ifac symposium on advanced control of chemical processes adchem key: cord- -zhmnfd w authors: straif-bourgeois, susanne; ratard, raoult title: infectious disease epidemiology date: journal: handbook of epidemiology doi: . / - - - - _ sha: doc_id: cord_uid: zhmnfd w the following chapter intends to give the reader an overview of the current field of applied infectious disease epidemiology. prevention of disease by breaking the chain of transmission has traditionally been the main purpose of infectious disease epidemiology. while this goal remains the same, the picture of infectious diseases is changing. new pathogens are identified and already known disease agents are changing their behavior. the world population is aging; more people develop underlying disease conditions and are therefore more susceptible to certain infectious diseases or have long term sequelae after being infected. the following chapter intends to give the reader an overview of the current field of applied infectious disease epidemiology. prevention of disease by breaking the chain of transmission has traditionally been the main purpose of infectious disease epidemiology. while this goal remains the same, the picture of infectious diseases is changing. new pathogens are identified and already known disease agents are changing their behavior. the world population is aging; more people develop underlying disease conditions and are therefore more susceptible to certain infectious diseases or have long term sequelae after being infected. infectious diseases are not restricted to certain geographic areas anymore because of the increasing numbers of world travelers and a worldwide food distribution. the fear of a bioterrorist attack adds a new dimension in infectious disease epidemiology, and health departments enhance their surveillance systems for early detection of suspicious disease clusters and for agents used as weapons of mass destruction. improvements in laboratory techniques and mapping tools help to expand the knowledge of transmission of disease agents and enhanced surveillance techniques are feasible as a result of software progress and reporting of diseases via secure internet sites. surveillance and outbreak investigations remain the major responsibilities in public health departments. epidemiologic methods and principles are still the basis for these tasks but surveillance techniques and outbreak investigation are changing and adapting to improvements and the expanded knowledge. conducting surveys is a useful way to gather information on diseases where surveillance data or other data sources are not available, especially when dealing with emerging or re-emerging pathogens. program evaluation is an important tool to systematically evaluate the effectiveness of intervention or prevention programs for infectious diseases. infectious diseases are a major cause of human suffering in terms of both morbidity and mortality. in , out of an estimated total of million deaths, million were due to infectious diseases (who a,b) . the most common cause of infectious disease deaths were pneumonia ( million), diarrhea ( million) followed by tuberculosis, malaria, aids and hepatitis b. not surprisingly, there is a large imbalance in diseases between developing and industrialized countries (see table . ). morbidity due to infectious diseases is very common in spite of the progress accomplished in recent decades. even in industrialized countries, the prevalence of infection is very high for some infectious agents. serologic surveys found that by young adulthood the prevalence of antibodies was % against herpes simplex virus type , - % against type , % against human herpes virus, % against hepatitis a, % against hepatitis c, - % against hepatitis b, and % against chlamydia pneumoniae (american academy of pediatrics ; mandell et al. ) . annually, approximately , , episodes of diarrhea leading to , hospitalizations and deaths occur among adults in the united states (mounts et al. ). the center for disease control and prevention (cdc) estimates that each year million people in the us get sick, more than , are hospitalized and die as a result of foodborne illnesses (cdc ) . every year influenza circulates widely, infecting from % to % of the world population. the importance of infectious disease epidemiology for prevention it is often said that epidemiology is the basic science of preventive medicine. to prevent diseases it is important to understand the causative agents, risk factors and circumstances that lead to a specific disease. this is even more important for infectious disease prevention, since simple interventions may break the chain of transmission. preventing cardiovascular diseases or cancer is much more difficult because it usually requires multiple long term interventions requiring lifestyle changes and behavior modification, which are difficult to achieve. in , the american commission of yellow fever, headed by walter reed, was sent to cuba. the commission showed that the infective agent was transmitted by the mosquito aedes aegypti. this information was used by the then surgeon general of the us army william gorgas, to clean up the year old focus of yellow fever in havana by using mosquito proofing or oiling of the larval habitat, dusting houses with pyrethrum powder and isolating suspects under a mosquito net. this rapidly reduced the number of cases in havana from in to in (goodwin ) . a complete understanding of the causative agent and transmission is always useful but not absolutely necessary. the most famous example is that of john snow who was able to link cholera transmission to water contamination during the london cholera epidemic of by comparing the deaths from those households served by the southwark & vauxhall company versus those served by another water company. john snow further confirmed his hypothesis by the experiment of removing the broad street pump handle (wills a ). over the past three decades, more than new pathogens have been identified, some of them with global importance: bartonella henselae, borrelia burgdorferi, campylobacter, cryptosporidium, cyclospora, ebola virus, escherichia coli :h , ehrlichia, hantaan virus, helicobacter, hendra virus, hepatitis c and e, hiv, human herpesvirus and , human metapneumovirus, legionella, new variant creutzfeldt-jakob disease agent, nipah virus, parvovirus b , rotavirus, severe acute respiratory syndrome (sars) etc.. while there are specific causative agents for infectious diseases, these agents may undergo some changes over time. the last major outbreak of pneumonic plague in the world occurred in manchuria in . this scourge, which had decimated humans for centuries, is no longer a major threat. the plague bacillus cannot survive long outside its animal host (humans, rodents, fleas) because it lost the ability to complete the krebs cycle on its own. while it can only survive in its hosts, the plague bacillus also destroys its hosts rapidly. as long as susceptible hosts were abundant, plague did prosper. when environmental conditions became less favorable (lesser opportunities to sustain the host to host cycles), less virulent strains had a selective advantage (wills b) . the influenza virus is the best example of an agent able to undergo changes leading to renewed ability to infect populations that had been already infected and immune. the influenza virus is a single stranded rna virus with a lipophilic envelope. two important glycoproteins from the envelope are the hemagglutinin (ha) and neuraminidase (na). the ha protein is able to agglutinate red blood cells (hence its name). this protein is important as it is a major antigen for eliciting neutralizing antibodies. antigenic drift is a minor change in surface antigens that result from point mutations in a gene segment. antigenic drift may result in epidemics, since incomplete protection remains from past exposures to similar viruses. antigenic shift is a major change in one or both surface antigens (h and|or n) that occurs at varying intervals. antigenic shifts are probably due to genetic recombination (an exchange of a gene segment) between influenza a viruses, usually those that affect humans and birds. an antigenic shift may result in a worldwide pandemic if the virus can be efficiently transmitted from person to person. in the past three decades throughout the world, there has been a shift towards an increase in the population of individuals at high risk for infectious diseases. in industrialized nations, the increase in longevity leads to higher proportion of the elderly population who are more prone to acquiring infectious diseases and developing life threatening complications. for example, a west nile virus (wnv) infection is usually asymptomatic or causes a mild illness (west nile fever); rarely does it cause a severe neuro-invasive disease. in the epidemic of west nile in louisiana, the incidence of neuro-invasive disease increased progressively from . per , in the to age group to per , in the to year old age group and jumped to per , in the age group and older. mortality rates showed the same pattern, a gradual increase to . per , in the to age group with a sudden jump to per , for the oldest age group of and older. improvement in health care in industrialized nations has caused an increase in the number of immune-deficient individuals, be it cancer survivors, transplant patients or people on immuno-suppressive drugs for long term auto-immune diseases. some of the conditions that may increase susceptibility to infectious diseases are: cancers, particularly patients on chemo or radiotherapy, leukemia, lymphoma, hodgkin's disease, immune suppression (hiv infection), long term steroid use, liver disease, hemochromatosis, diabetes, alcoholism, chronic kidney disease and dialysis patients. for example persons with liver disease are times more likely to develop vibrio vulnificus infections than are persons without liver disease. some of these infections may be severe, leading to death. in developing countries a major shift in population susceptibility is associated with the high prevalence of immune deficiencies due to hiv infections and aids. in botswana which has a high prevalence of hiv (sentinel surveillance revealed hiv seroprevalence rates of % among women presenting for routine antenatal care), tuberculosis rates increased from per , in to per , in (lockman et al. ) while before the hiv|aids epidemics, rates above were very rare. changes in lifestyles have increased opportunities for the transmission of infectious disease agents in populations previously at low risk. intravascular drug injections have increased the transmission of agents present in blood and body fluids (e.g. hiv, hepatitis b and c). consumption of raw fish, shell fish and ethnic food expanded the area of distribution of some parasitic diseases. air travel allows people to be infected in a country and be half-way around the globe before becoming contagious. by the same token, insects and other vectors have become opportunistic global travelers. aedes albopictus, the asian tiger mosquito, was thus imported in to houston, texas inside japanese tires. subsequently, it has invaded us states. with the advent of nucleic acid tests, it has become possible to detect the presence of infectious disease agents in the air and environmental surfaces. for example, the use of air samplers and polymerase chain reaction analysis has shown that bordetella pertussis dna can be found in the air surrounding patients with b. pertussis infection, providing further evidence of airborne spread (aintablian et al. ) and thus leading to re-evaluate the precautions to be taken. however the presence of nucleic acids in an environmental medium does not automatically mean that transmission will occur. further studies are necessary to determine the significance of such findings. infectious disease agents, when used in bioterrorism events, have often been reengineered to have different physical properties and are used in quantities not usually experienced in natural events. there is little experience and knowledge about the human body's response to large doses of an infectious agent inhaled in aerosol particles that are able to be inhaled deep into lung alveolae. during the anthrax letter event, there was considerable discussion about incubation period, recommended duration of prophylaxis, and minimum infectious dose. this lack of knowledge base has led to confusion in recommendations being made. although the basics of infectious disease epidemiology have not changed and the discipline remains strongly anchored on some basic principles, technological developments such as improved laboratory methods and enhanced use of informatics (such as advanced mapping tools, web based reporting systems and statistical analytical software) have greatly expanded the field of infectious disease epidemiology. molecular techniques are being used more and more as a means to analyze epidemiological relationships between microorganisms. hence the term molecular epidemiology refers to epidemiologic research studies made at the molecular level. the main microbial techniques used, target plasmids and chromosomes. more specifically, plasmid fingerprinting and plasmid restriction endonuclease (rea) digestion, chromosomal analysis including pulse field gel electrophoresis (pfge), restriction fragment length polymorphism (rflp), multi-locus sequence type (mlst) and spa typing to name a few of these techniques. polymerase chain reaction (pcr) is used to amplify the quantity of genomic material present in the specimen. real-time pcr detection of infectious agents is now possible in a few hours. these techniques are becoming more widely used, even in public health laboratories for routine investigations. it is beyond the scope of this text to describe these methods in more detail. applications of molecular epidemiology methods have completely changed the knowledge about infectious disease transmission for many microorganisms. the main application is within outbreak investigations. being able to characterize the nucleic acid of the microorganisms permits an understanding of how the different cases relate to each other. molecular epidemiology methods have clarified the controversy about the origin of tuberculosis cases: is it an endogenous (reactivation) or exogenous (reinfection) origin? endogenous origin postulates that mycobacterium tuberculosis can remain alive in the human host for a lifetime and can start multiplying and producing lesions. on the other hand exogenous origin theory postulates that reinfection plays a role in the development of tuberculosis. the immunity provided by the initial infection is not strong enough to prevent another exposure to mycobacterium tuberculosis and a new infection leads to disease. in countries with low tuberculosis transmission, for example the netherlands, most strains have unique rflp fingerprints. each infection is unique and there are hardly any clusters of infections resulting from a common source. most cases are the result of reactivation. this is in contrast with areas of high endemicity where long chains of transmission can be identified with few rflp fingerprinting patterns (alland et al. ) . in some areas, up to % of tuberculosis cases are the result of reinfection. numerous new immunoassays have been developed. they depend on an antigenantibody reaction, either using a test antibody to detect an antigen in the patient's specimen or using a test antigen to detect an antibody in the patient's specimen. an indicator system is used to show that the reaction has taken place and to quantify the amount of patient antigen or antibody. the indicator can be a radioactive molecule (radioimmunoassay [ria]), a fluorescent molecule (fluorescent immunoassay [fia]), a molecule with an attached enzyme that catalyzes a color reaction (enzyme-linked immunoassay [elisa or eia]), or a particle coated with antigen or antibody that produces an agglutination (latex particle agglutination [la] ). the reaction can be a simple antigen|antibody reaction or a "sandwich" immunoassay where the antigen is "captured" and a second "read out" antibody attaches to the captured antigen. the antibody used may be polyclonal (i.e. a mixture of immunoglobulin molecules secreted against a specific antigen, each recognizing a different epitope) or monoclonal (i.e. immunoglobulin molecules of single-epitope specificity that are secreted by a clone of b cells). it may be directed against an antigen on an epitope (i.e. a particular site within a macromolecule to which a specific antibody binds). plotting diseases on a map is one of the very basic methods epidemiologists do routinely. as early as john snow, suspecting water as a cause of cholera, plotted the cases of cholera in the districts of golden square, st. james and berwick, in london. the cases seemed to be centered around the broad street pump and less dense around other pumps. the map supplemented by other observations led to the experiment of removing the handle on the broad street pump and subsequent confirmation of his hypothesis (snow ) . geographical information systems (gis) have been a very useful tool in infectious disease research. gis are software programs allowing for integration of a data bank with spatial information. the mapping component includes physical layout of the land, towns, buildings, roads, administrative boundaries, zip codes etc. data may be linked to specific locations in the physical maps or to specific aggregates. a gis system includes tools for spatial analysis. climate, vegetation and other data may be obtained through remote sensing and combined with epidemiologic data to predict vector occurrence. however, these tools should be used with caution. they can be useful to generate hypotheses and identify possible associations between risk of disease and environmental exposures. because of potential bias, mapping should never be considered as more than an initial step in the investigation of an association. "the bright color palettes tend to silence a statistical conscience about fortuitous differences in the raw data" (boelaert et al. ) . for statistical methods in geographical epidemiology see chap. ii. of this handbook. web based reporting, use of computer programs and developments of sophisticated reporting and analytical software have revolutionized epidemiologic data collection and analysis. these tools have provided the ability to collect large amounts of data and handle large databases. however this has not been without risks. it remains crucial to understand the intricacies of data collected to avoid misinterpretation. for example, one should be aware that diseases and syndromes are initially coded by a person who may not be very software proficient, using shortcuts and otherwise could enter data of poor quality. what are the questions to be answered? too often one sees epidemiologists and statisticians preparing questionnaires, carrying out surveys, gathering surveillance information, processing data and producing reports, tables, charts and graphs in a routine fashion. epidemiology describes the distribution of health outcomes and determinants for a purpose. it is important to question the goals and objectives of all epidemiologic activities and tailor these activities to meet these objectives. the description of disease patterns includes analysis of demographic, geographical, social, seasonal and other risk factors. age groups to be used differ depending on the disease e.g. diseases affecting young children should have numerous age groups among children; sexually transmitted diseases require detailed age groups in late adolescence and early adulthood. younger age groups may be lumped together for diseases affecting mainly the elderly. gender categorization, while important for sexually transmitted diseases and other diseases with a large gender gap (such as tuberculosis), may not be important for numerous other diseases. geographical distribution is important to describe diseases linked to environmental conditions but may not be so useful for other diseases. surveillance, both active and passive, is the systematic collection of data pertaining to the occurrence of specific diseases, the analysis and interpretation of these data, and the dissemination of consolidated and processed information to contributors to the program and other interested persons (cdc b). in a passive surveillance system the surveillance agency has devised and put a system in place. after the placement, the recipient waits for the provider of care to report. passive case detection has been used for mortality and morbidity data for decades. it is almost universal. most countries have an epidemiology section in the health department that is charged with centralizing the data in a national disease surveillance system collecting mortality and morbidity data. in theory, a passive surveillance system provides a thorough coverage through space and time and gives a thorough representation of the situation. practically, compliance with reporting is often irregular and incomplete. in fact, the main flaws in passive case detection are incomplete reporting and inconsistencies in case definitions. the main advantages are the low cost of such a program and the sustained collection of data over decades. the purpose is to produce routine descriptive data on communicable diseases, generate hypotheses and prompt more elaborate epidemiologic studies designed to evaluate prevention activities. some conditions must be met to maximize compliance with reporting: . make reporting easy: provide easy to consult lists of reportable diseases, provide pre-stamped cards for reporting, provide telephone or fax reporting facilities. . do not require extensive information: name, age, sex, residence, diagnosis. some diseases may include data on exposure, symptoms, method of diagnosis etc. . maintain confidentiality and assure reporters that confidentiality will be respected. . convince reporters that reporting is essential: provide feedback; show how the data are used for better prevention. confidentiality of data is essential, particularly for those reporting health care providers who are subject to very strict confidentiality laws. any suspicion of failure of maintaining secure data would rapidly ruin a passive surveillance program. in an active surveillance system, the recipient will actually take some action to identify the cases. in an active surveillance program, the public health agency organizes a system by searching for cases or maintaining a periodic contact with providers. regular contacting boosts the compliance of the providers. providers are health agencies but also as in passive case detection, there may be day care centers, schools, long term care facilities, summer camps, resorts, and even public involvement. the agency takes the step to contact the health providers (all of them or a carefully selected sample) and requests reports from them at regular intervals. thus no reports are missing. active surveillance has several advantages: it allows the collection of more information. a provider sees that the recipient agency is more committed to surveillance and is therefore more willing to invest more time her|himself. it allows direct communication and opportunities to clarify definitions or any other problems that may have arisen. active surveillance provides much better, more uniform data than passive case detection but active case detection is much more expensive (see tables . and . ). active surveillance systems are usually designed when a passive system is deemed insufficient to accomplish the goals of disease monitoring. this type of surveillance is reserved for special programs, usually when it is important to identify every single case of a disease. active surveillance is implemented in the final phases of an eradication program: smallpox eradication, poliomyelitis eradication, guinea worm eradication and malaria eradication in some countries. active surveillance is also the best approach in epidemic or outbreak investigations to elicit all cases. in the smallpox eradication program, survey agents visited providers, asking about suspected cases and actually investigating each suspected case. in polio eradication programs, all cases of acute flaccid paralysis are investigated. in malaria eradication programs and some malaria control programs, malaria control agents go from house to house asking who has fever or had fever recently (in the past week or month for example). a blood smear is collected from those with fever. a case register is a complete list of all the cases of a particular disease in a definite area over a certain time period. registers are used to collect data on infections over long periods of time. registers should be population based, detailed and complete. a register will show an unduplicated count of cases. they are especially useful for long term diseases, diseases that may relapse or recur and diseases for which the same cases will consult several providers and therefore would be reported on more than one occasion. case registers contain identifiers, locating information, disease, treatment, outcome and follow-up information as well as contact management information. they are an excellent source of information for epidemiologic studies. in disease control, case registers are indispensable tools for follow up of chronic infections disease such as tuberculosis and leprosy. the contents and quality of a case register determine its usefulness. it should contain patient identifiers with names (all names), age, sex, place and date of birth, complete address with directions on how to reach the patient, name and address of a "stable" relative that knows the patient's whereabouts, diagnosis information with disease classification, brief clinical description (short categories are better than detailed descriptions), degree of infectiousness (bacteriological, serological results), circumstances of detection, initial treatment and response with specific dose, notes on compliance, side effects, clinical response, follow-up information with clinical response, treatment regimen, compliance, side effects, locating information; for some diseases contact information is also useful. updating a register is a difficult task. it requires cooperation from numerous persons. care must be taken to maintain the quality of data. it is important to only request pertinent information for program evaluation or information that would remind users to collect data or to perform an exam. for example, if compliance is often a neglected issue, include a question on compliance. further details concerning the use of registries in general are given in chap. i. of this handbook. sentinel disease surveillance for sentinel disease surveillance, only a sample of health providers is used. the sample is selected according to the objectives of the surveillance program. providers most likely to serve the population affected by the infection are selected, for example child health clinics and pediatricians should be selected for surveillance of childhood diseases. a sentinel system allows cost reduction and is combined with active surveillance. a typical surveillance program for influenza infections includes a selected numbers of general practitioners who are called every week to obtain the number of cases presented to them. this program may include the collection of samples for viral cultures or other diagnostic techniques. such a level of surveillance would be impossible to maintain on the national level. surveillance systems are evaluated on the following considerations (cdc b): usefulness: some surveillance systems are routine programs that collect data and publish results; however it appears that they have no useful purpose -no conclusions are reached, no recommendations are made. a successful surveillance system would provide information used for preventive purposes. sensitivity or the ability to identify every single case of disease is particularly important for outbreak investigations and eradication programs. predictive value positive (pvp) is the proportion of reported cases that actually have the health-related event under surveillance. low pvp values mean that non-cases might be investigated, outbreaks may be exaggerated or pseudo outbreaks may even be investigated. misclassification of cases may corrupt the etiologic investigations and lead to erroneous conclusions. unnecessary interventions and undue concern in the population under surveillance may result. representativeness ensures that the occurrence and distribution of cases accurately represent the real situation in the population. simplicity is essential to gain acceptance, particularly when relying on outside sources for reporting. flexibility is necessary to adapt to changes in epidemiologic patterns, laboratory methodology, operating conditions, funding or reporting sources. data quality is evaluated by the data completeness (blank or unknown variable values) and validity of data recorded (cf. chap. i. of this handbook). acceptability is shown in the participation of providers in the system. timeliness is more important in surveillance of epidemics. stability refers to the reliability (i.e., the ability to collect, manage and provide data properly without failure) and availability (the ability to be operational when it is needed) of the public health surveillance system. the major elements of a surveillance system as summarized by who are: mortality registration, morbidity reporting, epidemic reporting, laboratory investigations, individual case investigations, epidemic field investigations, surveys, animal reservoir and vector distribution studies, biologics and drug utilization, knowledge of the population and the environment. traditional surveillance methods rely on counting deaths and cases of diseases. however, these data represent only a small part of the global picture of infectious disease problems. mortality registration was one of the first elements of surveillance implemented. the earliest quantitative data available on infectious disease is about mortality. the evolution of tuberculosis in the us for example, can only be traced through its mortality. mortality data are influenced by the occurrence of disease but also by the availability and efficacy of treatment. thus mortality cannot always be used to evaluate the trend of disease occurrence. reporting of infectious diseases is one of the most common requirements around the world. a list of notifiable diseases is established on a national or regional level. the numbers of conditions vary; it ranges usually from to conditions. in general, a law requires that health facility staff, particularly physicians and laboratories, report these conditions with guaranteed confidentiality. it is also useful to have other non-health related entities report suspected communicable diseases such as day care centers, schools, restaurants, long term care facilities, summer camps and resorts. regulations on mandatory reporting are often difficult to enforce. voluntary compliance by the institution's personnel is necessary. reporting may be done in writing, by phone or electronically in the most advanced system. since most infectious diseases are confirmed by a laboratory test, reporting by the laboratory may be more reliable. the advantage of laboratory reporting is the ability to computerize the reporting system. computer programs may be set up to automatically report a defined set of tests and results. for some infectious diseases, only clinical diagnoses are made. these syndromes may be the consequences of a large number of different microorganisms for which laboratory confirmation is impractical. when public or physician attention is directed at a specific disease, reporting may be biased. when there is an epidemic or when the press focuses on a particular disease, patients are more prone to look for medical care and physicians are more likely to report. reporting rates were evaluated in several studies. in the us, studies show report rates of % for viral hepatitis, hemophilus influenzae %, meningococcal meningitis % and shigellosis %. it is important to have a standardized set of definitions available to providers. without standardized definitions, a surveillance system may be counting different entities from one provider to another. the variability may be such that the epidemiologic information obtained is meaningless. most case definitions in infectious disease epidemiology are based on laboratory tests, however some clinical syndromes such as toxic shock syndrome do not have confirmatory laboratory tests. most case definitions include a brief clinical description useful to differentiate active disease from colonization or asymptomatic infection. some diseases are diagnosed based on epidemiologic data. as a result many case definitions for childhood vaccine preventable diseases and foodborne diseases include epidemiologic criteria (e.g., exposure to probable or confirmed cases of disease or to a point source of infection). in some instances, the anatomic site of infection may be important; for example, respiratory diphtheria is notifiable, whereas cutaneous diphtheria is not (cdc ) . cases are classified as a confirmed case, a probable or a suspected case. an epidemiologically linked case is a case in which ) the patient has had contact with one or more persons who either have|had the disease or have been exposed to a point source of infection (including confirmed cases) and ) transmission of the agent by the usual modes is plausible. a case may be considered epidemiologically linked to a laboratory-confirmed case if at least one case in the chain of transmission is laboratory confirmed. probable cases have specified laboratory results that are consistent with the diagnosis yet do not meet the criteria for laboratory confirmation. suspected cases are usually cases missing some important information in order to be classified as a probable or confirmed case. case definitions are not diagnoses. the usefulness of public health surveillance data depends on its uniformity, simplicity and timeliness. case definitions establish uniform criteria for disease reporting and should not be used as the sole criteria for establishing clinical diagnoses, determining the standard of care necessary for a particular patient, setting guidelines for quality assurance, or providing standards for reimbursement. use of additional clinical, epidemiological and laboratory data may enable a physician to diagnose a disease even though the formal surveillance case definition may not be met. surveillance programs collect data on the overt cases diagnosed by the health care system. however these cases may not be the most important links in the chain of transmission. cases reported are only the tip of the iceberg. they may not at all be representative of the true endemicity of an infectious disease. there is a continuous process leading to an infectious disease: exposed, colonized, incubating, sick, clinical form, convalescing, cured. even among those who have overt disease there are several disease stages that may not be included in a surveillance system: some have symptoms but do not seek medical attention some do get medical attention but do not get diagnosed or get misdiagnosed some get diagnosed but do not get reported cases reported cases diagnosed but not reported cases who seek medical attention but were not diagnosed cases who were symptomatic but did not seek medical attention cases who were not symptomatic infectious disease cases play different roles in the epidemiology of an infectious disease; some individuals are the indicators (most symptomatic), some are the reservoir of microorganisms (usually asymptomatic, not very sick), some are amplifiers (responsible for most of the transmission), some are the victims (those who develop severe long term complications). depending on the specific disease and the purpose of the surveillance program, different disease stages should be reported. for example in a program to prevent rabies in humans exposure to a suspect rabid animal (usually a bite) needs to be reported. at the stage where the case is a suspect, prevention will no longer be effective. for bioterrorism events, reporting of suspects is of paramount importance to minimize consequences. waiting for confirmation causes too long of a delay. in the time necessary to confirm cases, opportunities to prevent co-infections may be lost and secondary cases may already be incubating, depending on the transmissibility of the disease. surveillance for west nile viral infections best rests on the reporting of neuroinvasive disease. case reports of neuro invasive diseases are a better indicator than west nile infection or west nile fever cases that are often benign, go undiagnosed and are reported haphazardly. for gonorrhea, young males are the indicators because of the intensity of symptoms. young females are the main reservoir because of the high proportion of asymptomatic infections. females of reproductive age are the victims because of pelvic invasive disease (pid) and sterility. a surveillance program for hepatitis b that only would include symptomatic cases of hepatitis b could be misleading. a country with high transmission of hepatitis b from mother to children would have a large proportion of infected newborn becoming asymptomatic carriers and a major source of infection during their lifetime. typically in countries with poor reporting of symptomatic hepatitis, the reporting of acute cases of hepatitis b would be extremely low in spite of high endemicity which would result in high rates of chronic hepatitis and hepatic carcinoma. most morbidity reporting collects data about individual cases. reporting of individual cases includes demographic and risk factor data which are analyzed for descriptive epidemiology and for implementation of preventive actions. for example, any investigation leading to contact identification and prophylaxis requires a start from individual cases. however, identification of individuals may be unnecessary and aggregate data sufficient for some specific epidemiologic purposes. monitoring an influenza epidemic for example, can be done with aggregate data. obtaining individual case information would be impractical since it would be too time consuming to collect detailed demographics on such a large number of cases. aggregate data from sentinel sites consists of a number of influenza-like illnesses by age group and the total number of consultants or the total number of 'participants' to be used as denominators. such data is useful to identify trends and determine the extent of the epidemic and geographic distribution. collection of aggregate data of the proportion of school children by age group and sex is a useful predictive tool to identify urinary schistosomiasis endemic areas (lengeler et al. ) without having to collect data on individual school children. epidemics of severe diseases are almost always reported. this is not the case for epidemics of milder diseases such as rashes or diarrheal diseases. many countries do not want to report an outbreak of disease that would cast a negative light on the countries. for example, many countries that are tourism dependent do not report cholera or plague cases. some countries did not report aids cases for a long time. case investigations are usually not undertaken for individual cases unless the disease is of major importance such as hemorrhagic fever, polio, rabies, yellow fever, any disease that has been eradicated and any disease that is usually not endemic in the area. outbreaks or changes in the distribution pattern of infectious diseases should be investigated and these investigations should be compiled in a comprehensive system to detect trends. while the total number of infectious diseases may remain the same, changes may occur in the distribution of cases from sporadic to focal outbreaks. for example the distribution of wnv cases in louisiana shifted from mostly focal outbreaks the first year the west nile virus arrived in the state in , to mostly sporadic cases the following year in (see fig. . ) . surveys are a very commonly used tool in public health, particularly in developing countries where routine surveillance is often inadequate (cf. chap. iv. of this handbook). survey data needs to be part of a comprehensive surveillance database. one will acquire a better picture from one or a series of well constructed surveys than from poorly collected surveillance data. surveys are used in control programs designed to control major endemic diseases: spleen and parasite surveys for malaria, parasite in urine and stools for schistosomiasis, clinical surveys for leprosy or guinea-worm disease and skin test surveys for tuberculosis. surveillance of microbial strains is designed to monitor, through active laboratory based surveillance, the bacterial and viral strains isolated. examples of these systems are: in the us, the pulsenet program is a network of public health laboratories that performs dna fingerprinting of bacteria causing foodborne illnesses (swaminathan et al. ). molecular sub-typing methods must be standardized to allow comparisons of strains and the building of a meaningful data bank. the method used in pulsenet is pulse field gel electrophoresis (pfge). the use of standardized subtyping methods has allowed isolates to be compared from different parts of the country, enabling recognition of nationwide outbreaks attributable to a common source of infection, particularly those in which cases are geographically separated. the us national antimicrobial resistance monitoring system (narms) for enteric bacteria is a collaboration between cdc, participating state and local health departments and the us food and drug administration (fda) to monitor antimicrobial resistance among foodborne enteric bacteria isolated from humans. narms data are also used to provide platforms for additional studies including field investigations and molecular characterization of resistance determinants and to guide efforts to mitigate antimicrobial resistance (cdc ) . monitoring of antimicrobial resistance is routinely done by requiring laboratories to either submit all, or a sample of their bacterial isolates. surveillance for zoonotic diseases should start at the animal level, thus providing early warning for impending increases of diseases in the animal population. rabies surveillance aims at identifying the main species of animals infected in an area, the incidence of disease in the wild animals and the prevalence of infection in the asymptomatic reservoir (bats). this information will guide preventive decisions made when human exposures do occur. malaria control entomologic activities must be guided by surveillance of anopheles population, biting activities, plasmodium infection to biting acivities and plasmodium infection rates in the anopheles population. surveillance for dead birds, infection rates in wild birds, infection in sentinel chickens and horse encephalitis are all part of west nile encephalitis surveillance. these methods provide an early warning system for human infections. the worldwide surveillance for influenza is the best example of the usefulness of monitoring animals prior to spread of infection in the human population. influenza surveillance programs aim to rapidly obtain new circulating strains to make timely recommendations about the composition of the next vaccine. the worldwide surveillance priority is given to the establishment of regular surveillance and investigation of outbreaks of influenza in the most densely populated cities in key locations, particularly in tropical or other regions where urban markets provide opportunities for contacts between humans and live animals (snacken et al. ). the rationale for selecting infectious diseases and an appropriate surveillance method is based on the goal of the preventive program. outbreaks of acute infectious diseases are common and investigations of these outbreaks are an important task for public health professionals, especially epidemiol-ogists. in , a total of foodborne outbreaks with , cases involved were reported in the us (cdc ) with norovirus being the most common confirmed etiologic agent associated with these outbreaks (see table . ). outbreaks or epidemics are defined as the number of disease cases above what is normally expected in the area for a given time period. depending on the disease, it is not always known if the case numbers are really higher than expected and some outbreak investigations can reveal that the reported case numbers did not actually increase. the nature of a disease outbreak depends on a variety of circumstances, most importantly the suspected etiologic agent involved, the disease severity or case fatality rate, population groups affected, media pressure, political inference and investigative progress. there are certain common steps for outbreak investiga-tions as shown in table . . however, the chronology and priorities assigned to each phase of the investigation have to be decided individually, based on the circumstances of the suspected outbreak and information available at the time. another way to detect an increase of cases is if the surveillance system of reportable infectious diseases reveals an unusually high number of people with the same diagnosis over a certain time period at different health care facilities. outbreaks of benign diseases like self-limited diarrhea are often not detected because people are not seeking medical attention and therefore medical services are not aware of them. furthermore, early stages of a disease outbreak are often undetected because single cases are diagnosed sporadically. it is not until a certain threshold is passed, that it becomes clear that these cases are related to each other through a common exposure or secondary transmission. depending on the infectious disease agent, there can be a sharp or a gradual increase of number of cases. it is sometimes difficult to differentiate between sporadic cases and the early phase of an outbreak. in the st. louis encephalitis (sle) outbreak in louisiana, the number of sle cases increased from to between week one and two and then the numbers gradually decreased over the next weeks to a total of cases (jones et al. ) . . after the initial report is received, it is important to collect and document basic information: contact information of persons affected, a good and thorough event description, names and diagnosis of hospitalized persons (and depending on the presumptive diagnosis their underlying conditions and travel history), laboratory test results and other useful information to get a complete picture and to confirm the initial story of the suspected outbreak. it also might be necessary to collect more biological specimens such as food items and stool samples for further laboratory testing. . based on the collected information the decision to investigate must be made. it may not be worthwhile to start an investigation if there are only a few people who fully recovered after a couple of episodes of a self-limited, benign diarrhea. other reasons not to investigate might be that this type of outbreak occurs regularly every summer or that it is only an increase in number of reported cases which are not related to each other. on the other hand, however, there should be no time delay in starting an investigation if there is an opportunity to prevent more cases or the potential to identify a system failure which can be caused, for example, by poor food preparation in a restaurant or poor infection control practices in a hospital or to prevent future outbreaks by acquiring more knowledge of the epidemiology of the agent involved. additional reasons to investigate include the interest of the media, politicians and the public in the disease cluster and the pressure to provide media updates on a regularly basis. another fact to consider is that outbreak investigations are good training opportunities for newly hired epidemiologists. sometimes lack of data and lack of sufficient background information make it difficult to decide early on if there is an outbreak or not. the best approach then is to assume that it is an outbreak until proven otherwise. . prevention of more cases is the most important goal in outbreak investigations and therefore a rapid evaluation of the situation is necessary. if there are precautionary measures to be recommended to minimize the impact of the outbreak and the spread to more persons, they should be implemented before a thorough investigation is completed. most likely control measures implemented by public health professionals in foodborne outbreaks are: recall or destruction of contaminated food items, restriction of infected food handlers from food preparation, correction of any deficiency in food preparation or conservation. . after taking immediate control measures, the next step is to know more about the epidemiology of the suspected agent. the most popular books for public health professionals include the "red book" (american academy of pediatrics ), the "control of communicable diseases manual" from the american public health association (apha ) or other infectious disease epidemiology books as well as the cdc website (www.cdc.gov). if the disease of interest is a reportable disease or a disease where surveillance data are available, baseline incidence rates can be calculated. then a comparison is made to determine if the reported numbers constitute a real increase or not. furthermore, the seasonal and geographical distribution of the disease is important as well as the knowledge of risk factors. many infectious diseases show a seasonal pattern such as rotavirus or neisseria meningitides. for example in suspected outbreaks where cases are associated with raw oyster consumption, the investigator should know that in the us gulf states vibrio cases increase in the summer months because the water conditions are optimal for the growth of the bacteria in water and in seafood. this kind of information will help to determine if the case numbers show a true increase and if it seems likely to be a real outbreak. . for certain diseases, numbers are not important. depending on the severity of the disease, its transmissibility and its natural occurrence, certain diseases should raise a red flag for every health care professional and even a single case should warrant a thorough public health investigation. for example a single confirmed case of a rabid dog in a city (potential dog to dog transmission within a highly populated area), a case of dengue hemorrhagic fever or a presumptive case of smallpox would immediately trigger an outbreak investigation. . sometimes an increase of case numbers is artificial and not due to a real outbreak. in order to differentiate between an artificial and a natural increase in numbers, the following changes have to be taken into consideration: alterations in the surveillance system, a new physician who is interested in the disease and therefore more likely to diagnose or report the disease, a new health officer strengthening the importance of reporting, new procedures in reporting (from paper to web based reporting), enhanced awareness or publicity of a certain disease that might lead to increased laboratory testing, new diagnostic tests, a new laboratory, an increase in susceptible population such as a new summer camp. . it is important to be sure that reported cases of a disease actually have the correct diagnosis and are not misdiagnosed. is there assurance that all the cases have the same diagnosis? is the diagnosis verified and were other differential diagnoses excluded? in order to be correct, epidemiologists have to know the basis for the diagnosis. are laboratory samples sufficient? if not, what kind of specimens should be collected to ascertain the diagnosis? what are the clinical signs and symptoms of the patient? in an outbreak of restaurant associated botulism in canada only the th case was correctly diagnosed. the slow progression of symptoms and misdiagnosis of the dispersed cases made it very difficult to link these cases and identify the source of the outbreak (cdc (cdc , . . the purpose of a case definition is to standardize the identification and counting of the number of cases. the case definition is a standard set of criteria and is not a clinical diagnosis. in most outbreaks the case definition has components of person, place and time, such as the following: persons with symptoms of x and y after eating at the restaurant z between date and date . the case definition should be broad enough to get most of the true cases but not too narrow so that true cases will not be misclassified as controls. a good method is to analyze the data, identify the frequency of symptoms and include symptoms that are more reliable than others. for example, diarrhea and vomiting are more specific than nausea and headache in the case definition of a food related illness. . what kind of information is necessary to be collected? it is sufficient to have a simple database with basic demographic information such as name, age, sex and information for contacting the patient. more often, date of reporting and date of onset of symptoms are also important. depending on the outbreak and the potential exposure or transmission of the agent involved further variables such as school, grade of student or occupation in adults might be interesting and valuable. . during an outbreak investigation it is important to identify additional cases that may not have been known or were not reported. there are several approaches: interview known cases and ask them if they know of any other friends or family members with the same signs or symptoms, obtain a mailing list of frequent customers in an event where a restaurant is involved, set up an active surveillance with physicians or emergency departments, call laboratories and ask for reports of suspected and confirmed cases. another possibility is to review surveillance databases or to establish enhanced surveillance for prospective cases. occasionally it might be worthwhile to include the media for finding additional cases through press releases. however the utility of that technique depends on the outbreak and the etiologic agent; the investigator should always do a benefit risk analysis before involving the media. . after finding additional cases, entering them in the database and organizing them, the investigator should try to get a better understanding of the situation by performing some basic descriptive epidemiology techniques such as sorting the data by time, place and person. for a better visualization of the data, an epidemic or "epi" curve should be graphed. the curve shows the number of cases by date or time of onset of symptoms. this helps to understand the nature and dynamic of the outbreak as well as to get a better understanding of the incubation period if the time of exposure is known. it also helps to determine whether the outbreak had a single exposure and no secondary transmission (single peak) or if there is a continuous source and ongoing transmission. figures . and . show "epi" curves of two different outbreaks: a foodborne outbreak in a school in louisiana, and the number of wnv human cases in louisiana in the outbreak, respectively. sometimes it is useful to plot the cases on a map to get a better idea of the nature and the source of an outbreak. mapping may be useful to track the spread by water (see john snow's cholera map) or by air or even a person to person transmission. if a contaminated food item was the culprit, food distribution routes with new cases identified may be helpful. maps, however, should be taken with caution and carefully interpreted. for example, wnv cases are normally mapped by residency but do not take into account that people might have been exposed or bitten by an infective mosquito far away from where they live. for outbreak investigations, spot maps are usually more useful than rate maps or maps of aggregate data. depending on the outbreak it might be useful to characterize the outbreak by persons' demographics such as age, sex, address and occupation or health status. are the cases at increased susceptibility or at high risk of infection? these kinds of variables might give the investigator a good idea if the exposure is not yet known. for typical foodborne outbreaks however, demographic information is not very useful because the attack rates will be independent of age and sex. more details on methods used in descriptive epidemiology are given in chap. i. of this handbook. . based on the results of basic descriptive epidemiology and the preliminary investigation, some hypotheses should be formulated in order to identify the cause of the outbreak. a hypothesis will be most likely formulated such as "those who attended the luncheon and ate the chicken salad are at greater risk than those who attended and did not eat the chicken salad". it is always easier to find something after knowing what to look for and therefore a hypothesis should be used as a tool. however, the epidemiologist should be flexible enough to change the hypothesis if the data do not support it. if data clues are leading in another direction, the hypothesis should be reformulated such as "those who attended the luncheon and ate the baked chicken are at greater risk than those who attended and did not eat the baked chicken". to verify or deny hypotheses, measures of risk association such as the relative risk (rr) or the odds ratio (or) have to be calculated (as described in chaps. i. , i. , and i. of this handbook). the cdc has developed the software program 'epiinfo' which is easy to use in outbreak investigations, and, even more importantly, free of charge. it can be downloaded from the cdc website (http:||www.cdc.gov|epiinfo|). measures of association, however, should be carefully interpreted; even a highly significant measure of association can not give enough evidence of the real culprit or the contaminated food item. the measure of association is only as good and valid as the data. most people have recall problems when asked what they ate, when they ate and when their symptoms started. even more biases or misclassifications of cases and controls can hide an association. a more confident answer comes usually from the laboratory samples from both human samples and food items served at time of exposure. agents isolated from both food and human samples that are identified as the same subtype, in addition to data results supporting the laboratory findings, are the best evidence beyond reasonable doubt. . as the last step in an outbreak investigation, the epidemiologist writes a final report on the outbreak and communicates the results and recommendations to the public health agency and facilities involved. in the us, public health departments also report foodborne outbreaks electronically to cdc via a secure web based reporting system, the electronic foodborne outbreak reporting system (efors). the "traditional" foodborne outbreak the "traditional" foodborne outbreak is usually a small local event such as family picnic, wedding reception, or other social event and occurs often in a local restaurant or school cafeteria. this type of outbreak is highly local with a high attack rate in the group exposed to the source. because it is immediately apparent to those in the local group such as the group of friends who ate at the restaurant or the students' parents, public health authorities are normally notified early in the outbreak while most of the cases are still symptomatic. epidemiologists can start early on with their investigation and therefore have a much better chance to collect food eaten and stool samples of cases with gastroenteritis for testing and also to detect the etiologic agent in both of them. in a school outbreak in louisiana, eighty-seven persons (sixty-seven students and twenty faculty members) experienced abdominal cramps after eating at the school's annual "turkey day" the day before. stool specimens and the turkey with the gravy were both positive for clostridium perfringens with the same pulse field gel electrophoresis (pfge) pattern (merlos ) . the inspection of the school cafeteria revealed several food handling violations such as storing, cooling and reheating of the food items served. other than illnesses among food handlers, these types of improper food handling or storage are the most common causes of foodborne outbreaks. a different type of outbreak is emerging as the world is getting smaller. in other words persons and food can travel more easily and faster from continent to continent and so do infectious diseases with them. foodborne outbreaks related to imported contaminated food items are normally widespread, involving many states and countries and therefore are frequently identified. in , a large outbreak of cyclospora cayetanensis occurred in us states and ontario, canada and was linked to contaminated raspberries imported from south america. several hundred laboratory confirmed cases were reported, most of them in immunocompetent persons (cdc ) . a very useful molecular tool to identify same isolates from different geographic areas is sub-typing enteric bacteria with pfge. in the us, the pulsenet database allows state health department to compare their isolates with other states and therefore increase the recognition of nationwide outbreaks linked to the same food item (swaminathan et al. ) . in a different scenario, a widely distributed food item with low-level contamination might result in an increase of cases within a large geographic area and therefore might be not get detected on a local level. this kind of outbreak might only be detected by chance if the number of cases increased in one location and the local health department alerts other states to be on the lookout for a certain isolate. another type of outbreak is the introduction of a new pathogen into a new geographic area as it happened in when vibrio cholerae was inadvertently introduced in the waters off the gulf coast of the united states. in the u.s., however, most cases are usually traced back to people who traveled to areas with a high cholera risk or to people who ate food imported from cholera-risk countries and only sporadic vibrio cholerae cases are associated with the consumption of raw or undercooked shellfish from the gulf of mexico (cdc b). food can not only be contaminated by the end of the food handling process i.e. by infected food handlers but also can be contaminated by any event earlier in the chain of food production. in , an ice cream outbreak of salmonella enteritidis in a national brand of ice cream resulted in , illnesses. the outbreak was detected by routine surveillance because of a dramatic increase of salmonella enteritidis in south minnesota. the cause of the outbreak was a basic failure on an industrial scale to separate raw products from cooked products. the ice cream premix was pasteurized and then transported to the ice cream factory in tanker trucks which had been used to haul raw eggs. this resulted in the contamination of the ice cream and subsequent salmonella cases (hennessey et al. ) . surveys are useful to provide information for which there is no data source or no reliable data source. surveys are time consuming and are often seen as a last choice to obtain information. however, too often unreliable information is used because it is easily available. for example, any assessment of the legionella problem using passive case detection will be unreliable due to under-diagnosis and under-reporting. most cases of legionellosis are treated empirically as community acquired pneumonias and are never formally diagnosed. in developing countries, surveys are often necessary to evaluate health problems since data collected routinely (disease surveillance, hospital records, case registers) are often incomplete and of poor quality. in industrialized nations, although many sources of data are available, there are some circumstances where surveys may be necessary. prior to carrying out surveys involving human subjects, special procedures need to be followed. in industrialized countries, a human subject investigation review board has to evaluate the project's value and ethics. in developing countries, however, such boards may not be formalized but it is important to obtain permission from medical, national and local political authorities before proceeding. surveys of human subjects are carried out by mail, telephone, personal interviews, and behavioral observations. in infectious diseases, the collection of biological specimens in humans (i.e. blood for serologic surveys) or the collection of environmental samples (food, water, environmental surfaces) is very common. personal interviews and specimen collection require face to face interaction with the individual surveyed. these are carried out in offices or by house to house surveys. non-respondents are an important problem for infectious disease surveys. those with an infection may be absent from school, may not answer the door or may be unwilling to donate blood for a serologic survey, thus introducing a systematic bias into the survey results. since surveys are expensive, they cannot be easily repeated. all field procedures, questionnaires, biological sample collection methods and laboratory tests should be tested prior to launching the survey itself. feasibility, acceptability and reliability can be tested in a small scale pilot study. more details on survey methods are to be found in chap. i. of this handbook. sampling . . since surveys are labor intensive, they are rarely carried out on an entire population but rather on a sample. to do a correct sampling, it is necessary to have a sampling base (data elements for the entire population) from which to draw the sample. examples of sampling bases are population census, telephone directory (for the phone subscriber population), school roster or a school list. in developing countries such lists are not often available and may have to be prepared before sampling can start. more information on sampling designs can be found in chap. iv. of this handbook. community surveys (house to house surveys) . . most community surveys are carried out in developing countries because reliable data sources are rare. the sampling base often ends up to the physical layout of the population. a trip and geographical reconnaissance of the area are necessary. the most common types of surveys undertaken in developing countries are done at the village level; they are based on maps and a census of the village. in small communities, it is important to obtain the participation of the population. villagers are often wary of government officials counting people and going from door to door. to avoid misinterpretations and rumors, influential people in the community should be told about the survey. their agreement is indispensable and their help is needed to explain the objectives of the survey and particularly its potential benefits. increasing the knowledge about disease, disease prevention and advancing science are abstract notions that are usually poorly understood or valued by villagers who are, in general, very practical people. if a more immediate benefit can be built into the survey, there will be an increase in cooperation of the population. incentives such as offering to diagnose and treat an infection or drugs for the treatment of common ailments such as headaches or malaria enhance the acceptance of the survey. in practically all societies the household is a primary economic and social unit. it can be defined as the smallest social unit of people who have the same residency and maintain a collective organization. the usual method for collecting data is to visit each household and collect samples or administer a questionnaire. medical staff may feel left out or even threatened whenever a medical intervention (such as a survey) is done in their area. a common concern is that people will go to their medical care provider and ask questions about the survey or about specimen collection and results. it is therefore important to involve and inform local medical providers as much as practical. a rare example of a house to house survey in an industrialized nation was carried out in slidell, louisiana for the primary purpose of determining the prevalence of west nile infection in a southern us focus. since the goal was to obtain a random sample of serum from humans living in the focus, the only method was a survey of this type. a cluster sampling design was used to obtain a representative number of households. the area was not stratified because of its homogeneity. census blocks were grouped so that each cluster contained a minimum of households. the probability of including an individual cluster was determined by the proportion of houses selected in that cluster and the number of persons participating given the number of adults in the household. a quota sampling technique was used, with a goal of enlisting participating households in each cluster. inclusion criteria included age (at least years of age) and length of residence (at least years). the household would be included only if an adult household resident was present. a standardized questionnaire was used to interview each participant. information was collected on demographics, any recent febrile illness, knowledge, attitudes, and behaviors to prevent wnv infection and potential exposures to mosquitoes. a serum sample for wnv antibody testing was drawn. in addition, a second questionnaire regarding selected household characteristics and peridomestic mosquito reduction measures was completed. informed consent was obtained from each participant, and all participants were advised that they could receive notification of their blood test results if they wished. institutional review board approvals were obtained. logistics for specimen collection, preservation and transportation to the laboratory were arranged. interpretation of serologic tests and necessary follow up were determined prior to the survey and incorporated in the methods submitted to the ethics committee. sampling weights, consisting of components for block selection, householdwithin-block selection, and individual-within-household participation, were used to estimate population parameters and % confidence intervals (ci). statistical tests were performed incorporating these weights and the stratified cluster sampling design. in this survey, households were surveyed (a % response rate), including participants. there were igm seropositive persons, for a weighted seroprevalence of . % (with a % confidence interval of . %- . %) (vicari et al. ). program evaluation is a systematic way to determine if prevention or intervention programs for the infectious disease of interest are effective and to see how they can be improved. it is beyond the scope of this chapter to explain program evaluation in detail however there is abundant information available i.e. the cdc's framework for program evaluation in public health (cdc a) as well as text books on program evaluation (fink ) . most importantly, evaluators have to understand the program such as the epidemiology of the disease of interest, the program's target population and their risk factors, program activities and resources. they have to identify the main objectives of the control actions and determine the most important steps. indicators define the program attributes and translate general concepts into measurable variables. data are then collected and analyzed so that conclusions and recommendations for the program are evidence based. evaluating an infectious disease control program requires a clear understanding of the microorganism, its mode of transmission, the susceptible population and the risk factors. the following example of evaluation of tuberculosis control shows the need to clearly understand the priorities. most of tuberculosis transmission comes from active pulmonary tuberculosis cases who have positive sputum smear (confirmed as tuberculosis mycobacteria on culture). to a lesser extent, smear negative culture positive pulmonary cases are also transmitting the infection. therefore priority must be given to find sputum positive pulmonary cases. the incidence of smear positive tuberculosis cases is the most important incidence indicator. incidences of active pulmonary cases and of all active cases (pulmonary and extra-pulmonary) are also calculated but are of lesser interest. the following proportions are used to detect anomalies in case finding or case ascertainment: all tuberculosis cases who are pulmonary versus extra-pulmonary, smear positive, culture positive, pulmonary cases versus smear negative, culture positive, pulmonary cases, culture positive, pulmonary cases versus culture negative, pulmonary cases. poor laboratory techniques or low interest in obtaining sputa for smears or cultures may result in underestimating bacteriological confirmed cases. excessive diagnosis of tuberculosis with reliance on chest x-rays on the other hand may overestimate unconfirmed tuberculosis cases. once identified, tuberculosis cases are placed under treatment. treatment of infectious cases is an important preventive measure. treatment efficacy is evaluated by sputum conversion (both on smear and culture) of the active pulmonary cases. after months of an effective regimen, % of active pulmonary cases should have converted their sputum from positive to negative. therefore the rate of sputum conversion at months becomes an important indicator of program effectiveness. this indicator must be calculated for those who are smear positive and with a lesser importance for the other active pulmonary cases. to ensure adequate treatment and prevent the development of acquired resistance, tuberculosis cases are placed under directly observed therapy (dot). this measure is quite labor intensive. priority must therefore be given to those at highest risk of relapse. these are the smear positive culture proven active pulmonary cases. dot on extra-pulmonary cases is much less important from a public health standpoint. the same considerations apply to contact investigation and preventive treatment in countries that can afford a tuberculosis contact program. a recently infected contact is at the highest risk of developing tuberculosis the first year after infection; hence the best preventive return is to identify contacts of infectious cases. those contacts are likely to have been recently infected. systematic screening of large population groups would also identify infected individuals but most would be 'old' infections at lower risk of developing disease. individuals infected with tuberculosis and hiv are at extremely high risk of developing active tuberculosis. therefore the tuberculosis control program should focus on the population at high risk of hiv infection. often, program evaluation is performed by epidemiologists who have not taken the time to understand the dynamics of a disease in the community. rates or proportions are calculated, no priorities are established and precious resources are wasted on activities with little preventive value. for example, attempting to treat all tuberculosis cases, whether pulmonary or not with dot, investigating all contacts regardless of the bacteriologic status of the index case, would be wasteful. today the world is smaller than ever before, international travel and a worldwide food market make us all potentially vulnerable to infectious diseases no matter where we live. new pathogens are emerging such as the sars or spreading through new territories such as wnv. wnv introduced in the us in , became endemic in the us over the next years. hospital-associated and community-associated methicillin resistant staphylococcus aureus (mrsa) and resistant tuberculosis cases and outbreaks are on the rise. public health professionals are concerned that a novel recombinant strain of influenza will cause a new pandemic. but not only the world and the etiologic agents are changing, the world population is changing as well. in industrialized countries, the life expectancy is increasing and the elderly are more likely to acquire a chronic disease, cancer or diabetes in their lifetime. because of underlying conditions or the treatment of these diseases, older populations also have an increased susceptibility for infectious diseases and are more likely to develop life-threatening complications. knowledge in the field of infectious disease epidemiology is expanding. while basic epidemiological methods and principles still apply today, improved laboratory diagnoses and techniques help to confirm cases faster, see how cases are related to each other and therefore can support the prevention of spread of the specific disease. better computers can improve the data analysis and internet allows access to in depth disease specific information. computer connectivity improves disease reporting for surveillance purposes and the epidemiologist can implement faster preventive measures if necessary and is also able to identify disease clusters and outbreaks on a timelier basis. the global threat of bioterrorism adds a new dimension. the intentional release of anthrax spores, and the infection and death of persons who contracted the disease created a scare of contaminated letters in the us population. with all these changes, there is renewed emphasis on infectious disease epidemiology and makes it a challenging field to work in. detection of bordetella pertussis and respiratory syncytial virus in air samples from hospital rooms transmission of tuberculosis in new york city. an analysis by dna fingerprinting and conventional epidemiologic methods report of the committee on infectious diseases in: chin j (ed) control of communicable diseases manual, th edn geographical information system (gis), gimmick or tool for health district management update: international outbreak of restaurant-associated botulism -vancouver epidemiologic notes and reports restaurant associated botulism from mushrooms bottled in-house -vancouver outbreaks of cyclospora cayetanensis infection -united states case definitions for infectious conditions under public health surveillance cdc ( a) framework for program evaluation in public health summary of infections reported to vibrio surveillance system (http:||www.cdc.gov|ncidod|dbmd|diseaseinfo|files|vibcste web.pdf) accessed norowalk-like viruses": public health consequences and outbreak management updated guidelines for evaluating public health surveillance systems: recommendations from the guidelines working group outbreaks of gastroenteritis associated with noroviruses on cruise ships -united states diagnosis and management of foodborne illnesses: a primer for physicians and other health care professionals epiinfo (http:||www.cdc.gov|epiinfo|) accessed evaluation fundamentals: guiding health programs, research and policy yellow fever. in: cox cr (ed) the wellcome trust illustrated history of tropical diseases a national outbreak of salmonella enteritidis infections from ice cream encephalitis outbreak in louisiana in simple school questionnaires can map both schistosoma mansoni and schistosoma haematobium in the democratic republic of congo molecular and conventional epidemiology of mycobacterium tuberculosis in botswana: a population-based prospective study of pulmonary tuberculosis patients dolin r (eds) ( ) mandell, douglas, and bennett's principles and practice of infectious diseases epidemiology, principles and methods. little, brown and company an uninvited guest at "turkey day trends in hospitalizations associated with gastroenteritis among adults in the united states the next influenza pandemic: lessons from hong kong on the mode of communication of cholera pulsenet: the molecular subtyping network for foodborne bacterial disease surveillance, united states late-breaker report presented at the cdc "annual epidemic intelligence service conference http:||www.who.int|whr | | archives|index.htm) accessed statistical annex. (http:||www.who.int| whr | |archives| |en|pdf|statisticalannex.pdf) accessed cholera, the black one. in: yellow fever black goddess, the coevolution of people and plagues four tales from the new decameron. in: yellow fever black goddess, the coevolution of people and plagues key: cord- - qdtw authors: zouinina, sarah; bennani, younès; rogovschi, nicoleta; lyhyaoui, abdelouahid title: a two-levels data anonymization approach date: - - journal: artificial intelligence applications and innovations doi: . / - - - - _ sha: doc_id: cord_uid: qdtw the amount of devices gathering and using personal data without the person’s approval is exponentially growing. the european general data protection regulation (gdpr) came following the requests of individuals who felt at risk of personal privacy breaches. consequently, privacy preservation through machine learning algorithms were designed based on cryptography, statistics, databases modeling and data mining. in this paper, we present two-levels data anonymization methods. the first level consists of anonymizing data using an unsupervised learning protocol, and the second level is anonymization by incorporating the discriminative information to test the effect of labels on the quality of the anonymized data. the results show that the proposed approaches give good results in terms of utility what preserves the trade-off between data privacy and its usefulness. due to the saturation of cities with smartphones and sensors, the amount of information gathered about each individual is frightening. humans are becoming walking data factories and third-parties are tempted to use personal data for malicious purposes. to protect individuals from the misuse of their precious information and to enable researchers to learn from data effectively, data anonymization is introduced with the purpose of finding balance between the level of anonymity and the amount of information loss. data anonymization is supported by the anr pro-text project. n • anr- -ce - - . therefore defined as it is the process of protecting individuals' sensitive information while preserving its type and format [ , ] . hiding one or multiple values or even adding noise to data as an attempt to anonymize data is considered inefficient because the reconstruction of the initial information is very probable [ ] . machine learning for data anonymization is still an underexplored area [ ] , although it provides some good assets to the field of data security. inspired from the k-anonymization technique proposed by sweeney [ ] , we aim to create micro clusters of similar objects that we code using the micro cluster's representative. in this way, the distortion of the data is minimal and the usefulness of the data is maximal. this can be achieved using supervised or unsupervised methods. for the unsupervised methods [ ] , the most used approach is the clustering that allows to open a new research direction in the field of anonymization i.e. create clusters of k elements and replace the data by the prototypes of the clusters (centroids) in order to obtain a good trade-off between the information loss and the potential data identification risk. however, usually, these approaches are based on the use of the k-means algorithm which is prone to local optima and may give biased results. in this paper we answer the question of how can the introduction of discriminative information affect the quality of the anonymized datasets. to this purpose, we revisited all the previously proposed approaches, and we added a second level of anonymization by incorporating the discriminative information and using adaptive weighting of features to improve the quality of the anonymized data. this aims to improve the anonymized data quality without compromising its level of privacy. the paper is organised into four sections: the first dresses the different approaches of privacy preserving using machine learning, the second sums up the previously proposed approaches, the third discusses the introduction of the discriminative information and the fourth validates the method experimentally on six different datasets. anonymization methods for microdata rely on many mechanisms and data perturbation is the common technique binding them all. those mechanisms modify the original data to improve data privacy but inevitably at cost of some loss in data utility. strong privacy protection requires masking the original data and thus reducing its utility. microaggregation is a technique for disclosure limitation aimed at protecting the privacy of data subjects in microdata releases. it has been used as an alternative to generalization and suppression to generate k -anonymous data sets, where the identity of each subject is hidden within a group of k subjects. unlike generalization, microaggregation perturbs the data and this additional masking freedom allows improving data utility in several ways, such as increasing data granularity, reducing the impact of outliers and avoiding discretization of numerical data [ ] microaggregation. rather than publishing an original variable v i for a given record, the average of the values of the group over which the record belongs is published. in order to minimize information loss, the groups should be as homogeneous as possible. the impact of microaggregation on the utility of anonymized data is quantified as the resulting accuracy of a machine learning model trained on a portion of microaggregated data and tested on the original data [ ] . microaggregation is measured in terms of syntactic distortion. achieving microaggregation might be done using machine learning models, like clustering and/or classification. lefevre et al. [ ] propose several algorithms for generating an anonymous data set that can be used effectively over predefined workloads. workload characteristics taken into account by those algorithms include selection, projection, classification and regression. additionally, lefevre et al. consider cases in which the anonymized data recipient wants to build models over multiple different attributes. nearest neighbor classification with generalization has been investigated by [ ] . the main purpose of generalizing exemplars (by merging them into hyper-rectangles) is to improve speed and accuracy as well as inducing classification rules, but not to handle anonymized data. martin proposes building non-overlapping, non-nested generalized exemplars in order to induce high accuracy. zhang et al. discuss methods for building naive bayes and decision tree classifiers over partially specified data [ , ] . partially specified records are defined as those that exhibit nonleaf values in the taxonomy trees of one or more attributes. therefore generalized records of anonymous data can be modeled as partially specified data. in their approach classifiers are built on a mixture of partially and fully specified data. inan et al. [ ] address the problem of classification over anonymized data. they proposed an approach that models generalized attributes of anonymized data as uncertain information, where each generalized value of an anonymized record is accompanied by statistics collected from records in the same equivalence class. they do not assume any probability distribution over the data. instead, they propose collecting all necessary statistics during anonymization and releasing these together with the anonymized data. they show that releasing such statistics does not violate anonymity. in previous articles we introduced an approach of k-anonymity using collaborative multi-view clustering [ ] and a k-anonymity through constrained clustering [ ] . the two models propose an algorithm that relies on the classical self organizing maps (soms) [ ] and collaborative multiview clustering in purpose to provide useful anonymous datasets [ ] . they achieve anonymization in two-levels, the pre-anonymization step and the anonymization step. the pre-anonymization step is similar for both algorithms and it consists of horizontally splitting data so each observation is described in different spaces and then using the collaborative paradigm to exchange topological information between collaborators. the davies bouldin index (db) [ ] is used in this case as a clustering validity measure and a stopping criterion of the collaboration. when db decreases, the collaboration is said to be positive, but if it increases, the collaboration is clearly negative, since it is degrading the clustering quality and therefore the utility of the provided anonymous data. the topological collaborative multiview clustering outputs homogeneous clusters after the clustering, the individuals contained in each view are coded using the best matching units of each neuron in the case of k-tca and using the linear mixture of models in the case of constrained tca. the pre-anonymized views are then gathered to be reconstructed in the same manner as the original dataset. the anonymization step of the algorithms is totally different between the two. in the k-tca, the pre-anonymized dataset will be fine-tuned using a som model with a map size determined automatically using the kohonen heuristic. each individual of the dataset is then coded using the bmu of the cluster and the level of k-anonymity is evaluated. in those model we have the advantage of determining the k-anonymity level automatically. in the second algorithm, constrained tca (c-tca), the k level of anonymity is fixed ahead, before starting the experiments. a som is created using the pre-anonymized dataset as an input. each node is examined to determine if it respects the constraint of k element in each cluster. respectively the elements captured by the neurons that don't respect the predefined constraint are redistributed on the closest units. by using this technique, we design clusters of at least k elements and we code the objects using the bmus in order to have k-anonymized dataset. we then evaluate the best k level that gives a good tradeoff between anonymity and utility. another method that we proposed to anonymize a dataset was the attribute oriented kernel density estimation [ ] . the choice of dimensional kde was motivated by the ability of the model to determine where data is grouped together and where it is sparse relying on its density. kde is a non parametric model that uses probability density to investigate the inner properties of a given dataset. the algorithm that we propose clusters the data by determining the points where density is the highest (local maximas) and the points with the smallest density (local minimas): those local minimas refer to the clusters' boarders and the local maximas are the clusters' prototypes. kde is a non-parametric approach to approximate the distribution of a dataset and overcome the inability of the histograms to achieve this estimation because of the discontinuity of the bins. each object that falls between two minimas is recoded using the corresponding local maxima. doing this at a one dimensional level helps preserving the characteristics of each feature in the dataset and thus doesn't compromise its utility. after evaluating the different results of data anonymization using the methods in the previous works, we asked the question what if data was labelled? and how the supervision can influence the obtained utility results? to answer to those questions we used the learning vector quantization approach (lvq). we applied it to enhance the clustering results of each of our proposed methods. lvq is a pattern recognition model that takes advantage of the labels to improve the accuracy of the classification. the algorithms learns from a subset of patterns that best represent the training set. the choice of the learning vector quantization (lvq) method was motivated by the simplicity and rapidity of convergence of the technique, since it is based on the hebbian learning. this is a prototype-based method that prepares a set of codebook vectors in the domain of the observed input data samples and uses them to classify unseen examples. kohonen presented the self organizing maps as an unsupervised learning paradigm that he improves using a supervised learning technique, called the learning vector quantization. it is a method used for optimizing the performances of a trained map in a reward-punishment scheme. learning vector quantization was designed for classification problems that have existing data sets that can be used to supervise the learning by the system. lvq is non-parametric, meaning that it does not rely on assumptions about that structure of the function that it is approximating. euclidean distance is commonly used to measure the distance between real-valued vectors, although other distance measures may be used (such as dot product), and data specific distance measures may be required for non-scalar attributes. there should be sufficient training iterations to expose all the training data to the model multiple times. the learning rate is typically linearly decayed over the training period from an initial value until it is close to zero. multiple passes of the lvq training algorithm are suggested for more robust usage, where the first pass has a large learning rate to prepare the codebook vectors and the second pass has a low learning rate and runs for a long time (perhaps -times more iterations). in the learning vector quatization model, each class contains a set of fixed prototypes with the same dimension of the data to be classified. lvq adaptively modifies the prototypes. in the learning algorithm, data is first clustered using a clustering method and the clusters' prototypes are moved using lvq to perform classification. we chose to supervise the results of the clustering by moving the center clusters' using the wlvq proposed in algorithm for each of the approaches. we use the wlvq [ ] since this upgraded version of the lvq respects the characteristics of each features and adapts the weighting of each feature according to its participation to the discrimination. the system learns using two layers: the first layer calculates the weights of the features and then it is presented to the lvq algorithm. the cost function of this algorithm can be written as follows: where x ∈ c k and w is the weighting coefficient matrix; m i is the nearest codeword vector to w x and m j is the second nearest codeword vector to initialization : initialize the matrix of weights w according to : the codewords m are chosen for each class using the k-means algorithm. learning phase: . present a learning example x. . let mi ∈ ci be the nearest codeword vector to x. if x ∈ ci, then go to else then • let mj ∈ cj be the second nearest codeword vector • if x ∈ cj then * a symmetrical window win is set around the mid-point of mi and mj. * if x falls within win, then codewords adaptation: * mi is moved away from x according to the formula where α(t) and β(t) are the learning rates w x. the wlvq with the collaborative paradigm enhances the utility of the anonymized data by the k-tca and the constrained tca (c-tca) models, the use of wlvq is done after the collaboration between cluster centers' to improve the results of the collaboration at the pre-anonymization and the anonymization steps. the experimental protocol of using wlvq with attribute-oriented data anonymization and kernel density estimation, takes in account the labels of the dataset and improves the found prototypes and then represents the microclusters using them. six datasets from the uci machine learning repository are used in the experiment. the table below presents the main characteristics of these databases. cluster validity consists of techniques for finding a set of clusters that best fits natural partitions without any a priori class information. the outcome of the clustering process is validated by a cluster validity index. internal validation measures reflect often the compactness, the connectivity and the separation of the cluster partitions. we choose to validate the results of the proposed methods using silhouette index and davies bouldin index. the results are given in the tables and . as illustrated, the attribute oriented microaggregation using wlvq (++: discriminative version of each approach, kde ++ , k-t ca ++ , c−t ca ++ ) outperforms by far the attribute oriented microaggregation in both silhouette and davies bouldin indices. separability utility. to measure the utility of the anonymized datasets we propose a test on the original and the anonymized data. the test consists of comparing the accuracy of a decision tree model with folds cross validation before and after microaggregation to evaluate the practicality of the proposed anonymization. we call it separability utility since it measures the separability of the clusters. we give the results of this measure in table , we also provide a comparison between the separability utility measures of the original and the anonymized datasets. the separability measure was improved after lvq for % of the tests done on the datasets, this can be explained by the tendency of microaggregation to remove non decisive attributes from the dataset in order to gather together elements that are similar. the ++ in the name of the methods refers to discriminant version. structural utility using the earth mover's distance. we believe that measuring the distance between two distributions is the way to evaluate the difference between the datasets. the amount of utility lost in the process of anonymization can be see as the distance between the anonymized dataset and the original one. the earth mover's distance (emd) also known as the wasserstein distance [ ] , extends the notion of distance between two single elements to that of a distance between sets or distributions of elements. it compares the probability distributions p and q on a measurable space (Ω, Ψ ) and is defined as follows (we are using the distance of order ): where Ω × Ω is the product probability space. notice that we may extend the definition so that p is a measure on a space (Ω, Ψ ) and q is a measure on a space (Ω , Ψ ). let us examine how the above is applied in the case of discrete sample spaces. for generality, we assume that p is a measure on (Ω, and q is a measure on (Ω , Ψ ) where Ω = {y i } n j= -the two spaces are not required to have the same cardinality. then, the distance between p and q becomes: emd is the minimum amount of work needed to transform a distribution to another. in our case we measure the emd between the anonymized and the original datasets, attribute by attribute, to get an idea about the distortion of the anonymized datasets. we then normalize all distances between and , then we define the utility by − w (p, q). the smaller the distance w is, the more the data utility is preserved. preserving combined utility. to choose the anonymization method which best addresses the separability-structural utility trade-off, we propose to combine the two types of utility structural and separability in a combined form while α = : comb u tility = α.separability + ( − α).structural table summarize the clustering results of the proposed approaches in terms of combined utility (comb u tility). as it can be seen, our approach attributeoriented generally performs best on all the datasets. to further evaluate the performance, we compute a measurement score by following [ ] : where comb u tility(a i , d j ) refers to the combined utility value of a i method on the d j dataset. this score gives an overall evaluation on all the datasets, which shows our approach attribute-oriented outperforms the other methods substantially in most cases. as shown in the table , the introduction of the discriminant information improves the utility of the anonymized datasets for all of the methods proposed. in this paper we studied the impact of incorporating the discriminative information to improve data anonymization level and to preserve its usefulness. the anonymization is achieved in two levels process. the first, uses one of these three methods: k-tca or constrained tca (c-tca) or attribute oriented kde, that we introduced for data anonymization through microaggregation approach. and the second, through the use of labels and the learning of the vectors weights adaptively using the weighted lvq. the experimental investigation shown above prove the efficiency of the methods and illustrate its importance. the main contributions of the article are the addition of the supervised learning layer to improve utility of the model without compromising its anonymity. the separability utility reflects the usefulness of the data and the structural utility shows its level of anonymity. the combined utility is a weighted measure that combines both measures, we can change the weight of the utility tradeoff depending on which side we want to emphasise on. adaptive weighting of pattern features during learning a cluster separation measure steered microaggregation as a unified primitive to anonymize data sets and data streams statistical disclosure control using anonymized data for classification learning accurate and concise naive bayes classifiers from attribute value taxonomies and data workload-aware anonymization clustering based privacy preserving of big data using fuzzification and anonymization operation an anonymization protocol for continuous and dynamic privacy-preserving data collection self-organizing maps instance-based learning: nearest neighbour with generalisation the complete book of data anonymization: from planning to implementation does k-anonymous microaggregation affect machine-learned macrotrends? the earth mover's distance as a metric for image retrieval a review of big data challenges and preserving privacy in big data k-anonymity: a model for protecting privacy data privacy: principles and practice. chapman & hall/crc learning from attribute value taxonomies and partially specified instances dual-regularized multi-view outlier detection preserving utility during attribute-oriented data anonymization process efficient kanonymization through constrained collaborative clustering a topological k -anonymity model based on collaborative multi-view clustering key: cord- -uwt kk authors: hu, paul jen-hwa; zeng, daniel; chen, hsinchun; larson, catherine; chang, wei; tseng, chunju; ma, james title: system for infectious disease information sharing and analysis: design and evaluation date: - - journal: ieee trans inf technol biomed doi: . /titb. . sha: doc_id: cord_uid: uwt kk motivated by the importance of infectious disease informatics (idi) and the challenges to idi system development and data sharing, we design and implement bioportal, a web-based idi system that integrates cross-jurisdictional data to support information sharing, analysis, and visualization in public health. in this paper, we discuss general challenges in idi, describe bioportal's architecture and functionalities, and highlight encouraging evaluation results obtained from a controlled experiment that focused on analysis accuracy, task performance efficiency, user information satisfaction, system usability, usefulness, and ease of use. i ncreasing globalization, combined with accelerating population mobility and more frequent travel, has made the prevention and management of infectious disease outbreaks a growing concern in public health. emerging infectious disease and epidemic outbreaks are particularly important and represent critical challenges facing public health researchers and practitioners [ ] , [ ] . in addition, potential threats of bioterrorism appear on the horizon [ ] . managing infectious disease outbreaks is intrinsically information intensive and requires substantial support for data gathering, integration, analysis, sharing, and visualization [ ] . such support requirements are becoming even more challenging because of the diverse, heterogeneous, and complex information available in enormous volumes and different sources that span jurisdictional constituencies both horizontally and vertically. public health professionals such as epidemiologists can be better supported by advanced information systems (is), as vividly manifested by emerging infectious disease informatics (idi)-an interdisciplinary research area that focuses on the design, implementation, and evaluation of advanced systems, techniques, and methods for managing infectious disease and epidemic outbreaks, ranging from prevention to surveillance and detection [ ] , [ ] . the design and implementation of an effective idi system can be complex and challenging. at the data level, an expanding array of data that pertain to particular diseases, population characteristics, and related health considerations must be collected, organized, and archived, typically by different clinical institutions and health agencies. these data are heterogeneous in their semantics, modeling, granularity, aggregation, availability frequency, and coding/representation. data sharing is critical to the pertinent institutions and agencies, which have to coordinate by explicitly specifying data ownership and access rights, as well as delineating the responsibilities associated with legal and privacy considerations. at the system level, these institutions and agencies often vary in their in-house systems, which adopt proprietary architecture designs and operate on different platforms. as kay et al. [ ] point out, most existing systems in public health have been developed in isolation. the challenge and complexity of designing an idi system extends beyond data and system heterogeneity. from the user's perspective, all relevant data must be seamlessly integrated to support his or her surveillance and analysis tasks that are critical to the prevention of and alert about particular disease events or devastating outbreaks. to be effective, an idi system must encompass sophisticated algorithms for the automatic detection of emerging disease patterns and the identification of probable threats or events. an effective idi system also must have advanced computational models that overlay health data for spatial-temporal analysis to support public health professionals' analysis tasks [ ] . several additional issues are crucial for system design and implementation, including the integration of multiple heterogeneous source data or systems, data accessibility and security, interfaces with geographic information systems (gis), text document management support, and data or text mining. in particular, idi design requirements include spatial-temporal data analysis and related visualization support. typically, public health professionals approach the surveillance or detection of a probable outbreak as an event for which all related data are dotted and analyzed in spatial and temporal dimensions. furthermore, the value of an idi system is generally determined by the extent to which the system can present data and analysis results through intuitively comprehensible and cognitively efficient visualization. ultimately, an idi system must facilitate and enhance task performance by enabling public health professionals to use heuristics and preferred analysis methods to generate more accurate analysis results within a shorter time window. to support the surveillance and detection of infectious disease outbreaks by public health professionals, we design and implement the bioportal system, a web-based idi system that provides convenient access to distributed, cross-jurisdictional health data pertaining to several major infectious diseases including west nile virus (wnv), foot-and-mouth disease (fmd), and botulism. our system development team is interdisciplinary, consisting of researchers in both is and public health, practitioners, and officials from several state health departments. bioportal supports sophisticated spatial-temporal data analysis methods, and has effective data/information visualization capabilities. the rest of this paper is structured as follows. in section ii, we describe the architecture design of bioportal and highlight its main technical components and functionalities. next, in section iii, we discuss the value of bioportal for infectious disease surveillance and management, derive hypotheses regarding its advantages and effects, and empirically test these hypotheses using a controlled experiment with subjects. to assess bioportal as a whole, we focus our evaluation on users rather than specific algorithms implemented as part of bioportal and examine its effects on their task performances as well as subjective assess-ments of the system. in section iv, we summarize our findings and note that our data support most of the hypotheses tested. our results suggest that bio portal can better support public health professionals' analysis tasks, and is generally considered more usable, useful, and easier to use than the benchmark technology. section v concludes with a summary, discussions of the paper's contributions and limitations, and some future research directions. bioportal is an integrated, cross-jurisdictional idi infrastructure that has been running for testing and research purposes since early (see www.bioportal.org). although it has not yet been adopted for operational use, it contains more than a dozen realworld data sets contributed by public health partners and other agencies. in this section, we summarize bioportal's architectural design, its main components, data set availability, system functionality, and related outbreak detection research. the information that we present herein establishes the background for the evaluation study reported in the subsequent sections of this paper. fig. illustrates the architecture of the bioportal system, including the data flows between/among the main components of bioportal as well as between bioportal and external data sources. from a system's perspective, bioportal is loosely coupled with state public health information systems in california and new york. it does not change the way these state systems operate. as needed, these systems transmit wnv/botulism information through secure links to the bioportal using mutually agreed protocols. such information is then stored in an internal data store maintained by the bioportal. the system also automatically retrieves data items from sources, such as those from the usgs, and stores them in this internal data store. all the system functions provided by bioportal, including infectious disease data search and query, spatial-temporal visualization, outbreak detection analysis and related modeling, and automatic alert generation based on the results of outbreak detection analysis, are solely based on the data stored in the bioportal internal data store, without further interactions with the contributing data sources. technically speaking, we adopt a data warehousing approach, rather than alternative approaches such as query translation, information linkage, or terminology mapping [ ] to address the distributed data integration challenges in idi. this choice of approach is primarily based on the following characteristics of infectious disease data sources and associated analysis needs. first, unlike many other biomedical applications for which it has become increasingly easy to query data sources automatically from remote locations, most infectious disease data sets have been developed primarily for internal use. although accessing the underlying databases through remote queries is technologically feasible, in practice, most idi data providers are unwilling to "open up" their databases. instead, they prefer pushing preprocessed data to (or preprocessing data waiting to be pulled by) a data warehousing system such as bioportal while retaining full control over data fields to be shared (directly at the data level as opposed to at the data access control level). second, the types of queries performed on idi data sources typically are confined to data aggregation requests over particular geographical regions and time periods. therefore, there is no need to strategize complex distributed queries. however, processing speed of the data aggregation is important because such operations must be carried out in large numbers for some outbreak detection analysis approaches (see section ii-c). third, the amount of idi data is relatively small in terms of storage volume because epidemiological information tends to contain a few short data fields, which makes a data warehousing approach feasible. furthermore, overlaps between epidemiological data coverage are rare; therefore, the data warehousing effort becomes relatively manageable. internally, bioportal consists of three main components: a web portal, a data store, and a communication backbone. in section ii-b, we provide the details of each component in more detail; here, we summarize bioportal's implementation environment and the assumptions made on the user's end. bioportal follows a standard three-tier web architecture. the data store component, developed using sql server, provides a data warehouse with information pulled from or pushed by contributing data sources. the communication backbone uses standardcompliant xml formats, and is built as multiple standalone java applications that interface with various data sources using different messaging protocols. most system functions are developed in java using jsp pages to interact with the user. as such, all except one major component of bioportal can be accessed by users through a standard web browser. the exception is the visualization module, which is developed as a standalone java application for improved performance, enhanced interactivity, and greater user interface control and is deployable through the web start technology (assuming that the sun jre environment is installed on the client machine). because the web portal component of bioportal implements the user interface and provides access to all main user functionalities-including: ) searching and querying available infectious disease-related data sets; ) visualizing the data sets using spatial-temporal visualization; ) accessing analysis and outbreak detection functions; and ) accessing the alerting mechanism-we do not discuss this component as one unit. instead, we briefly summarize our work on ) and ) in this section, and then present the bioportal visualization environment. because data analysis and outbreak detection involve innovative spatial-temporal data mining research beyond system implementation, we defer their discussion to section ii-c. ) portal data store: a main objective of bioportal is to enable users from partnering states and organizations to share data. typically, data from different organizations have different designs and are stored in different formats. to enable data interoperability, we use hl standards internally as the main storage format. some data providers (e.g., new york state's hin) have already adopted hl and can, thus, send hl -compliant data to bioportal directly. additional steps are needed to ensure data interoperability for those data providers that do not yet have hl -compliant data. first, we reach an agreement with them regarding the format (typically a simple home-grown xml format) for their data. second, the data providers modify their data export module to implement this mutually agreed format. third, when data from these providers reach bioportal, a data normalization module maps the customized xml format on to hl using predetermined mapping rules implemented by the bioportal team. in effect, the data from the hl -compliant data providers also are processed by this module, because it removes from them unneeded data attributes, duplications, and common misspellings (based on a customized home-grown dictionary). this normalization module is not intended to resolve structural or semantic incompatibilities in an automated fashion; rather, it converts data to a predetermined format and performs shallow syntactic checking. after being processed by the data normalization module, data are stored directly in bioportal's main data store. this hl xml-based design provides a key advantage over an alternative design based on a consolidated database for which the portal data store must consolidate and maintain the data fields for all data sets. when an underlying data set changes its data structure, a portal data store based on the consolidated database must be redesigned and reloaded to reflect the changes, which severely limits system scalability and extensibility. to alleviate potential computational performance problems with our xml-based design, we identify a core set of data fields based on the queries that are likely to be performed frequently. these fields are extracted from all xml messages and stored in a separate database table to enable fast retrieval. ) communication backbone: the communication backbone component enables data exchanges between bioportal and the underlying data sources. several federal programs have been recently created to promote data sharing and system interoperability in the healthcare domain; the cdc's national electronic disease surveillance system (nedss) initiative is particularly relevant for our research. it builds on a set of recognized national standards such as hl for its data format and messaging protocols, and provides basic modeling and ontological support for data models and vocabularies. the nedss and hl standards have had major impacts on the development of idi systems. although these standards have not yet been tested in cross-state sharing scenarios, they provide an appropriate foundation for data exchange standards in national and international contexts. bioportal relies heavily on nedss/hl standards. the communication backbone component uses a collection of source-specific "connectors" to communicate with contributing sources. we use the connector linking new york's hin system and bioportal to illustrate a typical design. the data from hin to the portal system are transmitted in a "push" manner, i.e., hin send through secure public health information network messaging system (phin ms) messages to the portal at prespecified time intervals. the connector on the portal side runs a data receiver daemon to listen for incoming messages. after a message is received, the connector checks for data integrity syntactically and normalizes the data. the connector then stores the verified message in the portal's internal data store through its data ingest control module. other data sources (e.g., usgs) may have "pull"-type connectors that periodically download information from source web sites, and examine and store those data in the portal's internal data store. in general, the communication backbone component provides data receiving and sending functionalities, source-specific data normalization, and data-encryption capabilities. ) data confidentiality and access control: data confidentiality, security, and access control are among the key research and development issues for the bioportal project. with regard to system development, programming and rules already developed for new york's hin system constitute the main sources of our design and implementation decisions. because there was no precedent for extending access to a data system across state lines, we needed to develop new access rules for bioportal. we have created new security and user agreement forms for organizations with proposed access as well as individual users within those organizations. in addition, the agencies involved in developing bioportal formally signed a memorandum of understanding prior to sharing any real data. the responsibilities of participating organizations and individuals with access include: ) the establishment and definition of roles within the agency for access, and the determination of those individuals who fill those roles, including systems for termination of access; ) security of data physically located on, or transported over the organization's network; ) protection for the confidentiality of all data accessed, with prohibitions against disclosure of personal or health information to any other agency, person, or public media outlet; and ) recognition of ownership rights of parties that have provided data. the types of data that must be addressed separately with regard to access are data from humans or owned animals that require the highest levels of confidentiality, data from freeranging wildlife, and data from other systems such as vectors (e.g., mosquitoes for wnv), land use, and so forth. the need for maximum access to track diseases must be balanced against the confidentiality concerns and risks of jeopardizing data reporting to the system. we summarize bioportal's data coverage in table i. ) data search and alerting: bioportal provides limited data search functions to regular idi users. instead of developing a generic data search interface with a full range of search criteria, after extensive discussions with potential end users (i.e., state and county epidemiologists and public health researchers), we decided to concentrate on search criteria based primarily on location and time. a specialized tabular interface allows users to quickly identify infectious disease cases that occurred at certain locations within a specified period of time. through this interface, the user can also get summary case counts across locations and times with different levels of granularity. an advanced search module is also available to power users. using this module, a power user can build a personalized search interface that includes additional data-set-specific search criteria. because bioportal can perform idi data analysis automatically without user intervention, if potentially interesting events are detected, the concerned individuals (e.g., epidemiologists in charge) should be alerted. we are currently implementing three types of alerting mechanisms. the first one is by e-mail. the second is through the bioportal web interface, so when a user authorizes himself or herself on the bioportal site and an alert message exists, a flashing icon will notify the user of the pending message. the third mechanism is cellular phone notification through an automated web-based short text message interface for urgent alerts. an important role of visualization in the context of large and complex data set exploration is to organize and characterize the data visually to assist users in overcoming information overload problems [ ] . bioportal makes available an advanced visualization module, called the spatial temporal visualizer (stv), to facilitate the exploration of infectious disease case data and summarize query results. developed as a generic visualization environment, stv can be used to visualize various spatial-temporal data sets simultaneously. the stv has three integrated and synchronized views: periodic, timeline, and gis. the periodic view provides the user with an intuitive display to identify periodic temporal patterns. the timeline view provides a two-dimensional timeline, along with a hierarchical display of the data elements organized as a tree. the gis view displays cases and sightings on a map. fig. illustrates how these three views can be used to explore infectious disease data sets; the top left panel shows the gis view. the user can select multiple data sets to be shown on the map in a layered manner using the checkboxes (e.g., disease cases, natural land features, land-use elements). the top-right panel corresponds to the timeline view and displays the occurrences of various cases using a gantt chart-like display. the user can also access case details easily using the tree display located to the left of the timeline display. below the timeline view is the periodic view with which the user can identify periodic temporal patterns (e.g., months with an unusually high number of cases). the bottom portion of the interface allows the user to specify subsets of data to be displayed and analyzed. as discussed in section ii-a, to achieve fine-graded interface control and high interactivity, stv has been developed as a standalone java application, which can be deployed transparently across the web. the essential data elements (location, time, and event type) displayed by stv are all captured in the relational tables in the bioportal internal data store. the auxiliary data elements (e.g., case details, needed only when a user wants to learn more about a particular data point) may be re-trieved from the hl xml messages stored in the bioportal internal data store. because stv is executed on the client machine, real-time data transmissions between the client machine and bioportal server are necessary. for better performance and shorter response time, stv caches much of the data needed on the client side. in addition to data access, query, and visualization, bioportal provides data analysis capabilities, particularly in the area of spatial-temporal data analysis. in idi applications, measurements of interest such as disease cases are often made at various locations in both space and time. in recent years, interest has increased in answering several central questions, which have great practical importance in outbreak detection and arise from spatial-temporal data analysis and related predictive modeling: how can areas with exceptionally high or low measures be identified? how can observers determine whether unusual measures can be attributed to known random variations or are statistically significant? in the latter case, how should the explanatory factors be assessed? how can statistically significant changes be identified in a timely manner in specific geographic areas? for instance, unusual clustering of dead birds has been proven to be highly indicative of wnv outbreaks. from a modeling and computational perspective, two distinct types of spatial-temporal clustering or hotspot analysis techniques have been developed. the first type is based on various kinds of scan statistics, and has been used with increasing frequency in public health and infectious disease studies [ ] . the second type is based on data clustering and its variations, and has found successful application in crime analysis [ ] . bio-portal makes both types of methods available through its web interface. in addition, it allows the user to interactively invoke these methods and visually inspect their results through stv. one major computational problem faced by existing methods is that the shapes of potential hotspots are limited to simple, fixed symmetrical shapes for analytical and search efficiency reasons. as a result, when the real underlying clusters do not conform to such shapes, the identified regions are often poorly localized. to overcome this major computational limitation, as part of the bioportal technical research effort, we have developed an alternative and complementary modeling approach called riskadjusted support vector clustering (rsvc). hotspot analysis differs from standard clustering in that clustering must be performed relative to baseline data points (representing a "normal" situation). in rsvc, we apply the "risk adjustment" concept from a crime hotspot analysis approach [ ] to incorporate baseline information in the clustering process. the basic intuition behind rsvc is as follows: a robust, svm-based clustering mechanism called support vector clustering allows detection of clusters with arbitrary shapes based on the distances defined over pairs of data points. by adjusting distance measures proportionally to the estimated density of the baseline factor, areas with high baseline density make it more difficult to group data points together as clusters, because the distances between these data points have been adjusted upward. we have also extended our rsvc approach to perform prospective hotspot analysis aimed at monitoring data sources on a continuous basis. for technical details of this bioportal spatial-temporal data analysis work, interested readers are referred to [ ] and [ ] . we conducted a controlled experiment to evaluate bioportal holistically. our foci were objective user task performance and subjective self-reported system assessments. our evaluation did not involve algorithmic assessments or examine individual components of bioportal (e.g., hotspot analysis and outbreak detection), which have been studied previously [ ] . in this section, we discuss the hypotheses tested and detail our evaluation design. we followed the system success evaluation framework by delone and mclean [ ] , and focused on evaluating the essential system characteristics of bioportal and its impacts on user task performance measured objectively by analysis accuracy and task completion efficiency. we also examined users' self-reported assessments of bioportal in terms of satisfaction, usability, usefulness, and ease of use, all of which are critical in system evaluations [ ] . for benchmark purposes, we included a computer-based spreadsheet program commonly used by public health professionals in their analysis tasks. ) analysis accuracy: by integrating interrelated data extracted from different sources and presenting them in a visually intuitive and comprehensible way, bioportal can be expected to better support various analysis tasks by public health professionals. therefore, we tested the following hypotheses. h a : the analysis accuracy that results from the use of bioportal is higher than that associated with the benchmark spreadsheet program. h b : the accuracy improvement that results from the use of bioportal, as compared with the benchmark spreadsheet program, increases with task complexity. ) task completion efficiency: by providing convenient access to integrated data extracted from difference sources, together with easy-to-use analytical algorithms and effective visualization, bioportal can be expected to make public health professionals increasingly efficient in their task performance. we, therefore, tested the following hypothesis. h : the task completion efficiency associated with bioportal is higher than that observed with the benchmark spreadsheet program. ) user satisfaction: user satisfaction is a fundamental aspect of system evaluation and embraces user information satisfaction that emphasizes information requirements [ ] . because of the critical importance of information support in an idi system, we explicitly focused on user information satisfaction and tested the following hypothesis. h : the user information satisfaction that results from the use of bioportal is significantly higher than that observed with the benchmark spreadsheet program. system usability has been shown to affect user adoption, system usage, and satisfaction [ ] . several usability instruments have been developed and validated [ ] , [ ] . of particular importance is the user interaction satisfaction (quis) scale [ ] capable of assessing a system in five fundamental usability dimensions-overall reactions to the system, screen layout and sequence, terminology and system information, system learnability, and system capabilities. we tested the following hypothesis. h : bioportal is more usable than the benchmark spreadsheet program and shows favorable usability scores in overall reaction to the system, screen layout and sequence, terminology and system information, system learnability, and system capabilities. system usefulness is critical to voluntary use of a new system [ ] , [ ] , and generally refers to the extent to which an individual considers a system useful in his or her work role. bioportal offers effective data integration support, and has sophisticated built-in functionalities and intuitive visualization designs; as a result, it can be expected to better support the demanding information processing often required in an analysis task. hence, we tested the following hypothesis. h : the usefulness of bioportal, as perceived by an individual, is significantly greater than that of the benchmark spreadsheet program. ) perceived ease of use: perceived ease of use refers to the degree to which an individual considers his or her use of a system to be free of effort [ ] . ease of use represents an essential motivation for individuals' voluntary use of a system [ ] , and can affect their adoption decisions significantly [ ] . hence, we tested the following hypothesis. h : the ease of use of bioportal, as perceived by an individual, is significantly greater than that of the benchmark spreadsheet program. we adopted a randomized, between-groups design. our subjects were graduate students attending the management school or the public health school of a major university located in the southwestern united states. all subjects were knowledgeable about computer-based spreadsheets but varied substantially in general public health knowledge. each subject was randomly assigned to use one particular system (bioportal or the spreadsheet program), though we remained mindful of maintaining a balance in the subject-technology assignment. with the assistance of several experienced public health researchers and professionals, we created six analysis scenarios common in public health and then developed a total of experiment tasks accordingly. the assisting experts classified the experiment tasks on the basis of complexity: low, medium, or high. a complete listing of the scenarios and analysis tasks used in the experiment is available upon request. we provide two examples as follows. scenario : examine data related to wnv. task : in , which county in new york had the highest dead bird count? (complexity = low) task : of the three listed bird species, bluejay, crow, and house sparrow, which had the highest number of positive cases of wnv? (complexity = low) scenario : determine correlations between the incidence of wnv and dead bird occurrences and mosquito pool counts. task : using the bioportal system or the spreadsheets, as assigned, to investigate wnv disease, can you determine whether, during , there is a correlation between the dead bird occurrences and mosquito pool counts? (complexity = high) task : (continued with task ) if so, what correlation do you observe? (complexity = high) to assess an individual subject's accuracy in each task, we consolidated the analyses by the assisting experts to establish a "gold-standard" solution for that task. we measured analysis accuracy using a ten-point scale, with one being completely incorrect and ten being completely correct. we measured task completion efficiency by using the amount of time that a subject needed to complete a task. we evaluated user information satisfaction [ ] using a sevenpoint likert scale, with one indicating extreme disagreement and seven indicating extreme agreement. we adapted question items from previous research [ ] to measure system usefulness and ease of use, using a seven-point likert scale with one indicating extreme disagreement and seven indicating extreme agreement. we adopted the quis instrument [ ] with a nine-point likert scale to evaluate system usability. before the experiment, we used a script to inform the subjects explicitly of our objective and data analysis plan while ensuring them of the necessary information privacy. subjects were asked to provide some demographic information, and self-assessments of their general computer self-efficacy and knowledge about computer-based spreadsheets and public health. we provided each subject with an overview of his or her assigned system and a training session based on sample tasks to illustrate how to use that system. in the experiment, each subject was asked to complete all analysis tasks grouped by analysis scenario and sequenced in increasing complexity, i.e., tasks progressing from low to high complexity. after completing all the tasks, each subject had to complete a questionnaire survey to provide his or her assessment of the system's usability, usefulness, and ease of use, as well as his or her satisfaction with the information support by the system. we imposed a -min time limit in the experiment, which was appropriate according to the results of a pilot study [ ] . a total of subjects voluntarily participated in the experiment. among them, subjects used bioportal, and the remainder used the spreadsheet program. of those using bioportal, subjects had high domain knowledge and the others were low in domain knowledge. a similar distribution was observed in the spreadsheet group. according to our analysis, the subjects in the bioportal and spreadsheet groups are comparable demographically, and reported similar self-assessments in general computer efficacy and computer-based spreadsheets. we reexamined the reliability of our instrument by assessing its internal consistency [ ] . as summarized in table ii , the subjects' evaluative responses showed that almost all constructs exhibited a cronbach's alpha value exceeding the commonly suggested threshold of . [ ] , thus, suggesting adequate reliability of our instrument. we tested the main effect of system (bioportal versus the spreadsheet program) and domain knowledge (low versus high general public health knowledge) as well as their combined effects by performing an analysis of variance (anova) with each dependent variable on the basis of subjects' responses. we also performed a paired t-test to assess the difference in each dependent variable obtained from the subjects using bioportal versus the spreadsheet program. we used the gold-standard result to evaluate the accuracy of each task performed by subjects. for each subject, we aggregated his or her analysis accuracy across all the tasks performed in the experiment and used the overall accuracy to test the hypothesized main effect of system. according to our analysis, the system had a significant effect on analysis accuracy (p-value < . ). we further investigated the effect of system on the basis of task complexity, and found that the system's effect on analysis accuracy was insignificant for low-complexity tasks but significant for tasks of medium and high complexity. bioportal's accuracy was greater (mean = . , sd = . ) than that of the spreadsheet program (mean = . , sd = . ), and the difference was significant at the . level. thus, our data supported h a and h b. based on our analysis, the system showed a significant main effect on task completion efficiency (p-value < . ). we compared the amount of time required to complete a task using the respective systems and found that, on average, subjects using bioportal could complete an analysis task considerably faster (mean = . min, sd = . min) than their counterparts supported by the spreadsheet program (mean = . min, sd = . min); the difference was significant at the . level. thus, our data supported h . according to our analysis, the main effect of the system on user information satisfaction was significant statistically (pvalue < . ). overall, subjects using bioportal exhibited higher satisfaction with the information support (mean = . , sd = . ) than their counterparts supported by the spreadsheet program (mean = . , sd = . ); the difference was significant at the . level. thus, our data supported h . according to our analysis, the system had a significant main effect on both overall reactions to the system (p-value < . ) and system capabilities (p-value < . ) but not on screen layout and sequence or terminology and system information. the effect on system learnability was somewhat significant statistically. overall, our subjects considered bioportal generally usable and recognized its utilities for supporting their analysis tasks. our evaluation results indicated that the design of bioportal may need to improve in screen layout and sequence, as well as in language (e.g., clarity and user friendliness). our subjects considered their learning to use bioportal not particularly difficult, but its learnability could be enhanced further. according to our comparative analysis of subjects' self-reported assessments of the respective systems, bioportal arguably was more usable than the spreadsheet program in most, but not all, fundamental usability dimensions, though the between-groups differences are not statistically significant. thus, our data partially supported h . our analysis shows that the system had a significant effect on perceived usefulness (p-value < . ). overall, our subjects considered bioportal more useful for supporting their analysis tasks (mean = . , sd = . ) than the spreadsheet program (mean = . , sd = . ). the observed between-groups difference was statistically significant at the . level. thus, our data supported h . according to our analysis, the effect of system on perceived ease of use was significant statistically (p-value < . ). our subjects considered bioportal easier to use (mean = . , sd = . ) than the spreadsheet program (mean = . , sd = . ). the between-groups difference in perceived ease of use was significant at the . level. therefore, our data supported h . the development of advanced idi systems and their routine use by public health professionals are becoming increasingly critical. we report here a significant idi effort, i.e., bioportal that supports cross-jurisdictional data integration with advanced data query, analysis, and visualization capabilities. we conducted a controlled experiment to evaluate bioportal along some fundamental system evaluation dimensions and investigated its effects on user task performance, with particular focus on analysis accuracy, task completion efficiency, and user information satisfaction, system, usability, usefulness, and ease of use. our study generated encouraging findings that suggest desirable effectiveness, usefulness, and ease of use of bioportal. we make several contributions to idi research and practice. first, we designed and implemented an advanced idi system by addressing essential system development challenges pertinent to data/system integration, analysis support, and visualization. second, we conducted a controlled experiment to evaluate bio-portal and its impacts on user task performance. our evaluation had methodological rigor and involved analysis scenarios and tasks common to public health professionals. third, we are contributing to general practices in public health by providing practitioners with a conveniently accessible, easy-to-use system that enables them to generate better analysis results in less time. our future research includes further system enhancements and expanded system evaluations. both system functionalities and usability need further improvement, including hotspot analysis and such usability dimensions as screen layout and sequence and system information. on the evaluation front, the reported evaluation only considers wnv, botulism, and foot-and-mouth disease and emphasizes frequency-and pattern-related analysis tasks. to better mimic real-world challenges in public health, additional and preferably more diverse analysis scenarios and tasks must be considered in future evaluation studies. while our subject choice is appropriate for the intended evaluation purpose and hypothesis testing, future investigations should also involve public health researchers and practitioners, preferably from different institutions and regions. paul jen-hwa hu received the ph.d. degree in management information systems from the university of arizona, tucson. he is an associate professor and david eccles faculty fellow with the david eccles school of business, university of utah, salt lake city. his current research interests include healthcare information systems and management, technology implementation management, electronic commerce, digital government, human-computer interaction, and knowledge management. he has authored more than papers in information systems and ieee journals. he is currently an associate professor and the director of the intelligent systems and decisions laboratory in the department of management information systems, university of arizona, tucson. he is an affiliated professor with the institute of automation, chinese academy of sciences, beijing, china. his research interests include software agents and multiagent systems, complex systems analysis, recommender systems, digital economic institutions, automated negotiation and auction, spatio-temporal data analysis, security informatics, and infectious disease informatics. dr. zeng is the chair of the informs college on artificial intelligence and the vice president for technical activities of the ieee intelligent transportation systems society. hsinchun chen received the bachelor's degree in management science from the national chiao-tung university, taiwan, r.o.c., the m.b.a. degree from the state university of new york, buffalo, and the ph.d. degree in information systems from new york university, new york, ny. he is the mcclelland professor with the department of management information systems, university of arizona, tucson, where he is the founding director of the hoffman e-commerce laboratory as well as the founder and director of the artificial intelligence laboratory. he is the author/coauthor of more than papers published in several journals and conference proceedings. his research interests include intelligence analysis, biomedical informatics, data, text, and web mining, digital library, knowledge management, and web computing. catherine larson received the bachelor's degree in liberal arts (spanish and anthropology) and the master's degree in library and information science from the university of illinois at urbana-champaign, champaign. she is currently the associate director of the artificial intelligence laboratory, department of management information systems, university of arizona, tucson. wei chang received the bachelor's degree from tsinghua university, beijing, china, and the master's degree from the university of arizona, tucson, both in management information systems. he is currently working toward the ph.d. degree in operation research and decision science at the university of pittsburgh, pittsburgh, pa. chunju tseng received the b.s. degree from national taiwan university, taiwan, r.o.c., and the m.s. degree in management information systems from the university of arizona, tucson. he is an assistant research scientist with the department of management information systems, university of arizona. his research interests include web mining, infectious disease surveillance, and humancomputer interaction. he has authored several papers published in conference proceedings. james ma received the b.sc. degree from peking university, beijing, china, and the master's degree in computer science from the university of texas, dallas. he is currently working toward the ph.d. degree in management information systems at the university of arizona, tucson. factors in the mergence of infectious diseases predicting super spreading events during the severe acute respiratory syndrome epidemics in hong kong and singapore the public health infrastructure and our nation's health disease surveillance and the academic, clinical, and public health communities a novel spatio-temporal data analysis approach based on prospective support vector clustering a national agenda for public health informatics innovative information sharing strategies surveillance and control of infectious diseases at local, national and international levels heterogeneous database integration in biomedicine creating a large-scale content-based airphoto image digital library prospective time periodic geographical disease surveillance using a scan statistic crimestat: a spatial statistics program for the analysis of crime incident locations a comparative study of spatio-temporal data analysis techniques in security informatics spatial-temporal cross-correlation analysis: a new measure and a case study in infectious disease informatics the delone and mclean model of information systems success: a ten-year update evaluating public sector information systems: satisfaction versus impact the measurement of user information satisfaction the usability of everyday technology: emerging and fading opportunities the influence of climate on the epidemiology of bartonellosis in ancash designing the user interface: strategies for effective human-computer interaction perceived usefulness, perceived ease of use, and user acceptance of information technology the technology acceptance model: past, present, and future determinants of perceived ease of use: integrating control, intrinsic motivation, and emotion into the technology acceptance model evaluating an infectious disease information sharing and analysis system validating instruments in mis research applied multiple regression/correlation analysis for the behavioral sciences key: cord- -u apzw authors: michael, edwin; sharma, swarnali; smith, morgan e.; touloupou, panayiota; giardina, federica; prada, joaquin m.; stolk, wilma a.; hollingsworth, deirdre; de vlas, sake j. title: quantifying the value of surveillance data for improving model predictions of lymphatic filariasis elimination date: - - journal: plos negl trop dis doi: . /journal.pntd. sha: doc_id: cord_uid: u apzw background: mathematical models are increasingly being used to evaluate strategies aiming to achieve the control or elimination of parasitic diseases. recently, owing to growing realization that process-oriented models are useful for ecological forecasts only if the biological processes are well defined, attention has focused on data assimilation as a means to improve the predictive performance of these models. methodology and principal findings: we report on the development of an analytical framework to quantify the relative values of various longitudinal infection surveillance data collected in field sites undergoing mass drug administrations (mdas) for calibrating three lymphatic filariasis (lf) models (epifil, lymfasim, and transfil), and for improving their predictions of the required durations of drug interventions to achieve parasite elimination in endemic populations. the relative information contribution of site-specific data collected at the time points proposed by the who monitoring framework was evaluated using model-data updating procedures, and via calculations of the shannon information index and weighted variances from the probability distributions of the estimated timelines to parasite extinction made by each model. results show that data-informed models provided more precise forecasts of elimination timelines in each site compared to model-only simulations. data streams that included year post-mda microfilariae (mf) survey data, however, reduced each model’s uncertainty most compared to data streams containing only baseline and/or post-mda or longer-term mf survey data irrespective of mda coverage, suggesting that data up to this monitoring point may be optimal for informing the present lf models. we show that the improvements observed in the predictive performance of the best data-informed models may be a function of temporal changes in inter-parameter interactions. such best data-informed models may also produce more accurate predictions of the durations of drug interventions required to achieve parasite elimination. significance: knowledge of relative information contributions of model only versus data-informed models is valuable for improving the usefulness of lf model predictions in management decision making, learning system dynamics, and for supporting the design of parasite monitoring programmes. the present results further pinpoint the crucial need for longitudinal infection surveillance data for enhancing the precision and accuracy of model predictions of the intervention durations required to achieve parasite elimination in an endemic location. we report on the development of an analytical framework to quantify the relative values of various longitudinal infection surveillance data collected in field sites undergoing mass drug administrations (mdas) for calibrating three lymphatic filariasis (lf) models (epifil, lym-fasim, and transfil), and for improving their predictions of the required durations of drug interventions to achieve parasite elimination in endemic populations. the relative information contribution of site-specific data collected at the time points proposed by the who monitoring framework was evaluated using model-data updating procedures, and via calculations of the shannon information index and weighted variances from the probability distributions of the estimated timelines to parasite extinction made by each model. results show that data-informed models provided more precise forecasts of elimination timelines in each site compared to model-only simulations. data streams that included year post-mda microfilariae (mf) survey data, however, reduced each model's uncertainty most compared to data streams containing only baseline and/or post-mda or longer-term mf survey data irrespective of mda coverage, suggesting that data up to this monitoring point may be optimal for informing the present lf models. we show that the improvements observed in the predictive performance of the best data-informed models may be a function of temporal changes in inter-parameter interactions. such best data-informed models may also produce plos mathematical models of parasite transmission, via their capacity for producing dynamical forecasts or predictions of the likely future states of an infection system, offer an important tool for guiding the development and evaluation of strategies aiming to control or eliminate infectious diseases [ ] [ ] [ ] [ ] [ ] [ ] [ ] . the power of these numerical simulation tools is based uniquely on their ability to appropriately incorporate the underlying nonlinear and multivariate processes of pathogen transmission in order to facilitate plausible predictions outside the range of conditions at which these processes are either directly observed or quantified [ ] [ ] [ ] [ ] . the value of these tools for guiding policy and management decisions by providing comparative predictions of the outcomes of various strategies for achieving the control or elimination of the major neglected tropical diseases (ntds) has been highlighted in a series of recent publications [ , , ] , demonstrating the crucial role these quantitative tools are beginning to play in advancing policy options for these diseases. while these developments underscore the utility of transmission models for supporting policy development in parasite control, a growing realization is that these models can be useful for this purpose only if the biological processes are well defined and demographic and environmental stochasticity are either well-characterized or unimportant for meeting the goal of the policy modelling exercise [ ] [ ] [ ] [ ] [ ] [ ] [ ] . this is because the realized predictability of any model for a system depends on the initial conditions, parameterizations and process equations that are utilized in its simulation such that model outcomes are strongly sensitive to the choice of values used for these variables [ ] . any misspecification of these system attributes will lead to failure in accurately forecasting the future behaviour of a system, with predictions of actual future states becoming highly uncertain even when the exact representation of the underlying deterministic process is well established but precise specification of initial conditions or forcing and/or parameter values is difficult to achieve [ , ] . this problem becomes even more intractable when theoretical models depend on parameter estimates taken from other studies [ , , ] . both these challenges, viz. sensitivity to forcing conditions and use of parameter estimates from settings that are different from the dynamical environment in which a model will be used for simulation, imply that strong limits will be imposed on the realized predictability of any given model for an application [ , , ] . as we have shown recently, if such uncertainties are ignored, the ability of parasite transmission models to form the scientific basis for management decisions can be severely undermined, especially when predictions are required over long time frames and across heterogeneous geographic locations [ , , ] . these inherent difficulties with using an idealized model for producing predictions to guide management have led to consideration of data-driven modelling procedures that allow the use of information contained within observations to improve specification and hence the predictive performance of process-based models [ , , , [ ] [ ] [ ] . such approaches, termed model-data fusion or data assimilation methods, act by combining models with various data streams (including observations made at different spatial or temporal scales) in a statistically rigorous way to inform initial conditions, constrain model parameters and system states, and quantify model errors. the result is the discovery of models that can more adequately capture the prevailing system dynamics in a site, an outcome which in turn has been shown to result in the making of significantly improved predictions for management decision making [ , , , ] . initially used in geophysics and weather forecasting, these methods are also beginning to be applied in ecological modelling, including more recently in the case of infectious disease modelling [ , ] . in the latter case, the approach has shown that it can reliably constrain a disease transmission model during simulation to yield results that approximate epidemiological reality as closely as possible, and as a consequence improve the accuracy of forecasts of the response of a pathogen system exposed to various control efforts [ - , , - ] . more recently, attention has also focused on the notion that a model essentially represents a conditional proposition, i.e. that running a model in a predictive mode presupposes that the driving forces of the system will remain within the bounds of the model conceptualization or specification [ ] . if these driving forces were to change, then it follows that even a model well-calibrated to a given historical dataset will fail. new developments in longitudinal data assimilation can mitigate this problem of potential time variation of parameters via the recursive adjustment of the model by assimilation of data obtained through time [ , , ] . apart from allowing assessment of whether stasis bias may occur in model predictions, such sequential model calibration with time-varying data can also be useful for quantifying the utility of the next measurement in maximizing the information gained from all measurements together [ ] . carrying out such longitudinal model-data analysis has thus the potential for providing information to improve the efficiency and cost-effectiveness of data monitoring campaigns [ , [ ] [ ] [ ] , along with facilitating more reliable model forecasts. a key question, however, is evaluating which longitudinal data streams provide the most information to improve model performance [ ] . indeed, it is possible that from a modelling perspective using more data may not always lead to a better-constrained model [ ] . this suggests that addressing this question is not only relevant to model developers, who need observational data to improve, constrain, and test models, but also for disease managers working on the design of disease surveillance plans. at a more philosophical level, we contend that these questions have implications for how current longitudinal monitoring data from parasite control programmes can best be exploited both scientifically and in management [ ] . specifically, we suggest that these surveillance data need to be analysed using models in a manner that allows the extraction of maximal information about the monitored dynamical systems so that this can be used to better guide both the collection of such data as well as the provision of more precise estimates of the system state for use in making state-dependent decisions [ , [ ] [ ] [ ] . currently, parasite control programmes use infection monitoring data largely from sentinel sites primarily to determine if an often arbitrarily set target is met [ ] . little consideration is given to whether these data could also be used to learn about the underlying transmission dynamics of the parasitic system, or how such learning can be effectively used by management to make better decisions regarding the interventions required in a setting to meet stated goals [ , ] . here, we develop an analytical framework to investigate the value of using longitudinal lf infection data for improving predictions of the durations of drug interventions required for achieving lf elimination by coupling data collected during mass drug interventions (mdas) carried out in three example field sites to three existing state-of-the-art lymphatic filariasis (lf) models [ , , , [ ] [ ] [ ] [ ] [ ] [ ] . to be managerially relevant to current who-specified lf intervention surveillance efforts, we evaluated the usefulness of infection data collected in these sites at the time points proposed by the who monitoring framework in carrying out the present assessment [ ] . this was specifically performed by ranking these different infection surveillance data streams according to the incremental information gain that each stream provided for reducing the prediction uncertainty of each model. longitudinal pre-and post-infection and mda data from representative sites located in each of the three major regions endemic for lf (africa, india, and papua new guinea (png)) were assembled from the published literature for use in constraining the lf models employed in this study. the three sites (kirare, tanzania, alagramam, india, and peneng, png) were selected on the basis that each represents the average endemic transmission conditions (average level of infection, transmitting mosquito genus) of each of these three major extant lf regions, while providing details on the required model inputs and data for conducting this study. these data inputs encompassed information on the annual biting rate (abr) and dominant mosquito genus, as well as mda intervention details, including the relevant drug regimen, time and population coverage of mda, and times and results of the conducted microfilaria (mf) prevalence surveys (table ) . note each site also provided these infection and mda data at the time points pertinent to the existing who guidelines for conducting lf monitoring surveys during a mda programme [ ] , which additionally, as pointed out above, allowed the assessment of the value of such infection data both for supporting effective model calibration and for producing more reliable intervention forecasts. the three existing lf models employed for this study included epifil, a deterministic monte carlo population-based model, and lymfasim and transfil, which are both stochastic, individual-based models. all three models simulate lf transmission in a population by accounting for key biological and intervention processes such as impacts of vector density, the life cycle of the parasite, age-dependent exposure, density-dependent transmission processes, infection aggregation, and the effects of drug treatments as well as vector control [ , , - , , , ] . although the three models structurally follow a basic coupled immigration-death model formulation, they differ in implementation (e.g. from individual to population-based), the total number of parameters included, and the way biological and intervention processes are mathematically incorporated and parameterized. the three models have been compared in recent work [ , ] , with full details of the implementation and simulation procedures for each individual model also described [ , , , , , , , , ] . individual model parameters and fitting procedures specific to this work are given in detail in s supplementary information. we used longitudinal data assimilation methods to sequentially calibrate the three lf models with the investigated surveillance data such that parameter estimates and model predictions reflect not only the information contained in the baseline but also follow-up data points. the available mf prevalence data from each site were arranged into four different temporal data streams to imitate the current who guidelines regarding the time points for conducting monitoring surveys during an mda programme. this protocol proposes that infection data be collected in sentinel sites before the first round of mda to establish baseline conditions, no sooner than months following the third round of mda, and no sooner than months following the fifth mda to assess whether transmission has been interrupted (defined as reduction of mf prevalence to below % in a population) [ , ] . thus, the four data streams considered for investigating the value of information gained from each survey were respectively: scenario -baseline mf prevalence data only, scenario -baseline and post-mda mf prevalence data, scenario -baseline, post-mda , and post-mda mf prevalence data, and scenario -baseline and post-mda mf prevalence data. in addition to these four data streams, a fifth model-only scenario (scenario ) was also considered where no site-specific data was introduced. in this case, simulations of interventions were performed using only model-specific parameter and abr priors estimated for each region. the first step for all models during the data assimilation exercises reported here was to initially simulate the baseline infection conditions in each site using a large number of samples ( , for epifil and transfil, and , - , for lymfasim) randomly selected from the parameter priors deployed by each model. the number of parameters which were left free to be fitted to these data by each model range from (lymfasim and transfil) to (epifil). the abr, a key transmission parameter in all three models, was also left as a free parameter whose distribution was influenced by the observed abr (table ) and/or by fits to previous region-specific datasets (see s supplementary information for model-specific implementations). the subsequent steps used to incorporate longitudinal infection data into the model calibration procedure varied among the models, but in all cases the goodness-of-fit of the model outputs for the site-specific mf prevalence data was assessed using the chi-square metric (α = . ) [ ] . epifil used a sequential model updating procedure to iteratively modify the parameters with the introduction of each subsequent follow up data point through time [ ] . this process uses parameter estimates from model fits to previous data as priors for the simulation of the next data which are successively updated with the introduction of each new observation, thus providing a flexible framework by which to constrain a model using newly available data. fig summarizes the iterative algorithm used for conducting this sequential model-data assimilation exercise [ ] . lymfasim and transfil, by contrast, included all the data in each investigated stream together for selecting the best-fitting models for each time series-i.e. model selection for each data series was based on using all relevant observations simultaneously in the fitting process [ , , ] . although a limitation of this batch estimation approach is that the posterior probability of each model is fixed for the whole simulation period, unlike the case in sequential data assimilation where a restricted set of parameters is exposed to each observation (as a result of parameter constraining by data used in the previous time step)which thereby yields models that give better predictions for different portions of the underlying temporal process-here we use both methods to include and assess the impact that this implementation difference may have on the results presented below. for all models, the final updated parameter estimates from each data stream were used to simulate the impact of observed mda rounds and for predicting the impact of continued mda to estimate how many years were required to achieve % mf prevalence. interventions were modelled by using the updated parameter vectors or models selected from each scenario for simulating the impact of the reported as well as hypothetical future mda rounds on the number of years required to reduce the observed baseline lf prevalence in each site to below the who transmission threshold of % mf prevalence [ ] . when simulating these interventions, the observed mda times, regimens, and coverages followed in each site were used (table ) , while mda was assumed to target all residents aged years and above. for making mf prevalence forecasts beyond the observations made in each site, mda simulations were extended for a total of annual rounds in each site at an assumed coverage of %. while the drug-induced mf kill rate and the duration of adult worm sterilization were fixed among the models (table ) , the worm kill rate was left as a free parameter to be estimated from post-intervention data to account for the uncertainty in this drug efficacy parameter [ , , ] . the number of years of mda required to achieve the threshold of % mf prevalence was calculated from model forecasts of changes in mf prevalence due to mda for each modeldata fusion scenario. the predictions from each model regarding timelines to achieve % mf for each fitting scenario were used to determine the information gained from each data stream compared to the in all scenarios, the initial epifil models were initialized with parameter priors and a chi-square fitting criterion was applied to select those models which represent the baseline mf prevalence data sufficiently well (α = . ). the accepted models were then used to simulate the impact of interventions on mf prevalence. the chi-square fitting criterion was sequentially applied to refine the selection of models according to the post-mda mf prevalence data included in the fitting scenario. the fitted parameters from selection of acceptable models at each data point were used to predict timelines to achieve % mf prevalence. the scenarios noted in the blue boxes indicate the final relevant updating step before using the fitted parameters to predict timelines to achieve % mf in that data fitting scenario. information attributable to the model itself [ , , ] . the relative information gained from a particular data stream was calculated as i d = h m -h md where h measures the entropy or uncertainty associated with a random variable, h m denotes predictions from the model-only scenario (scenario ) which essentially represents the impact of prior knowledge of the system, and h md signifies predictions from each of the four model-data scenarios (i.e. scenarios [ ] [ ] [ ] [ ] . the values of i d for each data scenario or stream were compared in a site to infer which survey data are most useful for reducing model uncertainty. the shannon information index was used to measure entropy, h, as follows: is the discrete probability density function (pdf) of the number of years of mda predicted by each fitted model to reach % mf, and is estimated from a histogram of the respective model predictions for m bins (of equal width in the range between the minimum and maximum values of the pdfs) [ , ] . to statistically compare two entropy values, a permutation test using the differential shannon entropy (dse) was performed [ ] . dse is defined as |h -h | where h was calculated from the distribution of timelines to achieve % mf for a given scenario, y , and h was calculated from the distribution of timelines to achieve % mf for a different scenario, y . the list of elements in y and y were combined into a single list of size y + y and the list was permuted , times. dse was then recalculated each time by calculating a new h from the first y elements and a new h from the last y elements from each permutation, from which p-values may be quantified as the proportion of all recalculated dses that were greater than the original dse. model predictions of the mean and variance in timelines to lf elimination were weighted according to the frequencies by which predictions occurred in a group of simulations. in general, if d , d ,. . .,d n are data points (model predictions in the present case) that occur in an ensemble of simulations with different weights or frequencies w ,w ,. . .,w n , then the weighted mean, here, n is the number of data points and n is the number of non-zero weights. in this study, the weighted variance of the distributions of predicted timelines to achieve % mf prevalence was calculated to provide a measure of the precision of model predictions in addition to the entropy measure, h. a similar weighting scheme was also used to pool the timeline predictions of all three models. here, predictions made by each of the three models for each data scenario were weighted as above, and a composite weighted % percentile interval for the pooled predictions was calculated for each data stream. this was done by first computing the weighted percentiles for the combined model simulations from which the pooled . th and . th percentile values were quantified. the matlab function, wprctile, was used to carry out this calculation. the extent by which parameter constraints are achieved through the coupling of models with data was evaluated to determine if improvements in such constraints by the use of additional data may lead to reduced model prediction uncertainty [ ] . parameter constraint was calculated as the ratio of the mean standard deviation of all fitted parameter distributions to the mean standard deviation of all prior parameter distributions. a ratio of less than one indicates the fitted parameter space is more constrained than the prior parameter space [ ] . this assessment was carried out using the epifil model only. in addition, pairwise parameter correlations were also evaluated to assess whether the sign, magnitude, and significance of these correlations changed by scenario to determine if using additional data might alter these interactions to better constrain a model. for this assessment, spearman's correlation coefficients and p-values testing the hypothesis of no correlation against the alternative of correlation were calculated, and the exercise was run using the estimated parameters from the epifil model. epifil was used to conduct a sensitivity analysis investigating whether the trend in relative information gained by coupling the model with longitudinal data was dependent on the interventions simulated. the same series of simulations (for three lf endemic sites and five fitting scenarios) were completed with the extended mda coverage beyond the observations given in table set here at % instead of % to represent an optimal control strategy. as before, the timelines to reach % mf prevalence in each fitting scenario were calculated and used to determine which data stream provided the model with the greatest gain of information. the results were compared to the original series of simulations to assess whether the trends are robust to changes in the intervention coverages simulated. epifil was also used to perform another sensitivity analysis expanding the number of data streams to investigate if the who monitoring scheme is adequate for informing the making of reliable model-based predictions of timelines for achieving lf elimination. to perform this sensitivity analysis, pre-and post-mda data from villupuram district, india that provide extended data points (viz. scenario - as previously defined, plus scenario -baseline, post-mda , post-mda , and post-mda mf prevalence data, and scenario -baseline, post-mda , post-mda , post-mda , and post-mda mf prevalence) were assembled from the published literature [ , ] . the timelines to reach % mf prevalence and the entropy for each of these additional scenarios were calculated to determine whether additional data streams over those recommended by who are required for achieving more reliable model constraints, which among these data might be considered as compulsory, and which might be optional for supporting predictions of elimination. differences in predicted medians, weighted variances and entropy values between data scenarios, models and sites were statistically evaluated using kruskall-wallis tests for equal medians, f-tests for equality of variance, and dse permutation tests, respectively. p-values for assessing significance for all pairwise tests were obtained using the benjamini-hochberg procedure for controlling the false discovery rate, i.e. for protecting against the likelihood of obtaining false positive results when carrying out multiple testing [ ] . here, our goal was twofold. first, to determine if data are required to improve the predictability of intervention forecasts by the present lf models in comparison with the use of theoretical models only, and second, to evaluate the benefit of using different longitudinal streams of mf survey data for calibrating the three models in order to determine which data stream was most informative for reducing the uncertainty in model predictions in a site. table summarises the key results from our investigation of these questions: these are the number of accepted best-fitting models for each data stream or scenario in the three study sites (table ) , the predicted median and range ( . th - . th percentiles) in years to achieve the mf threshold of % mf prevalence, the weighted variance and entropy values based on these predictions, and the relative information gained (in terms of reduced prediction uncertainty) by the use of longitudinal data for constraining the projections of each of the three lf models investigated. even though the number of selected best-fit models based on the chi-square criterion (see methods) differed for each site and model, these results indicate unequivocally that models constrained by data provided significantly more precise intervention predictions compared to model-only predictions ( table ). note that this was also irrespective of the two types of longitudinal data assimilation procedures (sequential vs. simultaneous) used by the different models in this study. thus, for all models and sites, model-only predictions made in the absence of data (scenario ) showed the highest prediction uncertainty, highlighting the need for data to improve the predictive performance of the present models. the relative information gained by using each data stream in comparison to model-only predictions further support this finding, with the best gains in reducing model prediction uncertainty provided by those data constraining scenarios that gave the lowest weighted variance and entropy values; as much as % to % reductions in prediction variance were achieved by these scenarios in comparison to modelonly predictions between the three models ( table ). the results also show, however, that data streams including post-mda mf survey data (scenarios and ) reduced model uncertainty (based on both the variance and entropy measures) most compared to data streams containing only baseline and/or post-mda mf survey data (scenarios and ) ( table ) . although there were differences between the three models (due to implementation differences either in how the models are run (monte carlo deterministic vs. individual-based) or in relation to how the present data were assimilated (see above)), overall, scenario , which includes baseline, post-mda , and post-mda data, was shown to most often reduce model uncertainty the greatest. additionally, there was no statistical difference between the performances of scenarios and in those cases where scenario resulted in the greatest gain of information (table ) . it is also noticeable that the best constraining data stream for each combination of site and model also produced as expected the lowest range in predictions of the numbers of years of annual mda required to achieve the % mf prevalence in each site, with the widest ranges estimated for model-only predictions (scenario ) and the shorter data streams (scenario ). in general, this constriction in predictions also led to lower estimates of the median times to achieve lf elimination, although this varied between models and sites ( table ) . the change in the distributions of predicted timelines to lf elimination without and with model constraining by the different longitudinal data streams is illustrated in fig for the kirare site (see s supplementary information for results obtained for the other two study villages investigated here). the results illustrate that both the location and length of the tail of the prediction distributions can change as models are constrained with increasing lengths of longitudinal data, with inclusion of post-mda mf survey data consistently producing a narrower or sharper range of predictions compared to when this survey point is excluded. fig compares the uncertainty in predictions of timelines to achieve elimination made by each of the three models without (scenario ) and via their constraining by the data streams providing the lowest prediction entropy for each of the models per site. note that variations in scenario predictions among the three models directly reflect the different model structures, parameterizations, and the presence (or absence) of stochastic elements. the boxplots in the figure, however, show that for all three sites and models, calibration of each model by data the lowest entropy scenario for each site is bolded and shaded grey. additional scenarios shaded grey are not significantly different from the lowest entropy scenario. data assimilation in filarial model predictions greatly reduces the uncertainty in predictions of the years of annual mda required to eliminate lf compared to model-only predictions, with the data streams producing the lowest entropy for simulations in each site significantly improving the precision of these predictions ( table ). this gain in precision, and thus the information gained using these data streams, is, as expected, greater for the stochastic lymfasim and transfil models compared to the deterministic epifil model. note also that even though the ranges in predictions of the annual mda years required to eliminate lf by the data streams providing the lowest prediction entropy differed statistically between the three models, the values overlapped markedly (e.g. for kirare the ranges are - , - , - for epifil, lymfasim and transfil data assimilation in filarial model predictions respectively), suggesting the occurrence of a similar constraining of predictive behaviour among the three models. to investigate this potential for a differential model effect, we further pooled the predictions from all three models for all the data scenarios and evaluated the value of each investigated data stream for improving their combined predictive ability. the weighted % percentile intervals from the pooled predictions were used for carrying out this assessment. the results are depicted in fig and indicate that, as for the individual model predictions, uncertainty in the collective predictions by the three lf models for the required number of years to eliminate lf using annual mda in each site may be reduced by model calibration to data, with the longitudinal mf prevalence data collected during the later monitoring periods (scenarios and ) contributing most to improving the multi-model predictions for each site. the boxplots show that by calibrating the models to data streams, more precise predictions are able to be made regarding timelines to achieve % mf prevalence across all models and sites. the results of pairwise f-tests for variance, performed to compare the weighted variance in timelines to achieve % mf prevalence between model-only simulations (scenario ) and the lowest entropy simulations (best scenario) (see table ), show that the predictions for the best scenarios are significantly different from the predictions for the model-only simulations. significance was determined using the benjamini-hochberg procedure for controlling the false discovery rate (q = . ). for epifil, lymfasim and transfil, the best scenarios are scenarios , , and for kirare, scenarios , , and for alagramam, and scenarios , , and for peneng, respectively. we attempted to investigate if model uncertainty in predictions by the use of longitudinal data was a direct function of parameter constraining by the addition of data. given the similarity in outcomes of each model, we remark on the results from the fits of the technically easier to run epifil model to evaluate this possibility here. the assessment of the parameter space constraint achieved through the inclusion of data was made by determining if the fitted parameter distributions for the model became reduced in comparison with priors as data streams were added to the system [ ] . the exercise showed that the size of the estimated parameter distributions reduced with addition of data, with even scenario data producing reductions for kirare and peneng (fig ) . in the case of alagramam, however, there was very little, if any, constraint in the fitted parameter space compared to the prior parameter space. this result, together with the fact that even using all the data in kirare and peneng produced up to only between . to % reductions in fitted parameter distributions when compared to the priors, indicate that the observed model prediction uncertainty in this study may be due to other complex factors connected with model parameterization. table provides the results of an analysis of pairwise parameter correlations of the selected best-fitting models for data scenario compared to those selected by the data stream that gave the best reduction in epifil prediction uncertainty for alagramam (scenario ). these results show that while the parameter space was not being constrained with the addition of more data, the pattern of parameter correlations changed in a complex manner between the two constraining data sets. for example, although the number of significantly correlated parameters did not differ, the magnitude and direction of parameter correlations were shown to change between the two data scenarios ( table ) . the corresponding results for kirare and peneng are shown in s supplementary information , and indicate that a broadly similar pattern of changes in parameter associations also occurred as a result of model calibration to the sequential data measured from those sites. this suggests that this outcome may constitute a general phenomenon at least with regards to the sequential constraining of epifil using longitudinal mf prevalence data. an intriguing finding (from all three data settings) is that the most sensitive parameters in this regard, i.e. with respect to altered strengths in pairwise parameter correlations, may be those representing the relationship of various components of host immunity with different transmission processes, including with adult worm mortality, rates of production and survival of mf, larval development rates in the mosquito vector and infection aggregation (table ) . this suggests that, as more constraining data are added, changes in the multidimensional parameter relationships related to host immunity could contribute to the sequential reductions in the lf model predictive uncertainty observed in this study. the lf elimination timeline predictions used above were based on modelling the impacts of annual mda given the reported coverages in each site followed by an assumed standard coverage for making longer term predictions (see methods). this raises the question as to whether the differences detected in the case of the best constraining data stream between the present study sites and between models ( table ) could be a function of the simulated mda coverages in each site. to investigate this possibility, we used epifil to model the outcome of changing the assumed mda coverage in each site on the corresponding entropy and information gain trends in elimination predictions made from the models calibrated to each of the site-specific data scenarios/streams investigated here. the results of increasing the assumed coverage of mda to % for each site are shown in fig and indicate that the choice of mda coverage in this study are unlikely to have significantly influenced the conclusion made above that the best performing data streams for reducing model uncertainty for predicting lf elimination pertains to data scenarios and . however, while the model-predicted timelines to achieve the % mf prevalence threshold using the observed mda coverage followed by % mda coverage showed that the data stream which most reduced uncertainty did not change from the impact of using the observed mda coverage followed by % mda coverage modelled for kirare and peneng (table , fig ) , this was not the case for alagramam, where data from scenario with a % coverage resulted in the greatest reduction in entropy compared to the original results using % coverage which indicated that scenario data performed best (table , fig ) . notably, though, the entropy values of predictions using the data scenario and constraints were not statistically different for this site (p-value < . ) (fig ) . epifil was also used to expand the number of calibration scenarios using a dataset with longer term post-mda data from villupuram district, india. this dataset contained two addition data streams: scenario which included baseline, post-mda , post-mda , and post-mda mf data, and scenario , which included baseline, post-mda , post-mda , post-mda , and post-mda mf data. scenario thus contained the most post-mda data and was demonstrated to be the most effective for reducing model uncertainty, but this effect was not statistically significantly different from the reductions produced by assimilating data contained in table . spearman parameter correlations for scenarios (lower left triangle) and (upper right triangle) for alagramam, india. data assimilation in filarial model predictions scenarios and ( table ). the inclusion of more data than are considered in scenario therefore did not result in any significant additional reduction in model uncertainty. epifil was used to evaluate the accuracy of the data-driven predictions of the timelines required to meet the goal of lf elimination based on breaching the who-set target of % mf for all sites, either scenario or had the lowest entropies, and scenario was not significantly different from scenario for kirare and alagramam. these results were not statistically different from the results given % coverage (see table ), suggesting that the data stream associated with the lowest entropy is robust to changes in the interventions simulated. scenarios where the weighted variance or entropy were not significantly different from the lowest entropy scenario are noted with the abbreviation ns. significance was determined using the benjamini-hochberg procedure for controlling the false discovery rate (q = . ). https://doi.org/ . /journal.pntd. .g table . predictions of timelines to achieve % mf in villupuram district, india, considering extended post-mda data. reporting those scenarios which are statistically significantly different from each other by numbers ( - ) as superscripts. for example, the weighted variance for scenario has the superscript numbers ( ) ( ) ( ) ( ) ( ) ( ) to indicate that the weighted variance for scenario is significantly different from the weighted variance for scenarios - . significance was determined using the benjamini-hochberg procedure for controlling the false discovery rate (q = . ) in all pairwise statistical tests. + information gained by each data stream (scenario - ) are presented in comparison to the information contained in the model-only simulation (scenario ) https://doi.org/ . /journal.pntd. .t data assimilation in filarial model predictions prevalence. this analysis was performed by using the longitudinal pre and post-infection and mda data reported for the nigerian site, dokan tofa, where elimination was achieved according to who recommended criteria after seven rounds of mda (table ). the data from this site comprised information on the abr and dominant mosquito genus, as well as details of the mda intervention carried out, including the relevant drug regimen applied, time and population coverage of mda, and outcomes from the mf prevalence surveys conducted at baseline and at multiple time points during mda [ ] . the results of model predictions of the timelines to reach below % mf prevalence as a result of sequential fitting to the mf prevalence data from this site pertaining to scenarios - (as defined above) are shown in table . note that in the post mda , and surveys, as no lf positive individuals were detected among the sample populations, we used a one-sided % clopper-pearson interval to determine the expected upper one-sided % confidence limits for these sequentially observed zero infection values data assimilation in filarial model predictions using the "rule of three" approximation after k empty samples formula [ ] . the results show that model constraining by scenario , which includes baseline and post-mda data, and scenario , which includes baseline, post-mda , and post-mda data, resulted in both the least entropy values and the shortest predicted times, i.e., from as low as to as high as years, required for achieving lf elimination in this site ( table ). the data in table show that the first instance the calculated one-sided upper % confidence limit in this setting fell below % mf prevalence also occurred post mda (i.e after years of mda). this is a significant result, and indicates that apart from being able to reduce prediction uncertainty, the best data-constrained models are also able to more accurately predict the maximal time ( years) by which lf elimination occurred in this site. our major goal in this study was to compare the reliability of forecasts of timelines required for achieving parasite elimination made by generic lf models versus models constrained by sequential mf prevalence surveillance data obtained from field sites undergoing mda. a secondary aim was to evaluate the relative value of data obtained at each of the sampling time points proposed by the who for monitoring the effects of lf interventions in informing these model predictions. this assessment allowed us to investigate the role of these data for learning system dynamics and measure their value for guiding the design of surveillance programmes in order to support better predictions of the outcomes of applied interventions. fundamentally, however, this work addresses the question of how best to use predictive parasite transmission models for guiding management decision making, i.e. whether this should be based on the use of ideal models which incorporate generalized parameter values or on models with parameters informed by local data [ ] . if we find that data-informed models can reduce prediction uncertainty significantly compared to the use of theoretical models unconstrained by data, then it is clear that to be useful for management decision making we require the application of model-data assimilation frameworks that can effectively incorporate information from appropriate data into models for producing reliable intervention projections. antithetically, such a finding implies that using unconstrained ideal models in these circumstances will provide only approximate predictions characterized by a degree of uncertainty that might be too large to be useful for reliable decision making [ , , ] . here, we have used three state-of-the-art lf models calibrated to longitudinal human mf prevalence data obtained from three representative lf study sites to carry out a systematic analysis of these questions in parasite intervention modelling (see also walker et al [ ] for a recent study highlighting the importance of using longitudinal sentinel site data for improving the prediction performances of the closely-related onchocerciasis models). further, by iteratively testing the reduction in the uncertainty of the projections of timelines required to achieve lf elimination in a site made by the models matching each observed data point, we have also quantified the relative values of temporal data streams, including assessing optimal record lengths, for informing the current lf models. our results provide important insights as to how best to use process models for understanding and generating predictions of parasite dynamics. they also highlight how site-specific longitudinal surveillance data coupled with models can be useful for providing information about system dynamics and hence for improving predictions of relevance to management decision-making. the first result of major importance from our work is that models informed by data can significantly reduce predictive uncertainty and hence improve performance of the present lf models for guiding policy and management decision-making. our results show that these improvements in predictive precision were consistent between the three models and across all three of our study sites, and can be very substantive with up to as much as % to % reductions in prediction variance obtained by the best data-constrained models in a site compared to the use of model-only predictions ( table ). the practical policy implications of this finding can also be gleaned from appraising the actual numerical ranges in the predictions made by each individual model for each of the modelling scenarios investigated here. in the case of epi-fil, the best data-informed model (scenario in peneng) gave an elimination prediction range of - years, while the corresponding model-only predictions for this site indicated a need for between - years of annual mda (table ). these gains in information from using data to inform model parameters and hence predictions were even larger for the two stochastic models investigated here, viz. lymfasim and transfil, where ranges as wide as - years predicted by model-only scenarios were reduced to - years for the best data-informed models in the case of lymfasim for kirare village, and from as broad as - years to - years respectively in the case of transfil for peneng (table ). these results unequivocally indicate that if parasite transmission models are used unconstrained by data, i.e. based on general parameter values uninformed by local data, it would lead to the making of predictions that would be marked by uncertainties that are likely to be far too large to be meaningful for practical policy making. if managers are risk averse, this outcome will also mean their need to plan interventions for substantially much longer than necessary, with major implications for the ultimate cost of the programme. note also that although statistically significant changes in the median years of mda required to achieve lf elimination were observed for the best datainformed models for all the three lf model types in each site, these were relatively small compared to the large reductions seen in each model's predictive uncertainly (table , fig ) . this result highlights that the major gains from constraining the present models by data lies in improving their predictive certainty rather than in advancing their average behaviour. however, our preliminary analysis of model predictive accuracy suggests that the best data-constrained models may also be able to generate more accurate predictions of the impact of control ( table ), indicating that, apart from simply reducing predictive uncertainty, such models could additionally have improved capability for producing more reliable predictions of the outcomes of interventions carried out in a setting. the iterative testing of the reduction in forecast uncertainty using mf surveillance data measured at time points proposed by the who (to support assessment of whether the threshold of % mf prevalence has been reached before implementation units can move to post-treatment surveillance [ ]) has provided further insights into the relative value of these data for improving the predictive performance of each of the present lf models. our critical finding here is that parameter uncertainty in all three lf models was similarly reduced by the assimilation of a few additional longitudinal data records (table ). in particular, we show that data streams comprising baseline + post-mda + post-mda (scenario ) and those comprising baseline + post-mda data (scenario ) best reduced parameter-based uncertainty in model projections of the impact of mdas carried out in each study site irrespective of the models used. although preliminary, a potential key finding is that the use of longer-term data additional to the data measured at the who proposed monitoring time points did not lead to a significant further reduction in parameter uncertainty (table ) . also, the finding that the who data scenarios and were adequate for constraining the present lf models appears not to be an artefact of variations in the mda coverages observed between the three study sites (fig ) . these results suggest that up to years of post-mda mf prevalence data are sufficient to constrain model predictions of the impact of lf interventions at a time scale that can go up to as high as to years depending on the site and model, and that precision may not improve any further if more new data are added ( table , table ). given that the who post-mda lf infection monitoring protocol was developed for the purpose solely focussed on supporting the meeting of set targets (e.g. the % mf prevalence threshold) and not on a priori hypotheses regarding how surveillance data could be used also to understand the evolution and hence prediction of the dynamical parasitic system in response to management action, our results are entirely fortuitous with respect to the value of the current lf monitoring data for learning about the lf system and its extinction dynamics in different settings [ ] . they do, nonetheless, hint at the value that coupling models to data may offer to inform general theory for guiding the collection and use of monitoring data in parasite surveillance programmes in a manner that could help extract maximal information about the underlying parasite system of interest. our assessment of whether the incremental increase in model predictive performance observed as a result of assimilating longitudinal data may be due to parameter constraining by the addition of data has shed intriguing new light on the impact that qualitative changes in dynamical system behaviour may have on parameter estimates and structure, and hence on the nature of the future projections of system change we can make from models. our major finding in this regard is that even though the parameter space itself may not be overly constrained by the best data stream (scenario in this case for alagramam village), the magnitude and direction of parameter correlations, particularly those representing the relationship of different components of host immunity with various transmission processes, changed markedly between the shorter (scenario ) and seemingly optimal data streams (scenario ). this qualitative change in system behaviour induced by alteration in parameter interactions in response to perturbations has been shown to represent a characteristic feature of complex adaptive ecological systems, particularly when these systems approach a critical boundary [ ] [ ] [ ] . this underscores yet another important reason to incorporate parameter information from data for generating sound system forecasts [ ] . the finding that additional data beyond years post-mda did not appear to significantly improve model predictive performance in this regard suggests that pronounced change in lf parameter interactions in response to mda interventions may occur generally around this time point for this parasitic disease, and that once in this parameter regime further change appears to be unlikely. this is an interesting finding, which not only indicates that coupling models to at least years post-mda will allow detection of the boundaries delimiting the primary lf parameter regions with different qualitative behaviour, but also that the current who monitoring protocol might be sufficient to allow this discovery of system change. although our principal focus in this study was in investigating the value of longitudinal data for informing the predictive performance of the current lf models, the results presented here have also underscored the existence of significant spatial heterogeneity in the dynamics of parasite extinction between the present sites ( table , fig ) . in line with our previous findings, this observed conditional dependency of systems dynamics on local transmission conditions means that timelines or durations of interventions required to break lf transmission (as depicted in table ) will also vary from site to site even under similar control conditions [ ] [ ] [ ] ] . as we indicated before, this outcome implies that we vitally require the application of models to detailed spatio-temporal infection surveillance data, such as that exemplified by the data collected by countries in sentinel sites as part of their who-directed monitoring and evaluation activities, if we are to use the present models to make more reliable intervention predictions to drive policy and management decisions (particularly with respect to the durations of interventions required, need for switching to more intensified or new mda regimens, and need for enhanced supplementary vector control) in a given endemic setting [ ] . as we have previously pointed out, the development of such spatially adaptive intervention plans will require the development and use of spatially-explicit data assimilation modelling platforms that can couple geostatistical interpolation of model inputs (eg. abr and/or sentinel site mf/ antigen prevalence data) with discovery of localized models from such data in order to produce the required regional or national intervention forecasts [ ] . the estimated parameter and prediction uncertainties presented here are clearly dependent on the model-data fusion methodology and its implementation, and the cost function used to discover the appropriate models for a data stream [ ] . while we have attempted to evaluate differences in individual model structures, their computer implementation, and the data assimilation procedures followed (e.g. sequential vs. simultaneous data assimilation), via comparing the collective predictions of the three models versus the predictions provided by each model singly, and show that these factors are unlikely to play a major role in influencing the current results, we indicate that future work must address these issues adequately to improve the initial methods we have employed in this work. currently, we are examining the development of sequential bayesian-based multi-model ensemble approaches that will allow better integration of each model's behaviour as well as better calculation of each model's transient parameter space at each time a new observation becomes available [ ] . this work also involves the development of a method to fuse information from several indicators of infection (e.g. mf, antigenemia, antibody responses [ ] ) together to achieve a more robust constraining of the present models. as different types of data can act as mutual constraints on a model, we also expect that such multiindicator model-data fusion methods will additionally address the problem of equifinality, which is known to complicate the parameterization of complex dynamical models [ , ] . of course, the ultimate test of the results reported here, viz. that lf models constrained by coupling to year post-mda data can provide the best predictions of timelines for meeting the % mf prevalence threshold in a site, is by carrying out the direct validation of our results against independent observations (as demonstrated by the preliminary validation study carried out here using the dokan tofa data (tables and )). we expect that data useful for performing these studies at scale may be available at the sentinel site level in the countries carrying out the current who-led monitoring programme. the present results indicate that access to such data, and to post-treatment surveillance data which are beginning to be assembled by many countries, is now a major need if the present lf models are to provide maximal information about parasite system responses to management and thus generate better predictions of system states for use in policy making and in judging management effectiveness in different spatiotemporal settings [ , ] . given that previous modelling work has indicated that if the globally fixed who-proposed % mf prevalence threshold is insufficient to break lf transmission in every setting (and thus conversely leading to significant infection recrudescence [ ] ), the modelling of such spatio-temporal surveillance data will additionally allow testing if meeting this recommended threshold will indeed result in successfully achieving the interruption of lf transmission everywhere. the epidemiology of filariasis control. the filaria epidemiological modelling for monitoring and evaluation of lymphatic filariasis control mathematical modelling and the control of lymphatic filariasis. the lancet infectious diseases heterogeneous dynamics, robustness/fragility trade-offs, and the eradication of the macroparasitic disease, lymphatic filariasis continental-scale, data-driven predictive assessment of eliminating the vector-borne disease, lymphatic filariasis, in sub-saharan africa by sequential modelling of the effects of mass drug treatments on anopheline-mediated lymphatic filariasis infection in papua new guinea assessing endgame strategies for the elimination of lymphatic filariasis: a model-based evaluation of the impact of dec-medicated salt predicting lymphatic filariasis transmission and elimination dynamics using a multi-model ensemble framework data-model fusion to better understand emerging pathogens and improve infectious disease forecasting ecological forecasting and data assimilation in a data-rich era the role of data assimilation in predictive ecology effectiveness of a triple-drug regimen for global elimination of lymphatic filariasis: a modelling study. the lancet infectious diseases epidemic modelling: aspects where stochasticity matters. mathematical biosciences relative information contributions of model vs. data to short-and long-term forecasts of forest carbon dynamics inference in disease transmission experiments by using stochastic epidemic models plug-and-play inference for disease dynamics: measles in large and small populations as a case study the limits to prediction in ecological systems big data need big theory too hierarchical modelling for the environmental sciences: statistical methods and applications parameter and prediction uncertainty in an optimized terrestrial carbon cycle model: effects of constraining variables and data record length bayesian calibration of simulation models for supporting management of the elimination of the macroparasitic disease, lymphatic filariasis an improved state-parameter analysis of ecosystem models using data assimilation. ecological modelling bayesian calibration of mechanistic aquatic biogeochemical models and benefits for environmental management the model-data fusion pitfall: assuming certainty in an uncertain world transmission dynamics and control of severe acute respiratory syndrome curtailing transmission of severe acute respiratory syndrome within a community and its hospital transmission dynamics of the etiological agent of sars in hong kong: impact of public health interventions philosophical issues in model assessment. model validation: perspectives in hydrological science a sequential monte carlo approach for marine ecological prediction a sequential bayesian approach for hydrologic model selection and prediction inferences about coupling from ecological surveillance monitoring: approaches based on nonlinear dynamics and information theory. towards an information theory of complex networks using model-data fusion to interpret past trends, and quantify uncertainties in future projections, of terrestrial ecosystem carbon cycling rate my data: quantifying the value of ecological data for the development of models of the terrestrial carbon cycle estimating parameters of a forest ecosystem c model with measurements of stocks and fluxes as joint constraints. oecologia global mapping of lymphatic filariasis adaptive management and the value of information: learning via intervention in epidemiology general rules for managing and surveying networks of pests, diseases, and endangered species geographic and ecologic heterogeneity in elimination thresholds for the major vector-borne helminthic disease, lymphatic filariasis modelling strategies to break transmission of lymphatic filariasis-aggregation, adherence and vector competence greatly alter elimination mathematical modelling of lymphatic filariasis elimination programmes in india: required duration of mass drug administration and post-treatment level of infection indicators mathematical models for lymphatic filariasis transmission and control: challenges and prospects. parasite vector the lymfasim simulation program for modeling lymphatic filariasis and its control the dynamics of wuchereria bancrofti infection: a model-based analysis of longitudinal data from pondicherry parameter sampling capabilities of sequential and simultaneous data assimilation: ii. statistical analysis of numerical results an evaluation of models for partitioning eddy covariance-measured net ecosystem exchange into photosynthesis and respiration normalized measures of entropy entropyexplorer: an r package for computing and comparing differential shannon entropy, differential coefficient of variation and differential expression impact of years of diethylcarbamazine and ivermectin mass administration on infection and transmission of lymphatic filariasis controlling the false discovery rate-a practical and powerful approach to multiple testing epidemiological and entomological evaluations after six years or more of mass drug administration for lymphatic filariasis elimination in nigeria the study of plant disease epidemics bayesian data assimilation provides rapid decision support for vector-borne diseases modelling the elimination of river blindness using long-term epidemiological and programmatic data from mali and senegal. epidemics socio-ecological dynamics and challenges to the governance of neglected tropical disease control. infectious diseases of poverty early-warning signals for critical transitions early warning signals of extinction in deteriorating environments practical limits for reverse engineering of dynamical systems: a statistical analysis of sensitivity and parameter inferability in systems biology models multi-sensor model-data fusion for estimation of hydrologic and energy flux parameters. remote sensing of environment key: cord- - g c r authors: kim, yeunbae; cha, jaehyuk title: artificial intelligence technology and social problem solving date: - - journal: evolutionary computing and artificial intelligence doi: . / - - - - _ sha: doc_id: cord_uid: g c r modern societal issues occur in a broad spectrum with very high levels of complexity and challenges, many of which are becoming increasingly difficult to address without the aid of cutting-edge technology. to alleviate these social problems, the korean government recently announced the implementation of mega-projects to solve low employment, population aging, low birth rate and social safety net problems by utilizing ai and icbm (iot, cloud computing, big data, mobile) technologies. in this letter, we will present the views on how ai and ict technologies can be applied to ease or solve social problems by sharing examples of research results from studies of social anxiety, environmental noise, mobility of the disabled, and problems in social safety. we will also describe how all these technologies, big data, methodologies and knowledge can be combined onto an open social informatics platform. a string of breakthroughs in artificial intelligence has placed ai in increasingly visible positions in society, heralding its emergence as a viable, practical, and revolutionary technology. in recent years, we have witnessed ibm's watson win first place in the american quiz show jeopardy! and google's alphago beat the go world champion, and in the very near future, self-driving cars are expected to become a common sight on every street. such promising developments spur optimism for an exciting future produced by the integration of ai technology and human creativity. ai technology has grown remarkably over the past decade. countries around the world have invested heavily in ai technology research and development. major corporations are also applying ai technology to social problem solving; notably, ibm is actively working on their science for social good initiative. the initiative will build on the success of the company's noted ai program, watson, which has helped address healthcare, education, and environmental challenges since its development. one particularly successful project used machine learning models to better understand the spread of the zika virus. using complex data, the team developed a predictive model that identified which primate species should be targeted for zika virus surveillance and management. the results of the project are now leading new testing in the field to help prevent the spread of the disease [ ] . on the other hand, investments in technology are generally mostly used for industrial and service growth, while investments for positive social impact appear to be relatively small and passive. this passive attitude seems to reflect the influence of a given nation's politics and policies rather than the absence of technology. for example, in , only . % of the total budget of the korean government's r&d of ict (information and communication technology) was used for social problem solving, but this investment will be increased to % within the next five years as the improvement of korean people's livelihoods and social problems are selected as important issues by the present government [ ] . in addition, new categories within ict, including ai, are required as a key means of improving quality of life and achieving population growth in this country. in this letter, i introduce research on the informatics platform for social problem solving, specifically based on spatio-temporal data, conducted by hanyang university and cooperating institutions. this research ultimately intends to develop informatics and convergent scientific methodologies that can explain, predict and deal with diverse social problems through a transdisciplinary convergence of social sciences, data science and ai. the research focuses on social problems that involve spatio-temporal information, and applies social scientific approaches and data-analytic methods on a pilot basis to explore basic research issues and the validity of the approaches. furthermore, ( ) open-source informatics using convergent-scientific methodology and models, and ( ) the spatio-temporal data sets that are to be acquired in the midst of exploring social problems for potential resolution are developed. in order to examine the applicability of the models and informatics platform in addressing a variety of social problems in the public as well as in private sectors, the following social problems are identified and chosen: . analysis of individual characteristics with suicidal impulse . study on the mobility of the disabled using gps data . visualization of the distribution of anxiety using social network services . big data-based analysis of noise environment and exploration of technical and institutional solutions for its improvement . analysis of the response governance regarding the middle eastern respiratory syndrome (mers) the research issues in the above social problems are explored, and the validity of the convergent-scientific methodologies are tested. the feasibility for the potential resolution of the problems are also examined. the relevant data and information are stored in a knowledge base (kb), and at the same time research methods that are used in data extraction, collection, analysis and visualization are also developed. furthermore, the kb and the method database are merged into an open informatics platform in order to be used in various research projects, business activities, and policy debates. while suicide rates in oecd countries are declining, only south korea has increasing suicide rates; moreover, korea currently has the highest suicide rate among oecd countries as shown in fig. . its high suicide rate is one of korea's biggest social problems, entailing the establishment of effective suicide prevention measures by understanding the causes of suicide. the goals of the research are to: ( ) understand suicidal impulse by analyzing the characteristics of members of society according to suicidal impulse experience; ( ) predict the likelihood of attempting suicide and analyzing the spatio-temporal quality of life; and ( ) to establish a policy to help prevent suicide. the korean social survey and survey of youth health status data are used for the analysis of suicide risk groups through data mining techniques, using a predictive model based on cell propagation to overcome the limitations of existing statistics methods such as characterization or classification. in the case of the characterization technique, results indicate that there are too many features related to suicide, and that there are variables including many categorical values, making it difficult to identify the variables that affect suicide. on the other hand, the classification technique had difficulties identifying the variables that affect suicide because the number of members attempting suicide was too small. correlations between suicide impulses and individual attributes of members of society and the trends of the correlations by year are obtained. the concepts of support, confidence and density are introduced to identify risk groups of suicide attempts, and computational performance problems caused by excessive numbers of risk groups are solved by applying a convex growing method. the y social survey including personal and household information of members of the society are used for analysis. the attributes include gender, age, education, marital status, level of satisfaction, disability status, occupation status, housing, and household income. the high-risk suicide cluster was identified using a small number of convexes. a convex is a set of cells, with one cell being the smallest unit of the cluster for the analysis, and a density is the ratio of the number of non-empty cells to the total number of cells in convex c [ ] . figure shows that the highest suicidal risk group c is composed of members with low income and education level. it was identified that level of satisfaction with life has the highest impact on suicidal impulse, followed in order of impact by disability, marital status, housing, household income, occupation status, gender, age and level of education. the results showed that women and young people tend to have more suicidal impulse. new prediction models with other machine learning methods and the establishment of mitigation policies are still in development. subjective analyses of change of wellbeing, social exclusion, and characteristics of spatio-temporal analysis will also be explored in the future. mobility rights are closely related to quality of life as a part of social rights. therefore social efforts are needed to guarantee mobility rights to both the physically and mentally disabled. the goal of the study is to suggest a policy for the extension of mobility rights of the disabled. in order to achieve this, travel patterns and sociodemographic characteristics of the physically impaired with low levels of mobility are studied. the study focused on individuals with physical impairments as the initial test group as a means to eventually gain insight into the mobility of the wider disabled population. conventional studies on mobility measurement obtained data from travel diaries, interviews, and questionnaire surveys. a few studies used geo-location tracking gps data. gps data is collected via mobile device and used to analyze the mobility patterns (distance, speed, frequency of outings) by using regression analysis, and to search for methods to extend mobility. a new metrics for mobility with a new indicator (travel range) was developed, and the way mobility impacts the quality of life of the disabled has been verified [ ] . about people with physical disabilities participated and collected more than , geo-location data over a month using an open mobile application called traccar. their trajectories are visualized based on the gps data as shown in fig. . the use of location data explained mobility status better than the conventional questionnaire survey method. the questionnaire surveyed mainly the frequency of outings over a certain period and number of complaints about these outings. gps data enabled researchers to conduct empirical observations on distance and range of travel. it was found that the disabled preferred bus routes that visit diverse locations over the shortest route. age and monthly income are negatively associated with a disabled individual's mobility. based on the research results, the following has been suggested: ( ) development of new bus routes for the disabled and ( ) recommendation of a new location for the current welfare center that would enable a greater range of travel. further study on travel patterns by using indoor positioning technology and cctv image data will be deployed. many social issues including political polarization, competition in private education, increases in suicide rate, youth unemployment, low birth rate, and hate crime have anxiety as their background. the increase of social anxiety can intensify competition and conflict, which can interfere with social solidarity and cause a decrease in social trust. existing social science research mainly focused on grasping public opinion through questionnaires, and ignored the role of emotions. the internet and social media were used to access emotional traits since they provide a platform not only for the active exchange of information, but also for the sharing and diffusion of emotional responses. if such emotional responses on the internet and geo-locations can be captured in realtime through machine learning, their spatio-temporal distribution could be visualized in order to observe their current status and changes by geographical region. a visualization system was built to map the regional and temporal distribution of anxiety psychology by combining spatio-temporal information using sns (twitter) with sentiment analysis. a twitter message collecting crawler was also developed to build a dictionary and tweet corpus. based on these, an automatic classification system of anxiety-related messages was developed for the first time in korea by applying machine learning to visualize the nationwide distribution of anxiety (see fig. ) [ ] . an average of , tweets with place_id are collected using open api twitter j. to date, about , units of data have been collected. a naïve bayes classifier was used for anxiety identification. an accuracy of . % was obtained by using , and , anxiety and non-anxiety tweets as training data respectively, and and , anxiety and non-anxiety tweets as testing data, respectively. the system indicated the existence of regional disparities in anxiety emotions. it was found that twitter users who reside in politicized regions have a lower degree of disclosure about their residing areas. this can be interpreted as the act of avoiding situations where the individual and the political position of the region coincide. as anxiety is not a permanent characteristic of an individual, it can change depending on the time and situation, making it difficult to measure by questionnaire survey at any given time. the twitter-based system can compensate for the limitations of such a survey method because it can continuously classify accumulated tweet text data and provide a temporal visualization of anxiety distribution at a given time within a desired visual scale (by ward, city, province and nationwide) as shown in fig. . environmental issues are a major social concern in our age, and interest has been increasing not only in the consequences of pollution but also in the effects of general environmental aesthetics on quality of life. there is much active effort to improve the visual environment, but not nearly as much interest has been given to improve the auditory environment. until now, policies on the auditory environment have remained passive countermeasures to simply quantified acoustic qualities (e.g., volume in db) in specific places such as construction sites, railroads, highways, and residential areas. they lack a comprehensive study of contextual correlations, such as the physical properties of sound, the environmental factors in time and space, and the human emotional response of noise perception. the goal of this study is to provide a cognitive-based, human-friendly solution to improve noise problems. in order to achieve this, the study aimed to ( ) develop a tool for collecting sound data and converting into a sound database, and ( ) build spatiotemporal features and a management platform for indoor and outdoor noise sources. first, pilot experiments were conducted to predict the indicators that measure emotional reactions by developing a handheld device application for data collection. three separate free-walking experiments and in-depth interviews were conducted with subjects at international airport lobbies and outdoor environments. through the experiment, the behavior patterns of the subjects in various acoustic environments were analyzed, and indicators of emotional reactions were identified. it was determined that the psychological state and the personal environment of the subject are important indicators of the perception of the auditory environment. in order to take into account both the psychological state of the subject and the physical properties of the external sound stimulus, an omnidirectional microphone is used to record the entire acoustic environment. subjects with smartphones with the built-in application walked for an hour in downtown seoul for data collection. on the app, after entering the prerequisite information, subjects pressed 'good' or 'bad' whenever they heard a sound that caught their attention. pressing the button would record the sound for s, and subjects were additionally asked to answer a series of questions about the physical characteristics of the specific location and the characteristics of the auditory environment. during the one-hour experiment, about sound environment reports were accumulated, with one subject reporting the sound characteristics from an average of different places. unlike previous studies, the subjects' paths were not pre-determined, and the position, sound and emotional response of the subject are collected simultaneously. the paths can be displayed to analyze the relations of the soundscapes to the paths (fig. ) . the study helped to build a positive auditory environment for specific places, to provide policy data for noise regulation and positive auditory environments, to identify the contexts and areas that are alienated from the auditory environment, and to extend the social meaning of "noise" within the study of sound. respiratory syndrome (mers) the development and spread of new infectious diseases are increasing due to the expansion of international exchange. as can be seen from the mers outbreak in korea in , epidemics have profound social and economic impacts. it is imperative to establish an effective shelter and rapid response system (rrs) for infectious diseases control. the goal of the study is to compare the official response system with the actual response system in order to understand the institutional mechanism of the epidemic response system, and to find effective policy alternatives through the collaboration of policy scholars and data scientists. web-based newspaper articles were analyzed to compare the official crisis response system designed to operate in outbreaks to the actual crisis response. an automatic news article crawling tool was developed, and , mers-related articles were collected, clustered and stored in the database (fig. ) . in order to manage and search for news articles related to mers from the article database, a curation tool was developed. this tool is able to extract information into triplet graphs (subjects/verbs/objects) from the articles by applying natural language processing techniques. a basic dictionary for the analysis of the infectious disease response system was created based on the extracted triplet information. the information extracted by the curation tool is massive and complex, which limits the ability to correctly understand and interpret information. a tool for visualizing information at a specific time with a network graph was developed and utilized to facilitate analysis and visualization of the networks (fig. ). all tools are integrated into a single platform to maximize the efficiency of the process. as for the official crisis response manual in case of an infectious disease, social network analysis indicated that while the national security bureau (nsb) and public health centers play as large a role as the center for disease control (cdc) in crisis management, the analysis of the news articles showed that the nsb was in fact rarely mentioned. it was found that the cdc and central disaster response headquarters, the official government organizations that deal with infectious diseases, as well as the central mers management countermeasures & support headquarters, a temporarily established organization, were not playing an important role in response to the mers outbreak. on the other hand, the ministry of health and welfare, medical institutions, and local governments all have played a central role in responding to mers. this means that the structure and characteristics of the command & control and communication in the official response system seems to have a decisive influence on the cooperative response in a real crisis response. these results provided concrete information on the role of each respondent and the communication system that previous studies based on interviews and surveys have not found. much research based on machine learning has been criticized for giving more importance on method itself from the start rather than focusing on data reliability. this study is based on a kb in which policy researchers manually analyze news articles and prepare basic data by tagging them. this way, it provides a basis for improving the reliability of results when executing text mining work through machine learning. by using text mining techniques and social network analysis, it is possible to get a comprehensive view of social problems such as the occurrence of infectious diseases by examining the structure and characteristics of the response system from a holistic perspective of the entire system. with the results of this study, new policies for infectious disease control are suggested in the following directions: ( ) strengthen cooperation networks in early response systems of infectious diseases; ( ) develop new, effective and efficient management plans of cooperative networks; and ( ) create new research to cover other diseases such as avian influenza and sars [ ] . an ever-present obstacle in the traditional social sciences when addressing social issues are the difficulties of obtaining evidences from massive data for hypothesis and theory verification. data science and ai can ease such difficulties and support social science by discovering hidden contexts and new patterns of social phenomena via low-cost analyses of large data. on the other hand, knowledge and patterns derived by machine learning from a large data set with noise often lack validity. although data-driven inductive methods are effective for finding patterns and correlations, there is a clear limitation to discovering causal relationships. social science can help data science and ai by interpreting social phenomena through humanistic literacy and social-scientific thought to verify theoretical validity, and identifying causal relationships through deductive and qualitative approaches. this is why we need convergent-scientific approaches for social problem solving. convergent approaches offer the new possibility of building an informatics platform that can interpret, predict and solve various social problems through the combination of social science and data science. in all pilot studies, the convergent-scientific approaches are found valid and sound. most of the research agendas involved the real-time collection and development of spatio-temporal databases in a real-time manner, and analytic visualization of the results. such visualization promises new possibilities in data interpretation. the data sets and tools for data collection, analysis and visualization are integrated onto an informatics platform so that they can be used in future research projects and policy debates. the research was the first transdisciplinary attempt to converge social sciences and data sciences in korea. this approach will offer a breakthrough in predicting, preventing and addressing future social problems. the research methodology, as a trailblazer, will offer new ground for a research field of a transdisciplinary nature converging data sciences, ai and social sciences. the data, information, knowledge, methodologies, and techniques will all be combined onto an open informatics platform. the platform will be maintained on an open-source basis so that it can be used as a hub for various academic research projects, business activities, and policy debates (see fig. ). the open informatics platform is planned to be expanded to incorporate citizen sensing, in which people's observations and views are shared via mobile devices and internet services in the future [ ] . in the area of social problem solving, fundamental problems have complex political, social and economic aspects that have their roots in human nature. both technical and social approaches are essential for tackling social problem solving. in fact, it is the fig. . structure of informatics platform integrated, orchestrated marriage between the two that would bring us closer to effective social problem management. we need to first study and carefully define the indicators specific to a given social problem or domain. there are many qualitative indicators that cannot be directly and explicitly measured such as social emotions, basic human needs and rights, and life fulfillment [ ] . if the results of machine learning are difficult to measure or include combinations of results that are difficult to define, that particular social problem may not be suitable for machine learning. therefore, there is a need for new social methods and algorithms that can accurately collect and identify the measurable indicators from opinions of social demanders. recently, mit has developed a device to quantitatively measure social signals. the small, lightweight wearable device contains sensors that record the people's behaviors (physical activity, gestures, and the amount of variation in speech prosody, etc.) [ ] . machine learning technologies working on already existing data sets are relatively inexpensive compared to conventional million-dollar social programs since machine learning tools can be easily extended. however, they can introduce bias and errors depending on the data content used to train machine learning models or can also be misinterpreted. human experts are always needed to recognize and correct erroneous outputs and interpretations in order to prevent prejudices [ ] . in the development of ai applications, a great amount of time and resources are required to sort, identify and refine data to provide massive data for training. for instance, machine learning models need to learn millions of photos to recognize specific animals or faces, but human intelligence is able to recognize visual cues by looking at only a few photos. perhaps it is time to develop a new ai framework which can infer and recognize objects based on small amounts of data, such as transfer learning [ ] , generate lacking data (gan), or integrate traditional ai technologies, such as symbolic ai and statistical machine learning into new frameworks. machine learning is excellent in predicting, but many social problem solutions do not depend on predictions. the organic ways solutions to specific problems actually unfold according to new policies and programs can be more practical and worth studying than building a cure-all machine learning algorithm. while the evolution of ai is progressing at a stunning rate, there are still challenges to solving social problems. further research on the integration of social science and ai is required. a world in which artificial intelligence actually makes policy decisions is still hard to imagine. considering the current limitations and capabilities of ai, ai should primarily be used as a decision aid. ibm: science for social good -applying ai, cloud and deep science toward new societal challenges analyzing suicide-ideation survey to identify high-risk groups: a data mining approach mobility among people with physical impairment: a study using geo-location tracking data sns data visualization for analyzing spatial-temporal distribution of social anxiety application of network analysis into emergency response: focusing on the outbreak of the middle-eastern respiratory syndrome in korea a platform for citizen sensing in sentient cities chapter : the sociology of social indicators to signal is human a guide to solving social problems with machine learning how transferable are features in deep neural networks? in: advances in neural information processing systems (nips ). nips foundation acknowledgements. this work was supported by the national research foundation of korea (nrf) grant funded by the korean government (msit ) (no. r a a ). key: cord- -gvezk vp authors: ahonen, pasi; alahuhta, petteri; daskala, barbara; delaitre, sabine; hert, paul de; lindner, ralf; maghiros, ioannis; moscibroda, anna; schreurs, wim; verlinden, michiel title: safeguards date: journal: safeguards in a world of ambient intelligence doi: . / - - - - _ sha: doc_id: cord_uid: gvezk vp the multiplicity of threats and vulnerabilities associated with ami will require a multiplicity of safeguards to respond to the risks and problems posed by the emerging technological systems and their applications. in some instances, a single safeguard might be sufficient to address a specified threat or vulnerability. more typically, however, a combination of safeguards will be necessary to address each threat and vulnerability. in still other instances, one safeguard might apply to numerous treats and vulnerabilities. one could depict these combinations in a matrix or on a spreadsheet, but the spreadsheet would quickly become rather large and, perhaps, would be slightly misleading. just as the ami world will be dynamic, constantly changing, the applicability of safeguards should also be regarded as subject to a dynamic, i.e., different and new safeguards may need to be introduced in order to cope with changes in the threats and vulnerabilities. the multiplicity of threats and vulnerabilities associated with ami will require a multiplicity of safeguards to respond to the risks and problems posed by the emerging technological systems and their applications. in some instances, a single safeguard might be sufficient to address a specified threat or vulnerability. more typically, however, a combination of safeguards will be necessary to address each threat and vulnerability. in still other instances, one safeguard might apply to numerous treats and vulnerabilities. one could depict these combinations in a matrix or on a spreadsheet, but the spreadsheet would quickly become rather large and, perhaps, would be slightly misleading. just as the ami world will be dynamic, constantly changing, the applicability of safeguards should also be regarded as subject to a dynamic, i.e., different and new safeguards may need to be introduced in order to cope with changes in the threats and vulnerabilities. for the purpose of this chapter, we have grouped safeguards into three main categories: the main privacy-protecting principles in network applications are: • anonymity (which is the possibility to use a resource or service without disclosure of user identity) • pseudonymity (the possibility to use a resource or service without disclosure of user identity, but to be still accountable for that use) • unlinkability (the possibility to use multiple resources or services without others being able to discover that these resources are being used by the same user) prime project, which studied the state of the art in privacy protection in network applications in , pointed out many performance problems and security weaknesses. the challenges in privacy-enhancing technologies for networked applications include developing methods for users to express their wishes regarding the processing of their data in machine-readable form ("privacy policies") and developing methods to ensure that the data are indeed processed according to users' wishes and legal regulations ("licensing languages" and "privacy audits": the former check the correctness of data processing during processing, while the latter check afterwards and should allow checking even after the data are deleted). privacy protection research is still new, and research on privacy protection in such emerging domains as personal devices, smart environments and smart cars is especially still in its infancy. privacy protection for personal mobile devices is particularly challenging due to the devices' limited capabilities and battery life. for these domains, only generic guidelines have been developed (see lahlou et al. ). langheinrich et al. show how difficult it might be to apply fair information practices (as contained in current data protection laws) to ami applications. most of the research on privacy protection is concerned with dangers of information disclosure. other privacy aspects have not received much attention from researchers. for example, the privacy aspect known as "the right to be let alone" is rarely discussed by technology researchers, despite its importance. research is needed with regard to overcoming the digital divide in the context of ami. the european commission has already been sponsoring some research projects which form a foundation for needed future initiatives. projects dealing with accessibility for all and e-inclusion (such as cost : "accessibility for all to services and terminals for next generation mobile networks", ask-it: "ambient intelligence system of agents for knowledge-based and integrated services for camenisch, j. (ed.), first annual research report, prime deliverable d . , . http:// www.prime-project.eu.org/public/prime_products/deliverables/rsch/pub_del_d . .a_ec_ wp . _v _final.pdf such research is, however, going on. an example is the ec-supported connect project, which aims to implement a privacy management platform within pervasive mobile services, coupling research on semantic technologies and intelligent agents with wireless communications (including umts, wifi and wimax) and context-sensitive paradigms and multimodal (voice/graphics) interfaces to provide a strong and secure framework to ensure that privacy is a feasible and desirable component of future ambient intelligence applications. the two-year project started in june . http://cordis.europa.eu/search/index.cfm?fuseaction = proj.simpledocument&pj_rcn = lahlou, s., and f. jegou, "european disappearing computer privacy design guidelines v ", ambient agora deliverable d . , electricité de france, clamart, . mobility impaired users") are concerned with standardisation, intuitive user interfaces, personalisation, interfaces to all everyday tools (e.g., domotics, home health care, computer accessibility for people with disabilities and elderly people), adaptation of contents to the channel capacity and the user terminal and so on. standardisation in the field of information technology (including, e.g., biometrics) is important in order to achieve interoperability between different products. however, interoperability even in fairly old technologies (such as fingerprint-based identification) has not yet been achieved. minimising personal data should be factored into all stages of collection, transmission and storage. the goal of the minimal data transmission principle is that data should reveal little about the user even in the event of successful eavesdropping and decryption of transmitted data. similarly, the principle of minimal data storage requires that thieves do not benefit from stolen databases and decryption of their data. implementation of anonymity, pseudonymity and unobservability methods helps to minimise system knowledge about users at the stages of data transmission and storage in remote databases, but not in cases involving data collection by and storage in personal devices (which collect and store mainly the device owner's data) or storage of videos. the main goals of privacy protection during data collection are, first, to prevent linkability between diverse types of data collected about the same user and, second, to prevent surveillance by means of spyware or plugging in additional pieces of hardware transmitting raw data (as occurs in wiretapping). these goals can be achieved by: • careful selection of hardware (so that data are collected and transmitted only in the minimally required quality and quantity to satisfy an application's goals, and there are no easy ways to spy on raw and processed data) • an increase of software capabilities and intelligence (so that data can be processed in real time) • deleting data as soon as the application allows. in practice, it is difficult to determine what "minimally needed application data" means. moreover, those data can be acquired by different means. thus, we suggest that data collection technologies less capable of violating personal privacy expectations be chosen over those more privacy-threatening technologies even if the accuracy of collected data decreases. software capabilities need to be maximised in order to minimise storage of raw data and avoid storage of data with absolute time and location stamps. we suggest this safeguard in order to prevent accidental logging of sensitive data, because correlation of different kinds of data by time stamps is fairly straightforward. these safeguards are presented below in more detail: • in our opinion, the most privacy-threatening technologies are physiological sensors and video cameras. physiological sensors are privacy-threatening because they reveal what's going on in the human body and, accordingly, reveal health data and even feelings. video cameras, especially those storing raw video data, are privacy-threatening because they violate people's expectations that "nobody can see me if i am hidden behind the wall" and because playback of video data can reveal more details than most people pay attention to in normal life. we suggest that usage of these two groups of devices should be restricted to safety applications until proper artificial intelligence safeguards (see below) are implemented. • instead of logging raw data, only data features (i.e., a limited set of pre-selected characteristics of data, e.g., frequency and amplitude of oscillations) should be logged. this can be achieved by using either hardware filters or real-time preprocessing of data or a combination of both. • time and location stamping of logged data should be limited by making it relative to other application-related information or by averaging and generalising the logged data. • data should be deleted after an application-dependent time, e.g., when a user buys clothes, all information about the textile, price, designer, etc., should be deleted from the clothes' rfid tag. for applications that require active rfid tags (such as for finding lost objects ), the rfid identifier tag should be changed, so that links between the shop database and the clothes are severed. • applications that do not require constant monitoring should switch off automatically after a certain period of user inactivity (e.g., video cameras should automatically switch off at the end of a game). • anonymous identities, partial identities and pseudonyms should be used wherever possible. using different identities with the absolute minimum of personal data for each application helps to prevent discovery of links between user identity and personal data and between different actions by the same user. orr, r.j., r. raymond, j. berman and f. seay, "a system for finding frequently lost objects in the home", technical report - , graphics, visualization, and usability center, georgia tech, . data and software protection from malicious actions should be implemented by intrusion prevention and by recovery from its consequences. intrusion prevention can be active (such as antivirus software, which removes viruses) or passive (such as encryption, which makes it more difficult to understand the contents of stolen data). at all stages of data collection, storage and transmission, malicious actions should be hindered by countermeasures such as the following: privacy protection in networking includes providing anonymity, pseudonymity and unobservability whenever possible. when data are transferred over long distances, anonymity, pseudonymity and unobservability can be provided by the following methods: first, methods to prove user authorisation locally and to transmit over the network only a confirmation of authorisation; second, methods of hiding relations between user identity and actions by, for example, distributing this knowledge over many network nodes. for providing anonymity, it is also necessary to use special communication protocols which do not use device ids or which hide them. it is also necessary to implement authorisation for accessing the device id: currently, most rfid tags and bluetooth devices provide their ids upon any request, no matter who actually asked for the id. another problem to solve is that devices can be distinguished by their analogue radio signals, and this can hinder achieving anonymity. additionally, by analysing radio signals and communication protocols of a personal object, one can estimate the capabilities of embedded hardware and guess whether this is a new and expensive thing or old and inexpensive, which is an undesirable feature. unobservability can be implemented to some extent in smart spaces and personal area networks (pans) by limiting the communication range so that signals do not penetrate the walls of a smart space or a car, unlike the current situation when two owners of bluetooth-enabled phones are aware of each other's presence in neighbouring apartments. methods of privacy protection in network applications (mainly long-distance applications) include the following: • anonymous credentials (methods to hide user identity while proving the user's authorisation). • a trusted third party: to preserve the relationships between the user's true identity and his or her pseudonym. • zero-knowledge techniques that allow one to prove the knowledge of something without actually providing the secret. • secret-sharing schemes: that allow any subset of participants to reconstruct the message provided that the subset size is larger than a predefined threshold. • special communication protocols and networks such as: -onion routing: messages are sent from one node to another so that each node removes one encryption layer, gets the address of the next node and sends the message there. the next node does the same, and so on until some node decrypts the real user address. -mix networks and crowds that hide the relationship between senders and receivers by having many intermediate nodes between them. • communication protocols that do not use permanent ids of a personal device or object; instead, ids are assigned only for the current communication session. communication protocols that provide anonymity at the network layer, as stated in the prime deliverable, are not suitable for large-scale applications: there is no evaluation on the desired security level, and performance is a hard problem. strong access control methods are needed in ami applications. physical access control is required in applications such as border control, airport check-ins and office access. reliable user authentication is required for logging on to computers and personal devices as well as network applications such as mobile commerce, mobile voting and so on. reliable authentication should have low error rates and strong anti-spoofing protection. work on anti-spoofing protection of iris and fingerprint recognition is going on, but spoofing is still possible. we suggest that really reliable authentication should be unobtrusive, continuous (i.e., several times during an application-dependent time period) and multimodal. so far, there has been limited research on continuous multimodal access control systems. authentication methods include the following: camenish, . some experts don't believe that biometrics should be the focus of the security approach in an ami world, since the identification and authentication of individuals by biometrics will always be approximate, is like publishing passwords, can be spoofed and cannot be revoked after an incident. tokens are portable physical devices given to users who keep them in their possession. implants are small physical devices, embedded into a human body (nowadays they are inserted with a syringe under the skin). implants are used for identification by unique id number, and some research aims to add a gps positioning module in order to detect the user's location at any time. with multimodal fusion, identification or authentication is performed by information from several sources, which usually helps to improve recognition rates and anti-spoofing capabilities. multimodal identification and/or authentication can also be performed by combining biometric and non-biometric data. methods for reliable, unobtrusive authentication (especially for privacy-safe, unobtrusive authentication) should be developed. unobtrusive authentication should enable greater security because it is more user-friendly. people are not willing to use explicit authentication frequently, which reduces the overall security level, while unobtrusive authentication can be used frequently. methods for context-dependent user authentication, which would allow user control over the strength and method of authentication, should be developed, unlike the current annoying situation when users have to go through the same authentication procedure for viewing weather forecasts and for viewing personal calendar data. recently, the meaning of the term "access control" has broadened to include checking which software is accessing personal data and how the personal data are processed. access control to software (data processing methods) is needed for enforcing legal privacy requirements and personal privacy preferences. user-friendly interfaces are needed for providing awareness and configuring privacy policies. maintaining privacy is not at the user's focus, so privacy information should not be a burden for a user. however, the user should easily be able to know and configure the following important settings: • purpose of the application (e.g., recording a meeting and storing the record for several years) • how much autonomy the application has • information flow from the user • information flow to the user (e.g., when and how the application initiates interactions with the user). additionally, user-friendly methods are needed for fast and easy control over the environment, which would allow a person (e.g., a home owner but not a thief) to override previous settings, and especially those settings learned by ami technologies. standard concise methods of initial warnings should be used to indicate whether privacy-violating technologies (such as those that record video and audio data, log personal identity data and physiological and health data) are used by ambient applications. licensing languages or ways to express legal requirements and user-defined privacy policies should be attached to personal data for the lifetime of their transmission, storage and processing. these would describe what can be done with the data in different contexts (e.g., in cases involving the merging of databases), and ensure that the data are really treated according to the attached licence. these methods should also facilitate privacy audits (checking that data processing has been carried out correctly and according to prescribed policies), including instances when the data are already deleted. these methods should be tamper-resistant, similar to watermarking. high-level application design to provide an appropriate level of safeguards for the estimated level of threats can be achieved by data protection methods such as encryption and by avoiding usage of inexpensive rfid tags that do not have access control to their id and by minimising the need for active data protection on the part of the user. high-level application design should also consider what level of technology control is acceptable and should provide easy ways to override automatic actions. when communication capabilities move closer to the human body (e.g., embedded in clothes, jewellery or watches), and battery life is longer, it will be much more difficult to avoid being captured by ubiquitous sensors. it is an open question how society will adapt to such increasing transparency, but it would be beneficial if the individual were able to make a graceful exit from ami technologies at his or her discretion. to summarise, the main points to consider in system design are: • data filtering on personal devices is preferred to data filtering in an untrustworthy environment. services (e.g., location-based services) should be designed so that personal devices do not have to send queries; instead, services could simply broadcast all available information to devices within a certain range. such an implementation can require more bandwidth and computing resources, but is safer because it is unknown how many devices are present in a given location. thus, it is more difficult for terrorists to plan an attack in a location where people have gathered. • authorisation should be required for accessing not only personal data stored in the device, but also for accessing device id and other characteristics. • good design should enable detection of problems with hardware (e.g., checking whether the replacement of certain components was made by an authorised person or not). currently, mobile devices and smart dust nodes do not check anything if the battery is removed, and do not check whether hardware changes were made by an authorised person, which makes copying data from external memory and replacement of external memory or sensors relatively easy, which is certainly inappropriate in some applications, such as those involved in health monitoring. • personal data should be stored not only encrypted, but also split according to application requirements in such a way that different data parts are not accessible at the same time. • an increase in the capabilities of personal devices is needed to allow some redundancy (consequently, higher reliability) in implementation and to allow powerful multitasking: simultaneous encryption of new data and detection of unusual patterns of device behaviour (e.g., delays due to virus activity). an increase in processing power should also allow more real-time processing of data and reduce the need to store data in raw form. • software should be tested by trusted third parties. currently, there are many kinds of platforms for mobile devices, and business requires rapid software development, which inhibits thorough testing of security and the privacy-protecting capabilities of personal devices. moreover, privacy protection requires extra resources and costs. • good design should provide the user with easy ways to override any automatic action, and to return to a stable initial state. for example, if a personalisation application has learned (by coincidence) that the user buys beer every week, and includes beer on every shopping list, it should be easy to return to a previous state in which system did not know that the user likes beer. another way to solve this problem might be to wait until the system learns that the user does not like beer. however, this would take longer and be more annoying. • good design should avoid implementations with high control levels in applications such as recording audio and images as well as physiological data unless it is strictly necessary for security reasons. • means of disconnecting should be provided in such a way that it is not taken as a desire by the user to hide. to some extent, all software algorithms are examples of artificial intelligence (ai) methods. machine-learning and data-mining are traditionally considered to belong to this field. however, safeguarding against aml threats requires al methods with very advanced reasoning capabilities. currently, ai safeguards are not mature, but the results of current research may change that assessment. many privacy threats arise because the reasoning capabilities and intelligence of software have not been growing as fast as hardware capabilities (storage and transmission capabilities). consequently, the development of ai safeguards should be supported as much as possible, especially because they are expected to help protect people from accidental, unintentional privacy violation, such as disturbing a person when he does not want to be, or from recording some private activity. for example, a memory aid application could automatically record some background scene revealing personal secrets or a health monitor could accidentally send data to "data hunters" if there are no advanced reasoning and anti-spyware algorithms running on the user's device. advanced ai safeguards could also serve as access control and antivirus protection by catching unusual patterns of data copying or delays in program execution. we recommend that ami applications, especially if they have a high control level, should be intelligent enough to: • detect sensitive data in order to avoid recording or publishing such data provide an automatic privacy audit, checking traces of data processing, data-or code-altering, etc. these requirements are not easy to fulfil in full scale in the near future; however, we suggest that it is important to fulfil these requirements as far as possible and as soon as possible. data losses and identity theft will continue into the future. however, losses of personal data will be more noticeable in the future because of the growing dependence on ami applications. thus, methods must be developed to inform all concerned people and organisations about data losses and to advise and/or help them to replace compromised data quickly (e.g., if somebody's fingerprint data are compromised, a switch should be made to another authentication method in all places where the compromised fingerprint was used). another problem, which should be solved by technology means, is recovery from loss of or damage to a personal device. if a device is lost, personal data contained in it can be protected from strangers by diverse security measures, such as data encryption and strict access control. however, it is important that the user does not need to spend time customising and training a new device (so that denial of service does not happen). instead, the new device should itself load user preferences, contacts, favourite music, etc, from some back-up service, like a home server. we suggest that ways be developed to synchronise data in personal devices with a back-up server in a way that is secure and requires minimal effort by the user. we suggest that the most important, but not yet mature technological safeguards are the following: • communication protocols that either do not require a unique device identifier at all or that require authorisation for accessing the device identifier • network configurations that can hide the links between senders and receivers of data • improving access control methods by multimodal fusion, context-aware authentication and unobtrusive biometric modalities (especially behavioural biometrics, because they pose a smaller risk of identity theft) and by aliveness detection in biometric sensors • enforcing legal requirements and personal privacy policies by representing them in machine-readable form and attaching these special expressions to personal data, so that they specify how data processing should be performed, allow a privacy audit and prevent any other way of processing • developing fast and intuitive means of detecting privacy threats, informing the user and configuring privacy policies • increasing hardware and software capabilities for real-time data processing in order to minimise the lifetime and amount of raw data in a system • developing user-friendly means to override any automatic settings in a fast and intuitive way • providing ways of disconnecting in such a way that nobody can be sure why a user is not connected • increasing security by making software updates easier (automatically or semiautomatically, and at a convenient time for the user), detection of unusual patterns, improved encryption • increasing software intelligence by developing methods to detect and to hide sensitive data; to understand the ethics and etiquette of different cultures; to speak different languages and to understand and translate human speech in many languages, including a capability to communicate with the blind and deaf • developing user-friendly means for recovery when security or privacy has been compromised. the technological safeguards require actions by industry. we recommend that industry undertake such technological safeguards. industry may resist doing so because it will increase development costs, but safer, more secure technology should be seen as a good investment in future market growth and protection against possible liabilities. it is obvious that consumers will be more inclined to use technology if they believe it is secure and will shield, not erode their privacy. we recommend that industry undertake such safeguards voluntarily. it is better to do so than to be forced by bad publicity that might arise in the media or from action by policy-makers and regulators. security guru bruce schneier got it right when he said that "the problem is … bad design, poorly implemented features, inadequate testing and security vulnerabilities from software bugs. … the only way to fix this problem is for vendors to fix their software, and they won't do it until it's in their financial best interests to do so. … liability law is a way to make it in those organizations' best interests." if development costs go up, industry will, of course, pass on those costs to consumers, but since consumers already pay, in one way or another, the only difference is who they pay. admittedly, this is not a simple problem because hardware manufacturers, software vendors and network operators all face competition and raising the cost of development and lengthening the duration of the design phase could have competitive implications, but if all industry players face the same exacting liability standards, then the competitive implications may not be so severe as some might fear. co-operation between producers and users of ami technology in all phases from r&d to deployment is essential to address some of the threats and vulnerabilities posed by ami. the integration of or at least striking a fair balance between the interests of the public and private sectors will ensure more equity, interoperability and efficiency. governments, industry associations, civil rights groups and other civil society organisations can play an important role in balancing these interests for the benefit of all affected groups. standards form an important safeguard in many domains, not least of which are those relating to privacy and information security. organisations should be expected to comply with standards, and standards-setting initiatives are generally worthy of support. while there have been many definitions and analyses of the dimensions of privacy, few of them have become officially accepted at the international level, especially by the international organization for standardization. the iso has at least achieved consensus on four components of privacy, i.e., anonymity, pseudonymity, unlinkability and unobservability. (see section . , p. , above for the definitions.) among the iso standards relevant to privacy and, in particular, information security are iso/iec on evaluation criteria for it security and iso , the code of practice for information security management. the iso has also established a privacy technology study group (ptsg) under joint technical committee (jtc ) to examine the need for developing a privacy technology standard. this is an important initiative and merits support. its work and progress should be tracked closely by the ec, member states, industry and so on. the iso published its standard iso in , which was updated in july . since then, an increasing number of organisations worldwide formulate their security management systems according to this standard. it provides a set of recommendations for information security management, focusing on the protection of information as an asset. it adopts a broad perspective that covers most aspects of information systems security. among its recommendations for organisational security, iso states that "the use of personal or privately owned information processing facilities … for processing business information, may introduce new vulnerabilities and necessary controls should be identified and implemented." by implementing such controls, iso/iec , information technology -security techniques -evaluation criteria for it security, first edition, international organization for standardization, geneva, . the standard is also known as the common criteria. similar standards and guidelines have also been published by other eu member states: the british standard bs was the basis for the iso standard. another prominent example is the german it security handbook (bsi, organisations can, at the same time, achieve a measure of both organisational security and personal data protection. iso acknowledges the importance of legislative requirements, such as legislation on data protection and privacy of personal information and on intellectual property rights, for providing a "good starting point for implementing information security". iso is an important standard, but it could be described more as a framework than a standard addressing specificities of appropriate technologies or how those technologies should function or be used. also, iso was constructed against the backdrop of today's technologies, rather than with ami in mind. hence, the adequacy of this standard in an ami world needs to be considered. nevertheless, organisations should state to what extent they are compliant with iso and/or how they have implemented the standard. audit logs may not protect privacy since they are aimed at determining whether a security breach has occurred and, if so, who might have been responsible or, at least, what went wrong. audit logs could have a deterrent value in protecting privacy and certainly they could be useful in prosecuting those who break into systems without authorisation. in the highly networked environment of our ami future, maintaining audit logs will be a much bigger task than now where discrete systems can be audited. nevertheless, those designing ami networks should ensure that the networks have features that enable effective audits. the oecd has been working on privacy and security issues for many years. it produced its first guidelines more than years ago. its guidelines on the protection of privacy and transborder flows of personal data were (are) intended to harmonise national privacy legislation. the guidelines were produced in the form of a recommendation by the council of the oecd and became applicable in september . the guidelines are still relevant today and may be relevant in an ami world too, although it has been argued that they may no longer be feasible in an ami world. the oecd's more recent guidelines for the security of information systems and networks are also an important reference in the context of developing privacy and security safeguards. these guidelines were adopted as a recommendation of the oecd council (in july ). in december , the oecd published a report on "the promotion of a culture of security for information systems and networks", which it describes as a major information resource on governments' effective efforts to date to foster a shift in culture as called for in the aforementioned guidelines for the security of information systems and networks. in november , the oecd published a -page volume entitled privacy online: oecd guidance on policy and practice, which contains specific policy and practical guidance to assist governments, businesses and individuals in promoting privacy protection online at national and international levels. in addition to these, the oecd has produced reports on other privacy-related issues including rfids, biometrics, spam and authentication. sensible advice can also be found in a report published by the us national academies press in , which said that to best protect privacy, identifiable information should be collected only when critical to the relationship or transaction that is being authenticated. the individual should consent to the collection, and the minimum amount of identifiable information should be collected and retained. the relevance, accuracy and timeliness of the identifier should be maintained and, when necessary, updated. restrictions on secondary uses of the identifier are important in order to safeguard the privacy of the individual and to preserve the security of the authentication system. the individual should have clear rights to access information about how data are protected and used by the authentication system and the individual should have the right to challenge, correct and amend any information related to the identifier or its uses. among privacy projects, prime has identified a set of privacy principles in the design of identity management architecture: principle : design must start from maximum privacy. principle : explicit privacy rules govern system usage. principle : privacy rules must be enforced, not just stated. principle : privacy enforcement must be trustworthy. principle : users need easy and intuitive abstractions of privacy. principle : privacy needs an integrated approach. principle : privacy must be integrated with applications. trust marks and trust seals can also be useful safeguards because the creation of public credibility is a good way for organisations to alert consumers and other individuals to an organisation's practices and procedures through participation in a programme that has an easy-to-recognise symbol or seal. trust marks and seals are a form of guarantee provided by an independent organisation that maintains a list of trustworthy companies that have been audited and certified for compliance with some industry-wide accepted or standardised best practice in collecting personal or sensitive data. once these conditions are met, they are allowed to display a trust seal logo or label that customers can easily recognise. a trust mark must be supported by mechanisms necessary to maintain objectivity and build legitimacy with consumers. trust seals and trust marks are, however, voluntary efforts that are not legally binding and an effective enforcement needs carefully designed procedures and the backing of an independent and powerful organisation that has the confidence of all affected parties. trust seals and trust marks are often promoted by industry, as opposed to consumer-interest groups. as a result, concerns exist that consumers' desires for stringent privacy protections may be compromised in the interest of industry's desire for the new currency of information. moreover, empirical evidence indicates that even some eight years after the introduction of the first trust marks and trust seals in internet commerce, citizens know little about them and none of the existing seals has reached a high degree of familiarity among customers. this does not necessarily mean that trust marks are not an adequate safeguard for improving security and privacy in the ambient intelligence world, it suggests that voluntary activities like self-regulation have -apart from being well designed -to be complemented by other legally enforceable measures. in addition to the general influence of cultural factors and socialisation, trust results from context-specific interaction experiences. as is well documented, computermediated interactions are different from conventional face-to-face exchanges due to anonymity, lack of social and cultural clues, "thin" information, and the uncertainty about the credibility and reliability of the provided information that commonly characterise mediated relationships. in an attempt to reduce some of the uncertainties associated with online commerce, many websites acting as intermediaries between transaction partners are operating so-called reputation systems. these institutionalised feedback mechanisms are usually based on the disclosure of past transactions rated by the respective partners involved. giving participants the opportunity to rank their counterparts creates an incentive for rule-abiding behaviour. thus, reputation systems seek to imitate some of the real-life trust-building and social constraint mechanisms in the context of mediated interactions. so far, reputation systems have not been developed for ami services. and it seems clear that institutionalised feedback mechanisms will only be applicable to a subset of future ami services and systems. implementing reputation systems only makes sense in those cases in which users have real choices between different suppliers (for instance, with regard to ami-assisted commercial transactions or information brokers). ami infrastructures that normally cannot be avoided if one wants to take advantage of a service need to be safeguarded by other means, such as trust seals, iso guidelines and regulatory action. despite quite encouraging experiences in numerous online arenas, reputation systems are far from perfect. many reputation systems tend to shift the burden of quality control and assessment from professionals to the -not necessarily entirely informed -individual user. in consequence, particularly sensitive services should not exclusively be controlled by voluntary and market-style feedbacks from customers. furthermore, reputation systems are vulnerable to manipulation. pseudonyms can be changed, effectively erasing previous feedback. and the feedback itself need not necessarily be sincere, either due to co-ordinated accumulation of positive feedback, due to negotiations between parties prior to the actual feedback process, because of blackmailing or the fear of retaliation. last but not least, reputation systems can become the target of malicious attacks, just like any netbased system. an alternative to peer-rating systems are credibility-rating systems based on the assessment of trusted and independent institutions, such as library associations, consumer groups or other professional associations with widely acknowledged expertise within their respective domains. ratings would be based on systematic assessments along clearly defined quality standards. in effect, these variants of reputation-and credibility-enhancing systems are quite similar to trust marks and trust seals. the main difference is that professional rating systems enjoy a greater degree of independence from vested interests. and, other than in the case of peer-rating systems which operate literally for free, the independent professional organisations need to be equipped with adequate resources. on balance, reputation systems can contribute to trust-building between strangers in mediated short-term relations or between users and suppliers, but they should not be viewed as a universal remedy for the ubiquitous problem of uncertainty and the lack of trust. a possible safeguard is a contract between the service provider and the user that has provisions about privacy rights and the protection of personal data and notification of the user of any processing or transfer of such data to third parties. while this is a possible safeguard, there must be some serious doubt about the negotiating position of the user. it's quite possible the service provider would simply say here are the terms under which i'm willing to provide the service, take it or leave it. also, from the service provider's point of view, it is unlikely that he would want to conclude separate contracts with every single user. in a world of ambient intelligence, such a prospect becomes even more unlikely in view of the fact that the "user", the consumer-citizen will be moving through resnick, p., r. zeckhauser, e. friedman and k. kuwabara, "reputation systems: facilitating trust in internet interactions", communications of the acm, ( ), , pp. - . http:// www.si.umich.edu/~presnick/papers/cacm /reputations.pdf. different spaces where there is likely to be a multiplicity of different service providers. it may be that the consumer-citizen would have a digital assistant that would inform him of the terms, including the privacy implications, of using a particular service in a particular environment. if the consumer-citizen did not like the terms, he would not have to use the service. consumer associations and other civil society organisations (csos) could, however, play a useful role as a mediator between service providers and individual consumers and, more particularly, in forcing the development of service contracts (whether real or implicit) between the service provider and the individual consumer. consumer organisations could leverage their negotiating position through the use of the media or other means of communication with their members. csos could position themselves closer to the industry vanguard represented in platforms such as artemis by becoming members of such platforms themselves. within these platforms, csos could encourage industry to develop "best practices" in terms of provision of services to consumers. government support for new technologies should be linked more closely to an assessment of technological consequences. on the basis of the far-reaching social effects that ambient intelligence is supposed to have and the high dynamics of the development, there is a clear deficit in this area. research and development (at least publicly supported r&d) must highlight future opportunities and possible risks to society and introduce them into public discourse. every research project should commit itself to explore possible risks in terms of privacy, security and trust, develop a strategy to cover problematic issues and involve users in this process as early as possible. a template for "design guidelines" that are specifically addressing issues of privacy has been developed by the "ambient agora" project which has taken into account the fundamental rules by the oecd, notably its guidelines on the protection of privacy and transborder flows of personal data, adopted on september , and the more recent guidelines for the security of information systems and networks. if the state acts as a buyer of strategically important innovative products and services, it contributes to the creation of the critical demand that enables suppliers to reduce their business risk and realise spillover effects. thus, public procurement programmes can be used to support the demand for and use of improved products and services in terms of security and privacy or identity protection. in the procurement of ict products, emphasis should therefore be given to critical issues such as security and trustworthiness. as in other advanced fields, it will be a major challenge to develop a sustainable procurement policy that can cope with ever-decreasing innovation cycles. the focus should not be on the characteristics of an individual product or component, but on the systems into which components are integrated. moreover, it is important to pay attention to the secondary and tertiary impacts resulting from deployment of large technical systems such as ambient intelligence. an evaluation of the indirect impacts is especially recommended for larger (infrastructure) investments and public services. while public procurement of products and services that are compliant with the eu legal framework and other important guidelines for security, privacy and identity protection is no safeguard on its own, it can be an effective means for the establishment and deployment of standards and improved technological solutions. accessibility is a key concept in helping to promote the social inclusion of all citizens in the information society embedded with ami technologies. accessibility is needed to ensure user control, acceptance, enforceability of policy in a user-friendly manner and the provision of citizens with equal rights and opportunities in a world of ambient intelligence. all citizens should have equal rights to benefit from the new opportunities that ami technologies will offer. this principle promotes the removal of direct and indirect discrimination, fosters access to services and encourages targeted actions in favour of under-represented groups. this principle promotes system design according to a user-centric approach (i.e., the concept of "design for all"). the design-for-all concept enables all to use applications (speech technology for the blind, pictures for the deaf). it means designing in a way to make sure applications are user-friendly and can be used intuitively. in short, industry has to make an effort to simplify the usage of icts, rather than forcing prospective users to learn how to use otherwise complex icts. better usability will then support easy learning (i.e., learning by observation), user control and efficiency, thus increasing satisfaction and, consequently, user acceptance. this principle aims to overcome user dependency and more particularly user isolation and stress due to the complexity of new technology, which leads to loss of control. education programmes on how to use new technologies will increase user awareness about the different possibilities and choices offered by ami technologies and devices. training and education help to overcome user dependency and social disruptions. user awareness is important to reduce the voluntary exclusion caused by a misunderstanding on how the technology works. this safeguard is essential in order to prevent almost all facets of dependency, system dependency as well as user dependency. consumers need to be educated about the privacy ramifications arising from virtually any transaction in which they are engaged. an education campaign should be targeted at different segments of the population. school-age children should be included in any such campaign. any networked device, particularly those used by consumer-citizens, should come with a privacy warning much like the warnings on tobacco products. when the uk department of trade and industry (dti) released its information security review, the uk e-commerce minister emphasised that everyone has a role to play in protecting information: "risks are not well managed. we need to dispel the illusion the information security issues are somebody else's problem. it's time to roll up our sleeves." the oecd shares this point of view. it has said that "all participants in the new information society … need … a greater awareness and understanding of security issues and the need to develop a 'culture of security'." the oecd uses the word "participants", which equates to "stakeholders", and virtually everyone is a participant or stakeholder -governments, businesses, other organisations and individual users. oecd guidelines are aimed at promoting a culture of security, raising awareness and fostering greater confidence (i.e., trust) among all participants. there are various ways of raising awareness, and one of those ways would be to have some contest or competition for the best security or privacy-enhancing product or service of the year. the us government's department of homeland security is sponsoring such competitions, and europe could usefully draw on their experience to hold similar competitions in europe. in the same way as the principle that "not everything that you read in the newspapers is true" has long been part of general education, in the ict age, awareness should generally be raised by organisations that are trustworthy and as close to the citizen as possible (i.e., on the local or regional level. questions of privacy, identity and security are, or should be, an integral part of the professional education of computer scientists. we agree with and support the commission's "invitation" to member states to "stimulate the development of network and information security programmes as part of higher education curricula". perhaps one of the best safeguards is public opinion, stoked by stories in the press and the consequent bad publicity given to perceived invasions of privacy by industry and government. new technologies often raise policy issues, and this is certainly true of ambient intelligence. ami offers great benefits, but the risk of not adequately addressing public concerns could mean delays in implementing the technologies, a lack of public support for taxpayer-funded research and vociferous protests by privacy protection advocates. cultural artefacts, such as films and novels, may serve as safeguards against the threats and vulnerabilities posed by advanced technologies, including ambient intelligence. science fiction in particular often presents a dystopian view of the future where technology is used to manipulate or control people, thus, in so doing, such artefacts raise our awareness and serve as warnings against the abuse of technology. a new york times film critic put it this way: "it has long been axiomatic that speculative science-fiction visions of the future must reflect the anxieties of the present: fears of technology gone awry, of repressive political authority and of the erosion of individuality and human freedom." an example of a cultural artefact is steven spielberg's film, minority report, which depicts a future embedded with ambient intelligence, which serves to convey messages or warnings from the director to his audience. minority report is by no means unique as a cultural artefact warning about how future technologies are like a double-edged knife that cuts both ways. to implement socio-economic safeguards will require action by many different players. unfortunately, the very pervasiveness of ami means that no single action by itself will be sufficient as a safeguard. a wide variety of socio-economic safeguards, probably even wider than those we have highlighted in the preceding sections, will be necessary. as implementation of ami has already begun (with rfids, surveillance systems, biometrics, etc.), it is clearly not too soon to begin implementation of safeguards. we recommend, therefore, that all stakeholders, including the public, contribute to this effort. the fast emergence of information and communication technologies and the growth of online communication, e-commerce and electronic services that go beyond the territorial borders of the member states have led the european union to adopt numerous legal instruments such as directives, regulations and conventions on ecommerce, consumer protection, electronic signature, cyber crime, liability, data protection, privacy and electronic communication … and many others. even the european charter of fundamental rights will play an important role in relation to the networked information society. our analysis of the dark scenarios shows that we may encounter serious legal problems when applying the existing legal framework to address the intricacies of an ami environment. our proposed legal safeguards should be considered as general policy options, aimed at stimulating discussion between stakeholders and, especially, policymakers. law is only one of the available sets of tools for regulating behaviour, next to social norms, market rules, "code" -the architecture of the technology (e.g., of cyberspace, wireless and wired networks, security design, encryption levels, rights management systems, mobile telephony systems, user interfaces, biometric features, handheld devices and accessibility criteria) and many other tools. the regulator of ambient intelligence can, for instance, achieve certain aims directly by imposing laws, but also indirectly by, for example, influencing the rules of the market. regulatory effect can also be achieved by influencing the architecture of a certain environment. the architecture of ami might well make certain legal rules difficult to enforce (for example, the enforcement of data protection obligations on the internet or the enforcement of copyright in peer-to-peer networks), and might cause new problems, particularly related to the new environment (spam, dataveillance ). on the other hand, the "code" has the potential to regulate by enabling or disabling certain behaviour, while law regulates via the threat of sanction. in other words, software and hardware constituting the "code", and architecture of the digital world, causing particular problems, can be at the same time the instrument to solve them. regulating through code may have some specific advantages: law traditionally regulates ex post, by imposing a sanction on those who did not comply with its rules (e.g., in the form of civil damages or criminal prosecution). architecture regulates by putting conditions on one's behaviour, allowing or disallowing something, not allowing the possibility to disobey. it regulates ex ante. ambient intelligence is particularly built on software code. this code influences how ambient intelligence works, e.g., how the data are processed, but this code itself can be influenced and accompanied by regulation. thus, the architecture can be a tool of law. this finding is more than elementary. it shows that there is a choice: should the law change because of the "code"? or should the law change "code" and thus ensure that certain values are protected? the development of technology represents an enormous challenge for privacy, enabling increasing surveillance and invisible collection of data. a technology that threatens privacy may be balanced by the use of a privacy-enhancing technology: the "code", as lessig claims, can be the privacy saviour. other technologies aim to limit the amount of data actually collected to the necessary minimum. however, most of the current technologies simply ignore the privacy implications and collect personal data when there is no such need. a shift of the paradigm to privacy-bydesign is necessary to effectively protect privacy. indeed, technology can facilitate privacy-friendly verification of individuals via, for example, anonymous and pseudonymous credentials. leenes and koops recognise the potential of these privacyenhancing technologies (pets) to enforce data protection law and privacy rules. but they also point at problems regarding the use of such technologies, which are often troublesome in installation and use for most consumers. moreover, industry is not really interested in implementing privacy-enhancing technology. they see no (economic) reason to do it. the analysis of leenes and koops shows that neither useful technology, nor law is sufficient in itself. equally important is raising stakeholder awareness, social norms and market rules. all regulatory means should be used and have to be used to respond to problems of the new environment to tackle it effectively. for the full effectiveness of any regulation, one should always look for the optimal mixture of all accessible means. as the impact and effects of the large-scale introduction of ami in societies spawn a lot of uncertainties, the careful demarche implied by the precautionary principle, with its information, consultation and participation constraints, might be appropriate. the application of this principle might inspire us in devising legal policy options when, as regards ami, fundamental choices between opacity tools and transparency tools must be made. opacity tools proscribe the interference by powerful actors into the individual's autonomy, while transparency tools accept such interfering practices, though under certain conditions which guarantee the control, transparency and accountability of the interfering activity and actors. legal scholars do not discuss law in general terms. their way of thinking always involves an application of the law in concrete or exemplified situations. the legislator will compare concrete examples and situations with the law and will not try to formulate general positions or policies. thus, the proposed legal framework will not deal with the ami problems in a general way, but focus on concrete issues, and apply opacity and transparency solutions accordingly. another particularity of legal regulation in cyberspace is the absence of a central legislator. though our legal analysis is based mostly on european law, we emphasise that not everything is regulated at a european level. regulation of (electronic) identity cards, for instance, concerns a crucial element in the construction of an ami environment, but is within the powers of the individual member states. both at european and national level, some decision-making competences have been delegated to independent advisory organs (children's rights commissioners, data protection authorities). hence, there exist many, what we can call, "little legislators" that adjust in some way the often executive power origin of legislation: the article data protection working party, national children's rights commissioners and international standardisation bodies can and do, for example, draft codes of conduct that constitute often (but not always) the basis for new legislation. we do not suggest the centralisation of the law-making process. on the contrary, we recommend respect for the diversity and plurality of lawmakers. the solutions produced by the different actors should be taken into consideration and be actively involved in policy discussions. development of case law should also be closely observed. consulting concerned citizens and those who represent citizens ( including legislators) at the stage of development would increase the legitimacy of new technologies. privacy aims to ensure no interference in private and individual matters. it offers an instrument to safeguard the opacity of the individual and puts limits to the interference by powerful actors into the individual's autonomy. normative in nature, regulatory opacity tools should be distinct from regulatory transparency tools, of which the goal is to control the exercise of power rather than to restrict power. we observe today that the reasonable expectation of privacy is eroding due to emerging new technologies and possibilities for surveillance: it develops into an expectation of being monitored. should this, however, lead to diminishing the right to privacy? ambient intelligence may seriously threaten this value, but the need for privacy (e.g., the right to be let alone) will probably remain, be it in another form adapted to new infrastructures (e.g., the right to be left offline). the right to privacy in a networked environment could be enforced by any means of protecting the individual against any form of dataveillance. such means are in line with the data minimisation principle of data protection law, which is a complementary tool to privacy. however, in ambient intelligence where collecting and processing personal data is almost a prerequisite, new tools of opacity such as the right to be left offline (in time, e.g., during certain minutes at work, or in space, e.g., in public bathrooms) could be recognised. several instruments of opacity can be identified. we list several examples, and there may be others. additional opacity recommendations are made in subsequent sections, for example, with regard to biometrics. we observe that there is not necessarily an internal coherence between the examples listed below. the list should be understood as a wish list or a list with suggestions to be consulted freely. opacity designates a zone of non-interference which should not be confused with a zone of invisibility: privacy, for instance, does not imply secrecy; it implies the possibility of being oneself openly without interference. another word might have been "impermeability" which is too strong and does not contrast so nicely with "transparency" as "opacity" does. see hildebrandt, m., and s. gutwirth the concept of a digital territory represents a vision that introduces the notions of space and borders in future digitised everyday life. it could be visualised as a bubble, the boundaries and transparency of which depend on the will of its owner. the notion of a digital territory aims for a "better clarification of all kinds of interactions in the future information society. without digital boundaries, the fundamental notion of privacy or the feeling of being at home will not take place in the future information society." the concept of digital territories encompasses the notion of a virtual residence, which can be seen as a virtual representation of the smart home. the concept of digital territories could provide the individual with a possibility to access -and stay in -a private digital territory of his own at (any) chosen time and place. this private, digital space could be considered as an extension of the private home. today, already, people store their personal pictures on distant servers, read their private correspondences online, provide content providers with their viewing and consuming behaviour for the purpose of digital rights management, communicate with friends and relatives through instant messengers and internet telephony services. the "prognosis is that the physical home will evolve to 'node' in the network society, implying that it will become intimately interconnected to the virtual world." the law guarantees neither the establishment nor the protection of an online private space in the same way as the private space in the physical world is protected. currently, adequate protection is lacking. for example, the new data retention law requires that telecommunication service providers keep communication data at the disposal of law enforcement agencies. the retention of communication data relates to mobile and fixed phone data, internet access, e-mail and e-telephony. data to be retained includes the place, time, duration and destination of communications. what are the conditions for accessing such data? is the individual informed when such data are accessed? does he have the right to be present when such data are examined? does the inviolability of the home extend to the data that are stored on a distant server? another example of inadequate protection concerns the increasing access to home activities from a distance, for example, as a result of the communication data generated by domestic applications that are connected to the internet. in both examples, there is no physical entrance in the private place. to ensure that these virtual private territories become a private domain for the individual, a regulatory framework could be established to prevent unwanted and unnoticed interventions similar to that which currently applies to the inviolability of the home. a set of rules needs to be envisaged to guarantee such protection, amongst them, the procedural safeguards similar to those currently applicable to the protection of our homes against state intervention (e.g., requiring a search warrant). technical solutions aimed at defending private digital territories against intrusion should be encouraged and, if possible, legally enforced. the individual should be empowered with the means to freely decide what kinds of information he or she is willing to disclose, and that aspect should be included in the digital territory concept. similarly, vulnerable home networks should be granted privacy protection. such protection could be extended to the digital movement of the person, that is, just as the privacy protection afforded the home has been or can be extended to the individual's car, so the protection could be extended to home networks, which might contact external networks. privacy at the workplace has already been extensively discussed. most of the legal challenges, we believe, that may arise can be answered with legal transparency rules. more drastic, prohibitive measures may be necessary in certain situations involving too far-reaching or unnecessary surveillance, which a society considers as infringing upon the dignity of the employee. in addition, transparency rules are needed to regulate other, less intrusive problems. we recall here the specific role of law-making institutions in the area of labour law. companies must discuss their surveillance system and its usage in collective negotiations with labour organisations and organisations representing employees before its implementation in a company or a sector, taking into account the specific needs and risks involved (e.g., workers in a bank vs. workers in public administration). all employees should always be clearly and a priori informed about the employee surveillance policy of the employer (when and where surveillance is taking place, what is the finality, what information is collected, how long it will be stored, what are the (procedural) rights of the employees when personal data are to be used as evidence, etc.). specific cyber territories for children have to be devised along the same lines. the united nations convention on the rights of the child ( ) contains a specific privacy right for children, and sets up monitoring instruments such as national children's rights commissioners. opinions of such advisory bodies should be carefully taken into account in policy discussion. national children's rights commissioners could take up problems relating to the permanent digital monitoring of children. as concluded in the legal analysis of the dark scenarios above, courts are willing to protect one's privacy but, at the same time, they tend to admit evidence obtained through a violation of privacy or data protection. there is a lack of clarity and uniformity regarding the consequence of privacy violations. in that case, the evidence was secured by the police in a manner incompatible with the requirements of article of the european convention on human rights (echr). the court accepted that the admission of evidence obtained in breach of the privacy right is not necessarily a breach of the required fairness under article of echr (the right to a fair trial), since the process taken as a whole was fair in the sense of article . the evidence against the accused was admitted and led to his conviction. the khan doctrine (followed in cases such as doerga v. the netherlands and p.g. and j.h. v. the united followed by at least some national courts. the fact that there is no general acceptance of an exclusionary rule creates legal uncertainty. its general acceptance is, however, necessary to protect the opacity of the individual in a more effective way. the departure from such position by the courts (namely "no inclusion of evidence obtained through privacy and/or data protection law infringements") could be considered and legislative prohibition of the admissibility (or general acceptance of the exclusionary rule) of such obtained evidence envisaged. in ambient intelligence, the use of implants can no longer be considered as a kind of futuristic or extraordinary exception. whereas it is clear that people may not be forced to use such implants, people may easily become willing to equip themselves with such implants on a (quasi) voluntary basis, be it, for example, to enhance their bodily functions or to obtain a feeling of security through always-on connections to anticipate possible emergencies. such a trend requires a careful assessment of the opacity and transparency principles at a national, european and international level. currently, in europe, the issue of medical implants has already been addressed. in ami, however, implants might be used for non-medical purposes. one of our dark scenarios suggests that organisations could force people to have an implant so they could be located anywhere at any time. cannot be achieved by less body-intrusive means. informed consent is necessary to legitimise the use of implants. we agree with those findings. the european group on ethics in science and new technologies goes further, stating that non-medical (profit-related) applications of implants constitute a potential threat to human dignity. applications of implantable surveillance technologies are only permitted when there is an urgent and justified necessity in a democratic society, and must be specified in legislation. we agree that such applications should be diligently scrutinised. we propose that the appropriate authorities (e.g., the data protection officer) control and authorise applications of implants after the assessment of the particular circumstances in each case. when an implant enables tracking of people, people should have the possibility to disconnect the implant at any given moment and they should have the possibility to be informed when a (distant) communication (e.g., through rfid) is taking place. we agree with the european group on ethics in science and new technologies that irreversible ict implants should not be used, except for medical purposes. further research on the long-term impact of ict implants is also recommended. another safeguard to guarantee the opacity of the individual is the possibility to act under anonymity (or at least under pseudonymity or "revocable anonymity"). the article working party has considered anonymity as an important safeguard for the right to privacy. we repeat here its recommendations: (a) the ability to choose to remain anonymous is essential if individuals are to preserve the same protection for their privacy online as they currently enjoy offline. (b) anonymity is not appropriate in all circumstances. (c) legal restrictions which may be imposed by governments on the right to remain anonymous, or on the technical means of doing so (e.g., availability of encryption products) should always be proportionate and limited to what is necessary to protect a specific public interest in a democratic society. (e) some controls over individuals contributing content to online public fora are needed, but a requirement for individuals to identify themselves is in many cases disproportionate and impractical. other solutions are to be preferred. (f) anonymous means to access the internet (e.g., public internet kiosks, prepaid access cards) and anonymous means of payment are two essential elements for true online anonymity. according to the common criteria for information technology security evaluation document (iso ), anonymity is only one of the requirements for the protection of privacy, next to pseudonymity, unlinkability, unobservability, user control/ information management and security protection. all these criteria should be considered as safeguards for privacy. the e-signature directive promotes the use of pseudonyms and, at the same time, aims to provide security for transactions. currently, the directive on electronic signatures states that only advanced electronic signatures (those based on a qualified certificate and created by a secure signature-creation device) satisfy the legal requirements of a signature in relation to data in electronic form in the same manner as a handwritten signature satisfies those requirements in relation to paper-based data and are admissible as evidence in legal proceedings. member states must ensure that an electronic signature (advanced or not) is not denied legal effectiveness and admissibility as evidence in legal proceedings solely on the grounds that it is: (a) in electronic form, (b) not based upon a qualified certificate, (c) not based upon a qualified certificate issued by an accredited certification service-provider, or (d) not created by a secure signature-creation device. in ambient intelligence, the concept of unlinkability can become as important as the concept of anonymity or pseudonymity. unlinkability "ensures that a user may make multiple uses of resources or services without others being able to link these uses together. … unlinkability requires that users and/or subjects are unable to determine whether the same user caused certain specific operations in the system." when people act pseudonymously or anonymously, their behaviour in different times and places in the ambient intelligence network could still be linked and consequently be subject to control, profiling and automated decision-making: linking data relating to the same non-identifiable person may result in similar privacy threats as linking data that relate to an identified or identifiable person. thus, in addition to and in line with the right to remain anonymous goes the use of anonymous and pseudonymous credentials, accompanied with unlinkability in certain situations (e.g., e-commerce), reconciling thus the privacy requirements with the accountability requirements of, e.g., e-commerce. in fact, such mechanisms should always be foreseen when disclosing someone's identity or when linking the information is not necessary. such necessity should not be easily assumed, and in every circumstance more privacy-friendly technological solutions should be sought. however, the use of anonymity should be well balanced. to avoid its misuse, digital anonymity could be further legally regulated, especially stating when it is not appropriate. by the cybercrime convention, which provides a definition for several criminal offences related to cybercrime and for general principles concerning international co-operation. the cybercrime convention, however, allows for different standards of protection. the convention obliges its signatories to criminalise certain offences under national law, but member states are free to narrow the scope of the definitions. the most important weakness of this convention is the slow progress in its ratification by signatory states. council framework decision / /jha also provides for criminal sanctions against cybercrimes. the framework decision is limited, however, both in scope and territory, since it only defines a limited number of crimes and is only applicable to the member states of the european union. international co-operation in preventing, combating and prosecuting criminals is needed and may be facilitated by a wide range of technological means, but these new technological possibilities should not erode the privacy of innocent citizens who are deemed to be not guilty until proven otherwise. cybercrime prosecution, and more importantly crime prevention, might be facilitated by a wide range of technological means, among them, those that provide for the security of computer systems and data against attacks. almost all human activity in ami can be reduced to personal data processing: opening doors, sleeping, walking, eating, putting lights on, shopping, walking in a street, driving a car, purchasing, watching television and even breathing. in short, all physical actions become digital information that relates to an identified or identifiable individual. often, the ambient intelligence environment will need to adapt to individuals and will therefore use profiles applicable to particular individuals or to individuals within a group profile. ami will change not only the amount, but also the quality of data collected so that we can be increasingly supported in our daily life (a goal of ambient intelligence). ami will collect data not only about what we are doing, when we do it and where we are, but also data on how we have experienced things. one can assume that the accuracy of the profiles, on which the personalisation of services depends, will improve as the amount of data collected grows. but as others hold more of our data, so grow the privacy risks. thus arises the fundamental question: do we want to minimise personal data collection? instead of focusing on reducing the amount of data collected alone, should we admit that they are indispensable for the operation of ami, and focus rather on empowering the user with a means to control such processing of personal data? data protection is a tool for empowering the individual in relation to the collection and processing of his or her personal data. the european data protection directive imposes obligations on the data controller and supports the rights of the data subject with regard to the transparency and control over the collection and processing of data. it does not provide for prohibitive rules on data processing (except for the processing of sensitive data and the transfer of personal data to third countries that do not ensure an adequate level of protection). instead, the eu data protection law focuses on a regulatory approach and on channelling, controlling and organising the processing of personal data. as the title of directive / /ec indicates, the directive concerns both the protection of the individual with regard to the processing of personal data and the free movement of such data. the combination of these two goals in directive / /ec reflects the difficulties we encounter in the relations between ambient intelligence and data protection law. there is no doubt that some checks and balances in using data should be put in place in the overall architecture of the ami environment. civil movements and organisations dealing with human rights, privacy or consumer rights, observing and reacting to the acts of states and undertakings might provide such guarantees. it is also important to provide incentives for all actors to adhere to legal rules. education, media attention, development of good practices and codes of conducts are of crucial importance. liability rules and rules aimed at enforcement of data protection obligations will become increasingly important. data protection law provides for the right to information on data processing, access to or rectification of data, which constitute important guarantees of individual rights. however, its practical application in an ami era could easily lead to an administrative nightmare, as information overload would make it unworkable. we should try to remedy such a situation in a way that does not diminish this right. the individual's right to information is a prerequisite to protect his interests. such a right corresponds to a decentralised system of identity (data) management, but it seems useful to tackle it separately to emphasise the importance of the individual's having access to information about the processing of his data. because of the large amounts of data to be processed in an ami world, the help of or support by intelligent agents to manage such information streams seems indispensable. information about what knowledge has been derived from the data could help the individual in proving causality in case of damage. further research on how to reconcile access to the knowledge in profiles (which might be construed as a trade secret in some circumstances) with intellectual property rights would be desirable. the right to be informed could be facilitated by providing information in a machine-readable language, enabling the data subject to manage the information flow through or with the help of (semi-) autonomous intelligent agents. of course, this will be more difficult in situations of passive authentication, where no active involvement of the user takes place (e.g., through biometrics and rfids). thus, information on the identity of the data controller and the purposes of processing could exist both in human-readable and machine-readable language. the way such information is presented to the user is of crucial importance -i.e., it must be presented in an easily comprehensible, user-friendly way. in that respect, the article working party has provided useful guidelines and proposed multilayer eu information notices essentially consisting of three layers: layer -the short notice contains core information required under article of the data protection directive (identity of the controller, purpose of processing, or any additional information which, in the view of the particular circumstances of the case, must be provided to ensure fair processing). a clear indication must be given as to how the individual can access additional information. layer -the condensed notice contains all relevant information required under the data protection directive. this includes the name of the company, the purpose of the data processing, the recipients or categories of recipients of the data, whether replies to the questions are obligatory or voluntary, as well as the possible consequences of failure to reply, the possibility of transfer to third parties, the right to access, to rectify and oppose choices available to the individual. in addition, a point of contact must be given for questions and information on redress mechanisms either within the company itself or details of the nearest data protection agency. layer -the full notice includes all national legal requirements and specificities. it could contain a full privacy statement with possible additional links to national contact information. we recommend that industry and law enforcement agencies consider an approach for ami environments similar to that recommended by the article working party. electronic versions of such notices should be sufficient in most of circumstances. our dark scenarios indicate a new kind of practice that has emerged in recent years in the sector of personal data trading: while some companies collect personal data in an illegal way (not informing the data subjects, transfer of data to third parties without prior consent, usage for different purposes, installing spyware, etc.), these personal data are shared, sold and otherwise transferred throughout a chain of existing and disappearing companies to the extent that the origin of the data and the original data collector cannot be traced back. this practice has been described as "data laundering", with analogy to money laundering: it refers to a set of activities aiming to cover the illegitimate origin of data. in our ami future, we should assume the value of personal data and therefore the (illegal) trading in these data will only grow. a means to prevent data laundering could be to oblige those who buy or otherwise acquire databases, profiles and vast amounts of personal data to check diligently the legal origin of the data. without checking the origin and/or legality of the databases and profiles, one could consider the buyer equal to a receiver of stolen goods and thus held liable for illegal data processing. they could be obliged to notify the national data protection officers when personal data(bases) are acquired. those involved or assisting in data laundering could be subject to criminal sanctions. ami requires efficient, faultless exchanges of relevant data and information throughout the ami network. the need for efficiency requires interoperable data formats and interoperable hardware and software for data processing. dark scenario (about the bus accident) has shown the need for interoperability in ambient intelligence, but it must be recognised that, at the same time, interoperable data and data processing technologies in all sectors and all applications could threaten trust, privacy, anonymity and security. full interoperability and free flow of personal data are not always desirable, and should not be considered as unquestionable. interoperability can entail an unlimited availability of personal data for any purpose. interoperability may infringe upon the finality and purpose specification principles and erode the rights and guarantees offered by privacy and data protection law. moreover, the purposes for which the data are available are often too broadly described (what is "state security", "terrorism", "a serious crime"?). data can become available afterwards for any purpose. interoperability of data and data processing mechanisms facilitates possible function creep (use of data for purposes other than originally envisaged). interoperability could contribute to the criminal use of ambient intelligence, for example, by sending viruses to objects in the network (interoperability opens the door for fast transmission and reproduction of a virus) or abusing data (interoperable data formats make data practical for any usage). interoperability is thus not only a technological issue. awareness -already today -of the possible negative sides of interoperability should bring about a serious assessment of both law and technology before the market comes up with tools for interoperability. legal initiatives in france (e.g., requiring interoperability of the itunes music platform) and sanctions imposed by the european commission (imposing interoperability of the microsoft work group server operating system) indicate clearly that the need for interoperability is desired on a political and societal level. in the communication from the commission to the council and the european parliament on improved effectiveness, enhanced interoperability and synergies among european databases in the area of justice and home affairs of , interoperability is defined as the "ability of it systems and of the business processes they support to exchange data and to enable the sharing of information and knowledge". this is, however, a more technological definition: it "explicitly disconnects the technical and the legal/political dimensions from interoperability, assuming that the former are neutral and the latter can come into play later or elsewhere. … indeed, technological developments are not inevitable or neutral, which is mutatis mutandis also the case for technical interoperability. the sociology of sciences has shown that any technological artefact has gone through many small and major decisions that have moulded it and given it its actual form. hence, the development of information technology is the result of micro politics in action. technologies are thus interwoven with organisation, cultural values, institutions, legal regulation, social imagination, decisions and controversies, and, of course, also the other way round. any denial of this hybrid nature of technology and society blocks the road toward a serious political, democratic, collective and legal assessment of technology. this means that technologies cannot be considered as faits accomplis or extrapolitical matters of fact." this way of proceeding has also been criticised by the european data protection supervisor, according to whom this leads to justifying the ends by the means. taking into account the need for interoperability, restrictions in the use and implementation of interoperability are required based on the purpose specification and proportionality principles. to this extent, a distinction between the processing of data for public (enforcement) to achieve certain purposes, for which access to data has been granted, access to the medium carrying the information (e.g., a chip) may be sufficient, for example, when verifying one's identity. there should always be clarity as to what authorities are being granted access. in the case of deployment of centralised databases, a list of authorities that have access to the data should be promulgated in an adequate, official, freely and easily accessible publication. such clarity and transparency would contribute to security and trust, and protect against abuses in the use of databases. the proportionality and purpose limitation principles are already binding under existing data protection laws. the collection and exchange of data (including interoperability) should be proportional to the goals for which the data have been collected. it will not be easy to elaborate the principles of proportionality and purpose limitation in ambient intelligence; previously collected data may serve for later developed applications or discovered purposes. creation and utilisation of databases may offer additional benefits (which are thus additional purposes), e.g., in the case of profiling. those other (derived) purposes should, as has been indicated in the opinion of the european data protection supervisor, be treated as independent purposes for which all legal requirements must be fulfilled. technical aspects of system operation can have a great impact on the way a system works, and how the proportionality principles and purpose limitation principles are implemented since they can determine, for example, if access to the central database is necessary, or whether access to the chip or part of the data is possible and sufficient. biometric technology can be a useful tool for authentication and verification, and may even be a privacy-enhancing technology. however, it can also constitute a threat to fundamental rights and freedoms. thus, specific safeguards should be put in place. biometric safeguards have already been subject of reflection by european data protection authorities: the article working party has stated that biometric data are in most cases personal data, so that data protection principles apply to processing of such data. on the principle of proportionality, the article working party points out that it is not necessary (for the sake of authentication or verification) to store biometric data in central databases, but in the medium (e.g., a card) remaining in the control of the user. the creation and use of centralised databases should always be carefully assessed before their deployment, including prior checking by data protection authorities. in any case, all appropriate security measures should be put in place. framing biometrics is more than just deciding between central or local storage. even storage of biometric data on a smart card should be accompanied by other regulatory measures that take the form of rights for the card-holders (to know what data and functions are on the card; to exclude certain data or information from being written onto the card; to reveal at discretion all or some data from the card; to remove specific data or information from the card). biometric data should not be used as unique identifiers, mainly because biometric data still do not have sufficient accuracy. of course, this might be remedied in the progress of science and technological development. there remains, however, a second objection: using biometrics as the primary key will offer the possibility of merging different databases, which can open the doors for abuses (function creep). european advisory bodies have considered biometric data as a unique identifier. generally speaking, since the raw data might contain more information than actually needed for certain finalities (including information not known at the moment of the collection, but revealed afterwards due to progress in science, e.g., health information related to biometric data), it should not be stored. other examples of opacity rules applied to biometrics might be prohibitions on possible use of "strong" multimodal biometrics (unless for high security applications) for everyday activities. codes of conduct can be appropriate tools to further regulate the use of technology in particular sectors. ami will depend on profiling as well as authentication and identification technologies. to enable ubiquitous communication between a person and his or her environment, both things and people will have to be traced and tracked. rfid seems to offer the technological means to implement such tracking. like biometrics, rfid is an enabling technology for real-time monitoring and decision making. like biometrics, rfids can advance the development of ami and provide many advantages for users, companies and consumers. no legislative action seems needed to support this developing technology. market mechanisms are handling this. there is, however, a risk to the privacy interests of the individual and for a violation of the data protection principles, as caspian and other privacy groups have stated. rfid use should be in accordance with privacy and data protection regulations. the article working party has already given some guidelines on the application of the principles of eu data protection legislation to rfids. it stresses that the data protection principles (purpose limitation principle, data quality principle, conservation principle, etc.) must always be complied with when the rfid technology leads to processing of personal data in the sense of the data protection directive. as the article working party points out, the consumer should always be informed about the presence of both rfid tags and readers, as well as of the responsible controller, the purpose of the processing, whether data are stored and the means to access and rectify data. here, techniques of (visual) indication of activation would be necessary. the data subject would have to give his consent for using and gathering information for any specific purpose. the data subject should also be informed about what type of data is gathered and whether the data will be used by the third parties. in ami, such rights may create a great burden, both on the data subject, on the responsible data controller and on all data processors. though adequate, simplified notices about the data processors' policy would be welcome (e.g., using adequate pictograms or similar means). in our opinion, such information should always be provided to consumers when rfid technology is used, even if the tag does not contain personal data in itself. the data subject should also be informed how to discard, disable or remove the tag. the right to disable the tag can relate to the consent principle of data protection, since the individual should always have the possibility to withdraw his consent. disabling the tag should at least be possible when the consent of the data subject is the sole legal basis for processing the data. disabling the tag should not lead to any discrimination of the consumer (e.g., in terms of the guarantee conditions). technological and organisational measures (e.g., the design of rfid systems) are of crucial importance in ensuring that the data protection obligations are respected (privacy by design, e.g., by technologically blocking unauthorised access to the data). thus, availability and compliance with privacy standards are of particular importance. the concept of "personal data" in the context of rfid technology is contested. wp states: in assessing whether the collection of personal data through a specific application of rfid is covered by the data protection directive, we must determine: (a) the extent to which the data processed relates to an individual, and (b) whether such data concerns an individual who is identifiable or identified. data relates to an individual if it refers to the identity, characteristics or behaviour of an individual or if such information is used to determine or influence the way in which that person is treated or evaluated. in assessing whether information concerns an identifiable person, one must apply recital of the data protection directive which establishes that "account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person." and further: "finally, the use of rfid technology to track individual movements which, given the massive data aggregation and computer memory and processing capacity, are if not identified, identifiable, also triggers the application of the data protection directive." article data protection working party, working document on data protection issues related to rfid further research on the rfid technology and its privacy implications is recommended. this research should also aim at determining whether any legislative action is needed to address the specific privacy concerns of rfid technology. further development of codes of conducts and good practices is also recommended. profiling is as old as life, because it is a kind of knowledge that unconsciously or consciously supports the behaviour of living beings, humans not excluded. it might well be that the insight that humans often "intuitively know" something before they "understand" it can be explained by the role profiling spontaneously plays in our minds. thus, there is no reason to prohibit automated profiling and data mining concerning individuals with opacity rules. profiling activities should in principle be ruled by transparency tools. in other words, the processing of personal datacollection, registration and processing in the strict sense -is not prohibited but submitted to a number of conditions guaranteeing the visibility, controllability and accountability of the data controller and the participation of the data subjects. data protection rules apply to profiling techniques (at least in principle). the collection and processing of traces surrounding the individual must be considered as processing of personal data in the sense of existing data protection legislation. both individual and group profiling are dependent on such collection and on the processing of data generated by the activities of individuals. and that is precisely why, in legal terms, no profiling is thinkable outside data protection. there is an ongoing debate in contemporary legal literature about the applicability of data protection to processing practices with data that are considered anonymous, i.e., they do not allow the identification of a specific individual. some contend that data protection rules do not allow processing practices that bring together data on certain individuals without trying to identify the said individual (in terms of physical location or name). some contend that data protection rules do not apply to profiling practices that process data relating to non-identifiable persons (in the sense of the data protection directive). we hold that it is possible to interpret the european data protection rules in a broad manner covering all profiling practices, but the courts have not spoken on this yet. data protection should apply to all profiling practices. when there is confusion in the application and interpretation of the legal instruments, they should be adapted so that they do apply to all profiling practices. profiling practices and the consequent personalisation of the ambient intelligence environment lead to an accumulation of power in the hands of those who control the profiles and should therefore be made transparent. the principles of data protection are an appropriate starting point to cope with profiling in a democratic constitutional state as they do impose good practices. nevertheless, while the default position of data protection is transparency ("yes, you can process, but …"), it does not exclude opacity rules ("no, you cannot process, unless …"). in relation to profiling, two examples of such rules are relevant. on the one hand, of course, there is the explicit prohibition against taking decisions affecting individuals solely on the basis of the automated application of a profile without human intervention (see article of the data protection directive). this seems obvious because in such a situation, probabilistic knowledge is applied to a real we recall that personal data in the eu data protection directive refers to "any information relating to an identified or identifiable natural person" (article ). the role of data protection law and non-discrimination law in group profiling in the private sector." article on automated individual decisions states: " . member states shall grant the right to every person not to be subject to a decision which produces legal effects concerning him or significantly affects him and which is based solely on automated processing of data intended to evaluate certain personal aspects relating to him, such as his performance at work, creditworthiness, reliability, conduct, etc. . subject to the other articles of this directive, member states shall provide that a person may be subjected to a decision of the kind referred to in paragraph if that decision: (a) is taken in the course of the entering into or performance of a contract, provided the request for the entering into or the performance of the contract, lodged by the data subject, has been satisfied or that there are suitable measures to safeguard his legitimate interests, such as arrangements allowing him to put his point of view; or (b) is authorized by a law which also lays down measures to safeguard the data subject's legitimate interests." person. on the other hand, there is the (quintessential) purpose specification principle, which provides that the processing of personal data must meet specified, explicit and legitimate purposes. as a result, the competence to process is limited to welldefined goals, which implies that the processing of the same data for other incompatible aims is prohibited. this, of course, substantially restricts the possibility to link different processing and databases for profiling or data mining objectives. the purpose specification principle is definitely at odds with the logic of interoperability and availability of personal data: the latter would imply that all databases can be used jointly for profiling purposes. in other words, the fact that the legal regime applicable to profiling and data mining is data protection does not give a carte blanche to mine and compare personal data that were not meant to be connected. the european data protection supervisor indicated in his annual report a number of processing operations that are likely to encompass specific risks to the rights and freedoms of data subjects, even if the processing does not occur upon sensitive data. this list relates to processing operations (a) of data relating to health and to suspected offences, offences, criminal convictions or security measures, (b) intended to evaluate personal aspects relating to the data subject, including his or her ability, efficiency and conduct, (c) allowing linkages, not provided for pursuant to national or community legislation, between data processed for different purposes, and (d) for the purpose of excluding individuals from a right, benefit or contract. software can be the tool for regulating one's behaviour by simply allowing or not allowing certain acts. thus, technology constituting the "software code" can affect the architecture of the internet (and thus potentially of ami) and can provide effective means for enforcing the privacy of the individual. for example, cryptology might give many benefits: it could be used for pseudonymisation (e.g., encrypting ip addresses) and ensuring confidentiality of communication or commerce. privacy-enhancing technologies can have an important role to play, but they need an adequate legal framework. the directive on the legal protection of software obliges member states to provide appropriate remedies against a person committing any act of putting into circulation, or the possession for commercial purposes of, any means the sole intended purpose of which is to facilitate the unauthorised removal or circumvention of any technical devices which may have been applied to protect a computer program. this mechanism aims to protect programmes enforcing the intellectual property rights against circumvention. similar legal protection against circumvention of privacy-enhancing technologies could be legally foreseen. technology might go beyond what the law permits (e.g., drm prevents intellectual property infringements but at the same time might limit the rights of the lawful user). negative side effects of such technologies should be eliminated. more generally, when introducing new technology on the market, manufacturers together with relevant stakeholders should undertake a privacy impact assessment. development of a participatory impact assessment procedure would allow stakeholders to quickly identify and react to any negative features of technology (see also section . . ). the european data protection directive imposes obligations on the data controller and gives rights to the data subject. it aims to give the individual control over the collection and processing of his data. many provisions in the data protection directive have several weaknesses in an ami environment. principles of proportionality and fairness are relative and may lead to different assessments in similar situations; obtaining consent might not be feasible in the constant need for the collection and exchange of data; obtaining consent can be simply imposed by the stronger party. individuals might not be able to exercise the right to consent, right to information, access or rectification of data due to the overflow of information. thus, those rules might simply become unworkable in an ami environment. and even if workable (e.g., thanks to the help of the digital assistants), are they enough? should we not try to look for an approach granting the individual even more control? several european projects are involved in research on identity management. they focus on a decentralised approach, where a user controls how much and what kind of information he or she wants to disclose. identity management systems, while operating on a need-to-know basis, offer the user the possibility of acting under pseudonyms, under unlinkability or anonymously, if possible and desirable. among the other examples of such systems, there are projects that base their logic on the assumption that the individual has property over his data, and then could use licensing schemes when a transfer of data occurs. granting him property over the data is seen as giving him control over the information and its usage in a "distribution chain". however, it is doubtful if granting him property over the data will really empower the individual and give him a higher level of protection and control over his data. the property model also assumes that the data are disseminated under a contract. thus, the question might arise whether the data protection directive should serve as a minimum standard and thus limit the freedom of contracts. but as our dark scenarios show, there exist many cases in which the individual will not be able to freely enter into a contract. another question arises since our data are not always collected and used for commercial purposes. in most situations, the processing of personal data is a necessary condition for entering into a contractual relation (whereas the data protection directive states in article that data processing without the individual's consent to use of his personal data is legitimate when such processing is necessary for the performance of a contract). the most obvious example is the collection of data by police, social insurance and other public institutions. the individual will not always be free to give or not give his data away. the property model will not address these issues. it will also not stop the availability of the data via public means. a weakness of the property model is that it might lead to treating data only as economic assets, subject to the rules of the market. but the model's aim is different: the aim is to protect personal data, without making their processing and transfer impossible. regarding data as property also does not address the issue of the profile knowledge derived from personal data. this knowledge is still the property of the owner or the licenser of the profile. the data-as-property option also ignores the new and increasingly invisible means of data collection, such as rfids, cameras or online data collection methods. discussing the issue of whether personal data should become the individual's property does not solve the core problem. on the one hand, treating data as property may lead to a too high level of protection of personal information, which would conflict with the extensive processing needs of ami. on the other hand, it would, by default, turn personal data into a freely negotiable asset, no longer ruled by data protection, but left to market mechanisms and consent of the data subjects (more often than not to the detriment of the latter). finally, the data-as-property option loses its relevance in the light of a focus upon anonymisation and pseudonymisation of data processed in ami applications. the prime consortium proposes identity management systems controlled by data subjects. it aims to enable individuals to negotiate with service providers the disclosure of personal data according to the conditions defined. such agreement would constitute a contract. an intelligent agent could undertake the management on the user side. this solution is based on the data minimisation principle and on the current state of legislation. it proposes the enforcement of (some) current data protection and privacy laws. it seems to be designed more for the needs of the world today than for a future ami world. the user could still be forced to disclose more information than he or she wishes, because he or she is the weaker party in the negotiation; he or she needs the service. the fidis consortium has also proposed a decentralised identity management, the vision of which seems to go a bit further than the prime proposal. it foresees that the user profiles are stored on the user's device, and preferences relevant for a particular service are (temporarily) communicated to the service provider for the purpose of a single service. the communication of the profile does not have to imply disclosure of one's identity. if there is information extracted from the behaviour of the user, it is transferred by the ambient intelligent device back to the user, thus updating his profile. thus, some level of exchange of knowledge is foreseen in this model, which can be very important for the data subject's right to information. a legal framework for such sharing of knowledge from an ami-generated profile needs to be developed, as well as legal protection of the technical solution enabling such information management. such schemes rely on automated protocols for the policy negotiations. the automated schemes imply that the consent of the data subject is also organised by automatic means. we need a legal framework to deal with the situation wherein the explicit consent of the data subject for each collection of data is replaced by a "consent" given by an intelligent agent. in such automated models, one could envisage privacy policies following the data. such "sticky" policies, attached to personal data, would provide for clear information and indicate to data processors and controllers which privacy policy applies to the data concerned. sticky policies could facilitate the auditing and self-auditing of the lawfulness of the data processing by data controllers. in any event, research in this direction is desirable. since ami is also a mobile environment, there is a need to develop identity management systems addressing the special requirements of mobile networks. the fidis consortium has prepared a technical survey of mobile identity management. it has identified some special challenges and threats to privacy in the case of mobile networks and made certain recommendations: • location information and device characteristics both should be protected. • ease of use of the mobile identity management tools and simplified languages and interfaces for non-experts should be enhanced. • a verifiable link between the user and his digital identity has to be ensured. accordingly, privacy should also be protected in peer-to-peer relationships. the importance of consumer protection will grow in ambient intelligence, because of the likelihood that consumers will become more dependent on online products and services, and because product and service providers will strengthen their bargaining position through an increasing information asymmetry. without the constraints of law, ambient intelligence service providers could easily dictate the conditions of participation in new environments. consumer protection should find the proper balance in ami. consumer protection law defines the obligations of the producers and the rights of consumer and consists of a set of rules limiting the freedom to contract, for the benefit of the consumer. consumer protection law plays a role of its own, but can support the protection of privacy and data protection rights. the basis for the european framework for consumer protection rules can be found in article of the ec treaty: "in order to promote the interests of consumers and to ensure a high level of consumer protection, the community shall contribute to protecting the health, safety and economic interests of consumers, as well as to promoting their right to information, education and to organise themselves in order to safeguard their interests." consumer protection at european level is provided by (amongst others) directive / on unfair terms in consumer contracts, directive / on consumer protection in respect of distance contracts and the directive on liability for defective products (discussed below). directive / and directive / were both already discussed (in chapter , sections . . . and . . . ) . in many respects, their rules are not fitted to ami and they need to be re-adapted. this especially relates to extending the scope of protection of those directives, thereby making sure that all services and electronic means of communications and trading are covered (including those services on the world wide web not currently covered by the distance contract directive). due to the increasing complexity of online services, and due to the possibility of information overflow, it seems necessary to find legal ways to assess and recognise contracts made through the intervention of intelligent agents. is the legal system flexible enough to endorse this? moreover, the same should apply to the privacy policies and to the consent of individuals for the collection of data (because, in identity management systems, intelligent agents will decide what data are to be disclosed to whom). here is a challenge: how to technologically implement negotiability of contracts and the framework of binding law in electronic, machine-readable form? suppliers should not be allowed to set up privacy conditions which are manifestly not in compliance with the generally applicable privacy rules and which disadvantage the customer. data protection legislation and consumer protection law could constitute the minimum (or default) privacy protection level. similar rules as those currently applicable under the consumer protection of directive / on unfair terms in consumer contracts could apply. mandatory rules of consumer protection require, inter alia, that contracts be drafted in plain, intelligible language, that the consumer be given an opportunity to examine all terms, that -in cases of doubt -the interpretation most favourable to the consumer prevail. suppliers should not be allowed to unfairly limit their liability for security problems in the service they provide to the consumer. in this respect, more attention could be given to a judgment of the court of first instance of nanterre (france) in in which the online subscriber contract of aol france was declared illegal in that it contained not less than abusive clauses in its standard contractual terms (many of which infringed consumer protection law). the directive on unfair terms in consumer contracts and the directive on consumer protection in respect of distance contracts provide a broad right to information for the consumer. it should be sufficient to dispense such information in electronic form, in view of the large amount of information directed towards consumers that would have to be managed by intelligent agents. an increasing number of service providers will be involved in ami services and it cannot be feasible to provide the required information about all of them. the solution may be to provide such information only about the service provider whom the consumer directly pays and who is responsible towards the consumer. joint liability would apply (for liability issues, see below). the right to withdrawal, foreseen by the directive / on consumer protection with respect to distance contracts, may not apply (unless otherwise agreed) to contracts in which (a) the provision of services has begun with the consumer's agreement before the end of the seven-working-day period and (b) goods have been made to the consumer's specifications or clearly personalised or which, by their nature, cannot be returned or are liable to deteriorate or expire rapidly. in an ami world, services will be provided instantly and will be increasingly personalised. this implies that the right of withdrawal will become inapplicable in many cases. new solutions should be developed to address this problem. currently, insofar as it is not received on a permanent medium, consumers must also receive written notice in good time of the information necessary for proper performance of the contract. in ami, payments will often occur automatically, at the moment of ordering or even offering the service. temporary accounts, administered by trusted third parties, could temporarily store money paid by a consumer to a product or service provider. this can support consumer protection and enforcement, in particular with respect to fraud and for effectively exercising the right of withdrawal. this would be welcome for services that are offered to consumers in the european union by service providers located in third countries, as enforcement of consumer protection rights is likely to be less effective in such situations. the possibility of group consumer litigation can increase the level of law enforcement and, especially, enforcement of consumer protection law. often an individual claim does not represent an important economic value, thus, individuals are discouraged from making efforts to enforce their rights. bodies or organisations with a legitimate interest in ensuring that the collective interests of consumers are protected can institute proceedings before courts or competent administrative authorities and seek termination of any behaviour adversely affecting consumer protection and defined by law as illegal. however, as far as actions for damages are concerned, issues such as the form and availability of group litigation are regulated by the national laws of the member states as part of procedural law. the possibility to bring such a claim is restricted to a small number of states. group litigation is a broad term which captures collective claims (single claims brought on behalf of a group of identified or identifiable individuals), representative actions (single claims brought on behalf of a group of identified individuals by, e.g., a consumer interest association), class action (one party or group of parties may sue as representatives of a larger class of unidentified individuals), among others. these definitions as well as the procedural shape of such claims vary in different member states. waelbroeck d., d. slater and g. the directive on electronic commerce aims to provide a common framework for information society services in the eu member states. an important feature of the directive is that it also applies to legal persons. similar to the consumer protection legislation, the directive contains an obligation to provide certain information to customers. in view of the increasing number of service providers, it may not be feasible to provide information about all of them. providing information about the service provider whom the customer pays directly and who is responsible towards him could be a solution to the problem of the proliferating number of service providers (joint liability may also apply here). the directive should also be updated to include the possibility of concluding contracts by electronic means (including reference to intelligent agents) and to facilitate the usage of pseudonyms, trusted third parties and credentials in electronic commerce. unsolicited commercial communication is an undesirable phenomenon in cyberspace. it constitutes a large portion of traffic on the internet, using its resources (bandwidth and storage capacity) and forcing internet providers and users to adopt organisational measures to fight it (by filtering and blocking spam). spam can also constitute a security threat. the dark scenarios show that spam may become an even more serious problem than it is today. an increase in the volume of spam can be expected because of the emergence of new means of electronic communication. zero-cost models for e-mail services encourage these practices, and similar problems may be expected when mobile services pick up a zero-cost or flat-fee model. as we become increasingly dependent on electronic communication -ambient intelligence presupposes that we are almost constantly online -we become more vulnerable to spam. in the example from the first dark scenario, spamming may cause irritation and make the individual reluctant to use ambient intelligence. fighting spam may well demand even more resources than it does today as new methods of spamming -such as highly personalised and location-based advertising -emerge. currently, many legal acts throughout the world penalise unsolicited communication, but without much success. the privacy & electronic communication directive provides for an opt-in regime, applicable in the instance of commercial communication, thus inherently prohibiting unsolicited marketing. electronic communications are, however, defined as "any information exchanged or conveyed between a finite number of parties by means of a publicly available electronic communications service. this does not include any information conveyed as part of a broadcasting service to the public over an electronic communications network except to the extent that the information can be related to the identifiable subscriber or user receiving the information." the communications need to have a commercial content in order to fall under the opt-in regulation of the privacy & electronic communication directive. consequently, this directive may not cover unsolicited, location-based advertisements with a commercial content that are broadcast to a group of people ("the public"). the impact of this exception cannot be addressed yet since location-based services are still in their infancy. a broad interpretation of electronic communications is necessary (the directive is technology-neutral). considering any unsolicited electronic communication as spam, regardless of the content and regardless of the technological means, would offer protection that is adequate in ambient intelligence environments in which digital communications between people (and service providers) will exceed physical conversations and communications. civil damages address a harm already done, and compensate for damages sustained. effective civil liability rules might actually form one of the biggest incentives for all actors involved to adhere to the obligations envisaged by law. one could establish liability for breach of contract, or on the basis of general tort rules. to succeed in court, one has to prove the damage, the causal link and the fault. liability can be established for any damages sustained, as far as the conditions of liability are proven and so long as liability is not excluded (as in the case of some situations in which intermediary service providers are involved ). however, in ami, to establish such proof can be extremely difficult. as we have seen in the dark scenarios, each action is very complex, with a multiplicity of actors involved, and intelligent agents acting for service providers often undertake the action or decision causing the damage. who is then to blame? how easy will it be to establish causation in a case where the system itself generates the information and undertakes the actions? how will the individual deal with such problems? the individual who is able to obtain damages addressing his harm in an efficient and quick way will have the incentive to take an action against the infringer, thus raising the level of overall enforcement of the law. such an effect would be desirable, especially since no state or any enforcement agency is actually capable of providing a sufficient level of control and/or enforcement of the legal rules. the liability provisions of the directive on electronic commerce can become problematic. the scope of the liability exceptions under the directive is not clear. the directive requires isps to take down the content if they obtain knowledge on the infringing character of the content (notice-and-take-down procedure). however, the lack of a "put-back" procedure (allowing content providers whose content has been wrongfully alleged as illegal, to re-publish it on the internet) or the verification of take-down notices by third parties is said to possibly infringe freedom of speech. it is recommended that the liability rules be strengthened and that consideration be given to means that can facilitate their effectiveness. the directive provides for exceptions to the liability of intermediary service providers (isps) under certain conditions. in the case of hosting, for example, a service provider is not liable for the information stored at the request of a recipient of the service, on condition that (a) the provider does not have actual knowledge of illegal activity or information and, as regards claims for damages, is not aware of facts or circumstances from which the illegal activity or information is apparent or (b) the provider, upon obtaining such knowledge or awareness, acts expeditiously to remove or to disable access to the information. see also section . . . in addition to the general considerations regarding liability presented in this section, we also draw attention to the specific problems of liability for infringement of privacy, including security infringements. currently, the right to remedy in such circumstances is based on the general liability (tort) rules. the data protection directive refers explicitly to liability issues stating that an immediate compensation mechanism shall be developed in case of liability for an automated decision based on inadequate profiles and refusal of access. however, it is not clear whether it could be understood as a departure from general rules and a strengthening of the liability regime. determining the scope of liability for privacy breach and security infringements might also be problematic. in any case, the proof of the elements of a claim and meeting the general tort law preconditions (damage, causality and fault) can be very difficult. opacity instruments, as discussed above, aiming to prohibit the interference into one's privacy can help to provide some clarity as to the scope of the liability. in addition, guidelines and interpretations on liability would be generally welcome, as would standards for safety measures, to provide for greater clarity and thus greater legal certainty for both users and undertakings. as already mentioned, it can be difficult for a user to identify the party actually responsible for damages, especially if he or she does not know which parties were actually involved in the service and/or software creation and delivery. the user should be able to request compensation from the service provider with whom he or she had direct contact in the process of the service. joint and several liability (with the right to redress) should be the default rule in the case of providers of ami services, software, hardware or other products. the complexity of the actions and multiplicity of actors justify such a position. moreover, this recommendation should be supplemented by the consumer protection recommendation requiring the provision of consumer information by the service or product provider having the closest connection with the consumer, as well as the provision of information about individual privacy rights (see above) in a way that would enable the individual to detect a privacy infringement and have a better chance to prove it in court. there is a need to consider the liability regime with other provisions of law. liability regime. in fact, the answer depends on national laws and how the directive has been implemented. the directive applies to products defined as movables, which might suggest that it refers to tangible goods. software not incorporated into a tangible medium (available online) will not satisfy such a definition. there are a growing number of devices (products) with embedded software (e.g., washing machines, microwaves, possibly rfids), which fall under the directive's regime. this trend will continue; the software will be increasingly crucial for the proper functioning of the products themselves, services and whole environments (smart car, smart home). should the distinction between the two regimes remain? strict liability is limited to death or personal injury, or damage to property intended for private use. the damage relating to the product itself, to the product used in the course of business and the economic loss will not be remedied under the directive. currently, defective software is most likely to cause financial loss only, thus the injured party would not be able to rely on provisions of the directive in seeking redress. however, even now in some life-saving applications, personal injury dangers can emerge. such will also be the case in the ami world (see, e.g., the first and second dark scenarios in which software failures cause accidents, property damage and personal injury) so the importance and applicability of the directive on liability for defective products will grow. the increasing dependence on software applications in everyday life, the increasing danger of sustaining personal injury due to a software failure and, thus, the growing concerns of consumers justify strengthening the software liability regime. however, the directive allows for a state-of-the-art defence. under this defence, a producer is not liable if the state of scientific and technical knowledge at the time the product was put into circulation was not such that the existence of the defect would be discovered. it has been argued that the availability of such a defence (member states have the discretion whether to retain it in national laws ) will always be possible since, due to the complexity of "code", software will never be defect-free. these policy and legal arguments indicate the difficulty in broadening the scope of the directive on liability for defective products to include software. reversal of the burden of proof might be a more adequate alternative solution, one that policymakers should investigate. it is often difficult to distinguish software from hardware because both are necessary and interdependent to provide a certain functionality. similarly, it may be difficult to draw the line between software and services. transfer of information via electronic signals (e.g., downloaded software) could be regarded as a service. some courts might also be willing to distinguish between mass-market software and software produced as an individual product (on demand). ami is a highly personalised environment where software-based services will surround the individual, thus the tendency to regard software as a service could increase. strict liability currently does not apply to services. service liability is regulated by national laws. extending such provision to services may have far-reaching consequences, not only in the ict field. the ami environment will need the innovation and creativity of service providers; therefore, one should refrain from creating a framework discouraging them from taking risks. however, some procedural rules could help consumers without upsetting an equitable balance. the consumer, usually the weaker party in a conflict with the provider, often has difficulty proving damages. reversing the burden of proof might facilitate such proof. most national laws seem to provide a similar solution. since national law regulates the issue of service liability, differences between national regulations might lead to differences in the level of protection. the lack of a coherent legal framework for service liability in europe is regrettable. learning from the differences and similarities between the different national legal regimes, as indicated in the analysis of national liability systems for remedying damage caused by defective consumer services, is the first step in remedying such a situation. reversing the burden of proof is less invasive than the strict liability rules where the issue of fault is simply not taken into consideration. such a solution has been adopted in the field of the non-discrimination and intellectual property laws, as well as in national tort regimes. an exception to the general liability regime is also provided in directive / /ec on the community framework for electronic signatures. in that directive, the certification service provider is liable for damage caused by non-compliance with obligations imposed by the directive unless he proves he did not act negligently. technology could potentially remedy the information asymmetry between users and ami service suppliers or data processors. the latter could have an obligation to inform consumers what data are processed, how and when and what is the aim of such activities (thus actually fulfilling their obligations under the data protection directive). this information could be stored and managed by an intelligent agent on behalf of the user, who is not able to deal with such information flow. however, the user would have the possibility to use such information to enforce his rights (e.g., to prove causation). other technological solutions (e.g., watermarking) could also help the user prove his case in court. in many cases, the damage sustained by the individual will be difficult to assess in terms of the economic value or too small to actually provide an incentive to bring an action to court. however, acts causing such damage can have overall negative effects. spam is a good example. fixed damages, similar to the ones used in the united states, or punitive damages could remedy such problems (some us state laws provide for fixed damages such as us$ for each unsolicited communication without the victim needing to prove such damage). they would also provide clarity as to the sanctions or damages expected and could possibly have a deterrent effect. the national laws of each member state currently regulate availability of punitive damages; a few countries provide for punitive and exemplary damages in their tort systems. the universal service directive provides for a minimum of telecommunication services for all at an affordable price as determined by each member state. prices for universal services may depart from those resulting from market conditions. such provisions aim at overcoming a digital divide and allowing all to enjoy a certain minimum of electronic services. the directive is definitely a good start in shaping the information society and the ami environment. the development of new technologies and services generates costs, both on individuals and society. many high-added-value ami services will be designed for people who will be able to pay for them. thus, ami could reinforce the inequalities between the poor and rich. everyone should be able to enjoy the benefits of ami, at least at a minimum level. the commission should consider whether new emerging ami services should be provided to all. some services (e.g., emergency services) could even be regarded as public and provided free of charge or as part of social security schemes. as shown in scenario , ami might cause major problems for current intellectual property protection, because ami requires interoperability of devices, software, data and information, for example, for crucial information systems such as health monitoring systems used by travelling seniors. there is also a growing need to create means of intellectual property protection that respect privacy and allow for anonymous content viewing. intellectual property rights give exclusive rights over databases consisting of personal data and profiles, while the data subjects do not have a property right over their own information collected. we discuss these issues below. the directive on the legal protection of databases provides for a copyright protection of databases, if they constitute the author's own intellectual creation by virtue of his selection or arrangement of their content. the directive also foresees a sui generis protection if there has been a qualitatively and/or quantitatively substantial investment in either the acquisition, verification or presentation of the content. sui generis protection "prevents the extraction and/or the re-utilization of the whole or of a substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database". this implies that the database maker can obtain a sui generis protection of a database even when its content consists of personal data. although the user does not have a property right over his personal data, the maker of a database can obtain an exclusive right over this type of data. hence, a profile built on the personal data of a data subject might constitute somebody else's intellectual property. the right to information about what knowledge has been derived from one's data could, to some extent, provide a safeguard against profiling. we recommend that further research be undertaken on how to reconcile this with intellectual property rights. the copyright directive provides for the protection of digital rights management (drms) used to manage the licence rights of works that are accessed after identification or authentication of a user. but drms can violate privacy, because they can be used for processing of personal data and constructing (group) profiles, which might conflict with data protection law. less invasive ways of reconciling intellectual property rights with privacy should be considered. this not only relates to technologies but also to an estimation of the factual economic position of the customer. for example, the general terms and conditions for subscribing to an interactive television service -often a service offered by just a few players -should not impose on customers a condition that personal data relating to their viewing behaviour can be processed and used for direct marketing or for transfer to "affiliated" third parties. as the article working party advises, greater attention should be devoted to the use of pets within drm systems. in particular, it advises that tools be used to see also section . . , specific recommendation regarding security. article data protection working party, working document on data protection issues related to intellectual property rights (wp/ ), adopted on january . http://ec.europa.eu/ justice_home/fsj/privacy preserve the anonymity of users and it recommends the limited use of unique identifiers. use of unique identifiers allows profiling and tagging of a document linked to an individual, enabling tracking for copyright abuses. such tagging should not be used unless necessary for performance of the service or unless with the informed consent of individual. all relevant information required under data protection legislation should be provided to users, including categories of collected information, the purpose of collecting and information about the rights of the data subject. the directive on the legal protection of software obliges member states to provide appropriate remedies against a person's committing any act of putting into circulation or the possession for commercial purposes of any means the sole intended purpose of which is to facilitate the unauthorised removal or circumvention of any technical device which may have been applied to protect a computer program. the software directive only protects against the putting into circulation of such devices and not against the act of circumventing as such. it would be advisable to have a uniform solution in that respect. drms can also violate consumer rights, by preventing the lawful enjoyment of the purchased product. the anti-circumvention provisions should be then coupled with better enforcement of consumer protection provisions regarding information disclosure to the consumer. the consumer should always be aware of any technological measures used to protect the content he wishes to purchase, and restrictions in use of such content as a consequence of technological protection (he should also be informed about technological consequences of drms for his devices, if any, e.g., about installing the software on his computer). product warnings and consumer notifications should always be in place and should aim to raise general consumer awareness about the drms. as interoperability is a precondition for ami, ami would have to lead to limitations on exclusive intellectual property rights. one could argue that software packages should be developed so that they are interoperable with each other. those restrictions might, inter alia, prevent the user from making backups or private copies, downloading music to portable devices, playing music on certain devices, or constitute the geographical restrictions such as regional coding of dvds. privacy protection. a broader scope of the decompilation right under software protection would be desirable. the ec's battle with microsoft was in part an attempt to strengthen the decompilation right with the support of competition law. currently, there is no international or european framework determining jurisdiction in criminal matters, thus, national rules are applicable. the main characteristics of the legal provisions in this matter have already been discussed in chapter , section . . . ; however, it seems useful to refer here to some of our earlier conclusions. the analysis of the connecting factors for forum selection (where a case is to be heard) shows that it is almost always possible for a judge to declare himself competent to hear a case. certain guidelines have already been developed, both in the context of the cybercrime convention as well as the eu framework decision on attacks against information systems on how to resolve the issue of concurrent competences. according to the cybercrime convention, "the parties involved shall, where appropriate, consult with a view to determining the most appropriate jurisdiction for prosecution." the eu framework decision on attacks against information systems states, "where an offence falls within the jurisdiction of more than one member state and when any of the states concerned can validly prosecute on the basis of the same facts, the member states concerned shall co-operate in order to decide which of them will prosecute the offenders with the aim, if possible, of centralizing proceedings in a single member state." legal experts and academics should follow any future developments in application of those rules that might indicate whether more straightforward rules are needed. the discussion on the green paper on double jeopardy should also be closely followed. scenario ("a crash in ami space") turns on an accident involving german tourists in italy, while travelling with a tourist company established in a third country. it raises questions about how ami might fit into a legal framework based on territorial concepts. clear rules determining the law applicable between the parties are an important guarantee of legal certainty. private international law issues are dealt at the european level by the rome convention on the law applicable to contractual obligations as well as the rome ii regulation on the law applicable to non-contractual obligations, the brussels regulation on jurisdiction and enforcement of judgments. the regulation on jurisdiction and enforcement of judgments in civil and commercial matters covers both contractual and non-contractual matters. it also contains specific provisions for jurisdiction over consumer contracts, which aim to protect the consumer in case of court disputes. these provisions should be satisfactory and workable in an ami environment. however, provisions of this regulation will not determine the forum if the defendant is domiciled outside the european union. also, the provisions on the jurisdiction for consumer contracts apply only when both parties are domiciled in eu member states. although the regulation provides for a forum if the dispute arises from an operation of a branch, agency or other establishment of the defendant in a member state, a substantial number of businesses offering services to eu consumers will still be outside the reach of this regulation. this emphasises again the need for a more global approach beyond the territory of the member states. clarification and simplification of forum selection for non-consumers would also be desirable. the complexity of the business environment, service and product creation and delivery would justify such approach. it would be of special importance for smes. currently, the applicable law for contractual obligations is determined by the rome convention. efforts have been undertaken to modernise the rome convention and replace it with a community instrument. recently, the commission has presented a proposal for a regulation of the european parliament and the council on the law applicable to contractual obligations. the provisions of the rome convention refer to contractual issues only. recently, the so-called rome ii regulation has been adopted, which provides for rules applicable to non-contractual obligations. the rome convention on law applicable to contractual obligations relies heavily on the territorial criterion. it refers to the habitual residence, the central administration or place of business as the key factors determining the national law most relevant to the case. but it services can be supplied at a distance by electronic means. the ami service supplier could have his habitual residence or central administration anywhere in the world and he could choose his place of residence (central administration) according to how beneficial is the national law of a given country. the habitual residence factor has been kept and strengthened in the commission's proposal for a new regulation replacing the rome convention (rome i proposal, article ). the new proposal for the rome i regulation amends the consumer protection provisions. it still relies on the habitual residence of the consumer, but it brings the consumer choice of contract law in line with the equivalent provisions of the brussels regulation, and broadens the scope of the application of its provisions. the commission proposal for the regulation on the law applicable to contractual obligations is a good step forward. the rome ii regulation on law applicable to non-contractual obligations applies to the tort or delict, including claims arising out of strict liability. the basic rule under the regulation is that a law applicable should be determined on the basis of where the direct damage occurred (lex loci damni). however, some "escape clauses" are foreseen and provide for a more adequate solution if more appropriate in the case at hand. this allows for flexibility in choosing the best solution. special rules are also foreseen in the case of some specific torts or delicts. uniform rules on applicable law at the european level are an important factor for improving the predictability of litigation, and thus legal certainty. in that respect, the new regulation should be welcomed. the regulation will apply from january . some other legislative acts also contain rules on applicable law. most important are provisions in the data protection directive. this directive also chooses the territorial criterion to determine the national law applicable to the processing of data, which is the law of the place where the processing is carried out in the context of an establishment of the data controller. such a criterion, however, might be problematic: more than one national law might be applicable. moreover, in times of globalisation of economic activity, it is easy for an undertaking to choose the place of establishment, which offers the most liberal regime, beyond the reach of european data protection law. in situations when a non-eu state is involved, the directive points out a different relevant factor, the location of the equipment used, thus enabling broader application of the eu data protection directive. as recital of the proposal states, these amendments aim to take into account the developments in distance selling, thus including ict developments. article ( ) of the directive stipulates: each member state shall apply the national provisions it adopts pursuant to this directive to the processing of personal data where: (a) the processing is carried out in the context of the activities of an establishment of the controller on the territory of the member state; when the same controller is established on the territory of several member states, he must take the necessary measures to ensure that each of these establishments complies with the obligations laid down by the national law applicable. the directive stipulates in article ( ) that the national law of a given member state will apply when the controller is not established on community territory and, for purposes of processing personal data, makes use of equipment, automated or otherwise, situated on the territory of the said member state, unless such equipment is used only for purposes of transit through the territory of the community. the article data protection working party interprets the term "equipment" as referring to all kinds of tools or devices, including personal computers, which can be used for many kinds of processing operations. the definition could be extended to all devices with a capacity to collect data, including sensors, implants and maybe rfids. (active rfid chips can also collect information. they are expensive compared to passive rfid chips but they are already part of the real world.) see article data protection working party, working document on determining the international application of eu data protection law to personal data processing on the internet by non-eu based websites ( / /en/final wp ), may . http://ec.europa.eu/justice_home/ fsj/privacy/ docs/wpdocs/ /wp _en.pdf as we see, in all these cases, the territorial criterion (establishment) prevails. we should consider moving towards a more personal criterion, especially since personal data are linked with an identity and a state of the data subject (issues which are regulated by the national law of the person). such a criterion could be more easily reconciled with the ami world of high mobility and without physical borders. the data subject will also be able to remain under the protection of his/her national law, and the data controller/service provider will not have the possibility of selecting a place of establishment granting him the most liberal treatment of law. data transfer is another issue highlighting the need for international co-operation in the creation of a common playing field for ami at the global level. what is the sense of protecting data in one country if they are transferred to a country not affording comparable (or any) safeguards? also, the globalisation of economic and other activities brings the necessity of exchanging personal data between the countries. the data protection directive provides a set of rules on data transfer to third countries. data can be transferred only to countries offering an adequate level of protection. the commission can conclude agreements (e.g., the safe harbour agreement) with third countries which could help ensure an adequate level of protection. the commission can also issue a decision in that respect. however, the major problem is enforcement of such rules, especially in view of the fact that some "safeguards" rely on self-regulatory systems whereby companies merely promise not to violate their declared privacy policies (as is the case with the safe harbour agreement). attention by the media and consumer organisations can help in the enforcement of agreed rules. the problem of weak enforcement also emphasises the need to strengthen international co-operation with the aim of developing new enforcement mechanisms. providing assistance in good practices in countries with less experience than the european union would also be useful. such a solution has the advantage of covering, with the protection of eu legislation, third country residents whose data are processed via equipment in the eu. a broad interpretation of the term "equipment" would help guarantee the relatively broad application of such a rule (see above). as a result, in most cases, application of the domicile/nationality rule or the place of the equipment used as the relevant factor would have the same result. however, we can envisage the processing of data not using such equipment, for example, when the data are already posted online. then the eu law could not be applicable. see chapter , section . . . . biometrics: legal issues and implications, background paper for the institute of prospective technological studies report on the protection of personal data with regard to the use of smart cards biometrics at the frontiers: assessing the impact on society, study commissioned by the libe committee of the european parliament comments on the communication of the commission on interoperability of european databases article data protection working party, working document on biometrics descriptive analysis and inventory of profiling practices, fidis (future of identity in the information society), deliverable d rfid and the perception of control: the consumer's view article data protection working party, working document on data protection issues related to rfid technology ( / /en -wp ) though these are good examples of the involvement of stakeholders in the discussion, the results are not fully satisfactory. as a compromise between the different actors, the guidelines do not go far enough in protecting the interests of consumers huge practical difficulties of effectively enforcing and implementing data protection, more particularly in the field of profiling structured overview on prototypes and concepts of identity management systems code': privacy's death or saviour? the weaker party in the contract is now protected by the general principles of law privacy and identity management for europe report on actual and possible profiling techniques in the field of ambient intelligence, fidis (future of identity in the information society) deliverable d . ami -the european perspective on data protection legislation and privacy policies privacy and human rights , produced by the electronic privacy information center article (d) of directive / /ec safeguards should be provided for subscribers against intrusion of their privacy by unsolicited communications for direct marketing purposes in particular by means of automated calling machines, telefaxes, and e-mails, including sms messages cogitas, ergo sum. the role of data protection law and non-discrimination law in group profiling in the private sector spam en electronische reclame a strict product liability regime based on the directive is the basis of the claims under the general tort regime. see giensen, i the precautionary principle in the information society, effects of pervasive computing on health and environment in such a case, the intelligent software agent's failure and the pet's failure might be covered by the strict liability regime the applicability of the eu product liability directive to software computer software and information licensing in emerging markets, the need for a viable legal framework the contract intended to treat it as such (as opposed to an information service). see singsangob a., computer software and information licensing in emerging markets, the need for a viable legal framework article of the directive liability article of the directive on liability for defective products liability for defective products and services: the netherlands the oecd has treated software downloads as a service for the vat and custom duties purposes the distance selling directive: points for further revision comparative analysis of national liability systems for remedying damage caused by defective consumer services: a study commissioned by the european commission (the scenario script) and section . . . on the legal analysis of private international law aspects of the scenario regulation (ec) no / of the european parliament and of the council of on jurisdiction and the recognition and enforcement of judgments in civil and commercial matters if the defendant is not domiciled in a member state, the jurisdiction of the courts of each member state shall, subject to articles and , be determined by the law of that member state; . any person domiciled in a member state may, whatever his nationality, avail himself of the rules of jurisdiction in force in that state, and in particular those specified in annex i, in the same way as the nationals of that state. by a translation of existing powers and structures the commission has presented the proposal for a regulation of the european parliament and the council on the law applicable to contractual obligations (rome i), com ( ) final it shall be presumed that the contract is most closely connected with the country where the party who is to effect the performance which is characteristic of the contract has, at the time of conclusion of the contract, his habitual residence, or, in the case of a body corporate or unincorporated, its central administration. however, if the contract is entered into in the course of that party's trade or profession the new proposal does not use the presumption that the country of habitual residence is the most closely connected with the case, as it is under the rome convention. in the proposal, the relevant factor of the habitual residence of the directive on liability for defective products provides for a liability without fault (strict liability). as a recital to the directive states, strict liability shall be seen as "the sole means of adequately solving the problem, peculiar to our age of increasing technicality, of a fair apportionment of the risks inherent in modern technological production." we should keep this reasoning in mind since it seems even more adequate when thinking about the liability issues in ami.most of the "products" offered in the ami environment will consist of softwarebased, highly personalised services. we should then think about adjusting the liability rules to such an environment. if it is difficult to distinguish between hardware and software from a technological perspective, why should we draw such a distinction from a legal perspective? an explicit provision providing for strict liability for software can be considered. nevertheless, such a proposal is controversial as it is said to threaten industry. since software is never defect-free, strict liability would expose software producers unfairly to claims against damages. thus, the degree of required safety of the programmes is a policy decision. strict liability could also impede innovation, especially the innovation of experimental and lifesaving applications. others argue that strict liability might increase software quality by making producers more diligent, especially, in properly testing their products. despite these policy considerations, there are some legal questions about the applicability of strict liability to software. the first question is whether the software can be regarded as "goods" or "products" and whether they fall under the strict actions allowing consolidation of the small claims of individuals could be also examined (i.e., group consumer actions). non-discrimination law can regulate and forbid the unlawful usage of processed data, for example, in making decisions or undertaking other actions on the basis of certain characteristics of the data subjects. this makes non-discrimination law of increasing importance for ami. the creation of profiles does not fall under non-discrimination law (potential use), but decisions based on profiling (including group profiling based on anonymous data) that affect the individual might provide the grounds for application of the non-discrimination rules. they apply in the case of identifiable individuals as well as to anonymous members of the group. profiles or decisions based on certain criteria (health data, nationality, income, etc.) may lead to discrimination against individuals. it is difficult to determine when it is objectively justified to use such data and criteria, and when they are discriminatory (e.g., the processing of health-related data by insurance companies leading to decisions to raise premiums). further legislative clarity would be desirable.however, certain negative dimensions of profiling still escape the regime of non-discrimination law (e.g., manipulation of individuals' behaviour by targeted advertising). here no remedies have been identified.the non-discrimination rules should be read in conjunction with the fairness principle of data protection law. the application of the two may have similar aims and effects; they might also be complementary: can the limitations of non-discrimination law be justified if they are regarded as not fair, as in the example of insurance companies raising premiums after processing health data? they can address a range of actions undertaken in ami, such as dynamic pricing or refusal to provide services (e.g., a refusal of service on the grounds that no information (profile) is available could be regarded as discriminatory).non-discrimination rules should be taken into consideration at the design stage of technology and service development. however, such issues might be addressed by the data protection legislation. in the opinion of gutwirth and de hert, principles of data protection are appropriate to cope with profiling. key: cord- - otxft authors: altman, russ b.; mooney, sean d. title: bioinformatics date: journal: biomedical informatics doi: . / - - - _ sha: doc_id: cord_uid: otxft why is sequence, structure, and biological pathway information relevant to medicine? where on the internet should you look for a dna sequence, a protein sequence, or a protein structure? what are two problems encountered in analyzing biological sequence, structure, and function? how has the age of genomics changed the landscape of bioinformatics? what two changes should we anticipate in the medical record as a result of these new information sources? what are two computational challenges in bioinformatics for the future? ular biology and genomics-have increased dramatically in the past decade. history has shown that scientific developments within the basic sciences tend to lag about a decade before their influence on clinical medicine is fully appreciated. the types of information being gathered by biologists today will drastically alter the types of information and technologies available to the health care workers of tomorrow. there are three sources of information that are revolutionizing our understanding of human biology and that are creating significant challenges for computational processing. the most dominant new type of information is the sequence information produced by the human genome project, an international undertaking intended to determine the complete sequence of human dna as it is encoded in each of the chromosomes. the first draft of the sequence was published in (lander et al., ) and a final version was announced in coincident with the th anniversary of the solving of the watson and crick structure of the dna double helix. now efforts are under way to finish the sequence and to determine the variations that occur between the genomes of different individuals. essentially, the entire set of events from conception through embryonic development, childhood, adulthood, and aging are encoded by the dna blueprints within most human cells. given a complete knowledge of these dna sequences, we are in a position to understand these processes at a fundamental level and to consider the possible use of dna sequences for diagnosing and treating disease. while we are studying the human genome, a second set of concurrent projects is studying the genomes of numerous other biological organisms, including important experimental animal systems (such as mouse, rat, and yeast) as well as important human pathogens (such as mycobacterium tuberculosis or haemophilus influenzae). many of these genomes have recently been completely determined by sequencing experiments. these allow two important types of analysis: the analysis of mechanisms of pathogenicity and the analysis of animal models for human disease. in both cases, the functions encoded by genomes can be studied, classified, and categorized, allowing us to decipher how genomes affect human health and disease. these ambitious scientific projects not only are proceeding at a furious pace, but also are accompanied in many cases by a new approach to biology, which produces a third new source of biomedical information: proteomics. in addition to small, relatively focused experimental studies aimed at particular molecules thought to be important for disease, large-scale experimental methodologies are used to collect data on thousands or millions of molecules simultaneously. scientists apply these methodologies longitudinally over time and across a wide variety of organisms or (within an organism) organs to watch the evolution of various physiological phenomena. new technologies give us the abilities to follow the production and degradation of molecules on dna arrays (lashkari et al., ) , to study the expression of large numbers of proteins with one another (bai and elledge, ) , and to create multiple variations on a genetic theme to explore the implications of various mutations on biological function (spee et al., ) . all these technologies, along with the genome-sequencing projects, are conspiring to produce a volume of biological information that at once contains secrets to age-old questions about health and disease and threatens to overwhelm our current capabilities of data analysis. thus, bioinformatics is becoming critical for medicine in the twentyfirst century. the effects of this new biological information on clinical medicine and clinical informatics are difficult to predict precisely. it is already clear, however, that some major changes to medicine will have to be accommodated. with the first set of human genomes now available, it will soon become cost-effective to consider sequencing or genotyping at least sections of many other genomes. the sequence of a gene involved in disease may provide the critical information that we need to select appropriate treatments. for example, the set of genes that produces essential hypertension may be understood at a level sufficient to allow us to target antihypertensive medications based on the precise configuration of these genes. it is possible that clinical trials may use information about genetic sequence to define precisely the population of patients who would benefit from a new therapeutic agent. finally, clinicians may learn the sequences of infectious agents (such as of the escherichia coli strain that causes recurrent urinary tract infections) and store them in a patient's record to record the precise pathogenicity and drug susceptibility observed during an episode of illness. in any case, it is likely that genetic information will need to be included in the medical record and will introduce special problems. raw sequence information, whether from the patient or the pathogen, is meaningless without context and thus is not well suited to a printed medical record. like images, it can come in high information density and must be presented to the clinician in novel ways. as there are for laboratory tests, there may be a set of nondisease (or normal) values to use as comparisons, and there may be difficulties in interpreting abnormal values. fortunately, most of the human genome is shared and identical among individuals; less than percent of the genome seems to be unique to individuals. nonetheless, the effects of sequence information on clinical databases will be significant. . new diagnostic and prognostic information sources. one of the main contributions of the genome-sequencing projects (and of the associated biological innovations) is that we are likely to have unprecedented access to new diagnostic and prognostic tools. single nucleotide polymorphisms (snps) and other genetic markers are used to identify how a patient's genome differs from the draft genome. diagnostically, the genetic markers from a patient with an autoimmune disease, or of an infectious pathogen within a patient, will be highly specific and sensitive indicators of the subtype of disease and of that subtype's probable responsiveness to different therapeutic agents. for example, the severe acute respiratory syndrome (sars) virus was determined to be a corona virus using a gene expression array containing the genetic information from several common pathogenic viruses. in general, diagnostic tools based on the gene sequences within a patient are likely to increase greatly the number and variety of tests available to the physician. physicians will not be able to manage these tests without significant computational assistance. moreover, genetic information will be available to provide more accurate prognostic information to patients. what is the standard course for this disease? how does it respond to these medications? over time, we will be able to answer these questions with increasing precision, and will develop computational systems to manage this information. several genotype-based databases have been developed to identify markers that are associated with specific phenotypes and identify how genotype affects a patient's response to therapeutics. the human gene mutations database (hgmd) annotates mutations with disease phenotype. this resource has become invaluable for genetic counselors, basic researchers, and clinicians. additionally, the pharmacogenomics knowledge base (pharmgkb) collects genetic information that is known to affect a patient's response to a drug. as these data sets, and others like them, continue to improve, the first clinical benefits from the genome projects will be realized. . ethical considerations. one of the critical questions facing the genome-sequencing projects is "can genetic information be misused?" the answer is certainly yes. with knowledge of a complete genome for an individual, it may be possible in the future to predict the types of disease for which that individual is at risk years before the disease actually develops. if this information fell into the hands of unscrupulous employers or insurance companies, the individual might be denied employment or coverage due to the likelihood of future disease, however distant. there is even debate about whether such information should be released to a patient even if it could be kept confidential. should a patient be informed that he or she is likely to get a disease for which there is no treatment? this is a matter of intense debate, and such questions have significant implications for what information is collected and for how and to whom that information is disclosed (durfy, ; see chapter ). a brief review of the biological basis of medicine will bring into focus the magnitude of the revolution in molecular biology and the tasks that are created for the discipline of bioinformatics. the genetic material that we inherit from our parents, that we use for the structures and processes of life, and that we pass to our children is contained in a sequence of chemicals known as deoxyribonucleic acid (dna). the total collec- r. b. altman and s. d. mooney tion of dna for a single person or organism is referred to as the genome. dna is a long polymer chemical made of four basic subunits. the sequence in which these subunits occur in the polymer distinguishes one dna molecule from another, and the sequence of dna subunits in turn directs a cell's production of proteins and all other basic cellular processes. genes are discreet units encoded in dna and they are transcribed into ribonucleic acid (rna), which has a composition very similar to dna. genes are transcribed into messenger rna (mrna) and a majority of mrna sequences are translated by ribosomes into protein. not all rnas are messengers for the translation of proteins. ribosomal rna, for example, is used in the construction of the ribosome, the huge molecular engine that translates mrna sequences into protein sequences. understanding the basic building blocks of life requires understanding the function of genomic sequences, genes, and proteins. when are genes turned on? once genes are transcribed and translated into proteins, into what cellular compartment are the proteins directed? how do the proteins function once there? equally important, how are the proteins turned off ? experimentation and bioinformatics have divided the research into several areas, and the largest are: ( ) genome and protein sequence analysis, ( ) macromolecular structure-function analysis, ( ) gene expression analysis, and ( ) proteomics. practitioners of bioinformatics have come from many backgrounds, including medicine, molecular biology, chemistry, physics, mathematics, engineering, and computer science. it is difficult to define precisely the ways in which this discipline emerged. there are, however, two main developments that have created opportunities for the use of information technologies in biology. the first is the progress in our understanding of how biological molecules are constructed and how they perform their functions. this dates back as far as the s with the invention of electrophoresis, and then in the s with the elucidation of the structure of dna and the subsequent sequence of discoveries in the relationships among dna, rna, and protein structure. the second development has been the parallel increase in the availability of computing power. starting with mainframe computer applications in the s and moving to modern workstations, there have been hosts of biological problems addressed with computational methods. the human genome project was completed and a nearly finished sequence was published in . the benefit of the human genome sequence to medicine is both in the short and in the long term. the short-term benefits lie principally in diagnosis: the availability of sequences of normal and variant human genes will allow for the rapid identification of these genes in any patient (e.g., babior and matzner, ) . the long-term benefits will include a greater understanding of the proteins produced from the genome: how the proteins interact with drugs; how they malfunction in disease states; and how they participate in the control of development, aging, and responses to disease. the effects of genomics on biology and medicine cannot be understated. we now have the ability to measure the activity and function of genes within living cells. genomics data and experiments have changed the way biologists think about questions fundamental to life. where in the past, reductionist experiments probed the detailed workings of specific genes, we can now assemble those data together to build an accurate understanding of how cells work. this has led to a change in thinking about the role of computers in biology. before, they were optional tools that could help provide insight to experienced and dedicated enthusiasts. today, they are required by most investigators, and experimental approaches rely on them as critical elements. twenty years ago, the use of computers was proving to be useful to the laboratory researcher. today, computers are an essential component of modern research. this is because advances in research methods such as microarray chips, drug screening robots, x-ray crystallography, nuclear magnetic resonance spectroscopy, and dna sequencing experiments have resulted in massive amounts of data. these data need to be properly stored, analyzed, and disseminated. the volume of data being produced by genomics projects is staggering. there are now more than . million sequences in genbank comprising more than billion digits. but these data do not stop with sequence data: pubmed contains over million literature citations, the pdb contains three-dimensional structural data for over , protein sequences, and the stanford microarray database (smd) contains over , experiments ( million data points). these data are of incredible importance to biology, and in the following sections we introduce and summarize the importance of sequences, structures, gene expression experiments, systems biology, and their computational components to medicine. sequence information (including dna sequences, rna sequences, and protein sequences) is critical in biology: dna, rna, and protein can be represented as a set of sequences of basic building blocks (bases for dna and rna, amino acids for proteins). computer systems within bioinformatics thus must be able to handle biological sequence information effectively and efficiently. one major difficulty within bioinformatics is that standard database models, such as relational database systems, are not well suited to sequence information. the basic problem is that sequences are important both as a set of elements grouped together and treated in a uniform manner and as individual elements, with their relative locations and functions. any given position in a sequence can be important because of its own identity, because it is part of a larger subsequence, or perhaps because it is part of a large set of overlapping subsequences, all of which have different significance. it is necessary to support queries such as, "what sequence motifs are present in this sequence?" it is often difficult to represent these multiple, nested relationships within standard relational database schema. in addition, the neighbors of a sequence element are also critical, and it is important to be able to perform queries such as, "what sequence elements are seen elements to the left of this element?" for these reasons, researchers in bioinformatics are developing object-oriented databases (see chapter ) in which a sequence can be queried in different ways, depending on the needs of the user (altman, ) . the sequence information mentioned in section . . is rapidly becoming inexpensive to obtain and easy to store. on the other hand, the three-dimensional structure information about the proteins that are produced from the dna sequences is much more difficult and expensive to obtain, and presents a separate set of analysis challenges. currently, only about , three-dimensional structures of biological macromolecules are known. these models are incredibly valuable resources, however, because an understanding of structure often yields detailed insights about biological function. as an example, the structure of the ribosome has been determined for several species and contains more atoms than any other to date. this structure, because of its size, took two decades to solve, and presents a formidable challenge for functional annotation (cech, ) . yet, the functional information for a single structure is vastly outsized by the potential for comparative genomics analysis between the structures from several organisms and from varied forms of the functional complex, since the ribosome is ubiquitously required for all forms of life. thus a wealth of information comes from relatively few structures. to address the problem of limited structure information, the publicly funded structural genomics initiative aims to identify all of the common structural scaffolds found in nature and grow the number of known structures considerably. in the end, it is the physical forces between molecules that determine what happens within a cell; thus the more complete the picture, the better the functional understanding. in particular, understanding the physical properties of therapeutic agents is the key to understanding how agents interact with their targets within the cell (or within an invading organism). these are the key questions for structural biology within bioinformatics: . how can we analyze the structures of molecules to learn their associated function? approaches range from detailed molecular simulations (levitt, ) to statistical analyses of the structural features that may be important for function (wei and altman, ). bioinformatics for more information see http://www.rcsb.org/pdb/. . how can we extend the limited structural data by using information in the sequence databases about closely related proteins from different organisms (or within the same organism, but performing a slightly different function)? there are significant unanswered questions about how to extract maximal value from a relatively small set of examples. . how should structures be grouped for the purposes of classification? the choices range from purely functional criteria ("these proteins all digest proteins") to purely structural criteria ("these proteins all have a toroidal shape"), with mixed criteria in between. one interesting resource available today is the structural classification of proteins (scop), which classifies proteins based on shape and function. the development of dna microarrays has led to a wealth of data and unprecedented insight into the fundamental biological machine. the premise is relatively simple; up to , gene sequences derived from genomic data are fixed onto a glass slide or filter. an experiment is performed where two groups of cells are grown in different conditions, one group is a control group and the other is the experimental group. the control group is grown normally, while the experimental group is grown under experimental conditions. for example, a researcher may be trying to understand how a cell compensates for a lack of sugar. the experimental cells will be grown with limited amounts of sugar. as the sugar depletes, some of the cells are removed at specific intervals of time. when the cells are removed, all of the mrna from the cells is separated and converted back to dna, using special enzymes. this leaves a pool of dna molecules that are only from the genes that were turned on (expressed) in that group of cells. using a chemical reaction, the experimental dna sample is attached to a red fluorescent molecule and the control group is attached to a green fluorescent molecule. these two samples are mixed and then washed over the glass slide. the two samples contain only genes that were turned on in the cells, and they are labeled either red or green depending on whether they came from the experimental group or the control group. the labeled dna in the pool sticks or hybridizes to the same gene on the glass slide. this leaves the glass slide with up to , spots and genes that were turned on in the cells are now bound with a label to the appropriate spot on the slide. using a scanning confocal microscope and a laser to fluoresce the linkers, the amount of red and green fluorescence in each spot can be measured. the ratio of red to green determines whether that gene is being turned off (downregulated) in the experimental group or whether the gene is being turned on (upregulated). the experiment has now measured the activity of genes in an entire cell due to some experimental change. figure . illustrates a typical gene expression experiment from smd. computers are critical for analyzing these data, because it is impossible for a researcher to comprehend the significance of those red and green spots. currently scientists are using gene expression experiments to study how cells from different organ- isms compensate for environmental changes, how pathogens fight antibiotics, and how cells grow uncontrollably (as is found in cancer). a new challenge for biological computing is to develop methods to analyze these data, tools to store these data, and computer systems to collect the data automatically. with the completion of the human genome and the abundance of sequence, structural, and gene expression data, a new field of systems biology that tries to understand how proteins and genes interact at a cellular level is emerging. the basic algorithms for analyzing sequence and structure are now leading to opportunities for more integrated analysis of the pathways in which these molecules participate and ways in which molecules can be manipulated for the purpose of combating disease. a detailed understanding of the role of a particular molecule in the cell requires knowledge of the context-of the other molecules with which it interacts-and of the sequence of chemical transformations that take place in the cell. thus, major research areas in bioinformatics are elucidating the key pathways by which chemicals are transformed, defining the molecules that catalyze these transformations, identifying the input compounds and the output compounds, and linking these pathways into bioinformatics networks that we can then represent computationally and analyze to understand the significance of a particular molecule. the alliance for cell signaling is generating large amounts of data related to how signal molecules interact and affect the concentration of small molecules within the cell. there are a number of common computations that are performed in many contexts within bioinformatics. in general, these computations can be classified as sequence alignment, structure alignment, pattern analysis of sequence/structure, gene expression analysis, and pattern analysis of biochemical function. as it became clear that the information from dna and protein sequences would be voluminous and difficult to analyze manually, algorithms began to appear for automating the analysis of sequence information. the first requirement was to have a reliable way to align sequences so that their detailed similarities and distances could be examined directly. needleman and wunsch ( ) published an elegant method for using dynamic programming techniques to align sequences in time related to the cube of the number of elements in the sequences. smith and waterman ( ) published refinements of these algorithms that allowed for searching both the best global alignment of two sequences (aligning all the elements of the two sequences) and the best local alignment (searching for areas in which there are segments of high similarity surrounded by regions of low similarity). a key input for these algorithms is a matrix that encodes the similarity or substitutability of sequence elements: when there is an inexact match between two elements in an alignment of sequences, it specifies how much "partial credit" we should give the overall alignment based on the similarity of the elements, even though they may not be identical. looking at a set of evolutionarily related proteins, dayhoff et al. ( ) published one of the first matrices derived from a detailed analysis of which amino acids (elements) tend to substitute for others. within structural biology, the vast computational requirements of the experimental methods (such as x-ray crystallography and nuclear magnetic resonance) for determining the structure of biological molecules drove the development of powerful structural analysis tools. in addition to software for analyzing experimental data, graphical display algorithms allowed biologists to visualize these molecules in great detail and facilitated the manual analysis of structural principles (langridge, ; richardson, ) . at the same time, methods were developed for simulating the forces within these molecules as they rotate and vibrate (gibson and scheraga, ; karplus and weaver, ; levitt, ) . the most important development to support the emergence of bioinformatics, however, has been the creation of databases with biological information. in the s, structural biologists, using the techniques of x-ray crystallography, set up the protein data bank (pdb) of the cartesian coordinates of the structures that they elucidated (as well as associated experimental details) and made pdb publicly available. the first release, in , contained structures. the growth of the database is chronicled on the web: the pdb now has over , detailed atomic structures and is the primary source of information about the relationship between protein sequence and protein structure. similarly, as the ability to obtain the sequence of dna molecules became widespread, the need for a database of these sequences arose. in the mid- s, the genbank database was formed as a repository of sequence information. starting with sequences and , bases in , the genbank has grown by much more than million sequences and billion bases. the genbank database of dna sequence information supports the experimental reconstruction of genomes and acts as a focal point for experimental groups. numerous other databases store the sequences of protein molecules and information about human genetic diseases. included among the databases that have accelerated the development of bioinformatics is the medline database of the biomedical literature and its paper-based companion index medicus (see chapter ). including articles as far back as and brought online free on the web in , medline provides the glue that relates many high-level biomedical concepts to the low-level molecule, disease, and experimental methods. in fact, this "glue" role was the basis for creating the entrez and pubmed systems for integrating access to literature references and the associated databases. perhaps the most basic activity in computational biology is comparing two biological sequences to determine ( ) whether they are similar and ( ) how to align them. the problem of alignment is not trivial but is based on a simple idea. sequences that perform a similar function should, in general, be descendants of a common ancestral sequence, with mutations over time. these mutations can be replacements of one amino acid with another, deletions of amino acids, or insertions of amino acids. the goal of sequence alignment is to align two sequences so that the evolutionary relationship between the sequences becomes clear. if two sequences are descended from the same ancestor and have not mutated too much, then it is often possible to find corresponding locations in each sequence that play the same role in the evolved proteins. the problem of solving correct biological alignments is difficult because it requires knowledge about the evolution of the molecules that we typically do not have. there are now, however, well-established algorithms for finding the mathematically optimal alignment of two sequences. these algorithms require the two sequences and a scoring system based on ( ) exact matches between amino acids that have not mutated in the two sequences and can be aligned perfectly; ( ) partial matches between amino acids that have mutated in ways that have preserved their overall biophysical properties; and ( ) gaps in the alignment signifying places where one sequence or the other has undergone a deletion or insertion of amino acids. the algorithms for determining optimal sequence alignments are based on a technique in computer science known as dynamic programming and are at the heart of many computational biology applications (gusfield, ) . figure . shows an example of a smith-waterman matrix. unfortunately, the dynamic programming algorithms are computationally expensive to apply, so a number of faster, more heuristic methods have been developed. the most popular algorithm is the basic local alignment search tool (blast) (altschul et al., ) . blast is based on the observations that sections of proteins are often conserved without gaps (so the gaps can be ignored-a critical simplification for speed) and that there are statistical analyses of the occurrence of small subsequences within larger sequences that can be used to prune the search for matching sequences in a large database. another tool that has found wide use in mining genome sequences is blat (kent, ) . blat is often used to search long genomic sequences with significant performance increases over blast. it achieves its -fold increase in speed over other tools by storing and indexing long sequences as nonoverlapping k-mers, allowing efficient storage, searching, and alignment on modest hardware. one of the primary challenges in bioinformatics is taking a newly determined dna sequence (as well as its translation into a protein sequence) and predicting the structure of the associated molecules, as well as their function. both problems are difficult, being fraught with all the dangers associated with making predictions without hard experimental data. nonetheless, the available sequence data are starting to be sufficient to allow good predictions in a few cases. for example, there is a web site devoted to the assessment of biological macromolecular structure prediction methods. recent results suggest that when two protein molecules have a high degree (more than percent) of sequence similarity and one of the structures is known, a reliable model of the other can be built by analogy. in the case that sequence similarity is less than percent, however, performance of these methods is much less reliable. when scientists investigate biological structure, they commonly perform a task analogous to sequence alignment, called structural alignment. given two sets of threedimensional coordinates for a set of atoms, what is the best way to superimpose them so that the similarities and differences between the two structures are clear? such computations are useful for determining whether two structures share a common ancestry and for understanding how the structures' functions have subsequently been refined during evolution. there are numerous published algorithms for finding good structural alignments. we can apply these algorithms in an automated fashion whenever a new structure is determined, thereby classifying the new structure into one of the protein families (such as those that scop maintains). one of these algorithms is minrms (jewett et al., ) . minrms works by finding the minimal root-mean-squared-distance (rmsd) alignments of two protein structures as a function of matching residue pairs. minrms generates a family of alignments, each with different number of residue position matches. this is useful for identifying local regions of similarity in a protein with multiple domains. minrms solves two problems. first, it determines which structural superpositions, or alignment, to evaluate. then, given this superposition, it determines which residues should be bioinformatics considered "aligned" or matched. computationally, this is a very difficult problem. minrms reduces the search space by limiting superpositions to be the best superposition between four atoms. it then exhaustively determines all potential four-atommatched superpositions and evaluates the alignment. given this superposition, the number of aligned residues is determined, as any two residues with c-alpha carbons (the central atom in all amino acids) less than a certain threshold apart. the minimum average rmsd for all matched atoms is the overall score for the alignment. in figure . , an example of such a comparison is shown. a related problem is that of using the structure of a large biomolecule and the structure of a small organic molecule (such as a drug or cofactor) to try to predict the ways in which the molecules will interact. an understanding of the structural interaction between a drug and its target molecule often provides critical insight into the drug's mechanism of action. the most reliable way to assess this interaction is to use experimental methods to solve the structure of a drug-target complex. once again, these experimental approaches are expensive, so computational methods play an important role. typically, we can assess the physical and chemical features of the drug molecule and can use them to find complementary regions of the target. for example, a highly electronegative drug molecule will be most likely to bind in a pocket of the target that has electropositive features. prediction of function often relies on use of sequential or structural similarity metrics and subsequent assignment of function based on similarities to molecules of known r. b. altman and s. d. mooney function. these methods can guess at general function for roughly to percent of all genes, but leave considerable uncertainty about the precise functional details even for those genes for which there are predictions, and have little to say about the remaining genes. analysis of gene expression data often begins by clustering the expression data. a typical experiment is represented as a large table, where the rows are the genes on each chip and the columns represent the different experiments, whether they be time points or different experimental conditions. within each cell is the red to green ratio of that gene's experimental results. each row is then a vector of values that represent the results of the experiment with respect to a specific gene. clustering can then be performed to determine which genes are being expressed similarly. genes that are associated with similar expression profiles are often functionally associated. for example, when a cell is subjected to starvation (fasting), ribosomal genes are often downregulated in anticipation of lower protein production by the cell. it has similarly been shown that genes associated with neoplastic progression could be identified relatively easily with this method, making gene expression experiments a powerful assay in cancer research (see guo, , for review) . in order to cluster expression data, a distance metric must be determined to compare a gene's profile with another gene's profile. if the vector data are a list of values, euclidian distance or correlation distances can be used. if the data are more complicated, more sophisticated distance metrics may be employed. clustering methods fall into two categories: supervised and unsupervised. hand. usually, the method begins by selecting profiles that represent the different groups of data, and then the clustering method associates each of the genes with the be performed automatically. two such unsupervised learning methods are the hierarchical and k-means clustering methods. hierarchical methods build a dendrogram, or a tree, of the genes based on ing close neighbors into a cluster. the first step often involves connecting the closest profiles, building an average profile of the joined profiles, and repeating until the entire tree is built. k-means clustering builds k clusters or groups automatically. the algorithm begins by picking k representative profiles randomly. then each gene is associated with the representative to which it is closest, as defined by the distance metric being employed. then the center of mass of each cluster is determined using all of the member gene's profiles. depending on the implementation, either the center of mass or the nearest member to it becomes the new representative for that cluster. the algorithm then iterates until the new center of mass and the previous center of mass are within some threshold. the result is k groups of genes that are regulated similarly. one drawback of k-means is that one must chose the value for k. if k is too large, logical "true" clusters may be split into pieces and if k is too small, there will be clusters that are bioinformatics commonly applied because these methods require no knowledge of the data, and can supervised learning methods require some preconceived knowledge of the data at representative profile to which they are most similar. unsupervised methods are more their expression profiles. these methods are agglomerative and work by iteratively join-merged. one way to determine whether the chosen k is correct is to estimate the average distance from any member profile to the center of mass. by varying k, it is best to choose the lowest k where this average is minimized for each cluster. another drawback of k-means is that different initial conditions can give different results, therefore it is often prudent to test the robustness of the results by running multiple runs with different starting configurations (figure . ) . the future clinical usefulness of these algorithms cannot be understated. in , van't veer et al. ( found that a gene expression profile could predict the clinical outcome of breast cancer. the global analysis of gene expression showed that some can- r. b. altman and s. d. mooney cers were associated with different prognosis, not detectable using traditional means. another exciting advancement in this field is the potential use of microarray expression data to profile the molecular effects of known and potential therapeutic agents. this molecular understanding of a disease and its treatment will soon help clinicians make more informed and accurate treatment choices. biologists have embraced the web in a remarkable way and have made internet access to data a normal and expected mode for doing business. hundreds of databases curated by individual biologists create a valuable resource for the developers of computational methods who can use these data to test and refine their analysis algorithms. with standard internet search engines, most biological databases can be found and accessed within moments. the large number of databases has led to the development of meta-databases that combine information from individual databases to shield the user from the complex array that exists. there are various approaches to this task. the entrez system from the national center for biological information (ncbi) gives integrated access to the biomedical literature, protein, and nucleic acid sequences, macromolecular and small molecular structures, and genome project links (including both the human genome project and sequencing projects that are attempting to determine the genome sequences for organisms that are either human pathogens or important experimental model organisms) in a manner that takes advantages of either explicit or computed links between these data resources. the sequence retrieval system (srs) from the european molecular biology laboratory allows queries from one database to another to be linked and sequenced, thus allowing relatively complicated queries to be evaluated. newer technologies are being developed that will allow multiple heterogeneous databases to be accessed by search engines that can combine information automatically, thereby processing even more intricate queries requiring knowledge from numerous data sources. the main types of sequence information that must be stored are dna and protein. one of the largest dna sequence databases is genbank, which is managed by ncbi. genbank is growing rapidly as genome-sequencing projects feed their data (often in an automated procedure) directly into the database. figure . shows the logarithmic growth of data in genbank since . entrez gene curates some of the many genes within genbank and presents the data in a way that is easy for the researcher to use (figure . ) . year the exponential growth of genbank total number of bases figure . . the exponential growth of genbank. this plot shows that since the number of bases in genbank has grown by five full orders of magnitude and continues to grow by a factor of every years. in addition to genbank, there are numerous special-purpose dna databases for which the curators have taken special care to clean, validate, and annotate the data. the work required of such curators indicates the degree to which raw sequence data must be interpreted cautiously. genbank can be searched efficiently with a number of algorithms and is usually the first stop for a scientist with a new sequence who wonders "has a sequence like this ever been observed before? if one has, what is known about it?" there are increasing numbers of stories about scientists using genbank to discover unanticipated relationships between dna sequences, allowing their research programs to leap ahead while taking advantage of information collected on similar sequences. a database that has become very useful recently is the university of california santa cruz genome assembly browser (figure . ) . this data set allows users to search for specific sequences in the ucsc version of the human genome. powered by the similarity search tool blat, users can quickly find annotations on the human genome that contain their sequence of interest. these annotations include known variations (mutations and snps), genes, comparative maps with other organisms, and many other important data. although sequence information is obtained relatively easily, structural information remains expensive on a per-entry basis. the experimental protocols used to determine precise molecular structural coordinates are expensive in time, materials, and human power. therefore, we have only a small number of structures for all the molecules characterized in the sequence databases. the two main sources of structural information are the cambridge structural database for small molecules (usually less than atoms) and the pdb for macromolecules (see section . . ), including proteins and nucleic acids, and combinations of these macromolecules with small molecules (such as drugs, cofactors, and vitamins). the pdb has approximately , high-resolution structures, but this number is misleading because many of them are small variants on the same structural architecture (figure . ) . if an algorithm is applied to the database to filter out redundant structures, less than , structures remain. there are approximately , proteins in humans; therefore many structures remain unsolved (e.g., burley and bonanno, ; gerstein et al., ) . in the pdb, figure . . a stylized diagram of the structure of chymotrypsin, here shown with two identical subunits interacting. the red portion of the protein backbone shows α-helical regions, while the blue portion shows β-strands, and the white denotes connecting coils, while the molecular surface is overlaid in gray. the detailed rendering of all the atoms in chymotrypsin would make this view difficult to visualize because of the complexity of the spatial relationships between thousands of atoms. each structure is reported with its biological source, reference information, manual annotations of interesting features, and the cartesian coordinates of each atom within the molecule. given knowledge of the three-dimensional structure of molecules, the function sometimes becomes clear. for example, the ways in which the medication methotrexate interacts with its biological target have been studied in detail for two decades. methotrexate is used to treat cancer and rheumatologic diseases, and it is an inhibitor of the protein dihydrofolate reductase, an important molecule for cellular reproduction. the three-dimensional structure of dihydrofolate reductase has been known for many years and has thus allowed detailed studies of the ways in which small molecules, such as methotrexate, interact at an atomic level. as the pdb increases in size, it becomes important to have organizing principles for thinking about biological structure. scop provides a classification based on the overall structural features of proteins. it is a useful method for accessing the entries of the pdb. the ecocyc project is an example of a computational resource that has comprehensive information about biochemical pathways. ecocyc is a knowledge base of the metabolic capabilities of e. coli; it has a representation of all the enzymes in the e. coli genome and of the compounds on which they work. it also links these enzymes to their position on the genome to provide a useful interface into this information. the network of pathways within ecocyc provides an excellent substrate on which useful applications can be built. for example, they could provide: ( ) the ability to guess the function of a new protein by assessing its similarity to e. coli genes with a similar sequence, ( ) the ability to ask what the effect on an organism would be if a critical component of a pathway were removed (would other pathways be used to create the desired function, or would the organism lose a vital function and die?), and ( ) the ability to provide a rich user interface to the literature on e. coli metabolism. similarly, the kyoto encyclopedia of genes and genomes (kegg) provides pathway datacets for organism genomes. a postgenomic database bridges the gap between molecular biological databases with those of clinical importance. one excellent example of a postgenomic database is the online mendelian inheritance in man (omim) database, which is a compilation of known human genes and genetic diseases, along with manual annotations describing the state of our understanding of individual genetic disorders. each entry contains links to special-purpose databases and thus provides links between clinical syndromes and basic molecular mechanisms (figure . ). the smd is another example of a postgenomic database that has proven extremely useful, but has also addressed some formidable challenges. as discussed previously in several sections, expression data are often represented as vectors of data values. in addition to the ratio values, the smd stores images of individual chips, complete with annotated gene spots (see figure . ). further, the smd must store experimental conditions, the type and protocol of the experiment, and other data associated with the experiment. arbitrary analysis can be performed on different experiments stored in this unique resource. a critical technical challenge within bioinformatics is the interconnection of databases. as biological databases have proliferated, researchers have been increasingly interested in linking them to support more complicated requests for information. some of these links are natural because of the close connection of dna sequence to protein structure (a straightforward translation). other links are much more difficult because the semantics of the data items within the databases are fuzzy or because good methods for linking certain types of data simply do not exist. for example, in an ideal world, a protein sequence would be linked to a database containing information about that sequence's function. unfortunately, although there are databases about protein function, it is not always easy to assign a function to a protein based on sequence information alone, and so the databases are limited by gaps in our understanding of biology. some excellent recent work in the integration of diverse biological databases has been done in connection with the ncbi entrez/pubmed systems, the srs resource, discoverylink, and the biokleisli project. the human genome sequencing projects will be complete within a decade, and if the only raison d'etre for bioinformatics is to support these projects, then the discipline is not well founded. if, on the other hand, we can identify a set of challenges for the next generations of investigators, then we can more comfortably claim disciplinary status for the field. fortunately, there is a series of challenges for which the completion of the first human genome sequence is only the beginning. with the first human genome in hand, the possibilities for studying the role of genetics in human disease multiply. a new challenge immediately emerges, however: collecting individual sequence data from patients who have disease. researchers estimate that more than percent of the dna sequences within humans are identical, but the remaining sequences are different and account for our variability in susceptibility to and development of disease states. it is not unreasonable to expect that for particular disease syndromes, the detailed genetic information for individual patients will provide valuable information that will allow us to tailor treatment protocols and perhaps let us make more accurate prognoses. there are significant problems associated with obtaining, organizing, analyzing, and using this information. there is currently a gap in our understanding of disease processes. although we have a good understanding of the principles by which small groups of molecules interact, we are not able to fully explain how thousands of molecules interact within a cell to create both normal and abnormal physiological states. as the databases continue to accumulate information ranging from patient-specific data to fundamental genetic information, a major challenge is creating the conceptual links between these databases to create an audit trail from molecular-level information to macroscopic phenomena, as manifested in disease. the availability of these links will facilitate the identification of important targets for future research and will provide a scaffold for biomedical knowledge, ensuring that important literature is not lost within the increasing volume of published data. an important opportunity within bioinformatics is the linkage of biological experimental data with the published papers that report them. electronic publication of the biological literature provides exciting opportunities for making data easily available to scientists. already, certain types of simple data that are produced in large volumes are expected to be included in manuscripts submitted for publication, including new sequences that are required to be deposited in genbank and new structure coordinates that are deposited in the pdb. however, there are many other experimental data sources that are currently difficult to provide in a standardized way, because the data either are more intricate than those stored in genbank or pdb or are not produced in a volume sufficient to fill a database devoted entirely to the relevant area. knowledge base technology can be used, however, to represent multiple types of highly interrelated data. knowledge bases can be defined in many ways (see chapter ); for our purposes, we can think of them as databases in which ( ) the ratio of the number of tables to the number of entries per table is high compared with usual databases, ( ) the individual entries (or records) have unique names, and ( ) the values of many fields for one record in the database are the names of other records, thus creating a highly interlinked network of concepts. the structure of knowledge bases often leads to unique strategies for storage and retrieval of their content. to build a knowledge base for storing information from biological experiments, there are some requirements. first, the set of experiments to be modeled must be defined. second, the key attributes of each experiment that should be recorded in the knowledge base must be specified. third, the set of legal values for each attribute must be specified, usually by creating a controlled terminology for basic data or by specifying the types of knowledge-based entries that can serve as values within the knowledge base. the development of such schemes necessitates the creation of terminology standards, just as in clinical informatics. the riboweb project is undertaking this task in the domain of rna biology (chen et al., ) . riboweb is a collaborative tool for ribosomal modeling that has at its center a knowledge base of the ribosomal structural literature. riboweb links standard bibliographic references to knowledge-based entries that summarize the key experimental findings reported in each paper. for each type of experiment that can be performed, the key attributes must be specified. thus, for example, a cross-linking experiment is one in which a small molecule with two highly reactive chemical groups is added to an ensemble of other molecules. the reactive groups attach themselves to two vulnerable parts of the ensemble. because the molecule is small, the two vulnerable areas cannot be any further from each other than the maximum stretched-out length of the small molecule. thus, an analysis of the resulting reaction gives information that one part of the ensemble is "close" to another part. this experiment can be summarized formally with a few features-for example, target of experiment, cross-linked parts, and cross-linking agent. the task of creating connections between published literature and basic data is a difficult one because of the need to create formal structures and then to create the necessary content for each published article. the most likely scenario is that biologists will write and submit their papers along with the entries that they propose to add to the knowledge base. thus, the knowledge base will become an ever-growing communal store of scientific knowledge. reviewers of the work will examine the knowledge-based elements, perhaps will run a set of automated consistency checks, and will allow the knowledge base to be modified if they deem the paper to be of sufficient scientific merit. riboweb in prototype form can be accessed on the web. one of the most exciting goals for computational biology and bioinformatics is the creation of a unified computational model of physiology. imagine a computer program that provides a comprehensive simulation of a human body. the simulation would be a complex mathematical model in which all the molecular details of each organ system would be represented in sufficient detail to allow complex "what if ?" questions to be asked. for example, a new therapeutic agent could be introduced into the system, and its effects on each of the organ subsystems and on their cellular apparatus could be assessed. the side-effect profile, possible toxicities, and perhaps even the efficacy of the agent could be assessed computationally before trials are begun on laboratory animals or human subjects. the model could be linked to visualizations to allow the teaching of medicine at all grade levels to benefit from our detailed understanding of physiological processes-visualizations would be both anatomic (where things are) and functional (what things do). finally, the model would provide an interface to human genetic and biological knowledge. what more natural user interface could there be for exploring physiology, anatomy, genetics, and biochemistry than the universally recognizable structure of a human that could be browsed at both macroscopic and microscopic levels of detail? as components of interest were found, they could be selected, and the available literature could be made available to the user. the complete computational model of a human is not close to completion. first, all the participants in the system (the molecules and the ways in which they associate to form higher-level aggregates) must be identified. second, the quantitative equations and symbolic relationships that summarize how the systems interact have not been elucidated fully. third, the computational representations and computer power to run such a simulation are not in place. researchers are, however, working in each of these areas. the genome projects will soon define all the molecules that constitute each organism. research in simulation and the new experimental technologies being developed will give us an understanding of how these molecules associate and perform their functions. finally, research in both clinical informatics and bioinformatics will provide the computational infrastructure required to deliver such technologies. bioinformatics is closely allied to clinical informatics. it differs in its emphasis on a reductionist view of biological systems, starting with sequence information and moving to structural and functional information. the emergence of the genome sequencing projects and the new technologies for measuring metabolic processes within cells is beginning to allow bioinformaticians to construct a more synthetic view of biological processes, which will complement the whole-organism, top-down approach of clinical informatics. more importantly, there are technologies that can be shared between bioinformatics and clinical informatics because they both focus on representing, storing, and analyzing biological data. these technologies include the creation and management of standard terminologies and data representations, the integration of heterogeneous databases, the organization and searching of the biomedical literature, the use of machine learning techniques to extract new knowledge, the simulation of biological processes, and the creation of knowledge-based systems to support advanced practitioners in the two fields. the proceedings of one of the principal meetings in bioinformatics, this is an excellent source for up-to-date research reports. other important meetings include those sponsored by the this introduction to the field of bioinformatics focuses on the use of statistical and artificial intelligence techniques in machine learning introduces the different microarray technologies and how they are analyzed dna and protein sequence analysis-a practical approach this book provides an introduction to sequence analysis for the interested biologist with limited computing experience this edited volume provides an excellent introduction to the use of probabilistic representations of sequences for the purposes of alignment, multiple alignment this primer provides a good introduction to the basic algorithms used in sequence analysis, including dynamic programming for sequence alignment algorithms on strings, trees and sequences: computer science and computational biology gusfield's text provides an excellent introduction to the algorithmics of sequence and string analysis, with special attention paid to biological sequence analysis problems artificial intelligence and molecular biology this volume shows a variety of ways in which artificial intelligence techniques have been used to solve problems in biology genotype to phenotype this volume offers a useful collection of recent work in bioinformatics another introduction to bioinformatics, this text was written for computer scientists the textbook by stryer is well written, and is illustrated and updated on a regular basis. it provides an excellent introduction to basic molecular biology and biochemistry what ways will bioinformatics and medical informatics interact in the future? will the research agendas of the two fields merge will the introduction of dna and protein sequence information change the way that medical records are managed in the future? which types of systems will be most affected (laboratory, radiology, admission and discharge, financial it has been postulated that clinical informatics and bioinformatics are working on the same problems, but in some areas one field has made more progress than the other why should an awareness of bioinformatics be expected of clinical informatics professionals? should a chapter on bioinformatics appear in a clinical informatics textbook? explain your answers one major problem with introducing computers into clinical medicine is the extreme time and resource pressure placed on physicians and other health care workers. will the same problems arise in basic biomedical research? why have biologists and bioinformaticians embraced the web as a vehicle for disseminating data so quickly, whereas clinicians and clinical informaticians have been more hesitant to put their primary data online? key: cord- -j lflryj authors: waller, anna e.; scholer, matthew; ising, amy i.; travers, debbie a. title: using emergency department data for biosurveillance: the north carolina experience date: - - journal: infectious disease informatics and biosurveillance doi: . / - - - - _ sha: doc_id: cord_uid: j lflryj biosurveillance is an emerging field that provides early detection of disease outbreaks by collecting and interpreting data on a variety of public health threats. the public health system and medical care community in the united states have wrestled with developing new and more accurate methods for earlier detection of threats to the health of the public. the benefits and challenges of using emergency department data for surveillance are described in this chapter through examples from one biosurveillance system, the north carolina disease event tracking and epidemiologic collection tool (nc detect). ed data are a proven tool for biosurveillance, and the ed data in nc detect have proved to be effective for a variety of public health uses, including surveillance, monitoring and investigation. a distinctive feature of ed data for surveillance is their timeliness. with electronic health information systems, these data are available in near real-time, making them particularly useful for surveillance and situational awareness in rapidly developing public health outbreaks or disasters. challenges to using ed data for biosurveillance include the reliance on free text data (often in chief complaints). problems with textual data are addressed in a variety of ways, including preprocessing data to clean the text entries and address negation. the use of ed data for public health surveillance can significantly increase the speed of detecting, monitoring and investigating public health events. biosurveillance systems that are incorporated into hospital and public health practitioner daily work flows are more effective and easily used during a public health emergency. the flexibility of a system such as nc detect helps it meet this level of functionality. biosurveillance is an emerging field that provides early detection of disease outbreaks by collecting and interpreting data on a variety of public health threats, including emerging infectious diseases (e.g., avian influenza), vaccine preventable diseases (e.g., pertussis) and bioterrorism (e.g., anthrax). with the centers for disease control and prevention's (cdc) initial focus on bioterrorism preparedness at the state and local level in and the subsequent anthrax outbreak of , the public health system and medical care community in the united states have wrestled with developing new and more accurate methods for earlier detection of threats to the health of the public. earlier detection, both intuitively and as illustrated through predictive mathematical models, is believed to save lives, prevent morbidity and preserve resources (kaufman et al., ) . biosurveillance systems use healthrelated data that generally precede diagnoses and that signal a sufficient probability of a case or an outbreak to warrant further public health response . rapid detection of disease outbreaks rests on a foundation of accurate classification of patient symptoms early in the course of their illness. electronic emergency department (ed) records are a major source of data for biosurveillance systems because these data are timely, population-based and widely available in electronic form (lober et al., ; teich et al., ) . there are more than million ed visits annually in the united states, and eds represent the only universally accessible source of outpatient healthcare that is available h a day, days a week (nawar et al., ) . eds see patients from all age groups and socioeconomic classes. patients may present with early, nonspecific symptoms or with advanced disease. the accessibility of eds provides a likely healthcare setting for many of the patients involved in a disease outbreak of public health significance. in recent years, eds have steadily adopted electronic medical records technology , which has facilitated the replacement of drop-in manual surveillance using ed data with ongoing, real-time surveillance. ed data have been shown to detect outbreaks - weeks earlier than traditional public health reporting channels lober et al., ; tsui et al., ; wagner et al., ) . the ed data elements that are used for biosurveillance include the chief complaint (a brief description of the patient's primary symptom(s)), the triage nurse's note (an expansion of the chief complaint that includes the history of present illness), other clinical notes (e.g., physician and nurses' progress and summary notes), initial measured temperature, and diagnosis codes. the most widely used ed data element is the chief complaint because it is recorded electronically by most eds and may precede entry of a diagnosis or transcription of physician notes by days or weeks . the triage note increases the amount of data available, which makes it more likely that biosurveillance algorithms will detect disease outbreaks. triage notes are becoming more available in electronic form, and one study found that adding triage notes increased the sensitivity of outbreak detection . several challenges to using ed data for biosurveillance have been identified varney & hirshon, ) , including costs to eds and public health, the lack of standardization of ed data, and security and confidentiality. many eds still document patient symptoms manually; even when the data are electronic, they are often entered in free text form instead of using standardized terms. timeliness is also a concern; while some ed data elements are entered into electronic systems at the start of the ed visit, other elements are added hours, days or even weeks later. even though there is no formal standard or best practices dictating how soon data should be available after an ed visit or other health system encounter for early detection, most surveillance systems aim for near real-time data, available within hours. the benefits and challenges of using ed data for surveillance will be described in more detail through examples from one biosurveillance system, the north carolina disease event tracking and epidemiologic collection tool (nc detect). nc detect evolved from a pilot project in to demonstrate the collection of timely, standardized ed data for public health surveillance and research. nc detect has since grown to incorporate ed visit data from % of / acute care hospital eds in the state of north carolina and has developed and implemented many innovative surveillance tools, including the emergency medicine text processor (emt-p) for ed chief complaint data and research-based syndrome definitions. nc detect now provides twice-daily ed data feeds to cdc's biosense and has over registered users at the state, regional and local levels across north carolina. this chapter will review the use of ed data for biosurveillance, including appropriate case studies from nc detect. ed data have been collected for decades for a variety of public health surveillance needs and have been incorporated into electronic systems designed to analyze data related to trauma, injury and substance abuse, among others. public health officials have used event-based or drop-in biosurveillance systems that include ed data during major events, including the olympic games, political conventions, heat waves, after major hurricanes, and after the identification of known widespread outbreaks (weiss et al., ; davis et al., ; lee et al., ; rydman et al., ) . many of these systems have required users to do manual abstractions from medical charts or to enter data into stand-alone systems for specific symptoms of interest. for example, the emergency id net program, established in the syndromes related to emerging infections of interest to the cdc, using paper forms and standardized computer screens (talan et al., ) . secondary data, data that are generated as part of normal patient treatment real-time (hourly, every h, daily). surveillance systems that use secondary systems requiring manual abstraction (rodewald et al., ) . this methodology of ed data collection has become standard practice for biosurveillance systems using ed data, including nc detect, rods, essence and ears, among others (hutwagner et al., ; ising et al., ; according to the international society for disease surveillance (isds) state syndromic surveillance use survey, % of responding states (n = ) performing syndromic surveillance use ed data (http://isds.wikispaces.com/ while most states and regions rely on ed chief complaint data, there is interest in increasing the number of ed data elements collected, as evidenced by the american health information community's biosurveillance minimum data set (http://www.hhs.gov/healthit/ahic/materials/meeting / . registry_project, accessed june , ) . (figure - ). lombardo, ; wagner et al., . while there are automated programs either in real-time (at the time of record generation) or near late s, created a network of select eds to manually collect data to study on delimited text batch files or hl messages. and billing, are generally extracted from hospital information system(s) through several different approaches to automated extraction programs, most rely either data are intended to be less burdensome to ed staff and less costly than bio/bdsg_minimum_dataset.doc, accessed june , ). the recommendations from the american health information community include additional emergency department data elements, such as triage notes, vital signs and icd- -cm-based diagnoses. the biosurveillance minimum data set is currently under formal evaluation for its utility at cdc-funded sites in while infectious disease surveillance has traditionally relied on laboratory results and the reporting of mandated reportable conditions by medical practitioners, ed visit data and timely symptom-based analysis provide additional means for early identification of infectious disease outbreaks. areas of particular interest include the cdc's list of potential bioterrorism agents, as well as post-disaster (e.g., hurricane, earthquake, chemical spill) surveillance (cdc, ) . the ability to create effective syndromes to use with ed visit data is of paramount importance to their timely use for public health surveillance. indiana, new york and washington/idaho. the structure of syndrome definitions used in biosurveillance is dependent on the design of the system and the nature of the data under surveillance. individual systems use different methods to identify specific disease symptoms in the chief complaint and triage note data. this includes deterministic methods, such as keyword searching, and probabilistic methods, such as naïve bayesian and probabilistic machine learning (espino et al., ) . syndrome definitions then classify records into syndromic categories based on which symptoms are identified. to date, no best practices exist to guide syndrome definition development and evaluation (sosin & dethomasis, ) . which syndromes are monitored and which symptoms are associated with each syndrome varies according to the system under consideration. furthermore, syndrome structure may vary depending upon which data elements, in addition to chief complaint, are available and their timeliness. syndrome structure refers to how many symptoms are required, within which data fields they must be found, and which boolean operators are employed to determine whether a certain record matches a particular syndrome. in september , the isds sponsored a consultative meeting on chief complaint classifiers and standardized syndromic definitions (chapman & dowling, ) . at this meeting, representatives from eleven syndromic signs and symptoms of dyspnea as recorded in an ed chief complaint or respiratory rate), clinical findings (e.g., abnormal breath sounds on pulmonary exam), abnormal lab findings (e.g., abnormal abg or positive culture results), imaging studies (e.g., infiltrate on chest x-ray), or certain icd- -cm diagnosis codes (e.g., , pneumonia). the meeting participants reached consensus on best practices for which clinical conditions to associate with each of six terms/keywords which best represent these clinical conditions. such as an abnormal vital sign (e.g., low oxygen saturations or increased triage note field (e.g., "shortness of breath" (sob)), a clinical observation medical concepts which may be represented by multiple possible data inputs surveillance systems throughout the country, including nc detect, met to to the system. for example, the concept of "dyspnea" may be represented by different syndromes (sensitive and specific versions of respiratory syndrome illness syndrome). through online collaboration and periodic conference calls, discuss which syndromes they monitor and which chief complaint-based the group continues the process of defining specific chief complaint search clinical conditions they include in each syndrome. clinical conditions are and gastrointestinal syndrome, constitutional syndrome and influenza-like while the process of identifying specific chief complaint search terms/ keywords to group into syndromes presents several technical challenges, the timeliness of chief complaints outweighs the benefits of any standardized data that are not available within hours of the ed visit. textual data such as chief complaint and triage note present several problems, including misspellings and use of ed-specific and locally-developed acronyms, abbreviations and truncations . there are two main approaches to dealing with the variability in textual surveillance data: ( ) incorporating keywords in the actual search query statements; or ( ) preprocessing the data. in systems that build various keyword searches (e.g., lexical variants, synonyms, misspellings, acronyms, and abbreviations) into the actual surveillance tools, elaborate search statements are constructed, employing statistical software such as sas (cary, nc), or standard query language (sql, microsoft, redmond, wa) (forbach et al., ; heffernan et al., ) . in systems with preprocessors, the data are cleaned prior to application of a syndromic classification algorithm (mikosz et al., ; shapiro, ) . the preprocessors clean text entries, replacing synonyms and local terms (e.g., throat pain, throat discomfort, ear/nose/throat problem), as well as misspellings, abbreviations, and truncated words (e.g., sorethroat, sore throaf, soar throat, st, s/t, sore thrt, sofe throat, ent prob), with standard terms (e.g., sore throat) or standard identifiers (e.g., umls ® concept unique identifier c ) (nlm, ) . preprocessors often include normalization tools to eliminate minor differences in case, inflection and word order and to remove stop words (nlm, ) . while there is no consensus about which approach is best, many biosurveillance programs are implementing preprocessors to improve operations (dara et al., ; hripscak et al., ; komatsu et al., ) . use of preprocessors can streamline maintenance of existing and development of new surveillance queries. query processing time is also faster, resulting in better overall biosurveillance system performance. one such preprocessor is the emergency medical text processor (emt-p), which was developed to process free text chief complaint data (e.g., chst pn, ches pai, chert pain, cp, chest/abd pain) in order to extract standard terms (e.g., chest pain) from emergency departments . emt-p has been evaluated by biosurveillance researchers in pennsylvania and found to improve syndromic classification (dara et al., ) . the developers continue to improve emt-p and have made it publicly available (travers, ). while clinical text such as triage notes can improve the accuracy of keyword-based syndrome queries, the data require processing to address negated terms hripcsak et al., ) . one study evaluated negex, a negation tool developed at the university of pittsburgh (chapman et al., ) . negex is a simple regular expression algorithm that filters out negated phrases from clinical text. the negex system was modified (to include the negation term (-)) and then combined with selected modules from emt-p that replaced synonyms (e.g., dec loc with consciousness decreased) and misspellings (nasaue with nausea) for use in nc detect. the pilot results show that this combination of emt-p and negex leads to more accurate negation processing . another ed data element available for biosurveillance is the final diagnosis, which is widely available in electronic form and is standardized using the international classifications of diseases, ninth revision, clinical modification (icd- -cm) (usdhhs, ) . all eds generate electronic icd- -cm diagnoses as they are required for billing purposes (ncipc, ; . there is, however, some evidence that diagnosis data are not always available in a timely manner. in contrast to chief complaint data, which are generally entered into ed information systems by clinicians in real-time, icd- -cm diagnoses are often entered into the system by coders well after the ed visit. sources of icd- -cm data may vary, which may influence the quality of the data. traditionally, diagnoses have been assigned to ed visits by trained coders who are employed by the hospital and/or physician professional group. the primary purpose of the coding is billing, as opposed to secondary uses such as surveillance. recently, emergency department information systems (edis) have come on the market that allow for diagnosis entry by clinicians. these systems typically include drop-down boxes with text that corresponds to icd- -cm codes; clinicians can then select a "clinical impression" at the end of the ed visit and the corresponding icd- -cm code becomes part of the edis data available for surveillance. in a study of regional surveillance systems in north carolina and washington, biosurveillance developers found that over half of the eds did not have electronic diagnosis data until week or more after the ed visit . in a follow up study, researchers prospectively measured the time of availability of electronic icd- -cm codes in nc detect for all ed visits on / / . the study confirmed that fewer than half of the eds sent diagnoses within week of the visit, and that it took weeks to get at least one diagnosis for two-thirds of the visits. seven ( %) of the hospitals had diagnoses for less than two-thirds of their ed visits at the week mark. diagnosis data are universally available from nc eds, and studies have shown that icd- -cm data alone or in combination with chief complaint data are more valid than chief complaint data alone for syndromic surveillance (beitel et al., ; . this study corroborated the earlier study, however, that indicated the majority of north carolina hospitals cannot send diagnosis data soon enough for timely, population-based biosurveillance. the in addition to ed data, nc detect receives data hourly from the statewide carolinas poison center, and daily data feeds from the statewide emergency medical system (ems) data collection center, a regional wildlife center, selected urgent care centers, and three laboratories of the nc state college of veterinary medicine (microbiology, immunology and vectorborne diseases laboratories) (waller et al., ) . nc detect assists local, regional and state public health professionals and hospital users in identifying, monitoring, and responding to potential terrorism events, man-made and natural disasters, human and animal disease outbreaks and other events of public health significance. this system makes it possible for public health officials to conduct daily surveillance for clinical syndromes that may be caused by infectious, chemical or environmental carolina by developing best practices for collecting and standardizing quality agents. suspicious syndromic patterns are detected using the cdc's ears cusum algorithms, which are embedded in the nc detect java-based web application. the system also provides broader surveillance reports for ed visits related to hurricanes, injuries, asthma, vaccine-preventable diseases and environmental health (waller et al., ) . role-based access provides hospital and public health access to nc detect data at local, regional and state levels; multi-tiered access provides tight controls on the data and allows all types of users to access the system, from those who need only an aggregated view of the data, to those who are able to decrypt sensitive protected health information when needed for investigation. nc detect provides an excellent example of an early event detection and situational awareness system using ed visit data for disease surveillance. it is well established, statewide, and utilized daily by a variety of public health practitioners. a recently completed study found that nc ed information in nc detect compared favorably with national estimates of ed data made by the national hospital ambulatory care survey, despite differences in data collection methods (hakenewerth et al., ) . this finding is an indication of a well designed and robust system (aylin et al., ) . the syndromes monitored in nc detect are derived from the cdc's text-based clinical case definitions for bioterrorism syndromes (cdc, ) . these syndromes were selected because they encompass both potential bioterrorism-related and community acquired disease processes. they include botulism-like illness (botulism), fever-rash (anthrax, bubonic plague, smallpox, tularemia, varicella), gastrointestinal (gastrointestinal anthrax, food/water-borne gastrointestinal illness, viral gastroenteritis), influenza-like-illness (epidemic influenza, pandemic influenza), meningoencephalitis (meningitis, encephalitis) and respiratory (respiratory anthrax, pneumonic plague, tularemia, influenza, sars). clinical case definitions are converted to syndrome definitions by expressing them in sql, in most cases requiring both a syndrome specific and a constitutional keyword in either the chief complaint or triage note field. for example, a record containing the syndrome specific term "cough" and the constitutional term "fever" would match the respiratory syndrome. documentation of a fever by vital sign measurement in the ed is also accepted in lieu of a constitutional keyword. the sql code is written to identify common synonyms, acronyms, abbreviation, truncations, misspellings and negation in the free text data. the nc detect syndrome definitions have been modified over several years in an iterative fashion according to the knowledge and experience of the nc detect syndrome definition workgroup. this workgroup meets monthly and includes public health epidemiologists who are regular users of nc detect for biosurveillance at the state and local levels, as well as clinicians and technical staff at nc detect. the continued improvement of the syndrome definitions for the purposes of syndromic surveillance requires more than this local effort, however. it requires collaboration with other system developers to determine the best practices nationally, as well as evidence-based research to support existing practices and/or develop new methodologies. the effectiveness of systems such as nc detect depends on the quality of the data provided by the data sources and on the system's capacity to collect, aggregate and report information. perfect data, however, rarely exist and there are no perfect data systems. thus, assessing and improving data quality must be ongoing tasks. in nc detect, both automated and manual data quality checks are conducted daily and weekly. a data quality workgroup meets monthly to review ongoing data quality concerns and strategize ways to address them. major data quality issues range from failure to submit data at all to incorrect mapping of county of residence to extended delays in submitting diagnosis and procedure code data. issues of particular concern include incomplete daily visit data, missing chief complaint data, failure to submit data in a timely fashion, and submission of invalid codes. successfully addressing ed data quality issues requires constant monitoring of the data and ongoing communication with the hospitals submitting the data to nc detect. nc detect has been used for a variety of public health surveillance needs including, but not limited to, early event detection, public health situational awareness, case finding, contact tracing, injury surveillance and environmental exposures (waller et al., ) . those disease outbreaks that are first identified by traditional means are still aided by ed-based surveillance systems for identifying additional suspected cases and documenting the epidemiology of the affected individuals. several major hurricanes have made landfall or passed through north carolina in the past years, including floyd in , isabel in , and ophelia in in addition, hundreds of katrina evacuees entered north carolina in august and september . while ed data were used in all instances to monitor the hurricanes' effects, the methodologies used show the evolution of ed data collection for public health surveillance in north carolina. in the fall of , hurricane floyd, preceded by hurricane dennis and followed by hurricane irene, caused massive rainfalls that flooded eastern regions of north carolina along the neuse, tar, roanoke, lumbar and cape fear rivers. as ncedd was still in early development in , a disaster response team and ed staff in hospitals worked together to define and apply standardized illness and injury classifications in order to conduct surveillance for the period of september to october , and to compare results to similar periods in . these classifications were applied manually based on diagnosis or chief symptoms for each patient visit abstracted from daily ed logs. based on these analyses, hurricane floyd resulted in increases in insect stings, dermatitis, diarrhea and mental health issues as well as hypothermia, dog bites and asthma. the leading cause of death related to the storm was drowning (cdc, ) . surveillance for this time period required the dedicated efforts of eis officers, medical students and other field staff, as well as ed staff and public health epidemiologists over an extended time period. nc dph conducted similar surveillance after hurricane isabel in , manually surveying hospitals to document hurricane-related morbidity and mortality (davis et al., ) . officials updated the survey instrument to collect more information on injuries and asked hospitals to complete and fax the information to nc dph. while less labor intensive overall than the surveillance that took place after hurricane floyd, the reliance on ed staff to provide information resulted in a relatively slow and extended collection of data from eds. federal officials evacuated two large groups to north carolina from katrina-hit areas of the gulf coast in august and september . for this event, nc dph relied on nc detect and hospital-based public health epidemiologists in wake and mecklenburg counties for ed data collection. while the epidemiologists at two hospitals were able to identify more katrina-related visits (n = ) than the automated nc detect reports (n = ), the nc detect reports required no manual tabulations and took only h to develop and implement. in addition, the epidemiologist count included patients not included in the nc detect database, such as patients who were directly admitted to the hospital, without receiving treatment in the visits were being monitored, ophelia approached the nc coast, where it stalled and resulted in the evacuation of coastal communities for several days. nc detect was used to monitor ophelia-related ed visits simultaneously with the katrina evacuee monitoring effort. ed (barnett et al., ) . furthermore, during the time the katrina evacuee while manual tabulations may result in greater specificity, near real-time automated ed data collection for post-disaster surveillance provides a very low cost approach for monitoring the public's health if a system is already in place and operational. queries can be continually refined to capture specific keywords in the chief complaint and triage note fields without added burden to hospital and/or public health staff. ed data collection provides an excellent complement to rapid needs assessments and other on-the-ground survey tools. automated ed data collection assumes that eds remain operational and that computerized health information systems continue to be used in times of mass disaster, an assumption that has not yet been put to the test in north carolina. the nc detect influenza-like illness (ili) definition, based on ed data, is used to monitor the influenza season in nc each year. the ed ili definition follows the same trend as north carolina's traditional, manually tabulated sentinel provider network but is available in near real-time, as shown in figure influenza-like illness surveillance -nc while north carolina continues to maintain its sentinel provider network, monitoring influenza with ed data provides several superior surveillance capabilities. in addition to timeliness, collecting ed data for influenza surveillance allows jurisdictions to assess impact on populations rather than samples, test case definition revisions on historical data, stratify ed visits by disposition type (admitted vs. discharged) and incorporate age stratification into analyses. the use of age groups in influenza surveillance has been shown to provide timely and representative information about the agespecific epidemiology of circulating influenza viruses (olsen et al., ) . several states and large metropolitan areas, along with north carolina, transmit aggregate ed-based ili counts by age group to an isds-sponsored proof-of-concept project called the distributed surveillance taskforce for real-time influenza burden tracking (distribute). although the ili case definitions are locally defined, the visualizations that distribute provides show striking similarities in ili trends across the country (http:/ /www.syndromic.org/projects/distribute.htm). while syndromic surveillance systems have clearly shown benefit for public health situational awareness and influenza surveillance, early event detection has been more of a challenge. symptom-based detection systems are often overly sensitive, resulting in many false positives that can drain limited resources (baer et al., ; heffernan et al., ) . hospital and public health users who incorporate syndromic surveillance into their daily work flows, however, are able to accommodate these false positives more efficiently and still derive benefit from monitoring ed data for potential events. investigating aberrations based on ed data that do not result in detecting an outbreak can still be important to confirm that nothing out of the ordinary is occurring. a recent investigation of gastrointestinal signals in pitt county, north carolina, for example, resulted in more active surveillance by the health department (checking non-ed data sources for similar trends) and the hospital (increased stool testing), as well as a health department press release promoting advice for preventing foodborne illnesses. although a true outbreak or signal causative agent was not detected, this work results in improved coordination and communication among the hospital, healthcare providers and health department, which will make collaboration more efficient in any future large scale response efforts. to complement the more sensitive symptom-based syndromes, system developers may also include reports looking for specific mention of category a bioterrorism agents, such as anthrax, botulism, etc. in nc detect, for example, the bioterrorism (bt) agent case report searches for keywords and icd- -cm diagnoses related to different bioterrorism agent groups. a statewide search on all agents on average returns only ten cases (averaging one case a day over days). in comparison to the specificity of this report, a statewide search on botulism-like illness for days in nc detect produces approximately cases while a search on a broad definition of gastrointestinal illness produces approximately , cases statewide over a -day period. while the bt agent case report does include false positive cases, it provides an effective, unobtrusive monitoring mechanism that complements the astute clinician. it is also an important backup when notifiable diseases go unreported to the public health department, which actually occurred in march with a single case of tularemia. similar to the periods during and after natural disasters, monitoring ed data during a known infectious disease outbreak can assist with case finding and contact tracing. during known outbreaks, nc detect is used to identify potential cases that may require follow up. to assist in this effort, the nc detect custom event report allows users to request new reports in just h, with specific keyword and/or icd- -cm diagnostic criteria . this report has assisted north carolina's public health monitoring in several events, including, but not limited to, nationwide recalls of peanut butter (february ), select canned foods (july ), nutritional supplements (january ), as well as localized hepatitis a (january ) and listeriosis (december ) outbreaks. allowing users to access reports with very specific keywords (e.g., "peanut," "canned chili," "selenium") provides them with an efficient, targeted mechanism for timely surveillance of emerging events, all with the intention of reducing morbidity and mortality. when syndromic surveillance systems collect icd- -cm diagnosis codes in addition to chief complaints, users can conduct retrospective analysis effectively. for example, users can search on the icd- -cm code v . (need for prophylactic vaccination and inoculation against certain viral diseases: rabies) to review how many ed patients received rabies prophylaxis in a given time period. using the ed chief complaint, users can go a step further and view how many ed patients with chief complaints related to animal bites/animal exposures were not documented as having received a v . code. investigation of the results may reveal hospital coding errors or hospital practices that are not in line with public health requirements that can then be corrected. the injury and violence prevention branch of nc dph has added ed data from nc detect to its data sources for injury surveillance efforts. in addition to ed visit data, they also use hospital discharge, death certificate, and medical examiner data. injury surveillance efforts involving ed data have included falls, traumatic brain injury, fire-related injury, self-inflicted injury, heat-related visits, and unintentional drug overdoses. furthermore, they have used ed data when working with trauma regional advisory committees to evaluate injury patterns and are exploring the possibility of incorporating ed data into nc's violent death reporting system. while ed data have long been used for injury surveillance, the availability of near realtime data provides opportunities for more timely documentation of intervention outcomes. ed data are a proven tool for biosurveillance, and the ed data in nc detect have proved to be effective for a variety of public health uses, including surveillance, monitoring and investigation. biosurveillance systems that are incorporated into hospital and public health practitioner daily work flows are more effective and easily used during a public health emergency. the flexibility of a system such as nc detect helps it meet this level of functionality. any surveillance system should undergo rigorous evaluation to make sure it is meeting user needs effectively and efficiently. the ed data stream of nc detect has undergone two such evaluations. in , it was evaluated by the north carolina center for public health preparedness at the charge of the nc dph. the evaluation was designed to determine the usefulness of the ed data and the ease with which it is used for both realtime and non-real-time public health surveillance activities. interviews were conducted with stakeholders to learn about the specifics of the ed data, data flow, and the aberration detection algorithms. in addition, local, regional and state public health authorities, as well as hospital-based public health epidemiologists (phes), were asked to complete a web-based survey about their experience using the ed data via nc detect. key findings included: • ed data permit public health authorities to identify human health events as a result of bioterrorism, natural or accidental disaster, or infectious disease outbreak, but the rapidity of detection is contingent on the extent of the event and affected individuals, the ability of chief complaint data to be successfully transmitted to nc detect in a timely manner, and the frequency and timing of aberration detection and investigation by public health authorities; • the nc statute mandating provision of ed visit data for public health surveillance and the availability of unc dem staff to provide technical and analytical expertise have been instrumental in assuring that timely, quality data are available for public health surveillance; • ed data are useful to public health authorities; • the system showed a low positive predictive value (ppv), indicating that users must examine a large number of false positives in order to identify a potentially true threat to public health. based on these findings, this evaluation recommended additional efforts to encourage public health authorities to routinely use the ed data, increased communication among hospitals, business organizations and public health authorities, examination and evaluation of different aberration detection algorithms, and a cost-benefit study of using ed data for public health surveillance. a second evaluation of the emergency department data stream of nc detect was conducted in by the research triangle institute to assess the impact of this biosurveillance system on public health preparedness, early detection, situational awareness, and response to public health threats. this study used key informant interviews and found the following: • biosurveillance has been used in north carolina for situational awareness and early detection of disease outbreaks; • public health epidemiologists in hospitals and regional state-based response teams have integrated use of nc detect with traditional surveillance activities; • biosurveillance has added timeliness and flexibility to traditional surveillance, increased reportable disease reporting and case finding, and increased public health communication. this evaluation recommended the addition of urgent care center data to complement the ed visit data for biosurveillance and exploring the use of diagnosis data, when available in a timely manner, to minimize false positive alerts. conclusion electronic health information systems, these data are available in near realtime, making them particularly useful for surveillance and situational awareness in rapidly developing public health outbreaks or disasters. the use of ed data for public health surveillance can significantly increase the speed of detecting, monitoring and investigating public health events. combined with other timely data sources such as data from poison centers, ems, ambulatory care data, and animal health data, ed data analyses are an important source of information for mitigating the effects of infectious disease. a distinctive feature of ed data for surveillance is their timeliness. with . are timely ed data systems for public health surveillance cost effective? how would you measure this? . how can biosurveillance systems and electronic lab reporting for reportable conditions best complement each other? . what other data sources could and should be used with ed data for an exemplar biosurveillance system? . can an automated biosurveillance system ever really replace the astute clinician at detecting and responding to an infectious disease outbreak of public health significance? . what statistical approaches are available for aberration detection and what are the pros and cons of each? how does a biosurveillance system determine which aberration detection method(s) to use? . what are the major data quality issues related to conducting public health surveillance with ed data? how can these be identified and addressed? use of administrative data or clinical databases as predictors of risk of death in hospital: comparison of models what is the value of a positive syndromic surveillance signal? advances in disease surveillance post-katrina situational awareness in north carolina use of emergency department chief complaint and diagnostic codes for identifying respiratory illness in a pediatric population framework for evaluating public health surveillance systems for early detection of outbreaks centers for disease control and prevention. morbidity and mortality associated with hurricane floyd -north carolina syndrome definitions for diseases associated with critical bioterrorism-associated agents a simple algorithm for identifying negated findings and diseases in discharge summaries consultative meeting on chief complaint classifiers and chief complaint preprocessing evaluated on statistical and non-statistical classifiers evaluation of public health response to hurricanes finds north carolina better prepared for public health emergencies reporting efficiency during a measles outbreak syco: a probabilistic machine learning method for classifying chief complaints into symptom and syndrome categories the validity of chief complaint and discharge diagnosis in emergency-based syndromic surveillance improving system ability to identify symptom complexes in free-text data discuss the challenges and benefits of using secondary ed visit data for public health surveillance what are some of the security and privacy issues surrounding the use of ed visit data for public health surveillance new york city syndromic surveillance systems syndromic surveillance in public health practice for the saem public health task force preventive care project. the rational for developing public health surveillance systems based on emergency department data fever detection in clinic visit notes using a general purpose processor the bioterrorism preparedness and response early aberration reporting system situational awareness using webbased annotation and custom reporting improving negation processing in triage notes triage note in emergency department-based syndromic surveillance economic impact of a bioterrorist attack: are prevention and postattack intervention programs justifiable? ontology-based automatic chief complaints classification for syndromic surveillance active morbidity surveillance after hurricane andrew--florida, roundtable on bioterrorism detection: information system-based surveillance a systems overview of the electronic surveillance system for early notification of community-based epidemics (essence ii) comparison of two major emergency department-based free-text chief-complaint coding systems data elements for emergency department systems, release . . atlanta, ga: centers for disease control advance data from vital and health statistics: no. . hyattsville, md: national center for health statistics. of emergency department data monitoring the impact of influenza by age: emergency department fever and respiratory complaint surveillance in new york city syndromic surveillance: the effects of syndrome grouping on model accuracy and outbreak detection a method for developing and maintaining a powerful but inexpensive computer data base of clinical information about emergency department patients the rate and risk of heat-related illness in hospital emergency departments during the chicago heat disaster taming variability in free text: application to health surveillance evaluation challenges for syndromic surveillance -making incremental progress emergency id net: an emergency department-based emerging infections sentinel network. the emergency id net study group the informatics response in disaster, terrorism and war timeliness of emergency department diagnoses for syndromic surveillance using nurses natural language entries to build a concept-oriented terminology for patients' chief complaints in the emergency department electronic availability, timeliness, sources and standards. proceedings of the amia symposium value of icd- -coded chief complaints for informatics association department of health and human services (usdhhs). international classification of north carolina emergency department visit data available for public health surveillance north carolina biosurveillance system wiley handbook of science and technology for homeland security update on public health surveillance in emergency departments syndrome and outbreak detection using chiefcomplaint data -experience of the real-time outbreak and disease surveillance project public health at the summer olympics: the los angeles county experience for the saem public health task force preventive care project. the rational for developing public health surveillance systems based on emergency department data update on public health surveillance in emergency departments framework for evaluating public health surveillance systems for early detection of outbreaks the validity of chief complaint and discharge diagnosis in emergency-based syndromic surveillance national hospital ambulatory medical care survey: emergency department summary. advance data from vital and health statistics: no. key: cord- - og pivv authors: lepenioti, katerina; pertselakis, minas; bousdekis, alexandros; louca, andreas; lampathaki, fenareti; apostolou, dimitris; mentzas, gregoris; anastasiou, stathis title: machine learning for predictive and prescriptive analytics of operational data in smart manufacturing date: - - journal: advanced information systems engineering workshops doi: . / - - - - _ sha: doc_id: cord_uid: og pivv perceiving information and extracting insights from data is one of the major challenges in smart manufacturing. real-time data analytics face several challenges in real-life scenarios, while there is a huge treasure of legacy, enterprise and operational data remaining untouched. the current paper exploits the recent advancements of (deep) machine learning for performing predictive and prescriptive analytics on the basis of enterprise and operational data aiming at supporting the operator on the shopfloor. to do this, it implements algorithms, such as recurrent neural networks for predictive analytics, and multi-objective reinforcement learning for prescriptive analytics. the proposed approach is demonstrated in a predictive maintenance scenario in steel industry. perceiving information and extracting business insights and knowledge from data is one of the major challenges in smart manufacturing [ ] . in this sense, advanced data analytics is a crucial enabler of industry . [ ] . more specifically, among the major challenges for smart manufacturing are: (deep) machine learning, prescriptive analytics in industrial plants, and analytics-based decision support in manufacturing operations [ ] . the wide adoption of iot devices, sensors and actuators in manufacturing environments has fostered an increasing research interest on real-time data analytics. however, these approaches face several challenges in real-life scenarios: (i) they require a large amount of sensor data that already have experienced events (e.g. failures of -ideally-all possible causes); (ii) they require an enormous computational capacity that cannot be supported by existing computational infrastructure of factories; (iii) in most cases, the sensor data involve only a few components of a production line, or a small number of parameters related to each component (e.g. temperature, pressure, vibration), making impossible to capture the whole picture of the factory shop floor and the possible correlations among all the machines; (iv) the cold-start problem is rarely investigated. on the other hand, there is a huge treasure of legacy, enterprise and operational systems data remaining untouched. manufacturers are sitting on a goldmine of unexplored historical, legacy and operational data from their manufacturing execution systems (mes), enterprise resource planning systems (erp), etc. and they cannot afford to miss out on its unexplored potential. however, only - % of the value from such available data-at-rest is currently accrued [ ] . legacy data contain information regarding the whole factory cycle and store events from all machines, whether they have sensors installed or not (e.g. products per day, interruption times of production line, maintenance logs, causalities, etc.) [ ] . therefore, legacy data analytics have the credentials to move beyond kpis calculations of business reports (e.g. oee, uptime, etc.), towards providing an all-around view of manufacturing operations on the shopfloor in a proactive manner. in this direction, the recent advancements of machine learning can have a substantial contribution in performing predictive and prescriptive analytics on the basis of enterprise and operational data aiming at supporting the operator on the shopfloor and at extracting meaningful insights. combining predictive and prescriptive analytics is essential for smarter decisions in manufacturing [ ] . in addition mobile computing (with the use of mobile devices, such as smartphones and tablets) can significantly enable timely, comfortable, non-intrusive and reliable interaction with the operator on the shopfloor [ ] , e.g. for generating alerts, guiding their work, etc. through dedicated mobile apps. the current paper proposes an approach for predictive and prescriptive analytics on the basis of enterprise and operational data for smart manufacturing. to do this, it develops algorithms based on recurrent neural networks (rnn) for predictive analytics, and multi-objective reinforcement learning (morl) for prescriptive analytics. the rest of the paper is organized as follows: sect. presents the background, the challenges and prominent methods for predictive and prescriptive analytics of enterprise and operational data for smart manufacturing. section describes the proposed approach, while sect. shows a walkthrough scenario of the proposed approach in the steel industry. section presents the experimental results, while sect. concludes the paper and outlines the plans for future research. background. intelligent and automated data analysis which aims to discover useful insights from data has become a best practice for modern factories. it is supported today by many software tools and data warehouses, and it is known by the name "descriptive analytics". a step further, however, is to use the same data to feed models that can make predictions with similar or better accuracy than a human expert. in the framework of smart manufacturing, prognostics related to machines' health status is a critical research domain that often leverages machine learning methods and data mining tools. in most of the cases, this is related to the analysis of streaming sensor data mainly for health monitoring [ ] [ ] [ ] , but also for failure prediction [ ] [ ] [ ] as part of a predictive maintenance strategy. however, in all of these approaches, the prediction is produced only minutes or even seconds before the actual failure, which, is not often a realistic and practical solution for a real industrial case. the factory managers need to have this information hours or days before the event, so that there is enough time for them to act proactively and prevent it. one way to achieve this is to perform data mining on maintenance and operational data that capture the daily life-cycle of the shop floor in order to make more high-level predictions [ ] [ ] [ ] . existing challenges. the most notable challenges related to predictive analytics for smart manufacturing include: (a) predictions always involve a degree of uncertainty, especially when the data available are not sufficient quantity-wise or quality-wise; (b) inconsistent, incomplete or missing data with low dimensionality often result into overfitting or underfitting that can lead to misleading conclusions; (c) properly preparing and manipulating the data in order to conclude to the most appropriate set of features to be used as input to the model is the most time-consuming, yet critical to the accuracy of the algorithms, activity; (d) lack of a common "language" between data scientists and domain experts hinders the extraction of appropriate hypothesis from the beginning and the correct interpretation and explainability of results. novel methods. time series forecasting involves prediction models that analyze time series data and usually infer future data trends. a time series is a sequence of data points indexed in time order. unlike regression predictive modeling, time series also adds the complexity of a sequence dependence among the input variables. recurrent neural networks (rnn) are considered to be powerful neural networks designed to handle sequence dependence. long short-term memory network (lstm) is a type of rnn that is typically used in deep learning for its ability to learn long-term dependencies and handle multiple input and output variables. background. prescriptive analytics aims at answering the questions "what should i do?" and "why should i do it?". it is able to bring business value through adaptive, time-dependent and optimal decisions on the basis of predictions about future events [ ] . during the last years, there is an increasing interest on prescriptive analytics for smart manufacturing [ ] , and is considered to be the next evolutionary step towards increasing data analytics maturity for optimized decision making, ahead of time. existing challenges. the most important challenges of prescriptive analytics include [ , , ] : (i) addressing the uncertainty introduced by the predictions, the incomplete and noisy data and the subjectivity in human judgement; (ii) combining the "learned knowledge" of machine learning and data mining methods with the "engineered knowledge" elicited from domain experts; (iii) developing generic prescriptive analytics methods and algorithms utilizing artificial intelligence and machine learning instead of problem-specific optimization models; (iv) incorporating adaptation mechanisms capable of processing data and human feedback to continuously improve decision making process over time and to generate non-intrusive prescriptions; (v) recommending optimal plans out of a list of alternative (sets of) actions. novel methods. reinforcement learning (rl) is considered to be a third machine learning paradigm, alongside supervised learning and unsupervised learning [ ] . rl shows an increasing trend in research literature as a tool for optimal policies in manufacturing problems (e.g. [ , ] ). in rl, the problem is represented by an environment consisting of states and actions and learning agents with a defined goal state. the agents aim to reach the goal state while maximizing the rewards by selecting actions and moving to different states. in interactive rl, there is the additional capability of incorporating evaluative feedback by a human observer so that the rl agent learns from both human feedback and environmental reward [ ] . another extension is multi-objective rl (morl), which is a sequential decision making problem with multiple objectives. morl requires a learning agent to obtain action policies that can optimize multiple objectives at the same time [ ] . the proposed approach consists of a predictive analytics component (sect. . ) and a prescriptive analytics component (sect. . ) that process enterprise and operational data from manufacturing legacy systems, as depicted in fig. . the communication is conducted through an event broker for the event predictions and the actions prescriptions, while other parameters (i.e. objective values and alternative actions) become available through restful apis. the results are communicated to business users and shopfloor operators through intuitive interfaces addressed to both computers and mobile devices. the proposed predictive analytics approach aims to: (i) exploit hidden correlations inside the data that derive from the day-to-day shop floor operations, (ii) create and adjust a predictive model able to identify future machinery failures, and (iii) make estimations regarding the timing of the failure, i.e. when a failure of the machinery may occur, given the history of operations on the factory. this type of data usually contains daily characteristics that derive from the production line operations and are typically collected as part of a world-wide best practice for monitoring, evaluation and improvement of the effectiveness of the production process. the basic measurement of this process is an industry standard known as overall equipment effectiveness (oee) and is computed as: oee(%) = availability(%)  performance(%)  quality (%). availability is the ratio of actual operational time versus the planned operational time, performance is the ratio of actual throughput of products versus the maximum potential throughput, and the quality is the ratio of the not-rejected items produced vs the total production. the oee factor can be computed for the whole production line as an indication of the factory's effectiveness or per machine or a group of machines. the proposed methodology takes advantage of these commonly extracted indicators and processes them in two steps: in predictive model building (learning) and predictive model deployment. predictive model building. the predictive analytics model incorporates lstm and exploits its unique ability to "remember" a sequence of patterns and its relative insensitivity to possible time gaps in the time series. as in most neural network algorithms, lstm networks are able to seamlessly model non-linear problems with multiple input variables through the iterative training of their parameters (weights). since the predictive analytics model deals with time-series, the lstm model is trained using supervised learning on a set of training sequences assigned to a known output value. therefore, an analyst feeds the model with a set of daily features for a given machine (e.g. the factors that produce the oee) and use as outcome the number of days until the next failure. this number is known since historical data can hold this information. nevertheless, when the model is finally built and put in operation, it will use new input data and will have to estimate the new outcome. predictive model deployment. when the lstm model is fed with new data it can produce an estimation of when the next failure will occur (i.e. number of days or hours) and what is the expected interruption duration in the following days. although this estimation may not be % accurate, it could help factory managers to program maintenance actions proactively in a flexible and dynamic manner, compared to an often rigid and outdated schedule that is currently the common practice. this estimation feeds into prescriptive analytics aiming at automating the whole decision-making process and provide optimal plans. the proposed prescriptive analytics approach is able to: (i) recommend (prescribe) both perfect and imperfect actions (e.g. maintenance actions with various degrees of restoration); (ii) model the decision making process under uncertainty instead of the physical manufacturing process, thus making it applicable to various industries and production processes; and, (iii) incorporate the preference of the domain expert into the decision making process (e.g. according to their skills, experience, etc.), in the form of feedback over the generated prescriptions. to do these, it incorporates multi-objective reinforcement learning (morl). unlike most of the multi-objective optimization approaches which result in the pareto front set of optimal solutions [ ] , the proposed approach provides a single optimal solution (prescription), thus generating more concrete insights to the user. the proposed prescriptive analytics algorithm consists of three steps: prescriptive model building, prescriptive model solving, and prescriptive model adapting, which are described in detail below. prescriptive model building. the prescriptive analytics model representing the decision making process is defined by a tuple s; a; t; r ð Þ ; where s is the state space, a is the action space, t is the transition function t : s  a  s ! r and r is the vector reward function r : s  a  s ! r n where the n-dimensions are associated with the objectives to be optimized o n . the proposed prescriptive analytics model has a single starting state s n , from which the agent starts the episode, and a state s b that the agent tries to avoid. each episode of the training process of the rl agent will end, when the agent returns to the normal state s n or when it reaches s b . figure depicts an example including alternative (perfect and/or imperfect maintenance) actions (or sets of actions) s a i ; i ¼ ; ; , each one of which is assigned to a reward vector. the prescriptive analytics model is built dynamically. in this sense, the latest updates on the number of the action states s a i and the estimations of the objectives' values for each state s k are retrieved through apis from the predictive analytics. each action may be implemented either before the breakdown (in order to eliminate or mitigate its impact) or after the breakdown (if this occurs before the implementation of mitigating actions). after the implementation of each action, the equipment returns to its normal state s n . solid lines represent the transitions a i that have non-zero reward with respect to the optimization objectives and move the agent from one state to another. prescriptive model deployment. on the basis of event triggers for predicted abnormal situations (e.g. about the time of the next breakdown) received through a message broker, the model moves from the normal state s n to the dangerous state s d . for each objective, the reward functions are defined according to whether the objective is to be maximized or minimized. on this basis, the optimal policy p o i s; a ð Þ; for each objective o i is calculated with the use of the actor-critic algorithm, which is a policy gradient algorithm aiming at searching directly in (some subset of) the policy space starting with a mapping from a finite-dimensional (parameter) space to the space of policies [ ] . assuming independent objectives, the multi-objective optimal policy is derived from: p opt s; a ð Þ ¼ q i i p o i s; a ð Þ. the time constraints of the optimal policy (prescription) are defined by the prediction event trigger. the prescription is exposed to the operator on the shop-floor (e.g. through a mobile device) providing them the capability to accept or reject it. if accepted, the prescribed action is added to the actions plan. prescriptive model adaptation. the prescriptive analytics model is able to adapt according to feedback by the expert over the generated prescriptions. this approach learns from the operator whether the prescribed actions converge with their experience or skills and incorporates their preference to the prescriptive analytics model. in this way, it provides non-disruptive decision augmentation and thus, achieves an optimized human-machine interaction, while, at the same time, optimizing manufacturing kpis. to do this, it implements the policy shaping algorithm [ ] , a bayesian approach that attempts to maximize the information gained from human feedback by utilizing it as direct labels on the policy. for each prescription, optional human feedback is received as a signal of approval or rejection, numerically mapped to the reward signals and interpreted into a step function. the feedback is converted into a policy p feedback s; a ð Þ, the distribution of which relies on the consistency, expressing the user's knowledge regarding the optimality of the actions, and the likelihood of receiving feedback. assuming that the feedback policy is independent from the optimal multi-objective policy, the synthetic optimal policy for the optimization objectives and the human feedback is calculated as: p opt s; a ð Þ ¼ p opt s; a ð ÞÃp feedback s; a ð Þ. the case examined is the cold rolling production line of m. j. maillis s.a. cold rolling is a process of reduction of the cross-sectional area through the deformation caused by a pair of rotating in opposite directions metal rolls in order to produce rolling products with the closest possible thickness tolerances and an excellent surface finish. in the milling station, there is one pair of back up rolls and one pair of work rolls. the deformation takes place through force of the rolls supported by adjustable strip tension in both coilers and de-coilers. over the life of the roll some wear will occur due to normal processing, and some wear will occur due to extraneous conditions. during replacement, the rolls are removed for grinding, during which some roll diameter is lost, and then are stored in the warehouse for future use. after several regrinding, the diameter of the roll becomes so small that is no longer operational. the lstm model of predictive analytics was created using the keras library with tensorflow as backend and the morl using brown-umbc reinforcement learning and planning (burlap) library, while the event communication between them is performed with a kafka broker. in the m. j. maillis s.a case, the system predicts the time of the next breakdown and the rul of the available rolls. for the latter, the operator can select one of the repaired rollers, having been subject to grinding, or a new one. therefore, the alternative actions are created dynamically according to the available repaired rollers existing in the warehouse. each one has a different rul, according to its previous operation, and a different cost (retrieved from enterprise systems) due to its depreciation. each roller has an id and is assigned to its characteristics/objectives of morl (i.e. cost to be minimized and rul to be maximized) in order to facilitate its traceability. the available rolls along with the aforementioned objectives values are retrieved on the basis of a predicted breakdown event trigger. the alternative actions for the current scenario along with their costs and ruls are shown in table . the action "replace with new roller" represents a perfect maintenance action, while the rest ones represent imperfect maintenance actions. figure depicts an example of the process in which the prescription "replace with repaired roller id " is generated on the basis of a breakdown prediction and previously received feedback and instantly communicated to the operators through a dedicated mobile app. the operators are also expected to provide feedback so that their knowledge and preferences are incorporated in the system and the models are adapted accordingly. the second analysis aimed to predict the expected interruption duration for the following day ('which is the expected interruption duration for the following day?'). the input features used in this lstm model were: availability, performance, minutes of breakdown, real gross production, number of breakdowns, and month (date). again, several lstm parameters and layers were tested and the final model resulted to be a sequential model with a first lstm layer of neurons and an activation function 'relu', a second layer of neurons with a 'relu' activation function, a dropout layer of . rate, and finally a dense layer. the model was trained using data from and ; using a batch size of , epochs, a timestep of and an rmsprop optimizer. predictions were performed in data and results are depicted in fig. . the blue line represents the actual value whereas the orange line represents the predicted value. the overall rmse is . , meaning that there is an average of . min uncertainty in each prediction. for this experiment, the actor-critic algorithm, which calculates the associated optimal policy sequentially within episodes, consists of a boltzmann actor and a td-lambda critic with learning rate = . , lambda = . and gamma = . . the generated policies are then integrated into a single policy taking into account the consistency (c = . ) and likelihood (l = . ) values. table presents five "snapshots" of positive and negative feedback along with the resulting shaped prescriptions and their respective policies. each "snapshot" is compared to the previous one. in this paper, we proposed an approach for predictive and prescriptive analytics aiming at exploiting the huge treasure of legacy enterprise and operational data and to overcome some challenges of real-time data analytics. the potential of the proposed approach is high, especially in traditional industries that have not benefit from the advancements of industry . and that have just started investigating the potential of data analytics and machine learning for the optimization of their production processes. the traditional manufacturing sectors (e.g. textile, furniture, packaging, steel processing) have usually older factories with limited capacity on investing in modern production technologies. since the neural networks are inherently adaptive, the proposed approach could be applied to similar production lines (e.g. at a newly established factory of the same type) overcoming the cold-start problem, due to which other techniques usually fail. it also exploits both the "voice of data" and the "voice of experts". regarding future work, we plan to evaluate our proposed approach in additional use cases, with different requirements, as well as to investigate approaches and algorithms for fusion of the outcomes derived from real-time data analytics and operational data analytics that represent different levels of information. building an industry . analytics platform predictive, prescriptive and detective analytics for smart manufacturing in the information age big data challenges in smart manufacturing: a discussion paper for bdva and effra research & innovation roadmap alignment bdva the age of analytics: competing in a data-driven world predictive maintenance in a digital factory shopfloor: data mining on historical and operational data coming from manufacturers' information systems the internet of things for smart manufacturing: a review recent advances and trends in predictive manufacturing systems in big data environment a full history proportional hazards model for preventive maintenance scheduling a neural network application for reliability modelling and conditionbased predictive maintenance data mining in manufacturing: a review based on the kind of knowledge data mining in manufacturing: a review a practical approach to combine data mining and prognostics for improved predictive maintenance application of data mining in a maintenance system for failure prediction analyzing maintenance data using data mining methods machine learning for predictive maintenance: a multiple classifier approach prescriptive analytics smart manufacturing with prescriptive analytics prescriptive analytics: literature review and research challenges reinforcement learning: an introduction model-free adaptive optimal control of episodic fixedhorizon manufacturing processes using reinforcement learning a reinforcement learning framework for optimal operation and maintenance of power grids human-centered reinforcement learning: a survey multiobjective reinforcement learning: a comprehensive overview many-objective stochastic path finding using reinforcement learning policy shaping: integrating human feedback with reinforcement learning acknowledgments. this work is funded by the european commission project h uptime "unified predictive maintenance system" ( ). key: cord- -yoav b authors: kyriazis, dimosthenis; biran, ofer; bouras, thanassis; brisch, klaus; duzha, armend; del hoyo, rafael; kiourtis, athanasios; kranas, pavlos; maglogiannis, ilias; manias, george; meerkamp, marc; moutselos, konstantinos; mavrogiorgou, argyro; michael, panayiotis; munné, ricard; la rocca, giuseppe; nasias, kostas; pariente lobo, tomas; rodrigálvarez, vega; sgouros, nikitas m.; theodosiou, konstantinos; tsanakas, panayiotis title: policycloud: analytics as a service facilitating efficient data-driven public policy management date: - - journal: artificial intelligence applications and innovations doi: . / - - - - _ sha: doc_id: cord_uid: yoav b while several application domains are exploiting the added-value of analytics over various datasets to obtain actionable insights and drive decision making, the public policy management domain has not yet taken advantage of the full potential of the aforementioned analytics and data models. diverse and heterogeneous datasets are being generated from various sources, which could be utilized across the complete policies lifecycle (i.e. modelling, creation, evaluation and optimization) to realize efficient policy management. to this end, in this paper we present an overall architecture of a cloud-based environment that facilitates data retrieval and analytics, as well as policy modelling, creation and optimization. the environment enables data collection from heterogeneous sources, linking and aggregation, complemented with data cleaning and interoperability techniques in order to make the data ready for use. an innovative approach for analytics as a service is introduced and linked with a policy development toolkit, which is an integrated web-based environment to fulfil the requirements of the public policy ecosystem stakeholders. the ict advances as well as the increasing use of devices and networks, and the digitalisation of several processes is leading to the generation of vast quantities of data. these technological advances have made it possible to store, transmit and process large amounts of data more effectively than before in several domains of human activity of public interest [ ] . it is undeniable that we are inundated with more data than we can possibly analyse [ ] . this rich data environment affects decision and policy making: cloud environments, big data and other innovative data-driven approaches for policy making create opportunities for evidence-based policies, modernization of public sectors and assistance of local governance towards enhanced levels of trust [ ] . during the traditional policy cycle, which is divided into different stages (agenda setting, policy formulation, decision making, policy implementation and policy evaluation), data is a valuable tool for allowing policy choices to become more evidence-based and analytical [ ] . the discussion of data-driven approaches to support policy making can be distinguished between two main types of data. the first is the use of open data (administrative -open -data and statistics about populations, economic indicators, education, etc.) that typically contain descriptive statistics, which are used more intensively and in a linked way, shared through cloud environments [ ] . the second main type of data is from any source, including data related to social dynamics and behaviour that affect the engagement of citizens (e.g. online platforms, social media, crowd-sourcing, etc.). these data are analysed with novel methods such as sentiment analysis, location mapping or advanced social network mining. furthermore, one key challenge goes beyond using and analysing big data, towards the utilization of infrastructures for shared data in the scope of ethical constraints both for the citizens and for the policy makers. these ethical constraints include "data ownership" ones, which determine the data sharing rules, as well as data localisation constraints, which may unjustifiably interfere with the "free flow of data". as for policy makers, the dilemma, to what extent big data policy making is in accordance to values elected governments promote, is created. this is a problem deriving from the fact that it is difficult to point to the scope of the consent citizens may give to big data policy analysis [ ] . furthermore, such policies provide a broad framework for how decisions should be made regarding data, meaning that they are high-level statements and need more detail before they can be operationalized [ ] . big data enabled policy making should answer modern democratic challenges, considering facts about inequality and transparency both on national and local level, involving multi-disciplinary and multi-sectoral teams. in this context, this paper presents the main research challenges and proposes an architecture of an overall integrated cloud-based environment for data-driven policy management. the proposed environment (namely policycloud) provides decision support to public authorities for policy modelling, implementation and simulation through identified populations, as well as for policy enforcement and adaptation. additionally, a number of technologies are introduced that aim at optimizing policies across public sectors by utilizing the analysed inter-linked datasets and assessing the impact of policies, while considering different properties (i.e. area, regional, local, national) and population segmentations. one of the key aspects of the environment is its ability to trigger the execution of various analytics and machine learning models as a service. thus, the implemented and integrated service can be executed over different datasets, in order to obtain the results and compile the corresponding policies. in terms of managing diverse data sources, the evolution of varieties of data stores (i.e. sql and nosql), where each variety has different strengths and usage models, is linked with the notion of "polyglot persistence" [ ] . the latter emphasizes that each application and workload may need a different type of data store, tailored for its needs (e.g. graph, time series). moreover, the field of data warehousing addresses creating snapshots of online transactional processing (oltp) databases for the purposes of online analytical processing (olap). this often requires copying the data and preparing it for analytics by transforming its structure (i.e. extract transform load process) and building the relevant indexes for the analytical queries. this costly process is performed in order to achieve fast query times for analytical queries of interest, and in order to support data mining. however, there is an increasing trend to adopt a "just in time data warehouse" model, where data are federated on the fly according to runtime parameters and constraints [ ] . as a result, data analytics frameworks increasingly strive to cater for data regardless of the underlying data store. apache spark [ ] is an open source framework for analytics, which is designed to run in a distributed setting within a data centre. spark provides a framework for cluster computing and memory management, and invents the notion of a resilient distributed dataset (rdd) [ ] that can be stored persistently using any storage framework that implements the hadoop file system interface, including the hadoop distributed filesystem (hdfs), amazon s and openstack swift. the spark sql component additionally defines a dataframe as an rdd having a schema, and provides a sql interface over dataframes [ ] . built in support is provided for data sources with a jdbc interface, as well as for hive, and the avro, parquet, orc and json formats. moreover, there is an external data sources api, where new data sources can be added by implementing a driver for the data source that implements the api. many such drivers have been implemented, for example for cassandra, mongodb and cloudant. these data sources can be queried, joining data across them and thus provide the ability to run batch queries across multiple data sources and formats. spark also integrates batch processing with real time processing in the form of spark streaming [ ] that allows real-time processing using the same underlying framework and programming paradigm as for batch computations. in spark . , streaming spark sql computations are also planned [ ] . with the advent of iot and the increasing capabilities available at the edge, applications may store and process data locally [ ] . in the policycloud architecture, data are managed whether in flight or at rest and are federated across multiple frameworks, data sources, locations and formats. another key aspect is data interoperability given the diversity of the data sources. among the main value propositions of the policycloud environment and tools for policy development and management will be its ability to integrate, link and unify the datasets from diverse sources, while at the same time enabling analytics over the unified datasets. as a key prerequisite to providing this added-value, the interoperability of diverse datasets should be ensured. a wide array of data representation standards in various domains have emerged as a means of enabling data interoperability and data exchange between different systems. prominent examples of such standards in different policy areas include: (i) the inspire data specifications [ ] for the interoperability of spatial data sets and services, which specify common data models, code lists, map layers and additional metadata on the interoperability to be used when exchanging spatial datasets, (ii) the common european research information format (cerif) [ ] for representing research information and supporting research policies, (iii) the internet of things ontologies and schemas, such as the w c semantic sensor networks (ssn) ontology [ ] and data schemas developed by the open geospatial consortium (e.g., sensorml) [ ], (iv) the common reporting standard (crs) that specifies guidelines for obtaining information from financial institutions and automatically exchanging that information in an interoperable way, and (v) standards-based ontologies appropriate for describing social relationships between individuals or groups, such as the "the friend of a friend" (foaf) ontology [ ] and the socially interconnected online communities (sioc) ontology [ ] . these standards provide the means for common representation of domain specific datasets, towards data interoperability (including in several cases semantic interoperability) across diverse databases and datasets. nevertheless, these standards are insufficient for delivering what policycloud promises for a number of reasons. initially, there is a lack of semantic interoperability in the given domain. for example, compliance to ontologies about iot and sensor data fails to ensure a unified modelling of the physics and mathematics, which are at the core of any sensing task. hence, in several cases there is a need for extending existing models with capabilities for linking/relating various quantifiable and measurable (real-world) features to define, in a user understandable and machine-readable manner the processes behind single or combined tasks in the given domain. furthermore, there is a lack of semantic interoperability across datasets from different sectors. there is not easy way to link related information elements stemming from datasets in different sectors, which typically comprise different schemas. in this context, environmental datasets and transport datasets for instance contain many related elements, which cannot however be automatically identified and processed by a system due to the lack of common semantics. finally, one needs to consider the lack of process interoperability. policycloud deals with data-driven policy development and management, which entails the simulation and validation of entire processes. especially in the case of multi-sectoral considerations (e.g., interaction and trade-offs between different policies) process interoperability is required in order to assess the impact of one policy on another. policycloud proposes a multi-layer framework for interoperability across diverse policy related datasets, which will facilitate semantic interoperability across related datasets both within a single sector and across different policy sectors. within a specific sector of each use case, semantic interoperability will enable adhering to existing standards-based representations for the sector data and other auxiliary data (e.g. sensor data, social media data). across different use cases, policycloud proposes a linkeddata approach [ ] to enable linking of interrelated data across different ontologies. the challenge is to provide a scalable, flexible and dependable methodology and environment for facilitating the needs of data-driven policy modelling, making and evaluation. the methodology should aim at applying the properties of policy modelling, co-creation and implementation across the complete data path, including data modelling, representation and interoperability, metadata management, heterogeneous datasets linking and aggregation, analytics for knowledge extraction, and contextualization. moreover, the methodology should exploit the collective knowledge out of policy "collections"/clusters combined with the immense amounts of data from several sources (e.g. sensor readings, online platforms, etc.). these collections of policies should be analysed based on specific key performance indicators (kpis) in order to enable the correlation of these kpis with different potential determinants of policies impact within and across different sectors (e.g. environment, radicalisation, migration, goods and services, etc.). another challenge refers to holistic policy modelling, making and implementation in different sectors (e.g. environment, migration, goods and services, etc.), through the analysis and linking of kpis of different policies that may be inter-dependant and intercorrelated (e.g. environment). the goal is to identify (unexpected) patterns and relationships between policies (through their kpis) to improve policy making. moreover, the approach should enable evaluation and adaptation of policies by dynamically extracting information from various data sources, community knowledge out of the collections of policies, and outcomes of simulations and evidence-based approaches. policies should be evaluated to identify both their effective kpis to be re-used in new/other policies, and the non-effective ones (including the causes for not being effective) towards their improvement. thus, developed policies should consider the outcomes of strategies in other cases, such as policies addressing specific city conditions. data-driven policy making highlights the need for a set of mechanisms that address the data lifecycle, including data modelling, cleaning, interoperability, aggregation, incremental data analytics, opinion mining, sentiment analysis, social dynamics and behavioural data analytics. in order to address data heterogeneity from different sources, modelling and representation technologies should provide a "meta-interpretation" layer, enabling the semantic and syntactic capturing of data properties and their representation. another key aspect refers to techniques for data cleaning in order to ensure data quality, coherence and consistency including the adaptive selection of information sources based on evolving volatility levels (i.e. changing availability or engagement level of information sources). mechanisms to assess the precision and correctness of the data, correct errors and remove ambiguity beyond limitations for multidimensional processing should be incorporated in the overall environment, while taking into consideration legal, security and ethical aspects. the specific challenge refers to the design and implementation of a semantic layer that will address data heterogeneity. to this end, the challenge is to research on techniques and semantic models for the interoperable use of data in different scenarios (and thus policies), locations and contexts. techniques for interoperability (such as oslc -open services for lifecycle collaboration) with different ontologies (as placeholders for the corresponding information) should be combined with semantic annotations. semantic models for physical entities/devices (i.e. sensors related to different policy sectors), virtual entities (e.g. groupings of such physical entities according to intrinsic or extrinsic, permanent or temporary properties) and online platforms (e.g. social media, humans acting as providers) should be integrated in data-driven policy making environments. these models should be based on a set of transversal and domain-specific ontologies and could provide a foundation for high-level semantic interoperability and rich semantic annotations across policy sectors, online systems and platforms. these will be turned into rich metadata structures providing a paradigm shift towards content-based storage and retrieval of data instead of data-based, given that stakeholders and applications target and require such content based on different high-level concepts. content-based networks of data objects need to be developed, allowing retrieval of semantically similar contents. one of the main barriers to public bodies experimenting with big data to improve evidence-based policy making is citizens' participation since they are lacking awareness of the extend they may influence policy design and the ways these will be feasible. the challenge is to raise awareness about policy consultations and enable citizens to take direct action to participate, thus ensuring higher levels of acceptance and of trust. a potential solution could be to follow a living lab approach and implement an engagement strategy based on different incentives mechanisms. furthermore, a data-driven policy management environment should allow social dynamics and behaviour to be included in the policy lifecycle (creation, adaptation, enforcement, etc.) through the respective models and analytical tools. these will allow policy makers to obtain the relevant crowd-sourcing data and the knowledge created by the closed groups (i.e. communities evaluating proposed policies) and the engaged citizens to analyse and propose social requirements that will be turned into policy requirements. on top of this, incentives management techniques will identify, declare and manage incentives for citizens' engagement, supporting different types of incentives (e.g. social, cultural, political, etc.), with respect to information exchange, contributions and collaboration aspirations. the environment should also provide strategies and techniques for the alignment participation incentives, as well as protocols enabling citizens to establish their participation. machine and deep learning techniques, such as classification, regression, clustering and frequent pattern mining algorithms, should be realized in order to infer new data and knowledge. sentiment analysis and opinion mining techniques should determine whether the provided contributor's input is positive or negative about a policy, thus developing a "contributor graph" for the contributors of the opinions that happen to be themselves contributors to ongoing policy making projects. in the same context, social dynamics and behavioural data analytics should provide insights regarding which data needs to be collected and aggregated in a given case (e.g. time window addressed by the policy, location of populations, expected impact, etc.), taking into consideration the requirements of the engaged citizens to model the required policies. moreover, a main challenge refers to technologies that allow analytics tasks to be decoupled from specific datasets and thus be triggered as services and applied to various cases and datasets. a key challenge refers to an overall system, acting as an endpoint that will allow stakeholders (such as policy makers and public authorities) to trigger the execution of different models and analytical tools on their data (e.g. to identify trends, to mine opinion artefacts, to explore situational and context awareness information, to identify incentives, etc.) and obtain the results. based on these results, the modelled policies (through their kpis) will be realized/implemented and monitored against these kpis. moreover, the endpoint should allow stakeholders and public administration entities to express in a declarative way their analytical tasks and thus perform/ingest any kind of data processing. another need is for an adaptive visualization environment, enabling policy monitoring to be visualized in different ways while the visualization can be modified on the fly. the environment should also enable the specification of the assets to be visualized: which data sources and which meta-processed information. the goal is to enable the selection of sources based on the stakeholders' needs. incremental visualization of analytics outcomes should also be feasible enabling visualization of results as they are generated. as a complete environment, the proposed architectural approach includes a set of main building blocks to realize the corresponding functionality as depicted in the following figure (fig. ) . the overall flow is initiated from various data sources, as depicted in the figure through the respective data acquisition block. data sources can be data stores from public authorities or external data sources (e.g. mobile devices, iot sensors, etc.) that contribute data following the provision of incentives, facilitated through the incentives management mechanism. a set of apis incorporated in a gateway component, enables data collection by applying techniques to identify the reliable sources exploiting the sources reliability component and for these sources obtain the data and perform the required data quality assessment and cleaning. semantic and syntactic interoperability techniques are utilized over the cleaned data providing the respective interoperable datasets to the policy cloud datastore following the required data linking and aggregation processes. the datastore is accessible from a set of machine learning models represented through the data analytics building block. machine learning models incorporate opinion mining, sentiment and social dynamic analysis, behavioural analysis and situational/context knowledge acquisition. the data store and the analytics models are hosted and executed in a cloud-based environment that provides the respective services obtained from a catalogue of cloud infrastructure resources. furthermore, all the analytics models are realized as services, thus enabling their invocation through a proposed policy development toolkit -realized in the scope of the policies building block of the proposed architecture as a single point of entry into the policycloud platform. the toolkit allows the compilation of policies as data models, i.e. structural representations that include key performance indicators (kpis) as a means to set specific parameters (and their target values) and monitor the implementation of policies against these kpis along with the list of analytical tools to be used for their computation. according to these analytics outcomes, the values of the kpis are specified resulting to policies implementation/creation. it should be noted that policycloud also introduces the concept of policies clusters in order to interlink different policies, and identify the kpis and parameters that can be optimized in such policy collections. across the complete environment, an implemented data governance and compliance model is enforced, ranging from the provision of cloud resources regarding the storage and analysis of data to the management of policies across their lifecycle. the vast amounts of data that are being generated by different sources highlight an opportunity for public authorities and stakeholders to create, analyse, evaluate and optimize policies based on the "fresh" data, the information that can be continuously collected by citizens and other sensors. to this end, what is required refers to techniques and an overall integrated environment that will facilitate not only data collection but also assessment in terms of reliability of the data sources, homogenization of the datasets in order to make them interoperable (following the heterogeneity in terms of content and formats of the data sources), cleaning of the datasets and analytics. while several analytical models and mechanisms are being developed a key challenge relates to approaches that will enable analytics to be triggered as services and thus applied and utilized in different datasets and contexts. in this paper, we have presented the aforementioned challenges and the necessary steps to address them. we also introduced a conceptual architecture that depicts a holistic cloud-based environment integrating a set of techniques across the complete data and policy management lifecycles in order to enable data-driven policy management. it is within our next steps to implement the respective mechanisms and integrate them based on the presented architecture, thus realizing the presented environment. how can social media data be used to improve services? business opportunities, and the it imperatives cases in public policy-making setting data policies, standards, and processes data for policy: a study of big data and other innovative data-driven approaches for evidenceinformed policymaking big data: basics and dilemmas of big data use in policy-making time data warehouse platform with databricks resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing spark sql: relational data processing in spark discretized streams: fault-tolerant streaming computation at scale structuring spark: dataframes, datasets and streaming fog computing and its role in the internet of things data specifications linked data: evolving the web into a global data space acknowledgment. the research leading to the results presented in this paper has received funding from the european union's funded project policycloud under grant agreement no . key: cord- -nr goxs authors: gizelis, christos-antonios; mavroeidakos, theodoros; marinakis, achilleas; litke, antonis; moulos, vrettos title: towards a smart port: the role of the telecom industry date: - - journal: artificial intelligence applications and innovations doi: . / - - - - _ sha: doc_id: cord_uid: nr goxs transformation is not only today’s trend but also a reality. ports could not be excluded from that change. a transformation process has been initiated in order to change their operational structure, and the services they offer. artificial intelligent and data oriented services push the services’s landscape beyond the traditional ones that are currently used. the scope of this paper is to analyze and scrutinize the opportunities that are risen for telecommunications/information and communication technology (ict) providers at ports. these opportunities are the stepping stone towards the transformation of ports for the future. this work in progress is under the dataports project that is funded by the european union’s horizon research and innovation programme under grant agreement no. . "dataports project aims to boost the transition of european seaports from connected and digital to smart and cognitive, by providing a secure environment for the aggregation and integration of data coming from different sources existing in the digital ports and owned by diverse stakeholders, so that the whole port community could benefit from this data in order to improve their processes, offer new services and devise new ai based and data driven business models" [ ] . for this purpose, the technological innovation is destined to transform the business as usual, therefore more and more companies hop on that huge wave in order to avoid to be forgotten in the near future as it happened in well branded companies in the past that did not see the need for change [ , ] . this massive transformation of the businesses has become more data-driven. inevitable data owners or those that can produce data have become the new major actors. the port industry is no exception and data-driven services is what they should offer, not in the future but today to their end-user, customers, stakeholder (many names exist to define them). according to deloitte port services [ ] , smart ports is the fourth technological priority in improving and evolving the shipping/maritime industry. by making a port smart, then ai-based services will offer cognitive services that will offer opportunities to port owners, visitors, customers, etc. smart and cognitive ports is the newest trend and like the smart cities is a creation of a new emerging data market. it is a term that expands the traditional stakeholders' ecosystem with limitless opportunities for new entries. a multiactor and very diverse ecosystem is created with many opportunities for market expansion, revenue increase, new services, especially data-driven ones such as internet of things (iot) [ ] and ai-based services. this rapid growing ecosystem with many actors and many roles creates the need for new and dynamic business models to fulfill the also rising needs. this opportunity was early identified by european union in [ ] and a special chapter was included within the smart cities one. "ports are considered a special case of a smart community, then they have to meet the same requirements that are asked for a smart city, adapted to the port situation". ports operate on a certain basis by following a number of models. they can be public, where the administration is operated by a central authority at a national, regional or city level, or private, a model that is run by a port company, or even a port (local) society and hybrid where in this case, public-private partnerships govern the port administration and the provided services. therefore, port authorities whether are private, public, or hybrid, are "forced" to transform and create the ports of the future, not only by creating an interconnection grid between them but offering new innovative services to their existing "customers" like companies in shipping, supply-chain and logistics, tourism, but also many more from a wide diversity areas, that they can take advantage from the new services. therefore, port authorities create synergies with research community, data owners and providers, software-houses, startups and smes in order to, together create new innovative services and expand the list of the potential beneficiaries. the need of port authorities has become an opportunity for port-oriented companies to experiment their wide range of services, algorithms, and data that they own and monetize their offerings in this new emerging market. it is an opportunity, which could be beneficial for numerous stakeholders. this opportunity increases the dynamics of the smart ports or ports of the future transformation. the basis for port transformation lays within the co-operation that takes place between strategic partnerships at national and international level. to that we should add partnerships with port-oriented service rd parties. the new ecosystem that is created around the ports is highly competitive in order to increase their market share, and ports are trying to get an advantage by implementing smart technologies and services in order to optimize their operations in many areas such as energy, security, transport, logistics and more. local port authorities gather around, numerous entities. organizations, associations, companies that may be stakeholders and potentially beneficiaries of a data-oriented platform provided by dataports. the list can be endless if attempt to put everyone in the frame, but trying to pin-point the main ones we can identify synergies within academia and research community, shipping companies, smes and startups, public authorities and policy makers, data providers, and local community associations (commercial, tourism, culture, etc.) on the port transformation journey, the telecommunication industry (also involved in ict services) can have a significant role and perhaps become a key player towards the transformation success. telecom (wired and wireless) hold the backbone of the data-driven environment [ , , ] through their networks. therefore, it is on their hand to become leaders in this rapidly increased emerging market. a telecom/ict provider in order to enter this emerging ecosystem and potentially benefit from its growth should firstly address real-life data market use cases in ports that are related to its areas of operations. the data that a telecom provider owns might not be directly usable by ports' community and interoperability issues should be prior investigated. data may be provided/shared through advanced ai-based services that are scalable, resilient and using a semantic approach. data should be designed and provided in a matter that can be governed and maintained in a business manner. for these reasons, new business models and a standardized methodology like [ ] should be defined and adopted respectively. policies should be also taken under consideration so that a reliable data sharing and trading platform to be established. these design patterns should be obtained under a secure and trusted environment between stakeholder and beneficiaries, internal or external to the port. telecommunication industry owns, due to its field of operations, owns, handles and process large volumes of data, real and non-real time. data that collected and stored from many different services. a telecom provider has the largest availability of customers' personal data, network data, mobility tracking, billing location, and so on. the correlation of such datasets, as well as the combination of them with external publicly available data, may potentially create a great value for solving problems and serving future needs. since it is a new field of operations and due to the fact that until now such data sharing wasn't available for many different reasons, the value or what needs data can solve may be uncertain yet. according to the eu [ ] , it is estimated that the open data market is expected to increase by % by , reaching e . billion, with the cumulative figure of e bn. open data in greece can generate an additional e . bn. gdp and approximately e billion in cumulative benefits within years. that led researchers, and other related entities to be highly involved, in order benefit from these numbers. every port transformation was focused on different elements. from the freight loading and uploading in the s, the industrial port and the logistics in the s. today even with a tighter regulatory environment from the data sharing point of view, the element of transformation is the interoperability and how the ports of the future will become smart and cognitive. the research community and the port related actors are relative active during the last few years. european authorities turned their attention into three points of action that could potentially affect and transform a port; the supply, the platform and the demand. various port management platforms/services/applications/systems, where created in order to cover these elements. supply in order to monitor and handle the operations with the logistics companies, the shipping companies and the freight carriers. port operations are monitored both for inbound and outbound activities, both equally demanding with different actors in each value chain. several initiatives in european level, have taken place in the past and some of them are very active even today in order to create an ecosystem around the ports. actions of european organizations and associations such as espo [ ] , iaph [ ] and aivp [ ] aim to interconnect and represent port authorities and their relationships with eu and other states. they played a vital role in the global trade and practically can be considered as the pioneers of the smart ports. in addition to that, enisa [ ] developed -in collaboration with several eu ports -a report that intends to provide a useful information regarding the cybersecurity strategy of port authorities and terminal operators. moreover, eu -especially through horizon programs -has funded numerous projects, that aim to create management platforms [ ] for maritime and port environments, in order to create interoperability and eventually the ports to become cognitive and smart. projects like smartcities, among others, became the marketplace of the european innovation partnership on smart cities and communities [ ] . the projects e-mar, flagship, and inmare [ ] handle maritime transport related issues, mass [ ] deals with ways to improve human behavior on board ships with special attention to emergency situations, marine-abc [ ] demonstrates the opportunities of the mobile ship-to-shore communication, bigdatastack [ ] attempts to optimize cluster management for data operations, smartship [ ] develops data analytics decision support systems and a circular economy based optimization platform. all the aforementioned and other projects prove that research community, port authorities, shipping and supply companies are aligned with a common objective which is the creation of a new ecosystem with advanced data-driven services for the benefit of the ports and the local communities. on top of that, european maritime sector through new calls is planning to offer efficient quality services integrated to the overall european transport system. dataports since january is planning to implement a data management platform to be operated by port authorities in order to provide advanced services (fig. ) and create a value-chain between stakeholders, internal and external ones (fig. ). data in the maritime domain is growing at an unprecedented rate, e.g., terabytes of oceanographic data are collected every month as well as petabytes of data are already publicly available. big data from different sources such as sensors, buoys, vessels, and satellites could potentially feed a large number of interesting applications regarding environmental protection, security, shipping routes optimization or cargo handling [ ] . although many projects [ , ] are trying to develop data management platforms in various application domains, not many of them have addressed integration in port environments with the possibility of including cognitive services and extending their platform to whole transportation routes around europe. -data access: the platform will ensure access to the heterogeneous data sources of the port -including relational and nosql databases, object storage, publish/subscribe streams -in a common manner. to achieve this goal, the project will rely on widely adopted data formats and interface specifications, such as json [ ] and openapi [ ] respectively. -semantic interoperability: the project will develop a framework for semantic interoperability of diverse data platforms, autonomous systems and applications. acting as a unifier, the framework will provide business level interoperability, in addition to data interoperability at the level of common spatial and temporal semantics. -data abstraction and virtualization: the platform will provide an abstraction layer that takes care of retrieving, processing and delivering the data with the proper quality level, while in parallel putting special emphasis on data security, performance, privacy and protection. this layer, acting as a middleware, will let the data consumers simply define their requirements on the needed data -expressed as data utility -and will take the responsibility for providing these data timely, securely and accurately by hiding the complexity of the underlying infrastructure. the latter could consist of different platforms, storage systems, and network capabilities. the orchestration of those (micro)services could be implemented via flow-based programming tools like node-red [ ] . -data analytics and ai services: the project will develop a set of data analytics services for supporting the development of descriptive, predictive, prescriptive models using the different datasets available to the platform. since the project will have to deal with big data technologies, the defined services should be scalable enough to process huge data volumes. appropriate state-of-the-art machine learning algorithms will be identified for supporting the development of cognitive applications in the context of the smart ports domain. -blockchain: the platform will provide all the tools for data sharing and trading in a secure and trusted manner. the specific building block should take into consideration the rules defined from data providers to data consumers. on top of that, it has to offer a clear value proposition to data owners. data access mechanisms, based on purpose control, will also be established. as a result, the solution has to keep provenance of the data entering the platform and implement the functionalities of data governance. the project will exploit permissioned blockchain technology such as hyperledger [ ] , in order to address all these requirements. as mentioned earlier in this paper, the emerging data market within ports is very appealing for a telecommunications provider due to the wide diversity and the large volumes of data that it owns or handles. although a telecom company can benefit in many ways from being involved in such ecosystem and eventually create a new revenue stream [ , ] , considerations should be taken into account, regarding the risks that come in front. since it is a new marketplace, the legal and regulatory environment cannot be considered as stable. entering a new area always demands careful business approaches. since a main involvement contains data sharing, trust should also be taken into account, not only technological but also ethical which might cause otherwise losing a competitive advantage. towards that direction, dataports project aims at designing a trusted marketplace in order to lower privacy barriers associated with the development of innovative data-intensive applications that consume personal data. the scope is to develop mechanisms that will encourage more and more people (travelers, employees, workers etc.), who so far seem to be reluctant, to share their personal utility and behavioral data. this is very likely to be achieved by keeping data subjects fully informed, including them as actual stakeholders and co-owners of the data archive. the goal is to find the balance between the level of risk the people are willing to take and the benefit they expect as users of a personal data platform; in other words, data privacy versus data utility. the fundamental concern of privacy protection is to prevent confidential data from being leaked. however, in the area of iot and big data, some information about the dataset is desired to be revealed by design. consequently, the quest is to quantify and control the leakage of sensitive information, so that it remains within a tolerated threshold, while allowing certain types of analytics to be performed. in the following paragraphs indicative techniques to fulfill the privacy requirements are presented: in fact, legislation and privacy norms are becoming increasingly strict, but it solutions for addressing these issues are lacking. part of the research in dataports project will be to provide solutions for logging and auditing access to sensitive data, modeling, managing and enforcing privacy and consent policies, as well as providing the ability to anonymize sensitive data. with such a solution in place, trust becomes a differentiator while auditing and compliance overhead is decreased for both the data processor and controller. as a result, the business challenge addressed is twofold; (i) the need to prove compliance to privacy and security laws, directives and norms in a more automated and systematic manner with less overhead and (ii) the desire to gain end user trust, encouraging the sharing of personal data and improving the quality of the data shared. the approach aims at providing tools and libraries for privacy and compliance by design and also offering such kind of solutions for existing applications without requiring changes to them. the goal is to make privacy and compliance part of the it infrastructure and to ensure close coupling of all data with relevant consent and policies. in many occasions, people want to share not their actual data but (slightly) different one, trying to balance privacy and accuracy [ ] . for example, residents do not wish to expose their exact location, for a variety of purposes, including privacy considerations and risk data leakage that could aid e.g. criminals to understand the occupancy pattern of their house. in the case of utility and behavioral data, the fuzzification could be multidimensional, in terms of space, time and aggregation of the data that are produced from all the smart devices. the initial idea of the blockchain framework was a permissionless distributed database based on the bitcoin peer- -peer open-source protocol [ ] , maintaining a growing list of data records hardened against tampering and revision, even by its own operators. this network model protects users from most prevalent frauds like chargebacks or unwanted charges. dataports project aims at moving one step forward in the context of a data marketplace, strengthening the integrity of the framework in terms of privacy. in a decentralized blockchain architecture as such, a full copy of the blockchain contains -at any time -all records of every transaction ever completed in the network and in addition every block contains a hash of the previous block. this enables the blocks to be traced back even to the first one, known as "the genesis block", which makes computationally prohibitively difficult and impractical to modify a block once it is created, especially as the chain of subsequent blocks gets generated [ ] . this will protect the privacy-restrictions defined by data subjects, against any possible alternation attempts. the data that is owned along with their process mechanisms, require intellectual property protection, especially within a regulatory complexity. moreover, the development of data agreements and privacy concerns, as well as, a different than it is used so far, pricing model should be taken under consideration. risks and potential harms of sharing corporate data, as well as, collecting inaccurate, old or "dirty" data might affect data quality. since to most of the actions concerning data, the gdpr should be applied, collecting unauthorized data or intrusive collection from individuals and organizations could be a complex process. therefore, improper or unauthorized access to shared data could cause conflicting legal jurisdictions and different security levels and in some cases loss of regulatory licenses, standards and certifications, reputational and industrial damages, as well as, drop in share price and/or increase in cost of capital. moreover, incomplete or non-representative sampling, or insufficient, outdated or incompatible data sets can be disastrous, especially in the case where there are many recipients of this new data-sharing/trading platform. although someone might say that from all the above described parameters, the risk for a telecom provider is forbidding enough to enter this new data sharing/trading market, there is equally amount of opportunities to enter these open data marketplaces [ , ] . as it is described in figs. and , it can be an opportunity for new business and increase in customer/subscribers base. in addition, can create an increased availability of vast and heterogeneous data ecosystems for ai and innovative data-driven business models, as well as, a way to tap into 'safe' personal data. opportunities can also exist for the telecom subscribers as well, by obtaining control over personal data. it is considered that the well-being and the quality of life benefits from personal data sharing in key sectors. moreover, opportunities exist by accessing personalized and cross-sectoral b c services and increasing the potential of personal data monetization. the entry of a telecom data owner in this emerging data-driven market, creates opportunities for third parties as well. for academia, by increasing the socio-economic impact of research data across domains and borders, it creates an open innovation access through data marketplaces. last but not least opportunities can be created in general for government and public bodies thus improving the local economy. these bodies could include, among others, the municipalities and the regions that host the ports. common use of data through platforms can lead to improved governmental services, especially ai-enhanced digital services. these local and regional opportunities can also lead to an integrated real-time european analytics system exposing annual statistics of the ports ecosystem. a telecom provider can benefit from entering such rapidly evolving data market. as in every case, there are considerations and risks that should be taken into account. these parameters will be investigated during the pilots' execution within dataports h project, where various data sets will be provided and used for cognitive applications. during the piloting phase, the value of the data and their governance methods will be thoroughly investigated. exploiting rich telecom data for increased monetization of telecom application stores big data & advanced analytics in telecom: a multi-billion-dollar revenue opportunity. heavy reading why business schools need to teach about the blockchain data monetisation: opportunities beyond ott: finance, retail, telecom and connected objects end-to-end security and privacy: design and open specification (final) port cybersecurity -good practices for cybersecurity in the maritime sector advances in shipping data analysis and modeling. routledge hyperledger fabric the transition of croatian seaports into smart ports an implementation of integrated interfaces for telecom systems and tms in vessels a big data architecture for managing oceans of data and maritime applications analytics: the real-world use of big data. retrieved from ibm institute for business value a framework for building a smart port and smart port index a robust information life cycle management framework for securing and governing critical infrastructure systems openjs foundation the json data interchange syntax the marketplace of the european innovation partnership on smart cities and communities the worldwide network of port cities report on open data internet of things for smart ports: technologies and challenges key: cord- - mdie v authors: valle, denis; albuquerque, pedro; zhao, qing; barberan, albert; fletcher, robert j. title: extending the latent dirichlet allocation model to presence/absence data: a case study on north american breeding birds and biogeographical shifts expected from climate change date: - - journal: glob chang biol doi: . /gcb. sha: doc_id: cord_uid: mdie v understanding how species composition varies across space and time is fundamental to ecology. while multiple methods having been created to characterize this variation through the identification of groups of species that tend to co‐occur, most of these methods unfortunately are not able to represent gradual variation in species composition. the latent dirichlet allocation (lda) model is a mixed‐membership method that can represent gradual changes in community structure by delineating overlapping groups of species, but its use has been limited because it requires abundance data and requires users to a priori set the number of groups. we substantially extend lda to accommodate widely available presence/absence data and to simultaneously determine the optimal number of groups. using simulated data, we show that this model is able to accurately determine the true number of groups, estimate the underlying parameters, and fit with the data. we illustrate this method with data from the north american breeding bird survey (bbs). overall, our model identified main bird groups, revealing striking spatial patterns for each group, many of which were closely associated with temperature and precipitation gradients. furthermore, by comparing the estimated proportion of each group for two time periods ( – and – ), our results indicate that nine (of ) breeding bird groups exhibited an expansion northward and contraction southward of their ranges, revealing subtle but important community‐level biodiversity changes at a continental scale that are consistent with those expected under climate change. our proposed method is likely to find multiple uses in ecology, being a valuable addition to the toolkit of ecologists. occur in space and time. for example, in a spatial context, these approaches have attempted to identify geographic areas with similar taxa, areas that have been variously called "biogeographical regions" (gonzales-orozco, thornhill, knerr, laffan, & miller, ) , "bioregions" (bloomfield et al., ) , or "biogeographical modules" (carstensen et al., ) . such bioregions have been argued to be important for understanding the role of history on community assemblages (carstensen et al., (carstensen et al., , , interpreting ecological dynamics (economo et al., ) , and developing broad-scale conservation strategies (vilhena & antonelli, ) . the latent dirichlet allocation (lda; not to be confused with linear discriminant analysis) model is a powerful model-based method to decompose species assemblage data into groups of species that tend to co-occur in space and/or time. the benefits of using this model include the ability to adequately represent uncertainty, accommodate missing data, and, perhaps most importantly, to describe sampling units as comprised of multiple groups (i.e., mixedmembership [mm] units) (valle, baiser, woodall, & chazdon, ) . conceptually, the ability to describe sampling units as comprised of multiple groups has rarely been considered in previous methods (i.e., prior approaches are typically based on "hard" partitions) but may better honor community dynamics and may better characterize impacts of environmental change. for instance, biome transition zones, ecotones, and habitat edges are locations that are often comprised of a mix of species groups, providing sources for potentially novel species interactions (gosz, ; ries, fletcher, battin, & sisk, ) . similarly, climate change is predicted to cause geographic shifts in species and communities, leading to the hypothesis of novel assemblages arising across space as climate and habitat changes (urban et al., ; williams & jackson, ) . in addition, most partitioning methods that delineate biogeographical regions or modules based on hard boundaries can lead to high uncertainty in boundary delineation-an issue that can be rectified by allowing groups to overlap. it is important to note that lda allows for overlapping groups, but it does not require it to be present (i.e., if data do not support overlap, no overlap is estimated). it is unfortunate that the lda model, as currently developed, has been restricted to abundance data, which are often not available because accurate quantification of abundance can be very challenging and costly. in the absence of abundance data, researchers often have to rely on presence/absence data to understand species distributions and biodiversity patterns (jones, ; joseph, field, wilcox, & possingham, ) . another limitation of the lda model is that the number of groups has to be prespecified, requiring researchers to run lda multiple times to then use some criterion (e.g., aic) to choose the optimal number of groups (e.g., valle et al., ) , an approach that often can be computationally costly. in this article, we substantially develop the lda model to be able to fit the much more commonly available presence/absence data and to automatically determine the optimal number of groups. we start by describing our statistical model. then, using simulated data, we show how our method automatically detects the optimal number of groups in the data, reliably estimates the underlying parameters, and better fits the data, outperforming other approaches. at last, we illustrate the novel insights gained using our method by analyzing a long-term dataset collected on breeding birds in the united states and canada (breeding bird survey [bbs]; pardieck, ziolkowski, lutmerding, campbell, & hudson, ) to determine how environmental variables influence bird assemblages across the continent and how these assemblages are changing through time. the overall goal of our method is to identify the major patterns of species co-occurrence in the data, each of which we define to be a distinct species group. we adopt the term species group (instead of "bioregion" or other related terms) because these major co-occurrence patterns do not have to have a strong spatial pattern (although they often do), these groups can overlap in space, and proportion of groups can change through time. more specifically, our method characterizes each sampling unit l in terms of the proportion of the different groups (parameter vector θ l ) and characterizes each group k in terms of the probability of the different species (parameter vector ϕ k ). for example, θ l ¼ : ; : ; : ; ½ indicates that the second group dominates unit l and that the fourth group is absent. this example also illustrates that a given sampling unit can be comprised of multiple groups, which explains why these types of models are called mixed-membership models. in the same way, ϕ k ¼ ; : ; : ½ indicates that species and (but not species ) are important species of group k. note that a given species can have a high probability in more than one group. a more formal description of the statistical model is given below. the data consist of a matrix filled with binary variables x isl (i.e., equal to one if species s was present in observation i and unit l and equal to zero otherwise). notice that we assume that multiple observations might have been made for each species s and unit l, possibly due to temporally repeated measures or because multiple subsamples were measured within each unit l (e.g., a forest plot comprised of four subplots). each of these binary variables has an associated latent group membership status z isl . this variable indicates to which group species s in sampling unit l during observation i comes from. we assume that each observation x isl comes from the following distribution, given that species s in unit l during observation i comes where ϕ sk is the probability of observing species s if this species came from group k. notice that z isl influences the distribution for x isl by determining the k subscript of the parameter ϕ. next, we assume that the latent variable z isl comes from a multinomial distribution: where θ l is a vector of probabilities that sum to one, and each element θ lk consists of the probability of a species in unit l to have come from group k. in relation to the priors for our parameters, we adopt a conjugate beta prior for ϕ sk : throughout this article, we assume vague priors by setting a and b to . building on the work of dunson ( ) and valle et al. ( ) , we adopt a truncated stick-breaking prior for θ l . this prior assumes that: for k = ,…,k− and γ > . we set the parameter for the last group to (i.e., v lk ¼ ). with these parameters, we calculate θ lk using the under this prior, θ lk is a priori stochastically exponentially decreasing as long as γ < and smaller γ tend to enforce greater sparseness (i.e., the existence of fewer groups). for most of the examples in this article, γ was set to . , which we have found to work well for multiple datasets. more details regarding this prior can be found in supporting information appendix s . the benefit of this prior is that, if the data support fewer groups than specified by the user, it will tend to force these superfluous additional groups to be empty or to have very few latent variables z isl assigned to them, as illustrated in the simulation section below. this prior also helps to avoid label switching, a common problem in mixed-membership and mixture models. bayesian markov chain monte carlo (mcmc) algorithms applied to these types of models sometimes mix poorly and can lead to nonsensical results if posterior distributions of parameters are summarized by their averages (stephens, ) . the label switching problem refers to the fact that the labels of the different groups can change (e.g., groups and can become groups and , respectively) without changing the likelihood (i.e., the group labels are unidentifiable). our truncated stick-breaking prior helps to avoid the label switching problem by enforcing an ordering of the groups according to their overall proportions. we fit the lda using a gibbs sampler. a more complete description of this model and the derivation of the full conditional distributions used within this gibbs sampler are provided in supporting information appendix s . supporting information appendix s contains a short tutorial describing how to fit the model using the code that we make publicly available, reproducing some of the simulated data results. there are three important points regarding lda that need to be emphasized. first, the proposed model can accommodate negative and positive correlations between species. to illustrate this, assume that there are just two species groups and two species, s and s'. negative correlation between these species is captured by our . these parameter estimates indicate that, whenever a site has a high proportion of group , species s will have a high probability of occurring, whereas species s' will tend to be absent. in the same way, whenever a site has a high proportion of group , species s' will have a high probability of occurring but species s will tend to be absent, resulting in negative correlation. positive correlation between these species is captured by our model if, for example,φ s ¼ : : ! and ϕ s ¼ : : ! . these parameter estimates imply that, whenever a site has a high proportion of group , both species s and s' will have a high probability of occurring. in the same way, whenever a site has a high proportion of group , both species s and s' will have a high probability of being absent, inducing positive correlation. second, hard clustering methods that group locations with similar species composition (e.g., kreft & jetz, ) correspond in our model to vectors θ l comprised of zeroes except for a single element which is equal to . in the same way, hard clustering methods that group species that tend to co-occur (e.g., azeria et al., ) from a species composition perspective. this is due to the fact that the probability of observing species s for two locations p and q is (see supporting information appendix s for details). in this scenario, the algorithm might determine that a single species group dominates all locations instead of distinguishing the different species groups. we simulate data to evaluate the performance of the proposed model and to compare its results to those from other clustering methods. to avoid the identifiability problems described above, we generate parameters for all simulations such that each group completely dominates at least one location and that each group has at least one species that is never present in the other groups (ensuring distinct species composition of these groups). we illustrate with simulated data how the truncated stick-breaking prior can identify the optimal number of groups and how our algorithm can retrieve the true parameter values under a wide range of conditions. more specifically, the true number of groups k* was set to and ; the number of sampling units (i.e., locations) was set to and ; the number of species was set to and ; and the number of observations per location was set to . parameters were drawn randomly (i.e., ϕ sk ∼ beta : ; : ð Þ and θ l ∼ dirichlet : ð Þ), and the identifiability assumptions described above were then imposed. we adopted a beta : ; : ð Þdistribution for ϕ sk because this distribution is likely to generate species groups that are more dissimilar in terms of species composition given that it is a u-shaped symmetric distribution. we generated datasets for each combination of these settings, totaling datasets. to fit these data, we assume a maximum of groups (k = ) and estimate the true number of groups k* by determining the number of groups that are not superfluous. superfluous groups are defined to be those groups that are very uncommon across the entire region (i.e., θ lk < : for % of the locations, where θ lk is the mean of the posterior distribution). at last, we test the sensitivity of the modeling results to the prior by fitting these data with γ set to . and . we also compare lda to other methods using simulated data. in these simulations, we assume data are available on species over , locations, with five repeated observations per location. furthermore, , , and groups were used to generate these data. because the goal is to compare inference from different methods, we set the parameters θ lk in such a way that it allows for a straightforward visual appraisal of the advantages and limitations of the different methods. on the other hand, the parameters ϕ ks were randomly drawn from beta : ; : ð Þ ; and subsequently, the assumption regarding groups with distinct species composition was imposed. when fitting lda, we set the maximum number of groups to and rely on the truncated stick-breaking prior with γ ¼ : to uncover the correct number of groups. we compare and contrast inference from our model to that from competing approaches, including traditional hard clustering methods (i.e., hierarchical and kmean clustering) and mixture models (i.e., region of common profile (rcp) model; foster et al., ; lyons, foster, & keith, to determine whether these breeding bird groups have been shifting their spatial distribution over time, we divided our study period into two -year periods: - and - . each route × period combination resulted in a distinct "sampling unit" (i.e., distinct row in our data matrix), and data from individual years within each time period were treated as repeated observations. to relate the spatial distribution of the identified groups to potential environmental drivers, we relied on freely available precipitation and temperature data from worldclim version (available at http://worldclim.org/version ) (fick & hijmans, ) . these data consist of the -years average climate information (from to ) for the month of june, covering the entire world. in an era of global change, an important feature of our method is that it is able to detect relatively subtle temporal changes in species composition. more specifically, we assessed whether group ranges had expanded north and contracted south. these are the patterns we a priori expected given warming temperatures and the strong influence of temperature on the spatial distribution of a range of taxonomic groups, including birds (chen, hill, ohlemuller, roy, & thomas, ; hitch & leberg, ; moritz et al., ; parmesan & yohe, ) . to detect these patterns, we fit the model once to data from both time periods (instead of fitting the model separately for each time period we set the maximum number of groups to for our case study. to interpolate the estimates of the proportion of different groups to unsampled areas, we relied on the inverse distance weighted (idw) algorithm implemented in the package "gstat" (graler, pebesma, & heuvelink, ; pebesma, ) . interpolations were restricted to locations within one degree of a bbs route. finally, our algorithm was programmed using a combination of c++ (through the rcpp package; eddelbuettel & francois, ) and r code (r core team ). we provide a tutorial in the supporting information appendix s for fitting this model. despite assuming the potential existence of a much higher number of groups (k = ), our results reveal that the proposed model was generally able to estimate well the true number of groups (boxplots in figure ) , except for datasets with few species and locations but many groups (i.e., locations, species, and groups; figure f) . we also find a good correspondence between the true and estimated parameter values for most of the scenarios explored (scatterplots in figure ) , with a slightly worse performance for data with few species but many groups (i.e., species and groups; figure g,h) . taken together, these results suggest that, when the ratio of the number of species to the number of groups is small, there is likely to be less distinction between groups from a species composition perspective, making it a harder task to untangle these groups. finally, in relation to the prior, we find that our results are broadly similar for γ ¼ : and γ ¼ . the main difference is that parameter estimates tended to be slightly worse when the true number of groups is and γ ¼ and when the true number of groups is and γ ¼ : (results not shown), agreeing with our expectations. because smaller γ values induce greater sparseness, parameters are better estimated with γ ¼ : when simulations are based on sparse assumptions (i.e., simulations with three groups) versus when this is not true (i.e., simulations with groups). our results reveal that the algorithm accurately estimates the proportion of the different groups in each location, regardless if mm units are present or not (left most and right most panels, respectively, in figure ). these results corroborate the observation that lda encompasses hard clustering of sites and/or species as special cases. on the other hand, figure clearly reveals that hard clustering methods cannot represent these mm locations (k-means and hierarchical clustering [hc] panels in figure ). mixture model approaches such as rcp are sometimes perceived to be able to represent these gradual changes in the proportion of groups. however, f i g u r e the latent dirichlet allocation (lda) estimates well the true number of groups (boxplots) and the θ lk parameter values (scatterplots). results from all datasets in each simulation setting are displayed simultaneously, based on lda with γ set to . . top and bottom panels display results for three and groups, respectively. boxplots in panels (a) and (f) show the estimated number of groups (i.e., the number of groups deemed not to be superfluous), revealing that lda can estimate well the true number of groups (k*) except for datasets with few locations (l), few species (s) but many groups (i.e., l s k*). scatterplots (panels b-e and g-j) reveal that the θ lk parameters can also be well estimated but there is considerable noise for datasets with few species but many groups (panels g and h). a : line and a linear regression line were added for reference (blue and red lines, respectively) [colour figure can be viewed at wileyonlinelibrary.com] our results reveal that, when applied to our simulated data, rcp tended to give transition regions that were too narrow (rcp panels in figure ). these model comparison results are particularly striking given that lda was fitted assuming potential groups, whereas results for the other methods were based on the assumption that the true number of groups was known. notice that these figures illustrate how lda can capture gradual changes in species composition associated with global change phenomena depending on what is being represented in the x-axis. for instance, the x-axis can represent a spatial gradient of anthropogenic forest disturbance (e.g., timber logging intensity or distance to forest edge) or can represent time (i.e., the same location sampled repeatedly through time, perhaps revealing the impact of climate change on species composition). recall that the simulated data were generated with , , and groups, but that the maximum number of groups when fitting lda was set to . our results suggest that the truncated stick-breaking prior was able to correctly estimate the underlying true number of groups (boxplots in figure ) given that the estimated θ lk 's were shrunk toward zero for the superfluous groups (red boxes in figure ) . we also find that all the other alternative methods required a much greater number of groups to fit the data as well as lda when mm locations are present (line graphs in figure ). these results reveal that lda achieves a much sparser representation of the data (based on the number of groups) without losing the ability to represent the inherent variability in the data. although these results are expected, given the larger number of parameters in lda, the ability to fit the data well with fewer groups is highly desirable from the user's perspective as the primary role of these methods is to reduce the dimensionality of biodiversity data. it is important to note that even in the absence of mm sampling units, lda can still estimate well the true number of groups and has similar fit to the data as the other clustering approaches (results not shown). finally, although overall, we identified main breeding bird groups (of a maximum of ) after eliminating groups that were very uncommon throughout the region (defined here as those for which θ lk was smaller than . for % of the locations, where θ lk denotes the posterior mean). an important test for any unsupervised method is if it is able to retrieve patterns that are widely acknowledged to exist by experts. using the estimated group proportion for each location for the - period, we find striking spatial patterns (maps in figure ) . importantly, these spatial patterns generally agree well with other maps of bird communities (e.g., bird conservation (figure a ). we find that the species that best f i g u r e the extended latent dirichlet allocation (lda) method identifies the true number of groups (left panels) and fits the data better than other clustering approaches for data with mm locations (right panels). results are shown separately for simulated data with , , and groups (top to bottom). boxplots depict the estimated proportion θ lk of each group k for all locations l = ,…,l. these boxplots emphasize how θ lk for the irrelevant extra groups (red boxes) are shrunk to zero for all locations. line graphs show the log likelihood, a measure of model fit for which larger values indicate better fit. these graphs reveal how other clustering approaches require a much greater number of groups to fit the data as well as lda with fewer groups. model fit results for lda correspond to the posterior mean of the log likelihood. lda results are shown with a single symbol because, differently from the other methods that were fitted multiple times with different number of groups, lda was fitted just once using a maximum of groups and the true number of groups was estimated (see corresponding boxplots). details regarding how the log likelihood was calculated for the different methods are provided in supporting find that group identifies species associated with desert environments (e.g., cactus wren and ash-throated flycatcher), while group identifies a mixture of short-grass prairie birds (e.g., dickcissel) and species associated with open country environments with scattered trees and shrubs (e.g., eastern phoebe). besides these biogeographical patterns, we also highlight the ability of our algorithm in depicting how environmental gradients are linked to the proportion of each group. for instance, we display how the main east coast groups (groups , , , , , and ) are , , , , , and along the east coast. (c) displays the spatial pattern of groups and . in both (a) and (c), higher proportion of individual groups is depicted using more opaque (i.e., less transparent) colors and different groups are depicted with different colors. (b, d) reveals that average june temperature and precipitation gradients seem to strongly constrain the spatial distribution of these breeding bird groups, respectively. circles represent the estimated proportion for each location and group while lines depict suitability envelopes. these envelopes were created by first defining equally spaced intervals on the x-axis and then calculating the median x value and the % percentile of y within each interval and connecting these results. notice that the same color scheme is used for right and left panels the latent dirichlet allocation (lda) model is a useful model for ecologists because it can more faithfully represent community dynamics and the impact of environmental change through the estimation of mixed-membership sites (valle et al., ) . the standard lda requires abundance data but, for many taxa, reliably estimating abundance is often very hard and costly (ashelford, chuzhanova, fry, jones, & weightman, ; joseph et al., ; kembel, wu, eisen, & green, ; royle, ; schloss, gevers, & westcott, ) . for these reasons, presence/absence data are typically much more ubiquitous than abundance data, often enabling analysis at f i g u r e species groups with a statistically significant association between latitude and change in group proportion using the breeding bird survey (bbs) dataset as a case study, we have shown how our method is able to uncover striking spatial and temporal patterns in bird groups. for example, we illustrate how these groups gradually change along a temperature gradient in the east coast and a precipitation gradient in texas. it has long been known that many bird species have strong relationships with abiotic gradients (bowen, ) , but how these gradients can explain entire groups of species has remained elusive. furthermore, we find subtle but pervasive changes in bird group proportions, changes which follow the expected patterns based on climate change (e.g., parmesan & yohe, ) . half of the species groups (nine of ) have expanded their northern range and shrunken their southern range. this pattern is consistent with species-specific models of changes in bird distribution with climate change in the united states (e.g., hitch & leberg, ; la sorte & thompson, ) . our results expand on these findings by illustrating how entire groups are shifting their spatial distribution. nevertheless, a more formal test that accounts for the multiple factors that influence the spatial distribution of birds will be required to ultimately confirm whether climate change is driving the spatial distribution shifts that we have detected. an important limitation of the method that we have presented is that the identified groups do not change over time, even though their spatial distribution may vary. in other words, θ lk may change with time but ϕ ks does not. this is particularly relevant in the context of climate change, where it is possible that the species composition of the groups themselves might be changing (lurgi, lopez, & montoya, ; stralberg et al., ; urban et al., ) . another important limitation in this study is that the proposed model does not take into account imperfect detection, a pervasive issue for wildlife sampling (mackenzie et al., ; royle, ) . this shortcoming can be partially attributed to inherent limitations in the bbs dataset, given that the estimation of detection probabilities requires very specific data types (e.g., repeated visits in occupancy models). it is also critical to highlight the importance of repeated observations per location given the relatively low information content in binary presence/absence data. determining all the parameters in the proposed model, including the optimal number of groups, can be challenging in the absence of these repeated observations. finally, although important broad-scale patterns can be identified and novel insights gained from post hoc analysis of lda model parameters, as illustrated with our case study, these results rely on a two-stage analysis that does not take into uncertainty in the estimated parameters. our ongoing work is focused on extending lda to accommodate covariates through regression models built-in to lda so that uncertainty can be coherently propagated when performing more formal statistical tests and when making spatial and temporal predictions. community ecologists have traditionally relied on fitting clustering models with different numbers of clusters and choosing the optimal number of clusters using metrics such as aic and bic (fraley & raftery, ; xu & wunsch, ) . using simulated data, we have shown how the truncated stick-breaking prior can aid the determination of the true number of groups. we acknowledge, however, that the modeler still has to specify the hyperparameter γ and the maximum number of groups k. using simulated data, we have found that setting γ to . often works well and that our model often identifies k groups if the true number of groups is equal or larger than k. while this may be seen as an indication that k has to be increased when using real data, an extremely large number of groups defeats the purpose of dimension reduction, making it increasingly harder to visualize and interpret model outputs. ultimately, we believe that the decision regarding the maximum number of groups k is a balance between what the data suggest and pragmatic considerations regarding how the results will be displayed and interpreted. our empirical example focused on large-scale biogeographical patterns. nevertheless, this method could also be applied in a landscape-scale context, identifying spatial variation in community structure within general habitat types and across patches, or to analyze long-term temporal changes in time-series data of species composition (e.g., christensen, harris, & ernest, ) . given the ubiquity of presence/absence data in community ecology, we believe that the extension of the latent dirichlet allocation model developed here will see a much wider use, becoming an important addition to the toolkit of community ecologists. we thank the numerous comments provided by ben baiser, daijiang http://orcid.org/ - - - robert j. fletcher jr. http://orcid.org/ - - - new screening software shows that most recent large s rrna gene clone libraries contain chimeras. applied and environmental microbiology using null model analysis of species co-occurrences to deconstruct biodiversity patterns and select indicator species. diversity and distributions a comparison of network and clustering methods to detect biogeographical regions african bird distribution in relation to temperature and rainfall care and feeding of topic models: problems, diagnostics, and improvements biogeographical modules and island roles: a comparison of wallacea and the west indies the functional biogeography of species: biogeographical species roles in wallacea and the west indies rapid range shifts of species associated with high levels of climate warming long-term community change through multiple rapid transitions in a desert rodent community nonparametric bayes applications to biostatistics breaking out of biogeographical modules: range expansion and taxon cycles in the hyperdiverse ant genus pheidole rcpp: seamless r and c++ integration worldclim : new -km spatial resolution climate surfaces for global land areas ecological grouping of survey sites when sampling artefacts are present model-based methods of classification: using the mclust software in chemometrics biogeographical regions and phytogeography of the eucalypts ecotone hierarchies spatio-temporal interpolation using gstat breeding distributions of north american bird species moving north as a result of climate change monitoring species abundance and distribution at the landscape scale presence-absence versus abundance data for monitoring threatened species incorporating s gene copy number information improves estimates of microbial diversity and abundance a framework for delineating biogeographical regions based on species distributions poleward shifts in winter ranges of north american birds numerical ecology novel communities from climate change simultaneous vegetation classification and mapping at large spatial scales estimating site occupancy rates when detection probabilities are less than one impact of a century of climate change on smallmammal communities in yosemite national park north american breeding bird survey dataset a globally coherent fingerprint of climate change impacts across natural systems multivariable geostatistics in s: the gstat package r: a language and environment for statistical computing ecological responses to habitat edges: mechanisms, models, and variability explained n-mixture models for estimating population size from spatially replicated counts reducing the effects of pcr amplification and sequencing artifacts on s rrna-based studies dealing with label switching in mixture models re-shuffling of species with climate disruption: a no-analog future for california birds improving the forecast for biodiversity under climate change decomposing biodiversity data using the latent dirichlet allocation model, a probabilistic multivariate statistical method individual movement strategies revealed through novel clustering of emergent movement patterns a network approach for identifying and delimiting biogeographical regions novel climates, no-analog communities, and ecological surprises survey of clustering algorithms extending the latent dirichlet allocation model to presence/absence data: a case study on north american breeding birds and biogeographical shifts expected from climate change key: cord- -zm nae h authors: vito, domenico; ottaviano, manuel; bellazzi, riccardo; larizza, cristiana; casella, vittorio; pala, daniele; franzini, marica title: the pulse project: a case of use of big data uses toward a cohomprensive health vision of city well being date: - - journal: the impact of digital technologies on public health in developed and developing countries doi: . / - - - - _ sha: doc_id: cord_uid: zm nae h despite the silent effects sometimes hidden to the major audience, air pollution is becoming one of the most impactful threat to global health. cities are the places where deaths due to air pollution are concentrated most. in order to correctly address intervention and prevention thus is essential to assest the risk and the impacts of air pollution spatially and temporally inside the urban spaces. pulse aims to design and build a large-scale data management system enabling real time analytics of health, behaviour and environmental data on air quality. the objective is to reduce the environmental and behavioral risk of chronic disease incidence to allow timely and evidence-driven management of epidemiological episodes linked in particular to two pathologies; asthma and type diabetes in adult populations. developing a policy-making across the domains of health, environment, transport, planning in the pulse test bed cities. air pollution has become silently and hiddendly one of the most impactful menace to global health. the european environmental agency [ ] estimates that premature deaths attributable to exposure to air pollution of fine matter particles reach are about in over eu countries. the exposure to no and o concentrations on the same countries in has been around and respectively. the health threat of air pollution remain located mostly in cities. but the effects does not only limitate on wellbeing, but are also econonomical. the most vulnerable to the risks are lower income socio-economic groups that nowadays are also the most exposed to environmental hazards. air pollution indeed does not represent only a sanitary issue: it's burden reflects also in increasing medical costs. air pollution thus, is a problem can be only addressed with a strategic vision can only be addressed with long term targeted policies, majorly in urban environments. in the year itu and the united nations economic commission for europe (unece) gave the definition of smart and sustainable city as "an innovative city that uses information and communication technologies (icts) and other means to improve quality of life, efficiency of urban operation and services, and competitiveness, while ensuring that it meets the needs of present and future generations with respect to economic, social, environmental as well as cultural aspects". this definition led also in , in the united for smart sustainable cities initiative (u ssc). this open global platform responded to united nations sustainable development goal : "make cities and human settlements inclusive, safe, resilient and sustainable.", offering an enabling environment to spread knowledge and innovation globally [ ] . also the health sector has been contaminated by this vision: the increase of social networking, cloud-based platforms, and smartphone apps that support data collection has enhance opportunities to collect data outside of the traditional clinical environment. such informative explosion allowed patients to collect and share data among each other, their families and clinicians. patient-generated health data (pghd) is defined as health-related data generated and recorded by or from patients outside of the clinical areas. this data could be an important resource available for patient, clinicians and decision makers to be used by to address a current or emerging health issue, and most of it is globally wide, also if they are integrated by information coming from diffuse sensory/iot devices and manually input voluntary data reported by the patients, caregivers, or generic citizen participation bring to shared decision-making. the definitions above helps to understand the context of pulse project. pulse aims to design and build a large-scale data management system enabling real time analytics of flows of personal data. the objective is to reduce the environmental and behavioral risk of chronic disease incidence to allow timely and evidence-driven management of epidemiological episodes linked in particular to two pathologies; asthma and type diabetes in adult populations. developing a policy-making across the domains of health, environment, transport, planning in the pulse test bed cities. the project is currently active in eight pilot cities, barcelona, birmingham, new york, paris, singapore, pavia, keelung and taiwan, following a participatory approach where citizen provide data through personal devices and the pulsair app, that are integrated with information from heterogeneous sources: open city data, health systems, urban sensors and satellites. pulse foster long-term sustainability goal of establishing an integrated data ecosystem based on continuous large-scale collection of all stated heterogeneous data available within the smart city environment. pulse project is goaled on build a set of extensible models and technologies to predict, mitigate and manage health problems in cities and promote population health. currently pulse is working in eight global cities. it harvest a multivariate data platform feed by open city data, data from health systems, urban and remote sensors and personal devices to minimize environmental and behavioral risk of chronic disease incidence and prevalence and enable evidence-driven and timely management of public health events and processes. the clinical is on asthma and type diabetes in adult populations: the project has been pioneer in the development of dynamic spatiotemporal health impact assessments through exposure-risk simulation model with the support of webgis for geolocated population-based data. pulse gives finally a more wide vision of wellbeing were it is intended also in the relationship with environmental conditions. acquisition, systematization and correlation of large volumes of heterogeneous health, social, personal and environmental data is among the core and primary activities in the pulse project. the overall goal of the deployments involves deriving additional values from the acquired data, through: developing more comprehensive benchmarking and understanding of the impact of social and environmental factors on health and wellbeing in urban communities, thereby broadening the scope of public health. on this sake pulse has developed tools for end-users (primarily citizens and patients, public health institutions and city services) that leverage open, crowd-sourced and remote sensing data, through integration, enrichment and improved accuracy/reliability of risk models, to guide actions and deliver interventions aiming to mitigate asthma and t d risk and improve healthy habits and quality of life. figure shows the conceptual schema of the relationships among dataflows. pulse project focuses on the link between air pollution and the respiratory disease of asthma, and between physical inactivity and the metabolic disease of type diabetes. the risk assessment for this two pathologies comprises the evaluation respectively of: for type diabetes: behavioural risks associated (i.e. reduced exercise/physical activity at home or in public places). this is associated with higher risk of t d onset in a dose-response relationship. the assessment use unobtrusive sensing/data collection and volunteered data to collect baseline measures of health and wellbeing, and tracking and model mobility at home and across the city (including time, frequency and route of mode of transit and/or movement). for asthma: environmental/exposure risks (i.e. exposure to air pollution, especially with regard to near roadway air pollution). poor air quality is associated with higher risk of asthma onset and exacerbation. risks of diseases onset are evaluated thorough risk assessment models, that in pulse are biometric simulation models that predict the risk of the onset of the ashtma and diabetes in relationship to air quality. the models has been developed by chosen ones from a literature review of the prediction models of type diabetes (t d) onset and asthma adult-onse. some of them were selected to be implemented and recalibrated on the datasets available on pulse repository and adding new variables [ ] . pulse architecture is composed by main structures [ ] : pulseair, app server, air quality distributed sensor system, gisdb, webgis and personal db. -pulse app: is the personal app provided to the participants in charge of collecting sensors data and interacting with the users to propose interventions and gamification. pulsair is available both for ios and android and can be connected to fitbit, garmin and asus health tracker devices. -air quality distributed sensor system: the pulse air quality sensor's system is composed of multiple type of sensors and sensor's datasets: it combines mobile sensors and mobile network of sensors in order monitor the variable trends in emission within urban areas with an high resolution and to appropriately address the temporal and spatial scales where usually pollutants are spread. two types of sensors has been used across pilots that are the aq x of dunavnet ( +, deployed in all pilots) and purpleair pa-ii sensor. the who definition of health includes reference to wellbeing: health is "a state of complete physical, mental and social wellbeing and not merely the absence of disease or infirmity" [ ] . wellbeing is a dynamic construct comprised of several dimensions. in a cohomprensive view of wellness can be defined main domain of wellness: the psycological health, the physical health and the subjective wellbeing. subjective wellbeing (swb) is often measured via validated psychometric scales, and individual and community surveys. subjective wellbeing is linked to health-related quality of life (hrqol) but is not synonymous with it. the factors identified as the most important for subjective wellbeing vary across space, time and cultural context (fig. ) . wellness entails contemporary also the simultaneous fulfillment of the three types of needs. personal needs (e.g., health, self-determination, meaning, spirituality, and opportunities for growth), are intimately tied to the satisfaction of collective needs such as adequate health care, environmental protection, welfare policies, and a measure of economic equality; for citizens require public resources to pursue private aspirations and maintain their health. wellness also concerns relational needs. two sets of needs are primordial in pursuing healthy relationships among individuals and groups: respect for diversity and collaboration and democratic participation. most approaches to community wellbeing (or its associated terms) follow a components approach: the majority of them have, at their core, an emphasis on individual wellbeing. pulse has focused on defining and developing a new concept of urban wellbeing tied to the broader concept of urban health resilience. this recognizes the connections between the physical characteristics of the urban environment (including assets and deficits) and human health (including both physical and psychological health). the pulse concept of urban wellbeing refers to the interaction between the positive and negative experiences within cities (whether objective or subjective), and the individual and community practices of mobility and placemaking. this novel interpretation of wellbeing focuses on the dynamic interplay between individual psychological characteristics and strengths, neighborhoods in which people live and work, and the capacity of individuals to respond to environmental and interpersonal stressors [ ] . within our population urban health model, the physical and social environments are understood as key drivers of wellbeing. this prioritizes an integrated, or relational, approach to urban places and health equity, including population differences in wellbeing. central to this relational approach is the idea that place mattersthat our health and wellbeing are shaped by the characteristics of the settings where we live and work, and these environments are in turn shaped by our healthrelated actions and behaviours. several recent studies have highlighted this important dynamic. using data from the english longitudinal study of aging, hamer and shankar [ ] found that individuals who hold more negative perceptions of their neighbourhood report less positive wellbeing, and experience a greater decline in wellbeing over time. of course, place itself can have a profound impact on our wellbeing. in pulse, we contextualize wellbeing within a model of urban resilience: urban resilience refers to the ability of an urban system -and all its constituent socio-ecological and socio-technical networks across temporal and spatial scalesto maintain or rapidly return to desired functions in the face of a disturbance, to adapt to change, and to quickly transform systems that limit current or future adaptive capacity. in this definition, urban resilience is dynamic and offers multiple pathways to resilience (e.g., persistence, transition, and transformation). it recognizes the importance of temporal scale, and advocates general adaptability rather than specific adaptedness. the urban system is conceptualized as complex and adaptive, and it is composed of socioecological and socio-technical networks that extend across multiple spatial scales. resilience is framed as an explicitly desirable state and, therefore, should be negotiated among those who enact it empirically. resilient urban neighborhoods can be broadly defined as those that have lower than expected premature mortality (measured via the urban health indicators). in pulse, we define urban wellbeing as an integral component of urban resilience. urban wellbeing, in this context, refers to the individual traits and capacities to prepare for, respond to, and recover from the personal and interpersonal challenges encountered in cities. these challenges could include experiences of bias and exclusion, on the one hand, and exposure to under-resourced or polluted environments, on the other. each of these challenges is associated with physiological and psychological stress at the individual and community level. stress is, of course, antithetical to wellbeing. translating this concepts into data constructs two main instruments are available into pulse architecture: the risk assessment models, previously described and the urban maps. the physical environment, socio-economic and cultural conditions, urban planning, available public or private services and leisure facilities are some of the factors that can have an effect on a person's health. hence, an interest in the study of geographical patterns of health-related phenomena has increased in recent years. within this context, maps have been demonstrated to be a useful tool for showing the spatial distribution of many types of data used in public health in a visual and concise manner [ , ] . for example, it permits the study of general geographical patterns in health data and identifying specific high-risk locations. an example of these maps in pulse are the personal exposure maps. personal exposure is a concept from the epidemiological science to quantify the amount of pollution that each individual is exposed to, as a consequence of the living environment, habits etc. personal exposure has been obtained matching the data from the dense network of low-cost sensors and the informations on habits coming from the pulsair app. following the sampling rate of the sensors the data has been calculated. figure shows a map for the personal exposure to pm with an hourly frequency. furthermore using the gps tracks from the pulsair app, fitbit and the personal exposure, an estimate of inhaled pollutant has been obtain in association to three classes of movement by the speed of body translation; standing, walking and running, considering the breaths per minute and the air volume per breath [ ] . personal exposure result has been also traced into exposure paths as in fig. : a time-lapse of min correspond to a dot movement line. the multivariate data driven approach of pulse gives an example of a new conception of health and wellness, not only focused on individual health status, but also on the relationship between individual and environment. such vision can be also directed toward the definition of "planetary health" provided by "the lancet contdown" [ ] . the data driven approach pursuited in pulse has surely given a great opportunity to implement such a vision, that maybe would not so immediatiate without possibility to integrate different sources of data. air quality in europe comparative analysis of standardized indicators for smart sustainable cities: what indicators and standards to use and when? empowering citizens through perceptual sensing of urban environmental and health data following a participative citizen science approach world health organization -un habitat: global report on urban health who: closing the gap in a generation: health equity through action on the social determinants of health why we need urban health equity indicators: integrating science, policy, and community associations between neighborhood perceptions and mental well-being among older adults overview of the health and retirement study and introduction to the special issue identification of persons at high risk for type diabetes mellitus: do we need the oral glucose tolerance test? two risk-scoring systems for predicting incident diabetes mellitus in u.s. adults aged to years hapt d: high accuracy of prediction of t d with a model combining basic and advanced data depending on availability applied spatial statistics for public health data atlas de mortalidad en áreas pequeñas de la capv dynamic spatio-temporal health impact assessments using geolocated population-based data: the pulse project the lancet countdown on health and climate change: from years of inaction to a global transformation for public health ), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license and indicate if changes were made. the images or other third party material in this chapter are included in the chapter's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use acknowledgments. this research was funded by the european union's research and innovation program h and is documented in grant no . in particular, pulse was funded under the call h -eu- . . in the topic sci-pm- - -big data supporting public health policies.more information on: www.project-pulse.eu. key: cord- -s t a z authors: christantonis, konstantinos; tjortjis, christos; manos, anastassios; filippidou, despina elizabeth; mougiakou, Εleni; christelis, evangelos title: using classification for traffic prediction in smart cities date: - - journal: artificial intelligence applications and innovations doi: . / - - - - _ sha: doc_id: cord_uid: s t a z smart cities emerge as highly sophisticated bionetworks, providing smart services and ground-breaking solutions. this paper relates classification with smart city projects, particularly focusing on traffic prediction. a systematic literature review identifies the main topics and methods used, emphasizing on various smart cities components, such as data harvesting and data mining. it addresses the research question whether we can forecast traffic load based on past data, as well as meteorological conditions. results have shown that various models can be developed based on weather data with varying level of success. the deployment of modern and smart cities increasingly gains attention, as large urban centers over time present numerous challenges for citizens. traffic is a stressful and timeconsuming factor affecting citizens. lately, many local authorities attempt to design and create smart infrastructures and tools in order to collect data and utilize models for better decision making and citizen support. such data are often derived from sensors collecting information about points of interest (pois) in real time. data mining techniques and algorithms can then support getting useful insights into the problem, whilst forming appropriate strategies to counter it. this work focuses on analyzing different approaches regarding data manipulation in order to predict day-ahead traffic loads at random places around cities, based on weather conditions. prediction efforts regard classification tasks aiming at highlighting factors that affect traffic prediction. this study utilizes weather data collected from sensor devices located in athens and thessaloniki, greece. three different day zones are introduced and compared while subsets of trimesters are also tested. the remaining of the paper is structured as follows: sect. provides background information. section presents our approach for traffic prediction, including data selection as well as experimental results. section discusses and evaluates our findings while sect. concludes the paper with suggestions for future work. traffic prediction is not a new subject; there are numerous scientific efforts that perform both classification and regression tasks in this domain [ ] . however, few efforts attempted to predict traffic volumes based on low level data, such as weather data. nejad et al. examined the power of decision trees for classifying traffic loads on three levels, based only on time and temperature [ ] . results were positive and motivated our work. xu et al. compared cart with k-nn and a direct kalmann filter [ ] . moreover, wang et al. proposed the use of volume and occupancy data [ ] . such data, as well as speed, can be obtained by loop detectors. loop detectors are sensors buried underneath highways which estimate traffic by observing information related to vehicles passing above them. loop detectors along with rain predictions were also used in order to predict crashes [ ] . tree-based algorithms are widely used and justified, however even more sophisticated algorithms were used such as support vector machines (svms) [ ] and neural networks (nns) [ ] . a novel hybrid method for short-term traffic prediction under both typical and atypical traffic conditions. theodorou et al. introduced an svm based model that identifies atypical conditions [ ] . we used arima or k-nn regression to identify typical or atypical conditions, respectively. in addition to [ ] , liu et al. introduced three binary variables, other than weather conditions, for holiday, special conditions and quality of roads [ ] . however, in this work we chose not to include such features for several reasons. firstly, holidays do not have the same effect on each season. for example, sunny holidays might result in lower levels of traffic in large urban centers whilst for winter holidays this is not the case. special conditions, such as a major social events or protests are not easy to add to models since most of the times these occur unexpectedly. abnormal conditions have a negative impact on such models, for instance when attempting to estimate the traffic flow based on expressway operating vehicle [ ] . another significant decision on such predictions relates to the actual day of the week. as highlighted in [ , ] , weekdays tend to have different characteristics from weekends and should be carefully tested. finally, most scenarios regard short-time prediction, meaning minutes to h ahead. dunne et al. utilized neurowavelet models along with the rain expectation in the next hour [ ] . weather data can be exploited in different smart cities problems and scenarios. collecting accurate weather data can be beneficial for numerous daily problems which associate weather with human decisions. traffic is affected by numerous factors, only some of which are predictable. as mentioned in sect. , traffic is directly affected by weather conditions, however the scale differs across cities and cultures. this section tests the above intuition on traffic prediction. traffic is a multi-dimensional problem; researchers focus on either predicting traffic loads on a certain time interval or selecting the optimal route based on real-time adaptations in order to minimize travel time. accidents and social events can shortly disrupt normal traffic in specific areas, while seasonality and weather conditions can affect traffic in a larger scale. based on that, we used weather data collected from sensors installed around carefully chosen specific city spots for predicting the day-ahead traffic volume. to select the most appropriate locations to install sensors that either measure traffic loads or collect weather data, it is crucial to define their objective in advance. our efforts focus on the question 'how can one exploit sensor data that are not personalized and create meaningful conclusions for the general public?' deployment of smart city infrastructure requires a deep understanding of the traffic problem. roads that are busier than others do not always provide more information in comparison to more isolated ones. the number of alternative roads or the location of busy buildings can affect the necessity of measuring traffic for a specific road. our approach, besides examining traffic predictability based on weather data, also aims to clarifying differences among locations. moreover, we implement a series of tests regarding different monthly periods under the objective to understand which months contribute positively on the deployment of such models. the approach followed is explained in sect. . . for this task, our data source was the newly deployed ppcity.io, which is a set of platforms providing useful information derived from sensors to citizens and visitors of athens and thessaloniki, greece. these two large urban centers host almost half the greek population. the selected platform collects and analyses environmental, traffic and geospatial data. those data are collected through several sensors located at central points around the two cities. in general, there are numerous city spots on both cities where the environmental and traffic conditions are measured. initially, we chose locations about which we had information regarding both traffic and weather conditions. we focused on the ten most reliable spots that produce data, i.e. sensors with the most compact flow of data and wider range of recording. weather related attributes used for the research analysis involved the following: humidity, pressure, temperature, wind direction, wind speed and ultraviolet radiation (uv). the platform provides additional weather data, but these were not included in the modeling process. indicatively, such variables include measurements for ozone concentration, nitrogen dioxide etc. the target variable (traffic load), named jam factor, represents the quality of travel. it ranges from to , where describes stopped traffic flow and a completely empty road. data have been collected for almost six months, a rather short period of time, but at least including august and december, two months when people tend to take holidays. august is widely considered in greece as the month with the lowest traffic volume, since it is a period that most people go on holidays, away from large urban centers. similarly, in december both cities face abnormalities in traffic loads, since many citizens move from and to the two large urban centers. the dataset comprises data collected between / / and / / . eight sensors are located in athens and two in thessaloniki. in addition, two sensors are located near school areas. more information about the sensors is available in sect. . the way data are processed is conceptually simple. regarding the data structure, we defined three distinct time intervals within a day and examined their predictability. the first interval, named morning, includes signals from sensors between : and : . the second interval, named afternoon, includes signals between : and : and the last one, named evening, includes signals between : and : . therefore, the average value for the given signals was computed. for example, if a sensor captures information every h(e.g. at : , : , : etc.), we computed and assigned the average value for each weather metric and the traffic load for that specific day period. the above strategy resulted in creating three different datasets consisting of instances on average. fortunately, the quality of the dataset was high, thus missing values were minimal. it is worth noting that our averaging strategy does not replace missing values. if there is a missing value in a weather averaging is performed using the remaining two available values, instead of averaging and labeling three values. an important decision was to transform the problem into a classification task by categorizing the target variable into two and later into three classes. initially, the split point was the median value for each dataset, aiming at a fully balanced task which would allow for safer conclusions. therefore, the two classes named high and low indicate if the traffic load on that specific day was higher than usual. in the second stage the target variable was split into three categories of equal size (high, medium and low). the metric used for classification evaluation was accuracy, since classes are balanced and of equal interest. finally, we split the dataset into three subsets consisting of data regarding different trimesters. the first one contains data for the period between -july and -october, the second for the period between -september and -december and the third one, between -october and -february. we aimed at distinguishing if any periods (seasonality) affect the models negatively. different cities demonstrated different traffic patterns; however, in almost every case we observed a few common patterns. first, on weekends citizens tended to use vehicles much less than weekdays, while on holidays people tended to leave the urban centers. to further understand the case of greece, fig. visualizes the mean value for all available sensors in order to explain phenomena beyond seasonality. the colors on all figures share the same palette and indicate the same subsets. more precisely, for fig. , the blue line indicates the afternoon values while red and green indicates morning and evening respectively. for figs. , , , , and blue indicates the subset of july- february, red for september- december, green for july- october and purple for october- february. as expected, a rapid introductory search on the data surface highlight that the volume of traffic was consistently higher during the afternoon period, while both morning and evening periods seem to present similar amounts of traffic. regarding the predictability for each of these combinations between sensors and day periods, for all the experiments on this task, we used the random forest classifier. it is an ensemble learning method that operates by constructing a multitude of decision trees. it is considered as a powerful method and stands as a top-notch solution for various problems [ ] . on top of that, it handles both scaled and not scaled data. in order to prevent overfitting, we used -fold cross-validation and for the optimal hyper-parameter tuning we used an extended grid search. it is important to clarify that all experiments were also conducted using the logistic regression algorithm, however the results were not included since they were almost identical. that is to justify that our algorithm does not over-fit since our datasets are small. further, figs. , , show classification results for each of the described time intervals. the y-axis indicates accuracy with a start-point at . to highlight the margin of a baseline model on a perfectly balanced classification task. since the purpose of this work was to analyze and highlight differences between time and day periods on traffic prediction we did not focus on exact values. in most cases feeding machine learning algorithms with more data is considered a crucial step to achieve generalization and expand the margin for higher accuracy. however, in all cases above, results with the initial dataset clearly did not outperform the ones with subsets. figures , and show results obtained by dividing the target variables into three classes of equal instances. figures , and illustrate the same intuition as for the case of two classes (i.e. higher accuracy for the -jul/ -oct period), but also highlight significant changes on the individual performance for each sensor. in sect. we evaluate and discuss results, whilst explaining abnormalities and unexpected behavior. the evaluation of this work examines all the essential steps on a data mining project. firstly, data acquisition, which is a critical component on every project, revealed the importance of sufficient data. the data collection process should be constant and clearly defined in advance. we dealt with a well-structured database that was recording sensor signals in a nearly perfect synchronization between traffic and weather data. regarding the pre-processing step, we did not use all the available features, instead the models contained only weather metrics tested and introduced by the literature. for weather data exploitation, it is crucial to fully understand the correlation between variables, because the same level of information might be repeatedly captured. the rest of this section further discusses our findings per case. the general problem in terms of real-time adjustment of routes in order to achieve traffic "congestions" is still most important for many cities. however, the day-ahead prediction of the volume was based on a major assumption that there would not be any accident or abnormal conditions in general. starting from this point, the factor of environmental conditions was essential since many employees, students and tourists decide in advance the way they travel around the city on the upcoming day. the results of the approach presented in sect. . are encouraging and justify what was stated above. all three-day periods on the binary classification achieved quite satisfying accuracy levels. more precisely, the initial model, resulted in accuracy higher than . for most sensors on morning and afternoon. on the other hand, for the -classes experiments, results do not allow to reach safe conclusions. however, the initial model demonstrated robust performance achieving on average accuracy higher than . . for the morning period we observed predictability similar with the binary case. for afternoon and evening periods there were not significant gaps in accuracy even though the -jul/ -oct (light green line) period shows better performance than the rest. in addition, we observed that the -class experiments do not show similar patterns with the previous results. surprisingly, the middle period -sep/ -dec (red line) was expected to achieve much better results especially for the sensors and which are installed close to school areas. indeed, the range of this period was selected to cover and adjust for the periods that schools operate, undisrupted. results were contradicting, while sensor outperforms the rest on the morning period for the binary case, sensor resulted in the lowest accuracy. figure highlights the special case of august in greece and the fact that the traffic load is highly reduced. in addition, as expected, on weekends traffic was heavily reduced while we observed that peaks on afternoon and evening periods for weekdays, happen mostly on thursdays and fridays. finally, it is worth noting that sensors and are in thessaloniki. for both, results were lower than the average for athens. based on these results, we observed that the range of traffic values for those two is amongst the highest. experiments for both stages produced some clear and informative results. the main conclusion is that the systematic recording and use of weather data can support decision making. the list below summarizes the conclusions. • standalone weather metrics can assist in building reliable prediction models regarding traffic volumes. • roads are busier in the afternoon and for most of the sensors even evenings have higher traffic volumes in comparison to mornings. • mornings do not return steady results for all sensors. as discussed, the results are stable only for sensors located near schools. • for morning periods, the peaks of traffic happen in the beginning of the week. no earlier than the middle of september the load gets similar to that of evenings and gets clearly lower again by the start of christmas holidays. • the first data subset regarding the period -jul/ -oct outperforms both other subsets and the initial data set consisting of all available data. that indicates that winter months introduce uncertainty and volatility, thus related models underperform significantly. • transforming the target variable into three classes resulted in admittedly good results, firmly better than the baseline model. the above conclusions emerge from the detailed analysis of the models, weather metrics and traffic volumes. however, threats to their validity exist. we briefly summarize them in sect. . . the biggest threat on approaches as such, is the fact that those day-ahead models rely on weather data which also are predicted. thus, it is crucial that we have accurate predictions of weather conditions. moreover, including the weekends that admittedly have different loads of traffic in the same models with weekdays may affect the validity on a negative way. another threat could be sufficient deseasonilising; the factor of time could be possible analyzed into more explanatory variables. not having available data for at least one year of recordings may lead to questionable conclusions about seasonal effects; however, this is not a rule. finally, the sensors regard roads of different volume of traffic and even though traffic usually fluctuates uniformly that may conclude to misleading results. short-term traffic prediction under both typical and atypical traffic conditions using a pattern transition model applying data mining in prediction and classification of urban traffic short-term traffic volume prediction using classification and regression trees dynamic traffic prediction based on traffic flow mining calibrating a real-time traffic crash-prediction model using archived weather and its traffic data short-term traffic condition prediction of urban road network based on improved svm short-term traffic flow prediction considering spatio-temporal correlation: a hybrid model combing type- fuzzy c-means and artificial neural network prediction of road traffic congestion based on random forest traffic flow prediction based on expressway operating vehicle data traffic prediction using multivariate nonparametric regression data mining for smart cities: predicting electricity consumption by classification weather adaptive traffic prediction using neurowavelet models t c: improving a decision tree classification algorithm's interval splits on continuous attributes acknowledgements. the work is implemented within the co-funded project public participation city (ppcity -t e k- and mis ) by action aid "research-create-innovate" implemented by general secretariat of research and technology, ministry of development and investments. the project (www.ppcity.eu) provides a set of tools and platforms to collect city environmental data and use these in an intelligent way in order to support informed urban planning. key element in this process is the opinions and views of citizens which are collected in a crowd sourcing manner. the research depicted under this paper is based on platform (run from https:// panel.ppcity.eu/platform /) which provides an open data city portal from environmental data in athens and thessaloniki, greece. key: cord- -lhjn f authors: zehnder, philipp; wiener, patrick; straub, tim; riemer, dominik title: streampipes connect: semantics-based edge adapters for the iiot date: - - journal: the semantic web doi: . / - - - - _ sha: doc_id: cord_uid: lhjn f accessing continuous time series data from various machines and sensors is a crucial task to enable data-driven decision making in the industrial internet of things (iiot). however, connecting data from industrial machines to real-time analytics software is still technically complex and time-consuming due to the heterogeneity of protocols, formats and sensor types. to mitigate these challenges, we present streampipes connect, targeted at domain experts to ingest, harmonize, and share time series data as part of our industry-proven open source iiot analytics toolbox streampipes. our main contributions are (i) a semantic adapter model including automated transformation rules for pre-processing, and (ii) a distributed architecture design to instantiate adapters at edge nodes where the data originates. the evaluation of a conducted user study shows that domain experts are capable of connecting new sources in less than a minute by using our system. the presented solution is publicly available as part of the open source software apache streampipes. in order to exploit the full potential of data-driven decision making in the industrial internet of things (iiot), a massive amount of high quality data is needed. this data must be integrated, harmonized, and properly described, which requires technical as well a domain knowledge. since these abilities are often spread over several people, we try to enable domain experts with little technical understanding to access data sources themselves. to achieve this, some challenges have to be overcome, such as the early pre-processing (e.g. reducing) of the potentially high frequency iiot data close to the sensor at the edge, or to cope with high technological complexity of heterogeneous data sources. the goal of this paper is to simplify the process of connecting new sources, harmonize data, as well as to utilize semantic meta-information about its meaning, by providing a system with a graphical user interface (gui). our solution, streampipes connect, is made publicly available as part of the open-source, self-service data analytics platform apache streampipes (incubating) . streampipes [ ] provides a complete toolbox to easily analyze and exploit a variety of iot-related data without programming skills. therefore, it leverages different technologies especially from the fields of big data, distributed computing and semantic web (e.g. rdf, json-ld). streampipes is widely adopted in the industry and is an incubating project at the apache software foundation. figure shows a motivating scenario of a production process in a company with several plants. it further illustrates the potentially geo-distributed heterogeneous data sources that are available in such a company. however, the challenge is how to enable domain experts to connect and harmonize these distributed heterogeneous industrial streaming data sources. first we show how our approach differs from existing related work in sect. . to cope with the distributed setting we leverage a master worker paradigm with a distributed architecture (sect. ). adapters are deployed on edge devices located within a close proximity to sources, to early filter and transform events. we use a semantics based model to describe adapters and to employ transformation rules on events (sect. ). the model covers standard formats and protocols as well as the possibility to connect proprietary data sources. in sect. , the implementation of our approach and the gui is explained in detail. we present results of a conducted user study to evaluate the usability of our system, in addition to the performance tests carried out in sect. . lastly sect. concludes our work and presents an outline of planned future work. data flow tools with a gui are commonly used to process and harmonize data from various sources. applications like talend , or streamsets can be used for extract, transform, load (etl) tasks, wich is a well elaborated field where the goal is to gather data from many heterogeneous sources and store it in a database. using such tools still requires a lot of technical knowledge, especially because they are not leveraging semantic technologies to describe the meaning of data. another tool in this field is node-red , which describes itself as a low-code solution for event-driven applications. node-red is designed to run on a single host. however, our approach targets distributed iiot data sources like machines or sensors. therefore, data can be processed directly on edge devices, potentially reducing network traffic. there are also approaches leveraging semantic technologies for the task of data integration and harmonization. winte.r [ ] supports standard data formats, like csv, xml, or rdf, further it supports several strategies to merge different data sets with a schema detection and unit harmonization. in contrast to our approach, it focuses on data sets instead of iiot data sources. the goal of spitfire [ ] is to provide a semantic web of things. it focuses on rest-like sensor interfaces, not on the challenge of integrating sensors using industrial protocols and high-frequency streaming data, that require local preprocessing. the big iot api [ ] enables interoperability between iot platforms. thus the paper has a different focus, we focus on domain experts to connect data, especially from iiot data sources. distributed architectures like presented in [ ] are required to cope with the distributed nature of iiot data sources. in the paper, a lightweight solution to ease the adoption of distributed analytics applications is presented. all raw events are stored in a distributed storage and are later used for analytics. the authors try to adopt a very lightweight approach and do not describe the semantics of events or transform them. in our approach, data is transformed and harmonized directly in the adapter at the edge. this eases the analytics tasks downstream usually performed by a (distributed) stream processing engine, such as kafka streams. such engines provide solutions to connect to data sources with kafka connect . it is possible to create connectors that publish data directly to kafka. they provide a toolbox of already integrated technologies, such as several databases or message brokers. still, a lot of configuration and programming work is required to use them. other industry solutions to cope with the problem of accessing machine data are to build custom adapters, e.g. with apache plc x . this requires a lot of development effort and often is targeted at a specific use case. we leverage such tools to enable an easy to configure and re-usable approach. another way to access machine data is to use a unified description, like the asset administration shell (aas) [ ] . it is introduced by the platform industry . and provides a unified wrapper around assets describing its representation and technical functionality. there are also some realizations of the concept, as described in [ ] . in our approach we try to automatically create an adapter by extracting sample data and meta-data from the data source. thus, this allows us to work with data sources that do not have a specific description like the aas. the main design decisions for our architecture are based on the goal of creating a system for both small, centralized as well as large, distributed environments. therefore, we decided to implement a master/worker paradigm, where the master is responsible for the management and controlling of the system and the workers actually access and process data. to achieve this, we need a lightweight approach to run and distribute services. container technologies offer a well suited solution and are particularly suitable for edge and fog processing scenarios [ ] . figure provides an overview of our architecture showing the data sources and the compute units located close to the sources, running the services of the system. the streampipes backend communicates with the master, which manages all the worker containers, as well as the adapter instances running in the workers. for the communication between the individual components we use json-ld. the master persists the information about the workers and running adatpers in a triple store. once a new worker is started, it is registered at the master, providing information which adapter types (e.g. plc, mqtt) are supported. when an adapter instance is instantiated to connect a new machine, the data is directly forwarded to a distributed message broker, as shown in fig. . new worker instances can be added during runtime to extend the system and the master schedules new adapters accordingly. the master coordinates and manages the system. for the transmission of the harmonized data we rely on already existing broker technologies, e.g. apache kafka. the adapter model is the core of our approach and provides a way to describe time series data sources. based on this model, adapters are instantiated, to connect and harmonize data according to pre-processing rules applied to each incoming event. such adapter descriptions are provided in rdf serialized as json-ld. figure shows our semantic adapter model. the adapter concept is at the core of the model. each adapter has a streamgrounding describing the protocol and format used to publish the harmonized data. additionally to sending unified data to a message broker, adapters are capable of applying transformation rules. datasets and datastreams are both supported by the model. for a better overview of the figure, we present a compact version of the model with the notation {stream, set}, meaning there is one class for streams and one for sets. from a modeling and conceptual point of view, there is no difference in our approach between the two types. we treat data sets as bounded data streams, which is why we generally refer to data streams from here onwards. further, there are two types of data stream adapters, genericdatastrea-madapters and specificdatastreamadapters. a genericdatastreamadapter consists of a combination of a datastreamprotocol (e.g. mqtt), responsible for connecting to data sources and formats (e.g. json) that are required to convert the connected data into the internal representation of events. since not all data sources comply with those standards (e.g. plc's, ros, opc-ua), we added the concept of a specificdatastreamadapter. this can also be used to provide custom solutions and implementations of proprietary data sources. user configurations for an adapter can be provided via staticproperties. they are available for formats, protocols, and adapters. there are several types of static properties, that allow to automatically validate user inputs (e.g. strings, urls, numeric values). configurations of adapters (e.g., protocol information or required api keys) can be stored in adapter templates, encapsulating the required information. listing . shows an instance of a genericdatastreamadapter, with mqtt as the protocol and json as a format. oftentimes, it is not sufficient to only connect data, it must also be transformed, reduced, or anonymized. therefore we introduce transformation rules, visualized in fig. , to either change the value of properties, schema, or the stream itself. our approach uses transformation rules to describe the actual transformation of events. based on these rules, pre-processing pipelines are automatically configured in the background, which run within an adapter instance. the following table presents an overview of an extensible set of transformation rules, which are already integrated. rule example events of connected data sources are transformed directly on the edge according to the introduced transformation rules, by applying transformation functions, event by event. a function takes an event e and configurations c as an input and returns a transformed event e . the model is expandable and new features can be added by a developer. an instance of an adapter contains a set of functions which are concatenated to a pre-processing pipeline. equation ( ) shows how an event is transformed by multiple functions. each function represents a transformation rule from our model. to ensure that the transformations are performed correctly the rules must be applied in a fixed order. first the schema, then the value, and last the stream transformations. the unit transformation function for example takes the property name, the original unit and the new unit as a configuration input. within the function the value of the property is transformed according to the factors in the qudt ontology . figure shows a complete pre-processing pipeline of our running example. on the left the raw input event e is handed to the first function f that changes the schema. the result of each function is handed to the next function in addition to the configurations. in the end, the final event e is sent to the defined streamgrounding of the adapter. we integrated our implementation into the open source software apache streampipes (incubating), which is publicly available on github . figure shows the adapter marketplace containing an overview of all protocols, specific adapters, and adapter templates. currently, we integrated different adapters and we continually integrate new ones. for streaming protocols, plcs (e.g. siemens s ), opc-ua , ros [ ] , mqtt [ ] , ftp, rest (iterative polling), mysql (subscribing to changes), influxdb, kafka, pulsar are integrated. further we support several data set protocols like files (can be uploaded), hdfs, ftp, rest, mysql, influxdb. additionally to those generic adapters, we have integrated several open apis, like opensensemap resulting in specific adapters. this number is also constantly growing, since adapters can be stored and shared as adapter templates. templates are serialized into json-ld, that can be exported and imported into other systems. they are also listed in the data marketplace. once a user selects the adapter that should be connected, a guided configuration process is started. this process is the same for data sets and data streams and just differs slightly between generic and specific adapters. we illustrate the . select adapter/protocol: first a user must select the specific adapter or protocol form the marketplace, shown in fig. . . configure adapter/protocol: in the next step a configuration menu is presented to the user. in of fig. an example for the mqtt protocol is shown. the broker url, optional credentials for authentication and the topic must be provided. . configure format (optional): for generic adapters additionally the format must be configured. in our example a user must select json. . refine event schema: so far the technical configurations to connect data sources were described, now the content of the events must be specified. figure in shows the event schema. users can add, or delete properties, as well as change the schema via a drag-and-drop user interface. further shown in additional information can be added to individual properties, like a description, the domain property, or the unit. based on the user interaction the transformation rules are derived in the background. . start adapter: in the last step a name for the adapter must be provided. additionally a description or an icon can be added. in this step it is also possible to define a maximum frequency for the resulting data stream, or to filter duplicate events. again, rules are derived from the user interaction. users just interact with the gui and the system creates the rules, resulting in an intuitive way of interacting with the system without worrying about the explicit modeling of the rules. we try to use the semantic model and meta-data as much as possible to help and guide the user through the system. all user inputs are validated in the gui according to the information provided in the adapter model (e.g. ensure correct data types or uris). additionally, the system uses information of the data sources, when available, during the configuration steps ./ ., described in the previous section. unfortunately, the usefulness of those interfaces highly depends on the selected adapter/protocol, since not all provide the same amount of high quality information. for example, some message brokers provide a list of available topics. other technologies, like plcs often have no interface like that and the users have to enter such information manually. but still, this user input is checked and when an error occurs during the connection to the source a notification with the problem is provided to the user. furthermore, the schema of the events is guessed by reading sample data from the source. once the endpoint of the source is connected, we establish a connection to gather some sample data. based on this data a guess of the schema is provided and suggested to the user in the gui. the realization for the individual implementations of this schema guess is again very different. for csv files for example it depends if they have a header line or not. for message brokers sending json a connection has to be established to gather several events to get the structure of the json objects. other adapters like the one for opc-ua can leverage the rich model stored in the opc server to already extract as much meta-information as possible. all of this information is harmonized into our semantic adapter model, where we also integrate external vocabularies, and presented in the gui to the user. users are able to refine or change the model. also on the property level we try to leverage the semantics of our model to easily integrate different representations of timestamps, by providing a simple way to harmonize them to the internal representation of unix timestamps. another example are unit transformations. based on the qudt ontology only reasonable transformations are suggested to the user. in our evaluation we show that domain experts with little technical knowledge are able to connect new sources. additionally, we present performance results of adapters and where the system is already deployed in production. setup: for our user study, we recruited students from a voluntary student pool of the karlsruhe institute of technology (kit) using hroot [ ] . the user study took place at the karlsruhe decision & design lab (kd lab) at the kit. the overall task was to connect two data sources with measurements of environment sensors as a basis, to create a live air quality index, similar to the one in [ ] . since most of the participants did not have a technical background and never worked with sensor data before, we gave a min introduction about the domain, the data sources, what it contains (e.g. particulate matter pm . /pm , nitrogen dioxide no , . . . ), and how an air quality index might look like. after that, the participants went into an isolated cabin to solve the tasks on their own, without any further assistance by the instructors. as a first task, they had to connect data from the opensensemap api [ ] , an online service for environmental data. the goal of the second task was to connect environmental data from official institutions, therefore data provided by the 'baden-wuerttemberg state institute for the environment, survey and nature conservation' was used. this data is produced by officially standardized air measuring stations distributed all over the state. after finishing the two tasks, the participants were forwarded to an online questionnaire, where they had to answer several questions to assess how usable the system was in their experience. for the questions, we used three standardized questionnaires as well as additional questions. to ensure that the participants answer the questions carefully, we added control questions to the questionnaire. three participants answered those control questions wrong, resulting in a total of valid answers. results: first, we present the results of the system usability scale (sus) [ ] , which measures how usable a software system is by comparing results to the average scores of websites . a score above . is considered as good result. we use the same colors to indicate how well the score is compared to the other systems. on the left of fig. , the overall result of . can be seen. since we have a high variance of technical expertise within our participants we grouped the results according to the technical experience. first we grouped them into two groups, whether they stated to be able to connect sensors with any programming language of their choice or not. participants not able to develop a connector for sensors with a programming language find the system more useful (good system, mean: . ) than participants who are able to connect a sensor with a programming language of their own choice (acceptable system, mean: . ). second, we grouped them according to their technological affinity from high to low. for that, we adopted the items of the technology readiness index (tri) [ ] in order to frame the questions on the expertise in using programming ide's and data tools. we can use this as a control to measure how affine participants are in using technologies (e.g. ide's). participants with a high technology affinity (quantile > . ) find the system not as useful as less technology affine participants, but still acceptable (mean: . ). participants with an average technology affinity find the system the most useful (good system: mean: , ). participants with a low technology affinity (quantile < . ) find the system good as well, however a bit less useful as the average class (mean: , ) . this is in line with the assumption, that such a tool is especially useful for non-technical users. the sus gives the tool a rating of a good system. the participants used the system for the first time and only for a duration of to min. in this respect, this is already a very good score and it is likely to assume that the score would be higher when more experienced users would have participated. for the second questionnaire, the user experience questionnaire (ueq) was chosen [ ] . it consists of six categories: attractiveness, perspicuity, efficiency, dependability, stimulation, and novelty. for each of these categories, a likert scale is provided to indicate how good the system is compared to other systems evaluated with the ueq. figure shows the results of the ueq on the right. all the results of the individual categories are above average. the results of the categories attractiveness, perspicuity, efficiency, and dependability are considered as good. the result of the novelty of the system is even rated as excellent. the figure also reveals that the results of all categories are equally good meaning we do not have to focus on a single aspect. it also suggests that there is still room for further improvement, but for a first user study the results are already very promising. together with the results from the sus, this means that the system is not only usable (i.e. fulfils its purpose) but also gives a good experience when using it (i.e. fun experience). additionally, we added own questions to the questionnaire to get some information which is especially relevant for our work. to see how technical the students were, we asked them whether they are able to connect new sensors in a programming language of their choice or not. just of the participants answered with yes, while gave a negative answer. this indicates we had a good mix of technical experience of the participants, as our system focuses on less technical users with little to no programming experience. we asked the participants, if they think, once they are more familiar with the system, they are able to connect new data sources in under one minute. answered with yes and with no. this shows that our approach is simple to use and efficient, as even the less technical participants state they can connect new data sources in under one minute, which is usually a technical and time-consuming task. regarding the question whether they think they are capable of connecting new real-time sensor data with our system, all of the participants answered with yes. this means all participants are capable of creating new adapters with the system. we also monitored the interaction of the users with the system to find out how long they approximately needed to complete the individual tasks. the result was that users took between to min for each task. overall, the results of the user study show that streampipes connect is already rated as a good system, which can be used by domain experts to quickly connect new data sources. setup: for the evaluation we connected the events of the joint states of a robot arm via ros. the frequency of the data stream is hz and the event size is bytes. this data was connected and processed with the ros adapter without any delays. to discover the limits of our system we created an adapter with a configurable data generator. therefore, we used the temperature event and transformed it with the same rules as in our example in fig. . for the test setup we used a server running the streampipes backend and two different commonly used edge devices for the worker instance. we used a raspberry pi and an intel nuc. to test the maximum performance of an adapter within a worker we produced events as fast as the worker could process them. for each device we ran different set-ups, all with a different lengths of the pipeline shown in fig. . figure shows the results of the performance test. each test ran times and the mean of sent events per second is plotted in the chart. for the nuc we produced . . events per test and for the raspberry pi . . events. the results of the figure show that if no pre-processing pipeline is used the events are transmitted the fastest and the longer the pre-processing pipeline is, the less events are processed. the only exception is the delete function, which removes a property of the event and thus increases the performance. the nuc performs significantly better then the raspberry pi, but for many real-world use cases a pi is still sufficient, since it also processes . events per second (with no pre-processing function). the add timestamp and transform unit functions have an higher impact on the performance than the other tested functions. apache streampipes (incubating) was developed as an open source project over the last couple of years by the authors of this paper at the fzi research center for information technology. since november , we transitioned the tool to the apache software foundation as a new incubating project. we successfully deployed streampipes in multiple projects in the manufacturing domain. one example is condition monitoring in a large industrial automation company. we connected several robots (universal robots) and plcs to monitor a production process and calculate business-critical kpis, improving the transparency on the current health status of a production line. in this paper, we presented streampipes connect, a self-service system for ingestion and harmonization of iiot time series data, developed as part of the open source iot toolbox apache streampipes (incubating). we presented a distributed, event-based data ingestion architecture where services can be directly deployed on edge devices in form of worker nodes. workers send real-time data from a variety of supported industrial communication protocols (e.g., plcs, mqtt, opc-ua) to a centralized message broker for further analysis. our approach makes use of an underlying semantics-based adapter model, which serves to describe data sources and to instantiate adapters. generated adapters connect to the configured data sources and pre-process data directly at the edge by applying pipelines consisting of user-defined transformation rules. in addition, we further presented a graphical user interface which leverages semantic information to better guide domain experts in connecting new sources, thus reducing development effort. to achieve the goal of providing a generic adapter model that covers the great heterogeneity of data sources and data types, the flexibility of semantic technologies was particularly helpful. especially the reuse of vocabularies (e.g. qudt) facilitates the implementation significantly. the user study has shown us that modeling must be easy and intuitive for the end user. for the future, we plan to further support users during the modeling process by recommending additional configuration parameters based on sample data of the source (e.g. to automatically suggest message formats). structure of the administration shell mqtt version . . . oasis stand hroot: hamburg registration andorganization online tool sus-a quick and dirty usability scale the big iot api -semantically enabling iot interoperability evaluation of docker as edge computing platform an architecture for efficient integration and harmonization of heterogeneous, distributed data sources enabling big data analytics winte.r -a web data integration framework technology readiness index (tri) a multiple-item scale to measure readiness to embrace new technologies opensensemap-a citizen science platform for publishing and exploring sensor data as open data spitfire: toward a semantic web of things ros: an open-source robot operating system streampipes: solving the challenge with semantic stream processing pipelines applying the user experience questionnaire (ueq) in different evaluation scenarios integrated data model and structure for the asset administration shell in industrie . key: cord- - oqfn rg authors: kotouza, maria th.; tsarouchis, sotirios–filippos; kyprianidis, alexandros-charalampos; chrysopoulos, antonios c.; mitkas, pericles a. title: towards fashion recommendation: an ai system for clothing data retrieval and analysis date: - - journal: artificial intelligence applications and innovations doi: . / - - - - _ sha: doc_id: cord_uid: oqfn rg nowadays, the fashion industry is moving towards fast fashion, offering a large selection of garment products in a quicker and cheaper manner. to this end, the fashion designers are required to come up with a wide and diverse amount of fashion products in a short time frame. at the same time, the fashion retailers are oriented towards using technology, in order to design and provide products tailored to their consumers’ needs, in sync with the newest fashion trends. in this paper, we propose an artificial intelligence system which operates as a personal assistant to a fashion product designer. the system’s architecture and all its components are presented, with emphasis on the data collection and data clustering subsystems. in our use case scenario, datasets of garment products are retrieved from two different sources and are transformed into a specific format by making use of natural language processes. the two datasets are clustered separately using different mixed-type clustering algorithms and comparative results are provided, highlighting the usefulness of the clustering procedure in the clothing product recommendation problem. the fashion clothing industry is moving towards fast fashion, enforcing the retail markets to design products at a quicker pace, while following the fashion trends and their consumer's needs. thus, artificial intelligence (ai) techniques are introduced to a company's entire supply chain, in order to help the development of innovative methods, solve the problem of balancing supply and demand, increase the customer service quality, aid the designers, and improve overall efficiency [ ] . recently, an increasing number of projects in the fashion industry make use of ai techniques, including projects run by google and amazon. the use of ai techniques was not possible before the adoption of e-commerce sites and information and communications technology (ict) systems from the traditional fashion industry, due to data deficiency. nowadays, the overflowing amount of data deriving from the daily use of e-commerce sites and the data collected by fashion companies enable solutions related to the fashion design process using ai techniques. popular fashion houses have provided remarkable ai-driven solutions, such as the hugo boss ai capsule collection , in which a new collection is developed entirely by an ai system, as well as the reimagine retail from the collaboration of tommy hilfiger, ibm and fashion institute of technology, which aims to identify future industry trends and to improve the design process. this work focuses on the creative part of the fashion industry, the fashion designing process. to this end, an intelligent and semi-autonomous decision support system for fashion designers is proposed. this system can act as a personal assistant, by retrieving, organizing and combining data from many sources, and, finally, suggesting clothing products taking into account the designer's preferences. the system combines natural language processing (nlp) techniques to analyze the information accompanying the clothing images, computer vision algorithms to extract characteristics from the images and enrich their meta-data, and machine learning techniques to analyze the raw data and to train models that can facilitate the decision-making process. several research works have been presented in the field of clothing data analysis, most of them involving clothing classification and feature extraction based on images, dataset creation, as well as product recommendation. in the work of [ ] , the deepfashion dataset was created consisting of , images characterized by many features and labels. in the work of [ ] , a sequence of steps is outlined in order to learn the features of a clothing image, which includes the following: a) image description retrieval, b) feature learning for the top and bottom part of the human body, c) feature extraction using deep learning, d) usage of pose estimation techniques, and e) hierarchical feature representation learning using deep learning. other related efforts [ ] [ ] [ ] present how to train models using image processing and machine learning techniques for feature extraction. however, little work has been done in analyzing the meta-data accompanying clothing images. in this work, apart from proposing an ai system which involves many subsystems as part of the clothing design process that can be combined together in order to help the designers with the decision-making process, we emphasize on the data collection, meta-data analysis and clustering techniques that can be applied to improve recommendations. in this section, we present the proposed decision support system for the designers' creative processes. the system is developed in such a way to be able to model the designer's preferences automatically and be user-friendly at the same, in order to be easily handled by individuals without knowledge of the action planning research field. the system is composed of two interconnected components: . offline component: this component performs (a) data collection from internal and external sources, (b) data storage and management to databases, and (c) data analysis processes that produce the artificial models which provide personalized recommendations to the end-users. . online component: this component comprises mainly the user interface (ui). the users, who are usually fashion designers with limited technical experience, are able to easily set their parameters via the graphical ui, visualize their results and provide feedback on the system results. the overall system architecture is depicted in fig. , whereas the major subsystems/ processes are further analyzed in the following subsections. there are two different sources used for training, as well as for the recommendation process: the internal and external data sources. internal data. each company has its own production line, rules and designing styles that are influenced by the fashion trends. the creativity team usually use an inspiration or starting point based on clothes coming from the company's previous collections and adapt them to the new fashion trends. the internal data are usually organized in relational databases and can be reached by the data collection subsystem. external data. the most common designers' source for new ideas is browsing on the collections of other popular online stores. to this end, the system includes a web crawler, the e-shops crawler, which is able to retrieve clothing information, i.e. clothing images accompanied by their meta-data. the online shops that are supported so far are asos, shtterstock, zalando and s.oliver. another important inspiration source for the designers are social media platforms, especially pinterest and instagram. to this end, a second web crawler, the social-media crawler, was implemented, which is able to utilize existing apis and retrieve information from the aforementioned platforms, including clothing images, titles of the post, descriptions and associated tags. both crawlers' infrastructure is extendable, so that they can be easily used for other online shops or social media platforms in the future. this subsystem is responsible for extracting the clothing attributes from the meta-data accompanying every clothing image. some of the attributes that are extracted from the available meta-data, accompanied by some valid examples, are presented below: for each attribute there is a dictionary, created by experienced fashion designers, that contains all the possible accepted values, including synonyms and abbreviations. nlp techniques are used for word-based preprocessing of all meta-data text. the attributes are extracted using a mapping process between the meta-data and the original attributes. the mapping is achieved by finding the occurrences of the words contained in the dictionaries to the meta-data. in the case of successful matching, the corresponding word is marked as a label to the respective attribute. the data annotation process complements the data collection and data preprocessing modules. it is used to enrich the extracted data with common clothing features that can be derived from images using computer vision techniques. examples of clothing attributes that can be extracted from images include color, fabric and neck design. it is widely known that color has the biggest impact on clothing, as it is related to location, occasion, season, and many other factors. taking into consideration its importance, an intelligent computer vision component was implemented. this component has the capability to distinguish and extract the five most dominant colors of each clothing image. more specifically, the color of a clothing image is represented by the values of the rgb channels and its percentage, the color ranking specified by the percentage value and the most relevant general color label to the respective rgb value. the rest of the clothing attributes are extracted using deep learning techniques. each attribute is represented by a single value from a set of predefined labels. after the data collection and annotation processes, all the data are available in a common format (row data) that can be analyzed using well-known state-of-the-art techniques. a common technique to organize data into groups of similar products is clustering. clustering can speed up the recommendation process, by making the look-up subprocess quicker when it comes to significant amount of data. a practical example is a case where a user makes a search at the online phase: the system can limit the data used for product recommendation to those that are included in the clusters characterized by labels related to the user's search. several clustering algorithms can be used depending on the type of the data. clothing data can be characterized by both numerical (i.e. product price) and categorical features (i.e. product category) in general. a detailed review of the algorithms used for mixedtype data clustering can be found in [ ] . the algorithms can be divided in three major categories: a) partition-based algorithms, which build clusters and update centers based on partition, b) hierarchical clustering algorithms, which create a hierarchical structure that combines (agglomerative algorithms) or divides (division algorithms) the data elements into clusters, based on the elements' similarities, and c) model-based algorithms, which can either use neural network methods or statistical learning methods, choose a detailed model for every cluster and discover the most appropriate model. the algorithms that we use in this paper are as follows: . kmodes : a partition-based algorithm, which aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimized. the distance, i.e. the dissimilarity between two objects, is determined by counting the number of mismatches in all attributes. the number of clusters is set by the user. . pam : a partition-based clustering algorithm, which creates partitions of the data into k clusters around medoids. the similarities between the objects are obtained using the gower's dissimilarity coefficient [ ] . the goal is to find k medoids, i.e. representative objects, which minimize the sum of the dissimilarities of the objects to their closest representative object. the number of clusters is set by the user. . hac : a hierarchical agglomerative clustering algorithm, which is based on the pairwise object similarity matrix calculated using the gower's dissimilarity coefficient. at the beginning of the process, each individual object forms its own cluster. then, the clusters are merged iteratively until all the elements belong to one cluster. the clustering results are visualized as a dendrogram. the number of clusters is set by the user. . fbhc : a frequency-based hierarchical clustering algorithm [ ] , which utilizes the frequency of each label that occurs in each product feature to form the clusters. instead of performing pairwise comparisons between all the elements of the dataset to determine objects' similarities, this algorithm builds a low dimensionality frequency matrix for the root cluster, which is split recursively as one goes down the hierarchy, overcoming limitations regarding memory usage and computational time. the number of clusters can be set by the user or by a branch breaking algorithm. this algorithm would iteratively compare the parent clusters with their children nodes, using evaluation metrics and user-selected thresholds. . varsel : a model-based algorithm, which performs the variable selection and the maximum likelihood estimation of the latent class model. the variable selection is performed using the bayesian information criterion. the number of clusters is determined by the model. the clothing recommender is the most important component of our system, since it combines all the aforementioned analysis results to create models that make personalized predictions and product recommendations. the internal and external data, the user's preferences, and the company's rules are all taken into consideration. moving on to the online component, the ui enables the designer to search for products using keywords. the extracted results can then be evaluated by the designer and the preferred products can be saved on their dashboard over time and for each product search. if the user is not satisfied by the recommendations, they have the ability either to renew their preferences or ask for new recommendations. the offline and the online components are interconnected by a subsystem that is responsible for implementing the models feedback process. the user can approve or disapprove the proposed products based on their preferences, and this information is transmitted as input to a state-of-the-art deep reinforcement learning algorithm, which assesses the end user's choices and re-trains the personalized user model. this is an additional learning mechanism evolving the original models over time, making the new search results more relevant and personalized. a real-life scenario is provided as a use case, in order to highlight the usefulness of the clustering procedure in the clothing product recommendation. our team is collaborating with a fashion designer working for the energiers greek retail company, who is interested in designing the company's collection for the new season. she uses the garments designed and produced by the company in the previous season as a source of inspiration, combined with the assos e-shop current collections. in this direction, the company dataset was created by extracting the fashion products from the previous season from the company database, and the relevant e-shop dataset was retrieved using a web crawler. a total of images were collected by the eshop crawler for the season winter , by making queries involving different labels of the attributes product category, length, sleeve, collar and fit. the meta-data of the retrieved images and a pointer to the image location were stored in a relational database. the meta-data were tokenized and split into columns, by assigning values in the desired attributes, after preprocessing plain text using nlp techniques. in this section, the experimental results on the company and e-shop datasets using the kmodes, pam, hac, fbhc and varsel algorithms are presented. the results are evaluated using four internal evaluation metrics: a. entropy, which quantifies the expected value of the information contained in the clusters. b. silhouette, which validates the consistency within the clusters. c. within sum of square error (wss), which is the total distance of data points from their respective cluster centroids and validates the consistency between the objects of each cluster. d. identity [ ] , which is expressed as the percentage of data contained in the cluster with an exact alignment regarding the feature's labels. lower values of entropy and wss, and higher values of silhouette and identity indicate better clustering results. the clustering results differ according to the applied clustering algorithms. table shows the normalized mutual information [ ] of the algorithms that were tested, in a pairwise fashion. the values show some variance, with most of them being around %. it is worth mentioning that the pam and fbhc algorithms share information that reaches . %, which is something that can enhance their reliability. on the other hand, the least amount of information is shared between the clusters formed by varsel and kmodes, fbhc. the main reason seems to be that varsel algorithm has automatically identified only clusters, whereas the rest of them have formed clusters. the number of clusters (k) for kmodes, pam, hac and fbhc was given as input parameter to the algorithms, after experimenting with varying values of k ( to clusters) and calculating the wss and silhouette metrics. a graphical representation of the information shared across the clusters created by the different algorithms can be seen on the sankey diagram depicted in fig. . the figure makes clear that the pam algorithm uniformly distributes the data objects across the clusters, whereas kmodes clustering results follow a normal distribution. the distributions of those two portioning algorithms seem to be close. the varsel algorithm normally distributes the objects in a similar fashion, but in this case only clusters are created. on the other hand, the hierarchical algorithms create two large size clusters, where the majority of the objects are assigned to, and four significantly smaller clusters. table reports the comparison results of the clustering algorithms based on the values that they achieved at the evaluation metrics. the average values of the evaluation metrics are presented. the best results achieved by an algorithm are highlighted as boldface, whereas the second highest results are presented in italics. the table makes clear that there is not a unique best algorithm that achieves the best results in all the evaluation metrics, so the algorithm's selection depends on the application needs. the hierarchical algorithms achieved better results at the entropy and identity metrics, which means that the number of labels characterizing each feature in a cluster is small, whereas the partition-based algorithms outperform at the metrics that concern the distances between the objects of each cluster. once again it is proved that the pam algorithm uniformly distributes the data across clusters, and this is the reason why we select this algorithm for the rest of the analysis in this paper. a -dimensional representation of the distribution of the data into the six groups obtained by pam can be seen in fig. . the centroids of the company dataset extracted by the pam algorithm are depicted in table and table accordingly. the centroids are determined as the most frequent attribute values of the row data for each cluster. a more detailed representation of the groups' consistency for the attributes product category and gender can be obtained using a heatmap (fig. ) . by analyzing the consistency of each group and the distribution of the labels across the groups in the two datasets, one can observe that the company dataset is characterized by six major categories, i.e. set, bermuda, blouse for men and women, dress, and leggings. on the other hand, the e-shop dataset is characterized by dress, shirt, trousers, set, romper, and cardigan. as for the rest of attributes, most of the products are characterized by short length in the company dataset, whereas in the e-shop dataset the medium and knee length are more frequent. the tables make clear that the collar attribute has many missing values, so a good practice will be to recognize this attribute at the data annotation subsystem, using computer vision techniques. as for the fit, the regular fit value is the most common in both datasets. therefore, when the fashion designer is interested in designing a red dress, she can set the parameters for the product category and the color through the ui of the system and press the search button. the system will then refer to the company database and filter only the products that are included in the group created by the offline clustering procedure. the same procedure will be followed to filter only the products that belong to group in the e-shop's database. the two groups are then combined and the system can select only those products with the label "red" at the color attribute. next, this subset can be filtered even more according to the designer's additional preferences and the fashion trends to extract personalized recommendations. finally, the designer can interact with the system to evaluate (grade) each recommended product, create her dashboard or even ask for new recommendations results if she is not satisfied at all. in this work, an intelligent system that automates the typical procedures followed by a fashion designer is described. the system can retrieve data from online sources and the designer's company database, transform plain text accompanying images into clothing features using dictionary mapping and nlp techniques, extract new features from the images using computer vision, and store all the information into a common format in a relational database. the processed data can then be handled by state-of-the-art machine learning techniques including clustering, prediction models, and recommender systems. the paper focuses on presenting the system's architecture, emphasizing on the data collection and transformation processes, as well as the clustering procedures that can be used to organize the row data into groups. a real-life use case scenario was also presented, showing the usefulness of the clustering procedure in the product recommendation problem. future work involves the augmentation of the data annotation process, enabling the extraction of new relevant attributes from non-annotated images. additionally, the extended use of the products prices and the products' sales history can enrich the model creation process significantly, leading to more reasonable and personalized suggestions for the designers. additional steps can be taken in the direction of the improvement of the userfriendliness and the capabilities of the ui, which will be utilized by the designers to enter their preferences, search products, save the system's products recommendation and create dashboards. finally, an extended set of experiments using new datasets and methods are needed, whereas testing and evaluation of the recommended products are going to be done by fashion designers in more real-life use case scenarios. applications of artificial intelligence in the apparel industry: a review deepfashion: powering robust clothes recognition and retrieval with rich annotations retrieving real world clothing images via multi-weight deep convolutional neural networks learning and recognition of clothing genres from full-body images getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos runway to realway: visual analysis of fashion clustering algorithms for mixed datasets: a review a general coefficient of similarity and some of its properties a dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures cluster ensembles-a knowledge reuse framework for combining multiple partitions acknowledgements. this research has been co-financed by the european regional development fund of the european union and greek national funds through the operational program competitiveness, entrepreneurship and innovation, under the call research -create -innovate (project code: t edk- ). key: cord- -rvg ayp authors: ponce, marcelo; sandhel, amit title: covid .analytics: an r package to obtain, analyze and visualize data from the corona virus disease pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: rvg ayp with the emergence of a new pandemic worldwide, a novel strategy to approach it has emerged. several initiatives under the umbrella of"open science"are contributing to tackle this unprecedented situation. in particular, the"r language and environment for statistical computing"offers an excellent tool and ecosystem for approaches focusing on open science and reproducible results. hence it is not surprising that with the onset of the pandemic, a large number of r packages and resources were made available for researches working in the pandemic. in this paper, we present an r package that allows users to access and analyze worldwide data from resources publicly available. we will introduce the covid .analytics package, focusing in its capabilities and presenting a particular study case where we describe how to deploy the"covid .analytics dashboard explorer". in a novel type of corona virus was first reported, originally in the province of hubei, china. in a time frame of months this new virus was capable of producing a global pandemic of the corona virus disease (covid ), which can end up in a severe acute respiratory syndrome (sars-cov- ). the origin of the virus is still unclear [ , , ] , although some studies based on genetic evidence, suggest that it is quite unlikely that this virus was human made in a laboratory, but instead points towards cross-species transmission [ , ] . although this is not the first time in the human history when humanity faces a pandemic, this pandemic has unique characteristics. for starting the virus is "peculiar" as not all the infected individuals experience the same symptoms. some individuals display symptoms that are similar to the ones of a common cold or flu while other individuals experience serious symptoms that can cause death or hospitalization with different levels of severity, including staying in intensive-care units (icu) for several weeks or even months. a recent medical survey shows that the disease can transcend pulmonary manifestations affecting several other organs [ ] . studies also suggest that the level of severity of the disease can be linked to previous conditions [ ] , gender [ ] , or even blood type [ ] but the fundamental and underlying reasons still remain unclear. some infected individuals are completely asymptomatic, which makes them ideal vectors for disseminating the virus. this also makes very difficult to precisely determine the transmission rate of the disease, and it is argued that in part due to the peculiar characteristics of the virus, that some initial estimates were underdetermining the actual value [ ] . elderly are the most vulnerable to the disease and reported mortality rates vary from to % depending on the geographical location. in addition to this, the high connectivity of our modern societies, make possible for a virus like this to widely spread around the world in a relatively short period of time. what is also unprecedented is the pace at which the scientific community has engaged in fighting this pandemic in different fronts [ ] . technology and scientific knowledge are and will continue playing a fundamental role in how humanity is facing this pandemic and helping to reduce the risk of individuals to be exposed or suffer serious illness. techniques such as dna/rna sequencing, computer simulations, models generations and predictions, are nowadays widely accessible and can help in a great manner to evaluate and design the best course of action in a situation like this [ ] . public health organizations are relying on mathematical and data-driven models (e.g. [ ] ), to draw policies and protocols in order to try to mitigate the impact on societies by not suffocating their health institutions and resources [ ] . specifically, mathematical models of the evolution of the virus spread, have been used to establish strategies, like social distancing, quarantines, self-isolation and staying at home, to reduce the chances of transmission among individuals. usually, vaccination is also another approach that emerges as a possible contention strategy, however this is still not a viable possibility in the case of covid , as there is not vaccine developed yet [ , ] . simulations of the spread of virus have also shown that among the most efficient ways to reduce the spread of the virus are [ ] : increasing social distancing, which refers to staying apart from individuals so that the virus can not so easily disperse among individuals; improving hygiene routines, such as proper hand washing, use of hand sanitizer, etc. which would eventually reduce the chances of the virus to remain effective; quarantine or self-isolation, again to reduce unnecessary exposure to other potentially infected individuals. of course these recommendations based on simulations and models can be as accurate and useful as the simulations are, which ultimately depend on the value of the parameters used to set up the initial conditions of the models. moreover these parameters strongly depend on the actual data which can be also sensitive to many other factors, such as data collection or reporting protocols among others [ ] . hence counting with accurate, reliable and up-to-date data is critical when trying to understand the conditions for spreading the virus but also for predicting possible outcomes of the epidemic, as well as, designing proper containment measurements. similarly, being able to access and process the huge amount of genetic information associated with the virus has proben to shred light into the disease's path [ , ] . encompassing these unprecedented times, another interesting phenomenon has also occurred, in part related to a contemporaneous trend in how science can be done by emphasizing transparency, reproducibility and robustness: an open approach to the methods and the data; usually refer as open science. in particular, this approach has been part for quite sometime of the software developer community in the so-called open source projects or codes. this way of developing software, offers a lot of advantages in comparison to the more traditional and closed, proprietary approaches. for starting, it allows that any interested party can look at the actual implementation of the code, criticize, complement or even contribute to the project. it improves transparency, and at the same time, guarantees higher standards due to the public scrutiny; which at the end results in benefiting every one: the developers by increasing their reputation, reach and consolidating a widely validated product and the users by allowing direct access to the sources and details of the implementation. it also helps with reproducibility of results and bugs reports and fixes. several approaches and initiatives are taking the openness concepts and implementing in their platforms. specific examples of this have drown the internet, e.g. the surge of open source powered dashboards [ ] , open data repositories, etc. another example of this is for instance the number of scientific papers related to covid published since the beginning of the pandemic [ ] , the amount of data and tools developed to track the evolution of pandemic, etc. [ ] . as a matter of fact, scientists are now drowning in publications related to the covid [ , ] , and some collaborative and community initiatives are trying to use machine learning techniques to facilitate identify and digest the most relevant sources for a given topic [ , , ] . the "r language and environment for statistical computing" [ , ] is not exception here. moreover, promoting and based on the open source and open community principles, r has empowered scientists and researchers since its inception. not surprisingly then, the r community has contributed to the official cran [ ] repository already with more than a dozen of packages related to the covid pandemic since the beginning of the crisis. in particular, in this paper we will introduce and discuss the covid .analytics r package [ ] , which is mainly designed and focus in an open and modular approach to provide researchers quick access to the latest reported worldwide data of the covid cases, as well as, analytical and visualization tools to process this data. this paper is organized as follow: in sec. we describe the covid .analytics , in sec. we present some examples of data analysis and visualization, in sec. we describe in detail how to deploy a web dashboard employing the capabilities of the covid .analytics package providing full details on the implementation so that this procedure can be repeated and followed by interested users in developing their own dashboards. finally we summarize some conclusions in sec. . the covid .analytics r package [ ] allows users to obtain live worldwide data from the novel covid . it does this by accessing and retrieving the data publicly available and published by two main sources: the "covid- data repository by the center for systems science and engineering (csse) at johns hopkins university" [ ] for the worldwide and us data, and the city of toronto for the toronto data [ ] . the package also provides basic analysis and visualization tools and functions to investigate these datasets and other ones structured in a similar fashion. the covid .analytics package is an open source tool, which its main implementation and api is the r package [ ] . in addition to this, the package has a few more adds-on: • a central github repository, https://github.com/mponce /covid .analytics where the latest development version and source code of the package are available. users can also submit tickets for bugs, suggestions or comments using the "issues" tab. • a rendered version with live examples and documentation also hosted at github pages, https: //mponce .github.io/covid .analytics/; • a dashboard for interactive usage of the package with extended capabilities for users without any coding expertise, https://covid analytics.scinet.utoronto.ca. we will discuss the details of the implementation in sec. . • a "backup" data repository hosted at github, https://github.com/mponce /covid analytics. datasets -where replicas of the live datasets are stored for redundancy and robust accesibility sake (see fig. ). one of the main objectives of the covid .analytics package is to make the latest data from the reported cases of the current covid pandemic promptly available to researchers and the scientific community in what follows we describe the main functionalities from the package regarding data accessibility. the covid .data function allows users to obtain realtime data about the covid reported cases from the jhu's ccse repository, in the following modalities: • aggregated data for the latest day, with a great 'granularity' of geographical regions (ie. cities, provinces, states, countries) • time series data for larger accumulated geographical regions (provinces/countries) • deprecated : we also include the original data style in which these datasets were reported initially. the datasets also include information about the different categories (status) "confirmed"/"deaths"/"recovered" of the cases reported daily per country/region/city. this data-acquisition function, will first attempt to retrieve the data directly from the jhu repository with the latest updates. if for what ever reason this fails (eg. problems with the connection) the package will load a preserved "image" of the data which is not the latest one but it will still allow the user to explore this older dataset. in this way, the package offers a more robust and resilient approach to the quite dynamical situation with respect to data availability and integrity. in addition to the data of the reported cases of covid , the covid .analytics package also provides access to genomics data of the virus. the data is obtained from the national center for biotechnology information (ncbi) databases [ , ] . table shows the functions available in the covid .analytics package for accessing the reported cases of the covid pandemic. the functions can be divided in different categories, depending on what data they provide access to. for instance, they are distinguished between agreggated and time series data sets. they are also grouped by specific geographical locations, i.e. worldwide, united states of america (us) and the city of toronto (ontario, canada) data. the time series data is structured in an specific manner with a given set of fields or columns, which resembles the following format: "province.state" | "country.region" | "lat" | "long" | ... sequence of dates ... one of the modular features this package offers is that if an user has data structured in a data.frame organized as described above, then most of the functions provided by the covid .analytics package for analyzing time series data will just work with the user's defined data. in this way it is possible to add new data sets to the ones that can be loaded using the repositories predefined in this package and extend the analysis capabilities to these new datasets. sec. . presents an example of how external or synthetic data has to be structured so that can use the function from the covid .analytics package. it is also recommended to check the compatibility of these datasets using the data integrity and consistency checks functions described in the following section. due to the ongoing and rapid changing situation with the covid- pandemic, sometimes the reported data has been detected to change its internal format or even show some anomalies or inconsistencies . for instance, in some cumulative quantities reported in time series datasets, it has been observed that these quantities instead of continuously increase sometimes they decrease their values which is something that should not happen . we refer to this as an inconsistency of "type ii". some negative values have been reported as well in the data, which also is not possible or valid; we call this inconsistency of "type i". when this occurs, it happens at the level of the origin of the dataset, in our case, the one obtained from the jhu/ccesgis repository [ ] . in order to make the user aware of this, we implemented two consistency and integrity checking functions: • consistency.check: this function attempts to determine whether there are consistency issues within the data, such as, negative reported value (inconsistency of "type i") or anomalies in the cumulative quantities of the data (inconsistency of "type ii") • integrity.check: this determines whether there are integrity issues within the datasets or changes to the structure of the data alternatively we provide a data.checks function that will execute the previous described functions on an specified dataset. data integrity. it is highly unlikely that the user would face a situation where the internal structure of the data or its actual integrity may be compromised. however if there are any suspicious about this, it is possible to use the integrity.check function in order to verify this. if anything like this is detected we urge users to contact us about it, e.g. https://github.com/mponce /covid .analytics/issues. data consistency. data consistency issues and/or anomalies in the data have been reported several times these are claimed, in most of the cases, to be missreported data and usually are just an insignificant number of the total cases. having said that, we believe that the user should be aware of these situations and we recommend using the consistency.check function to verify the dataset you will be working with. nullifying spurious data. in order to deal with the different scenarios arising from incomplete, inconsistent or missreported data, we provide the nullify.data function, which will remove any potential entry in the data that can be suspected of these incongruencies. in addition ot that, the function accepts an optional argument stringent=true, which will also prune any incomplete cases (e.g. with nas present). similarly to the rapid developments and updates in the reported cases of the disease, the sequencing of the virus is moving almost at equal pace. that's why the covid .analytics package provides access to good number of the genomics data currently available. the covid .genomic.data function allows users to obtain the covid 's genomics data from ncbi's databases [ ] . the type of genomics data accessible from the package is described in table . although the package attempts to provide the latest available genomic data, there are a few important details and differences with respect to the reported cases data. for starting, the amount of genomic information available is way larger than the data reporting the number of cases which adds some additional constraints when retrieving this data. in addition to that, the hosting servers for the genomic databases impose certain limits on the rate and amounts of downloads. in order to mitigate these factors, the covid .analytics package employs a couple of different strategies as summarized below: • most of the data will be attempted to be retrieved live from ncbi databases -same as using src='livedata'. • if that is not possible, the package keeps a local version of some of the largest datasets (i.e. genomes, nucleotides and proteins) which might not be up-to-date -same as using src='repo'. • the package will attempt to obtain the data from a mirror server with the datasets updated on a regular basis but not necessarily with the latest updates -same as using src='local'. these sequence of steps are implemented in the package using trycath() exceptions in combination with recursivity, i.e. the retrieving data function calling itself with different variations indicating which data source to use. as the covid .analytics package will try present the user with the latest data sets possible, different strategies (as described above) may be in place to achieve this. one way to improve the realiability of the access to and avialability of the data is to use a series of replicas of the datasets which are hosted in different locations. fig. summarizes the different data sources and point of access that the package employs in order to retrieve the data and keeps the latest datasets available. genomic data as mentioned before is accessed from ncbi databases. this is implemented in the covid .genomic.data function employing the ape [ ] and rentrez [ ] packages. in particular the proteins datasets, with more than k entries, is quite challenging to obtain "live". as a matter of fact, the covid .genomic.data function accepts an argument to specify whether this should be the case or not. if the src argument is set to 'livedata' then the function will attempt to download the proteins list directly from ncbi databases. if this fail, we recommend using the argument src='local' which will provide an stagered copy of this dataset at the moment in which the package was submitted to the cran repository, meaning that is quite likely this dataset won't be complete and most likely outdated. additionaly, we offer a second replica of the datasets, located at https://github.com/mponce /covid analytics.datasets where all datasets are updated periodically, this can be accessed using the argument src='repo'. in addition to the access and retrieval of the data, the covid .analytics package includes several functions to perform basic analysis and visualizations. table shows the list of the main functions in the package. description main type of output data acquisition covid .data obtain live* worldwide data for covid virus, from the jhu's ccse repository [ ] return dataframes/list with the collected data covid .toronto.data obtain live* data for covid cases in the city of toronto, on canada, from the city of toronto reports [ ] return dataframe/list with the collected data covid .us.data obtain live* us specific data for covid virus, from the jhu's ccse repository [ ] return dataframe with the collected data genomics covid .genomic.data c .refgenome.data c .fasta.data c .ptree.data c .nps.data c .np_fasta.data obtain genomic data from ncbi databases -see table in the reported data, this is mostly given by the province/city and/or country/region. in order to facilitate the processing of locations that are located geo-politically close, the covid .analytics package provides a way to identify regions by indicating the corresponding continent's name where they are located. i.e. "south america", "north america", "central america", "america", "europe", "asia" and "oceania" can be used to process all the countries within each of these regions. the geographicalregions function is the one in charge of determining which countries are part of what continent and will display them when executing geographicalregions(). in this way, it is possible to specify a particular continent and all the countries in this continent will be processed without needing to explicitly specifying all of them. reports. as the amount of data available for the recorded cases of covid can be overwhelming, and in order to get a quick insight on the main statistical indicators, the covid .analytics package includes the report.summary function, which will generate an overall report summarizing the main statistical estimators for the different datasets. it can summarize the "time series" data (when indicating cases.to.process="ts"), the "aggregated" data (cases.to.process="agg") or both (cases.to.process="all"). the default will display the top entries in each category, or the number indicated in the nentries argument, for displaying all the records just set nentries= . the function can also target specific geographical location(s) using the geo.loc argument. when a geographical location is indicated, the report will include an additional "rel.perc" column for the confirmed cases indicating the relative percentage among the locations indicated. similarly the totals displayed at the end of the report will be for the selected locations. in each case ("ts" or/and "agg") will present tables ordered by the different cases included, i.e. confirmed infected, deaths, recovered and active cases. the dates when the report is generated and the date of the recorded data will be included at the beginning of each table. it will also compute the totals, averages or mean values, standard deviations and percentages of various quantities, i.e. • it will determine the number of unique locations processed within the dataset • it will compute the total number of cases per case type • percentages -which are computed as follow: for the "confirmed" cases, as the ratio between the corresponding number of cases and the total number of cases, i.e. a sort of "global percentage" indicating the percentage of infected cases with respect to the rest of the world covid .analytics "internal" rsync/git -when a new release is push to cran "internal" scripts src="livedata" src="repo" src="local" https://github.com/mponce /covid analytics.datasets figure : schematic of the data acquision flows between the covid .analytics package and the different sources of data. dark and solid/dashed lines represent api functions provided by the package accesible to the users. dotted lines are "internal" mechanisms employed by the package to synchronize and update replicas of the data. data acquisition from ncbi servers is mostly done utilizing the ape [ ] and rentrez [ ] packages. for "confirmed" cases, when geographical locations are specified, a "relative percentage" is given as the ratio of the confirmed cases over the total of the selected locations for the other categories, "deaths"/"recovered"/"active", the percentage of a given category is computed as the ratio between the number of cases in the corresponding category divided by the "confirmed" number of cases, i.e. a relative percentage with respect to the number of confirmed infected cases in the given region • for "time series" data: it will show the delta (change or variation) in the last day, daily changes day before that (t − ), three days ago (t − ), a week ago (t − ), two weeks ago (t − ) and a month ago (t − ) when possible, it will also display the percentage of "recovered" and "deaths" with respect to the "confirmed" number of cases the column "globalperc" is computed as the ratio between the number of cases for a given country over the total of cases reported -the "global perc. average (sd: standard deviation)" is computed as the average (standard deviation) of the number of cases among all the records in the data -the "global perc. average (sd: standard deviation) in top x" is computed as the average (standard deviation) of the number of cases among the top x records a typical output of the summary.report for the "time series" data, is shown in the example in sec. . in addition to this, the function also generates some graphical outputs, including pie and bar charts representing the top regions in each category; see fig. . totals per location & growth rate. it is possible to dive deeper into a particular location by using the tots.per.location and growth.rate functions. these functions are capable of processing different types of data, as far as these are "time series" data. it can either focus in one category (eg. "ts-confirmed", "ts-recovered", "ts-deaths",) or all ("ts-all"). when these functions detect different types of categories, each category will be processed separately. similarly the functions can take multiple locations, ie. just one, several ones or even "all" the locations within the data. the locations can either be countries, regions, provinces or cities. if an specified location includes multiple entries, eg. a country that has several cities reported, the functions will group them and process all these regions as the location requested. totals per location. the tots.per.location function will plot the number of cases as a function of time for the given locations and type of categories, in two plots: a log-scale scatter one a linear scale bar plot one. when the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout. additionally, the function will attempt to generate different fits to match the data: • an exponential model using a linear regression method • a poisson model using a general linear regression method • a gamma model using a general linear regression method the function will plot and add the values of the coefficients for the models to the plots and display a summary of the results in the console. it is also possible to instruct the function to draw a "confidence band" based on a moving average, so that the trend is also displayed including a region of higher confidence based on the mean value and standard deviation computed considering a time interval set to equally dividing the total range of time over equally spaced intervals. the function will return a list combining the results for the totals for the different locations as a function of time. growth rate. the growth.rate function allows to compute daily changes and the growth rate defined as the ratio of the daily changes between two consecutive dates. the growth.rate function shares all the features of the tots.per.location function as described above, i.e. can process the different types of cases and multiple locations. the graphical output will display two plots per location: • a scatter plot with the number of changes between consecutive dates as a function of time, both in linear scale (left vertical axis) and log-scale (right vertical axis) combined • a bar plot displaying the growth rate for the particular region as a function of time. when the function is run with multiple locations or all the locations, the figures will be adjusted to display multiple plots in one figure in a mosaic type layout. in addition to that, when there is more than one location the function will also generate two different styles of heatmaps comparing the changes per day and growth rate among the different locations (vertical axis) and time (horizontal axis). furthermore, if the interactivefig=true argument is used, then interactive heatmaps and d-surface representations will be generated too. some of the arguments in this function, as well as in many of the other functions that generate both static and interactive visualizations, can be used to indicate the type of output to be generated. table lists some of these arguments. in particular, the arguments controlling the interactive figures -interactivefig and interactive.display-can be used in combination to compose an interactive figure to be captured and used in another application. for instance, when interactive.display is turned off but interactivefig=true, the function will return the interactive figure, so that it can be captured and used for later purposes. this is the technique employed when capturing the resulting plots in the covid .analytics dashboard explorer as presented in sec. . . finally, the growth.rate function when not returning an interactive figure, will return a list combining the results for the "changes per day" and the "growth rate" as a function of time, i.e. when interactivefig is not specified or set to false (which its default value) or when interactive.display=true. when is turned off, but interactivefig=true, the function will return the interactive figure, so that it can be captured and used for later purposes. trends in daily changes. the covid .analytics package provides three different functions to visualize the trends in daily changes of reported cases from time series data. • single.trend, allows to inspect one single location, this could be used with the worldwide data sliced by the corresponding location, the toronto data or the user's own data formatted as "time series" data. • mtrends, is very similar to the single.trend function, but accepts multiple or single locations generating one plot per location requested; it can also process multiple cases for a given location. • itrends function to generate an interactive plot of the trend in daily changes representing changes in number of cases vs total number of cases in log-scale using splines techniques to smooth the abrupt variations in the data the first two functions will generate "static" plots in a compose with different insets: • the main plot represents daily changes as a function of time • the inset figures in the top, from left to right: total number of cases (in linear and semi-log scales), changes in number of cases vs total number of cases changes in number of cases vs total number of cases in log-scale • the second row of insets, represent the "growth rate" (as defined above) and the normalized growth rate defined as the growth rate divided by the maximum growth rate reported for this location plotting totals. the function totals.plt will generate plots of the total number of cases as a function of time. it can be used for the total data or for a specific or multiple locations. the function can generate static plots and/or interactive ones, as well, as linear and/or semi-log plots. plotting cases in the world. the function live.map will display the different cases in each corresponding location all around the world in an interactive map of the world. it can be used with time series data or aggregated data, aggregated data offers a much more detailed information about the geographical distribution. the covid .analytics package allows users to model the dispersion of the disease by implementing a simple susceptible-infected-recovered (sir) model [ , ] . the model is implemented by a system of ordinary differential equations (ode), as the one shown by eq.( ). where s represents the number of susceptible individuals to be infected, i the number of infected individuals and r the number of recovered ones at a given moment in time. the coefficients β and γ are the parameters controlling the transition rate from s to i and from i to r respectively; n is the total number of individuals, i.e. n = s(t) + i(t) + r(t); which should remain constant, i.e. eq.( ) can be written in terms of the normalized quantities, although the ode sir model is non-linear, analytical solutions have been found [ ] . however the approach we follow in the package implementation is to solve the ode system from eq.( ) numerically. the function generate.sir.model implements the sir model from eq.( ) using the actual data from the reported cases. the function will try to identify data points where the onset of the epidemic began and consider the following data points to generate proper guesses for the two parameters describing the sir ode system, i.e. β and γ. it does this by minimizing the residual sum of squares (rss) assuming one single explanatory variable, i.e. the sum of the squared differences between the number of infected cases i(t) and the quantity predicted by the modelĨ(t), the ode given by eq.( ) is solved numerically using the ode function from the desolve and the minimization is tackled using the optim function from base r. after the solution for eq.( ) is found, the function will provide details about the solution, as well as, plot the quantities s(t), i(t), r(t) in a static and interactive plot. the generate.sir.model function also estimates the value of the basic reproduction number or basic reproduction ratio, r , defined as, which can be considered as a measure of the average expected number of new infections from a single infection in a population where all subjects can be susceptible to get infected. the function also computes and plots on demand, the force of infection, defined as, f inf ection = βi(t), which measures the transition rate from the compartment of susceptible individuals to the compartment of infectious ones. for exploring the parameter space of the sir model, it is possible to produce a series of models by varying the conditions, i.e. range of dates considered for optimizing the parameters of the sir equation, which will effectively "sweep" a range for the parameters β, γ and r . this is implemented in the function sweep.sir.models, which takes a range of dates to be used as starting points for the number of cases used to feed into the generate.sir.model producing as many models as different ranges of dates are indicated. one could even use this in combination to other resampling or monte carlo techniques to estimate statistical variability of the parameters from the model. in this section we will present some basic examples of how to use the main functions from the covid .analytics package. we will begin by installing the covid .analytics package. this can be achieved in two alternative ways: . installing the latest stable version of the package directly from the cran repository. this can be done within an r session using the install.packages function, i.e. > install.packages("covid .analytics") . installing the development version from the package's github repository, https://github.com/ mponce /covid .analytics using the devtools package [ ] and its install_github function. i.e. # begin by installing devtools if not installed in your system > install.packages("devtools") # install the covid .analytics packages from the github repo > devtools::install_github("mponce /covid .analytics") after having installed the covid .analytics package, for accessing its functions, the package needs to be loaded using r's library function, i.e. the covid .analytics uses a few additional packages which are installed automatically if they are not present in the system. in particular, readxl is used to access the data from the city of toronto [ ] , ape is used for pulling the genomics data from ncbi; plotly and htmlwidgets are used to render the interactive plots and save them in html documents, desolve is used to solve the differential equations modelling the spread of the virus, and gplots, pheatmap are used to generate heatmaps. lst. shows how to use the covid .data function to obtain data in different cases. # obtain all the records combined for " confirmed " , " deaths " and " recovered " cases # for the global ( worldwide ) * aggregated * data covid . data . allcases <-covid . data () # obtain time series data for global " confirmed " cases covid . confirmed . cases <-covid . data ( " ts -confirmed " ) # reads all possible datasets , returning a list covid . all . datasets <-covid . data ( " all " ) # reads the latest aggregated data of the global cases covid . all . agg . cases <-covid . data ( " aggregated " ) # reads time series data for global casualties covid . ts . deaths <-covid . data ( " ts -deaths " ) # read " time series " data for the city of toronto toronto . ts . data <-covid . data ( " ts -toronto " ) # this can be also done using the covid . toronto . data () fn tor . ts . data <-covid . toronto . data () # or get the original data as reported by the city of toronto tor . df . data <-covid . toronto . data ( data . fmr = " orig " ) # retrieve us time series data of confirmed cases us . confirmed . cases <-covid . data ( " ts -confirmed -us " ) # retrieve us time series data of death cases us . deaths . cases <-covid . data ( " ts -deaths -us " ) # or both cases combined us . cases <-covid . us . data () listing : reading data from reported cases of covid using the covid .analytics package. in general, the reading functions will return data frames. exceptions to this, are when the functions need to return a more complex output, e.g. when combining "all" type of data or when requested to obtain the original data from the city of toronto (see details in table ). in these cases, the returning object will be a list containing in each element dataframes corresponding to the particular type of data. in either case, the structure and overall content can be quickly assessed by using r's str or summary functions. one useful information to look at after loading the datasets, would be to identify which locations/regions have reported cases. there are at least two main fields that can be used for that, the columns containing the keywords: 'country' or 'region' and 'province' or 'state'. lst. show examples of how to achieve this using partial matches for column names, e.g. "country" and "province". # read a data set data <-covid . data ( " ts -confirmed " ) # look at the structure and column names str ( data ) names ( data ) # find ' country ' column country . col <-pmatch ( " country " , names ( data ) ) # slice the countries countries <-data [ , country . col ] # list of countries print ( unique ( countries ) ) # sorted table of countries , may include multiple entries print ( sort ( table ( countries ) ) ) # find ' province ' column prov . col <-pmatch ( " province " , names ( data ) ) # slice the provinces provinces <-data [ , prov . col ] # list of provinces print ( unique ( provinces ) ) # sorted table of provinces , may include multiple entries print ( sort ( table ( provinces ) ) ) listing : identifying geographical locations in the data sets. an overall view of the current situation at a global or local level can be obtained using the report.summary function. lst. shows a few examples of how this function can be used. # a quick function to overview top cases per region for time series and aggregated records report . summary () # save the tables into a text file named ' covid -summaryreport _ currentdate . txt ' # where * currrentdate * is the actual date report . summary ( savereport = true ) # summary report for an specific location with default number of entries report . summary ( geo . loc = " canada " ) # summary report for an specific location with top report . summary ( nentries = , geo . loc = " canada " ) # it can combine several locations report . summary ( nentries = , geo . loc = c ( " canada " ," us " ," italy " ," uruguay " ," argentina " ) ) a typical output of the report generation tool is presented in lst. . typical output of the report.summary function. this particular example was generated using report.summary(nentries= ,graphical.output=true,savereport=true), which indicates to consider just the top entries, generate a graphical output as shown in fig. and a to save text file including the report which is the one shown here.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~- # #### ts -confirmed cases --data dated : - - :: - - : : # #### ts -deaths cases --data dated : - - :: - - : : # #### ts -recovered cases --data dated : - - :: - - : : # #### aggregated data --ordered by confirmed cases --data dated : - - :: - - : : # #### aggregated data --ordered by deaths cases --data dated : - - :: - - : : # #### aggregated data --ordered by recovered cases --data dated : - - :: - - : : # #### aggregated data --ordered by active cases --data dated : - - :: - - : : * statistical estimators computed considering independent reported entries * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * overall summary * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * statistical estimators computed considering / / independent reported entries per case -type * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * a daily generated report is also available from the covid .analytics documentation site, https: //mponce .github.io/covid .analytics/. the covid .analytics package allows users to investigate total cumulative quantities per geographical location with the totals.per.location function. examples of this are shown in lst. . # totals for confirmed cases for " ontario " tots . per . location ( covid . confirmed . cases , geo . loc = " ontario " ) # total for confirmed cases for " canada " tots . per . location ( covid . confirmed . cases , geo . loc = " canada " ) # total nbr of confirmed cases in hubei including a confidence band based on moving average tots . per . location ( covid . confirmed . cases , geo . loc = " hubei " , confbnd = true ) # total nbr of deaths for " mainland china " tots . per . location ( covid . ts . deaths , geo . loc = " china " ) # ## # read the time series data for all the cases all . data <-covid . data ( 'ts -all ') # run on all the cases tots . per . location ( all . data , " japan " ) # ## # total for death cases for " all " the regions tots . per . location ( covid . ts . deaths ) # or just tots . per . location ( covid . data ( " ts -confirmed " ) ) listing : calculation of totals per country/region/province. in addition to the graphical output as shown in fig. , the function will provide details of the models fitted to the data. similarly, utilizing the growth.rate function is possible to compute the actual growth rate and daily changes for specific locations, as defined in sec. . . lst. includes examples of these. # read time series data for confirmed cases ts . data <-covid . data ( " ts -confirmed " ) # compute changes and growth rates per location for all the countries growth . rate ( ts . data ) # compute changes and growth rates per location for ' italy ' growth . rate ( ts . data , geo . loc = " italy " ) # compute changes and growth rates per location for ' italy ' and ' germany ' growth . rate ( ts . data , geo . loc = c ( " italy " ," germany " ) ) # #### # combining multiple geographical locations : # obtain time series data tsconfirmed <-covid . data ( " ts -confirmed " ) # explore different combinations of regions / cities / countries # when combining different locations , heatmaps will also be generated comparing the trends among these locations growth . rate ( tsconfirmed , geo . loc = c ( " italy " ," canada " ," ontario " ," quebec " ," uruguay " ) ) growth . rate ( tsconfirmed , geo . loc = c ( " hubei " ," italy " ," spain " ," united ␣ states " ," canada " ," ontario " ," quebec " ," uruguay " ) ) growth . rate ( tsconfirmed , geo . loc = c ( " hubei " ," italy " ," spain " ," us " ," canada " ," ontario " , " quebec " ," uruguay " ) ) # turn off static plots and activate interactive figures growth . rate ( tsconfirmed , geo . loc = c ( " brazil " ," canada " ," ontario " ," us " ) , staticplt = # static and interactive figures growth . rate ( tsconfirmed , geo . loc = c ( " brazil " ," italy " ," india " ," us " ) , staticplt = true , interactivefig = true ) listing : calculation of growth rates and daily changes per country/region/province. in addition to the cumulative indicators described above, it is possible to estimate the global trends per location employing the functions single.trend, mtrends and itrends. the first two functions generate static plots of different quantities that can be used as indicators, while the third function generates an interactive representation of a normalized a-dimensional trend. the lst. shows examples of the use of these functions. fig. displays the graphical output produced by these functions. # single location trend , in this case using data from the city of toronto tor . data <-covid . toronto . data () single . trend ( tor . data [ tor . data $ status == " active ␣ cases " ,]) # or data from the province of ontario ts . data <-covid . data ( " ts -confirmed " ) ont . data <-ts . data [ ts . data $ province . state == " ontario " ,] single . trend ( ont . data ) # or from italy single . trend ( ts . data [ ts . data $ country . region == " italy " ,]) # multiple locations ts . data <-covid . data ( " ts -confirmed " ) mtrends ( ts . data , geo . loc = c ( " canada " ," ontario " ," uruguay " ," italy " ) ) # multiple cases single . trend ( tor . data ) # interactive plot of trends # for all locations and all type of cases itrends ( covid . data ( " ts -all " ) , geo . loc = " all " ) # or just for confirmed cases and some specific locations , saving the result in an html file named " itrends _ ex . html " itrends ( covid . data ( " ts -confirmed " ) , geo . loc = c ( " uruguay " ," argentina " ," ontario " ," us " ," italy " ," hubei " ) , filename = " itrends _ ex " ) listing : calculation of trends for different cases, utilizing the single.trend, mtrends and itrends functions. the typical representations can be seen in fig. . most of the analysis functions in the covid .analytics package have already plotting and visualization capabilities. in addition to the previously described ones, the package has also specialized visualization functions as shown in lst. . many of them will generate static and interactive figures, see table for details of the type of output. in particular the live.map function is an utility function which allows to plot the location of the recorded cases around the world. this function in particular allows for several customizable features, such as, the type of projection used in the map or to select different types of projection operators in a pull down menu, displaying or not the legend of the regions, specify rescaling factors for the sizes representing the number of cases, among others. the function will generate a live representation of the cases, utilizing the plotly package and ultimately open the map in a browser, where the user can explore the map, drag the representation, zoom in/out, turn on/off legends, etc. # retrieve time series data ts . data <-covid . data ( " ts -all " ) # static and interactive plot totals . plt ( ts . data ) # totals for ontario and canada , without displaying totals and one plot per page totals . plt ( ts . data , c ( " canada " ," ontario " ) , with . totals = false , one . plt . per . page = true ) # totals for ontario , canada , italy and uruguay ; including global totals with the linear and semi -log plots arranged one next to the other totals . plt ( ts . data , c ( " canada " ," ontario " ," italy " ," uruguay " ) , with . totals = true , one . plt . per . page = false ) # totals for all the locations reported on the dataset , interactive plot will be saved as " totals -all . html " totals . plt ( ts . data , " all " , filename = " totals -all " ) # retrieve aggregated data data <-covid . data ( " aggregated " ) # interactive map of aggregated cases --with more spatial resolution live . map ( data ) # or live . map () # interactive map of the time series data of the confirmed cases with less spatial resolution , ie . aggregated by country live . map ( covid . data ( " ts -confirmed " ) ) listing : examples of some of the interactive and visualization capabilities of plotting functions. the typical representations can be seen in fig. . last but not least, one the novel features added by the covid .analytics package, is the ability of model the spread of the virus by incorporating real data. as described in sec. . , the generate.sir.model function, implements a simple sir model employing the data reported from an specified dataset and a particular location. examples of this are shown in lst. . the generate.sir.model function is complemented with the plt.sir.model function which can be used to generate static or interactive figures as shown in fig. . the generate.sir.model function as described in sec. will attempt to obtain proper values for the parameters β and γ, by inferring the onset of the epidemic using the actual data. this is also listed in the output of the function (see lst. ), and it can be controlled by setting the parameters t and t or deltat, which are used to specify the range of dates to be considered for using when determining the values of β and γ. the fatality rate (constant) can also be indicated via the fatality.rate argument, as well, as the total population of the region with tot.population. # read time series data for confirmed cases data <-covid . data ( " ts -confirmed " ) # run a sir model for a given geographical location generate . sir . model ( data , " hubei " , t = , t = ) generate . sir . model ( data , " germany " , tot . population = ) generate . sir . model ( data , " uruguay " , tot . population = ) generate . sir . model ( data , " ontario " , tot . population = , add . extras = true ) # the function will aggregate data for a geographical location , like a country with multiple entries generate . sir . model ( data , " canada " , tot . population = , add . extras = true ) fig.( ) , also raises an interesting point regarding the accuracy of the sir model. we should recall that this is the simplest approach one could take in order to model the spread of diseases and usually more refined and complex models are used to incorporate several factors, such as, vaccination, quarantines, effects of social clusters, etc. however, in some cases, specially when the spread of the disease appears to have enter the so-called exponential growth rate, this simple sir model can capture the main trend of the dispersion (e.g. left plot from fig. ). while in other cases, when the rate of spread is slower than the freely exponential dispersion, the model clearly fails in tracking the actual evolution of cases (e.g. right plot from fig. ) . finally, lst. shows an example of the generation of a sequence of values for r , and actually any of the parameteres (β, γ) describing the sir model. in this case, the function takes a range of values for the initial date t and generates different date intervals, this allows the function to generate multiple sir models and return the corresponding parameters for each model. the results are then bundle in a "matrix"/"array" object which can be accessed by column for each model or by row for each paramter sets. # read timeseries data ts . data <-covid . data ( " ts -confirmed " ) # select a location of interest , eg . france # france has many entries , just pick " france " fr . data <-ts . data [ ( ts . data $ country . region == " france " ) & ( ts . data $ province . state == " " ) ,] # sweep values of r based on range of dates to consider for the model ranges <- : deltat <- params _ sweep <-sweep . sir . models ( data = fr . data , geo . loc = " france " , t _ range = ranges , deltat = deltat ) # the parameters --beta , gamma , r --are returned in a " matrix " " array " object print ( params _ sweep ) as mentioned before, the functions from the covid .analytics package also allow users to work with their own data, when the data is formated in the time series strucutre as discussed in sec. . . . this opens a large range of possibilities for users to import their own data into r and use the functions already defined in the covid .analytics package. a concrete example of how the data has to be formatted is shown in lst. . the example shows how to structure the data in a ts format from "synthetic" data generated from randomly sampling different distributions. however this could be actual data from other places or locations not accesible from the datasets provided by the package, or some researchers may have access to their own private sets of data too. the example also shows two cases, where the data can include the "status" column or not, and whether it could be more than one location. as a matter of fact, we left the "long" and "lat" fields empty but if one includes the actual coordinates, the maping function live.map can also be used with these structured data. # ts data structure : # " province . state " " country . region " " lat " " long " dates . . . # first let ' s create a ' fake ' location fake . locn <-c ( na , na , na , na ) # names for these columns names ( fake . locn ) <-c ( " province . state " ," country . region " ," lat " ," long " ) # let ' s set the dates dates . vec <-seq ( as . date ( " / / " ) , as . date ( " / / " ) , " days " ) # data . vecx would be the actual values / cases data . vec <-rpois ( length ( dates . vec ) , lambda = ) # can also add more cases data . vec <-abs ( rnorm ( length ( dates . vec ) , mean = , sd = ) ) data . vec <-abs ( rnorm ( length ( dates . vec ) , mean = , sd = ) ) # this will names the columns as your dates names ( data . vec ) <-dates . vec names ( data . vec ) <-dates . vec names ( data . vec ) <-dates . vec # merge them into a data frame with multiple entries synthetic . data <-as . data . frame ( rbind ( rbind ( c ( fake . locn , data . vec ) ) , rbind ( c ( fake . locn , data . vec ) ) , rbind ( c ( fake . locn , data . vec ) ) ) ) # finally set you locn to somethign unqiue , so you can use it in the generate . sir . model fn synthetic . data $ country . region <-" mylocn " # one could even add " status " synthetic . data $ status <-c ( " confirmed " ," death " ," recovered " ) # or just one case per locn synthetic . data <-synthetic . data [ , -ncol ( synthetic . data ) ] synthetic . data $ country . region <-c ( " mylocn " ," mylocn " ," mylocn " ) # now we can use this ' synthetic ' dataset with any of the ts functions # data checks integrity . check ( synthetic . data ) consistency . check ( synthetic . data ) data . checks ( synthetic . data ) # quantitative indicators tots . per . location ( synthetic . data ) growth . rate ( synthetic . data ) single . trend ( synthetic . data [ ,] ) mtrends ( synthetic . data ) # sir models synthsir <-generate . sir . model ( synthetic . data , geo . loc = " mylocn " ) plt . sir . model ( synthsir , interactivefig = true ) sweep . sir . models ( synthetic . data , geo . loc = " mylocn " ) listing : example of structuring data in a ts format, so that it can be used with any of the ts functions from the covid .analytics package. the covid .analytics package provides access to genomics data available at the ncbi databases [ , ] . the covid .genomic.data is the master function for accesing the different variations of the genomics information available as shown in gtypes <-c ( " genome " ," fasta " ," tree " , " nucleotide " ," protein " , " nucleotide -fasta " ," protein -fasta " , " genomic " ) each of these functions return different objects, lst. shows an example of the different structures for some of the objects. the most involved object is obtained from the covid .genomic.data when combining different types of datasets. # str ( results ) list of $ refgenome : list of .. $ livedata : chr [ : ] " a " " t " " t " " a " ... .. $ repo : chr [ : ] " a " " t " " t " " a " ... .. $ local : chr [ : ] " a " " t " " t " " a " . ] " a " " t " " t " " a " ... .. .. -attr ( * , " species " ) = chr " severe _ acute _ respiratory _ syndrome _ coronavirus _ " .. $ local : list of .. .. $ nc _ . : chr [ : ] " a " " t " " t " " a " ... .. .. -attr ( * , " species " ) = chr " severe _ acute _ respiratory _ syndrome _ coronavirus _ " $ ptns : list of .. $ repo : chr [ : ] " yp _ " " yp _ " " yp _ " " yp _ " ... .. $ local : chr [ : ] " yp _ " " yp _ " " yp _ " " yp _ " ... : chr [ : ] " - - t : : z " " - - t : : z " " - - t : : z " " - - t : : z " ... : chr [ : ] " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " .. : chr [ : ] " homo ␣ sapiens " " homo ␣ sapiens " " homo ␣ sapiens " " homo ␣ sapiens " ... .. $ isolation _ source : chr [ : ] " " " " " " " " ... .. $ collection _ date : chr [ : ] " - " " - - " " - - " " - - " ... .. $ biosample : chr [ : ] " " " samn " " samn " " samn " ... .. $ genbank _ title : chr [ : ] " severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ␣ isolate ␣ wuhan -hu - , ␣ complete ␣ genome " " severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ␣ isolate ␣ sars -cov - / human / ind / gbrc / , ␣ complete ␣ genome " " severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ␣ isolate ␣ sars -cov - / human / ind / gbrc a / , ␣ complete ␣ genome " " severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ␣ isolate ␣ sars -cov - / human / ind / gbrc b / , ␣ complete ␣ genome " ... $ proteins : ' data . frame ': obs . of variables : .. $ accession : chr [ : ] " yp _ " " yp _ " " yp _ " " yp _ " ... .. $ sra _ accession : chr [ : ] " " " " " " " " ... .. $ release _ date : chr [ : ] " - - t : : z " " - - t : : z " " - - t : : z " " - - t : : z " ... .. $ species : chr [ : ] " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " " severe ␣ acute ␣ respiratory ␣ syndrome -related ␣ coronavirus " .. : chr [ : ] " leader ␣ protein ␣ [ severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ] " " nsp ␣ [ severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ] " " nsp ␣ [ severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ] " " nsp ␣ [ severe ␣ acute ␣ respiratory ␣ syndrome ␣ coronavirus ␣ ] " ... $ sra : list of .. $ sra _ info : chr [ : ] " this ␣ download ␣ ( via ␣ ftp ) ␣ provides ␣ coronaviridae ␣ family -containing ␣ sra ␣ runs ␣ detected ␣ with ␣ ncbi ' s ␣ kmer ␣ analysis ␣ ( stat ) ␣ tool . ␣ " " it ␣ provides ␣ corresponding ␣ sra ␣ run ␣ ( srr ) ,␣ sample ␣ ( srs ) ,␣ and ␣ submission ␣ ( sra ) ␣ accessions , ␣ as ␣ well ␣ as ␣ biosample ␣ an " | _ _ truncated _ _ " the ␣ stat ␣ kmer ␣ analysis ␣ was ␣ performed ␣ via ␣ a ␣ two -step ␣ process ␣ with ␣ a ␣ -mer ␣ coarse ␣ database ␣ and ␣ a ␣ -mer ␣ fine ␣ database . ␣ " " the ␣ database ␣ is ␣ generated ␣ from ␣ refseq ␣ genomes ␣ and ␣ the ␣ viral ␣ genome ␣ set ␣ from ␣ nt ␣ using ␣ a ␣ minhash -based ␣ approach . ␣ " ... .. $ sra _ runs : ' data . frame ': obs . of variables : .. .. $ acc : chr [ : ] " err " " err " " err " " err " ... .. .. $ sample _ acc : chr [ : ] " ers " " ers " " ers " " ers " ... .. .. $ biosample : chr [ : ] " samea " " same a " " samea " " samea " ... .. .. $ sra _ study : chr [ : ] " erp " " erp " " erp " " erp " ... .. .. $ bioproject : chr [ : ] " " " " " " " " ... $ references : list of .. $ : chr " covid . analytics ␣ --␣ local ␣ data " .. $ : chr " / users / marcelo / library / r / . / library / covid . analytics / extdata / " listing : objects composition for the example presented in lst. one aspect that should be mentioned with respect to the genomics data is that, in general, these are large datasets which are continuously being updated hence increasing theirs sizes even more. these would ultimately present pragmatical challenges, such as, long processing times or even starvation of memory resources. we will not dive into major interesting examples, like dna sequencing analysis or building phylogenetics trees; but packages such as ape, apegenet, phylocanvas, and others can be used for these and other analysis. one simple example we can present is the creation of dynamical categorization trees based on different elements of the sequencing data. we will consider for instance the data for the nucleotides as reported from ncbi. the example on lst. shows how to retrieve either nucleotides (or proteins) data and generate categorization trees based on different elements, such as, hosting organism, geographical location, sequences length, etc. in the examples we employed the collapsibletree package, that generates interactive browsable trees through web browsers. # retrieve the nucleotides data nucx <-covid . genomic . data ( type = ' nucleotide ' , src = ' repo ') # identify specific fields to look at len . fld <-" length " acc . fld <-" accession " geoloc . fld <-" geo _ location " seq . fld <-" sequence _ type " host . fld <-" host " seq . limit <- seq . limit <- seq . limit <- # selection criteria , nucleotides with seq . length between and selec . ctr . <-nucx $ length < seq . limit & nucx $ length > seq . limit # remove nucletoides without specifying a " host " listing : example of how to generate a dynamic browsable tree using some of information included in the nucleotides dataset. some of these trees representations are shown in fig. . in this section we will present and discuss, how the covid .analytics dashboard explorer is implemented. the main goal is to provide enough details about how the dashboard is implemented and works, so that users could modify it if/as they seem fit or even develop their own. for doing so, we will focus in three main points: • the front end implementation, also know as the user interface, mainly developed using the shiny package • the back end implementation, mostly using the covid .analytics package • the web server installation and configuration where the dashboard is hosted the covid .analytics dashboard explorer is built using the shiny package [ ] in combination with the covid .analytics package. shiny allows users to build interactive dashboards that can work through a web interface. the dashboard mimics the covid .analytics package commands and features but enhances the commands as it allows users to use dropdowns and other control widgets to easily input the data rather than using a command terminal. in addition the dashboard offers some unique features, such as a personal protective equipment (ppe) model estimation, based on realistic projections developed by the us centers for disease control and prevention (cdc). the dashboard interface offers several features: . the dashboard can be run on the cloud/web allowing for multiple users to simultaneously analyze the data with no special software or hardware requirements. the shiny package makes the dashboard mobile and tablet compatible as well. . it aids researchers to share and discuss analytical findings. . the dashboard can be run locally or through the web server. . no programming or software expertise is required which reduces technical barriers to analyzing the data. users can interact and analyze the data without any software expertise therefore users can focus on the modeling and analysis. in these times the dashboard can be a monumental tool as it removes barriers and allows a wider and diverse set of users to have quick access to the data. . interactivity. one feature of shiny and other graphing packages, such as plotly, is interactivity, i.e. the ability to interact with the data. this allows one to display and show complex data in a concise manner and focus on specific points of interest. interactive options such as zoom, panning and mouse hover all help in making the user interaction enjoyable and informative. . fast and easy to compare. one advantage of a dashboard is that users can easily analyze and compare the data quickly and multiple times. for example users can change the slider or dropdown to select multiple countries to see the total daily count effortlessly. this allows the data to be displayed and changed as users analysis requirements change. the dashboard can be laucnhed locally in a machine with r, either through an interactive r session or in batch mode using rscript or r cmd batch or through the web server accessing the following url https://covid analytics.scinet.utoronto.ca. for running the dashboard locally, the covid .analytics package has also to be installed. for running the dashboard within an r session the package has to be loaded and then it should be invoked using the following sequence of commands, > library(covid .analytics) > covid explorer() the batch mode can be executed using an r script containing the commands listed above. when the dashboard is run locally the browser will open a port in the local machine -localhost:port-connection, i.e. http:// . . . . it should be noted, that if the dashboard is launched interactively within an r session the port used is -http:// . . . : -, while if this is done through an r script in batch mode the port used will be different. to implement the dashboard and enhance some of the basic functionalities offered, the following libraries were specifically used in the implementation of the dashboard: • shiny [ ] : the main package that builds the dashboard. • shinydashboard [ ] : this is a package that assists us to build the dashboard with respect to themes, layouts and structure. • shinycssloaders [ ] : this package adds loader animations to shiny outputs such as plots and tables when they are loading or (re)-calculating. in general, this are wrappers around base css-style loaders. • plotly [ ] : charting library to generate interactive charts and plots. although extensively used in the core functions of the covid .analytics , we reiterate it here as it is great tool to develop interactive plots. • dt [ ] : a datatable library to generate interactive table output. • dplyr [ ] : a library that helps to apply functions and operations to data frames. this is important for calculations specifically in the ppe calculations. the r shiny package makes developing dashboards easy and seamless and removes challenges. for example setting the layout of a dashboard typically is challenging as it requires knowledge of frontend technologies such as html, css and bootstrap to be able to position elements and change there asthetic properties. shiny simplifies this problem by using a built in box controller widget which allows developers to easily group elements, tables, charts and widgets together. many of the css properties, such as, widths or colors are input parameters to the functions of interest. the sidebar feature is simple to implement and the shiny package makes it easy to be compatible across multiple devices such as tablets or cellphones. the shiny package also has built in layout properties such as fluidrow or columns making it easy to position elements on a page. the library does have some challenges as well. one challenge faced is theme design. shinydashboard does not make it easy to change the whole color theme of the dashboard outside of the white or blue theme that is provided by default. the issue is resolved by having the developer write custom css and change each of the various properties manually. the dashboard contains two main components a sidebar and a main body. the sidebar contains a list of all the menu options. options which are similar in nature are grouped in a nested format. for example the dashboard menu section called "datasets and reports", when selected, displays a nested list of further options the user can choose such as the world data or toronto data. grouping similar menu options together is important for making the user understand the data. the main body displays the content of a page. the content a main body displays depends on the sidebar and the selected menu option the user selects. there are three main generic elements needed to develop a dashboard: layouts, control widgets and output widgets. the layout options are components needed to layout the features or components on a page. in this dashboard the layout widgets used are the following: • box: the boxes are the main building blocks of a dashboard. this allows us to group content together. • tabpanels: tabpanels allow us to create tabs to divide one page into several sections. this allows for multiple charts or multiple types of data to be displayed in a single page. for example in the indicators page there are four tabs which display four different charts with mosaic tab displaying the charts in different configurations. • header and title: these are used to display text, title pages in the appropriate sizes and fonts. an example describing these elements and its implementation is shown in lst. . ␣ table ' ) , h ( ' world ␣ data ␣ of ␣ all ␣ covid ␣ cases ␣ across ␣ the ␣ globe ') , column ( , selectinput ( ns ( " category _ list " ) , label = h ( " category " ) , choices = category _ list ) ) , column ( , downloadbutton ( ns ( ' downloaddata ') , " download " ) ) , withspinner ( dt :: datatableoutput ( ns ( " table _ contents " ) ) ) ) } listing : snippet of a code that describes the various features used in generating a dashboard. the ns(id) is a namespaced id for inputs/outputs. withspinner is the shiny cssloaders which generates a loading gif while the chart is being loaded. shiny modules are used when a shiny application gets larger and complicated, they can also be used to fix the namespacing problem. however shiny modules also allow for code reusability and code modularity as the code can be broken into several pieces called modules. each module can then be called in different applications or even in the same application multiple times. in this dashboard we break the code into two main groups user interface (ui) modules and server modules. each menu option has there own dedicated set of ui and associated server modules. this makes the code easy to build and expand. for each new menu option a new set of ui and sever module functions will be built. lst. is also an example of an ui module, where it specifies the desing and look of the element and connect with the active parts of the application. lst. shows an example of a server function called reportserver. this type of module can update and display charts, tables and valueboxes based on the user selections. this same scenario occurs for all menu options with the ui/server paradigm. another way to think about the ui/server separation, is that the ui modules are in charge of laying down the look of a particular element in the dahboard, while the sever is in charge of dynamically 'filling' the dynamical elements and data to populate this element. control widgets, also called input widgets, are widgets which users use to input data, information or settings to update charts, tables and other output widgets. the following control widgets were used in this dashboard: • numericalinput: a textbox that only allows for numerical input which is used to select a single numerical value. • selectinput: a dropdown that may be multi select for allowing users to select multiple options as in the case of the country dropdown. figure : screenshot from the "covid .analytics dashboard explorer", "mosaic" tab from the 'indicators' category. four interactive figures are shown in this case: the trends (generated using the itrends function), totals (genrated using the totals.plt function) and two world global representations of covid reported cases (generated using the live.map function). the two upper plots are adjusted and re-rendered according to the selection of the country, category of the data from the input boxes. • slider: the slider in our dashboard is purely numerical but used to select a single numerical value from a given range of a min and max range. • download button: this is a button which allows users to download and save data in various formats such as csv format. • radiobuttons: used to select only one from a limited number of choices. • checkbox: similar in purpose to radiobuttons that also allow users to select one option from a limited number of options. output control widgets are widgets that are used to display content/information back to the user. there are three main ouput widgets used in this dashboard: • plotlyouput: this widget output and creates plotly charts. plotly is a graphical package library used to generate interactive charts. • rendertable: is an output that generates the output as an interactive table with search, filter and sort capabilities provided out of the box. • valuebox: this is a fancy textbox with border colors and descriptive font text to generate descriptive text to users such as the total number of deaths. the dashboard contains the menus and elements shown in and described below: • indicators: this menu section displays different covid indicators to analyze the pandemic. there are four notable indicators itrend, total plot, growth rate and live map, which are displayed in each of the various tabs. itrend displays the "trend" in a log-log plot, total plot shows a line graph of total number, growth rate displays the daily number of changes and growth rate (as defined in sec. . ), live map shows a world map of infections in an aggregated or timeseries format. these indicators are shown together in the "mosaic" tab. • models: this menu option contains a sub-menu presenting models related to the pandemic. the first model is the sir (susceptible infection recovery) which is implemented in the covid .analytics package. sir is a compartmental model to model how a disease will infect a population. the second two models are used to estimate the amount of ppe needed due to infectious diseases, such as ebola and covid . • datasets and reports: this section provides reporting capability which outputs reports as csv and text files for the data. the world data subsection displays all the world data as a table which can be filtered, sorted and searched. the data can also be saved as a csv file. the toronto data displays the toronto data in tabular format while also displaying the current pandemic numbers. data integrity section checks the integrity and consistency of the data set such as when the raw data contains negative numbers or if the cumulative quantities decrease. the report section is used to generate a report as a text file. • references: the reference section displays information on the github repo and documentation along with an external dashboards section which contains hyperlinks to other dashboards of interest. dashboards of interest are the vaccine tracker which tracks the progress of vaccines being tested for covid , john hopkins university and the canada dashboard built by the dall lana school of epidemiology at the university of toronto. • about us: contact information and information about the developers. in addition to implementing some of the functionalities provided by the covid .analytics package, the dashboard also includes a ppe calculator. the hospital ppe is a qualitative model which is designed to analyze the amount of ppe equipment needed for a single covid patient over a hospital duration. the ppe calculation implemented in the covid .analytics dashboard explorer is derived from the cdc's studies for infectious diseases, such as ebola and covid . the rationality is that ebola and covid are both contagious infections and ppe is used to protect staff and patients and prevent transmission for both of these contagious diseases. the hospital ppe calculation estimates and models the amount of ppe a hospital will need during the covid pandemic. there are two analysis methods a user can choose to determine hospital ppe requirement. the first method to analyze ppe is to determine the amount of ppe needed for a single hospitalized covid patient. this first model requires two major component: the size of the healthcare team needed to take care of a single covid patient and the amount of ppe equipment used by hospital staff per shift over a hosptialization duration. the model is based off the cdc ebola crisis calculation [ ] . alhough ebola is a different disease compared to covid , there is one major similarity. both covid and ebola are diseases which both require ppe for protection of healthcare staff and the infected patient. to compensate the user can change the amount of ppe a healthcare staff uses per shift. that information can be adjusted by changing the slider values in the advanced setting tab. the calculation is pretty straightforward as it takes the ppe amount used per shift and multiplies it by the number of healthcare staff and then by the hospitalization duration. the first model has two tabs. the first tab displays a stacked bar chart displaying the amount of ppe equipment user by each hospital staff over the total hospital duration of a single patient. it breaks each ppe equipment by stacks. the second tab panel called advanced settings has a series of sliders for each hospital staff example nurses where users can use the slider to change the amount of ppe that the hospital staff will user per shift. the second model is a more recent calculation developed by the cdc [ ] . the model calculates the burn rate of ppe equipment for hospitals for a one week time period. this model is designed specifically for covid . the cdc has created an excel file for hospital staff to input their information and also an android app as well which can be utilized. this model also implemented in our dashboard, is simplified to calculate the ppe for a one week setting. the one week limit was implemented for two reasons, first to limit the amount of input data a user has to enter into the system as too much data can overwhelm and confuse a user; second because the covid pandemic is a highly fluidic situation and for hospital staff to forecast their ppe and resource equipments greater than a one week period may not be accurate. note that this model is not accurate if the facilitiy recieves a resupply of ppe. for resupplied ppe start a new calculation. there are four tab panels to the burn rate calculation which displays charts and settings. the first tab daily usage displays a multi-line chart displaying the amount of ppe used daily, ∆p p e daily . the calculation for this is a simple subtraction between two consecutive days, i.e. the second day (j + ) from the first day (j) as noted in eq. ( ) . the tab panel called remaining supply shows a multi line chart the number of days the remaining ppe equipment will last in the facility. the duration of how long the ppe equipment can last in a given facility, inversely depends on the amount of covid patients admitted to the hospital. to calculate the remaining ppe one calculates the average amount of ppe used over the one week duration and then divides the amount of ppe at the beginning of the day by the average ppe usage, as shown in eq. ( ), where t denotes the time average over a t period of time. the third panel called ppe per patient displays a multi line chart of the burn rate, i.e. the amount of ppe used per patient per day. eq.( ) represents the calculation as the remaining ppe supply divided by the number of covid patients in the hospital staff during that exact day. the fourth tab called advanced settings is a series of show and hide "accordians" where users can input the amount of ppe equpiment they have at the start of each day. there are six collapsed boxes for each ppe equipment type and for covid patient count. expanding a box displays seven numericalinput textboxes which allows users to input the number of ppe or patient count for each day. the equations describing the ppe needs, eqs. ( , , ) are implemented in the shiny dashboard using the dplyr library. the dplyr library allows users to work with dataframe like objects in a quick and efficient manner. the three equations are implemented using a single dataframe. the advanced setting inputs of the burn rate analysis tab are saved into a dataframe. the ppe equations -eqs. ( the back-end implementation of the dashboard is achieved using the functions presented in sec. on the server module of the dashboard. the main strategy is to use a particular function and connect it with the input controls to feed the needed arguments into the function and then capture the output of the function and render it accordingly. let's consider the example of the globe map representation shown in the dashboard which is done using the live.map function. lst. shows how this function connects with the other elements in the dashboard: the input elements are accessed using input$... which in this are used to control the particular options for the displaying the legends or projections based on checkboxes. the output returned from this function is captured through the renderplotly({...}) function, that is aimed to take plotly type of plots and integrate them into the dashboard. # livemap plot charts on the three possible commbinations output $ ts _ livemap <-output $ ts _ livemap <-output $ ts _ livemap <-output $ ts _ livemap <-renderplotly ({ legend <-input $ sel _ legend projections <-input $ sel _ projection live . map ( covid . data ( " ts -confirmed " ) , interactive . display = false , no . legend = legend , select . projctn = projections ) }) listing : example of how the live.map function is used to render the ineractive figures display on the dashboard. another example is the report generation capability using the report.summary function which is shown in lst. . as mentioned before, the input arguments of the function are obtained from the input controls. the output in this case is rendered usign the rendertext({...}) function, as the output of the original function is plain text. notice also that there are two implementations of the report.summary, one is for the rendering in the screen and the second one is for making the report available to be downloaded which is handled by the downloadhandler function. reportserver <-function ( input , output , session , result ) { output $ report _ output _ default <-rendertext ({ # extract the vairables of the inputs nentries <-input $ txtbox _ nentries geo _ loc <-input $ geo _ loc _ select ts <-input $ ddl _ ts capture . output ( report . summary ( graphical . output = false , nentries = nentries , geo . loc = geo _ loc , cases . to . process = ts ) ) } , sep = '\ n ') report <-reactive ({ nentries <-input $ txtbox _ nentries geo _ loc <-input $ geo _ loc _ select ts <-input $ ddl _ ts report <-capture . output ( report . summary ( graphical . output = false , nentries = nentries , geo . loc = geo _ loc , cases . to . process = ts ) ) return ( report ) }) output $ downloadreport <-downloadhandler ( filename = function () { paste ( " report -" , sys . date () ," . txt " , sep = " " ) } , content = function ( file ) { writelines ( paste ( report () ) , file ) } ) } listing : report capabilites implemented in the dashboard using the report.summary function. the final element in the deployment of the dashboard is the actual set up and configuration of the web server where the application will run. the actual implementation of our web dashboard, accessible through https://covid analytics.scinet.utoronto.ca, relies on a virtual machine (vm) in a physical server located at scinet headquarters. we should also notice that there are other forms or ways to "publish" a dashboard, in particular for shiny based-dashboards, the most common way and perhaps straighforward one is to deploy the dashboard on https://www.shinyapps.io. alternatively one could also implement the dashboard in a cloud-based solution, e.g. https://aws.amazon.com/blogs/big-data/running-r-on-aws/. each approach has its own advantages and disadvantages, for instance, depending on a third party solution (like the previous mentioned) implies some cost to be paid to or dependency on the provider but will certainly eliminate some of the complexity and special attention one must take when using its own server. on the other hand, a self-deployed server will allow you for full control, in principle cost-effective or cost-controled expense and full integration with the end application. in our case, we opted for a self-controlled and configured server as mentioned above. moreover, it is quite a common practice to deploy (multiple) web services via vms or "containers". the vm for our web server runs on centos and has installed r version . from sources and compiled on the vm. after that we proceeded to install the shiny server from sources, i.e. https://github.com/rstudio/ shiny-server/wiki/building-shiny-server-from-source. after the installation of the shiny-server is completed, we proceed by creating a new user in the vm from where the server is going to be run. for security reasons, we recommend to avoid running the server as root. in general, the shiny server can use a user named "shiny". hence a local account is created for this user, and then logged as this user, one can proceed with the installation of the required r packages in a local library for this user. all the neeeded packages for running the dashboard and the covid .analytics package need to be installed. lst. shows the commands used for creating the shiny user and finalizing the configuration and details of the log files. # place a shortcut to the shiny -server executable in / usr / bin sudo ln -s / usr / local / shiny -server / bin / shiny -server / usr / bin / shiny -server # create shiny user sudo useradd -r -m shiny # create log , config , and application directories sudo mkdir -p / var / log / shiny -server sudo mkdir -p / srv / shiny -server sudo mkdir -p / var / lib / shiny -server sudo chown shiny / var / log / shiny -server sudo mkdir -p / etc / shiny -server listing : list of commands used on the vm to finalize the setup of the shiny user and server. source: https: //github.com/rstudio/shiny-server. for dealing with the apache configuration on port , we added the file /etc/httpd/conf.d/rewrite.conf as shown in lst. . rewritecond %{ request _ scheme } = http rewriterule^https : / / %{ server _ name }%{ request _ uri } [ qsa , r = permanent ] listing : modifications to the apache configurations, specified in the file rewrite.conf. these three lines rewrite any incoming request from http to https. for handling the apache configuration on port , we added this file /etc/httpd/conf.d/shiny.conf, as shown in lst. . this virtualhost receives the https requests from the internet on port , establishes the secure connection, and redirects all input to port using plain http. all requests to "/" are redirected to "http:// . . . : /app /", where app in this case is a subdirectory where a particular shiny app is located. there is an additional configuration file, /etc/httpd/conf.d/ssl.conf, which contains the configuration for establishing secure connections such as protocols, certificate paths, ciphers, etc. the main tool we use in order to communicate updates between the different elements we use in the development and mantainance of the covid .analytics package and dashboard web interface is orchestrated via git repositories. in this way, we have in place version control systems but also offer decentralized with multiple replicas. fig. shows an schematic of how our network of repositories and service is connected. the central hub for our package, is located at the github repo htttps: //github.com/mponce /covid .analytics; we then have (and users can too) our own clones of local copies of this repo -we usually use this for development and testing-. when a stable and substantial contribution to the package is reached, we submit this to the cran repository. similarly, when an update is done on the dashboard we can synchronize the vm via git pulls and deploy the updates on the server side. in this paper we have presented and discussed the r covid .analytics package, which is an open source tool to obtain, analyze and visualize data of the covid pandemic. the package also incorporates a dashboard to facilitate the access to its functionalities to less experienced users. as today, there are a few dozen other packages also in the cran repository that allow users to gain access to different datasets of the covid pandemic. in some cases, some packages just provide access to data from specific geographical locations or the approach to the data structure in which the data is presented is different from the one presented here. nevertheless, having a variety of packages from which users can try and probably combine, is an important and crucial element in data analysis. moreover it has been reported different cases of data misuse/misinterpretation due to different issues, such as, erroneous metadata or data formats [ ] and in some cases ending in articles' retractions [ ] . therefore providing additional functionalities to check the integrity and consistency of the data, as our the covid .analytics package github repo -central repository https://github.com/mponce /covid .analytics shiny server, running on vm https://covid analytics.scinet.utoronto.ca local copies cran repo https://cran.r-project.org/package=covid .analytics local copies local copies github io -web rendering https://mponce .github.io/covid .analytics/ private instances figure : schematic of the different repositories and systems employed by the covid .analytics package and dashboard interface. does is paramount. this is specially true in a situation where the unfolding of events and data availability is flowing so fast that sometimes is even hard to keep track of all the changes. moreover, the covid .analytics package offers a modular and versatile approach to the data, by allowing users to input their own data for which most of the package functions can be applied when the data is structured using a time series format as described in this manuscript. the covid .analytics is also capable of retrieving genomics data, and it does that by incorporating a novel, more reliable and robust way of accessing and designing different pathways to the data sources. another unique feature of this package is the ability of incorporating models to estimate the disease spread by using the actual data. although a simple model, it has shown some interesting results in agreement for certain cases. of course there are more sophisticated approaches to shred light in analyzing this pandemic; in particular novel "community" approaches have been catalyzed by this too [ ] . however all of these approaches face new challenges as well [ ] , and on that regards counting with a variety, in particular of open source tools and direct access to the data might help on this front. r: a language and environment for statistical computing, r foundation for statistical computing r: a language for data analysis and graphics covid .analytics: load and analyze live data from the covid- pandemic the biggest mystery: what it will take to trace the coronavirus source animal source of the coronavirus continues to elude scientists a pneumonia outbreak associated with a new coronavirus of probable bat origin the proximal origin of sars-cov- bat-borne virus diversity, spillover and emergence extrapulmonary manifestations of covid- opensafely: factors associated with covid- death in million patients considering how biological sex impacts immune responses and covid- outcomes coronavirus blood-clot mystery intensifies using influenza surveillance networks to estimate state-specific prevalence of sars-cov- in the united states consolidation in a crisis: patterns of international collaboration in early covid- research critiqued coronavirus simulation gets thumbs up from code-checking efforts timing social distancing to avert unmanageable covid- hospital surges special report: the simulations driving the world's response to covid- covid- vaccine design: the janus face of immune enhancement covidep: a web-based platform for real-time reporting of vaccine target recommendations for sars-cov- social network-based distancing strategies to flatten the covid- curve in a post-lockdown world asymptotic estimates of sarscov- infection counts and their sensitivity to stochastic perturbation evolutionary origins of the sars-cov- sarbecovirus lineage responsible for the covid- pandemic an interactive web-based dashboard to track covid- in real time pandemic publishing poses a new covid- challenge will the pandemic permanently alter scientific publishing? how swamped preprint servers are blocking bad coronavirus research project, trainees, faculty, advancing scientific knowledge in times of pandemics covid- risk factors: literature database & meta-analysis coronawhy: building a distributed, credible and scalable research and data infrastructure for open science, scinlp: natural language processing and data mining for scientific text the comprehensive r archive network covid- data repository by the center for systems science and engineering covid- : status of cases in toronto database resources of the national center for biotechnology information ape . : an environment for modern phylogenetics and evolutionary analyses in r rentrez: an r package for the ncbi eutils api a contribution to the mathematical theory of epidemics the sir model for spread of disease: the differential equation model, loci.(originally convergence exact analytical solutions of the susceptible-infected-recovered (sir) epidemic model and of the sir model with equal death and birth rates devtools: tools to make developing r packages easier shiny: web application framework for r shinydashboard: create dashboards with 'shiny', r package version shinycssloaders: add css loading animations to 'shiny' outputs interactive web-based data visualization with r, plotly, and shiny, chapman and hall/crc, dt: a wrapper of the javascript library 'datatables', r package version dplyr: a grammar of data manipulation estimated personal protective equipment (ppe) needed for healthcare facilities personal protective equipment (ppe) burn rate calculator high-profile coronavirus retractions raise concerns about data oversight covid- pandemic reveals the peril of ignoring metadata standards artificial intelligence cooperation to support the global response to covid- the challenges of deploying artificial intelligence models in a rapidly evolving pandemic the r script containing the shiny app to be run should be placed in /etc/shiny-server and confiurations details about the shiny interface are adjusted in the /etc/shiny-server/shiny-server.conf file.permissions for the application file has to match the identity of the user launching the server, in this case the shiny user.at this point if the installation was sucessful and all the pieces were placed properly, when the shiny-server command is executed, a shiny hosted app will be accessible from localhost: .since the shiny server listens on port in plain http, it is necessary to setup an apache web server to act as a reverse proxy to receive the connection requests from the internet on ports and , the regular http and https ports, and redirect them to port on the same host (localhost). mp wants to thank all his colleagues at scinet, especially daniel gruner for his continous and unconditional support, and marco saldarriaga who helped us setting up the vm for installing the shiny server. key: cord- -vzizkekp authors: jarke, matthias title: data sovereignty and the internet of production date: - - journal: advanced information systems engineering doi: . / - - - - _ sha: doc_id: cord_uid: vzizkekp while the privacy of personal data has captured great attention in the public debate, resulting, e.g., in the european gdpr guideline, the sovereignty of knowledge-intensive small and medium enterprises concerning the usage of their own data in the presence of dominant data-hungry players in the internet needs more investigation. in europe, even the legal concept of data ownership is unclear. we reflect on requirements analyses, reference architectures and solution concepts pursued by the international data spaces initiative to address these issues. the second part will more deeply explore our current interdisciplinary research in a visionary “internet of production” with research groups from production and materials engineering, computer science, business and social sciences. in this setting, massive amounts of heterogeneous data must be exchanged and analyzed across organizational and disciplinary boundaries, throughout the lifecycle from (re-)engineering, to production, usage and recycling, under hard resource and time constraints. a shared metaphor, borrowed from plato’s famous cave allegory, serves as the core modeling and data management approach from conceptual, logical, physical, and business perspectives. the term "data sovereignty" is hotly debated in political, industrial, and privacy communities. politicians understand sovereignty as national sovereignty over data in their territory, when it comes to the jurisdiction over the use of big data by the big international players. one might think that data industries dislike the idea becausein whatever definitionit limits their opportunities to exploit "data as the new oil". however, some of them employ the vision of data sovereignty of citizens as a weapon to abolish mandatory data privacy rules as limiting customer sovereignty by viewing them as people in need of protection in an uneven struggle for data ownership. for exactly this reason, privacy proponents criticize data sovereignty as a tricky buzzword by the data industry, aiming to undermine the principles of real self-determination and data thriftiness (capturing only the minimal data necessary for a specified need) found in many privacy laws. the european gdpr regulation follows this argumentation to some degree by clearly specifying that you are the owner of all personal data about yourself. surprising to most participants, the well-known göttingen-based law professor gerald spindler, one of the gdpr authors, pointed out at a recent dagstuhl seminar on data sovereignty (cappiello et al. ) that this personal data ownership is the only formal concept of data ownership that legally exists in europe. in particular, the huge group of knowledge-intensive small and medium enterprises (smes) or even larger user industries in europe are lacking coherent legal, technical, and organizational concepts how to protect their data-and model-based knowledge in the globalized industrial ecosystems. in late , we introduced the idea to extend the concept of personal data spaces (halevy et al. ) to the inter-organizational setting by introducing the idea of industrial data spaces as the kernel of platforms in which specific industrial ecosystems could organize their cooperation in a data-sovereign manner (jarke ; jarke and quix ) . the idea was quickly taken up by european industry and political leaders. since , a number of large-scale german and eu projects have defined requirements (otto and jarke ) . via numerous use case experiments, the international data space (ids) association with currently roughly corporate members worldwide has evolved, and agreed on a reference architecture now already in version . section gives a brief overview of this reference architecture, its philosophy of "alliance-driven data ecosystems", and a few of the technical contributions required to make it operational. as recently pointed out by loucopoulos et al. ( ) , the production sector offers particularly complex challenges to such a setting due to the heterogeneity of its data and mathematical models, the structural and material complexity of many products, the globalized supply chains, and the international competition. funded by the german "excellence competition ", an interdisciplinary group of researchers at rwth aachen university therefore started a -year excellence cluster "internet of production" aiming to address these challenges in a coherent manner. section presents an overview of key concepts and points to ongoing work on specific research challenges. alliance-driven ecosystems and the international data space several of the most valuable firms worldwide create value no longer by producing their own output but by orchestrating the output of others. following modern versions of early medieval port cities and more recently phone companies, they do this by creating network effects by creating platforms which serve as two-sided or multi-sided markets (gawer ). in the best-known cases within the software industries, such as apple, amazon, or facebook, but also domain-specific ones like uber or flixbus, there is a keystone player defining and running the platform. the typical strategy here is a very high early marketing investment to gain as many early adopters as possible, thus achieving quickly a dominant market position and being able to exploit extremely rich data sets as a basis for analytics, advertising, or economies of logistics. design requirements engineering for this kind of platforms was already discussed in (jarke et al. ) . more recently, however, driven by the goals of exploiting the opportunities of platforms but also preserving organizational data sovereignty, we are seeing the appearance of platform-based ecosystems organized and governed by alliances of cooperating players. examples in europe include a smart farming data ecosystem initiated by the german-based farm equipment producer claas together with farming and seed-producing partners, which was recently joined by claas' fiercest competitor john deere. another example is an ongoing effort by vdv, the german organization of regional public transport organization, to set up an alliance-driven data ecosystem for intermodal traffic advice, ticketing, and exception handling in competition to efforts by big keystone players such as flixbus, deutsche bahn, or even google maps based on aachen's mobility broker metadata harmonization approach (beutel et al. ). beyond the core importance of such a domain-specific meta model, the creation of platform business models is a key success factor. yoo et al. ( ) already pointed out that, in contrast to traditional business model approaches, the "currency" of multi-sided markets can be an exchange of services rather than a purely financial one. in pfeiffer et al. ( ) , we therefore developed a business model development approach based on their service-dominant business logic and validated it in this intermodal travel ecosystem (cf. fig. ). in otto and jarke ( ), we employed a literature analysis and elaborate focus groups with industrial partners from the ids association in order to identify the main commonalities and differences (table ) , extending an earlier discussion of platform evolution by tiwana et al. ( ) . in this subsection, we summarize some of the fundamental design decisions of the international data space approach to alliance-driven data ecosystems design. the full description with, e.g., a complete information model of the approach can be found in ) from which the figures in this section have been excerpted. with its focus on sovereign and secure data exchange, the ids architecture takes up aspects of several other well-known architectural patterns for data exchange, as shown in fig. . it is different from the data lake architectures employed by most keystone-driven data platforms which emphasize rapid loading and pay-as-you-go data integration and knowledge extractions, but does embed such functionalities as service offerings whose usage, however, can be limited with enforced and monitored usage policies. on the other side, blockchains can be considered one extreme of such enforcements in a decentralized setting aiming at full transparency and provenance tracking, whereas the ids architecture emphasizes the sovereign definition of usage policies by the individual players, plus agreed policies for a data space. membership of an actor (organizational data management entity, cloud, or service provider) is established by two core technical elements, as illustrated in fig. : firstly, the data exchange (only) under agreed usage policies (shown as print-like ids boxes on the links in the figure) , and secondly by more or less "trusted" ids connectors. these combine aspects of traditional wrapper-mediators for data integration with trust guarantees provided by a hierarchy of simple to very elaborate security mechanisms. within the ids projects, at least four different usage policy enforcement strategies have been developed (eitel et al. ) , all accessible via the conceptual model of the odrl (open digital rights language) accepted by w c in . the usage control specifications support model-based code generation and execution based on earlier work by pretschner et al. ( ) . figure shows how security elements for policy definition (pdp) and policy management (pmp) in the linkage between connectors interact with policy execution points (pep) in the ids connectors from which they can be propagated even to certain specific data items within the protected internal sphere of a company. in fig. , we referred to the service-dominant business logic underlying most alliance-driven data ecosystems including the ids. obviously, as in other trustintensive inter-organizational settings (gans et al. ) , the individual actors and linkages should be carefully defined at the strategic level, for which conceptual modeling techniques such as i* strategic dependencies (yu ) are the obvious candidates. the analysis summarized in (otto and jarke ) has therefore led to the inclusion of an i*-inspired actor dependency network of important roles and their (task) dependencies (fig. ) . in the reference architecture model version . report , this is further elaborated with business process model patterns for the individual tasks, and governance mechanisms for the organizational infrastructure underlying data ecosystem set-up and operation. typical ids use cases so far have been relatively limited in their organizational and technical complexity. a number of more ambitious and socially complex variants, such as medical data spaces for cross-clinic medical and biomedical research have started and are accelerated by the demand caused by the covid- crisis. however, probably the most complex application domain tackled so far is production engineering, the subject of our dfg-funded excellence cluster "internet of production". in this -year effort, research groups from production and materials engineering, computer science, business and social sciences cooperate to study not just the sovereign data exchange addressed by the ids architecture in a fully globalized setting, but also the question of how to communicate between model-and data-driven approaches of vastly different disciplines and scales. in this setting, massive amounts of heterogeneous data must be exchanged and analyzed across organizational and disciplinary boundaries, throughout the lifecycle from (re-)engineering, to production, usage and recycling, under hard resource and time constraints. figure illustrates this complexity with a few of the different kinds of data, but most importantly with three different lifecycles that have traditionally hardly communicated in the nowadays necessary speed of feedback cycles among them. as one use case, we are considering the introduction of low-cost electric vehicles to the market. the engineering of such completely new cars requires numerous different specialties and supplier companies to collaborate and exchange data worldwide. to be financially viable, their production will take place in many small factories which can already be profitable with, say, . cars a year, rather than the traditional . . but this raises the question how the many small factories all over the world can exchange best practices, and provide feedback to the development engineers for perhaps culture-specific design improvements. last not least, only the usage experience with buying and operating the novel vehicles will really show what works, what is attractive for customers, what are their ideas for improving the design but perhaps also the production process of the cars. and all of this must happen in an iterative improvement cycle which runs at least - times the speed of traditional automotive innovation, in order to have a successful vehicle before the venture capital runs out. in addition to the challenges mentioned in sect. , the computer science perspective on fig. yields extreme syntactic and semantic heterogeneity combined with very high data volume, but often also highly challenging real-time velocity requirements, and a wide variety of model-driven and data-driven approaches. thus, all the v's of big data are present in this setting in addition to sovereign data exchange. building a complete digital twin for such a setting appears hopeless. we need a new kind of data abstraction with the following properties: • support for model-driven as well as data-driven approaches, and combinations thereof, across a wide range of different scientific disciplines. • relatively small size, in order to be easily movable anyplace between cloud and edge computing. • executable according to almost any real-time demand, with reasonable results on at least the most important perspective to the problem at hand. • suitable as valorized objects in an ids-like controlled data exchange ecosystem. in other words, we must find a common understanding addressing the open question what are actually the objects moving around in an industrial data space. the intuition for our solution to this question came from an unexpected corner, greek philosophy. in its famous allegory of the cave illustrated in fig. , plato (ca. b.c.) showed the limits of human knowledge by sketching a scenery in which humans are fixed in a cave such that they can only see the shadows of things happening in the outside world cast by natural light or by fires lit behind the phenomena (it is funny to note that he even invented the concept of fake news using human-made artefacts instead of real-world objects between the fire and the shadow). anyway, the shadows are obviously highly simplified data-driven real-time enabled abstractions which are, however, created under specific illuminations (=models) such as sunlight or fire. we therefore named our core abstraction digital shadows, a suitably compact result of combining context-specific simplified models and data analytics. formally, digital shadows can be seen as a generalization of the view concept in databases where the base data as well as the (partially materialized) views are highly dynamic and complex objects. once we had invented this abstraction, we found that there already exist quite a number of useful digital shadow examples, among them quite well-known ones like the combinations of shortest path algorithms and real-time phone location data shown in navigation systems like tomtom, the flexible combination of petri net process modeling with event mining algorithms in process mining (van der aalst ), or abstractions used in real-time vision systems for autonomous driving or sports training. within the excellence cluster "internet of things", we have been demonstrating the usefulness on complex examples from machine tooling control with extreme realtime constraints, optimizing the energy-intensive step of hot rolling ubiquitous in steelbased production chains, plastics engineering, and laser-based additive manufacturing. rather than going into details here, we point to some companion papers elaborating specific issues including the digital shadow concept itself and some early validation experiments (liebenberg and jarke ) , the design of an initial physical infrastructure emphasizing the dynamic positioning and secure movement of digital shadows as well as base data within a novel infrastructure concept (pennekamp et al. ) , logical foundations of correct and rapid data integration in connectors or data lakes (hai et al. ) , model-driven development (dalibor et al. ) and the creation of a rdf-based metadata infrastructure called factdag (gleim et al. ) aimed at interlinking the multiple digital shadows in a holistic knowledge structure. product oriented integration of heterogeneous mobility services data ecosystems-sovereign data exchange among organizations model-driven development of a digital twin for injection molding usage control in the industrial data space continuous requirements management for organization networks -a (dis-)trustful approach bridging different perspectives on technological platforms-towards and integrative framework factdag -formalizing data interoperability in an internet of production relaxed functional dependency discovery in heterogeneous data lakes principles of data space systems data spaces: combining goal-driven and data-driven approaches in community decision and negotiation support the brave new world of design requirements on warehouses, lakes, and spaces -the changing role of conceptual modeling for data integration information systems engineering with digital shadows -concept and case studies requirements engineering for cyber physical production systems designing a multi-sided data platform: findings from the international data spaces case reference architecture model version . . international data spaces association service-oriented business model framework -a servicedominant logic based approach for business modeling in the digital era towards an infrastructure enabling the internet of production distributed usage control platform evolution -co-evolution of platform architecture, governance, and environmental dynamics the new organizing logic of digital innovation -an agenda for information systems research modeling strategic relationships for process reengineering acknowledgments. this work was supported in part by the fraunhofer ccit research cluster, and in part by deutsche forschungsgemeinschaft (dfg, german research foundation) under germany's excellence strategy -exc- internet of production - . i would like to thank numerous collaborators in these projects, especially christoph quix and istvàn koren (deputy area coordinators for the iop infrastructure), the overall ids project manager boris otto, and my iop co-speakers christian brecher, günter schuh as well as iop manager matthias brockmann. key: cord- - twmcitu authors: mukhina, ksenia; visheratin, alexander; nasonov, denis title: spatiotemporal filtering pipeline for efficient social networks data processing algorithms date: - - journal: computational science - iccs doi: . / - - - - _ sha: doc_id: cord_uid: twmcitu one of the areas that gathers momentum is the investigation of location-based social networks (lbsns) because the understanding of citizens’ behavior on various scales can help to improve quality of living, enhance urban management, and advance the development of smart cities. but it is widely known that the performance of algorithms for data mining and analysis heavily relies on the quality of input data. the main aim of this paper is helping lbsn researchers to perform a preliminary step of data preprocessing and thus increase the efficiency of their algorithms. to do that we propose a spatiotemporal data processing pipeline that is general enough to fit most of the problems related to working with lbsns. the proposed pipeline includes four main stages: an identification of suspicious profiles, a background extraction, a spatial context extraction, and a fake transitions detection. efficiency of the pipeline is demonstrated on three practical applications using different lbsn: touristic itinerary generation using facebook locations, sentiment analysis of an area with the help of twitter and vk.com, and multiscale events detection from instagram posts. in today's world, the idea of studying cities and society through location-based social networks (lbsns) became a standard for everyone who wants to get insights about people's behavior in a particular area in social, cultural, or political context [ ] . nevertheless, there are several issues concerning data from lbsns in research. firstly, social networks can use both explicit (i.e., coordinates) or implicit (i.e., place names or toponyms) geographic references [ ] ; it is a common practice to allow manual location selection and changing user's position. the twitter application relies on gps tracking, but user can correct the position using the list of nearby locations, which causes potential errors from both gps and user sides [ ] . another popular source of geo-tagged data -foursquare -also relies on a combination of the gps and manual locations selection and has the same problems as twitter. instagram provides a list of closely located points-of-interest [ ] , however, it is assumed that a person will type the title of the site manually and the system will advise the list of locations with a similar name. although this functionality gives flexibility to users, there is a high chance that a person mistypes a title of the place or selects the wrong one. in facebook, pages for places are created by the users [ ] , so all data including title of the place, address and coordinates may be inaccurate. in addition to that, a user can put false data on purpose. the problem of detecting fake and compromised accounts became a big issue in the last five years [ , ] . spammers misrepresent the real level of interest to a specific subject or degree of activity in some place to promote their services. meanwhile, fake users spread unreliable or false information to influence people's opinion [ ] . if we look into any popular lbsn, like instagram or twitter, location data contains a lot of errors [ ] . thus, all studies based on social networks as a data source face two significant issues: wrong location information stored in the service (wrong coordinates, incorrect titles, duplicates, etc.) and false information provided by users (to hide an actual position or to promote their content). thus, in this paper, we propose a set of methods for data processing designed to obtain a clean dataset representing the data from real users. we performed experimental evaluations to demonstrate how the filtering pipeline can improve the results generated by data processing algorithms. with more and more data available every minute and with a rise of methods and models based on extensive data processing [ , ] , it was shown that the users' activity strongly correlates with human activities in the real world [ ] . for solving problems related to lbsn analysis, it is becoming vital to reduce the noise in input data and preserve relevant features at the same time [ ] . thus, there is no doubt that such problem gathers more and more attention in the big data era. on the one side, data provided by social media is more abundant that standard georeferenced data since it contains several attributes (i.e., rating, comments, hashtags, popularity ranking, etc.) related to specific coordinates [ ] . on the other side, the information provided by users of social networks can be false and even users may be fakes or bots. in , goodchild in [ ] raised questions concerning the quality of geospatial data: despite that a hierarchical manual verification is the most reliable data verification method, it was stated that automatic methods could efficiently identify not only false but questionable data. in paper [ ] , the method for pre-processing was presented, and only % of initial dataset was kept after filtering and cleaning process. one of the reasons for the emergence of fake geotags is a location spoofing. in [ ] , authors used the spatiotemporal cone to detect location spoofing on twitter. it was shown that in the new york city, the majority of fake geotags are located in the downtown manhattan, i.e., users tend to use popular places or locations in the city center as spoofing locations. the framework for the location spoofing detection was presented in [ ] . latent dirichlet allocation was used for the topic extraction. it was shown that message similarity for different users decreases with a distance increase. next, the history of user check-ins is used for the probability of visit calculation using bayes model. the problem of fake users and bots identification become highly important in the last years since some bots are designed to distort the reality and even to manipulate society [ ] . thus, for scientific studies, it is essential to exclude such profiles from the datasets. in [ ] , authors observed tweets with specific hashtags to identify patterns of spammers' posts. it was shown that in terms of the age of an account, retweets, replies, or follower-to-friend ratio there is no significant difference between legitimate and spammer accounts. however, the combination of different features of the user profile and the content allowed to achieve a performance of . auc [ ] . it was also shown that the part of bots among active accounts varies between % and %. this work was later improved by including new features such as time zones and device metadata [ ] . in contrast, other social networks do not actively share this information through a public api. in [ ] , available data from social network sites were studied, and results showed that social networks usually provide information about likes, reposts, and contacts, and keep the data about deleted friends, dislikes, etc., private. thus, advanced models with a high-level features are applicable only for twitter and cannot be used for social networks in general. more general methods for compromised accounts identification on facebook and twitter were presented in [ ] . the friends ratio, url ratio, message similarity, friend number, and other factors were used to identify spam accounts. some of these features were successfully used in later works. for example, in [ ] , seven features were selected to identify a regular user from a suspicious twitter account: mandatory -time, message source, language, and proximityand optional -topics, links in the text, and user interactions. the model achieved a high value of precision with approximately % of false positives. in [ ] , random forest classifier was used for spammers identification on twitter, which results in the accuracy of . %. this study was focused on five types of spam accounts: sole spammers, pornographic spammers, promotional spammers, fake profiles, and compromised accounts. nevertheless, these methods are usercentered, which means it is required to obtain full profile information for further analysis. however, there is a common situation where a full user profile is not available for researches, for example, in spatial analysis tasks. for instance, in [ ] , authors studied the differences between public streaming api of twitter and proprietary service twitter firehose. even though public api was limited to % sample of data, it provided % of geotagged data, but only % of all sample contains spatial information. in contrast, instagram users are on average times more likely post data with geotag comparing to twitter users [ ] . thus, lbsn data processing requires separate and more sophisticated methods that would be capable of identifying fake accounts considering incomplete data. in addition to that, modern methods do not consider cases when a regular user tags a false location for some reason, but it should be taken into account as well. as it was discussed above, it is critical to use as clean data as possible for research. however, different tasks require different aspects of data to be taken into consideration. in this work, we focus on the main features of the lbsn data: space, time, and messages content. first of all, any lbsn contains data with geotags and timestamps, so the proposed data processing methods are applicable for any lbsn. secondly, the logic and level of complexity of data cleaning depend on the study goals. for example, if some research is dedicated to studying daily activity patterns in a city, it is essential to exclude all data with wrong coordinates or timestamps. in contrast, if someone is interested in exploring the emotional representation of a specific place in social media, the exact timestamp might be irrelevant. in fig. , elements of a pipeline are presented along with the output data from each stage. as stated in the scheme, we start from general methods for a large scale analysis, which require fewer computations and can be applied on the city scale or higher. step by step, we eliminate accounts, places, and tags, which may mislead scientists and distort results. suspicious profiles identification. first, we identify suspicious accounts. the possibility of direct contact with potential customers attracts not only global brands or local business but spammers, which try to behave like real persons and advertise their products at the same time. since their goal differs from real people, their geotags often differ from the actual location, and they use tags or specific words for advertising of some service or product. thus, it is important to exclude such accounts from further analysis. the main idea behind this method is to group users with the same spatial activity patterns. for the business profiles such as a store, gym, etc. one location will be prevalent among the others. meanwhile, for real people, there will be some distribution in space. however, it is a common situation when people tag the city only but not a particular place, and depending on the city, coordinates of the post might be placed far from user's real location, and data will be lost among the others. thus, on the first stage, we exclude profiles, who do not use geotags correctly, from the dataset. we select users with more than ten posts with location to ensure that a person actively uses geotag functionality and commutes across the city. users with less than ten posts do not provide enough data to correctly group profiles. in addition, they do not contribute sufficiently to the data [ ] . then, we calculate all distances between two consecutive locations for each user and group them by m, i.e., we count all distances that are less than km, all distances between and km and so on. distances larger than km are united into one group. after that, we cluster users according to their spatial distribution. the cluster with a deficient level of spatial variations and with the vast majority of posts being in a single location represents business profiles and posts from these profiles can be excluded from the dataset. at the next step, we use a random forest (rf) classifier to identify bots, business profiles, and compromised accounts -profiles, which do not represent real people and behave differently from them. it has been proven by many studies that a rf approach is efficient for bots and spam detection [ , ] . since we want to keep our methods as general as possible and to keep our pipeline applicable to any social media, we consider only text message, timestamp, and location as feature sources for our model. we use all data that a particular user has posted in the studied area and extract the following spatial and temporal features: number of unique locations marked by a user, number of unique dates when a user has posted something, time difference in seconds between consecutive posts. for time difference and number of posts per date, we calculated the maximum, minimum, mean, and standard deviation. from text caption we have decided to include maximum, minimum, average, mean, standard deviation of following metrics: number of emojis per post, number of hashtags per post, number of words per post, number of digits used in post, number of urls per post, number of mail addresses per post, number of user mentions per post. in addition to that, we extracted money references, addresses, and phone numbers and included their maximum, minimum, average, mean, and standard deviation into the model. in addition, we added fraction of favourite tag in all user posts. thus, we got features in our model. as a result of this step, we obtain a list of accounts, which do not represent normal users. city background extraction. the next stage is dedicated to the extraction of basic city information such as a list of typical tags for the whole city area and a set of general locations. general locations are places that represent large geographic areas and not specific places. for example, in the web version of twitter user can only share the name of the city instead of particular coordinates. some social media like instagram or foursquare are based on a list of locations instead of exact coordinates, and some titles in this list represent generic places such as streets or cities. data from these places is useful in case of studying the whole area, but if someone is interested in studying actual temporal dynamics or spatial features, such data will distort the result. also, it should be noted that even though throughout this paper we use the word 'city' to reference the particular geographic area, all stages are applicable on the different scales starting from city districts and metropolitan regions to states, countries, or continents. firstly, we extract names of administrative areas from open street maps (osm). after that, we calculate the difference between titles in social media data and data from osm with the help of damerau-levenshtein distance. we consider a place to be general if the distance between its title and some item from the list of administrative objects is less than . these locations are excluded from the further analysis. for smaller scales such as streets or parks, there are no general locations. then, we analyze the distribution of tags mentions in the whole area. the term 'tag' denotes the important word in the text, which characterizes the whole message. usually, in lbsn, tags are represented as hashtags. however, they can also be named entities, topics, or terms. in this work, we use hashtags as an example of tags, but this concept can be further extrapolated on tags of different types. the most popular hashtags are usually related to general location (e.g., #nyc, #moscow) or a popular type of content (#photo, #picsoftheday, #selfie) or action (#travel, #shopping, etc.). however, these tags cannot be used to study separate places and they are not relevant either to places or to events since they are actively used in the whole area. nevertheless, scientists interested in studying human behavior in general can use this set of popular tags because it represents the most common patterns in the content. in this work, we consider tag as general if it was used in more than % of locations. however, it is possible to exclude tags related to public holidays. we want to avoid such situations and keep tags, which have a large spatial distribution but narrow peak in terms of temporal distribution. thus, we group all posts that mentioned a specific tag for the calendar year and compute their daily statistics. we then use the gini index g to identify tags, which do not demonstrate constant behavior throughout the year. if g ≥ . we consider tag as an event marker because it means that posts distribution have some peaks throughout the year. this pattern is common for national holidays or seasonal events such as sports games, etc. thus, after the second stage, we obtain the dataset for further processing along with a list of common tags and general locations for the studying area. spatial context extraction. using hashtags for events identification is a powerful strategy, however, there are situations where it might fail. the main problem is that people often use hashtags to indicate their location, type of activity, objects on photos and etc. thus, it is important to exclude hashtags which are not related to the possible event. to do that, we grouped all hashtags by locations, thus we learn which tags are widely used throughout the city and which are place related. if some tag is highly popular in one place, it is highly likely that the tag describes this place. excluding common place-related tags like #sea or #mall for each location, we keep only relevant tags for the following analysis. in other words, we get the list of tags which describe a normal state of particular places and their specific features. however, such tags cannot be indicators of events. fake transitions detection. the last stage of the pipeline is dedicated to suspicious posts identification. sometimes, people cannot share their thoughts or photos immediately. it leads to situations where even normal users have a bunch of posts, which are not accurate in terms of location and timestamp. at this stage, we exclude posts that cannot represent the right combination of their coordinates and timestamps. this process is similar to the ideas for location spoofing detection -we search for transitions, which someone could not make in time. the standard approach for detection of fake transitions is to use spacetime cones [ ] , but in this work, we suggest the improvement of this methodwe use isochrones for fake transitions identification. in urban studies, isochrone is an area that can be reached from a specified point in equal time. isochrone calculation is based on usage of real data about roads, that is why this method is more accurate than space-time cones. for isochrone calculation, we split the area into several zones depending on their distance from the observed point: pedestrian walking area (all locations in km radius), car/public transport area (up to km), train area ( - km) and flight area (further than km). this distinction was to define a maximum speed for every traveling distance. the time required for a specific transition is calculated by the following formula: where s i is the length of the road segment and v is the maximum possible velocity depending on the inferred type of transport. the road data was extracted from osm. it is important to note that on each stage of the pipeline, we get output data, which will be excluded, such as suspicious profiles, baseline tags, etc. however, this data can also be used, for example, for training novel models for fake accounts detection. the first experiment was designed to highlight the importance of general location extraction. to do that, we used the points-of-interest dataset for moscow, russia. the raw data was extracted from facebook using the places api and contained , places. the final dataset for moscow contained , places, and general sites were identified. however, it should be noted that among general locations, there were detected 'russia' ( , , visitors), 'moscow', 'russia' ( , , visitors) , 'moscow oblast' ( , visitors). for instance, the most popular non-general locations in moscow are sheremetyevo airport and red square, with only , and , check-ins, respectively. the itinerary construction is based on solving the orienteering problem with functional profits (opfp) with the help of the open-source framework fops [ ] . in this approach, locations are scored by their popularity and by farness distance. we used the following parameters for the ant colony optimization algorithm: ant per location and iterations of the algorithm, as it was stated in the original article. the time budget was set to h, the red square was selected as a starting point, and vorobyovy gory was used as a finish point since they two highly popular touristic places in the city center. the resulting routes are presented in fig. . both routes contain extra places, including major parks in the city: gorky park and zaryadye park. however, there are several distinctions in these routes. the route based on the raw data contains four general places (fig. , left) -'moscow', 'moscow, 'russia', 'russia', and 'khamovniki district', which do not correspond to actual places. thus, % of locations in the route cannot be visited in real life. in contrast, in case of the clean data (fig. , right) , instead of general places algorithm was able to add real locations, such as bolshoi theatre and central children's store on lubyanka with the largest clock mechanism in the world and an observation deck with the view on kremlin. thus, the framework was able to construct a much better itinerary without any additional improvements in algorithms or methods. to demonstrate the value of background analysis and typical hashtags extraction stages, we investigated the scenario of analysis of users' opinions in a geographical area via sentiment analysis. we used a combined dataset of twitter and vk.com posts taken in sochi, russia, during . sochi is one of the largest and most popular russian resorts. it was also the host of the winter olympics in . since twitter and vk.com provide geospatial data with exact coordinates, we created a squared grid with a cell size equal to m. we then kept only cells containing data (fig. , right) - cells in total. each cell was considered as a separate location for the context extraction. the most popular tags in the area are presented in fig. (left) . tag '#sochi' was mentioned in / of cells ( and cells for russian and english versions of the tag, respectively). the followup tags '#sochifornia' (used in cells) and '#sea' (mentioned in cells) were twice less popular. after that, we extracted typical tags for each cell. we considered a post to be relevant to the place if it contained at least one typical tag. thus, we can be sure that posts represent the sentiment in that area. the sentiment analysis was executed in two stages. first, we prepare the text for polarity detection. to do that, we delete punctuation, split text in words, and normalized text with the help of [ ] . in the second step, we used the russian sentiment lexicon [ ] to get the polarity of each word ( indicates positive word and − negative word). the sentiment of the text is defined as if a sum of polarities of all words more than zero and − if the sum is less than zero. the sentiment of the cell is defined as an average sentiment of all posts. on the fig. , results of sentiment analysis are presented, cells with average sentiment less than . were marked as neutral. it can be noted from maps that after the filtering process, more cells have a higher level of sentiment. for sochi city center, the number of posts with the sentiment |s| ≥ . increased by . %. it is also important that number of uncertain cell with sentiment rate . ≤ |s| ≤ . decreased by . % from to cells. thus, we highlighted the strong positive and negative areas and decreased the number of uncertain areas by applying the context extraction stage of the proposed pipeline. in this experiment, we applied the full pipeline on the instagram data. new york city was used as a target city in the event detection approach [ ] we collected the data from over , locations for a period of up to years. the total number of posts extracted from the new york city area is , , . in the first step, we try to exclude from the dataset all users who provide incorrect data, i.e. use several locations instead of the whole variety. we group users with the help of k-means clustering method. the appropriate number of clusters was obtained by calculating the distortion parameter. deviant cluster contained , users out of , , . the shape of deviant clusters can be seen in fig. . suspicious profiles mostly post in the same location. meanwhile, regular users have variety in terms of places. after that, we trained our rf model using manually labelled data from both datasets. the training dataset contains profiles with ordinary users and fake users; test data consists of profiles including normal profiles and suspicious accounts. the model distinguishes a regular user from suspicious successfully. normal user were detected correctly and users were marked as suspicious. suspicious users out of were correctly identified. thus, there were obtained % of precision and % of recall. since the goal of this work is to get clean data as a result, we are interested in a high value of recall and precision is less critical. as a result, we obtained a list of , , profiles which related to real people. at the next step, we used only data from these users to extract background information about cities. titles of general locations were derived for new york. these places were excluded from further analysis. after that, we extracted general hashtags; the example of popular tags in location before and after background tags extraction is presented on the fig. . general tags contain mostly different term related to toponyms and universal themes such as beauty or life. then, we performed the context extraction for locations. for each location typical hashtags were identified as % most frequent tags among users. we consider all posts from one user in the same location as one to avoid situations where someone tries to force their hashtag. we will use extracted lists to exclude typical tags from posts. after that, we calculated isochrones for each normal users to exclude suspicious posts from data. in addition to that, locations with a high rate of suspicious posts ( % or higher part of posts in location was detected as suspicious) were excluded as well. there was locations in new york city. the final dataset for new york consists of , locations. for event detection we performed the same experiment which was described in [ ] . in the original approach the spike in activity in particular cell of the grid was consider as an event. to find these spikes in data, historical grids is created using retrospective data for a calendar year. since we decrease amount of data significantly, we set threshold value to . we used data for to create grids, then we took two weeks from for the result evaluation: a week with a lot of events during - of march and an ordinary week with less massive events - february. the results of the recall evaluation are presented in table . as can be seen from the table on an active week, the recall increment was . % and for nonactive week recall value increase on . %. it is also important to note that some events, which do not have specific coordinates, such as snowfall in march or saint patrick's day celebration, were detected in the less number of places. this leads to lesser number of events in total and more significant contribution to the false positive rate. nevertheless, the largest and the most important events, such as nationwide protest '#enough! national school walkout' and north american international toy fair are still detected from the very beginning. in addition to that due to the altered structure of historical grids, we were able to discover new events such as a concert of canadian r&b duo 'dvsn', global engagement summit at un headquarters, etc. these events were covered with a low number of posts and stayed unnoticed during the original experiment. however, the usage of clean data helped to highlight small events which are essential for understanding the current situation in the city. in this work, we presented a spatiotemporal filtering pipeline for data preprocessing. the main goal of this process is to exclude unreliable data in terms of space and time. the pipeline consists of four stages: during the first stage, suspicious user profiles are extracted from data with the help of k-means clustering and random forest classifier. on the next stage, we exclude the buzz words from the data and filter locations related to large areas such as islands or city districts. then, we identify the context of a particular place expressed by unique tags. in the last step, we find suspicious posts using the isochrone method. stages of the pipeline can be used separately and for different tasks. for instance, in the case of touristic walking itinerary construction, we used only general location extraction, and the walking itinerary was improved by replacing % of places. in the experiment dedicated to sentiment analysis, we used a context extraction method to keep posts that are related to the area where they were taken, and as a result, . % of uncertain areas were identified either as neutral or as strongly positive or negative. in addition to that, for event detection, we performed all stages of the pipeline, and recall for event detection method increased by . %. nevertheless, there are ways for further improvement of this pipeline. in instagram, some famous places such as times square has several corresponding locations including versions in other languages. this issue can be addressed by using the same method from the general location identification stage. we can use distance to find places with a similar name. currently, we do not address the repeating places in the data since it can be a retail chain, and some retail chains include over a hundred places all over the city. in some cases, it can be useful to interpret a chain store system as one place. however, if we want to preserve distinct places, more complex methods are required. despite this, the applicability of the spatiotemporal pipeline was shown using the data from facebook, twitter, instagram, and vk.com. thus, the pipeline can be successfully used in various tasks relying on location-based social network data. deep" learning for missing value imputation in tables with non-numerical data right time, right place" health communication on twitter: value and accuracy of location information social media geographic information: why social is special when it goes spatial building sentiment lexicons for all major languages positional accuracy of twitter and instagram images in urban environments a location spoofing detection method for social networks compa: detecting compromised accounts on social networks the rise of social bots the quality of big (geo)data urban computing leveraging location-based social network data: a survey zooming into an instagram city: reading the local through social media an agnotological analysis of apis: or, disconnectivity and the ideological limits of our knowledge of social media advances in social media research: past, present and future morphological analyzer and generator for russian and ukrainian languages efficient pre-processing and feature selection for clustering of cancer tweets analyzing user activities, demographics, social network structure and user-generated content on instagram is the sample good enough? comparing data from twitter's streaming api with twitter's firehose orienteering problem with functional profits for multi-source dynamic path construction fake news detection on social media: a data mining perspective who is who on twitter-spammer, fake or compromised account? a tool to reveal true identity in real-time twitter as an indicator for whereabouts of people? correlating twitter with uk census data detecting spammers on social networks online humanbot interactions: detection, estimation, and characterization multiscale event detection using convolutional quadtrees and adaptive geogrids places nearby: facebook as a location-based social media platform arming the public with artificial intelligence to counter social bots detecting spam in a twitter network true lies in geospatial big data: detecting location spoofing in social media acknowledgement. this research is financially supported by the russian science foundation, agreement # - - . key: cord- -a datsoo authors: ambrogi, federico; coradini, danila; bassani, niccolò; boracchi, patrizia; biganzoli, elia m. title: bioinformatics and nanotechnologies: nanomedicine date: journal: springer handbook of bio-/neuroinformatics doi: . / - - - - _ sha: doc_id: cord_uid: a datsoo in this chapter we focus on the bioinformatics strategies for translating genome-wide expression analyses into clinically useful cancer markers with a specific focus on breast cancer with a perspective on new diagnostic device tools coming from the field of nanobiotechnology and the challenges related to high-throughput data integration, analysis, and assessment from multiple sources. great progress in the development of molecular biology techniques has been seen since the discovery of the structure of deoxyribonucleic acid (dna) and the implementation of a polymerase chain reaction (pcr) method. this started a new era of research on the structure of nucleic acids molecules, the development of new analytical tools, and dna-based analyses that allowed the sequencing of the human genome, the completion of which has led to intensified efforts toward comprehensive analysis of mammalian cell struc ture and metabolism in order to better understand the mechanisms that regulate normal cell behavior and identify the gene alterations responsible for a broad spectrum of human diseases, such as cancer, diabetes, cardiovascular diseases, neurodegenerative disorders, and others. in this chapter we focus on the bioinformatics strategies for translating genome-wide expression analyses into clinically useful cancer markers with a specific focus on breast cancer with a perspective on new diagnostic device tools coming from the field of nanobiotechnology and the challenges related to high-throughput data integration, analysis, and assessment from multiple sources. great progress in the development of molecular biology techniques has been seen since the discovery of the structure of deoxyribonucleic acid (dna) and the implementation of a polymerase chain reaction (pcr) method. this started a new era of research on the structure of nucleic acids molecules, the development of new analytical tools, and dna-based analyses that allowed the sequencing of the human genome, the completion of which has led to intensified efforts toward comprehensive analysis of mammalian cell struc- ture and metabolism in order to better understand the mechanisms that regulate normal cell behavior and identify the gene alterations responsible for a broad spectrum of human diseases, such as cancer, diabetes, cardiovascular diseases, neurodegenerative disorders, and others. technical advances such as the development of molecular cloning, sanger sequencing, pcr, oligonucleotide microarrays and more recently the development of a variety of so-called next-generation sequencing (ngs) platforms has actually revolutionized translational research and in particular cancer research. now, scientists can obtain a genome-wide perspective of cancer gene expression useful to discover novel cancer biomarkers for more accurate diagnosis and prognosis, and monitoring of treatment effectiveness. thus, for instance, microrna expression signatures have been shown to provide a more accurate method of classifying cancer subtypes than transcriptome profiling and allow classification of different stages in tumor progression, actually opening the field of personalized medicine (in which disease detection, diagnosis, and therapy are tailored to each individual's molecular profile) and predictive medicine (in which genetic and molecular information is used to predict disease development, progression, and clinical outcome). however, since these novel tools generate a tremendous amount of data and since the number of laboratories generating microarray data is rapidly growing, new bioinformatics strategies that promote the maximum utilization of such data, as well as methods for integrating gene ontology annotations with microarray data to improve candidate biomarker selection are necessary. in particular, the management and analysis of ngs data requires the development of informatics tools able to assemble, map, and interpret huge quantities of relatively or extremely short nucleotide sequence data. as a paradigmatic example, a major pathology such as breast cancer can be considered. breast can-part f cer is the most common malignancy in women with a cumulative lifetime risk of developing the disease as high as one in every eight women [ . ]. several factors are associated with this cancer such as genetics, life style, menstrual and reproductive history, and long-term treatment with hormones. until now breast cancer has been hypothesized to develop, following a progression model similar to that described for colon cancer [ . , ], through a linear histological progression from adenosis, to ductal/lobular hyperplasia, to atypical ductal/lobular hyperplasia, to in situ carcinoma and finally to invasive cancer, corresponding to increasingly worse patient outcome. molecularly, it has been suggested that this process is accompanied by increasing alterations of the genes that encode for tumor suppressor proteins, nuclear transcription factors, cell cycle regulatory proteins, growth factors, and corresponding receptors, which provide a selective advantage for the outgrowth of mammary epithelial cell clones containing such mutations [ . ] . recent advances in genomic technology have improved our understanding of the genetic events that parallel breast cancer development. in particular, dna microarray-based technology, with the simultaneous evaluation of thousands of genes, has provided researchers with an opportunity to perform comprehensive molecular and genetic profiling of breast cancer able to classify it into some clinically relevant subtypes and in the attempt to predict the prognosis or the response to treatment [ . - ]. unfortunately, the initial enthusiasm for the application of such an approach was tempered by the publication of several studies reporting contradictory results on the analysis of the same samples analyzed on different microarray platforms that arose the skepticism regarding the reliability and the reproducibility of this technique [ . , ]. in fact, despite the great theoretical potential for improving breast cancer management, the actual performance of predictors, built using genes' expression, is not as good as initially published, and the lists of genes obtained from different studies are highly unstable, resulting in disparate signatures with little overlap in their constituent genes. in addition, the biological role of individual genes in a signature, the equivalence of several signatures, and their relation to conventional prognostic factors are still unclear [ . ]. even more incomplete and confusing is the information obtained when molecular genetics was applied to premalignant lesions; indeed, genome analysis revealed an unexpected morphological complexity of breast cancer, very far from the hypothesized multi-step linear process, but sug-gesting a series of stochastic genetic events leading to distinct and divergent pathways towards invasive breast cancer [ . ], the complexity of which limits the application of really effective strategies for prevention and early intervention. therefore, despite the great body of information about breast cancer biology, improving our knowledge about the puzzling bio-molecular features of neoplastic progression is of paramount importance to better identify the series of events that, in addition to genetic changes, are involved in breast tumor initiation and progression and that enable premalignant cells to reach the six biological endpoints that characterize malignant growth (self-sufficiency in growth signals, insensitivity to growth-inhibitory signals, evasion of programmed cell death, limitless replicative potential, sustained angiogenesis, and tissue invasion and metastasis). to do that, instead of studying the single aspects of tumor biology, such as gene mutation or gene expression profiling, we must apply an investigational approach aimed to integrate the different aspects (molecular, cellular, and supracellular) of breast tumorigenesis. at the molecular level, an increasing body of evidence suggests that gene expression alone is not sufficient to explain protein diversity and that epigenetic changes (i. e., heritable changes in gene expression that occur without changes in nucleotide sequences), such as alteration in dna methylation, chromatin structure changes, and dysregulation of microrna expression, may affect normal cells and predispose them to subsequent genetic changes with important repercussions in gene expression, protein synthesis, and ultimately cellular function [ . [ ] [ ] [ ] [ ] . at the cellular level, evidence indicates that to really understand cell behavior, we must consider also the microenvironment in which cells grow; an environment that recent findings indicate to have a relevant role in promoting and sustaining abnormal cell growth and tumorigenesis [ . ] . this picture is further complicated by the concept that among the heterogeneous cell population that makes up the tumor, there exists an approximately % of cells, also known as tumor initiating cells that are more likely derived from normal epithelial precursors (stem/precursor cells), and share with them a number of key properties including the capacity of self-renewal and the ability to proliferate and differentiate [ . , ] . when altered in their response to abnormal inputs from the local microenvironment, these stem/precursor cells can give rise to preneoplastic lesions [ . ]. in fact, similarly to bone marrow-derived stem cells, tissue-specific stem cells show remarkable part f . plasticity within the microenvironment: they can enter a state of quiescence for decades (cell dormancy), but can become highly dynamic once activated by specific microenvironment stimuli from the surrounding stroma and are ultimately transformed in tumor initiating cells [ . ]. the stroma, in which the mammary gland is embedded, is composed of adipocytes, fibroblasts, blood vessels, and an extracellular matrix in which several cytokines and growth factors are present. while none of these cells are themselves malignant, they may acquire an abnormal phenotype and altered function due to their direct or indirect interaction with epithelial stem/precursor cells. acting as an oncogenic agent, the stroma could provoke tumorigenicity in adjacent epithelial cells leading to the acquisition of genomic changes, at which epigenetic alterations also concur, that can accumulate over time and provoke silencing of more than pivotal genes' encoding for proteins involved in tumor suppression, apoptosis, cell cycle regulation, dna repair, and signal transduction [ . ] . under these conditions, epithelial cells and the stroma co-evolve towards a transformed phenotype following a process that has not yet been worked out [ . , ]. many of the soluble factors present in the stroma, essential for the normal mammary gland development, have been found to be associated with cancer initiation. this is the case of hormone steroids (estradiol and progesterone), which are physiological regulators of breast development and whose dysregulation may result in preneoplastic and neoplastic lesions [ . - ]. in fact, through their respective receptors, in epithelial cells estrogens and progesterone may induce the syn-thesis of local factors that, on the one hand, trigger the activation of the stem/precursor cells and, on the other hand, exert a paracrine effect on endothelial cells, which in response to the vascular endothelial growth factor, trigger neoangiogenesis activation [ . ]. in addition, estrogens have been found implicated in the local modifications of tissue homeostasis associated to a chronic inflammation that may promote epithelial transformation due to the continued production of pro-inflammatory factors that favors generation of a pro-growth environment and fosters cancer development [ . ]; alternatively, transformed epithelial cells would enhance activation of fibroblasts through a vicious circle that supports the hypothesis according to which cancer should be considered as a never healing wound. last but not least, very recent findings in animal models have clearly indicated that an early event occurring in the activation of estrogen-induced mammary carcinogenesis is represented by the altered expression of some oncogenic micrornas (oncomir), suggesting a functional link between hormone exposure and epigenomic control [ . ]. concerning the forecasted role of new nanobiotechnology applications, disclosing the bio-molecular events contributing to tumor initiation is, therefore, of paramount importance and to achieve this goal a convergence of advanced biocomputing tools for cancer biomarker discovery and multiplexed nanoprobes for cancer biomarker profiling is crucial. this is the one of the major tasks currently ongoing in medical research, namely the interaction of nanodevices with cells and tissues in vivo and their delivery to disease sites. biomarkers refer to genes, rna, proteins, and mirna expressions that can be correlated with a biological condition or may be important for prognostic or predictive aims as far as regards the clinical outcome. the discovery of biomarkers has a long history in translational research. in more recent years, microarrays have generated a great deal of work, promising the discovery of prognostic and predictive biomarkers able to change medicine as was known until then. since the beginning, the importance of statistical methods in such a context was evident, starting from the seminal paper of golub, which showed the ability of gene expression to classify tumors [ . ]. although bioinformatics is the leading engine, referenced in biomolecular literature, providing in-formatics tool to handle massive omic data, the computational core is actually represented by biostatistics methodology aiming at extracting useful summary information. biostatistics cornerstones are represented by large sample and likelihood theories, hypothesis testing, experimental design, and exploratory multivariate techniques summarized in the genomic era according to class comparison, prediction, and discovery. actually, massive omic data and the idea of personalized medicine need to develop statistical theory according to new requirements. even in the case of multivariate techniques, the problems usually faced using statistical techniques accounted for orders of magnitude of less data than those encountered with high-throughput technologies. a situation that ngs techniques will easily exacerbate. in class comparison studies there is a predefined group which identifies samples and the interest is in evaluating if the groups express the transcripts of interest differently. such studies are generally performed using a transcript by transcript analysis, performing thousands of statistical tests then correcting p-values to account for the desired percentages of false positives and negatives. in fact, the multiple comparison problem is the first concern as traditionally methods for family-wise control are generally too restrictive when accounting for thousands of tests. the false discovery rate (fdr) was a major breakthrough in such a context. the general concepts underlying fdr are outlined later ( fig. . ) . another topic discussed regards the parametric assumptions underlying most of the statistical tests used. permutation tests were much developed to face this issue and are now one of the standard tools available to researchers. jeffery and colleagues [ . ] performed a systematic comparison of different methods for identifying genes differentially expressed across experimental groups, finding that different methods gave rise to very different lists of genes and that sample size and noise level strongly affected predictive performance of the methods chosen for evaluation. also, evaluation of the accuracy of fold-change compared to ordinary and moderated t statistics was performed by witten and tibshirani [ . ], which discusses the issues of reproducibility and accuracy of gene lists returned by different methods, claiming that a researcher's decision to use fold-change or a modified t-statistic should be based on biological, rather than statistical, considerations. in this sense, the classical limma-like approach [ . ] has become a de facto standard in the analysis of high-throughput data: gene expression and mirna microarrays, proteomics, and serial analysis of gene expression (sage) generate an incredible amount of data which is routinely analyzed element-wise, without considering the multivariate nature of the problem. akin to this, non-parametric multivariate analysis of variance (manova) techniques have also been suggested to identify differentially expressed genes in the context of microarrays and qpcr-rt [ . , ], with the advantage of not making any distributional assumption on expression data and of being able to circumvent the dimensionality issue related to omic data (n • of subjects n • of genes). a well-known example of class comparison study was that of van't veer and colleagues [ . ] in which a panel of genes, a signature, was claimed to be predictive of poor outcome at years for breast cancer patients. in this case a group of patients relapsing at years was compared in terms of gene expression to a group of patients not relapsing within years. in class discovery studies no predefined groups are available and the interest is in findings new groupings, usually called bioprofiles, using the available expression measures. the standard statistical method to perform class discovery is cluster analysis that received great expansion due to gene expression studies. it is worth saying that cluster analysis is a powerful yet tricky method that should be applied taking care of outliers, stability of results, number of suspected profiles, and so on. these aspects are very hard to face with thousands of transcript to be analyzed. even more subtle is the problem of the interpretation of the clusters obtained in terms of disease profiles and the definition of a rule to define the discovered profiles. alternatively, classical multivariate methods, such as principal components analysis (pca), are gaining relevance for visualization of high-dimensional data ( fig. the work of perou and colleagues [ . ] is an important example of class discovery by cluster analysis in a major pathology such as breast cancer. in their work, the authors found genes distinguished between estrogen positive cancer with luminal characteristics and estrogen negative cancers. among these two subgroups, one had a basal characterization and the other showed patterns of up-regulation for genes linked to oncogeneerb-b . repeated application of cluster anal-ysis in different case series resulted in very similar groupings. notwithstanding the above-mentioned issues connected to cluster analysis, one of the major breakthroughs of genomic studies was actually believed to be the definition of genomic signatures/profiles by the repeated application of cluster analysis to different case series without the definition of a formal rule for class assignment of new subjects. profiles may then be correlated with clinical outcome as was done for breast cancer by van't veer and colleagues [ . ]. now, more than years after this study, it is not yet clear which is the real contribution of microarray-based gene expression profiling to breast cancer prognosis. of all the so-called first-generation signatures, only oncotype dx [ . ], a qrt-pcr based analysis of genes, has reached level ii of evidence to support tumor prognosis and has been included in the national comprehensive cancer network guidelines, whereas the remaining signatures have only obtained level iii of evidence so far [ . ]. reasons for this are, among the others, a lack of stabil-part f . ity in terms of genes that the lists are composed of and strong time-dependence, i. e., reduced prognostic value after to years of follow-up. another, and more important, issue for prognostic/prediction studies is connected to the design of the study itself. in fact, a prognostic study should be planned by defining a cohort that will be followed during time while a case control study may be only suggestive of signatures to be considered class comparison in genomic-wide studies is one of the most common and challenging applications since the advent of microarray technology. the first study on predictive signatures in breast cancer in [ . ] was mainly a class comparison study. from the statistical viewpoint one of the first problems evidenced was the large number of statistical tests performed in such an analysis. in particular, the classical control for false positives, emphasizing the specificity of the screening appeared from the beginning to be too restrictive with the cost of false negatives too high. to understand such an issue, let us suppose to have to compare the gene expression in a group of tumor tissues with that of a group of normal tissues. for each gene a statistical test controlling the probability of saying that a gene is different when in fact it is not (false positive, fp), is performed. such an error is called type one error and its level is generally called α and fixed at a % level. the problem is that a test at α level is performed for each gene. therefore, if the probability of making a mistake (fp) is . , while the probability of not making a mistake is . (this is the probability of saying the gene is not differentially expressed when it is not, true negative), when performing, say, tests the probability of not making any mistake is . , which is practically . accordingly, the probability of at least one fp is practically . how can the specificity of the experiment be controlled? a large number of procedures is available, the most simple and known is the bonferroni correction. let us see how it works. in particular, if n tests are performed at α level, the probability of not having any false positive is ( − α) n , therefore the probability of making at least one false positive is − ( − α) n , which can be approximated as − nα (for small α). the bonferroni correction originates from this. in fact, if the tests are performed at level α b = α/n, then we can expect to have no false positive among the genes declared differentially expressed at α level. this is, in fact, at the cost of a large number of false positives. in genomic experiments, when thousands of tests are performed, the bonferroni significance level is so low that very few genes can easily pass the first screening probably paying too high costs in terms of genes declared not significantly differentially expressed when actually they have a differential expression. the balance between specificity and sensibility is a fairly old problem in screening problems, which is exacerbated with high-throughput data analysis. one of the most common approaches applied in such a context is the proposal of benjamini and hochberg [ . ] called the false discovery rate (fdr) trying to control the number of false positives among the genes declared significant. to better understand the novelty of fdr, let us suppose to have m genes to be considered in the highthroughput experiment, n of the m genes are truly differentially expressed while p are not. performing the appropriate statistical test ns of the m genes are declared not different between groups under comparison while s are significantly different (fig. . ) . the type one error rate α (fpr) controls the number of fp with respect to n, while using the bonferroni correction the probability that fp is greater than is controlled. the fdr changes perspective and considers the columns of the table instead of the rows. fdr controls the number of fp with respect to s. if, for example, genes are declared differentially expressed with an fdr of %, it is expected that be false positives. this may allow greater flexibility in the managing of the screening phase of the analysis (see fig. . for a graphical representation of results from a class comparison microarray study, with an application of fdr concepts). the problem first solved by benjamini and hochberg was basically how to estimate fdr and different proposals have appeared since then, for example the q-value of storey [ . ]. in general, omic and multiplexed diagnostic technologies with the ability to produce vast amounts of biomolecular data, have vastly outstripped our ability to sensibly deal with this data deluge and extract useful and meaningful information for decision making. the producers of novel biomarkers assume that an integrated bioinformatics and biostatistics infrastructure exists to support the development and evaluation of multiple assays and their translation to the clinic. actually, the best scientific practice for the use of high-throughput data is still to be developed. in this perspective, the existence of fig. . volcano plot of differential gene expression pattern between experimental groups. on the x-axis the least squares (ls) means (i. e., difference of mean expression on log scale between experimental groups) and on the yaxis -log transformed p-values corrected for multiplicity using the fdr method from benjamini et al. [ . ] are reported. the horizontal red line corresponds to a cut-off for the significance level α at . . points above this threshold represent genes which are actually differentially expressed between experimental groups, and that are to be further investigated advanced computational technologies for bioinformatics is irrelevant along the translational research process unless supporting biostatistical evaluation infrastructures exist to take advantage of developments in any technology. in this sense, a key problem is the fragmentation of quantitative research efforts. the analysis of high dimensional data is mainly conducted by researchers with limited biostatistical experience using standard software without the knowledge of the underlying statistical principles of the methodology then exposing the results to a wide uncertainty not only due to sample size limitations. moreover, so far, a large amount of biostatistical methods and software tools supporting bioinformatics analysis of genomic/proteomic data has been provided but reference standardized analysis procedures coping with suitable preprocessing and quality control approaches on raw data coming from omic and multiplex assays are still waiting for development. formal initiatives for the integration of biostatistical research groups with functional genomics and proteomics labs are one of the major challenges in this context. in fact, besides the development of innovative biostatistics and bioinformatics tools, a major key of success lies in the ability to integrate different competencies. such an integration cannot be simply demanded for the development of software, such as the arraytrack initiative, but needs to develop integrated skills assisted by a software platform able to outline the analysis plan. in such a context, different strategies can be adopted from open software, such as r and bioconductor, to commercial ones such as sas/jmp genomics. in a functional dynamic perspective, to the characterization of the bio-profiles of cancer affected patients, is added the complexity related to the prolonged followup of patients with the necessity of the registration of the event-history of possible adverse events (local recurrence and/or metastasis) before death, that may offer useful insight into disease dynamics to identify a subset of patients with worse prognosis and better response to the therapy. this makes it necessary to develop strategies for the integration of clinical and follow-up information with those deriving from genetic and molecular characterizations. the evaluation and benchmarking of new analytical processes for the discovery, development, and clinical validation of new diagnostic/prognostic biomarkers is an extremely important problem especially in a fast growing area such as translational research based on functional genomics/proteomics. in fact, the presentation of overoptimistic results based on the unsuited application of biostatistical procedures can mask the true performance of new biomarker/bioprofiles and create false expectations about its effectiveness. guidelines for omic and cross-omic studies should be defined through the integration of different competencies coming from clinical-translational, bioinformatics, and biostatistics research competencies. this integrated contribution from multidisciplinary research teams will have a major impact on the development of standard procedures that will standardize the results and make research more consistent and accurate according to relevant bioanalytical and clinical targets. . . microarray studies have provided insight on global gene expression in cells and tissues with the expectation of prognostic assessments improvement. the identification of genes whose expression levels are associated with recurrence might also help better discriminating those subjects who are likely to respond to the various tailored systemic treatments. however, microarray experiments raised several questions to the statistical community about the design of the experiments, data acquisition and normalization, supervised and unsupervised analysis. all these issues are burdened by the fact that typically the number of genes being investigated far exceeds the number of patients. it is well-recognized that too large a number of predictor variables affects the performance of classification models: bellman coined the term curse of dimensionality [ . ], referring to the fact that in the absence of simplifying assumptions, the sample size needed to estimate a function of several variables to a given degree of accuracy (i. e., to get a reasonably low-variance estimate) grows exponentially with the number of variables. to avoid this problem, feature selection and extraction issue play a crucial role in microarray analysis. this has led several researchers to find it judicious to filter out genes that do not change their expression level significantly, reducing the complexity of the data and improving the signal to noise ratio. however, the adopted measure of significance in filtering (the implicitly controlled error measure) is not often easy to interpret in terms of the simultaneous testing of thousands of genes. moreover, gene expressions are usually filtered on a per-gene basis and seldom taking into account the correlation between different gene expressions. this filtering approach is commonly used in most current high-throughput experiments whose main objective is to detect differentially expressed genes (active genes) and, therefore, to generate hypotheses rather than to confirm them. all these methods, based on a measure of significance, select genes from a supervised perspective, i. e., accounting for the outcome of interest (the subject status). however, an unsupervised approach might be useful in order to reveal the pattern of associations among different genes making it possible to single out redundant information. the figure shows the data analysis pipeline developed in many papers dealing with expressions from high throughput experiments integration and standardization of approaches for assessment of diagnostic and prognostic performance is a key issue. many of the clinical and translational research groups have chosen different approaches for biodata modeling, tailored to specific types of medical data. however, very few proper benchmarking studies of algorithm classes have been performed worldwide and fewer examples of best practice guidelines have been produced. similarly, few studies have closely examined the criteria under which medical decisions are made. the integrating aspects of this theme relates to methods and approaches for inference, diagnosis, prognosis, and general decision making in the presence of heterogeneous and uncertain data. a further priority is to ensure that research in biomarker analysis is designed and informed from the outset to integrate well with clinical practice (to facilitate widespread clinical acceptance) and that it exploits cross-over between methods and knowledge from different areas (to avoid duplication of efforts to facilitate rapid adoption of good practice in the development of this healthcare technology). reference problems are related to the assessment of improved diagnostic and prognostic tools in the clinical setting, resorting to observational and experimental clinical studies from phase i to phase iv and the integration with studies on therapy efficacy which would involve bioprofile and biopattern analysis. in this perspective, the integration of different omic data is a well-known issue that is receiving increasing attention in biomedical research [ . , ], and which questions the capability of researchers to make sense out of a huge amount of data with very different features. since this integration can not only be seen as an it problem, proper biostatistical approaches need to be taken into account that consider the multivariate nature of the problem in the light of exploiting the maximum prior information about the biological patterns underlying the clinical problem. a critical review of microarray studies was performed earlier in a paper by dupuy and simon [ . ], in which a thorough analysis of the major limitations and pitfalls of microarray studies published in concerning cancer outcome was done (see fig. . for a general pipeline for high-throughput experiments). integrated into this review was the attempt to write guidelines for statistical analysis and reporting of gene expression microarray studies. starting from this work, it will be possible to extend the outlined criticisms to a wider range of omic studies, in order to produce updated guidelines useful for biomolecular researchers. in the perspective of integrating omic data coming from different technologies, a comparison of microarray data with ngs platforms will be a relevant point [ . - ]. due to the lack of sufficiently stan-dardized procedures for processing and analyzing ngs data, much attention will be given to the process of data generation and of quality control evaluation. such an integration is crucial because, though capabilities of ngs platforms mostly outperform those of microarrays, protocols of management and data analysis are typically very time-consuming, thus making it impractical to be used for in-depth analysis of large samples. of note, one of the ultimate goals of biomedical research is to connect diseases to genes that specify their clinical features and to drugs capable of treating them. dna microarrays have been used for investigating genome-wide expression of common diseases producing a multitude of gene signatures predicting survival, whose accuracy, reproducibility, and clinical relevance has, however, been debated [ . , , , ]. moreover, the regulatory relationships between the signature genes have rarely been investigated, largely limiting their biological understanding. the genes, indeed, never act independently from each other. rather, they form functional connections that coordinate their activity. hence, it is fundamental that in each cell in every life stage, regulatory events take place in order to keep the healthy steady state. any perturbation of a gene network, in fact, has a dramatic effect on our life, leading to disease and even death. the prefix nano is from the greek world meaning dwarf. nanotechnology refers to the science of materials whose functional organization is on the nanometer scale, that is − m. starting from ideas originating in physics in the s and boosted by the need of miniaturization (i. e., speed) of the electronic industry the field has grown rapidly. today, nanotechnology is gaining an important place in the medicine of the future. in particular, by using the patho-physiological conditions of diseased and inflamed tissues it is possible to target nanoparticles and with them drugs, genes, and diagnostic tools. moreover, the spatial and/or temporal contiguity of data from ngs and nanobiotech diagnostic approaches imposes the adoption of methods related to signal analysis which are still to be introduced in standard software, being related to statistical functional data analysis methods. therefore, the extension of the multivariate statistical methodologies adopted so far is requested in a functional data context; a problem that has already been met in the analysis of mass spectrometry data from proteomic analyses. nanotechnology-based platforms for the highthroughput, multiplexed detection of proteins and nucleic acids actually promise to bring substantial advances in molecular diagnostics. forecasted applications of nano-diagnostic devices are related to the assessment of the dynamics of cell process for a deeper knowledge of the ongoing etio-pathological process at the organ, tissue, and even single cell level. ngs is a growing revolution in genomic nanobiotechnologies that parallelized the assay process, integrating reactions at the micro or nano scale on chip surfaces, producing thousands or millions of sequences at once. these technologies are intended to lower the costs of nucleic acid sequencing far beyond that possible with earlier methods. concerning cancer, a key issue is related to the improvement of early detection and prevention through the understanding of the cellular and molecular pathways part f . of carcinogenesis. in such a way it would be possible to identify the conditions that are precursors of cancer before the start of the pathological process, unraveling its molecular origins. this should represent the next frontiers of bioprofiling to allow the strict monitoring and possible reversal of the neoplastic transformation through personalized preventive strategies. advances in nanobiotechnology enables the visualization of changes in tissues and physiological process with a subcellular real-time spatial resolution. this is a revolution that can be compared to the daguerreotype pictures from current high-throughput multiplex approaches to the digital high resolution of next generation diagnostic devices. enormous challenges remain in managing and analyzing the large amounts of data produced. such evolution is expected to have a strong impact in terms of personalized medical prevention and treatment with considerable effects on society. therefore, the success will be strongly related to the capability of integrating data from multiple sources in a robust and sustainable research perspective, which could enhance the transfer of high-throughput molecular results to novel diagnostic and therapy application. the new framework of nanobiotechnology approaches in biomedical decision support according to improved clinical investigation and diagnostic tools is emerging. there is a general need for guidelines for biostatistics and bioinformatics practice in the clinical translation and evaluation of new biomarkers from cross-omic studies based on hybridization, ngs, and high-throughput multiplexed nanobiotechnology assays. specifically, the major topics concern: bioprofile discovery, outcome analysis in the presence of complex follow-up data, assessment of diagnostic, and prognostic values of new biomarkers/bioprofiles. current molecular diagnostic technologies are not conceived to manage biological heterogeneity in tissue samples, in part because they require homogeneous preparation, leading to a loss of valuable spatial information regarding the cellular environment and tissue morphology. the development of nanotechnology has provided new opportunities for integrating morphological and molecular information and for the study of the association between observed molecular and cellular changes with clinical-epidemiological data. concerning specific approaches, bioconjugated quantum dots (qds) [ . - ] have been used to quantify multiple biomarkers in intact cancer cells and tissue specimens, allowing the integration of traditional histopathology versus molecular profiles for the same tissue [ . [ ] [ ] [ ] [ ] [ ] [ ] . current interest is focused on the development of nanoparticles with one or multiple functionalities. for example, binary nanoparticles with two functionalities have been developed for molecular imaging and targeted therapy. bioconjugated qds, which have both targeting and imaging functions, can be used for targeted tumor imaging and for molecular profiling applications. nanoparticle material properties can be exploited to elicit clinical advantage for many applications, such as for medical imaging and diagnostic procedures. iron oxide constructs and colloidal gold nanoparticles can provide enhanced contrast for magnetic resonance imaging (mri) and computed tomography (ct) imaging, respectively [ . , ]. qds provide a plausible solution to the problems of optical in vivo imaging due to the tunable emission spectra in the near-infrared region, where light can easily penetrate through the body without harm and their inherent ability to resist bleaching [ . ]. for ultrasound imaging, contrast relies on impedance mismatch presented by materials that are more rigid or flexible than the surrounding tissue, such as metals, ceramics, or microbubbles [ . ]. continued advancements of these nano-based contrast agents will allow clinicians to image the tumor environment with enhanced resolution for a deeper understanding of disease progression and tumor location. additional nanotechnologically-based detection and therapeutic devices have been made possible using photolithography and nucleic acid chemistry [ . highly sensitive biosensors that recognize genetic alterations or detect molecular biomarkers at extremely low concentration levels are crucial for the early detection of diseases and for early stage prognosis and therapy response. nanowires have been used to detect several biomolecular targets such as dna and proteins [ . , ]. the identification of dna alterations is crucial to better understand the mechanism of a disease such as cancer and to detect potential genomic markers for diagnosis and prognosis. other studies have reported the development of a three-dimensional gold nanowire platform for the detection of mrna with enhanced sensitivity from cellular and clinical samples. highly sensitive electrochemical sensing systems use peptide nucleic acid probes to directly detect specific mrna molecules without pcr amplification steps [ . cantilever nanosensors have also been used to detect minute amount of protein biomarkers. label-free resonant microcantilever systems have been developed to detect the ng/ml level of alpha-fetoprotein, a potential marker of hepatocarcinoma, providing an opportunity for early disease diagnosis and prognosis [ . ]. nanofabricated and functionalized devices such as nanowires and nanocantilevers are fast, multiplexed, and label-free methods that provide extraordinary potential for the future of personalized medicine. the combination of data from multiple imaging techniques offers many advantages over data collected from a single modality. potential advantages include: improved sensitivity and specificity of disease detection and monitoring, smarter therapy selection based on larger data sets, and faster assessment of treatment efficacy. the successful combination of imaging modalities, however, will be difficult to achieve with multiple contrast agents. multimodal contrast agents stand to fill this niche by providing spatial, temporal, and/or functional information that corresponds with anatomic features of interest. there is also great interest in the design of multifunctional nanoparticles, such as those that combine contrast and therapeutic agents. the integration of diagnostics and therapeutics, known as theranostics, is attractive because it allows the imaging of therapeutic delivery, as well as follow-up studies to assess treatment efficacy. finally, a key direction of research is the optimization of biomarker panels via principled biostatistics approaches for the quantitative analysis of molecular profiles for clinical outcome and treatment response prediction. the key issues that will need to be addressed are: (i) a panel of tumor markers will allow more accurate statistical modeling of the disease behavior than relying on single tumor markers; and (ii) the combination of tumor gene expression data and molecular information of the cancer microenvironment is necessary to define aggressive phenotypes of cancer, as well as for determining the response of early stage disease to treatment (chemotherapy, radiation, or surgery). currently, the major tasks in biomedical nanotechnology are (i) to understand how nanoparticles interact with blood, cells, and organs under in vivo physiological conditions and (ii) to overcome one of their inherent limitations, that is, their delivery to diseased sites or organs [ . - ]. another major challenge is to generate critical studies that can clearly link biomarkers with disease behaviors, such as the rate of tumor progression and different responses to surgery, radiation or drug therapy [ . ]. the current challenge is, therefore, related to the advancement of biostatistics and biocomputing techniques for the analysis of novel high-throughput biomarkers coming from nanotechnology applications. current applications involve high-throughput analysis of gene expression data and for multiplexed molecular profiling of intact cells and tissue specimens. the advent of fast and low cost high-throughput diagnostic devices based on ngs approaches appears to be of critical relevance for improving the technology transfer to disease prevention and clinical strategies. the development of nanomaterials and nanodevices offers new opportunities to improve molecular diagnosis, increasing our ability to discover and identify minute alterations in dna, rna, proteins, or other biomolecules. higher sensitivity and selectivity of nanotechnology-based detection methods will permit the recognition of trace amounts of biomarkers which will open extraordinary opportunities for systems biology analysis and integration to elicit effective early detection of diseases and improved therapeutic outcomes; hence paving the way to achieving individualized medicine. effective personalized medicine depends on the integration of biotechnology, nanotechnology, and informatics. bioinformatics and nanobioinformatics are cohesive forces that will bind these technologies together. nanobioinformatics represents the application of information science and technology for the purpose of research, design, modeling, simulation, communication, collaboration, and development of nano-enabled products for the benefit of mankind. within this framework a critical role is played by evaluation and benchmarking approaches according to a robust health technology assessment approach; moreover the development of enhanced data analysis approaches for the integration of multimodal molecular and clinical data should be based on up to date and validated biostatisical approaches. therefore, in the developing nanobiotechnology era, the role of biostatistical support to bioinformatics is definitely essential to prevent loss of money and suboptimal developments of biomarkers and diagnostic disease signature approaches of the past, which followed a limited assessment according to a strict business perspective rather than to social sustainability. concerning the relevance and impact for national health systems, it is forecasted that current omic approaches based on nanobiotechnology will contribute to the identification of next generation diagnostic tests which could be focused on primary to advanced disease prevention by early diagnosis of genetic risk patterns, or the start or natural history of the pathological process of multifactor chronic disease by the multiplexed assessment of both direct and indirect, inner genetic, or environment causal factors. a benefit of such a development would be finally related to the reduction of costs in the diagnostic process since nanobiotechological approaches seem best suited in the perspective of points-of-care poc diagnostic facilities which could be disseminated in large territories with a reduced number of excellence clinical facilities with reference diagnostic protocols. nanomaterials are providing the small, disposable lab-on-chip tests that are leading this new approach to healthcare. a variety of factors are provoking calls for changes in how diagnosis is managed. the lack of infrastructure in the developing world can be added to the inefficiency and cost of many diagnostic procedures done in central labs, rather than by a local doctor. for the developed world, an increasingly elderly population is going to exacerbate demand on healthcare and any time-saving solutions will help deal with this new trend. poc devices are looking to reduce the dependence on lab tests and make diagnosis easier, cheaper, and more accessible for countries lacking healthcare infrastructure. a key role in the overall framework will be played by data analysis under principled biostatistical approaches to develop suitable guidelines for data quality analysis, the following extraction of relevant information and communication of the results in an ethical and sustainable perspective for the individual and society. the proper, safe and secure management of personalized data in a robust and shared bioethical reference framework is, indeed, expected to reduce the social costs related to unsuited medicalization through renewed preventive strategies. a strong biostatistical based health technology assessment phase will be essential to avoid the forecasted drawbacks of the introduction of such a revolution in prevention and medicine. to be relevant for national health services, research on biostatistics and bioinformatics applied to nano-biotechnology should exploit its transversal role across multiple applied translational research projects on biomarker discovery, development, and clinical validation until their release for routine application for diagnostic/prognostic aims. objectives that would enable an accelerated framework for translational research since the involvement of quantitative support are listed here: • technological platforms for the developments in the fields of new diagnostic prevention and therapeutic tools. in the context of preventing and treating diseases, the objectives are to foster academic and industrial collaboration through technological platforms where multidisciplinary approaches using cutting edge technologies arising from genomic research may contribute to better healthcare and cost reduction through more precise diagnosis, individualized treatment, and more efficient development pathways for new drugs and therapies (such as the selection of new drug candidates), and other novel products of the new technologies. • patentable products: customized array and multiplex design with internal and external controls for optimized normalization. validation by double checked expression results for genes or protein in the customized array and multiplex assays. patenting of validated tailor-made cdna/proteomic arrays that encapsulate gene/protein signatures related to the response to the therapy with optimized cost/effectiveness properties. a robust, multidisciplinary quantitative assessment framework in translational research is a global need, which should characterize any specific laboratory and clinical translation project. however, the quantitative assessment phase is rarely based on an efficient cooperation between biologists, biotechnologists, and clinicians with biostatisticians, with relevant skills in this field. this represents a major limitation to the rapid transferability of basic research results to healthcare. such a condition is solved in the context of pharmacology in the research and development of new drugs to their assessment in clinical trials, whereas, for diagnostic/prognostic biomarkers, this framework is still to be fully defined. such a gap is wasting resources and is malpractice in the use of biomarkers and related bioprofiles for clinical decision making in critical phases of chronic and acute major diseases like cancer and cardiovascular pathologies. cancer statistics premalignant and in situ breast disease: biology and clinical implications genetic alteration during colorectal tumor development the hallmarks of cancer molecular portraits of human breast tumours bernards: a gene-expression signature as a predictor of survival in breast cancer foekens: gene expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer maqc consortium: the microarray quality control (maqc) project shows inter-and intraplatform reproducibility of gene expression measurements critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting highthroughput genomic technology in research and clinical management of breast cancer. exploiting the potential of gene expression profiling: is it ready for the clinic? ductal epithelial proliferations of the breast: a biologic continuum? comparative genomic hybridization and highmolecular-weight cytokeratin expression patterns baylin: gene silencing in cancer in association with promoter hypermethylation dna methylation and histone modification regulate silencing of epithelial cell adhesion molecule for tumor invasion and progression histone modifications in transcriptional regulation croce: mi-crorna gene expression deregulation in human breast cancer putting tumours in context prospective identification of tumorigenic breast cancer cells daidone: isolation and in vitro propagation of tumorigenic breast cancer cells with stem/progenitor cell properties stem cells, cancer, and cancer stem cells hare: tumour-stromal interactions in breast cancer: the role of stroma in tumourigenesis know thy neighbor: stromal cells can contribute oncogenic signals tarin: tumor-stromal interactions reciprocally modulate gene expression patterns during carcinogenesis and metastasis fusenig: friends or foes -bipolar effects of the tumour stroma in cancer estrogen carcinogenesis in breast cancer effects of oestrogen on gene expression in epithelium and stroma of normal human breast tissue the role of estrogen in the initiation of breast cancer inflammation and cancer estrogen-induced rat breast carcinogenesis is characterized by alterations in dna methylation, histone modifications and aberrant microrna expression molecular classification of cancer: class discovery and class prediction by gene expression monitoring multiple comparisons: bonferroni corrections and false discovery rates culhane: comparison and evaluation of methods for generating differentially expressed gene lists from microarray data a comparison of foldchange and the t-statistic for microarray data analysis linear models and empirical bayes methods for assessing differential expression in microarray experiments robustified manova with applications in detecting differentially expressed genes from oligonucleotide arrays non-parametric manova methods for detecting differentially expressed genes in real-time rt-pcr experiments gene expression profiling predicts clinical outcome of breast cancer using biplots to interpret gene expression in plants botstein: singular value decomposition for genome-wide expression data processing and modeling epithelialto-mesenchymal transition, cell polarity and stemness-associated features in malignant pleural mesothelioma use of biplots and partial least squares regression in microarray data analysis for assessing association between genes involved in different biological pathways statistical issues in the analysis of chip-seq and rna-seq data biganzoli: data mining in cancer research a gene expression database for the molecular pharmacology of cancer systematic variation in gene expression patterns in human cancer cell lines wolmark: a multigene assay to predict recurrence of tamoxifen-treaten, node-negative breast cancer reis-filho: microarrays in the s: the contribution of microarray-based gene expression profiling to breast cancer classification, prognostication and prediction boracchi: prediction of cancer outcome with microarrays hill: prediction of cancer outcome with microarrays: a multiple random validation strategy hochberg: controlling the false discovery rate: a practical and powerful approach to multiple testing a direct approach to false discovery rates adaptive control processes: a guided tour the challenges of integrating multi-omic data sets searls: data integration: challenges for drug discovery comparing microarrays and next-generation sequencing technologies for microbial ecology research rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays microrna expression profiling reveals mirna families regulating specific biological pathways in mouse frontal cortex and hippocampus ioannidis: predictive ability of dna microarrays for cancer outcomes and correlates: an empirical assessment gene expression profiling: does it add predictive accuracy to clinical characteristics in cancer prognosis? quantum dot bioconjugates for ultrasensitive nonisotopic detection the use of nanocrystals in biological detection quantum dots for live cells, in vivo imaging, and diagnostics in-vivo molecular and cellular imaging with quantum dots molecular profiling of single cells and tissue specimens with quantum dots molecular profiling of single cancer cells and clinical tissue specimens with semiconductor quantum dots bioconjugated quantum dots for multiplexed and quantitative immunohistochemistry in situ molecular profiling of breast cancer biomarkers with multicolor quantum dots high throughput quantification of protein expression of cancer antigens in tissue microarray using quantum dot nanocrystals emerging use of nanoparticles in diagnosis and treatment of breast cancer superparamagnetic iron oxide contrast agents: physicochemical characteristics and applications in mr imaging colloidal gold nanoparticles as a blood-pool contrast agent for x-ray computed tomography in mice quantum-dot nanocrystals for ultrasensitive biological labeling and multicolor optical encoding contrast ultrasound molecular imaging: harnessing the power of bubbles light-directed spatially addressable parallel chemical synthesis bio-barcode-based dna detection with pcr-like sensitivity microchips as controlled drug delivery devices ingber: soft lithography in biology and biochemistry small-scale systems for in vivo drug delivery real-time monitoring of enzyme activity in a mesoporous silicon double layer nanotechnologies for biomolecular detection and medical diagnostics nanosystems biology nanowire nanosensors for highly sensitive and selective detetction of biological and chemical species thundat: cantilever-based optical deflection assay for discrimination of dna singlenucleotide mismatches bioassay of prostatespecific antigen (psa) using microcantilevers micro-and nanocantilever devices and systems for biomolecule detection viral-induced self-assembly of magnetic nanoparticles allows the detection of viral particles in biological media high-content analysis of cancer genome dna alterations label-free biosensing of a gene mutation using a silicon nanowire field-effect transistor solit: therapeutic strategies for targeting braf in human cancer electrical detection of vegfs for cancer diagnoses using anti-vascular endothelial growth factor aptamer-modified si nanowire fets single conducting polymer nanowire chemiresistive label-free immunosensor for cancer biomarker label-free, electrical detection of the sars virus n-protein with nanowire biosensors utilizing antibody mimics as capture probes nanogram per milliliter-level immunologic detection of alpha-fetoprotein with integrated rotating-resonance microcantilevers for early-stage diagnosis of heptocellular carcinoma transport of molecules, particles, and cells in solid tumors delivery of molecular and cellular medicine to solid tumors the next frontier in molecular medicine: delivery of therapeutics biomedical nanotechnology with bioinformatics -the promise and current progress key: cord- - c xeqg authors: sokolov, michael title: decision making and risk management in biopharmaceutical engineering—opportunities in the age of covid- and digitalization date: - - journal: ind eng chem res doi: . /acs.iecr. c sha: doc_id: cord_uid: c xeqg [image: see text] in , the covid- pandemic resulted in a worldwide challenge without an evident solution. many persons and authorities involved befriended the value of available data and established expertise to make decisions under time pressure. this omnipresent example is used to illustrate the decision-making procedure in biopharmaceutical manufacturing. this commentary addresses important challenges and opportunities to support risk management in biomanufacturing through a process-centered digitalization approach combining two vital worlds—formalized engineering fundamentals and data empowerment through customized machine learning. with many enabling technologies already available and first success stories reported, it will depend on the interaction of different groups of stakeholders how and when the huge potential of the discussed technologies will be broadly and systematically realized. in , the world was shocked by the covid- pandemic. in many countries a large number of restrictive measures have been introduced in order to reduce the rate and extent of the outbreak. the pandemic has spread in a characteristic sequence from asia to europe and then to the rest of the world. the introduced country-specific interventions strongly vary in their severity and timeline across different, even neighboring countries. during such a "lockdown" phase, which in this case usually lasted over more than two months, a major goal is to monitor the situation and to collect sufficient data. in the era of digitalization and globalization, several organizations have been able to measure trends based on a daily updated display of available worldwide information. , this has been essential to plan next steps, while reducing risks for the health system such as operative or capacity bottlenecks. despite different governmental strategies to address the problem, all of them strongly rely on the concept of social distancing, the adherence to which is difficult to monitor and control across society. such limited control is often present when trying to solve a problem, which is relatively new as well as subject to many potentially influential factors, with some of them difficult to quantify or to predict. during the covid- crisis, it has become broadly evident how essential the availability of data is for decision-making for complex problems, and how unstable such decision processes can become when the data are biased by uncertainty and lack of prior expertise. although with a different complexity and effect on society, the biopharmaceutical industry faces an analogous uncertaintydriven environment on a daily basis in their workstream. the biopharmaceutical sector is a dominantly growing branch of the pharma industry with prominent blockbuster therapeutic protein products such as humira (adalimumab) and rituxan (rituximab). this industry uses as one of the principle unit operations a biotechnological process based on a living organism to produce highly specific drugs targeting, for example, cancer, autoimmune, and orphan diseases. these bioprocesses are complicated to control and require many cycles of usually quite long experimental investigations. hence, this industry is driven by two opposing objectives: ensuring high drug quality and safety to patients, while competitively reducing time to market and process development and manufacturing costs. hundreds of potentially influential factors in the production process can be taken into account and many tens of them are being broadly monitored and controlled. the main engineering challenges , − are to ( ) robustly control the behavior of the living organism involved in the process, ( ) efficiently align the often heterogeneous data generated across different process units and scales, ( ) include all available prior know-how and experience into the decision process, ( ) reduce human errors and introduced inconsistency, and ( ) enable an automated and adaptive procedure to assess the critical process characteristics. this commentary takes the covid- pandemic as an illustrative example of decision making under uncertainty based on a daily increasing number of available data and know-how. this example will be used throughout the commentary to portray the decision making challenges in the biopharmaceutical industry with the key goals to reflect on the potential of different digital data-and knowledge-driven solutions to support mastering the path toward the standards of industry . . complex problems can be solved efficiently through the support of relevant data and/or through sufficient experience in dealing with similar problems. in both cases, it is essential to evaluate how close prior data and knowledge are to the problem to be solved and how trustworthy these are. in the covid- pandemic strong biases are introduced on the data due to the long incubation time of the virus resulting in a delayed symptomatic response and appearance in the database, inconsistencies in the fatality definitions, and incomplete testing across the population, among others. , this uncertainty is coupled to a lack of governmental and social experience of dealing with pandemics of such broad magnitude. also, the contribution from the scientific community is yet very heterogeneous in focus and suggestions. therefore, data is used as relative trending method, while experience is gradually building up within and across countries. in biopharma, the situation is usually much better as there is less uncertainty on the acquired data and available prior experience on developing or operating similar biopharmaceutical processes. nonetheless, the level of uncertainty can be expected considerably larger compared to the closely related small-molecule pharma and general chemistry sectors. figure presents the landscape of different data (in red) and expertise (in yellow) available in bioprocessing together with the duration to generate these and a tendency of their utilization importance for decision making. in the very heterogeneous field of data sources in bioprocessing one important group is the information defined or available before the start of the process (or a certain unit operation) such as the designed set points (experimental design in development or process design space in manufacturing), from here on referred to as z variables. these variables are essential to define an optimal and robust operation strategy for the process, whereby meta data (e.g., information on operator, devices, site specifications) and raw material information are often taken into account significantly less compared to the other variables. the variables labeled with a preceding x are dynamic process measurements through different sensors (online) or offline analytics. these are essential to monitor the process and provide a basis for control. unlike variable-specific sensors such as ph, information from spectroscopes (e.g., raman or nir) has to be linked to the variable of interest through a regression method, which requires additional work to be calibrated. in particular, the profiles of selected characteristic variables for each unit operation are always considered in the decision process, while the inclusion of all other sensor data depends on its importance for process control and direct availability from the historian, that is, the possibility to directly align all available dynamic data sources, which often is not the case. finally, the variables quantifying the product quality are labeled with a y symbol. these are essential to characterize the process outcome and interconnect different unit operations, for example, the impurities produced upstream in the bioreactor to the purification procedure in the following downstream operations. these variables often require a complex analytical procedure over many hours to days. eventually, after many days to few weeks, the first data sets can be obtained for a unit operation and, within months, also the information on several development cycles as well as several unit operations can be available. after many months to several years, a complete development activity toward the manufacturing scale can be obtained. in development and manufacturing, many professionals are involved, who had to go through a long learning procedure of hands-on experience in the lab, working on multiple scales and for the production of different molecules, possibly utilizing different biological organisms, operation strategies and devices. such expertise for the complex underlying processes is built throughout several years. similarly, also the modeling experts capable of formalizing certain mechanistic process behavior and/or statistically deriving process interrelationships (chemometrics) require training of many months to a few years, with the generation of a new predictive model potentially lasting for several months. all such expertise is usually linked to individuals and is not available in a generalized format to other team members. because of significant time pressure in development and risk mitigation pressure in manufacturing, decisions are often made on an ad hoc basis involving expert meetings where all readily available data, analysis results, and experience sources are taken into account without ensuring consideration of all possible available information hidden in the databases or inside the potential of (not automatedly retrained or connected) predictive models. hybrid modeling pursues the goal to synergistically combine available data and know-how as highlighted in the graphical abstract figure. thereby, the know-how is provided as the fundamental backbone based on formalizing central process characteristics (e.g., mass balances) and interrelationships (e.g., characteristic ratios or dimensionless numbers) in broadly valid model equations. as explained in detail by von stosch et al. and narayanan et al., the available data is then used to finetune the model parameters to the considered use case and to flexibly adapt to different scenarios. figure compares the concept of hybrid modeling to the two standard modeling approaches, namely purely data-driven (statistical) and purely knowledge-driven (mechanistic) approaches. on the side of assessing the process behavior by a process expert perspective, one could either rely on statistical methods, which require a lot of data to support decision making or on deterministic methods, which can only be formalized if a large part of the behavior is well understood. the first approach is strictly limited by the amount of available data due to the large labor cost of each data point produced combined with the complexity of the data. , the latter is limited by the generally available understanding of the complex unit operations as well as the availability of an expert for each of them. , therefore, a solution based on combining the formalization of the central know-how with the flexible learning of the unknown remainder from the available data breaks the need for either large data sets or the continuous involvement of a process expert in order to reliably withdraw important decision support. from the perspective of the decision making stakeholders such algorithmic solution provides a trustworthy decision base with less effort, that is, less time and labor to conduct experiments to create the central know-how and less labor time of an expert to correctly structure that information for the decision making perspective. figure supports the explanation of the central role of hybrid modeling in improving trustworthiness and decision support in process design and failure detection. standard design of experiments methods , try to explain the product information (y) based on the design process factors (z) with a so-called "black-box" approach, that is, without specifically integrating mechanistic process information or the dynamic process information (x). taking simplistically the covid- example, this could be interpreted as trying to understand how certain imposed restrictions affect the final lethality of the period under these restrictions without anticipating all the trackable society behavior in the meantime. the x information therefore bears a central possibility to better understand characteristic dynamic scenarios and patterns, which might result in a different final outcome y. hybrid modeling enables a simulation, based on the integrated know-how, of how for given initial conditions z a process could evolve. in the second step this predicted evolution is linked based on a historical model to the final product quality y. hence, while retaining the general goal of finding the optimal conditions z to reach the target outcome y optimal , this procedure supplements a lot of certainty on the final outcome compared to the black-box approach, based on a knowledge-supported, projected architecture of possible bridges between process start and end. in the absence of long-standing experience in the problem field, such dynamic progression can be also simulated based on simple, yet effective dynamic interrelationships or based on stochastic approaches, which are both also being utilized in modeling the covid- pandemic. , , in the past few years, hybrid modeling has become increasingly popular in the bioprocessing domain leading to be considered a main new direction. hybrid modeling, in the context of therapeutic protein manufacturing models, demonstrated its enabling potential in applications such as monitoring and forecasting, , control, , optimization, , and also in downstream processing. despite real-world problems seldom existing in isolation, heterogeneity is often a governing factor in finding a solution. this means that despite the availability of some prior data and/ or know-how, their alignment is complex due to structural or phenomenological differences. the characteristic sequence of the covid- pandemic spread provides countries affected at later stages the possibility to learn from the data from the previously affected ones. such learning is obviously limited as countries strongly vary in organization and capacity of their health system, population size and density, and geographical location, etc. nonetheless, certain effects such as characteristic symptoms, contagiousness, risk groups, lethality, etc. could be identified even without or with limited data available. human beings possess a powerful cognitive ability for such knowledge transfer, while traditional optimization solvers usually lack such ability in their search strategy. in machine learning such a concept is described as transfer learning, where one "generic" part of the model, usually the first layers of a convolutional neural net, is learnt from generally available data and then the data of a specific system is used to fine-tune the model to that specific use case. . enabling possibility of hybrid modeling to learn process dynamics and support forecast of final product quality. such a two-step procedure enables complete simulation of process and product quality based on different process designs and optimizing the design space to reach optimal product quality. industrial & engineering chemistry research pubs.acs.org/iecr commentary in many engineering sectors including biopharmaceutical processes, process development or operations, teams are exposed to new entities such as new cell lines to be used in the bioprocess, new column material for purification or abnormal effects such as unusual levels or profiles of characteristic process variables in the manufacturing plant. often, such situations must be managed under time pressure, and decisions are made on the readily available data and best educated guesses by experts. it is an infrequent practice to rigorously include any similar data and know-how from previous activities directly into the decision process due to often severe levels of heterogeneity, which can be a result of (partially) different utilized devices, scales, and materials as well as differently structured or quantified data. a smart digital solution enabling an automatic leverage of available prior information from heterogeneous sources by reliably deducing the transferrable know-how could enable a tremendous breakthrough for supporting complex decision making in biopharma manufacturing. the general structure of hybrid models is quite attractive to apply such transfer learning concept in small data environments, where the mechanistic backbone accounts for major generic effects while the machine learning part enables fine-tuning based on the limited available data, for instance to a specific molecule. in process development such an approach capitalizes strongly on all available information from previous development activities. this could not only support pharma companies, but particularly also customer manufacturing organizations (cmos) the business model of which scales even more with delivering on time, and which are exposed to a large level of diversity. of course, a beneficial implementation of broadly applicable transfer learning must go through a rigorous digitalization and integration of all data archives, which requires a tremendous preparation and investment. other industries such as finance have demonstrated the impact and potential of such digital transformation. although hybrid modeling has not been reported in direct connection with transfer learning in biomanufacturing, several transfer use cases based on datadriven techniques have been already conducted to further explore and adapt model-supported transfer learning in biopharma. examples include extrapolation from low to high performing conditions with hybrid models, cross-scale prediction, and cross-molecule prediction with multivariate − and with adaptive machine learning techniques. it goes without saying that, in a situation such as the covid- pandemic, a trustworthy forecast of the near future would be priceless. thereby, one has to comment that it is not only the knowledge of the future evolution that counts but also the underlying understanding on its relation to the introduced regulations. this understanding is vital to make a solid decision among many potential alternatives. hence, taking the definitions in figure , at each point of time one would like to understand how changes in z affect the process outcome y, and which combination of z is optimal to reach the desired outcome y optimal . while this theoretically represents a classical optimization problem, in the process development lab and on the manufacturing floor, the different teams require a practically relevant representation of such a solution which is connected as much as possible to their daily workstream and mindset as well as associated decision making process. the possibility to simulate different future scenarios and compare the results must therefore be presented in a visually comprehensible, tangible, manageable, and transferrable form. figure presents the added value generated by different levels of technological complexity and integration of digital solutions in biopharmaceutical manufacturing. as highlighted in figure , even further value can be created not only if the potential of hybrid modeling and transfer learning is assessable through a practically designed digital twin but also if such a digital twin is directly connected to the process and becomes an active stakeholder of the decision making process. being set up in real-time connection to all data generating devices, all accessible and consistently learning models, and the process control layer, such a digital twin can not only provide predictive-model based real-time alerts, but also automatically take actions based on optimization across different scenarios. while at the process development level, such digital twin-based controls could be used for efficient process design, at the manufacturing level, predictive quality and predictive manufacturing are likely to be the central applications. it is important to highlight that such digital twins should be realized across all interconnected unit operations, to enable communication, scheduling, and optimization across the entire plant. another very important application in bioprocessing is smart operations of parallel high-throughput experimental systems. here such digital twins can efficiently learn across all of the ongoing operations and reduce (in real-time) the redundant information, while consistently redesigning the experiments to provide further knowledge. , in such cases, experimental systems and digital twins must actively collaborate on simultaneously improving process understanding as well as the process itself. and regulated industries if human health or even survival is affected by the decision process, such decisions must be accurately documented, validated, and surveilled. in biopharmaceutical manufacturing, health authorities impose stringent regulations on the process design to ensure consistent product quality. thereby, regulations such as the quality guidelines by the international council for harmonisation of technical requirements for pharmaceuticals for human use also actively incentivize the utilization of model-based solutions to support understanding and operation of the complex processes. the smart digital solution-enabled stabilization of decisions through robust learning from previous know-how and data should be positively embraced by drug producers as well as health authorities. however, in manufacturing operations which are based on industrial & engineering chemistry research pubs.acs.org/iecr commentary decisions either actively introduced or supported by such models, a detailed assessment of these smart digital solutions is required. this will inevitably result in a critical confrontation of smart manufacturing procedures and smart humans. it can be expected that a growing number of companies will increasingly utilize advanced predictive solutions besides the commonly utilized, static multivariate techniques, which given their linear nature are much simpler to validate for good manufacturing practice (gmp) utility. this experience will very likely provide more clarity on the limits of complexity which can be introduced into such smart digital solutions to ensure transparency and trackability for health authorities, but also on the general filing procedure of a highly interconnected, digital-twin-supervised manufacturing facility. he is cofounder and coo of datahow ag, an internationally active spin-off company from eth zurich. the company specializes in process digitalization, data analytics, and modeling with a particular focus on the biopharmaceutical domain. he also holds a lecturer position for statistics for chemical engineers at eth and continues collaborating on academic projects in his field of expertise. this invited contribution is part of the i&ec research special issue for the class of influential researchers. the author expresses deep gratitude to his colleagues and collaborators to jointly develop the vision on the potential of digitalization in bioengineering expressed in this commentary. the cost of staying open: voluntary social distancing and lockdowns in the us. ssrn electron how will country-based mitigation measures influence the course of the covid- epidemic? ( ) who. coronavirus disease a fiasco in the making? as the coronavirus pandemic takes hold, we are making decisions without reliable data. gv wire fed-batch and perfusion culture processes: economic, environmental, and operational feasibility under uncertainty the market of biopharmaceutical medicines: a snapshot of a diverse industrial landscape biopharmaceutical benchmarks evolving trends in mab production processes production of protein therapeutics in the quality by design (qbd) paradigm. top engineering challenges in therapeutic protein product and process design big data in biopharmaceutical process development: vice or virtue? workflow for criticality assessment applied in biopharmaceutical process validation stage covid- -navigating the uncharted covid- in italy: momentous decisions and many uncertainties. the lancet global health the scientific literature on coronaviruses, covid- and its associated safety-related research dimensions: a scientometric analysis and scoping review hybrid semi-parametric modeling in process systems engineering: past, present and future a new generation of predictive models: the added value of hybrid models for manufacturing processes of therapeutic proteins model-based methods in the biopharmaceutical process lifecycle identification of manipulated variables for a glycosylation control strategy application of quality by design to the characterization of the cell culture process of an fc-fusion protein industrial & engineering chemistry research pubs.acs.org/iecr commentary enhanced process understanding and multivariate prediction of the relationship between cell culture process and monoclonal antibody quality insights into the dynamics and control of covid- infection rates early dynamics of transmission and control of covid- : a mathematical modelling study hybrid modeling for quality by design and pat-benefits and challenges of applications in biopharmaceutical industry hybrid modeling as a qbd/pat tool in process development: an industrial e. coli case study hybrid-ekf: hybrid model coupled with extended kalman filter for real-time monitoring and control of mammalian cell culture quality by control: towards model predictive control of mammalian cell culture bioprocesses a general hybrid semi-parametric process control framework hybrid metabolic flux analysis/data-driven modelling of bioprocesses systematic interpolation method predicts protein chromatographic elution from batch isotherm data without a detailed mechanistic isotherm model insights on transfer optimization: because experience is the best teacher deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning role of knowledge management in development and lifecycle management of biopharmaceuticals machine learning: overview of the recent progresses and implications for the process systems engineering field provable data integrity in the pharmaceutical industry based on version control systems and the blockchain organizational transformation for sustainable development: a case study, management of permanent change digital finance and fintech: current research and future research directions cross-scale predictive modeling of cho cell culture growth and metabolites using raman spectroscopy and multivariate analysis line and real-time prediction of recombinant antibody titer by in situ raman spectroscopy sequential multivariate cell culture modeling at multiple scales supports systematic shaping of a monoclonal antibody toward a quality target a machine-learning approach to calibrate generic raman models for real-time monitoring of cell culture processes accelerating biologics manufacturing by modeling or: is approval under the qbd and pat approaches demanded by authorities acceptable without a digital-twin? processes processwide control and automation of an integrated continuous manufacturing platform for antibodies integrated optimization of upstream and downstream processing in biopharmaceutical manufacturing under uncertainty: a chance constrained programming approach online optimal experimental re-design in robotic parallel fed-batch cultivation facilities monitoring parallel robotic cultivations with online multivariate analysis commentary: the smart human in smart manufacturing key: cord- - w cam authors: fang, zhichao; costas, rodrigo; tian, wencan; wang, xianwen; wouters, paul title: an extensive analysis of the presence of altmetric data for web of science publications across subject fields and research topics date: - - journal: scientometrics doi: . /s - - - sha: doc_id: cord_uid: w cam sufficient data presence is one of the key preconditions for applying metrics in practice. based on both altmetric.com data and mendeley data collected up to , this paper presents a state-of-the-art analysis of the presence of kinds of altmetric events for nearly . million web of science publications published between and . results show that even though an upward trend of data presence can be observed over time, except for mendeley readers and twitter mentions, the overall presence of most altmetric data is still low. the majority of altmetric events go to publications in the fields of biomedical and health sciences, social sciences and humanities, and life and earth sciences. as to research topics, the level of attention received by research topics varies across altmetric data, and specific altmetric data show different preferences for research topics, on the basis of which a framework for identifying hot research topics is proposed and applied to detect research topics with higher levels of attention garnered on certain altmetric data source. twitter mentions and policy document citations were selected as two examples to identify hot research topics of interest of twitter users and policy-makers, respectively, shedding light on the potential of altmetric data in monitoring research trends of specific social attention. ever since the term "altmetrics" was coined in jason priem's tweet in , a range of theoretical and practical investigations have been taking place in this emerging area (sugimoto et al. ) . given that many types of altmetric data outperform traditional citation counts with regard to the accumulation speed after publication , initially, altmetrics were expected to serve as faster and more fine-grained alternatives to measure scholarly impact of research outputs (priem et al. (priem et al. , . nevertheless, except for mendeley readership which was found to be moderately correlated with citations zahedi and haustein ) , a series of studies have confirmed the negligible or weak correlations between citations and most altmetric indicators at the publication level (bornmann b; costas et al. ; de winter ; zahedi et al. ) , indicating that altmetrics might capture diverse forms of impact of scholarship which are different from citation impact (wouters and costas ) . the diversity of impact beyond science reflected by altmetrics, which is summarized as "broadness" by bornmann ( ) as one of the important characteristics of altmetrics, relies on diverse kinds of altmetric data sources. altmetrics do not only include events on social and mainstream media platforms related to scholarly content or scholars, but also incorporate data sources outside the social and mainstream media ecosystem such as policy documents and peer review platforms . the expansive landscape of altmetrics and their fundamental differences highlight the importance of keeping them as separate entities without mixing, and selecting datasets carefully when making generalizable claims about altmetrics (alperin ; wouters et al. ) . in this sense, data presence, as one of the significant preconditions for applying metrics in research evaluation, also needs to be analyzed separately for various altmetric data sources. bornmann ( ) regarded altmetrics as one of the hot topics in the field of scientometrics for several reasons, being one of them that there are large altmetric data sets available to be empirically analyzed for studying the impact of publications. however, according to existing studies, there are important differences of data coverage across diverse altmetric data. in one of the first, thelwall et al. ( ) conducted a comparison of the correlations between citations and categories of altmetric indicators finding that, except for twitter mentions, the coverage of all selected altmetric data of pubmed articles was substantially low. this observation was reinforced by other following studies, which provided more evidence about the exact coverage for web of science (wos) publications. based on altmetric data retrieved from impactstory (is), zahedi et al. ( ) reported the coverage of four types of altmetric data for a sample of wos publications: mendeley readers ( . %), twitter mentions ( . %), wikipedia citations ( . %), and delicious bookmarks ( . %). in a follow-up study using altmetric data from altmetric.com, costas et al. ( ) studied the coverage of five altmetric data for wos publications: twitter mentions ( . %), facebook mentions ( . %), blogs citations ( . %), google+ mentions ( . %), and news mentions ( . %). they also found that research outputs in the fields of biomedical and health sciences and social sciences and humanities showed the highest altmetric data coverage in terms of these five altmetric data. similarly, it was reported by haustein et al. ( ) that the coverage of five social and mainstream media data for wos papers varied as follows: twitter mentions ( . %), facebook mentions ( . %), blogs citations ( . %), google + mentions ( . %), and news mentions ( . %). in addition to aforementioned large-scale research on wos publications, there have been also studies focusing on the coverage of altmetric data for research outputs from a certain subject field or publisher. for example, on the basis of selected journal articles in the field of humanities, hammarfelt ( ) investigated the coverage of five kinds of altmetric data, including mendeley readers ( . %), twitter mentions ( . %), citeulike readers ( . %), facebook mentions ( . %), and blogs citations ( . %). waltman and costas ( ) found that just about % of the publications in the biomedical literature received at least one f prime recommendation. for papers published in the public library of science (plos) journals, bornmann ( a) reported the coverage of a group of altmetric data sources tracked by plos's article-level metrics (alm). since the data coverage is a value usually computed for most altmetric studies, similar coverage levels are found scattered across many other studies as well (alperin ; fenner ; robinson-garcía et al. ) . by summing up the total number of publications and those covered by altmetric data in related studies, erdt et al. ( ) calculated the aggregated percentage of coverage for altmetric data. their aggregated results showed that mendeley readers covers the highest share of publications ( . %), followed by twitter mentions ( . %) and cit-eulike readers ( . %), while other altmetric data show relatively low coverage in general (below %). the distributions of publications and article-level metrics across research topics are often uneven, which has been observed through the lens of text-based (gan and wang ) , citation-based (shibata et al. ) , usage-based (wang et al. ) , and altmetric-based (noyons ) approaches, making it possible to identify research topics of interest in different contexts, namely, the identification of hot research topics. by combining the concept made by tseng et al. ( ) , hot research topics are defined as topics that are of particular interest to certain communities such as researchers, twitter users, wikipedia editors, policy-makers, etc. thus, hot is defined as the description of a relatively high level of attention that research topics have received on different altmetric data sources. attention here is understood as the amount of interactions that different communities have generated around research topics, therefore those topics with high levels of attention can be identified and characterized as hot research topics from an altmetric point of view. traditionally, several text-based and citation-based methodologies have been widely developed and employed in detecting research topics of particular interest to researchers, like co-word analysis (ding and chen ; lee ) , direct citation and co-citation analysis (chen ; small ; small et al. ) , and the "core documents" based on bibliographic coupling (glänzel and czerwon ; glänzel and thijs ) , etc. besides, usage metrics, which are generated by broader sets of users through various behaviors such as viewing, downloading, or clicking, have been also used to track and identify hot research topics. for example, based on the usage count data provided by web of science, detected hot research topics in the field of computational neuroscience, which are listed as the keywords of the most frequently used publications. by monitoring the downloads of publications in scientometrics, wang et al. ( ) identified hot research topics in the field of scientometrics, operationalized as the most downloaded publications in the field. from the point of view that altmetrics can capture the attention around scholarly objects from broader publics (crotty ; sugimoto ) , some altmetric data were also used to characterize research topics based on the interest exhibited by different altmetric and social media users. for example, robinson-garcia et al. ( ) studied the field of microbiology to map research topics which are highly mentioned within news media outlets, policy briefs, and tweets over time. zahedi and van eck ( ) presented an overview of specific topics of interest of different types of mendeley users, like professors, students, and librarians, and found that they show different preferences in reading publications from different topics. identified research topics of publications that are faster to be mentioned by twitter users or cited by wikipedia page editors, respectively. by comparing the term network based on author keywords of climate change research papers, the term network of author keywords of those tweeted papers, and the network of "hashtags" attached to related tweets, haunschild et al. ( ) concluded that twitter users are more interested in topics about the consequences of climate change to humans, especially those papers forecasting effects of a changing climate on the environment. although there are multiple previous studies discussing the coverage of different altmetric data, after nearly years of altmetric research, we find that a renewed large-scale empirical analysis of the up-to-date presence of altmetric data for wos publications is highly relevant. particularly, since amongst previous studies, there still exist several types of altmetric data sources that have not been quantitatively analyzed. moreover, although the correlations between citations and altmetric indicators have been widely analyzed at the publication level in the past, the correlations of their presence at the research topic level are still unknown. to fill these research gaps, this paper presents a renovated analysis of the presence of various altmetric data for scientific publications, together with a more focused discussion about the presence of altmetric data across broad subject fields and smaller research topics. the main objective of this study is two-fold: ( ) to reveal the development and current situation of the presence of altmetric data across publications and subject fields, and ( ) to explore the potential application of altmetric data in identifying and tracking research trends that are of interest to certain communities such as twitter users and policy-makers. the following specific research questions are put forward: rq . compared to previous studies, how the presence of different altmetric data for wos publications has developed until now? what is the difference of altmetric data presence across wos publications published in different years? rq . how is the presence of different altmetric data across subject fields of science? for each type of altmetric data, which subject fields show higher levels of data prevalence? rq . how are the relationships among various altmetric and citation data in covering different research topics? based on specific altmetric data, in each subject field which research topics received higher levels of altmetric attention? a total of , , wos papers published between and were retrieved from the cwts in-house database. since identifiers are necessary for matching papers with their altmetric data, only publications with a digital object identifier (doi) or a pubmed identifier (pubmed id) recorded in wos were considered. using the two identifiers, wos papers were matched with types of altmetric data from altmetric.com and mendeley readership as listed in table . the data from altmetric.com were extracted from a research snapshot file with data collected up to october . mendeley readership data were separately collected through the mendeley api in july . altmetric.com provides two counting methods of altmetric performance for publications, including the number of each altmetric event that mentioned the publication and the number of unique users who mentioned the publication. to keep a parallelism with mendeley readership, which is counted at the user level, the number of unique users was selected as the indicator for counting altmetric events in this study. for selected publications, the total number of events they accumulated on each altmetric data source are provided in table as well. besides, we collected the wos citation counts in october for the selected publications. citations serves as a benchmark for a better discussion and understanding of the presence and distribution of altmetric data. to keep the consistency with altmetric data, a variable citation time window from the year of publication to was utilized and selfcitations were considered for our dataset of publications. to study subject fields and research topics, we employed the cwts classification system (also knowns as the leiden ranking classification). waltman and van eck ( ) developed this publication-level classification system mainly for citable wos publications (article, review, letter) based on their citation relations. in its version, publications are clustered into micro-level fields of science with similar research topics (here and after known as micro-topics) as shown in fig. with vosviewer. for each micro-topic, the top five most characteristic terms are extracted from the titles of the publications in order to label the different micro-topics. furthermore, these micro-topics are assigned to five main subject fields of science algorithmically obtained, including social sciences and humanities (ssh), biomedical and health sciences (bhs), physical sciences and engineering (pes), life and earth sciences (les), and mathematics and computer science (mcs). the cwts classification system has been applied not only in the leiden ranking (https :// www.leide nrank ing.com/), but also in many different previous studies related with subject field analyses didegah and thelwall ; zahedi and van eck ) . a total of , , of the initially selected publications (accounting for . %) have cwts classification information. this set of publications was drawn as a subset for the comparison of altmetric data presence across subject fields and research topics. table presents the number of selected publications in each main subject field. in order to measure the presence of different kinds of altmetric data or citation data across different sets of publications, we employed the three indicators proposed by haustein et al. coverage (c) indicates the percentage of publications with at least one altmetric event (or one citation) recorded in the set of publications. therefore, the value of coverage ranges from to %. the higher the coverage, the higher the share of publications with altmetric event data (or citation counts). density (d) is the average number of altmetric events (or citations) of the set of publications. both publications with altmetric events (or citations) and those without any altmetric events (or citations) are considered in the calculation of density, so it is heavily influenced by the coverage and zero values. the higher the value of density, the more altmetric events (or citations) received by the set of publications on average. intensity (i) is defined as the average number of altmetric events (or citations) of publications with at least one altmetric event (or citation) recorded. different from d, the calculation of i only takes publications with non-zero values in each altmetric event (or citation event) into consideration, so the value must be higher or equal to one. only in those cases of groups of publications without any altmetric events (or citations), the intensity is set to zero by default. the higher the value of intensity, the more altmetric events (or citations) that have occurred around the publications with altmetric/citation data on average. in order to reveal the relationships among these three indicators at the research topic level, as well as the relationships of preferences for research topics among different data, the spearman correlation analysis was performed with ibm spss statistics . this section consists of four parts: the first one presents the overall presence of altmetric data for the whole set of wos publications (in contrast with previous studies) and the evolution of altmetric data presence over the publication years. the second part compares the altmetric data presence of publications across five main subject fields of science. the third part focuses on the differences of preferences of altmetric data for research topics. in the fourth part, twitter mentions and policy document citations are selected as two examples for identifying hot research topics with higher levels of altmetric attention received. coverage, density, and intensity of sources of altmetric data and citations were calculated for nearly . million sample wos publications to reveal their overall presence. table presents not only the results based on our dataset, but also, for comparability purposes, the findings of data coverage (c_ref) reported by some previous altmetric empirical studies that also used altmetric.com (and mendeley api for mendeley readership) as the altmetric data source, and wos as the database for scientific publications; and also without applying restrictions of certain discipline, country, or publisher. as these previous studies analyzed datasets with size, publication years (py), and data collection years (dy) different from ours, we present them as references for discussing the retrospective historical development of altmetric data prevalence. according to the results, the presence of different altmetric data varies greatly. mendeley readership provides the largest values of coverage ( . %), density ( . ), and intensity ( . ), even higher than citations. as to other altmetric data, their presence is much lower than mendeley readers and citations. twitter mentions holds the second largest values among all other altmetric data, with . % of publications mentioned by twitter users and those mentioned publications accrued about . twitter mentions on average. it is followed by several social and mainstream media data, like facebook mentions, news mentions, and blogs citations. about . % of publications have been mentioned by facebook, . % have been mentioned by news outlets, and . % have been cited by blog posts. but among these three data sources, publications mentioned by news outlets accumulated more intensive attention in consideration of its higher value of intensity ( . ), which means that mentioned publications got more news mentions on average. in contrast, even though there are more publications mentioned by facebook, they received fewer mentions at the individual publication level (with the intensity value of . ). for the remaining altmetric data, their data coverage values are extremely low. wikipedia citations and policy document table the overall presence of types of altmetric data and citation data citations only covered respectively . % and . % of the sample publications, while the coverage of reddit mentions, f prime recommendations, video comments, peer review comments, and q&a mentions are lower than %. in terms of these data, the altmetric data of publications are seriously zero-inflated. compared to the coverage reported by previous studies, an increasing trend of altmetric data presence can be observed as time goes by. mendeley, twitter, facebook, news, and blogs are the most studied altmetric data sources. on the whole, the more recent the studies, the higher the values of coverage they report. our results show one of the highest data presence for most altmetric data. although the coverage of twitter mentions, news mentions, and reddit mentions reported by meschede and siebenlist ( ) is slightly higher than ours, it should be noted that they used a random sample consisting of wos papers published in , and as shown in fig. , there exist biases toward publication years when investigating data presence for altmetrics. after calculating the three indicators for research outputs in each publication year, fig. shows the change trends of the presence of altmetric data. overall there are two types of tendencies for all altmetric data, which are in correspondence with the accumulation velocity patterns identified in the research conducted by . thus, for altmetric data with higher speed in data accumulating, such as twitter mentions, facebook mentions, news mentions, blogs citations, and reddit mentions, newly published publications have higher coverage levels. in contrast, those altmetric data taking a longer time to accumulate (i.e., the slow sources defined by ), they tend to accumulate more prominently for older publications. wikipedia citations, policy document citations, f prime recommendations, video comments, peer review comments, and q&a mentions fall into this "slower" category. as a matter of fact, their temporal distribution patterns resemble more that of citations counts. regarding mendeley readers, although it keeps quite high coverage in every publication year, it shows a downward trend as citations too, indicating a kind of readership delay, by which newly published papers have to take time to accumulate mendeley readers (haustein et al. ; thelwall ; zahedi et al. ). in general, publications in the fields of natural sciences and medical and health sciences received more citations (marx and bornmann ) , but for altmetric data, the distribution across subject fields shows another picture. as shown in fig. , on the basis of our dataset, it is confirmed that publications in the subject fields of bhs, pse, and les hold the highest presence of citation data, and publications in the fields of ssh and mcs accumulated obviously fewer citation counts. however, as observed by costas et al. ( ) for twitter mentions, facebook mentions, news mentions, blogs citations, and google+ mentions, most altmetric data in fig. are more likely to concentrate on publications from the fields of bhs, ssh, and les, while pse publications lose the advantage of attracting attention as they show in terms of citations, thereby performing weakly in altmetric data presence as mcs publications do. amongst the presence of altmetric data and citations of scientific publications across five subject fields by altmetric.com is an aggregation of two platforms: publons and pubpeer. in our dataset, there are , distinct publications with altmetric peer review data for the analysis of data presence across subject fields, of them (accounting for . %) having peer review comments from publons and , of them (accounting for . %) having peer review comments from pubpeer ( publications have been commented by both). if we only consider the publications with publons data, bhs publications and les publications contribute the most (accounting for . % and . %, respectively), which is in line with ortega ( )'s results about publons on the whole. nevertheless, pubpeer data, which covers more publications recorded by altmetric.com, is biased towards ssh publications. ssh publications make up as high as . % of all publications with pubpeer data, followed by bhs publications (accounting for . %), besides the relatively small quantity of wos publications in the field of ssh, thereby leading to the overall high coverage of peer review comments of ssh publications. moreover, given the fact that the distributions of altmetric data are highly skewed, with the majority of publications only receiving very few altmetric events (see appendix ), particularly for altmetric data with relatively small data volume, their density and intensity are very close across subject fields. but in terms of intensity, there exist some remarkable subject field differences for some altmetric data. for example, on reddit, ssh publications received more intensive attention than other subject fields in consideration of their higher value of intensity. by comparison, those les and pse publications cited by wikipedia pages accumulated more intensive attention, even though the coverage of wikipedia citations of pse publications is rather low, suggesting that although pse publications have a lower coverage in wikipedia, they are more repeatedly cited. due to the influence of highly skewed distribution of altmetric data (see appendix ) on the calculation of coverage and density, these two indicators at the micro-topic level are strongly correlated for all kinds of altmetric data (see appendix ). in comparison, the correlation between coverage and intensity is rather weaker. moreover, in an explicit way, coverage tells how many publications around a micro-topic have been mentioned or cited at least once, and intensity describes how frequently those publications with altmetric data or citation data have been mentioned or cited. consequently, for a specific micro-topic, these two indicators can reflect the degree of broadness (coverage) and degree of deepness (intensity) of its received attention. therefore, we employed coverage and intensity to investigate the presence of altmetric data at the micro-topic level and identify research topics with higher levels of attention received on different data sources. coverage and intensity values were calculated and appended to micro-topics based on different types of altmetric and citation data, then the spearman correlation analyses were performed at the micro-topic level between each pair of data respectively. figure illustrates the spearman correlations of coverage amongst citations and types of altmetric data at the micro-topic level, as well as those of intensity. the higher the correlation coefficient, the more similar the presence patterns across micro-topics between two types of data. discrepancies in the correlations can be understood as differences in the relevance of every pair of data for micro-topics, therefore some pairs of data with stronger correlations may have a more similar preference for the same micro-topics, while those with relatively weaker correlations focus on more dissimilar micro-topics. through the lens of data coverage, mendeley readers is the only altmetric indicator that is moderately correlated with citations at the micro-topic level, being in line with the previous conclusions about the moderate correlation between mendeley readership counts and citations at the publication level . in contrast, because of the different distribution patterns between citations and most altmetric data across subject fields we found in fig. , it is not surprising that the correlations of coverage between citations and other altmetric data are relatively weak, suggesting that most altmetric data cover research topics different than citations. among altmetric data, twitter mentions, facebook mentions, news mentions, and blogs citations are strongly correlated with each other, indicating that these social media data cover similar research topics. most remaining altmetric data also present moderate correlations with the above social media data, however, q&a mentions, as the only altmetric data showing the highest coverage of publications in the field of mcs, is weakly correlated with other altmetric data at the micro-topic level. nevertheless, from the perspective of intensity, most altmetric data show different attention levels towards research topics, because the values of intensity of different data are generally weakly or moderately correlated. twitter mentions and facebook mentions, news mentions and blogs citations, are the two pairs of altmetric data showing the strongest correlations from both coverage and intensity perspectives, thus supporting the idea that these two pairs of altmetric data do not only respectively cover very similar research topics, but also focus on similar research topics. there exist a certain share of micro-topics in which their publications have not been mentioned at all by some specific altmetric data. in order to test the effect of those mutual zero-value micro-topics between each pair of data, the correlations have been performed also excluding them (see appendix ). it is observed that particularly for those pairs of altmetric data with low overall data presence across publications (e.g., q&a mentions and peer review comments, q&a mentions and policy document citations), their correlation coefficients are even lower when mutual zero-value micro-topics are excluded, although the overall correlation patterns across different data types at the micro-topic level are consistent with what we observed in fig. . on the basis of coverage and intensity, it is possible to compare the altmetric data presence across research topics and to further identify topics that received higher levels of attention. as shown in fig. , groups of publications with similar research topics (micro-topics) can be classified into four categories according to the levels of coverage and intensity of attention received. in this framework, hot research topics are those topics with a high coverage level of their publications, and at the same time they have also accumulated a relatively high intensive average attention (i.e., their publications exhibit high coverage and high intensity values). differently, those research topics in which only few publications have received relatively high intensive attention can be regarded as star-papers topics (i.e., low coverage and high intensity values), since the attention they attracted has not expanded to a large number of publications within the same research topic. thus, in star-papers topics the attention is mostly concentrated around a relatively reduced set of publications, namely, those star-papers with lots of attention accrued, while most of the other publications in the same research topic do not receive attention. following this line of reasoning, there are also research topics with a relatively large share of publications covered by a specific altmetric data, but those covered publications do not show a high average intensity of attention (i.e., high coverage and low intensity values), these research topics are defined as popular research topics with mile-wide and inch-deep attention accrued. finally, unpopular research topics indicate those topics with few publications covered by a specific altmetric data source, and the average of data accumulated by the covered publications is also relatively small (i.e., low coverage and low intensity values); these research topics have not attracted too much attention, thereby arguably remaining in an altmetric unpopular status. it should be noted that as time goes on and with newly altmetric activity generated, the status of a research topic might switch across the above four categories. following the framework proposed in fig. , we took twitter mention data as an example to empirically identify hot research topics in different subject fields. a total of micro-topics with at least one twitter mention in fig. were plotted into a two-dimensional system according to the levels of coverage and intensity they achieved (fig. a) . micro-topics are ranked based on their coverage and intensity at first, respectively. the higher the ranking a micro-topic achieves, the higher the level of its coverage or intensity. size of micro-topics is determined by their total number of publications. in order to identify representative hot research topics on twitter, here we selected the top % as the criterion for both levels of coverage and intensity (two dashed lines in fig. a) to partition micro-topics into four parts, which are in correspondence with fig. . as a result, microtopics with higher levels of coverage and intensity are classified as hot research topics that received broader and more intensive attention from twitter users (locate at the upper right corner of fig. a ). because publications in the fields of ssh, bhs, and les have much higher coverage and intensity of twitter data, micro-topics from these three subject fields are more likely to distribute at the upper right part. in contrast, micro-topics in pse and mcs concentrate at the lower left part. in consideration of the biased presence of twitter data across five main subject fields, we plotted micro-topics in each subject field by the same method as fig. a , respectively, and then zoomed in and only presented the part of hot research topics for each subject field in fig. b-f to show their identified hot research topics on twitter. for clear visualization, one of the extracted terms by cwts classification system was used as the label for each micro-topic. in the field of ssh, there are micro-topics considered, and ( %) of them rank in top % from both coverage and intensity perspectives (fig. b) . in this subject field, hot research topics tend to be about social issues, including topics related to gender and sex (e.g., "sexual orientation", "gender role conflict", "sexual harassment", etc.), education (e.g., "teacher quality", "education", "undergraduate research experience", etc.), climate ("global warming"), as well as psychological problems (e.g., "stereotype threat", "internet addiction", "stress reduction", etc.). bhs is the biggest field with both the most research outputs and the most twitter mentions, so there are micro-topics considered, and ( %) of them were detected as hot research topics in fig. c . research topics about daily health keeping (e.g., "injury prevention", "low carbohydrate diet", "longevity", etc.), worldwide infectious diseases (e.g., "zika virus infection", "ebola virus", "influenza", etc.), lifestyle diseases (e.g., "obesity", "chronic neck pain", etc.), and emerging biomedical technologies (e.g., "genome editing", "telemedicine", "mobile health", etc.) received more attention on twitter. moreover, problems and revolutions in the medical system caused by some social activities such as "brexit" and "public involvement" are also brought into focus. in the field of pse, ( %) out of micro-topics were identified as hot research topics in fig. d . as a field with less twitter mentions accumulated, although most research topics are left out by twitter users, those about the universe and astronomy (e.g., "gravitational wave", "exoplanet", "sunspot", etc.) and quantum (e.g., "quantum walk", "quantum game", "quantum gravity", etc.) received relatively higher levels of attention. in addition, there are also some hot research topics standing out from complexity sciences, such as "scale free network", "complex system", and "fluctuation theorem". in the field of les, there are micro-topics in total, and fig. e shows ( %) hot research topics in this field. these hot research topics are mainly about animals (e.g., "dinosauria", "shark", "dolphin", etc.) and natural environment problems (e.g., "extinction risk", "wildlife trade", "marine debris", etc.). finally, as the smallest subject field, mcs has ( %) out of micro-topics identified as hot research topics (fig. f) , which are mainly about emerging information technologies (e.g., "big data", "virtual reality", "carsharing") and robotics (e.g., "biped robot", "uncanny valley", etc.). to reflect the differences of hot research topics through the lens of different altmetric data sources, policy document citation data was selected as another example. figure shows the overall distribution of micro-topics with at least one policy document citation and the identified hot research topics in five main subject fields. the methodology of fig. based on twitter data. however, due to the smaller data volume of policy document citations, there are micro-topics sharing the same intensity of . in this case, total number of policy document citations of each micro-topic was introduced as a benchmark to make distinctions. for micro-topics with the same intensity, the higher the total number of policy document citations accrued, the higher the level of attention in the dimension of intensity. after this, if micro-topics still share the same ranking, they are tied for the same place with the next equivalent rankings skipped. in general, these paralleling rankings of micro-topics with relatively low level of attention do not affect the identification of hot research topics. through the lens of policy document citations, identified hot research topics differ from those in the eyes of twitter uses to some extents. in the field of ssh, ( %) out of micro-topics were classified as hot research topics (fig. b) . these research topics mainly focus on industry and finance (e.g., "microfinance", "tax compliance", "intra industry trade", etc.), as well as child and education (e.g., "child care", "child labor", "teacher quality", etc.). besides, "gender wage gap" is also a remarkable research topic appeared in policy documents. in the field of bhs, there are micro-topics have been cited by policy documents at least once, and ( %) of them were classified as hot research topics (fig. c) . worldwide infectious diseases are typically concerned by policy-makers, consequently, there is no doubt that they were identified as hot research topics, such as "sars", "ebola virus", "zika virus infection", and "hepatitis c virus genotype". in addition, healthcare (e.g., "health insurance", "nursing home resident", "newborn care", etc.), social issues (e.g., "suicide", "teenage pregnancy", "food insecurity", "adolescent smoking", etc.), and potential health-threatening environment problems (e.g., "ambient air pollution", "environmental tobacco smoke", "climate change", etc.) drew high levels of attention from policy-makers too. different from the focus of attention on astronomy of twitter users, in the field of pse (fig. d) , the ( %) hot research topics out of micro-topics that concerned by policy-makers are mainly around energy and resources, like "energy saving", "wind energy", "hydrogen production", "shale gas reservoir", "mineral oil", and "recycled aggregate", etc. in the field of les, fig. e shows the ( %) hot research topics identified out from micro-topics. from the perspective of policy documents, environmental protection (e.g., "marine debris", "forest management", "sanitation", etc.) and sustainable development (e.g., "selective logging", "human activity", "agrobiodiversity", etc.) are hot research topics. at last, in the field of mcs (fig. f) , publications are hardly cited by policy documents, thus there are only ( %) topics out of micro-topics identified as hot research topics. in this field, policy-makers paid more attention to information security ("differential privacy", "sensitive question") and traffic economy ("road pricing", "carsharing"). data presence is essential for the application of altmetrics in research evaluation and other potential areas. the heterogeneity of altmetrics makes it difficult to establish a common conceptual framework and to draw a unified conclusion (haustein ) , thus in most cases it is necessary to separate altmetrics to look into their own performance. this paper investigated types of altmetric data respectively based on a large-scale and up-to-date dataset, results show that various altmetric data vary a lot in the presence for wos publications. data presence of several altmetric data has been widely discussed and explored in previous studies. there are also some reviews summarizing the previous observations of the coverage of altmetric data (erdt et al. ; ortega ) . generally speaking, our results confirmed the overall situations of the data presence in those studies. for instance, mendeley readership keeps showing a very high data coverage across scientific publications and provides the most metrics among all altmetric data, followed by twitter mentions and facebook mentions. however, there exist huge gaps among these altmetric data. regarding the data coverage, . % of sample publications have attracted at least one mendeley reader, while for twitter mentions and facebook mentions, the value is only . % and . %, respectively. moreover, for those altmetric data which are hardly surveyed with the same dataset of wos publications before, like reddit mentions, f prime recommendations, video comments, peer review comments, and q&a mentions, their data coverage is substantially lower than %, showing an extremely weak data presence across research outputs. comparing with previous observations of altmetric data coverage reported in earlier altmetric studies, it can be concluded that the presence of altmetric data is clearly increasing, and our results are generally higher than those previous studies using the same types of datasets. there are two possible reasons for the increasing presence of altmetric data across publications. one is the progress made by altmetric data aggregators (particularly altmetric.com), by improving their publication detection techniques and by enlarging tracked data sources. for example, altmetric.com redeveloped their news tracking system in december , which partially explains the rise of news coverage in (see fig. ). the second reason for the increasing presence of some altmetric data is the rising uptake of social media by the public, researchers, and scholarly journals (nugroho et al. ; van noorden ; zheng et al. ) . against this background, scientific publications are more likely to be disseminated on social media, thereby stimulating the accumulation of altmetric data. the fact that more publications with corresponding altmetric data accrued and detected is beneficial to consolidate the data foundation, thus promoting the development and possible application of altmetrics. in the meantime, we emphasized the biases of altmetric data towards different publication years. costas et al. ( ) highlighted the "recent bias" they found in the overall altmetric scores, which refers to the dominance of most recent published papers in garnering altmetric data. nevertheless, we found that the "recent bias" is not exhibited by all types of altmetric data. for altmetric data with relatively high speed in data accumulation after publication, like twitter mentions, facebook mentions, news mentions, blogs citations, and reddit mentions , it is demonstrated that their temporal distribution conforms to a "recent bias". however, a "past bias" is found for altmetric data that take a relatively longer time to accumulate, such as wikipedia citations, policy document citations, f prime recommendations, video comments, peer review comments, and q&a mentions . due to the slower pace of these altmetric events, they are more concentrated on relatively old publications. even for mendeley readers, its data presence across recent publications is obviously lower. overall, although an upward tendency of data presence has been observed over time, most altmetric data still keep an extremely low data presence, with the only exceptions of mendeley readers and twitter mentions. as suggested by thelwall et al. ( ) , until now these altmetric data may only be applicable to identify the occasional exceptional or above average articles rather than as universal sources of impact evidence. in addition, the distinguishing presence of altmetric data reinforces the necessity of keeping altmetrics separately in future analyses or research assessments. with the information of subject fields and micro-topics assigned by the cwts publicationlevel classification system, we further compared the presence of types of altmetric data across subject fields of science and their inclinations to different research topics. most altmetric data have a stronger focus on publications in the fields of ssh, bhs, and les. in contrast, altmetric data presence in the fields of pse and mcs are generally lower. this kind of data distribution differs from what has been observed based on citations, in what ssh are underrepresented while pse stands out as the subject field with higher levels of citations. this finding supports the idea that altmetrics might have more added values for social sciences and humanities when citations are absent . in this study, it is demonstrated that even within the same subject field, altmetric data show different levels of data presence across research topics. amongst altmetric data, their correlations at the research topic level are similar with the correlations at the publication level zahedi et al. ) , with mendeley readers the only altmetric data moderately correlated with citations, and twitter mentions and facebook mentions, news mentions and blogs citations, the two pairs showing the strongest correlations. there might exist some underlying connections within these two pairs of strongly correlated altmetric data, such as the possible synchronous updating by users who utilize multiple platforms to share science information, which can be further investigated in future research. for the remaining altmetric data, although many of them achieved moderate to strong correlations with each other from the aspect of coverage because they have similar patterns of data coverage across subject fields, the correlations of data intensity are weaker, implying that research topics garnered different levels of attention across altmetric data (robinson-garcia et al. ) . in view of the uneven distribution of specific altmetric data across research topics, it is possible to identify hot research topics which received higher levels of attention from certain communities such as twitter users and policy-makers. based on two indicators for measuring data presence: coverage and intensity, we developed a framework to identify hot research topics operationalized as micro-topics that fall in the first decile in terms of the ranking distribution of both coverage and intensity. this means that hot research topics are those with large shares of the publications receiving intensive average attention. we have demonstrated the application of this approach in detecting hot research topics mentioned on twitter and cited in policy documents. since the subject field differences are so pronounced that they might hamper generalization (mund and neuhäusler ) , the identification of hot research topics was conducted for each subject field severally. hot research topics on twitter reflect the interest shown by twitter users, while those in policy documents serve as the mirror of policy-makers' focuses on science, and these two groups of identified hot research topics are diverse and hardly overlapped. this result proves that different communities are keeping an eye on different scholarly topics driven by dissimilar motivations. the methodology of identifying hot research topics sheds light on an innovative application of altmetric data in tracking research trends with particular levels of social attention. by taking the advantage of the clustered publication sets (i.e., micro-topics) algorithmically generated by the cwts classification system, the methodology proposed measures how wide and intensive is the altmetric attention to the research outputs of specific research topics. this approach provides a new option to monitor the focus of attention on science, thus representing an important difference with prior studies about the application of altmetric data in identifying topics of interest, which mostly were based on co-occurrence networks of topics with specific altmetric data accrued (haunschild et al. ; robinson-garcia et al. ) . the methodology proposed employs a two-dimensional framework to classify research topics into four main categories according to the levels of the specific altmetric attention they received. as such, the framework represents a more simplified approach to study and characterize different types of attention received by individual research topics. in our proposal for the identification of hot research topics, the influence of individual publications with extremely intensive attention received is to some extent diminished, relying the assessment of the whole topic on the overall attention of the publications around the topic, although of course those topics characterized by singularized publications with high levels of attention are also considered as "star-papers topics". it should be acknowledged that the results of this approach give an overview of the attention situations of generalized research topics, however, to get more detailed pictures of specific micro-level research fields, other complementary methods based on the detailed text information of the publications should be employed to go deep into micro-topics. moreover, in this study, the identification of hot research topics is based on the whole dataset, in future studies, through introducing the factors of publication time of research outputs and the released time of altmetric events, it is suggested to monitor those hot research topics in real time in order to reflect the dynamic of social attention on science. there are some limitations in this study. first, the dataset of publications is restricted to publications with dois or pubmed ids. the strong reliance on these identifiers is also seen as one of the challenges of altmetrics (haustein ) . second, although all types of documents are included in the overall analysis of data presence, only article, review, and letter are assigned with main subject fields of science and micro-topics by the cwts publication-level classification system, so only these three document types are considered in the following analysis of data presence across subject fields and research topics. but these three types account for . % of sample publications (see appendix ), they can be used to reveal relatively common phenomena. lastly, the cwts classification system is a coarse-grained system of disciplines in consideration of that some different fields are clustered into an integral whole, like social sciences and humanities, making it difficult to present more fine-grained results. but the advantages of this system lie in that it solves the problem caused by multi-disciplinary journals, and individual publications with similar research topics are clustered into micro-level fields, namely, micro-topics, providing us with the possibility of comparing the distribution of altmetric data at the research topic level, and identifying hot research topics based on data presence. this study investigated the state-of-the-art presence of types of altmetric data for nearly . million web of science publications across subject fields and research topics. except for mendeley readers and twitter mentions, the presence of most altmetric data is still very low, even though it is increasing over time. altmetric data with high speed of data accumulation are biased to newly published papers, while those with lower speed bias to relatively old publications. the majority of altmetric data concentrate on publications from the fields of biomedical and health sciences, social sciences and humanities, and life and earth sciences. these findings underline the importance of applying different altmetric data with suitable time windows and fields of science considered. within a specific subject field, altmetric data show different preferences for research topics, thus research topics attracted different levels of attention across altmetric data sources, making it possible to identify hot research topics with higher levels of attention received in different altmetric contexts. based on the data presence at the research topic level, a framework for identifying hot research topics with specific altmetric data was developed and applied, shedding light onto the potential of altmetric data in tracking research trends with a particular social attention focus. geographic variation in social media metrics: an analysis of latin american journal articles do altmetrics point to the broader impact of research? an overview of benefits and disadvantages of altmetrics usefulness of altmetrics for measuring the broader impact of research: a case study using data from plos and f prime alternative metrics in scientometrics: a meta-analysis of research into three altmetrics what do altmetrics counts mean? a plea for content analyses measuring field-normalized impact of papers on specific societal groups: an altmetrics study based on mendeley data citespace ii: detecting and visualizing emerging trends and transient patterns in scientific literature do "altmetrics" correlate with citations? extensive comparison of altmetric indicators with citations from a multidisciplinary perspective altmetrics: finding meaningful needles in the data haystack testing for universality of mendeley readership distributions the relationship between tweets, citations, and article views for plos one articles co-saved, co-tweeted, and co-cited networks dynamic topic detection and tracking: a comparison of hdp, c-word, and cocitation methods altmetrics: an analysis of the state-of-theart in measuring research impact on social media studying the accumulation velocity of altmetric data tracked by altmetric.com the stability of twitter metrics: a study on unavailable twitter mentions of scientific publications what can article-level metrics do for you? research characteristics and status on social media in china: a bibliometric and co-word analysis a new methodological approach to bibliographic coupling and its application to the national, regional and institutional level using 'core documents' for detecting and labelling new emerging topics using altmetrics for assessing research impact in the humanities how many scientific papers are mentioned in policy-related documents? an empirical investigation using web of science and altmetric data does the public discuss other topics on climate change than researchers? a comparison of explorative networks based on author keywords and hashtags grand challenges in altmetrics: heterogeneity, data quality and dependencies interpreting 'altmetrics': viewing acts on social media through the lens of citation and social theories characterizing social media metrics of scholarly papers: the effect of document properties and collaboration patterns tweets vs. mendeley readers: how do these two social media metrics differ? it -information technology how to identify emerging research fields using scientometrics: an example in the field of information security on the causes of subject-specific citation rates in web of science cross-metric compatability and inconsistencies of altmetrics who reads research articles? an altmetrics analysis of mendeley user categories towards an early-stage identification of emerging topics in science-the usability of bibliometric characteristics measuring societal impact is as complex as abc a survey of recent methods on deriving topics from twitter: algorithm to evaluation. knowledge and information systems exploratory analysis of publons metrics and their relationship with bibliometric and altmetric impact altmetrics data providers: a meta-analysis review of the coverage of metrics and publication the altmetrics collection altmetrics: a manifesto mapping social media attention in microbiology: identifying main topics and actors new data, new possibilities: exploring the insides of altmetric the skewness of science detecting emerging research fronts based on topological measures in citation networks of scientific publications tracking and predicting growth areas in science identifying emerging topics in science and technology attention is not impact" and other challenges for altmetrics scholarly use of social media and altmetrics: a review of the literature are mendeley reader counts high enough for research evaluations when articles are published? do altmetrics work? twitter and ten other social web services a comparison of methods for detecting hot topics online collaboration: scientists and the social network f recommendations as a potential new data source for research evaluation: a comparison with citations a new methodology for constructing a publication-level classification system of science detecting and tracking the real-time hot topics: a study on computational neuroscience usage patterns of scholarly articles on web of science: a study on web of science usage count tracing scientist's research trends realtimely users, narcissism and control-tracking the impact of scholarly publications in the st century social media metrics for new research evaluation how well developed are altmetrics? a cross-disciplinary analysis of the presence of 'alternative metrics' in scientific publications mendeley readership as a filtering tool to identify highly cited publications on the relationships between bibliographic characteristics of scientific documents and citation and mendeley readership counts: a large-scale analysis of web of science publications exploring topics of interest of mendeley users social media presence of scholarly journals acknowledgements zhichao fang is financially supported by the china scholarship council ( ). rodrigo costas is partially funded by the south african dst-nrf centre of excellence in scientometrics and science, technology and innovation policy (scistip). xianwen wang is supported by the national natural science foundation of china ( and ). the authors thank altmetric.com for providing the altmetric data of publications, and also thank the two anonymous reviewers for their valuable comments.the , , sample wos publications were matched with their document types through the cwts in-house database. table presents the number of publications and the coverage of altmetric data of each type. the types of article, review, and letter, which are included in the cwts classification system, account for about . % in total. the altmetric data coverage varies across document types as observed by zahedi et al. ( ) . for most altmetric data, review shows the highest altmetric data coverage, followed by article, editorial material, and letter. it is reported that the distributions of citation counts (seglen ) , usage counts , and twitter mentions ) are highly skewed. results in fig. show that the same situation happens to other altmetric data as well. even though the data spearman correlation analyses among coverage, density, and intensity of micro-topics were conducted for each altmetric data and citations, and the results are shown in fig. . because of the highly skewed distribution of all kinds of altmetric data, the calculation of coverage and density are prone to get similar results, especially for altmetric data with smaller data volume. therefore, the correlation between coverage and density is quite strong for every altmetric data. for most altmetric data, density and intensity are moderately or strongly correlated, and their correlations are always slightly stronger than that between coverage and intensity. in consideration of the influence of zero values of some micro-topics on inflating the spearman correlation coefficients, we did a complementary analysis by calculating the spearman correlations for each pair of data after excluding those mutual micro-topics with zero values (fig. ). compared to the results shown in fig. , values in fig. are clearly lower, especially for those pairs of altmetric data with relatively low data presence. however, the overall patterns are still consistent with what we observed in fig. . key: cord- - imgztwe authors: frishman, d.; albrecht, m.; blankenburg, h.; bork, p.; harrington, e. d.; hermjakob, h.; juhl jensen, l.; juan, d. a.; lengauer, t.; pagel, p.; schachter, v.; valencia, a. title: protein-protein interactions: analysis and prediction date: - - journal: modern genome annotation doi: . / - - - - _ sha: doc_id: cord_uid: imgztwe proteins represent the tools and appliances of the cell — they assemble into larger structural elements, catalyze the biochemical reactions of metabolism, transmit signals, move cargo across membrane boundaries and carry out many other tasks. for most of these functions proteins cannot act in isolation but require close cooperation with other proteins to accomplish their task. often, this collaborative action implies physical interaction of the proteins involved. accordingly, experimental detection, in silico prediction and computational analysis of protein-protein interactions (ppi) have attracted great attention in the quest for discovering functional links among proteins and deciphering the complex networks of the cell. proteins represent the tools and appliances of the cellthey assemble into larger structural elements, catalyze the biochemical reactions of metabolism, transmit signals, move cargo across membrane boundaries and carry out many other tasks. for most of these functions proteins cannot act in isolation but require close cooperation with other proteins to accomplish their task. often, this collaborative action implies physical interaction of the proteins involved. accordingly, experimental detection, in silico prediction and computational analysis of protein-protein interactions (ppi) have attracted great attention in the quest for discovering functional links among proteins and deciphering the complex networks of the cell. proteins do not simply clump togetherbinding between proteins is a highly specific event involving well defined binding sites. several criteria can be used to further classify interactions (nooren and thornton ) . protein interactions are not mediated by covalent bonds and, from a chemical perspective, they are always reversible. nevertheless, some ppi are so persistent to be considered irreversible (obligatory) for all practical purposes. other interactions are subject to tight regulation and only occur under characteristic conditions. depending on their functional role, some protein interactions remain stable for a long time (e.g. between proteins of the cytoskeleton) while others last only fractions of a second (e.g. binding of kinases to their targets). protein complexes formed by physical binding are not restricted to so called binary interactions which involve exactly two proteins (dimer) but are often found to contain three (trimer), four (tetramer), or more peptide chains. another distinction can be made based on the number of distinct proteins in a complex: homo-oligomers contain multiple copies of the same protein while hetero-oligomers consist of different protein species. sophisticated "molecular machines" like the bacterial flagellum consist of a large number of different proteins linked by protein interactions. the focus of this chapter is on the computational methods for analyzing and predicting protein-protein interactions. nevertheless, some basic knowledge about experimental techniques for detecting these interactions is highly useful for interpreting results, estimating potential biases, and judging the quality of the data we use in our work. many different types of methods have been developed but the vast majority of interactions in the literature and public databases come from only two classes of approaches: co-purification and two-hybrid methods. co-purification methods (rigaut et al. ) are carried out in vitro and involve three basic steps. first, the protein of interest is "captured" from a cell lysatee.g. by attaching it to an immobile matrix. this may be done with specific antibodies, affinity tags, epitope tags along with a matching antibody, or by other means. second, all other proteins in the solution are removed in a washing step in order to purify the captured protein. under suitable conditions, protein-protein interactions are preserved. in the third step, any proteins still attached to the purified protein are detected by suitable methods (e.g. western-blot or mass spectrometry). hence, the interaction partners are co-purified, as the name of the method implies. the two-hybrid technique (fields and song ) uses a very different approachit exploits the fact that transcription factors such as gal consist of two distinct functional domains. the dna-binding domain (bd) recognizes the transcription factor (tf) binding site in the dna and attaches the protein to it while the activation domain (ad) triggers transcription of the gene under the control of the factor. when expressed as separate protein chains, both domains remain fully functional: the bd still binds the dna but lacks a way of triggering transcription. the ad could trigger transcription but has no means of binding to the dna. for a two-hybrid test, two proteins x and y are fused to these domains resulting in two hybrids: x-bd and y-ad. if x binds to y, the resulting protein complex turns out to be a fully functional transcription factor. accordingly, an interaction is revealed by detecting transcription of the reporter gene under the control of the tf. in contrast to co-purifications, the interaction is tested in vivo in the two-hybrid system (usually in yeast, but other systems exist). the above description refers to small-scale experiments testing one pair of proteins at a time, but both approaches have successfully been extended to large-scale experiments testing thousands of pairs in very short time. while such high-throughput data is very valuable, especially for computational biology which often requires comprehensive input data, a word of caution is necessary. even with the greatest care and a maximum of thoughtful controls, high-throughput data usually suffer from a certain degree of false-positive results as well as false-negatives compared to carefully performed and highly optimized individual experiments. the ultimate source of information about protein interactions is provided by high-resolution three-dimensional structures of interaction complexes, such as the one shown in fig. . spatial architectures obtained by x-ray crystallography or nmr spectroscopy provide atomic-level detail of interaction interfaces and allow for mechanistic understanding of interaction processes and their functional implications. additional kinetic, dynamic and structural aspects of protein interactions can be elucidated by electron and atomic force microscopy as well as by fluorescence resonance energy transfer. fig. structural complex between rhoa, a small gtp protein belonging to the ras superfamily, and the catalytic gtpase activating domain of rhogap (graham et al. ) protein interaction databases a huge number of protein-protein interactions has been experimentally determined and described in numerous scientific publications. public protein interaction databases that provide interaction data in form of structured, machine-readable datasets organized according to well documented standards have become invaluable resources for bioinformatics, systems biology and researchers in experimental laboratories. the data in these databases generally originate from two major sources: large-scale datasets and manually curated information extracted from the scientific literature. as pointed out above, the latter is considered substantially more reliable and large bodies of manually curated ppi data are often used as the gold standard against which predictions and large-scale experiments are benchmarked. of course, these reference data are far from complete and strongly biased. many factors, including experimental bias, preferences of the scientific community, and perceived biomedical relevance influence the chance of an interaction to be studied, discovered and published. in the manual annotation process it is not enough to simply record the interaction as such. additional information such as the type of experimental evidence, citations of the source, experimental conditions, and more need to be stored in order to convey a faithful picture of the data. annotation is a highly labor intensive task carried out by specially trained database curators. ppi databases can be roughly divided in two classes: specialized databases focusing on a single organism or a small set of species and general repositories which aim for a comprehensive representation of current knowledge. while the former are often well integrated with other information resources for the same organism, the latter strive for collecting all available interaction data including datasets from specialized resources. the size of these databases is growing constantly as more and more protein interactions are identified. as of writing (november ) , global repositories are approaching , pieces of evidence for protein interactions in various species. all of these databases offer convenient web interfaces that allow for interactively searching the database. in addition, the full datasets are usually provided for download in order to enable researchers to use the data in their own computational analyses. table gives an overview of some important ppi databases. until relatively recently, molecular interaction databases like the ones listed in table acted largely independently from each other. while they provided an extremely valuable service to the community in collecting and curating available molecular interaction data from the literature, they did so largely in an uncoordinated manner. each database had its own curation policy, feature set, and data formats. in , the proteomics standards initiative (psi), a work group of the human proteome organization (hupo), set out to improve this situation, with contributions from a broad range of academic and commercial organizations, among them bind, cellzome, dip, glaxosmithkline, hybrigenics sa, intact, mint, mips, serono, and the universities of bielefeld, bordeaux, and cambridge. in a first step, a community standard for the representation of protein-protein interactions was developed, the psi mi format . (hermjakob et al. ) . recently, version . of the psi mi format has been published , extending the scope of the format from protein-protein interactions to molecular interactions in general, allowing to model for example protein-rna complexes. the psi mi format is a flexible xml format representing the interaction data to a high level of detail. n-ary interactions (complexes) can be represented as well as experimental conditions and technologies, quantitative parameters and interacting domains. the xml format is accompanied by detailed controlled vocabularies in obo format (harris et al. ). these vocabularies are essential for standardizing not only the syntax, but also the semantics of the molecular interaction representation. as an example, the "yeast two-hybrid technology" described above is referred to in the literature using many different synonyms, for example y h, h, "yeast-two-hybrid", etc. while all of these terms refer to the same technology, filtering interaction data from multiple different databases based on this set of terms is not trivial. thus, the psi mi standard provides a set of now more than well-defined terms relevant to molecular interactions. figure shows the intact advanced search tool with a branch of the hierarchical psi mi controlled vocabulary. figure provides a partial graphical representation of the annotated xml schema, combined with an example dataset in psi mi xml format, reprinted from kerrien et al. ( b) . for user-friendly distribution of simplified psi data to end users, the psi mi . standard also defines a simple tabular representation (mitab), derived from the biogrid format (breitkreutz et al. ) . while this format necessarily excludes details of interaction data like interacting domains, it provides a means to efficiently access large numbers of basic binary interaction records. the psi mi format is now widely implemented, with data available from biogrid, dip, hprd, intact, mint, and mips, among others. visualization tools like cytoscape (shannon et al. ) can directly read and visualize psi mi formatted data. comparative and integrative analysis of interaction data from multiple sources has become easier, as has the development of analysis tools which do not need to provide a plethora of input parsers any more. the annotated psi mi xml schema, a list of tools and databases implementing it, as well as further information, are available from http:// www.psidev.info/. however, the development and implementation of a common data format is only one step towards the provision of consistent molecular interaction data to the scientific community. another key step is the coordination of the data curation process itself between different molecular interaction databases. without such synchronization, independent databases will often work on the same publications and insert the data into their systems, according to different curation rules, thus doing redundant work on some publications, while neglecting others. recognizing this issue, the dip, intact, and mint molecular interaction databases are currently synchronizing their curation efforts in the context of the imex consortium (http://imex.sf.net). these databases are now applying the same curation rules to provide a consistent high level of curation quality, and are synchronizing their fields of activity, each focusing on literature curation from a non-overlapping set of scientific journals. for these journals, the databases aim to insert all published interactions into the database shortly after publication. regular data exchange of all newly curated data between imes databases is currently in the implementation phase. to support the systematic representation and capture of relevant molecular interaction data supporting scientific publications, the hupo proteomics standards initiative has recently published "the minimum information required for reporting a molecular interaction experiment (mimix)" , detailing data items considered essential for the authors to provide, as well as a practical guide to efficient deposition of molecular interaction data in imex databases . the imex databases are also collaborating with scientific journals and funding agencies, to increasingly recommend data producers to deposit their data in an imex partner database prior to publication. database deposition prior to publication not only ensures public availability of the data at the time of publication, but also provides important quality control, as database curators often assess the data in much more detail than reviewers. the psi journal collaboration efforts are starting to show first results. nature biotechnology, nature genetics, and proteomics are now recommending that authors deposit molecular interaction data in a relevant public domain database prior to publication, a key step to a better capture of published molecular interaction data in public databases, and to overcome the current fragmentation of molecular interaction data. as an example of a molecular interaction database implementing the psi mi . standard, we will provide a more detailed description of the intact molecular interaction database ), accessible at http://www.ebi.ac.uk/intact. intact is a curated molecular interaction database active since . intact follows a full text curation policy, publications are read in full by the curation team, and all molecular interactions contained in the publication are inserted into the database, containing basic facts like the database accession numbers of the proteins participating in an interaction, but also details like experimental protein modifications, which can have an impact on assessments of confidence in the presence or absence of interactions. each database record is cross-checked by a senior curator for quality control. on release of the record, the corresponding author of the publication is automatically notified (where an email address is available), and requested to check the data provided. any corrections are usually inserted into the next weekly release. while such a detailed, high quality approach is slow and limits coverage, the provision of high quality reference datasets is an essential service both for biological analysis, and for the training and validation of automatic methods for computational prediction of molecular interactions. as it is impossible for any single database, or even the collaborating imex databases, to fully cover all published interactions, curation priorities have to be set. any direct data depositions supporting manuscripts approaching peer review have highest priority. next, for some journals (currently cell, cancer cell, and proteomics) intact curates all molecular interactions published in the journal. finally, several special curation topics are determined in collaboration with external communities or collaborators, where intact provides specialized literature curation and collaborates in the analysis of experimental datasets, for example around a specific protein of interest (camargo et al. ) . as of november , intact contains . binary interactions supported by ca. , publications. the intact interface implements a standard "simple search" box, ideal for search by uniprot protein accession numbers, gene names, species, or pubmed identifiers. the advanced search tool (fig. ) provides field-specific searches as well a specialized search taking into account the hierarchical structure of controlled vocabularies. a default search for the interaction detection method " hybrid" returns , interactions, while a search for " hybrid" with the tickbox "include children" activated returns more than twice that number, , interactions. the hierarchical search automatically includes similarly named methods like "two hybrid pooling approach", but also "gal vp complement". search results are initially shown in a tabular form based on the mitab format, which can also be directly downloaded. each pairwise interaction is only listed once, with all experimental evidence listed in the appropriate columns. the final column provides access to a detailed description of each interaction as well as a graphical representation of the interaction in is interaction neighborhood graph. for interactive, detailed analysis, interaction data can be loaded into tools like cytoscape (see below) via the psi . xml format. all intact data is freely available via the web interface, for download in psi mi tabular or xml format, and computationally accessible via web services. intact software is open source, implemented in java, with hibernate (www.hibernate.org/) for the object-relational mapping to oracle tm or postgres, and freely available under the apache license, version from http://www.ebi.ac.uk/intact. on a global scale, protein-protein interactions participate in the formation of complex biological networks which, to a large extent, represent the paths of communication and metabolism of an organism. these networks can be modeled as graphs making them amenable to a large number of well established techniques of graph theory and social network analysis. even though interaction networks do not directly encode cellular processes nor provide information on dynamics, they do represent a first step towards a description of cellular processes, which is ultimately dynamic in nature. for instance, protein-interaction networks may provide useful information on the dynamics of complex assembly or signaling. in general, investigating the topology of protein interaction, metabolic, signaling, and transcriptional networks allows researchers to reveal the fundamental principles of molecular organization of the cell and to interpret genome data in the context of large-scale experiments. such analyses have become an integral part of the genome annotation process: annotating genomes today increasingly means annotating networks. a protein-protein interaction network summarizes the existence of both stable and transient associations between proteins as an (undirected) graph: each protein is represented as a node (or vertex), an edge between two proteins denotes the existence of an interaction. interactions known to occur in the actual cell ( fig. a ) can thus be represented as an abstract graph of interaction capabilities (fig. b ). as such a graph is limited by definition to binary interactions, its construction from a database of molecular interactions may involve arbitrary choices. for instance, an n-ary interaction measured by co-purification can be represented using either the clique (all binary interactions between the n proteins are retained) or the spoke model (only edges connecting the "captured" protein to co-purified proteins are retained). once a network has been reconstructed from protein interaction data, a variety of statistics on network topology can be computed, such as the distribution of vertex degrees, the distribution of the clustering coefficient and other notions of density, the distribution of shortest path length between vertex pairs, or the distribution of network motifs occurrences (see for a review). these measures can be used to describe networks in a concise manner, to compare, group or contrast different networks, and to identify properties characteristic of a network or a class of network under study. some topological properties may be interpreted as traces of underlying biological mechanisms, shedding light on their dynamics, their evolution, or both and helping connect structure to function (see the "network modules" section below). for instance, most interaction networks seem to exhibit scale-free topology (jeong et al. ; yook et al. ) , i.e. their degree distribution (the probability that a node has exactly k links) approximates a power law p(k) $ k -g , meaning that most proteins have few interaction partners but some, the so-called "hubs", have many. as an example of derived evolutionary insight, it is easy to show that networks evolving by growth (addition of new nodes) and preferential attachment (new nodes are more likely to be connected to nodes with more connections) will exhibit scale-free topology (degree distribution approximates a power-law) and hubs (highly connected nodes). a simple model of interaction network evolution by gene duplication, where a duplicate initially keeps the same interaction partners as the original, generates preferential attachment, thus providing a candidate explanation for the scale-free nature and the existence of hubs in these networks . interacting proteins are denoted as p , p , etc. (b) a graph representation of the protein interactions shown in a. each node represents a protein, and each edge connects proteins that interact. (c) information on protein interactions obtained by different methods. (d) protein interaction network derived from experimental evidence shown in c. as in a, each node is a protein, and edges connect interactors. edges a colored according to the source of evidence: red - d, green -apms, brown -y h, magenta -prof, yellow -lit, blue -loc a corresponding functional interpretation of hubs and scale-free topology has been proposed in terms of robustness. scale-free networks are robust to component failure, as random failures are likely to affect low degree nodes and only failures affecting hub nodes will significantly change the number of connected components and the length of shortest paths between node pairs. deletion analyses have, perhaps unsurprisingly, confirmed that highly connected proteins are more likely to be essential (winzeler et al. ; giaever et al. ; gerdes et al. ) . most biological interpretations that have been proposed for purely topological properties of interaction networks have been the subject of heated controversies, some of which remain unsolved to this day (e.g. (he and zhang ; yu et al. ) on hubs). one often cited objection to any strong interpretation is the fact that networks reconstructed from high-throughput interaction data constitute very rough approximations of the "real" network of interactions taking place within the cell. as illustrated in fig. c , interaction data used in a reconstruction typically result from several experimental methods, often complemented with prediction schemes. each specific method can miss real interactions (false negatives) and incorrectly identify other interactions (false positives), resulting in biases that are clearly technology-dependent (gavin et al. ; legrain and selig ) . assessing false-negative and false-positive rates is difficult since there is no gold standard for positive interactions (protein pairs that are known to interact) or, more importantly, for negative interactions (protein pairs that are known not to interact). using less-than-ideal benchmark interaction sets, estimates of - % false positives and - % false negatives have been proposed for yeast-two-hybrid and copurification based techniques (aloy and russell ) . in particular, a comparison of several high-throughput interaction datasets on yeast, showing low overlap, has confirmed that each study covers only a small percentage of the underlying interaction network (von mering et al. ) (see also "estimates of the number of protein interactions" below). integration of interaction data from heterogeneous sources towards interaction network reconstruction can help compensate for these limitations. the basic principle is fairly simple and rests implicitly on a multigraph representation: several interaction networks to be integrated, each resulting from a specific experimental or predictive method, are defined over the same set of proteins. integration is achieved by merging them into a single network with several types of linksor edge colors-each drawn from one of the component networks. some edges in the multigraph may be incorrect, while some existing interactions may be missing from the multigraph, but interactions confirmed independently by several methods can be considered reliable. figure d shows the multigraph that corresponds to the evidence from fig. c and can be used to reconstruct the actual graph in fig. b . in practice, integration is not always straightforward: networks are usually defined over subsets of the entire gene or protein complement of a species, and meaningful integration requires that the overlap of these subsets be sufficiently large. in addition, if differences of reliability between network types are to be taken into account, an integrated reliability scoring scheme needs to be designed (jansen et al. ; von mering et al. ) with the corresponding pitfalls and level of arbitrariness involved in comparing apples and oranges. existing methods can significantly reduce false positive rates on a subset of the network, yielding a subnetwork of highreliability interactions. the tremendous amounts of available molecular interaction data raise the important issue of how to visualize them in a biologically meaningful way. a variety of tools have been developed to address this problem; two prominent examples are visant (hu et al. ) and cytoscape (shannon et al. ) . a recent review of further network visualization tools is provided by suderman and hallett ( ) . in this section, we focus on cytoscape (http://www.cytoscape.org) and demonstrate its use for the investigation of protein-protein interaction networks. for a more extensive protocol on the usage of cytoscape, see (cline et al. ) . cytoscape is a stand-alone java application that is available for all major computer platforms. this software provides functionalities for (i) generating biological networks, either manually or by importing interaction data from various sources, (ii) filtering interactions, (iii) displaying networks using graph layout algorithms, (iv) integrating and displaying additional information like gene expression data, and (v) performing analyses on networks, for instance, by calculating topological network properties or by identifying functional modules. one advantage of cytoscape over alternative visualization software applications is that cytoscape is released under the open-source lesser general public license (lgpl). this license basically permits all forms of software usage and thus helps to build a large user and developer community. third-party java developers can easily enhance the functionality of cytoscape by implementing own plug-ins, which are additional software modules that can be readily integrated into the cytoscape platform. currently, there are more than forty plug-ins publicly available, with functionalities ranging from interaction retrieval and integration across topological network analysis, detection of network motifs, protein complexes, and domain interactions, to visualization of subcellular protein localization and bipartite networks. a selection of popular cytoscape plug-ins is listed in table . in the following, we will describe the functionalities of cytoscape in greater detail. the initial step of generating a network can be accomplished in different ways. first, the user can import interaction data that are stored in various flat file or xml formats such as biopax, sbml, or psi-mi, as described above. second, the user can directly retrieve interactions from several public repositories from within cytoscape. a number table ). third, the user can utilize a text-mining plug-in that builds networks based on associations found in publication abstracts (agilent literature search; table ). while these associations are not as reliable as experimentally derived interactions, they can be helpful when the user is investigating species that are not well covered yet in the current data repositories. fourth, the user can directly create or manipulate a network by manually adding or removing nodes (genes, proteins, domains, etc.) and edges (interactions or relationships). in this way, expert knowledge that is not captured in the available data sets can be incorporated into the loaded network. generated networks can be further refined by applying selections and filters in cytoscape. the user can select nodes or edges by simply clicking on them or framing a selection area. in addition, starting with at least one selected node, the user can incrementally enlarge the selection to include all direct neighbor nodes. cytoscape also provides even sophisticated search and filter functionality for selecting particular nodes and edges in a network based on different properties; in particular, the enhanced search plug-in (table ) improves the built-in search functionality of cytoscape. filters select all network parts that match certain criteria, for instance, all human proteins or all interactions that have been detected using the yeast two-hybrid system. once a selection has been made, all selected parts can be removed from the network or added to another network. the main purpose of visualization tools like cytoscape is the presentation of biological networks in an appropriate manner. this can usually be accomplished by applying graph layout algorithms. sophisticated layouts can assist the user in revealing specific network characteristics such as hub proteins or functionally related protein clusters. cytoscape offers various layout algorithms, which can be categorized as circular, hierarchical, spring-embedded (or force-directed), and attribute-based layouts (fig. ). further layouts can be included using the cytoscape plug-in architecture, for example, to arrange protein nodes according to their subcellular localization or to their pathways assignments (bubblerouter, cerebral; table ). some layouts may be more effective than others for representing molecular networks of a certain type. the spring-embedded layout, for instance, has the effect of exposing the inherent network structure, thus identifying hub proteins and clusters of tightly connected nodes. it is noteworthy that current network visualization techniques have limitations, for example, when displaying extremely large or dense networks. in such cases, a simple graphical network representation with one node for each interaction partner, as it is initially created by cytoscape, can obfuscate the actual network organization due to the sheer number of nodes and edges. one potential solution to this problem is the introduction of meta-nodes (metanode plug-in; table ). a meta-node combines and replaces a group of other nodes. meta-nodes can be collapsed to increase clarity of the visualization and expanded to increase the level of detail (fig. ). an overview of established and novel visualization techniques for biological networks on different scales is presented in (hu et al. ). all layouts generated by cytoscape are zoomable, enabling the user to increase or decrease the magnification, and they can be further customized by aligning, scaling, or rotating selected network parts. additionally, the user can define the graphical network representation through visual styles. these styles define the colors, sizes, and shapes of all network parts. a powerful feature of cytoscape is its ability of visually mapping additional attribute values onto network representations. both nodes and edges can have arbitrary attributes, for example, protein function names, the number of interactions (node degree), expression values, the strength and type of an interaction, or confidence values for interaction reliability. these attributes can be used to adapt the network illustration by dynamically changing the visual styles of individual network parts (fig. ) . for example, this feature enables highlighting trustworthy interactions by assigning (table ). all protein nodes with subcellular localizations different from plasma membrane are combined into meta-nodes. these meta-nodes can be collapsed or expanded to increase clarity or detailedness, respectively different line styles or sizes to different experiment types (discrete mapping of an edge attribute), to spot network hubs by changing the size of a node according to its degree (discrete or continuous mapping of a node attribute), or to identify functional network patterns by coloring protein nodes with a color gradient according to their expression level (continuous mapping of a node attribute). hence, it is possible to simultaneously visualize different data types by overlaying them with a network model. in order to generate new biological hypotheses and to gain insights into molecular mechanisms, it is important to identify relevant network characteristics and patterns. for this purpose, the straightforward approach is the visual exploration of the network. table lists a selection of cytoscape plug-ins that assist the user in this analysis task, for instance, by identifying putative complexes (mcode), by grouping proteins that show a similar expression profile (jactivemodules), or by identifying overrepresented go terms (bingo, golorize). however, the inclusion of complex data such as time-series results or diverse gene ontology (go) terms into the network visualization might not be feasible without further software support. particularly in case of huge, highly connected, or dynamic networks, more advanced visualization techniques will be required in the future. fig. visual representation of a subset of the gal network in yeast. the protein nodes are colored with a red-to-green gradient according to their expression value; green represents the lowest, red the highest value, and blue a missing value. the node size indicates the number of interactions (node degree); the larger a node, the higher is its degree. the colors and styles of the edges represent different interaction types; solid black lines represent protein-protein, dashed red lines protein-dna interactions in addition to the visual presentation of interaction networks, cytoscape can also be used to perform statistical analyses. for instance, the networkanalyzer plug-in (assenov et al. ) computes a large variety of topology parameters for all types of networks. the computed simple and complex topology parameters are represented as single values and distributions, respectively. examples of simple parameters are the number of nodes and edges, the average number of neighbors, the network diameter and radius, the clustering coefficient, and the characteristic path length. complex parameters are distributions of node degrees, neighborhood connectivities, average clustering coefficients, and shortest path lengths. these computed statistical results can be exported in textual or graphical form and are additionally stored as node attributes. the user can then apply the calculated attributes to select certain network parts or to map them onto the visual representation of the analyzed network as described above (fig. ) . it is also possible to fit a power law to the node degree distribution, which can frequently indicate a so-called scale-free network with few highly connected nodes (hubs) and many other nodes with a small number of interactions. scale-free networks are especially robust against failures of randomly selected nodes, but quite vulnerable to defects of hubs (albert ) . how many ppis exist in a living cell? the yeast genome encodes approximately gene products which means that the maximal possible number of interacting protein pairs in this organism is close to million, but what part of these potential interactions are actually realized in nature? for a given experimental method, such as the two-hybrid essay, the estimate of the total number of interactions in the cell is given by where n measured is the number of interactions identified in the experiment, and r fp and r fn are false positive and false negative rates of the method. r fn can be roughly estimated based on the number of interactions known with confidence (e.g., those confirmed by three-dimensional structures) that are being recovered by the method. assessing r fp is much more difficult because no experimental information on proteins that do not interact is currently available. since it is known that proteins belonging to the same functional class often interact, one very indirect way of calculating r fn is as the fraction of functionally related proteins not found to be interacting. an even more monumental problem is the estimation of the total number of unique structurally equivalent interaction types existing in nature. an interaction type is defined as a particular mutual orientation of two specific interacting domains. in some cases homologous proteins interact in a significantly different fashion while in other cases proteins lacking sequence similarity engage in interactions of the same type. in general, however, interacting protein pairs sharing a high degree of sequence similarity ( - % or higher) between their respective components almost always form structurally similar complexes (aloy et al. ) . this observation allows utilization of available atomic resolution structures of complexes for building useful models of closely related binary complexes. the total number of interaction types can then be estimated as follows: where the interaction similarity multiplier c reflects the clustering of all interactions of the same type, and e all-species extrapolates from one biological species to all organisms. aloy and russel ( ) derived an estimate for c by grouping interactions between proteins that share high sequence similarity, as discussed above. c depends on the number of paralogous sequences encoded in a given genome. for small prokaryotic organisms it is close to while for larger and more redundant genomes it adopts smaller values, typically in the range of . - . . the multiplier for all species e allspecies can be derived by assessing what fraction of known protein families is encoded in a given genome. based on the currently available data this factor is close to for bacteria, which means that a medium size prokaryotic organism contains around one tenth of all protein families. for eukaryotic organisms e all-species lies between and . for the comprehensive two-hybrid screen of yeast by (uetz ) in which interactions between proteins were identified, aloy and russell ( ) estimated c, r fp, and r fn , and e all-species to be . , . , . , and . respectively, leading to an estimated different interaction types in yeast alone, and over all species. based on the two-hybrid interaction map of the fly (giot ) the number of all interaction types in nature is estimated to be . it is thus reasonable to expect the total number of interaction types to be around , , and only are currently known. beyond binary interactions, proteins often form large molecular complexes involving multiple subunits (fig. ) . these complexes are much more than a random snapshot of a group of interacting proteinsthey represent large functional entities which remain stable for long periods of time. many such protein complexes have been elucidated step by step over time and recent advances in high-throughput technology have led to largescale studies revealing numerous new protein complexes. the preferred technology for this kind of experiment is initial co-purification of the complexes followed by the identification of the member proteins by mass spectrometry. as the bakers yeast s. cerevisiae is one of the most versatile model organisms used in molecular biology, it is not surprising that the first large-scale complex datasets were obtained in this species (gavin et al. ; ho et al. ; gavin et al. ; krogan et al. ). the yeast protein interaction database mpact (guldener et al. ) provides access to protein complexes based on careful literature annotation composed of different proteins plus over complexes from large-scale experiments which contain more than distinct proteins. these numbers contain some redundancy with respect to complexes, due to slightly different complex composition found by different groups or experiments. nevertheless, the dataset covers about % of the s.cerevisiae proteome. while many complexes comprise only a small number of different proteins, the largest of them features an impressive different protein species. a novel manually annotated database, corum (ruepp et al. ) contains literature-derived information about mammalian multi-protein complexes. over % of all complexes contain between three and six subunits, while the largest molecular structure, the spliceosome, consists of components (fig. ). modularity has emerged as one of the major organizational principles of cellular processes. functional modules are defined as molecular ensembles with an autonomous function (hartwell et al. ) . proteins or genes can be partitioned into modules based on shared patterns of regulation or expression, involvement in a common metabolic or regulatory pathway, or membership in the same protein complex or subcellular structure. modular representation and analysis of cellular processes allows for inter- pretation of genome data beyond single gene behavior. in particular, analysis of modules provides a convenient framework for studying the evolution of living systems (snel and huynen ) . multiprotein complexes represent one particular type of functional modules in which individual components engage in physical interactions to execute a specific cellular function. algorithmically, modular architectures can be defined as densely interconnected groups of nodes on biological networks (for an excellent review of available methods see (sharan et al. ). statistically significant functional subnetworks are characterized by a high degree of local clustering. the density of a cluster can be represented as a function q(m,n) = m/(n(n À )), where m is the number of interactions between the n nodes of the cluster (spirin and mirny ) . q thus takes values between for a set of unconnected nodes and for a fully connected cluster (clique). the statistical significance of q strongly depends on the size of the graph. it is obvious that random clusters with q ¼ involving just three proteins are very likely while large clusters with q ¼ or even with values below . are extremely unlikely. in order to compute the statistical significance of a cluster with n nodes and m connections spirin and mirny calculate the expected number of such clusters in a comparable random graph and then estimate the likelihood of having m or more interactions within a given set of n proteins given the number of interactions that each of these proteins has. significant dense clusters identified by this procedure on a graph of protein interactions were found to correspond to functional modules most of which are involved in transcription regulation, cell-cycle/ cell-fate control, rna processing, and protein transport. however, not all of them constitute physical protein complexes and, in general, it is not possible to predict whether a given module corresponds to a multiprotein complex or just to a group of functionally coupled proteins involved in the same cellular process. the search for significant subgraphs can be further enhanced by considering evolutionary conservation of protein interactions. with this approach protein complexes are predicted from binary interaction data by network alignment which involves comparing interaction graphs between several species (sharan et al. ) . first, proteins are grouped by sequence similarity such that each group contains one protein from each species, and each protein is similar to at least one other protein in the group. then a composite interaction network is created by joining with edges those pairs of groups that are linked by at least one conserved interaction. again, dense clusters on such network alignment graph are often indicative of multiprotein complexes. an alternative computational method for deriving complexes from noisy large-scale interaction data relies on a "socio-affinity" index which essentially reflects the frequency with which proteins form partnerships detected by co-purification (gavin et al. ) . this index was shown to correlate well with available three-dimensional structure data, dissociation constants of protein-protein interactions, and binary interactions identified by the two-hybrid techniques. by applying a clustering procedure to a matrix containing the values of the socio-affinity index for all yeast protein pairs found to associate by affinity purification, complexes were predicted, with over a half of them being novel and previously unknown. however, dependent on the analysis parameters distinct complex variants (isoforms) are found that differ from in terms of their subunit composition. those proteins present in most of the isoforms of a given complex constitute its core while variable components present only in a small number of isoforms can be considered "attachments" (fig. ) . furthermore, some stable, typically smaller protein groups can be found in multiple attachments in which case they are fig. definitions of complex cores, attachments, and modules. redrawn and modified with permission from (gavin et al. ) called "modules". stable functional modules can thus be flexibly used in the cell in a variety of functional contexts. proteins frequently associated with each other in complex cores and modules are likely to be co-expressed and co-localized. in this section, we offer a computational perspective on utilizing protein network data for molecular medical research. the identification of novel therapeutic targets for diseases and the development of drugs has always been a difficult, time-consuming and expensive venture (ruffner et al. ) . recent work has charted the current pharmacological space using different networks of drugs and their protein targets (paolini et al. ; keiser et al. ; kuhn et al. ; yildirim et al. ) based on biochemical relationships like ligand binding energy and molecular similarity or on shared disease association. above all, since many diseases are due to the malfunctioning of proteins, the systematic determination and exploration of the human interactome and homologous protein networks of model organisms can provide considerable new insight into pathophysiological processes (giallourakis et al. ) . knowledge of protein interactions can frequently improve the understanding of relevant molecular pathways and the interplay of various proteins in complex diseases (fishman and porter ) . this approach may result in the discovery of a considerable number of novel drug targets for the biopharmaceutical industry, possibly affording the development of multi-target combination therapeutics. observed perturbations of protein networks may also offer a refined molecular description of the etiology and progression of disease in contrast to phenotypic categorization of patients (loscalzo et al. ). molecular network data may help to improve the ability of cataloging disease unequivocally and to further individualize diagnosis, prognosis, prevention, and therapy. this will require a network-based approach that does not only include protein interactions to differentiate pathophenotypes, but also other types of molecular interactions as found in signaling cascades and metabolic pathways. furthermore, environmental factors like pathogens interacting with the human host or the effects of nutrition need to be taken into account. after large-scale screens identified enormous amounts of protein interactions in organisms like yeast, fly, and worm (goll and uetz ) , which also serve as model systems for studying many human disease mechanisms (giallourakis et al. ) , experimental techniques and computational prediction methods have recently been applied to generate sizable networks of human proteins (cusick et al. ; stelzl and wanker ; assenov et al. ; ram ırez et al. ). in addition, comprehensive maps of protein interactions inside pathogens and between pathogens and the human host have been compiled for bacteria like e. coli, h. pylori, c. jejuni, and other species (noirot and noirot-gros ) , for many viruses such as herpes viruses, the epstein-chapter . : protein-protein interactions: analysis and prediction barr virus, the sars coronavirus, hiv- , the hepatitis c virus, and others (uetz et al. ) , and for the malaria parasite p. falciparum (table ) . those extensive network maps can now be explored to identify potential drug targets and to block or manipulate important protein-protein interactions. furthermore, different experimental methods are also used to expand the known interaction networks around pathway-centric proteins like epidermal growth factor receptors (egfrs) (tewari et al. ; oda et al. ; jones et al. ) , smad and transforming growth factor-b (tgfb) (colland and daviet ; tewari et al. ; barrios-rodiles et al. ) , and tumor necrosis factor-a (tnfa) and the transcription factor nf-kb (bouwmeester et al. ). all of these proteins are involved in sophisticated signal transduction cascades implicated in various important disease indications ranging from cancer to inflammation. the immune system and toll-like receptor (tlr) pathways were the subject of other detailed studies (oda and kitano ) . apart from that, protein networks for longevity were assembled to research ageing-related effects (xue et al. ). high-throughput screens are also conducted for specific disease proteins causative of closely related clinical and pathological phenotypes to unveil molecular interconnections between the diseases. for example, similar neurodegenerative disease phenotypes are caused by polyglutamine proteins like huntingtin and over twenty ataxins. although they that are not evolutionarily related and their expression is not restricted to the brain, they are responsible for inherited neurotoxicity and age-dependent dementia only in specific neuron populations (ralser et al. ) . yeast two-hybrid screens revealed an unexpectedly dense interaction network of those disease proteins forming interconnected subnetworks (fig. ) , which suggests common pathways affected in disease (goehler et al. ; lim et al. ) . some of the protein-protein interactions may be involved in mediating neurodegeneration and thus may be tractable for drug inhibition, and several interaction partners of ataxins could additionally be shown to be potential disease modifiers in a fly model (kaltenbach et al. ) . a number of methodological approaches concentrate on deriving correlations between common topological properties and biological function from subnetworks around proteins that are associated with a particular disease phenotype like cancer. recent studies report that human disease-associated proteins with similar clinical and pathological features tend to be more highly connected among each other than with other proteins and to have more similar transcription profiles xu and li ; goh et al. ). this observation points to the existence of disease-associated functional modules. interestingly, in contrast to disease genes, essential genes whose defect may be lethal early on in life are frequently found to be hubs central to the network. further work focused on specific disease-relevant networks. for instance, to analyze experimental asthma, differentially expressed genes were mapped onto a protein interaction network ). here, highly connected nodes tended to have smaller expression changes than peripheral nodes. this agrees with the general notion that disease-causing genes are typically not central in the network. similarly, a comprehensive protein network analysis of systemic inflammation in human subjects investigated blood leukocyte gene expression patterns when receiving an inflammatory stimulus, a bacterial endotoxin, to identify functional modules perturbed in response to this stimulus (calvano et al. ) . topological criteria and gene expression data were also used to search protein networks for functional modules that are relevant to type diabetes mellitus or to different types of cancer (jonsson and bates ; cui et al. ; lin et al. ; pujana et al. ). moreover, it was recently demonstrated that the integration of gene expression profiles with subnetworks of interacting proteins can lead to improved prognostic markers for breast cancer outcome that are more reproducible between patient cohorts than sets of individual genes selected without network information (chuang et al. ). in drug discovery, protein networks can help to design selective inhibitors of protein-protein interactions which target specific interactions of a protein, but do not affect others (wells and mcclendon ) . for example, a highly connected protein (hub) may be a suitable target for an antibiotic whereas a more peripheral protein with few interaction partners may be more appropriate for a highly specific drug that needs to avoid side effects. thus, topological network criteria are not only useful for characterizing disease proteins, but also for finding drug targets. the diversity of interactions of a targeted protein could also help in predicting potential side effects of a drug. apart from that, it is remarkable that some potential drugs have been found to be less effective than expected due to the intrinsic robustness of living systems against perturbations of molecular interactions (kitano ) . furthermore, mutations in proteins cause genetic diseases, but it is not always easy to distinguish protein interactions impaired by mutated binding sites from other disease causes like structural instability induced by amino acid mutations. nowadays many genome-wide association and linkage studies for human diseases suggest genomic loci and linkage intervals that contain candidate genes encoding snps and mutations of potential disease proteins (kann ) . since the resultant list of candidates frequently contain dozens or even hundreds of genes, computational approaches have been developed to prioritize them for further analyses and experiments. in the following, we will demonstrate the variety of available prioritization approaches by explicating three recent methods that utilize protein interaction data in addition to the inclusion of other sequence and function information. all methods capitalize on the above described observation that closely interacting gene products often underlie polygenic diseases and similar pathophenotypes (oti and brunner ) . using protein-protein interaction data annotated with reliability values, lage et al. ( ) first predict human protein complexes for each candidate protein. they then score the pairwise phenotypic similarity of the candidate disease with all proteins within each complex that are associated with any disease. the scoring function basically measures the overlap of the respective disease phenotypes as recorded in text entries of omim (online mendelian inheritance in man) (hamosh et al. ) based on the vocabulary of umls (unified medical language system) (bodenreider ) . lastly, all candidates are prioritized by the probability returned by a bayesian predictor trained on the interaction data and phenotypic similarity. therefore, this method depends on the premise that the phenotypic effects caused by any disease-affected member in a predicted protein complex are very similar to each other. another prioritization approach by franke et al. ( ) does not make use of overlapping disease phenotypes and primarily aims at connecting physically disjoint genomic loci associated with the same disease using molecular networks. at the beginning, their method prioritizer performs a bayesian integration of three different network types of gene/protein relationships. the latter are derived from functional similarity using gene ontology annotation, microarray coexpression, and proteinprotein interaction. this results in a probabilistic human network of general functional links between genes. prioritizer then assesses which candidate genes contained in different disease loci are closely connected in this gene-gene network. to this end, the score of each candidate is initially set to zero, but it is increased iteratively during network exploration by a scoring function that depends on the network distance of the respective candidate gene to candidates inside another genomic loci. this procedure finally yields separate prioritization lists of ranked candidate genes for each genomic loci. in contrast to the integrated gene-gene network used by prioritizer, the endeavour system (aerts et al. ) directly compares candidate genes with known disease genes and creates different ranking lists of all candidates using various sources of evidence for annotated relationships between genes or proteins. the evidence can be derived from literature mining, functional associations based on gene ontology annotations, co-occurrence of transcriptional motifs, correlation of expression data, sequence similarity, common protein domains, shared metabolic pathway membership, and protein-protein interactions. at the end, endeavour merges the resultant ranking lists using order statistics and computes an overall prioritization list of all candidate genes. finally, it is important to keep in mind that current datasets of human protein interactions may still contain a significant number of false interactions and thus biological and medical conclusions derived from them should always be taken with a note of caution, in particular, if no good confidence measures are available. a comprehensive atlas of protein interactions is fundamental for a better understanding of the overall dynamic functioning of the living organisms. these insights arise from the integration of functional information, dynamic data and protein interaction networks. in order to fulfill the goal of enlarging our view of the protein interaction network, several approaches must be combined and a crosstalk must be established among experimental and computational methods. this has become clear from comparative evaluations which show similar performances for both types of methodologies. in fact, over the recent years this field has grown into one of the most appealing fields in bioinformatics. evolutionary signals result from restrictions imposed by the need to optimize the features that affect a given interaction and the nature of these features can differ from interaction to interaction. consequently, a number of different methods have been developed based a range of different evolutionary signals. this section is devoted to a brief review of some of these methods. these techniques are based on the similarity of absence/presence profiles of interacting proteins. in its original formulation (gaasterland and ragan ; huynen and bork ; pellegrini et al. ; marcotte et al. a ) the phylogenetic profiles were codified as / vectors for each reference protein according to the absence/presence of proteins of the studied family in a set of fully sequenced organisms (see fig. a ). the vectors for different reference sequences are compared by using the hamming distance (pellegrini et al. ) between vectors. this measure counts the number of differences between two binary vectors. the rationale for this method is that both interacting proteins must be present in an organism and that reductive evolution will remove unpaired proteins in the rest of the organisms. proposed improvements include the inclusion of quantitative measures of sequence divergence (marcotte et al. b; date and marcotte ) and the ability to deal with biases in the taxonomic distribution of the organisms used (date and marcotte ; barker and pagel ) . these biases are due to the intuitive fact that evolutionarily similar organisms will share a higher number of protein and genomic features (in this case presence/absence of an orthologue). to reduce this problem, date et al. used mutual information from sequence divergent profiles for measuring the amount of information shared by both vectors. mutual information is calculated as: miðp ; p Þ ¼ hðp Þ þ hðp Þ À hðp ; p Þ; where hðp Þ ¼ p pðp Þ ln pðp Þ is the marginal entropy of the probability distribution of protein p sequence distances and hðp ; p Þ ¼ À p p pðp ; p Þ ln pðp ; p Þ is the joint entropy of the probability distributions of both protein p and p sequence distances. the corresponding probabilities are calculated from the whole distribution of orthologue distances for the organisms. in this way, the most likely evolutionary distances between orthologues from a pair of organisms will produce smaller entropies and consequently smaller values of mutual information. this formulation should implicitly reduce the effect of taxonomic biases. in an interesting work, published recently by barker et al. ( ) , the authors showed that detection of correlated gene-gain/gene-loss events improves the predictions by reducing the number of false positives due to taxonomic biases. the phylogenetic profiling approach has been shown to be quite powerful, because its simple formulation has allowed the exploration of a number of alternative interdependencies between proteins. this is the case for enzyme "displacement" in metabolic pathways detected as anti-correlated profiles (morett et al. ) , and for complex dependence relations among triplets of proteins (bowers et al. ) . phylogenetic profiles have also been correlated with bacterial traits to predict the genes related to particular phenotypes (korbel et al. ) . the main drawbacks of these methods are the difficulty of dealing with essential proteins (where there is no absence information) and the requirement for the genomes under study to be complete (to establish the absence of a family member). fig. prediction of protein interactions based on genomic and sequence features. information coming from the set of close homologs of the proteins p and p from the organism in other organisms can be used to predict an interaction between these proteins. (a) phylogenetic profiling. presence/absence of a homolog of both proteins in different organisms is coded as the corresponding two / profiles (most simple approach) and an interaction is predicted for very similar profiles. (b) similarity of phylogenetic trees. multiple sequence alignments are built for both sets of proteins and phylogenetic trees are derived from the proteins with a possible partner present in its organism. proteins with highly similar trees are predicted to interact. (c) gene neighbourhood conservation. genome closeness is checked for those genes coding for both sets of homologous proteins. interaction is predicted if gene pairs are recurrently close to each other in a number of organisms. (d) gene fusion. finding the proteins containing different sequence regions homologous to each of the two proteins is used to predict an interaction between them similarity in the topology of phylogenetic trees of interacting proteins has been qualitatively observed in a number of cases (fryxell ; pages et al. ; goh et al. ) . the extension of this observation to a quantitative method for the prediction of protein interactions requires measuring the correlation between the similarity matrices of the explored pairs of protein families (goh et al. ) . this formulation allows systematic evaluation of the validity of using the original observation as a signal of protein interaction (pazos and valencia ) . the general protocol for these methods is illustrated in fig. b . it includes the building of the multiple sequence alignment for the set of orthologues (one per organism) related to every query sequence, the calculation of all protein pair evolutionary distances (derived from the corresponding phylogenetic trees) and finally the comparison of evolutionary distance matrices of pairs of query proteins using pearsons correlation coefficient. protein pairs with highly correlated distance matrices are predicted to be more likely to interact. although this signal has been shown to be significant, the underlying process responsible for this similarity is still controversial (chen and dokholyan ) . there are two main hypotheses for explaining this phenomenon. the first hypothesis suggests that this evolutionary similarity comes from the mutual adaptation (co-evolution) of interacting proteins and the need to retain interaction features while sequences diverge. the second hypothesis implicates external factors. in this scenario, the restrictions imposed by evolution on the functional process implicating both proteins would be responsible for the parallelism of their phylogenetic trees. although the relative importance of both factors is still not clear, the predictive power of similarities in phylogenetic trees is not affected. indeed, a number of developments have improved the original formulation (pazos et al. ; sato et al. ). the first advance involved managing the intrinsic similarity of the trees because of the common underlying taxonomic distribution (due to the speciation processes). this effect is analogous to the taxonomic biases discussed above. in these cases, the approach followed was to correct both trees by removing this common trend. for example, pazos et al. subtracted the distances of the s rrna phylogenetic tree to the corresponding distances for each protein tree. the correlations for the resulting distance matrices were used to predict protein interactions. additionally some analyses have focused on the selection of the sequence regions used for the tree building (jothi et al. ; kann et al. ) . for example, it has been shown that interacting regions, both defined as interacting residues (using structural data) and as the sequence domain involved in the interaction, show more clear tree similarities than the whole proteins (mintseris and weng ; jothi et al. ) . other interesting work showed that prediction performance can be improved by removing poorly conserved sequence regions ). finally, in a very recent work (juan et al. ) the authors have suggested a new method for removing noise in the detection of tree similarity signals and detecting different levels of evolutionary parallelism specificity. this method introduces the new strategy of using the global network of protein evolutionary similarity for a better calibration of the evolutionary parallelism between two proteins. for this purpose, they define a protein co-evolutionary profile as the vector containing the evolutionary correlations between a given protein tree and all the rest of the protein trees derived from sequences in the same organism. this co-evolutionary profile is a more robust and comparable representation of the evolution of a given protein (it involves hundreds of distances) and can be used to deploy a new level of evolutionary comparison. the authors compare these co-evolutionary profiles by calculating pearsons correlation coefficient for each pair. in this way, the method detects pairs of proteins for which high evolutionary similarities are supported by their similarities with the rest of proteins of the organism. this approach significantly improves the predictive performance of the tree similaritybased methods so that different degrees of co-evolutionary specificity are obtained according to the number of proteins that might be influencing the co-evolution of the studied pair. this is done by extending the approach of sato et al. ( ) , that uses partial correlations and a reduced set of proteins for determining specific evolutionary similarities. juan et al. calculated the partial correlation for each significant evolutionary similarity with respect to the remaining proteins in the organism and defined levels of co-evolutionary specificity according to the number of proteins that are considered to be co-evolving with each studied protein pair. with this strategy, its possible to detect a range of evolutionary parallelisms from the protein pairs (for very specific similarities) up to subsets of proteins (for more relaxed specificities) that are highly evolution dependent. interestingly, if specificity requirements are relaxed, protein relationships among components of macro-molecular complexes and proteins involved in the same metabolic process can be recovered. this can be considered as a first step in the application of higher orders of evolutionary parallelisms to decode the evolutionary impositions over the protein interaction network. this method exploits the well-known tendency of bacterial organisms to organize proteins involved in the same biochemical process by clustering them in the genome. this observation is obviously related to the operon concept and the mechanisms for the coordination of transcription regulation of the genes present in these modules. these mechanisms are widespread among bacterial genomes. therefore the significance of a given gene proximity can be established by its conservation in evolutionary distant species (dandekar et al. ; overbeek et al. ) . the availability of fully sequenced organisms makes computing the intergenic distances between each pair of genes easy. genes with the same direction of transcrip-tion and closer than bases are typically considered to be in the same genomic context (see fig. c ). the conservation of this closeness must be found in more than two highly divergent organisms to be considered significant because of the taxonomic biases. while this signal is strong in bacterial genomes, its relevance is unclear in eukaryotic genomes. this is the main drawback of these methodologies. in fact, this signal only can be exploited for eukaryotic organisms by extrapolating genomic closeness of bacterial genes to their homologues in eukaryotes. obviously, this extrapolation leads to a considerable reduction in the confidence and number of obtained predictions for this evolutionary lineage. however, conserved gene pairs that are transcribed from a shared bidirectional promoter can be detected by similar methods and can found in eukaryotes as well as prokaryotes (korbel et al. ) a further use of evolutionary signals in protein function and physical interaction prediction has been the tendency of interacting proteins to be involved in gene fusion events. sequences that appear as independently expressed orfs in one organism become fused as part of the same polypeptide sequence in another organism. these fusions are strong indicators of functional and structural interaction that have been suggested to increase the effective concentration of interacting functional domains (enright et al. ; marcotte et al. b ). this hypothesis proposes that gene fusion could remove the effect of diffusion and relative correct orientation of the proteins forming the original complex. these fusion events are typically detected when sequence searches for two nonhomologous proteins obtain a significant hit in the same sequence. cases matching to the same region of the hit sequence are removed (these cases are schematically represented in fig. d ). in spite of the strength of this signal, gene fusion seems to not be a habitual event in bacterial organisms. the difficulty of distinguishing protein interactions belonging to large evolutionary families is the main drawback of the automatic application of these methodologies. integration of experimentally determined and predicted interactions as described above, there are many both experimental techniques and computational methods for determining and predicting interactions. to obtain the most comprehensive interaction networks possible, as many as possible of these sources of interactions should be integrated. the integration of these resources is complicated by the fact that the different sources are not all equally reliable, and it is thus important to quantify the accuracy of the different evidence supporting an interaction. in addition to the quality issues, comparison of different interaction sets is further complicated by the different nature of the datasets: yeast two-hybrid experiments are inherently binary, whereas pull-down experiments tend to report larger complexes. to allow for comparisons, complexes are typically represented by binary interaction networks; however, it is important to realize that there is not a single, clear definition of a "binary interaction". for complex pull-down experiments, two different representations have been proposed: the matrix representation, in which each complex is represented by the set of binary interactions corresponding to all pairs of proteins from the complex, and the spoke representation, in which only bait-prey interactions are included (von mering et al. ) . the binary interactions obtained using either of these representations are somewhat artificial as some interacting proteins might in reality never touch each other and others might have too low an affinity to interact except in the context of the entire complex bringing them together. even in the case of yeast two-hybrid assays, which inherently report binary interactions, not all interactions correspond to direct physical interactions. the database string ("search tool for the retrieval of interacting genes/ proteins") (von mering et al. ) represents an effort to provide many of the different types of evidence for functional interactions under one common framework with an integrated scoring scheme. such an integrated approach offers several unique advantages: ) various types of evidence are mapped onto a single, stable set of proteins, thereby facilitating comparative analysis; ) known and predicted interactions often partially complement each other, leading to increased coverage; and ) an integrated scoring scheme can provide higher confidence when independent evidence types agree. in addition to the many associations imported from the protein interaction databases mentioned above (bader et al. ; salwinski et al. ; guldener et al. ; mishra et al. ; stark et al. ; chatr-aryamontri et al. ), string also includes interactions from curated pathway databases (vastrik et al. ; kanehisa et al. ) and a large body of predicted associations that are produced de novo using many of the methods described in this chapter (dandekar et al. ; gaasterland and ragan ; pellegrini et al. ; marcotte et al. c) . these different types of evidence are obviously not directly comparable, and even for the individual types of evidence the reliability may vary. to address these two issues, string uses a two-stage approach. first, a separate scoring scheme is used for each evidence type to rank the interactions according to their reliability; these raw quality scores cannot be compared between different evidence types. second, the ranked interaction lists are benchmarked against a common reference to obtain probabilistic scores, which can subsequently be combined across evidence types. to exemplify how raw quality scores work, we will here explain the scoring scheme used for physical protein interactions from high-throughput screens. the two funda-mentally different types of experimental interaction data sets, complex pull-downs and binary interactions are evaluated using separate scoring schemes. for the binary interaction experiments, e.g. yeast two-hybrid, the reliability of an interaction correlates well with the number of non-shared interaction partners for each interactor. string summarizes this in the following raw quality score: logððn þ Þ Á ðn þ ÞÞ; where n and n are the numbers of non-shared interaction partners. this score is similar to the ig measure suggested by saito et al. ( ) . in the case of complex pulldown experiments, the reliability of the inferred binary interactions correlates better with the number of times the interactors were co-purified compared to what would be expected at random: where n is the number of purifications containing both proteins, n and n are the numbers of purifications containing either protein or , and n is the total number of purifications. for this purpose, the bait protein was counted twice to account for bait-prey interactions being more reliable than prey-prey interactions. these raw quality scores are calculated for each individual high-throughput screen. scores vary within one dataset, because they include additional, intrinsic information from the data itself, such as the frequency with which an interaction is detected. for medium sized data sets that are not large enough to apply the topology based scoring schemes, the same raw score is assigned to all interactions within a dataset. finally, very small data sets are pooled and considered jointly as a single interaction set. we similarly have different scoring schemes for predicted interactions based on coexpression in microarray expression studies, conserved gene neighborhood, gene fusion events and phylogenetic profiles. based on these raw quality scores, a confidence score is assigned to each predicted association by benchmarking the performance of the predictions against a common reference set of trusted, true associations. string uses as reference the functional grouping of proteins maintained at kegg (kyoto encyclopedia of genes and genomes (kanehisa et al. ) . any predicted association for which both proteins are assigned to the same "kegg pathway" is counted as a true positive. kegg pathways are particularly suitable as a reference because they are based on manual curation, are available for a number of organisms, and cover several functional areas. other benchmark sets could also be used, for example "biological process" terms from gene ontology (ashburner et al. ) or reactome pathways (vastrik et al. ). the benchmarked confidence scores in string generally correspond to the probability of finding the linked proteins within the same pathway or biological process. the assignment of probabilistic scores for all evidence types solves many of the issues of data integration. first, incomparable evidence types are made comparable by assigning a score that represents how well the evidence type can predict a certain type of interactions (the type being specified by the reference set used). second, the separate benchmarking of interactions from, for example, different high-throughput protein interaction screens accounts for any differences in reliability between different studies. third, use of raw quality scores allows us to separate more reliable interactions from less reliable interactions even within a single dataset. the probabilistic nature of the scores also makes it easy to calculate the combined reliability of an interaction given multiple lines of evidence. it is computed under the assumption of independence for the various sources, in a na€ ıve bayesian fashion. in addition to having a good scoring scheme, it is crucial to make the evidence for an interaction transparent to the end users. to achieve this, the string interaction network is made available via a user-friendly web interface (http://string.embl.de). when performing a query, the user will first be presented with a network view, which provides a first, simplified overview (fig. ) . from here the user has full control over parameters such as the number of proteins shown in the network (nodes) and the minimal reliability required for an interaction (edge) to be displayed. from the network, the user also has the ability to drill down on the evidence that underlies any given interaction using the dedicated viewer for each evidence type. for example, it is possible to inspect the publications that support a given interaction, the set of protein that were fig. protein interaction network of the core cell-cycle regulation in human. the network was constructed by querying the string database (von mering et al. ) for very high confidence interactions (conf. score > . ) between four cyclin-dependent kinases, their associated cyclins, the wee kinase and the cdc phosphatases. the network correctly recapitulates cdc interacts with cyclin-a/b, cdk with cyclin-a/e, and cdk / with cyclin-d. it also shows that the wee and cdc phosphatases regulate cdc and cdk but not cdk and cdk . moreover, the network suggests that cdc a phosphatase regulates cdc and cdk , whereas cdc b and cdc c specifically regulate cdc co-purified in a particular experiment and the phylogenetic profiles or genomic context based on which an interaction was predicted. protein binding is commonly characterized by specific interactions of evolutionarily conserved domains (pawson and nash ) . domains are fundamental units of protein structure and function , which are incorporated into different proteins by genetic duplications and rearrangements (vogel et al. ) . globular domains are defined as structural units of fifty and more amino acids that usually fold independently of the remaining polypeptide chain to form stable, compact structures (orengo and thornton ) . they often carry important functional sites and determine the specificity of protein interactions (fig. ) . essential information on fig. exemplary interaction between the two human proteins hhr b and ataxin- . each protein domain commonly adopts a particular d structure and may fulfill a specific molecular function. generally, the domains responsible for an observed protein-protein interaction need to be determined before further functional characterizations are possible. in the depicted protein-protein interaction, it is known from experiments that the ubiquitin-like domain ubl of hhr b (yellow) forms a complex with de-ubiquitinating josephin domain of ataxin- (blue) (nicastro et al. ) the cellular function of specific protein interactions and complexes can often be gained from the known functions of the interacting protein domains. domains may contain binding sites for proteins and ligands such as metabolites, dna/rna, and drug-like molecules (xia et al. ) . widely spread domains that mediate molecular interactions can be found alone or combined in conjunction with other domains and intrinsically disordered, mainly unstructured, protein regions connecting globular domains (dunker et al. ) . according to apic et al. ( ) multi-domain proteins constitute two thirds of unicellular and % of metazoan proteomes. one and the same domain can occur in different proteins, and many domains of different types are frequently found in the same amino acid chain. much effort is being invested in discovering, annotating, and classifying protein domains both from the functional (pfam (finn et al. ) , smart (letunic et al. ), cdd (marchler-bauer et al. , interpro (mulder et al. ) and structural (scop (andreeva et al. ) , cath (greene et al. )) perspective. notably, it may be confusing that the term domain is commonly used in two slightly different meanings. in the context of domain databases such as pfam and smart, a domain is basically defined by a set of homologous sequence regions, which constitute a domain family. in contrast, a specific protein may contain one or more domains, which are concrete sequence regions within its amino acid sequence corresponding to autonomously folding units. domain families are commonly represented by hidden markov models (hmms), and highly sensitive search tools like hmmer (eddy ) are used to identify domains in protein sequences. different sources of information about interacting domains with experimental evidence are available. experimentally determined interactions of single-domain proteins indicate domain-domain interactions. similarly, experiments using protein fragments help identifying interaction domains, but this knowledge is frequently hidden in the text of publications and not contained in any database. however, domain databases like pfam, smart, and interpro may contain some annotation obtained by manual literature curation. in the near future, high-throughput screening techniques will result in even larger amounts of protein fragment interaction data to delineate domain borders and interacting protein regions (colland and daviet ) . above all, three-dimensional structures of protein domain complexes are experimentally solved by x-ray crystallography or nmr and are deposited in the pdb database (berman et al. ) . structural contacts between two interacting proteins can be derived by mapping sequence positions of domains onto pdb structures. extensive investigations of domain combinations in proteins of known structures (apic et al. ) as well as of structurally resolved homo-or heterotypic domain interactions (park et al. ) revealed that the overlap between intra-and intermolecular domain interactions is rather limited. two databases, ipfam (finn et al. ) and did (stein et al. ) , provide pre-computed structural information about protein interactions at the level of pfam domains. analysis of structural complexes suggests that interactions between a given pair of proteins may be mediated by different domain pairs in different situations and in different organisms. nevertheless, many domain interactions, especially those involved in basic cellular processes such as dna metabolism and nucleotide binding, tend to be evolutionarily conserved within a wide range of species from prokaryotes to eukaryotes (itzhaki et al. ) . in yeast, pfam domain pairs are associated with over % of experimentally known protein interactions, but only . % of them are covered by ipfam (schuster-bockler and bateman ) . domain interactions can be inferred from experimental data on protein interactions by identifying those domain pairs that are significantly overrepresented in interacting proteins compared to random protein pairs (deng et al. ; ng et al. a; riley et al. ; sprinzak and margalit ) (fig. ) . however, the predictive power of such an approach is strongly dependent on the quality of the data used as the source of information for protein interactions, and the coverage of protein sequences in terms of domain assignments. basically, the likelihood of two domains, d i and d j , to interact can be estimated as the fraction of protein pairs known to interact among all proteins in the dataset containing this domain pair. this basic idea has been improved upon by using a maximum-likelihood (ml) approach based on the expectation-maximization (em) algorithm. this method finds the maximum likelihood estimator of the observed protein-protein interactions by an iterative cycle of computing the expected likelihood (e-step) and maximizing the unobserved parameters (domain interaction propensities) in the m-step. when the algorithm converges (i.e. the total likelihood cannot be further improved by the algorithm), the ml estimate for the likelihood of the unobserved domain interactions is found (deng et al. ; riley et al. ). riley and colleagues further improved this method by excluding each potentially interacting domain pair from the dataset and recomputing the ml-estimate to obtain an additional confidence value for the respective domain-domain interaction. this domain pair exclusion (dpea) method measures the contribution of each domain pair to the overall likelihood of the protein interaction network based on domain-domain interactions. in particular, this approach enables the prediction of specific domain-domain interactions between selected proteins which would have been missed by the basic ml method. another ml-based algorithm is insite which takes differences in the reliability of the protein-protein interaction data into account (wang et al. a) . it also integrates external evidence such as functional annotation or domain fusion events. an alternative method for deriving domain interactions is through co-evolutionary analysis that exploits the notion that mutations of residue pairs at the interaction interfaces are correlated to preserve favorable physico-chemical properties of the binding surface (jothi et al. ) . the pair of domains mediating interactions between two proteins p and p may therefore be expected to display a higher similarity of their phylogenetic trees than other, non-interacting domains (fig. ) . the degree of agreement between the evolutionary history of two domains, d i and d j , can be computed by the pearsons correlation coefficient r ij between the similarity matrices of the domain sequences in different organisms: where n is the number of species, m i pq and m j pq are the evolutionary distances between species, and m i and m j are the mean values of the matrices, respectively. in figure the evolutionary tree of the domain d is most similar to those of d and d , corroborating the actual binding region. a well-known limitation of the correlated mutation analysis is that it is very difficult to decide whether residue co-variation happens as a result of functional co-evolution directed at preserving interaction sites, or because of sequence divergence due to speciation. to address this problem, suggested to distinguish the relative contribution of conserved and more variable regions in aligned sequences to the co-evolution signal based on the hypothesis that functional co-evolution is more prominent in conserved regions. finally, interacting domains can be identified by phylogenetic profiling, as described above for full-chain proteins. as in the case of complete protein chains, the similarity of evolutionary patterns shared by two domains may indicate that they interact with each other directly or at least share a common functional role (pagel et al. ) . as illustrated in fig. , clustering protein domains with similar phylogenetic profiles allows researchers to build domain interaction networks which provide clues for describing molecular complexes. similarly, the domainteam method (pasek et al. ) considers chromosomal neighborhoods at the level of conserved domain groups. a number of resources provide and combine experimentally derived and predicted domain interaction data. interdom (http://interdom.i r.a-star.edu.sg/) integrates domain-interaction predictions based on known protein interactions and complexes with domain fusion events (ng et al. b) . dima (http://mips.gsf.de/genre/proj/dima ) is another database of domain interactions, which integrates experimentally demon- fig. co-evolutionary analysis of domain interactions. two orthologous proteins from different organisms known to interact with each other are shown. the first protein consists of two domains, d and d , while the second protein includes the domains d , d , d , and d . evolutionary trees for each domain are shown, their similarity serves as an indication of interaction likelihood that is encoded in the interaction matrix strated domain interactions from ipfam and did with predictions based on the dpea algorithm and phylogenetic domain profiling ). recently, two new comprehensive resources, domine (http://domine.utdallas.edu) (raghavachari et al. ) and dasmi (http://www.dasmi.de) (blankenburg et al. , submitted) , were introduced and are available online. these resources contain ipfam and did data and predicted domain interactions taken from several other publications. predictions are based on several methods for deriving domain interactions from protein interaction data, phylogenetic domain profiling data and domain coevolution. with the availability of an increasing number of predictions the task of method weighting and quality assessment becomes crucial. a thorough analysis of the quality of domain interaction data can be found in schlicker et al. ( ) . beyond domain-domain contacts, an alternative mechanism of mediating molecular recognition is through binding of protein domains to short sequence regions (santonico et al. ) , typically from three to eight residues in length (zarrinpar et al. ; neduva et al. ) . such linear recognition motifs can be discovered from protein interaction data by identifying amino acid sequence patterns overrepresented in proteins that do not possess significant sequence similarity, but share the same interacting partner (yaffe ) . web services like eml (http://elm.eu.org (puntervoll et al. ) ), support the identification of linear motifs in protein sequences. as described above, specific adapter domains can mediate protein-protein interactions. while some of these interaction domains recognize small target peptides, others are involved in domain-domain interactions. as short binding motifs have a rather high probability of being found by chance and the exact mechanisms of binding specificity for this mode of interaction are not understood completely, predictions of proteinprotein interactions based on binding domains is currently limited to domain-domain interactions for which reliable data is available. predicting ppis from domain interactions may simply be achieved by reversing the ideas discussed above, that is, by using the domain composition of proteins to evaluate the interaction likelihood of proteins (bock and gough ; sprinzak and margalit ; wojcik and schachter ) . in a naive approach, domain interactions are treated as independent, and all protein pairs with a matching pair of interacting domains are predicted to engage in an interaction. given that protein interactions may also be mediated by several domain interactions simultaneously, more advanced statistical methods take into account dependencies between domains and exploit domain combinations (han et al. ) and multiple interacting domain pairs (chen and liu ) . exercising and validating these prediction approaches revealed that the most influential factor for ppi prediction is the quality of the underlying data. this suggests that, as for most biological predictions in other fields, the future of prediction methods for protein and domain interactions may lie in the integration of different sources of evidence and weighting the individual contributions based on calibration to goldstandard data. further methodological improvements may include the explicit consideration of cooperative domains, that is, domain pairs that jointly interact with other domains (wang et al. b ). basic interactions between two or up to a few biomolecules are the basic elements of the complex molecular interaction networks that enable the processes of life and, when thrown out of their intended equilibrium, manifest the molecular basis of diseases. such interactions are at the basis of the formation of metabolic, regulatory or signal transduction pathways. furthermore the search for drugs boils down to analyzing the interactions between the drug molecule and the molecular target to which it binds, which is often a protein. for the analysis of a single molecular interaction, we do not need complex biological screening data. thus it is not surprising that the analysis of the interactions between two molecules, one of them being a protein, has the longest tradition in computational biology of all problems involving molecular interactions, dating back over three decades. the basis for such analysis is the knowledge of the three-dimensional structure of the involved molecules. to date, such knowledge is based almost exclusively on experimental measurements, such as x-ray diffraction data or nmr spectra. there are also a few reported cases in which the analysis of molecular interactions based on structural models of protein has led to successes. the analysis of the interaction of two molecules based on their three-dimensional structure is called molecular docking. the input is composed of the three-dimensional structures of the participating molecules. (if the involved molecule is very flexible one admissible structure is provided.) the output consists of the three-dimensional structure of the molecular complex formed by the two molecules binding to each other. furthermore, usually an estimate of the differential free energy of binding is given, that is, the energy difference dg between the bound and the unbound conformation. for the binding event to be favorable that difference has to be negative. this slight misnomer describes the binding between a protein molecule and a small molecule. the small molecule can be a natural substrate such as a metabolite or a molecule to be designed to bind tightly to the protein such as a drug molecule. proteinligand docking is the most relevant version of the docking problem because it is a useful help in searching for new drugs. also, the problem lends itself especially well to computational analysis, because in pharmaceutical applications one is looking for small molecules that are binding very tightly to the target protein, and that do so in a conformation that is also a low-energy conformation in the unbound state. thus, subtle energy differences between competing ligands or binding modes are not of prime interest. for these reasons there is a developed commercial market for protein-ligand docking software. usually the small molecule has a molecular weight of up to several hundred daltons and can be quite flexible. typically, the small molecule is given by its d structure formula, e.g., in the form of a smiles string (weininger ) . if a starting d conformation is needed there is special software for generating such a conformation (see, e.g. (pearlman ; sadowski et al. ) ). challenges of the protein ligand problem are (i) finding the correct conformation of the usually highly flexible ligand in the binding site of the protein, (ii) determining the subtle conformational changes in the binding site of the protein upon binding of the ligand, which are termed induced fit, (iii) producing an accurate estimate of the differential energy of binding or at least ranking different conformations of the same ligand and conformations of different ligands correctly by their differential energy of binding. methods tackling problem (ii) can also be used to rectify smaller errors in structural models of proteins whose structure has not been resolved experimentally. the solution of problem (iii) provides the essential selection criterion for preferred ligands and binding modes, namely those with lowest differential energy of binding. challenge (i) has basically been conquered in the last decade as a number of docking programs have been developed that can efficiently sample the conformational space of the ligand and produce correct binding modes of the ligand within the protein, assuming that the protein is given in the correct structure for binding the ligand. several methods are applied here. the most brute-force method is to just try different (rigid) conformations of the ligand one after the other. if the program is fast enough one can run through a sizeable number of conformations per ligand (mcgann et al. ) . a more algorithmic and quite successful method is to build up the ligand from its molecular fragments inside the binding pocket of the protein (rarey et al. ). yet another class of methods sample ligand conformations inside the protein binding pocket by methods such as local search heuristics, monte carlo sampling or genetic algorithms (abagyan et al. ; jones et al. ; morris et al. ). there are also programs exercising combinations of different methods (friesner et al. ). the reported methods usually can compute the binding mode of a ligand inside a protein within fractions of a minute to several minutes. the resulting programs can be applied to screening through large databases of ligands involving hundreds of thousands to millions of compounds and are routinely used in pharmaceutical industry in the early stages of drug design and selection. they are also repeatedly compared on benchmark datasets (kellenberger et al. ; englebienne et al. ). more complex methods from computational biophysics, such as molecular dynamics (md) simulations that compute a trajectory of the molecular movement based on the forces exerted on the molecules take hours on a single problem instance and can only be used for final refinement of the complex. challenges (ii) and (iii) have not been solved yet. concerning problem (ii), structural changes in the protein can involve redirections of side chains in or close to the binding pocket and more substantial changes involving backbone movement. while recently methods have been developed to optimize side-chain placement upon ligand binding (claußen et al. ; sherman et al. ) , the problem of finding the correct structural change upon binding involving backbone and side-chain movement is open (carlson ) . concerning problem (iii), there are no scoring functions to date that are able to sufficiently accurately estimate the differential energy of binding on a diverse set of protein-ligand complexes huang and zou ) . this is especially unfortunate as an inaccurate estimate of the binding energy causes the docking program to disregard correct complex structures even though they have been sampled by the docking program because they are labeled with incorrect energies. this is the major problem in docking which limits the accuracy of the predictions. recent reviews on protein-ligand docking have been published in sousa et al. ( ) and rarey et al. ( ) . one restriction with protein-ligand docking as it applies to drug design and selection is that the three-dimensional structure of the target protein needs to be known. many pharmaceutical targets are membrane-standing proteins for which we do not have the three-dimensional structure. for such proteins there is a version of drug screening that can be viewed as the negative imprint of docking: instead of docking the drug candidate into the binding site of the proteinwhich is not availablewe superpose the drug candidate (which is here called the test molecule) onto another small molecule which is known to bind to the binding site of the protein. such a molecule can be the natural substrate for the target protein or another drug targeting that protein. let us call this small molecule the reference molecule. the suitability of the new drug candidate is then assessed on the basis of its structural and chemical similarity with the reference molecule. one problem is that now both the test molecule and the reference molecule can be highly flexible. but in many cases largely rigid reference molecules can be found, and in other cases it suffices to superpose the test moelcule onto any low-energy conformation of the reference molecule. there are several classes of drug screening programs based on this molecular comparison, ranging from (i) programs that perform a detailed analysis of the three-dimensional structures of the molecules to be compared (e.g. (lemmen et al. ; kr€ amer et al. ) ) across (ii) programs that perform a topological analysis of the two molecules (rarey and dixon ; gillet et al. ) to (iii) programs that represent both molecules by binary or numerical property vectors which are compared with string methods (mcgregor and muskal ; xue et al. ) . the first class of programs require fractions of seconds to fractions of a minute for a single comparison, the second can perform hundreds comparisons per second, the third up to several ten thousand comparisons per second. reviews of methods for drug screening based on ligand comparison are given in (lengauer et al. ; k€ amper et al. ). here both binding partners are proteins. since drugs tend to be small molecules this version of the docking problem is not of prime interest in drug design. also, the energy balance of protein-protein binding is much more involved that for protein-ligand binding. optimal binding modes tend not to form troughs in the energy landscape that are as pronounced as for protein-ligand docking. the binding mode is determined by subtle side-chain rearrangements of both binding partners that implement the induced fit along typically quite large binding interfaces. the energy balance is dominated by difficult to analyze entropic terms involving the desolvation of water within the binding interface. for these reasons, the software landscape for protein-protein docking is not as well developed as for protein-ligand docking and there is no commercial market for protein-protein docking software. protein-protein docking approaches are based either on conformational sampling and mdwhich can naturally incorporated molecular flexibility but suffers from very high computing demandsor on combinatorial sampling with both proteins considered rigid in which case handling of protein flexibility has to be incorporated with methodical extensions. for space reasons we do not detail methods for protein-protein docking. a recent review on the subject can be found in hildebrandt et al. ( ) . a variant of protein-protein docking is protein-dna docking. this problem shares with protein-protein docking the character that both binding partners are macromolecules. however, entropic aspects of the energy balance are even more dominant in protein-dna docking than in protein-protein docking. furthermore dna can assume nonstandard shapes when binding to proteins which deviate much more from the known double helix than we are used to when considering induced fit phenomena. icm-a method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation gene prioritization through genomic data fusion scale-free networks in cell biology the relationship between sequence and interaction divergence in proteins ten thousand interactions for the molecular biologist structural systems biology: modelling protein interactions scop database in : refinements integrate structure and sequence family data domain combinations in archaeal, eubacterial and eukaryotic proteomes gene ontology: tool for the unification of biology. the gene ontology consortium computing topological parameters of biological networks bind: the biomolecular interaction network database network biology: understanding the cells functional organization constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes predicting functional gene links from phylogenetic-statistical analyses of whole genomes high-throughput mapping of a dynamic signaling network in mammalian cells the worldwide protein data bank (wwpdb): ensuring a single, uniform archive of pdb data predicting protein-protein interactions from primary structure the unified medical language system (umls): integrating biomedical terminology superti-furga g ( ) a physical and functional map of the human tnf-alpha/ nf-kappa b signal transduction pathway use of logic relationships to decipher protein network organization the grid: the general repository for interaction datasets interaction network containing conserved and essential protein complexes in escherichia coli epstein-barr virus and virus human protein interaction maps a network-based analysis of systemic inflammation in humans disrupted in schizophrenia interactome: evidence for the close connectivity of risk genes and a potential synaptic basis for schizophrenia protein flexibility is an important component of structure-based drug discovery mint: the molecular interaction database on evaluating molecular-docking methods for pose prediction and enrichment factors prediction of protein-protein interactions using random decision forest framework the coordinated evolution of yeast proteins is constrained by functional modularity network-based classification of breast cancer metastasis flexe: efficient molecular docking considering protein structure variations integration of biological networks and gene expression data using cytoscape integrating a functional proteomic approach into the target discovery process identification of the helicobacter pylori anti-sigma factor a map of human cancer signaling interactome: gateway into systems biology conservation of gene order: a fingerprint of proteins that physically interact discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages inferring domain-domain interactions from protein-protein interactions flexible nets. the roles of intrinsic disorder in protein interaction networks profile hidden markov models evaluation of docking programs for predicting binding of golgi alpha-mannosidase ii inhibitors: a comparison with crystallography protein interaction maps for complete genomes based on gene fusion events a novel genetic system to detect protein-protein interactions ipfam: visualization of protein-protein interactions in pdb at domain and amino acid resolutions pfam: clans, web tools and services pharmaceuticals: a new grammar for drug discovery a genomic approach of the hepatitis c virus generates a protein interaction map reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes glide: a new approach for rapid, accurate docking and scoring. . method and assessment of docking accuracy the coevolution of gene family trees microbial genescapes: phyletic and functional patterns of orf distribution among prokaryotes analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets proteome survey reveals modularity of the yeast cell machinery functional organization of the yeast proteome by systematic analysis of protein complexes experimental determination and system level analysis of essential genes in escherichia coli mg disease gene discovery through integrative genomics similarity searching using reduced graphs a protein interaction map of drosophila melanogaster a protein interaction network links git , an enhancer of huntingtin aggregation, to huntingtons disease co-evolution of proteins with their interaction partners analyzing protein interaction networks the human disease network mgf( )(-) as a transition state analog of phosphoryl transfer the cath domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution mpact: the mips protein interaction resource on yeast online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders prespi: a domain combination based prediction system for protein-protein interaction the gene ontology (go) database and informatics resource from molecular to modular cell biology why do hubs tend to be essential in protein networks? the hupo psis molecular interaction format -a community standard for the representation of protein interaction data modeling protein-protein and protein-dna docking systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry towards zoomable multidimensional maps of the cell visant: data-integrating visual framework for biological networks and modules an iterative knowledge-based scoring function to predict protein-ligand interactions: ii. validation of the scoring function measuring genome evolution evolutionary conservation of domain-domain interactions a bayesian networks approach for predicting protein-protein interactions from genomic data lethality and centrality in protein networks development and validation of a genetic algorithm for flexible docking a quantitative protein interaction network for the erbb receptors using protein microarrays global topological features of cancer proteins in the human interactome co-evolutionary analysis of domains in interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions high-confidence prediction of global interactomes based on genome-wide coevolutionary networks huntingtin interacting proteins are genetic modifiers of neurodegeneration lead identification by virtual screaning kegg for linking genomes to life and the environment protein interactions and disease: computational approaches to uncover the etiology of diseases predicting protein domain interactions from coevolution of conserved regions relating protein pharmacology by ligand chemistry comparative evaluation of eight docking tools for docking and virtual screening accuracy intact-open source resource for molecular interaction data broadening the horizon -level . of the hupo-psi format for molecular interactions a robustness-based approach to systems-oriented drug design systematic association of genes to phenotypes by genome and literature mining analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs fast d molecular superposition and similarity search in databases of flexible molecules stitch: interaction networks of chemicals and proteins a protein interaction network of the malaria parasite plasmodium falciparum a human phenome-interactome network of protein complexes implicated in genetic disorders genome-wide protein interaction maps using two-hybrid systems flexs: a method for fast flexible ligand superposition novel technologies for virtual screening smart : domains in the context of genomes and networks a protein-protein interaction network for human inherited ataxias and disorders of purkinje cell degeneration a multidimensional analysis of genes mutated in breast and colorectal cancers network-based analysis of affected biological processes in type diabetes models human disease classification in the postgenomic era: a complex systems approach to human pathobiology hubs in biological interaction networks exhibit low changes in expression in experimental asthma cdd: a conserved domain database for interactive domain family analysis detecting protein function and protein-protein interactions from genome sequences a combined algorithm for genome-wide prediction of protein function a combined algorithm for genome-wide prediction of protein function gaussian docking functions pharmacophore fingerprinting. . application to qsar and focused library design structure, function, and evolution of transient and obligate protein-protein interactions human protein reference database- update systematic discovery of analogous enzymes in thiamin biosynthesis automated docking using a lamarckian genetic algorithm and an empirical binding free energy function new developments in the interpro database systematic discovery of new recognition peptides mediating protein interaction networks integrative approach for computationally inferring protein domain interactions interdom: a database of putative interacting protein domains for validating predicted protein interactions and complexes the solution structure of the josephin domain of ataxin- : structural determinants for molecular recognition protein interaction networks in bacteria diversity of protein-protein interactions a comprehensive map of the toll-like receptor signaling network a comprehensive pathway map of epidermal growth factor receptor signaling submit your interaction data the imex way: a step by step guide to trouble-free deposition the minimum information required for reporting a molecular interaction experiment (mimix) protein families and their evolution-a structural perspective the modular nature of genetic diseases use of contiguity on the chromosome to predict functional coupling a database and tool, im browser, for exploring and integrating emerging gene and protein interaction data for drosophila the mips mammalian protein-protein interaction database dima . predicted and known domain interactions a domain interaction map based on phylogenetic profiling species-specificity of the cohesin-dockerin interaction between clostridium thermocellum and clostridium cellulolyticum: prediction of specificity determinants of the dockerin domain global mapping of pharmacological space mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the pdb and yeast a proteome-wide protein interaction map for campylobacter jejuni identification of genomic features using microsyntenies of domains: domain teams assembly of cell regulatory systems through protein interaction domains assessing protein co-evolution in the context of the tree of life assists in the prediction of the interactome similarity of phylogenetic trees as indicator of protein-protein interaction rapid generation of high quality approximate -dimension molecular structures assigning protein functions by comparative genome analysis: protein phylogenetic profiles network modeling links breast cancer susceptibility and centrosome dysfunction elm server: a new resource for investigating short functional sites in modular eukaryotic proteins domine: a database of protein domain interactions an integrative approach to gain insights into the cellular function of human ataxin- computational analysis of human protein interaction networks docking and scoring for structure-based drug design feature trees: a new molecular similarity measure based on tree matching a fast flexible docking method using an incremental construction algorithm a generic protein purification method for protein complex characterization and proteome exploration inferring protein domain interactions from databases of interacting proteins corum: the comprehensive resource of mammalian protein complexes human protein-protein interaction networks and the value for drug discovery comparison of automatic three-dimensional models builders using x-ray structures interaction generality, a measurement to assess the reliability of a protein-protein interaction the database of interacting proteins: update methods to reveal domain networks partial correlation coefficient between distance matrices as a new indicator of protein-protein interactions the inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships functional evaluation of domain-domain interactions and human protein interaction networks reuse of structural domain-domain interactions in protein networks cytoscape: a software environment for integrated models of biomolecular interaction networks conserved patterns of protein interaction in multiple species network-based prediction of protein function novel procedure for modeling ligand/ receptor induced fit effects quantifying modularity in the evolution of biomolecular systems protein-ligand docking: current status and future challenges protein complexes and functional modules in molecular networks correlated sequence-signatures as markers of protein-protein interaction biogrid: a general repository for interaction datasets did: interacting protein domains of known three-dimensional structure the value of high quality protein-protein interaction networks for systems biology protein-protein interactions: analysis and prediction tools for visually exploring biological networks systematic interactome mapping and genetic perturbation analysis of a c. elegans tgf-b signaling network a comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae herpesviral protein networks and their interaction with the human proteome from orfeomes to protein interaction maps in viruses reactome: a knowledge base of biologic pathways and processes structure, function and evolution of multidomain proteins analysis of intraviral protein-protein interactions of the sars coronavirus orfeome string -recent developments in the integration and prediction of protein interactions comparative assessment of large-scale data sets of protein-protein interactions insite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scale comparative evaluation of scoring functions for molecular docking analysis on multi-domain cooperation for predicting protein-protein interactions smiles, a chemical language and information system. . introduction and encoding rules reaching for high-hanging fruit in drug discovery at protein-protein interfaces protein-protein interaction map inference using interacting domain profile pairs analyzing cellular biochemistry in terms of molecular networks discovering disease-genes by topological features in human protein-protein interaction network a modular network model of aging evaluation of descriptors and mini-fingerprints for the identification of molecules with similar activity bits" and pieces drug-target network functional and topological characterization of protein interaction networks the importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics the structure and function of proline recognition domains protein-protein interactions: analysis and prediction key: cord- -z ta pp authors: shahi, gautam kishore title: amused: an annotation framework of multi-modal social media data date: - - journal: nan doi: nan sha: doc_id: cord_uid: z ta pp in this paper, we present a semi-automated framework called amused for gathering multi-modal annotated data from the multiple social media platforms. the framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process. from a given list of the articles from professional news media or blog, amused detects links to the social media posts from news articles and then downloads contents of the same post from the respective social media platform to gather details about that specific post. the framework is capable of fetching the annotated data from multiple platforms like twitter, youtube, reddit. the framework aims to reduce the workload and problems behind the data annotation from the social media platforms. amused can be applied in multiple application domains, as a use case, we have implemented the framework for collecting covid- misinformation data from different social media platforms. with the growth of the number of users on different social media platforms, social media have become part of our lives. they play an essential role in making communication easier and accessible. people and organisations use social media for sharing and browsing the information, especially during the time of the pandemic, social media platforms get massive attention from users talwar et al. ( ) . braun and gillespie conducted a study to analyse the public discourse on social media platforms and news organisation. the design of social media platforms allows getting more attention from the users for sharing news or user-generated content. several statistical or computational study has been conducted using social media data braun and gillespie ( ) . but data gathering and its annotation are time-consuming and financially costly. in this study, we resolve the complications of data annotation from social media platforms for studying the problems of misinformation and hate speech. usually, researchers encounter several problems while conducting research using social media data, like data collection, data sampling, data annotation, quality of the data, copyright © , association for the advancement of artificial intelligence (www.aaai.org). all rights reserved. and the bias in data grant-muller et al. ( ) . data annotation is the process of labelling the data available in various formats like text, video or images. researchers annotate social media data for researches based on hate speech, misinformation, online mental health etc. for supervised machine learning, labelled data sets are required so that machine can quickly and clearly understand the input patterns. to build a supervised or semi-supervised model on social media data, researchers face two challenges-timely data collection and data annotation shu et al. ( ) . one time data collection is essential because some platforms either restrict data collection or often the post itself is deleted by social media platforms or by the user. for instance, twitter allows data crawling of only the past seven days (from the date of data crawling) by using the standard apis stieglitz et al. ( ) . moreso, it is not possible to collect the deleted posts from social media platforms. another problem stands with data annotation; it is conducted either in an in-house fashion (within lab or organisation) or by using a crowd-based tool(like amazon mechanical turk(amt)) aroyo and welty ( ) . both approaches of data annotations require an equitable amount of effort to write the annotation guidelines along with expert annotators. in the end, we are not able to get quality annotated data which makes it challenging to a reliable statistical or artificial intelligence based analysis. there is also always a chance of wrongly labelled data leading to bias in data cook and stevenson ( ) . currently, professional news media or blogs also cover the posts from social media posts in their articles. the inclusion of social media posts in the news and blog articles creates an opportunity to gather labelled social media data. journalists cover humongous topics of social issues such as misinformation, propaganda, rumours during elections, disasters, pandemics, and mob lynching, and other similar events. journalists link social media posts in the content of the news articles or blogs to explain incidents carlson ( ) . to solve the problems of on-time data collection and data annotation, we propose a semi-automatic framework for data annotation from social media platforms. the proposed framework is capable of getting annotated data on social issues like misinformation, hate speech or other critical social scenarios. the key contributions of the paper are listed below-• we present a semi-automatic approach for gathering an-arxiv: . v [cs.si] oct notated data from social media platforms. amused gathers labelled data from different social media platform in multiple formats(text, image, video). • amused reduces the workload, time and cost involved in traditional data annotation technique. • amused resolves the issues of bias in the data (wrong label assigned by annotator) because the data gathered will be labelled by professional news editors or journalists. • the amused can be applied in many domains like fake news or propaganda in the election, mob lynching etc. for which it is hard to gather the data. to present a use case, we apply the proposed framework to gather data on covid- misinformation on multiple social media platforms. in the following sections, we discuss the related work, different types of data circulated and its restrictions on social media platforms, current annotation techniques, proposed methodology and possible application domain; then we discuss the implementation and result. we also highlight some of the findings in the discussion, and finally, we discuss the conclusion and ideas for future works. much research has been published using social media data, but they are limited to a few social media platforms or language in a single work. also, the result is published with a limited amount of data. there are multiple reasons for the limited work; one of the key reason is the availability of the annotated data for the research thorson et al. ( ) ; ahmed, pasquier, and qadah ( ) . chapman et al. highlights the problem of getting labelled data for nlp related problem chapman et al. ( ) . researchers are dependent on in-house or crowd-based data annotation. recently, alam et al. uses a crowd-based annotation technique and asks people to volunteer for data annotation, but there is no significant success in getting a large number of labelled data alam et al. ( ) . the current annotation technique is dependent on the background expertise of the annotators. on the other hand, finding the past data on an incident like mob lynching, disaster is challenging because of data restrictions by social media platforms. it requires looking at massive posts, news articles with an intensive amount of manual work. billions of social media posts are sampled to a few thousand posts for data annotation either by random sample or keyword sampling, which brings a sampling bias in the data. with the in-house data annotation, forbush et al. mentions that it's challenging to hire annotator with background expertise in a domain. another issue is the development of a codebook with a proper explanation forbush et al. ( ) . the entire process is financially costly and time taking duchenne et al. ( ) . the problem with the crowdbased annotation tools like amt is that the low cost may result in wrong labelling of data. many annotators who cheat, not performing the job, but using robots or answering randomly fort, adda, and cohen ( ); sabou et al. ( ) . with the emergence of social media as a news resources caumont ( ), many people or group of people use it for different purpose like news sharing, personal opinion, social crime in the form of hate speech, cyber bullying. nowadays, the journalists cover some of the common issues like misinformation, mob lynching, hate speech, and they also link the social media post in the news articles cui and liu ( ) . in the proposed framework, we used the news articles from profession news website for developing the proposed framework. we only collect the news articles/blog from the credible source which does not compromise with the news quality meyer ( ) . in the next section, we discuss the proposed methodology for the amused framework. social media platform allows users to create and view posts in multiple formats. every day billions of posts containing images, text, videos are shared on social media sites such as facebook, twitter, youtube or instagram aggarwal ( ) . people use a combination of image, text and video for more creative and expressive forms of communication. data are available in different formats and each social media platform apply restriction on data crawling. for instance, facebook allows crawling data only related to only public posts and groups. giglietto, rossi, and bennato discuss the requirement of multi-modal data for the study of social phenomenon giglietto, rossi, and bennato ( ) . in the following paragraph, we highlighted the data format and restriction on several social media platforms. text almost every social media platform allows user to create or respond to the social media post in text. but each social media platform has a different restriction on the size of the text. twitter has a limit of characters, while on youtube, users are allowed to comment up to a limit of characters. reddit allows , characters; facebook has a limit of characters, wikipedia has no limit and so on. the content and the writing style changes with the character limit of different social media platform. image like text, image is also a standard format of data sharing across different social media platforms. these platforms also have some restriction on the size of the image. like twitter has a limit of megabytes, facebook and instagram have a limit of megabytes, reddit has a limit of megabytes. images are commonly used across different platform. it is common in social media platforms like instagram, pinterest. video some platforms are primarily focused on video like youtube. while other platforms are multi-modal which allows video, text and image. for video also there are restrictions in terms of duration like youtube has a capacity of hours, twitter allows seconds, instagram has a limit of seconds, and facebook allows videos up to minutes. the restriction of video's duration on different platforms catches different users. for instance, on twitter and instagram users post video with shorter duration. in contrast, youtube has users from media organisation, vlog writer, educational institution etc where the duration of video is longer. in the current annotation scenario, researchers collect the data from social media platforms for a particular issue with different search criteria. there are several problems with the current annotation approaches; some of them are highlighted below. • first, social media platforms restrict users to fetch old data; for example, twitter allows us to gather data only from the past seven days using the standard apis. we need to start on-time crawling; otherwise, we lose a reasonable amount of data which also contains valuable content. • second, if the volume of data is high, it requires filtering based on several criteria like keyword, date, location etc. these filtering further degrades the data quality by excluding the major portion of data. for example, for hate speech, if we sample the data using hateful keyword, then we might lose many tweets which are hate speech but do not contain any hateful word. • third, getting a good annotator is a difficult task. annotation quality depends on the background expertise of the person. even we hire annotator in our organisation; we have to train them for using the test data. for crowdsourcing, maintaining annotation quality is more complicated. moreover, maintaining a good agreement between multiple annotators is also a tedious job. • fourth problem is the development of annotation guidelines. we have to build descriptive guidelines for data annotation, which handle a different kind of contradiction. writing a good codebook requires domain knowledge and consultant from experts. • fifth, overall, data annotation is a financially costly process and time-consuming. sorokin and forsyth highlighted the issue of cost while using a crowd-based annotation technique sorokin and forsyth ( ) . • sixth, social media is available in multiple languages, but much research is limited to english. data annotation in other languages, especially under-resourced languages is difficult due to the lack of experienced annotators. the difficulty adversely affects the data quality and brings some bias in the data. in this work, we propose a framework to solve the above problems by crawling the embedded social media posts from the news articles and a detailed description is given in the proposed method section. in this section, we discuss the proposed methodology of the annotation framework. our method consists of nine steps, they are discussed below- step : domain identification the first step is the identification of the domain in which we want to gather the data. a domain could focus on a particular public discourse. for example, a domain could be fake news in the us election, hate speech in trending hashtags on twitter like #blacklivesmatter, #riotsinsweden etc. domain selection helps to focus on the relevant data sources. step : data source after domain identification, the next step is the identification of data sources. data sources may consist of either the professional news websites or the blogs that talk about the particular topic, or both. for example, many professional websites have a separate section which discusses the election or other ongoing issues. in the step, we collect the news website or blog which discuss the chosen domain. step : web scraping in the next step, we crawl all news articles from a professional news website or blogs which discuss the domain from each data source. for instances, a data source could be snopes snopes ( ) or poynter institute ( ) . we fetch all the necessary details like the published date, author, location, news content. step : language identification after getting the details from the news articles, we check the language of the news articles. we use iso - codes wikipedia ( ) for naming the language. based on the language, we can further filter the group of news articles based on spoken language from a country and apply a language-specific model for finding meaning insights. step : social media link from the crawled data, we fetch the anchor tag( a ) mentioned in the news content, then we filter the hyperlinks to identify social media platforms like twitter and youtube. from the filtered link, we fetch unique identifiers to the social media posts, for instance, for a hyperlink consisting of tweet id, we fetch the tweet id from the hyperlink. similarly, we fetch the unique id to social media for each platform. we also remove the links which are not fulfilling the filtering criteria. step : social media data crawling in this step, we fetch the data from the respective social media platform. we build a crawler for each social media platform and crawl the details using unique identifiers or uniform resource locator (url) obtained from the previous step. due to the data restriction, we use crowdtangle team ( ) to fetch them from facebook and instagram posts. example-for twitter, we use twitter crawler using tweet id (unique identifier), we crawl details about the tweets. step : data labelling in this step, we assign labels to the social media data based on the label assigned to the news articles by journalists. often news articles describe the social media post to be hate speech, fake news, or propaganda. we assign the class of the social media post mentioned in the news article as a class described by the journalist. for example, if a news article a containing social media post s has been published by a journalist j and journalist j has described the social media post s to be a fake news, we label the social media post s as fake news. usually, the news article is published by a domain expert, and it assures that social media post embedded or linked in the news article is correctly labelled. step : human verification in the next step, to check the correctness, a human verifies the assigned label to the social media post and with label mentioned in the news articles. if the label is wrongly assigned, then data is removed from the corpus. this step assures that the collected social media post contains the relevant post and correctly given label. a human can verify the label of the randomly selected news articles. step : data enrichment in this, we merge the social media data with the details from the news articles. it helps to accumulate extra information which might allow for further analysis. data merging provides analysis from news authors and also explains label assigned to the social media post. in this section, we consider the possible application domains of the proposed framework. nevertheless, the framework is a general one, and it can be tailored to suit varied unmentioned domains as well where the professional news website or blogs covers the incident like election, fake news etc. "fake news is an information that is intentionally, and verifiable false and could mislead readers" allcott and gentzkow ( ) . misinformation is part of fake news which is created deliberately intended to deceive. there is an increasing amount of misinformation in the media, social media, and other web sources. in recent years, much research has been done for fake news detection and debunking of fake news zhou and zafarani ( ) . in the last two decades, there is a significant increase in the spread of misinformation. nowadays more than fact-checking websites are working to tackle the problem misinformation cherubini and graves ( ) . fact-checking websites can help to investigate claims and assist citizens in determining whether the information used in an article is true or not. in a real-world scenario, people spread a vast amount of misinformation during the time of a pandemic, an election or a disaster. gupta et al. ( ) . there is a v problem of fake news -volume -a large number of fake news, velocity -during the peak the speed of propagation also intensifies, variety -different formats of data like images, text, videos are used in fake news. still, fake news detection requires a considerable effort to verify the claims. one of the most effective strategies for tackling this problem is to use computational methods to detect false news. misinformation has attracted significant attention in recent years as evidenced in recent publications li et al. ( ) ; li, meng, and yu ( ); li et al. ( ) ; popat et al. ( ) . additionally, misinformation is adopted across language borders and consequently often spread around the globe. for example-one fake news "russia released lions to implement the lockdown during covid- " was publicised across multiple countries in different languages like italian and tamil poynter ( ). mob lynching is a violent human behaviour where a group of people execute the legal practice without a trial which ends with a significant injury or death of a person apel ( ) . it is a worldwide problem, the first case executed in the th century in ireland, then it was trending in the usa during the - th century. often, mob lynching is initiated by rumours or fake news which gets triggered by the social media by a group of peoplearun ( ). the preventive measures taken by the government to overcome all obstacles and prevent further deaths were not successful in its entirety. getting the data for analysis of mob lynching is difficult because of the unexpected events occurring throughout the year, mainly in remote areas. there is no common search term or keyword that helps to crawl social media. so, if we fetch the specific social media post from the news articles which is covering analysis about the mob lynching arun ( ), we can use it for several studies. it will also help to analyse the cause and pattern from the previous incident griffin ( ) . online abuse is any kind of harassment, racism, personal attacks, and other types of abuse on online social media platforms. the psychological effects of online abuse on individuals can be extreme and lasting mishra, yannakoudakis, and shutova ( ) . online abuse in the form of hate speech, cyberbullying, personal attacks are common issue mishra, yannakoudakis, and shutova ( ) . many research has been done in english and other widely spoken languages, but under-resourced languages like hindi, tamil (and many more) are not well explored. gathering data in these languages is still a big challenge, so our annotation framework can easily be applied to collect the data on online abuse in multiple languages. in this, we discuss the implementation of our proposed framework. as a case study, we apply the amused for data annotation for covid- misinformation in the following way: step : domain identification out of several possible application domains, we consider the spread of misinformation in the context of covid- . we choose this the topic since because, december , the first official report of covid- , misinformation spreading over the web shahi and nandini ( ). the increase of misinformation is one of the big problems during the covid- problems. the director of the world health organization(who), considers that with covid, we are fighting with both pandemic and infodemic the guardian ( ). infodemic is a word coined by world health organization (who) to describe the misinformation of virus, and it makes hard for users to find trustworthy sources for any claim made on the covid- pandemic, either on the news or social media world health organization and others ( ); zarocostas ( ). one of the fundamental problems is the lack of sufficient corpus related to pandemic shahi, dirkson, and majchrzak ( ) . content of the misinformation depends on the domain; for example, during the election, we have a different set of misinformation compared to a pandemic like covid- , so domain identification helps to focus on specif topic. step : data sources for data source, we looked for fact-checking websites(like politifact, boomlive) and decided to use the poynter and snopes. we choose poynter figure : amused: an annotation framework for multi-modal social media data because poynter has a central data hub which collects data from more fact-checking websites while snopes is not integrated with poynter but having more than fact-checked articles on covid- . we describe the two data sources as follow-snopes-snopes snopes ( ) is an independent news house owned by snopes media group. snopes verifies the correctness of misinformation spread across several topics like election, covid- . as for the fact-checking process, they manually verify the authenticity of the news article and performs a contextual analysis. in response to the covid- infodemic, snopes provides a collection of a fact-checked news article in different categories based on the domain of the news article. poynter-poynter is a non-profit making institute of journalists institute ( ). in covid- crisis, poynter came forward to inform and educate to avoid the circulation of the fake news. poynter maintains an international fact-checking network(ifcn), the institute also started a hashtag #coronavirusfacts and #datoscoronavirus to gather the misinformation about covid- . poynter maintains a database which collects fact-checked news from factchecking organisation in languages. step : web scraping in this step, we developed a pythonbased crawler using beautiful soup richardson ( ) to fetch all the news articles from the poynter and snopes. our crawler collects important information like the title of the news articles, name of the fact-checking websites, date of publication, the text of the news articles, and a class of news articles. we have assigned a unique identifier to each of them and its denoted by fcid. a short description of each element given in table . step : language detection we collected data in multiple languages like english, german, hindi etc. to identify the language of the news article, we have used langdetect shuyo ( ) , a python-based library to detect the language of the news articles. we used the textual content of new articles to check the language of the news articles. our dataset is categorise into different languages. step : social media link in the next step, while doing the html crawling, we filter the url from the parsed tree of the dom (document object model). we analysed the url pattern from different social media platforms and applied keyword-based filtering from all hyperlinks in the dom. we store that urls in a separate column as the social media link. an entire process of finding social media is shown in figure . some of the url patterns are discussed below-twitter-for each tweet, twitter follows a pattern twitter.com/user name/status/tweetid. so, in the collection hyperlink, we searched for the keyword, "twitter.com" and "status", it assures that we have collected the hyperlink which referring to the tweet. youtube-for each youtube video, youtube follows a pattern hwww.youtube.com/watch?v=vidoeid. so, in the collection hyperlink, we searched for the keyword, "youtube.com" and "watch", these keyword assures that we have collected the hyperlink which referring to the particular youtube video. reddit-for each subreddit, reddit follows a pattern www.reddit.com/r/subreddit topic/. so, in the collection hy- example news id we provide a unique identifying id to each news articles. we use acronym for news source and the number to identify a news articles. newssource url it is a unique identifier pointing to the news articles. https://factcheck.afp.com/vi deo-actually-shows-anti-gove rnment-protest-belarus news title in this field, we store the title of the news articles. a video shows a rally against coronavirus restrictions in the british capital of london. published date each news articles published the fact check article with a class like false, true, misleading. we store it in the class column. news class we provide a unique identifying id to each news articles. false published-by in this field, we store the name of the professional news websites or blog, for example, afp, quint etc. country each news articles published the fact check article with a class like false, true, misleading. we store it in the class column. we provide a unique identifying id to each news articles. english table : name, definition and an example of elements collected from new articles. perlink, we searched for the keyword, "reddit.com" and a regex code to detect "reddit.com/r/", which confirms that we have collected the hyperlink which referring to the particular subreddit. similarly, we followed the approach for other social media platforms like facebook, instagram, wikipedia, pinterest, gab. in the next step, we used the regex code to filter the unique id for each social media post like tweet id for twitter, video id for youtube. step : social media data crawling after web scraping, we have the unique identifier of each social media post like tweet id for twitter, video id for videos etc. we build a python-based program for crawling the data from the respective social media platform. we describe some of the crawling tool and the details about the collected data. twitter-we used python crawler using tweepy roesslein ( ), which crawls all details about a tweet. we collect text, time, likes, retweet, user details such as name, location, follower count. youtube-for youtube, we built a python-based crawler which collects the textual details about the video, like title, channel name, date of upload, likes, dislikes. we also crawled the comments of the respective. similarly, we build our crawler for other platforms, but for instagram and facebook, we used the crowdtangle for data crawling, data is limited to posts from public pages and group team ( ). step : data labelling for data labelling, we used the label assigned in the news articles then we map the social media post with their respective news article and assign the label to the social media post. for example, a tweet extracted from news article is mapped to the class of the news article. an entire process of data annotating shown in figure . step : human verification in the next step, we manually overlook each social media post to check the correctness of the process. we provided the annotator with all necessary information about the class mapping and asked them to verify it. for example-in figure , human open the news article using the newssource url and verified the label assigned to the tweet. for covid- misinformation, a human checked randomly sampled % social media post from each social media platforms and verified the label assign to the social media post and label mentioned in the news articles. with the random checks, we found that all the assigned labels are correct. this helps make sure the assigned label is correct and reduces the bias or wrongly assigned label. we further normalise the data label into false, partially false, true and others using the definitions mentioned in shahi, dirkson, and majchrzak ( ) . the number of social media post found in four different category is shown in table . step : data enrichment in this step, we enrich the data by providing extra information about the social media post. the first step is merging the social media post with the respective news article, and it includes additional information like textual content, news source, author. the detailed analysis of the collected data is discussed in the result section. based on the results, we also discuss some of the findings in the discussion section. a snapshot of the labelled data from twitter is shown in figure . we will release the data as open-source for further study. for the use case of misinformation on covid- , we identified ifcn as the data source, and we collected data from different social media platforms. we found that around % of news articles contain linked their content to social media websites. overall, we have collected fact-checked news articles from countries in languages. a detailed description of social media data extracted using the amused framework is presented in table . we have cleaned the hyperlinks collected using the amused framework. we filtered the social media posts by removing the duplicates using a unique identifier of social media post. we have presented a timeline plot of data collected from different social media platforms in figure . we plotted the data from those social media platform which has ( ) figure : a timeline distribution of data collected from a number of different social media platform from january to august , we have presented the platform having data count more than . facebook instagram pinterest reddit tiktok twitter wikipedia youtube table : summary of covid- misinformation data collected from different social media platforms, deleted and duplicate posts are excluded in the count. the total number of post more than unique posts in table because it depreciates the plot distribution. we dropped the plot from pinterest ( ), whatsapp( ), tiktok ( ), reddit( ) . the plot shows that most of the social media posts are from facebook and twitter, then followed by youtube, then wikipedia and instagram. we have also presented the class distribution of these social media post in table . the figure shows that the number of post overall social media post was maximum during the mid-march to mid-may, . misinformation also follows the trend of the covid- situation in many countries because the number of social media post also decreased after june . the possible reason could be either the spread of misinformation is reduced, or fact-checking websites are not focusing on this issue as during the early stage. from our study, we highlighted some of the useful points. usually, the fact-checking website links the social media post from multiple social media platforms. we tried to gather data from various social media platforms, but we found the maximum number of links from facebook, twitter, and youtube. there are few unique posts from reddit ( ), tik-tok( ) but they were less than what we were expecting brennen et al. ( ) . surprisingly there are only three unique posts from pinterest, and there are no data available from gab, sharechat, and snapchat. however, gab is well known for harmful content, and people in their regional languages use sharechat. there are only three unique posts from pinterest. many people use wikipedia as a reliable source of information, but there are links from wikipedia. hence, overall fact-checking website is limited to some trending social media platforms like twitter or facebook while social media platforms like gab, tiktok is famously famous for malformation, misinformation brennen et al. ( ) . what-sapp is an instant messaging app, used among friends or group of people. so, we only found some hyperlink which links to the public whatsapp group. to increase the visibility of fact-checked articles, a journalist can also use schema.org vocabulary along with the microdata, rdfa, or json-ld formats to add details about misinformation to the news articles shahi, nandini, and kumari ( ) . another aspect is the diversity of social media post on the different social media platforms. more often, news articles mention facebook, twiter, youtube but less number of post from instagram, pinterest, no post from gab, tiktok. there might be these platforms actively ask or involve the factchecking website for monitoring the content on their platform, or the journalists are more focused on these platforms only. but it would be interesting to study the proposition of fake news on different platforms like tiktok, gab. we have also analysed the multi-modality of the data on the social media platform. in the case of misinformation on covid- , the amount of misinformation on text is more compare to video or image. but, in table we show that apart from text, the fake news is also shared as image, video or mixed-format like image+text. it will also be beneficial to detect fake news on different platforms. it also raises the open question of cross-platform study on a particular topic like misinformation on covid- . someone can also build a classification model shahi et al. ( ) ; nandini et al. ( ) to detect a class of fake news into true, false, partially false or other categories of news articles. while applying amused framework on the misinformation on covid- , we found that misinformation across multiple source platform, but it mainly circulated across facebook, twitter, youtube. our finding raises the concern of mitigating the misinformation on these platforms. in this paper, we presented a semi-automatic framework for social media data annotation. the framework can be applied to several domains like misinformation, mob lynching, and online abuse. as a part of the framework, we also used a python based crawler for different social media websites. after data labelling, the labels are cross-checked by a human which ensures a two-step verification of data annotation for the social media posts. we also enrich the social media post by mapping it to the news article to gather more analysis about it. the data enrichment will be able to provide additional information for the social media post. we have implemented the proposed framework for collecting the misinformation post related to the covid- as future work, the framework can be extended for getting the annotated data on other topics like hate speech, mob lynching etc. amused will decrease the labour cost and time for the data annotation process. amused will also increase the quality of the data annotation because we crawl the data from news articles which are published by an expert journalist. an introduction to social network data analytics key issues in conducting sentiment analysis on arabic social media text fighting the covid- infodemic in social media: a holistic perspective and a call to arms social media and fake news in the election imagery of lynching: black men, white women, and the mob truth is a lie: crowd truth and the seven myths of human annotation on whatsapp, rumours, and lynchings hosting the public discourse, hosting the public: when online news and social media converge types, sources, and claims of covid- misinformation embedded links, embedded meanings: social media commentary and news sharing as mundane media criticism trends shaping digital news overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions the rise of fact-checking sites in europe automatically identifying changes in the semantic orientation of words how does online news curate linked sources? a content analysis of three online news media automatic annotation of human actions in video was coronavirus predicted in a dean koontz novel? what a catch! traits that define good annotators amazon mechanical turk: gold mine or coal mine? the open laboratory: limits and possibilities of using facebook, twitter, and youtube as a research data source enhancing transport data collection through social media sources: methods, challenges and opportunities for textual data narrative, event-structure analysis, and causal interpretation in historical sociology faking sandy: characterizing and identifying fake images on twitter during hurricane sandy a dean koontz novel the international fact-checking network truth finding on the deep web: is the problem solved? a survey on truth discovery t-verifier: verifying truthfulness of fact statements defining and measuring credibility of newspapers: developing an index tackling online abuse: a survey of automated abuse detection methods modelling and analysis of temporal gene expression data using spiking neural networks credibility assessment of textual claims on the web russia released lions beautiful soup documentation corpus annotation through crowdsourcing: towards best practice guidelines fakecovid -a multilingual cross-domain fact check news dataset for covid- analysis, classification and marker discovery of gene expression data with evolving spiking neural networks an exploratory study of covid- misinformation on twitter inducing schema. org markup from natural language context fake news detection on social media: a data mining perspective language-detection library snopes. . collections archive utility data annotation with amazon mechanical turk social media analytics-challenges in topic discovery, data collection, and data preparation why do people share fake news? associations between the dark side of social media use and fake news sharing behavior crowdtangle. facebook, menlo park, california the guardian. . the who v coronavirus: why it can't handle the pandemic youtube, twitter and the occupy movement: connecting content and circulation practices list of iso - codes world health organization and others world report how to fight an infodemic fake news: a survey of research, detection methods, and opportunities key: cord- -onj zpi authors: abuelkhail, abdulrahman; baroudi, uthman; raad, muhammad; sheltami, tarek title: internet of things for healthcare monitoring applications based on rfid clustering scheme date: - - journal: wireless netw doi: . /s - - - sha: doc_id: cord_uid: onj zpi covid- surprised the whole world by its quick and sudden spread. coronavirus pushes all community sectors: government, industry, academia, and nonprofit organizations to take forward steps to stop and control this pandemic. it is evident that it-based solutions are urgent. this study is a small step in this direction, where health information is monitored and collected continuously. in this work, we build a network of smart nodes where each node comprises a radio-frequency identification (rfid) tag, reduced function rfid reader (rfrr), and sensors. the smart nodes are grouped in clusters, which are constructed periodically. the rfrr reader of the clusterhead collects data from its members, and once it is close to the primary reader, it conveys its data and so on. this approach reduces the primary rfid reader’s burden by receiving data from the clusterheads only instead of reading every tag when they pass by its vicinity. besides, this mechanism reduces the channel access congestion; thus, it reduces the interference significantly. furthermore, to protect the exchanged data from potential attacks, two levels of security algorithms, including an aes bit with hashing, have been implemented. the proposed scheme has been validated via mathematical modeling using integer programming, simulation, and prototype experimentation. the proposed technique shows low data delivery losses and a significant drop in transmission delay compared to contemporary approaches. coronavirus will have a long-term impact overall world. the most significant impact will manifest itself in the penetration of it surveillance and tracking. wireless sensor networks (wsns) become very efficient and viable to a wide variety of applications in many aspects of human life, such as tracking systems, medical treatment, environmental monitoring, intelligent transportation system (its), public health, smart grid, and many other areas [ ] . radio frequency identification (rfid) is a wireless technology with a unique identifier that utilizes the radio frequency for data transmission; it is transferred from the device to the reader via radio frequency waves. the data is stored in tags; these tags can be passive, active, or battery-assisted-passive (bap). the active and bap tags contain batteries that allow them to communicate on a broader range that can go up to km for enterprise users and over km in military applications. unlike battery-powered tags, passive tags use the reader's rf signal to generate power and transmit/receive data [ ] . using wsns and rfid is a promising solution, and it becomes prevalent in recent years. its low cost and low power consumption, rfid is easy to install, deploy, and combine with sensors [ ] . these features make rfid combined with sensors a viable and enabling technology for iot. with a wide variety of increasingly cheap sensors and rfid technologies, it becomes possible to build a real-time healthcare monitoring system at low price with very high quality. the rfid system is considered the strategic enabling component for the healthcare system due to the energy autonomy of battery-less tags and their low cost. in addition, rfid can be attached to the monitored items to be recognized, and hence enhancing the efficiency of monitoring and managing the objects [ ] [ ] [ ] [ ] . having real-time data collection and management is very important, especially in health-related systems. for instance, the united nations international children emergency fund (unicef) and the world health organization (who) reported in that more than thousand women die every year from causes related to pregnancy and childbirth [ ] ; this is due to the unavailability of timely medical treatments. moreover, the report stated that the main reasons of cancer-related deaths are due to the late detection of the abnormal cellular growth at the last stage. many lives can be saved by utilizing real-time iot smart nodes that can continuously monitor the patient's health condition. hence, it empowers the physicians to detect serious illnesses such as cancer in the primary stage. the motivations for the proposed framework are threefold: low cost, high performance, and real-time collection of data. an rfid reader cannot rapidly get data from tags because of its static nature and short transmission range. therefore, high power and costly rfid reader is required to extend the range for quick information gathering. however, this would result in an increase in the price of the framework considering the high cost of rfid reader with a high transmission range (not less than $ ) and the increased expenditure of initiating the connection between backend servers rfid reader. the question can we limit rfid readers' quantity, while still accomplishing sufficient information accumulation? moreover, in customary rfid observing applications, such as tracking luggage in airlines, an rfid reader is necessary to rapidly handle many tags at various distances. an rfid reader can just read tags within its range. many limitations could negatively affect the data collection's performance, such as multi bath fading and limited bandwidth; these issues can be maintained by transmitting information in short separations through multi-hop information transmission mode in wsns. besides, in every data collection system, the most critical challenge is to consider the real-time requirements. combining rfid tags with rfid readers and wsns helps significantly in solving this challenge [ ] [ ] [ ] . in this paper, we develop a framework that integrates rfid with wireless sensor systems based on a clustering scheme to gather information efficiently. essentially, our framework utilizes a smart node proposed by shen et al. [ ] . the smart node contains an rfid tag, reduced function rfid reader (rfrr), and a wireless sensor. the cluster's construction depends on multi-criteria in choosing the clusterhead among smart nodes in the same range. for instance, each node can read the tag id and battery level of all smart nodes in its range; the node with the highest battery level will be chosen as the clusterhead. the cluster consists of a clusterhead and cluster members; each member in the cluster transmits their tag information to the clusterhead. then, the rfid readers send the collected data to the backend server for data management and processing. also, to protect exchanged data from potential attacks, we have applied two levels of security algorithms. the proposed technique can lend itself to a wide range of applications, for example, collecting data in smart cities, aiming to monitor people's healthcare in large events such as festivals, malls, airports, train stations, etc. the specific contributions of this paper are listed below: • we exploit the smart nodes to develop an efficient healthcare monitoring scheme based on a collaborative adaptive clustering approach. • the proposed clustering scheme reduces the reader's burden to read every node and allows them to read only the node within its range. this approach minimizes the channel access congestion and helps in reducing any other interference. it also reduces the transmission delay, thus collecting the information between nodes efficiently for a large-scale system. • we formulate the clustering problem as a mathematical programming model to minimize the energy consumption and the interference in a large-scale mobile network. • to protect the collected data by the proposed approach from security threats that might occur during data communication among smart nodes and primary readers, we secure the exchanged data by two security levels. • we develop a small-scale prototype where we explore the performance of the proposed approach. the prototype is composed of a set of wearable smart nodes that each consists of rfid tag, reduced function rfid reader, and body sensor. also, all exchanged data among the smart nodes have been encrypted. the rest of the paper is organized as follows. section presents the related work on health care monitoring applications. in sect. , the proposed system is discussed, starting with explaining the problem statement followed by the proposed clustering approach. in sect. , the cluster formation is modeled as an integer program. in sect. , we present and discuss the three used methods to evaluate our proposed approach. first, the optimal solution using integer programming is discussed. given the long-running time required for integer programming, the proposed system is simulated using matlab, where the local information is employed to construct the clusters. thirdly, a small-scale prototype is built to test the approach. finally, we conclude this paper with our findings and suggestions for future directions. this section summarizes some of the previous work related to health care monitoring applications. many researchers have focused on solving this problem by using either rfid or wsn as the short-range radio interfaces. however, very few of these solutions are suitable for the problem (health care monitoring applications for a large-scale system) that addresses a crowded area with high mobility. sun microsystems, in collaboration with the university of fribourg [ ] proposed a web-based application called (rfid-locator) to improve the quality of hospital services. rfid-locator tracks the patients and goods in the hospital to build a smart hospital. all patients in the hospital are given an rfid based on wristband resembling a watch with a passive rfid tag in it. all patient's history and treatment records are stored in a centralized secure database. doctors have rfid-enabled personal data assistant (pda) devices to read the patient's data determined on the patients' rfid bangles. the results are promising, but too much work is needed in the security and encryption of the collected data. dsouza et al. [ ] proposed a wireless localization network to follow the location of patients in indoor environments as well as to monitor their status (e.g., walking, running). the authors deploy static nodes at different locations of the hospital that interact with a patient mobile unit to determine the patient's position in the building. each patient carries a small mobile node that is composed of a small-size fleck nano wireless sensor and a three-axis accelerometer sensor to monitor his/her physical status. however, using everybody's smartphone gps and wi-fi is not an energy-efficient solution because it requires enormous power. chandra-saharan et al. [ ] proposed a location-aware wsn to track people in a disaster site using a ranging algorithm. the ranging algorithm is based on received signal strength indicator (rssi) environment and mobility adaptive (rema). like [ , ] , the authors in [ ] focused on the healthcare area and provided a survey that shows the current study on rfid sensing from the viewpoint of iot for individual healthcare also proves that rfid technology is now established to be part of the iot. on the other hand, the paper reveals many challenging issues, such as the reliability of the sensors and the actual dependence of the reader's node. there are even more advanced solutions in [ ] ; the authors proposed ihome approach, which consists of three key blocks: imedbox, imedpack, and the bio-patch. rfid tags are used to enable communication capabilities to the imedpack block also flexible, and wearable biomedical sensor devices are used to collect data (bio-patch). the results are promising, but the study didn't consider monitoring purposes. another smart healthcare system is proposed in [ ] to monitor and track patients, personnel, and biomedical devices automatically using deferent technologies rfid, wsn, and smart mobile. to allow these different technologies to interoperate a complex network communication relying on a coap, low-pan, and rest paradigms, two use cases have been implemented. the result proved a good performance not only to operate within hospitals but to provide power effective remote patient monitoring. the results are promising, but their approach needs more infrastructures of the wired and wireless sensor network. gope and hwang [ ] proposed a secure iot healthcare application using a body sensor network (bsn) to monitor patient's health using a collection of tiny-powered and lightweight wireless sensor nodes. also, the system can efficiently protect a patient's privacy by utilizing a lightweight anonymous authentication protocol, and the authenticated encryption scheme offset codebook (ocb). the lightweight anonymous authentication protocol can achieve mutual authentication, preserve anonymity, and reduce computation overhead between nodes. the ocb block cipher encryption scheme is well-suited for secure and expeditious data communication as well as efficient energy consumption. the results are promising, but their approach needs infrastructure. furthermore, an intelligent framework for healthcare data security (ifhds) has been proposed to secure and process large-scale data using a column-based approach with less impact on data processing [ ] . the following table comapres the proposed approach with the existing literature. it shows that there is no similar work to the proposed approach. techniques f- f- f- f- f- f- f- f- f- [ ] a hybrid rfid energy-efficient routing scheme for dynamic clustering networks a smart real-time healthcare monitoring and tracking system based on mobile clustering scheme a data collection method based on mobile edge computing for wsn energy-efficient large-scale tracking systems based on mobile clustering scheme energy-efficient large-scale tracking systems based on two-level hierarchal clustering : the smart node is a wearable smart node that includes a reduced function rfid reader (rfrr), a body sensor (bs), a rfid tag and a microcontroller, where in the rfid reader has a greater transmission range than the rfrr, where in the rfrr reads other smart nodes' tags and stores this data into its own rfid tag feature (f- ): aplurality of smart nodes, which integrate radio-frequency identification (rfid) and wireless sensor network (wsn) feature (f- ): the clustering scheme in which each node reads the tag id of all nodes in its range and a cluster head is a node which has the highest cost function (e.g. battery level); the cluster consists of a clusterhead and cluster members feature (f- ): the data collection scheme in which an rfid reader receives all packets of node data from the ch, and the rfid reader sends the collected information to a back-end server for data processing and management feature (f- ): formulating a novel mathematical programming model which optimizes the clustering structures to guarantee the best performance of the network. the mathematical model optimizes the following objective functions: ( ) minimizing the total distance between chs and cms to improve positioning accuracy; and ( ) minimizing the number of clusters which reduces the signal transmission traffic feature (f- ): two level security is obtained by when a node writes data to its rfid tag, the data is signed with a signature, which is a hash value, the obtained hash is encrypted with a aes bits shared key in this section, the proposed system is discussed, starting with explaining the problem statement followed by the proposed solution. during healthcare monitoring of people, the main challenge is to ensure safety, efficient data collection, and privacy. people stay in a bounded area, embedded with various random movements in their vicinity. different technologies have been suggested to collect data from crowds and can be categorized as passive and active sensing. passive sensing, such as computer vision, does not need any connection with the user. they can aid in movement detection, counting people, and density approximation [ , ] . however, these approaches fail to deliver accurate identification of individuals in addition to the need for ready infrastructure, which is very costly. there are also some active systems such as rfid tags that can be attached to the individual and obtain user's data. nevertheless, these systems require an expensive infrastructure for organizing rfid readers at points of data collection [ ] . therefore, to deliver accurate identification of individuals in addition to reduce the cost of the infrastructure and to attain efficient large-scale data collection for healthcare monitoring applications, we suggest employing a system of mobile smart nodes that is composed of rfid and wsn. the mobile smart nodes are clustered to minimize data traffic and ensure redundancy and delivery to the command center. however, clustering rfid nodes into groups comes with many technical challenges, such as achieving accurate positioning, collecting information in each cluster, and reporting this information from clusters head to the server for processing it. in addition, there are also many challenges related to clustering, which is crucial to managing the transmission to avoid interference. furthermore, the rfid tag is susceptible to malicious attacks; therefore, we implemented two levels of security algorithms to protect the stored data from potential attacks. this section discusses the proposed data collection technique that can efficiently collect the health information (e.g., temperature, heartbeat) and make them available to the backendback-end server in real-time. the main components in our system architecture include smart nodes, rfid readers, and a backend server, as shown in fig. . the smart node integrates the functionalities of rfid and wireless sensor node. it consists of body sensor (bs), rfid tag, and reduced-function rfid reader (rfrr). unlike standard sensors, bs does not have a transmission function. bs is responsible for collecting the body-sensed data, such as heartbeat, muscle, temperature. the rfrr is an rfid reader with a small range compared to the traditional rfid reader. the protocol is composed of two phases: cluster construction and data exchange. in the beginning, each node reads the tag particulars (e.g., id, battery level) of all nodes in its range. then, the node, for example, with the highest battery level, is autonomously nominated as a clusterhead for this group of nodes. all smart nodes initiate a table of the nominees to be the clusterhead of the newly constructed cluster. the clusterhead sends a message to all nodes within its range to inform them that i am a clusterhead to join its group. secondly, the node accepting the offer from this clusterhead node sends an acknowledgment message; this is important to avoid duplicate association with multiple nodes. this step ends the cluster construction. once the cluster is formed, it reads other smart nodes and stores their data into its local tag. the clusterhead tag works as a data storage. finally, when an rfrr comes across rfid, the stored data are transferred to rfid and the backendback-end server for further processing. this feature helps reach remote nodes and hence enhance the system reliability and reduce the infrastructure cost. this process is repeated periodically; new clusters are formed, and new clusterheads are selected along with their children. this technique guarantees fair load distribution among multiple devices to attain the network's maximum lifetime and avoid draining the battery of any individual smart node. the pseudo-code for our algorithm is shown below. choose the ch with highest bl : if it is ch and meet its cm then : read data from the cluster member : end if : if it is ch and meet an rfid reader then : send its data to the rfid reader : end if the ultimate goal of this research is to design an optimum healthcare monitoring application based on the rfid clustering scheme. to meet the practical requirements for applying the system in large-scale environments, the proposed system's energy consumption should be minimum, and communication quality must be high. therefore, the integer programming model presented below aims to optimize the following objectives: • minimizing the total distance between clusterheads (chs) and cluster members (cms). • minimizing the number of clusters. the first objective, which is to minimize the total distance between all chs and their respective cms, is meant to enhance tag detectability. also, shorter distances improve the signal quality and reduce the time delay of transmissions within each cluster. for example, in traditional rfid monitoring applications, such as supply chain management and baggage checking in delta airlines, an rfid reader is required to process several tags at different distances in a short time frame. an rfid reader can only read tags in its range. limited communication bandwidth, background noise, multi-path fading, and channel accessing contention between tags would severely deteriorate the performance of the data collection [ ] . the second objective is pursued because minimizing the number of clusters reduces signal transmission traffic, lowering the interference between signals. this results in reducing the use of energy and maximizing the lifetime of the network. for instance, rfid tag data usually is collected using direct transmission mode, in which an rfid reader communicates with a tag only when the tag moves into its transmission range. if many tags move towards a reader at the same time, they contend to access the channels for information transmission. when a node enters the reading range of an rfid reader, the rfid reader reads the node's tag information. suppose several nodes enter the range of rfid reader at the same time. in that case, the rfid reader gives the first meeting tag the highest priority to access the channel, reducing channel contention and long-distance transmission interference [ ] . in the clusterhead based algorithm, cluster members replicate their tag data to the clusterhead. when a clusterhead of a particular cluster reaches an rfid reader, the rfid reader receives all nodes' information in this cluster. this enhanced method significantly reduces channel access congestion and reduces the information exchanges between nodes. the method is suitable for a wide range of applications where monitored objects (e.g., zebras, birds, and people) tend to move in clusters. let i = to n denote the cm number, j = to n denote the ch number, dij denotes the distance between cmi and chj, and f denotes the fixed cost per ch. the user's battery level (bl) is defined as in ( ), / which is a predefined node energy threshold. expressions ( ) and ( ) define the decision variables, xij and yj, which are integer binary variables. fig. the architecture of the healthcare monitoring system fig. a timeline of the transactions carried out between smart nodes, b timeline of the transactions carried out between smart nodes and the main rfid reader wireless networks the complete integer-programming model of the clustering problem is given by ( ) . the first expression in ( ) is the objective function z, which consists of two terms. the first term is the total distance between chs and cms, and the second term is the total number of clusters in the network. the objective function z is minimized subject to four sets of constraints. constraint (i) ensures that every cm has a ch, so we avoid any isolated smart nodes. constraint ii controls the maximum cluster size (cs). constraint iii ensures that all cluster members are within the ch's rfid range, i.e., not more than d max away (e.g., two feet). finally, constraint iv ensures that a ch node's battery level must be at least / (e.g.. %). the fixed cost of each ch is denoted by f, which is analyzed later. in this section, the performance of the proposed approach is evaluated using three methods: the integer programming, simulation, and a small-scale prototype. the general algebraic modeling system (gams) is designed for modeling and solving linear programming (lp), nonlinear programming (nlp), and mixed-integer programming (mip) optimization problems [ ] . since the above model described in eq. ( ) is a binary integer program, it is solved by the mip feature of gams. we use gams version . . . we consider two different scenarios. the first scenario tackles the problem by considering the two terms in the objective function that aims at minimizing the number of clusters and the total distance between chs and cms to find the optimal cluster size (cs) in constraint ii. the second scenario applies sensitivity analysis by fixing the total number of nodes to n = , , , , and ; this is done while changing the fixed cost of each ch, f, and calculating the optimal value of the number of clusters and the total distance as well. both scenarios are analyzed under the condition that the service region's size is set as * ft . to achieve a % confidence level, we have repeated each experiment times using different random input for nodes' locations and the battery level for each node. it can be observed from fig. that the total distance between the chs and the cms is reduced on average when cs is equal (i.e., one clusterhead and five cluster members) for nodes and nodes. the total distance between the chs and the cms is also reduced on average when cs is equal , , for nodes, nodes, and nodes, respectively. for example, with nodes, the minimum accumulated distance between all clusters and their members is about ft when cluster size is equal , whereas, with cluster size, it is about ft. similar to nodes scenario, the minimum distance is about ft when cluster size is equal to , whereas, with cluster size, it is about ft and when cluster size is . therefore, the clustering approach is effective in reducing the total distances when cs is equal to for nodes and nodes, and , , for nodes, nodes, and nodes, respectively. figure displays the number of clusters while the cluster size is changing. it can be observed that the number of clusters drops when the cluster size increases. however, we are interested not only in minimizing the number of the clusters, but we are also interested in minimizing the total distances between the clusterhead and the cluster member to achieve the accuracy of positioning and maximize the lifetime of the network. for instance, with nodes, the optimum minimum distance is about ft when cluster size is equal , and with nodes, the optimum minimum distance is about ft when cluster size is equal . therefore, the optimum value of cluster size is equal to for nodes and nodes, and , , for nodes, nodes, and nodes, respectively. figure demonstrates the model's total distance when the fixed cost per master f is equal to e , where e = , , …, . for nodes, the optimal (minimum) total distance is ft, which is obtained when f is equal to (e = ). for the case of nodes, the optimal total distance is ft, which is also obtained when f is equal to . these numbers indicate that the clustering approach is well-suited for large-scale monitoring applications. figure illustrates the optimal number of the clusters when the value of fixed cost per master f is equal to e where e = , , …, . for nodes, the optimal (minimum) number of the clusters is clusters, which is obtained when e = , or f = . for the case of nodes, the optimal number of clusters is clusters, which is also obtained when f = . therefore, the best value of f for both terms in the optimization function in eq. ( ) to work effectively is . in this section, we formulate the energy consumption of the proposed clustering approach and the traditional approach analytically. in the beginning, we define the following parameters: r: rfid_maximum_data_rate, bps p a : rfid_active_power, w p i : rfid_idle_power, w t a tag ¼ l=r: rfid tag active time in second, where l is the data length, bits. we define the total energy consumption for the traditional approach as follows. for the traditional approach, t a tag ¼ t round . given the current advancement in rfid technology, we can assume that the collision rate is very low with confidence. hence, b i % . then, for the proposed approach, we define the following specific parameters.e ch : total energy consumption per clusterhead, e total : total energy consumption given the current advancement in rfid technology, we assume the collision rate to be minimal. hence, b i % and t i tag i ð Þ % t round À t a tag , for i. besides, in order not to miss any data, the clusterhead is set on for the whole round period, hence, t a rfrr ¼ t round and t i rfrr ¼ . equation ( ) can be rewritten as fig. the average total distance when changing cs from to nodes fig. number of clusters when changing cs from to nodes fig. the total distance when changing f from to nodes wireless networks we have implemented the proposed system for monitoring the health parameters using cisco packet tracer . since it supports iot, rfid, and many other functions. figure shows the smart node components as built using cisco packet tracer. the smart node consists of rfrr, bs, rfid tag, and the microcontroller. the rfrr is a standard rfid reader with a limited range. we program the rfrr to perform two tasks: the first task is reading data from the attached body sensors and storing data into its tag. the second task is reading the data from other smart nodes within its transmission range and storing it into its tag. the body sensor is responsible for collecting body-sensed data such as temperature, heartbeat. the rfid tag works as data storage. on the other hand, the microcontroller (mcu) is used to monitor, verify, and process smart nodes readings. the transmitted data between smart nodes and rfid readers has three fields. a unique smart node id assigned to each user ( byte), the sensed-data ( byte), and the timestamp, which records the time at when the data is collected ( bytes). furthermore, to protect the collected data from potential attacks, we apply rivest-shamir-adleman (rsa) algorithms [ ] . figure shows the components of the rfid reader and its connectivity with the backend server. the rfid readers are responsible for collecting the data from smart nodes and delivering them to the backend server. the transmission range of the rfid reader is much greater than that of the rfrr. upon reading the smart node tag data, it sends that data directly to the backend server wirelessly carried by udp packets. rivest-shamir-adleman (rsa) algorithms are also applied for the transmitted data from smart nodes to primary readers. using the above setup, we start by studying the performance of the packet delay, and the number of delivered packets have been calculated for the traditional approach and the cluster approach. in the traditional approach, every node sends its packets directly to an rfid reader. in the clustering approach, every node sends its packets to its clusterhead, and the clusterhead forwards them to an rfid reader. each node sends ten packets every minute, and the simulation has been tested for min to achieve a % confidence interval. the average delay per packets is calculated using eq. ( ), where n is the number of delivered packets and r t is the receiving time and s t is the sending time. average delay per packet ¼ n table shows a sample of the collected data at the backend server before and after implementing the rsa algorithm. the smart node appends the timestamp to the sensed data and stores the information in its tag through rfrr. as stated before, the transmitted data between smart nodes and rfid readers has three fields, namely, unique smart node id, the sensed data, and the timestamp when the data was collected. figure illustrates the average transmission delay per packet for a different number of nodes. we can notice that the traditional approach's delay per packet is almost fixed regardless of available smart nodes. this behavior can be attributed to the fact that each node would meet the rfid readers for forwarding its packets with equal probability. on the other hand, when the clustering approach is employed, the delay drops significantly; for example, when n = , the packet delay drops by %. the higher is the number of smart nodes, the lower is the packet delay; this happens because when the number of smart nodes increases in the same area, the density increases, as well as the fig. number of clusters when changing f from to nodes fig. the smart node components as built-in packet tracer number of clusterheads. therefore, the probability of a regular node meets a clusterhead increases, which leads to reduce the delay in delivering the collected data to the primary reader and then to the back-end-server. figure displays the number of delivered packets for different numbers of nodes. in the clustering approach, the system delivers exactly packets, which are the total number of packets generated by all smart nodes. on the other hand, in the traditional approach, the system suffers packet loss (e.g., % loss for n = ) due to the increase in channel access congestion as the number of nodes increases. next, we study the traditional approach's energy consumption, the optimal approach, and the proposed clustering approach. in the traditional approach, every node sends its packets directly to an rfid reader. in the clustering approach, as explained in sect. . , every node reads the tag particulars (battery level ) of all nodes in its range. the node with the highest battery level is then chosen as a clusterhead for this group of nodes. then, the clusterhead broadcast a message to all nodes within its range to inform them that i am a clusterhead to join its group. then, the node accepting this clusterhead node's offer sends an acknowledgment message; this is important to avoid duplicate association with multiple nodes. once the cluster is formed, the clusterhead remains active, and the cluster member remains in sleep mode. the clusterhead reads other smart nodes and stores their data into its local tag. the cluster member switches to active mode every s to store its data into its own local. finally, the clusterhead sends the data to an rfid reader, then to the backend server for further processing and management. this process is repeated every min; new clusters are formed, and new clusterheads are selected along with their children. this technique guarantees fair load distribution among multiple devices to attain the maximum lifetime of the network and avoiding draining the battery of any individual smart node. the relative performance of the three methods has been evaluated using matlab. it is assumed that each node can send data traffic at a rate of kbps, and it can send frames with sizes up to bytes (one byte for the id tag number, one byte for the data (heartbeat) and two bytes for timestamp and sequence number). table shows the rfid hardware energy consumption parameters, as specified by sparkfun [ ] . in order to achieve a % confidence interval, each simulation experiment was repeated times using different random topologies. for each simulation run, the total energy consumption for each round was calculated for different values of the number of nodes (n = , ,…, ). figure and table show the average total energy consumption for the traditional approach, the clustering algorithm, and the optimal gams solution of the integer programming model. figure shows that the clustering solution's total energy consumption is close to the minimum total consumption obtained by the optimal gams solution. the clustering algorithm's total energy becomes closer to the optimal value as the number of nodes increases. this result is clear from table , which shows a difference of % between the clustering algorithm's performance and the optimal gams solution when the number of nodes is equal to , but only a difference of . % when the number of nodes is equal to . this feature shows that the proposed clustering algorithm can produce high-quality, near-optimum solutions for large-scale problems. as shown in table , the traditional approach's energy consumption is . % higher than the optimal consumption specified by gams when the number of nodes is equal to , and . % higher when the number of nodes is equal to . the traditional approach (without clustering) is not a practical solution method for large-scale systems. in this section, we evaluate the performance of the proposed approach using a small-scale prototype. we begin by describing the experimental setup and then discuss the experimental results. figure shows the smart node components in our prototype testbed. the smart node consists of rfrr, bs, rfid tag, and the microcontroller. the rfrr is a standard rfid reader with a limited range, which can read up to two feet as in spark fun specification with onboard antenna [ ] . we program the rfrr to perform two tasks. the first task is reading the heartbeat, and the muscle sensed data from the bs (via pulse sensor, and muscle sensor), respectively, and storing this data into its tag. the second task is reading the data from other smart nodes within its transmission range and storing it into its tag. bs is responsible for collecting the body-sensed data such as heartbeat and muscle data. the rfid tag works as a packet memory buffer for data storage. arduino's read board is a microcontroller that is used to monitor, verify, and process smart nodes readings. the transmitted data between smart nodes and rfid readers has three fields, smart node id, the sensed data, and the sequence number of the data to know when the data was recorded. for each node, three packets of data are needed to be published so that other nodes can get their information. therefore, we need only four bytes of data entries: node id ( byte), heart rate information ( byte), and the sequence number ( bytes). the sequence number helps in discovering how recent the carried information is, and helps other nodes in deciding whether to record newly read data or discard it. each rfid tag has a -byte capacity; the first bytes are divided into chunks of bytes where each is used to store information of one node, this sums to a total of data slots. the remaining bytes are used for authentication. the first data slot is reserved for one's tag. other data slots are initially marked as available; that is, they do not contain data about other nodes and are ready to be utilized for that purpose. figure shows the flowchart that presents the process of handling new data. when a new data arrives and is to be stored, the controller tries to find whether a slot that contains data for the same id exists. if so, the slot is updated if the sequence number is less than the new sequence number; otherwise, the new data is discarded. if the controller does not find a previous record for that id, it stores its data in a new available slot, which means some data to be lost. we implement two levels of security algorithms to ensure the integrity of the arrived data, as well as to authenticate the source of data in our scheme. when a node writes the bytes data into its tag, the data is signed with bytes signature, which is used for authentication. to obtain the signature, the controller calculates the md bits hash value of the data bytes. then, the obtained hash is encrypted with the aes bits shared key. the result is the signature and is stored on the tag. to verify a newly read tag, the controller computes the hash of the new data (but not the signature), encrypts it with the shared key, and compares the result with the signature. the new data is valid if the result and its signature match each other. otherwise, it is considered an invalid node, and its data is discarded. the experimental prototype consists of three smart nodes ( , , ) and one primary rfid reader, as shown in fig. . each smart node consists of rfid tag, microcontroller, pulse sensor, and rfrr, a regular rfid reader with a limited range, which can read up to two feet with an onboard antenna. the primary rfid reader is an rfid reader attached to an external antenna to increase its transmission range. in this prototype, node , which has the highest battery level, plays the role of the clusterhead, and node and node play the role of the cluster members. node reads tag information of node and node . then, the primary rfid reader receives all packets of node , node , and node from node when it moves into the primary rfid reader range. then, the rfid reader sends the collected information to the backend server for data processing. figure shows a sample of the collected data of the pulse sensor that includes the beat per minute (bpm), live heartbeat or interbeat interval (ibi), and the analog signal (as) on the serial monitor. each row in fig. includes bpm, ibi, and as. for instance, the first row has as bpm, as ibi, and as as. the typical readings of the beat per minute of the pulse sensor should be between and . otherwise, it is considered an emergency case. it can be observed from figs. and that a valid foreign tag # is read and updated, and a valid foreign tag# is read and then updated on the serial monitor, respectively. to verify a newly read tag, the controller computes the hash of the new data (but not the signature), encrypts it with the shared key, and compares the result with the signature. the new data is valid if the result and its signature match each other. otherwise, it is considered an invalid node, and its data is discarded. figures and shows that tag# and tag# are valid. figure shows the captured data packets in an invalid foreign tag. in this example, the reader using the authentication process, which the controller executed, reported that tag number four is invalid. the controller computes the hash of the new data, encrypts it with the shared key, and compares the result with the signature, so tag four is considered as an invalid node. its data is discarded because the results and signature do not match. in this paper, we presented a novel technique for iot healthcare monitoring applications based on the rfid clustering scheme. the proposed scheme integrates rfid with wireless sensor systems to gather information efficiently, aiming at monitoring the health of people in large events such as festivals, malls, airports, train stations. the developed system is composed of clusters of wearable smart nodes. the smart node is composed of rfid tag, reduced function of an rfid reader, and body sensors. the clusters are reconstructed periodically based on specific criteria, such as the battery level. these clusters collect data from their members and when they come across rfid readers, they deliver the collected data to these readers. on the other hand, using the traditional approaches, only the nodes in the range of the rfid readers can send their tag data to the rfid readers. hence, this will cause several performance problems such as long delay, dropped packets, missing data, and channel access congestion. the proposed clustering approach overcome all these problems. it demonstrated outstanding performance in reducing the packet transmission delay, inter-node interference, and better energy utilization. the experimental results have supported the above performance. the proposed approach can lend itself easily to monitor and collect the health information of the society population continuously, especially in the current pandemic. as future research directions, we are planning to integrate the smart nodes with other sensors to ensure full health care application and test the new application in large-scale scenarios. there is also a need to improve the clustering algorithm to guarantee a high level of service quality of the deployed health applications. conflict of interest the authors declare that they have no conflict of interest. ethical approval the study only includes humans in roaming a large hall to test the connectivity of the established networks. a survey on the internet of things security handbook: fundamentals and applications in contactless smart cards, radio frequency identification and near field communication efficient data collection for large-scale mobile monitoring applications influence of thermal boundary conditions on the double-diffusive process in a binary mixture engineering design process an object-oriented finite element implementation of large deformation frictional contact problems and applications x-analysis integration (xai) technology. virginia technical report preventing deaths due to hemorrhage taxonomy and challenges of the integration of rfid and wireless sensor networks neuralwisp: a wirelessly powered neural interface with -m range a capacitive touch interface for passive rfid tags building a smart hospital using rfid technologies wireless localization network for patient tracking empirical analysis and ranging using environment and mobility adaptive rssi filter for patient localization during disaster management the research of network architecture in warehouse management system based on rfid and wsn integration bringing iot and cloud computing towards pervasive healthcare rfid technology for iot-based personal healthcare in smart spaces a health-iot platform based on the integration of intelligent packaging, unobtrusive bio-sensor, and intelligent medicine box an iot-aware architecture for smart healthcare systems bsn-care: a secure iot-based modern healthcare system using body sensor network ifhds: intelligent framework for securing healthcare bigdata effective data collection in multi-application sharing wireless sensor networks a hybrid approach of rfid andwsn system for efficient data collection an analysis on optimal clusterratio in cluster-based wireless sensor networks wireless regulation and monitoringsystem for emergency ad-hoc networks using nodes concurrent data collectiontrees for iot applications acooperation-based routing algorithm in mobile opportunistic networks a data prediction model based on extended cosine distance for maximizing network lifetime of wsn crpd: anovel clustering routing protocol for dynamic wireless sensor networks secure data transmission in hybrid radio frequency identification with wireless sensor networks real-time healthcare monitoring system using smartphones energy-efficient data collection scheme based on mobile edge computing in wsns iterative clustering for energy-efficient large-scale tracking systems an asynchronous clustering and mobile data gathering schema based on timer mechanis min wireless sensor networks optimum bilevel hierarchi-cal clustering for wireless mobile tracking systems modeling and representation to support design-analysis integration crowd analysis: a survey. machine vision and applications data-driven crowd analysis in videos gams specifications the sparkfun specification key: cord- -zm yipu authors: tzouros, giannis; kalogeraki, vana title: fed-dic: diagonally interleaved coding in a federated cloud environment date: - - journal: distributed applications and interoperable systems doi: . / - - - - _ sha: doc_id: cord_uid: zm yipu coping with failures in modern distributed storage systems that handle massive volumes of heterogeneous and potentially rapidly changing data, has become a very important challenge. a common practice is to utilize fault tolerance methods like replication and erasure coding for maximizing data availability. however, while erasure codes provide better fault tolerance compared to replication with a more affordable storage overhead, they frequently suffer from high reconstruction cost as they require to access all available nodes when a data block needs to be repaired, and also can repair up to a limited number of unavailable data blocks, depending on the number of the code’s parity block capabilities. furthermore, storing and placing the encoded data in the federated storage system also remains a challenge. in this paper we present fed-dic, a framework which combines diagonally interleaved coding on client devices at the edge of the network with organized storage of encoded data in a federated cloud system comprised of multiple independent storage clusters. the erasure coding operations are performed on client devices at the edge while they interact with the federated cloud to store the encoded data. we describe how our solution integrates the functionality of federated clouds alongside erasure coding implemented on edge devices for maximizing data availability and we evaluate the working and benefits of our approach in terms of read access cost, data availability, storage overhead, load balancing and network bandwidth rate compared to popular replication and erasure coding schemes. in recent years, the management and preservation of big data has become a vital challenge in distributed storage systems. failures, unreliable nodes and components are inevitable and such failures can lead to permanent data loss and overall system slowdowns. to guarantee availability, distributed storage systems typically rely on two fault tolerance methods: ( ) replication, where multiple copies of the data are made, and ( ) erasure coding, where data is stored in the form of smaller data blocks which are distributed across a set of different storage nodes. replication based algorithms as those utilized in amazon dynamo [ ] , google file system (gfs) [ , ] , hadoop distributed file system (hdfs) [ , ] are widely utilized. these can help tolerate a high permanent failure rate as they provide the simplest form of redundancy by creating replicas from which systems can retrieve the lost data blocks, but cannot easily cope with bursts of failures. furthermore, replication introduces a massive storage overhead as the size of the created replicas is equal to the size of their original data e.g. -way replication occupies times the volume of the original data block in order to provide fault tolerance. on the other hand, erasure coding [ ] can provide higher redundancy while also offering a significant improvement in storage overhead compared to replication. for example, a -way replication creates replicas of a data block and causes a x storage overhead for providing fault tolerance, while an erasure code can provide the same services for half the storage overhead or even lower by creating smaller parity blocks that can retrieve lost data more efficiently than full-sized replicas. thus, erasure codes are more storage affordable than replication but their reliability is limited to the number of parity blocks for repairing erasures. for example, an erasure code that creates parity chunks cannot fix a data block with or more unavailable or lost chunks. yet the most critical challenge with erasure coding is that it suffers from high reconstruction cost as it needs to access multiple blocks stored across different sets of storage nodes or racks (groups of nodes inside a distributed system) in order to retrieve lost data [ ] , leading to high read access and network bandwidth latency. the majority of the distributed file systems deploy random block placement [ ] and one block per rack policies [ , ] to achieve optimized reliability and load balancing for stored encoded data. however, storing data across multiple nodes and/or racks can lead to higher read and network access costs among nodes and racks during the repairing processes. for example, in the worst case, repairing a corrupted or unavailable block in a node may require traversing all nodes across different racks, causing a heavy amount of data traffic among nodes and racks. also, in a typical cross-rack storage, the user does not have any control over the placement of the data blocks across different racks, limiting the ability of the system to tolerate a higher average failure rate. to reduce the cost of accessing multiple nodes or racks, file systems can keep metadata records regarding the topology of the encoded data codewords (groups that contain original data blocks alongside their parity blocks) in private nodes. however, the placement of the metadata files among the system's nodes is also challenging. for example, storing a codeword in a small group of nodes while keeping metadata about the data blocks scattered throughout the public clouds instead of a specific storage node [ ] , will also require to traverse all nodes at worst in order to recover any failed data inside the codeword. this problem leads to high cross-node read and network access costs, despite the use of metadata. in this paper we propose fed-dic (federated cloud diagonally interleaved coding), a novel compression framework deployed on an edge-cloud infrastructure where client devices perform the coding operation and they interact with the federated cloud to store the encoded data. fed-dic's compression approach is based on diagonal interleaved erasure coding that offers improved data availability while reducing read access costs in a federated cloud environment. it employs a variation of diagonally interleaved codes on streaming data organized as a grid of input records. specifically, the grid content is interleaved into groups that diagonally span across the grid, and then the interleaved groups of data are encoded using a simple reed-solomon (rs) erasure code. next, our framework organizes the encoded data into batches based on the number of clusters in the federated cloud and places each batch to a different cluster in the cloud, while keeping a metadata index of the locations of each stored data stream. the benefit is that fed-dic will only access the cluster with the requested data records and retrieve the correspondent diagonals, enabling the system to efficiently extract the corresponding records. fed-dic has multiple benefits: it maximizes the availability of the encoded data by ordering input data into smaller groups, based on diagonally interleaved coding, and encoding each group using the erasure coding technique. furthermore, it supports efficient archival and balances the load by storing each version of the streaming data array in a rotational basis among the storage nodes, e.g. if we have an infrastructure with file clusters, for the first version of the data array, the first batch of diagonals is stored on the first node cluster, the second batch on the second node cluster and the third batch on the third cluster. for the second version of the array, the first batch of diagonals is stored on the second cluster, the second batch on the third cluster and the third batch on the first cluster and so on. we present an approach how multiple storage usage can optimize read access costs while keeping data availability and low bandwidth cost for retrieving data by utilizing multiple storage clusters in the same cloud environment instead of storing data in a single cluster. we illustrate the effectiveness of our approach with an extended experimental evaluation in terms of read access cost, data availability, storage overhead, load balancing and network bandwidth rate compared to popular replication and erasure coding schemes. in this section we provide some background material regarding the technologies that we utilize at fed-dic: the federated cloud environment, erasure coding and diagonally interleaved coding. many large-scale distributed computing organizations that need to store and maintain continuous amounts of data deploy distributed storage systems, such as hdfs [ , ] , gfs [ , ] (which were mentioned above), ceph [ ] , microsoft azure [ , ] , amazon s [ ] , alluxio [ ] etc., which comprise multiple nodes, often organized into groups called racks. currently, most of these systems write and store large data as blocks of fixed size, which are distributed almost evenly among the system's nodes using random block placement or load balancing policies. in each system, one of the nodes operates as the master node e.g. the namenode in hdfs, that keeps a record of the file directories and redirects client requests toward the storage api for opening, copying or deleting a file. however, these policies are limited as they depend on the size of the data stored in the systems as well as the policies followed by the specific storage nodes (e.g., load balancing policies). our framework assumes the deployment of multiple hdfs clusters within the federated cloud environment, each comprising a different master node and storage layer. the client edge device can communicate with each of the master nodes with a different interface in order to store different groups of data into separate hdfs clusters. distributed systems deploy erasure codes as a storage-efficient alternative to replication so as to guarantee fault tolerance and data availability for their stored data. erasure codes are a form of forward error correction (fec) codes that can achieve fault tolerance in the communication between a sender and a receiver by adding redundant information in a message; this enables the detection and correction of errors without the need for re-transmission. for instance, a sender node encodes a file with erasure coding and generates a data codeword or a stripe containing original and redundant parity data. next, the sender node sends sequentially the blocks of the encoded stripe to a receiver node. in its turn, the receiver node detects whether there is a sufficient number of available blocks in order to decode them into their original content. if no original blocks are received, the parity blocks can repair them up to a finite range. the most commonly used erasure code algorithm is the reed-solomon (rs), a maximum distance separable code (mds) which is expressed as a pair of parameters (b, k) (rs(b, k)) where b is the number of input chunks on a data block and k is the number of parity chunks created by the erasure code. the parity chunks are generated by utilizing cauchy or vandermonde matrices over a gf ( m ) galois field, where m is the number of elements in the field and m is the word size of encoding. the code constructs a matrix of size k × d which contains values from the gf ( m ) field that correspond to the dimensions of the matrix and represent the positions of the input chunks. next, the rs code derives an inverse k ×k submatrix from the previous matrix. the original matrix is multiplied by the inverse submatrix in order to convert the top square of the former into a k × k identity matrix which will keep the content of the original data chunks unaltered during the encoding and decoding processes. the result is a stripe of length n = b + k, that contains the b chunks of the original data and the k parity chunks generated by the code. rs is k-fault tolerant due to the fact that the original data can be recovered for up to k lost chunks. in other words, while replication needs to copy and store the original data n + times, erasure codes only require to store the data n−k n times, which costs considerably less compared to replication. reed-solomon codes are also characterized by linearity [ ] . in other words, they perform linear coding operations based on the galois field arithmetic. more formally, given an (b, k) code, let b , b , ..., b b be the b original data chunks and p , p , ..., p k be the k parity chunks. each parity chunk p j ( < j < k) can be expressed as this technique is limited as the redundancy provided by simple rs codes can repair up to k unavailable nodes. if there are more than k chunk erasures, the code will not be able to fully repair their original data. our framework tries to deal with limited redundancy by deploying a more advanced erasure coding technique based on reed-solomon and diagonally interleaved coding, the latter of which we describe in the next section. leong et al. have studied a burst erasure model in [ ] , where all erasure patterns with limited-length burst erasures are admissible so that they can construct an asymptotically optimal convolutional code that achieves maximum message size for all available patterns. this code involves stripes derived from one or more data messages interleaved in a diagonal order. for a set of parameters (c, d, k), where c is the interval between input messages, d is the total number of symbols in the encoded message (original data and parity symbols) and k is the number of generated parity symbols, an input message is equally split into a vector of c columns and d − k rows. next, tables of blank or null symbols are placed around the message table that represent non-existent messages before and after the input message. the symbols of the entire table are interleaved in diagonal pattern, forming well-defined diagonals containing at least one symbol from the input message. finally, a systematic block code is used to create k parity symbols for every diagonal, thus constructing a convolutional code with d − diagonal stripes that can repair up to k lost symbols in each diagonal and span across d consecutive time steps. as a result, diagonally interleaved codes are able to handle an extended number of erasure bursts in one message and allow smaller erasures to be fixed without accessing massive amounts of data. in fig. we illustrate with an example how diagonally interleaved coding is applied for a single data block. the process of splitting an input message into a vector can be applied only if the input data is organized in single data stripes. to optimize data availability, our framework uses a derived version of diagonally interleaved coding that takes as input data organized in a grid and interleaves all content into diagonals before encoding them with a reed-solomon code. in this section we present the challenges of existing schemes and how we propose to address them in our fed-dic framework. retrieval. one major challenge in typical cloud environments is the lack of user-oriented control in data distribution and storage. most cloud systems store data blocks in randomly chosen nodes and nodes within racks in their clusters without balancing the load. for example, a system that uses an rs(b, k) to encode its streamed data, will distribute the d = b + k chunks of the generated codeword to d different nodes in a random order. however, in cases of node failures, the system needs to retrieve data from other nodes within the rack or even across racks to retrieve parity data, leading to high read access costs and network overhead, which can considerably slow down the repair process. fed-dic deals with this problem by uploading and distributing the encoded data to a federated cloud with multiple autonomous hadoop clusters in the same network, each with a unique namenode. to retrieve a particular data record, the framework keeps a metadata file containing the locations of the stored encoded data. the metadata file is created and can be accessed by the edge device in order to locate the requested data record and retrieve it faster with a significantly reduced read access latency, limited to the cluster where the specific data record is stored, without the need to traverse all nodes or maintain scattered metadata among nodes or clusters. fed-dic's topology in terms of the stored data among the clusters of the federated cloud, combined with the reduced storage size of the data chunks generated from its encoding process, provide significantly smaller read access costs and transfer bandwidth overhead for nodes in the cloud. limited data availability. distributed systems deploy erasure coding methods to achieve higher redundancy than replication with more affordable storage cost. however, the availability provided by simple erasure codes such as reed-solomon codes for the encoded data is restricted to the number of parity chunks generated by the code. more specifically, a reed-solomon code that creates k parity data chunks from b original data chunks (rs(b, k)) can repair up to k failures between the original or parity data. if there are more than k unavailable or failed chunks in the stripe, the rs code will not be able to restore the data back to their original state. to deal with this challenge, several advanced erasure codes have been presented, including alpha entanglement codes [ ] and diagonally interleaved codes [ ] . fed-dic uses a variation of diagonally interleaved coding on a group of streaming data containing input records from multiple sensor groups (columns) across multiple days (rows). the array data are interleaved diagonally and encoded with multiple parity chunks for each arranged diagonal pattern, achieving higher data availability and greater repairing range than conventional erasure coding methods. most large-scale distributed systems deploy load balancing policies for node distribution or utilizing one-node-per-rack [ ] [ ] [ ] to balance the storage load across the cluster. however, most load balancing policies require the use of sophisticated techniques which may lead to load imbalances among nodes, especially when the number of the data chunks in a stripe exceeds the number of nodes that comprise a cluster. fed-dic groups the encoded data diagonals into batches before they are stored to multiple node clusters in a non-random order. if the user decides to upload a data array and store it over the old one, the framework rotates the directions of the clusters in which the new batches will be stored in order to achieve good load balancing. to deal with the above problems of conventional erasure coding on federated clouds, we designed and developed fed-dic (federated cloud diagonally interleaved coding), a framework that utilizes diagonal interleaving and erasure coding on streaming data records in a federated edge cloud environment. the goal of our framework is to reduce the read access cost and network overhead caused by accessing multiple nodes in a federated cloud while maximizing data availability for the data stored in the federated cloud environment. fed-dic also supports load balancing by storing multiple versions of the data records among clusters in a rotational order, while keeping storage availability, using the techniques we have developed and its api for distributing the data and balancing them across the clusters. in cases of high load in a cluster due to data congestion or unavailable nodes, fed-dic can reconfigure the number of batches and the content size of each batch in order to achieve load balancing by storing data to a smaller number of clusters with more nodes and larger storage space. as illustrated in fig. fed-dic comprises three main components: the client side (edge devices), a federated cloud comprising multiple independent clusters where each cluster consists of multiple independent nodes, and a network hub that connects the two other components through the network. the client devices are operated by the user and provide six services: ( ) the interleaver module which re-orders the input data set into a grid and interleaves them into diagonal groups, ( ) the coder module which encodes all diagonal groups prior to the uploading process and decodes received diagonal stripes containing userrequested data, ( ) the destination module which splits the encoded stripes into batches and configures the order of destination clusters where the batches will be stored, ( ) the hadoop service which communicates with the namenodes of each cluster in order to upload the diagonal stripe batches, ( ) the metadata service which creates a metadata index file during the upload process and provides a query interface for the user during the retrieval process, and ( ) the extractor module which searches through a received diagonal stripe in order to extract the data record requested by the user and store it to a new file. our framework works as follows: a client takes as input a set of streaming data records and organises them into a grid of d columns and g rows. the data records in the grid are re-ordered into c = d + g − diagonal groups, which are then encoded with reed-solomon, generating up to k parity chunks per diagonal using an -bit galois field. next, fed-dic groups the diagonal stripes into h batches and stores each batch into a different cluster in the federated cloud. simultaneously, the client creates a metadata file that contains information for each stored data record: the day the record was created, the group of sensors that generated the record and the diagonal stripe in which the record was interleaved. to retrieve selected data records, the client receives user-created request queries about data records and communicates with their correspondent clusters to download the stripes that include the records so as to extract their contents in output files. to upload a new version of the already stored data while archiving the older versions, the cluster destinations are rotated in a stack order by setting the first cluster destination at the position of the last cluster destination in a circular pattern. in that way, fed-dic achieves not only the maintenance of multiple versions, but also load balancing throughout all clusters within the federated cloud. if fed-dic kept uploading newer versions into the same clusters each time, there could have been inconsistencies between the clusters. especially, the first and last clusters in the cloud would have smaller data load than the other clusters. the read access cost for a data query q from a group of q queries, is given by the sum of the access time a client needs to traverse l lines in the metadata file to find the requested data, the latency needed to access any h clusters that contain the data (h ≤ h) and the search delay caused by any missing d data chunks in a cluster. the probability p i shows if a chunk is available for transferring. if p i = , the chunk is missing. this is computed by the following formula: where r md is the time a client needs to read a line from the metadata file, r h is the time to access a cluster in the federated cloud and t m is the search delay caused by missing data chunks in the cloud. the read access latency l q for downloading and extracting a requested data query q is given by the access cost t q which was computed previously, plus the time required to download all available d chunks in the diagonal stripe that includes the data using an internet connection of b bandwidth and the computation time t dec q a client needs to decode the diagonal stripe so as to extract the result. the formula for the overall query storage latency is given below: where t p is the elapsed time for an available data chunk to be transferred from the federated cloud to a client device. similarly, the total read access latency l q is the sum of the read access latency for all q queries: the read access latency for erasure coding is computed in a similar way to l q , with the only difference that the metadata access time is not taken into account. when stored chunks are missing or unavailable in the federated cloud due to failures or nodes being disabled in the cloud's clusters, erasure codes try to utilize any available parity chunks in order to reconstruct the damaged encoded file. however, if a decent amount of chunks are not available in a cluster, there may be permanent loss of the original data, due to the number of available data chunks being insufficient for use with erasure codes. the data loss percentage d c of a fault tolerance method is measured by the fraction of the probability p i of a data chunk c i being available with the total number of data chunks in the entire cloud, subtracted from , as follows: fed-dic provides an api with the following four operations: encode(). this operation interleaves the input data set into d diagonal data groups of varied length. then, it merges data in each group into new data blocks so as to be encoded with a unique reed-solomon erasure code. cates with all the namenodes of the federated cloud in order to upload and store each batch in a different cluster, while keeping track of the data locations and information in a metadata file stored in the client devices. the metadata file can be shared and backed up in all clients in order to avoid any corruptions. if, for any reason, the cloud changes the location of its clusters, the clients need to update the metadata accordingly. however, a small non-significant access overhead may occur in the case that the client device that performs the store() process becomes unavailable and the metadata have to be accessed from another client. due to the integrity of our private client nodes, the probability of this situation is extremely rare, so it is not considered when measuring the read access latency. this method provides an interface to the user for entering multiple queries regarding a data record the user aims to retrieve. once the user issues his queries, the method searches for each requested data record the diagonal stripe in which it is included and downloads it accessing immediately the corresponding storage cluster. once the clients receive the diagonal stripes with the data requested by the user, this operation decodes any available chunks in a stripe into its original merged data block and extracts the requested result from the block before deleting it. we describe the two main algorithms implemented by our framework: storing data to the federated cloud: a client takes as input the data records to be uploaded, these correspond to g sensor data groups over a time period r days, stored in .csv files. the client invokes the encode() operation to organize the content into a grid with dimensions g × r, where its elements are interleaved into c dynamic diagonal arrays of varied length (as shown in fig. ). records are inserted into the grid according to the day and sensor group indicated on the record. starting with the record of the last sensor group during the first day, the client forms a diagonal line from bottom right to top left and inserts any existing grid elements in the diagonal line, into a dynamic diagonal array. the diagonal arrays span through the entire grid with the last one containing only the record of the first sensor group during the last day. next, in each diagonal array, the data in the elements are merged into a single data object and encoded using a (b, k) reed-solomon code. the encoding process splits each merged data object into equally sized b chunks and generates k parity chunks, creating a stripe of length d = b + k. next, the client uses the operation store() to group the diagonal stripes into h batches containing an equal number of (c/h) stripes in each batch and to upload the batches into the different clusters of the federated cloud by communicating with every namenode within the cloud. once the namenode of a cluster receives the data, it distributes the chunks in random order to its nodes. during the storage process, the clients write and store metadata records about the stored data, their version, the date and sensor group as well as the number of the diagonal stripes they belong to. the metadata file helps the edge devices to access the stored data faster and more easily by reducing the access costs among the hdfs clusters. the distribution of the batches is performed in a sequential way. for example, in a federated cloud of f clusters, the first data batch is stored into the first cluster and so on until the last batch is stored in the f -th cluster. when the user wants to upload a new version of the data over the already stored versions, the clients swap the order of the cluster destinations by placing the first cluster destination right after the last cluster destination of the older version in a last in, first out (lifo) order. in our example, for the second version of our data, the first batch will be uploaded into the f -th cluster, the second one to the first cluster and so on with the last cluster being uploaded to the (f - )-th cluster. the way data records are stored in fed-dic enables us to traverse - clusters at most to recover any data segment. whereas, conventional (b, k) reed-solomon would merge r with every other record in the input into a single data block, split it into b original chunks and encode it using a galois field matrix to generate k parity chunks which are distributed to the cloud via hadoop. thus, even if a small part of data must be recovered, the data encoded with rs need to be restored in their entirety, which may require traversing all clusters in the cloud, incurring a heavy read access cost. the clients provide an interface to the user awaiting response queries. when the user issues a query, the clients gather all entered queries into a list array and use retrieve() to search through the metadata file generated from the uploading process for the diagonal stripes where the query data are stored. for every entry in the query list, the client connects to the correspondent cluster to download the diagonal stripe with the requested data. if the edge device fails to download sufficient amount of chunks for restoring the stripe into its original data, it informs the user that the queried data from that diagonal stripe cannot be recovered. if it receives enough chunks from the stripe, it deploys decode() to restore the diagonal stripe using rs(b, k) back to its original content. finally, the clients search through the recovered data objects for the requested record entries and extract them as a result. when there are multiple concurrent requests from users, the clients schedule the requests to the hub in multiple rows according to the source cluster of the requested data and return the result for the oldest request each time. in this section we evaluate fed-dic in terms of data loss, maximum transfer network rate and storage overhead, compared to the replication and conventional reed-solomon erasure coding techniques. the client machines we used were desktop computers with an intel i - -core cpu at . ghz per core, with gb ram and a western digital wd ezex- wn a hard disk drive of tb. the machines run microsoft windows and are connected to the network using a cisco rv dual gigabit wan vpn router with a data throughput of mbps and support of , concurrent connections. the router operates as our network hub and due to its specifications, the probability of a failure or bottleneck is extremely small. although there are several ways to deal with such failures, this is outside the scope of our paper. for the experiments, we deploy via oracle virtualbox clusters each comprising nodes, virtual machines (vms) in total running apache hadoop . . in linux lubuntu . for evaluating fed-dic against replication and reed-solomon. for memory and disk allocation reasons, the vms are running across real desktop machines: our client device and a second machine with the same hardware specifications as the first, which is connected to the same network. vms are running on each machine, connected to the same network as the client machines using a bridged adapter. our setup is restricted to the equipment and network availability in our local computing and communication environment, however the algorithms we have developed can adapt well to accommodate larger clusters with thousands of nodes by modifying the number of batches in which the encoded data will be grouped as well as the content size in each batch. also, we can set the batches to be stored in clusters with higher reliability within a large cloud. our data set for the experiments is a collection of transport values obtained from scats sensors that are deployed in the dublin smart city [ ] . this data set contains a huge amount of records with information regarding the specific sensor that captured the snapshot and its capture date; the data needs to be stored and maintained in the cloud to be further analyzed by the human operators (i.e., to identify congested streets and entire geographical areas over time). fed-dic is responsible to store and recover this data to and from the cloud. our first experiment involves the total read latency of recovering data with fed-dic ( , ) compared to reed-solomon ( , ) . for rs ( , ) we merge the input files of the data grid used by fed-dic to a single .csv file. when the file is encoded to a stripe of chunks ( original and parity), we distribute chunks to each of the first clusters, with the last being stored in the last cluster. note, that reed-solomon could retrieve the encoded file traversing only clusters instead of going through all clusters. in fact, fed-dic could also be easily configured (by appropriately setting the number of batches where diagonal stripes are grouped) to store and retrieve the data successfully utilizing only clusters. however, in order to take advantage of the entire experimental environment ( -cluster cloud system with a total of nodes) we utilize all clusters for both techniques, to avoid load imbalances (data distributed in clusters, while the th cluster is unused) and minimize the impact on the data loss percentage (in cases of failures). due to the data chunks spanning across all clusters, a simple decoding process with rs takes almost s to complete, as seen in fig. . this happens due to the clients having to access all clusters in order to download all the chunks needed for recovering the stripe's original data. even if we request a small portion of the encoded data, reed-solomon has no built-in features that allow us to retrieve a specific part of data, so it will still have to retrieve and decode the entire file content in order to give us an output. our fed-dic technique on the other hand, reduces the total access latency by returning only requested parts of the stored data instead of the entire data content by accessing to clusters at most. for to queries for data inside the same cluster, fed-dic achieves at least % lower read access latency compared to rs. even in the case that we request data queries that are stored in two different clusters, fed-dic still reads the data in a shorter time compared to rs. our second experiment evaluates the reliability between -way replication, rs ( , ) and fed-dic ( , ) in the data loss scenario. we performed runs of experiments. as fig. indicates, due to its organized multi-cluster storage policies, fed-dic manages to achieve lower data loss rates than rs. even when only up to % of the nodes are available in the federated cloud, fed-dic may be able to maintain a sufficient number of chunks in some diagonal stripes, which allows it to restore a portion the original data. the next experiment we evaluate the storage overhead and the maximum network transfer rate between these fault tolerance methods. as fig. a shows, replication stores the entire data content inside the cluster without splitting it, causing a large storage overhead even for single blocks, compared to a chunk produced by simple erasure coding and fed-dic. in fig. b we present the total storage overhead for all three methods. -way replication occupies a massive portion of the storage with all replicas combined, while all chunks generated by erasure coding and fed-dic produce lower overheads, with the latter occupying slightly less storage than erasure coding due to the varied sizes of the chunks. we also measured the rates during data transferring using performance monitoring programs included with lubuntu os. as seen in fig. due to the size of the replicas, replication severely burdens the network with a high transfer rate of . mbps, followed by erasure coding with a transfer rate of kbps. fed-dic operates with smaller data transfers and thus provides smaller and less burdening network data rates when transferring one or multiple queried data records. finally, fig. shows the load balancing achieved in the three fault tolerance methods between -node clusters, while uploading different data streams with similar sizes. due to the random distribution of replicas and chunks in the hdfs cloud, replication and client-side reed-solomon erasure codes are very inconsistent in terms of load balancing. specifically, a majority of data may be stored to one cluster, while other clusters store less data, even though erasure coding seems more consistent than replication. it is worth to note that we do not consider hdfs server-side erasure coding since it requires a code with higher parameters, which generates a number of chunks equal to the number of nodes in a single cluster. meanwhile, fed-dic, using the rotational stack policy for cluster destinations described previously, it can store new streams in the federated cloud's clusters in a different order for every stream. since our framework stores data of different size in each cluster in every uploading process, it can maintain an almost perfect load balance between h clusters for each h uploaded streams. for example, in fig. for every streams uploaded in the cloud, fed-dic can achieve storage consistency and good load balancing between the clusters. several approaches over the last decade have been proposed for improving read access costs and the reliability of erasure coding in cloud storage environments. in particular, a method that drastically improves read access costs and data reconstruction in erasure coded storage systems is deterministic data distribution, or d for short [ ] . d maximizes the reliability of a storage system and reduces cross rack repair traffic by utilizing deterministic distribution of data blocks across the storage system. d uses orthogonal arrays to define the data layout in which the data will be distributed across multiple racks, ignoring the one block per rack placement, while balancing the load among nodes across the system's racks. this implementation works on single hdfs clusters with multiple racks but it does not seem to support federated clouds or other systems with independent clusters, unlike our approach with fed-dic. even if we modify d to support multiple clusters, the clusters need to contain a certain number of nodes in order to apply server-side erasure coding, whereas in fed-dic, erasure coding is performed by the client devices. simple erasure codes provide efficient fault tolerance but their reliability is restricted to the parameters set by the user. advanced erasure coding techniques like alpha entanglement codes by estrada et al. [ ] , increase the reliability and the integrity of a system compared to normal reed-solomon codes by entangling old and new data blocks and creating robust, flexible meshes of interdependent data with multiple redundancy paths. also in the ring framework for key-value stores (kvs) [ ] , taranov et al. introduce stretched reed-solomon (srs) codes which support a single key-to-node mapping for multiple resilience levels. these lead to higher and more expanded reliability compared to conventional reed-solomon codes. however, this work is only restricted to key-value stores and is not available to conventional databases for use. also, unlike our work, the reliability ranges of srs are limited only to the parameters of specific key-tonode mappings. hybris [ ] by dobre et al. is a hybrid cloud storage system that scatters data across multiple unreliable or inconsistent public clouds, and it stores and replicates metadata information within trusted private nodes. the metadata are related to the data scattered across the public clouds, providing easier access and strong consistency for the data, as well as improved system performance and storage costs compared to existing multi-cloud storage systems. in our case, fed-dic uses metadata containing information about the topology of data stored in a federated cloud so that the client can connect immediately to the cluster that contains a requested portion of the data, thus drastically reducing the read access cost in these systems compared to simple erasure codes. in this paper, we presented fed-dic, our framework that integrates diagonal interleaved coding with organized storage of the encoded data in a federated cloud environment. our framework takes as input data organized in a grid, interleaves them into diagonal stripes that are encoded using a reed-solomon erasure code. the encoded diagonal stripes are grouped into batches which are stored to different clusters in the cloud. the user issues queries to retrieve portions of the data without the need for the clients to access every cluster in the cloud, thus reducing the access cost compared to other methods like replication and simple erasure codes. our experimental evaluations illustrate the benefits of our framework compared to other fault tolerance methods in terms of total read access latency, data loss percentage, maximum network transfer rate, storage overhead and load balancing. for future work, one direction we are following is to deploy fed-dic in a federated environment with different hardware equipment where we plan to evaluate the working and benefits as well as the corresponding costs of our approach when different types of equipment are utilized. dynamo: amazon's highly available key-value store the google file system formalizing google file system the hadoop distributed file system. in: msst a review on hadoop-hdfs infrastructure extensions erasure coding vs. replication: a quantitative comparison d : deterministic data distribution for ecient data reconstruction in erasure-coded distributed storage systems hadoop block placement policy for different file formats xoring elephants: novel erasure codes for big data f : facebook's warm {blob} storage system hybris: robust hybrid cloud storage ceph: a scalable, high-performance distributed file system introducing windows azure windows azure storage: a highly available cloud storage service with strong consistency amazon web services the performance analysis of cache architecture based on alluxio over virtualized infrastructure the theory of error-correcting codes on coding for real-time streaming under packet erasures alpha entanglement codes: practical erasure codes to archive data in unreliable environments scats fast and strongly-consistent peritem resilience in key-value stores communications laboratory at aueb. the authors would like to thank dr. davide frey for shepherding the paper. this research has been supported by european union's horizon grant agreement no . key: cord- - u kn ge authors: huberty, mark title: awaiting the second big data revolution: from digital noise to value creation date: - - journal: nan doi: . /s - - - sha: doc_id: cord_uid: u kn ge “big data”—the collection of vast quantities of data about individual behavior via online, mobile, and other data-driven services—has been heralded as the agent of a third industrial revolution—one with raw materials measured in bits, rather than tons of steel or barrels of oil. yet the industrial revolution transformed not just how firms made things, but the fundamental approach to value creation in industrial economies. to date, big data has not achieved this distinction. instead, today’s successful big data business models largely use data to scale old modes of value creation, rather than invent new ones altogether. moreover, today’s big data cannot deliver the promised revolution. in this way, today’s big data landscape resembles the early phases of the first industrial revolution, rather than the culmination of the second a century later. realizing the second big data revolution will require fundamentally different kinds of data, different innovations, and different business models than those seen to date. that fact has profound consequences for the kinds of investments and innovations firms must seek, and the economic, political, and social consequences that those innovations portend. yet this bbig data^revolution has so far fallen short of its promise. precious few firms transmutate data into novel products. instead, most rely on data to operate, at unprecedented scale, business models with long pedigree in the media and retail sectors. big data, despite protests to the contrary, is thus an incremental change-and its revolution one of degree, not kind. the reasons for these shortcomings point to the challenges we face in realizing the promise of the big data revolution. today's advances in search, e-commerce, and social media relied on the creative application of marginal improvements in computational processing power and data storage. in contrast, tomorrow's hopes for transforming real-world outcomes in areas like health care, education, energy, and other complex phenomena pose scientific and engineering challenges of an entirely different scale. our present enthusiasm for big data stems from the confusion of data and knowledge. firms today can gather more data, at lower cost, about a wider variety of subjects, than ever before. big data's advocates claim that this data will become the raw material of a new industrial revolution. as with its th century predecessor, this revolution will alter how we govern, work, play, and live. but unlike the th century, we are told, the raw materials driving this revolution are so cheap and abundant that the horizon is bounded only by the supply of smart people capable of molding these materials into the next generation of innovations (manyika et al. ) . this utopia of data is badly flawed. those who promote it rely on a series of dubious assumptions about the origins and uses of data, none of which hold up to serious scrutiny. in aggregate, these assumptions all fail to address whether the data we have actually provides the raw materials needed for a data-driven industrial revolution we need. taken together, these failures point out the limits of a revolution built on the raw materials that today seem so abundant. four of these assumptions merit special attention: first, n = all, or the claim that our data allow a clear and unbiased study of humanity; second, that today = tomorrow, or the claim that understanding online behavior today implies that we will still understand it tomorrow; third, offline = online, the claim that understanding online behavior offers a window into economic and social phenomena in the physical world; and fourth, that complex patterns of social behavior, once understood, will remain stable enough to become the basis of new data-driven, predictive products and services in sectors well beyond social and media markets. each of these has its issues. taken together, those issues limit the future of a revolution that relies, as today's does, on the bdigital exhaust^of social networks, e-commerce, and other online services. the true revolution must lie elsewhere. gathering data via traditional methods has always been difficult. small samples were unreliable; large samples were expensive; samples might not be representative, despite researchers' best efforts; tracking the same sample over many years required organizations and budgets that few organizations outside governments could justify. none of this, moreover, was very scalable: researchers needed a new sample for every question, or had to divine in advance a battery of questions and hope that this proved adequate. no wonder social research proceeded so slowly. mayer-schönberger and cukier ( ) argue that big data will eliminate these problems. instead of having to rely on samples, online data, they claim, allows us to measure the universe of online behavior, where n (the number of people in the sample) is basically all (the entire population of people we care about). hence we no longer need worry, they claim, about the problems that have plagued researchers in the past. when n = all, large samples are cheap and representative, new data on individuals arrives constantly, monitoring data over time poses no added difficulty, and cheap storage permits us to ask new questions of the same data again and again. with this new database of what people are saying or buying, where they go and when, how their social networks change and evolve, and myriad other factors, the prior restrictions borne of the cost and complexity of sampling will melt away. but n ≠ all. most of the data that dazzles those infatuated by bbig data^-mayer-schönberger and cukier included-comes from what mckinsey & company termed bdigital exhaust^ (manyika et al. ) : the web server logs, e-commerce purchasing histories, social media relations, and other data thrown off by systems in the course of serving web pages, online shopping, or person-to-person communication. the n covered by that data concerns only those who use these services-not society at large. in practice, this distinction turns out to matter quite a lot. the demographics of any given online service usually differ dramatically from the population at large, whether we measure by age, gender, race, education, and myriad other factors. hence the uses of that data are limited. it's very relevant for understanding web search behavior, purchasing, or how people behave on social media. but the n here is skewed in ways both known and unknown-perhaps younger than average, or more tech-savvy, or wealthier than the general population. the fact that we have enormous quantities of data about these people may not prove very useful to understanding society writ large. but let's say that we truly believe this assumption-that everyone is (or soon will be) online. surely the proliferation of smart phones and other devices is bringing that world closer, at least in the developed world. this brings up the second assumption-that we know where to go find all these people. several years ago, myspace was the leading social media website, a treasure trove of new data on social relations. today, it's the punchline to a joke. the rate of change in online commerce, social media, search, and other services undermines any claim that we can actually know that our n = all sample that works today will work tomorrow. instead, we only know about new developments-and the data and populations they cover-well after they have already become big. hence our n = all sample is persistently biased in favor of the old. moreover, we have no way of systematically checking how biased the sample is, without resorting to traditional survey methods and polling-the very methods that big data is supposed to render obsolete. but let's again assume that problem away. let's assume that we have all the data, about all the people, for all the online behavior, gathered from the digital exhaust of all the relevant products and services out there. perhaps, in this context, we can make progress understanding human behavior online. but that is not the revolution that big data has promised. most of the bbig data^hype has ambitions beyond improving web search, online shopping, socializing, or other online activity. instead, big data should help cure disease, detect epidemics, monitor physical infrastructure, and aid first responders in emergencies. to satisfy these goals, we need a new assumption: that what people do online mirrors what they do offline. otherwise, all the digital exhaust in the world won't describe the actual problems we care about. there's little reason to think that offline life faithfully mirrors online behavior. research has consistently shown that individuals' online identities vary widely from their offline selves. in some cases, that means people are more cautious about revealing their true selves. danah boyd's work (boyd and marwick ) has shown that teenagers cultivate online identities very different from their offline selves-whether for creative, privacy, or other reasons. in others, it may mean that people are more vitriolic, or take more extreme positions. online political discussions-another favorite subject of big data enthusiasts-suffer from levels of vitriol and partisanship far beyond anything seen offline (conover et al. ) . of course, online and offline identity aren't entirely separate. that would invite suggestions of schizophrenia among internet users. but the problem remains-we don't know what part of a person is faithfully represented online, and what part is not. furthermore, even where online behavior may echo offline preferences or beliefs, that echo is often very weak. in statistical terms, our ability to distinguish bsignificant^from binsignificant^results improves with the sample size-but statistical significance is not actual significance. knowing, say, that a history of purchasing some basket of products is associated with an increased risk of being a criminal may be helpful. but if that association is weak-say a one-hundredth of a percent increase-it's practical import is effectively zero. big data may permit us to find these associations, but it does not promise that they will be useful. ok, but you say, surely we can determine how these distortions work, and incorporate them into our models? after all, doesn't statistics have a long history of trying to gain insight from messy, biased, or otherwise incomplete data? perhaps we could build such a map, one that allows us to connect the observed behaviors of a skewed and selective online population to offline developments writ large. this suffices only if we care primarily about describing the past. but much of the promise of big data comes from predicting the future-where and when people will get sick in an epidemic, which bridges might need the most attention next month, whether today's disgruntled high school student will become tomorrow's mass shooter. satisfying these predictive goals requires yet another assumption. it is not enough to have all the data, about all the people, and a map that connects that data to real-world behaviors and outcomes. we also have to assume that the map we have today will still describe the world we want to predict tomorrow. two obvious and unknowable sources of change stand in our way. first, people change. online behavior is a culmination of culture, language, social norms and other factors that shape both people and how they express their identity. these factors are in constant flux. the controversies and issues of yesterday are not those of tomorrow; the language we used to discuss anger, love, hatred, or envy change. the pathologies that afflict humanity may endure, but the ways we express them do not. second, technological systems change. the data we observe in the bdigital exhaust^of the internet is created by individuals acting in the context of systems with rules of their own. those rules are set, intentionally or not, by the designers and programmers that decide what we can and cannot do with them. and those rules are in constant flux. what we can and cannot buy, who we can and cannot contact on facebook, what photos we can or cannot see on flickr vary, often unpredictably. facebook alone is rumored to run up to a thousand different variants on its site at one time. hence even if culture never changed, our map from online to offline behavior would still decay as the rules of online systems continued to evolve. an anonymous reviewer pointed out, correctly, that social researchers have always faced this problem. this is certainly true but many of the features of social systems-political and cultural institutions, demography, and other factors-change on a much longer timeframe than today's data-driven internet services. for instance, us congressional elections operate very differently now compared with a century ago; but change little between any two elections. contrast that with the pace of change for major social media services, for which years may be a lifetime. a recent controversy illustrates this problem to a t. facebook recently published a study (kramer et al. ) in which they selectively manipulated the news feeds of a randomized sample of users, to determine whether they could manipulate users' emotional states. the revelation of this study prompted fury on the part of users, who found this sort of manipulation unpalatable. whether they should, of course, given that facebook routinely runs experiments on its site to determine how best to satisfy (i.e., make happier) its users, is an interesting question. but the broader point remains-someone watching the emotional state of facebook users might have concluded that overall happiness was on the rise, perhaps consequence of the improving american economy. but in fact this increase was entirely spurious, driven by facebook's successful experiment at manipulating its users. compounding this problem, we cannot know, in advance, which of the social and technological changes we do know about will matter to our map. that only becomes apparent in the aftermath, as real-world outcomes diverge from predictions cast using the exhaust of online systems. lest this come off as statistical nihilism, consider the differences in two papers that both purport to use big data to project the outcome of us elections. digrazia et al. ( ) claim that merely counting the tweets that reference a congressional candidate-with no adjustments for demography, or spam, or even name confusion-can forecast whether that candidate will win his or her election. this is a purely bdigital exhaust^approach. they speculate-but cannot know-whether this approach works because (to paraphrase their words) bone tweet equals one vote^, or ball attention on twitter is better^. moreover, it turns out that the predictive performance of this simple model provides no utility. as huberty ( ) shows, their estimates perform no better than an approach that simply guesses that the incumbent party would win-a simple and powerful predictor of success in american elections. big data provided little value. contrast this with wang et al. ( ) . they use the xbox gaming platform as a polling instrument, which they hope might help compensate for the rising non-response rates that have plagued traditional telephone polls. as with twitter, n ≠ all: the xbox user community is younger, more male, less politically involved. but the paper nevertheless succeeds in generating accurate estimates of general electoral sentiment. the key difference lies in their use of demographic data to re-weight respondents' electoral sentiments to look like the electorate at large. the xbox data were no less skewed than twitter data; but the process of data collection provided the means to compensate. the black box of twitter's digital exhaust, lacking this data, did not. the difference? digrazia et al. ( ) sought to reuse data created for one purpose in order to do something entirely different; wang et al. ( ) set out to gather data explicitly tailored to their purpose alone. . the implausibility of big data . taken together, the assumptions that we have to make to fulfill the promise of today's big data hype appear wildly implausible. to recap, we must assume that: . everyone we care about is online; . we know where to find them today, and tomorrow; . they represent themselves online consistent with how they behave offline, and; . they will continue to represent themselves online-in behavior, language, and other factors-in the same way, for long periods of time. nothing in the history of the internet suggests that even one of these statements holds true. everyone was not online in the past; and likely will not be online in the future. the constant, often wrenching changes in the speed, diversity, and capacity of online services means those who are online move around constantly. they do not, as we've seen, behave in ways necessarily consistent with their offline selves. and the choices they make about how to behave online evolve in unpredictable ways, shaped by a complex and usually opaque amalgam of social norms and algorithmic influences. but if each of these statements fall down, then how have companies like amazon, facebook, or google built such successful business models? the answer lies in two parts. first, most of what these companies do is self-referential: they use data about how people search, shop, or socialize online to improve and expand services targeted at searching, shopping, or socializing. google, by definition, has an n = all sample of google users' online search behavior. amazon knows the shopping behaviors of amazon users. of course, these populations are subject to change their behaviors, their self-representation, or their expectations at any point. but at least google or amazon can plausibly claim to have a valid sample of the primary populations they care about. second, the consequences of failure are, on the margins, very low. google relies heavily on predictive models of user behavior to sell the advertising that accounts for most of its revenue. but the consequences of errors in that model are low-google suffers little from serving the wrong ad on the margins. of course, persistent and critical errors of understanding will undermine products and lead to lost customers. but there's usually plenty of time to correct course before that happens. so long as google does better than its competitors at targeting advertising, it will continue to win the competitive fight for advertising dollars. but if we move even a little beyond these low-risk, self-referential systems, the usefulness of the data that underpin them quickly erodes. google flu provides a valuable lesson in this regard. in , google announced a new collaboration with the centers for disease control (cdc) to track and report rates of influenza infection. historically, the cdc had monitored us flu infection patterns through a network of doctors that tracked and reported binfluenza-like illness^in their clinics and hospitals. but doctors' reports took up to weeks to reach the cdc-a long time in a world confronting sars or avian flu. developing countries with weaker public health capabilities faced even greater challenges. google hypothesized that, when individuals or their family members got the flu, they went looking on the internet-via google, of course-for medical advice. in a highly cited paper, ginsberg et al. ( ) showed that they could predict region-specific influenza infection rates in the united states using google search frequency data. here was the true promise of big data-that we capitalize on virtual data to better understand, and react to, the physical world around us. the subsequent history of google flu illustrates the shortcomings of the first big data revolution. while google flu has performed well in many seasons, it has failed twice, both times in the kind of abnormal flu season during which accurate data are most valuable. the patterns of and reasons for failure speak to the limits of prediction. in , google flu underpredicted flu rates during the h n pandemic. post-hoc analysis suggested that the different viral characteristics of h n compared with garden-variety strains of influenza likely meant that individuals didn't know they had a flu strain, and thus didn't go looking for flu-related information (cook et al. ) . conversely, in , google flu over-predicted influenza infections. google has yet to discuss why, but speculation has centered on the intensive media coverage of an early-onset flu season, which may have sparked interest in the flu among healthy individuals (butler ). the problems experienced by google flu provide a particularly acute warning of the risks inherent in trying to predict what will happen in the real world based on the exhaust of the digital one. google flu relied on a map-a mathematical relationship between online behavior and real-world infection. google built that map on historic patterns of flu infection and search behavior. it assumed that such patterns would continue to hold in the future. but there was nothing fundamental about those patterns. either a change in the physical world (a new virus) or the virtual one (media coverage) were enough to render the map inaccurate. the cdc's old reporting networks out-performed big data when it mattered most. despite ostensibly free raw materials, mass-manufacturing insight from digital exhaust has thus proven far more difficult than big data's advocates would let on. it's thus unsurprising that this revolution has had similarly underwhelming effects on business models. amazon, facebook, and google are enormously successful businesses, underpinned by technologies operating at unprecedented scale. but they still rely on centuries-old business models for most of their revenue. google and amazon differ in degree, but not kind, from a newspaper or a large department store when it comes to making money. this is a weak showing from a revolution that was supposed to change the st century in the way that steam, steel, or rail changed the th. big data has so far made it easier to sell things, target ads, or stalk long-lost friends or lovers. but it hasn't yet fundamentally reworked patterns of economic life, generated entirely new occupations, or radically altered relationships with the physical world. instead, it remains oddly self-referential: we generate massive amounts of data in the process of online buying, viewing, or socializing; but find that data truly useful only for improving online sales and search. understanding how we might get from here to there requires a better understanding of how and why data-big or small-might create value in a world of better algorithms and cheap compute capacity. close examination shows that firms have largely used big data to improve on existing business models, rather than adopt new ones; and that those improvements have relied on data to describe and predict activity in worlds largely of their own making. where firms have ventured beyond these self-constructed virtual worlds, the data have proven far less useful, and products built atop data far more prone to failure. the google flu example suggests the limits to big data as a source of mass-manufactured insight about the real world. but google itself, and its fellow big-data success stories, also illustrate the shortcomings of big data as a source of fundamentally new forms of value creation. most headline big data business models have used their enhanced capacity to describe, predict, or infer in order to implement-albeit at impressive scale and complexity-centuries-old business models. those models create value not from the direct exchange between consumer and producer, but via a web of transactions several orders removed from the creation of the data itself. categorizing today's big data business models based on just how far they separate data generation from value creation quickly illustrates how isolated the monetary value of firms' data is from their primary customers. having promised a first-order world, big data has delivered a third-order reality. realizing the promise of the big data revolution will require a different approach. the same problems that greeted flu prediction have plagued other attempts to build big data applications that forecast the real world. engineering solutions to these problems that draw on the potential of cheap computation and powerful algorithms will require not different methods, but different raw materials. the data those materials require must originate from a first-order approach to studying and understanding the worlds we want to improve. such approaches will require very different models of firm organization than those exploited by google and its competitors in the first big data revolution. most headline big data business models do not make much money directly from their customers. instead, they rely on third parties-mostly advertisers-to generate profits from data. the actual creation and processing of data is only useful insofar as it's of use to those third parties. in doing so, these models have merely implemented, at impressive scale and complexity, the very old business model used by the newspapers they have largely replaced. if we reach back into the dim past when newspapers were viable businesses (rather than hobbies of the civic-minded wealthy), we will remember that their business model had three major components: . gather, filter, and analyze news; . attract readers by providing that news at far below cost, and; . profit by selling access to those readers to advertisers. the market for access matured along with the newspapers that provided it. both newspapers and advertisers realized that people who read the business pages differed from those who read the front page, or the style section. front-page ads were more visible to readers than those buried on page a . newspapers soon started pricing access to their readers accordingly. bankers paid one price to advertise in the business section, clothing designers another for the style pages. this segmentation of the ad market evolved as the ad buyers and sellers learned more about whose eyeballs were worth how much, when, and where. newspapers were thus third-order models. the news services they provided were valuable in their own right. but readers didn't pay for them. instead, news was a means of generating attention and data, which was only valuable when sold to third parties in the form of ad space. data didn't directly contribute to improving the headline product-news-except insofar as it generated revenue that could be plowed back into news gathering. the existence of a tabloid press of dubious quality but healthy revenues proved the weakness of the link between good journalism and profit. from a value creation perspective, google, yahoo, and other ad-driven big data businesses are nothing more than newspapers at scale. they too provide useful services (then news, now email or search) to users at rates far below cost. they too profit by selling access to those users to third-party advertisers. they too accumulate and use data to carve up the ad market. the scale of data they have available, of course, dwarfs that of their newsprint ancestors. this data, combined with cheap computation and powerful statistics, has enabled operational efficiency, scale, and effectiveness far beyond what newspapers could ever have managed. but the business model itself-the actual means by which these firms earn revenues-is identical. finally, that value model does not emerge, fully-formed, from the data itself. the data alone are no more valuable than the unrefined iron ore or crude oil of past industrial revolutions. rather, the data were mere inputs to a production process that depended on human insightthat what people looked for on the internet might be a good proxy for their consumer interests. big-box retail ranks as the other substantial success for big data. large retailers like amazon, wal-mart, or target have harvested very fine-grained data about customer preferences to make increasingly accurate predictions of what individual customers wish to buy, in what quantities and combinations, at what times of the year, at what price. these predictions are occasionally shocking in their accuracy-as with target's implicit identification of a pregnant teenager well before her father knew it himself, based solely on subtle changes in her purchasing habits. from this data, these retailers can, and have, built a detailed understanding of retail markets: what products are complements or substitutes for each other; exactly how much more people are willing to pay for brand names versus generics; how size, packaging, and placement in stores and on shelves matters to sales volumes. insights built on such data have prompted two significant changes in retail markets. first, they have made large retailers highly effective at optimizing supply chains, identifying retail trends in their infancy, and managing logistical difficulties to minimize the impact on sales and lost competitiveness. this has multiplied their effectiveness versus smaller retailers, who lack such capabilities and are correspondingly less able to compete on price. but it has also changed, fundamentally, the relationship of these retailers to their suppliers. big box retailers have increasingly become monopsony buyers of some goods-books for amazon, music for itunes. but they are also now monopoly sellers of information back to their suppliers. amazon, target and wal-mart have a much better understanding of their suppliers' customers than the customers themselves. they also understand these suppliers' competitors far better. hence their aggregation of information has given them substantial power over suppliers. this has had profound consequences for the suppliers. wal-mart famously squeezes suppliers on cost-either across the board, or by pitting suppliers against one another based on detailed information of their comparative cost efficiencies and customer demand. hence big data has shifted the power structure of the retail sector and its manufacturing supply chains. the scope and scale of the data owned by amazon or wal-mart about who purchases what, when, and in what combinations often means that they understand the market for a product far better than the manufacturer. big data, in this case, comes from big business-a firm that markets to the world also owns data about the world's wants, needs, and peculiarities. even as they are monopsony buyers of many goods (think e-books for amazon), they are correspondingly monopoly sellers of data. and that has made them into huge market powers on two dimensions, enabling them to squeeze suppliers to the absolute minimum price, packaging, size, and other product features that are most advantageous to them-and perhaps to their customers. but big data has not changed the fundamental means of value creation in the retail sector. whatever its distributional consequences, the basic retail transaction-of individuals buying goods from retail intermediaries, remains unchanged from earlier eras. the same economies of scale and opportunities for cross-marketing that made montgomery ward a retail powerhouse in the th century act on amazon and wal-mart in the st. big data may have exacerbated trends already present in the retail sector; but the basics of how that sector creates value for customers and generates profits for investors are by no means new. retailers have yet to build truly new products or services that rely on data itself-instead, that data is an input into a longstanding process of optimization of supply chain relations, marketing, and product placement in service of a very old value model: the final close of sale between a customer and the retailer. second-and third-order models find value in data several steps removed from the actual transaction that generates the data. however, as the google flu example illustrated, that data may have far less value when separated from its virtual context. thus while these businesses enjoy effectively free raw materials, the potential uses of those materials are in fact quite limited. digital exhaust from web browsing, shopping, or socializing has proven enormously useful in the self-referential task of improving future web browsing, shopping, and socializing. but that success has not translated success at tasks far removed from the virtual world that generated this exhaust. digital exhaust may be plentiful and convenient to collect, but it offers limited support for understanding or responding to real-world problems. first-order models, in contrast, escape the flu trap by building atop purpose-specific data, conceived and collected with the intent of solving specific problems. in doing so, they capitalize on the cheap storage, powerful algorithms, and inexpensive computing power that made the first wave of big data firms possible. but they do so in pursuit of a rather different class of problems. first order products remain in their infancy. but some nascent examples suggest what might be possible. ibm's watson famously used its natural language and pattern recognition abilities to win the jeopardy! game show. doing so constituted a major technical feat: the ability to understand unstructured, potentially obfuscated jeopardy! game show answers, and respond with properly-structured questions based on information gleaned from vast databases of unstructured information on history, popular culture, art, science, or almost any other domain. the question now is whether ibm can adapt this technology to other problems. its first attempts at improving medical diagnosis appear promising. by learning from disease and health data gathered from millions of patients, initial tests suggest that watson can improve the quality, accuracy, and efficacy of medical diagnosis and service to future patients (steadman ) . watson closes the data value loop: patient data is made valuable because it improves patient services, not because it helps with insurance underwriting or product manufacturing or logistics or some other third-party activity. premise corporation provides another example. premise has built a mobile-phone based data gathering network to measure macroeconomic aggregates like inflation and food scarcity. this network allows them to monitor economic change at a very detailed level, in regions of the world where official statistics are unavailable or unreliable. this sensor network is the foundation of the products and services that premise sells to financial services firms, development agencies, and other clients. as compared with the attenuated link between data and value in second-or third-order businesses, premise's business model links the design of the data generation process directly to the value of its final products. optimum energy (oe) provides a final example. oe monitors and aggregates data on building energy use-principally data centers-across building types, environments, and locations. that data enables it to build models for building energy use and efficiency optimization. those models, by learning building behaviors across many different kinds of inputs and buildings, can perform better than single-building models with limited scope. most importantly, oe creates value for clients by using this data to optimize energy efficiency and reduce energy costs. these first-order business models all rely on data specifically obtained for their products. this reliance on purpose-specific data contrasts with third-order models that rely on the bdigital exhaust^of conventional big data wisdom. to use the newspaper example, thirdorder models assume-but can't specifically verify-that those who read the style section are interested in purchasing new fashions. google's success stemmed from closing this information gap a bit-showing that people who viewed web pages on fashion were likely to click on fashion ads. but again, the data that supports this is data generated by processes unrelated to actual purchasing-activities like web surfing and search or email exchange. and so the gap remains. google appears to realize this, and has launched consumer surveys as an attempt to bridge that gap. in brief, it offers people the chance to skip ads in favor of providing brand feedback. we should remember the root of the claim about big data. that claim was perhaps best summarized by halevy et al. ( ) in what they termed bthe unreasonable effectiveness of data^-that, when seeking to improve the performance of predictive systems, more data appeared to yield better returns on effort than better algorithms. most appear to have taken that to mean that data-and particularly more data-are unreasonably effective everywhereand that, by extension, even noisy or skewed data could suffice to answer hard questions if we could simply get enough of it. but that misstates the authors' claims. they did not claim that more data was always better. rather, they argued that, for specific kinds of applications, history suggested that gathering more data paid better dividends than inventing better algorithms. where data are sparse or the phenomenon under measurement noisy, more data allow a more complete picture of what we are interested in. machine translation provides a very pertinent example: human speech and writing varies enormously within one language, let alone two. faced with the choice between better algorithms for understanding human language, and more data to quantify the variance in language, more data appears to work better. but for other applications, the bbigness^of data may not matter at all. if i want to know who will win an election, polling a thousand people might be enough. relying on the aggregated voices of a nation's twitter users, in contrast, will probably fail (gayo-avello et al. ; gayo-avello ; huberty ) . not only are we not, as section discussed, in the n = all world that infatuated mayer-schönberger and cukier ( ); but for most problems we likely don't care to be. having the right data-and consequently identifying the right question to ask beforehand-is far more important than having a lot of data of limited relevance to the answers we seek. big data therefore falls short of the proclamation that it represents the biggest change in technological and economic possibility since the industrial revolution. that revolution, in the span of a century or so, fundamentally transformed almost every facet of human life. someone born in , who lived to be years old, grew up in a world of horses for travel, candles for light, salting and canning for food preservation, and telegraphs for communication. the world of their passing had cars and airplanes, electric light and refrigerators, and telephones, radio, and motion pictures. having ranked big data with the industrial revolution, we find ourselves wondering why our present progress seems so paltry in comparison. but much of what we associate with the industrial revolution-the advances in automobile transport, chemistry, communication, and medicine-came much later. the businesses that produced them were fundamentally different from the small collections of tinkerers and craftsmen that built the first power looms. instead, these firms invested in huge industrial research and development operations to discover and then commercialize new scientific discoveries. these changes were expensive, complicated, and slow-so slow that john stuart mill despaired, as late as , of human progress. but in time, they produced a world inconceivable to even the industrial enthusiasts of the s. in today's revolution, we have our looms, but we envision the possibility of a model t. today, we can see glimmers of that possibility in ibm's watson, google's self-driving car, or nest's thermostats that learn the climate preferences of a home's occupants. these and other technologies are deeply embedded in, and reliant on, data generated from and around realworld phenomena. none rely on bdigital exhaust^. they do not create value by parsing customer data or optimizing ad click-through rates (though presumably they could). they are not the product of a relatively few, straightforward (if ultimately quite useful) insights. instead, ibm, google, and nest have dedicated substantial resources to studying natural language processing, large-scale machine learning, knowledge extraction, and other problems. the resulting products represent an industrial synthesis of a series of complex innovations, linking machine intelligence, real-time sensing, and industrial design. these products are thus much closer to what big data's proponents have promised-but their methods are a world away from the easy hype about mass-manufactured insights from the free raw material of digital exhaust. we're stuck in the first industrial revolution. we have the power looms and the water mills, but wonder, given all the hype, at the absence of the model ts and telephones of our dreams. the answer is a hard one. the big gains from big data will require a transformation of organizational, technological, and economic operations on par with that of the second industrial revolution. then, as now, firms had to invest heavily in industrial research and development to build the foundations of entirely new forms of value creation. those foundations permitted entirely new business models, in contrast to the marginal changes of the first industrial revolution. and the raw materials of the first revolution proved only tangentially useful to the innovations of the second. these differences portend a revolution of greater consequence and complexity. firms will likely be larger. innovation will rely less on small entrepreneurs, who lack the funds and scale for systems-level innovation. where entrepreneurs do remain, they will play far more niche roles. as rao ( ) has argued, startups will increasingly become outsourced r&d, whose innovations are acquired to become features of existing products rather than standalone products themselves. the success of systems-level innovation will threaten a range of current jobs-white collar and service sector as well as blue collar and manufacturing-as expanding algorithmic capacity widens the scope of digitizeable tasks. but unlike past revolutions, that expanding capacity also begs the question of where this revolution will find new forms of employment insulated from these technological forces; and if it does not, how we manage the social instability that will surely follow. with luck, we will resist the temptation to use those same algorithmic tools for social control. but human history on that point is not encouraging. regardless, we should resist the temptation to assume that a world of ubiquitous data means a world of cheap, abundant, and relevant raw materials for a new epoch of economic prosperity. the most abundant of those materials today turn out to have limited uses outside the narrow products and services that generate them. overcoming that hurdle requires more than just smarter statisticians, better algorithms, or faster computation. instead, it will require new business models capable of nurturing both new sources of data and new technologies into truly new products and services. social privacy in networked publics: teens' attitudes, practices, and strategies when google got flu wrong assessing google flu trends performance in the united states during the influenza virus a (h n ) pandemic more tweets, more votes: social media as a quantitative indicator of political behavior d ( ) i wanted to predict elections with twitter and all i got was this lousy paper: a balanced survey on election prediction using twitter data limits of electoral predictions using twitter detecting influenza epidemics using search engine query data the unreasonable effectiveness of data multi-cycle forecasting of congressional elections with social media experimental evidence of massive-scale emotional contagion through social networks big data: the next frontier for innovation, competition, and productivity. mckinsey global institute report mayer-schönberger v, cukier k ( ) big data: a revolution that will transform how we live, work, and think entrepreneurs are the new labor ibm's watson is better at diagnosing cancer than human doctors forecasting elections with non-representative polls acknowledgments this research is a part of the ongoing collaboration of brie, the berkeley roundtable on the international economy at the university of california at berkeley, and etla, the research institute of the finnish economy. this paper has benefited from extended discussions with cathryn carson, drew conway, chris diehl, stu feldman, david gutelius, jonathan murray, joseph reisinger, sean taylor, georg zachmann, and john zysman. all errors committed, and opinions expressed, remain solely my own.open access this article is distributed under the terms of the creative commons attribution license which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited. key: cord- -jtyoojte authors: buzzell, andrew title: public goods from private data -- an efficacy and justification paradox for digital contact tracing date: - - journal: nan doi: nan sha: doc_id: cord_uid: jtyoojte debate about the adoption of digital contact tracing (dct) apps to control the spread of covid- has focussed on risks to individual privacy (sharma&bashir , tang ). this emphasis reveals significant challenges to ethical deployment of dct, but generates constraints which undermine justification to implement dct. it would be a mistake to view this result solely as the successful operation of ethical foresight analysis (floridi&strait ), preventing deployment of potentially harmful technology. privacy-centric analysis treats data as private property, frames the relationship between individuals and governments as adversarial, entrenches technology platforms as gatekeepers, and supports a conception of emergency public health authority as limited by individual consent and considerable corporate influence that is in some tension with the more communitarian values that typically inform public health ethics. to overcome the barriers to ethical and effective dct, and develop infrastructure and policy that supports the realization of potential public benefits of digital technology, a public resource conception of aggregate data should be developed. analyze the movements and behaviour of an individual diagnosed with an infectious disease to identify possible incidences of transmission. the virology of covid- creates two kinds of scaling challenges that make manual contact tracing unfeasible. the mode of transmission is respiratory droplet spread, with some evidence of transmission via indirect surface contact, and the potential for aerosolized transmission in limited circumstances (van doremalen et al ) . with a reproductive rate sufficient for exponential case growth, this creates a horizontal problem of resource scale, in the us alone it is estimated that over , full-time contact tracers (watson et al ) . the long period of infectivity, and in particular the period of asymptomatic transmission, creates a vertical scaling problem, where the amount of data required to conduct tracing for each individual is quite large, encompassing contacts with people and surfaces over a day period. dct apps could mitigate the vertical problem by assisting recall, recording high fidelity data for each individual that can be retroactively queried to identify potential transmission, and the horizontal problem by automating much of the contact tracing process (ferretti et al ) . even without a vaccine, effective dct could allow public authorities to relax some of the severe restrictions that have been imposed, an important counterfactual when considering the justifiability of dct programs (mello & wang ). most dct proposals use bluetooth low energy (ble) radio networking technology present in smartphones, recording received signal strength indicator (rssi) measurements to determine when devices are close together, and for how long. a database of device pairings and rssi information is maintained on the device or a centralized server, and when one device is flagged as belonging to an infected individual, an algorithm can select from the database identifiers recorded while the individual may have been infectious, filter them by duration and signal strength, and produce a list of device ids that might be targeted for intervention of some type, such as testing or self-isolation. as a sociotechnical system, dct re-taxonomizes rssi data as predictions of disease transmission risk and mandates actions, backed by public health authority. justification for the ensuing actions depends in part on the reliability of the prediction. dct faces serious efficacy challenges with both prediction and coverage, summarized in the supplementary material. when the non-causal proxies for transmission are too weakly correlated with actual transmission risks, or the individual or population coverage is insufficient or uneven, dct can't perform the function of identifying infection risks effectively. while predictive problems might be mitigated by improving technology and aggregating additional data, coverage problems threaten the viability of dct directly, and are least amenable to post-hoc correction. they require populations be persuaded to use the dct app, and that hardware and software vendors cooperate with public health authorities to resolve barriers to adoption and usage, such as the need for software modifications to enable passive rssi measurement. the exercise of coercive authority in the interests of public health is typically justified by the harm principle (upshur ) , that the action be necessary to prevent harm to others. it is further limited by the principle of least infringement (childress et al ) , that interventions which undermine privacy or autonomy must be the least burdensome alternative necessary to support the public health objective independently justified by the harm principle. efficacy is therefore a necessary condition on justification, and any modulation of measures taken in response to other ethical concerns must maintain a level of efficacy consistent with claims that the intervention is a viable alternative (allen & selgelid ) . for example, evidence that the pervasive use of face coverings in public significant reduces inter human transmission of covid- (zhang et al ) might justify the exercise of state power to make them compulsory, a significant limitation on autonomy, but one that is relatively low in costs and restrictions, compared to alternatives such as mass shelter-at-home orders. the efficacy of the less-restrictive alternative is high enough that the marginally better results from dramatically more severe restrictions are offset. if responsiveness to ethical or legal requirements constrain implementation of dct in ways that weaken its expected efficacy, this in turn undermines justification for coercive measures to encourage adoption. this might indicate a fundamental problem with the proposed intervention. because dct often triggers actions that further impact individual autonomy, such as quarantine, with the predictions it makes, efficacy is particularly critical. moreover, because dct has the potential to generate knowledge of risks that could save lives, decisions that dilute their epistemic capacity are themselves ethically salient (dennett ). at a time of heightened public awareness of the privacy and security challenges presented by our digital data exhaust, dct has been subject to intense scrutiny on privacy grounds. there is a growing awareness that our data can be used in contexts that we would not consent to, and which could harm our interests. we might agree to let an app to track our music listening habits to recommend playlists, but be dismayed to learn it can be used to make inferences about our even where we might grant consent to use our data in one context of analysis, interpretation and action, such as infectious disease control, we might not be able to foresee functions the data might be used for within it. similar problems with informed consent arise in the context of genetic research (lunshof et al ) , where uncertainty about usage problematizes consent, a problem magnified under the socio-technical conditions in which digital data is collected and retained, which generates very little friction to such re-contextualization and re-taxonomization. in light of these concerns it is not surprising that many dct models have focused on privacy-bydesign, with strict minimization of data collected and transmitted, strong anonymization, a prohibition on the use of additional data sources (such as gps), and policies demanding regular deletion of data and restrictions on uploading data to central servers. privacy-preserving dct models have been extremely influential, as evidenced by the extent to which implementations have coalesced around privacy-preserving standards (chan et al , li and guo , tang ) such as mit's private kit (mit ), pepp-pt (pepp team ) and dp- t (troncoso et al ) , and the extent to which technology platform providers and health institutions (world health organization ) have embraced this approach. because the design of mobile operating systems prevents the passive collection of bluetooth data, the cooperation of vendors is necessary to build effective dct apps. the dominant mobile operating system vendors, apple and google jointly and rapidly developed the "exposure notification api" (apple & google ) to support limited dct capabilities. access to the exposure api is tightly controlled, and only one app can be deployed in a country. the vendors can disable and remove the app at any time. the app cannot use any data source except bluetooth rssi data obtained via the exposure api. the app cannot transmit this data to a central server. the exposure api provides a methodology for the calculation of disease transmission risk which public health authorities configure by setting some pre-defined values. the structure of the exposure api expresses and enforces a policy perspective on the the relationship between public health authorities and citizens who use the products manufactured by apple and google. this treats data as private property, frames the relationship between individuals and governments as adversarial, entrenches technology platforms as gatekeepers and offers a conception of emergency public health authority as limited by individual consent and considerable corporate influence. this is an unconventional view -historically privacy is not signifiant constraint on manual contact tracing, and even strong legislation such as hippa recognizes the legitimate need for public health authorities access to protected health information (hippa cfr . ) technology companies require a great deal of public trust to operate, as do governments and public health authorities. because of the need for cooperation with governments to build dct, vendors are exposed to highly publicized risks in the deployment of dct, in terms of maintaining trust and also in avoiding additional regulation. the privacy preserving model serves vendor interests, allowing them to cooperate with public health authorities, thus avoiding regulatory or coercive measures, by limiting the possibility that the use of dct apps breaks tacit or contractual agreements with their users that could damage already wavering public trust. (newton ) . critically, the exposure notification api prevents several actions that might be undertaken to improve the efficacy of dct. coverage problems that relate to contexts where smartphone ownership or physical possession is uneven could be partially remediated by aggregating other data, as could the predictive weaknesses of rssi. the aggregation of data, including gps, on central servers where it can be subject to further analysis and enrichment might also improve the epidemiological value of dct (mello & wang ) . some countries have political, demographic and cultural characteristics that might favour the use of multiple apps, and data preservation may have future epidemiological value. if privacy-maximizing constraints on dct undermine efficacy, this in turn can weaken justification to deploy dct at all. one might conclude that this is the correct outcome of ethical analysis of dct, that it cannot be used ethically, because requirements needed to generate the efficacy required for public health objectives are unjustifiably invasive or coercive. alternately, one might wonder if this suggests that privacy-maximizing analysis is problematic. it is somewhat dismaying that a public health intervention that we have the technical means to deploy, which would be a much less restrictive alternative to measures currently in effect, becomes unjustifiable because of the restrictions necessary to ensure minimization of privacy risks. concerns about security and mission creep are only accidentally supportive of privacy maximization. while there are legitimate reasons to think that the socio-technical infrastructure dct apps depend on are too insecure to trust, these are generally not inherent but are instead the results of implementation decisions, and in practice we are able to mitigate these to support many sensitive applications. there will be many examples of poorly implemented dct, such as qatar's which leaked personal data in qr-codes (amnesty international uk ), but this does not mean that secure dct is not possible. one might also worry that governments will misuse the data down the road, but emergency public health legislation enacted in most jurisdiction has strict limitations that we should trust to function as intended. even if we have upstream worries about the rule of law in some jurisdiction, this a distal problem, and not one that weighs in favour of the privacy-maximizing view generally. the problem of coverage efficacy is one of trust and influence as much as it is technicaladequate coverage and compliance depends in part on the public's willingness to cooperate. justify if a majority of the population did not support it. one of the reasons why anti-vaccination propaganda, which is often produced and amplified by hostile entities, is particularly dangerous is that it can erode democratic mandate for the very actions that would mitigate the damage. the paradox which arises for dct is that increasing privacy protection in order to overcome constraints on justification undermines predictive efficacy to an extent that weakens justification to deploy dct at all. but, to relax these protections to improve predictive efficacy conflicts with public sentiment (milsom et al ) , creating resistance to adoption that would exacerbate coverage efficacy problems, again weakening justification on efficacy grounds, but also increasing the justificatory burden because implementation against public sentiment raises the stakes in terms of autonomy impingement. the remainder of this article explores a route to resolve this paradox by examining the conditions that make the privacy objections so difficult to overcome. public sentiment against impingements on privacy necessary for dct is grounded in legitimate fears of pervasive security problems with the socio-technical infrastructure. the litany of security and privacy problems with dct applications that have already been deployed (privacy international ) reinforce this. however, this sentiment is also shaped by an increasingly prominent public discussion of technology ethics that is framed in a way that sits uneasily alongside the values that inform public health ethics. a dominant strain of technology ethics, as exemplified by legal expressions such as the eu's gdpr and california's ccpa and many ai ethics charters and codes of conduct (jobin & vayena ) resembles a format that, in bioethics, came to be known as "principlism" (beauchamp & childress , clouser & gert . this is the view that minimal set of principals, usually autonomy, non-maleficence, beneficence, and justice, supply the analytical machinery needed to approach ethical problems. it is criticized on the grounds that it does not specify an ordering, which instead is often inherited from the context of application, which tends to privilege the liberal individualist emphasis on autonomy, and which is unable to fully articulate principles such as beneficence beyond self-interest. (callahan ) . applied technology ethics has a tendency to generate trade-off dilemmas, such as that between innovation and precaution, or privacy and public goods, because, as with principlism in bioethics, it does not supply a decision procedure for conflict resolution. this is particularly challenging when institutions that produce technological artefacts and systems struggle with "...onboarding external ethical perspectives..." (metcalf & moss ) that conflict with tacit and explicit internal norms. our underlying moral interest in applied ethics demands more than compromise and consilience, rather, as callahan puts it "[s]erious ethics, the kind that causes trouble to comfortable lives, wants to know what counts as a good choice and what counts as a bad choice" (callahan ) . the "communitarian turn" in bioethics arose in part because capabilities emerging in genetic research created opportunities for public goods that could only be ethically realized once focus on individual interests yielded to more communitarian principles such as solidarity and public benefit. (chadwick ) . predictive genetic analysis that might benefit an individual, their family, and their community, now and in the future, exposes information that might be prejudicial to the individual's interests, for example, by interfering with their ability to acquire health insurance (fulda & lykens , launis the extended value of genetic data over long timelines and across unforeseeable applications problematizes the coherence and applicability of autonomy protections such as informed consent. an ethical framework that could motivate policy and regulation to enables the pursuit of these opportunities for public good required the integration of communitarian values. likewise, public health ethics introduces principles such as solidarity, proportionality, and reciprocity alongside the four core principles of biomedical ethics (coughlin , lee , schröder-bäck et al , communitarian values that reflect the fundamentally shared object of concern, and further expose the limits of analysis that privileges autonomy. communitarian and distributive considerations can help resolve some of the ordering problems principlist technology ethics inherits from the liberal individualist context it operates within, helping to resolve tradeoffs by giving greater weight to shared values and common goods. if dct cannot be deployed in a way that is ethical and effective, this is an unfortunate loss of a significant public health opportunity. the barriers to remediation run deeper than privacypreserving technical measures, and stem from the need to develop a conception of aggregate personal data as a public resource. the exposure notification api encodes and enforces a privacy and autonomy maximizing model of dct, essentially privatizing a public health policy concern. one justification for this is that corporations are enabling their users to protect their personal property, or adhering to a contractual obligation (taddeo & floridi, ) . traditional contact tracing treats our personal data as a potential public resource, with synchronous consent and access procedures triggered by the identification of transmission risk, whereas dct treats it as a de-facto public resource with aways-on consent and access. dct provides public benefits based on data collected from many individuals who might never have an elevated risk. its value is at the population level, and we would accept impingement on our privacy for the good of the community. although privacy is usually regarded as a paradigmatically individual concern, communitarian approaches to privacy (o'hara , floridi argue that groups can have privacy rights, and that privacy is fundamentally a common good, where its value and limits are in reciprocal tension with other community values. technology companies profit from the value they extract from aggregate data, which depends on pervasive access to individual data in ways that frequently compromise privacy. aggregate data is exponentially more economically and informationally valuable than that of the data of any one individual, and confers signifiant soft power to influence public sentiment, and hard power to control access to data and generate economic opportunities. but it is not clear that the equivocation between personal data as the private property of an individual, and aggregate data as the private property of the collector, is justified. napoli ( ) argues that "...whatever the exact nature of one's individual property rights in one's user data may be, when these data are aggregated across millions of users, their fundamental character changes in such a way that they are best conceptualized as a public resource" (napoli ) . if aggregate data is substantially and uniquely distinctive, this supports the application of public trust doctrine, which is based on the idea that "...because of their unique characteristics, certain natural resources and systems are held in trust by the sovereign on behalf of the citizens" (calabrese, ) , such a the public broadcast spectrum. the exploration of a communitarian approach to applied technology ethics and the articulation and assertion of a public resource rationale applicable to the data we generate by engaging with digital technology and services could enable policy and regulation that would directly address the barriers to effective and ethical dct. this could expand regulatory and policy measures to ensure the safe handling of sensitive data, foster the enabling conditions for the realization of opportunities to use aggregate data for public good, and help reverse the centralization of decisive power over public policy in the hands of multinational technology corporations. where policy and legislation such as the gdpr, especially through the dpia process, identifies and protects risks to individual interests, methodologies to identify and protect opportunities in the public interest lag behind, as the barriers to dct implementation illustrate. inherent efficacy challenges: the virology of covid- , so far as it is understood, makes this re-taxonomization problematic, because the mode of transmission and infectivity is such that there is only a weak likelihood that any particular contact detected by dct results in transmission, whereas for disease such as tuberculosis or hiv/aids, it is easier to identify exposure events with high transmission probability. the contact/transmission link is also problematic due to the potential for transmission via indirect surface contact. there are socioeconomic confounds related to smartphone ownership and use that will skew representation and the ability to install and update dct apps. life patterns in some populations generate periods of contact with others when smartphone are not present, and some forms of employment generate a large number of contacts with others, which may or may not actually correspond to increased risks of transmission. evidence for nonnosocomial transmission in japan shows primary cases in several contexts where smartphones are frequently not on our persons or turned off, such as music events and gyms (furuse et al ) . bluetooth rssi as a proxy for exposure: there are efficacy problems with the core technology. rssi measurements map only weakly to transmission risk, because ble radio signals travel through walls and barriers used in public spaces to specifically to prevent droplet spread. rssi is stronger when we walk side-by-side than following one another, is weakened when phones are in pockets while sitting around a table, and is sensitive to many idiosyncratic features of indoor environment (leith & farrell ) . there are also considerable differences in rssi measurement for different devices and different mobile operating systems (bluetrace ), which introduces socio-economic confounds. security: efficacy can be further undermined by deliberate exploitation of security vulnerabilities (vaudenay ) and even simple circumvention such as the display of screen captures instead of running apps, as has been observed in india with mandatory aarogya setu app (clarence ) undermines the public health value of dct. at the population level, dct apps would have to be in use by % of a population (servick ) to be effective, a challenge that lead singapore to consider making their app mandatory, a proposal later abandoned due to implementation challenges (mahmud ) . various jurisdictions have considering opt-in, out-out, and incentivization schemes to encourage uptake. at the individual level, coverage involves the extent of the individual's activities and behaviours that are accurately captured by the dct app. aside from issues related to smartphone ownership and presence described above, mobile phone operating systems place limits on the ways apps can access bluetooth radios, often requiring apps be open and in use -even an individual who has installed the app and has their phone at all times would produce little useful dct data in this case. this problem is in fact a critical barrier to effective dct, and requires the cooperation of operating system vendors to remediate, and requiring users to keep their phones open and the apps on-screen is not viable. neo-muzak and the business of mood necessity and least infringement conditions in public health ethics qatar: 'huge' security weakness in covid- contact-tracing app exposure notifications: using technology to help public health authorities fight covid- contact tracing of tuberculosis: a systematic review of transmission modelling studies principles of biomedical ethics principlism and communitarianism pact: privacy sensitive protocols and mechanisms for mobile contact tracing the communitarian turn: myth or reality public health ethics: mapping the terrain aarogya setu: why india's covid- contact tracing app is controversial a critique of principlism how many principles for public health ethics information, technology, and the virtues of ignorance quantifying sars-cov- transmission suggests epidemic control with digital contact tracing group privacy: a defence and an interpretation ethical foresight analysis: what it is and why it is needed ethical issues in predictive genetic testing: a public health perspective clusters of coronavirus disease in communities the song is you: preferences for musical attribute dimensions reflect personality the global landscape of ai ethics guidelines the online competition between pro-and anti-vaccination views solidarity, genetic discrimination, and insurance: a defense of weak genetic exceptionalism coronavirus contact tracing: evaluating the potential of using bluetooth received signal strength for proximity detection covid- contact-tracing apps: a survey on the global deployment and challenges public health ethics theory: review and path to convergence from genetic privacy to open consent data mining for health: staking out the ethical territory of digital phenotyping covid- : govt developing wearable contact tracing device information wars: tackling the threat from disinformation on vaccines ethics and governance for digital disease surveillance owning ethics: corporate logics, silicon valley, and the institutionalization of ethics survey of acceptability of app-based contact tracing in the uk ethical guidelines for covid- tracing apps this is how much americans trust facebook, google, apple, and other big tech companies. the verge user data as public resource: implications for social media regulation intimacy . : privacy rights and privacy responsibilities on the world wide web logistics of community smallpox control through contact tracing and ring vaccination: a stochastic network model privacy international ( ) apps and covid- teaching seven principles for public health ethics: towards a curriculum for a short course on ethics in public health programmes covid- contact tracing apps are coming to a phone near you. how will we know whether they work use of apps in the covid- response and the loss of privacy protection hipaa isn't enough: all our data is health data the ethics of nudging the debate on the moral responsibilities of online service providers privacy-preserving contact tracing: current solutions and open questions decentralized privacy-preserving proximity tracing principles for the justification of public health intervention aerosol and surface stability of sars-cov- as compared with sars-cov- analysis of dp t systematic literature review on the spread of health-related misinformation on social media a national plan to enable comprehensive covid- case finding and contact tracing in the us ethical considerations to guide the use of digital proximity tracking technologies for covid- contact tracing: interim guidance identifying airborne transmission as the dominant route for the spread of covid- key: cord- -i ecxgus authors: nan title: abstracts of publications related to qasr date: - - journal: nan doi: . /qsar. sha: doc_id: cord_uid: i ecxgus nan tive mechanisms p. - . edited by magee, p.s., henry, d.r., block, j.h., american chemical society, washington, . results: an overview is given on the approaches for the discovery and design concepts of bioactive molecules: a) natural products derived from plant extracts and their chemically modified derivatives (cardiac glycosides, atropine, cocaine, penicillins, cephalosporins, tetracyclines and actinomycins, pyrethrins and cyclosporin; b) biochemically active molecules and their synthetic derivatives: acetylcholine, histamine, cortisonelhydrocortisone, indole- -acetic acid (phenoxyacetic acid herbicides); c) principles of selective toxicity is discussed exemplified by trimethoprimlmethotrexate, tetracyclines, acylovir, azidothymidine, antifungal agents; d) metabolism of xenobiotics; e) exploitation of secondary effects (serendipity); f) receptor mapping; g) quantitative structure-activity relationship studies; h) empirical screening (shotgun approach). results: past and present of qsar is overviewed: a) historical roots; b) the role of qsar models in rational drug design, together with a simplified diagram of the steps involved in drug development, including the place of qsar investigations; c) classification of qsar models: structure-cryptic (property-activity) models, structure-implicit (quantum chemical) models, structure-explicit (structure-activity) and structure-graphics (computer graphics) models; d) a non-empirical qsar model based on quantities introduced for identification of chemical structures, using szymansk's and randic's identification (id) numbers, including applications for alkyltriazines. bioessays, , ( ) , - . results: a review is given on recent observations about receptor structure and the dynamic nature of drug receptors and the significance of receptor dynamics for drug design: a) receptors are classified according to structure and function (i) ion channels (nicotinic acetylcholine, gaba, glycine); (ii) g protein linked [adrenergic ( c x ,~) , muscarinic acetylcholine, angiotensin, substance k, rhodopsin]; (iii) tyrosine kinase (insulin, igf, egf, pdgf); (iv) guanylate cyclase (atrial natriuretic peptide, speractin); b) protein conformational changes can be best studied on allosteric proteins whose crystal structure is available (e.g. hemoglobin, aspartate transcarbamylase, tryptophan repressor) (no high resolution of a receptor structure is known); c) receptor conformational changes can be studied by several indirect approaches (i) spectral properties of covalent or reversibly bound fluorescent reporter groups; (ii) the sensitivity of the receptor to various enzymes; (iii) the sedimentation of chromatographic properties of the receptor; the affinity of binding of radioligands; (iv) the functional state of the receptor; d) there are many unanswered questions: e.g. (i) are there relatively few conformational states for receptors with fluctuations around them or many stable conformational states; (ii) how can static structural information be used in drug design when multiple receptor conformations exist. title: designing molecules and crystals by computer. (review) author: koide, a. ibm japan limited, tokyo scientific center, tokyo research laboratory - sanban-cho, chiyoda-ku, tokyo , japan. source: ibm systems journal , ( ), - . results: an overview is given on three computer aided design (cad) systems developed by ibm tokyo scientific center: a) molecular design support system providing a strategic combination of simulation programs for industrial research and development optimizing computational time involved and the depth of the resulting information; b) molworld on ibm personal systems intended to create an intelligent visual environment for rapidly building energetically stable d molecular geometries for further simulation study; c) molecular orbital graphics system designed to run on ibm mainframe computers offering highly interactive visualization environment for molecular electronic structures; d) the systems allow interactive data communication among the simulation programs for their strategically combined use; e) the structure and functions of molworld is illustrated on modeling the alanine molecule: (i) data model of molecular structures; (ii) chemical formula input; (iii) generation of d molecular structure; (iv) formulation of bonding model; (v) interactive molecular orbital graphics; (vi) methods of visualizing electronic structures; (vii) use of molecular orbital graphics for chemical reactions. title: interfacing statistics, quantum chemistry, and molecular modeling. (review) author: magee, p.s. biosar research project vallejo ca , usa. source: acs symposium series , no. . in: probing bioactive mechanisms p. - . edited by magee, p.s., henry, d.r., block, j.h., american chemical society, washington, . results: a review is given on the application and overlap of quantum chemical, classical modeling and statistical approaches for the quant. struct.-act. relat. , - ( ) abstr. - understanding of binding events at the molecular level. a new com-a) qsar of cns drugs has been systematically discussed according plementary method called statistical docking experiment is also to the following classes: (i) general (nonspecific) cns depressants: presented: general anesthetics; hypnotics and sedatives; (ii) general insights obtained using energy-minimized structures; activation in the bound state, types and energies of interactions at the receptor site and in crystal; four successful examples (significant regression equations) are given for the modeling of binding events using physico-chemical descriptors and correlation analysis: (i) binding of a diverse set of pyridines to silica gel during thin-layer chromatography; (ii) binding of meta-substituted n-methyl-arylcarbamates to bovine erythrocyte ache; (iii) binding of meta-substituted n-methyl-arylcarbamates to ache obtained from susceptible and resistant green rice leafhoppers; (iv) activity of phenols inhibiting oxidative phosphorylation of adp to atp in yeast; a new statistical method for mapping of binding sites has been developed based on the hypermolecule approach, identifying key positions of binding and nature of the energy exchange between the hypermolecule atoms and the receptor site; two examples are given on the successful application of statistical modeling (statistical docking experiment) based on the hypermolecule approach: (i) inhibition of housefly head ache by metasubstituted n-methyl-arylcarbamates (n = , r = . , s = . , f = . ); (ii) inhibition of housefly head ache by orthosubstituted n-methyl-arylcarbamates (n = , r = . , s = . , f = . ). (nonspecific) cns stimulants; (iii) selective modifiers of cns functions: anticonvulsants, antiparkinsonism drugs, analgetics and psychopharmacological agents; (iv) miscellaneous: drugs interacting with central a-adrenoreceptors, drugs interacting with histamine receptors, cholinergic and anticholinergic drugs; b) the review indicates that the fundamental property of the molecules which mostly influence the activity of cns drugs is hydrophobicity (they have to pass the cell membrane and the bloodbrain barrier); c) electronic parameters, indicative of dipole-dipole or charge-dipole interactions, charge-transfer phenomena, hydrogen-bond formation, are another important factor governing the activity of most cns agents; d) topographical, lipophylic and electronic structures of cns pharmacophores are reviewed; e) qsar equations, tables and figures from references are shown and discussed. the relevant template for each atom in the molecule is mapped into a bit array and the appropriate atomic position is marked; volume comparisons (e.g. common volume or excluded volume) are made by bit-wise boolean operations; the algorithm for the visualization of the molecular surface comprising the calculated van der walls volume is given; comparisons of cpu times required for the calculation of the van der waals molecular volumes of various compounds using the methods of stouch and jurs, pearlman, gavazotti and the new method showed that similar or better results can be achieved using the new algorithm with vax-class computers on molecules containing up to several hundred atoms. abtstr. - quant. struct.-act. relat. , - ( ) one of the important goal of protein engineering is the design of isosteric analogues of proteins; major software packages are available for molecular modeling are among others developed by (i) biodesign, inc., pasadena, california; (ii) biosym technologies, san diego, california; (iii) tripos, st. louis, missouri; (iv) polygen, waltham, massachusetts; (v) chemical design ltd. oxford; the molecular modelling packages use three basic parameters: (i) descriptive energy field; (ii)algorithm for performing molecular mechanics calculations; (iii) algorithm for performing molecular dynamics calculations; modelling study of the binding events occurring between the envelop protein (gp ) of the aids (hiv) virus and its cellular receptor (cd ) protein supported the hypothesis that this domain was directly involved in binding the gp envelop protein leading to the design of conformationally restricted synthetic peptides binding to cd . title: finding washington, . results: a new technique called "homology graphing" has been developed for the analysis of sequence-function relationships in proteins which can be used for sequence based drug design and the search for lead structures: a) as target protein is inhibited by the ligands of other proteins having sequence similarity, computer programs have been developed for the search of the similarity of proteins; b) proteins are organized into hierarchical groups of families and superfamilies based on their global sequence similarities; c) global sequence similarities were used to find inhibitors of acetolactate synthase (als) and resulted in a quinone derivative as a lead structure of new als inhibitors; d) local sequence similarities of bacterial and mammal glutathione synthase (gsh) were used to find inhibitors of gsh; e) it was shown that the sequence segment of gsh was similar to dihydrofolate reductase (dhfr) is part of the atp-binding site; f) biological bases of local similarity between sequences of different proteins were indicated: molecular evolution of proteins and functionally important local regions; g) homology graph, as a measure of sequence similarity was defined; h) sequence-chemical structure relationship based on homology graph and the procedure to find lead structures was illustrated by an example resulting in a list of potential inhibitors selected by the procedure based on the sequence segment from residue to of the sequence of tobacco als. source: acs symposium series , no. . in: probing bioactive mechanisms p. - . edited by magee, p.s.. henry, d.r., block, j.h., american chemical society, washington. . results: a review is given on the molecular design of the following major types of antifungal compound in relation to biochemistry, molecular modeling and target site fit: a) squalene epoxidase inhibitors (allilamines and thiocarbanilates) blocking conversion of squalene , -oxidosqualene; b) inhibitors of sterol c- demethylation by cytochrome p- (piperazines pyridines, pyrimidines, imidazoles and triazoles); c) inhibitors of sterol a' -t a' isornerization andlor sterol reductase inhibitors (morpholines); d) benzimidazoles specifically interfering with the formation of microtubules and the activity phenylcarbamates on benzimidazole resistant strains; e) carboxamides specifically blocking the membrane bound succinate ubiquinone oxidoreductase activity in the mitochondria electron transport chain in basidiomycetes; f) melanin biosynthesis inhibitors selectively interfering with the polyketide pathway to melanin in pyricularia oryzae by blocking nadph dependent reductase reactions of the pathway (fthalide, pcba, chlobentiazone, tricyclazole, pyroquilon, pp ). title: quantitative modeling of soil sorption for xenobiotic chemicals. (review) author: sabljic, a. theoretical chemistry group, department of physical chemistry, institute rudjer boskovic hpob , yu- zagreb, croatia, yugoslavia. source: environ. health perspect. , ( ), - . results: the environmental fate of organic pollutants depends strongly on their distribution between different environmental compartments. a review is given on modeling the soil sorption behavior of xenobiotic chemicals: a) distribution of xenobiotic chemicals in the environment and principles of its statistical modeling; b) quantitative structure-activity relationship (qsar) models relating chemical, biological or environmental activity of the pollutants to their structural descriptors or physico-chemical properties such as logp values and water solubilities; c) analysis of the qsar existing models showed (i) low precision of water solubility and logp data; (ii) violations of some basic statistical laws; d) molecular connectivity model has proved to be the most successful structural parameter modeling soil sorption; e) highly significant linear regression equations are cited between k , values and the first order molecular connectivity index ( ' x ) of a wide range of organic pollutants such as polycyclic aromatic hydrocarbons (pahs) and pesticides (organic phosphates, triazines, acetanilides, uracils, carbamates, etc.) with r values ranging from . to . and s values ranging from . to . ; f) the molecular connectivity model was extended by the addition of a single semiempirical variable (polarity correction factor) resulting in a highly significant linear regression equations between the calculated and measured ko, values of the total set of compounds (n = , r = . , s = . , f = ); g) molecular surface areas and the polarity of the compounds were found to be responsible for the majority of the variance in the soil sorption data of a set of structurally diverse compounds. title: strategies for the use of computational sar methods in assessing genotoxicity. (review) results: a review is given on the overall strategy and computational sar methods for the evaluation of the potential health effects of chemicals. the main features of this strategy are discussed as follows: a) generalized sar model outlining the strategy of developing information for the structure-activity assessment of the potential biological effects of a chemical or a class of chemicals; b) models for predicting health effects taking into account a multitude of possible mechanisms: c) theoretical models for the mechanism of the key steps of differential activity at the molecular level; d) sar strategies using linear-free energy methods such as the hansch approach; e) correlative sar methods using multivariate techniques for descriptor generation and an empirical analysis of data sets with large number of variables (simca, adapt, topkat, case, etc.); f) data base considerations describing three major peer-reviewed genetic toxicology data bases (i) national toxicology program (ntp) containing short term in vitro and in vivo genetic tests; (ii) data base developed by the epa gene-tox program containing different short term bioassays for more than compounds, used in conjunction with adapt, case and topkat; (iii) genetic activity profile (gap) in form of bar graphs displaying information on various tests using a given chemical. title: quantitative structure-activity relationships. principles, and authors: benigni,, r.; andreoli, c.; giuliani, a. applications to mutagenicity and carcinogenicity. (review) laboratory of toxicology and ecotoxicology, istituto superiore di sanita rome, italy. source: mutat. res. , ( ), - . results: methods developed for the investigation for the relationships between structure and toxic effects of compounds are summarized: a) the extra-thermodynamic approach: the hansch paradigm, physical chemical properties that influence biological activity and their parametrization, originality of the hansch approach, receptors and pharmacophores: the natural content of the hansch approach, predictive value of qsars, a statistifa tool: multiple linear regression analysis, the problem of correlations among molecular descriptors, other mathematical utilizations of extrathermodynamic parameters; b) the substructural approach: when topological (substructural) descriptors are needed, how to use topological decriptors; c) qsar in mutagenicity and carcinogenicity: general problems, specific versions of the substructural approach used for mutagenicity and carcinogenicity, applications to mutagenicity and carcinogenicity. title: linking structure and data. (review) author: bawden, d. source: chem. britain , (nov) , i - . address not given. results: the integration of information from different sources, particularly linking structural with non-structural information is an important consideration in chemical information technology. a review is given on integrated systems: a) socrates chemicallbiological data system for chemical structure and substructure searching combined with the retrieval of biological and physicochemical data, compound availability, testing history, etc.; b) psidom suite of pc based structure handling routines combining chemical structure with the retrieval of text and data; c) cambridge crystal structure databank on x-ray data of organic compounds integrating information on chemical structure, crystal conformation, numerical information on structure determination, bibliographic reference and keywording; d) computer aided organic synthesis for structure and substructure search, reaction retrieval, synthetic analysis and planning, stereochemical analysis, product prediction and thermal hazard analysis. title: determination of three-dimensional structures of proteins and nucleic acids in solution by nuclear magnetic resonance spectroscopy. source: critical rev. biochem. mol. biol. , ( ) , - . results: a comprehensive review is given on the use of nmr spectroscopy for the determination of d structures of proteins and nucleic acids in solution discussing the following subjects: a) theoretical basis of two-dimensional ( d) nmr and the nuclear overhauser effect (noe) measurements for the determination of d structures is given; b) sequential resonance assignment for identifying spin systems of protein nmr spectra and nucleic acid spectra, selective isotope labeling for extension to larger systems and the use of site specific mutagenesis; c) measurement and calculation of structural restraints of the molecules (i) interproton distances; (ii) torsion angle restrains; (iii) backbone torsion angle restraints; (iv) side chain torsion angle restraints; (v) stereospecific assignments; (vi) dihedral angle restraints in nucleic acids; d) determination of secondary structure in proteins; e) determination of tertiary structure in proteins using (i) metric matrix distance geometry (ii) minimization in torsion angle space; (iii) restrained molecular dynamics; (iv) dynamical simulated annealing; (v) folding an extended strand by dynamical simulated annealing; (vi) hybrid metric matrix distance geometry-dynamical simulated annealing method; (vii) dynamical simulated annealing starting from a random array of atoms; f) evaluation of the quality of structures generated from nmr data illustrated by studies for the structure determination of proteins and oligonucleotides using various algorithms and computer programs; g) comparisons of solution and x-ray structures of (i) globular proteins; (ii) related proteins; (iii) nonglobular proteins and polypeptides; h) evaluation of attainable precision of the determination of solution structures of proteins for which no x-ray structures exist (i) bds-i (small -residue protein from the sea anemone sulcata; (ii) hirudin (small -residue protein from leech which is a potent natural inhibitor of coagulation); i) structure determination by nmr is the starting point for the investigation of the dynamics of conformational changes upon ligand abtstr. - quant. struct.-act. relat. , - ( ) binding, unfolding kinetics, conformational equilibria between different conformational states, fast and slow internal dynamics and other phenomena. title: aladdin. an integrated tool for computer-assisted molecular design and pharmacophore recognition from geometric, steric, and substructure searching of three-dimensional molecular structures. ( aladdin has the ability to (i) objectively describe receptor map hypothesis; (ii) scan a database to retrieve untested compounds which is predicted to be active by a receptor map hypothesis; (iii) quantitatively compare receptor map hypotheses for the same biological activity; (iv) design compounds that probe the bioactive conformation of a flexible ligand; (v) design new compounds that a receptor map hypothesis predicts to be active; (vi) design compounds based on structures from protein x-ray crystallography; a search made by aladdin in a database for molecules that should have d dopaminergic activity recognized unexpected d dopamine agonist activity of existing molecules; a comparison of two superposition rules for d agonists was performed by aladdin resulted in a clear discrimination between active and inactive compounds; a compound set was designed that match each of the three lowenergy conformations of dopamine resulting in novel active analogues of known compounds; mimics of some sidc ~ . . p.piide beta turns were designed, in order to demonstrate that aladdin can find small molecules that match a portion of a peptide chain and/or backbone; results: lately a number of chemical information systems based on three-dimensional ( -d) molecular structures have been developed and used in many laboratories: a) concord uses empirical rules and simplified energy minimization to rapidly generate approximate but usually highly accurate -d molecular structures from chemical notation or molecular connection table input; b) chemical abstracts service (cas) has added -d coordinates for some million organic substances to the cas registry file; c) cambridge structural database system contains x-ray and neutron diffraction crystal structures for tens of thousands of compounds; d) maccs d developed by molecular design ltd., contains the standard maccs-i structures to which additional -d data, such as cartesian coordinates, partial atomic charges and molecular mechanics energy are added; maccs d allows exact match, geometric, submodel and substructure searching of -d models with geometric constrains specified to certain degree of tolerance; two -d databases are also available from molecular design that can be searched using maccs d [drug data report ( , models) and fine chemicals directory ( , models)]; e) aladdin (daylight chemical information systems) is also searches databases of -d structures to find compounds that meet biological, substructural and geometric criteria such as ranges of distances, angles defined by three points (dihedral angles) and plane angles that the geometric object must match. aladdin is one of a number of menus working within the framework provided by daylight's chemical information system. title: improved access to supercomputers boosts chemical applica-author: borman, s. c&en sixteenth st., n.w., washington dc , usa. source: c&en , ( ) , - . results: supercomputers have been much more accessible by scientists and engineers in the past few years in part as a result of the establishment of national science foundation (nsf) supercomputer centers. the most powerful class of supercomputers have program execution rates of million to billion floating-point operations per second, memory storage capacities of some ten million to miltion computer words and a standard digital word size of bits, the equivalent of about decimal digits. the following examples are given for the use of supercomputer resources for chemical calculations and modeling: a) modeling of key chromophores in the photosynthetic reaction center of rhodopseudomonas viridis showing the heme group, the iron atom and the chlorophyll which absorbs light and causes rapid transfer of electron to pheophtin and then to the quinone; modeling includes a significant part of the protein having about atoms out of a total of some , ; quant. struct.-act. relat. , - ( ) abstr. - b) modeling of transition state of reaction between chloride and methyl chloride including electron clouds and water molecules surrounding the reaction site; c) analysis of nucleic acid and protein sequences to evaluate the secondary structure of these biopolymers; d) construction of a graphical image of hexafluoropropylene oxide dimer, a model for dupont krytox high performance lubricant; e) calculation of the heats of formation of diaminobenzene isomers indicated that the target para isomer was kcal/mol less stable then the meta isomer byproduct therefore the development for its large scale catalytic synthesis was not undertaken (saving was estimated to be $ to $ million). , b) comparison of the newly defined eo, parameter with the taft-kutter-hansch e, (tkh e,) parameter showed characteristic steric effects of ortho-alkoxy and n-bonded planar type substituents (e.g. no,, ph); c) in various correlation analyses using retrospective data eo, satisfactorily represented the steric effects of ortho-substituents on reactivity and biological activity of various organic compounds; d) semi-empirical am calculations using a hydrocarbon model to study the steric effects of a number of ortho-substituents resulted in the calculation of the es value (difference in the heat of formation between ortho-substituted toluene and t-butylbenzene) which linearly correlated with the eo, and the tkh e, parameters; e) effects of di-ortho substitution on lipophilicity could be mostly expressed by the summed effect of the -and -position substituents; t) highly significant regression equations were calculated for the pk, values of di-ortho-substituted benzoic acids using various substituent parameters; g) quantitative analysis of the effect of ortho-substitution is difficult because it is a result of overlapping steric and electronic effects. title: calculation of partition coefficient of n-bridgehead com- ( i i ) is more lipophilic than propanolol- -sulphate (iv)]. fig. shows the relationship between lipophilicity and ph for the compounds (circle represents (i), triangle (ii), rhomboid (m) and square gv - ( ) abstr. - f (rekker's constant, characterizing hydrophobicity). results: a good agreement was found between the observed and calculated logp values of i ( . and . , respectively) and for iii. the hydrophobicity of i was found to be significantly lower than that of i ( . and . , respectively) . the large deviation was attributed to the surface reduction as a result of condensed ring formation in i. since interesting pharmacological activities have been reported for several derivatives of this type of compounds, the hydrophobicity of the unsubstituted lh-indolo[ , -c]quinoline has been calculated to be . : ( ) [interaction energy between a molecule and the binding site model was assumed to be the sum of its atomic contributions according to the expres-e , . , , ,~~(~) was the interaction energy parameter between the site region rand the atom-type of atom a and ag(b) was the total interaction energy for the binding mode b (binding mode was regarded as feasible when the molecule was in its energetically most favorable conformation)]. sion ag(b) = erelion reatomi a in r er,typc(a). where results: for development of the binding site model, first a simple geometry was proposed and agm- agm,calc i agm+) was calculated for the whole set of compounds. if the calculated binding energy of any of the compounds was outside of the above boundary, the proposed site geometry was rejected and a more complex one was considered. this procedure had been repeated until all molecules in the set could be fitted within the experimental data range. as a result a d, five-region voronoi binding site model has been developed for the pahs containing a trigonal pyramid (rl) in the center and portions r rs having infinite volumes and indicated by boundary planes. region r, represented access to the solvent and regions r rs were blocked for binding ( fig. ) : pyrene is shown in its optimal binding mode with its atom barely touching the boundary surfaces and edges: calculations showed that benzene and other monoaromatic ring compounds should be very weak competitors for the b[a]p site. the model correctly predicted the binding energy of nine competitors outside of the training set. '% (wiener index calculated as the sum of all unique shortest distances between atoms in the hydrogen suppressed graph of the compound); (wiener index calculated as the sum of all geometric distances between atoms in the hydrogen suppressed molecule of the compound). results: the traditional d wiener number is defined as the sum of the lengths of all possible routes in the molecular graph. here the length is proposed to be calculated as the real three-dimensional length between atoms: this is the d wiener number. this number has many of the advantageous features of the related and very much studied d wiener number. additionally, it is highly discriminative and its use in quantitative structure-property relation studies (qspr) appears to be encouraging, according to the preliminary calculations. of these the most convincing is the set of statistical parameters for the linear correlation between the experimental and calculated enthalpy functions of dw the lower alkanes not shown here. three different models have been tried and in all cases the d wiener number seemed to be superior to the d one as it is reflected in (eqs. - ). a) gaba receptors in human mouse, rat and bovine brain tissues, membrane preparations and cellular uptake systems; b) gaba receptors in cat and rat spinal cord preparations; c) cultured astrocytes. as in the equations nearly all indicator variables had negative regression coefficients it was concluded that instead of searching for better analogs, the research should be directed toward degradable pro-gaba or pro-muscimol derivatives that are efficiently taken up into the central nervous system (cns). . (+ . ) irng + . ( ) title: synthesis and qsar of -aryl- -(~- -quinolyi/l-isoqui-noly ethyl)piperazines and some related compounds as hypotensive agents. authors ( ) based on eq. , an optimal logp is predicted (logpo = . ). the highest activity was produced by the -( -methylphenyl)- -(~- -qui-data determined: chemical descriptors: abtstr. - quant. struct.-act. relat. , - ( ) nolylethyl) piperazine, its logp value being near to the optimal value ( . ). l.og(bph) values calculated by eq. agree well with the observed ones. source: toxicology , ( ), - . compounds: , -dimethoxyphenol, -chlorophenol, . -dichlorophenol, -methyl- -nitropheno , , dichlorophenol, , , -trichlorophenol, , , , -tetrachlorophenol, , , -triiodophenol, pentachlorophenol. biological material: chinese hamster ovary (cho) cells. data taken from the literature: ezoc; ecsoc; eczoa; ec~oa [concentration (mmol l) of the compound leading to a or % inhibition of the cell growth or adenosine uptake, respectively]. data determined: ego; ecso [concentration (mmol l) of the compound leading to a or % inhibition of the na+/k+-atpase activity, respectively]. chemical descriptors: logp (logarithm of the partition coefficient in i-octanollwater); u (hammett's constant, characterizing the electron-withdrawing power of the substituent); e, (taft's constant, characterizing steric effects of the substituent); x (molecular connectivity index, calculated by koch's method). results: highly significant linear relationships were calculated between log (eczo) and logp (r = - . ). the relationship between log(ec,,) and u being less good (r = - . ). combining the two parameters the relationship has improved (eq. i): ( ) (logarithm of the partition coefficient in i-octanollwater); (hansch-fujita's substituent constant characterizing hydrophobicity); (hammett's constant, characterizing the electron-withdrawing power of the substituent); (sterimol steric parameter, characterizing the steric effect of the meta substituents); (rplc derived hydrophobic substituent constant, defined by chen and horv th, and extrapolated to x methanol); (indicator variable for the present for the absence of hydrogen bonding substituents). results: logk' values were determined for the benzenesulfonamides and correlated with chemical descriptors. a highly significant linear relationship between logk' and logp was calculated (eq. ): ( pk, (negative logarithm of the acidic dissociation constant); logp (logarithm of the partition coefficient in i-octanol/water). results: relationships between ki values and the chemical descriptors were investigated for cpz and its listed metabolites. relationship between log(l/ki) and logp was calculated (eq. ) no numerical intercept (c) is given: ( in spite of the complexity of the full mechanism of inhibition involving at least six transition states and five distinct intermediates, a significant linear regression equation was calculated for ki (eq. ): since the crystal structure of the acyl-enzyme complex, the acylation and deacylation rate were available, it was concluded that the inhibition begins with the histidine catalyzed attack of serine , at the benzoxazinone c , while the carbonyl oxygen occupies the oxyanion hole formed by glycine and serine . title: antifolate and antibacterial activities of -substituted authors: harris, n.v.; smith, c.; bowden, k. rhone results: it was shown earlier that binding of diaminoquinazolines to dhfr correlated with the torsional angle of the -amino group of the quinazoline nucleus. it was postulated that the interaction between the adjacent -substituent and the -amino group was very important in determining dhfr binding of the compounds possibly, because of the influence on the hydrogen-bond formed between the -amino group and a residue at the active site. the existence of such interaction in -substituted , -diaminoquinazolines were shown by measuring a , , and ~~ values. the ui and uor electronic parameters correlated well with chemical shifts of the -nh, groups (eq. ) but showed poor correlation for the -nh, group (eq. ), respectively: ( ) the equations suggest that the through-ring resonance interactions between the -substituent and the adjacent -amino group are disrupted by some other effects which might have significance for binding. a) an extensive set of compounds based on the nalidixic acid structure of type i. where r', r , r and r are various substituents; x and x* = c, n (for nalidixic acid: x = c, x = n, r' = et, r' = cooh. r = h, r = me); b) subset of (i) (set a) containing fifty two , -disubstituted -alky l- , -dihydro- -oxoquinoline- -carboxylic acids; compounds: abtstr. - quant. struct.-act. relat. , - ( ) c) subset of (i) (set b) containing one hundred and sixty two xylic acids; d) subset of (i) (set c) containing eighty five , -dihydr - -oxo- , -naphthyridine- -carboxylic acids with substituted azetidinyl, pyrrolidinyl and piperidinyl rings at position , fluorine at position and ethyl, vinyl or -fluoroethyl substituent at position . biological material: ps. aeruginosa v- , e. coli nihj jc- , s. the study showed that the most active compounds have fluorine in position , r can be a wide variety of nitrogen containing substituent and the best predictor for r is its lipophilicity. compounds: phytoalexins: pisatin, , a-dihydroxy- , -(methylenedioxy)pterocarpan, a, la-dehydropisatin, -hydroxy- , -(methylenedioxy)- a, a-dehydropterocarpan, (*)- -hydroxy- -methoxypterocarpan, (+)- -hydroxy- -zmethoxypterocarpan, (-)- -zhydroxy- -methoxypterocarpan, vestitol, sativan, formonenetin, coumestrol, '-o-methylcoumestro , phaseoilin, phaseollinisoflavan, '-methoxyphaseollin-isoflavan, glyceollin, a- a-dehydroglyceollin, tuberosin, a, ladehydrotuberosin. (capacity factor determined rp-hplc). calculated for logp of six reference compounds using their k' values (eq. ): ( ) n = r = . s not given f not given the lipophilicity of the phytoalexins were within the range of log p = . - . . it was found that the antifungal activity of similar compounds positively correlated with antifungal activity but no equation could be calculated for the whole set of compounds. it was suggested, however, that compounds with logp values higher than . were retained in the membranes, therefore phytoalexins with slightly lower lipophilicity, as well as greater fungitoxicity and systemic activity should be searched. certain structural features seemed to correlate with antifungal activity such as the presence of phenolic oh and benzylic hydrogen. it was suggested that the ability of the ortho oh group to form fairly stable intramolecular hydrogen bond may contribute to the greater stability of the shiff base hnctional group and the higher biological activity of the substances (various subsets required different equations). results showed that compounds with increasing lipophilicity and electron donating substituents at the -and -positions have high inhibitory activity. i-[( '-allyl- '-hydroxybenzilidene)amino]- -hydroxyguanidine was found to be the most active compound. the use of parameter focusing of the substituent hydrophobic constant and electronic constants was suggested for the selection of further substituents to design effective compounds. biological material: a) rabbits; b) rats; c) guinea pig. data taken from the literature: analogue results: prp, ecjoh, ecsob, ecsot values were measured and presented for the c,, paf analogue and compared with that of other analogues. c,, paf analogue was less potent than the c or cis paf analogues and equivalent to c,, paf analogue, showing that the activity decreased with lipophilicity. a highly significant parabolic relationship was calculated between log(rps) and cf (eq. ): the maximum activity was calculated cf = . , this corresponds to the cl paf. (energy minimizatipn of the compounds were calculated using the free valence geometry energy minimization method); (molecular shape analysis according to hopfinger was used to quantitatively compare the shape similarity of analogs in their minimum energy conformer states (within kcal/mol of their global minimum energy ( fig. shows the superposition of the reference conformations of the phenylalanine and tryptophane analogues). quant. struct.-act. relat. , - ( ) chemical descriptors: logp (logarithm of the partition coefficient in -octanol/water); (hansch-fujita's substituent constant characterizing hydrophobicity of a substituent on the aromatic ring and the hydrophobicity of the aromatic ring itself, respectively) ; [common overlap steric volumes (a ) between pairs of superimposed molecules in a common low energy conformation]; [dipole moment (debeyes) of the whole molecule and of the aromatic ring, respectively, calculated using the cndoi method] ; quantum chemical indices (partial atomic charges calculated by the cndoi method); - [torsion angles (deg) (fig. ) rotated during the conformational analysis of the compounds]. results: significant parabolic regression equations were calculated for the antigelling activity of the phenylalanine and tryptophan analogues (eq. and eq. , respectively): the different qsar for the phenylalanine and tryptophan analogues indicated that they interact with hemoglobin in different ways or at different sites. for the phenylalanine analogues the hydrophobicity of the side chain, the aromatic dipole moment and the steric overlap volume explained about %, % and % of the variance in antigelling activity, respectively. for the tryptophan analogues the square of the dipole moment or the steric overlap volume explained % or % of the variance in ra, respectively, being the two descriptors highly correlated. the results show that the tryptophan analogs have a relatively tight fit with the receptor site. title: s-aryl (tetramethyl) isothiouronium salts as possible antimicrobial agents, iv. in both eq. and eq. , log(l/c) depended primarily on electronic factors (eu' ) and only secondarily on hydrophobicity (ctobsd). a threshold logp value for the active isothiuronium salts was indicated, as the compounds with logp values between - . and - . were found to be totally inactive with the exception of the nitro-derivatives. title: comparative qsar study of the chitin synthesis inhibitory activity of benzoyl-ureas versus benzoyl-biurets. source: tagungsbericht , no. \ r* ponents explaining . %, . % and . % of the variance. fig. shows the minimum energy conformation of a highly active representative of the urea analogs (dimilin) with . a distance between the and carbon atoms. fig. shows the low energy conformation of the corresponding biuret analog with the two benzene rings in appro:imately the same plane and with the same c -c distance ( . a) allowing to fit a hypothetical benzoylurea phamacophore. the similarity of the regression equations and the modelling study supported the hypothesis that the benzoylbiurets act by the same mechanism as the benzoylureas. biological material: insect species: aedes aegypti, musca domestica, chilo suppressalis, hylemya platura, oncopeltus suppressalis, oncopeltus fasciatus, pieris brassicae, leptinotarsa decemlineata. [concentration of the benzoylurea derivative (various dimensions) required to kill % of insect larvae (a. aegypti, m. domestica, c. suppressalis, h. platura, . suppressalis, . fasciatus, p. brassicae or l. decemlineata]. data determined: lcso [concentration of the biuret analogue (ppm) required to kill % of insect larvae (a. aegypti or m. domestica]; molecular modeling (models of the compounds were built using molidea); conformational analysis (minimum energy conformations of the compounds were calculated using molecular mechanics method). chemical descriptors: the thesis is devoted to the quantitative analysis of the uncoupling activity of substituted phenols using chemical descriptors in order to obtain further information on the mode of action of phenol uncouplers: the study of the partition coefficient of substituted phenols in liposomelwater system [p(l/w)] showed that (i) p(l/w) depended primarily on the logp value; (ii) influence of steric and electronic parameters depended on the type of the lipid involved; qsar analysis of uncoupling phenols in rat-liver mitochondria identified the relevant physicochemical parameters required for phenols being protonophore in inner mitochondrial membrane and quantitatively separated the potency as the protonophore in the inner mitochondrial membrane and the incorporation factor (iogp); protonophoric potency of substituted phenols was linearly related to uncoupling activity when certain critical physicochemical parameters of the experiment were taken into account; linear relationship was calculated between uncoupling activities of substituted phenols and related uncouplers in the mitochondria from the flight muscles of house flies and in spinach chloroplasts; the results indicated a shuttle type mechanism for the uncoupling action of substituted phenols. title: uncoupling properties of a chlorophenol series on acer cell - ( ) compounds: chlorinated phenols substituted with -c , -c , , , -cl, , , -ci, pentachlorophenol, -ci- -me. -c - -me, -c - , -me, -c - , -me, -ci- -ally , -c - -pr- -me, z , , -c , , , , -c , , -cl, , -c , , -ci, , , -c , -cl- -no , , -c - -no,, -ci- , -no,. biological material: acer pseudoplatanus l. cell suspensions. data determined: dso [concentration of the compound (pmolll) required for % uncoupling effect registered by measuring the oxygen consumption rate by polarography]; [minimal concentration of the compound (pnol/l) required for giving a full uncoupling effect]. chemical descriptors: logp (logarithm of the partition coefficient in -octanol/water); mr (molar refractivity); ed (steric parameter representing the perimeter of coplanary molecules projected onto the aromatic plane); a (angular parameter expressing the hindrance in the neighborhood of the hydroxyl group in positions and , respectively) ; ui, (hammett's constants, characterizing the electron-withdrawing power of the para-substituent and the ortho-or -nitro substituents, respectively). results: highly significant linear regression equations were calculated for the uncoupling effects of chlorophenols in acer cell suspensions: the equations for the uncoupling effects in the whole cells and those calculated previously for isolated mitochondria or chloroplasts possess similar structures. . (* . ) a, - . ( ) title: effects of ' substituents on diphenyl ether compounds. results: sar suggested that the space for the n' and nz substituents in the psi binding site is relatively large. the variation of the number of the carbon atoms of r on the photosynthetic inhibitory activity is shown in fig. (hansch-fujita's substituent constant characterizing hydrophobicity); chemical descriptors: . (* . ) ior + . (& . ) hb + . the biological activity of three out of the (dpe- , and ) substituted diphenyl esters were measured and listed. igr values were measured for the three compounds and compared with that of a- and methoprene. it was found that the position of acetamido group in the phenol moiety when it is in the ortho position abtstr. - - ( ) increases the lipophilicity of the compound with a logp value of . . if the same group is in mr para position, the logp values are . and . , respectively and they are comparatively ineffective. when both the ortho positions are substituted with tertiary butyl groups (dpe- ) the logp value is relatively higher ( . ) which increases the lipophilicity of the compound and explains the pronounced idr activity at relatively low concentrations. abstr. results: a highly significant linear regression equation was calculated for the descriptors of r' (r' = i-pro was eliminated as an outlier) (eq. ): the compound with r' = eto, r = me and z = was found to be an effective, broad spectrum insecticide. the replacement of the quaternary carbon with a silicon atom cansimplify the synthesis of test compounds and thus can be advantageously utilized for the preparation of large compound sets for qsar studies. the data suggest that the initial electron loss from the given compounds is the preeminent factor effecting the reaction rate. a single mechanism is suggested over the entire range of reactivities, where a transition state with a considerable positive charge is involved. title: connection models of structure and activity: ii. estimation of electronoacceptor and electronodonor functions of active centers in the molecules of physiologically active materials. research institute of physiology active materials chernogolovka, moskow district, ussr. engl. summary). authors chemical descriptors: logp (logarithm of hydrophobicity). results: calculations for electronoacceptor and electronodonor entharpic and free energy factors on the base of functional groups were made according to the principle of independence of active centers: data determined: linear correlation was found between the calculated and measured characteristics: the accuracy of the fitting was the same as the measurement error of ah,,, and agm.the entropy might be calculated from enthalpy, gibbs energy and temperature: the good linear correlations between the measured and calculated data show that the functional group approaches might be used for these compound types. the substituent effects for the a-acceptorlr-donor substituents (f, c , br, i) were found to be very much larger for the c fsr relative to the nitrobenzenes. these results indicate that the extra electron enters a o*-orbital, which is localized on the c-r atoms. for the structure-solubility relationship of aliphatic alcohols. the study indicated that solubility of aliphatic alcohols depends primarily on molecular connectivity ('x), the number of carbon atoms in the alkyl chain (n'), the number of hydrogens on the a-carbon atom (normal, iso, secondary, ternary) and the degree of branching (vg): ( ) n not given r not given s not given f not given eq. was found to be a highly significant predictor of s (eq. ): the result support kier's, furthermore kier and hall's earlier models on the structural dependence of water solubility of alcohols. -log(s) = 'x + ( )* sg - . title: linear free energy relationships for peroxy radical-phenol reactions. influence of the para-substituent, the orthodi-tert-butyl groups and the peroxy radical. k (reaction rate constant (m -'s -i ) of the reaction between the reaction of cumyl-, -phenylethyl-and t-butyl-peroxy radicals and ortho-para-substituted phenol inhibitors). data taken from the literature: chemical descriptors: u+ r. ui, ur (charton's electronic substituent constant and its decomposition to inductive and resonance components, respectively for the characterization of the para substituent); (indicator variable for the presence or absence of the t-bu groups in , -position of the phenols). results: highly significant linear regression equations were calculated by stepwise regression analysis for logk in spite of the diverse data set originating from different laboratories using different peroxy radicals (eq. , eq. ): quant. struct.-act. relat. , - ( ) ( ) n = r = . s = . f = . i c~" was not selected by stepwise regression indicating that the orthodi-t-bu substitution had no significant effect on the rate of hydrogen abstraction from phenols by the radicals. the form of the equations for different subsets of the phenols and radicals indicated that the reaction mechanism was the same for the different peroxy radicals. title: a fractal study of aliphatic compounds. a quantitative structure-property correlation through topological indices and bulk parameters. the following descriptors are considered as 'bulk parameters': vw (van der waals volume, calculated from the van der waals radii of the atoms); mw (molecular weight); sd (steric density of the functional group). results: highly significant equations are presented for calculating vw, sd and mw r values ranging from . to . , other statistics and the number of investigations are not given. q and values calculated by these equations were introduced to the equation given above and the physicochemical properties were calculated. the observed and calculated iogv,, d and p values are presented and compared for the alkanes, alcohols, acids and nitriles. the observed and calculated physicochemical parameters agreed well. fractal nature of the alkyl chain length was discussed and a relationship was presented between the fractal-dimensioned alkyl chain length and a generalized topological index. title: application of micellar liquid chromatography to modeling of organic compounds by quantitative structure-activity relationships. chemical descriptors: logp (logarithm of the partition coefficient in -octanol/water). results: in a series of experiment with the listed compounds micellar liquid chromatography has been applied to model hydrophobicity of organic compounds in a biological system. the measured logk' values of the substituted benzenes were found to be superior predictors of logp. fig. shows the plot of logp versus logk' of the substituted benzenes. highly significant correlation was calculated for the logk' values of phenols (open squares) (n = , r = . ), for the rest of the compounds (full squares) (n = , r = . ) and for the entire set (n = , r = . ). further experiments using various surfactant types in the mobil phase suggested that logk' values generated on a lamellar phase may be better predictors of hydrophilicity than logp obtained from binary solvent systems. title: isoxazolinyldioxepins. . the partitioning characteristics and the complexing ability of some oxazolinyldioxepin diastereoisomers. authors quant. struct.-act. relat. , - ( ) source: j. chem. soc. perkin trans. i . no. , - compounds: oxazolinyldioxepin derivatives of type i and , where x = h, f, ci, cf or ch . data determined: logk' [logarithm of the capacity factor, measured by reversed-phase liquid chromatography (rplc)]; mep (molecular electrostatic computed by geesner-prettre and pullman's vsspot procedure). chemical descriptor: logp (logarithm of the partition coefficient in -octanollwater). results: the logk' and logp values were measured for the two type of diastereomers and a highly significant linear relationship between logk' and logp was presented (r = . ): the meps of i and 's f-derivatives were determined and presented, "a" for type i, "b" for type ii: the complex forming ability of the diastereoisomers with mono-cations was investigated and explained in terms of the structures and electronic properties of the compounds. results: linear relationships are presented plotting y versus n for the hydrophobic sorbents (fig. ) and the slopes of these straight lines are suggested for experimental determination of . q, values. ~ values determined by the suggested method are listed. while no linear relationships were found between kd and n, y depend linearly on n for the test compounds [alkanols (i). alkane diols ( ) results: three linear models were fittedwith independent variabies of log(p), mr and o x . the best fitting parameters (independent of composition) were obtained from the following models (no statistical characteristics is presented): ( ) ( ) the two types of correlations (with structural and with moving phase parameters) together might be used for the optimization of chromatographic separation of complex mixtures of sulphur-containing substances. (zero order molecular bonding type connectivity index); ig(k) = a + a p' + a logp + a p' logp ig(k) = a + a tg(cm) + a, logp + a tg(cm) logp - - the kd values derived by the suggested method were compared by kd values calculated by martin's rule and a good agreement was found. title: mathematical description of the chromatographic behaviour of isosorbide esters separated by thin layer chromatography. compounds: isosorbide esters: isosorbide (l), - -monoacetate, - -monoacetate, - -mononitrate, - -mononitrate, l-diacetate, - -nitro- -acetate, - -nitro- -acetate, l-dinitrate. rn, r~i [retention factors obtained by thin-layer chromatography in benzene/ethylacetate/isopropanol/ ( : : . ) and in dichloromethane/diisopropylether/isopropanol ( : : : ) eluent systems, respectively]. data determined: chemical descriptors: (information index, based on the distribution of the elements in the topological distance matrix); (the geometrical analogue); (randic connectivity index); (maximum geometric distance in the molecule); compounds: highly diverse chemicals. grouped according to the following properties: contains (ester or amide or anhydride) or (heterocyclic n) or ( bound to c) or (unbranched alkyl group with greater than carbons). data determined: aerud chemical descriptors: (aerobic ultimate degradation in receiving waters). v x x, nci m, (molecular weight). results: the paper has aimed at developing a model for predicting aerud. the data sets were collected from biodegradation experts. the experts estimated the biodegradationtime that might be required for aerud on the time scales of days, weeks, months and longer. (valence second order molecular connectivity index); (fourth order path/cluster connectivity index); (number of covalently bound chlorine atoms); highly diverse chemicals but typical in wastewater treatment systems were examined. zero to six order molecular and cluster connectivity indexes were calculated using computer programs wrinen in for-tran for ibm pc/xt. the best fitted linear regression model is: [first order rate constant: transport or transformation parameter (mol/pa. h)]. results: the qwasi fugacity model describes the fate of a (contaminating) chemical, such as organo-chlorine compounds, pesticides or metals. the lake model consists of water, bottom and suspended sediments, and air. the model includes the following processes: advective flow, volatilization, sediment deposition, resuspension and burial, sediment-water diffusion, wet and dry atmospheric deposition, and degrading reactions. the steady state solution of the model is illustrated by application to pcbs in lake ontario using the equilibrium criterion of fugacity as the variable controlling environmental fate of the chemical. the applications are based upon inaccurate data. use of fugacity is inappropriate for involatile chemicals, such as metals, or ionic species, because fugacities are calculated from a basis of vapor phase concentrations. for these materials the use of the equilibrium concentration activity is more appropriate since activities are calculated from a water phase base. thus, a new equilibrium criterion, termed the "aquivalent" concentration (or equivalent aqueous concentration) is suggested as being preferable. this concentration has the advantage of being applicable in all phases, such as water, air and sediments. the formalism developed in the qwasi approach can also be applied, making possible a ready comparison of the relative rates (and thus, the importance) of diverse environmental fate processes. all these are illustrated by applying the model on a steady state basis to quant. struct.-act. relat. , - ( ) abstr. - the pcb example and to the fate of lead in lake ontario. the estimated and observed concentrations of pcbs and lead in lake ontario agree well: the largest difference in the case of pcbs in rain amounts to a factor of three. in other phases, and especially in the case of lead, the difference is usually less than per cent. although in order to judge the biological effects of a contaminant it is of fundamental importance to know its transport and transformations, and the present model has been proven to useful to describe this; direct biological implications are not deduced at the present stage. the similar slopes of the equations show that these compounds exert their cytotoxicity primarily by alkylation. while the majority of the tested compounds showed no hypoxia-selective cytotoxicity (ratio awa .o), the -n and -no substituted compounds were more toxic to uv cells under hypoxic conditions (ratio = . for the compound with r = -n ), indicating cellular reduction of the nitro-group. the measured hypoxic selectivity of the -no and -n , substituted compounds was a fraction of the calculated ratio (measured fold and calculated fold by eq. between the -n and -nh, substituted compounds). the main reason for the difference between the calculated and measured hypoxic selectivity is suggested to be the low reduction potential of the -n and -no, groups (e = - mv and e = - mv, respectively). title: quantitative structure-activity relationships for the cytotoxici- (hammett's constant, characterizing the electron-withdrawing power of the substituent); (hammett's polar electronic constant characterizing the electron withdrawing power of the substituent for anilines). results: significant linear regression equations were calculated for the halflife (t / ), growth inhibition ( ) and clonogenicity data (ctlo) using hammett constants (eq. , eq. , eq. ): ( ) n = r = . s = . f not given - ( ) type, test animals, the mean level of toxicity and the form of the equation. e.g. analysis of the toxicity of phenols showed a transition between simple dependence from logp to exclusive dependence to reactivity factors indicating two separate classes of phenol toxicity (eq. for mouse i.p. toxicity, and eq. for rat oral toxicity): results: an additivity model, plc = cni a ti -to, where ni is the number of ith substituents in a benzene derivative, at; is the toxicity contribution of the ilh substituent and to is the toxicity of the parent compound (benzene), was used for predicting toxicity of lo similar correlation was found between mutagenicity and u (fig. ) indicating that both biochemical and chemical processes involve a ph dependent nucleophilic ring opening (including the protonation of the aziridin nitrogen as rate controlling step) and in.,uenced by electronic and steric factors (equation not given). resonance effect). (electron density on n in the homo calculated by the mndo). results: highly significant linear relationships between log( ic) and logp, &homo (eq. ); iogp, qhomo (eq. ) are presented indicating that the more hydrophobic and more electron-rich triazines are more active according to the ames test: substructures [a total of fragments were generated from the compounds using the program case (computer-automated structure evaluation) system]. results: a comparative classification of the compounds were performed using case for identifying molecular fragments associated with cancerogenic activity (biophores) as well as deactivating fragments (biophobes). case identified biophores and biophobes from the fragments of the compounds with a less than . % probability of being associated with carcinogenicity as a chance. the sensitivity and specificity of the analysis was unexpectedly high: . and . , respectively. the predictive power of case biological material: chemical descriptors: was tested using the identified biophores and biophobes on a group of chemicals not present in the data base. the ability of case to correctly predict carcinogens and presumed non-carcinogens was found to be very good. it was suggested that non-genotoxic carcinogens may act by a broader mechanism rather than being chemical specific. compounds: compounds of type i where r = h, ch , c,h , czh , c h , czh , c h , c h , c h , sc h ; compounds of t y p i where r = h, c~heoh, sc h h. data determined: p t pi" (a priori probability of appearance of the i-th active compound); (a priori probability of appearance of the i-th nonactive compound). (the first order molecular connectivity index); (the second order molecular connectivity index); (information-theoretic index on graph distances calculated by the wiener index according to gutmann and platt); chemical descriptors: (rank of smell, where the rank is defined to equal with one for the most active compound). results: the authors' previously proposed structure-activity relationship approach was applied for structure-odor relationship. different compounds of groups i and i were examined using the topological indices w, r, i, x as independent variables and v as the dependent variable. the best correlation was obtained between r and v ( fig. i) results: logp and iogp, values were determined for the nitroimidazole derivatives. significant linear equations were calculated, the best one related for logp and logp,r (eq. ): ( ) logp = . logp,i + . n = r = . s not given f not given chemical descriptors: logp descriptors (logarithm of the partition coefficient in i-octanoll water); ( indicator variables taking the value of for the presence of cr/p-hydroxy , a-fluoro, a-methy , afluoro, -hydroxy, i a-fluoro. , -acetonide, -deoxy, -acetate, -propionate, i-butyrate or -isobutyrate, respectively). results: a data set of steroids were compiled after removing those ones containing unique substituents. the set was divided into two categories of approximately equal membership by defining a threshold logp value of . . a descriptor set was created and the non-significant ones were eliminated using the weight-sign change feature selection technique. linear leaning machine was applied to calculate the weight vectors and complete convergence was achieved in the training procedure. the predictive ability of the linear pattern classifier thus obtained was tested using the leave one out procedure. the predictive ability was found to be i . %. the predictive ability of the approach was found to be good and improvement was expected with larger data set. steroids, however, containing new substituents would have to be subjected to a repeated pattern-recognition calculation. lengthlbreadth descriptors ( descriptors)]. results: for modeling the shape of the compounds, simca was used: the approach was to generate disjoint principal models of clustered points in a multidimensional space. the number of clusters for each structure was determined by using hierarchical cluster analysis. fig. shows the orthogonal views of a schematic representation of the sim-ca models for the atom clusters in senecionine: each compound in turn was used as a reference structure. every other structure was superimposed on the reference using the ends of the corresponding binding moment vector plus the ring nitrogen atom. canonical correlation analysis was used for calculating the correlation between the five biological activity data and shape descriptors of structures. the best correlation was observed for jurs' shadow descriptors. the msa and simca descriptors were comparable. the model was able to express both the amount and direction of shape the differences, and also for encoding relevant information for correlation with the biological activity. compounds: n-substituted -methyl- -nitropyrazole- -carboxamides ( ), n-substituted -amino- -methylpyrazole- -carboxamides (iii), n-substituted -methyl- -diazopyrazole- -carboxamides and n-piperidiny -n-( , -dimethyl- -nitrosopyrazol- -yl)-urea (vii) . title: structure-activity correlations for psychotomimetics. . phenylalkylamines: electronic, volume, and hydrophobicity parameters. abtstr. quant. struct.-act. relat. , - ( ) data determined: edso conformational analysis g [dose of the compound (mg/kg) which causes % of the rats which were trained on rngfkg reference compound to respond as they would to the training drug]; (geometries of the compounds were calculated using mmf from starting geometries determined by the program euclid). discriminant analysis resulted in a function containing six variables which misclassified only one compound in the training set. when the data was repeatedly split randomty into a training and a test sb, the misclassification rate was % ( out of classifications). fig. shows the plot of the two canonical varieties from discriminant analysis visualizing the separation of hallucinogenic and nonhallucinogenic derivatives (meaning of symbols are the same as in fig. ). multiple regression analysis (mra) was found to be the most useful for identifying relevant and discarding redundant variables. highly significant parabolic regression equations were calculated for the human activity data (a) (n = , r ranging from . to . , f not given) and for animal data (edso) (n = , r = . and r = . , f not given). eight descriptors were found to be highly significant. among these the importance of directional hydrophobicity and volume effects indicated that steric and hydrophobic interactions participate in the interaction with the receptor.mra indicated a strong interaction between the meta-and para-substituents and the presence of the formation of charge transfer complex by accepting charge. data did not support the hypothesis that the human activity data and animal ( ) data taken from the literature: sweet(n) (sweet taste of the compound, where n represent the number of times a sample has to be diluted to match the taste of % sucrose solution). [class fit distances of a compound to sweet and nonsweet class (dimension not given) calculated by principal component analysis]. chemical descriptors: mr (molar refractivity); bi, l (sterimol steric parameters, characterizing the steric effect of the substituent); r (hansch-fujita's substituent constant characterizing hydrophobicity); urn, up (hammett's constants, characterizing the electron-withdrawing power of the substituent in meta-and para-position, respectively). results: no statistically significant regression equation was obtained by the hansch-fujita approach using the chemical descriptors listed. d', d quant. stact.-act. relat. , - ( ) abstr. - principal component analysis of the data set extracted principal components, explaining % of the variance of the sweet compounds. the sweet compounds clustered in a relatively confined region of the d space whereas the tasteless and bitter compounds were scattered around the sweet compounds. a coomans plot, however., indicated, when plotting d' versus d , that sweet and nonsweet compounds could be well separated along the d' axis ( fig. , title: conformation of cyclopeptides. factor analysis. a convenient tool for simplifying conformational studies of condensed poly-ring systems. prolyl-type cyclopeptides. authors conformations of the six-membered dop-ring family may be reproduced by means of a superposition of the canonical twist (t), boat (b) and chair (c) forms. physically, the coefficients have the meaning of relative contributions (amplitudes) of the t, b and c forms into the total conformation of the ring. here factor analysis (fa) and principal component analysis was used in conformational studies of various x-ray conformers of dop/pyr. a correspondence was found between factors identified and rpt, when the rings are considered separately. this fact allows a physical interpretation of the fa results: two or three puckering variables were found for the dop and pyr rings expressing the absolute amplitudes of the basic pucker modes. subse-quent fa treatment of the condensed system revealed five conformational variables necessary and sufficicnt to describe the twolring puckering completely. each of the basic pucker modes defines a unique pattern of conformational variation of the whole two-ring system. the results demonstrate that fa is a powerful technique in analysing condensed poly-ring systems, not amenable to the rpt treatment. title: preprocessing, variable selection, and classification rules in the application of simca pattern recognition to mass spectral data. authors: dunn m*, w.j.; emery, s.l.; glen, g.w; scott, d.r. college of pharmacy, the university of illinois at chicago south wood, chicago il , usa. source: environ. sci. technol. , ( ) , - . compounds: a diverse set of compounds observed in ambient air classified as ( ) nonhalogenated benzenes; ( ) chlorine containing compounds; ( ) bromo-and bromochloro compounds; ( ) aliphatic hydrocarbons; ( ) miscellaneous oxygen-containing hyhocarbon-like compounds (aliphatic alcohols, aldehydes and ketones). pattern recognition was applied to autocorrelation-transformed mass spectra of the compounds using providing chemical class assignment for an unknown]; m/z;, mlz,, m/zl (first three principal components scores of simca). results: simca pattern recognition method was applied on a training set of toxic compounds targeted for routine monitoring in ambient air. the analysis resulted in very good classification and identification of the compounds ( % and %, respectively). however, the training procedure proved to be inadequate as a number hydrocarbons from field samples (gcims analysis) were incorrectly classified as chlorocarbons. a new approaches for the preprocessing (scaling the ms data by taking the square root of the intensitiesfollowed by autocorrelation transform), variable selection (only the most intense ions in the ms spectrum were taken), and for the classification rules of simca has been introduced to improve results on real data. fig, and as a result of the revised rules the classification performance has been greatly improved for field data ( - %). title: a qsar model for the estimation of carcinogenicity. - ( ) it was suggested that the mechanism of mutagenicity of the cimeb[a]ps measured in the ames test is probably more complex than the simple reactivity of carbocation intermediates. [dipole interaction potential (dimension not given)]; [molecular electrostatic potential (kcallmol) in a plane]; [molecular electrostatic field map (kcall mol), mapping the e(r)j values of a molecule surface in a plane, predicting the directions and energies of the interactions with small polar molecules at distances greater than the van der waals sphere]; (construction of surfaces corresponding to a given value of potential); d mep and mef maps ( d maps weregenerated by superimposing the equipotential curves corresponding to a value of kcallmol in the case of mep. and . kcallmol in the case of mef, computed in several planes perpendicular to the mean plane of the analogues in low energy conformations, stacking over each other in a distance). results: the three vasopressin analogues differ significantly in their biological activities. both mep and mef maps of the of the biologically active (mpa')-avp and (cpp')-avp are similar, but they are different from that of the inactive (ths')-avp. fig. , fig. abtstr. - quant. struct.-act. relat. , - ( ) a new method for calculating the points of the equipotential curves was also presented. crystal structure (crystal coordinates of the molecules were determined by x-ray diffraction methods). data taken from the literature: [electrostatic molecular potential (ev) were calculated using am- type semiempirical mo calculations]; conformational analysis [minimum energy conformations were calculated using x-ray structures as input geometries followed by am -method (fletcher-powell algorithm)]. chemical descriptors: ui, u [rotational angles of the n-ally group (deg)]. results: four similar energy minima were located by am-i calculations for both namh+ and nlph+. the energy minima for the protonated nam + and nlph + were the most populated ones with conformational enantiomers relative to the involved n-allyl-piperidine moiety ( % and %, respectively). it was shown that the isopotential curve localization of emp contour maps were very similar for the corresponding conformations of both nlph * and namh + indicating that both molecules should interact a the same anionic sites of the opioid receptor, ( p morphine receptor). fig. and fig. shows the emp contour maps of namh' and nlph + , respectively, in their preferred conformations: compounds: esfenvalerate (ss and sr isomers) of type i, -phenoxybenzyl -( -ethoxyphenyl)- , , -trifluoropropyl ether (r quant. struct.-act. relat. , - ( ) abstr. isomer) ( ). a-cyano- -phenoxybenzyl -( -chlorophenyl)- -methylpropionate (s isomer) (iii) and deltamethrin (iv). " cn data determined: conformational analysis (minimum energy conformations of the compounds in vacuum were calculated using am molecular orbital method and broyden-fletcher-goldfarb-shanno method integrated into mopac); (root mean square, indicating the goodness of fit between two conformers in d); (logarithm of the partition coefficient in i-octanol/water estimated using clogp program); [heat of formation of the most stable conformer (kcallmol)]. rms logp e chemical descriptors: - results: it was assumed that the d positions of the benzene rings of the pyrethroids are decisive for good insecticidal activity. the lower energy conformers of (i) (ss and sr isomers), ( ) (r isomer), ( ) (s isomer) and deltamethrin (iv) were compared by superimposition. inspite of their opposite configuration, esfenvalerate (i) (ss isomer) and the new type pyrethroid i (r isomer) were reasonably superimposed, indicating that the positions of the benzene rings in space are important and the bonds between them are not directly determinant (fig. ) crystal structure (x-ray crystal coordinates of penicillopepsin was obtained from the protein data bank). data determined: electrostatic potential [electrostatic potential of the protein atoms (kcallmol) is calculated using the partial charges in the amber united atom force field); docking (the dock program was used to find molecules that have a good geometric fit to the receptor). results: a second generation computer-assisted drug design method has been developed utilizing a rapid and automatic algorithm of locating sterically reasonable orientations of small molecules in a receptor site of known d structure. it includes also a scoring scheme ranking the orientations by how well the compounds fit the receptor site. in the first step a large database (cambridge crystallographic database) is searched for small molecules with shapes complementary to the receptor structure. the second step is a docking procedure investigating the electrostatic and hydrogen bonding properties of the receptor displayed by the midas graphics package. the steps of the design procedure is given. the algorithm includes a simple scoring function approximating a soft van der waals potential summing up the interaction between the receptor and ligand atoms. directional hydrogen bonding is localized using electrostatic potential of the receptor at contact points with the substrate. the shape search of (i) was described in detail. a new method has been developed for the construction of a hypothetical active site (hasl), and the estimation of the binding of potential inhibitors to this site. the molecules were quantitatively compared to one another through the use of their hasl representations. after repeated fitting one molecule lattice to another, they were merged to form a composite lattice reflecting spatial and atomic requirements of all the molecules simultaneously. the total pki value of an inhibitor was divided to additive values among its lattice points presumed to account for the binding of every part of the molecule. using an iterative method, a self consistent mathematical model was produced distributing the partial pki values of the training set in a predicting manner in the lattice. the hasl model could be used quantitatively and predictively model enzyme-inhibitor interaction. a lattice resolution of - a was found to be optimal. a learning set of e. coli dhfr inhibitors were chosen to test the predictive power of the hasl model at various resolutions. binding predictions (pki values) were calculated for the entire inhibitor set at each resolution and plotted separately for the learning and test set members at . a resolution (fig. i) : . ala-). data determined: molecular models ( d structure of the molecules have been constructed and displayed using the program geom communicating with cambridge x-ray data bank, brookhaven protein data bank, sandoz x-ray data bank sybyl and disman; quant. struct.-act. relat. , - ( ) abstr. - distance geometry [nuclear overhauser enhancements (noe) and spin-spin coupling constants were measured by d nmr methods, semiempirically calibrated as proton-proton distance (a) and dihedral angle (deg) constrains and used in distance geometry calculations (disman) andlor in restrained molecular dynamics calculations to determine d structure of molecules in solution]; crystal structure (atomic coordinates of the compounds were determined by x-ray crystallography); rms [root mean square deviation (a) of the corresponding atoms of two superimposed molecular structures]. chemical descriptors: results: distance geometry calculations were carried out using geom and disman, to identify all conformations of the compounds in solution which were consistent with experimental data obtained by noe measurements. the application of geom was demonstrated by modelling cycbsporin a with and without a limited set of h-bond constrains and with a full nmr data set. in case of cyclosporin a, randomly generated linear analogues of the cyclic structure were formed from the monomers. geometric cyclization was achieved using disman, resulting in many different but stereochemically correct conformations of cyclosporin a. superposition of the backbones of the best cyclic conformers showed rms deviations between . a and . a. fig. shows the superposition of a disman generated ring conformation (thick line) with its x-ray structure of cyclosporin a (thin line) with h-bond constraints (rms = . a): fig. distance and dihedrl-angle constraints have been extracted from noe and vicinal coupling data and used to generate the conformation and the cyclization conditions of the hexapeptide (fig. ) (position of residual distance violations and their direction is shown by arrows): although the described method is not exhaustive, it explores a much greater variety of initial structures than had been previously possible. title: a new model parameter set for @-lactams. authors: durkin, k.a.; sherrod, m.j.; liotta*, d. department of chemistry, emory university atlanta gl , usa. source: j. org. chem. , ( ) , - . compounds: @lactam antibiotics of diverse structure. data taken from the literature: crystal structures (crystal coordinates of the p-lactams were determined using x-ray diffraction method). results: superposition of the x-ray structures and the calculated geometries of -lactams using the original parameter set in the mm force field in model gave satisfactory rms values. a lack of planarity of the -lactam ring and significant differences in the calculated bond lengths and anglesaround the &lactam nitrogen were detected, however. = , s, so, so,]. in order to improve fit, a new atom type with new parameters has been developed for the p-lactam nitrogen (wild atom type coded with symbol in model). the new parameters were evaluated by comparison of the calculated and x-ray geometries of the -lactams. using the new parameter set, the x-ray data were satisfactorily reproduced except for the sulfone -lactams. it was indicated that the ampac data were not suitable for the sulfones as the hypervalent sulfur compounds are not well described in the am hamiltonian. an additional parameters was, however, derived giving good structural data unrelated to the ampac information. it is not known which the new parameter sets is the best for the sulfone / -lactams. title: a molecular modelling study of the interaction of compounds noradrenalin. biological material: a) cdna of the hamster lung ,-adrenergic receptor and &-adrenergic receptor kinase; b) bacterio-ovine-and bovine-rhodopsin and rhodopsin kinase. protein primary sequence (amino acid sequence of the hamster lung p,-adrenergic receptor has been deduced by cloning the gene and the cdna of the hamster lung &adrenergic receptor);, (the cosmic molecular modeling program was used for modeling a-helices in a hydrophobic environment using p and w torsion angles of - " and - ", respectively, according to blundell et al.); (the two highest lying occupied and the two lowest lying unoccupied orbitals, respectively, calculated using indo molecular orbital calculation); crystal structure (crystal coordinates of noradrenaline has been determined by x-ray diffractometry); conformation analysis (minimum energy conformation of the &-adrenergic receptor model has been calculated using molecular mechanics method). results: strong experimental evidences suggested that rhodopsin and ,-adrenergic receptor had similar secondary structure. thus, it was assumed, that similarly to bacterio-ovine-and bovine-rhodopsins, d -adr ener gic receptor -. . b -adrenergic receptor possesses a structure consisting of seven ahelices traversing the cell membrane. fig. shows the postulated arrangements of the a-helices of rhodopsin and the &-receptor. using the experimental data, a model of the &-adrenergic receptor has been generated for the study of its interaction with noradrenaline. a possible binding site was created. successful docking indicated that homo and lumo orbitals contributed to the binding in a chargetransfer interaction between trp- and noradrenaline. a hydrogen bond was detected between the threonine residue of the model receptor and the noradrenaline side chain hydroxyl explaining why chirality was found to be important for the activity of adrenergic substances. title: three-dimensional steric molecular modeling of the [binding affinity (nm) of the compounds to the -ht receptor]. data determined: molecular modeling ( d molecular models of each compound were made using camseqlm molecular modeling system); [distance (a) from the center of the aromatic ring to the ring-embedded nitrogen, when the nitrogen is placed in the same plane as the aromatic ring]. results: in order to derive rules for the -ht pharmacophore, a molecular graphics-based analysis was made using six core structures. the structures were aligned so as to overlay the aromatic rings and to place the ring embedded nitrogen atom in the same plane as the aromatic ring. nine steric rules were derived from the analysis common to all potent -ht agents. fig. shows the d representation of the six overlaid -ht core structures using camseqim: the -ht inactivity of atropine could be explained because its steric properties differed from those the active ics - only by a single atom and failed to meet two of the nine hypothetical criteria. uv-visible spectra [spectrophotometric studies of mixtures of the dyes and nicotine in % (vlv) aqueous ethanol mixture at "ci. results: cyanine dyes demonstrate a multitude of biological activities which may be due to the interference of the adsorbed dye molecule on active sites of the living cell. it was shown by uv-and visible spectrophotometry that the hydroxy styryl cyanine dyes and nicotine formed : charge-transfer complexes. the absorption band of the complex formed between nicotine and dye was detected at wavelengths longer than those of the individual pure substances having identical concentrations to those in mixture. fig. shows that the two partially positive centres of the dye ( -( -hydroxystyryl)-pyridinium-i-ethyliodide) were located at a similar distance than the two nitrogen atoms of pyridine or pyrrolidinyl moieties of nicotine allowing the suggested : i parallel stucking interaction between the two molecule: molecular modeling ( conformations were calculated using a distance geometry algorithm and energy minimized by a modified mm force field in moledit). results: all conformers within kcallmol of the lowest energy conformer were superposed on the x-ray structure of mk- . crystal structure (crystal coordinates of the proteins were determined using x-ray diffraction method). data taken from the literature: was less than a); (probability that a given tetrapeptide sequence is superimposable on the ribonuclease a structure); [probability that the ith residue (amino acid) will occur in the jth conformational state of the tetrapeptide which is superimposable to ribonuclease a]. results: it was suggested that the five tetrapeptides were essential components of larger peptides and might be responsible for their biological activity (binding to the cd receptor). earlier it was hypothesized that the critical tetrapeptide located in a segment of ribonuclease a, would assume low energy conformations (residues - , a @-bend, having a segment homologous to the sequence of peptide t). low energy conformers of the tetrapeptides could be superimposed to the native structure of segment - of ribonuclease a. fig. shows the superimposition of peptide t (full square): many low energy conformers could be calculated for the tetrapeptides but for the polio sequence. the p, value for most tetrapeptides were - times higher that the value of the less active polio sequence. the results supported the hypothesis that the active peptide t adopts the native ribonuclease @-bend. title: potential cardiotonics. . synthesis, cardiovascular activity, molecule-and crystal structure of -phenyl-and -(pyrid- -yl)- data determined: [dose of the compound (mollkg) required for % increase of the heart beat frequency of guinea pig or dog heart]; [dose of the compound (mollkg) required for % decrease of systolic or diastolic blood pressure of dog]; crystal structure (atomic coordinates of the compounds were determined by x-ray diffraction); molecular modeling (molecule models were built using molpac); mep [molecular electrostatic potential (mep) (dimension not given) was calculated using cndoiz]. results: milrinon and its oxygen containing bioisoster possess highly similar crystal structure and mep isopotential maps ( fig. and fig. ) both compounds show strong positive inotropic and vasodilatoric activity. it was suggested that the negative potential region around the thiocarbonyl group such as the carbonyl group in milrinon imitates the negative potential field around the phosphate group of camp. title: molecular mechanics calculations of cyclosporin a analogues. effect of chirality and degree of substitution on the side chain conformations of ( s, r, r, e)- -hydroxy- -methyl- -(meth~lamino)- octenoic acid and related derivatives. [solution conformation of csa in cdch has been elucidated via molecular dynamics simulation incorporating distance constrains obtained from ir spectroscopy and nuclear overhauser effect (noe) data]; (conformational analysis was performed using the search subroutine within sybyl); energy minimization (low energy conformers were calculated using molecular mechanics withinmacromodel ver. . applying an all-atom version of the amber force field). results a total of conformations of csa have been identified within kcalfmol of the minimum energy conformer. population analysis showed that one conformer dominates in solution. fig. shows the superposition of the peptide backbone of the crystal and solution structures of csa (crystal structure is drawn with thick line and the solution structure with thin line). it was shown that the boltzmann distribution between active and inactive conformers correlated with the order of the immunosuppressive activity. a common bioactive conformer serving as a standard for further design has been proposed for csa and its analogs. abtstr. quant. struct.-act. relat. , - ( ) data determined: molecular modeling (models of (i), ( ) and ( ) were built using sybyl based on x-ray coordinates of the compounds); conformational analysis (minimum energy conformations of the compounds were calculated using the search option of sybyl and mndo method; [interaction energy of the molecules (kcall mol) with a hypothetical receptor probe (negatively charged oxygen atom) calculated by grid]. results: the specific receptor area of the sodium channel was modeled with a negatively charged oxygen probe (carboxyl group), interacting with the positively charged (protonated) ligand. fig. shows areas for energetically favorable interaction (areas i, oh ho c *. . quant. struct.-act. relat. , - ( ) abstr. - biological material: a) aspergillus ochraceus; b) carbopeptidase a. data determined kobr [first order rate coefficient (io- /sec) of the hydrochloric acid hydrolysis of ochratoxin a and b]; x-ray crystallography (coordinates of the crystal structure of ochratoxin a and b was obtained using x-ray diffraction); (models of ochratoxin a and b was built using alchemy); ["c nmr chemical shifts (ppm) of the amide and ester carbonyls of the ochratoxins]. chemical descriptors: pka (negative logarithm of the acidic dissociation constant). results: a reversal of the hydrolysis rate between ochratoxin a and b was observed comparing the hydrolysis rates obtained in vitro (carbopeptidase a) and in vivo (hydrochloric acid). the difference in hydrolysis rates cannot be due to conformation since the two toxins have the same conformation in both crystal and in solution. fig. shows the fit of ochratoxin a and b based on superimposing the phenolic carbon atoms. it is suggested that the relative large steric bulk of the chloro atom hinders the fit between ochratoxin a and the receptor site of carbopeptidase a. thus, probably the slower metabolism is the reason, why ochratoxin a is more toxic than ochratoxin b. title: inhibitors of cholesterol biosynthesis. . trans- -( -pyrrol- charge distribution studies showed that compactin had two distinct regions of relatively large partial charges corresponding to the pyrrol ring and the isobutyric acid side chain. experiments for more closely mimicking the polar regions associated with the high activity of compactin indicated that potency of the new compounds was relatively insensitive to the polarity of the r' group. it was also suggested that an electron deficient pyrrole ring was required for high potency. title: synthesis and biological activity of new hmg-coa reductase inhibitors. . lactones of pyridine-and pyrimidine-substituted . dihydroxy- -heptenoicf-heptanoic) acids. chemical descriptors: results: an attempt was made to correlate electrophysiological activity with the effect of the position of the aryl group on the conformation of the side chain using molecular modeling. the study suggested that the compounds with class activity prefer a gauche (a in fig. ) and compounds in which class i activity prefer trans relationship of the nitrogens (b in fig. ) : the study indicated that the point of attachment of the aryl moiety had an effect on the side chain conformation which appeared to be a controlling factor of the electrophysiological profile of these compounds. title: a molecular mechanics analysis of molecular recognition by cyclodextrin mimics of a-chymotrypsin. authors ( ) quant. struct.-act. relat. . - ( ) biological material: chymotrypsin. data taken from the literature: crystal structure (crystal coordinates of the macrocycles determined using x-ray diffraction analysis). data determined: molecular modeling structure superposition (models of b-cd and in chains by nmethylformamide and n-dimethyl-formamide substituted (capped) b-cd were built using the amber program and the coordinates for building the n-methylformamide substituent were calculated using mndo in the mopac program); (energy minimization of the molecules were calculated in vacuo using molecular mechanics program with the amber force field); (energy minimized structures of b-cd and capped b-cd were separately fit to the xray structure of the b-cd complex); [molecular electrostatic potential (kcallmol) of b-cd and capped b-cd were approximated by the coulombic interaction between a positive point charge and the static charge distribution of the molecule, modeled by the potential derived atomic point charges at the nuclei and visualized as d mep map]. results: b-cd and capped b-cd were analyzed as biomimetic models of the active site of chymotrypsin. capped b-cd was shown to be the more effective biomimetic catalyst. capping also altered certain structural features of molecular recognition. the orientation of the secondary hydroxyls were altereddue to twisting of some of the glucose units. secondary hydroxyl oxygen mimics the ser- of chymotrypsin in initiating the acyl transfer event through nucleophilic attack on the substrate. fig. shows the energy minimized structures of b-cd (a) and capped b-cd (b) (fragment number is given in parenthesis). the mep maps of b-cd and capped b-cd showed that the qualitative features of the electrostatic recognition were practically the same in the two mimics. biologicai material: four monocotyledonous (johnson grass, yellow foxtail, barnyard grass, yellow millet) and four dicotyledonous weed species (velvetleaf, morning glory, prickly sida, sicklepod). data determined: [pre-emergence and postemergence herbicidal activities of the compounds were measured and rated using a scale ranging from (no activity) to (complete kill]; [measure of the compound's ability (dimension not given) to translocate upwards in plants through xylem vessels); [soil sorption coefficient calculated by the formula k, = c,/c,, where c, is the concentration of the compound (pg compoundlg soil) and c. is the concentration of the compound (pg compoundlml) in water solution in equilibrium with the soil]; (models of the compounds were built using maccs and prxbld programs); tscf kd molecular modeling quant. struct.-act. relat. , - ( ) abstr. - conformational analysis (minimum energy conformations of the compounds were calculated using mm molecular mechanics method); (molecules were visualized using program mogli on an evans and sutherland picture system ); [total energies, orbital eigenvalues, atomic charges and dipole moments of simple model analogs of type i were calculated using prddo (partial retention of diatomic overlap) level of approximation]. electronic structure chemical descriptors: logp (logarithm of the partition coefficient in -octanollwater). results: conformational analyses and high level quantum mechanical calculations of the conformational preferences showed that the compounds with r = -c and -ci substituents adopt a coplanar structure stabilized by intramolecular hydrogen bond, whereas the -c analogue does not (fig. ): higher logp values ( . - . logarithmic unit difference), higher kd and tscf values of the -ci and -ci substituted compounds relative to the -ci analog were interpreted as the result of the intramolecular hydrogen bond and were consistent with the observation that the -ci and -ci analogs were active as post-emergence but not pre-emergence herbicides while the -ci derivative was active in both modes. title: application of molecular modeling techniques to pheromones of the marine brown algae cutleria multifida and ectocarpus siliculosus (phaeophyceae). metalloproteins as chemoreceptors? (geometrical models of the compounds were constructed using information from the cambridge structural data base (csd) and calculated using molecular mechanics methods in sybyl); (minimum energy conformations of the compounds were calculated using molecular mechanics method within sybyl). chemical descriptors: kfcq [partition coefficient in fc /water (fc = fluorocarbon results: as both ectocarpene (i) and multifidene ( ) trigger mutual cross reactions between male gametes of ectocarpus siliculosus and cutleria multifida males it was supposed that a common mode of binding should exist for the two structurally different pheromones. the active analogue approach was applied to model the pheromone receptor by superposing the minimum energy conformations of active structural analogues (hi, iv, v, vi) on ectocarpene and multifidene. the common active conformation of (i) and ( ) was extracted by systematic superimposition of the analogues. to explain the function of the double bonds in the pheromones, the presence of a receptor bound metal cation was assumed. simultaneous optimization,of both structures without and with a receptor bound metal cation resulted in virtually the same conformations. fig. shows the mapping of multifidene onto ectocarpene in their biologically relevant conformations. solvent)]. title: critical differences in the binding of aryl phosphate and carbamate inhibitors of acetylcholinesterases. conformational analysis [minimum energy conformations of (asn-ala-asn-pro) was calculated using charmm (chemistry at harvard macromolecular mechanics), amber (assisted model building with energy refinement) and ecepp (empirical conformational energy program for peptides) potential energy functions]; [root mean square deviation (a) of the position of the corresponding atoms of two superimposed molecular sructures], results: low energy conformations of (asn-ala-asn-pro) has been determined using charmm, ecepp and amber in order to determine their final conformations and relative energies. the final conformations were compared calculating the rms values of their c" atoms and matching the parameters of the energy minimized (asn-ala-asn-pro), peptide to that of the ideal helix or coiled coil. the similarity of the final conformations obtained by using any two different potentials starting from the same conformation varied from the satisfactory to highly unacceptable. the extent of difference between any pairs of the final conformations generated by two different potential energy functions were not significantly different. the lowestenergy conformation calculated by each of the energy potentials for any starting conformation was a left handed helix and pair-wise superposition of the c" atoms in the final conformations showed small rms values ( .o - . a) . it was suggested that the native conformation of (asn-ala-asn-pro), in the cs protein may be a left-handed helix, since all three potential energy functions generated such conformation. - ( ) source: proteins proteins , ( ), - , . biological material: crambin. data determined phi-psi probability plot (probabilities of the occurrences of phi-psi dihedral angle pairs for each amino acid were determined and plotted using the data of approximately proteins from the brookhaven protein data bank); (optimization technique for the reproduction the folding process converging to the native minimum energy structure by dynamically sampling many different conformations of the simplified protein backbone). chemical descriptors: phi-psi values [dihedral angles (deg) defined by the bonds on either side of the a-carbon atom of the amino acid residue in a protein]. results: a simplified model has been developed for the representation of protein structures. protein folding was simulated assuming a freely rotating rigid chain where the effect of each side chain approximated by a single atom. phi-psi probabilities were used to determine the potentials representing the attraction or repulsion between the different amino acid residues. many characteristics of native proteins have been successfully reproduced by the model: (i) the optimization was started from protein models with random conformations and led to protein models with secondary structural features (a-helices and strands) similar by nature and site to that of the native protein; (ii) the formation of secondary structure was found to be sequence specific influenced by long-range interactions; (iii) the association of certain pairs of cysteine residues were preferred compared to other cysteine pairs depending on folding; (iv) the empirical potentials obtained from phi-ps probabilities led to the formation of a hydrophobic core of the model peptide. x [dihedral angle (deg) ]. results: four kinds of monte carlo simulations of about , steps of the conformations of crambin were carried out by using the second derivative matrix of energy functions (starting from native and unfolded conformations both in two kinds of systems, in vacuo and in solution). fig. shows the native (a) and theunfolded (b) conformation of crambin. starting from native conformation, the differences between the mean properties of the simulated crambin conformations obtained from in vacuo and solution calculations were not very large. the fluctuations around the mean conformation during simulation were smaller in solution than in vacuo, however. simulation starting from the unfolded conformation resulted in a more intensive fluctuation of the structure in solution than in vacuo indicating the importance of the hydration energy term in the model. the conformations generated in the simulations starting from the native conformation deviate slightly from the xray conformation (rms = . a and i . a for in vacuo and solution simulations, respectively). the results indicate that the simulations of the protein with hydration energyare more realistic that the simulations without hydration energy. fig. results: earlier studies overestimated the catalytic rate decrease of the hypothetical d a point mutant of thrombin ( orders of magnitude decrease calculated instead of the order of magnitude measured). the source of error was due to an overestimation of v and neglecting the effects of the surrounding water molecules and induced dipoles. to compensate for these errors, a scale factor of . was introduced into the calculations. as aresult of the rescaling, one magnitude increase of tat for the d mutant and two magnitudes decrease of k,, of the k mutant of ribonuclease a was predicted. it was shown that the effect of the mutations on the catalytic rate depended almost entirely on steric factors. it was suggested that in mutants of serine proteases where the buried asp is replaced by ala or asp, the kcat value will decrease between - orders of magnitude. title: high-resolution structure of an hiv zinc fingerlike domain via a new nmr-based distance geometry approach. authors: summers*, m.f.; south, t.l.; kim [root mean square deviation (a) of the corresponding atoms of two superimposed molecular sructures]. results: the atomic resolution structure of an hiv zinc fingerlike domain has been generated by a new nmr-based dg method using d noesy backcalculation. the quality of the structures thus obtained were evaluated on the basis of the consistence with the experimental data (comparison of measured and back-calculated nmr spectra) rather than comparing it tostructural informations from other sources (e.g. x-ray data). the method provided a quantitative measure of consistence between experimental and calculated data which allowed for the use of tighter interproton distance constraints. the folding of the c( l)-f( )-n( )-c( )-g(s)-k( ) residues were found to be virtually identical with the folding of the related residues in the x-ray structure of the iron domain of rubredoxin (rms values . and . a). the backbone folding of the peptide was found to be. significantly different from that of the "classical" dna-binding zn-finger. fig. shows the wire frame model of all the back%ne atoms and certain side chain atoms of the peptide (dg struciure) (dashed lines indicate hydrogen atoms): active site of the protease dimer in an extended conformation with extensive van der waals and hydrogen bonding and was more than % excluded from contact with the surrounding water (fig. i , where the inhibitor is shown in thicker lines and the hydrogen bonds in dashed lines): data determined: ago,, ago, [standard free energy (callmol) of transfer of a molecule from an apolar phase to an aqueous phase, observed or calculated by eq. : ago, = c aui ai, where aoi is the atomic solvation parameter of atomic group i, ai is the accessible surface area of atom i, respectively]. results: atomic solvation parameters (asps) characterizing the free energy change per unit area for transfer of a chemical group from the protein interior to aqueous surroundings were determined. ago, and ago, were determined and compared, and a highly significant linear relationship is presented (fig. ) . one letter symbols indicate amino acid side chains: fig . the binding of the inhibitor induced substantial movement in the en-' zyme around the residues to in both subunits at places exceeding i - the structure of glutamine synthetase is discussed. it was established that hydrophobic interactions are important for the intersubunit interactions, and the hydrophobic interactions between the two rings of subunits are stronger than between the subunits within a ring. the cterminal helix contribute strongly to the inter-ring hydrophobic interaction. asps are suggested to estimate the contribution of the hydrophobic energy to protein folding and subunit assembly and the binding of small molecules to proteins. title: determination of the complete three-dimensional structure of the trypsin inhibitor from squash seeds in aqueous solution by nuclear magnetic resonance and a combination of distance geometry and dynamical simulated annealing. authors: holak*, t.a.; gondol, d.; otlewski, j.; wilusz, t. max-planck-hstitut fiir biocbemie d- martinsried bei miinchen, federal republic of germany. title: interpretation of protein folding and binding with atomic crystal structure (atomic coordinates of cmti-i was determined by solvation parameters. x-ray diffraction method). results: in order to obtain information of the d structure of the free cmti-i in solution, a total of inhibitor structures were calculated by a combination of distance-geometry and dynamical simulated annealing methods, resulting in well defined d positions for the backbone and side-chain atoms. fig. shows the superposition of the backbone (n, c", c, ) atoms of the structures best fitted to residues to (binding loop): the average rms difference between the individual structures and the minimized mean stfucture was . (* . ) a for the backbone atoms and . (+ . ) a for all heavy atoms. title: electron transport in sulfate reducing bacteria. molecular modeling and nmr studies of the rubredoxin-tetraheme-cytochrome-c complex. biological material: a) sulfate reducing bacterium (desulfovibrio vulgaris); b) rubredoxin (iron-sulfur protein); c) tetraheme cytochrome c from d. vulgaris; e) flavodoxin. detected in the segments from the residues to and - . fig. shows the best superposition (residues to ) of the nmr and crystal structure of cmti-i indicating the backbone c, c", n, , as well as the disulfide c and s atoms: fig. it was demonstrated that uncertainty in nmr structure determination can be eliminated by including stereospecific assignments and precise distance constraints in the definition of the structure. crystal structure (coordinates of the crystal structure of the compounds were determined by x-ray crystallography). results: the speed of the homolysis of the organometallic bond is " times higher in the apoenzyme bound coenzyme biz than in a homogenous solution. structural changes occurring during the co-c bond homolysis of the coenzyme biz leading from cobalt(ii ) corrin to cobalt(i ) corrin were investigated. fig. shows the superposition of structures of the cobalt corrin part of the biz (dotted line) and of cob(ii)alamin (solid line): biological material: apoenzyme, binding the coenzyme biz and the data determined: . shows that the crystal structure of biz and cob(i )alamin are strikingly similar and offers no explanation for the mechanism of the protein-induced activation of homolysis. it was suggested that the co-c bond may be labiiized by the apoenzyme itself and in addition to a substrate-induced separation of the homolysis fragments (which mights be supported by a strong binding of the separated fragments to the protein). 'h nmr (complete stereospecific assignments were carried out and proton-proton distance constrains were determined by the analyses of dqf-cosy, hohaha and noesy spectra); nh, ah, oh [chemical shifts (ppm) of proton resonances of human eti ; d structure ( d structure of et was calculated using the distance geometry program dadas based upon the noesy proton-proton distance constrains determined by nmr spectroscopy); [root mean square distance (a) between et conformers calculated by distance geometry (dadas)]. results: the solution conformation of et has been determined by the combined use of d 'h nmr spectroscopy and distance geometry calculations. five structures of et have been calculated from different initial conformations. the superposition of the backbone atoms the calculated structures is shown in fig. . the average rms value in the core region for the main-cahin atoms was . a. quant. struct.-act. relat. , - ( ) the lack of specific interactions between the core and tail portions of et and a characteristic helix-like conformation in the region from lys' to cys" was shown. literature data indicated that neither the eti - nor the etl - truncated derivatives of et showed constricting or receptor binding activity suggesting that the et receptor recognizes an active conformation consisting of both the tail and core portion. the present study, however, suggested that the receptor bound conformation of et is probably different from that in solution because the lack of interaction between tail and core. the hydrophobic nature of the tail suggested the importance of a hydrophobic interaction with the receptor. compounds: triphenyl-methyl-phosphit cation (tpmp+). biological material: nicotinic acetylcholine receptor (achr), a prototype of the type i of membrane receptor protein from the electric tissue of torpedo and electrophorus. results: a computer model of the achr ion channel has been proposed. fig. shows the side view of the ion channel model with five pore-forming mz-helices and the channel blocking photoaffinity label (tpmp +) represented by the dotted sphere: fig. it was supported by electronmicroscopy, electrophysiological-and biochemical experiments that the mz-helices were formed by homologous amino acid sequences containing negatively charged amino acid side chains which act as the selectivity filter. the amino acid side chains may undergo conformational changes during the permeation of the cation. the predicted transmembrane folding of four transmembrane a-helices of type i receptors is shown in fig. : fig. energy profile calculations indicate that other transmembrane sequences of the receptor protein besides m may affect the ion channel. source: cabios , ( ), - . results: an interactive computer program tefoojj has been developed for drug design on ibm/pc and compatible computers. the program contains the following modules and performs the following calculations: a) series design for selecting an optimal starting set of compounds using a modified version of austel's method; b) regression analysis calculating the hansch's equation; c) hansch searching method using the equation calculated by the regression analysis routine or the use of an input equation for the identification of the most active compounds; d) geometrical searching methods for finding the optimum substituents in the parameter space using the sphere, ellipse, quadratic or polyhedric cross algorithms with or without directionality factors; e) space contraction for reducing the dimension of the parameter space by eliminating non-significant parameters; f) an example is given for the lead optimization of an aliphatic lead compound correctly predicting the n-pentane to be the optimum substituent. results: a new expert system sparc is being developed at epa and at the university of georgia to develop quantitative structure-activity relationships for broad compound classes: a) classical qsar approaches predict therapeutic response, environmental fate or toxicity from structure/property descriptors quantifying hydrophobicity, topological descriptors, electronic descriptors and steric effects; b) sparc (sparc performs automated reasoning in chemistry), an expert system written in prolog, models chemistry at the level of physical organic chemistry in terms of mechanism of interaction that contribute to the phenomena of interest; c) sparc uses algorithms based on fundamental chemical structure theory to estimate parameters such as acid dissociation constants (pk,s), hydrolysis rate constants, uv, visible and ir absorption spectra, and other properties; d) the information required to predict input data for broad classes of compounds is dispersed throughout the entire ir spectrum and can be extracted using fourier transforms; e ) the accuracy of sparc algorithm was demonstrated on the close match of calculated and experimental pk, values of carboxylic acid derivatives near to the noise level of measurement. abtstr. - quant. struct.-act. relat. , - ( ) results: a new stand-alone molecular simulation program, nmrgraf integrating molecular modeling and nmr techniques has been introduced by biodesign inc. a) nmrgraf is a molecular modeling program utilizing force fields which incorporate empirical properties such as bond lengths and angles, dihedral, inversion and nonbonded interactions, electrostatic charges and van der waals interactions; b) the molecular structural properties are combined with nuclear overhouser effect (noe) and j-coupling nmr data (experimental interproton distance constrains and bond angle data); c) the nmr proton-proton distance data are accurate only at relatively short distances ( to a) which restricts the use of nmr noe approaches only for the analysis of molecules with known x-ray structure; d) the combination of nmr and molecular modeling approaches, however, makes it possible to model virtually any molecule even if its structure does not exist in the databases. title: electronic structure calculations on workstation computers. results: the main features of the program system turbomole for large-scale calculation of scf molecular electronic structure on workstation computers is described: a) the program system allows for scf level treatments of energy, first-and second-order derivatives with respect to nuclear coordinates, and an evaluation of the me? correlation energy approximation; b) the most important modules of turbomole are (i) dscf performing closed and open shell rhf calculations; (ii) egrad used for analytical scf gradient evaluations; (iii) kora calculating direct two-electron integral transformation (iv) force for the computation and processing of integral derivatives and the solution of cphf equations; c) comparison and evaluation of timings of representative applications of turbomole on various workstations showed that apollo ds o.ooo and iris d were the fastest and comparable to the convex c in scalar mode. results: a new algorithm has been developed for the calculating and visualizing space filling models of molecules.the algorithm is about times faster than a conventional one and has an interesting transparency effect when using a stereo viewer. a) the algorithm is briefly described and and the result is visualized on modeling a ribonucleotide unit; b) in the order of increasing atomnumbers, the (x,y) sections of the hemispherical disks of the atoms are projected on the screen with decreasing value of the azimuthal angle (p) of the van der wads radius and as the value of p decreases the projection is increasingly whitened to obtain shading effect on the surfaces of the spheres; c) the transparency of the van der waals' surfaces of atoms of a molecule makes possible to perceive almost the whole space filling structure and not only the surface, hiding the underlying atoms. title: supercomputers and biological sequence comparison algo-authors: core*, n.g. ; edmiston, e.w. ; saltz, j.h. ; smith, r.m. rithms. yale university school of medicine new haven ct - , usa. source: computers biomed. res. , ( ) , - . compounds: dna and protein fragments. chemical descriptors: sequences of monomers. results: a dynamic programming algorithm to determine best matches betweenpairs of sequences or pairs of subsequences has been used on the intel ipsc/l hypercube and ,the connection machine (cm-i). parallel processing of the comparison on cm-i results in run times which are to times as fast as the vax , with this factor increasing as the problem size increases. the cm-i and the intel ipsc hypercube are comparable for smaller sequences, but the cm-i is several times quicker for larger sequences. a fast algorithm by karlin and his coworkers designed to determine all exact repeats greater than a given length among a set of strings has been tried out on the encore multimax/ o. the dynamic programming algorithms are normally used to compare two sequences, but are very expensive for multiple sequences. the karlin algorithm is well suited to comparing multiple sequences. calculating a multiple comparison of dna sequences each - nucleotides long results in a speedup roughly equal to the number of the processors used. source: cabios , ( ), . results: a program has been developed for the prediction and display of the secondary structure of proteins using the primary amino acid sequence as database. a) the program calculates and graphically demonstrates four predictive profiles of the proteins allowing interpretation and comparison with the results of other programs; b) as a demonstration the sliding averages of n sequential amino acids were calculated and plotted for four properties of human interleukin : (i) plot of the probabilities of a-helix, p-structure and p-turns according to chou and fasman; (ii) @-turn index of chou and fasman; (iii) plot of the hydrophobicity index of hopp and woods; (iv) flexibility index of karplus and schulz; c) the regions of primary structure having properties which usually go together agreed reasonably well with each other, i.e. loops and turns with bend probability and hydrophilicity with flexibility. title: dsearch. a system for three-dimensional substructure searching. quant. struct.-act. relat. , - ( ) sci. , ( ) , - . results: the search for threedimensional substructures is becoming widely used in d modeling and for the construction of pharmacophores for a variety of biological activities. a system ( dsearch) for the definition and search of three-dimensional substructures is described: a) representation of atom types consists of five fields (i) element (he-u); (ii) number of non hydrogen neighbors (bonded atoms); (iii) number of 'k electrons; (iv) expected number of attached hydrogens; (v) formal charge; (vi) four type of dummy atoms are also used to define geometric points in space (e.g. centroid of a b) definition of queries (i) definition of spatial relationship between atoms; (ii) matches in atom type (iii) preparation of keys (constituent descriptors); (iv) execution of key search; (v) geometric search including the handling of angleldihedral constrains and takes into account "excluded volume"; c) time tests showed that a search of d structures with to atoms in large databases with more than , entries took only a few minutes ( - s). results: a database containing about , compounds in connection tables and , experimentally determined structures from the cambridge structural database has been transformed into a database of low energy d molecular structures using the program concord. the strategy for building the d database consisted of the following four steps: a) generation of approximate d coordinates from connection tables (hydrogens were omitted); b) assignment of atom types from connection table information characterized by five descriptors: (i) element type (he-u); (ii) number of attached non-hydrogen neighbors ( - ); (iii) number of 'k electrons ( - ); (iv) calculated number of attached hydrogens ( - ); (v) formal charge ( - , , ); c) addition of three types of chemically meaningful dummy atoms for purposes of d substructure searching: (i) centroids of planar and -membered rings; (ii) dummy atoms representing the lone electron pairs; (iii) ring perpendiculars positioned orthogonal to and . a above and below each planar ring; d) efficient storage of the resultant coordinate database indexing the compounds with identification number; e) the database can be used among others to deduce pharmacophores essential for biological activity and to search for compounds containing a given pharmacophore. title source: c&en , ( ) , - . results: a complex carbohydrate structure database (ccsd) has been developed by the complex carbohydrate research center at the university of georgia, having more than structures and related text files, with about more records to be added over the next two years. the following are the most important features of ccsd: a) in ccsd, database records include full primary structures for each complex carbohydrate, citations to papers in which sequences were published, and suplementary information such as spectroscopic analysis, biological activity, information about binding studies, etc; b) structural display format visualize branching, points of attachment between glycosyl residues and substituents, anomeric configuration of glycosyl linkages, absolute configuration of glycosyl residues, ring size, identity of proteins or lipids to which carbohydrates are attached and other data; c) it is planned that ccsd will provide threedimensional coordinates, to visualize and rotate the structures in stereo and study their interaction with proteins or other biopolimers. probing bioactive mechanisms p commercial carbamate insecticides of type ii, where r' = h, s-bu biological material: acetylcholinesterase. data determined: d [distance (a) between the serine oxygen of acetylcholinesterase and the methyl substituents of the carbamate and phosphate inhibitors molecular modeling (models and minimum energy conformations of acetylcholine, aryl carbamate and phosphate ester inhibitors were created using the draw mode of a maccs database and the prxbld modeling program) results: transition state modeling of the reaction of the serine hydroxyl ion of acetylcholinesterase with the methylcarbamoyl and dimethyl phosphoryl derivatives of , -dimethyl-phenol showed that the active site binding for these two classes of acetylcholinesterase inhibitors should be different. the model shows that the distances between the serine oxygen and the ring substituents (meta-and para-me-thy groups) are different in both spacing and direction. fig. shows the transition state models of serine hydroxyl d values for the meta-and para-methyl substituents of n-methylcarbamate and dimethylphosphate were meta = . , para = . and meta = . , para = . a, respectively title: a comparison of the charmm, amber and ecepp potentials for peptides. . conformational predictions for the tandemly repeated peptide (asn-ala-asn-pro) biological material: tandemly repeated peptide (asn-ala-asn-pro) which is a major immunogenic epitope in the circumsporozoite (cs) protein of plasmodium falciparum conformational analysis dream or reality? a authors: nbray-szab *, g.; nagy, j.; bcrces, a. priori predictions for thrombin and ribonuclease mutants molecular modeling (geometric model of subtilisin and trypsin was built using protein data bank coordinates and model of thrombin was built using the theoretical coordinate set of graphic representations of the triad of the tetrahedral intermediate for the enzymes formed on the active side chain and residues (ser- , his- and asp in subtilisin; ser- , his- and asp-i in trypsin and thrombin; his- , lys- and his-i in ribonuclease a) were created using pcgeom electrostatic properties (electrostatic surfaces and fields of the molecules were calculated and displayed using amber and associated programs); [chemical shift (ppm) measured by nmr results: a hypothetical model between rubredoxin and cytochrome c was built as a model for the study of electron transfer between different redox centers, as observed in other systems. fig. i shows the main chains atoms of the proposed complex where the hemes of the cytochromes are shown along with the center of rubredoxin and stabilized by hydrogen bonds and charge-pair interactions (the nonheme iron of the rubredoxin is in close proximity to heme of cytochrome c ): spectroscopy].the model was consistent with the requirements of steric factors, complementary electrostatic interactions and nmr data of the complex. comparison of the new model and the nmr data of the previously proposed flavodoxin-cytochrome c complex showed that both proteins interacted with the same heme-group of cytochrome c . title: nuclear magnetic resonance solution and x-ray structures of squash trypsin inhibitor exhibit the same conformation of the proteinase binding loop.authors: holak*, t.a.; bode, w.; huber, r.; otlewski, j.; wilusz, t. max-planck-institut fiir biochemie d- martinsried bei munchen, federal republic of germany.source: j. mol. biol. , no. , - . biological material: a) trypsin inhibitor from the seeds of the squash cucurbita maxima; b) p-trypsin and trypsin inhibitor complex.title: retention prediction of analytes in reversed-phase high-performance liquid chromatography based on molecular structure. . quant. struct.-act. relat. , - ( ) results: an expert system (cripes) has been developed for the prediction of rp-hplc retention indices from molecular structure by combining a set of rules with retention coefficients stored in a database. the method underlying the system is based on the "alkyl aryl retention index scale" and aims to improve the reproducibility of prediction and compatibility between various instruments and column materials. the vp-expert system shell from microsoft was used for the development. the performance of cripes was demonstrated on several subtypes of substituted benzenes (phenacyl halides, substituted arylamines, arylamides and other types). in general the calculated and measured retention indices agreed well but relatively large deviations were observed between the ie and ic values for phenacyl bromides and chlorides, o-and p-bromo anilines, n-methylbenzamide and n,n-dimethylbenzamide and phthalate esters. the extension of the database with further interaction values was regarded as necessary for a consistently high accuracy at prediction. [out-of-plane bending energy (kcal/mol) given by the formula e, = kd', where d is the distance from the atom to the plane defined by its three attached atoms and k is a force constant]; eb e, [torsional energy (kcal/mol .deg ) associated with four consecutive bonded atoms i,j,k,l given by the formula e, = ki,j.k,l(l fs/lslcos(lslbi,j,k,~)), where b is the torsion angle between atoms i j , k and , s and k are constants]; e, [potential energy (kcallmol) (nonbonded energy term) associated with any pair of atoms which are neither directly bonded to a common atom or belong to substructures more than a specified cutoff distance away given by the formula e, = kij(l. la" - . /a ), where a is the distance between the two atoms divided by the sum of their radii, and k is the geometric mean of the k constants associated with each atom]. results: model geometries produced by the tripos . force field have been assessed by minimizing the crystall structures of three cyclic hexapeptides, crambin and diverse complex organic compounds. comparative force field studies of the tripos . , amber and amberlopls force fields were carried out by energy minimization of three cyclic hexapeptides starting from the crystal structures showed the tripos . force field superior to the others with the exception of amber, ecep and levb force fields as published by other workers. a direct comparison between the performance of tripos . and . amber using isolated crambin showed that the bond and torsion angles of tripos . averaged closer to the crystal structure than the angles calculated by amber (rms = . a, . deg and . deg for bond lengths, angles, and torsions, respectively, and rms = . a for heavy atoms). fig. shows the superimposed structures of crambin before and after energy minimization:fi . tripos . was assessed for general purpose applications by minimizing organic compounds starting from their crystal structures. the test showed that tripos . had a systematic error in overestimating the bond lengths of atoms in small rings. statistical analysis of the results showed that t r i p s . had an acceptable overall performance with both peptides and various organic molecules, however, its performance was not equal to the best specialized force fields. title: new software weds molecular modeling, nmr. author: krieger. j. c&en sixteenth st., n.w., washington dc , usa. source: c&en , ( ) , . key: cord- -ohorip authors: kapoor, mudit; malani, anup; ravi, shamika; agrawal, arnav title: authoritarian governments appear to manipulate covid data date: - - journal: nan doi: nan sha: doc_id: cord_uid: ohorip because sars-cov- (covid- ) statistics affect economic policies and political outcomes, governments have an incentive to control them. manipulation may be less likely in democracies, which have checks to ensure transparency. we show that data on disease burden bear indicia of data modification by authoritarian governments relative to democratic governments. first, data on covid- cases and deaths from authoritarian governments show significantly less variation from a day moving average. because governments have no reason to add noise to data, lower deviation is evidence that data may be massaged. second, data on covid- deaths from authoritarian governments do not follow benford's law, which describes the distribution of leading digits of numbers. deviations from this law are used to test for accounting fraud. smoothing and adjustments to covid- data may indicate other alterations to these data and a need to account for such alterations when tracking the disease. there are several possible explanations for the high burden in democracies. first, democracies are on an average richer (higher per capita income and health expenditure as a percentage of gross domestic product) than other regimes. they can afford more tests, resulting in higher case and death counts. second, democracies are more open to travel and trade. this facilitates the spread of covid- across borders. third, democracies may, idiosyncratically, have a larger elderly population, which is more vulnerable to covid- . fourth, most democratic countries are north of ° latitude. fifth, perhaps authoritarian regimes have greater control over their population. they may be better able to enforce social distancing and limit mobility, both of which reduce spread of the disease. these explanations presume that the data on covid- burden are reliable. however, the press has raised questions about the credibility of covid- data reported by countries. stories regarding data manipulation have emerged for china , iran , indonesia , and the us . therefore, it is important statistically to investigate the reliability of covid- data that is being reported across regimes. in democracies, with freedom of the press, separation of power, and an active opposition, there may exist checks and balances that prevent governments from manipulating the data. authoritarian regimes have greater latitude to manipulate data. such governments have been criticized, however, for manipulating other types of data [ ] [ ] [ ] [ ] . these governments have an incentive to use information as a form of social control [ ] [ ] [ ] [ ] [ ] . here we show evidence of manipulation of covid- data by authoritarian regimes relative to democratic regimes. first, data from authoritarian governments show significantly less variation from a day moving average. because governments have no reason to add noise to data, lower deviation is evidence that data may be massaged. second, data from authoritarian governments do not follow benford's law, which describes the distribution of leading digits of non-manipulated numbers. these discrepancies do not provide direct evidence that the lower burden on authoritarian governments is due to data manipulation. however, they do provide indirect evidence: these modifications likely have a purpose and a plausible reason is suppressing bad news. ensuring the credibility of data isn't a coronavirus specific concern. data manipulation has been a perennial concern in public health and economics. there are notable instances of data fabrication in research , disease surveillance , , and measurement of economic conditions [ ] [ ] [ ] [ ] . there are many statistical methods for detecting fraud , , . here we focus on two types of tests. one compares moments of the distribution of data across sites , [ ] [ ] [ ] , specifically variance , , , , [ ] [ ] [ ] . the other looks at digit preference that deviates from benford's law , , , . there is a strong positive association between fluctuations in the covid- data reported by different countries and their "democratic-ness". figure plots the natural logarithm of the mean of the squared deviation of daily cases and deaths per million people, respectively, from the day moving average against the eiu's overall democracy index score. not only do authoritarian regimes report fewer cases and deaths, there seems to be more random variation in the data in more democratic nations. aggregated data on covd- across all countries in each of regime categories provides further visual evidence that there is less variation in case data in authoritarian regimes. figures s a & s b plots daily cases and deaths per million people around a -day centered moving average for those indicators, respectively, for each regime type. in addition to a lower rate of reported cases and deaths, there is almost no fluctuation in the data from authoritarian or hybrid regimes. variation in the data appears to increase as one moves to a higher category of democratic-ness. regression analysis (table ) although it is unlikely that features that affect the level of covid- burden affect the variation in that burden, we estimate a version of the regression in table with controls for gdp per capita, health and trade as a percent of gdp, share of population over and an indicator for countries above degrees latitude. while greater democratic-ness is no longer associated with additional variability in cases, it continues to be associated with significantly greater variability in daily deaths per million people (table s ) . figure presents the results of our analysis for cumulative case and death data when our screening criteria is that growth in the day centered moving average is greater than . %. (results for tests for other screening criteria are presented in tables s are roughly consistent.) one cannot reject that benford's law describes the distribution of first digits for cumulative cases for all regime types for p value less than %. however, one can reject the benford's law that describes the distribution of first digits for cumulative deaths at p value less than % for cumulative deaths for authoritarian regime, hybrid regimes, and flawed democracy, while it cannot be rejected for full democracies. validation with ecdc data. all of the analysis reported above were also conducted with data from the ecdc and the results are very similar. analysis of compliance with benford's law suggests data from authoritarian regimes, hybrid regimes, and flawed democracy on cases comply but not for deaths, while for full democracies the data complies with benford's law, both for cases and deaths. higher deaths may be more politically salient and, therefore, subject to manipulation. first, because the infection fatality rate of covid- is close to %, cases are less consequential than covid- deaths. second, deaths better reflects state capacity than cases. total cases are largely determined by transmissibility and infectiousness of the disease, and the total number of tests. total deaths are influenced by, in addition to these factors, the health infrastructure, including availability of medical personnel and beds. governments may be able credibly to blame low levels of testing on global shortages rather than government policy. personnel and beds, however, require long term investments in medical education and construction. therefore, a high death rate may imply the government has performed poorly for some time. this study has several limitations. one is that, while we establish an association between data smoothening and government regimes, there may be potential confounders not included here that could alter the conclusions of the study. second, no causal link has been established between government regimes and data smoothening. third, the study does not present methods to obtain less biased estimates of cases. comparison of multiple sources of information or indirect methods of measuring covid- , such as sari cases or orders of caskets, are worth exploring. a fourth limitation is that the paper presents two major methods of detecting manipulation. there are others, and these may reveal a greater degree of manipulation. the results here raise significant questions about the reliability of the data being reported by different countries and highlights the need for a degree of caution when making projections using such data. it may be appropriate to put in place systems for ongoing monitoring for fraud as are used for clinical trials , [ ] [ ] [ ] [ ] . data. data on the type of regime in different countries come primarily from the and authoritarian regimes (scores ≤ ). given the arbitrary and discontinuous nature of the boundaries between these categories, we also directly use the numerical scores in our empirical analyses. we also employ data from other measures of democracy, such as freedom house's democracy, the varieties of democracy index, and the polity of the polity project; these are described in the supplement. for validating results from jhu data, we use data on cases and deaths from the european centre for disease prevention and control (ecdc). the ecdc data are similar to jhu, except that they do not contain presumptive positive cases, defined as cases that have been confirmed by state or local labs, though not by national labs such as the cdc . we do not employ world health organization (who) data on covid cases and deaths because of a change in the reporting time for who numbers on march , that makes it difficult to compare who number before and after that date. aggregate who data, on the one hand, and jhu and ecdc data, on the other, are very similar, with the exception of the period from february - , . we choose to use ecdc data rather than who data to validate results using jhu data because of errors found in the who data . country-level demographic and economic information (country-level per capita income, health and trade expenditure as a proportion of gdp, and the share of population over age ) for the year / are drawn from the world bank open database . missing values were substituted with regional averages. we used data from the independent states and two territories for which the eiu produced scores. this covered more than % of the world's population. covid- data was only available for countries, hong kong was classified as part of china, and there was no data for comoros, lesotho, north korea, tajikistan, and turkmenistan. these countries accounted for more than % of the total cases and deaths across the world. data availability. all the data used for this study are publically available and will be posted, along with code for all statistical analyses, will be posted in a github repository by the corresponding author. manipulation is to look for abnormal statistics (such as with the moments of the distribution) of the variable , [ ] [ ] [ ] . it is difficult to identify abnormal means because one may not observe actual cases separately from the numbers reported by countries. a challenge for identifying abnormal variation in data is that there is no obvious baseline for normal variation. however, because the virus may not care about regime type, a comparison of variation across regime types may highlight abnormalities. in general, differences in variation across regime types cannot a priori distinguish whether one type suppressed variation or another type added variation. however, it is unlikely that higher variation is associated with manipulation, because countries gain little from adding variation to their data , , , , [ ] [ ] [ ] . by contrast, manipulating data can lead to reduced variation if care is not taken to reintroduce "normal" levels of variation . therefore, we investigate whether authoritarian governments manipulate data by testing whether their covid- data is "smoothened" relative to democratic governments. to determine if the difference in data variation between authoritarian and democratic regimes is statistically significant, we employ regression analysis. our dependent variable is a measure of variation in burden. we compose this variable in three steps. first, we calculate a day centered moving average in daily cases (deaths) for each day in each country. second, we calculate the square of the deviation of the observed daily cases (deaths) around that moving average for each country. third, we add one to the squared deviation and divide that by population (millions) and then take the natural logarithm. our treatment variable is either the country's score on the eiu's democracy index, freedom house's democracy, the varieties of democracy index, or the polity of the polity project. our regressions also include a constant. while our observations are at the country day level, we cluster standard errors at the country level to account for autocorrelation in covid- burden. manipulation is to see if data follow patterns that are common in non-manipulated data. one such pattern is that the leading significant digits of a number (or mantissa) has a distribution such that pr(mantissa < t/ ) = log t for t in [ , ) . also known as benford's law, a wide assortment of data obey this law [ ] [ ] [ ] [ ] . data have been checked against this distribution to test for fraud in accounting data , campaign contributions and scientific data . we investigate whether governments manipulate data by testing whether the covid- data on cumulative cases and deaths across different regimes (authoritarian, hybrid, flawed democracy, and full democracy) confirms to benford's law. before we can test covid- case and death data against benford's law, we must decide whether these data are appropriate to test against the law. a concern is that early during an epidemic and after infections plateau, the data will have a number of repeated numbers. these repeats may be the result of true case counts but still violate the law. therefore, we look at portions of the time series of covid- data during which cases and deaths are rising. specifically, we test data ("screened data") during which the growth rate of the day moving average of cases and deaths is greater than some cutoff k, where k is %, . %, and %. to implement the test, we look only at the first digit of the screened case and death data. according to benford's law, pr(first significant digit = d) = log ( +d - ), for d = , ,..., . we group countries into the regimes (authoritarian, hybrid, flawed democracy, and full democracy) defined by the eiu's democracy index. within each category, we compare the observed frequency of each digit d in the case data against the frequency predicted by benford's law using a pearson's chi-squared test. natural logarithm of the mean of squared deviations of observed daily cases and deaths per million people from a -day centered moving average, by eiu democracy index score. notes. case and deaths data are from johns hopkins university. the democracy index score is from the eiu's democracy index. we compute the day centered moving average of daily cases and deaths. we compute the square of daily deviations of the observed cases (deaths) from the day centered moving average and add one to it. then for each country we divide this daily deviation by population per million, compute the mean for each country, and take the natural logarithm. note: *** p< . , ** p< . , * p< . . % confidence intervals are in parenthesis. the errors are clustered at the country level. our unit of analysis is the "country-date". the dependent variable is the natural logarithm of the squared deviation of the observed value from the day centered moving average plus one per million people for each country on a daily basis, from the date when the first case was noted till june , . freedom house democracy score ranges from to , to make it comparable to the eiu democracy score, the score is divided by . the vdem score ranges from to , to make it comparable to eiu democracy score, it is multiplied by . similarly the modified polity score ranges from - (strongly autocratic) to + (strongly democratic), therefore, to make it comparable we add to the score and divide it by . actual frequency of first significant digit in covid- total cases and deaths during periods that day centered moving average grows faster than . % daily, frequency predicted by benford's law, and test of the difference, by regime type. total covid cases, deaths per million people, and case fatality ratio, by government regime, over time. new daily cases and deaths per million people and -day moving average of the same, by government regime. ordinary least squares regression of deviations from a moving average on measures of democracy. the atlantic the guardian china's statistical system in transition: challenges, data problems, and institutional innovations measuring economic growth from outer space reconsidering regime type and growth: lies, dictatorships, and statistics how much should we trust the dictator's gdp growth estimates? the census and the limits of stalinist rule why resource-poor dictators allow freer media: a theory and evidence from panel data government control of the media china's strategic censorship informational autocrats central statistical monitoring: detecting fraud in clinical trials analysing the quality of routine malaria data in mozambique incentives for reporting disease outbreaks the role of biostatistics in the prevention, detection and treatment of fraud in clinical trials statistical techniques to detect fraud and other data irregularities in clinical questionnaire data fraud and misconduct in medical science are these data real? statistical methods for the detection of data fabrication in clinical trials a key risk indicator approach to central statistical monitoring in multicentre clinical trials: method development in the context of an ongoing large-scale randomized trial detecting fabrication of data in a multicenter collaborative animal study statistical techniques for the investigation of fraud in clinical research : statistical aspects of the detection of fraud. fraud and misconduct in biomedical research the law of anomalous numbers a statistical derivation of the significant-digit law guidelines for quality assurance in multicenter trials: a position paper ensuring trial validity by data quality assurance and diversification of monitoring methods a statistical approach to central monitoring of data quality in clinical trials data fraud in clinical trials economist intelligence unit democracy index in relation to health services accessibility: a regression analysis religion and volunteering in context: disentangling the contextual effects of religion on voluntary behavior economic and political determinants of the effects of fdi on growth in transition and developing countries political regime characteristics and transitions covid- deaths and cases: how do sources compare? world bank open data forensic economics assessing the integrity of tabulated demographic data a taxpayer compliance application of benford's law on the peculiar distribution of the us stock indexes' digits the effective use of benford's law to assist in detecting fraud in accounting data breaking the (benford) law not the first digit! using benford's law to detect fraudulent scientif ic data the errors are clustered at the country level. the dependent variable is the natural logarithm of the squared deviation of the observed value from the day centered moving average plus one per million people for each country on a daily basis, from the date when the first case was noted till key: cord- - o svbms authors: urošević, vladimir; andrić, marina; pagán, josé a. title: baseline modelling and composite representation of unobtrusively (iot) sensed behaviour changes related to urban physical well-being date: - - journal: the impact of digital technologies on public health in developed and developing countries doi: . / - - - - _ sha: doc_id: cord_uid: o svbms we present the grounding approach, deployment and preliminary validation of the elementary devised model of physical well-being in urban environments, summarizing the heterogeneous personal big data (on physical activity/exercise, walking, cardio-respiratory fitness, quality of sleep and related lifestyle and health habits and status, continuously collected for over a year mainly through wearable iot devices and survey instruments in global testbed cities) into composite domain indicators/indexes convenient for interpretation and use in predictive public health and preventive interventions. the approach is based on systematized comprehensive domain knowledge implemented through range/threshold-based rules from institutional and study recommendations, combined with statistical methods, and will serve as a representative and performance benchmark for evolution and evaluation of more complex and advanced well-being models for the aimed predictive analytics (incorporating machine learning methods) in subsequent development underway. the urban public health, well-being monitoring, and prevention are recently being transformed from reactive to a predictive and eventually long-term risk mitigating systems, through a number of research initiatives and projects, such as the ongoing pulse project (participative urban living in sustainable environments, funded from the eu horizon programme) focusing on the chronic metabolic and respiratory diseases (such as type diabetes and asthma) affected or exacerbated by the preventable or modifiable environmental and lifestyle factors, and well-being/resilience. a major challenge in the project is the modelling and assessment/prediction of citizen well-being from the collected and processed big data of unprecedented variety and from highly heterogeneous sources (health and vital activity personal data obtained through wearable devices and other sensing technologies, geo-located online surveys, open/public smart city datasets…), on individual and collective (population/cluster) levels. overall well-being and its main domains (vitality, supportive relationships, stress levels…) are all significant factors affecting the onset and exacerbation of the stated chronic diseases which are becoming more and more widespread and progressing in urban environments, and overall resilience of citizens and urban communities is increasingly important against other pertaining global and sustainability challenges, like climate change. the proposed and deployed elementary statistical model presented in this paper is to be the basis for interpretation and contextualization of changes to well-being, and a performance benchmark for evaluation and comparison of more complex and advanced well-being models of the aimed predictive analytics and final intelligent system (incorporating machine learning methods) in subsequent development, supporting the pulse phos (public health observatories established for the relevant policy making and execution in smart cities). the activity and vital/health parameters data measured mostly unobtrusively by wearable devices (wristbands, smartwatches) have particular significance for behaviour analysis and change recognition in pulse, as these are the input data streams with highest volume, acquisition "velocity", and temporal resolution/granularity of all the various data collected in the project, and therefore practically the most (and only) suitable data comprising the sufficiently continuous and non-sparse time series over months, to properly derive or construct the behavioural patterns and analyze behaviour changes. recent studies performed by the stated major wearable device manufacturers over billions of records of temporal measurements data [ , ], as well as the experiences from projects like the just concluded city age (www.city ageproject.eu) [ , ] , show the significance and general predictive ability of the measured main vital/health and activity parameters (walking, climbing stairs, physical activity/exercise, heart rate data, consumed calories…) for overall health and physiological/physical well-being assessment. the additional complementary socio-demographic, health, lifestyle/habits and environmental data in less frequent temporal resolution, ingested from the open/public datasets or manual "obtrusive" inputs, are combined to cross-check, adjust and improve integrity of the recognized behaviour changes derived from the main timeseries data acquired through the wearable devices. we adopt a combined knowledge-and data-driven approach in detection and characterization of relevant behaviours that denote significant variations in well-being, with multi-level hierarchical model topology and range/threshold based computational rules as basic primary formal knowledge structures, and statistical analytics as baseline (and performant) data-driven detection methods. the complexity of human behaviours is commonly represented through multi-level hierarchical structured models, decomposed to more granular "units" like activities and action events [ , ] , with multiple variables from behavioural, physiological and environmental domains of well-being known to additionally increase complexity and dimensionality [ ] . there are other contending approaches, like the monitoring and analysis of individual well-being or behavioural domain indicators or determinants independently in parallel, without hierarchical structuring and substantial synthesis into fewer higher-level composite factors or score(s) [ ] . most, including the adopted and followed approach works (like [ ] ), are comprehensively covered or referenced in the encyclopedia of quality of life and well-being research (springer ) that summarizes recent research works related to well-being and quality of life in spanned various research and policy-making/implementation fields. main advantages of a few composite synthetic indicators/factors over a battery of multiple separate indicators, namely: • ability to summarize complex and multi-dimensional real-life phenomena or domains (like well-being), • easier for interpretation and comparison among (socio-demographic, geospatial/regional…) groups or population clusters, • more effective for comprehending overall trends, particularly when a number of the underlying indicators denote opposing-trend changes, are of crucial significance in usage and context settings of the pulse project, with over indicators formalized in the initial knowledge-based well-being model topology from the systematization of collected data, and with • visible set of indicators to various stakeholders (policy makers, researchers, general public) needing to be minimal without omitting important underlying information, • and collaboration, communication and comparison of complex dimensions by various stakeholders needing to be most straightforward, facilitated and effective. we therefore propose two complementary approaches for synthesis of the composite well-being indicators composed from underlying streamed iot-sourced timeseries data in the context of pulse. the indicators summarize multi-dimensional aspects of citizen well-being and enable the assessment of individual and synthesized collective urban well-being over time. the notion is illustrated through analysis within the scope of four representative and characteristical key summary indicators of citizen health and fitness, derived from activity and vital/health parameters measured, as stated, using wearable sensing devices: motility, physical activity, sleep quality and cardio-respiratory health/fitness (fig. ) . in the first approach, daily and intra-daily underlying measurements (table ) are used to estimate levels of adherence to rule-and range-based recommendations matured from institutional knowledge of relevant authorities and population-significant studies in the field, accumulated for over decades in the stated four example domains of motility, physical activity, sleep quality and cardio-respiratory fitness [ , , ] . the complementary data-driven statistical approach is predicated on standard scores that denote the number of standard deviations that a given measurement deviates from the sample mean. this approach allows comparison of individual scores to the corresponding norm groups stratified by common socio-demographic parameters (age, gender…), when considered conditionally independent nodes in the complete model topology. it also allows to place a score for any individual and variable with respect to alternative descriptive statistic or measure of central tendency (variable median, geometric mean, standard deviation or error), so that more accurate or optimal comparison for specific variable distribution can be made. the data are collected by several types of health and fitness wearable tracker devices manufactured by fitbit, garmin and asus, monitoring physical and walking activity, sleep and heart/cardio parameters, for over recruited citizens participating in the study in global testbed cities (barcelona, birmingham, new york, paris, pavia(italy), singapore, and keelung/taipei), supplied with wearable tracker devices by the project. physical activity level as a single measure is mainly expressed in terms of time spent and calories burned while performing light/soft, moderate, and intense/vigorous physical activity. walking activity is captured with walked steps, distance, speed, and climbed stairs/floors measurements. heart rate measures capture time spent in different target heart rate zones (like peak, cardio, fat burn), resting and maximal heart rate (hr max ), and some still experimental measures like systolic and diastolic blood pressure, measured by the newest recently released devices such as asus vivowatch bp, but not yet acquired in significant volume sufficient for analysis. the peak heart rate zone by default definition ranges from to % of person's maximum heart rate (hr max ), the cardio zone ranges from to % of hr max , and the fat burn zone ranges from to % of hr max . sleep quality/hygiene measures mainly capture time spent in defined phases of sleep. all the processed measures are listed in table , collected or aggregated with a default daily periodicity, except the ones in gray-shaded rows which are acquired in higher intra-daily temporal resolution, mostly once in every min or up to once in a minute, depending on the variable. in addition to stated behavioral time-series data, an extensive set of personal sociodemographic, profile (age, gender, ethnicity, educational and marital status, employment status and occupational environment…), as well as health state, risk factors and habits, lifestyle, neighbourhood and quality of life assessment, and other relevant behavioural data are manually input/submitted on each citizen participating in the study via online forms, composed from adapted relevant survey/assessment instruments for each specific field, like framingham, euroqol- , ipaq-sf. these data are geolocalized to the residence location of each responding citizen for the purposes of analytics of collective/community well-being, and are collected from a greater number of recruited respondents, but just in rare cases in more than one iteration over time due to the high number and scope of covered variables, and therefore suitable for a broad but mainly static "snapshot" assessment of current well-being state rather than for behaviour change model and analytics. incorporating both these static and iot-sensed temporal data into a fully comprehensive predictive well-being model is an ongoing task in progress throughout the end of the project, with results to be presented in other upcoming publications. physical activity in our first approach stated above in sect. can be discretized using several common baseline categorizations related or derived from the above mentioned relevant institutional/governmental and professional expert guidelines for the urban population groups. the example approach taken in the recent health survey of england from [ ] compared well-being and mental health of adults in different sociodemographically stratified population groups by physical activity, among others. the activity level categories used in the analysis were the following: • meets aerobic guidelines: at least min moderately intensive physical activity or min vigorous activity per week or an equivalent combination of these • asserted activity: to min moderate activity or - min vigorous activity per week or an equivalent combination of these • low activity: to min moderate activity or to min vigorous activity per week or an equivalent combination of these • inactive: less than min moderate activity or less than min vigorous activity per week or an equivalent combination of these, and the corresponding linear scaled scoring function denotes "meets aerobic guidelines" with a score of , "certain activity" - , "low activity" - , and "inactive" with . this baseline scoring scale, besides sufficient granularity and robustness exhibited in referenced comprehensive studies, is also convenient for • mapping to the defined activity level categories used as input parameters for the consensus models for prediction of risk of type diabetes (t d) and asthma onset and exacerbation, developed for the pulse project [ , ] • quantification of longer-term and/or periodic activity level behaviours directly from categorized daily or incidental activity level values as measured and acquired from the wearable tracker devices through relevant apis (light/soft, moderate, and vigorous/intense activity). similarly, the authors in [ ] and [ ] demonstrate the following referent threshold ranges of the number of daily walked steps to be used for classification of walking activity in healthy adults, and the corresponding scoring function linearly assigning the following - integer scores to the classification categories: highly active ( , or more steps/day) - , active ( , - , steps/day) - , somewhat active ( , - , steps/day) - , lowly active ( , - , steps/day) - , and sedentary (under , steps/day) - . vo max is the metric denoting the maximum amount of oxygen that an individual can use during intense exercise. it is widely and commonly used as an indicator of cardiorespiratory fitness. a simple generic estimate of vo max of an individual can be obtained using their maximum and resting heart rates in the following formula, publised in [ ] : vo max % hr max hr rest à : ml=ðkg à minÞ ð Þ where hr max can be crudely estimated as age of the person. a relatively standard convenient and meaningful categorization of vo max for western european and usa populations can be on a scale from -very low, through -low, -fair, -moderate, -good, and -very good, to maximal -elite, depending of the individual's gender and age, with common categorization for males and females aged to published by shvartz & reibold in [ ] . total average sleep duration in h is a straightforward direct metric for assessing the quality of sleep in terms of longer-term stable behaviour across complete populations. the us national sleep foundation recently provided the following referent expert sleep duration recommendations (in terms of recommended (or not) threshold values for both oversleep and undersleep), categorized by precise granular age ranges [ ] : these recommendations categorize possible output sleep duration times as either recommended, may be appropriate, or not recommended, and the optimal recommended duration is - h for majority of the populations. additionally, relevant recent findings like the extensive meta-analysis performed by the american diabetes association to assess the dose-response relationship between sleep duration and risk of type diabetes [ ] , have concluded that the lowest type diabetes risk is for the average overnight sleep duration from to under h per day, and that both shorter and longer sleep durations than this optimum range denote up to . times increased risk (and up to times increased cardiac conditions risk shown in the related studies [ ] , also relevant in the project). we therefore slightly alter the may be appropriate category from the otherwise adopted recommendations from table above to mildly risky, reflecting the importance of stated health risks in pulse, and the effect of common or periodically repeated behaviour patterns over months or years to the exasperation of the risks. this categorization will also consequently be communicated on the data visualizations and public health/prevention interventions and campaigns deployed and administered through relevant pulse system applications and modules (pho dashboards, pulsair gamified mobile app.) towards the citizens and urban communities, and the resulting function scores assigned to the categories are therefore: for not recommended, . for mildly risky, and for recommended, inversely proportional to the pesimistically estimated risks increase brought by shorter durations. complex eventual relations of detailed specific measured sleep parameters to well-being will be explored by more advanced methods in other subsequent work. from several existing elementary statistical approaches for aggregating the underlying dimensional indices and constructing the summary composite indicator value, we consider the weighted geometric mean of the four constituting dimensional indices as most adequate and appropriate for this specific well-being problematics: where summary dimensional indices denoted by the scoring function values are: i wwalking activity index, i pphysical activity/exercise index, i ssleep duration index, i ccardio-respiratory fitness index (through vo max), and wt w , wt p , wt s , and wt c are respective weight factors, derived from expert assessments and rank data from relevant previous studies and experience, and assigned to adjust the relative importance and contribution of each of the indices to the resulting composite indicator value, per compositing methods outlined in [ ] and [ ] , or for derivation of composite un human development index (hdi). as all constituting indices are directly proportional to the resulting composite indicator (i.e. the higher the activity levels or cardiorespiratory fitness scores, the higher the well-being), and low value of either of the four is significant for decreased overall composite (although there is some correlation between the indices -e.g. decrease in cardio-respiratory fitness in most cases causes decreased activity levels as well), the geometric mean is adequate for its sensitivity to low values of each individual constituting index, and ability to combine values on completely different scales without normalization required. initially assigned values of weight factors are . for i w , for i p , and . for i s and i c , taking into account the importance of specific indices for respiratory disease and t d risk, volatility of the collected data by now, and known overestimation of some measured variable values (like number of walked steps, vo max estimate, or recognized sessions of cycling and some other exercise types) by the predominantly used wearable devices -fitbit charge [ ] . the weight factors are set as configuration parameters in the model, so they can be changed to fine-tune the composition according to the data insights acquired over time or the results of the validation described in sect. below. time series of the values of the composite indicator are formed from weekly and monthly aggregations of underlying daily and intra-daily measurements into the constituting index values. method for computing those values from the measured values of variables listed in table above is as developed and introduced in [ ] for synthesis of indicators and geriatric factors from the same source iot data, based in this case on univariate normalization of relative changes (quantified in standard scores, as stated above in the "approach" sect. ) of acquired big temporal data during the complete study period, and then multivariate weighted linear aggregation of obtained normalized indicators and descriptive statistics into higher-level composite factors, to capture weekly and monthly behavioural patterns and trends, less susceptible to influence of outliers and ocassional notably deviating values. validation metric is the correlation with specific corresponding summary measure(s) of current well-being, self-reported by the respondent citizens through web and mobile app. questionnaires as mentioned above. they can be summarized from two relevant subset questionnaires: ) european social survey (ess), and ) euroqol- d (eq ) survey instrument, both standardized (with minor adaptations) and common for measuring well-being in multiple continuous and/or repeated relevant europe-wide and national-level studies, and robust to some degree against extreme fully subjective bias. statements of the ess questionnaire broadly cover social and most of the other aspects of personal and community well-being that the respondent rates on a -degree likert scale (strongly agree- , agree- , neither agree nor disagree- , disagree- , strongly disagree- ), the total possible questionnaire score thus ranging from , denoting the lowest/worst well-being, to representing the optimum. eq instrument is focused on physical and mental health status and daily life activities measured in dimensions (mobility, self-care, usual activities, pain/discomfort and anxiety/depression), also self-rated on a -degree scale from perceived worst to best like in ess. last question asks for assessment of the respondent's overall health state (hsa) on the current day, on the scale from (worst) to (best imaginable), also mapped to a number ranged from to by the formula + * hsa/ for the purpose of this evaluation. total cumulative eq score thus ranges from to . figure below shows the correlation scatter plot of ess scores and composite well-being indices for respondents of which filled ees questionnaire twice, two filled it three times, and the rest only once during the observed period of months. figure shows the relationship between the composite well-being indices and the obtained eq scores of respondents, of which filled the questionnaire three times, filled it twice, and the rest once during the observed data collection period. the analysis reveals a medium positive correlation of . (with the p-value of approx. .  − ) between our constructed composite index and cumulative eq scores. the composite indicator and its constituting domain indices can therefore be considered promising for their intended purpose of basic represention of the urban physical well-being aspects modelled from a variety of heterogeneous underlying activity and health/vital parameters measured by iot wearable devices, summarized in main dimensional and one overall derived score convenient for comparisons, interpretation and presentation to the end-users, particularly in the required shortest most concise manner and form, such as through a mobile app. ui or intervention messages. as almost half of the questions in eq are very remotely or not at all related to the physical well-being aspects summarized by the composite indicator, the correlation is expected to increase when the ongoing work in incorporating social and other wellbeung aspects fully in the model is completed. found small positive correlation of . (p-value . ) between the composite indicator and cumulative scores of ess questionnaire (in which most of the questions are not related to physical well-being) additionally points to the significance of this composite indicator to the overall well-being. the work also continues on the cleaning and pre-processing the data collected on the remaining monitored citizens, incorporation of machine learning methods in the model and exploration and modelling of the influence of detailed sleep and cardiac parameters, as well as of the sensed ambiental data, on the main well-being domains. the images or other third party material in this chapter are included in the chapter's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. ageing-friendly cities for assessing older adults' decline: iot-based system for continuous monitoring of frailty risks using smart city infrastructure a critical analysis of an iot-aware aal system for elderly monitoring semantic and fuzzy modelling for human behaviour recognition in smart spaces a scalable hybrid activity recognition approach for intelligent environments comparative impact study of the european social survey (ess) european research infrastructure consortium (eric) systematic scoping review of indicators of community wellbeing in the uk composite index construction national sleep foundation's sleep time duration recommendations: methodology and results summary aerobic fitness norms for males and females aged to years: a review health survey for england : well-being and mental health. nhs digital health and social care information centre and the uk office for national statistics hapt d: high accuracy of prediction of t d with a model combining basic and advanced data depending on availability how many steps/day are enough? for adults estimation of vo max from the ratio between hrmax and hrrest -the heart rate ratio method a bayesian network analysis of the probabilistic relations between risk factors in the predisposition to type diabetes sleep duration and risk of type diabetes: a meta-analysis of prospective studies indicators and methods for constructing a us human well-being index (hwbi) for ecosystem services research assessing the ability of the fitbit charge to accurately predict vo max sleep duration and risk of stroke events and stroke mortality: a systematic review and meta-analysis of prospective cohort studies data driven mci and frailty prevention: geriatric modelling in the city age project empowering citizens through perceptual sensing of urban environmental and health data following a participative citizen science approach acknowledgment. this work has received funding from the european union's horizon research and innovation programme under the grant agreement no. (pulse). the performed research studies have all been granted ethical approval from the relevant irb authority in each pulse pilot testbed city (ethics committee of the parc de salut mar hospital in barcelona, through nhs health research authority iras (integrated research application system) in birmingham, new york academy of medicine irb in new york city, etc.), resulting from comprehensive multimonthly evaluation processes. the inclusion and exclusion methods and criteria for recruiting citizens have been specified in relevant previous publications of the project, such as in section . . . participation criteria in [ ] . key: cord- -tpqsjjet authors: nan title: section ii: poster sessions date: - - journal: j urban health doi: . /jurban/jti sha: doc_id: cord_uid: tpqsjjet nan food and nutrition programs in large urban areas have not traditionally followed a systems approach towards mitigating food related health issues, and instead have relied upon specific issue interventions char deal with downstream indicators of illness and disease. in june of , the san francisco food alliance, a group of city agencies, community based organizations and residents, initiated a collahorarive indicator project called rhe san francisco food and agriculture assessment. in order to attend to root causes of food related illnesses and diseases, the purpose of the assessment is to provide a holistic, systemic view of san francisco\'s food system with a focus on three main areas that have a profound affect on urban public health: food assistance, urban agriculture, and food retailing. using participatory, consensus methods, the san francisco food alliance jointly developed a sec of indicators to assess the state of the local food system and co set benchmarks for future analysis. members collected data from various city and stare departments as well as community based organizations. through the use of geographic information systems software, a series of maps were created to illustrate the assets and limitations within the food system in different neighborhoods and throughout the city as a whole. this participatory assessment process illustrates how to more effectively attend to structural food systems issues in large urban areas by ( t) focusing on prevention rather than crisis management, ( ) emphasizing collaboration to ensure institutional and structural changes, and ( ) aptly translating data into meaningful community driven prevention activities. to ~xplore the strategies to overcome barriers to population sample, we examined the data from three rapid surveys conducted at los angeles county (lac). the surveys were community-based partic· patory surveys utilizing a modified two-stage cluster survey method. the field modifications of the method resulted in better design effect than conventional cluster sample survey (design effect dose to that if the survey was done as simple random sample survey of the same size). the surveys were con· ducte~ among parents of hispanic and african american children in lac. geographic area was selected and d .v ded int.o small c~usters. in the first stage, clusters were selected with probability proportionate to estimated size of children from the census data. these clusters were enumerated to identify and develop a list of households with eligible children from where a random sample was withdrawn. data collectmn for consented respondents involved - minutes in-home interview and abstraction of infor· ma~ion from vaccine record card. the survey staff had implemented community outreach activities designed to fost~r an~ maintain community trust and cooperation. the successful strategies included: developing re.lat on .w. th local community organizations; recruitment of community personnel and pro· vide them with training to conduct the enumeration and interview; teaming the trained community introduction: though much research has been done on the health and social benefits of pet ownership for many groups, there have been no explorations of what pet ownership can mean to adults who are marginalized, living on fixed incomes or on the street in canada. we are a community group of researchers from downtown toronto. made up of front line staff and community members, we believe that community research is important so that our concerns, visions, views and values are presented by us. we also believe that research can and should lead to social change. method: using qualitative and exploratory methods, we have investigated how pet ownership enriched and challenged the lives of homeless and transitionally housed people. our research team photographed and conducted one-on-one interviews with pet owners who have experienced home· lessncss and live on fixed incomes. we had community participation in the research through a partnership with the fred vicror centre camera club. many of the fred victor centre camera club members have experienced homelessness and being marginalized because of poverty. the members of the dub took the photos and assisted in developing the photos. they also participated in the presenta· tion of our project. results: we found that pet ownership brings important health and social benefits to our partici· pants. in one of the most poignant statements, one participant said that pet ownership " ... stops you from being invisible." another commented that "well, he taught me to slow down, cut down the heavy drugs .. " we also found that pet ownership brings challenges that can at times be difficult when one is liv· ing on a fixed income. we found that the most difficult thing for most of the pet owners was finding affordable vet care for their animals. conclusion: as a group, we decided that research should only be done if we try to make some cha.nges about what we have learned. we continue the project through exploring means of affecting social change--for example, ~eti.tions and informing others about the result of our project. we would like to present our ~mdmgs and experience with community-based participatory action resea.rch m an oral. presentarton at yo~r conference in october. our presentation will include com· mumty representation ~f. both front-hne staff and people with lived experience of marginalization and homdessness. if this is not accepted as an oral presentation, we are willing to present the project m poster format. introduction the concept of a healthy city was adopted by the world health organization some time ago and it includes strong support for local involvement in problem solving and implementation of solutions. while aimed at improving social, economic or environmental conditions in a given community, more significantly the process is considered to be a building block for poliq reform and larger scale 'hange, i.e. "acting locally while thinking globally." neighbourhood planning can he the entry point for citizens to hegin engaging with neighbours on issues of the greater common good. methods: this presentation will outline how two community driven projects have unfolded to address air pollution. the first was an uphill push to create bike lanes where car lanes previously existed and the second is an ongoing, multi-sectoral round table focused on pollution and planning. both dt•monstrate the importance of having support with the process and a health focus. borrowing from traditions of "technical aid"• and community development the health promoter /planner has incorporated a range of "determinants of health" into neighbourhood planning discussions. as in most urban conditions the physical environment is linked to a range of health stressors such as social isolation, crowding, noise, lack of open space /recreation, mobility and safety. however typical planning processes do not hring in a health perspective. health as a focus for neighbourhood planning is a powerful starti_ng point when discussing transportation planning or changing land-uses. by raising awareness on determmants of health, citizens can begin to better understand how to engage in a process and affect change. often local level politics are involved and citizens witness policy change in action. the environmental liaison committee and the dundas east hike lanes project resulted from local level initiatives aimed at finding solutions to air pollution -a priority identified hy the community. srchc supported the process with facilitation and technical aid. _the processs had tangible results that ultimately improve living conditions and health. •tn the united kmgdom plannm in the 's established "technical aid" offices much like our present day legal aid system to provide professional support and advocacy for communities undergoing change. p - (c) integrating community based research: the experience of street health, a community service agency i.aura cowan and jacqueline wood street health began offering services to homeless men and women in east downtown ~oronto in . nursing stations at drop-in centres and shelters were fo~lowed by hiv/aids prevent ~, harm reduction and mental health outreach, hepatitis c support, sleeping bag exchange, and personal tdennfication replacement and storage programs. as street health's progi;ams expanded, so to~ did the agency:s recognition that more nee~ed t~ be done to. address the underl~ing causes of, th~ soct~l and economic exclusion experienced by its clients. knowing t.h~t. a~voca~y ts. helped by . evtd~nce , street he.alt~ embarked on a community-based research (cbr) initiative to dent fy commumty-dnven research priorities within the homeless and underhoused population. methods: five focus groups were conducted with homeless people, asking participants to identify positive and negative forces in their lives, and which topics were important to take action on and learn more about. findings were validated through a validation meeting with participants. results: participants identified several important positive and negative forces in their lives. key positive forces included caring and respectful service delivery, hopefulness and peer networks. key negative forces included lack of access to adequate housing and income security, poor service delivery and negative perceptions of homeless people. five topics for future research emerged from the process, focusing on funding to address homelessness and housing; use of community services for homeless people; the daily survival needs of homeless people and barriers to transitioning out of homelessness; new approaches to service delivery that foster empowerment; and policy makers' understanding of poverty and homelessness. conclusions: although participants expressed numerous issues and provided much valuable insight, definitive research ideas and action areas were not clearly identified by participants. however, engagement in a cbr process led to some important lessons and benefits for street health. we learned that the community involvement of homeless people and front-line staff is critical to ensuring relevance and validity for a research project; that existing strong relationships with community parmers are essential to the successful implementation of a project involving marginalized groups; and that an action approach focusing on positive change can make research relevant to directly affected people and community agency staff. street health benefited from using a cbr approach, as the research process facilitated capacity building among staff and within the organization as a whole. p - (c) a collaborative process to achieve access to primary health care for black women and women of colour: a model of community based participartory research notisha massaquoi, charmaine williams, amoaba gooden, and tulika agerwal in the current healthcare environment, a significant number of black women and women of color face barriers to accessing effective, high quality services. research has identified several issues that contribute to decreased access to primary health care for this population however racism has emerged as an overarching determinant of health and healthcare access. this is further amplified by simultaneous membership in multiple groups that experience discrimination and barriers to healthcare for example those affected by sexism, homelessness, poverty, homophobia and heterosexism, disability and hiv infection. the collaborative process to achieve access to primary health care for black women and women of colour project was developed with the university of toronto faculty of social work and five community partners using a collaborative methodology to address a pressing need within the community ro increase access to primary health care for black women and women of colour. women's health in women's hands community health centre, sistering, parkdale community health centre, rexdale community health centre and planed parenthood of toronto developed this community-based participatory-action research project to collaboratively barriers affecting these women, and to develop a model of care that will increase their access to health services. this framework was developed using a process which ensured that community members from the target population and service providers working in multiculrural clinical settings, were a part of the research process. they were given the opportunity to shape the course of action, from the design of the project to the evaluation and dissemination phase. empowerment is a goal of the participatory action process, therefore, the research process has deliberately prioritized _ro enabling women to increase control over their health and well-being. in this session, the presenters will explore community-based participatory research and how such a model can be useful for understanding and contextualizing the experiences of black women and women of colour. they will address. the development and use of community parmerships, design and implementation of the research prorect, challenges encountered, lessons learnt and action outcomes. they will examine how the results from a collaborative community-based research project can be used as an action strategy to poster sessions v address che social determinants of women health. finally the session will provide tools for service providers and researchers to explore ways to increase partnerships and to integrate strategies to meet the needs of che target population who face multiple barriers to accessing services. lynn scruby and rachel rapaport beck the purpose of this project was ro bring traditionally disenfranchised winnipeg and surrounding area women into decision-making roles. the researchers have built upon the relationships and information gachered from a pilot project and enhanced the role of input from participants on their policy prioriries. the project is guided by an advisory committee consisting of program providers and community representatives, as well as the researchers. participants included program users at four family resource cencres, two in winnipeg and two located rurally, where they participated in focus groups. the participants answered a series of questions relating to their contact with government services and then provided inpuc as to their perceptions for needed changes within government policy. following data analysis, the researchers will return to the four centres to share the information and continue che discussion on methods for advocating for change. recommendations for program planning and policy development and implementation will be discussed and have relevance to all participants in the research program. women's health vera lefranc, louise hara, denise darrell, sonya boyce, and colleen reid women's experiences with paid and unpaid work, and with the formal and informal economies, have shifted over the last years. in british columbia, women's employability is affected by government legislation, federal and provincial policy changes, and local practices. two years ago we formed the coalition ior women's economic advancement to explore ways of dealing with women's worsening economic situations. since the formation of the coalition we have discussed the need for research into women's employabilicy and how women were coping and surviving. we also identified how the need to document the nature of women's employability and reliance on the informal economy bore significanc mechodological and ethical challenges. inherent in our approach is a social model of women's health that recognizes health as containing social, economic, and environmental determinants. we aim to examine the social contexc of women's healch by exploring and legitimizing women's own experiences, challenging medical dominance in understandings of health, and explaining women's health in terms of their subordination and marginalizacion. through using a feminist action research (far) methodology we will explore the relationship between women's employability and health in communities that represent bricish columbia's social, economic, cultural/ethnic, and geographic diversities: skidegate, fort st . .john, lumhy, and surrey. over the course of our year project, in each community we will establish and work with advisory committees, hire and train local researchers, conduct far (including a range of qualitative methods), and support action and advocacy. since the selected communities are diverse, the ways that the research unfolds will ·ary between communities. expected outcomes, such as the provision of a written report and resources, the establishment of a website for networking among the communities, and a video do.:umentary, are aimed at supporting the research participants, coalition members, and advisory conuniuces in their action efforts. p t (c) health & housing: assessing the impact of transitional housing for people living with hiv i aids currently, there is a dearth of available literature which examines supporrive housing for phas in the canadian context. using qualitative, one-on-one interviews we investigace the impact of transitional housing for phaswho have lived in the up to nine month long hastings program. our post<'r pr<·senta-t on will highlight research findings, as well as an examination of transitional housing and th<· imp;kt it has on the everyday lives of phas in canada. this research is one of two ground breaking undertakings within the province of ontario in which fife house is involved. p - (c) eating our way to justice: widening grassroots approaches to food security, the stop community food centre as a working model charles l.evkoe food hanks in north america have come co play a central role as the widespread response to growing rates of hunger. originally thought to be a short term-solution, over the last years, they have v poster sessions be · · · · d wi'thi'n society by filling the gaps in the social safety net while relieving govemcome mst tut ona ze . . . t f the ir responsibilities. dependent on corporate donations and sngmauzmg to users, food banks men so th' . · i i . are incapable of addressing the structural cause~ of ~u~ger. s pres~ntation w e~~ ore a ternanve approaches to addressing urban food security while bmldmg more sustamabl.e c~mmumt es. i:nrough the f t h st p community food centre, a toronto-based grassroots orgamzanon, a model is presented case e h'l k' b 'id · b that both responds to the emergency food needs of communities w e wor mg to. u ~ sustama le and just food system. termed, the community food centre model. (cfc), ~he s~op is worki?g to widen its approach to issues of food insecurity by combining respectful ~ rect service wit~ com~~mty ~evelop ment, social justice and environmental sustainability. through this approach, various critical discourses around hunger converge with different strategic and varied implications for a~ion. as a plac~-based organization, the stop is rooted within a geographical space and connected directly to a neighbourhood. through working to increase access to healthy food, it is active in maintaining people's dignity, building a strong and democratic community and educating for social change. connected to coalitions and alliances, the stop is also active in organizing across scales in connection with the global food justice movement. inner city shelter vicky stergiopoulos, carolyn dewa, katherine rouleau, shawn yoder, and lorne tugg introduction: in the city of toronto there are more than , hostel users each year, many with mental health and addiction issues. although shelters have responded in various ways to the health needs of their clients, evidence on the effectiveness of programs delivering mental health services to the home· less in canada has been scant. the objective of this community based research was to provide a forma· tive evaluation of a multi-agency collaborative care team providing comprehensive care for high needs clients at toronto's largest shelter for homeless men. methods: a logic model provided the framework for analysis. a chart review of clients referred over a nine month period was completed. demographic data were collected, and process and outcome indicators were identified for which data was obtained and analyzed. the two main outcome measures were mental status and housing status months after referral to the program. improvement or lack of improvement in mental status was established by chart review and team consensus. housing outcomes were determined by chart review and the hostel databases. results: of the clients referred % were single and % were unemployed. forty four percent had a psychiatric hospitalization within the previous two years. the prevalence of severe and persistent men· tal illness, alcohol and substance use disorders were %, % and % respectively. six months after referral to the program % of clients had improved mental status and % were housed. logistic regression controlling for the number of general practitioner and psychiatrist visits, presence of person· ality or substance use disorder and treatment non adherence identified two variables significantly associ· ated mental status improvement: the number of psychiatric visits (or, . ; % ci, . - . ) and treatment non adherence (or, . ; % ci, . - . ). the same two variables were associated with housing outcomes. history of forensic involvement, the presence of a personality or substance use disorder and the number of visits with a family physician were not significantly associated with either outcome. conclusions: despite the limitations in sample sire and study design, this study can yield useful informa· tion to program planners. our results suggest that strategies to improve treatment adherence and access to mental health specialists can improve outcomes for this population. although within primary care teams the appropriate collaborative care model for this population remains to be established, access to psychiatric follow up, in addition to psychiatric assessment services, may be an important component of a successful program. mount sinai hospital (msh) has become one of the pre-eminent hospitals in the world by contributing to the development of innovative approaches to effective health care and disease prevention. recently, the hospital has dedicated resources towards the development of a strategy aimed at enhancing the hospital's integration with its community partners. this approach will better serve the hospital in the current health care environment where local health integration networks have been struck to enhance and support local capacity to plan, coordinate and integrate service delivery. msh has had early success with developing partnerships. these alliances have been linked to programs serving key target populations with _estabhshe~. points of access to msh. recognizing the need to build upon these achievements to remain compe~mve, the hospital has developed a community integration strategy. at the forefront of this strategy is c.a.r.e (community advisory reference engine): the hospital's compendium of poster sessions v community partners. as a single point of access to community partner information, c.a.r.e. is more than a database. c.a.r.e. serves as the foundation for community-focused forecasting and a vehicle for inter and intra-organizational knowledge transfer. information gleaned from the catalog of community parmers can be used to prepare strategic, long-term partnership plans aimed at ensuring that a comprehensive array of services can be provided to the hospital community. c.a.r.e. also houses a permanent record of the hospital's alliances. this prevents administrative duplication and facilitates the formation of new alliances that best serve both the patient and the hospital. c.a.r.e. is not a stand-alone tool and is most powerful when combined with other aspects of the hospital's community integration strategy. it iscxpected that data from the hospital's community advisory committees and performance measurement department will also be stored alongside stakeholder details. this information can then be used to drive discussions at senior management and the board, ensuring congruence between stakeholder, patient and hospital objectives. the patient stands to benefit from this strategy. the unique, distinct point of reference to a wide array of community services provides case managers and discharge planners with the information they need to connect patients with appropriate community services. creating these linkages enhances the patient's capacity to convalesce in their homes or places of residence and fosters long-term connections to neighborhood supports. these connections can be used to assist with identifying patients' ongoing health care needs and potentially prevent readmission to hospital. introduction: recruiting high-risk drug users and sex workers for hiv-prevention research has often been hampered by limited access to hard to reach, socially stigmatized individuals. our recruitment effom have deployed ethnographic methodology to identify and target risk pockets. in particular, ethnographers have modeled their research on a street-outreach model, walking around with hiv-prevention materials and engaging in informal and structured conversations with local residents, and service providers, as well as self-identified drug users and sex workers. while such a methodology identifies people who feel comfortable engaging with outreach workers, it risks missing key connections with those who occupy the margins of even this marginal culture. methods: ethnographers formed a women's laundry group at a laundromat that had a central role as community switchboard and had previously functioned as a party location for the target population. the new manager helped the ethnographers invite women at high risk for hiv back into the space, this time as customers. during weekly laundry sessions, women initiated discussions about hiv-prevention, sexual health, and eventually, the vaccine research for which the center would be recruiting women. ra.its: the benefits of the group included reintroducing women to a familiar locale, this time as customers rather than unwelcome intruders; creating a span of time (wash and dry) to discuss issues important to me women and to gather data for future recruitment efforts; creating a location to meet women encountered during more traditional outreach research; establishing the site as a place for potential retention efforts; and supporting a local business. women who participated in the group completed a necessary household task while learning information that they could then bring back to the community, empowering them to be experts on hiv-prevention and vaccine research. some of these women now assist recruitment efforts. the challenges included keeping the group women-only, especially after lunch was provided, keeping the membership of the group focused on women at risk for hiv, and keeping the women in the group while they did their laundry. conclusion: public health educators and researchers can benefit from identifying alternate congregation sites within risk pockets to provide a comfortable space to discuss hiv prevention issues with high-risk community members. in our presentation, we will describe the context necessary for similar research, document the method's pitfalls and successes, and argue that the laundry group constitutes an ethical, respectful, community-based method for recruitment in an hiv-prevention vaccine trial. p - t (c) upgrading inner city infrastructure and services for improved environmental hygiene and health: a case of mirzapur in u.p. india madhusree mazumdar in urgency for agricultural and industrial progress to promote economic d.evelopment follo_wing independence, the government of india had neglected health promotion and given less emphasis on infrastructure to promote public health for enhanced human pro uct v_ity. ong wit r~p m astrucrure development, which has become essential if citie~ are to. act ~s harbmger.s of econ~nuc ~owth, especially after the adoption of the economic liberalization policy, importance _is a_lso ~emg g ve.n to foster environmental hygiene for preventive healthcare. the world health orga~ sat ~ is also trj:' ! g to help the government to build a lobby at the local level for the purpose by off~rmg to mrroduce_ its heal.thy city concept to improve public health conditions, so as to reduce th_e disease burden. this pape~ s a report of the efforts being made towards such a goal: the paper descr~bes ~ c~se study ?f ~ small city of india called mirzapur, located on the banks of the nver ganga, a ma or lifeline of india, m the eastern part of the state of uttar pradesh, where action for improvement began by building better sanitation and environmental infrastructure as per the ganga action plan, but continued with an effort to promote pre· ventive healthcare for overall social development through community participation in and around the city. asthma physician visits in toronto, canada tara burra, rahim moineddin, mohammad agha, and richard glazier introduction: air pollution and socio-economic status are both known to be associated with asthma in concentrated urban settings but little is known about the relationship between these factors. this study investigates socio-economic variation in ambulatory physician consultations for asthma and assesses possible effect modification of socio-economic status on the association between physician visits and ambient air pollution levels for children aged to and adults aged to in toronto, canada between and . methods: generalized additive models were used to estimate the adjusted relative risk of asthma physician visits associated with an interquartile range increase in sulphur dioxide, nitrogen dioxide, pm . , and ozone, respectively. results: a consistent socio-economic gradient in the number of physician visits was observed among children and adults and both sexes. positive associations between ambient concentrations of sul· phur dioxide, nitrogen dioxide and pm . and physician visits were observed across age and sex strata, whereas the associations with ozone were negative. the relative risk estimates for the low socio-«onomic group were not significantly greater than those for the high socio-economic group. conclusions: these findings suggest that increased ambulatory physician visits represent another component of the public health impact of exposure to urban air pollution. further, these results did not identify an age, sex, or socio-economic subgroup in which the association between physician visits and air pollution was significantly stronger than in any other population subgroup. eco-life-center (ela) in albania supports a holistic approach to justice, recognizing the environ· mental justice, social justice and economic justice depend upon and support each other. low income cit· izens and minorities suffer disproportionately from environmental hazards in the workplace, at home, and in their communities. inadequate laws, lax enforcement of existing environmental regulations, and ~ea.k penalties for infractions undermine environmental protection. in the last decade, the environmental ust ce m~ve~ent in tirana metropolis has provided a framework for identifying and exposing the links ~tween irrational development practices, disproportionate siting of toxic facilities, economic depres· s on, and a diminished quality of life in low-income communities and communities of color. the envi· ~onmental justice agenda has always been rooted in economic, racial, and social justice. tirana and the issues su.rroun~ing brow~fields redevelopment are crucial points of advocacy and activism for creating ~ubstantia~ social chan~~ m low-income communities and communities of color. we engaging intensively m prevcnnng co'.' mumnes, especially low income or minority communities, from being coerced by gov· ernmenta~ age_nc es or companies into siting hazardous materials, or accepting environmentally hazard· ous_ practices m order to create jobs. although environmental regulations do now exist to address the environmental, health, and social impacts of undesirable land uses, these regulations are difficult to poster sessions v enforce because many of these sites have been toxic-ridden for many years and investigation and cleanup of these sites can be expensive. removing health risks must be the main priority of all brown fields action plans. environmental health hazards are disproportionately concentrated in low-income communities of color. policy requirements and enforcement mechanisms to safeguard environmental health should be strengthened for all brownfields projects located in these communities. if sites are potentially endangering the health of the community, all efforts should be made for site remediation to be carried out to the highest cleanup standards possible towards the removal of this risk. the assurance of the health of the community should take precedence over any other benefits, economic or otherwise, expected to result from brownfields redevelopment. it's important to require from companies to observe a "good neighbor" policy that includes on-site visitations by a community watchdog committee, and the appointment of a neighborhood environmentalist to their board of directors in accordance with the environmental principles. vancouver - michael buzzelli, jason su, and nhu le this is the second paper of research programme concerned with the geographical patterning of environmental and population health at the urban neighbourhood scale. based on the vancouver metropolitan region, the aim is to better understand the role of neighbourhoods as epidemiological spaces where environmental and social characteristics combine as health processes and outcomes at the community and individual levels. this paper builds a cohort of commensurate neighbourhoods across all six censuses periods from to , assembles neighbourhood air pollution data (several criterion/health effects pollutants), and providing an analysis to demonstrate how air pollution systematically and consistently maps onto neighbourhood socioeconomic markers, in this case low education and lone-parent families. we conclude with a discussion of how the neighbourhood cohort can be further developed to address emergent priorities in the population and environmental health literatures, namely the need for temporally matched data, a lifecourse approach, and analyses that control for spatial scale effects. solid waste management and environment in mumbai (india) by uttam jakoji sonkamble and bairam paswan abstract: mumbi is the individual financial capital of india. the population of greater mumbai is , , and sq. km. area. the density of population , per sq. km. the dayto-day administration and rendering of public services within gr. mumbai is provided by the brihan mumbai mahanagar palika (mumbai corporation of gr. mumbai) that is a body of elected councilors on a -year team. mumcial corporation provides varies conservancy services such as street sweeping, collection of solid waste, removal and transportation, disposal of solid waste, disposal of dead bodies of animals, construction, maintance and cleaning of urinals and public sanitary conveniences. the solid waste becoming complicated due to increase in unplanned urbanization and industrialization, the environment has deteriorated significantly due to inter, intra and international migration stream to mumbai. the volume of inter state migration to mumbai is considerably high i.e. . lakh and international migrant . lakh have migrated to mumbai. present paper gives the view on solid waste management and its implications to environment and health. pollution from a wide varity of emission, such as from automobiles and industrial activities, has reached critical level in mumbai, causing respiratory, ocular, water born diseases and other health problems. sources of generation of waste are -household waste, commercial waste, institutional waste, street sweeping, silt removed from drain/nallah/cleanings. disposal of solid waste in gr. mumbai done under incineration . processing to produce organic manure. . vermi-composting . landfill the study shows that the quantity of waste disposal of through processing and conversion to organic ~anure is about - m.t. per day. the processing is done by a private agency m/s excel industries ltd. who had set up a plant at the chincholi dumping ground in western mumbai for this purpose. the corporation is also disposal a plant of its waste mainly market waste through the environment friendly, natural pro-ces~ known as vermi-composing about m.t. of market waste is disposed of in this manner at the various sites. there are four land fill sites are available and percent of the waste matter generated m mumbai is disposed of through landfill. continuous flow of migrant and increa~e in slum population is a complex barrier in the solid waste management whenever community pamc pat on work strongly than only we can achieved eco-friendly environment in mumbai. persons exposed to residential craffic have elevated races of respiratory morbidity an~ ~ortality. since poverty is an important determinant of ill-health, some h~ve argued that t~es~ assoc at ons may relate to che lower socioeconomic status of those living along ma or roads. our ob ect ve was to evaluate the association between traffic intensity at home and hospital admissions for respiratory diagnoses among montreal residents older than years. morning peak traffic estimates from the emmej montreal traffic model (motrem ) were used as an indicator of exposure to road traffic outside the homes of those hospitalised. the influence of socioeconomic status on the relationship between traffic intensity and hospital admissions for respiratory diagnoses was explored through assessment of confounding by lodging value, expressed as the dollar average over road segments. this indicator of socioeconomic status, as calculated from the montreal property assessment database, is available at a finer geographic scale than socioeconomic information accessible from the canadian census. there was an inverse relationship between traffic intensity estimates and lodging values for those hospitalised (rho - . , p vehicles during che hour morning peak), even after adjustment for lodging value (crude or . , cl % . - . ; adjusted or . , cl % . - . ). in montreal, elderly persons living along major roads are at higher risk of being hospitalised for respiratory illnesses, which appears not simply to reflect the fact that those living along major roads are at relative economic disadvantage. the paper argues that human beings ought to be at the centre of the concern for sustainable development. while acknowledging the importance of protecting natural resources and the ecosystem in order to secure long term global sustainability, the paper maintain that the proper starting point in the quest for urban sustainability in africa is the 'brown agenda' to improve che living and working environment of che people, especially che urban poor who face a more immediate environmental threat to their health and well-being. as the un-habitat has rightly observed, it is absolutely essential "to ensure that all people have a sufficient stake in the present to motivate them to take part in the struggle to secure the future for humanity.~ the human development approach calls for rethinking and broadening the narrow technical focus of conventional town planning and urban management in order to incorporate the emerging new ideas and principles of urban health and sustainability. i will examine how cities in sub-saharan africa have developed over the last fifty years; the extent to which government policies and programmes have facilitated or constrained urban growth, and the strategies needed to achieve better functioning, safer and more inclusive cities. in this regard i will explore insights from the united nations conferences of the s, especially local agenda of the rio summit, and the istanbul declaration/habitat agenda, paying particular attention to the principles of enablement, decentralization and partnership canvassed by these movements. also, i will consider the contributions of the various global initiatives especially the cities alliance for cities without slums sponsored by the world bank and other partners; che sustainable cities programme, the global campaigns for good governance and for secure tenure canvassed by unhabit at, the healthy cities programme promoted by who, and so on. the concluding section will reflect on the future of the african city; what form it will take, and how to bring about the changes needed to make the cities healthier, more productive and equitable, and better able to meet people's needs. heather jones-otazo, john clarke, donald cole, and miriam diamond urban areas, as centers of population and resource consumption, have elevated emissions and concentrations of a wide range of chemical contaminants. we have developed a modeling framework in which we first ~stimate the emissions and transport of contaminants in a city and second, use these estimates along with measured contaminant concentrations in food, to estimate the potential health risk posed by these che.micals. the latter is accomplished using risk assessment. we applied our modeling framework to consider two groups of chemical contaminants, polycyclic aromatic hydrocarbons (pah) a.nd the flame re~ardants polybrominated diphenyl ethers (pbde). pah originate from vehicles and stationary combustion sources. ~veral pah are potent carcinogens and some compounds also cause noncancer effects. pbdes are additive flame retardants used in polyurethane foams (e.g., car seats, furniture) fer sessions v and cl~ equipm~nt (e.g., compute~~· televisio~s). two out of three pbdes formulations are being voluntarily phased by mdustry due to rmng levels m human tissues and their world-wide distribution. pbdes have been .related to adv.erse neurological, developmental and reproductive effects in laboratory ijlimals. we apphed our modelmg framework to the city of toronto where we considered the southcattral area of by km that has a population of . million. for pah, local vehicle traffic and area sources contribute at least half of total pah in toronto. local contributions to pbdes range from - %, depending on the assumptions made. air concentrations of both compounds are about times higher downtown than km north of toronto. although measured pah concentrations in food date to the s, we estimate that the greatest exposure and contribution to lifetime cancer risk comes from ingestion of infant formula, which is consistent with toxicological evidence. the next greatest exposure and cancer risk are attributable to eating animal products (e.g. milk, eggs, fish). breathing downtown air contributes an additional percent to one's lifetime cancer risk. eating vegetables from a home garden localed downtown contributes negligibly to exposure and risk. for pbdes, the greatest lifetime exposure comes through breast milk (we did not have data for infant formula), followed by ingestion of dust by the toddler and infant. these results suggest strategies to mitigate exposure and health risk. p - (a) immigration and socioeconomic inequalities in cervical cancer screening in toronto, canada aisha lofters, rahim moineddin, maria creatore, mohammad agha, and richard glazier llltroduction: pap smears are recommended for cervical cancer screening from the onset of sexual activity to age . socioeconomic and ethnoracial gradients in self-reported cervical cancer screening have been documented in north america but there have been few direct measures of pap smear use among immigrants or other socially disadvantaged groups. our purpose was to investigate whether immigration and socioeconomic factors are related to cervical cancer screening in toronto, canada. methods: pap smears were identified using fee codes and laboratory codes in ontario physician service claims (ohip) for three years starting in for women age - and - . all women with any health system contact during the three years were used as the denominator. social and economic factors were derived from the canadian census for census tracts and divided into quintiles of roughly equal population. recent registrants, over % of whom are expected to be recent immigrants to canada, were identified as women who first registered for health coverage in ontario after january , . results: among , women age - and , women age - , . % and . %, rtspcctively, had pap smears within three years. low income, low education, recent immigration, visible minority and non-english language were all associated with lower rates (least advantaged quintile:most advantaged quintile rate ratios were . , . , . , . , . , respectively, p < . for all). similar gradients were found in both age groups. recent registrants comprised . % of women and had mm;h lower pap smear rates than non-recent registrants ( . % versus . % for women age - and . % versus . % for women age - ). conclnsions: pap smear rates in toronto fall well below those dictated by evidence-based practice. at the area level, immigration, visible minority, language and socioeconomic characteristics are associated with pap smear rates. recent registrants, representing a largely immigrant group, have particularly low rates. efforts to improve coverage of cervical cancer screening need to be directed to all ~omen, their providers and the health system but with special emphasis on women who recently arrived m ontario and those with social and economic disadvantage. challeges faced: a) most of the resources are now being ~pent in ~reventing the sprea.d of hiv/ aids and maintaining the lives of those already affected. b) skilled medical ~rs~nal are dymg under· mining the capacity to provide the required health care services. ~) th.e comphcat o~s of hiv/aids has complicated the treatment of other diseases e.g. tbs d) the ep dem c has led. to mcrease number of h n requiring care and support. this has further stretched the resources available for health care. orp a s d db . . i methods used on our research: . a simple community survey con ucte y our orgamzat on vo · unteers in three urban centres members of the community, workers and health care prov~ders were interviewed ... . meeting/discussions were organized in hospitals, commun.ity centre a~d with government officials ... . written questionnaires to health workers, doctors and pohcy makers m th.e health sectors. lessors learning: • the biggest-health bigger-go towards hiv/aids prevention • aids are spreading faster in those families which are poor and without education. •women are the most affected. •all health facilities are usually overcrowded with hiv/aids patients. actions needed:• community education oh how to prevent the spread of hiv/aids • hiv/aids testing need to be encouraged to detect early infections for proper medical cover. • people to eat healthy • people should avoid drugs. implications of our research: community members and civic society-introduction of home based care programs to take care of the sick who cannot get a space in the overcrowded public hospitals. prl-v a te sector private sector has established programs to support and care for the staff already affected. government provision of support to care-givers, in terms of resources and finances. training more health workers. introduction: australian prisons contain in excess of , prisoners. as in most other western countries, reliance on 'deprivation of liberty' is increasing. prisoner numbers are increasing at % per annum; incarceration of women has doubled in the last ten years. the impacts on the community are great - % of children have a parent in custody before their th birthday. for aboriginal communities, the harm is greater -aboriginal and/or torres strait islanders are incarcerated at a rate ten times higher than other australians. % of their children have a parent in custody before their th birthday. australian prisons operate under state and territory jurisdictions, there being no federal prison system. eight independent health systems, supporting the eight custodial systems, have evolved. this variability provides an unique opportunity to assess the capacity of these health providers in addressing the very high service needs of prisoners. results: five models of health service provision are identified -four of which operate in one form or another in australia: • provided by the custodial authority (queensland and western australia)• pro· vided by the health ministry through a secondary agent (south australia, the australian capital territory and tasmania) • provided through tendered contract by a private organization (victoria and northern territory) • provided by an independent health authority (new south wales) • (provided by medics as an integral component of the custodial enterprise) since the model of the independent health authority has developed in new south wales. the health needs of the prisoner population have been quantified, and attempts are being made to quantify specific health risks /benefits of incarceration. specific enquiry has been conducted into prisoner attitudes to their health care, including issues such as client information confidentiality and access to health services. specific reference will be made to: • two inmate health surveys • two inmate access surveys, and • two service demand studies. conclusions: the model of care provision, with legislative, ethical, funding and operational independence would seem provide the best opportunity to define and then respond to the health needs of prisoners. this model is being adopted in the united kingdom. better health outcomes in this high-risk group, could translate into healthier families and their communities. p - (a) lnregrated ethnic-specific health care systems: their development and role in increasing access to and quality of care for marginalized ethnic minorities joshua yang introduction: changing demographics in urban areas globally have resulted in urban health systems that are racially and ethnically homogenous relative to the patient populations they aim to serve. the resultant disparities in access to and quality of health care experienced by ethnic minority groups have been addressed by short-term, instirutional level strategies. noticeably absent, however, have been structural approaches to reducing culturally-rooted disparities in health care. the development of ethnic-specific h~alth car~ systems i~ a structural, long-term approach to reducing barriers to quality health care for eth· me mmonty populations. methods: this work is based on a qualitative study on the health care experiences of san francisco chinatown in the united states, an ethnic community with a model ethnic-specific health care infrastrucrure. using snowball sampling, interviews were conducted with key stakeholders and archival research was conducted to trace and model the developmental process that led to the current ethnic-specific health care system available to the chinese in san francisco. grounded theory was the methodology ijltd to analysis of qualitative data. the result of the study is four-stage developmental model of ethnic-specific health infrastrueture development that emerged from the data. the first stage of development is the creation of the human capital resources needed for an ethnic-specific health infrastructure, with emphasis on a bilingual and bicultural health care workforce. the second stage is the effective organization of health care resources for maximal access by constituents. the third is the strengthening and stability of those institutional forms through increased organizational capacity. integration of the ethnic-specific health care system into the mainstream health care infrastructure is the final stage of development for an ethnic-specific infrastructure. conclusion: integrated ethnic-specific health care systems are an effective, long-term strategy to address the linguistic and cultural barriers that are being faced by the spectrum of ethnic populations in urban areas, acting as culturally appropriate points of access to the mainstream health care system. the model presented is a roadmap to empower ethnic communities to act on the constraints of their health and political environments to improve their health care experiences. at a policy level, ethnic-specific health care organizations are an effective long-term strategy to increase access to care and improve qualiiy of care for marginalized ethnic groups. each stage of the model serves as a target area for policy interventions to address the access and care issues faced by culturally and linguistically diverse populations. users in baltimore md: - noya galai, gregory lucas, peter o'driscoll, david celentano, david vlahov, gregory kirk, and shruti mehta introduction: frequent use of emergency rooms (er) and hospitalizations among injection drug users (idus) has been reported and has often been attributed to lack of access to primary health care. however, there is little longitudinal data which examine health care utilization over individual drug use careers. we examined factors associated with hospitalizations, er and outpatient (op) visits among idus over years of follow-up. methods: idus were recruited through community outreach into the aids link to lntravenous experience (alive) study and followed semi-annually. , who had at least follow-up visits were included in this analysis. outcomes were self-reported episodes of hospitalizations and er/op visits in the prior six months. poisson regression was used accounting for intra-person correlation with generalized estimation equations. hits: at enrollment, % were male, % were african-american, % were hiv positive, median age was years, and median duration of drug use was years. over a total of , visits, mean individual rates of utilization were per person years (py) for hospitalizations and per py for er/op visits. adjusting for age and duration of drug use, factors significantly associated with higher rates of hospitalization included hiv infection (relative incidence [ri(, . ), female gender (ri, . ), homelessness (ri, . ), as well as not being employed, injecting at least daily, snorting heroin, havmg a regular source of health care, having health insurance and being in methadone mainte.nance treatment (mmt). similar associations were observed for er/op visits except for mmt which was not associated with er/op visits. additional factors associated with lower er/op visits were use of alcohol, crack, injecting at least daily and trading sex for drugs. % of the cohort accounted for % of total er/op visits, while % of the cohort never reported an op visit during follow-up. . . . lgbt) populations. we hypothesized that prov dmg .appomtments .for p~t ~nts w thm hours would ensure timely care, increase patient satisfaction, and improve practice eff c ency. further, we anticipated that the greatest change would occur amongst our homeless patients.. . methods: we tested an experimental introduction of advanced access scheduling (usmg a hour rule) in the primary care medical clinic. we tracked variables inclu~ing waiting ti~e fo~ next available appointment; number of patients seen; and no-show rates, for an eight week penod pnor to and post introduction of the new scheduling system. both patient and provider satisfaction were assessed using a brief survey ( questions rated on a -pt scale). results and conclusion: preliminary analyses demonstrated shorter waiting times for appointments across the clinic, decreased no-show rates, and increased clinic capacity. introduction of the advanced access scheduling also increased both patient and provider satisfaction. the new scheduling was initiated in july . quantitative analyses to measure initial and sustained changes, and to look at differential responses across populations within our clinic, are currently underway. introduction: there are three recognized approaches to linking socio-economic factors and health: use of census data, gis-based measures of accessibility/availability, and resident self-reported opinion on neighborhood conditions. this research project is primarily concerned with residents' views about their neighborhoods, identifying problems, and proposing policy changes to address them. the other two techniques will be used in future research to build a more comprehensive image of neighborhood depri· vation and health. methods: a telephone survey of london, ontario residents is currently being conducted to assess: a) community resource availability, quality, access and use, b) participation in neighborhood activities, c) perceived quality of neighborhood, d) neighborhood problems, and e) neighborhood cohesion. the survey instrument is composed of indices and scales previously validated and adapted to reflect london specifically. thirty city planning districts are used to define neighborhoods. the sample size for each neighborhood reflects the size of the planning district. responses will be compared within and across neighborhoods. data will be linked with census information to study variation across socio-eco· nomic and demographic groups. linear and gis-based methods will be used for analysis. preliminary results: the survey follows a qualitative study providing a first look at how experts involved in community resource planning and administration and city residents perceive the availability, accessibility, and quality of community resources linked to neighborhood health and wellbeing, and what are the most immediate needs that should be addressed. key-informant interviews and focus groups were used. the survey was pre-tested to ensure that the language and content reflects real experiences of city residents. the qualitative research confirmed our hypothesis that planning districts are an acceptable surrogate for neighborhood, and that the language and content of the survey is appropriate for imple· mentation in london. scales and indices showed good to excellent reliability and validity during the pre· test (cronbach's alpha from . - . ). preliminary results of the survey will be detailed at the conference. conclusions: this study will help assess where community resources are lacking or need improve· ment, thus contributing to a more effective allocation of public funds. it is also hypothesized that engaged neighborhoods with a well-developed sense of community are more likely to respond to health programs and interventions. it is hoped that this study will allow london residents to better understand the needs and problems of their neighborhoods and provide a research foundation to support local understandmg of community improvement with the goal of promoting healthy neighborhoods. p - (a) hiv positive in new york city and no outpatient care: who and why? hannah wolfe and victoria sharp introduction: there are approximately million hiv positive individuals living in the united sta!es. about. % of these know their hiy status and are enrolled in outpatient care. of the remaining yo, approx~mately half do not know their status; the other group frequently know their status but are not enrolled m any .sys~em of outpatient care. this group primarily accesses care through emergency departments. when md cated, they are admitted to hospitals, receive acute care services and then, upon poster sessions v di 'harge, disappear from the health care system until a new crisis occurs, when they return to the emergency department. as a large urban hiv center, caring for over individuals with hiv we have an active inpatient service ".'ith appr~xi~.ately discharges annually. we decided to survey our inpatients to better charactenze those md v duals who were not enrolled in any system of outpatient care. results: % of inpatients were not enrolled in regular outpatient care: % at roosevelt hospital and % at st.luke\'s hospital. substance abuse and homelessness were highly prevalent in the cohort of patients not enrolled in regular outpatient care. % of patients not in care (vs. % of those in care) were deemed in need of substance use treatment by the inpatient social worker. % of those not in care were homeless (vs. % of those in care.) patients not in care did not differ significantly from those in me in terms of age, race, or gender. patients not in care were asked "why not:" the two most frequent responses were: "i haven't really been sick before" and "i'd rather not think about my health. conclusions: this study suggests that there is an opportunity to engage these patients during their stay on the inpatient units and attempt to enroll them in outpatient care. simple referral to an hiv clinic is insufficient, particularly given the burden of homelessness and substance use in this population. efforts are currently underway to design an intervention to focus efforts on this group of patients. p .q (a) healthcare availability and accessibility in an urban area: the case of ibadan city, nigeria in oder to cater for the healthcare need of the populace, for many years after nigeria's politicl independence, empphasis was laid on the construction of teaching, general, and specialist hospital all of which were located in the urban centres. the realisation of the inadequacies of this approach in adequately meeting the healthcare needs of the people made the country to change and adopt the primary health care (phc) system in . the primary health care system which is in line with the alma ata declaration of of , wsa aimed at making health care available to as many people as possible on the basis of of equity and social justice. thus, close to two decades, nigeria has operated primary health care system as a strategy for providing health care for rural and urban dwellers. this study focusing on urban area, examimes the availabilty and accessibility of health care in one of nigeria's urban centre, ibadan city to be specific. this is done within the contest of the country's national heath policy of which pimary health care is the main thrust. the study also offers necessary suggestion for policy consideration. in spite of the accessibility to services provided by educated and trained midwifes in many parts of fars province (iran) there are still some deliveries conducted by untrained traditional birth attendants in rural parts of the province. as a result, a considerable proportion of deliveries are conducted under a higher risk due to unauthorised and uneducated attendants. this study has conducted to reveal the pro· portion of deliveries with un-authorized attendants and some spatial and social factors affecting the selection of delivery attendants. method: this study using a case control design compared some potentially effective parameters indud· ing: spatial, social and educational factors of mothers with deliveries attended by traditional midwifes (n= ) with those assisted by educated and trained midwifes (n= ). the mothers interviewed in our study were selected from rural areas using a cluster sampling method considering each village as a cluster. results: more than % of deliveries in the rural area were assisted by traditional midwifes. there are significant direct relationship between asking a traditional birth attendant for delivery and mother age, the number of previous deliveries and distance to a health facility provided for delivery. significant inverse relationships were found between mother's education and ability to use a vehicle to get to the facilities. conclusion: despite the accessibility of mothers to educated birth attendants and health facilities (according to the government health standards), some mothers still tend to ask traditional birth attendants for help. this is partly because of unrealistic definition of accessibility. the other considerable point is the preference of the traditional attendants for older and less educated mothers showing the necessity of changing theirs knowledge and attitude to understand the risks of deliveries attended by traditional and un-educated midwifes. p - (a) identification and optimization of service patterns provided by assertive community treatment teams in a major urban setting: preliminary findings &om toronto, canada jonathan weyman, peter gozdyra, margaret gehrs, daniela sota, and richard glazier objective: assertive community treatment (act) teams are financed by the ontario ministry of health and long-term care (mohltc) and are mandated to provide treatment, rehabilitation and support services in the community to people with severe and persistent mental illness. there are such teams located in various regions across the city of toronto conducting home visits - times per week to each of their approximately respective clients. each team consists of multidisciplinary health professionals who assist clients to identify their needs, establish goals and work toward them. due to complex referral patterns, the need for service continuity and the locations of supportive housing, clients of any one team are often found scattered across the city which increases home visit travel times and decreases efficiency of service provision. this project examines the locations of clients in relation to the home bases of all act teams and identifies options for overcoming the geographical challenges which arise in a large urban setting. methods: using geographic information systems (gis) we geocoded all client and act agency addresses and depicted them on location maps. at a later stage using spatial methods of network analysis we plan to calculate average travel rimes for each act team, propose optimization of catchment areas and assess potential travel time savings. resnlts: initial results show a substantial scattering of clients from several act teams and substan· rial overlap of visit travel routes for most teams. conclusions: reallocation of catchment areas and optimization of act teams' travel patterns should lead to substantial savings in travel times, increased service efficiency and better utilization of resourc_~· ~e l' .s l _= ._oo, " .ci = ( . - . )), and/or unemployed (or = . , %ci = ij . - . _ people. in multtvanate analysis, after a full adjustment on gender, age, health status, health insura~ce, income, occupat n and tducation level, we observed significant associations between having no rfd and: ~arrtal and_ pare~t hood status (e.g. or single no kids/in couple+kids = . , %ci = ( . - . ()~ quality of relattonsh ps with neighbours (or bad/good= . , %ci = [ . - . )), and length of residence m the neighbourhood (with a dose/effect statistical relationship). . co clusion: gender, age, employment status, mariral and parenthood stat~s as well as ~e gh bourhood anchorage seem to be major predictors of having a rfd, even when um.versa! health i~sur ance has reduced most of financial barriers. in urban contexts, where residential migrattons and single lift (or family ruptures) are frequent, specific information may be conducted to encourage people to ket rfd. :tu~y tries to assess the health effects and costs and also analyse the availability and accessibility to health care for poor. . methods: data for this study was collected by a survey on households of the local community living near the factories and households where radiation hazard w~s n?~ present. ~~art from mor· bidity status and health expenditure, data was collected ~n access, a~ail~b .hty and eff c ency of healrh care. a discriminant analysis was done to identify the vanables that d scnmmate between the study and control group households in terms of health care pattern. a contingent valuation survey was also undertaken among the study group to find out the factors affecting their willingness to pay for health insurance and was analysed using logit model. results: the health costs and indebtedness in families of the study group was high as compared to control group households and this was mainly due to high health expenditure. the discriminant analysis showed that expenditure incurred by private hospital inpatient and outpatient expenditure were significant variables, which discriminated between the two types of households. the logit analysis showed !hat variables like indebtedness of households, better health care and presence of radiation induced illnesses were significant factors influencing willingness to pay for health insurance. the study showed that study group households were dependent on private sector to get better health care and there were problems with access and availability at the public sector. conclusion: the study found out that the quality of life of the local community is poor due to health effects of radiation and the burden of radiation induced illnesses are so high for them. there is an urgent need for government intervention in this matter. there is also a need for the public sector to be efficient to cater to the needs of the poor. a health insurance or other forms of support to these households will improve the quality for health care services, better and fast access to health care facilities and reduces the financial burden of the local fishing community. the prevalence of substance abuse is an increasing problem among low-income urban women in puerto rico. latina access to treatment may play an important role in remission from substance abuse. little is known, however, about latinas' access to drug treatment. further, the role of social capital in substance abuse treatment utilization is unknown. this study examines the relative roles of social capital and other factors in obtaining substance abuse treatment, in a three-wave longitudinal study of women ages - living in high-risk urban areas of puerto rico, the inner city latina drug using study (icldus). social capital is measured at the individual level and includes variables from social support and networks, familism, physical environment, and religion instruments of the icdus. the study also elucidates the role of treatment received during the study in bringing about changes in social capital. the theoretical framework used in exploring the utilization of substance abuse treatment is the social support approach to social capital. the research addresses three main questions: ( t) does social capital predict parti~ipating in treatment programs? ( ) does participation in drug treatment programs increase social capital?, and ( ) is there a significant difference among treatment modalities in affecting change in ~ial capital? the findings revealed no significant association between levels of social capital and gettmg treatment. also, women who received drug treatment did not increase their levels of social capital. the findings, however, revealed a number of significant predictors of social capital and receiving drug ~buse treatment. predictors of social capital at wave iii include employment status, total monthly mcoi:rie, and baseline social capital. predictors of receiving drug abuse treatment include perception of physical health and total amount of money spent on drugs. other different variables were associated to treatment receipt prior to the icldus study. no significant difference in changes of social capital was found among users of different treatment modalities. this research represents an initial attempt to elucidate the two-way relationship between social capital and substance abuse treatment. more work is necessary to unden~nd. ~e role of political forces that promote social inequalities in creating drug abuse problems and ava lab hty of treatment; the relationship between the benefits provided by current treatment poster sessions v sctrings and treatment-seeking behaviors; the paths of recovery; and the efficacy and effectiveness of the trtaanent. and alejandro jadad health professionals in urban centres must meet the challenge of providing equitable care to a population with diverse needs and abilities to access and use available services. within the canadian health care system, providers are time-pressured and ill-equipped to deal with patients who face barriers of poverty, literacy, language, culture and social isolation. directing patients to needed supportive care services is even more difficult than providing them with appropriate technical care. a large proportion of the population do not have equitable access to services and face major problems navigating complex systems. new approaches are needed to bridge across diverse populations and reach out to underserved patients most in need. the objective of this project was to develop an innovative program to help underserved cancer patients access, understand and use needed health and social services. it implemented and evaluated, a pilot intervention employing trained 'personal health coaches' to assist underserved patients from a variety of ethno-linguistic, socio economic and educational backgrounds to meet their supportive cancer care needs. the intervention was tested with a group of underserved cancer patients at the princess margaret hospital, toronto. personal coaches helped patients identify needs, access information, and use supportive care services. triangulation was used to compare and contrast multiple sources of quantitative and qualitative evaluation data provided by patients, personal health coaches, and health care providers to assess needs, barriers and the effectiveness of the coach program. many patients faced multiple barriers and had complex unmet needs. barriers of poverty and language were the easiest to detect. a formal, systematic method to identify and meet supportive care needs was not in place at the hospital. however, when patients were referred to the program, an overwhelming majority of participants were highly satisfied with the intervention. the service also appeared to have important implications for improved technical health care by ensuring attendance at appointments, arranging transportation and translation services, encouraging adherence to therapy and mitigating financial hardship -using existing community services. this intervention identified a new approach that was effective in helping very needy patients navigate health and social services systems. such programs hold potential to improve both emotional and physical health out· comes. since assistance from a coach at the right time can prevent crises, it can create efficiencies in the health system. the successful use of individuals who were not licensed health professionals for this purpose has implications for health manpower planning. needle exchange programs (neps) have been distributing harm reduction materials in toronto since . counterfit harm reduction program is a small project operated out of a community health centre in south-east toronto. the project is operated by a single full-time coordinator, one pan-rime mobile outreach worker and two peers who work a few hours each week. all of counterfit's staff, peers, and volunteers identify themselves as active illicit drug users. yet the program dis~rib utes more needles and safer crack using kits and serves more illicit drug u~rs t~an the comb ~e~ number of all neps in toronto. this presentation will discuss the reasons behind this success, .s~ f cally the extended hours of operation, delivery models, and the inclusion of an. extremely marg ~ahzed community in all aspects of program design, implementation and eva.luat ?n. ~ounterfit was recently evaluated by drs. peggy milson and carol strike, two leading ep dem olog st and researchers in the hiv and nep fields in toronto and below are some of their findings: "the program has experienced considerable success in delivering a high quality, accessible and well-used program .... the pro· gram has allowed (service users) to become active participants in providing. services to others and has resulted in true community development in the best sense. " ... counterf t has ~~n verr succe~sful attracting and retaining clients, developing an effective peer-based model an.d assisting chen~s ~ th a vast range of issues .... the program has become a model for harm red~ctmn progr~ms withm the province of ontario and beyond." in june , the association of ?ntano co~mumty heal~~ <:en· ires recognized counterfit's acheivements with the excellence m community health initiatives award. in kenya, health outcomes and the performance of government health service~ have det~riorated since the late s, trends which coincide with a period of severe resource constramts necessitated by macro-economic stabilization measures after the extreme neo-liberalism of the s. when the govern· ment withdrew from direct service provision as reform trends and donor advocacy suggested, how does it perform its new indirect role of managing relations with new direct health services providers in terms of regulating, enabling, and managing relations with these health services providers? in this paper therefore, we seek to investigate how healthcare access and availability in the slums of nairobi has been impacted upon by the government's withdrawal from direct health care provision. the methodology involved col· leering primary data by conducting field visits to health institutions located in the slum areas of kibera and korogocho in nairobi. purposive random sampling was utilized in this study because this sampling technique allowed the researcher(s) to select those health care seekers and providers who had the required information with respect to the objectives of the study. in-depth interviews using a semi-structured ques· tionnaire were administered ro key informants in health care institutions. this sought to explore ways in which the government and the private sector had responded and addressed in practical terms to new demands of health care provision following the structural adjustment programmes of the s. this was complemented by secondary literature review of publications and records of key governmental, bilateral and multilateral development partners in nairobi. the study notes a number of weaknesses especially of kenya's ministry of health to perform its expected roles such as managing user fee revenue and financial sustainability of health insurance systems. this changing face of health services provision in kenya there· fore creates a complex situation, which demands greater understanding of the roles of competition and choice, regulatory structures and models of financing in shaving the evolution of health services. we rec· ommend that the introduction of user fees, decentralization of service provision and contracting-out of non-clinical to private and voluntary agencies require a new management culture, and new and clear insri· tutional relationships. experience with private sector involvement in health projects underlines the need not only for innovative financial structures to deal with a multitude of contractual, political, market and risks, but also building credible structures to ensure that health services projects are environmentally responsive, socially sensitive, economically viable, and politically feasible. purpose: the purpose of this study is to examine the status of mammography screening utilization and its predictors among muslim women living in southern california. methods: we conducted a cross-sectional study that included women aged ::!: years. we col· leered data using a questionnaire in the primary language of the subjects. the questionnaire included questions on demography; practices of breast self-examination (bse) and clinical breast examination (cbe); utilization of mammography; and family history of breast cancer. bivariate and multiple logistic regression analyses were performed to estimate the odds ratios of mammography use as a function of demographic and other predictor variables. . results: among the women, % were married, % were - years old, and % had family h story of breast cancer. thirty-two percent of the participating women never practiced bse and % had not undergone cbe during the past two years. the data indicated that % of the women did not have mammography in the last two years. logistic regression analysis showed that age ( r= . , % confi· dcnc~ interval (cl)=l. - . ), having clinical breast examination ( r= . , % cl= . - . ), and practtce of self-breast examination ( r= . , % cl= . - . ), were strong predictors of mammography use . . conclusions: the data point to the need for intervention targeting muslim women to inform and motivate th.cm a~ut practices for early detection of breast cancer and screening. further studies are needed to investigate the factors associated with low utilization of mammography among muslim women population in california. we conducted a review of the scientific literature and° government documents to describe ditnational health care program "barrio adentro" (inside the neighborhood). we also conducted qualiurivt interviews with members of the local health committees in urban settings to descrihe the comm unity participation component of the program. rtsmlts: until recently, the venezuelan public health system was characterized by a lack or limited access w health care ( % of the population) and long waiting lists that amounted to denial of service. moit than half of the mds worked in the five wealthiest metropolitan areas of the country. jn the spring oi , a pilot program hired cuban mds to live in the slums of caracas to provide health care to piople who had previously been marginalized from social programs. the program underwent a massive expansion and in only two years , cuban and , venezuelan health care providers were working acmss the country. they provide a daily average of - medical consultations and home visits, c lly out neighborhood rounds, and deliver health prevention initiatives, including immunization programs. they also provide generic medicines at no cost to patients, which treat % of presenting ill-ij!m, barrio adentro aims to build , clinics (primary care), , diagnostic and rehabilitation ctnrres (secondary care), and upgrade the current hospital infrastructure (tertiary care). local health committees survey the community to identify needs and organize a variety of lobby groups to improve dit material conditions of the community. last year, barrio adentro conducted . times the medical visits conducted by the ministry of health. the philosophy of care follows an integrated approach where btalrh is related to housing, education, employment, sports, environment, and food security. conclusions: barrio adentro is a unique collaboration between low-middle income countries to provide health care to people who have been traditionally excluded from social programs. this program shows that it is possible to develop an effective international collaboration based on participatory democracy. low-income americans are at the greatest risk of being uninsured and often face multiple health concerns. this evaluation of the neighborhood health initiative (nh!), an organization which uses multiple programmatic approaches to meet the multiple health needs of clients, reflected the program's many activities and the clients' many service needs. nh! serves low-income, underserved, and hard-to-reach residents in the des moines enterprise community. multiple approaches (fourth-generation evaluation, grounded theory, strengths-and needs-based) and methods (staff and client interviews, concept mapping, observations, qualitative and quantitative analysis) were used to achieve that reflection. results indicate good targeting of residents in the zip code and positive findings in the way of health insurance coverage and reported unmet health needs of clients. program activities were found to match client nttds, validating the organization\'s assessment of clients. important components of nhi were the staff composition and that the organization had become part of both the formal and informal networks. nhi positioned as a link between the target population and local health and social sc:rvice agencies, working to connect residents with services and information as well as aid local agencies in reaching this underserved population. p - (c) welfare: definition by new york city maribeth gregory for an individual who resides in new york city, to obtain health insurance under the medicaid policy one must fall under certain criteria .. (new york city's welfare programs ) if the individual _is on ssi or earns equal to or less than $ per month, he is entitled to receive no more than $ , m resources. a family the size of two would need to earn less than $ per month to qualify for no greater than ss, worth of medicaid benefits. a family of three would qualify for $ , is they earned less than $ per month and so on. introduction: the vancouver gay communiry has a significant number of asian descendan!l. because of their double minority status of being gay and asian, many asian men who have sex with men (msm) are struggling with unique issues. dealing with racism in both mainstream society and the gay communiry, cultural differences, traditional family relations, and language challenges can be some of their everyday srruggles. however, culturally, sexually, and linguistically specific services for asian msm are very limited. a lack of availability and accessibiliry of culturally appropriate sexual health services isolates asian msm from mainstream society, the gay community, and their own cultural communities, deprives them of self-esteem, and endangers their sexual well-being. this research focuses on the qualita· tive narrative voices of asian msm who express their issues related to their sexualiry and the challenges of asking for help. by listening to their voices, practitioners can get ideas of what we are missing and how we need to intervene in order to reach asian msm and ensure their sexual health. methods: since many asian msm are very discreet, it is crucial to build up trust relationships between the researcher and asian msm in order to collect qualitative data. for this reason, a community based participatory research model was adopted by forming a six week discussion group for asian msm. in each group session, the researcher tape recorded the discussion, observed interactions among the participants, and analyzed the data by focusing on participants' personal thoughts, experiences, and emotions for given discussion topics. ra lts: many asian msm share challenges such as coping with a language barrier, cultural differ· ences for interpreting issues and problems, and westerncentrism when they approach existing sexual health services. moreover, because of their fear of being disclosed in their small ethnic communities, a lot of asian msm feel insecure about seeking sexual health services when their issues are related to their sexual orientation. conclflsion: sexual health services should contain multilingual and multicultural capacities to meet minority clients' needs. for asian msm, outreach may be a more effective way to provide them with accessible sexual health services since many asian msm are closeted and are therefore reluctant to approach the services. building a communiry for asian msm is also a significant step toward including them in healthcare services. a communiry-based panicipatory approach can help to build a community for asian msm since it creates a rrust relationship between a worker and clients. p - (c) identifying key techniques to sustain interpretation services for assisting newcomers isolated by linguistic and cultural barriers from accessing health services s. gopi krishna lntrodaetion: the greater toronto area (gta) is home to many newcomer immigrants and other vulnerable groups who can't access health resources due to linguistic, cultural and systemic barriers. linguistic and cultural issues are of special concern to suburbs like scarborough, which is home to thousands of newcomer immigrants and refugees lacking fluency in english. multilingual community ~nterpreter. service~ (mcis) is a non-profit social service organization mandated to provide high quality mterpretanon services. to help newcomers access health services, mcis partnered with the scarborough network of immigrant serving organizations (sniso) to recruit and train volunteer interpreters to accompany clienrs lacking fluency in english and interpret for them to access health services at various locati?ns, incl~~ing communiry ~c:-lth centres/social service agencies and hospitals. the model envisioned agencies recruin~ and mcis ~.mm.g and creating an online database of pooled interpreter resources. this da.tabase, acces& bl~ to all pama~~g ?rganization is to be maintained through administrative/member · ship fees to. be ~ d by each parnapanng organization. this paper analyzes the results of the project, defines and identifies suc:cases before providing a detailed analysis for the reasons for the success . . methods:. this ~per~ q~ntitative (i.e. client numben) and qualitative analysis (i.e. results of key •~ormant m~rv ews with semce ~sers and interpreters) to analyze the project development, training and mplementanon phases of the project. it then identifies the successes and failures through the afore· mentioned analysis. poster sessions vss resljts: the results of the analysis can be summarized as: • the program saw modest success both ia l?lllls of numbers of clients served as well as sustainability at various locations, except in the hospital iririog. o the success of the program rests strongly on the commitment of not just the volunteer interprmr, but on service users acknowledgments through providing transponation allowance, small honororia, letter of reference etc. • the hospital sustained the program better at the hospital due to the iolume and nature of the need, as well as innate capacity for managing and acknowledging volunceers. collc/llsion: it is possible to facilitate and sustain vulnerable newcomer immigrants access to health !ul'ices through the training and commitment of an interpreter volunteer core. acknowledging volunteer commitment is key to the sustenance of the project. this finding is important to immigration and health policy given the significant numbers of newcomer immigrants arriving in canada's urban communities. nity program was established in to provide support to people dying at home, especially those who were waiting for admission to the resi , and age > (males) or > (females) (n= , ). results: based on self-report, an estimated . , ( %) of nyc adults have~ or more cvd risk factors. this population is % male, % white, % black, and % with s years of education. most report good access to health care, indicated by having health insurance ( %), regular doctor ( %), their blood pressure checked within last months ( %), and their choles· terol checked within the past year ( % ). only % reported getting at least minutes of exercise ~ times per week and only % eating ~ servings of fruits and vegetables the previous day. among current smokers, % attempted to quit in past months, but only % used medication or counseling. implications: these data suggest that most nyc adults known to be at high risk for cvd have access to regular health care, but most do not engage in healthy lifestyle or, if they smoke, attempt effective quit strategies. more clinic-based and population-level interventions are needed to support lifestyle change among those at high risk of cvd. introduction: recently, much interest has been directed at "obesogenic" (obesity-promoting) (swinburn, egger & raza, ) built environments, and at geographic information systems (gis) as a tool for their exploration. a major geographical concept is accessibility, or the ease of moving from an origin to a destination point, which has been recently explored in several health promotion-related stud· ies. there are several methods of calculating accessibility to an urban feature, each with its own strengths, drawbacks and level of precision that can be applied to various health promotion research issues. the purpose of this paper is to describe, compare and contrast four common methods of calculating accessibility to urban amenities in terms of their utility to obesity-related health promotion research. practical and conceptual issues surrounding these methods are introduced and discussed with the intent of providing health promotion researchers with information useful for selecting the most appropria e accessibility method for their research goal~ ~ethod: this paper describes methodological insights from two studies, both of which assessed the neighbourhood-level accessibility of fast-food establishments in edmonton, canada -one which used a relatively simple coverage method and one which used a more complex minimum cos method. res.its: both methods of calculating accessibility revealed similar patterns of high and low access to fast-food outlets. however, a major drawback of both methods is that they assume the characteristics of the a~e~ities and of the populations using them are all the same, and are static. the gravity potential method is introduced as an alternative, since it is ·capable of factoring in measures of quality and choice. a n~mber of conceptual and pr~ctical iss~es, illustrated by the example of situational influences on food choice, make the use of the gravity potential model unwieldy for health promotion research into sociallydetermined conditions such as obesity. co.nclusions: i~ ~ommended that geographical approaches be used in partnership with, or as a foun~ation for, ~admonal exploratory methodologies such as group interviews or other forms of commumty consultation that are more inclusive and representative of the populations of interest. qilhl in los angeles county ,,..ia shaheen, richard casey, fernando cardenas, holman arthurs, and richard baker ~the retinomax autorefractor has been used for vision screening of preschool age childien. ir bas been suggested to be used and test school age children but not been validated in this age poup. ob;taiw: to compare the results of retinomax autorefractor with findings from a comprehensive i!' examination using wet retinoscopy for refractive error. mllhods: children - years old recruited from elementary schools at los angeles county were iaml with snellen's chart and the retinomax autorefractor and bad comprehensive eye examination with dilation. the proportion of children with abnormal eye examination as well as diesensitiviry and specificity of the screening tools using retinomax autorefractor alone and in combinalion wirh snellen's chart. results of the children enrolled in the study (average age= . ± . years; age range, - years), ?% had abnormal eye examination using retinoscopy with dilation. for the lerinomax, the sensitivity was % ( % confidence interval [ci] %- %), and the specificity was % ( % ci, %- o/o). simultaneous testing using snellen's chart and retinomax resulted in gain in sitiviry ( %, % cl= , ), and loss in specificity ( %, % cl= %- %). the study showed that screening school age children with retinomax autorefractor could identify most cases with abnormal vision but would be associated with many false-positive results. simuhaneous resting using snellen's chart and retinomax maximize the case finding but with very low specificiry. mdhotjs: a language-stratified, random sample of members of the college of family physicians of canada received a confidential survey. the questionnaire collected data on socio-demographic characteristics, medical training, practice type, setting and hcv-related care practices. the self-adminisratd questionnaire was also made available to participants for completion on the internet. batdti: response proportion was %. median age was years ( % female) and the proporlionoffrench questionnaires was %. approximately % had completed family medicine residency lllining in canada; median year of training completion was . sixty-seven percent, % and % work in private offices/clinics, community hospitals and emergency departments, respectively. regarding ~practices, % had ever requested a hcv test and % of physicians had screened for hcv iafrction in rhe past months· median number of tests was . while % reported having no hcv-uaed patients in their practic~, % had - hcv-infected patients. regarding the level of hcv care provided, . % provide ongoing advanced hcv care including treatment and dose monitoring for ctmduions: in this sample of canadian family physicians, most had pro~ided hcv screening. to •least one patient in the past year. less than half had - hcv-infected patients and % provide ~:relared care the role of socio-demographic factors, medical training as wel_i as hcv ca~e percep-lldas rhe provision of appropriate hcv screening will be examined and described at the time of the canference. ' - (c) healthcare services: the context of nepal meen poudyal chhetri """ tl.ction healthcare service is related with the human rights and fundamental righ~ of the ci~ ciaaiuntry. however, the growing demand foi health care services, quality heal~care service, accessib b~ id die mass population and paucity of funds are the different but interrelated issues to .be ~ddressed. m nepat. n view of this context, public health sector in nepal is among other sectors, which is struggling -.i for scarce resources. . . . nepal, the problems in the field of healthcare servic~s do not bnut ~o the. paucity of faads and resources only, but there are other problems like: rural -urban imbalance, regional unbalance, poster sessio~ f the ll ·m ·ted resources poor healthcare services, inequity and inaccessibility of the poor management o , . poor people of the rural, remote and hilly areas for the healthcare services and so on.. . . . · . i f ct the best resource allocation is the one that max m zes t e sum o m ivi ua s u · ea t services. n a , · h d' ·b · · · h . ·t effi.ciency and efficient management are correlated. it might be t e re istn utmn of mes. ence, equi y, . . . . . income or redistribution of services. moreover, maximizanon of available resources, qua tty healthcare services and efficient management of them are the very important and necessary tools and techmques to meet the growing demand and quality healthcare services in nepal. p - (a) an jn-depth analysis of medical detox clients to assist in evidence based decision making xin li, huiying sun, ajay puri, david marsh, and aslam anis introduction: problematic substance use represents an ever-increasing public health challenge. in the vancouver coastal health (vch) region, there are more than , individuals having some probability of drug or alcohol dependence. to accommodate this potential demand for addiction related services, vch provides various services and treatment, including four levels of withdrawal management services (wms). clients seeking wms are screened and referred to appropriate services through a central telephone intake service (access i). the present study seeks to rigorously evaluate one of the services, vancouver detox, a medically monitored -bed residential detox facility, and its clients. doing so will allow decision makers to utilize evidence based decision-making in order to improve the accessibility and efficiency of wms, and therefore, the health of these clients. methods: we extract one-year data (october , -september , from an efficient and comprehensive database. the occupancy rate of the detox centre along with the clients' wait time for service and length of stay (los) are calculated. in addition, the effect of seasonality on these variables and the impact of the once per month welfare check issuance on the occupancy rate are also evaluated. results: among the clients (median age , % male) who were referred by access! to vancouver detox over the one-year period, were admitted. the majority ( %) of those who are not admitted are either lost to follow up (i.e., clients not having a fixed address or telephone) or declined service at time of callback. the median wait time was day [q -ql: - ], the median los was days iq -qt: - ], and the average bed occupancy rate was %. however, during the threeday welfare check issue period the occupancy rate was lower compared to the other days of the year % vs. %, p conclusion: our analysis indicates that there was a relatively short wait time at vancouver detox, however % of the potential clients were not served. in addition, the occupancy rate declined during the welfare check issuance period and during the summer. this suggests that accessibility and efficiency at vancouver detox could be improved by specifically addressing these factors. background: intimate partner violence (ipv) is associated with acute and chronic physical and men· tal health outcomes for women resulting in greater use of health services. yet, a vast literature attests to cultural variations in perceptions of health and help-seeking behaviour. fewer studies have examined differences in perceptions of ipv among women from ethnocultural communities. the recognition, definition, and understanding of ipv, as well as the language used to describe these experiences, may be different in these communities. as such, a woman's response, including whether or not to disclose or seek help, may vary according to her understanding of the problem. methods: this pilot study explores the influence of cultural factors on perceptions of and responses to ipv among canadian born and immigrant young women. in-depth focus group interviews were con· ducted with women, aged to years, living in toronto. open-ended and semi-structured interview questions were designed to elicit information regarding how young women socially construct jpv and where they would go to receive help. interviews were transcribed, then read and independently coded by the research team. codes were compared and disagreements resolved. qualitative software qsr n was used to assist with data management. . ruu~ts_: res~nses_abo~t what constitutes ipv were similar across the study groups. when considering specific ab.us ve ~ tuanons and types of relationships, participants held fairly relativistic views about ipv, especially with regard to help-seeking behaviour. cultural differences in beliefs about normaive m;ile/femal~ relations. familial.roles, and customs governing acceptable behaviours influenced partictpants perceptions about what n ght be helpful to abused women. interview data highlight the social l ter srnfons v suucrural _impact these factors ha:e on you?g women and provide details regarding the dynamics of cibnocultur~ m~uences on help-~eekmg behav ur: t~e ro~e of such factors such as gender inequality within rtlaoo?sh ps and t_he ~erce ved degree of ~oc al solat on and support nerworks are highlighted. collc~ the~ findmgs unde~score the _ mporta_nc_e of understanding cultural variations in percrprions of ipv ~ relanon to ~elp-seekmg beha~ ':'ur. th s_mformation is critical for health professionals iodiey may connnue developmg culturally sensmve practices, including screening guidelines and protorol s. ln addition, _this study demonstr~tes that focus group interviews are valuable for engaging young romen in discussions about ipv, helpmg them to 'name' their experiences, and consider sources of help when warranted. p -s (a) health problems and health care use of young drug users in amsterdam .wieke krol, evelien van geffen, angela buchholz, esther welp, erik van ameijden, and maria prins / trod ction: recent advances in health care and drug treatment have improved the health of populations with special social and health care needs, such as drug users. however, still a substantial number dots not have access to the type of services required to improve their health status. in the netherlands, tspccially young adult drug users (yad) whose primary drug is cocaine might have limited access to drugrreatment services. in this study we examined the history and current use of (drug associated) treatmmt services, the determinants for loss of contact, and the current health care needs in the young drug mm amsterdam study (yodam). methods: yodam started in and is embedded in the amsterdam cohort study among drug mm. data were derived from y ad aged < years who had used cocaine, heroin, ampheramines and i or methadone at least days a week during the months prior to enrolment. res lts:of yao, median age was years (range: - years), % was male and % had dutch nationality at enrolment. nearly all participants ( %) reported a history of contact with drug llt.lnnent services (methadone maintenance, rehabilitation clinics and judicial treatment), mental health car? (ambulant mental care and psychiatric hospital) or general treatment services (day-care, night-care, hdp for living arrangements, work and finance). however, only % reported contact in the past six l!xlllths. this figure was similar in the first and second follow-up visit. among y ad who reported no current contact with the health care system, % would like to have contact with general treatment serl' icts. among participants who have never had contact with drug treatment services, % used primarily cocaine compared with % and % among those who reported past or current contact, respectively. saied on the addiction severity index, % reported at least one mental health problem in the past days, but only % had current contact with mental health services. concl sion: results from this study among young adult drug users show that despite a high contact rm with health care providers, the health care system seems to lose contact with yao. since % indicatt the need of general treatment services, especially for arranging house and living conditions, health m services that effectively integrate general health care with drug treatment services and mental health care might be more successful to keep contact with young cocaine users. mtthods: respondents included adults aged and over who met dsm-iv diagn?snc criteria for an anxiety or depressive disorder in the past months. we performed two sets of logisnc regressmns. thtdichotomous dependent variables for each of the regressions indicated whether rhe respondenr_vis-ud a psychiatrist, psychologist, family physician or social worker in the _past_ months. no relationship for income. there was no significant interaction between educatmn an mco~:· r: ::or respondents with at least a high school education to seek help ~rom any of the four servic p were almost twice that for respondents who had not completed high school. th . d ec of analyses found che associacion becween educacion and use of md-provided care e secon s · · be d · · ·f· ly ·n che low income group for non-md care, the assoc anon cween e ucatlon and was s gm icant on -· . . . . use of social workers was significant in both income groups, but significant only for use of psychologists in che high-income group. . . . conclusion: we found differences in healch service use by education level. ind v duals who have nor compleced high school appeared co use less mental he~lt~ servi~es provided ~y psyc~iatrists, psycholo· gists, family physicians and social workers. we found limited e.v dence _suggesting the influence of educa· tion on service use varies according to income and type of service provider. results suggesc there may be a need to develop and evaluate progr~ms.designe~ to deliver targeted services to consumers who have noc completed high school. further quahtanve studies about the expen· ence of individuals with low education are needed to clarify whether education's relationship with ser· vice use is provider or consumer driven, and to disentangle the interrelated influences of income and education. system for homeless, hiv-infected patients in nyc? nancy sahler, chinazo cunningham, and kathryn anastos introduction: racial/ethnic disparities in access to health care have been consistently documented. one potential reason for disparities is that the cultural distance between minority patients and their providers discourage chese patients from seeking and continuing care. many institucions have incorporated cultural compecency craining and culturally sensicive models of health care delivery, hoping co encourage better relacionships becween patients and providers, more posicive views about the health care system, and, ulcimacely, improved health outcomes for minority patients. the current scudy tests whether cultural distance between physicians and patients, measured by racial discordance, predicts poorer patient attitudes about their providers and the health care system in a severely disadvantaged hiv-infected population in new york city that typically reports inconsistent patterns of health care. methods: we collected data from unscably housed black and latino/a people with hiv who reported having a regular health care provider. we asked them to report on their attitudes about their provider and the health care system using validated instruments. subjects were categorized as being racially "concordant" or "discordant" with their providers, and attitudes of these two groups were compared. results: the sample consisted of ( %) black and ( %) latino/a people, who reported having ( %) black physicians, ( %) latino/a physicians, ( %) white physicians, and ( %) physicians of another/unknown race/ethnicity. overall, ( %) subjects had physicians of a different race/ethnicity than their own. racial discordance did not predict negative attitudes about rela· tionship with providers: the mean rating of a i-item trust in provider scale (lo=high and o=low) was . for both concordant and discordant groups, and the mean score in -icem relationship with provider scale ( =high and !=low) was . for both groups. however discordance was significantly associated with distrust in che health care syscem: che mean score on a -icem scale ( =high discrust and l=low distrust) was . for discordant group and . for che concordant group (t= . , p= . ). we further explored these patterns separacely in black and lacino/a subgroups, and using different strategies ro conceptualize racial/ethic discordance. conclusions: in this sample of unscably housed black and latino/a people who receive hiv care in new york city, having a physician from the same racial/ethnic background may be less important for developing a positive doctor-patient relationship than for helping the patients to dispel fear and distrust about the health care system as a whole. we discuss the policy implications of these findings. ilene hyman and samuel noh . .abstract objectiw: this study examines patterns of mental healthcare utilization among ethiopian mm grants living in toronto. methods: a probability sample of ethiopian adults ( years and older) completed structured face-to-face interviews. variables ... define, especially who are non-health care providers. plan of analysis. results: approximately % of respondents received memal health services from mainstream healthcare providers and % consulted non-healthcare professionals. of those who sought mental health services from mainstream healthcare providers, . % saw family physicians, . % visited a psychiatrist. and . % consulted other healthcare providers. compared with males, a significantly higher proportion gsfer sessions v ri ftlnales consulted non-healthcare_ professionals for emotional or mental health problems (p< . ). tlbile ethiopian's overall use of mamstream healthcare services for emotional problems ( %) did not prlydiffer from the rate ( %) of the general population of ontario, only a small proportion ( . %) rjerhiopians with mental health needs used services from mainstream healthcare providers. of these, !oj% received family physicians' services, . % visited a psychiatrist, and . % consulted other healthll/c providers. our data also suggested that ethiopian immigrants were more likely to consult tradioooal healers than health professionals for emotional or mental health problems ( . % vs. . % ). our bivariate analyses found the number of somatic symptoms and stressful life events to be associated with an increased use of medical services and the presence of a mental disorder to be associated with a dfcreased use of medical services for emotional problems. however, using multivariate methods, only die number of somatic symptoms remained significantly associated with use of medical services for emooonal problems. diu#ssion: study findings suggest that there is a need for ethnic-specific and culturally-appropriate mrcrvention programs to help ethiopian immigrants and refugees with mental health needs. since there ~a strong association between somatic symptoms and the use family physicians' services, there appears robe a critical role for community-based family physicians to detect potential mental health problems among their ethiopian patients, and to provide appropriate treatment and/or referral. the authors acknowledge the centre of excellence for research in immigration and settlement (ceris) in toronto and canadian heritage who provided funding for the study. we also acknowledge linn clark whose editorial work has improved significantly the quality of this manuscript. we want to thank all the participants of the study, and the ethiopian community leaders without whose honest contributions the present study would have not been possible. this paper addresses the impact of the rationalization of health-care services on the clinical decision-making of emergency physicians in two urban hospital emergency departments in atlantic canada. using the combined strategies of observational analysis and in-depth interviewing, this study provides a qualitative understanding of how physicians and, by extension, patients are impacred by the increasing ancmpts to make health-care both more efficient and cost-effective. such attempts have resulted in significantly compromised access to primary care within the community. as a consequence, patients are, out of necessity, inappropriately relying upon emergency departments for primary care services as well as access to specialty services. within the hospital, rationalization has resulted in bed closures and severely rmricted access to in-patient services. emergency physicians and their patients are in a tenuous position having many needs but few resources. furthermore, in response to demands for greater accountability, physicians have also adopted rationality in the form of evidence-based medicine. ultimately, ho~ever, rationality whether imposed upon, or adopted by, the profession significantly undermines physu.: ans' ability to make decisions in the best interests of their patients. johnjasek, gretchen van wye, and bonnie kerker introduction: hispanics comprise an increasing proportion of th.e new york city (nyc) populanon !currently about %). like males in the general population, h spamc males (hm) have a lower prrval,nce of healthcare utilization than females. however, they face additional access barriers such as bnguage differences and high rates of uninsurance. they also bear a heavy burden of health problems lllehasobesity and hiv/aids. this paper examines patterns of healthcare access and ut hzat on by hm compared to other nyc adults and identifies key areas for intervention. . . . and older are significantly lower than the nhm popu anon . v. . , p<. ), though hi\' screening and immunizations are comparable between the two groups. conclusion: findings suggest that hm have less access t? healthcare than hf or nhm. hown r, hm ble to obtain certain discrete medical services as easily as other groups, perhapsdueto!rtor are a hm. i i . subsidized programs. for other services, utilization among s ower. mprovmg acc~tocareinthis group will help ensure routine, quality care, which can lead to a greater use of prevennve services iii! thus bener health outcomes. introduction: cancer registry is considered as one of the most important issues in cancer epidemiology and prevention. bias or under-reporting of cancer cases can affect the accuracy of the results of epidemiological studies and control programs. the aim of this study was to assess the reliability of the regional cancer report in a relatively small province (yasuj) with almost all facilities needed for c llcll diagnosis and treatment. methods: finding the total number of cancer cases we reviewed records of all patients diagnoicd with cancer (icd - ) and registered in any hospital or pathology centre from until i n yasuj and all ( ) surrounding provinces. results: of patients who were originally residents of yasui province, . % wereaccoulll!d for yasuj province. the proportion varies according to the type of cancer, for exarnplecancetsofdiglstive system, skin and breast were more frequently reported by yasuj's health facilities whereas cancmoi blood, brain and bone were mostly reported by neighbouring provinces. the remaining cases ( . % were diagnosed, treated and recorded by neighbouring provinces as their incident cases. this is partly because of the fact that patients seek medical services from other provinces as they believed that the facil. ities are offered by more experienced and higher quality stuffs and their relative's or temporary acooiii' modation addresses were reported as their place of residence. conclusion: measuring the spatial incidence of cancer according to the location of report ortht current address affected the spatial statistics of cancer. to correct this problem recording the permanm! address of diagnosed cases is important. p - (c). providing primary healthcare to a disadvantaged population at a university-run commumty healthcare facility tracey rickards the. c:ommuni~y .h~alth ~linic (chc) is a university sponsored nurse-managed primary bealthwt (p~c:l clime. the clm c is an innovative model of healthcare delivery in canada that has integrated tht principles of phc ser · · h' . vices wit ma community development framework. it serves to provide access to phc services for members of th · · illi · dru is ii be . . e community, particularly the poor and those who use or gs, we mg a service-learning facil'ty f d · · · · · · d rionll h . . .,m.:. · t · . meet c ient nee s. chmc nursing and social work staff and srudents r·--· ipa em various phc activities and h .l.hont" less i . f . outreac services in the local shelters and on the streels to'"" popu auon o fredericton as well th chc · model iii fosterin an on oi : . • e promotes and supports a harm reduction . · local d!or an~ h ng ~art:ersh p with aids new brunswick and their needle exchange program, w tha ing condoms and :xu:t h:~~~e e~aint~nance therapy clients, and with the clie~ts themselves ~_r; benefits of receiving health f ucation, a place to shower, and a small clothing and food oai~· care rom a nurse p · · d d · --""~'i"· are evidenced in th r research that involved needsaans mvo ves clients, staff, and students. to date the chc has unacn- · sessment/enviro i . d ; •• '"""" ll eva uanon. the clinic has also e . d nmenta scan, cost-benefit analysis, an on-go...,, "".'i'~ facility and compassionate lea x~mme the model of care delivery' focusing on nursing roles wi~ cj rmng among students. finally, the clinic strives to share the resu•p v . -arch with the community in which it provides service by distributing a bi-monthly newsletter, and plllicipating in in-services and educational sessions in a variety of situations. the plan for the future is coolinued research and the use of evidence-based practice in order to guide the staff in choosing how much n~ primary healthcare services to marginalized populations will be provided. n- (c) tuming up the volume: marginalized women's health concerns tckla hendrickson and betty jane richmond bdrotbu:tion: the marginalization of urban women due to socio-economic status and other determinants negatively affects their health and that of their families. this undermines the overall vitaliry of urban communities. for example, regarding access to primary health care, women of lower economic surus and education levels are less likely to be screened for breast and cervical cancer. what is not as widely reported is how marginalized urban women in ontario understand and articulate their lack of access to health care, how they attribute this, and the solutions that they offer. this paper reports on the rnults of the ontario women's health network (owhn) focus group project highlighting urban women's concerns and suggestions regarding access to health care. it also raises larger issues about urban health, dual-purpose focus group design, community-based research and health planning processes. mdhods: focus group methodology was used to facilitate a total of discussions with urban and rural women across ontario from to . the women were invited to participate by local women's and health agencies and represented a range of ages, incomes, and access issues. discussions focussed on women's current health concerns, access to health care, and information needs. results were analyzed using grounded theory. the focus groups departed from traditional focus group research goals and had two purposes: ) data collection and dissemination (representation of women's voices), and ) fostering closer social ties between women, local agencies, and owhn. the paper provides a discussion and rationale for a dual approach. rax/ts: the results confirm current research on women's health access in women's own voices: urban women report difficulty finding responsive doctors, accessing helpful information such as visual aids in doctors' offices, and prohibitive prescription costs, in contrast with rural women's key concern of finding a family doctor. the research suggests that women's health focus groups can address access issues by helping women to network and initiate collective solutions. the study shows that marginalized urban women are articulate about their health conctrns and those of their families, often understanding them in larger socio-economic frameworks; howtver, women need greater access to primary care and women-friendly information in more languages and in places that they go for other purposes. it is crucial that urban health planning processes consult directly with women as key health care managers, and turn up the volume on marginalized women's voices. women: an evaluation of awareness, attitudes and beliefs introduction: nigeria has one of the highest rates of human immunodeficiency virus ihivi seroprrvalence in the world. as in most developing countries vertical transmission from mother to child account for most hiv infection in nigerian children. the purpose of this study was ro. determine the awareness, attitudes and beliefs of pregnant nigerian women towards voluntary counseling and testing ivct! for hiv. mnbod: a pre-tested questionnaire was used to survey a cross section '.>f. pregnant women ~t (lrlleral antenatal clinics in awka, nigeria. data was reviewed based on willingness to ~c~ept or re ect vct and the reasons for disapproval. knowledge of hiv infection, routes of hiv transm ssmn and ant rnroviral therapy iart) was evaluated. hsults: % of the women had good knowledge of hiv, i % had fair knowledge while . % had poor knowledge of hiv infection. % of the women were not aware of the association of hreast milk feeding and transmission of hiv to their babies. majority of the women % approved v~t while % disapproved vct, % of those who approved said it was because vct could ~educe risk of rransmission of hiv to their babies. all respondents, % who accepted vc.i ~ere willing to be tnted if results are kept confidential only % accepted to be tested if vc.t results w. be s~ared w .th pinner and relatives % attributed their refusal to the effect it may have on their marriage whale '-gave the social 'and cultural stigmatization associated with hiv infection for their r~fusal.s % wall accept vct if they will be tested at the same time with their partners. ~ of ~omen wall pref~r to breast feed even if they tested positive to hiv. women with a higher education diploma were times v more likely to accept vct. knowledge of art for hiv infected pregnant women as a means of pre. vention of maternal to child transmission [pmtct) was generally poor, % of respondents wm aware of art in pregnancy. conclusion: the acceptance of vct by pregnant women seems to depend on their understanding that vct has proven benefits for their unborn child. socio-cultur al factors such as stigmatizationof hiv positive individuals appears to be the maj_or impedi~ent towards widespread acceptanee of ycr in nigeria. involvemen t of male partners may mpro~e attitudes t~wa~ds vct:the developmentofm novative health education strategies is essential to provide women with mformanon as regards the benefits of vct and other means of pmtct. p - (c) ethnic health care advisors in information centers on health care and welfare in four districts of amsterdam arlette hesselink, karien stronks, and arnoud verhoeff introduction : in amsterdam, migrants report a "worse actual health and a lower use of health care services than the native dutch population. this difference might be partly caused by problems migrants have with the dutch language and health care and welfare system. to support migrants finding their way through this system, in four districts in amsterdam information centers on health care and welfare were developed in which ethnic health care advisors were employed. their main task is to provide infor· mation to individuals or groups in order to bridge the gap between migrants and health care providers. methods: the implementat ion of the centers is evaluated using a process evaluation in order to give inside in the factors hampering and promoting the implementat ion. information is gathered using reports, attending meetings of local steering groups, and by semi-structu red interviews with persons (in)directly involved in the implementat ion of the centers. in addition, all individual and groupcontaets of the health care advisors are registered extensively. results: since four information centers, employing ethnic health care advisors, are implemented. the ethnicity of the health care advisors corresponds to the main migrant groups in the different districts (e.g. moroccan, turkeys, surinamese and african). depending on the local steering groups, the focus of the activities of the health care advisors in the centers varies. in total, around individual and group educational sessions have been registered since the start. most participants were positive about the individual and group sessions. the number of clients and type of questions asked depend highly on the location of the centers (e.g. as part of a welfare centre or as part of a housing corporation). in all districts implementa tion was hampered by lack of ongoing commitment of parties involved (e.g. health care providers, migrant organization s) and lack of integration with existing health care and welfare facilities. discussion: the migrant health advisors seem to have an important role in providing information on health and welfare to migrant clients, and therefore contribute in bridging the gap between migrants and professionals in health care and welfare. however, the lack of integration of the centers with the existing health care and welfare facilities in the different districts hampers further implementation . therefore, in most districts the information centres will be closed down as independent facilicities in the near future, and efforts are made to better connect the position of migrant health advisor in existing facilities. the who report ranks the philippines as ninth among countries with a high tb prevalence. about a fourth of the country's population is infected, with majority of cases coming from the lower socioeconomic segments of the community. metro manila is not only the economic and political capital of the philippines but also the site of major universities and educational institutions. initial interviews with the school's clinicians have established the need to come up with treatment guidelines and protocols for students and personnel when tb is diagnosed. these cases are often identified during annual physical examinations as part of the school's requirements. in many instances, students and personnel diagnosed with tb are referred to private physicians where they are often lost to follow-up and may have failure of treatment due to un monitored self-administered therapy. this practice ignores the school clinic's great potential as a tb treatment partner. through its single practice network (spn) initiative, the philippine tuberculosis initiatives for the private sector (philippine tips), has established a model wherein school clinics serve as satellite treatment partners of larger clinics in the delivery of the directly observed treatment, short course (dots) protocol. this "treatment at the source" allows school-based patients to get their free government-suppl ied tb medicines from the clinic each day. it also cancels out the difficulty in accessing medicines through the old model where the patient has to go to the larger clinic outside his/her school to get treatment. the model also enables the clinic to monitor the treatment progress of the student and assumes more responsibility over their health. this experience illustrates how social justice in health could be achieved from means other than fund generation. the harnessing of existing health service providers in urban communities through standardized models of treatment delivery increases the probability of treatment success, not only for tb but for other conditions as well. p - (c) voices for vulnerable populations: communalities across cbpr using qualitative methods martha ann carey, aja lesh, jo-ellen asbury, and mickey smith introduction: providing an opportunity to include, in all stages of health studies, the perspectives and experiences of vulnerable and marginalized populations is increasingly being recognized as a necessary component in uncovering new solutions to issues in health care. qualitative methods, especially focus groups, have been used to understand the perspectives and needs of community members and clinical staff in the development of program theory, process evaluation and refinement of interventions, and for understanding and interpreting results. however, little guidance is available for the optimal use of such information. methods: this presentation will draw on diverse experiences with children and their families in an asthma program in california, a preschool latino population in southern california, a small city afterschool prevention program for children in ohio, hiv/aids military personnel across all branches of the service in the united states, and methadone clinic clients in the south bronx in new york city. focus groups were used to elicit information from community members who would not usually have input into problem definitions and solutions. using a fairly common approach, thematic analysis as adapted from grounded theory, was used to identify concerns in each study. next we looked across these studies, in a meta-synthesis approach, to examine communalities in what was learned and in how information was used in program development and refinement. results: while the purposes and populations were diverse, and the type of concerns and the reporting of results varied, the conceptual framework that guided the planning and implementation of each study was similar, which led to a similar data analysis approach. we will briefly present the results of each study, and in more depth we will describe the communalities and how they were generated. conclusions: while some useful guidance for planning future studies of community based research was gained by looking across these diverse studies, it would be useful to pursue a broader examination of the range of populations and purposes to more fully develop guidance. background: the majority of studies examining the relationship between residential environments and cardiovascular disease have used census derived measures of neighborhood ses. there is a need to identify specific features of neighborhoods relevant to cardiovascular disease risk. we aim to ) develop methods· data on neighborhood conditions were collected from a telephone survey of s, fesi· dents in balth:.ore, md; forsyth county, nc; and new york, ny. a sample of of the i.ni~~l l'elpondents was re-interviewed - weeks after the initial interview t~ measure the tes~-~etest rebab ~ ty of ~e neighborhood scales. information was collected across seven ~e ghborho~ cond ~ons (aesth~~ ~uah~, walking environment, availability of healthy foods, safety, violence, social cohesion, and acnvmes with neighbors). neighborhoods were defined as census tracts or homogen~us census tra~ clusters. ~sycho metric properties.of the neighborhood scales were accessed by ca~cu~~.ng chronba~h s alpha~ (mtemal consistency) and intraclass correlation coefficients (test-r~test reliabilmes) .. pear~n s .corre~anons were calculated to test for associations between indicators of neighborhood ses (tncludmg d mens ons of race/ ethnic composition, family structure, housing, area crowding, residential stability, education, employment, occupation, and income/wealth) and our seven neighborhood scales. . chronbach's alphas ranged from . (walking environment) to . (violence). intraclass correlations ranged from . (waling environment) to . (safety) and wer~ high~~~ . ~ for ~urout of the seven neighborhood dimensions. our neighborhood scales (excluding achv hes with neighbors) were consistently correlated with commonly used census derived indicators of neighborhood ses. the results suggest that neighborhood attributes can be reliably measured. further development of such scales will improve our understanding of neighborhood conditions and their importance to health. childhood to young adulthood in a national u.s. sample jen jen chang lntrodfldion: prior studies indicate higher risk of substance use in children of depressed mothers, but no prior studies have followed up the offspring from childhood into adulthood to obtain more precise estimates of risk. this study aimed to examine the association between early exposure to maternal depl'elsive symptoms (mds) and offspring substance use across time in childhood, adolescence, and young adulthood. methods: data were obtained from the national longitudinal survey of youth. the study sample includes , mother-child/young adult dyads interviewed biennially between and with children aged to years old at baseline. data were gathered using a computer-assisted personal interview method. mds were measured in using the center for epidemiologic studies depression scale. offspring substance use was assessed biennially between and . logistic and passion regression models with generalized estimation equation approach was used for parameter estimates to account for possible correlations among repeated measures in a longitudinal study. rnlllta: most mothers in the study sample were whites ( %), urban residents ( %), had a mean age of years with at least a high school degree ( %). the mean child age at baseline was years old. offspring cigarette and alcohol use increased monotonically across childhood, adolescence, and young adulthood. differential risk of substance use by gender was observed. early exposure to mds was associated with increased risk of cigarette (adjusted odds ratio (aor) = . , % confidence interval ( ): . , . ) and marijuana use (aor = . , % ci: . , . ), but not with alcohol use across childhood, adolescence, and young adulthood, controlling for a child's characteristics, socioeconomic status, ~ligiosity, maternal drug use, and father's involvement. among the covariates, higher levels of father's mvolvement condluion: results from this study confirm previous suggestions that maternal depressive symptoms are associated with adverse child development. findings from the present study on early life experi-e~ce have the potential to inform valuable prevention programs for problem substance use before disturbances become severe and therefore, typically, much more difficult to ameliorate effectively. the ~act (~r-city men~ health study predicting filv/aids, club and other drug transi-b~) study a multi-level study aimed at determining the association between features of the urban enyjronment mental health, drug use, and risky sexual behaviors. the study is randomly sampling foster sessions v neighborhood residents and assessing the relations between characteristics of ethnographically defined urban neighborhoods and the health outcomes of interest. a limitation of existing systematic methods for evaluating the physical and social environments of urban neighborhoods is that they are expensive and time-consuming, therefore limiting the number of times such assessments can be conducted. this is particularly problematic for multi-year studies, where neighborhoods may change as a result of seasonality, gentrification, municipal projects, immigration and the like. therefore, we developed a simpler neighborhood assessment scale that systematically assessed the physical and social environment of urban neighborhoods. the impact neighborhood evaluation scale was developed based on existing and validated instruments, including the new york city housing and vacancy survey which is performed by the u.s. census bureau, and the nyc mayor's office of operations scorecard cleanliness program, and modified through pilot testing and cognitive testing with neighborhood residents. aspects of the physical environment assessed in the scale included physical decay, vacancy and construction, municipal investment and green space. aspects of the social environment measured include social disorder, social trust, affluence and formal and informal street economy. the scale assesses features of the neighborhood environment that are determined by personal (e.g., presence of dog feces), community (e.g., presence of a community garden), and municipal (e.g., street cleanliness) factors. the scale is administered systematically block-by-block in a neighborhood. trained research staff start at the northeast corner of an intersection and walk around the blocks in a clockwise direction. staff complete the scale for each street of the block, only evaluating the right side of the street. thus for each block, three or more assessments are completed. we are in the process of assessing psychometric properties of the instrument, including inter-rater reliability and internal consistency, and determining the minimum number of blocks or street segments that need to be assessed in order to provide an accurate estimate of the neighborhood environment. these data will be presented at the conference. obj«tive: to describe and analyze the perceptions of longterm injection drug users (idus) about their initiation into injecting. toronto. purposive sampling was used to seek out an ethnoculturally diverse sample of idus of both genders and from all areas of the city, through recruitment from harm reduction services and from referral by other study participants. interviews asked about drug use history including first use and first injecting, as well as questions about health issues, service utilization and needs. thematic analysis was used to examine initiation of drug use and of injection. results: two conditions appeared necessary for initiation of injection. one was a developed conception of drugs and their (desirable) effects, as suggested by the work of becker for marijuana. thus virtually all panicipants had used drugs by other routes prior to injecting, and had developed expectations about effects they considered pleasureable or beneficial. the second condition was a group and social context in which such use arose. no participants perceived their initiation to injecting as involving peer pressure. rather they suggested that they sought out peers with a similar social situation and interest in using drugs. observing injection by others often served as a means to initiate injection. injection served symbolic purposes for some participants, enhancing their status in their group and marking a transition to a different social world. concl ion: better understanding of social and contextual factors motivating drug users who initiate injection can assist in prevention efforts. ma!onty of them had higher educational level ( %-highschool or higher).about . yo adffiltted to have history of alcohol & another . % had history of smoking. only . % people were on hrt & . % were receiving steroid. majority of them ( . ) did not have history of osteoporosis. . % have difficulty in ambulating. only . % had family history of osteoporosis. bmd measurements as me~sured by dual xray absorptiometry (dexa) were used for the analysis. bmd results were compare~ w ~ rbc folate & serum vitamin b levels. no statistical significance found between bmd & serum v taffiln b level but high levels of folate level is associated with normal bmd in bivariate and multivariate analysis. conclusion: in the studied elderly population, there was no relationship between bmd and vitamin b ; but there was a significant association between folate levels & bmd. introduction: adolescence is a critical period for identity formation. western studies have investigated the relationship of identity to adolescent well-being. special emphasis has been placed on the influence of ethnic identity on health, especially among forced migrants in different foreign countries. methodology: this study asses by the means of an open ended question identity categorization among youth in three economically disadvantaged urban communities in beirut, the capital of lebanon. these three communities have different histories of displacement and different socio-demographic makeup. however, they share a history of displacement due to war. results and conclusion: the results indicated that nationality was the major category of identification in all three communities followed by origin and religion. however, the percentages that self-identify by particular identity categories were significantly different among youth in the three communities, perhaps reflecting different context in which they have grown up. mechanical heart valve replacement amanda hu, chi-ming chow, diem dao, lee errett, and mary keith introduction: patients with mechanical heart valves must follow lifelong warfarin therapy. war· farin, however, is a difficult drug to take because it has a narrow therapeutic window with potential seri· ous side effects. successful anticoagulation therapy is dependent upon the patient's knowledge of this drug; however, little is known regarding the determinants of such knowledge. the purpose of this study was to determine the influence of socioeconomic status on patients' knowledge of warfarin therapy. methods: a telephone survey was conducted among patients to months following mechan· ical heart valve replacement. a previously validated -item questionnaire was used to measure the patient's knowledge of warfarin, its side effects, and vitamin k food sources. demographic information, socioeconomic status data, and medical education information were also collected. results: sixty-one percent of participants had scores indicative of insufficient knowledge of warfarin therapy (score :s; %). age was negatively related to warfarin knowledge scores (r= . , p = . ). in univariate analysis, patients with family incomes greater than $ , , who had greater. than a grade education and who were employed or self employed had significantly higher warfarm knowledge scores (p= . , p= . and p= . respectively). gender, ethnicity, and warfar~n therapy prior to surgery were not related to warfarin knowledge scores. furthermore, none of t~e. m-hospital tea~hing practices significantly influenced warfarin knowledge scores. however, panic ~ants who _rece v~d post discharge co~unity counseling had significantly higher knowledge scores tn comp~r son with those who did not (p= . ). multivariate regression analysis revealed that und~r~tandmg the ~oncept of ?ternational normalized ratio (inr), knowing the acronym, age and receiving ~ommum !' counseling after discharge were the strongest predictors of warfarin kn~wledge. s~ oeconom c status was not an important predictor of knowledge scores on the multivanate analysis. poster sessions v ~the majority of patients at our institution have insufficient knowledge of warfarin therapy.post-discharge counseling, not socioeconomic status, was found to be an important predictor of warfarin knowledge. since improved knowledge has been associated with improved compliance and control, our findings support the need to develop a comprehensive post-discharge education program or, at least, to ensure that patients have access to a community counselor to compliment the in-hospital educatiop program. brenda stade, tony barozzino, lorna bartholomew, and michael sgro lnttotl#ction: due to the paucity of prospective studies conducted and the inconsistency of results, the effects of prenatal cocaine exposure on functional abilities during childhood remain unclear. unlike the diagnosis of fetal alcohol spectrum disorder, a presentation of prenatal cocaine exposure and developmental and cognitive disabilities does not meet the criteria for specialized services. implications for public policy and services are substantial. objective: to describe the characteristics of children exposed to cocaine during gestation who present to an inner city specialty clinic. mnbods: prospective cohort research design. sample and setting: children ages to years old, referred to an inner city prenatal substance exposure clinic since november, . data collection: data on consecutive children seen in the clinic were collected over an month period. instrument: a thirteen ( ) page intake and diagnostic form, and a detailed physical examination were used to collect data on prenatal substance history, school history, behavioral problems, neuro-psychological profile, growth and physical health of each of the participants. data analysis: content analysis of the data obtained was conducted. results: twenty children aged to years (mean= . years) participated in the study. all participants had a significant history of cocaine exposure and none had maternal history or laboratory (urine, meconium or hair) exposure to alcohol or other substances. none met the criteria of fetal alcohol spectrum disorder. all were greater than the tenth percentile on height, weight, and head circumference, and were physically healthy. twelve of the children had iqs at the th percentile or less. for all of the children, keeping up with age appropriate peers was an ongoing challenge because of problems in attention, motivation, motor control, sensory integration and expressive language. seventy-four percent of participants had significant behavioral and/or psychological problems including aggressiveness, hyperactivity, lying, poor peer relationships, extreme anxiety, phobias, and poor self-esteem. conclusion: pilot study results demonstrated that children prenatally exposed to cocaine have significant learning, behavioural, and social problems. further research focusing on the characteristics of children prenatally exposed to cocaine has the potential for changing policy and improving services for this population. methods: trained interviewers conducted anonymous quantitative surveys with a random sample (n= ) of female detainees upon providing informed consent. the survey focused on: sociodemographic background; health status; housing and neighborhood stability and social resource availability upon release. results: participants were % african-american, % white, % mixed race and % native american. participants' median age was , the reported median income was nto area. there is mounting evidence that the increasing immigrant population has a_ sigmfic~nt health disadvantage over canadian-born residents. this health disadvantage manifests particularly m the ma "ority of "mm "gr t h h d be · · h . . . . an s w o a en m canada for longer than ten years. this group as ~n associ~te~ with higher risk of chronic disease such as cardiovascular diseases. this disparity twccb n ma onty of the immigrant population and the canadian-born population is of great importance to ur an health providers d" · i i · b as isproporttonate y arge immigrant population has settled in the ma or ur an centers. generally the health stat f · · · · · · h h been . us most mm grants s dynamic. recent mm grants w o av_e ant •;ffca~ada _for less ~han ~en years are known to have a health advantage known as 'healthy imm • ~ants r::r · ~:s eff~ ~ defined by the observed superior health of both male and female recent immi- immigrant participation in canadian society particularly the labour market. a new explanation of the loss of 'healthy immigrant effect' is given with the help of additional factors. lt appears that the effects of social exclusion from the labour market leading to social inequalities first experienced by recent immigrant has been responsible for the loss of healthy immigrant effect. this loss results in the subsequent health disadvantage observed in the older immigrant population. a study on patients perspectives regarding tuberculosis treatment by s.j.chander, community health cell, bangalore, india. introduction: the national tuberculosis control programme was in place over three decades; still tuberculosis control remains a challenge unmet. every day about people die of tuberculosis in india. tuberculosis affects the poor more and the poor seek help from more than one place due to various reasons. this adversely affects the treatment outcome and the patient's pocket. many tuberculosis patients become non-adherence to treatment due to many reasons. the goal of the study was to understand the patient's perspective regarding tuberculois treatment provided by the bangalore city corporation. (bmc) under the rntcp (revised national tuberculosis control programme) using dots (directly observed treatment, short course) approach. bmc were identified. the information was collected using an in-depth interview technique. they were both male and female aged between - years suffering from pulmonary and extra pulmonary tuberculosis. all patients were from the poor socio economic background. results: most patients who first sought help from private practitioners were not diagnosed and treated correctly. they sought help form them as they were easily accessible and available but they. most patients sought help later than four weeks as they lacked awareness. a few of patients sought help from traditional healers and magicians, as it did not help they turned to allopathic practitioners. the patients interviewed were inadequately informed about various aspect of the disease due to fear of stigma. the patient's family members were generally supportive during the treatment period there was no report of negative attitude of neighbours who were aware of tuberculosis patients instead sympathetic attitude was reported. there exists many myth and misconception associated with marriage and sexual relationship while one suffers from tuberculosis. patients who visited referral hospitals reported that money was demanded for providing services. most patients had to borrow money for treatment. patients want health centres to be clean and be opened on time. they don't like the staff shouting at them to cover their mouth while coughing. conclusion: community education would lead to seek help early and to take preventive measures. adequate patient education would remove all myth and conception and help the patients adhere to treatment. since tb thrives among the poor, poverty eradiation measures need to be given more emphasis. mere treatment approach would not help control tuberculosis. lntrod#ction: the main causative factor in cervical cancer is the presence of oncogenic human papillomavitus (hpv). several factors have been identified in the acquisition of hpv infection and cervical cancer and include early coitarche, large number of lifetime sexual partners, tobacco smoking, poor diet, and concomitant sexually transmitted diseases. it is known that street youth are at much higher risk for these factors and are therefore at higher risk of acquiring hpv infection and cervical cancer. thus, we endeavoured to determine the prevalence of oncogenic hpv infection, and pap test abnormalities, in street youth. ~tbods: this quantitative study uses data collected from a non governmental, not for profit dropin centre for street youth in canada. over one hundred females between the ages of sixteen and twentyfour were enrolled in the study. of these females, all underwent pap testing about those with a previous history of an abnormal pap test, or an abnormal-appearing cervix on clinical examination, underwent hpv-deoxyribonucleic (dna) testing with the digene hybrid capture ii. results: data analysis is underway. the following results will be presented: ) number of positive hpv-dna results, ) pap test results in this group, ) recommended follow-up. . the results of this study will provide information about the prevalence of oncogemc hpv-dna infection and pap test abnormalities in a population of street youth. the practice implic~ tions related to our research include the potential for improved gynecologic care of street youth. in addition, our recommendations on the usefulness of hpv testing in this population will be addressed. methods: a health promotion and disease prevention tool was developed over a period of several years to meet the health needs of recent immigrants and refugees seen at access alliance multicultural community health centre (aamchc), an inner city community health centre in downtown toronto. this instrument was derived from the anecdotal experience of health care providers, a review of medical literature, and con· sultations with experts in migration health. herein we present the individual components of this instrument, aimed at promoting health and preventing disease in new immigrants and refugees to toronto. results: the health promotion and disease prevention tool for immigrants focuses on three primary health related areas: ) globally important infectious diseases including tuberculosis (tb), hiv/aids, syphilis, viral hepatitis, intestinal parasites, and vaccine preventable diseases (vpd), ) cancers caused by infectious diseases or those endemic to developing regions of the world, and ) mental illnesses includiog those developing among survivors of torture. the health needs of new immigrants and refugees are complex, heterogeneous, and ohen reflect conditions found in the immigrant's country of origin. ideally, the management of all new immigrants should be adapted to their experiences prior to migration, however the scale and complexity of this strategy prohibits its general use by healthcare providers in industrialized countries. an immigrant specific disease prevention instrument could help quickly identify and potentially prevent the spread of dangerous infectious diseases, detect cancers at earlier stages of development, and inform health care providers and decision makers about the most effective and efficient strategies to prevent serious illness in new immigrants and refugees. lntrodmction: as poverty continues to grip pakistan, the number of urban street children grows and has now reached alarming proportions, demanding far greater action than presently offered. urbanization, natural catastrophe, drought, disease, war and internal conflict, economic breakdown causing unemployment, and homelessness have forced families and children in search of a "better life," often putting children at risk of abuse and exploitation. objectives: to reduce drug use on the streets in particular injectable drug use and to prevent the transmission of stds/hiv/aids among vulnerable youth. methodology: baseline study and situation assessment of health problems particularly hiv and stds among street children of quetta, pakistan. the program launched a peer education program, including: awareness o_f self and body protection focusing on child sexual abuse, stds/hiv/aids , life skills, gender and sexual rights awareness, preventive health measures, and care at work. it also opened care and counseling center for these working and street children ar.d handed these centers over to local communities. relationships among aids-related knowledge and bt:liefs and sexual behavior of young adults were determined. rea.sons for unsafe sex included: misconception about disease etiology, conflicting cultural values, risk demal, partner pressur~, trust and partner significance, accusation of promiscuity, lack of community endorsement of protecnve measures, and barriers to condom access. in addition socio-economic pressure, physiological issues, poor community participation and anitudes and low ~ducation level limited the effectiveness of existing aids prevention education. according to 'the baseline study the male children are ex~ to ~owledge of safe sex through peers, hakims, and blue films. working children found sexual mfor~anon through older children and their teachers (ustad). recommendation s: it was found that working children are highly vulnerable to stds/hiv/aids, as they lack protective meas":res in sexual abuse and are unaware of safe sexual practices. conclusion: non-fatal overdose was a common occurrence for idu in vancouver, and was associated with several factors considered including crystal methamphetamine use. these findings indicate a need for structural interventions that seek to modify the social and contextual risks for overdose, increased access to treatment programs, and trials of novel interventions such as take-home naloxone programs. background: injection drug users (idus) are at elevated risk for involvement in the criminal justice system due to possession of illicit drugs and participation in drug sales or markets. the criminalization of drug use may produce significant social, economic and health consequences for urban poor drug users. injection-related risks have also been associated with criminal justice involvement or risk of such involvement. previous research has identified racial differences in drug-related arrests and incarceration in the general population. we assess whether criminal justice system involvement differs by race/ethnicity among a community sample of idus. we analyzed data collected from idus (n = , ) who were recruited in san francisco, and interviewed and tested for hiv. criminal justice system involvement was measured by arrest, incarceration, drug felony, and loss/denial of social services associated with the possession of a drug felony. multivariate analyses compared measures of criminal justice involvement and race/ethnicity after adjusting for socio-demographic and drug-use behaviors including drug preference, years of injection drug use, injection frequency, age, housing status, and gender. the six-month prevalence of arrest was highest for whites ( %), compared to african americans ( %) and latinos ( % ), in addition to the mean number of weeks spent in jail in the past months ( . vs. . and . weeks). these differences did not remain statistically significant in multivariate analyses. latinos reported the highest prevalence of a lifetime drug felony conviction ( %) and mean years of lifetime incarceration in prison ( . years), compared to african americans ( %, . years) and whites ( %, . years). being african american was independently associated with having a felony conviction and years of incarceration in prison as compared to whites. the history of involvement in the criminal justice system is widespread in this sample. when looking at racial/ethnic differences over a lifetime including total years of incarceration and drug felony conviction, the involvement of african americans in the criminal justice system is higher as compared to whites. more rigorous examination of these data and others on how criminal justice involvement varies by race, as well as the implications for the health and well-being of idus, is warranted. homelessness is a major social concern that has great im~act on th~se living.in urban commu?ities. metro manila, the capital of the philippines is a highly urbanized ar~ w. t~ the h gh~st concentration of urban poor population-an estimated , families or , , md v duals. this exploratory study v is the first definitive study done in manila that explores the needs and concerns of street dwdlent\omc. less. it aims to establish the demographic profile, lifestyle patterns and needs of the streetdwdlersindit six districts city of manila to establish a database for planning health and other related interventions. based on protocol-guid ed field interviews of street dwellers, the data is useful as a template for ref!!. ence in analyzing urban homelessness in asian developing country contexts. results of the study show that generally, the state of homelessness reflects a feeling of discontent, disenfranchisem ent and pow!!· lessness that contribute to their difficulty in getting out of the streets. the perceived problems andlar dangers in living on the streets are generally associated with their exposure to extreme weather condirioll! and their status of being vagrants making them prone to harassment by the police. the health needs of the street dweller respondents established in this study indicate that the existing health related servias for the homeless poor is ineffective. the street dweller respondents have little or no access to social and health services, if any. some respondents claimed that although they were able to get service from heallh centers or government hospitals, the medicines required for treatment are not usually free and are beyond their means. this group of homeless people needs well-planned interventions to hdp them improve their current situations and support their daily living. the expressed social needs of the sucet dweller respondents were significantly concentrated on the economic aspect, which is, having a perma· nent source of income to afford food, shelter, clothing and education. these reflect the street dweller' s need for personal upliftment and safety. in short, most of their expressed need is a combination of socioeconomic resources that would provide long-term options that are better than the choice of living on the streets. the suggested interventions based on the findings will be discussed. . methods: idu~ aged i and older who injected drugs within the prior month were recruited in usmg rds which relies on referral networks to generate unbiased prevalence estimates. a diverse and mon· vated g~o~p of idu "seeds." were given three uniquely coded coupons and encouraged to refer up to three other ehgibl~ idu~, for which they received $ usd per recruit. all subjects provided informed consent, an anonymous ~t erv ew and a venous blood sample for serologic testing of hiv, hcv and syphilis anti~!· results. a total of idus were recruited in tijuana and in juarez, of whom the maion!)' were .male < .l. % and . %) and median age was . melhotls: using the data from a multi-site survey on health and well being of a random sample of older chinese in seven canadian cities, this paper examined the effects of size of the chinese community and the health status of the aging chinese. the sample (n= , ) consisted of aging chinese aged years and older. physical and mental status of the participants was measured by a chinese version medical outcome study short form sf- . one-way analysis of variance and post-hoc scheffe test were used to test the differences in health status between the participants residing in cities representing three different sizes of the chinese community. regression analysis was also used to examine the contribution of size of the chinese community to physical and mental health status. rmdts: in general, aging chinese who resided in cities with a smaller chinese population were healthier than those who resided in cities with a larger chinese population. the size of the chinese community was significant in predicting both physical and mental health status of the participants. the findings also indicated the potential underlying effects of the variations in country of origin, access barriers, and socio-economic status of the aging chinese in communities with different chinese population size. the study concluded that size of an ethnic community affected the health status of the aging population from the same ethnic community. the intra-group diversity within the aging chinese identified in this study helped to demonstrate the different socio-cultural and structural challenges facing the aging population in different urban settings. urban health and demographic surveillance system, which is implemented by the african population & health research center (aphrc) in two slum settlements of nairobi city. this study focuses on common child illnesses including diarrhea, fever, cough, common cold and malaria, as well as on curative health care service utilization. measures of ses were created using information collected at the household level. other variables of interest included are maternal demographic and cultural factors, and child characteristics. statistical methods appropriate for clustered data were used to identify correlates of child morbidity. preliminary ratdts: morbidity was reported for , ( . %) out of , children accounting for a total of , illness episodes. cough, diarrhoea, runny nose/common cold, abdominal pains, malaria and fever made up the top six forms of morbidity. the only factors that had a significant associ· ation with morbidity were the child's age, ethnicity and type of toilet facility. however, all measures of socioeconomic status (mother's education, socioeconomic status, and mother's work status) had a significant effect on seeking outside care. age of child, severity of illness, type of illness and survival of father and mother were also significantly associated with seeking health care outside home. the results of this study have highlighted the need to address environmental conditions, basic amenities, and livelihood circumstances to improve child health in poor communities. the fact that socioeconomic indicators did not have a significant effect on prevalence of morbidity but were significant for health seeking behavior, indicate that while economic resources may have limited effect in preventing child illnesses when children are living in poor environmental conditions, being enlightened and having greater economic resources would mitigate the impact of the poor environmental conditions and reduce child mortality through better treatment of sick children. inequality in human life chances is about the most visible character of the third world urban space. f.conomic variability and social efficiency have often been fingered to justify such inequalities. within this separation households exist that share similar characteristics and are found to inhabit a given spatial unit of the 'city. the residential geography of cities in the third world is thus characterized by native areas whose core is made up of deteriorated slum property, poor living conditions and a decayed environment; features which personify deprivation in its unimaginable ma~t~de. there are .eviden~es that these conditions are manifested through disturbingly high levels of morbidity and mortality. ban · h h d-and a host of other factors (corrupt n, msens t ve leaders p, poor ur ty on t e one an , . · f · · · th t ) that suggest cracks in the levels and adherence to the prmc p es o socta usnce. ese governance, e c . . . . . ps £factors combine to reinforce the impacts of depnvat n and perpetuate these unpacts. by den· grou o . · "id . . bothh tifying health problems that are caused or driven by either matena _or soc a e~nvanon or , t e paper concludes that deprivation need not be accepted as a way. of hfe a~d a deliberate effon must be made to stem the tide of the on going levels of abject poverty m the third world. to the extent that income related poverty is about the most important of all ramifications of po~erty, efforts n_iu_st include fiscal empowerment of the poor in deprived areas like the inner c~ty. this will ~p~ove ~he willingness of such people to use facilities of care because they are able to effectively demand t, smce m real sense there is no such thing as free medical services. ). there were men with hiv-infection included in the present study (mean age and education of . (sd= . ) and . (sd= . ), respectively). a series of multiple regressions were used to examine the unique contributions of symptom burden (depression, cognitive, pain, fatigue), neuropsychologic al impairment (psychomotor efficiency), demographics (age and education) and hiv disease (cdc- staging) on iirs total score and jirs subscores: ( ) activities of daily living (work, recreation, diet, health, finances); ( ) psychosocial functioning (e.g., self-expression, community involvement); and ( ) intimacy (sex life and relationship with partner). resnlts: total iirs score (r " . ) was associated with aids diagnosis (ii= . , p < . ) and symptoms of pain (ii= - . , p < . ), fatigue (ji = - . , p < . ) and cognitive difficulties (p = . , p < . ). for the three dimensions of the iirs, multiple regression results revealed: ( ) activities of daily living (r = . ) were associated with aids diagnosis (ii = . , p < . ) and symptoms of pain

mg/di) on dipstick analysis. results: there were , ( . %) males. racial distribution was chinese ( . % ), malay ( . % ), indians ( . %) and others ( . % ).among participants, who were apparently "healthy" (asymptomatic and without history of dm, ht, or kd), gender and race wise % prevalence of elevated (bp> / ), rbg (> mg/di) and positive urine dipstick for protein was as follows male: ( . ; . ; . ) female:( . ; . ; . ) chinese:( . ; . ; ) malay: ( . ; . ; . ) indian:( . ; . ; . ) others: ( . ; . ; . ) total:(l . , . , . ). percentage of participants with more than one abnormality were as follows. those with bp> / mmhg, % also had rbg> mg/dl and . % had proteinuria> i. those with rbg> mgldl, % also had proteinuria> and % had bp> / mmhg. those with proteinuria> , % also had rbg> mg/dl, and % had bp> / mmhg. conclusion: we conclude that sub clinical abnormalities in urinalysis, bp and rbg readings are prevalent across all genders and racial groups in the adult population. the overlap of abnormalities, point towards the high risk for esrd as well as cardiovascular disease. this indicates the urgent need for population based programs aimed at creating awareness, and initiatives to control and retard progression of disease. introduction: various theories have been proposed that link differential psychological vulnerability to health outcomes, including developmental theories about attachment, separation, and the formation of psychopathology. research in the area of psychosomatic medicine suggests an association between attachment style and physical illness, with stress as a mediator. there are two main hypotheses explored in the present study: ( t) that individuals living with hiv who were upsychologically vulne~able" at study entry would be more likely to experience symptoms of depression, anxiety and phys ca! illness over. the course of the -month study period; and ( ) life stressors and social support would mediate the relat nship between psychological vulnerability and the psychological ~nd physical outcomes. . (rsles), state-trait anxiety inventory (stai), beck depr~ssi~n lnvento~ (bdi), and~ _ -item pbys~i symptoms inventory. we characterized participants as havmg psychological vulnerability and low resilience" as scoring above on the raas (insecure attachment) or above on the das (negative expectations about oneself). . . . . . " . . ,, . results: at baseline, % of parnc pants were classified as havmg low resilience. focusmg on anxiety, the average cumulative stai score of the low-resilience group was significandy hi~e~ than that of the high-resilience group ( . sd= . versus . sd= . ; f(l, )= . , p <. ). similar results were obtained for bdi and physical symptoms (f( , )= . , p<. and f( , )= . , p<. , respec· tively). after controlling for resilience, the effects of variance in life stres".°rs averaged over time wa~ a_sig· nificant predictor of depressive and physical symptoms, but not of anxiety. ho~e_ver, these assooan~s became non-significant when four participants with high values were removed. s id larly, after controlling for resilience, the effects of variance in social support averaged over time became insignificant. conclusion: not only did "low resilience" predict poor psychological and physical outcomes, it was also predictive of life events and social support; that is, individuals who were low in resilience were more likely to experience more life events and poorer social support than individuals who were resilient. for individuals with vulnerability to physical, psychological, and social outcomes, there is need to develop and test interventions to improve health outcomes in this group. rajat kapoor, ruby gupta, and jugal kishore introduction: young people in india represent almost one-fourth of the total population. they face significant risks related to sexual and reproductive health. many lack the information and skills neces· sary to make informed sexual and reproductive health choices. objective: to study the level of awareness about contraceptives among youth residing in urban and rural areas of delhi. method: a sample of youths was selected from barwala (rural; n= ) and balmiki basti (urban slums; n= ) the field practice areas of the department of community medicine, maulana azad medical college, in delhi. a pre-tested questionnaire was used to collect the information. when/(calen· dar time), by , fisher exact and t were appliedxwhom (authors?). statistical tests such as as appropriate. result: nearly out of ( . %) youth had heard of at least one type of contraceptive and majority ( . %) had heard about condoms. however, awareness regarding usage of contraceptives was as low as . % for terminal methods to . % for condom. condom was the best technique before and after marriage and also after childbirth. the difference in rural and urban groups was statistically signif· icant (p=. , give confidence interval too, if you provide the exact p value). youth knew that contra· ceptives were easily available ( %), mainly at dispensary ( . %) and chemist shops ( . %). only . % knew about emergency contraception. only advantage of contraceptives cited was population con· trol ( . %); however, . % believed that they could also control hiv transmission. awareness of side effects was poor among both the groups but the differences were statistically significant for pills (p= . ). media was the main source of information ( %). majority of youth was willing to discuss a~ut contraceptive with their spouse ( . %), but not with others. . % youth believed that people in their age group use contraceptives. % of youth accepted that they had used contraceptives at least once. % felt children in family is appropriate, but only . % believed in year spacing. . conclusion: awareness about contraceptives is vital for youth to protect their sexual and reproduc· tive health .. knowledge about terminal methods, emergency contraception, and side effects of various contraceptives need to be strengthened in mass media and contraceptive awareness campaigns. mdbot:ls: elderly aged + were interviewed in poor communities in beirut the capital of f:ebanon, ~e of which is a palestinia~. refugee camp. depression was assessed using the i -item geriat· nc depressi~n score (~l?s- ). specific q~estions relating to the aspects of religiosity were asked as well as questions perta rung to demographic, psychosocial and health-related variables. results: depression was prevalent in % of the interviewed elderly with the highest proportion being in the palestinian refugee camp ( %). mosque attendance significantly reduced the odds of being depressed only for the palestinian respondents. depression was further associated, in particular communities, with low satisfaction with income, functional disability, and illness during last year. condiuion: religious practice, which was only related to depression among the refugee population, is discussed as more of an indicator of social cohesion, solidarity than an aspect of religiosity. furthermore, it has been suggested that minority groups rely on religious stratagems to cope with their pain more than other groups. implications of findings are discussed with particular relevance to the populations studied. nearly thirty percent of india's population lives in urban areas. the outcome of urbanization has resulted in rapid growth of urban slums. in a mega-city chennai, the slum populations ( . percent) face greater health hazards due to overcrowding, poor sanitation, lack of access to safe drinking water and environmental pollution. amongst the slum population the health of women and children are most neglected, resulting in burden of both communicable and non-communicable diseases. the focus of the paper is to present the epidemiology profile of children (below years) in slums of chennai, their health status, hygiene and nutritional factors, the social response to health, the trends in child health and urbanization over a decade, the health accessibility factors, the role of gender in health care and assessment impact of health education to children. the available data prove that child health in slums is worse than rural areas. though the slum population is decreasing there is a need to explore the program intervention and carry out surveys for collecting data on some specific health implications of the slum children. objective: during the summer of there was a heat wave in central europe, producing an excess number of deaths in many countries including spain. the city of barcelona was one of the places in spain where temperatures often surpassed the excess heat threshold related with an increase in mortality. the objective of the study was to determine whether the excess of mortality which occurred in barcelona was dependent on age, gender or educational level, important but often neglected dimensions of heat wave-related studies. methods: barcelona, the second largest city in spain ( , , inhabitants in ) , is located on the north eastern coast. we included all deaths of residents of barcelona older than years that occurred in the city during the months of june, july and august of and also during the same months during the preceding years. all the analyses were performed for each sex separately. the daily number of deaths in the year was compared with the mean daily number of deaths for the period - for each educational level. poisson regression models were fitted to obtain the rr of death in with respect to the period - for each educational level and age group. results: the excess of mortality during that summer was more important for women than for men and among older ages. although the increase was observed in all educational groups, in some age-groups the increase was larger for people with less than primary education. for example, for women in the group aged - , the rr of dying for compared to - for women with no education was . ( %ci: . - - ) and for women with primary education or higher was . ( %ci: . - . ). when we consider the number of excess deaths, for total mortality (>= years) the excess numbers were higher for those with no education ( . for women and . for men) and those with less than primary education ( . for women and - for men) than those with more than primary edm:ation ( . for women and - . for men). conclusion: age, gender and educational level were important in the barcelona heat wave. it is necessary to implement response plans to reduce heat morbidity and mortality. policies should he addressed to all population but also focusing particularly to the oldest population of low educational level. introduction: recently there has been much public discourse on homelessness and its imp~ct on health. measures have intensified to get people off the street into permanent housing. for maximum v poster sessions success it is important to first determine the needs of those to be housed. their views on housing and support requirements have to be considered, as th~y ar~ the ones affected. as few res.earch studies mclude the perspectives of homeless people themselves, httle is known on ho~ they e~penence the mpacrs on their health and what kinds of supports they believe they need to obtain housing and stay housed. the purpose of this study was to add the perspectives of homeless people to the discourse, based in the assumption that they are the experts on their own situations and needs. housing is seen as a major deter· minant of health. the research questions were: what are the effects of homelessness on health? what kind of supports are needed for homeless people to get off the street? both questions sought the views of homeless individuals on these issues. methods: this study is qualitative, descriptive, exploratory. semi-structured interviews were conducted with homeless persons on street corners, in parks and drop-ins. subsequently a thematic analysis was carried out on the data. results: the findings show that individuals' experiences of homelessness deeply affect their health. apart from physical impacts all talked about how their emotional health and self-esteem are affected. the system itself, rather than being useful, was often perceived as disabling and dehumanizing, resulting in hopelessness and resignation to life on the street. neither welfare nor minimum wage jobs are sufficient to live and pay rent. educational upgrading and job training, rather than enforced idleness, are desired by most initially. in general, the longer persons were homeless, the more they fell into patterned cycles of shelter /street life, temporary employment /unemployment, sometimes addictions and often unsuccessful housing episodes. conclusions: participants believe that resources should be put into training and education for acquisition of job skills and confidence to avoid homelessness or minimize its duration. to afford housing low-income people and welfare recipients need subsidies. early interventions, 'housing first', more humane and efficient processes for negotiating the welfare system, respectful treatment by service providers and some extra financial support in crisis initially, were suggested as helpful for avoiding homelessness altogether or helping most homeless people to leave the street. this study is a national homelessness initiative funded analysis examining the experiences and perceptions of street youth vis-a-vis their health/wellness status. through in-depth interviews with street youth in halifax, montreal, toronto, calgary, ottawa and vancouver, this paper explores healthy and not-so healthy practices of young people living on the street. qualitative interviews with health/ social service providers complement the analysis. more specifically, the investigation uncovers how street youth understand health and wellness; how they define good and bad health; and their experiences in accessing diverse health services. findings suggest that living on the street impacts physical, emotional and spiritual well·being, leading to cycles of despair, anger and helplessness. the majority of street youth services act as "brokers" for young people who desire health care services yet refuse to approach formal heal~h care organizational structures. as such, this study also provides case examples of promising youth services across canada who are emerging as critical spaces for street youth to heal from the ravages of ~treet cultur~. as young people increasingly make up a substantial proportion of the homeless population in canada, it becomes urgent to explore the multiple ways in which we can support them to regain a sense of wellbeing and "citizenship." p - (c) health and livelihood implications of marginalization of slum dwellers in provision of water and sanitation services in nairobi city elizabeth kimani, eliya zulu, and chi-chi undie . ~ntrodfldion: un-habitat estimates that % of urban residents in kenya live in slums; yet due to their illegal status, they are not provided with basic services such as water sanitation and health care. ~nseque~tly, the services are provided by vendors who typically provide' poor services at exorbitant prices .. this paper investigates how the inequality in provision of basic services affects health and livelihood circumstances of the poor residents of nairobi slums . . methods: this study uses qualitative and quantitative data collected through the ongoing longitudmal .health and demographic study conducted by the african population and health research center m slum communities in n ·rob" w d · · · · ai . e use escnpnve analytical and qualitative techmques to assess h~w concerns relating to water supply and environmental sanitation services rank among the c~mmumty's general and health needs/concerns, and how this context affect their health and livelihood circumstances. results: water ( %) and sanitation ( %) were the most commonly reported health needs and also key among general needs (after unemployment) among slum dwellers. water and sanitation services are mainly provided by exploitative vendors who operate without any regulatory mechanism and charge exorbitantly for their poor services. for instance slum residents pay about times more for water than non-slum households. water supply is irregular and residents often go for a week without water; prices are hiked and hygiene is compromised during such shortages. most houses do not have toilets and residents have to use commercial toilets or adopt unorthodox means such as disposing of their excreta in the nearby bushes or plastic bags that they throw in the open. as a direct result of the poor environmental conditions and inaccessible health services, slum residents are not only sicker, they are also less likely to utilise health services and consequently, more likely to die than non-slum residents. for instance, the prevalence of diarrhoea among children in the slums was % compared to % in nairobi as a whole and % in rural areas, while under-five mortality rates were / , / and / respectively. the results demonstrate the need for change in governments' policies that deprive the rapidly expanding urban poor population of basic services and regulatory mechanisms that would protect them from exploitation. the poor environmental sanitation and lack of basic services compound slum residents' poverty since they pay much more for the relatively poor services than their non-slum counterparts, and also increase their vulnerability to infectious diseases and mortality. since iepas've been working in harm reduction becoming the pioneer in latin america that brought this methodology for brazil. nowadays the main goal is to expand this strategy in the region and strive to change the drug policy in brazil. in this way harm reduction: health and citizenship program work in two areas to promote the citizenship of !du and for people living with hiv/aids offering law assistance for this population and outreach work for needle exchange to reduce damages and dissemination of hiv/aids/hepatit is. the methodology used in outreach work is peer education, needle exchange, condoms and folders distribution to reduce damages and the dissemination of diseases like hiv/aids/hepatitis besides counseling to search for basic health and rights are activities in this program. law attendance for the target population at iepas headquarters every week in order to provide law assistance that includes only supply people with correct law information or file a lawsuit. presentations in harm reduction and drug policy to expand these subjects for police chiefs and governmental in the last year attended !du and nidu reached and . needles and syringes exchanged. in law assistance ( people living with aids, drug users, inject drug users, were not in profile) people attended. lawsuits filed lawsuits in current activity. broadcasting of the harm reduction strategies by the press helps to move the public opinion, gather supporters and diminish controversies regarding such actions. a majority number of police officer doesn't know the existence of this policy. it's still polemic discuss this subject in this part of population. women remain one of the most under seviced segments of the nigerian populationand a focus on their health and other needs is of special importance.the singular focus of the nigerian family welfare program is mostly on demographic targets by seeking to increase contraceptive prevalence.this has meant the neglect of many areas of of women's reproductive health. reproductive health is affected by a variety of socio-cultural and biological factors on on e hand and the quality of the service delivery system and its responsiveness on the other.a woman's based approach is one which responds to the needs of the adult woman and adolescent girls in a culturally sensitive manner.women's unequal access to resources including health care is well known in nigeria in which stark gender disparities are a reality .maternal health activities are unbalanced,focusi ng on immunisation and provision of iron and folic acid,rather than on sustained care of women or on the detection and referral of high risk cases. a cross-sectional study of a municipal government -owned hospitalfrom each of the geo-political regions in igeria was carried out (atotal of ce~ters) .. as _part ~f t~e re.search, the h~spital records were uesd as a background in addition to a -week mtens ve mvesuganon m the obstemc and gynecology departments. . . . : little is known for example of the extent of gynecological morbtdtty among women; the little known suggest that teh majority suffer from one or more reproductive tr~ct infect~ons. although abortion is widespread, it continues to be performed under ilegal and unsafe condmons. with the growing v poster sessions hiv pandemic, while high riskgroups such ascomn;iercial sex workers and their clients have been studied, little has been accomplished in the large populat ns, and particularly among women, regardmgstd an hiv education. . . conclusions: programs of various governmentalor non-governmental agen,c es to mvolve strategies to broaden the narrow focus of services, and more importan~, to put wo~en s reproducnve health services and information needs in the forefront are urgently required. there is a n~d to reonent commuication and education activities to incorprate a wider interpretation of reproducnve health, to focus on the varying information needs of women, men, and youth and to the media most suitable to convey information to these diverse groups on reproductive health. introduction: it is estimated that there are - youths living on the streets, on their own with the assistance of social services or in poverty with a parent in ottawa. this population is under-serviced in many areas including health care. many of these adolescents are uncomfortable or unable to access the health care system through conventional methods and have been treated in walk-in clinics and emergency rooms without ongoing follow up. in march , the ontario government provided the ct lamont institute with a grant to open an interdisciplinary and teaching medical/dental clinic for street youth in a drop-in center in downtown ottawa. bringing community organizations together to provide primary medical care and dental hygiene to the streetyouths of ottawa ages - , it is staffed by a family physician, family medicine residents, a nurse practitioner, public health nurses, a dental hygienist, dental hygiene students and a chiropodist who link to social services already provided at the centre including housing, life skills programs and counselling. project objectives: . to improve the health of high risk youth by providing accessible, coordinated, comprehensive health and dental care to vulnerable adolescents. . to model and teach interdisciplinary adolescent care to undergraduate medical students, family medicine residents and dental hygiene students. methods: non-randomized, mixed method design involving a process and impact evaluation. data collection-qualitative:a) semi-structured interviews b) focus groups with youth quantitative:a) electronic medical records for months b) records (budget, photos, project information). results: in progress-results from first months available in august . early results suggest that locating the clinic in a safe and familiar environment is a key factor in attracting the over youths the clinic has seen to date. other findings include the prevalence of preventative interventions including vaccinations, std testing and prenatal care. the poster presentation will present these and other impacts that the clinic has had on the health of the youth in the first year of the study. conclusions: ) the clinic has improved the health of ottawa streetyouth and will continue beyond the initial pilot project phase. ) this project demonstrates that with strong community partnerships, it is possible meet make healthcare more accessible for urban youth. right to health care campaign by s.j.chander, community health cell, bangalore, india. introduction: the people's health movement in india launched a campaign known as 'right to health care' during the silver jubilee year of the alma ata declaration of 'health for all' by ad in collahoration with the national human rights commission (nhrc). the aim of the campaign was to establish the 'right to health care' as a basic human right and to address structural deficiencies in the pubic health care system and unregulated private sector . . methods: as part of the campaign a public hearing was organized in a slum in bangalore. former chairman of the nhrc chaired the hearing panel, consisting of a senior health official and other eminent people in the city. detailed documentation of individual case studies on 'denial of access to health care' in different parts of the city was carried out using a specific format. the focus was on cases where denial of health services has led to loss of life, physical damage or severe financial losses to the patient. results: _fourte_en people, except one who had accessed a private clinic, presented their testimonies of their experiences m accessing the public health care services in government health centres. all the people, e_xcept_ one person who spontaneously shared her testimony, were identified by the organizations worki_ng with the slum dwellers. corruption and ill treatment were the main issues of concern to the people. five of the fourteen testimonies presented resulted in death due to negligence. the public health cen· n:s not only demand money for the supposedly free services but also ill-treats them with verbal abuse. five of these fourteen case studies were presented before the national human right commission. the poster sessions v nhrc has asked the government health officials to look into the cases that were presented and to rectify the anomalies in the system. as a result of the public hearing held in the slum, the nhrc identified urban health as one of key areas for focus during the national public hearing. cond#sion: a campaign is necessary to check the corrupted public health care system and a covetous private health care system. it helps people to understand the structure and functioning of public health care system and to assert their right to assess heath care. the public hearings or people's tribunals held during the campaign are an instrument in making the public health system accountable. ps- (a) violence among women who inject drugs nadia fairbairn, jo-anne stoltz, evan wood, kathy li, julio montaner, and thomas kerr background/object ives: violence is a major cause of morbidity and mortality among women living in urban settings. though it is widely recognized that violence is endemic to inner-city illicit drug markets, little is known about violence experienced by women injection drug users (!du). therefore, the present analyses were conducted to evaluate the prevalence of, and characteristics associated with, experiencing violence among a cohort of female idu in vancouver. methods: we evaluated factors associated with violence among female participants enrolled in the vancouver injection drug user study (vidus) using univariate analyses. we also examined self-reported relationships with the perpetrator of the attack and the nature of the violent attack. results: of the active iou followed between december , and may , , ( . %) had experienced violence during the last six months. variables positively associated with experiencing violence included: homelessness (or= . , % ci: . - . , p < . ), public injecting (or= . , % ci: . - . , p < . ), frequent crack use (or= . , % ci: . - . , p < . ), recent incarceration (or = . , % cl: . - . , p < . ), receiving help injecting (or = . , % cl: . - . , p < . ), shooting gallery attendance (or = . , % ci: . - . , p < . ), sex trade work (or = . , % cl: . - . , p < . ), frequent heroin injection (or= . , % cl: . - . , p < . ), and residence in the downtown eastside (odds ratio [or] = . , % ci: . - . , p < . ). variables negatively associated with experiencing violence included: being married or common-law (or = . . % ci: . - . , p < . ) and being in methadone treatment (or = . , % ci: . - . , p < . ). the most common perpetrators of the attack were acquaintances ( . %), strangers ( . %), police ( . %), or dealers ( . %). attacks were most frequently in the form of beatings ( . %), robberies ( . %), and assault with a weapon ( . %). conclusion: violence was a common experience among women !du in this cohort. being the victim of violence was associated with various factors, including homelessness and public injecting. these findings indicate the need for targeted prevention and support services, such as supportive housing programs and safer injection facilities, for women iou. introduction: although research on determinants of tobacco use among arab youth has been carried out at several ecologic levels, such research has included conceptual models and has compared the two different types of tobacco that are most commonly used among the lebanese youth, namely cigarette and argileh. this study uses the ecological model to investigate differences between the genders as related to the determinants of both cigarette and argileh use among youth. methodology: quantitative data was collected from youth in economically disadvantaged urban communities in beirut, the capital of lebanon. results: the results indicated that there are differences by gender at a variety of ecological levels of influence on smoking behavior. for cigarettes, gender differences were found in knowledge, peer, family, and community influences. for argileh, gender differences were found at the peer, family, and community l.evels. the differential prevalence of cigarette and argileh smoking between boys and girls s therefore understandable and partially explained by the variation in the interpersonal and community envi.ronment which surrounds them. interventions therefore need to be tailored to the specific needs of boys and girls. introduction: the objective of this study was to assess the relationship between parents' employment status and children' health among professional immigrant families in vancouver. our target communmes v poster sessions included immigrants from five ethnicity groups: south korean, indian, chine~e, ~ussian, and irani~ with professional degrees (i.e., mds, lawyers, engineers, ma?~ger~, and uru~ers ty professors) w h no relevant job to their professions and those who had been hvmg m the studied area at least for months. methodology: the participants were recruited by collaboration from three local community agencies and were interviewed individually during the fall of . ra#lts: totally, complete interviews were analyzed: from south-east asia, from south asia, from russia and other eastern europe. overall, . % were employed, . % were underemployed, % indicated they were unemployed. overall, . % were not satisfied with their current job. russians and other eastern europeans were most likely satisfied with their current job, while south-east asians were most satisfied from their life in canada. about % indicated that their spouses were not satisfied with their life in canada, while % believed that their children are very satisfied from their life in canada. in addition, around % said they were not satisfied from their family relationship in canada. while most of the responders ranked their own and their spouses' health status as either poor or very poor, jut % indicated that their first child's health was very poor. in most cases they ranked their children's health as excellent or very good. the results of this pilot study show that there is a need to create culturally specific child health and behavioral scales when conducting research in immigrant communities. for instance, in many asian cultures, it is customary for a parent either to praise their children profusely, or to condemn them. this cultural practice, called "saving face," can affect research results, as it might have affected the present study. necessary steps, therefore, are needed to revise the current standard health and behavioral scales for further studies by developing a new scale that is more relevant and culturally sensitive to the targeted immigrant families. metboda: database: national health survey (ministry of health www.msc.es). two thousand interviews were performed among madrid population ( . % of the whole); corresponded to older adults ( . % of the . million aged years and over). study sample constitutes . % ( out of ) of those older adults, who live in urban areas. demographic structure (by age and gender) of this population in relation to health services use (medical consultations, dentist visits, emergence services, hospitalisation) was studied using general linear model univariate procedure. a p . ), while age was associated with emergence services use ( % of the population: %, % and % of each age group) and hos~italisation ( % .oft~~ population: %, % and %, of each age ~oup) (p . ) was fou~d with respect to dennst v s ts ( % vs %), medical consultations ( % vs %), and emergence services use ( % vs %), while an association (p= . ) was found according to hospitalisation ( % vs %). age. an~ g~der interaction effect on health services use was not found (p> . ), but a trend towards bosp tal sanon (p= . ) could be considered. concl.uions: demographic structure of urban older adults is associated with two of the four health se~ices use studi~. a relation.ship ber_ween age. and hospital services use (emergence units and hospitalisanon), but not with ~ut-hosp tal sei:vices (medical and dentist consultations), was found. in addition ro age, gender also contnbutes to explam hospitalisation. . sexua experiences. we exammed the prevalence expenences relation to ethnic origin and other sociodemographic variables as wc i as y j die relation between unwanted sexual experiences, depression and agreuion. we did so for boys and prts separately. mdhods: data on unwanted sexual expcric:nces, depressive symptoms (ce.s-d), aggrc:uion (bohi-di and sociodemographic facron were collected by self-report quescionnairc:s administettd to students in the: nd grade (aged - ) of secondary schools in amsterdam, the netherlands. data on the nature ol unwanted sexual experiences were collected during penonal interviews by trained schoolnursn. ltaijtj: overall prevalences of unwanted sexual experiences for boys and girls were . % and . % respectively. unwanted sexual experiences were more often ttported by turkish ( . %), moroc· an ( . %) and surinamese/anrillian boys ( . %) than by dutch boys ( . %). moroccan and turkish girls, however, reported fewer unwanted sexual experiences (respectively . and . %) than durch girls did ( . %). depressive symptoms(or= . , cl= . - . ) covert agression ( r• . , cl• . - . ) and cmrt aggression (or= . , cl• . - . ) were more common in girls with an unwanted sexual experi· met. boys with an unwanted sexual experience reported more depressive symptoms (or= . ; cl• i . .l· . ) and oven agression (or= . , cl= . - . ) . of the reported unwanted sexual experiences rnpec· timy . % and . % were confirmed by male and female adolescents during a personal interview. cond sion: we ..:an conclude that the prevalence of unwanted sexual experiences among turkish and moroccan boys is disturbing. it is possible that unwanted sexual experiences are more reported hy boys who belong to a religion or culture where the virginity of girls is a maner of family honour and talking about sexuality is taboo. more boys than girls did not confirm their initial disdosurc of an lllwalltc:d sexual experience. the low rates of disclosure among boys suggcsu a necd to educ.:atc hcahh care providen and others who work with migrant boys in the recognition and repomng of exu.il ... iction. viramin a aupplc:tmntation i at .h'yo, till far from tafl'eted %. feedinit pracn~:n panku· lerty for new born earn demand lot of educatton ernpha a• cxdu ve hrealt fecdtnit for dnared rcnoj of months was observtd in only .s% of childrrn thoulh colckturm w. givm n rn% of mwly horn ct.ildrm. the proportion of children hclow- waz (malnounshrdl .con" a• h!jh •• . % anj "rt'i· acimy tc.. compared to data. mother's ~alth: from all is womm in ttprod~uvr •ill' poup, % were married and among marned w~ .\ % only w\"rt' u mic wmr cnntr.-:cruve mt h· odl % were married bdorc thc •ar of yean and % had thnr ftnc prcicnancy hcftitt dlt' •icr nf yean. the lt'f'vicn are not uutfactory or they arc adequate but nae unh ed opumally. of thote' l'h mothen who had deliverrd in last one year, % had nailed ntmaral eum nat on ira" oncc, .~o-... bad matt rhan four ttmn and ma ortty had heir tetanus toxotd tnin,"t or"'" nlht "'"'"· ljn r ned rn· win ronductrd . % dchvcnn and % had home deh\'t'oc'i. ~md~: the tervtcn unbud or u led are !tu than dnarame. the wr· l'kft provided are inadequate and on dechm reprcwnttng a looun t ~p of h hnto good coytti\#' ol wr· ncn. l!.ckground chanpng pnoriry cannoc be ruled out u °"" of thc coatnbutory bc f. ps-ii ia) dcpn:wioa aad anuccy ia mip'mu ia awccr._ many de wn, witco tui~bmjer. jack dekker, aart·jan lttkman, wim gonmc:n. and amoud verhoeff ~ a dutch commumry-bucd icudy thawed -moarh•·prc:yalm«i al . ' . kw anx · ay daorden and . % foi' dqrasion m anmttdam. nm .. p tficantly hlllhn than dwwhrft .. dw ~thew diffamca m pttyalcnca att probably rdarcd to tlk' largr populanoa of napaan ..\mturdam. ~ddress ~ro.ad~r .determinants of health depends upon the particular health parad'.~ adhered. ~o withm each urisd ctton. and whether a paradigm is adopted depends upon the ideologi~a and pol~ncal context of each nation. nations such as sweden that have a long tradition of public policies promonng social jus~ce an~ equity are naturally receptive to evolving population health concepts. '[he usa represen~ a ~bey en~ro~~t where such is~ues are clear!~ subordinate. ., our findings mdicate that there s a strong political component that influences pubh ~ealth a~proaches and practi~ within the jurisdictions examined. the implications are that those seek· m~ to raise the broader detennmants of the public's health should work in coalition to raise these issues with non-health organizations and age · ca d d th · - badrgrollnd: in developed countries, social inequalities in health have endured or even worsened comparatively throughout different social groups since the s. in france, a country where access to medical and surgical care is theoretically affordable for everyone, health inequalities are among the high· est in western europe. in developing countries, health and access to care have remained critical issues. in madagascar, poverty has even increased in recent years, since the country wenr through political crisis and structural adjustment policies. objectives. we aimed to estimate and compare the impact of socio· economic status but also psychosocial characteristics (social integration, health beliefs, expectations and representation, and psychological characteristics) on the risk of having forgone healthcare in these dif· fercnt contexts. methods: population surveys conducted among random samples of households in some under· served paris neighbourhoods (n= ) and in the whole antananarivo city (n= ) in , using a common individual questionnaire in french and malagasy. reslllts: as expected, the impact of socioeconomic status is stronger in antananarivo than in paris. but, after making adjustments for numerous individual socio-economic and health characteristics, we observed in both cities a higher (and statically significant) occurrence of reponed forgone healthcare among people who have experienced childhood and/or adulthood difficulties (with relative risks up to and .s respectively in paris and antananarivo) and who complained about unhealthy living conditions. in paris, it is also correlated with a lack of trust in health services. coneluions: aside from purely financial hurdles, other individual factors play a role in the non-use of healthcare services. health insurance or free healthcare seems to be necessary hut not sufficienr to achieve an equitable access to care. therefore, health policies must not only focus on the reduction of the financial barriers to healthcare, but also must be supplemented by programmes (e.g. outreach care ser· vices, health education, health promotion programmes) and discretionary local policies tailored to the needs of those with poor health concern .. acknowledgments. this project was supported by the mal>io project and the national institute of statistics (instat) in madagascar, and hy the development research institute (ird) and the avenir programme of the national institute of health and medical research (inserm) in france. for the cities of developing countries, poverty is often described in terms of the living standard~ of slum populations, and there is good reason to believe that the health risks facing these populations are even greater, in some instances, than those facing rural villagers. yet much remains to be learned ahour the connections between urban poverty and health. it is not known what percentage of all urban poor live in slums, that is, in communities of concentrated poverty; neither is it known what proportion of slum residents are, in fact, poor. funhermore, no quantitative accounting is yet available that would sep· arare the health risks of slum life into those due to a househoid•s own poverty and those stemminic from poveny in the surrounding neighborhood. if urban health interventions are to be effectively targeted in developing countries, substantial progress must be made in addressing these cenrral issues. this paper examines poverty and children's health and survival using two large surveys, one a demographic and health survey fielded in urban egypt (with an oversampling of slums) and the other a survey of the slums of allahabad, india. using multivariate statistical methods. we find, in both settings: ( substan· rial evidence of living standards heterogeneity within the slums; ( strong evidence indicating that household-level poverty is an imponant influence on health; and ( ) staristically significant (though less strong) evidence that with household living standards held constant, neighborhood levels of poverty adversely affect health. the paper doses with a discussion of the implications of these findings for the targeting of health and poverty program interventions. p - (a) urban environment and the changing epidemiological surfacr. the cardiovascular ~ &om dorin, nigeria the emergence of cardiovascular diseases had been explained through the concomitants o_f the demographic transition wherein the prevalent causes of morbidity and monality ~hangr pr~mmant infectious diseases to diseases of lifestyle or chronic disease (see deck, ) . a ma or frustration m the v poster sessions case of cvd is its multifactural nature. it is acknowledged that the environment, however defined is the d · f · t' b tween agents and hosts such that chronic disease pathogenesis also reqmre a me an o mterac ion e . spatio-temporal coincidence of these two parties. what is not clear is which among ~ever~( potennal fac· · h b pace exacerbate cvd risk more· and to what extent does the ep dem olog cal trans · tors m t e ur an s ' . . . . tion h othesis relevant in the explanation of urban disease outlook even the developmg cities like nigeri~: thesis paper explorer these within a traditional city in nigeria. . . . the data for the study were obtained from two tertiary level hospitals m the metropolis for years ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . the data contain reported cases of cvd in the two facilities for the period. adopting a series of parametric and non-parametric statistics, we draw inferences between the observed cases of cvds and various demographic and locational variables of the patients. findings: about % of rhe cases occurred in years ( ) ( ) ( ) coinciding with the last year of military rule with great instability. . % occurred among male. . % also occurred among people aged - years. these are groups who are also likely to engage in most stressful life patterns. ~e study also shows that % of all cases occurred in the frontier wards with minor city areas also havmg their •fair' share. our result conformed with many empirical observation on the elusive nature of causation of cvd. this multifactoral nature had precluded the production of a map of hypertension that would be consistent with ideas of spatial prediction. cvd -cardiovascular diseases. mumbai is the commercial capital of india. as the hub of a rapidly transiting economy, mumbai provides an interesting case study into the health of urban populations in a developing country. with high-rise multimillion-dollar construction projects and crowded slums next to each other, mumbai presents a con· trast in development. there are a host of hi-tech hospitals which provide high quality care to the many who can afford it (including many westerners eager to jump the queue in their healthcare systems-'medical tour· ism'), at the same time there is a overcrowded and strained public healthcare system for those who cannot afford to pay. voluntary organizations are engaged in service provision as well as advocacy. the paper will outline role of the voluntary sector in the context of the development of the healthcare system in mumbai. mumbai has distinct upper, middle and lower economic classes, and the health needs and problems of all three have similarities and differences. these will be showcased, and the response of the healthcare system to these will be documented. a rising hiv prevalence rate, among the highest in india, is a challenge to the mumbai public healthcare system. the role of the voluntary sector in service provision, advocacy, and empowerment of local populations with regards to urban health has been paramount. the emergence of the voluntary sector as a major player in the puzzle of urban mumbai health, and it being visualized as voices of civil society or communiry representatives has advantages as well as pitfalls. this paper will be a unique attempt at examining urban health in india as a complex web of players. the influence of everyday socio·polirical-cultural and economic reality of the urban mumbai population will be a cross cutting theme in the analysis. the paper will thus help in filling a critical void in this context. the paper will thus map out issues of social justice, gender, equiry, effect of environment, through the lens of the role of the voluntary sector to construct a quilt of the realiry of healthcare in mumbai. the successes and failures of a long tradi· tion of the active advocacy and participation of the voluntary sector in trying to achieve social justice in the urban mumbai community will be analyzed. this will help in a better understanding of global urban health, and m how the voluntary sector/ngos fir into the larger picture. ba~und: o~er. half _of n~irobi's . million inhabitants live in illegal informal settlements that compose yo of the city s res dent al land area. the majority of slum residents lack access to proper san· iranon and a clean and adequate water supply. this research was designed to gain a clearer understand· mg of what kappr · · · h f . . opnate samtanon means or the urban poor, to determine the linkages between gender, hvehhoods, and access to water and sanitation, and to assess the ability of community sanitation blocks to meet water and sanitation needs in urban areas. m~tbojs_: _a household survey, gender specific focus groups and key informant interviews were conducted m maih saba, a peri-urban informal settlement. qualitative and quantitative research tools were u~ to asses~ the impact and effectiveness of community sanitation blocks in two informal settlements. results ropna e samtarmn me u es not only safe and clean latrines, but also provision ° adequate drainage and access to water supply for cleaning of clothes and homes. safety and cleanliness poster sessions v were priorities for women in latrines. levels of poverty within the informal settlements were identified and access to water and sanitation services improved with increased income. environmental health problems related to inadequate water and sanitation remain a problem for all residents. community sanitation blocks have improved the overall local environment and usage is far greater than envisioned in the design phase. women and children use the blocks less than men. this is a result of financial, social, and safety constraints. the results highlight the importance a need to expand participatory approaches for the design of water and sanitation interventions for the urban poor. plans need to recognize "appropriate sanitation" goes beyond provision of latrines and gender and socioeconomic differences must be taken into account. lessons and resources from pilot projects must be learned from, shared and leveraged so that solutions can be scaled up. underlying all the challenges facing improving water and sanitation for the urban poor are issues of land tenure. p - (c) integrating tqm (total quality management), good governance and social mobilization principles in health promotion leadership training programmes for new urban settings in countries/ areas: the prolead experience susan mercado, faren abdelaziz, and dorjursen bayarsaikhan introduction: globalization and urbanization have resulted in "new urban settings" characterized by a radical process of change with positive and negative effects, increased inequities, greater environmental impacts, expanding metropolitan areas and fast-growing slums and vulnerable populations. the key role of municipal health governance in mitigating and modulating these processes cannot be overemphasized. new and more effective ways of working with a wide variety of stakeholders is an underpinning theme for good governance in new urban settings. in relation to this, organizing and sustaining infrastructure and financing to promote health in cities through better governance is of paramount importance. there is a wealth of information on how health promotion can be enhanced in cities. despite this, appropriate capacity building programmes to enable municipal players to effectively respond to the challenges and impacts on health of globalization, urbanization and increasing inequity in new urban settings are deficient. the who kobe centre, (funded by the kobe group( and in collaboration with regional offices (emro, searo, wpro) with initial support from the japan voluntary contribution, developed a health promotion leadership training programme called "prolead" that focuses on new and autonomous structures and sustainable financing for health promotion in the context of new urban settings. methodology: country and/or city-level teams from areas, (china, fiji, india, japan, lebanon, malaysia, mongolia, oman, philippines, republic of korea, tonga and viet nam) worked on projects to advance health promotion infrastructure and financing in their areas over a month period. tools were provided to integrate principles of total quality management, good governance and social mobili .ation. results: six countries/areas have commenced projects on earmarking of tobacco and alcohol taxes for health, moblization of sports and arts organizations, integration of health promotion and social health insurance, organizational reforms, training in advocacy and lobbying, private sector and corporate mobilization and community mobilization. results from the other six areas will be reported in ..;obcr. conclusions: total quality management, good governance and social mobilization principles and skills are useful and relevant for helping municipal teams focus on strategic interventions to address complex and overwhelming determinants of health at the municipal level. the prolead training programmes hopes to inform other processes for building health promotion leadership capacity for new urban settings. the impact of city living and urbanization on the health of citizens in developing countries has received increasing attention in recent years. urban areas contribute largely to national economies. however, rapid and unplanned urban growth is often associated with poverty, environmental degradation and population demands that outstrip service capacity which conditions place human health at risk. local and national governments as well as multi national organizations are all grappling with the challenges of urbanization. with limited data and information available, urban health characteristics, including the types, quantities, locations and sources in kampala, are largely unknown. moreover, there is n? basis for assessing the impact of the resultant initiatives to improve health ~onditions amo~g ~o": ": um ties settled in unplanned areas. since urban areas are more than the aggregation ?f ~?pie w~th md_ v dual risk factors and health care needs, this paper argues that factors beyond the md v dual, mcludmg the poster sessions v · i d h · i · ment and systems of health and social services are determinants of the health soc a an p ys ca environ . of urban populations. however, as part of an ongoing study? ~s pape~ .addresses the basic concerns of urban health in kampala city. while applying the "urban hvmg conditions and the urban heal~ pen· alty" frameworks, this paper use aggregated urban health d~ta t~ explore the role of place an~ st tu· tions in shaping health and well-being of the population m kampala by understanding how characteristics of the urban environment and specific features of the city are causally related to health of invisible and forgotten urban poor population: results i~dica~e that a .range o~ urb~n he~l~h hazards m the city of kampala include substandard housing, crowdmg, mdoor air poll.ut on, msuff c ent a~d con· taminated water, inadequate sanitation and solid waste management services, vector borne .diseases, industrial waste increased motor vehicle traffic among others. the impact of these on the envtronment and community.health are mutually reinforcing. arising out of the withdra"'.al of city pl~nning systems and service delivery systems or just planning failure, thousands of people part cularl~ low-mc~me groups have been pushed to the most undesirable sections of the city where they are faced with ~ va_r ety ~f envj· ronmental insults. the number of initiatives to improve urban health is, however, growing mvolvjng the interaction of many sectors (health, environment, housing, energy, transportation and urban planning) and stakeholders (local government, non governmental organizations, aid donors and local community groups). key words: urban health governance, health risks, kampala. introduction: the viability of urban communities is dependent upon reliable and affordable mass transit. in particular, subway systems play an especially important role in the mass transit network, since they provide service to vast numbers of ridersseven of the subway systems worldwide report over one billion passenger rides each year. surprisingly, given the large number of people potentially affected, very little is known about the health and safety hazards that could affect both passengers and transit workers; these include physical (e.g., noise, vibration, accidents, electrified sources, temperature extremes), biological (e.g., transmission of infectious diseases, either through person-to-person spread or vector-borne, for example, through rodents), chemical (e.g., exposure to toxic and irritant chemicals and metals, gas emissions, fumes), electro-magnetic radiation, and psychosocial (e.g., violence, workstress). more recently, we need to consider the threat of terrorism, which could take the form of a mass casualty event (e.g., resulting from conventional incendiary devices), radiological attack (e.g., "dirty bomb"), chemical terrorist attack (e.g., sarin gas), or bioterrorist attack (e.g., weapons grade anthrax). given the large number of riders and workers potentially at risk, the public health implications are considerable. methods: to assess the hazards associated with subways, a structured review of the (english) litera· ture was conducted. ruults: based on our review, non-violent crime, followed by accidents, and violent crimes are most prevalent. compared to all other forms of mass transit, subways present greater health and safety risks. however, the rate of subway associated fatalities is much lower than the fatality rate associated with automobile travel ( . vs. . per million passenger miles), and cities with high subway ridership rates have a % lower per capita rate of transportation related fatalities than low ridership cities ( . versus . annual deaths per , residents). available data also suggest that subway noise levels and levels of air pollutants may exceed recommended levels. . ~: there is a paucity of published research examining the health and safety hazards associated with subways. most of the available data came from government agencies, who rely on passively reported data. research is warranted on this topic for a number of reasons, not only to address important knowled~ gaps, but also because the population at potential risk is large. importantly, from an urban perspecnve, the benefits of mass transit are optimized by high ridership ratesand these could be adversely affu:ted by unsafe conditions and health and safety concerns. veena joshi, jeremy lim. and benjamin chua ~ ~rban health issues have moved beyond infectious diseases and now centre largely on chrome diseases. diabetes is one of the most prevalent non-communicable diseases globally. % of adult ¥ benefit in providing splash pads in more parks. given the high temperature and humidity of london summers, this is an important aspect and asset of parks. interviewed parents claimed to visit city parks anywhere between to days per week. corrduion: given that the vast majority of canadian children are insufficiently active to gain health benefits, identifying effective qualities of local parks, that may support and foster physical activity is essential. strategies to promote activity within children's environments are an important health initiative. the results from this study have implications for city planners and policy makers; parents' opinions of, and use of city parks provides feedback as to the state current local parks, and modifications that should be made for new ones being developed. this study may also provide important feedback for health promoters trying to advocate for physical activity among children. introdt clion: a rapidly increasing proportion of urban dwellers in africa live below the poverty line in overcrowded slums characterized by uncollected garbage, unsafe water, and deficient sanitation and overflowing sewers. this growth of urban poverty challenges the commonly held assumption that urban populations enjoy better health than their rural counterparts. the objectives of this study are (i) to compare the vaccination status, and morbidity and mortality outcomes among children in the slums of nairobi with rural kenya, and (ii) to examine the factors associated with poor child health in the slums. we use data from demographic and health survey representative of all slum settlements in nairobi city carried out in by the african population & health research center. a total of , women aged - from , households were interviewed. our sample consists of , children aged - months. the comparison data are from the kenya demographic and health survey. the outcomes of interest include child vaccination status, morbidity (diarrhea, fever and cough) and mortality, all dichotomized. socioeconomic, environmental, demographic, and behavioral factors, as well as child and mother characteristics, are included in the multivariate analyses. multilevel logistic regression models are used. l'nlimin ry rest lts: about % of children in the slums had diarrhea in the two weeks prior to the survey, compared to % of rural children. these disparities between the urban poor anj the rural residents are also observed for fever ( % against %), cough ( % versus %), infant mortality ( / against / ), and complete vaccination ( % against %). preliminary multivariate results indicate that health service utilization and maternal education have the strongest predictive power on child morbidity and mortality in the slums, and that household wealth has only minor, statistically insignificant effects. conclruion: the superiority of health of urban children, compared with their rural counterparts, masks significant disparities within urban areas. compared to rural residents, children of slum dwellers in nairobi are sicker, are less likely to utilize health services when sick, and stand greater risk to die. our results suggest policies and programs contributing to the attainment of the millennium development goal on child health should pay particular attention to the urban poor. the insignificance of socioeconomic status suggests that poor health outcomes in these communities are compounded by poor environmental sanitation and behavioral factors that could partly be improved through female education and behavior change communication. introduction: historic trade city surat with its industrial and political peace has remained a center of attraction for people from all the comers of india resulting in to pop.ulatio~ explosio~ a~d stressed social and service infrastructure. the topography,dimate and demographic profile of the city s threat to the healthy environment. aim of this analysis is to review the impact of managemt'nt reform on health indicators. method: this paper is an analysis of the changing profile of population, sanitary infr~s~rucrure, local self government management and public health service reform, secondary health stat st cs data, health indicator and process monitoring of years. . . health of entire city and challenge to the management system. plague outbr~ak ( ) was the turning point in the history of civic service management including p~blic ~e~lth service management. ~ocal self government management system was revitalized by reg~lar_ field v s ts o~ al~ cadre~, _decentraltzanon of power and responsibility, equity, regular vigilant momtormg, commumcanon facility, ream_approach and people participation. reform in public health service management was throu_gh stan~~rd zed intervention protocol, innovative intervention, public private partnership, community part c panon, academic and service institute collaboration and research. sanitation service coverage have reached nearer to universal. area covered by safe water supply reached to %( ) from % ( ) and underground drainage to % ( ) from % ( ) the overhauling of the system have reflected on health indicators of vector and water born disease. malaria spr declined to . ( ) from . 'yo(! ) and diarrhea case report declined to ( ) from ( ). except dengue fever in no major disease outbreaks are reported after . city is recipient of international/national awards/ranking for these achievements. the health department have developed an evidence and experience based intervention and monitoring system and protocol for routine as well as disaster situation. the health service and management structure of surat city have emerged as an urban health model for the country. introduction: the center for healthy communities (chc) in the department of family and com· munity medicine at the medical college of wisconsin developed a pilot project to: ) assess the know· ledge, attitudes, and behaviors of female milwaukee public housing residents related to breast cancer; develop culturally and literacy appropriate education and screening modules; ) implement the developed modules; ) evaluate the modules; and ) provide follow-up services. using a community-based participatory research model the chc worked collaboratively with on-site nurse case management to meet these objectives. methods: a "breast health kick off event" was held at four separate milwaukee public housing sites for elderly and disabled adults. female residents were invited to complete a -item breast health survey, designed to accommodate various literacy levels. responses were anonymous and voluntary. the survey asked women about their previous physical exams for breast health, and then presented a series of state· ments about breast cancer to determine any existing myths. the final part gathered information about personal risk for breast cancer, the highest level of education completed, and whether the respondents h;td ever used hormone replacement therapy and/or consumed alcohol. responses were collected for descriptive analysis. results: a total of surveys (representing % of the total female population in the four sites) were completed and analyzed. % reported that they had a physical exam in the previous rwo years. % of respondents indicated they never had been diagnosed with breast cancer. % reported having had a mammogram and % having had a clinical breast exam. those that never had a mammogram reported a fear of what the provider would discover or there were not any current breast problems ro warrant an exam. % agreed that finding breast cancer early could lower the chance of dying of cancer. over % reported that mammograms were helpful in finding cancer. however, % believed that hav· ing a mammogram actually prevents breast cancer. % indicated that mammograms actually cause cancer and % reported that a woman should get a mammogram only if there is breast cancer in her family. conclusion: this survey indicates that current information about the importance of mammograms and clinical breast exams is reaching traditionally underserved women. yet there are still critical oppor· tunities to provide valuable education on breast health. this pilot study can serve as a tool for shaping future studies of health education messages for underserved populations. located in a yourh serv· ~ng agency m downtow~ ottawa, the clinic brings together community partners to provide primary medical care. and dent~i hygiene t? the street youths of ottawa aged - . the primary goal of the project is to provide accessible, coordinated, comprehensive health and dental care to vulnerable adolescents. these efforts respond to the pre-existing body of evidence suggesting that the principle barrier in accessing such care for these youths are feelings of intimidation and vulnerability in the face of a complex healthcare system. the bruyere fhn satellite clinic is located in the basement of a downtown drop-in and brings together a family medicine physician and her residents, a dental hygienist and her nd year students, a nurse practitioner, a chiropodist and public health nurses to provide primary care. the clinic has been extremely busy and well received by the youth. this workshop will demonstrate how five community organizations have come together to meet the needs of high risk youths in ottawa. this presentation will showcase the development of the clinic from its inception through its first year including reaction of the youths, partnerships and lessons learned. it will also focus on its sustainability without continued funding. we hope to have developed a model of service delivery that could be reproduced and sustained in other large cities with faculties of medicine. methods: non-randomized, mixed method design involving a process and impact evaluation. data collection-qualitative-a) semi structured interviews with providers & partners b)focus groups with youth quantitative a)electronic medical records for months records (budget, photos, project information). results: ) successfully built and opened a medicaudental clinic which will celebrate its year anniversary in august. ) over youths have been seen, and we have had over visits. conclusion: ) the clinic will continue to operate beyond the month project funding. ) the health of high risk youth in ottawa will continue to improve due to increased access to medical services. p - (a) health services -for the citizens of bangalore -past, present and future savita sathyagala, girish rao, thandavamurthy shetty, and subhash chandra bangalore city, the capital of karnataka with . million is the th most populous city in india; supporting % of the urban population of karnataka, it is considered as one of the fastest growing cities in india. known as the 'silicon valley of india', bangalore is nearly years old. bangalore city corporation (bmp), is a local self government and has the statutory commitment to provide to the citizens of bangalore: good roads, sanitation, street lighting, safe drinking water apart from other social obligations, cultural development and poverty alleviation activities. providing preventive and promotive heahh services is also a specific component. the objective of this study was to review the planning process with respect to health care services in the period since india independence; the specific research questions being what has been the strategies adopted by the city planners to address to the growing needs of the population amidst the background of the different strategies adopted by the country as a whole. three broad rime ranges have been considered for analysis: the s, s and the s. the salient results are: major area of focus has been on the maternal and child care with activities ranging from day-care to in-patient-care; though the number of institutions have grown from to the current day , their distribution has been far from satisfactory; obtaining support from the india population projects and major upgradarions have been undertaken in terms of infrastructure; over the years, in addition to the dispensaries of modern system of medicine, local traditional systems have also been initiated; the city has partnered with the healthy cities campaign with mixed success; disease surveillance, addressing the problems related to the emerging non-communicable diseases including mental health and road traffic injuries are still in its infancy. isolated attempts have been made to address the risks groups of elderly care and adolescent care. what stands out remarkably amongst the cities achievements is its ability to elicit participation from ngos, cbos and neighbourhood groups. however, the harnessing of this ability into the health sector cannot be said totally successful. the moot question in all the above observed development are: has the city rationally addressed it planning needs? the progress made so far can be considered as stuttered. the analysis and its presentation would identify the key posirive elements in the growth of banglore city and spell a framework for the new public health. introduction: anaemia associated with pregnancy is a major public health problem all over the world. different studies in different parts of india shown prevalence of anaemia between - %. anaemia remains a serious health problem in pregnancy despite of strong action taken by the government of india through national programmes. in the present study we identified th~ social beha~iors, responsible for low compliance of if a tablets consumption in pregnancy at community level and intervention was given with new modified behaviors on trial bases. . in vadodara urban. anganwadies out of were selected from the list by random sampling for tips (trials of improved practices) study. . . participants: pregnant women ( , intervention group+ , control. group) registered m the above anganwadies. study was conducted in to three phases: phase: . formative research and baseline survey (frbs). data was collected from all pregnant women to identify behaviors that are responsible for low compliance of ifa tablets. both qualitative and quantitative data were collected. haemoglobin was estimated of all pregnant women by haemo-cue. phase: . phase of tips. behaviors were identified both social & clinical for low compliance of ifa tablets consumption in pregnancy from frbs and against those, modified behaviors were proposed to pregnant women in the intervention group on trial bases by health education. trial period of weeks was given for trial of new behaviors to pregnant women in the interven· tion group. phase: . in this phase, feedbacks on behaviors tried or not tried were taken from pregnant women in intervention group. haemoglobin estimation was carried out again in all pregnant women. at the end of the study, messages were formulated on the bases of feedbacks from the pregnant women. results: all pregnant women in the intervention group had given positive feedback on new modified behaviors after intervention. mean haemoglobin concentration was higher in intervention group ( . ± . gm%) than control group ( . ± . gm%). ifa tablets compliance was improved in intervention group ( . %) than control group ( . %). conclusion: all pregnant women got benefits after trial of new modified behaviors in the intervention group. messages were formulated from the new modified behaviors, which can be used for longterm strategies for anaemia control in the community. introduction: in order to develop a comprehensive mch handbook for pregnant women and to assess its effect among them, a pilot study was carried out at the maternal and child health training institute (mchti), in dhaka, bangladesh. methods: from mchti a sample of pregnant women was selected and all subjects were women who were attending the first visit of their current pregnancy by using a random sampling method. of the subjects, women were given the mch handbook as case and women were not given the handbook as control. data on pre and post intervention of the handbook from the cases and controls were taken from data recording forms between the st of november and st of october, and data was analysed by using a multilevel analysis approach. this was a hospital-based action (case-control) research, and was applied in order to measure the outcome of pre and post intervention following the introduction of the handbook. data was used to assess the effects of utilisation of the handbook on women's knowledge, practice and utilisation of mch services. results: this study showed that the change of knowledge about antenatal care visits was . % among case mothers. knowledge of danger signs improved . %, breast feeding results . %, vaccination . % and family planning results improved . % among case. results showed some positive changes in women's attitudes among case mothers and study showed the change of practice in antenatal care visits was .u. % in the case. other notable changes were: change of practice in case mother's tetanus toxoid (ti), . %; and family planning . %. in addition, handbook assessment study indicated that most women brought the handbook on subsequent visits ( . %), the handbook was highly utilised (i.e. it was read by . %, filled-in by . %, and was used as a health education tool by . %). most women kept the handbook ( . %) and found it highly useful ( . %) with a high client satisfaction rate of . %. conclusion: pregnant women in the case group had higher knowledge, better practices, and higher utilisation of mch services than mothers in the control groups who used alternative health cards. if the handbook is developed with a focus on utilising a problem-oriented approach and involving the recomendations .of end~users, it is anticipated that the mch handbook will contribute significantly to ensuring the quahry of hfe of women and their children in bangladesh. after several meetmgs to identify the needs of the community, a faso clinic was opened at ncfs. health care professionals from smh joined with developmental and social service workers from ncfs to implement the faso diagnostic process and to provide culturally appropriate after-care. the clinic is unique in that its focus is the high risk urban aboriginal population of toronto. it accepts referrals of not only children and youth, but also of adults. lessons learned: response to the faso clinic at native child and family services has been overwhelming. aboriginal children with f asd are receiving timely diagnosis and interventions. aboriginal youth and adults who have been struggling with poveny, substance abuse, and homelessness are more willing to enter the ncfs centre for diagnosis and treatment. aboriginal infants prenatally exposed to alcohol born at st. michael's hospital or referred by other centres have access to the developmental programs located in both of the partnering agencies. the presentation will describe the clinic's development, and will detail the outcomes described, including interventions unique to the aboriginal culture. p - (c) seeds, soil, and stories: an exploration of community gardening in southeast toronto carolin taran, sarah wakefield, jennifer reynolds, and fiona yeudall introduction: community gardens are increasingly seen as a mechanism for improving nutrition and increasing food security in urban neighbourhoods, but the evidence available to support these claims is limited. in order to begin to address this gap in a way that is respectful of community knowledge and needs, the urban gardening research opportunities workgroup (ugrow) project explored the benefits and potential risks of community gardening in southeast toronto. the project used a community-based research (cbr) model to assess community gardens as a means of improving local health. the research process included interviews, focus groups, and participant observation (documented in field notes). we also directly engaged the community in the research process, through co-learning activities and community events which allowed participants to express their views and comment on emerging results. most of the research was conducted by a community-based research associate, herself a community gardener. key results were derived from these various sources through line-by-line coding of interview transcripts and field note review, an interactive and iterative process which involved both academic and community partners. results: these various data sources all suggest that enhanced health and access to fresh produce are important components of the gardening experience. they also highlight the central importance of empowering and community-building aspects of gardening to gardeners. community gardens were thought to play a role in developing friendships and social support, sharing food and other resources, appreciating cultural diversity, learning together, enhancing local place attachment and stewardship, and mobilizing to solve local problems (both inside and outside the garden). potential challenges to community gardens as a mechanism for communiry development include bureaucratic resistance to gardens, insecure land tenure and access, concerns about soil contamination, and a lack of awareness and under· standing by community members and decision-makers of all kinds. conclusion: the results highlight many health and broader social benefits experienced by commu· nity gardeners. they also point to the need for greater support for community gardening programs, par· ticularly ongoing the ongoing provision of resources and education programs to support gardens in their many roles. this research project is supported by the wellesley central health corporation and the centre for urban health initiatives, a cihr funded centre for research development hased at the univer· sity of toronto. p - (c) developing resiliency in children living in disadvantaged neighbourhoods sarah farrell, lorna weigand, and wayne hammond the traditional idea of targeting risk reduction by focusing on the development of eff~ctive coping strategies and educational programs has merit in light of the research reportmg_ that_ ~ lupl.e forms of problem behaviour consistently appear to be predicted by increasing exposure to den_uf able risk factors. as a result, many of the disadvantaged child and youth studies have focused on trymg to better _unde.r· stand the multiple risk factors that increase the likelihood of the development of at nsk behaviour m ch ldren/youth and the potential implications for prevention. this in turn has led t_o. the conclus on that community and health programs need to focus on risk reduction by helpm~ md v duals develop more effective coping strategies and a better understanding of the limitations of cenam pathologies, problematic v poster sessions coping behaviours and risk factors potentially inheren~ in high needs co~unities. ~owever, another ai:ea of research has proposed that preventative interventions should cons de~ .~rotecnve fa~ors alo~~ with reducing risk factors. as opposed to just emphasizing problems, vulnerab ht es, and deficits, a res liencybased perspective holds the belief that children, youth and their families. have strengths, reso~ce.s and the ability to cope with significant adversity in ways that are not only effective, but tend to result m mcreased ability to constructively respond to future adversity. with this in mind, a participatory research project sponsored by the united way of greater toronto was initiated to evaluate and determine the resiliency profiles of children - years (n = ) of recent immigrant families living in significantly disadvantaged communities in the toronto area. the presentation will provide an overview of the identified protective factors (both intrinsic and extrinsic) and resiliency profiles in an aggregated format as well as a summary of how the children and their parents interpreted and explained these strength-based results. as part of the focus groups, current community programs and services were examined by the participants as to what might be best practices for supporting the development and maintaining of resiliency in children, families and communities. it was proposed that the community model of assessing resiliency and protective factors as well as proposed best strength-based practice could serve as a guide for all in the community sector who provide services and programs to those in disadvantaged neighbourhoods. p - (c) naloxone by prescription in san francisco, ca and new york, ny emalie huriaux the harm reduction coalition's overdose project works to reduce the number of fatal overdoses to zero. located in new york, ny and san francisco, ca, the overdose project provides overdose education for social service providers, single-room occupancy hotel (sro) residents, and syringe exchange participants. the project also conducts an innovative naloxone prescription program, providing naloxone, an opiate antagonist traditionally administered by paramedics to temporarily reverse the effects of opiate overdose, to injection drug users (idus). we will describe how naloxone distribution became a reality in new york and san francisco, how the project works, and our results. the naloxone prescription program utilizes multiple models to reach idus, including sro-and street-based trainings, and office-based trainings at syringe exchange sites. trainings include information on overdose prevention, recognition, and response. a clinician conducts a medical intake with participants and provides them with pre-filled units of naloxone. in new york, funding was initially provided by tides foundation. new york city council provides current funding. new york department of mental health and hygiene provides program oversight. while the new york project was initiated in june , over half the trainings have been since march . in san francisco, california endowment, tides foundation, and san francisco department of public health (sfdph) provide funding. in addition, sfdph purchases naloxone and provides clinicians who conduct medical intakes with participants. trainings have been conducted since november . to date, nearly individuals have been trained and provided with naloxone. approximately of them have returned for refills and reported that they used naloxone to reverse an opiate-related overdose. limited episodes of adverse effects have been reported, including vomiting, seizure, and "loss of friendship." in new york, individuals have been trained and provided with naloxone. over overdose reversals have been reported. over half of the participants in new york have been trained in the south bronx, the area of new york with the highest rate of overdose fatalities. in san francisco, individuals have been trained and provided with naloxone. over overdose reversals have been reported. the majority of the participants in san francisco have been trained in the tenderloin, th street corridor, and mission, areas with the highest rates of overdose fatalities. the experience of the overdose project in both cities indicates that providing idus low-threshold access to naloxone and overdose information is a cost-effective, efficient, and safe intervention to prevent accidental death in this population. p - (c) successful strategies to regulate nuisance liquor stores using community mobilization, law enforcement, city council, merchants and researchers tahra goraya presenta~ion _will discuss ~uccessful environmental and public policy strategies employed in one southen: cahf?rmna commumty to remedy problems associated with nuisance liquor stores. participants ~ be given tools to understand the importance of utilizing various substance abuse prevention str~tegi~ to change local policies and the importance of involving various sectors in the community to a~_ st with and advocate for community-wide policy changes. recent policy successes from the commultles of pa~ad~na and altad~na will highlight the collaborative process by which the community mobilized resulnng m several ordmances, how local law enforcement was given more authority to monitor poster sessions v nonconforming liquor stores, how collaborative efforts with liquor store owners helped to remove high alcohol content alcohol products from their establishments and how a community-based organiz,uion worked with local legislators to introduce statewide legislation regarding the regulation of nuisance liquor outlets. p - (c) "dialogue on sex and life": a reliable health promotion tool among street-involved youth beth hayhoe and tracey methven introduction: street involved youth are a marginalized population that participate in extremely risky behaviours and have multiple health issues. unfortunately, because of previous abuses and negative experiences, they also have an extreme distrust of the adults who could help them. in , toronto public health granted funding to a non governmental, nor for profit drop-in centre for street youth aged - , to educate them about how to decrease rhe risk of acquiring hiv. since then the funding has been renewed yearly and the program has evolved as needed in order to target the maximum number of youth and provide them with vital information in a candid and enjoyable atmosphere. methods: using a retrospective analysis of the six years of data gathered from the "dialogue on sex and life" program, the researchers examined the number of youth involved, the kinds of things discussed, and the number of youth trained as peer leaders. also reviewed, was written feedback from the weekly logs, and anecdotal outcomes noted by the facilitators and other staff in the organization. results: over the five year period of this program, many of youth have participated in one hour sessions of candid discussion regarding a wide range of topics including sexual health, drug use, harm reduction, relationship issues, parenting, street culture, safety and life skills. many were new youth who had not participated in the program before and were often new to the street. some of the youth were given specific training regarding facilitation skills, sexual anatomy and physiology, birth control, sexually transmitted infections, hiv, substance use/abuse, harm reduction, relationships and discussion of their next steps/future plans following completion of the training. feedback has been overwhelmingly positive and stories of life changing decisions have been reported. conclusion: clearly, this program is a successful tool to reach street involved youth who may otherwise be wary of adults and their beliefs. based on data from the evaluation, recommendations have been made to public health to expand the funding and the training for peer leaders in order ro target between - new youth per year, increase the total numbers of youth reached and to increase the level of knowledge among the peer leaders. p - (c) access to identification and services jane kali replacing identification has become increasingly more complex as rhe government identification issuing offices introduce new requirements rhar create significant barriers for homeless people to replace their id. new forms of identification have also been introduced that art' not accessible to homekss peoplt-(e.g. the permanent resident card). ar rhe same time, many service providers continue to require identifi· cation ro access supports such as income, housing, food, health care, employment and employmt·nt training programs. street health, as well as a number of other agencies and community health centres, h, , been assisting with identification replacement for homeless peoplt· for a number of years. the rnrrt·nr challenges inherent within new replacement requirements, as well as the introduction of new forn ' of identification, have resulted in further barriers homeless people encounter when rrring to access t:ssential services. street health has been highlighting these issues to government identification issuing offices, as well as policy makers, in an effort to ensure rhar people who are homeless and marginalized have ac'ess to needed essential services. bandar is a somali word for •·a safe place." the bandar research project is the product of the regent park community health centre. the research looks ar the increasing number of somali and afri· can men in the homeless and precariously house population in the inner city core of down~own toronto. in the first phase of the pilot project, a needs assessment was conducted to dennfy barners and issues faced by rhe somali and other african men who are homeless and have add cr ns issues. th_e second phase of rhe research project was to identify long rerm resources and service delivery mechamsms that v poster sessions would enhance the abiity of this population to better access detox, treatment, and post treatment ser· vices. the final phase of the project was to facilitate the development of a conceptual model of seamless continual services and supports from the streets to detox to treatment to long term rehabilitation to housing. "between the pestle and mortar" -safe place. p - (c) successful methods for studying transient populations while improving public health beth hayhoe, ruth ewert, eileen mcmahon, and dan jang introduction: street youth are a group that do not regularly access healthcare because of their mis· trust of adults. when they do access health care, it is usually for issues severe enough for hospitalization or for episodic care in community clinics. health promotion and illness prevention is rarely a part of their thinking. thus, standard public health measures implemented in a more stable population do not work in this group. for example, pap tests, which have dearly been shown to decrease prevalence of cer· vical cancer, are rarely done and when they are, rarely followed up. methods to meet the health care needs and increase the health of this population are frequently being sought. methods: a drop-in centre for street youth in canada has participated in several studies investigating sexual health in both men and women. we required the sponsoring agencies to pay the youth for their rime, even though the testing they were undergoing was necessary according to public health stan· dards. we surmised that this would increase both initial participation and return. results: many results requiring intervention have been detected. given the transient nature of this population, return rates have been encouraging so far. conclusion: it seems evident that even a small incentive for this population increases participation in needed health examinations and studies. it is possible that matching the initial and follow-up incentives would increase the return rate even further. the fact that the youth were recruited on site, and not from any external advertising, indicates that studies done where youth trust the staff, are more likely to be successful. the presentation will share the results of the "empowering stroke prevention project" which incor· porated self-help mutual aids strategies as a health promotion methodology. the presentation will include project's theoretical basis, methodology, outcomes and evaluation results. self-help methodology has proven successful in consumer involvement and behaviour modification in "at risk," "marginalized" settings. self-help is a process of learning with and from each other which provides participants oppor· tunities for support in dealing with a problem, issue, condition or need. self-help groups are mechanisms for the participants to investigate existing solutions and discover alternatives, empowering themselves in this process. learning dynamic in self-help groups is similar to that of cooperative learning and peertraining, has proven successful, effective and efficient (haller et al, ) . the mutual support provided by participation in these groups is documented as contributory factor in the improved health of those involved. cognizant of the above theoretical basis, in the self-help resource centre initiated the "empowering stroke prevention project." the project was implemented after the input from health organizations, a scan of more than resources and an in-depth analysis of risk-factor-specific stroke prevention materials indicated the need for such a program. the project objectives were:• to develop a holistic and empowering health promotion model for stroke prevention that incorporates selfhelp and peer support strategies. • to develop educational materials that place modifiable risk factors and lifestyle information in a relevant context that validates project participants' life experiences and perspectives.• to educate members of at-risk communities about the modifiable risk factors associated with stroke, and promote healthy living. to achieve the above, a diverse group of community members were engaged as "co-editors" in the development of stroke prevention education materials which reflected and validated their life experiences. these community members received training to become lay health promoters (trained volunteer peer facilitators). in collaboration with local health organizations, these trained lay health promoters were then supported in organizing their own community-based stroke prevention activities. in addition, an educational booklet written in plain language, entitled healthy ways to prevent stroke: a guide for you, and a companion guide called healthy ways to pre· vent stroke: a facilitator's guide were produced. the presentation will include the results of a tw<>tiered evaluation of the program methodology, educational materials and the use of the materials beyond the life of the project. this poster presentation will focus on the development and structure of an innovative street outreach service that assists individuals who struggle mental illness/addictions and are experiencing homelessness. the mental health/outreach team at public health and community services (phcs) of hamilton, ontario assists individuals in reconnecting with health and social services. each worker brings to the ream his or her own skills-set, rendering it extremely effective at addressing the multidimensional and complex needs of clients. using a capacity building framework, each ream member is employed under a service contract between public health and community services and a local grassroots agency. there are public health nurses (phn), two of whom run a street health centre and one of canada's oldest and most successful needle exchange programs, mental health workers, housing specialists, a harm reduction worker, youth workers, and a united church minister, to name a few. a community advisory board, composed of consumers and professionals, advises the program quarterly. the program is featured on raising the roors 'shared learnings on homelessness' website at www.sharedlearnings.ca. through our poster presentation participants will learn how to create effective partnerships between government and grassroots agencies using a capacity building model that builds on existing programs. this study aims to assess the effects of broadcasting a series of documentary and drama videos, intended to provide information about the bc healthguide program in farsi, on the awareness about and the patterns of the service usage among farsi-speaking communities in the greater vancouver area. the major goals of the present study were twofold; ( ) to compare two methods of communications (direct vs. indirect messages) on the attitudes and perceptions of the viewers regarding the credibility of messengers and the relevance of the information provided in the videos, and ( ) to compare and contrast the impact of providing health information (i.e., the produced videos) via local tvs with the same materials when presented in group sessions (using vcr) on participants' attitudes and perceptions cowards the bc healrhguide services. results: through a telephone survey, farsi-speaking adults were interviewed in november and december . the preliminary findings show that % of the participants had seen the aired videos, from which, % watched at least one of the 'drama' clips, % watched only 'documentary' clip, and % watched both types of video. in addition, % of the respondents claimed that they were aware about the program before watching the aired videos, while % said they leaned about the services only after watching the videos. from this group, % said they called the bchg for their own or their "hildren's health problems in the past month. % also indicated that they would use the services in the future whenever it would be needed. % considered the videos as "very good" and thought they rnuld deliver relevant messages and % expressed their wish to increase the variety of subjects (produ\:e more videos) and increase the frequency of video dips. conclusion: the results of this study will assist public health specialists in bc who want to choose the best medium for disseminating information and apply communication interventions in multi\:ultural communities. introduction: many theorists and practitioners in community-based research (cbr) and knowledge transfer (kt) strongly advocate for involvement of potential users of research in the development of research projects, yet few examples of such involvement exist for urban workplace health interventions. we describe the process of developing a collaborative research program. methods: four different sets of stakeholders were identified as potential contributors to and users of the research: workplace health policy makers, employers, trade unions, and health and safety associations. representatives of these stakeholders formed an advisory committee which met quarterly. over the month research development period, an additional meetings were held between resc:ar~h~rs and stakeholders. in keeping with participant observation approaches, field notes of group and md v ~ ual meetings were kept by the two co-authors. emails and telephone calls were also documented. qu~h tative approaches to textual analysis were used, with particular attention paid to collaborattve v poster sessions relationships established (as per cbr), indicators of stakeholders' knowledge utilization (as per kt), and transformations of the proposed research (as per cbr). results: despite initial strong differences of opinion both among stakeho~ders .an~ between stakeholders and researchers, goodwill was noted among all involved. acts of rec~proc ty included mu.rual sharing of assessment tools, guidance on data utilization to stakeho~der orga~ zat ns, and suggestions on workplace recruitment to researchers. stakeholders demonstrated mcreases m concep~ual. un~erstand ing of workplace health e.g. they more commonly discussed more complex,. psychosocial md cators of organizational health. stakeholders made instrumental use of shared materials based on research e.g. adapting their consulting model to more sophisticated dat~ analysis. sta~ehol?~rs recogni_zed the strategic use of their alliance with researchers e.g., transformational leadership trainmg as a~ inducement to improve health and safety among small service franchises. stakeholders helped re-define the research questions, dramatically changed the method of recruitment from researcher cold call to stakeholderbased recruitment, and strongly influenced pilot research designs. owing a great deal to the elaborate joint development process, the four collaboratively developed pilot project submissions which were all successfully funded. conclusion: the intensive process of collaborative development of a research program among stakeholders and researchers was not a smooth process and was time consuming. nevertheless, the result of the collaborative process was a set of projects that were more responsive to stakeholder needs, more feasible for implementation, and more broadly applicable to relevant workplace health problems. introduction: environmental groups, municipal public health authorities and, increasingly, the general public are advocating for reductions in pesticide use in urban areas, primarily because of concern around potential adverse health impacts in vulnerable populations. however, limited evidence of the relative merits of different intervention strategies in different contexts exists. in a pilot research project, we sought to explore the options for evaluating pesticide reduction interventions across ontario municipalities. methods: the project team and a multi-stakeholder project advisory committee (pac), generated a list of potential key informants (kl) and an open ended interview guide. thirteen ki from municipal government, industry, health care, and environmental organizations completed face to face or telephone interviews lasting - minutes. in a parallel process, a workshop involving similar representatives and health researchers was held to discuss the role of pesticide exposure monitoring. minutes from pac meetings, field notes taken during ki interviews, and workshop proceedings were synthesized to generate potential evaluation methods and indicators. results: current evaluation activities were limited but all kls supported greater evaluation effons beginning with fuller indicator monitoring. indicators of education and outreach services were imponant for industry representatives changing applicator practices as well as most public health units and environmental organizations. lndictors based on bylaw enforcement were only applicable in the two cities with bylaws, though changing attitudes toward legal approaches were being assessed in many communities. the public health rapid risk factor surveillance system could use historical baseline data to assess changes in community behaviour through reported pesticide uses and practices, though it had limited penetration in immigrant communities not comfortable in english. pesticide sales (economic) data were only available in regional aggregates not useful for city specific change documentation. testing for watercourse or environmental contamination might be helpful, but it is sporadic and expensive. human exposure monitoring was fraught with ethical issues, floor effects from low levels of exposure, and prohibitive costs. clinical episodes of pesticide exposure reported to the regional poison centre (all ages) or the mother risk program (pregnant or breastfeeding women) are likely substantial underestimates that would be need to be supplemented with sentinel practice surveillance. focus on special clinical populations e.g., multiple chemical sensitivity would require additional data collection efforts . . conc~ons: broad support for evaluation and multiple indicators were proposed, though con-s~raints associate~ with access, coverage, sensitivity and feasibility were all raised, demonstrating the difficulty of evaluating such urban primary prevention initiatives. interventionists. an important aim of the youth monitor is to learn more about the health development of children and adolescents and the factors that can influence this development. special attention is paid to emo· tional and behavioural problems. the youth monitor identifies high-risk groups and factors that are associated with health problems. at various stages, the youth monitor chancrs the course of life of a child. the sources of informa· tion and methods of research are different for each age group. the results arc used to generate various kinds of repons: for children and young persons, parents, schools, neighbourhoods, boroughs and the municipality of rotterdam and its environs. any problems can be spotted early, at borough and neigh· bourhood level, based on the type of school or among the young persons and children themselves. together with schools, parents, youngsters and various organisations in the area, the municipal health service aims to really address these problems. on request, an overview is offered of potentially suitable interventions. the authors will present the philosophy, working method, preliminary effects and future developments of this instrument, which serves as the backbone for the rotterdam local youth policy. social workers to be leaders in response to aging urban populations: the practicum partnership program sarah sisco, alissa yarkony, and patricia volland "'" tliu:tion: across the us, . % of those over live in urban areas. these aging urban popu· lations, including the baby boomers, have already begun encounter a range of heahh and mental hcahh conditions. to compound these effects, health and social service delivery fluciuates in cities, whit:h arc increasingly diverse both in their recipients and their systems. common to other disciplines (medicine, nursing, psychology, etc.) the social work profession faces a shortage of workers who are well-equipped to navigate the many systems, services, and requisite care that this vast population requires. in the next two decades, it is projected that nearly , social workers will be required to provide suppon to our older urban populations. social workers must be prepared to be aging-savvy leaders in their field, whether they specialize in gerontology or work across the life span. mllhotu: in , a study conducted at the new york academy of medicine d<> :umcntcd the need for improved synchroniciry in two aspects of social work education, classroom instruction and the field experience. with suppon from the john a. hanford foundation, our team created a pilot proj~"t entitled the practicum pannership program (ppp) in master's level schools of social work, to improvt" aginr exposure in field and classroom content through use of the following: i) community-university partnrr· ships, ) increased, diverse student field rotations, ll infusion of competcn ."}'·drivm coursework, enhancement of field instructors' roles, and ) concentrated student recruitment. we conductt"d a prr· and post-test survey into students' knowledge, skills. and satisfaction. icarlja: surveys of over graduates and field inltnk."tors rcflected increased numlk-n of . rrm:y· univmity panncrships, as well as in students placed in aging agencin for field placements. there wa marked increase in student commitments to an aging specialization. onr year por.t·gradu:nion rcvealrd that % of those surveyed were gainfully employed, with % employed in the field of aginic. by com· bining curricular enhancement with real-world experiences the ppp instilled a broad exposurr for llu· dents who worked with aging populations in multiple urban settings. coltdtuion: increased exposure to a range of levels of practicr, including clinical, policy/ajvocaq, and community-based can potentially improve service delivery for older adulh who live in elfin, and potentially improve national policy. the hanford foundation has now elected to uppon cxpantion of the ppp to schools nationwide (urban and rural) to complement other domntic initiatives to cnhalk"c" holistic services for older adults across the aging spectrum. bodrgnn.ntl: we arc a team of rcscarcbcn and community panncn working tcj c(her to develop an in"itepth understanding of the mental health needs of homeless youth ~ages to ) (using qualiutivc and quantitative methods ' panicipatory rncarch methods). it is readily apparmt that '-neless youth cxpcricnce a range of mental health problems. for youth living on the street, menul illnew may be either a major risk factor for homelessnal or may frequently emcsge in response to coping with rhe multitudinous stressors associated with homclcslllcsi including exposure to violence, prasutt to pamaplte in v poster sessions survival sex and/or drug use. the most frequent psychiatric diagnoses amongst the homeless gencrally include: depression, anxiety and psychosis. . . . the ultimate ob ective of the pr~am of rei:e~ is to ~evelop a plan for intervention to meet the mental health needs of street youth. prior t_o pl~nnmg mtervenbons, .it is necessary to undertake a comprehensive assessment ~f mental health needs m this ~lnerable populanon. thus, the immediate objective of this research study is to undertake a comprehensive assessment of men· tal health needs. . . melbotlology: a mixed methodology triangulating qualitative, participatory acnon and quantitative methods will capture the data related to mental health needs of homeless youth. a purposive sample of approximately - subjecrs. ages to , is currently being ~ted ~participate from the commu.nity agencies covenant house, evergreen centre fo~ srrc;et youth, turning p? ?t and street ~ serv~. youth living on the street or in short -term residennal programs for a mmimum of month pnor to their participation; ages to and able to give infonned consent will be invited to participate in the study. o..tcomes: the expected outcome of this initial survey will be an increased understanding of mental health needs of street youth that will be used to develop effective interventions. it is anticipated that results from this study will contribute to the development of mental health policy, as well as future programs that are relevant to the mental health needs of street youth. note: it is anticipated that preliminary quantitative data ( subjects) and qualitative data will be available for the conference. the authors intend to present the identification of the research focus, the formation of our community-based team, relevance for policy, as well as preliminary results. p - (a) the need for developing a firm health policy for urban informal worken: the case of despite their critical role in producing food for urban in kenya, urban farmers have largely been ignored by government planners and policymakers. their activity is at best dismissed as peripheral eveo, inappropriate retention of peasant culture in cities and at worst illegal and often some-times criminal· ized. urban agriculture is also condemned for its presumed negative health impact. a myth that contin· ues despite proof to the contrary is that malarial mosquitoes breed in maize grown in east african towns. however, potential health risks are insignificant compared with the benefits of urban food production. recent studies too rightly do point to the commercial value of food produced in the urban area while underscoring the importance of urban farming as a survival strategy among the urban poor, especially women-headed households. since the millennium declaration, health has emerged as one of the most serious casualties consequent on the poverty, social exclusion, marginalisation and lack of sustain· able development in africa. hiv/aids epidemic poses an unprecedented challenge, while malaria, tuber· culosis, communicable diseases of childhood all add to the untenable burden. malnutrition underpins much ill-health and is linked to more than per cent of all childhood deaths. kenya's urban poor people ~ace ~ h~ge burde~ of preventable and treatable health problems, measured by any social and bi~ medical md cator, which not only cause unnecessary death and suffering, but also undermine econonuc development and damage the country's social fabric. the burden is in spite of the availability of suitable tools and re:c=hnology for prevention and treatment and is largely rooted in poverty and in weak healah •rstems. this pa~ therefore challenges development planners who perceive a dichotomy instead of con· tmuum between informal and formal urban wage earners in so far as access to health services is con· cemed. it i~ this gap that calls for a need to developing and building sustainable health systems among the urban mformal ~wellers. we recommend a focus on an urban health policy that can build and strengthen the capacity of urban dwellers to access health services that is cost-effective and sustainable. such ~ health poli<=>: must strive for equity for the urban poor, displaced or marginalized; mobilise and effect ~ely use sufficient sustainable resources in order to build secure health systems and services. special anenti_on. should ~ afforded hiv/aids in view of the unprecedented challenge that this epidemic poses to africa s economic and social development and to health services on the continent. methods: a review of the literature led us to construct three simple models and a composite model of exposure to traffic. the data were collected with the help of a daily diary of travel activities using a sample of cyclists who went to or come back from work or study. to calculate the distance, the length of journey, and the number of intersections crossed by a cyclist different geographic information systems (gis) were operated. statistical analysis was used to determine the significance between a measure of exposure on the one hand, and the sociodemographic characteristics of the panicipants or their geographic location on the other hand. restlltj: our results indicate that cyclists were significantly exposed to road accidents, no matter of where they live or what are their sociodemographic characteristics. we also stress the point that the fact of having been involved in a road accident was significantly related to the helmet use, but did not reduce the propensity of the cyclists to expose themselves to the road hazards. condlllion: the efforts of the various authorities as regards road safety should not be directed towards the reduction of the exposure of the vulnerable users, but rather towards the reduction of the dangers to which they could face. keywords: cyclist, daily diary of activities, measures of exposure to traffic, island of montreal. p - (a) intra urban disparities and environmental health: some salient features of nigerian residential neighbourhoods olumuyiwa akinbamijo intra urban disparities and environmental health: some salient features of nigerian residential neighbourhoods abstract urbanization panicularly in nigerian cities, ponends unprecedented crises of grave dimensions. from physical and demographic viewpoints, city growth rates are staggering coupled with gross inabilities to cope with the consequences. environmental and social ills associated with unguarded rapid urbanization characterize nigerian cities and threaten urban existence. this paper repons the findings of a recent study of the relationship between environmental health across inrraurban residential communities of akure, south west nigeria. it discuses the typical urbanization process of nigerian cities and its dynamic spatial-temporal characteristics. physical and socio-demographic attributes as well as the levels and effectiveness of urban infrastructural services are examined across the core residential districts and the elite residential layouts in the town. the incidence rate of cenain environmentally induced tropical diseases across residential neighborhoods and communes is examined. salient environmental variables that are germane to health procurement in the residential districts, incidence of diseases and diseases parasitology, diseases prevention and control were studied. field data were subjected to analysis ranging from the univariate and bivariate analysis. inferential statistics using the chi-square test were done to establish the truthfulness of the guiding hypothesis. given the above, the study affirms that there is strong independence in the studied communities, between the environment and incidence of diseases hence health of residents of the town. this assertion, tested statistically at the district levels revealed that residents of the core districts have very strong independence between the environment and incidences of diseases. the strength of this relationship however thins out towards the city peripheral districts. the study therefore concludes that since most of the city dwellers live in urban deprivation, urban health sensitive policies must be evolved. this is to cater for the urban dwellers who occupy fringe peripheral sites where the extension of facilities often times are illegally done. urban infrastructural facilities and services need be provided as a matter of public good for which there is no exclusive consumption or access even for the poorest of the urban poor. many suffer from low-self esteem, shame and guilt about their drug use. in addition, they often lack suppon or encounter opposition from their panners, family and friends in seeking treatment. these personal barriers are compounded by fragmented addiction, prenatal and social care services, inflexible intake systems and poor communication among sectors. the experience of accessing adequate care between services can be overwhelming and too demanding. the toronto centre for substance use in pregnancy (t-cup) is a unique program developed to minimize barriers by providing kone-stop" comprehensive healthcare. t-cup is a primary care based program located in the department of family medicine at st. joseph\'s health centre, a community teaching hospital in toronto. the interdisciplinary staff provides prenatal and addiction services, case management, as well as care of newborns affected by substance use. regular care plan meetings are held between t-cup, labour and delivery nurses and social workers in the y poster sessions maternity and child care program. t-cup also connects "'.omen with. inpatient treatment programs and community agencies such as breaking the cycle, an on-site counselmg group for pregnant substance users. · f · d d h ith method: retrospective chart review, qualitative patient ~ans action stu ~· an ea care provider surveys are used to determine outcomes. primary outcomes mclude changes m maternal su~tance use, psychosocial status and obstetrical complications (e.g. pre-rupture of membrane, pre-eclampsia, placen· ral abruption and hemorrhage). neonatal measures ~~nsisted of .bir~h pa_rame~ers, length of h~spital st.ay and complications (e.g. feral distress, meconium stammg, resuscitation, aund ce, hypoglycemia, seventy of withdrawal and treatment length). chart review consisted of all t-cup patients who met clinical cri· reria for alcohol or drug dependence and received prenatal and intra-partum care at st. joseph's from october to june . participants in the qualitative study included former and current t-cup patients. provider surveys were distributed on-site and to a local community hospital. raulb: preliminary evaluation has demonstrated positive results. treatment retention and satisfaction rates were high, maternal substance use was markedly reduced and neonatal outcomes have shown to be above those reported in literature. conclusion: this comprehensive, primary care model has shown to be optimal in the management of substance use in pregnancy and for improving neonatal outcomes. future research will focus on how this inexpensive program can be replicated in other health care settings. t-cup may prove to be the optimal model for providing care to pregnant substance users in canada. lntrod ction: cigarette smoking is one of the most serious health problems in taiwan. the prevalence of smoking in is . % in males . % in females aged years and older. although the government of taiwan passed a tobacco hazards control act in , it has not been strongly enforced in many places. therefore, community residents have often reported exposure of second hand smoke. the purpose of the study was to establish a device to build up more smoke-free environments in the city of tainan. methods: unique from traditional intervention studies, the study used a healthy city approach to help build up smoke-free environments. the major concept of the approach is to build up a healthy city platform, including organizing a steering committee, setting up policies and indicators, creating intersectoral collaboration, and increasing community participation. first, more than enthusiastic researchers, experts, governmental officers, city counselors and community leaders in tainan were invited in the healthy city committee. second, smoke-free policies, indicators for smoke-free environments, and mechanisms for inter-departmen· tal inspections were set up. third, community volunteers were recruited and trained for persuading related stakeholders. lastly, both penalties and rewards were used for help build up the environments. raults: aher two-year ( aher two-year ( - execution of the project, the results qualitatively showed that smoke-free environments in tainan were widely accepted and established, including smoke-free schools, smoke-free workpla~es, smoke-free households, smoke-free internet shops, and smoke-free restaurants. smoke~s were. effectively educated not to smoke in public places. community residents including adults and children m the smoke-free communities clearly understand the adverse effects of environmental tobacco smoke and actively participated anti-smoking activities. conclruions: healthy city platform is effective to conquer the barrier of limited anti-smoking rc:sources. nor. only can it enlar:ge community actions for anti-smoking campaigns, but also it can provide par_merships for collaboratjon. by establishing related policies and indicators the effects of smoke· free environments can be susta ·ned a d th · · · ' · n e progression can be monitored m a commuruty. these issues are used ~· oi::c it~ goals, weuha identifies issues that put people's health at risk. presently, team com~u:c: ran ee~tion !earns. (iats) that design integrative solutions ~tesj'°~ g om six to fifteen members. methods in order to establish wo-poster sessions v projects for weuha, the following approach was undertaken: i. a project-polling template was created and sent to all members of the alliance for their input. each member was asked to identify thdr top two population groups, and to suggest a project on which to focus over a - month period for each identified population. . there was a % response to the poll and the top three population groups were identified. data from the toronto community health profile database were utilized to contextualize the information supplied for these populations. a presentation was made to the steering committee and three population-based projects were selected, leaders identified and iats formed. three population-based projects: the population-based projects and health care issues identified are: newcomer prenatal uninsured women; this project will address the challenges faced by providers to a growing number of non-insured prenatal women seeking care. a service model where the barrier of "catchments" is removed to allow enhanced access and improved and co-ordinated service delivery will be pilot-tested. children/obesity/diab etes: using a health promotion model this team will focus on screening, intervention, and promoting healthy lifestyles (physical activity and nutrition) for families as well as for overweight and obese children. seniors health promotion and circle of discharge: this team will develop an early intervention model to assist seniors/family unit/caregivers in accessing information and receiving treatment/care in the community. the circle of discharge initiative will address ways of utilizing community supports to keep seniors in the community and minimize readmissions to acute care facilities. results/expected outcomes: coordinated and enhanced service delivery to identified populations, leading to improved access, improved quality of life, and health care for these targeted populations. introduction: basic human rights are often denied to high-risk populations and people living with hiv/aids. their rights to work and social security, health, privacy, non discrimination, liberty and freedom of movement, marriage and having a family have been compromised due to their sero-positive status and risk of being positive. the spread of hiv/aids has been accelerating due to the lack of general human rights among vulnerable groups. to formulate and implement effective responses needs dialogue and to prevent the epidemic to go underground barriers like stigma need to be overcome. objective: how to reduce the situation of stigma, discrimination and human rights violations experienced by people living with hiv/aids and those who are vulnerable to hiv/aids. methodology and findings: consultation meetings were strm.-rured around presentations, field visits, community meetings and group work to formulate recommendations on how govt and ngos/cbos should move forward based on objective. pakistan being a low prevalence country, the whole sense of compl;u:enc.:y that individuals are not subject to situations of vulnerable to hiv is the major threat to an explosion in th•· epidemic, therefore urgent measures are needed to integrate human rights issues from the very start of the response. the protection and promotion of human rights in an integral component of ;tll responses to the hiv/aids epidemic. it has been recognized that the response to hiv/aios must he multi sectoral and multi faceted, with each group contributing its particular expertise. for this to occur along with other knowlcdg<" more information is required in human rights abuses related to hiv/ aids in a particular scenario. the ~·on sultarion meetings on hiv/aids and human rights were an exemplary effort to achieve the same ohj<..:tivc. recommendations: the need for a comprehensive, integrated and a multi-sectoral appro;u.:h in addressing the issue of hiv/aids was highlighted. the need social, cultural and religious asp•·ct' to he: prominently addressed were identified. it was thought imperative measures even in low prevalence countries. education has a key role to play, there is a need for a code of ethics for media people and h<"alth care providers and violations should be closely monitored and follow up action taken. p - (c) how can community-based funding programs contribute to building community capacity and how can we measure this elusive goal? mary frances maclellan-wright, brenda cantin, mary jane buchanan, and tammy simpson community capacity building is recognized by the public health agency of canada (phac) as an important strategy for improving the overall health of communities by enabling communities to addre~s priority issues such as social and economic determinants of health. in / phac.:, alberta/nwf region's population health fund (phf) supported community-based projects to build community capacity on or across the determinants of health. specifically, this included creating accessible and sup· portive social and physical environments as well as creating tools and processes necessary for healthy policy development and implementation. the objective of this presentation is to highlight how the community capacity building tool, developed by phac ab/nwf region, can demonstrate gains in v poster sessions · · the course of a pror· ect and be used as a reflective tool for project planning and community capacity over . . . . i · a art of their reporting requirements, pro ect sites completed the community caparny eva uanon. s p . . th t i ii i'd d . building tool at the beginning and end of their ~ne-year prorect. e oo ~o ects va an reliable data in the context of community-based health prorects. developed through a vigorous ~nd collabora ve research process, the tool uses plain languag~ to expl~re nine key f~atures o~ commuruty cap~city with 't ch with a section for contextual information, of which also mdude a four-pomt raong ems, ea f fu d · scale. results show an increase in community capacity over the course o the nde prorects. pre and post aggregate data from the one-year projects measure~ statistic.ally si~n~ficant changes for of the scaled items. projects identified key areas of commumty capacity bmldmg that needed strengthemng, such as increasing participation, particularly among people with low incomes; engaging community members in identifying root causes; and linking with community groups. in completing the tool, projects examined root causes of the social and economic determinants of health, thereby exploring social justice issues related to the health of their community. results of the tool also served as a reflec· cion on the process of community capacity building; that is, how the project outcomes were achieved. projects also reported that the tool helped identify gaps and future directions, and was useful as a project planning, needs assessment and evaluation tool. community capacity building is a strategy that can be measured. the community capacity building tool provides a practical means to demonstrate gains in community capacity building. strengthening the elements of community capacity building through community-based funding can serve as building blocks for addressing other community issues. needs of marginalized crack users lorraine barnaby, victoria okazawa, barb panter, alan simpson, and bo yee thom background: the safer crack use coalition of toronto (scuc) was formed in in response to the growing concern for the health and well-being of marginalized crack users. a central concern was the alarm· ing hepatitis c rate ( %) amongst crack smokers and the lack of connection to prevention and health ser· vices. scuc is an innovative grassroots coalition comprised of front-line workers, crack users, researcher! and advocates. despite opposition and without funding, scuc has grown into the largest crack specific harm reduction coalition in canada and developed a nationally recognized sarer crack kit distribution program (involving community-based agencies that provide outreach to users). the success of our coalition derives from our dedication to the issue and from the involvement of those directly affected by crack use. setting: scuc's primary service region is greater toronto, a diverse, large urban centre. much ofour work is done in areas where homeless people, sex trade workers and drug users tend to congregate. recently, scuc has reached out to regional and national stakeholders to provide leadership and education. mandate: our mandate is to advocate for marginalized crack users and support the devdopmentof a com.p.rehensive harm reduction model that addresses the health and social needs facing crack users; and to fac htare the exchange of information between crack users, service providers, researchers, and policy developers across canada. owrview: the proposed workshop will provide participants with an overview of the devdopment of scuc, our current projects (including research, education, direct intervention and consultation), our challenge~ and s~ccesses and the role of community development and advocacy within the coalition. pre-senter~ will consist of community members who have personal crack use experience and front-line work· ers-, sc.uc conducted a community-based research project (toronto crack users perspectives, ) , in w~ich s focus groups with marginalized crack users across toronto were conducted. participants iden· t f ed health and social issues affecti h b · · · d " red . . ng t em, arrsers to needed services, personal strategies, an oue recommendations for improved services. presenters will share the methodology, results and recommen· datmns resulting from the research project. conc/usio": research, field observations and consultations with stakeholders have shown that cradck shmoke~s are at an. increased risk for sexually transmitted infections hiv/aids hepatitis c, tb an ot er serious health issues health · ff, · ' ' · · . · issues a ectmg crack users are due to high risk behavmurs, socio· economic factors, such as homeless d. · · · · d · . . ness, scrsmmat on, unemployment, violence incarceraoons, an soc a so at on, and a lack of comprehe · h i h · ' ns ve ea t and social services targeting crack users. · · sinct · s, owever arge remains a gross underesurnaoon. poster sessions v these are hospital-based reports and many known cases go unreported. however teh case, young age at first intercourse, inconsistent condom use and multiple partnersplace adolescents at high risks for a diverse array of stls, including hiv. about % of female nigerian secondary school students report initiating sexual intercourse before age years. % of nigerian female secondary school students report not using a condom the last time they had sexual intercourse. more than % of urban nigerian teens report inconsistent condom use. methods: adolescents were studied, ages to , from benin city in edo state. the models used were mother-daughter( ), mother -son( ), father -son ( ), and father-daughter( ). the effect of parent-child sexual communicationat baseline on child\'s report of sexual behavior, to months later were studied. greater amounts of sexual risk communication were asociated with markedly fewer episodes of unprotected sexual intercourse, reduced number of sexual partners and fewer episodes of unprotected sexual intercourse. results: this study proved that parents can exert more influence on the sexual knowledge attitudes and practise of their adolescent children through desired practises or rolemodeling, reiterating their values and appropriate monitoring of the adolescents\' behavior. they also stand to provide information about sexuality and various sexual topics. parental-child sexual communication has been found to be particularly influential and has been associated with later onset of sexual initiation among adolescents, less sexual activity, more responsible sexual attitudes including greater condom use, self efficacy and lower self -reported incidence of stis. conclusions: parents need to be trained to relate more effectively with their children/wards about issues related to sex and sexuality. family -based programs to reduce sexual risk-taking need to be developed. there is also the need to carry out cross-ethnicaland cross-cultural studies to identify how parent-child influences on adolescent sexual risk behavior may vary in different regions or countries, especially inthis era of the hiv pandemic. introduction: public health interventions to identify and eliminate health disparities require evidence-based policy and adequate model specification, which includes individuals within a socioecological context, and requires the integration of biosociomedical information. multiple public and private data sources need to be linked to apportion variation in health disparities ro individual risk factors, the health delivery system, and the geosocial environment. multilevel mapping of health disparities furthers the development of evidence-based interventions through the growth of the public health information network (phin-cdc) by linking clinical and population health data. clinical encounter data, administrative hospital data, population socioenvironmental data, and local health policy were examined in a three-level geocoded multilevel model to establish a tracking system for health disparities. nj has a long established political tradition of "home rule" based in elected municipal governments, which are responsible for the well-being of their populations. municipalities are contained within counties as defined by the us census, and health data are linked mostly at the municipality level. marika schwandt community organizers from the ontario coaliti~n again~t pove~, .along ":ith ~edical practitioners who have endorsed the campaign and have been mvolved m prescnbmg special diet needs for ow and odsp recipients, will discuss the raise the rates campaign. the organizati~n has used a special diet needs supplement as a political tool, meeting the urgent needs o.f .poor ~ople m toront~ while raising the issues of poverty as a primary determinant of health and nutrtnous diet as a preventative health mea· sure. health professionals carry the responsibility to ensure that they use all means available to them to improve the health of the individuals that they serve, and to prevent future disease and health conditions. most health practitioners know that those on social assistance are not able to afford nutritious foods or even sufficient amounts of food, but many are not aware of the extra dietary funds that are available aher consideration by a health practitioner. responsible nurse practitioners and physicians cannot, in good conscience, ignore the special needs diet supplement that is available to all recipients of welfare and disabiliry (ow and odsp). a number of toronto physicians have taken the position that all clients can justifiably benefit from vitamins, organic foods and high fiber diets as a preventative health measure. we know that income is one of the greatest predictors of poor health. the special needs diet is a health promotion intervention which will prevent numerous future health conditions, including chronic conditions such as cardiovascular disease, cancer, diabetes and osteoporosis. many communiry health centres and other providers have chosen to hold clinics to allow many patients to get signed up for the supplement at one time. initiated by the ontario coalition against poverty, these clinics have brought together commu· niry organizers, community health centers, health practitioners, and individuals, who believe that poverty is the primary determinant of poor health. we believe that rates must be increased to address the health problems of all people on social assistance, kids, elders, people with hiv/aids -everyone. even in the context of understaffing, it could be considered a priority activity that has potentially important health promotion benefits. many clients can be processed in a two hour clinic. most providers find it a very interesting, rewarding undertaking. in the ontario coalition for social justice found that a toronto family with two adults and two kids receives $ , . this is $ , below the poverty line. p - (c) the health of street youth compared to similar aged youth beth hayhoe and ruth ewert . lntrod~on: street youth are at an age normally associated with good health, but due to their risky ~hav ours and th~ conditions in which they live, they experience health conditions unlike their peer~ an more stable env r~nments. in addition, the majority of street youth have experienced significant physical, sexual ~nd em.ot onal abuse as younger children, directly impacting many of the choices they make around their physical and emotional health. we examined how different their health really is. . , methodl: using a retrospective analysis of the years of data gathered from yonge street mis· ~ • evergreen health centre, the top conditions of youth were examined and compared with national tren~s for similar aged youth. based on knowledge of the risk factors present in the group, rea· sons for the difference were examined. d' ~its: street youth experience more illness than other youth their age and their illnesses can bt . irect t ·~kc~ to the. conditions in which they live. long-term impacts of abus~ contribute to such signif· ~~nt t e t d~slpl air that youth may voluntarily engage in behaviours or lack of self care in the hope at t cir ve~ w perhaps come to a quicker end. concl non: although it has ion b k h th' dy clearly shows d'fi . h g ee~ no~n t at poverty negatively affects health, ~siu be used to make ; erence m t .e health of this particular marginalized population. the infonnanon can relates to th . ecommendatio.ns around public policy that affects children and youth, especially as it e r access to appropriate health care and follow up. p - (cl why do urban children · b gt . tarek hussain an adesh die: how to save our children? the traditional belief that urban child alid. a recent study (dhs d fr r~n are better off than rural children might be no longer v urban migrants are highata th om h c~untn~s i demonstrates that the child survival prospects of rural· er an t ose m their r j · · ·grants. in bangladesh, currently million ~r~ ~ gm and lower than those of urban non-idi million. health of the urban ~ p~e are hvmg m urban area and by the year , it would be so the popu at on s a key a eals that urban poor have the worse h h . concern. recent study on the urban poor rev ea t situation than the nation as a whole. this study shows that infant poster sessions v mortality among the urban poor as per thousand, which are above the rural and national level estimates. the mortality levels of the dhaka poor are well above those of the rest of the city's population but much of the difference in death rates is explained by the experience of children, especially infants. analyzing demographic surveillance data from a large zone of the city containing all sectors of the population, research showed that the one-fifth of the households with the least possessions exhibited u child mortality almost three times as high as that recorded by the rest of the population. why children die in bangladesh? because their parents are too poor to provide them with enough food, clean water and other basic needs to help them avoid infection and recover from illness. researchers believed that girls are more at risk than boys, as mothers regularly feed boys first. this reflects the different value placed on girls and boys, as well as resources which may not stretch far enough to provide for everyone. many studies show that housing conditions such as household construction materials and access to safe drinking water and hygienic toilet facilities are the most critical determinants of child survival in urban areas of developing countries. the present situation stressed on the need for renewed emphasis on maternal and child healthcare and child nutrition programs. mapping path for progress to save our children would need be done strategically. we have the policies on hand, we have the means, to change the world so that every child will survive and has the opportunity to develop himself fully as a healthy human being. we need the political will--courage and determination to make that a reality. p - (c) sherbourne health centre: innovation in healthcare for the transgendered community james read introduction: sherbourne health centre (shc), a primary health care centre located in downtown toronto, was established to address health service gaps in the local community. its mission is to reduce barriers to health by working with the people of its diverse urban communities to promote wellness and provide innovative primary health services. in addition to the local communities there are three populations of focus: the lesbian, gay, bisexual, transgendered and transexual communities (lgbtt); people who are homeless or underhoused; and newcomers to canada. shc is dedicated to providing health services in an interdisciplinary manner and its health providers include nurses, a nurse practitioner, mental health counsellors, health promoters, client-resource workers, and physicians. in january shc began offering medical care. among the challenges faced was how to provide responsive, respectful services to the trans community. providers had considerable expertise in the area of counselling and community work, but little in the area of hormone therapy -a key health service for those who want to transition from one gender to another. method: in preparing to offer community-based health care to the trans community it was clear that shc was being welcomed but also being watched with a critical eye. trans people have traditionally experienced significant barriers in accessing medical care. to respond to this challenge a working group of members of the trans community and health providers was created to develop an overall approach to care and specific protocols for hormone therapy. the group met over a one year period and their work culminated in the development of medical protocols for the provision of hormone therapy to trans individuals. results: shc is currently providing health care to registered clients who identify as trans individuals (march ) through primary care and mental health programs. in an audit of shc medical charts (january to september ) female-to-male (ftm) and male-to-female (mtf) clients were identified. less than half of the ftm group and just over two-thirds of the mtf group presented specifically for the provision of hormones. based on this chart audit and ongoing experience shc continues to update and refine these protocols to ensure delivery of quality care. conclusion: this program is an example of innovative community-based health delivery to a population who have traditionally faced barriers. shc services also include counselling, health promotion, outreach and education. p - (c) healthy cities for canadian women: a national consultation sandra kerr, kimberly walker, and gail lush on march , the national network on environment and women's health held a pan-canadian consultation to identify opportunities for health research, policy change, and action. this consultation also worked to facilitate information sharing and networking between canadian women working as urban planners, policy makers, researchers, and service workers on issues pertaining to the health of women living in canadian cities. methods: for this research project, participants included front-line service workers, policy workers, researchers, and advocates from coast to coast, including francophone women, women with disabilities, racialized women, and other marginalized groups. the following key areas were selected as topics for du.bnes i alto kading .:auk of end·sugr ieaal clileue ia singapore, accounting for more than so% of new can singapore (nkfs) to embark on a prevention program (pp) empo~r d ahc j u f dieir condition bttter, emphasizing education and disease sdf·managemen lkilla a. essennal camponenn of good glycaemic control. we sought explore the effects of a pecialijed edu.:a on pro· pun od glycacmic conuol, as indicated by, serum hba ic values budine serum hba ic values were determined before un so yean). ohew-ibmi ~ .nwm , wai hip ratio> l),up to primary and above secondary level education and those having om urine iclt showed that increasing hbalc levels ( ) had increasing urmary protein ( .± ; . ±i ih so± ) and crearinine (s .s ± s ± ; ioi± s) levels fbg rnults showed that the management nf d abetn m the nkfs preven· tion programme is effec;rive. results also indicated har hba le leve have a linnr trend wnh unnary protein and creatinine which are imponant determinants of renal diseate tal family-focused cinical palbway promoce politivc outcollln for ua inner city canu allicy ipmai jerrnjm care llctivirits in preparation for an infanr'' dilchargr honlr, and art m endnl lo improve effi.:k'fl.:tn of c.are. lere i paucity of tttran:h, and inconsi trncy of rnulta on ht-•m!*- of f m ly·fc"-'uw d nm a: to determinr whrthrr implrmentation of family.focuted c:pt n ntnn.tt.tl unit w"n mg an inner city ;ommunity drcl't'aki leftarh of lf•y (i.osi and rromclll'i family uo•fkllon and rt. j nest for dikhargr. md odt: family-focuk"d cpi data wm coll«ted for all infant• horn btrwttn and wft"k• t"lal mi atr who wrtt . dm ed to the ntonatal unit lmgdl of -.y . n. . day'o p c o.osi ind pma . d•mr., ho.nr . t . n. . ± i. i wb, p < o.os) wett n« fiamly f.lfrt n the pre.(]' poup. ~ .fxtmon icofn for famihn wrre high. and families noctd thc:y wnr mott prepued to ah thrar t..lby "'-· thett was .a cosi uving of s , (cdn) per patient d teharpd home n the pmi-cp poap c.-pated the p"''lfoup· cortclaion· lmplrmrnr.rion of family·foanrd c:p. in a nrona . i umt tc"fyidi an nnn an com· muniry decre.ned length of'"'" mft with a high dcgrft of family uujamon, and wrre coll~nt at least % percent of the kathmandu population lives in slum like conditions with poor access to basic health services. in these disadvantaged areas, a large proportion of children do not receive treatment due to inaccessibility to medical services. in these areas, diarrhea, pneumonia, and measles, are the key determinants of infant mortality. protein energy malnutrition and vitamin a deficiency persists and communicable diseases are compounded by the emergence of diseases like hiv/aids. while the health challenges for disadvantaged populations in kathmandu are substantial, the city has also experienced various forms of innovative and effective community development health programs. for example, there are community primary health centers established by the kathmandu municipality to deliver essential health services to targeted communities. these centers not only provide equal access to health services to the people through an effective management system but also educate them hy organizing health related awareness programs. this program is considered one of the most effective urban health programs. the paper/presentation this paper will review large, innovative, and effective urhan health programs that are operating in kathmandu. most of these programs are currently run by international and national ngos a) early detection of emerging diseases in urban settings through syndromic surveillance: data pilot study kate bassil of community resources, and without adequate follow-up. in november shelter pr.oviders ~et with hospital social workers and ccac to strike a working group to address some of th~ issues by mcre.asing knowledge among hospital staff of issues surrounding homelessness, and to build a stro?g workmg relationship between both systems in hamilton. to date the hswg has conducted four w~lkmg to~ of downtown shelters for hospital staff and local politicians. recently the hswg launched its ·~ool.k t for staff working with patients who are homeless', which contains community resources and gu dehnes to help with effective discharge plans. a scpi proposal has been submitted to incre~se the capacity of the hswg to address education gaps and opportunities with both shelters and hospitals around homelessness and healthcare. the purpose of this poster presentation is to share hamilton's experience and learnings with communities who are experiencing similar issues. it will provide for intera~tion around shared experiences and a chance to network with practitioners across canada re: best practices. introduction and objectives: canadians view health as the biggest priority for the federal government, where health policies are often based on models that rely on abstract definitions of health that provide little assistance in the policy and analytical arena. the main objectives of this paper are to provide a functional definition of health, to create a didactic model for devising policies and determining forms of intervention, to aid health professionals and analysts to strategize and prioritize policy objectives via cost benefit analysis, and to prompt readers to view health in terms of capacity measures as opposed to status measures. this paper provides a different perspective on health, which can be applied to various applications of health such as strategies of aid and poverty reduction, and measuring the health of an individual/ community/country. this paper aims to discuss theoretical, conceptual, methodological, and applied implications associated with different health policies and strategies, which can be extended to urban communities. essentially, our paper touches on the following two main themes of this conference: •health status of disadvantaged populations; and •interventions to improve the health of urban communities.methodology: we initially surveyed other models on this topic, and extrapolated key aspects into our conceptual framework. we then devised a theoretical framework that parallels simple theories of physkal energy, where health is viewed in terms of personal/societal health capacities and effort components.after establishing a theoretical model, we constructed a graphical representation of our model using selfrated health status and life expectancy measures. ultimately, we formulated a new definition of health, and a rudimentary method of conducting cost benefit analysis on policy initiatives. we end the paper with an application example discussing the issues surrounding the introduction of a seniors program.results: this paper provides both a conceptual and theoretical model that outlines how one can go about conducting a cost-benefit analysis when implementing a program. it also devises a new definition and model for health barred on our concept of individual and societal capacities. by devising a definition for health that links with a conceptual and theoretical framework, strategies can be more logically constructed where the repercussions on the general population are minimized. equally important, our model also sets itself up nicely for future microsimulation modeling and analysis.implications: this research enhances one's ability to conduct community-based cost-benefit analysis, and acts as a pedagogical tool when identifying which strategies provide the best outcome. p - (a) good playgrounds are hard to find: parents' perceptions of neighbourhood parks patricia tucker, martin holmes, jennifer irwin, and jason gilliland introduction: neighbourhood opportunities, including public parks and physical activity or sports fields hav~ been. iden.tified as correlates to physical activity among youth. increasingly, physical activity among children s bemg acknowledged as a vital component of children's lives as it is a modifiable determinant of childh~d obesity. children's use of parks is mainly under the influence of parents; therefore, the purpose of this study was to assess parents' perspectives of city parks, using london ontario as a case study.m~~: this qualitative study targeted a heterogeneous sample of parents of children using local parks w thm london. parents with children using the parks were asked for minutes of their time and if willing, a s.hort interview was conducted. the interview guide asked parents for their opinion 'of city parks, particularly the one they were currently using. a sample size of parents is expected by the end of the summer.results: preliminary findings are identifying parents concern with the current jack of shade in local parks. most parents have identified this as a limitation of existing parks, and when asked what would make the parks better, parents agree that shade is vital. additionally, some parents are recognizing the v poster sessions focused discussions during the consultation: . women in _poverty . women with disability . immi· grant and racialized women . the built and _physica_l environment. . . . . r its· participants voiced the need for integration of the following issues withm the research and policy :::na; t) the intersectional nature of urban women's health i~sues wh~ch reflects the reality of women's complex lives ) the multisectional aspect of urban wo_m~n s health, ss~es, which reflects the diversity within women's lives ) the interse~roral _dynamics within _womens hves and urban health issues. these concepts span multiple sectors -mdudmg health, educat n, and economics -when leveraging community, research, and policy support, and engaging all levels of government.policy jmplicatiom: jn order to work towards health equity for women, plans for gender equity must be incorporated nationally and internationally within urban development initiatives: • reintroduce "women" and "gender" as distinct sectors for research, analysis, advocacy, and action. •integrate the multisectional, intersectional, and intersecroral aspects of women's lives within the framework of research and policy development, as well as in the development of action strategies. • develop a strategic framework to house the consultation priorities for future health research and policy development (for example, advocacy, relationship building, evidence-based policy-relevant research, priority initiatives}.note: research conducted by nnewh has been made possible through a financial contribution from health canada. the views expressed herein do not necessarily represent the views of health canada.p - (c) drugs, culture and disadvantaged populations leticia folgar and cecilia rado lntroducci n: a partir de un proyecto de reducci n de daiios en una comunidad urbana en situ· aci n de extrema vulnerahilidad surge la reflexion sobre el lugar prioritario de los elementos sociocuhurales en el acceso a los servicios de salud de diferentes colectivos urbanos. las "formas de hacer, pensar y sentir" orientan las acciones y delimitan las posibilidades que tienen los individuos de definir que algo es o no problema, asf como tambien los mecanismos de pedido de ayuda. el analisis permanenre del campo de "las culturas cotidianas" de los llamados "usuarios de drogas" aporta a la comprension de la complejidad del tema en sus escenarios reales, y colabora en los diseiios contextualizados de politicas y propuestas socio-sanitarias de intervenci n, tornandolas mas efectivas.mitodos: esta experiencia de investigaci n-acci n que utiliza el merodo emografico identifica elementos socio estructurales, patrones de consumo y profundiza en los elementos socio-simb icos que estructuran los discursos de los usuarios, caracterizandolos y diferenciandolos en tanto constitutivos de identidades socia les que condicionan la implementaci n del programas de reduccion de daiios.resultados: los resultados que presentaremos dan cuenta de las caracteristicas diferenciadas v relaciones particulares ~ntre los consumidores de drogas en este contexto espedfico. a partir de este e~tudio de caso se mtentara co ? enzar a responder preguntas que entendemos significativas a la hora de pensar intcrvcnciones a la med da de poblaciones que comparten ciertas caracteristicas socio-culturales. (cuales serian las .motivaciones para el cambio en estas comunidades?, cque elementos comunitarios nos ayudan a i:nnstnur dema~~a? • cque tenemos para aprender de las "soluciones" que ellos mismos encuenrran a los usos problemat cos? methods: our study was conducted by a team of two researchers at three different sites. the mapping consisted of filling in a chart of observable neighbourhood features such as graffiti, litter, and boarded housing, and the presence or absence of each feature was noted for each city block. qualitative observations were also recorded throughout the process. researchers analyzed the compiled quantitative and qualitative neighbourhood data and then analyzed the process of data collection itself.results: this study reveals the need for further research into the effects of physical environments on individual health and sense of well-being, and perception of investment in neighbourhoods. the process reveals that perceptions of health and safety are not easily quantified. we make specific recommendations about the mapping methodology including the importance of considering how factors such as researcher social location may impact the experience of neighbourhoods and how similar neighbourhood characteristics are experienced differently in various spaces. further, we discuss some of the practical considerations around the mapping exercise such as recording of findings, time of day, temperature, and researcher safety.conclusion: this study revealed the importance of exploring conceptions of health and well-being beyond basic physical wellness. it suggests the importance of considering one's environment and one's own perception of health, safety, and well-being in determining health. this conclusion suggests that attention needs to be paid to the connection between the workplace and the external environment it is situated in. the individual's workday experience does not start and stop at the front door of their workplace, but rather extends into the neighbourhood and environment around them. our procedural observations and recommendations will allow other researchers interested in the effect of urban environments on health to consider using this innovative methodology. introduction: responding ro protests against poor medical attention for sexually assaulted women and deplorable conviction rates for sex offenders, in the late s, the ontario government established what would become over hospital-based sexual assault care and treatment centres (sactcs) across the province. these centres, staffed around the clock with specially trained heath care providers, have become the centralized locations for the simultaneous health care treatment of and forensic evidence collection from sexually assaulted women for the purpose of facilitating positive social and legal outcomes. since the introduction of these centres, very little evaluative research has been conducted to determine the impact of this intervention. the purpose of our study was to investigate it from the perspectives of sexually assaulted women who have undergone forensic medical examinations at these centres.method: women were referred to our study by sactc coordinators across ontario. we developed an interview schedule composed of both closed and open-ended questions. twenty-two women were interviewed, face-to-face. these interviews were approximately one-to-two hours in length, and were transcribed verbatim. to date, have been analyzed for key themes.results: preliminary findings indicate that most women interviewed were canadian born ( 'yo), and ranged in age from to years. a substantial proportion self-identified as a visible minority ( 'x.). approximately half were single or never married ( %) and living with a spouse or family of origin ( %). most were either students or not employed ( %). two-thirds ( %) had completed high school and onethird ( %) was from a lower socio-economic stratum. almost two-fifths ( %) of women perceived the medical forensic examination as revictimizing citing, for example, the internal examination and having blood drawn. the other two-thirds ( %) indicated that it was an empowering experience, as it gave them a sense of control at a time when they described feeling otherwise powerless. most ( %) women stated that they had presented to a centre due to health care concerns and were very satisfied ( % ) with their experiences and interactions with staff. almost all ( %) women felt supported and understood.conclusions: this research has important implications for clinical practice and for appropriately addressing the needs of sexually assaulted women. what is apparent is that continued high-quality medical attention administered in the milieu of specialized hospital-based services is essential. at the same time, we would suggest that some forensic evidence collection procedures warrant reevaluation. the study will take an experiential, approach by chroruclmg the impa~ of the transition f m the streets to stabilization in a managed alcohol program through the techruque of narrative i~:uiry. in keeping with the shift in thinking in the mental health fie!~ ~his stu~y is based on a paradigm of recovery rather than one of pathology. the "inner views of part c pants hves as they portray their worlds, experiences and observations" will be presented (charm~z, , ~· ~)-"i?e p~ of the study is to: identify barriers to recovery. it will explore the exj?cnence of ~n~t zanon pnor to entry into the program; and following entry will: explore the meanmg ~nd defirutto~s of r~overy ~~d the impact of the new environment and highlight what supports were instrumental m movmg pan apants along the recovery paradigm.p -st (a) treating the "untreatable": the politics of public health in vancouver's inner city introdudion: this paper explores the everyday practices of therapeutic programs in the treadnent of hiv in vancouver's inner city. as anthropologists have shown elsewhere, therapeutic programs do not siinply treat physical ailments but they shape, regulate and manage social lives. in vancouver's inner city, there are few therapeutic options available for the treatment of -ilv. public health initiatives in the inner city have instead largely focused on prevention and harm reduction strategies such as needle exchange programs, safe injection sites, and safer-sex education. epidemiological reports suggest that less than a quarter of those living with hiv in the downtown eastside (dtes) are taking antiretroviral therapies raising critical questions regarding the therapeutic economy of antiretrovirals and rights to health care for the urban poor.methods: this paper is drawn from ethnographic fieldwork in vancouver's otes neighborhood focusing on therapeutic programs for hiv treatment among "hard-to-reach" populations. the research includes participant-observation at inner city health clinics specializing in the treatment of hiv; semi· structured interviews with hiv positive participants, health care professionals providing hiv treatment, and administraton working in the field of inner city public health; and, lastly, observation at public meetings and conferences surrounding hiv treatment.r.awlts: hiv prevention and treatment is a central concern in the lives of many residents living in the inner city -although it is just one of many health priorities afflicting the community. concerns about drug resistance, cost of antiretrovirals, and illicit drug use means that hiv therapy for most is characterized by the daily observation of their medicine ingestion at health clinics or pharmacies. daily observed treatment (dot) is increasingly being adopted as a strategy in the therapeutic management of "untreatable" populations. dot programs demand a particular type of subject -one who is "compliant" to the rules and regimes of public health. over emphasis on "risky practices," "chaotic lives," and "~dh~rence" preve~ts the public health system from meaningful engagement with the health of the marginalized who continue to suffer from multiple and serious health conditions and who continue to experience considerable disparities in health.~ the ~ffec~s of hiv in the inner city are compounded by poverty, laclc of safe and affordable houamg, vanous llegal underground economies increased rates of violence and outbreaks of ~~~·~ly tr~nsmitted infections, hepatitis, and tuberculosi: but this research suggests 'that public health uunauves aimed at reducing health disparities may be failing the most vulnerable and marginal of citiztl s. margaret malone ~ vi~lence that occurs in families and in intimate relationships is a significant urban, ~unity, and pu~hc health problem. it has major consequences and far-reaching effects for women, ~~--renho, you~ sen on, and families. violence also has significant effects for those who provide and ukllc w receive health care violence · · i · · . all lasses, · is a soc a act mvolvmg a senous abuse of power. it crosses : ' : ' ~ s;nden, ag~ ~ti~, cultures, sexualities, abilities, and religions. societal responseshali ra y oc:used on identificatton, crisis intervention and services for families and individuajs.promoten are only "-"--:-g to add h · ' · i in intimate relationshi with"-~"'.". ress t e issues of violence against women and vjoence lenga to consider i~ m families. in thi_s p~per, i analyze issues, propose strategies, and note c~· cannot be full -...l'-~ whork towards erad canng violence, while arguing that social justice and equity y -. ucvcu w en thett are people wh mnhod: critical social theory, an analysis that addresses culturally and ethnically diverse communities, together with a population health promotion perspective frame this analysis. social determinants of health are used to highlight the extent of the problem of violence and the social and health care costs.the ottawa charter is integrated to focus on strategies for developing personal skills, strengthening community action, creating supportive environments, devdoping healthy public policies, and re-orientating health and social services. attention is directed to approaches for working with individuals, families, groups, communities, populations, and society.ratdts: this analysis demonstrates that a comprehensive interdisciplinary, multisectoral, and multifaceted approach within an overall health promoting perspective helps to focus on the relevant issues, aitical analysis, and strategies required for action. it also illuminates a number of interacting, intersecting, and interconnecting factors related to violence. attention, which is often focused on individuals who are blamed for the problem of violence, is redirected to the expertise of non-health professionals and to community-based solutions. the challenge for health promoters working in the area of violence in families and in intimate relationships is to work to empower ourselves and the communities with whom we work to create health-promoting urban environments. social justice, equiry, and emancipatory possibilities are positioned in relation to recommendations for future community-based participatory research, pedagogical practices for health care practitioners, and policy development in relation to violence and urban health. the mid-main community health center, located in vancouver british columbia (bc), has a diverse patient base reflecting various cultures, languages, abilities, and socio-economic statuses. due to these differences, some mid-main patients experience greater digital divide barriers in accessing computers and reputable, government produced consumer health information (chi) websites, such as the bc healthguide and canadian health network. inequitable access is problematic because patient empowerment is the basis of many government produced chi websites. an internet terminal was introduced at mid-main in the summer of , as part of an action research project to attempt to bridge the digital divide and make government produced chi resources useful to a broad array of patients. multi-level interventions in co-operation with patients, with the clinic and eventually government ministries were envisioned to meet this goal. the idea of implementing multi-level interventions was adopted to counter the tendency in interactive design to implement a universal solution for the 'ideal' end-user [ ), which discounts diversity. to design and execute the interventions, various action-oriented and ethnographic methods were employed before and during the implementation of the internet terminal. upon the introduction of the internet terminal, participant observation and interviews were conducted using a motion capture software program to record a digital video and audio track of patients' internet sessions. this research provided insight into the spectrum of patients' capacities to use technology to fulfil their health information needs and become empowered. at the mid-main clinic it is noteworthy that the most significant intervention to enhance the usefulness of chi websites for patients appeared to be a human rather than a technological presence. as demonstrated in other ethnographic research of community internet access, technical support and capacity building is a significant component of empowerment ( ). the mid-main wired waiting room project indicates that medical practitioners, medical administrators, and human intermediaries remain integral to making chi websites useful to patients and their potential empowerment. ( ) over the past years the environmental yo~th alliance has been of~ering a.youth as~t. mappin~ program which trains young people in community research and evaluation. wh ~st the positive expenenc~ and relationships that have developed over this time attest to the success of this program, no evaluations has yet been undertaken to find out what works for t.he youth, what ~ould be changed, and what long term outcomes this approach offers for the youth, their local community, and urban governance. these topics will be shared and discussed to help other community disorganizing and uncials governments build better, youth-driven structures in the places they live.p - (a) the world trade center health registry: a unique resource for urban health researchers deborah walker, lorna thorpe, mark farfel, erin gregg, and robert brackbill introduction: the world trade center health registry (wfchr) was developed as a public health response to document and evaluate the long-term physical and mental health effects of the / disaster on a large, diverse population. over , people completed a wfchr enrollment baseline survey, creating the largest u.s. health registry. while studies have begun to characterize / bealth impacts, questions on long-term impacts remain that require additional studies involving carefully selected populations, long-term follow-up and appropriate physical exams and laboratory tests. wtchr provides an exposed population directory valuable for such studies with features that make ita unique resource: (a) a large diverse population of residents, school children/staff, people in lower man· hattan on / including occupants of damaged/destroyed buildings, and rescue/recovery/cleanup work· ers; (b) consent by % of enrollees to receive information about / -related health studies; (c) represenration of many groups not well-studied by other researchers; (c) email addresses of % of enrollees; (d) % of enrollees recruited from lists with denominator estimates; and (e) available com· parison data for nyc residents. wfchr strives to maintain up-to-date contact information for all enrollees, an interested pool of potential study participants. follow-up surveys are planned.methods: to promote the wtchr as a public health resource, guidelines for external researcher.; were developed and posted on (www.wtcregistry.org) which include a short application form, a twopage proposal and documentation of irb approval. proposals are limited to medical, public health, or other scientific research. researchers can request de-identified baseline data or have dohmh send information about their studies to selected wfchr enrollees via mail or email. applications are scored by the wtchr review committee, comprised of representatives from dohmh, the agency for toxic subst~nces and disease registry, and wtchr's scientific, community and labor advisory committees. a data file users manual will be available in early fall .~suits: three external applications have been approved in , including one &om a non-u.s. ~esearcher, all requesting information to be sent to selected wtchr enrollees. the one completed mail· mg~~ wtchr enrollee~ (o , wfc tower evacuees) generated a positive survey response rate. three additional researchers mtend to submit applications in . wfchr encourages collaborations between researchers and labor and community leaders.conclusion: studies involving wtchr enrollees will provide vital information about the long· term health consequences of / . wtchr-related research can inform communities, researcher.;, policy makers, health care providers and public health officials examining and reacting to and other disasters. t .,. dp'"f'osed: thi is presentation will discuss the findings of attitudes toward the repeat male client iden· ie as su e a and substance us'n p · · · · i · 'd . . - g. articipants will learn about some identified effective strategies or service prov ers to assist this group of i · f men are oft · d bl men. n emergency care settings, studies show that this group en viewe as pro emaric patient d i r for mental health p bl h h an are more ikely to be discharged without an assessmen !) ea rofr ems t. an or er, more cooperative patients (forster and wu · hickey er al., · r y resu ts om this study suggest th · · ' ' l · d tel' mining how best to h . d at negative amtudes towards patients, difficu nes e · as well pathways l_e_ p patientsblan ~ck of conrinuity of care influence pathways to mental health care. • uc\:ome pro emat c when p ti k · che system. m a ems present repeatedly and become "get stuc id methods: semi-structured intervie d . · (n= ), ed nurses (n= ) other ed ;s were con ucted with male ed patients (n= ), ed phys oans ' sta (n= ) and family physicians (n= ). patients also completed a poster sessions v diagnostic interview. interviews were tape-recorded, transcribed verbatim and managed using n . transcripts were coded using an iterative process and memos prepared capture emergent themes. ethics approval was obtained and all participants signed a detailed informed consent form.introduction: urban settings are particularly susceptible to the emergence and rapid spread of nt•w or rare diseases. the emergence of infectious diseases such as sars and increasing concerns over the next influenza pandemic has heightened interest in developing and using a surveillance systt·m which detects emerging public health problems early. syndromic surveillance systems, which use data b, scd on symptoms rather than disease, offer substantial potential for this by providing near-real-rime data which are linked to an automated warning system. in toronto, we are piloting syndromic data from the · emergency medical services (ems) database to examine how this information can be used on an ongoing basis for the early detection of syndromes including heat-related illness (hri), and influenza-like-illness (ill). this presentation will provide an outline of the planned desi_gn of this system and proposed evaluation. for one year, call codes which reflect heat-related illness or influenza-like-illness will be selected and searched for daily using software with a multifactorial algorithm. calls will he stratified by call code, extracted from the -ems database and transferred electronically to toronto public health. the data will be analyzed for clusters and aberrations from the expected with the realtime outbreak and disease surveillance (rods) system, a computer-based public health tool for the early detection of disease outbreaks. this -ems surveillance system will be assessed in terms of its specificity and sensitivity through comparisons with the well-established tracking systems already in place for hr! and ill. others sources of data including paramedic ambulance call reports of signs and . this study will introduce complementary data sources t~ the ed ch e~ complamt an~ o~~rthe-counter pharmacy sales syndromic surveillance data currently bemg evaluated m ~ther ontar~o cltles. . syndromic surveillance is a unique approach to proactive(~ dete~tmg early c.hangesm the health status of urban communities. the proposed study aims to provide evidence of differential effectiveness through investigating the use of -ems call data as a source of syndromic surveillance information for hr! and ili in toronto. introduction: there is strong evidence that primary care interventions, including screening, brief advi<:e, treatment referral and pharmacotherapy are effective in reducing morbidity and mortality caused by substance abuse. yet physicians are poor at intervening with substance users, in part because of lack of time, training and support. this study examines the hypothesis that shared care in addictions will result in decreased substance use and improved health status of patients, as well as increased use of primary care interventions by primary care practitioners (pcps). methods: the addiction medicine service (ams) at st. joseph's health centre's family medicine department is in the process of being transformed from its current structure as a traditional consult service into a shared care model called addiction shared care (asc). the program will have three components: education, office systems and clinical shared care. as opposed to a traditional consult service, the patient will be booked with both a primary care liaison worker (pcl) and addictions physician. patients referred from community physicians, the emergency department and inpatient medical and psychiatric wards will be recruited for the study as well as pcps from the surrounding community. the target sample size is - physicians and a similar number of patients. after initial consult, patient will be recruited into the study with their consent. the shared-case model underlines the interaction and collaboration with the patient's main pcp. asc will provide them with telephone consults, advice, support and re-assessment for their patients, as well as educational sessions and materials such as newsletters and informational kits.results: the impact of this transition on our patient care and on pcp's satisfaction with the asc model is currently being evaluated through a grant provided by the ministry of health & long term care. a retrospective chart review will be conducted using information on the patient's substance use, er/clinic visits, and their health/mood status. pcp satisfaction with the program will be measured through surveys and focus groups. our cost-effectiveness analysis will calculate the overall cost of the program per patient..conclusion: this low-cost service holds promise to serve as an optimal model and strategy to improve outcomes and reduce health care utilization in addict patients. the inner city public health project introduction the inner city public health project (icphp) was desi.gned to explore new an~ innovative ways to reach marginalized inner city populations that par-t c pate m high health-nsk beha~ ors. much of this population struggles with poverty, addictions, mental illness and homelessness, creatmg barriers to accessing health services and receiving follow-up. this pro ect was de~igned to evaluat~ .~e success of offering clinics in the community for testing and followup of communicable diseases uuhzmg an aboriginal outreach worker to build relationships with individuals and agencies. v n(demographics~ history ~f testi~g ~nd immunization and participation in various health-risk behaviors), records of tesnng and mmumzat ons, and mterviews with partner agency and project staff after one year.. results: t~e chr ~as i~strumental in building relationships with individuals and partner agencies ' .° the c~mmun_ ~ re_sultmg m req~ests for on-site outreach clinics from many of the agencies. the increase m parnc pat n, the chr mvolvement in the community, and the positive feedback from the agen? staff de~onstrated that.the project was successfully creating partnerships and becoming increasingly integrated m the community. data collected from clients at the initial visit indicated that the project was reaching its target populations and highlighted the unique health needs of clients, the large unmet need for health services and the barriers that exist to accessing those services. ~usion: the outreach clinics were successful at providing services to target populations of high health-nsk groups and had great support from the community agencies. the role of the chr was critical to the success of the project and proved the value of this category of health care worker in an urban aboriginal population. the unmet health needs of this disadvantaged population support the need for more dedicated resources with an emphasis on reducing access barriers. building a caring community old strathcona's whyte avenue, a district in edmonton, brought concern about increases in the population of panhandlers, street people and homeless persons to the attention of all levels of government. the issue was not only the problems of homelessness and related issues, but feelings of insecurity and disempowerment by the neighbourhood residents and businesses. their concerns were acknowledged, and civic support was offered, but it was up to the community itself to solve the problem. within a year of those meetings, an adult outreach worker program was created. the outreach worker, meets people in their own environments, including river valley camps. she provides wrap-around services rooted in harm reduction and health promotion principles. her relationship-based practice establishes the trust for helping clients with appropriate housing, physical and/or mental health issues, who have little or no income and family support to transition from homelessness. the program is an excellent example of collaboration that has been established with the businesses, community residents, community associations, churches, municiple services, and inner-city agencies such as boyle street community services. statistics are tracked using the canadian outcomes research institute homes database, and feedback from participants, including people who are street involved. this includes an annual general meeting for community and people who are homeless. the program's holistic approach to serving the homeless population has been integral, both in creating community awareness and equipping residents and businesses to effectively interact with people who are homeless. through this community development work, the outreach worker engages old strathcona in meeting the financial and material needs of the marginalized community. the success of this program has been surprising -the fact is that homeless people's lives are being changed; one person at a time and the community has been changed in how they view and treat those without homes. over two years, the program has successfully connected with approximately seventy-five individuals who call old strathcona home, but are homeless. thirty-six individuals are now in homes, while numerous others have been assisted in obtaining a healthier and safer lifestyles by becoming connected with other social/health agencies. the program highlights the roots of homelessness, barriers to change and requirements for success. it has been a thriving program and a model that works by showing how a caring community has rallied together to achieve prosperous outcomes. the spn has created models of tb service delivery to be used m part~ers~ p with phannaceunca compa-. · · -. t' ns cooperatives and health maintenance orgamzanons (hmos). for example, the mes, c v c orgamza , . · b tb d' · spn has established a system with pharmaceutical companies that help patients to uy me cmes at a special discounted rate. this scheme also allows patients to get a free one-_month's worth of~ dru_g supply if they purchase the first months of their regimen. the sy_s~e~s were ~es gned to be cm~pattble with existing policies for recording and documentation of the ph hppme national tuberculosis program (ntp). aside from that, stakeholders were also encouraged to be dots-enabled through the use of m~nual~ and on-line training courses. the spn initiative offers an alternative in easing the burden of tb sc:rv ce delivery from rhe public sector through the harnessing of existing private-sector (dsos). the learnmgs from the spn experience would benefit groups from other locales that _work no~ only on ~ but other health concerns as well. the spn experience showcases how well-coordinated private sector involvements help promote social justice in health delivery in urban communities.p - (c) young people in control; doing it safe. the safe sex comedy juan walter and pepijn v. empelen introduction: high prevalence of chlamydia and gonorrhoea have been reported among migrants youth in amsterdam, originating from the dutch antilles, suriname and sub-sahara africa. in addition, these groups also have high rates of teenage-pregnancy (stuart, ) and abortions (rademakers ), indicating unsafe sexual behaviour of these young people. young people (aged - ) from the so· called urban scene (young trendsetters in r&b/hip hop music and lifestyle) in amsterdam have been approached by the municipal health service (mhs) to collaborate on a safe sex project. their input was to use comedy as vehicle to get the message a cross. for the mhs this collaboration was a valuable opportunity to reach a hard-to-reach group.mdhods: first we conducted a need assessment by means of a online survey to assess basic knowledge and to similtaneously examine issues of interest concerning sex, sexuality, safer sex and the opposite sex. second, a small literature study was conducted about elements and essential conditions for succesful entertainment & education (e&e) (bouman ), with as most important condition to ensure that the message is realistic (buckingham & bragg, ) . third a program plan was developed aiming at enhancing the stl/hiv and sexuality knowledge of the young people and addressing communication and educational skills, by means of drama. subsequently a safe sex comedy show was developed, with as main topics: being in love, sexuality, empowerment, stigma, sti, hivand safer sex. the messages where carried by a mix of video presentation, stand up comedy, spoken word, rap and dance.results: there have been two safe sex comedy shows. the attendance was good; the group was divers' with an age range between and year, with the majority being younger than year. more women than men attended the show. the story lines were considered realistic and most of the audients recognised the situations displayed. eighty percent of the audients found the show entertaining and % found it edm:arional. from this %, one third considers the information as new. almost all respondents pointed our that they would promote this show to their friends.con.clusion: the s.h<_>w reached the hard-to-reach group of young people out of the urban scene and was cons d_ered entert~mmg, educational and realistic. in addition, the program was able in addressing important issues, and impacted on the percieved personal risk of acquiring an sti when not using condoms, as well as on basic knowledge about stl's. introduction: modernity has contributed mightily to the marginality of adults who live with mental illnesses and the subsequent denial of opportunities to them. limited access to social, vocational, educational, and residential opportunities exacerbates their disenfranchisement, strengthening the stigma that has been associated with mental illness in western society, and resulting in the denial of their basic human rights and their exclusion from active participation in civil society. the clubhouse approach tn recovery has led to the reduction of both marginality and stigma in every locale in which it has been implemented judiciously. its elucidation via the prism of social justice principles will lead to a deeper appreciation of its efficacy and relevance to an array of settings. methods: a review of the literature on social justice and mental health was conducted to determine core principles and relationships between the concepts. in particular, fondacaro and weinberg's ( ) conceptualization of social justice in community psychology suggests the desirability of the clubhouse approach in community mental health practice. a review of clubhouse philosophy and practice has led to the inescapable conclusion that there is a strong connection between clubhouse philosophy-which represents a unique approach co recovery--and social justice principles. placing this highly effective model of community mental health practice within the context of these principles is long overdue. via textual analysis, we will glean the principles of social justice inherent in the rich trove of clubhouse literature, particularly the international standards of clubhouse development.results: fondacarao and weinberg highlight three primary social justice themes within their community psychology framework: prevention and health promotion; empowerment, and a critical pnsp<"<·tive. utilizing the prescriptive principles that inform every detail of clubhouse development and th<" movement toward recovery for individuals at a fully-realized clubhouse, this presentation asserts that both clubhouse philosophy and practice embody these social justice themes, promote human rights, and empower clubhouse members, individuals who live with mental illnesses, to achieve a level of wdl-heing and productivity previously unimagined.conclusion: a social justice framework is critical to and enhances an understanding of the clubhouse model. this model creates inclusive communities that lead to opportunities for full partic pil!ion civil society of a previously marginalized group. the implication is that clubhouses that an· based on the international standards for clubhouse programs offer an effective intervention strategy to guarantee the human rights of a sizable, worthy, and earnest group of citizens. to a drastic increase in school enrollment from . million in to . million in .s. however, while gross enrollment rates increased to °/., in the whole country after the introduction of fpe, it remained conspicuously low at % in the capital city, nairobi. nairobi city's enrollment rate is lower .than thatof all regions in the country except the nomadic north-eastern province. !h.e.d sadvantage of children bas_ed in the capital city was also noted in uganda after the introduction of fpe m the late s_-many_ education experts in kenya attribute the city's poor performance to the high propornon of children hvmg m slums, which are grossly underserved as far as social services are concerned. this paper ~xammes the impact of fpe and explores reasons for poor enrollment in informal settlements m na rob city. methods: the study utilizes quantitative and qualitative schooling data from the longitudinal health and demographic study being implemented by the african population and health research center in two informal settlements in nairobi. descriptive statistics are used to depict trends in enrollment rates for children aged - years in slum settlements for the period - . results: the results show that school enrollment has surprisingly steadily declined for children aged - while it increased marginally for those aged - . the number of new enrollments (among those aged years) did not change much between and while it declined consistently among those aged - since . these results show that the underlying reasons for poor school attendance in poverty-stricken populations go far beyond the lack of school fees. indeed, the results show that lack of finances (for uniform, transportation, and scholastic materials) has continued to be a key barrier to schooling for many children in slums. furthermore, slum children have not benefited from fpe because they mostly attend informal schools since they do not have access to government schools where the policy is being implemented.conclusion: the results show the need for equity considerations in the design and implementation of the fpe program in kenya. without paying particular attention to the schooling needs of the urban poor children, the millennium development goal aimed at achieving universal primary education will remain but a pipe dream for the rapidly increasing number of children living in poor urban neighborhoods.ps- (c) programing for hiv/aids in the urban workplace: issues and insights joseph kamoga hiv/aids has had a major effect on the workforce. according to !lo million persons who are engaged in some form of production are affectefd by hiv/aids. the working class mises out on programs that take place in communities, yet in a number of jobs, there are high risks to hiv infection. working persons sopend much of their active life time in workplaces and that is where they start getting involved in risky behaviour putting entire families at risk. and when they are infected with hiv, working people face high levels of seclusion, stigmatisation and some miss out on benefits especially in countries where there are no strong workplace programs. adressing hiv/aids in the workplace is key for sucessfull responses. this paper presents a case for workplace programing; the needs, issues and recommendations especially for urban places in developing countries where the private sector workers face major challenges. key: cord- -dbgs ado authors: rieke, nicola; hancox, jonny; li, wenqi; milletari, fausto; roth, holger; albarqouni, shadi; bakas, spyridon; galtier, mathieu n.; landman, bennett; maier-hein, klaus; ourselin, sebastien; sheller, micah; summers, ronald m.; trask, andrew; xu, daguang; baust, maximilian; cardoso, m. jorge title: the future of digital health with federated learning date: - - journal: nan doi: nan sha: doc_id: cord_uid: dbgs ado data-driven machine learning has emerged as a promising approach for building accurate and robust statistical models from medical data, which is collected in huge volumes by modern healthcare systems. existing medical data is not fully exploited by ml primarily because it sits in data silos and privacy concerns restrict access to this data. however, without access to sufficient data, ml will be prevented from reaching its full potential and, ultimately, from making the transition from research to clinical practice. this paper considers key factors contributing to this issue, explores how federated learning (fl) may provide a solution for the future of digital health and highlights the challenges and considerations that need to be addressed. research on artificial intelligence (ai) has enabled a variety of significant breakthroughs over the course of the last two decades. in digital healthcare, the introduction of powerful machine learning-based and particularly deep learning-based models [ ] has led to disruptive innovations in radiology, pathology, genomics and many other fields. in order to capture the complexity of these applications, modern deep learning (dl) models feature a large number (e.g. millions) of parameters that are learned from and validated on medical datasets. sufficiently large corpora of curated data are thus required in order to obtain models that yield clinical-grade accuracy, whilst being safe, fair, equitable and generalising well to unseen data [ , , ] . for example, training an automatic tumour detector and diagnostic tool in a supervised way requires a large annotated database that encompasses the full spectrum of possible anatomies, pathological patterns and types of input data. data like this is hard to obtain and curate. one of the main difficulties is that unlike other data, which may be shared and copied rather freely, health data is highly sensitive, subject to regulation and cannot be used for research without appropriate patient consent and ethical approval [ ] . even if data anonymisation is sometimes proposed as a way to bypass these limitations, it is now wellunderstood that removing metadata such as patient name or date of birth is often not enough to preserve privacy [ ] . imaging data suffers from the same issue -it is possible to reconstruct a patient's face from three-dimensional imaging data, such as computed tomography (ct) or magnetic resonance imaging (mri). also the human brain itself has been shown to be as unique as a fingerprint [ ] , where subject identity, age and gender can be predicted and revealed [ ] . another reason why data sharing is not systematic in healthcare is that medical data are potentially highly valuable and costly to acquire. collecting, curating and maintaining a quality dataset takes considerable time and effort. these datasets may have a significant business value and so are not given away lightly. in practice, openly sharing medical data is often restricted by data collectors themselves, who need fine-grained control over the access to the data they have gathered. federated learning (fl) [ , , ] is a learning paradigm that seeks to address the problem of data governance and privacy by training algorithms collaboratively without exchanging the underlying datasets. the approach was originally developed in a different domain, but it recently gained traction for healthcare applications because it neatly addresses the problems that usually exist when trying to aggregate medical data. applied to digital health this means that fl enables insights to be gained collaboratively across institutions, e.g. in the form of a global or consensus model, without sharing the patient data. in particular, the strength of fl is that sensitive training data does not need to be moved beyond the firewalls of the institutions in which they reside. instead, the machine learning (ml) process occurs locally at each participating institution and only model characteristics (e.g. parameters, gradients etc.) are exchanged. once training has been completed, the trained consensus model benefits from the knowledge accumulated across all institutions. recent research has shown that this approach can achieve a performance that is comparable to a scenario where the data was co-located in a data lake and superior to the models that only see isolated singleinstitutional data [ , ] . for this reason, we believe that a successful implementation of fl holds significant potential for enabling precision medicine at large scale. the scalability with respect to patient numbers included for model training would facilitate models that yield unbiased decisions, optimally reflect an individual's physiology, and are sensitive to rare diseases in a way that is respectful of governance and privacy concerns. whilst fl still requires rigorous technical consideration to ensure that the algorithm is proceeding optimally without compromising safety or patient privacy, it does have the potential to overcome the limitations of current approaches that require a single pool of centralised data. the aim of this paper is to provide context and detail for the community regarding the benefits and impact of fl for medical applications (section ) as well as highlighting key considerations and challenges of implementing fl in the context of digital health (section ). the medical fl use-case is inherently different from other domains, e.g. in terms of number of participants and data diversity, and while recent surveys investigate the research advances and open questions of fl [ , , ] , we focus on what it actually means for digital health and what is needed to enable it. we envision a federated future for digital health and hope to inspire and raise awareness with this article for the community. ml and especially dl is becoming the de facto knowledge discovery approach in many industries, but successfully implementing data-driven applications requires that models are trained and evaluated on sufficiently large and diverse datasets. these medical datasets are difficult to curate (section . ). fl offers a way to counteract this data dilemma and its associated governance and privacy concerns by enabling collaborative learning without centralising the data (section . ). this learning paradigm, however, requires consideration from and offers benefits to the various stakeholders of the healthcare environment a parameter server distributes the model and each node trains a local model for several iterations, after which the updated models are returned to the parameter server for aggregation. this consensus model is then redistributed for subsequent iterations. (b) decentralised architecture via peer-to-peer: rather than using a parameter server, each node broadcasts its locally trained model to some or all of its peers and each node does its own aggregation. (c) hybrid architecture: federations can be composed into a hierarchy of hubs and spokes, which might represent regions, health authorities or countries. (section . ). all these points will be discussed in this section. data-driven approaches rely on datasets that truly represent the underlying data distribution of the problem to be solved. whilst the importance of comprehensive and encompassing databases is a well-known requirement to ensure generalisability, state-of-the-art algorithms are usually evaluated on carefully curated datasets, often originating from a small number of sources -if not a single source. this implies major challenges: pockets of isolated data can introduce sample bias in which demographic (e.g. gender, age etc.) or technical imbalances (e.g. acquisition protocol, equipment manufacturer) skew the predictions, adversely affecting the accuracy of pre-diction for certain groups or sites. the need for sufficiently large databases for ai training has spawned many initiatives seeking to pool data from multiple institutions. large initiatives have so far primarily focused on the idea of creating data lakes. these data lakes have been built with the aim of leveraging either the commercial value of the data, as exemplified by ibm's merge healthcare acquisition [ ] , or as a resource for economic growth and scientific progress, with examples such as nhs scotland's national safe haven [ ] , the french health data hub [ ] and health data research uk [ ] . substantial, albeit smaller, initiatives have also made data available to the general community such as the human connectome [ ] , uk biobank [ ] , the cancer imaging archive (tcia) [ ] , nih cxr [ ] , nih deeplesion [ ] , the cancer genome atlas (tcga) [ ] , the alzheimer's disease neu-roimaging initiative (adni) [ ] , or as part of medical grand challenges such as the camelyon challenge [ ] , the multimodal brain tumor image segmentation benchmark (brats) [ ] or the medical segmentation decathlon [ ] . public data is usually task-or disease-specific and often released with varying degrees of license restrictions, sometimes limiting its exploitation. regardless of the approach, the availability of such data has the potential to catalyse scientific advances, stimulate technology start-ups and deliver improvements in healthcare. centralising or releasing data, however, poses not only regulatory and legal challenges related to ethics, privacy and data protection, but also technical ones -safely anonymising, controlling access, and transferring healthcare data is a non-trivial, and often impossible, task. as an example, anonymised data from the electronic health record can appear innocuous and gdpr/phi compliant, but just a few data elements may allow for patient reidentification [ ] . the same applies to genomic data and medical images, with their high-dimensional nature making them as unique as one's fingerprint [ ] . therefore, unless the anonymisation process destroys the fidelity of the data, likely rendering it useless, patient reidentification or information leakage cannot be ruled out. gated access, in which only approved users may access specific subsets of data, is often proposed as a putative solution to this issue. however, not only does this severely limit data availability, it is only practical for cases in which the consent granted by the data owners or patients is unconditional, since recalling data from those who may have had access to the data is practically unenforceable. the promise of fl is simple -to address privacy and governance challenges by allowing algorithms to learn from non co-located data. in a fl setting, each data controller not only defines their own governance processes and associated privacy considerations, but also, by not allowing data to move or to be copied, controls data access and the possibility to revoke it. so the potential of fl is to provide controlled, indirect access to large and comprehensive datasets needed for the development of ml algo-rithms, whilst respecting patient privacy and data governance. it should be noted that this includes both the training as well as the validation phase of the development. in this way, fl could create new opportunities, e.g. by allowing large-scale validation across the globe directly in the institutions, and enable novel research on, for example, rare diseases, where the incident rates are low and it is unlikely that a single institution has a dataset that is sufficient for ml approaches. moving the to-be-trained model to the data instead of collecting the data in a central location has another major advantage: the high-dimensional, storage-intense medical data does not have to be duplicated from local institutions in a centralised pool and duplicated again by every user that uses this data for local model training. in a fl setup, only the model is transferred to the local institutions and can scale naturally with a potentially growing global dataset without replicating the data or multiplying the data storage requirements. some of the promises of fl are implicit: a certain degree of privacy is provided since other fl participants never directly access the data from other institutions and only receive the resulting model parameters that are aggregated over several participants. and in a client-server architecture (see figure ), in which a federated server manages the aggregation and distribution, the participating institutions can even remain unknown to each other. however, it has been shown that the models themselves can, under certain conditions, memorise information [ , , , ] . therefore the fl setup can be further enhanced by privacy protections using mechanisms such as differential privacy [ , ] or learning from encrypted data (c.f. sec. ). and fl techniques are still a very active area of research [ ] . all in all, a successful implementation of fl will represent a paradigm shift from centralised data warehouses or lakes, with a significant impact on the various stakeholders in the healthcare domain. if fl is indeed the answer to the challenge of healthcare ml at scale, then it is important to understand who the various stakeholders are in a fl ecosystem and what they have to consider in order to benefit from it. the aggregation may happen on one of the training nodes or a separate parameter server node, which would then redistribute the consensus model. b) peer to peer training: nodes broadcast their model updates to one or more nodes in the federation and each does its own aggregation. cyclic training happens when model updates are passed to a single neighbour one or more times, round-robin style. c) hybrid training: federations, perhaps in remote geographies, can be composed into a hierarchy and use different communication/aggregation strategies at each tier. in the illustrated case, three federations of varying size periodically share their models using a peer to peer approach. the consensus model is then redistributed to each federation and each node therein. clinicians are usually exposed to only a certain subgroup of the population based on the location and demographic environment of the hospital or practice they are working in. therefore, their decisions might be based on biased assumptions about the probability of certain diseases or their interconnection. by using ml-based systems, e.g. as a second reader, they can augment their own expertise with expert knowledge from other institutions, ensuring a consistency of diagnosis not attainable today. whilst this promise is generally true for any ml-based system, systems trained in a federated fashion are potentially able to yield even less biased decisions and higher sensitivity to rare cases as they are likely to have seen a more complete picture of the data distribution. in order to be an active part of or to benefit from the federation, however, demands some up-front effort such as compliance with agreements e.g. regarding the data structure, annotation and report protocol, which is necessary to ensure that the information is presented to collaborators in a commonly understood format. patients are usually relying on local hospitals and practices. establishing fl on a global scale could ensure higher quality of clinical decisions regardless of the location of the deployed system. for example, patients who need medical attention in remote areas could benefit from the same high-quality ml-aided diagnosis that are available in hospitals with a large number of cases. the same advantage applies to patients suffering from rare, or geographically uncommon, diseases, who are likely to have better outcomes if faster and more accurate diagnoses can be made. fl may also lower the hurdle for becoming a data donor, since patients can be reassured that the data remains with the institution and data access can be revoked. hospitals and practices can remain in full control and possession of their patient data with complete traceability of how the data is accessed. they can precisely control the purpose for which a given data sample is going to be used, limiting the risk of misuse when they work with third parties. however, participating in federated efforts will require investment in on-premise computing infrastructure or private-cloud service provision. the amount of necessary compute capabilities depends of course on whether a site is only participating in evaluation and testing efforts or also in training efforts. even relatively small institutions can participate, since enough of them will generate a valuable corpus and they will still benefit from collective models generated. one of the drawbacks is that fl strongly relies on the standardisation and homogenisation of the data formats so that predictive models can be trained and evaluated seamlessly. this involves significant standardisation efforts from data managers. researchers and ai developers who want to develop and evaluate novel algorithms stand to benefit from access to a potentially vast collection of real-world data. this will especially impact smaller research labs and start-ups, who would be able to directly develop their applications on healthcare data without the need to curate their own datasets. by introducing federated efforts, precious resources can be directed towards solving clinical needs and associated technical problems rather than relying on the limited supply of open datasets. at the same time, it will be necessary to conduct research on algorithmic strategies for federated training, e.g. how to combine models or updates efficiently, how to be robust to distribution shifts, etc., as highlighted in the technical survey papers [ , , ] . and a fl-based development implies that the researcher or ai developer cannot investigate or visualise all of the data on which the model is trained. it is for example not possible to look at an individual failure case to understand why the current model performs poorly on it. healthcare providers in many countries are affected by the ongoing paradigm shift from volume-based, i.e. fee-for-service-based, to value-based healthcare. a value-based reimbursement structure is in turn strongly connected to the successful establishment of precision medicine. this is not about promoting more expensive individualised therapies but instead about achieving better outcomes sooner through more focused treatment, thereby reducing the costs for providers. by way of example, with sufficient data, ml approaches can learn to recognise cancer-subtypes or genotypic traits from radiology images that could indicate certain therapies and discount others. so, by providing exposure to large amounts of data, fl has the potential to increase the accuracy and robustness of healthcare ai, whilst reducing costs and improving patient outcomes, and is therefore vital to precision medicine. manufacturers of healthcare software and hardware could benefit from federated efforts and infrastructures for fl as well, since combining the learning from many devices and applications, without revealing anything patient-specific can facilitate the continuous improvement of ml-based systems. this potentially opens up a new source of data and revenue to manufacturers. however, hospitals may require significant upgrades to local compute, data storage, networking capabilities and associated software to enable such a use-case. note, however, that this change could be quite disruptive: fl could eventually impact the business models of providers, practices, hospitals and manufacturers affecting patient data ownership; and the regulatory frameworks surrounding continual and fl approaches are still under development. fl is perhaps best-known from the work of konečnỳ et al. [ ] , but various other definitions have been proposed in literature [ , , , ] . these approaches can be realised via different communication architectures (see figure ) and respective compute plans (see figure ). the main goal of fl, however, remains the same: to combine knowledge learned from non co-located data, that resides within the participating entities, into a global model. whereas the initial application field mostly comprised mobile devices, participating entities in the case of healthcare could be institutions storing the data, e.g. hospitals, or medical devices itself, e.g. a ct scanner or even low-powered devices that are able to run computations locally. it is important to understand that this domainshift to the medical field implies different conditions and requirements. for example, in the case of the federated mobile device application, potentially millions of partici-pants could contribute, but it would be impossible to have the same scale of consortium in terms of participating hospitals. on the other hand medical institutions may rely on more sophisticated and powerful compute infrastructure with stable connectivity. another aspect is that the variation in terms of data type and defined tasks as well as acquisition protocol and standardisation in healthcare is significantly higher than pictures and messages seen in other domains. the participating entities have to agree on a collaboration protocol and the high-dimensional medical data, which is predominant in the field of digital health, poses challenges by requiring models with huge numbers of parameters. this may become an issue in scenarios where the available bandwidth for communication between participants is limited, since the model has to be transferred frequently. and even though data is never shared during fl, considerations about the security of the connections between sites as well as mitigation of data leakage risks through model parameters are necessary. in this section, we will discuss more in detail what fl is and how it differs from similar techniques as well as highlighting the key challenges and technical considerations that arise when applying fl in digital health. fl is a learning paradigm in which multiple parties train collaboratively without the need to exchange or centralise datasets. although various training strategies have been implemented to address specific tasks, a general formulation of fl can be formalised as follows: let l denote a global loss function obtained via a weighted combination of k local losses {l k } k k= , computed from private data x k residing at the individual involved parties: where w k > denote the respective weight coefficients. it is important to note that the data x k is never shared among parties and remains private throughout learning. in practice, each participant typically obtains and refines the global consensus model by running a few rounds of optimisation on their local data and then shares the updated parameters with its peers, either directly or via a pa-rameter server. the more rounds of local training are performed without sharing updates or synchronisation, the less it is guaranteed that the actual procedure is minimising the equation ( ) [ , ] . the actual process used for aggregating parameters commonly depends on the fl network topology, as fl nodes might be segregated into sub-networks due to geographical or legal constraints (see figure ). aggregation strategies can rely on a single aggregating node (hub and spokes models), or on multiple nodes without any centralisation. an example of this is peer-to-peer fl, where connections exist between all or a subset of the participants and model updates are shared only between directly-connected sites [ , ] . an example of centralised fl aggregation with a client-server architecture is given in algorithm . note that aggregation strategies do not necessarily require information about the full model update; clients might choose to share only a subset of the model parameters for the sake of reducing communication overhead of redundant information, ensure better privacy preservation [ ] or to produce multitask learning algorithms having only part of their parameters learned in a federated manner. a unifying framework enabling various training schemes may disentangle compute resources (data and servers) from the compute plan, as depicted in figure . the latter defines the trajectory of a model across several partners, to be trained and evaluated on specific datasets. for more details regarding state-of-the-art of fl techniques, such as aggregation methods, optimisation or model compression, we refer the reader to the overview by kairouz et al. [ ] . fl is rooted in older forms of collaborative learning where models are shared or compute is distributed [ , ] . transfer learning, for example, is a well-established approach of model-sharing that makes it possible to tackle problems with deep neural networks that have millions of parameters, despite the lack of extensive, local datasets that are required for training from scratch: a model is first trained on a large dataset and then further optimised on the actual target data. the dataset used for the initial training does not necessarily come from the same domain or even the same type of data source as the target dataset. this type of transfer learning has shown better algorithm example of a fl algorithm [ ] in a clientserver architecture with aggregation via fedavg [ ] . require: num federated rounds t : procedure aggregating : initialise global model: for t ← · · · t do : for client k ← · · · k do run in parallel end for : return w (t) : end procedure performance [ , ] when compared to strategies where the model had been trained from scratch on the target data only, especially when the target dataset is comparably small. it should be noted that similar to a fl setup, the data is not necessarily co-located in this approach. for transfer learning, however, the models are usually shared acyclically, e.g. using a pre-trained model to finetune it on another task, without contributing to a collective knowledge-gain. and, unfortunately, deep learning models tend to "forget" [ , ] . therefore after a few training iterations on the target dataset the initial information contained in the model is lost [ ] . to adopt this approach into a form of collaborative learning in a fl setup with continuous learning from different institutions, the participants can share their model with a peer-to-peer architecture in a "round-robin" or parallel fashion and train in turn on their local data. this yields better results when the goal is to learn from diverse datasets. a client-server architecture in this scenario enables learning on multi-party data at the same time [ ] , possibly even without forgetting [ ] . there are also other collaborative learning strategies [ , ] such as ensembling, a statistical strategy of combining multiple independently trained models or predictions into a consensus, or multi-task learning, a strategy to leverage shared representations for related tasks. these strategies are independent of the concept of fl, and can be used in combination with it. the second characteristic of fl -to distribute the compute -has been well studied in recent years [ , , ] . nowadays, the training of the large-scale models is often executed on multiple devices and even multiple nodes [ ] . in this way, the task can be parallelised and enables fast training, such as training a neural network on the extensive dataset of the imagenet project in hour [ ] or even in less than seconds [ ] . it should be noted that in these scenarios, the training is realised in a cluster environment, with centralised data and fast network communication. so, distributing the compute for training on several nodes is feasible and fl may benefit from the advances in this area. compared to these approaches, however, fl comes with a significant communication and synchronisation cost. in the fl setup, the compute resources are not as closely connected as in a cluster and every exchange may introduce a significant latency. therefore, it may not be suitable to synchronise after every batch, but to continue local training for several iterations before aggregation. we refer the reader to the survey by xu et al. [ ] for an overview of the evolution of federated learning and the different concepts in the broader sense. despite the advantages of fl, there are challenges that need to be taken into account when establishing federated training efforts. in this section, we discuss five key aspects of fl that are of particular interest in the context of its application to digital health. in healthcare, we work with highly sensitive data that must be protected accordingly. therefore, some of the key considerations are the trade-offs, strategies and remaining risks regarding the privacy-preserving potential of fl. privacy vs. performance. although one of the main purposes of fl is to protect privacy by sharing model updates rather than data, fl does not solve all potential privacy issues and -similar to ml algorithms in generalwill always carry some risks. strict regulations and data governance policies make any leakage, or perceived risk of leakage, of private information unacceptable. these regulations may even differ between federations and a catch-all solution will likely never exist. consequently, it is important that potential adopters of fl are aware of potential risks and state-of-the-art options for mitigating them. privacy-preserving techniques for fl offer levels of protection that exceed today's current commercially available ml models [ ] . however, there is a trade-off in terms of performance and these techniques may affect for example the accuracy of the final model [ ] . furthermore future techniques and/or ancillary data could be used to compromise a model previously considered to be low-risk. level of trust. broadly speaking, participating parties can enter two types of fl collaboration: trusted -for fl consortia in which all parties are considered trustworthy and are bound by an enforceable collaboration agreement, we can eliminate many of the more nefarious motivations, such as deliberate attempts to extract sensitive information or to intentionally corrupt the model. this reduces the need for sophisticated countermeasures, falling back to the principles of standard collaborative research. non-trusted -in fl systems that operate on larger scales, it is impractical to establish an enforceable collaborative agreement that can guarantee that all of the parties are acting benignly. some may deliberately try to degrade performance, bring the system down or extract information from other parties. in such an environment, security strategies will be required to mitigate these risks such as, encryption of model submissions, secure authentication of all parties, traceability of actions, differential privacy, verification systems, execution integrity, model confidentiality and protections against adversarial attacks. information leakage. by definition, fl systems sidestep the need to share healthcare data among participating institutions. however, the shared information may still indirectly expose private data used for local training, for example by model inversion [ ] of the model updates, the gradients themselves [ ] or adversarial attacks [ , ] . fl is different from traditional training insofar as the training process is exposed to multiple parties. as a result, the risk of leakage via reverse-engineering increases if adversaries can observe model changes over time, observe specific model updates (i.e. a single institution's update), or manipulate the model (e.g. induce additional memorisation by others through gradient-ascentstyle attacks). countermeasures, such as limiting the granularity of the shared model updates and to add specific noise to ensure differential privacy [ , , ] may be needed and is still an active area of research [ ] . medical data is particularly diverse -not only in terms of type, dimensionality and characteristics of medical data in general but also within a defined medical task, due to factors like acquisition protocol, brand of the medical device or local demographics. this poses a challenge for fl algorithms and strategies: one of the core assumptions of many current approaches is that the data is independent and identically distributed (iid) across the participants. initial results indicate that fl training on medical non-iid data is possible, even if the data is not uniformly distributed across the institutions [ , ] . in general however, strategies such as fedavg [ ] are prone to fail under these conditions [ , , ] , in part defeating the very purpose of collaborative learning strategies. research addressing this problem includes for example fedprox [ ] and part-data-sharing strategy [ ] . another challenge is that data heterogeneity may lead to a situation in which the global solution may not be the optimal final local solution. the definition of model training optimality should therefore be agreed by all participants before training. as per all safety-critical applications, the reproducibility of a system is important for fl in healthcare. in contrast to training on centralised data, fl involves running multiparty computations in environments that exhibit complexities in terms of hardware, software and networks. the traceability requirement should be fulfilled to ensure that system events, data access history and training configuration changes, such as hyperparameter tuning, can be traced during the training processes. traceability can also be used to log the training history of a model and, in particular, to avoid the training dataset overlapping with the test dataset. in particular in non-trusted federations, traceability and accountability processes running in require execution integrity. after the training process reaches the mutually agreed model optimality criteria, it may also be helpful to measure the amount of contribution from each participant, such as computational resources consumed, quality of the local training data used for local training etc. the measurements could then be used to determine relevant compensation and establish a revenue model among the participants [ ] . one implication of fl is that researchers are not able to investigate images upon which models are being trained. so, although each site will have access to its own raw data, federations may decide to provide some sort of secure intra-node viewing facility to cater for this need or perhaps even some utility for explainability and interpretability of the global model. however, the issue of interpretability within dl is still an open research question. unlike running large-scale fl amongst consumer devices, healthcare institutional participants are often equipped with better computational resources and reliable and higher throughput networks. these enable for example training of larger models with larger numbers of local training steps and sharing more model information between nodes. this unique characteristic of fl in healthcare consequently brings opportunities as well as challenges such as ( ) how to ensure data integrity when communicating (e.g. creating redundant nodes); ( ) how to design secure encryption methods to take advantage of the computational resources; ( ) how to design appropriate node schedulers and make use of the distributed computational devices to reduce idle time. the administration of such a federation can be realised in different ways, each of which come with advantages and disadvantages. in high-trust situations, training may operate via some sort of 'honest broker' system, in which a trusted third party acts as the intermediary and facilitates access to data. this setup requires an independent entity to control the overall system which may not always be desirable, since it could involve an additional cost and procedural viscosity, but does have the advantage that the precise internal mechanisms can be abstracted away from the clients, making the system more agile and simpler to update. in a peer-to-peer system each site interacts directly with some or all of the other participants. in other words, there is no gatekeeper function, all protocols must be agreed up-front, which requires significant agreement efforts, and changes must be made in a synchronised fashion by all parties to avoid problems. and in a trustless-based architecture the platform operator may be cryptographically locked into being honest which creates significant computation overheads whilst securing the protocol. future efforts to apply artificial intelligence to healthcare tasks may strongly depend on collaborative strategies between multiple institutions rather than large centralised databases belonging to only one hospital or research laboratory. the ability to leverage fl to capture and integrate knowledge acquired and maintained by different institutions provides an opportunity to capture larger data variability and analyse patients across different demographics. moreover, fl is an opportunity to incorporate multiexpert annotation and multi-centre data acquired with different instruments and techniques. this collaborative effort requires, however, various agreements including definitions of scope, aim and technology which, since it is still novel, may incorporate several unknowns. in this context, large-scale initiatives such as the melloddy project , the healthchain project , the trustworthy federated data analytics (tfda) and the german cancer consortium's joint imaging platform (jip) represent pioneering efforts to set the standards for safe, fair and innovative collaboration in healthcare research. ml, and particularly dl, has led to a wide range of innovations in the area of digital healthcare. as all ml methods benefit greatly from the ability to access data that approximates the true global distribution, fl is a promising approach to obtain powerful, accurate, safe, robust and unbiased models. by enabling multiple parties to train collaboratively without the need to exchange or centralise datasets, fl neatly addresses issues related to egress of sensitive medical data. as a consequence, it may open novel research and business avenues and has the potential to improve patient care globally. in this article, we have discussed the benefits and the considerations pertinent to fl within the healthcare field. not all technical questions have been answered yet and fl will certainly be an active research area throughout the next decade [ ] . despite this, we truly believe that its potential impact on precision medicine and ultimately improving medical care is very promising. financial disclosure: author rms receives royalties from icad, scanmed, philips, and ping an. his lab has received research support from ping an and nvidia. author sa is supported by the prime programme of the german academic exchange service (daad) with funds from the german federal ministry of education and research (bmbf). author sb is supported by the national institutes of health (nih). author mng is supported by the healthchain (bpifrance) and melloddy (imi ) projects. deep learning deep learning: a primer for radiologists clinically applicable deep learning for diagnosis and referral in retinal disease revisiting unreasonable effectiveness of data in deep learning era a systematic review of barriers to data sharing in public health estimating the success of re-identifications in incomplete datasets using generative models quantifying differences and similarities in whole-brain white matter architecture using local connectome fingerprints brainprint: a discriminative characterization of brain morphology communication-efficient learning of deep networks from decentralized data federated learning: challenges, methods, and future directions federated machine learning: concept and applications privacy-preserving federated brain tumour segmentation multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation advances and open problems in federated learning federated learning for healthcare informatics ibm's merge healthcare acquisition nhs scotland's national safe haven the french health data hub and the german medical informatics initiatives: two national projects to promote data sharing in healthcare health data research uk the human connectome: a structural description of the human brain uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age the cancer imaging archive (tcia): maintaining and operating a public information repository chestx-ray : hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases deeplesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning the cancer genome atlas (tcga): an immeasurable source of knowledge the alzheimer's disease neuroimaging initiative (adni): mri methods h&e-stained sentinel lymph node sections of breast cancer patients: the camelyon dataset the multimodal brain tumor image segmentation benchmark (brats) a large annotated medical image dataset for the development and evaluation of segmentation algorithms membership inference attacks against machine learning models white-box vs black-box: bayes optimal strategies for membership inference understanding deep learning requires rethinking generalization the secret sharer: evaluating and testing unintended memorization in neural networks deep learning with differential privacy privacy-preserving deep learning federated optimization: distributed machine learning for on-device intelligence braintorrent: a peer-to-peer environment for decentralized federated learning peer-to-peer federated learning on graphs learning differentially private recurrent language models distributed deep learning networks among institutions for medical imaging deep convolutional neural networks for computeraided detection: cnn architectures, dataset characteristics and transfer learning convolutional neural networks for medical image analysis: full training or fine tuning? catastrophic interference in connectionist networks: the sequential learning problem an empirical investigation of catastrophic forgetting in gradient-based neural networks learning without forgetting overcoming forgetting in federated learning on non-iid data collaborative learning for deep neural networks an overview of multi-task learning in deep neural networks how to scale distributed deep learning accurate, large minibatch sgd: training imagenet in hour demystifying parallel and distributed deep learning: an in-depth concurrency analysis yet another accelerated sgd: resnet- training on imagenet in . seconds p sgd: patient privacy preserving sgd for regularizing deep cnns in pathological image classification deep leakage from gradients beyond inferring class representatives: user-level privacy leakage from federated learning deep models under the gan: information leakage from collaborative deep learning multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: abide results federated optimization in heterogeneous networks communication-efficient learning of deep networks from decentralized data federated learning with non-iid data data shapley: equitable valuation of data for machine learning key: cord- -cdhu pfi authors: efroni, zohar title: location data as contractual counter-performance: a consumer perspective on recent eu legislation date: - - journal: smart urban mobility doi: . / - - - - _ sha: doc_id: cord_uid: cdhu pfi this chapter analyses recent developments in the area of digital consumer law in the eu while focusing on the ‘data as counter-performance’ quandary and its application to location data. the immense technological and economic significance of location data in smart urban spaces renders them a relevant subject for inquiry in the context of ongoing legal efforts to protect consumers who grant permission to use their location data in exchange for digital goods and services. the classic problem of how to get from point a to point b in the most efficient and convenient way demands new solutions in our digital time and age, especially in modern cities, which are home to % of the eu population. technological solutions are predominantly based on the generation, collection and extensive use of electronic data. to name just one example, 'mobility as a service' (maas) stands for a technology-based platform solution in an urban setting that heavily relies on multiple mobility data sources. location data play a key role not only in maas platforms but also in many other data-driven solutions, technologies, products and business models that shape life in the hyper-connected environment powered by the growth of smartphones. the promise of location-based services and personalised mobility solutions for consumers is considerable-and so are the challenges and risks they pose to individual interests. a recent privacy incident that has captured much media attention is illustrative. apple's iphone pro was reported to have continued collecting location data even when the user set the iphone not to collect such data. namely, the phone continued pinging its gps modules despite users' deliberate choice to disable this function. in this way, contrary to users' expectations and possibly to apple's own privacy policy, it was impossible to completely turn off location-based system services simply by individually switching off location services for all applications and system services. rather, users needed to turn off all global location services in the device settings. apple replied to the allegation by explaining that the matter was rooted in the 'ultra wideband technology' embedded in the device. this technology endows the eu commission, 'urban mobility package' (european commission, august ) accessed august . see eg warwick goodall and others, 'the rise of mobility as a service: reshaping how urbanites get around' ( ) deloitte review - . in this review, maas is described as a model which, at its core, relies on a digital platform that integrates end-to-end trip planning, booking, electronic ticketing and payment services across all modes of transportation, public or private, ibid forbes.com/sites/kateoflahertyuk/ / / /apple-iphone- -iphone- -pro-location-privacy-issue> accessed february . ibid: an it security expert showed how gps data are also collected when individual location services are disabled in the iphone pro's settings. this happened even when users set their location services toggle to 'never'. zack whittaker, 'apple says its ultra wideband technology is why newer iphones appear to share location data, even when the setting is disabled' (techcrunch, december ) accessed february . device with spatial awareness to identify other ultra wideband devices nearby. one application of this technology is enabling file sharing between devices via airdrop. apple added that the management of ultra wideband compliance and its use of location data are done entirely on the device and that the company is not collecting user location data. still, the revelation was not particularly flattering for a company that takes pride in its comparatively strict privacy and security standards. the location data that mobile devices collect fuel giant, global and in some cases thinly regulated markets, which often operate and prosper entirely unnoticed by those who own the devices. a series of articles in the new york times picked up the topic. as part of the privacy project, reporters obtained a file containing more than billion location pings from over million us citizens as they moved through several major cities such as washington, san francisco and los angeles. the newspaper attained the data from a commercial location data company-one of dozens of its kind-that collects precise location data by utilising software included in mobile phone applications. the online article illustrates via the use of interactive heatmaps and analytics techniques how much can be learned about people simply by following their movement traces over time, and how easy it can be to obtain and use such data in the absence of effective regulation. the report shows further how omnipresent surveillance is and how penetrative it can be. a us advertising executive was quoted as describing the location data industry there as 'the wild west'. apple explained: 'ultra wideband technology is an industry standard technology and is subject to international regulatory requirements that require it to be turned off in certain locations [. . .] . ios uses location services to help determine if an iphone is in these prohibited locations in order to disable ultra wideband and comply with regulations.', ibid. ibid: according to apple, a new, dedicated toggle option for this feature will be included in upcoming ios updates. see apple's privacy governance statement explaining its cross-functional approach to privacy governance, accessed february : 'at apple we design our products and services according to the principle of privacy by default and collect only the minimum amount of data necessary to provide our users with a product or service. we also deploy industry-leading consent mechanisms to allow our customers to choose whether to share data such as their location shortly before this chapter went to print, a global crisis overshadowed all the problems location data have elicited so far, and for that matter, it dwarfed all other national, regional and global problems as well: as of july th , the novel coronavirus (sars cov- ) has caused over thirteen million infection cases and over half a million deaths worldwide. in order to slow down its expansion rate and bring the spread of the pandemic under control, an early identification of infected individuals as well as all other individuals who have been in contact with them is considered critical: knowing the mobility patterns of positively tested individuals during the relevant period, cross referencing this data with the location data (typically generated by smartphones) of all the persons who were in close physical contact with them, and then, based upon matches, taking preventive measures such as sending direct sms warnings, ordering quarantine and isolation, conducting pinpointed testing, etc., is considered by many a promising, even a vital strategy to contain the disease. this current example comes to briefly demonstrate both the enormous utility location data may have and the potential for misuse. in times of crisis such as these, the harm to privacy rights and even to the integrity of the political system in some democracies as a whole often go unnoticed. less people ponder now whether a massive and unchecked collection of location data by the government as part of the measures it takes against a health disaster of this dimension is justified, proportionate and in conformity with fundamental rights. in emergency situations, as in normal times, utilising location data is particularly prevalent in modern urban environments, in which mobility becomes ever 'smarter' and in which movement patterns can be ascertained and exploited in more accurate, sophisticated and pervasive manners. with this observation in mind, the aim of this chapter is twofold. the first part (sect. ) seeks to sketch the main issues triggered specifically by location data and the application of eu data privacy and data protection law to evolving commercial scenarios. this part argues that assessing the problem requires a broad perspective that, besides law, includes technological and economic aspects of newly evolving ecosystems. the three spheres are often intertwined: technological advancements offer new solutions to familiar problems, and moreover, they offer entirely new behavioural options and choices (that might ultimately create new problems). the potential added value for consumers stimulates economic activity and business models designed to monetise technological innovation and enhance consumption. all this happens within a legal environment that might impose restrictions on technology and commerce and where regulative adjustments might be called for. the second part (sect. ) focuses on risks and opportunities for consumers who are willing to trade their (location) data specifically for digital goods and services. providing the data often relates directly to benefiting from more personalised, finely tuned and, in the end, useful technological solutions. in light of the rising trend often described as (consumer) data commodification, the second part endeavours to provide initial insights into the problem of location data that economically-and potentially also legally-function as a counter-performance, particularly after the enactment of directive (eu) / , which addresses the topic. location data: conceptual, technological and economic perspectives location data is a term often used in the context of digital technology and economy but which is less often explained or treated as a unique type of data that creates a unique set of problems. in order to somewhat narrow the scope of the present discussion, it appears reasonable to begin by limiting it to machine-readable data, i.e. data that are generated, stored, analysed, aggregated, enriched, edited, manipulated, transmitted, etc. by the use of digital machines and devices. next, it is clear that location data in our context go beyond the colloquial meaning of a category of machine-readable data that essentially indicate a physical location in space (often referred to as 'geolocation' ); non-spatial information can also reveal the location of an individual. in addition, technologies that collect and utilise spatial coordinates very often match it with temporal data, namely timestamps associated with pings of physical locations. the timestamps are an integral element of the data from a technological perspective. hence, some academics and actors in the business-technology sector use the term spatio-temporal data to more precisely describe the data being collected and processed for analytics, functionality, mobility and other purposes. moreover, fully capturing the essence and value of location data includes not only an indication of physical location at a certain time but also information about the direction and speed they may encapsulate. location data hence provide the basis for mobility data, a concept that is intimately related to the common understanding of smart mobility. in turn, smart mobility was defined on one occasion as 'collecting, managing, and analysing (fusing) various data sources related to different aspects of residents' movement in order to better understand and improve the way people move. it follows that smart mobility crucially depends on high quality mobility data on a massive scale and from multiple sources. spatio-temporal data can be said to create an interface layer between the presence and behaviour of a person in cyberspace and the presence and behaviour of that person in real space. beyond the deductive force of such data (knowing the physical location of a person at a certain time can disclose personal preferences, tastes, behaviours and social connections), the data interface layer highlights a problem for further discussion, see jonathan andrew, 'location data and human mobility: an evaluation of a dissonance that frames data protection and privacy rights' (phd thesis, european university institute ) accessed february . ibid - : 'the term "location data" [. . .] fails to connote a core dimension of the data inhered i.e. the temporal data. a more appropriate nomenclature would be that of 'spatio-temporal data.' see eg hasso-plattner-institut, 'spatio-temporal data analysis' (hasso-plattner-institut) accessed february (references to development projects in the area of time-stamped data analytics); omni.sci, 'spatiotemporal definition' (omni.sci) accessed february . see eg uber's statement that '[i]n order to calculate speeds we use two data inputs: (a) gps locations of vehicles over time, and (b) map data that represents the street network on which vehicles travel', in 'uber movement: speeds calculation methodology' accessed february . the pecr in the uk define in s ( ) location data as specifically including the direction of travel and the time the location information was recorded, accessed february . data logs about past locations combined with social and other data can tell third parties much about the person's personality, background, preferences and habits. they also have predictive force: based on past location data and an analysis of recurring patterns, informed assumptions can be made that can be described as the vanishing boundaries between living and operating in these two ostensibly distinct and yet increasingly intertwined spaces. the location component not only triggers the question of (which) space but also the question of what or whom. location data are machine generated. with various levels of accuracy, they ascertain the location of a device-not a natural person. attributing the location to a specific individual is necessarily based on assumptions, correlations, statistical calculations and often on additional data sets and information that establish the presumed nexus to an individual. it can be reasonably assumed, for instance, that the location of a smartphone at a certain time and the location of the person registered as its owner are one and the same. based on device location data alone, however, a certain degree of uncertainty always remains. location data are potentially subject to data protection and data privacy laws. though the main legal data protection instrument in the eu-the gdpr -mentions location data by name in its definition of 'personal data', it neither defines this term nor provides a detailed explanation. the eprivacy directive, which aims to guarantee the confidentiality of communications over publicly available electronic communication networks and services, defines location data as meaning 'any data processed in an electronic communications network or by an electronic communications service, indicating the geographic position of the terminal equipment of a user of a publicly available electronic communications service'. recital of this directive is somewhat more detailed in providing that: location data may refer to the latitude, longitude and altitude of the user's terminal equipment, to the direction of travel, to the level of accuracy of the location information, to the identification of the network cell in which the terminal equipment is located at a certain point in time and to the time the location information was recorded (emphasis added). the eprivacy directive distinguishes between 'location data' and 'traffic data', with the latter defined as 'any data processed for the purpose of the conveyance of a communication on an electronic communications network or for the billing thereof'. based on these definitions, the directive further distinguishes between the protection scheme and compliance requirements pertaining to 'traffic data' on the one hand and 'location data other than traffic data' on the other. regarding the latter category, art. ( ) of the eprivacy directive provides, inter alia, that '[w]here location data other than traffic data, relating to users or subscribers of public communications networks or publicly available electronic communications services, can be processed, such data may only be processed when they are made anonymous, or with the consent of the users or subscribers to the extent and for the duration necessary for the provision of a value added service'. accordingly, location data only sometimes qualify as traffic data-it depends on whether the data processing goes beyond the mere purpose of enabling the transmission of communication. this structure, and specifically the lack of sufficient coherence in the distinction between location data that qualify as traffic data and location data that do not as well as the separate sets of rules that apply to each category, has been criticised. realising these deficiencies, art. working party (predecessor of the european data protection board) recommended merging the provisions of art. and art. of the eprivacy directive, suggesting furthermore that eprivacy directive, art (c). eprivacy directive, art (b). recital of the eprivacy directive provides: 'in digital mobile networks, location data giving the geographic position of the terminal equipment of the mobile user are processed to enable the transmission of communications. such data are traffic data covered by article of this directive. however, in addition, digital mobile networks may have the capacity to process location data which are more precise than is necessary for the transmission of communications and which are used for the provision of value-added services such as services providing individualised traffic information and guidance to drivers. the processing of such data for value added services should only be allowed where subscribers have given their consent. even in cases where subscribers have given their consent, they should have a simple means to temporarily deny the processing of location data, free of charge.' see eg andrew, 'location data and human mobility' (n ) - . both traffic data and location data are 'metadata' of increasing informational value that should be subject to a harmonised consent-based regime. this approach was adopted in the commission's proposal for the eprivacy regulation, which, once enacted, would repeal the eprivacy directive and drop the distinction between traffic data and location data-including their respective definitions. at the same time, the eprivacy regulation proposal would introduce an explicit distinction between the content of electronic communications and metadata. recital of the eprivacy regulation proposal explains: the content of electronic communications may reveal highly sensitive information about the natural persons involved in the communication, from personal experiences and emotions to medical conditions, sexual preferences and political views, the disclosure of which could result in personal and social harm, economic loss or embarrassment. similarly, metadata derived from electronic communications may also reveal very sensitive and personal information. these metadata includes the numbers called, the websites visited, geographical location, the time, date and duration when an individual made a call etc., allowing precise conclusions to be drawn regarding the private lives of the persons involved in the electronic communication, such as their social relationships, their habits and activities of everyday life, their interests, tastes etc. this approach reflects the understanding that both location data and traffic data fall under the concept of 'metadata', a designation that nonetheless is not contradictory to the very sensitive personal information they may contain. the proposal maintains a different distinction manifested in new definitions of 'electronic communications content' and 'electronic communications metadata'. therefore, this regulation should require providers of electronic communications services to obtain end-users' consent to process electronic communications metadata, which should include data on the location of the device generated for the purposes of granting and maintaining access and connection to the service. location data that is generated other than in the context of providing electronic communications services should not be considered as metadata. examples of commercial usages of electronic communications metadata by providers of electronic communications services may include the provision of heatmaps; a graphical representation of data using colors to indicate the presence of individuals. to display the traffic movements in certain directions during a certain period of time, an identifier is necessary to link the positions of individuals at certain time intervals. this identifier would be missing if anonymous data were to be used and such movement could not be displayed (emphasis added). this statement clarifies that location data collected in contexts other than providing electronic communications services would fall outside the scope of the regulation. if the same data, however, qualify as personal data under the gdpr, the latter instrument applies and users' consent might still be required. in the latest iteration and proposed amendments to the text of the eprivacy regulation proposal, introduced by the eu parliament in late , an additional recital ( aa) was proposed: metadata such as location data can provide valuable information, such as insights in human movement patterns and traffic patterns. such information may, for example, be used for urban planning purposes. further processing for such purposes other than for which the metadata where initially collected may take place without the consent of the end-users concerned, provided that such processing is compatible with the purpose for which the metadata are initially collected, certain additional conditions are met and safeguards are in place, including, where appropriate, the consultation of the supervisory authority, an impact assessment by the provider of electronic communications networks and services and the requirement to genuinely anonymise the result before sharing the analysis with third parties. as end-users attach great value to the confidentiality of their communications, including their physical movements, such data cannot be used to determine the nature or characteristics on an end-user or to build a profile of an end-user, in order to, for example, avoid that the data is used for segmentation purposes, to monitor the behaviour of a specific end-user or to draw conclusions concerning the private life of an end-user. for the same reason, the end-user must be provided with information about these processing activities taking place and given the right to object to such processing. overall, the eu legal scheme and recent trends regarding location data are conscious of the increasing utility of location data and the importance of safeguarding users' privacy and data protection interests, regardless of the specific technology applied. both the gdpr and the eprivacy regulation proposal advance a for a definition of 'electronic communications service', the proposal refers to art ( ) of technology-neutral approach to their respective subject matters. in parallel, the conceptual and definitional distinction between content and metadata remains, as does the reliance on anonymisation to reduce risks to privacy interests. a myriad of devices and technologies used by urbanites collect, process and exchange location data at a considerable volume, frequency and scale. locationbased services generally aim to obtain the accurate position of individuals-both indoors and outdoors-in order to provide services such as route planning and navigation and to facilitate travel efficiently and comfortably. global positioning systems (gps) are considered the dominant technology for outdoors positioning as well as the most accurate and reliable, but other technologies are also prevalent, such as wifi-based localisation cell tower triangulation. technologies used for localisation indoors include wifi (wlan), internal measurement unit (imu), radio frequency id tags (rfid), bluetooth, gsm and fm. research has identified three principal domains in which technology is advancing rapidly, penetration into consumer markets is considerable and location data provide increasing functionality: smartphones, connected cars and the internet of things (iot). in all of these domains, various location technologies are in use, and the positioning data generated are often infused with other information sources such as geographic information system (gis) data or real traffic data. some technologies are specifically tailor-made for smartphones, e.g. applications with location-based check-in services that enable individuals to share their activityrelated choices. in particular, social media applications equipped with check-in functions (such as facebook or twitter) provide a vast amount of relevant data that help to determine activity patterns in the context of urban mobility. among other purposes, such data allow researchers and analytics experts to ascertain individual mobility patterns with growing precision and granularity. the potential of location data is obviously not limited to social media applications with check-in functions. mobile phone traces can be used for various purposes, ranging from urban transportation modelling and research to the creation of personal profiles and targeted advertising by commercial entities as well as areas beyond commerce such as criminal investigations. researchers have noticed that companies also use ultrasonic side channels on mobile devices, usually without the customers being aware of it, in order to determine physical locations and content consumption habits and to follow their movements with applications that permanently 'listen' through the device's built-in microphone to ultrasonic beacons in the background. due to the extremely broad use of smart mobile devices for performing daily tasks in urban settings, the location points of a growing number of such devices (and by extension, of their users) are being constantly processed, calculated and transmitted. researchers determined that it is now dramatically easier to track the location of a huge number of mobile devices, 'leading to a wealth of information about the mobility of humans, vehicles, devices, and practically anything that can be fitted with a mobile computing device'. and the density of sensors, signals and reception points-particularly in the city-contributes to the aggregation of very precise, highquality location data. developments in the area of consumer iot also demonstrate an increasing reliance on location, iotforall.com/location-data-iot-applications-and-benefits/> accessed february : 'location data is how many modern businesses make sense of their processes, their products and/or services, and how people interact with all of the above. it enables businesses to track assets across oceanic black holes. it allows them to map customer journeys seamlessly. it is the tool they use to optimize the routes of swarms of vehicles weaving through smart cities.' processable data. iot location data are particularly accurate, which also renders them a particularly valuable, multipurpose source for commercial players, among others. researchers have begun to take notice of the possible impacts and risks involved in analysing data sets from iot devices combined with smart city infrastructure in the context of digital forensics, among other areas. it would not be exaggerated to say that location data are the lifeblood of smart mobility, and iot devices are one critical source for such data. clearly, connected cars, assisted driving technologies and autonomous vehicles (collectively 'connected cars') are another important source. modern automobiles also become smarter and more connected thanks to numerous in-car sensors, on-board computing capacities and an internet connection to external sources. according to one account, connected cars are equipped with on-board computers and embedded mobile broadband as well as dozens of sensors and around microprocessors collecting telematics and driver data. these can produce and then upload to the cloud up to gb of data with every driving hour. a considerable portion of this data qualifies as location data or is part of the mobility data the car generates. as indicated by researchers, both the technologies that generate the data and technology-based analytics models (including ai) open up an extremely broad range of use cases for such data: mobility data have been used to answer questions such as how people travel between cities and what the patterns are of their daily commute, as well as to predict socioeconomic trends, find relationships in online social networks, identify people's weight and health status, discover employment patterns, and follow the spread of infectious diseases [. growing field of commercial applications by mobile communication service providers [. . .] as well as by several companies that have already started to provide location-based services analyzing mobile phone location traces. there is a close bond between the useful things technology makes possible and the commercial endeavours that monetise and design business models around them. given the sheer wealth of information advanced technologies and analytics methods currently offer, the economic significance of location data can hardly be overstated. the data have an enormous commercial value for companies that provide a wide range of products and services and sometimes become a key resource for the firm's value proposition. as mentioned in a recent study, data can become the product (as compared to merely enhancing or augmenting an existing product), with location-based services being an archetypical example. as a result, personal data are being increasingly commodified, that is, they are being traded and handled by market participants as a valuable commodity. to name one prominent example, companies such as here provide a plethora of services based on the understanding that 'the world [. . .] is increasingly powered by location data and technology, enabling people and objects to live, move and interact faster, safer and in a more efficient way than ever before'. here, in which major automotive players currently hold significant shares, provides products and solutions that are centred around the idea that location, described as the 'data layer of everything', is the one element that is critical to enabling an 'autonomous world'. the here open location platform is described as being able to create exhaustive data pools (with data gathered from car sensors, smart city systems and/or other iot platforms) and thereby offer the opportunity to develop advanced location-based services. here is not alone in discovering the economic potential of commercialising high-quality location data on a massive scale. it competes with other players in an ecosystem where the automotive industry and smart mobility are building on ai-based solutions and where business, innovation, markets and the economy at large are 'data-driven'. in china, navinfo is striving to become 'the digital brain of intelligent driving with ultraprecise location information and automotive-grade semiconductors for advanced driver assistance systems (adas) and autonomous driving'. in the realm of location-based services, foursquare, the company that, as per its own statement, 'invented the check-in', now has a product (pilgrim sdk) that embeds foreground and background location awareness into smartphone applications in order to provide contextual content in real time. according to an online report from , this company generated over billion visits a month from million locations globally. such enormous amounts of location data-in some cases the product that carries the entire business model of commercial enterprises-are being successfully and creatively converted into revenue. a wide range of business models have emerged in the location data ecosystem, including platform, service, hardware and software providers that initially collect the data from consumers; data brokers that specialise in buying and selling data sets in secondary data markets; and data-driven technology companies that invent sophisticated methods and models to analyse and extract more insights and commercially valuable information from big data. consequently, new markets emerge in which businesses and users directly and explicitly trade personal-level location information. in other words, business models in which consumers 'pay' with their data are on the rise, and consumer protection law is confronted with completely new situations and problems. in december , the european commission published two proposals for directives that would regulate certain aspects concerning contracts for the supply of digital content and for the online sale of goods. the debate in recent years has circled around several issues, including ( ) coverage of situations in which the consumer provides data as counter-performance instead of a price for digital content and services and ( ) the inclusion of embedded digital content under the protection scheme of the directives (in the current texts of the directives such embedded digital content is referred to as 'goods with digital elements'). framework questions such as the explicit inclusion of 'personal data' as counter-performance and the simultaneous application of the gdpr triggered an extensive discussion. another question circled around protection to consumers that 'passively' provide personal data instead of a price. the general aim of the resulting directive concerning digital goods and services (dcsd) is to fully harmonise certain requirements concerning contracts between traders and consumers for the supply of digital content or services (recital dcsd). it is explicitly designed to harmonise rules on the conformity of digital content or a digital service with the contract, remedies in the event of a lack of such conformity or a failure to supply and the modalities for the exercise of those remedies, as well as on the modification of digital content or a digital service. recitals through lay out a fairly long list of matters in which member states are not strictly bound by the dcsd. these matters include national rules on the formation, validity, nullity or effects of contracts; the legal nature or classification of the contract; remedies for 'hidden defects'; and claims against any third party that is not the trader. the debate regarding the proper reach of the dcsd did not focus specifically on location data. the remainder of this chapter seeks to fill this gap. the initial commission's proposal (com-dcd) included a provision that extended the scope of the directive to cases where the consumer actively provides, in exchange for digital content, counter-performance other than money in the form of personal data or any other data. after much debate over this issue (including a critical opinion issued by the european data protection supervisor ), the directive now sets forth that consumers who provide personal data in exchange for digital content or digital services in principle should benefit from the protections therein. this provision is subject to two exceptions: ( ) when the personal data are provided by the consumer is exclusively processed by the trader for the purpose of supplying the digital content or digital services, or ( ) for allowing the trader to comply with legal requirements to which the trader is subject-and in both cases, the trader does not process that data for any other propose. the dcsd now states generally that in the case of any conflict, the gdpr overrides provisions under the dcsd. the same applies to conflicts with the e-privacy directive (directive / /ec). this priority rule is helpful at least on a formal level for resolving questions of parallel application. it should help domestic legislatures and courts with the task of applying a certain legal regime in case of discrepancies. such discrepancies are likely in light of the conceptual and practical overlaps between data protection/privacy law (protecting the individual as a data subject/user) and consumer protection law (protecting potentially the same individual as a consumer). this bright-line rule represents the general understanding that neither contract law in general nor specific consumer protection regulations should derogate from the level of protection persons enjoy under data protection and privacy law. more precisely, art. ( ) dcsd provides that consumer protection under the dcsd should be 'without prejudice' to the data protection body of law. early proposals suggested a distinction between actively and passively provided data in data-as-counter-performance scenarios. whereas the com-dcd referred only to data that are actively provided by the consumer, the council's draft would have allowed member states to extend the application of the directive to passively provided data as well. both the council and the eu parliament refrained from using the term 'actively' within their respective amendments to art. of the dcd draft. the council's draft kept the emphasis on actively provided data while excluding collected metadata (such as ip addresses) or automatically generated content (such as information collected and transmitted by cookies). by comparison, the parliament's draft (ep-dcd) would allow for the inclusion of data that is provided passively (e.g. personal data collected by the trader such as ip addresses). the option of excluding passively provided data from the scope of art. dcsd has been criticised on several grounds, including the fact that the distinction between actively and passively provided data could turn fuzzy in certain situations. ultimately, the phrase 'actively provide[s]' was removed from the final text. especially relevant to location data is recital dcsd, which indicates that 'metadata' are not covered by the dcsd unless member states specifically extend the application of this directive to such situations. it follows that data which qualify as 'metadata' will trigger protection only if the exchange of such data against digital content/services is specifically recognised under domestic law as a com-dcd, recital : 'as regards digital content supplied not in exchange for a price but against counter-performance other than money, this directive should apply only to contracts where the supplier requests and the consumer actively provides data' (emphasis added). council-dcd, art ( ) at n . ibid at n . ep-dcd, recital . this recital also mentioned as covered by the directive 'the name and e-mail address or photos, provided directly or indirectly to the trader, for example through individual registration or on the basis of a contract which allows access to consumers' photos'. ibid. 'contract'. at the same time, recital dcsd clarifies generally that the conclusion of the contract and the provision of the data do not have to happen simultaneously or at any specific proximity of time in order for the dcsd to apply. this recital includes the ongoing collocation of data that users upload or create in the course of using the digital content/service, which might, under a certain interpretation, also encompass 'passive' data provision situations. alas, the dcsd does not provide a definition for the term 'metadata'. the examples of metadata it mentions-namely, 'information concerning the consumer's device or browsing history'-do not offer a conclusive answer. one important area in which this ambiguity is relevant is the case of cookies. it has been argued, for instance, that cookies that collect data such as browsing history (hence 'metadata' that the consumer, strictly speaking, neither uploads nor creates) in exchange for digital goods or services is a situation excluded from dcsd. another area that comes to mind, of course, is location data. given that only personal data can count as counter-performance, location data would qualify if (a) it is considered 'personal data' under the gdpr and if (b) the data are not exclusively processed by the trader for the purpose of supplying the digital content or digital services. here dcsd, recital is read as excluding cookies information generally, and cookies information that qualifies as personal data specifically. this outcome was criticised, eg, by lena mischau, 'daten als "gegenleistung" im neuen verbrauchervertragsrecht' [ ] zeitschrift für die gesamte privatrechtswissenschaft (forthcoming). for simplicity, we set aside the second exception in dcsd, art ( ) regarding data processed in order to comply with a legal obligation. broad definition of 'personal data' and the corresponding interpretation by the court of justice of the european union (cjeu), the exclusion of non-personal data might end up having a marginal impact in practice. it is generally reasonable to assume that non-anonymised location data are more valuable than anonymised data to traders in the b c sector in terms of allowing pinpointed targeted advertising, refined consumer profile building and individualised pricing models. the first condition nonetheless triggers the general problem of how and where to draw the line between personal and non-personal (including anonymised) data. the eprivacy regulation proposal suggests that location pings require a device identifier to make them useful in terms of creating heatmaps and ascertaining mobility patterns that are important to the research and development of smart mobility concepts in densely populated cities. furthermore, depending on the technology and device at play, consumer location data that are collected automatically often come with 'build-in identifiers' such as ip address, device id and advertiser id in smartphones. even when separated from those identifiers, location data are particularly susceptible to re-identification attacks, and within the broader discussion about the sheer feasibility of rendering personal data completely and permanently anonymised, location data present an example in support of arguments that total anonymisation cannot be attained. the upshot is that location data will almost always qualify as personal data under the gdpr (unless sufficiently anonymised before processing under applicable/acceptable technical and legal standards of anonymisation) and thereby fulfil the first condition. the second condition calls for a careful assessment. whether the location data that the consumer provides are processed exclusively for supplying the digital content/services in accordance with the dcsd depends largely on the facts and circumstances of the individual case. the assessment will be as complex (or as straightforward) as ascertaining the technical, contractual and practical conditions surrounding the exchange. in addition, obligations under the dcsd's supply and conformity requirements and perhaps some other sources external to the contract might be relevant. this restriction under art. ( ) is formulated in a very similar way to art. ( ) (b) gdpr, which permits the processing of personal data if 'processing is necessary for the performance of a contract to which the data subject is party or in order to take steps at the request of the data subject prior to entering into a contract' (emphasis added). at the same time, the gdpr provision is somewhat broader compared to art. ( ) dcsd. the latter excludes from the concept of counter-performance the processing of personal data exclusively 'for the purpose of supplying the digital content or digital service in accordance with this directive' (emphasis added). it seems that, at least in some cases, processing for the purpose of supply is a specific type of contract performance necessity. under this interpretation, it is conceivable that art. ( )(b) gdpr might also capture processing that is not directly related to supplying the contracted subject matter. the edpb opined that 'article ( )(b) [gdpr] applies where either of two conditions are met: the processing in question must be objectively necessary for the performance of a contract with a data subject, or the processing must be objectively necessary in order to take pre-contractual steps at the request of a data subject'. in this context, the concept of necessity is applied not strictly under contract law but under data protection (objective) assessment criteria. at the same time, even under such a narrow construction of the legal basis of art. ( )(b) gdpr, it is clear that there is no perfect overlap with art. ( ) dcsd. as a result, a valid art. ( )(b) gdpr basis does not exclude a priori application of the dcsd, but in practice, processing on this basis will often coincide with situations excluded under art. ( ) dcsd. in a legal-economic environment that tolerates the consensual commodification of personal data and simultaneously imposes strict data protection limitations on traders, a successful business model seeking to monetise the data will usually need to rely on processing grounds other than contractual performance necessity, mainly on consent. indeed, the importance of users' affirmative consent in situations where location data are being processed by the trader is expected to increase in light of the cjeu jurisprudence on metadata collected by cookies. in the planet case, the cjeu ruled that a pre-selected checkbox does not fulfil the requirements of consent. active, informed and specific consent is required for using both personal and dcsd, arts - . non-personal data covered under the e-privacy directive, and the user should have a viable option to refuse the implementation of cookies as 'user consent may no longer be presumed but must be the result of active behaviour on the part of the user'. similar to data retrieved via cookies (e.g. ip addresses), location data are often collected in the course of a continuous, automated process inherent to using a connected device. the process runs seamlessly in the background without any affirmative action of users to 'hand over' their data and sometimes even without their knowledge. the prominence of consent is expected to grow under the upcoming eprivacy regulation as an important lawful basis of processing 'electronic communications metadata'. already today, consent is the main lawful basis of processing location data that qualify as sensitive data under art. gdpr. the claim that users often do not actively provide explicit consent to the collection of their (personal) location data poses a major compliance challenge that relates to the more general problem of how to improve the consent process in digital and online settings. in the final analysis, whether consumers actively provide the personal (location) data or not is of secondary importance, and in any case, it should not impose a technical limitation on the dcsc's scope. for the opposite conclusion, a convincing normative or economic argument saying that location data provided 'passively' call for a lower degree of consumer protection would have to be made. the question of how to reconcile commercial data as counter-performance models with privacy and data protection law and their consent requirements (importantly including art. ( ) gdpr) will remain the paramount challenge. after many twists and turns on the issue of goods with embedded digital content, the dcsd adopted a new definition for 'goods with digital elements', meaning 'any tangible movable items that incorporate, or are inter-connected with, digital content or a digital service in such a way that the absence of that digital content or digital service would prevent the goods from performing their functions'. this definition covers what is commonly referred to as iot devices. iot devices connect to the internet via ip addresses, and connectivity is by definition essential for them to perform their functions. the legal scheme explicitly excludes goods with digital elements from the dcsd while making such goods subject to the sale of goods directive (sgd). since the sgd applies solely to sales contracts, and since its definition of a sales contract does not entertain the concept of data as counter-performance, goods with digital elements for which the consumer provides data instead of a price are covered neither by the dcsd nor by the sgd. it follows that renting, lending and gratis distribution of a consumer iot device remains outside of the regulative scope of these directives, unless the transaction for the supply of digital elements can be severed from the transaction concerning the physical good and be treated separately and independently. this 'distribution of labour' between the dcsd and the sgd means that unless the physical component serves merely as a data carrier of digital content, the sgd applies exclusively to sales contracts of goods that include digital elements. the question of whether the digital element in a given case is essential for the good to perform its functions is to be answered, to a large extent, by the terms of the contract itself and the surrounding circumstances. for iot devices covered by the sgd, the directive's protection scheme spreads over the digital components alongside the physical elements. it sets forth specific objective requirements for conformity that are typical to digital content and services, such as the duty to inform the consumer and to supply updates, including security updates that are necessary to keep those goods in conformity. the sgd, however, does not include a detailed provision comparable to art. dcsd regarding modifications in the digital content or services and the consumer protection safeguards therein. the application of the coverage question to iot devices is certainly relevant for smart mobility. the consumer devices used for smart mobility usually qualify as goods with digital elements under the dcsd/sgd scheme. those devices rely on location data and connection to the internet is essential for their proper function and utility. during their operation, they establish connection to remote services that access their location data. as noted, in the absence of transfer of ownership for a price, the consumer protection layer of the dcsc/sgd does not apply. it appears that traders still sell most consumer iot devices for money. but a shift to business models that more intensively and transparently monetise personal data collected by the device for a considerable discount, a subscription model and/or gratis distribution instead of sales transactions do not seem that farfetched. particularly in the consumer iot and smartphone segments, consumers have a strong incentive to share their location with hardware, software, service and platform providers. depending on the particular case, sharing location data can dramatically increase personal usability and functionality. the mission of consumer protection law at this juncture should be to ensure that consumers, who suffer from information asymmetry vis-à-vis traders, weaker bargaining positions and in some cases total lack of both bargaining power and viable alternatives, are not being exploited. one important element is imposing transparency obligations on traders to enhance consumers' understanding of the context, purposes, implications and risks associated with sharing location. a comprehensive evaluation of the legal position of eu consumers in the iot segment should include further regulative instruments, such as the consumer rights directive ( / /eu) as recently revised by directive (eu) / (consumer rights modernisation directive crmd). the crd generally secures broad information rights under article thereof (including information about the total price of the goods or services) as well as specific information requirements for distance or off-premises contracts (article ). the revised crd (to be transposed in national laws by may ) borrows many important definitions from the gdpr and the dcsd/sgd scheme. it will apply explicitly 'where the trader supplies or undertakes to supply digital content which is not supplied on a tangible medium or a digital service to the consumer and the consumer provides or undertakes to provide personal data to the trader'. in principle, crd rights should apply to contracts regarding iot goods, namely, both to the physical component of the device and the digital content or service that makes it work. but this is not always the case. for instance, some consumer rights specifically attach requirements concerning pre-contractual information duties or the rights of consumers in the case of withdrawal to digital content. under the revised crd, these rights will also apply to digital content/services of goods with digital elements subject to a sales contract, except for cases where the digital content is supplied on a tangible medium and the consumer 'pays' with personal data. this structure suggests that pre-installed digital content on an iot device does not benefit from the crd's protections that apply to digital content. the synopsis sketched above, while only briefly touching upon the genuinely complex matrix of digital consumer protection law in the eu, demonstrates that the implications of the revised crd for iot consumers are not easy to pin down. as the consolidated body of consumer protection law emerging under the new deal for consumers initiative of the european commission and the enactment of the dcsd/ sgd becomes more intricate, the exposition, implementation and compliance challenges are likely to increase and provide fertile ground for further research and discussion. location data remain an extremely relevant and dynamic playing field for technology developers, market actors and consumers. as such, they calls for the attention of lawmakers and courts as they come to define the legal boundaries for these dynamics and, to some extent, prescribe the rules of the game. the task of enabling market models with an increasing reliance on data and their consensual exchange in b c markets and, at the same time, preserving the rights of individuals as data subjects and consumers should not be underestimated. many questions within data protection and privacy law itself as well as questions concerning its interface with other legal domains such as consumer protection and contract law remain unresolved. location data, due to their unique significance and role in the digital economy, could play a pivotal role in the process of figuring out this interplay-which is hopefully moving towards a coherent and consistent legal scheme that finds the right balance between personal autonomy, state intervention and market economy. on the one hand, utilising location data is indispensable for numerous technological innovations and key for economic growth. on the other hand, such utilisation poses new risks to individual interests. whether location data therefore could and should be treated as a unique category of data from a legal perspective is a vexing question that has not yet been extensively discussed, but it certainly deserves some deeper deliberations. open access this chapter is licensed under the terms of the creative commons attribution . international license (http://creativecommons.org/licenses/by/ . /), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons licence and indicate if changes were made. the images or other third party material in this chapter are included in the chapter's creative commons licence, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. toch and others (n ) (citations omitted) unpacking the valuation of data in the data-driven economy' (notes for remarks at the nyu conference on global data law the commodification of privacy on the internet' ( ) science and public policy sevignani refers to commodification as 'the process of making things exchangeable on markets, either actually and/or discursively by framing things as if they were exchangeable see statement on here's website accessed see commission foursquare is finally proving its (dollar) value' (tech crunch the limits of transparency: data brokers and commodification' ( ) new media and society survey results regarding car data show that consumers are increasingly willing to share data but that they also expect a fair value in return ) and recitals opinion / on the proposal for a directive on certain aspects concerning contracts for the supply of digital content note that the dcsd (as opposed to the com-dcd) no longer includes the phrase 'counterperformance' in connection with personal data provided by the consumer. it further omitted the phrase 'in exchange consumer data as consideration eucml , ff. here it is suggested that two 'layers' of consumer protection apply in such cases personal data" means any information relating to an identified or identifiable natural person ("data subject"); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person this judgment refers to the definition of 'personal data' under parliament and council directive (ec) / of october on the protection of individuals with regard to the processing of personal data and on the free movement of such data they who must not be identified -distinguishing personal from non-personal data under the gdpr' ( ) max planck institute for innovation and competition com ( ) final: eprivacy regulation proposal regarding the eprivacy regulation proposal. the presidency proposed a modification, according to which the processing of electronic communication metadata would be allowed also when it is necessary for the purpose of legitimate interests (i.e., without specific user consent), subject to a number of safeguards. see council of the european union, proposal for a regulation of the european parliament and of the council concerning the respect for private life and the protection of personal data in electronic communications and repealing directive another way is by applying personal information management systems (pims). see edps, 'opinion on personal information management systems for more discussion, see axel metzger iot devices shows that all the products on the list are currently offered for money-some for a considerable price. see ' most popular iot devices in (only noteworthy iot products)' (software testing helps amending council directive / /eec and directives / /ec, / /ec and / /eu of the european parliament and of the council as regards the better enforcement and modernisation of union consumer protection rules see referenced and new definitions in the revised crd to 'goods', 'personal data', 'sales contract', 'service contract', 'digital content' and 'digital service' as well as to 'computability )(g)-(h) key: cord- - w fkd authors: nan title: abstract date: - - journal: eur j epidemiol doi: . /s - - - sha: doc_id: cord_uid: w fkd nan the organisers of the european congress of epidemiology , the board of the netherlands epidemiological society, and the board of the european epidemiology federation of the international epidemiological association (iea-eef) welcome you to utrecht, the netherlands, for this iea-eef congress. epidemiology is a medical discipline that is focussed on principles and methods of research on causes, diagnosis and prognosis of disease, and establishing the benefits and risks of treatment and prevention. epidemiological research has proven its importance by contributions to the understanding the origins and consequences of diseases, and has made major contributions to the management diseases and improvement of health in-patients and populations. this meeting provides a major opportunity to affirm the scientific and societal contributions of epidemiological research in health care practice, both in clinical medicine and in public health. during this meeting major current health care problems are addressed alongside methodological issues, and the opportunities and challenges in approaching them are explored. the exchange of ideas will foster existing co-operation and stimulate new collaborations across countries and cultures. the goal of this meeting is to promote the highest scientific quality of the presentations and display advanced achievements in epidemiological research. our aim is to offer a comprehensive and educational programme in the field of epidemiological research in health care and public health, and create room for discussions on contemporary methods and innovations from the perspective of policy makers, professionals and patients. above all, we want to stimulate open interaction among the congress participants. your presence in utrecht is key to an outstanding scientific meeting. the european congress of epidemiology is organised by epidemiologists of utrecht university, under the auspices of the iea-eef, and in collaboration with the netherlands epidemiological society. utrecht university, founded in , is the largest university in the netherlands and harbours the largest academic teaching hospital in the netherlands. the epidemiologists from utrecht university work in the faculties of medicine, veterinary medicine, pharmacy and biology. the current meeting was announced through national societies, taking advantage of their newsletters and of the iea-eef newsletter. in addition, avoiding the costs and disadvantages of the traditional journal advertisements and leaflets, information about the congress was disseminated via an internet mailing list of epidemiologists, which was compiled from, among other, the meeting in porto in , the european young epidemiologist network (http://higiene.med.up.pt/eye/) and several institutions and departments. many of the procedures followed this year were based on or directly borrowed from the stimulating iea-eef congress in porto in . publication in an international journal of large circulation of the congress programme and abstracts selected for oral and poster presentation, signifies the commitment of the organisers towards all colleagues that decided to present their original work at our meeting, and is intended to promote our discipline and to further stimulate the quality of the scientific work of european epidemiologists. and methods to the objectives and quality of its description; presentation of results; importance of the topic; and originality. a final rating was given on a - point scale. the two junior epidemiologists independently evaluated each abstract. based on ratings of the juniors, the senior epidemiologist gave a final abstract rating. the senior reviewer decided when juniors disagreed, and harnessed against untoward extreme judgements of the juniors. based on the judgement by the seniors abstracts with a final rating of or higher were accepted for presentation. next, in order to shape the scientific programme according to scientific and professional topics and issues of interest for epidemiologists, members of the scientific programme committee grouped the accepted abstracts in major thematic clusters. for these, topics, keywords and title words were used. within each cluster, abstracts with a final rating of or higher, as well as abstracts featuring innovative epidemiological approaches were prioritised to be programmed as an oral presentation. the submitted abstracts had an average final rating of (sd= ). in total abstracts ( %) with a final rating of or lower were rejected. because of the thematic programming some abstracts with a final rating of or higher will be presented as posters, while few with a final rating of are programmed as oral presentation. there were abstracts ( %) accepted and programmed for poster presentation; each poster will be displayed for a full day. in total abstracts ( %) are accepted for oral presentation. these are programmed in parallel sessions. based on the topics of their abstracts the oral sessions were arranged into themes, notably epidemiology of diseases, methods clinical & population research, burden of disease, high risk populations, growth and development, public health strategies, translational epidemiology. sessions from one theme never are programmed parallel. in table we present the submitted and accepted abstracts (oral or poster) according to the distribution of country of origin. in table submitted abstracts are displayed according to their topic, as classified by the authors using the topic long list of the submission form. the scientific programme committee convened in a telephone meeting by the end of the summer of and decided on the above programming process. the scientific programme committee was informed on the result of the programming process by the end of april . fifteen abstracts were submitted for the eye session work in progress. of these abstracts were selected for oral presentation and thereby nominated for the eye award. in total, abstracts were submitted in relation to the student award, of which were programmed for oral presentation and thus nominated. during the congress authors of poster presentations may name themselves as candidate for the poster award. during the closing ceremony the winners of the student award and the poster award will be announced. these awards are an initiative of the netherlands epidemiological society that will fund them in . according to the iea rules expenses of congress participation for applicants from low-income countries will be covered. the board of the iea-eef will select a maximum of candidates; their travel and registration expenses will be (partly) covered from the congress budget. in order to stimulate participation form as many as possible junior researchers and young epidemiologists the congress budget covers registration fee reduction for undergraduate (msc) participants and eye members. this also holds for the registration fee reduction of iea-eef members and nes members. it is years ago ( ) that the iea regional european meeting was held in the hague, the marcon wei leray gehring leray spallek greving laan brussee brussee may- de groot hoefman goettsch posters ordered by abstract number olawuyi de vries uotinen smits gimeno houben de vries de vries ten berg diaconu diaconu streppel belo sauvaget koopman kilsztajn pembrey schmidt kretzschmar scholtens defraye medronho eljedi van de garde mello hosper feleus bayingana de wit stolk teixeira teixeira verhoef capon de boer lazarevska terschu¨ren khosravi boroujeni molag mikolajczyk luijsterburg stolk mirabelli barreto curzio pereira van gageldonk-lafeber van nispen roobol mokkink rava haukka jansen de kraker bogers donalisio behrens borders melis pac muller van den hooven van der sande van den berg novoa van den boogaard vannoord koedijk kuczerowska giorgi rossi vannoord baussano agabiti bierma-zeinstra van wier faustini jarrin juhl miguel maira lindert van den berg mierzejewska fonseca cardoso martens cotton corte ursoniu vernic boer ruskamp szurkowska bijkerk fonseca cardoso mazur ahrens dijkstra ajdacic-gross ajdacic-gross lucas santos gielkens-sijstermans tobi de kraker proteomics and genomics are supposed to be related to epidemiology and clinical medicine, among other because of the putative diagnostic usefulness of proteomics and genomics tests. hence, clinical and sometimes even public health applications are promised by basic sciences. it is debated whether such promises and subsequent expectation are fulfilled. what are at meaningful and consequential examples of current findings in proteomics, genomics and similar approaches in biomedical research? are they different from the ''classic'' tools and frameworks of clinical epidemiology? in the context of proteomics and genomics, etiologic studies, primary prevention, epidemiological surveillance and public health are concerned with the influence of environmental exposures on gene expression and on the accumulation of genetic alterations. proponents and advocates of proteomics and genomics have suggested that their products can yield clinically useful findings, e.g., for early diagnosis, for prognosis, for therapeutic monitoring, without always needing to identify the proteins, peptides or other 'biomarkers' at stake. do we feel comfortable with this ''black-box'' reasoning, i.e. do we question the role of pathophysiological and mechanistic reasoning in clinical medicine? how much sense does it make for epidemiology to play with and scrutinize proteomics and genomics approaches in epidemiology and clinical medicine? what are at present (and in the near future) the main biological, clinical and public health implications of current findings in these research fields? in this plenary session these and other questions regarding the place and role of proteomics and genomics in clinical epidemiological research are discussed from different perspectives. infection diseases: beneficial or disaster for man? infectious diseases pose an increasing risk to human and animal health. they lead to increasing mortality, in contrast to the situation fifty years ago when new control measures still provided hope of overcoming many problems in the future. improved hygiene, better socio-economic circumstances, vaccination and use of antibiotics has led to a gradual decline of tuberculosis, rheumatic fever, measles and mumps in western societies over the last five decades. paradoxically, absence of exposure to infectious agents has a major impact as well. the decline in infectious disease risk is accompanied by a gradual increase of allergic and autoimmune diseases and this association is believed to be causal. exposure to infectious agents from early on in life can markedly boost an individual's natural resistance and hence influence the individual's reaction to future exposure to both biological and non-biological antigens. in this plenary session we want to emphasise both aspects of the effect of infectious agents on human and animal health. evidence based medicine in health care practice and epidemiological research p. glasziou & l. bonneux evidence-based medicine is defined as the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. proponents of evidence-based medicine maintain that coming form a tradition of pathophysiological rationale and rather unsystematic clinical experience, clinical disciplines should summarize and use evidence concerning their practices, by using principles of critical appraisal and quantitative clinical reasoning. for this they should convert clinical information needs into answerable questions, locate the available evidence, critically appraise that evidence for its validity and usefulness, and apply the results of the best available evidence in practice. applying the principles of evidence-based medicine implies improvement of the effectiveness and efficiency of health care. therefore, evidence-based medicine has commonalties with clinical medical and epidemiological research. for integration of evidence-based medicine into health care practice the challenge is to translate knowledge from clinical medical and epidemiological research, for example in up to date practice guidelines. the limitations of using evidence alone to make decisions are evident. the importance of the values and preference judgments that are implicit in every clinical management decision are also evident. critics of evidence based medicine argue that applying best available research evidence in practice in order to improve the effectiveness and efficiency of health care contradicts with the importance of the values and preference judgments in clinical management decisions. in this plenary session we want to contrast these and other viewpoints on evidence based medicine in health care practice. statistical topics i: missing data prof. dr. t. stijnen this workshop will be an educational lecture on missing data by professor stijnen from the department of epidemiology and biostatistics of the erasmus mc rotterdam, the netherlands. every clinical or epidemiological study suffers from the problem of missing data. in practice mostly simple solutions are chosen, such as restricting the analysis to individuals without missing data or imputing missing values by the mean value of the individuals with observed data. it is not always realised that these simple approaches can lead to considerable bias in the estimates and standard errors of the parameters of interest. in the last to years much research has been done to better methods for dealing with missing values. in this workshop first an overview will be given of the methods that are available and their advantages and disadvantages will be discussed. most attention will be given to multiple imputation, to date generally considered as the best method for dealing with missing values. also the available software in important statistical packages such as spss and sas will be shortly discussed. prof. dr. b. van hout suppose that one wants to know how often individuals have a certain characteristic, and suppose that one doesn't have any knowledge -any knowledge at all -how often this is the case. now, suppose that one starts by checking individuals and only finds individual with this characteristic. than the probability that the ' th individual has the characteristic is / . the fact that this is not / (although it will be close to that if the numbers of observations increase) may be counter-intuitive. it will become less so, when realising how it is obtained from the formal integration of the new information with the complete uncertainty beforehand. this formal integration, with a prior indicating that it is as likely to be / as it is to be / as it is / , and with negative and positive observation -is by way of bayes rule. the italian mathematician, actuary, and bayesian, bruno de finetti ( - , estimated that it would take until the year for the bayesian view of statistics to completely prevail. the purpose of this workshop is to not only convince the attendants that this is appealing outlook but also to aid the workshop participants in realising this prediction. after a first introduction of the work of reverend bayes, a number of practical examples are presented and the attendant is introduced in the use of win-bugs. a first example -introducing the notion of noninformative priors -concerns a random effects logistic regression analysis. second, the use of informative priors, is illustrated (in contrast with non-informative priors) using an analysis of differences in quality of life as observed in a randomised clinical trial. it will be shown how taking account of such prior information changes the results as well as showing how such information may increase the power of the study. in a third example, it will be shown how winbugs offers a powerful and flexible tool to estimate rather complex multi-level models in a relatively easy way and how to discriminate between various models. within this presentation some survival techniques (or stress control techniques) will be presented for when winbugs starts to spit out obscure error-codes without giving the researcher any clue where to search for the reason behind these errors. communicating research to the public h. van maanen, prof. dr. ir. d. kromhout, a. aarts most researchers will at some point in their career face difficulties in communicating research results to the public. whereas most scientific publications will pass by the larger public in silence, now and then a publication provokes profound interest of popular press. interest from the general public should be regarded as positive. after all, public money is put into research and a researcher has a societal responsibility of spreading new knowledge. however, often, general press interest is regarded upon as negative by the researcher. the messages get shortened, distorted or ridiculed. whose responsibility is this misunderstanding between press and researchers? should a researcher foresee press reaction and what can be done to prevent negative consequences? is the press responsible background and relevance: patients with a carotid artery stenosis, including those with an asymptomatic or moderate stenosis, have a considerable risk of ischemic stroke. identification of risk factors for cerebrovascular disease in these patients may improve risk profiling and guide new treatment strategies. objectives and question: we cross-sectionally investigated whether carotid stiffness is associated with previous ischemic stroke or tia in patients with a carotid artery stenosis of at least %. design and methods: patients were selected from the second manifestations of arterial disease (smart) study, a cohort study among patients with manifest vascular disease or vascular risk factors. arterial stiffness, measured as change in lumen diameter of the common carotid arteries during the cardiac cycle, forms part of the vascular screening performed at baseline. the first participants with a stenosis of minimally % in at least one of the internal carotid arteries measured by duplex scanning were included in this study. logistic regression analysis was used to determine the relation between arterial stiffness and previous ischemic stroke or tia. results: the risk of ischemic stroke or tia in the highest quartile (stiffest arteries) relative to the lowest quartile was . ( % ci . - . ). these findings were adjusted for age, sex, systolic blood pressure, minimal diameter of the carotid artery and degree of carotid artery stenosis. conclusion and discussion: in-patients with a ‡ % carotid artery stenosis, increased common carotid stiffness is associated with previous ischemic stroke and tia. measurement of carotid stiffness may improve selection of high-risk patients eligible for carotid endarterectomy and may guide new treatment strategies. background (and relevance): patients with advanced renal insufficiency are at increased risk for adverse cardiovascular disease (cvd) outcomes. objectives and question: the aim was to establish whether impaired renal function is an independent predictor of cvd and death in an unselected high-risk population with cvd. design and methods: the study was performed in patients with cvd. primary outcomes were all vascular events and all cause death. during a median follow-up of months, patients had a vascular event ( %) and patients died ( . %). results: the adjusted hazard ratio (hr) of an estimated glomerular filtration rate < vs > ml/min per . m was . ( % ci . - . ) for vascular events and . ( % ci . - . ) for all cause death. for stroke as a separate outcome it was . ( % ci . - . ) . subgroup analysis according to vascular disease at presentation or the risk factors hypertension, diabetes and albuminuria had no influence on the hr's. conclusion and discussion: renal insufficiency is an independent risk factor for adverse cvd events in patients with a history of vascular disease. renal function was a particularly important factor in predicting stroke. the presence of other risk factors hypertension, diabetes or albuminuria had no influence on the impact of renal function alone. background and relevance: patients with hypertension have an increased case-fatality during acute mi. coronary collateral (cc) circulation has been proposed to reduce the risk of death during acute ischemia. objectives and question: we determined whether and to which degree high blood pressure (bp) affects the presence and extent of cc-circulation. design and methods: cross-sectional study in patients ( % males), admitted for elective coronary angioplasty between january and july . collaterals were graded with rentrop's classification (grade - ). ccpresence was defined as rentrop-grade ‡ . bp was measured twice with an inflatable cuff-manometer in seated position. pulse pressure was calculated by systolic blood pressure (sbp) )diastolic blood pressure (dbp). mean arterial pressure was calculated by dbp + / *(sbp-dbp). systolic hypertension was defined by a reading ‡ mmhg. we used logistic regression with adjustment for putative confounders. results: sbp (odds ratio (or) . per mmhg; % confidence-interval (ci) . - . ), dbp (or . per mmhg; % ci . - . ), mean arterial pressure (or . per mmhg; % ci . - . ), systolic hypertension (or . ; % ci . - . ), and antihypertensive treatment (or . ; % ci . - . ), each were inversely associated with the presence of cc's. also, among patients with cc's, there was a graded, significant inverse relation between levels of sbp, levels of pulse pressure, and collateral extent. conclusion and discussion: there is an inverse relationship between bp and the presence and extent of cc-circulation in patients with ischemic heart disease. background and relevance: silent brain infarcts are associated with decreased cognitive function in the general population. objectives and question: we examined whether this relation also exists in patients with symptomatic arterial disease. furthermore, we compared cognitive function of patients with stroke or tia, with cognitive function of patients with symptomatic arterial disease at other sites in the arterial tree. design and methods: an extensive screening was done in consecutive patients participating in the second manifestations of arterial disease (smart) study, including a neuropsychological test. inclusion diagnoses were cerebrovascular disease, symptomatic coronary artery disease, peripheral arterial disease, or abdominal aortic aneurysm. mri examination was performed to assess the presence of silent infarcts in patients without symptomatic cerebrovascular disease. the patients were assigned to one of three categories according to their patient history and inclusion diagnosis: no stroke or tia, no silent infarcts (n = ; mean age years); no stroke or tia, but silent infarcts present (n = ; mean age years); stroke or tia at background and relevance: patients with manifest vascular disease are at high risk of a new vascular event or death. modification of classical risk factors is often not successful. objectives and question: we determined whether the extra care of a nurse practitioner (np) could be beneficial to the cardiovascular risk profile of high-risk patients. design and methods: randomised controlled trial based on the zelen design. patients with manifestations of a vascular disease and who had ‡ modifiable vascular risk factors were prerandomised to receive treatment by a np plus usual care or usual care alone. after year, risk factors were re-measured. primary endpoint was achievement of treatment goals for risk factors. results: of the pre-randomised patients, of ( %) in the intervention group and of ( %) in the control group participated in the study. after a mean follow-up of months, the patients in the intervention group achieved significantly more treatment goals than did the patients in the control group (systolic blood pressure % versus %, total cholesterol % vs %, ldl-cholesterol % vs %, and bmi % vs %). medication use was increased in both groups and no differences were found in patients' quality of life (sf- ) at followup. conclusion and discussion: treatment delivered by nps, in addition to a vascular risk factor screening and prevention program, resulted in a better management of vascular risk factors compared to usual care alone in vascular patients after year follow-up. were used as non-invasive markers of vascular damage and adjusted for age and sex if appropriate. results: the prevalence of the metabolic syndrome in the study population was %. in pad patients this was %; in chd patients %, in stroke patients % and in aaa patients %. patients with the metabolic syndrome had an increased mean imt ( . vs. . mm, p-value < . ), more often a decreased abpi ( % vs. %, p-value . ) and increased prevalence of albuminuria ( % vs. %, p-value . ) compared to patients without this syndrome. an increase in the number of components of the metabolic syndrome was associated with an increase in mean imt (p-value for trend < . ), lower abpi (p-value for trend < . ) and higher prevalence albuminuria (p-value for trend < . ). conclusion and discussion: in patients with manifest vascular disease the presence of the metabolic syndrome is associated with advanced vascular damage. background (and relevance): in patients with type diabetes the progression of atherosclerosis is accelerated, as observed by the high incidence of cardiovascular events. objectives (and question): to estimate the influence of location and extent of vascular disease on new cardiovascular events in type diabetes patients. design and methods: diabetes patients (n = ), mean age years, with and without prior vascular disease were followed through - (mean follow-up years). patients with vascular disease (n = ) were classified according to symptomatic vascular location, and number (extent) of locations. we analyzed occurrence of new (non)-fatal cardiovascular events using cox proportional hazards models and kaplan-meier analysis. results: multivariate-adjusted hazard ratios (hrs) were comparable in diabetes patients with cerebrovascular disease (hr . ; % ci . - . ), coronary heart disease (hr . ; . - . ) and peripheral arterial disease (hr . ; . - . ), compared to those without vascular disease. multivariate-adjusted hr was . ; ( . - . ) in patients with one vascular location and . ; ( . - . ) in those with ‡ locations. the -year risks were respectively . % ( . - . ) and . % ( . - . ) . conclusion and discussion: diabetes patients with prior vascular disease have an increased risk of cardiovascular events, irrespective of symptomatic vascular location. cardiovascular risk increased with the number of locations. data emphasize the necessity of early and aggressive treatment of cardiovascular risk factors in diabetes patients. background (and relevance): despite recent advances in medical treatment, cardiovascular disease (cvd) is still health problem number one in western societies. a multifactorial approach with the aid of nurse practitioners (nps) is beneficial for achieving treatment goals and reducing events in patients with manifest cvd. objectives (and question): in the self-management of vascular patients activated by internet and nurses (spain) pilot study, we want to implement and test a secure personalized website with additional treatment and coaching of a np for hypertension, hyperlipidemia, diabetes mellitus, smoking and obesity in patients with clinical manifestations of cvd. design and methods: interesting patients are going to use the secure patient-specific website. before the use of the web-application, risk factors are measured. realistic treatment goal(s) for elevated risk factors based on current guidelines are made and appointments how to achieve the treatment goal(s) are determined between the patients and the np in a face to face contact. patients can enter his/ her own weight or a new blood pressure measurement for instance, besides the regular exchange information with the responding np through e-mail messages. the np personally replies as quick as possible and gives regular but protocol driven feedback and support to the patient. the risk factors are remeasured after six months. conclusion and discussion: the spain study is aimed to implement and test a patient specific website. secondary outcome is the change in cardiovascular risk profile. the pre-post measurements of risk factors and the amount of corresponding e-mail messages between the patient and the np enhances the feasibility of this innovative way of risk factor management. background (and relevance): modification of vascular risk factors has been shown to be effective in reducing mortality and morbidity in patients with symptomatic atherosclerosis. nevertheless, reduction of risk factors in clinical practice is difficult to achieve and maintain. objectives (and question): in the risk management in utrecht and leiden evaluation (rule) study, a prospective, comparative study, we assess the effects of a multidisciplinary vascular screening program on improvement of the cardiovascular risk profile and to compare this to a setting without such a program that provides current standard practice in patients referred for cardiovascular disorders. design and methods: patients with diabetes mellitus, coronary artery disease, cerebrovascular disease, or peripheral arterial disease ( per disease category in each hospital) referred by the general practitioner will be enrolled, started january . at the umcu, patients need to be enrolled in the vascular screening program or will be identified through the hospital registration system. at the lumc patients will be identified through the hospital registration system. risk factors will be measured in the two hospitals at baseline and one year after their initial visit. a risk function will be developed for this population based on data of the whole cohort. analysis will be performed on the two comparison groups as a whole, and on subgroups per disease category. changes in risk factors will be assessed with linear or logistic regression procedures, adjusting for differences in baseline characteristics between groups. conclusion and discussion: the rule study is aimed to evaluate the added value of a systematic hospital based vascular screening program on risk factor management in patients at high risk for vascular diseases. background: signs of early cerebral damage are frequently seen on mri scans in elderly people. they are related to future manifest cerebrovascular disease and cognitive deterioration. cardiovascular risk factors can only partially explain their presence and progression. evidence that inflammation is involved in atherogenesis continues to accumulate. chronic infections can act as an inflammatory stimulus. it is possible that subclinical inflammation and chronic infections play a role in the pathogenesis of early cerebral damage. objectives (and question): to unravel the role of inflammation and chronic infection in the occurrence and progression of early cerebral damage in patients with manifest vascular disease. design and methods: participants of the smart study with manifest vascular disease underwent an mr investigation of the brain between may and december . starting in january of all patients are invited for a second mr of the brain after an average follow-up period of four years. both at baseline and after follow-up all cardiovascular risk factors are measured and blood samples are stored to assess levels of inflammatory biomarkers and antibodies against several pathogens. occurrence and progression of early cerebral damage is assessed by measuring the volume of white matter lesions, the number of silent brain infarctions, cerebral atrophy, aberrant metabolic ratios measured with mr spectroscopy and cognitive function at baseline and after follow-up. the relation between inflammation, chronic infection and the occurrence and progression of early cerebral damage will be investigated using both crosssectional and longitudinal analysis. abstract monocyte chemoattractant protein (mcp- ) polymorphism and susceptibility for coro-nary collaterals j.j. regieli , , j. koerselman , ng sunanto , , m. entius , p.p. de jaegere , y. van der graaf , d.e. grobbee , p.a. doevendans heart lung institute, dept of cardiology clinical epidemiology, julius center for health sciences and primary care, utrecht, netherlands background (and relevance): collateral formation is an important beneficial condition during an acute ischemic event. a marked interindividual variability in high risk patients is seen, but at present the basis for this variability is unclear and can not be explained solely by environmental factors. a genetic factor might be present that could influence coronary collateral formation. objectives (and question): we have analyzed the association between a single nucleotide polymorphism in mcp- and the formation of coronary collaterals in patients admitted for angioplasty. mcp- has been suggested to play an important role in collateral development. design and methods: this study involved caucasian patients who were admitted for coronary angioplasty. coronary collateral development was defined angiographically as rentropgrade ‡ . polymorphisms in the promoter region of mcp- () ) were identified by pcr and allele specific restriction digestion. this method allows identification of individuals with either aa, ag or gg at mcp- position ) . statistical analysis was performed using a x -test, unconditional logistic regression, likelihood ration and a wald's test. results: we could genotype of the patients. coronary collaterals (rentropgrade > ) were found in patients. the allele frequency for aa, ag and gg was . %, . % and . %, respectively. the dis-tribution of mcp- genotypes in subjects without collaterals was in hardy weinberg equilibrium. we found that individuals with g allele ( %) were more likely to have collaterals than those with homozygous aa (or . , % ci . to . ) adjusted for potential confounders. linear regression shows that the allele g increased the likelihood for collateral presence with a factor . . conclusion and discussion: this study provides evidence for a role for genetic variation of mcp- gene in the occurrence of coronary collaterals in high risk patients. until september patients with recently established clinically manifest atherosclerotic disease with > modifiable vascular risk factors were selected for the study. the mean self-efficacy scores were calculated for vascular risk factors (age, sex, vascular disease, weight, diabetes mellitus, smoking behavior, hypercholesterolemia, hypertension, and hyperhomocysteinemia). results: diabetes, overweight, and smoking, but none of the other risk factors, were significantly associated with the level of self-efficacy in these patients. patients with diabetes had lower self-efficacy scores ( . ) for exercise and controlling weight ( . ) than patients without diabetes ( . p = . ) and ( . p = . ) respectively. overweight patients scored low on controlling weight ( . and . p< . ) and choosing healthy food ( . and . p = . ) than patients who were on a healthy weight ( . and . ). conclusion and discussion: patients with vascular diseases appear to have high levels of self-efficacy regarding medication use ( . ), exercise ( . ), and controlling weight ( . ). in patients with diabetes, overweight and in smokers, self efficacy levels were lower. practice implications: in nursing care and research on developing self-efficacy based interventions, lower self-efficacy levels can be taken into account for specific vascular patient groups. background (and relevance): little is known about the role of serum uric acid in the metabolic syndrome and increased risk of cardiovascular disease. we investigated the association between uric acid levels and the metabolic syndrome in a population of patients with manifest vascular diseases and whether serum uric acid levels conveyed an increased risk for cardiovascular disease in patients with the metabolic syndrome. design and methods: this is a nested case-cohort study of patients originating from the second manifestations of arterial disease (smart) study. all patients had manifest vascular diseases, constituting peripheral artery disease, cerebral ischemia, coronary artery disease and abdominal aortic aneurysm. analyzing the relationship of serum uric acid with the metabolic syndrome, age, sex, creatinine clearance, alcohol and diuretics were considered as confounders. investigating the relationship of serum uric acid levels with the risk for cardiovascular disease, values were adjusted for age and sex. results: the metabolic syndrome was present in % of the patients. serum uric acid levels in patients with metabolic syndrome were higher compared to patients without ( . ± . mmol/l vs. . ± . mmol/l). serum uric acid concentrations increased with the number of components of the metabolic syndrome ( . mmol/l to . mmol/l) adjusted for age, sex, creatinine clearance, alcohol and use of diuretics. increased serum uric acid concentrations showed to be independently associated with the occurrence of cardiovascular events in patients without the metabolic syndrome (age en sex adjusted hr: . , % ci . - . ) , contrary to patients with the metabolic syndrome (adjusted hr: . , % ci . - . ). conclusion: elevated serum uric acid levels are strongly associated with the metabolic syndrome, yet are not linked to an increased risk for cardiovascular disease in patients with the metabolic syndrome. however, in patients without the metabolic syndrome elevated serum uric acid levels are associated with increased risk for cardiovascular disease. the objective of this study is to investigate the overall and combined role of late-life depression, prolonged psychosocial stress exposure, and stress hormones in the etiology of hippocampal atrophy and cognitive decline. design and methods: as part of the smart study, participants with manifest vascular disease underwent an mri of the brain between may and december . in a subsample of subjects, cognitive function and depressed mood were assessed. starting in january , all patients are invited for a follow-up mri of the brain. at this follow-up measurement, minor and major depression, hypothalamic-pituitary-adrenal (hpa) axis function indicated by salivary cortisol, psychosocial stress exposure indicated by stressful life events early and later in life, and cognitive functioning will also be assessed. the independent and combined effects of late-life depression, (change in) hpa-axis activity, and psychosocial stress exposure on risk of hippocampal atrophy and cognitive decline will be estimated with regression analysis techniques adjusting for potential confounders. introduction the netherlands epidemiological society advocates according to good epidemiological practice, that research with sound research questions and good methodology should be adequately published independent of the research outcomes. although reporting bias in clinical trials is fully acknowledged, failure to report outcomes or selective reporting of outcomes in non-clinical trial epidemiological studies is less well known, but most likely occurs as well. in this mini-symposium the netherlands epidemiological society wants to give attention to this phenomenon of not publishing research outcomes, to encourage publication of all outcomes of adequate research. different scopes to this subject will be addressed: the background, an example of occurrence, initiatives to possibly avoid it and an editor's point of view. selective reporting of outcomes in clinical studies (reporting bias) has been described to occur frequently. therefore a registration of clinical trials is started which enables to address this problem in the future since occurrence of not publishing negative or adverse outcomes can be investigated with this registration. in non-clinical epidemiological studies the failure to report outcomes or selective reporting of outcomes most likely occurs as well, but is less studied and reported. again studies with negative outcomes or no associations are the ones most likely not to be reported. the most important obstacles for not publishing no or negative associations are tradition and priorities of researchers and journals. the reviewers might play a role in this as well. the netherlands epidemiological society advocates according to good epidemiological practice, that research with sound research questions and good methodology should be adequately published independent of the research outcomes. however, reality occurs not to be accordingly. therefore we would like to give attention to this phenomenon of not publishing research outcomes in non-trial-based epidemiological studies, to encourage publication of all outcomes of adequate research. in this mini-symposium, firstly the effects of failure or selective publishing of outcomes on subsequent meta-analysis in a non-clinical research setting will be demonstrated. afterwards, initiatives to promote and improve publication of observational epidemiological research will be addressed, the editor's point of view on this phenomenon will be given and finally concluding remarks will be given. background: there are several reasons for suspecting reporting bias in time-series studies of air pollution. such bias could lead to false conclusions concerning causal associations or inflate estimates of health impact. objectives: to examine time-series results for evidence of publication and lag selection bias. design and methods: all published time-series studies were identified and relevant data extracted into a relational database. effect estimates were adjusted to an increment of lg/m . publication bias was investigated using funnel plots and two statistical methods (begg, egger). adjusted summary estimates were calculated using the ''trim and fill'' method. the effect of lag selection was investigated using data on mortality from us cities and from a european multi-centre panel study of children. results: there was evidence of publication bias in a number of pollutant-outcome analyses. adjustment reduced the summary estimates by up to %. selection of the most significant lag increased estimates by over % compared with a fixed lag. conclusion and discussion: publication and lag selection bias occurs in these studies but significant associations remain. presentation and publication of time-series results should be standardised. background: selective non-publication of study outcomes hampers the critical appraisal and appropriate interpretation of available evidence. its existence could be shown empirically in clinical trials. observational research often uses an exploratory approach rather than testing specific hypotheses. results of multiple data analyses may be selected based on their direction and significance. objectives: to improve the quality of reporting of observational studies. to help avoid selective non-publication of study outcomes. methods: ''strengthening the reporting of observational studies in epidemiology (strobe)'' is an international multidisciplinary initiative that currently develops a checklist of items recommended for the reporting of observational studies (http:// www.strobe-statement.org). results: strobe recommends to avoid selective reporting of 'positive' or 'significant' study results and to base the interpretation on main results rather than on results of secondary analyses. discussion: strobe cannot prevent data dredging, but it promotes transparency at the publication stage. for instance, if multiple statistical analyses were performed in a large dataset to identify new exposure-outcome associations, authors should give details and not only report significant associations. strobe could have a ''feedback effect'' on study quality since, ideally, researchers think ahead when a study is planned and consider points that are essential for later publication. good publishing practice begins with researchers considering ( ) whether an intended study can bring added value, irrespective its result, ( ) and whether its methodology is valid to pick up positive and negative outcomes equally well. when reporting ( ) they should adequately discuss the significance of a negative result ( ) and be as eager to publish negative results as positive ones. as to editors, intentional bias in relation to study results is considered editorial malpractice, whatever its motivation. unintentional bias may be more frequent but will not easily be noticed, also by editors. editorial responsibility implies several levels (accepting for review, choice of reviewers, assess their reviews, decision making, and a repeated process in case of resubmission). various designs for process evaluation can be considered. evaluation will be more difficult for journals with few professional support. collaboration between journals can help, and may also avoid 'self evaluation bias'. in line with registering of randomized trials, registers for observational study protocols could facilitate monitoring for bias and searching unpublished results. but practicalities, methodological requirements, and bureaucratic burden should not be underestimated. in principle, in an era of electronic publishing every study can be made widely accessible widely also if not 'accepted', by editors or authors themselves. however, this would need huge changes in culture of authoring and reading, editorial practice, publishing business, and scientific openness. background: high circulating levels of insulin-like growth factor-i (igf-i), a mitogenic and anti-apoptotic peptide, have been associated with increased risk of several cancer types. objective: to study circulating levels of igf-i and igf binding protein- (igfbp- ) in relation to ovarian cancer risk. design and methods: within the european prospective investigation into cancer and nutrition (epic), we compared levels of igf-i and igfbp- measured in blood samples collected at baseline in women who subsequently developed ovarian cancer ( women diagnosed before age ) and controls. results: the risk of developing ovarian cancer before age ('premenopausal' was increased among women in the middle or top tertiles of igf-i, compared to the lowest tertile: or = . [ % ci: . - . ], and or = . [ % ci: . - . ], respectively (p trend = . ). results were adjusted for bmi, previous hormone use, fertility problems and parity. adjustment for igfbp- levels slightly attenuated relative risks. in older women we observed no association between igf-i, igfbp- and ovarian cancer risk. discussion and conclusion: in agreement with the only other prospective study in this field (lukanova et al, int j cancer, ) , our results indicate that high circulating igf-i levels may increase the risk of premenopausal ovarian cancer. background: the proportion of glandular and stromal tissue in the breast (percent breast density) is a strong breast cancer risk factor. insulin-like growth factor (igf- ) is hypothesized to influence breast cancer risk by increasing breast density. objectives: we studied the relation between premenopausal circulating igf- levels and changes in breast density over menopause. design and methods: mammograms and blood samples of premenopausal participants of the prospect-epic cohort were collected at baseline. a second mammogram was collected after these women became postmenopausal. we determined serum igf- levels. mammographic density was assessed using a computer-assisted method. changes in percent density over menopause were calculated for quartiles of igf- , using linear regression, adjusted for age and bmi. results: premenopausal percent density was not associated with igf- levels (mean percent density . in all quartiles). however, women in the highest igf- quartile showed less decrease in percent density over menopause ( st quartile: ) . vs th quartile: ) . , p-trend = . ). this was mostly explained by a stronger decrease of total breast size in women with high igf- levels. conclusion and discussion: women with high igf- levels show a lower decrease of percent density over menopause than those with low igf- levels. background: body mass index (bmi) has been found to be associated with risk of colon cancer in men, whereas weaker associations have been reported for women. reasons for this discrepancy are unclear but may be related to fat distribution or use of hormone replacement therapy (hrt) in women. objective: to examine the association between anthropometry and risk of colon cancer in men and women. design and methods: during . years of followup, we identified cases of colon cancer among , subjects free of cancer at baseline from european countries. results: bmi was significantly related to colon cancer risk in men (rr per kg/ m , . ; %-ci . - . ) but not in women (rr . ; . - . ; p interaction = . ), whereas waist-hip-ratio (whr) was equally strong related to risk in both genders (rr per . , men, . ; %-ci . - . ; women, . ; . - . ; p interaction = . ). the positive association for whr was not apparent among postmenopausal women who used hrt. conclusions: abdominal obesity is an equally strong risk factor for colon cancer in both sexes and whr is a disease predictor superior to bmi in women. the association may vary depending on hrt use in postmenopausal women; however, these findings require confirmation in future studies. background: fruits and vegetables are thought to protect against colorectal cancer. recent cohort studies, however, have not been able to show a protective effect. patients & methods: the relationship between consumption of vegetables and fruit and the incidence of colorectal cancer within epic was examined among , subjects of whom developed colorectal cancer. a multivariate cox proportional hazard model was used to determine adjusted cancer risk estimates. a calibration method based on standardized -hour dietary recalls was used to correct for measurement errors. results: after adjustment for potential confounding and exclusion of the first two years of follow-up, the results suggest that consumption of vegetables and fruits is weakly, inversely associated with risk of colorectal cancer (hr . , . , . , . , . , for quintiles of intake, % ci upper quintile . - . , p-trend . ), with each gram daily increase in vegetables and fruit associated with a statistically borderline significant % reduction in colorectal cancer risk (hr . ; . - . ). linear calibration strengthened this effect. further subgroup analyses will be presented. conclusion: findings within epic support the hypothesis that increased consumption of fruits and vegetables may protect against colorectal cancer risk. a diverse consumption of vegetables and fruit may influence the risk of gastric and oesophageal cancer. diet diversity scores (dds) were calculated within the epic cohort data from > , subjects in european countries. four scores, counting the number of ffq-based food-items usually eaten at least once in two weeks, were calculated to represent the diversity in the overall vegetable and/or fruit consumption. after an average follow-up of . years, incident cases of gastric and oesophageal cancer were observed. cox proportional hazard models were used to compute tertile specific risks, stratified by follow-up duration, gender and centre and adjusted for total consumption of vegetables and fruit and potential confounders.preliminary findings suggest that, compared to individuals who eat from only or less vegetable sub-groups, individuals who usually eat from eight different subgroups, have a reduced gastric cancer risk (hr . ; % ci . - . ). in comparison to all others, individuals who usually eat only the same fruit may experience an elevated risk (hr . ; % ci . - . ). these findings from the epic study suggest that a diverse consumption of vegetables may reduce gastric and oesophageal cancer risk. subjects with a very low diversity in fruit consumption may experience higher risk. g. steindorf , l. friedenreich , j. linseisen , p. vineis , e. riboli for the epic group german cancer research center, heidelberg, germany alberta cancer board, alberta, canada imperial college london, great-britain background: previous research on physical activity and lung cancer risk, conducted predominantly in males, has yielded inconsistent results. objectives: we examined this relationship among , men and women from the epic-cohort. design and methods: during . years of follow-up we identified men and women with incident primary lung cancer. detailed information on recreational, household and occupational physical activity, smoking habits, and diet was assessed. relative risks (rr) were estimated using cox regression. results: we did not observe an inverse association between occupational, recreational or household physical activity and lung cancer risk either in males or in females. we found a modest reduction in lung cancer risk associated with sports in males and cycling in females. for occupational physical activity, lung cancer risk was increased for unemployed men (rr = . ; % confidence interval . - . ) and men with standing occupations (rr = . ; . - . ) compared with sitting professions. conclusion: our study shows no convincing protective associations of physical activity with lung cancer risk. discussion: it may be speculated that the elevated risks for occupational physical activity could reflect the higher probability that manual workers are exposed to industrial carcinogens compared to workers having sitting/office jobs. purposes: epidemiological research almost always means using data and, increasingly, human tissue as well. the use of these resources is not free but is subject to various regulations, which differ in the european countries on several important aspects. usually these regulations have been determined without involvement of active epidemiological researchers or patient organisations. this workshop will address the issues involved in these regulations in the european context. it will serve the following purposes: -to provide arguments and tools and to exchange best practices for a way out of the regulatory labyrinths especially in cross european research projects; -to provide a platform for epidemiologists and patient groups to discuss their concerns about impediments for epidemiological research with other parties, like data protection authorities. targeted audience: the mini symposium is primarily meant for epidemiologists, but provides an excellent opportunity to meet and discuss with other stakeholders, like from patient groups, data protection authorities, the european commission etc. as well. therefore program allows for extra time for discussion. the other stakeholders will be explicitly invited. a special 'day ticket' is available to satellite symposium epidemiology and the seventh eu research framework over the last few years the seventh eu research framework has been drafted. it is now rapidly moving towards the first calls for proposals. previous eu research programmes and frameworks have been criticised because they are considered to include too few possibilities for epidemiological research and public health research. this satellite-symposium will provide an outline of the research framework and inform researchers about the current state of affairs of the seventh eu research framework. special focus will be on the possibilities for epidemiology and public health research. - . welcome by our host prof. jan willem coebergh, rotterdam, introduction, international and national regulations on the use of data and tissue or research in europe, different approaches to: evert-ben van veen l.l.m. (medlawconsult, the netherlands) -'identifiability' of data -consent for using data and tissue for research the tubafrost code of conduct to exchange data and tissue across europe. . - . data and tissuebanking for research in denmark: a liberal approach the danish approach to use patient data for epidemiological research, cooperation of the danish data protection authority, the danish act of to use anonymous but coded tissue for research based on an opt-out system, first experiences hans storm ph.d. (copenhagen, denmark) . - . estonian data protection act: a disaster for epidemiology the story of the birth of the act, implementing the european data protection directive and of its consequences reveal political and administrative incapability resulting in gradual vanishing of register-based epidemiological research. background: non-invasive assessment of atherosclerosis is important. most of the evidence of coronary calcium has been based on images obtained by electron beam ct (ebct). current data suggest that ebct and multi-slice ct (msct) give comparable results. since msct is more widely available than ebct, information on its reproducibility is relevant. objective: to assess inter-scan reproducibility of msct and to evaluate whether reproducibility is affected by different measurement protocols, slice thickness, cardiovascular risk factors and technical variables. design: cross-sectional study. materials and methods: the study population comprised healthy postmenopausal women. coronary calcium was assessed in these women twice at two separate visits using msct (philips mx idt ). images were made using . and . mm slice thickness. the agatston, volume and mass scores were assessed. reproducibility was determined by mean differences, absolute mean differences and intra-class correlation coefficients (iccc). results: the reproducibility of coronary calcium measurements between scans was excellent with iccc of > . , and small mean and absolute mean differences. reproducibilility was similar for . as for . mm slices, and equal for agatston, volume and mass measurements. conclusion: inter-scan reproducibilility of msct is excellent, irrespective of slice thickness and type of calcium parameter. background: it has been suggested that the incidence of colorectal cancer is associated with socioeconomic status (ses). the major part of this association may be explained by known lifestyle risk factors such as dietary habits. objective: to explore the association between diet and ses measured at area-based level. methods: the data for this analysis were taken from a multi-centre case-control study conducted to investigate the association between some environmental, genetic factors and colorectal cancer incidence. the townsend scores (as deprivation index) were categorized into fifths. a linear regression analysis was used to estimate difference in mean of each continuous variable of diet by deprivation fifth. results: the mean of processed meat consumption in the most deprived area was higher compared to the mean of that in the most affluent areas (mean difference = . , % ci: . , . ). by contrast, the mean of vegetables and fruits consumption in the most deprived areas was lower than that in the affluent areas. conclusion: our findings suggest that lifestyle factors are likely to be related to ses. thus any relation between ses and colorectal cancer may direct us to seek for the role of different life style factors in aetiology of this cancer. background: the reason for the apparent decline in semen quality during the past years is still unexplained. objective: to investigate the effect of exposure to cigarette smoke in utero on the semen quality in the male offspring. design and methods: in this prospective follow-up study, adult sons of mothers, who during pregnancy provided information about smoking and other lifestyle factors, are sampled in six strata according to prenatal tobacco smoke exposure. each man provides a semen sample, a blood sample, and answers a questionnaire, which is collected in a mobile laboratory. external quality assessment of semen analysis is performed twice a year. results: until now, a total of men have been included. the participation rate is %. the percentage of men with decreased sperm concentration (< mill/ml) is %. the unadjusted median ( - % percentile) sperm concentration in the non-exposed group (n = ) is ( - ) mill/ml compared to ( - ) mill/ml among men exposed to > cigarettes per day in fetal life (n = aim: to estimate the prevalence of overweight and obesity, and their effects in physical activity (pa) levels of portuguese children and adolescents aged - years. methods: the sample comprises subjects ( females- males) attending basic/secondary schools. the prevalence of overweight and obesity was calculated using body mass index (bmi), and the cut-off points suggested by cole et al. ( ) . pa was assessed with the baecke et al. ( ) questionnaire. proportions were compared using chi-square tests and means by anova. results and conclusions: overall, . % were overweight (females = . %; males = . %) and . % were obese (females = . %; males = . %). prevalence was similar across age and gender. bmi changed with age (p< . ), and a significant interaction between age and gender was found (p = . ): whereas bmi in males increased with aging, in females increased up to years and stabilized onwards. males showed significantly higher values of pa (p< . ). both genders had a tendency to increase their pa until - years. a significant interaction between age and gender (p = . ) points out different gender patterns across age: pa increased with aging in males but in females started to decline after years. no significant differences in pa were found between normal weight, overweight and obese subjects (p = . ). background: atherosclerosis is an inflammatory process. however, the relation between inflammatory markers and extent and progression of atherosclerosis remains unclear. objectives: we studied the association between c-reactive protein (crp) and established measures of atherosclerosis. design and methods: within the rotterdam study, a population-based cohort of , persons over age , we measured crp, carotid plaque and intima-media thickness (imt), abdominal artery calcification, ankle-brachial index (abi) and coronary calcification. using ancova, we investigated the relation between crp and extent of atherosclerosis. we studied the association between progression of extra coronary atherosclerosis (mean follow-up period: . years) and crp using multinomial regression analysis. results: crp levels were positively related to all measures of atherosclerosis, but the relation was weaker for measures based on detection of calcification only. crp levels were associated with severe progression of carotid plaque (multivariable adjusted odds ratio: . , % ci: . - . ), imt ( . , . - . ) and abi ( . , . - . ). no relation was observed with progression of abdominal artery calcification. conclusion and discussion: crp is related to extent and progression of atherosclerosis. the relation seems weaker for measures based on detection of calcification only, indicating that calcification of plaques might attenuate the inflammatory process. background: maternal stress during pregnancy has been reported to have an adverse influence on fetal growth. the terrorist attacks of september , on the united states have provoked feelings of insecurity and stress worldwide. objective: our aim was to test the hypothesis that maternal exposure to these acts of terrorism via the media had an unfavourable influence on mean birth weight in the netherlands. design and methods: in a prospective cohort study, we compared birth weights of dutch neonates who were in utero during the attacks with those of neonates who were in utero exactly year later. results: in the exposed group, birth weight was lower than in the non-exposed group (difference, g, %ci . , . , p = . ). the difference in birth weight could not be explained by tobacco use, maternal age, parity or other potential confounders, nor by shorter pregnancy durations. conclusion: these results provide evidence supporting the hypothesis that exposure of dutch pregnant women to the september events via the media has had an adverse effect on the birth weight of their offspring. objective: asian studies suggested potential reduction in the risk of pneumonia among patients with stroke on ace-inhibitor therapy. because of the high risk of pneumonia in patients with diabetes we aimed to assess the effects of ace-inhibitors on the occurrence of pneumonia in a general, ambulatory population of diabetic patients. methods: a case-control study was performed nested in , patients with diabetes. cases were defined as patients with a first diagnosis of pneumonia. for each case, up to controls were matched by age, gender, practice, and index date. current ace-inhibitor use was defined within a time-window encompassing the index date. results: ace-inhibitors were used in . % of , cases and in , % of , matched controls (crude or: . , % ci . to . ). after adjusting for potential confounders, ace-inhibitor therapy was associated with a reduction in pneumonia risk (adjusted or: . , % ci . to . ). the association was consistent among different relevant subgroups (stroke, heart failure, and pulmonary diseases) and showed a strong dose-effect relationship (p< . ). conclusions: use of ace-inhibitors was significantly associated with reduced pneumonia risk and may apart from blood pressure lowering properties be useful in prevention of respiratory infections in patients with diabetes. background: progressive decline in serum levels of testosterone occurs with normal aging in both men and women. this is paralleled by a decrease in physical performance and muscle strength, which may lead to disability, institutionalization and mortality. objective. we examined whether low levels of testosterone were associated with three-year decline in physical performance and muscle strength in two population-based samples of older men and women. methods: data were available for men in the longitudinal aging study amsterdam (lasa) and men and women in the health, aging, and body composition (health abc) study. levels of total testosterone and free testosterone were determined at baseline. physical performance and grip strength were measured at baseline and after three years. results: total and free testosterone were not associated with change in physical performance or muscle strength in men. in women, low levels of total testosterone (<= ng/dl) increased the risk of decline in physical performance (p = . ), and low levels of free testosterone (< pg/ ml) of decline in muscle strength (p = . ). conclusion: low levels of total and free testosterone were associated with decline in physical performance and muscle strength in older women, but not in older men. background: obesity and physical inactivity are key determinants of insulin resistance, and chronic hyperinsulinemia may mediate their effects on endometrial cancer (ec) risk. aim: to examine the relationships between prediagnostic serum concentrations of cpeptide, igf binding protein (igfbp)- and igfbp- , and ec risk. methods: we conducted a case-control study nested within the epic prospective cohort study, including incident cases of ec, in pre-and post-menopausal women, and matched control subjects. odds ratios (or) and % confidence intervals (ci) were calculated using conditional logistic regression models. results: in fasting women (> h since last meal), serum levels of c-peptide, igfbp- and igfbp- were not related to risk. however, in nonfasting women ( h or less since last meal), ec risk increased with increasing serum levels of c-peptide ( background: tobacco is the single most preventable cause of death in the world today. tobacco use primarily begins in early adolescent. objective: to estimate the prevalence and evaluate factors associated with smoking among high school going adolescents in karachi, pakistan. methods: a school based cross sectional survey was conducted in three towns of karachi from january through may . two-stage cluster sampling stratified on school types was employed to select schools and students. self-reported smoking status of school going adolescents was our main outcome in analysis. results: prevalence of smoking ( days) among adolescents was . %. multiple logistic regression model showed that after adjustment for age, ethnicity and place of residence, being student of a government school (or= . ; % ci: . - . ), parental smoking (or = . ; % ci: . - . ), uncle (or = . ; % ci: . - . ) , peer smoking (or = . ; % ci: . - . ) and spending leisure time outside home (or = . ; % ci . - . ) were significantly associated with adolescents smoking. conclusion: a . % prevalence of smoking among school going adolescents and influence of parents and peers in initiating smoking in this age group warrant the need for effective tobacco control in the country especially among the adolescents. background: individual patient data meta-analyses (ipd-ma) have been proposed to improve subgroup analyses that may provide clinically relevant information. nevertheles, comparison of the effect estimates of ipd-ma and meta-analyses of published data (map) are lacking. objective: to compare main and subgroup effect estimates of ipd-ma and map. methods: an extended literature search was performed to identify all ipd-ma of randomized controlled trials, followed by a related article search to identify maps with a similar domain, objective, and outcome. data were extracted regarding number of trials, number of subgroups, effect measure, effect estimate and their confidence intervals. results: in total ipd-ma and map could be included in the analysis. twentyfive main effect estimates could be compared; of which were in the same direction. although over subgroups were studied in both ipd-ma and map, only effect estimates could be compared; were in the same direction. subgroup analyses in map most often related to trial characteristics, whereas subgroup analyses in ipd-ma were related to patient characteristics. conclusion: comparable ipd-ma and map report similar main and subgroup effect estimates. however, ipd-ma more often study subgroups based on patient characteristics, and thus provide more clinically relevant information. patients with diabetes have an increased risk of a complicated course of community-acquired lower respiratory tract infections. although influenza vaccination is recommended for these persons, vaccination levels remain too low because of conflicting evidence regarding potential benefits. as part of the prisma nested casecontrol study among , persons recommended for vaccination, we studied the effectiveness of single and repeat influenza vaccination in the subgroup of adult diabetic population ( , ) during the - influenza a epidemic. case patients were hospitalized for diabetes, acute respiratory or cardiovascular events, or died and controls were sampled from the baseline cohort. after control for age, gender, health insurance, prior health care, medication use and co-morbid conditions logistic regression analysis showed that the occurrence of any complication ( hospitalizations, deaths) was reduced by % ( % confidence interval % to %). vaccine effectiveness was similar for those who received the vaccine for the first time and for those who received an earlier influenza vaccination. although we did not perform virological analysis or distinguish type i from type ii diabetes we conclude that patients with diabetes benefit substantially from influenza vaccination independent of whether they received the vaccine for the first time or received earlier influenza vaccinations. background: construction workers are at risk of developing silicosis. regular medical evaluations to detect silicosis preferably in the pre-clinical phase are needed. objectives: to identify the presence or absence of silicosis by developing an easy to use diagnostic model for pneumoconiosis from simple questionnaires and spirometry. design and methods: multiple logistic regression analysis was done in dutch construction workers, using chest x-ray indicative for pneumoconiosis (ilo profusion category > / ) as the reference standard (prevalence . %). model calibration was assessed with graph and the hoshmer-lemeshow goodness of fit test; discriminative ability using area under receiver operating characteristic curve (auc); and internal validity using bootstrapping procedure. results: age > years, current smoking, high exposure job title, working > years in the construction industry, 'feeling unhealthy', and standardized residual fev below ) . were selected as predictors. the diagnostic model showed a good calibration (p = . ) and discriminative ability (auc . ; % ci . to . ). internal validity was reasonable (correction factor of . and optimism corrected auc of . ). conclusions: and discussion: our diagnostic model for silicosis showed reasonable performance and internal validity. to apply the model with confidence, external validation before application in a new working population is recommended. background: artemisinin based combination therapy (act) reduces microscopic gametocytaemia, the malaria parasite stage responsible for transmission from man to mosquito. as a result, act is expected to reduce the burden of disease in african populations. however, molecular techniques recently revealed high prevalences of gametocytaemia below the microscopic threshold. our objective was to determine the importance of sub-microscopic gametocytaemia after act treatment. methods: kenyan children (n= ) aged months - years were randomised to four treatment regimens. gametocytaemia was determined by microscopy and pfs real-time nucleic acid sequence-based amplification (qt-nasba). transmission was determined by membrane feedings. findings: gametocyte prevalence at enrolment was . % ( / ) as determined by pfs qt-nasba and decreased after treatment with act. membrane feedings in randomly selected children revealed that the proportion of infectious children was up to fourfold higher than expected when based on microscopy. act did not significantly reduce the proportion of infectious children but merely the proportion of infected mosquitoes. interpretation: sub-microscopic gametocyte densities are common after treatment and contribute considerably to mosquito infection. our novel approach indicates that the effect of act on malaria transmission is much smaller than previously suggested. these findings are sobering for future interventions aiming to reduce malaria transmission. background: adequate folate intake may be important in the prevention of breast cancer. factors linked to folate metabolism may be relevant to its protective role. objectives: to investigate the association between folate intake and breast cancer risk among postmenopausal women and evaluate the interaction with alcohol and vitamin b intake. methods: a prospective cohort analysis of folate intake among , postmenopausal women from the e n french cohort who completed a validated food frequency questionnaire in was conducted. during years follow-up , cases of pathology-confirmed breast cancer were documented through followup questionnaires. nutrient intakes were categorized in quintiles and energy-adjusted using the regression-residual method. cox modelderived relative risks (rr) were adjusted for known risk factors for breast cancer. results: the multivariate rr comparing the extreme quintiles of folate intake was . ( % ci . - . ; p-trend= . ). after stratification, the association was observed only among women whose alcohol consumption was above the median (= . g/day) and among women who consumed = . lg/day of vitamin b . however, tests for interaction were not significant. conclusions: in this population, high intakes of folate were associated with decreased breast cancer risk; alcohol and vitamin b intake may modify the observed inverse association. background: the simultaneous rise in the prevalence of obesity and atopy in children has prompted suggestions that obesity might be a causal factor in the inception of atopic diseases. objective: we investigated the possible role of ponderal index (kg/m ) as marker for fatness at birth in early childhood atopic dermatitis (ad) in a prospective birth cohort study. methods: between november and november , mothers and their newborns were recruited after delivery at the university of ulm, germany. active follow-up was performed at the age of months. results: for ( %) of the children included at baseline, information on physician reported diagnosis of ad was obtained during follow-up. incidence of ad was . % at the age of one year. mean ponderal index at birth was . kg/m . risk for ad was higher among children with high ponderal index at birth (adjusted or for children within the third and fourth compared to children within the second quartile of ponderal index: . ; % ci . respectively) background: the relationship between duration of breastfeeding and risk of childhood overweight remains inconclusive, possibly in part caused by using never breastfeeding mothers as the reference category. objectives: we assessed the association between duration of breastfeeding and childhood overweight among ever breastfed children within a prospective birth cohort study. methods: between november and november all mothers and their newborns were recruited after delivery at the university of ulm, germany. active follow-up was performed at age months. results: among children ( % of children included at baseline) with available body mass index at age two ( . %) were overweight. whereas children ( . %) were never breastfed, ( . %) were breastfed for at least six months, and ( . %) were exclusively breastfed for at least six months. compared to children who were exclusively breastfed less than three months, the adjusted or for overweight was . ( % ci . ; . ) in children who were exclusively breastfed for at least three but less than six months and . ( % ci . ; . ) in children who were exclusively breastfed for at least six months. conclusion: these results highlight the importance of prolonged breastfeeding in the prevention of overweight in children. background: in africa, hiv and feeding practices influence child mortality. exclusive breastfeeding for months (bf ) and formula feeding (ff) when affordable are two who recommendations for safe feeding. objective: we estimated the proportion and the number of children saved with each recommendation at population level. design and methods: data on sub-saharan countries were analysed. we considered saved a child remaining hiv-free and alive after two years of life. a spreadsheet model based on a decision tree for risk assessment was used to calculate this number according to six scenarios that combine the two recommendations without and with promotion then with promotion and group education. results: whatever the country, the number of children saved with bf would be higher than with ff. overall, without promotion, ( background: farming has been associated with respiratory symptoms as well as protection against atopy. effects of different farming practices on respiratory health in adults have rarely been studied. objectives: we studied associations between farming practices and hay fever and current asthma in organic and conventional farmers. design and methods this cross-sectional study evaluated questionnaire data of conventional and organic farmers. associations between health effects and farm exposures were assessed by logistic regression. results: organic farmers reported slightly more hay fever than conventional farmers ( . % versus . %, p = . ). however, organic farming was no independent determinant for hay fever in multivariate models including farming practices and potential confounders. livestock farmers who grew up on a farm had a five-fold lower prevalence of hay fever than crop farmers without farm childhood (or . , % ci . - . ). use of disinfectants containing quaternary ammonium compounds was positively related to hay fever (or . , % ci . - . ). no effects of farming practices were found for asthma. conclusion and discussion: our study adds to the evidence that a farm childhood in combination with current livestock farming protects against allergic disorders. this effect was found for both organic and conventional farmers. background: although a body mass index (bmi) above kg/m is clearly associated with an increase in mortality in the general population, the meaning of high levels of bmi among physically heavily working men is less clear. methods: we assessed the association between bmi and mortality in a cohort of male construction workers, aged - years, who underwent an occupational health examination in wu¨rttemberg (germany) during - and who were followed over a years period. covariates considered in the proportional hazard regression analysis included age, nationality, smoking status, alcohol consumption, and comorbidity. results: during the follow-up deaths occurred. there was a strong u-shaped association between bmi and all-cause mortality, which was lowest for bmi levels between and kg/m . this pattern persisted after exclusion of the first years of follow-up and control for multiple covariates. compared with men with a bmi < . kg/m , the relative mortality was . ( % confidence interval: , - , ), . ( . - . ) and . ( . - . ) for bmi ranges - . , - . and = . kg/m . conclusion and discussion: bmi levels commonly considered to reflect overweight or moderate obesity in the general population may be associated with reduced mortality in physically heavily working men. background: colonoscopy with removal of polyps may strongly reduce colorectal cancer (crc) incidence and mortality. empirical evidence for optimal schedules for surveillance is limited. objective. to assess risk of proximal and distal crc after colonoscopy with polypectomy. design and methods: history and results of colonoscopies were obtained from cases and controls in a population-based case-control study in germany. risk of proximal and distal crc according to time since colonoscopy was compared to risk of subjects without previous colonoscopy. results: subjects with previous detection and removal of polyps had a much lower risk of crc within four years after colonoscopy (adjusted odds ratio . , % confidence interval . - . ), and a similar risk as those without colonoscopy in the long run. within four years after colonoscopy, risk was particularly low if only single or small adenomas were detected. most cancers occurring after polypectomy were located in the proximal colon, even if polyps were found in the sigma or rectum only. conclusion and discussion: our results support suggestions that surveillance colonoscopy after removal of single and small adenomas may be deferred to five years and that surveillance should include the entire colorectum even if only distal polyps are detected. background: a population-based early detection programme for breast cancer has been in progress in finland since . recently, detailed information about actual screening invitation schemes in - has become available in electronic form, which enables more specific modeling of breast cancer incidence. objectives: to present a methodology for taking into account historical municipality-specific schemes of mass screening when constructing predictions for breast cancer incidence. to provide predictions for numbers of new cancer cases and incidence rates according to alternative future screening policies. methods: observed municipality-specific screening invitation schemes in finland during - were linked together with breast cancer data. the incidence rate during the observation period was analyzed using poisson regression, and this was done separately for localized and nonlocalized cancers. for modeling, the screening programme was divided into seven different phases. alternative screening scenarios for future mass-screening practices in finland were created and an appropriate model for incidence prediction was defined. results and conclusion: expanding the screening programme would increase the incidence of localized breast cancers; the biggest increase would be obtained by expanding from women aged - to - . the impacts of changes in the screening practices on predictions for non-localized cancers would be minor. background: new screening technologies are implemented to routine screening in increasing numbers, with limited evidence on their effectiveness. randomised evaluation of new technologies is encouraged but rarely done. objective: to evaluate in a randomised design whether the effectiveness of an organised cervical screening programme can be improved by means of new technologies. methods: since , - , women have been invited annually to a randomised multi-arm trial ran within the finnish organised cervical screening programme. the invited women are randomly allocated to three study arms of different primary screening tests: conventional cytology, automation-assisted cytology and, since , human papillomavirus (hpv) testing. up to , we have gathered information on , screening visits in the automation-assisted arm and , in the hpv arm, and we have compared the results to conventional screening. results: automation-assistance resulted in a slightly increased detection of precancers, but the efficacy based on interval cancers is not known. results on hpv screening suggest higher detection of precancers and cancers compared to conventional screening. conclusion: evidence of higher effectiveness of new screening technologies is needed, especially when changing the existing screening programmes. the multi-arm trial shows how these technologies can be implemented to routine in a controlled manner. introduction: nodules and goitres are important risk factors for thyroid cancer. as the number of diagnosed cases of thyroid cancer is increasing, the incidence of such risk factors has been assessed in a french cohort of adults. methods: the su.vi.max (supple´mentation en vitamines et mine´raux antioxydants) cohort study included middle-aged adults followed-up during eight years. incident cases of goitres and nodules have been identified retrospectively by scheduled clinical examinations and spontaneous consultations by the participants. cox proportional hazards modeling was used to identify factors associated to thyroid diseases. results: finally, incident cases of nodules and goitres were identified among , subjects free of thyroid diseases at inclusion. after an average follow-up of years, the incidence of goitres and nodules was . % in - year old men, . % in - year old women and . % in - year old women. identified associated factors were age, low urinary thiocyanate level and oral contraceptive use in women, and high urinary thiocyanate level and low urinary iodine level in men. conclusion: estimated incidences are consistent with those observed in other countries. the protective role of urinary thiocyanate in both men and women and, in women, oral contraceptives deserve further investigation. background: various statistical methods for outbreak detection in hospital settings have been proposed in the literature. usually validation of those methods is difficult, because the long time series of data needed for testing the methods are not available. modeling is a tool to overcome that difficulty. objectives: to use model generated data for testing sensitivity and specificity of different outbreak detection methods. methods: we developed a simple stochastic model for a process of importation and transmission of infection in small populations (hospital wards). we applied different statistical outbreak detection methods described in the literature to the generated time series of diagnosis data and calculated and the sensitivity and specificity of different methods. results: we present roc curves for the different methods and show how they depend on the underlying model parameters. we discuss how sensitivity and specificity measures depend on the degree of underdiagnosis, on the ratio of admitted colonised patients to colonisation resulting from transmission in the hospital, and on the frequency of testing patients for colonisation. conclusions: modeling can be a useful tool for evaluating statistical methods of outbreak detection especially in situation where real data is scarce or its quality questionable. associated with higher mammographic density and breast pain, has been increased which has bearing on screening performance. objective: we compared the screening performance for women aged - years with dense and lucent breast patterns in two time periods and studied the possible interaction with use of hrt. methods: data were collected from a dutch regional screening programme for women referred in - (n = ) and - (n = ) . in addition, we sampled controls for both periods that were not referred (n = and n = resp.) and women diagnosed with an interval cancer. mammograms were digitised and computer-assisted methods used to measure mammographic density. among other parameters, sensitivity was calculated to describe screening performance. results: screening performance has improved slightly, but the difference between dense and lucent breast patterns still exists (e.g. sensitivity % vs. %). hrt use has increased; sensitivity was particularly low ( %) in the group of women with dense breast patterns on hrt. discussion: in conclusion, the detrimental effect of breast density and the interaction with hrt on screening performance warrants further research with enlargement of the catchment area, more referred women, interval cancers and controls. background: population based association studies might lead to false-positive results if possibly underlying population structure is not adequately accounted for. to assess the nature of the population structure some kind of cluster analysis has to be carried out. we investigated the use of self-organizing maps (soms) for this purpose. objectives: the two main questions concern identification of an either discrete or an admixed population structure and identification of the number of subpopulations involved in forming the structured population under investigation. design and methods: we simulated data sets with different population models and included varying informative marker and map sizes. sample sizes ranged from to individuals. results: we found that a discrete structure can easily be accessed by soms. a near to perfect assignment of individuals to their population of origin can be obtained. for an admixed population structure though, soms do not lead to reasonable results. here, even the correct number of subpopulations involved can not be identified. conclusion: in conclusion, soms can be an alternative to a model-based cluster analysis if the researcher assumes a discrete structure but should not be applied if an admixed structure is likely. background: little is known about the combined effect of duration of breastfeeding, sucking habits and malocclusion in the primary dentition. objectives: we studied the association of breastfeeding and non-nutritive sucking habits on malocclusion on the primary dentition. design and methods: a cross-sectional study nested in a birth cohort was carried out in pelotas, brazil. a random sample of children aged was examined and their mothers interviewed. the foster and hamilton criteria were used to define anterior open bite (aob) and posterior cross bite (pcb). information regarding breastfeeding and non-nutritive sucking habits was collected from birth to years-old. poisson's regression analysis was used. results: non-nutritive sucking habits between months and years of age (pr . [ . ; . ] ) and digital sucking at years of age (pr . [ . ; . ]) were risk factors for aob. breastfeeding for less than months (pr . [ . ; . ] ) and the regular use of a pacifier between months and years of age (pr . [ . ; . ]) were the risk factors for pcb. for pcb an interaction was identified between lack of breastfeeding and the use of a pacifier. conclusion: lack of breastfeeding and longer non-nutritive sucking habits during early childhood were the main risk factors for malocclusion in primary dentition. background: recent, dramatic coronary heart disease (chd) mortality increases in beijing, can be mostly explained by adverse changes in risk factors, particularly total cholesterol and diabetes. it is important for policy making to predict the impact of future changes in risk factors on chd mortality trends. objective: to assess the potential impact of changes in risk factors on numbers of chd deaths in beijing from to , to provide evidence for future chd strategies. design: the previously validated impact model was used to estimate the chd deaths expected in a) if recent risk factor trends continue or b) if levels of risk factors reduce. results: continuation of current risk factor trends will result in a % increase in chd deaths by , (almost half being attributable to increases in total cholesterol levels). even optimistically assuming a % annual decrease in risk factors, chd deaths would still rise by % because of population ageing. conclusion: a substantial increase in chd deaths in beijing may be expected by . this will reflect worsening risk factors compounded by demographic trends. population ageing in china will play an important role in the future, irrespective of any improvements in risk factor profiles. background: since smoking cessation is more likely during pregnancy than at other times, interventions to maintain quitting postpartum may give the best opportunity for a long-time abstinence. it is still not clear what kind of advice or counseling should be given to help prevent the relapse postpartum. objectives: to identify the factors, which predispose women to smoking relapse after delivery. design and methods: the cohort study was conducted in and in public maternity units in lodz, poland. the study population consisted of pregnant women between - weeks of pregnancy who have quit smoking no later than three months prior to participation in the study. smoking status was verified using saliva cotinine level. women were interviewed twice: during pregnancy and three months after delivery. results: within three months after delivery about half of women relapsed into smoking. the final model identified the following risk factors for smoking relapse: having partner and friends who smoke, quitting smoking in late pregnancy, and negative experiences after quitting smoking such as dissatisfaction with weight, nervousness, irritation, loosing pleasure. conclusion. this study advanced the knowledge of the factors that determine smoking relapse after delivery and provided preliminary data for future interventions. introduction: it remains difficult to predict the effect of an particular antihypertensive drug in an individual patient and pharmacogenetics might optimise this. objective: to investigate whether the association between use of angiotensin converting enzyme (ace)-inhibitors or ß-blockers and the risk of stroke or myocardial infarction (mi) is modified by the t-allele of the angiotensinogen m t polymorphism. methods: data were used from the rotterdam study, a population-based prospective cohort study. in total, subjects with hypertension were included from july st, onwards. follow-up ended at the diagnosis of mi or stroke, death, or the end of study period (january st, ) . the drug-gene interaction and the risk of mi/stroke was determined with a cox proportional hazard model (adjusted for each drug class as time-dependent covariates). results: the interaction between current use of ace-inhibitors and the angiotensinogen m t polymorphism increased the risk of mi (synergy index (si) = . ; % ci: . - . ) and non-significant increased risk of stroke (si = . ; % ci: . - . ). no interaction was found between current use of ß-blockers and the agt m t polymorphism on the risk of mi or stroke. conclusion: subjects with at least one copy of the t allele of the agt gene might have less benefit from ace-inhibitor therapy. [ . - . ] to . [ . - . ] in those without ms-idf and . [ . - . ] with ms-idf. ms-ncep had no effect. conclusion and discussion: although cardiovascular disease was self-reported, we conclude that the higher prevalence of cardiovascular disease is partly accounted for by marked differences in the prevalence of metabolic syndrome. the ms-idf criteria seem better for defining metabolic syndrome in ethnic groups than the ms-ncep criteria. background: selenium is an essential trace mineral with antioxidant properties. objective: to perform meta-analyses of the association of selenium levels with coronary heart disease (chd) endpoints in observational studies and the efficacy of selenium supplements in preventing chd in randomized controlled trials. methods: we searched medline and the cochrane library from through . relative risk (rr) estimates were pooled using an inversevariance weighted random-effects model. for observational studies reporting three or more categories of exposure we conducted a dose-response meta-analysis. results: twenty-five observational studies and clinical trials met our inclusion criteria. the pooled rr comparing the highest to the lowest categories of selenium levels was . ( % confidence interval . - . ) in cohort studies and . ( . - . ) in case-control studies. in dose-response models, a % increase in selenium levels was associated with a % ( - %) reduced risk of coronary events. in randomized trials, the rr comparing participants taking selenium supplements to those taking placebo was . ( . - . ). conclusion: selenium levels were inversely associated with the risk of chd in observational studies. the randomized trials findings are still inconclusive. these results require confirmation in randomised controlled trials. currently, selenium supplements should not be recommended for cardiovascular prevention. background propensity score analysis (psa) can be used to reduce confounding bias in pharmacoepidemiologic studies of the effectiveness and safety of drugs. however, confidence intervals may be falsely precise because psa ignores uncertainty in the estimated propensity scores. objectives: we propose a new statistical analysis technique called bayesian propensity score analysis (bpsa). the method uses bayesian modelling with the propensity score as a latent variable. our question is: does bpsa yield improved interval estimation of treatment effects compared to psa? our objective is: to implement bpsa using computer programs and investigate the performance of bpsa compared to psa. design and methods: we investigated bpsa using monte carlo simulations. synthetic datasets, of sample size n = , , , were simulated by computer. the datasets were analyzed using bpsa and psa and we estimated the coverage probability of % credible intervals. results the estimated coverage probabilities ranged from % to % for bpsa, and from % to % for psa, with simulation standard errors less than %. background: several factors associated with low birth weight, such as smoking and body mass index (bmi) do not explain all ethnic differences. this study investigates the effects of working conditions on birth weight among different ethnic groups. methods: questionnaire data, filled in weeks after prenatal screening, was used from the amsterdam born children and their development (abcd) study (all pregnant women in amsterdam [ / / - / / (n = . ), response ( %)]. ethnicity (country of birth). was dichotomised into dutch and non-dutch. working conditions were: weekly working hours, weekly hours standing/walking, physical load and job-strain (karasek-model). only singleton deliveries with pregnancy duration = weeks were included. results: although only . % of the non-dutch women worked during first trimester ( . % of the dutch women), they reported significantly more physical load ( . % vs . %), more hours standing/walking ( . % vs . %) and more high job-strain ( . vs . ). linear regression revealed that only high job-strain lowered significantly birth weight (non-dutch: gram and dutch: gram). after adjusting for confounders (gender, parity, smoking, maternal length, maternal bmi and education), this was only significant in the non-dutch group ( vs. gram). conclusion: job-strain has more effect on birth weight in non-dutch compared to dutch women. background: in panama population was estimated in . million habitants, from which three millions lived in malaria endemic areas. until january malaria control activities were accomplished under a vertical structure. objective: to evaluate the evolution of malaria control in panama, before and after the decentralization of the malaria program. design and methods: average (standard deviation) of the program indexes are described for the last decades. the correlation between positive smears index and per capita cost of the program is analyze. results: in the 's the average (standard deviation) positive smears index per habitants was . % ( . ); in the 's: . % ( . ); in the 's: . % ( . ); in the 's: . % ( . ); and in the first five years of : . % ( . ). after the decentralization of the program was accomplished in , the positive smears index increased . fold. the average per capita cost involved in malaria control activities per decade ranged between . y . us dollars and presented a determination coefficient of . in the reduction of the positive smears index. discussion: the decentralization had significant detrimental implications in the control program capabilities. background: notification rates of new smear-positive tuberculosis in the central mountainous provinces ( / , population) are considerably lower than in vietnam in general ( / , population). this study assessed whether this is explained by low case detection. objective: to assess the prevalence and case detection of new smear-positive pulmonary tuberculosis among adults with a prolonged cough in central mountainous vietnam. design and methods: a house-to-house survey of adults years or older was carried out in randomly selected districts in three mountainous provinces in central vietnam in . three sputum specimens were microscopically examined of persons reporting a cough of weeks or longer. results: the survey included , persons with a response of %. a cough of weeks or longer was reported by , ( . % % ci . - . ) persons. of these, were sputum smear-positive of whom had had anti-tuberculosis treatment. the prevalence of new smear-positive tuberculosis was / , population ( % ci - / , population). the patient diagnostic rate was . per person-year, suggesting that the case notification rate as defined by who was %. conclusion: low tuberculosis notification rates in mountainous vietnam are probably due to low tuberculosis incidence. explanations for low incidence at high altitude need to be studied. background: although patients with type diabetes (dm ) have an increased risk of urinary tract infections (utis), not much is known about predictors of a complicated course. objective: this study aims to develop a prediction rule for complicated utis in dm patients in primary care. design and methods: we conducted a -month prospective cohort study, including dm patients aged years or older from the second dutch national survey of general practice. the combined outcome measure was defined as the occurrence of recurrent cystitis, or an episode of acute pyelonephritis or prostatitis. results: of the , dm patients % was male and mean age was years (sd ). incidence of the outcome was per patient years (n = ). predictors were age, male sex, number of physician contacts, incontinence of urine, cerebro vascular disease or dementia and renal disease. the area under the receiver-operating curve (auc) was . ( % ci . to . ). subgroup analyses for gender showed no differences. there is an increased early postoperative mortality (operation risk) after elective surgery. this mortality is normally associated with cardiovascular events, such as deep venous thrombosis, pulmonary embolism, and ischemic heart diseases. our objective was to quantify the magnitude of the increased mortality and how long the mortality after an operation persists. we focused on the early postoperative mortality after surgery for total knee and total hip replacements from the national registries in australia and norway, which cover more than % of all operations in the two nations. only osteoarthritis patients between and years of age were included. a total of . patients remained for analyses. smoothed intensity curves were calculated for the early postoperative period. effects of risk factors were studied using a nonparametric proportional hazards model. the mortality was highest immediately after the operation ($ deaths per . patients per day), and it decreased until the rd postoperative week. the mortality was virtually the same for both nations and both joints. mortality increased with age and was higher for males than for females. a possible reduction of early postoperative mortality is plausible for the immediate postoperative period, and no longer than the rd postoperative week. background/objectives: single, modifiable risk factors for stroke have been extensively studied before, but their combined effects were rarely investigated. aim of the present study was to assess single and joint effects of risk factors on stroke and transitoric ischemic attack (tia) incidence in the european prospective investigation into cancer and nutrition (epic)-potsdam study. methods: among participants aged - years at baseline total stroke cases and tia cases occurred during . years of follow-up. relative risks (rr) for stroke and tia related to risk factors were estimated using cox proportional hazard models. results: after adjustment for potential confounders rr for ischemic stroke associated with hypertension was . ( % ci, . - . ) and for tia . ( % ci . - . ). the highest rr for ischemic stroke (rr . , % ci . - . , p trend< . ) and tia (rr . , % ci . - . , p trend= . ) were observed among participants with or modifiable risk factors. . % of ischemic strokes and . % of tia cases were attributable to hypertension, diabetes mellitus, high alcohol consumption, hyperlipidemia, and smoking. conclusion: almost % of ischemic stroke cases could be explained by classical modifiable risk factors. however, only one in four tia cases was attributable to those risk factors. background: the investigation of genetic factors is gaining importance in epidemi-ology. most relevant from a public health perspective are complex diseases that are characterised by complex pathways involving gene-gene-and gene-environment-interactions. the identification of such pathways requires sophisticated statistical methods that are still in their infancy. due to their ability in describing complex association structures, directed graphs may represent a suitable means for modelling complex causal pathways. objectives: we present a study plan to investigate the appropriateness for using directed graphs for modelling complex pathways in association stud-ies. design and methods: graphical models and artificial neural networks will be investigated using simulation studies and real data and their advantages and disadvantages of the respective ap-proaches summed up. furthermore, it is planned to construct a hybrid model exploiting the strengths of either model type. results and conclusions: the part of the project that concerns graphical models is being funded and ongoing. first results of a simulation study have been obtained and will be presented and discussed. a second project is currently being applied for. this shall cover the investigation of neural networks and the construction of the hybrid model. this study investigates variations in mortality from 'avoidable' causes among migrants in the netherlands in comparison with the native dutch population. data were obtained from population and mortality registries in the period - . we compared mortality rates for selected 'avoidable' conditions for turkish, moroccan, surinamese and antillean/aruban groups to native dutch. we found slightly elevated risk in total 'avoidable' mortality for migrant populations (rr = . ). higher risks of death among migrants were observed from almost all infectious diseases (most rr> . ) and several chronic conditions including asthma, diabetes and cerebro-vascular disorders (most rr> . ). migrant women experienced a higher risk of death from maternity-related conditions (rr = . ). surinamese and antillean/ aruban population had a higher mortality risk (rr = . and . respectively), while turkish and moroccans experienced a lower risk of death (rr = . and . respectively) from all 'avoidable' conditions compared to native dutch. control for demographic and socioeconomic factors explained a substantial part of ethnic differences in 'avoidable' mortality. conclusion: compared to native dutch, total 'avoidable' mortality was slightly elevated for all migrants combined. mortality risks varied greatly by cause of death and ethnicity. the substantial differences in mortality for a few 'avoidable' conditions suggest opportunities for improvement within specific areas of the healthcare system. warmblood horses scored by the jury as having uneven feet will never pass yearly selection sales of the royal dutch warmblood studbook (kwpn).to evaluate whether the undesired trait 'uneven feet' influences performance, databases of kwpn (n = horses) and knhs (n = show jumpers, n = dressage horses) were linked through the unique number of each registered horse. using a proc glm model of sas was investigated whether uneven feet had effects on age at first start and highest performance level. elite show jumpers with uneven feet start at . years and dressage horses . years of age, which is a significant difference (p< . ) with elite even feet horses ( . respectively . years). at their maximum level of performance horses with even feet linearly scored in show jumping . at regular and . at elite level ( . resp. . with uneven feet), while in dressage horses scores were . at regular and . at elite level ( . resp. . with uneven feet).the conformational trait 'uneven feet' appears to have a significant effect on age at first start, while horses with even feet demonstrate a higher maximal performance than horses with uneven feet. objectives: to identify children with acute otitis media (aom) who might benefit more from treatment with antibiotics. methods: an individual patient data meta-analysis (ipdma) on six randomized trials (n = children). to preclude multiple testing, we first performed a prognostic study in which predictors of poor outcome were identified. subsequently, interactions between these predictors and treatment were studied by fixed effect logistic regression analyses. only if a significant interaction term was found, stratified analyses were performed to quantify the effect in each subgroup. results: interactions were found for: age and bilateral aom, and otorrhea. in children less than years with bilateral aom, a rate difference (rd) of ) % ( % ci ) ; ) %) was found, whereas in children aged years or older with unilateral aom the rd was ) % ( % ci ) ; ) %). in children with and without otorrhea the rd were ) % ( % ci ) ; ) %), and ) % ( % ci ) %; ) %). conclusion: although there still are many areas in which ipdma can be improved, using individual patient data appear to have many advantages especially in identifying subgroups. in our example, antibiotics are beneficial in children aged less than years with bilateral aom, and in children with otorrhea. major injuries, such as fractures, are known to increase the risk of venous thrombosis (vt). however, little is known of the risk caused by minor injuries, such as ankle sprains. we studied the risk of vt after minor injury in a population-based case-control study of risk factors for vt, the mega-study. consecutive patients, enrolled via anticoagulation clinics, and control subjects, consisting both of partners of patients and randomly selected control subjects, were asked to participate and filled in a questionnaire. participants with cancer, recent plastercasts, surgery or bedrest were excluded from the current analyses. out of patients ( . %) and out of controls ( . %) had suffered from a minor injury resulting in a three-fold increased risk of vt (odds ratio adjusted for age and sex . ; % confidence interval . - . ) compared to those without injury. the risk was highest in the first month after injury and was no longer increased after months. injuries located in the leg increased the risk five-fold, while those located in other body parts did not increase the risk. these results show that minor injuries in the leg increase the risk of vt. this effect appears to be temporary and mainly local. introduction: in southeast asia, dengue was considered a childhood disease. in the americas, this disease occurs predominantly in older age groups, indicating the need for studies to investigate the immune status of the child population, since the presence of antibodies against a serotype of this virus is a risk factor for dengue hemorrhagic fever (dhf). objective: to evaluate the seroprevalence and seroincidence of dengue in children living in salvador, bahia, brazil. methods: a prospective study was carried out in a sample of children of - years by performing sequential serological surveys (igg/ dengue). results: seroprevalence in children was . %. a second survey (n = seronegative children) detected an incidence of . % and no difference was found between males and females or according to factors socioeconomic analyzed. conclusion and discussion: these results show that, in brazil, the dengue virus circulates actively in the initial years of life, indicating that children are also at great risk of developing dhf. it is possible that in this age group, dengue infections are mistaken for other febrile conditions, and that there are more inapparent infections in this age group. therefore, epidemiological surveillance and medical care services should be aware of the risk of dhf in children. since , in the comprehensive cancer centre limburg (cccl) region, all women aged - years are invited to participate in the cervical cancer screening programme once every five years. we had the unique opportunity to link data from the cervical screening programme and the cancer registry. we studied individual pap smear testing and participation in the screening programme preceding the diagnosis of cervical cancer. all invasive cases of cervical cancer of women aged - years in the period - were selected. subgroups were based on results of the pap smear and invitation and participation in the screening programme. time interval between screening and detection of tumours was calculated. in - , the non-response rate was %. in total, invasive cervical cancer cases were detected of which were screening and interval carcinomas. in the group of women who were invited but did not participate and women who were not invited, respectively and tumours were detected. these tumours had a higher stage compared to screening carcinomas. in the cccl region, more and higher stage tumours were found in women who did not participate in the screening compared to women with screening tumours. background: pcr for mycobacterium tuberculosis (mtb) has already proved to be a useful tool for the diagnosis and investigation of molecular epidemiology. objectives: evaluation of pcr assay for detection of mycobacterium tuberculosis dna as a diagnostic aid in cutaneous tuberculosis. design and methods: thirty paraffinembedded samples belonging to patients were analyzed for acid fast bacilli. dna was extracted from tissue sections and pcr was performed using specific primers based on is repeated gene sequence of mtb. results: two of the tissue samples were positive for acid fast bacilli (afb). pcr was positive in eight samples from six patients. amongst them, two were suspected of having lupus vulgaris confirmed histopathologically, whom their entire tests were positive. accounting histopathology as gold standard, the sensitivity of pcr in this study was determined as %. conclusion: from cases of skin tuberculosis diagnosed by histopathology, were positive by pcr technique, which shows the priority of previous method to molecular technique. discussion: pcr assay can be used for rapid detection of mtb from cutaneous tuberculosis cases, particularly when staining for afb is negative and there is a lack of growth on culture or when fresh material has not been collected for culture. background: recent epidemiological studies used automated oscillometric blood pressure (aod) devices that systematically measure higher blood pressure values than random zero sphygmomanometer devices (rzs) hampering the comparability of the blood pressure values between these studies. we applied both a random zero and an automated oscillometric blood pressure device in a randomized order in an ongoing cohort study. objectives: the aim of this analysis was to compare the blood pressure values by device and to develop a conversion algorithm for the estimation of blood pressure values from one device to the other. methods: within a randomized subset of subjects aged - years, each subject was measured three times by each device (hawskley random zero and omron hem- cp) in a randomized order. results: the mean difference (aod-rzs) between the devices was . mmhg and . mmhg for the systolic and diastolic blood pressure respectively. linear regression models including age, sex, and blood pressure level can be used to convert rzs blood pressure values to aod blood pressure values and vice versa. conclusions: the results may help to better compare blood pressure values of epidemiological studies that used different blood pressure devices. a form was used to collect relevant perinatal clinical data, as part of a european (mosaic) and italian (action) project. the main outcomes were mortality and a variable combining mortality and severe morbidity at discharge. the cox proportional hazards and logistic regression models were used, respectively, for the two outcomes. results: twenty-two of percent of vpbs were among fbms. comparing to control group, the percentage of babies below weeks and plurality was statistically significant higher among babies of fbms: % vs. . and . % vs. . %. adjusting for potential confounders, no association for mortality among immigrant group was found, whereas a slightly excess of morbidity-mortality was observed (odd ratio, . ; % cis . - . ). conclusions: the high proportion of vpbs among fbms and the slight excess observed in morbidity and mortality indicate the need to improve the health care delivery for the immigrant population. background: high-risk newborns have excess mortality, morbidity and use of health services. objectives: to describe re-hospitalizations after discharge from an italian region. design and methods: the population study consisted of all births with < weeks' gestation discharged alive from the twelve neonatal intensive care units in lazio region during . the perinatal clinical data was collected as part of a european project (mosaic). we used the regional hospital discharge database to find hospital admissions within months, using tax code for record linkage. data were analyzed through logistic regression for re-hospitalization. results: the study group included children; among these, ( . %) were re-hospitalized; overall, readmission were observed. the median total length of stay for re-admissions was d. the two most common reasons for re-hospitalization were respiratory ( . %) and gastrointestinal ( . %) disorders. the presence of a severe morbidity at discharge (odd ratio . : % cis . - . ) and male sex (odd ratio . ; % cis . - . ) predicted re-hospitalization in multivariate model. conclusions: almost one out three preterm infants was re-hospitalized in the first months. readmissions after initial hospitalization for a very preterm birth could be a sensitive indicator of quality of follow-up strategies in high risk newborns. background: self-medication with antibiotics may lead to inappropriate use and increase the risk of selection of resistant bacteria. in europe the prevalence varies from / to / respondents. self-medication may be triggered by experience with prescribed antibiotics. we investigated whether in european countries prescribed use was associated with self-medication with antibiotics. methods: a population survey was conducted in european countries with respondents completing the questionnaire. multivariate logistic regression analysis was used to study the relationship between prescribed use and self-medication (both actual and intended) in general, for a specific symptom/disease or a specific antibiotic. results: prescribed use was associated with selfmedication, with stronger effect in northern/western europe (odds ratio . , % ci . - . ) than in southern ( . , . - . ) and eastern europe ( . , . - . ). prescribing of a specific antibiotic increased the probability of self-medication with the same antibiotic. prescribing for a specific symptom/disease increased the likelihood of self-medication for the same symptom/disease. the use of prescribed antibiotics and actual self-medication were both determinants of intended self-medication in general and for specific symptoms/diseases. conclusions: routine prescribing of antibiotics increases the risk of self-medication with antibiotics for similar ailments, both through the use of leftovers and buying antibiotics directly from pharmacies. background: in the american national kidney foundation published a guideline based on opinion and observational studies which recommends tight control of serum calcium, phosphorus and calcium-phosphorus product levels in dialysis patients. objectives: within the context of this guideline, we explored associations of these plasma concentrations with cardiovascular mortality risk in incident dialysis patients. design and methods: in necosad, a prospective multi-centre cohort study in the netherlands, we included consecutive patients new on haemodialysis or peritoneal dialysis between and . risks were estimated using adjusted time-dependent cox regression models. results: mean age was ± years, % was male, and % was treated with haemodialysis. cardiovascular mortality risk was significantly higher in haemodialysis patients (hr: . ; % ci: . to . ) and in peritoneal dialysis patients (hr: . ; . to . ) with elevated plasma phosphorus levels when compared to patients who met the target. in addition, having elevated plasma calcium-phosphorus product concentrations increased cardiovascular mortality risk in haemodialysis (hr: . ; . to . ) and in peritoneal dialysis patients (hr: . ; . to . ). conclusion: application of the current guideline in clinical practice is warranted since it reduces cardiovascular mortality risk in haemodialysis and peritoneal dialysis patients in the netherlands. background: urologists are increasingly confronted with requests for early detection of prostate cancer in men from hereditary prostate cancer (hpc) families. however, little is known about the benefit of early detection among men at increased risk. objectives: we studied the effect of biennial screening with psa in unaffected men from hpc families, aged - years, on surrogate endpoints (test and tumour characteristics). methods: the netherlands foundation for the detection of hereditary tumours holds information on approximately hpc families. here, nonaffected men from these families were included and invited for psa testing every years. we collected data on screening history and complications related to screening. results: in the first round, serum psa was elevated ( ng/ml or greater) in of men screened ( %). further diagnostic assessment revealed patients with prostate cancer ( . %). compared to population-based prostate cancer screening trials, the referral rate is equal but the detection rate is twice as high. discussion: in conclusion, the results of prostate cancer screening trials will not be directly applicable to screening in hpc families. the balance between costs, side-effects and potential benefits of screening when applied to a high-risk population will have to be assessed separately. background: in industrialized countries occupational tuberculosis among health care workers (hcws) is re-emerging as a public health priority. to prevent and control tuberculosis transmission in nosocomial settings, public health agencies have issued specific guidelines. turin, the capital of the piedmont region in italy, is experiencing a worrying rise of tuberculosis incidence. here, hcws are increasingly exposed to the risk of nosocomial tuberculosis transmission. objectives: a) to estimate the sex-and age-adjusted annual rate of tuberculosis infection (arti) (per person-years [%py]) among the hcws, as indicated by tuberculin skin test conversion (tst) conversion, b) to identify occupational factors associated with significant variations in the arti, c) to investigate the efficacy of the regional preventive guidelines. design and methods: multivariate survival analysis on tst conversion data from a dynamic cohort of hcws in turin, between and . results: the overall estimated arti was . ( % ci: . - . ) %py. the risk of tst conversion significantly differed among workplaces, occupations, and age of hcws. the guidelines implementation was associated with an arti reductions of . ( % ci: . - . ) %py. conclusions: we identify occupational risk categories for targeting surveillance and prevention measures and assessed the efficacy of the local guidelines. background: a positive family history (fh) of breast cancer is an established risk factor for the disease. screening for breast cancer in israel is recommended annually for positive-fh women aged = y and biennially for average-risk women aged - y. objective: to assess the effect of having a positive breast cancer fh on performing screening mammography in israeli women. methods: a cross-sectional survey based on a random sample of the israeli population. the study population consists of , women aged - y and telephone interviews were used. logistic regression models identified variables associated with mammography performance. results: a positive fh for breast cancer was reported by ( . %) participants. performing a mammogram in the previous year was reported by . % and . % of the positive and negative fh subgroups, respectively (p< . ). rates increased with age. among positive fh participants, being married was the only significant correlate for a mammogram in the previous year. conclusions: over % and around % of high-risk women aged - y and = y, respectively, are inadequately screened for breast cancer. screening rates are suboptimal in average-risk women too. discussion: national efforts should concentrate on increasing awareness and breast cancer screening rates. to evaluate the association between infertility, infertility treatments and breast cancer risk. methods: a historical prospective cohort with , women who attended israeli infertility centers between and . their medical charts were abstracted. breast cancer incidence was determined through linkage with the national cancer registry. standardized incidence ratios (sirs) and % confidence intervals were computed by comparing observed cancer rates to those expected in the general population. additionally, in order to control for known risk factors, a casecontrol study nested within the cohort was carried out as well based on telephone interviews with breast cancer cases and controls matched by : ratio. results: compared to . expected breast cancer cases, were observed (sir = . ;non-significant). risk for breast cancer was higher for women treated with clomiphene citrate (sir = . ; % ci . - . ). similar results were noted when treated and untreated women were compared, and when multivariate models were applied. in the nested case-control study, higher cycle index and treatment with clomiphene citrate were associated with significantly higher risk for breast cancer. conclusions: clomiphene citrate may be associated with higher breast cancer risk. smoking is a strong risk factor for arterial disease. some consider smoking also as a risk factor for venous thrombosis, while the results of studies investigating the relationship are inconsistent. therefore, we evaluated smoking as a risk factor for venous thrombosis in the multiple environmental and genetic assessment of risk factors for venous thrombosis (mega) study, a large population-based case-control study. consecutive patients with a first venous thrombosis were included from six anticoagulation clinics. using a random-digit-dialing method a control group was recruited in the same geographical area. all participants completed a questionnaire including questions on smoking habits. persons with known malignancies were excluded from the analyses, leading to a total of patients and controls. current and former smoking resulted in a small increased risk of venous thrombosis (ors adjusted for age, sex and bmi) (or-current: . ci : . - . , or-former: . ci : . - . ). an increasing amount and duration of smoking was associated with an increase in risk. the highest risk was found among young (lowest tertile: to yrs) current smokers; twenty or more pack-years resulted in a . -fold increased risk (ci : . - . ). in conclusion, smoking results in a small increased risk of venous thrombosis, with the greatest relative effect among young heavy smokers. objective: to explore whether the observed association between silica exposure and lung cancer was confounded by exposure to other occupational carcinogens, we conducted a nested case-control-study among a cohort of male workers in chinese mines and potteries. methods: lung cancer cases and matched controls were selected. exposure to respirable silica as well as relevant occupational confounders were evaluated quantitatively based on historical industrial hygiene data. the relationship between silica exposure and lung cancer mortality was analyzed by conditional logistic regression analysis adjusted for exposure to arsenic, polycyclic aromatic hydrocarbons (pahs), radon, and smoking habit. results: in a crude analysis adjusted for smoking only, a significant trend of increasing risk of lung cancer with exposure to silica was found for tin, copper/iron miners, and pottery workers. however, after the relevant occupational confounders were adjusted, no association can be observed between silica exposure and lung cancer mortality (pro mg/m -year increase of silica exposure: or = . , % ci: . - . ). conclusion: our results suggest that, the observed excess risk of lung cancer among silica exposed chinese workers is more likely due to exposure to other occupational carcinogens such as arsenic and pahs rather than due to exposure to respirable silica. background: modelling studies have shown that lifestyle interventions for adults with a high risk of developing diabetes are costeffective. objective: to explore the cost-effectiveness of lifestyle interventions for adults with low or moderate risk of developing diabetes. design and methods: the short-term effects of both a community-based lifestyle program for the general population and a lifestyle intervention for obese adults on diabetes risk factors were estimated from international literature. intervention costs were based on dutch projects. the rivm chronic diseases model was used to estimate long-term health effects and disease costs. costeffectiveness was evaluated from a health care perspective with a time horizon of years. results: intervention costs needed to prevent one case of diabetes in years range from , to , euro for the community program and from , to , euro for the intervention for obese adults. cost-effectiveness was , to , euro per quality adjusted life-year for the community program and , to , for the lifestyle intervention. conclusion: a lifestyle intervention for obese adults produces larger individual health benefits then a community program but, on a population level, health gains are more expensively achieved. both lifestyle interventions are cost-effective. background: in barcelona, the proportion of foreign-born patients with tuberculosis (tb) raised from . % in to , % in . objective: to determine differences in infection by country of origin among contacts investigated by the tb programme in barcelona from - . design and methods: data were collected on cases and their contacts. generalized estimating equations were used to obtain the risk of infection (or and % ci) to account for potential correlation among contacts. results: contacts of foreign born cases were more infected than contacts of natives patients ( % vs %, p< . ) factors related to infection among contacts of foreign cases were inner city residency (or: . , % ci: . - . ) and sputum smear positivity of the case (or: . , % ci: . - . ) and male contact (or: . , % ci: . - . ), but not daily contact (or: . , % ci: . - . ) among natives cases, inner city residency (or: . , % ci: . - . ), sputum smear positivity (or: . , % ci: . - . ) and daily exposure (or: . , % ci: . - . ) increased risk of infection. conclusion: contacts immigrant tb cases have a higher risk of infection than contacts of natives cases, however daily exposure to an immigrant case was not associated with a greater risk of infection. this could be explained by the higher prevalence of tb infection in their country of origin. background: an inverse association between birthweight and subsequent coronary heart disease (chd) has been widely reported but has not been formally quantified. we therefore conducted a systematic review of the association between birthweight and chd. design and methods: seventeen studies including a total of , singletons that had reported quantitative or qualitative estimates of the association between birthweight and chd by october were identified. additional data from two unpublished studies of individuals were also included. in total, the analyses included data on non-fatal and fatal coronary heart disease events in , individuals. results: the mean weighted estimate for the association between birthweight and chd incidence was . ( % ci . - . ) per kg of birthweight. overall, there was no significant heterogeneity between studies (p = . ) or evidence of publication bias (begg test p = . ). fifteen studies were able to adjust for some measure of socioeconomic position, but such adjustment did not materially influence the association: . ( % ci . - . ). discussion: these findings are consistent with one kilogram higher birthweight being associated with - % lower risk of subsequent chd, but the causal nature of this association remains uncertain and its direct relevance to public health is likely to be small. objective: diabetes has been reported to be associated with a greater coronary hazard among women compared with men with diabetes. we quantified the coronary risk associated with diabetes by sex by conducting a meta-analysis of prospective cohort studies. methods: studies reporting estimates of the relative risk for fatal coronary heart disease (chd) comparing those with and without diabetes, for both men and women were included. results: studies of type- diabetes and chd among , individuals were identified. the summary relative risk for fatal chd, diabetes versus not, was significantly greater among women than men: . ( % ci . to . ) versus . ( % ci . to . ), p< . . after excluding eight studies that had only adjusted for age, the sex risk difference was substantially reduced, but still highly significant (p = . ). the pooled ratio of the relative risks (female: male) from the multiple-adjusted studies was . ( % ci . to . ). conclusions: the relative risk for fatal chd associated with diabetes is % higher in women than in men. more adverse cardiovascular risk profiles among women with diabetes, combined with possible treatment disparities that favour men, may explain the greater excess coronary risk associated with diabetes in women. background: malaria in sri lanka is strongly seasonal and often of epidemic nature. the incidence has lowered in recent years which increased the relevance of epidemic forecasting for better targeting control resources. objectives: to establish the spatio/temporal correlation of precipitation and malaria incidence for use in forecasting. design and methods: de-trended long term ( de-trended long term ( - monthly time series of malaria incidence at district level were regressed in a poisson regression against rainfall and temperature at several lags. results: in the north and east of sri lanka, malaria seasonality is strongly positively correlated to rainfall seasonality (malaria lagging one or two months behind rainfall). however, in the south west, no significant (negative) correlation was found. also in the hill country, no significant correlation was observed. conclusion and discussion: despite high correlations, it still remains to be explored to what extent rainfall can be a used as a predictor (in time) of malaria. observed correlation could simply be due to two cyclical seasonal patterns running in parallel, without causal relationship. e.g. similarly, strong correlations were found between temperature and malaria seasonality at months time lag in northern districts, but causality is biologically implausible. background: few studies assessed the excess burden of acute respiratory tract infections (rti) among preschool children in primary care during viral seasons. objective: to determine the excess of rti in preschool children in primary care attributable to influenza and respiratory syncytial virus (rsv). methods: we performed a retrospective cohort study including all children aged - years registered in the database of the utrecht general practitioner (gp) network. during during - , gps recorded episodes of acute rti. surveillance data of influenza and rsv were obtained from the weekly sentinel system of the dutch working group on clinical virology. viral seasons and base-line period were defined as the weeks with respectively more than % and less than % of the yearly number of isolates of influenza or rsv. results: on average episodes of rti were recorded per , child years ( % ci: - ). notably more consults for rti occurred during influenza-season (rr . , % ci: . - . ) and rsv-season (rr . , % ci: . - . ) as compared to base-line period, especially in children younger than two years of age. conclusion: substantial excess rates of rti were demonstrated among preschool children in primary care during influenza-season and particularly during rsvseason, notably in the younger age group. background: many cancer patients who have already survived some time want to know about their prognosis, given the precondition that they are still alive. objective: we described and interpreted population-based conditional -year relative survival rates for cancer patients. methods: the longstanding eindhoven cancer registry collects data on all patients with newly diagnosed cancer in the southeastern part of the netherlands ( . million inhabitants). patients aged - years, diagnosed between and and followed up until january , were included. conditional -year relative survival was computed for every additional year survived. results: for many tumours conditional -year relative survival approached - % after having survived - years. however, for stomach cancer and hodgkin's lymphoma conditional -year relative survival increased to only - % and for lung cancer and non-hodgkin's lymphoma it did not exceed - %. initial differences in survival at diagnosis between age and stage groups disappeared after having survived for - years. conclusion: prognosis for patients with cancer changes with each year survived and for most tumours patients can considered to be cured after a certain period of time. however, for stomach cancer, lymphoma's and lung cancer the odds for death remains elevated compared to the general population. background: systematic review with meta-analysis, now regarded as 'best evidence', depends on availability of primary trials and on completeness of review. whilst reviewers have attempted to assess publication bias, relatively little attention has been given to selection bias by reviewers. method: systematic reviews of three cardiology treatments, that used common search terms, were compared for inclusion/exclusion of primary trials, pooled measures of principal outcomes and conclusions. results: in one treatment, reviews included , , , , and trials. there was little overlap: of trials in the last review only , , , and were included by others. reported summary effects ranged from (most effective to least significant); mortality relative risk . ( . , . ) in trials to . ( . , . ) in , and in one morbidity measure; standardised mean difference from . ( . , . ) in trials ( patients) to . () . , . ) in ( patients). reviewers' conclusions ranged from 'highly effective' to 'no evidence of effect'. conclusions: these examples illustrate strong selection bias in published meta-analyses. post hoc review contravenes one important principal of science 'first the hypothesis, then the test'. selection bias by reviewers may affect 'evidence' more than does publication bias. in the context of a large population based german case control study examining the effects of hormone therapy (ht) on breast cancer risk, we conducted a validation study comparing ht prescription data with participants' self-reports for data quality assurance. included were cases and controls aged - years, stratified by age and hormone use. study participants provided detailed information on ht use to trained interviewers, while gynecologists provided prescription data via telephone or fax. data were compared using proportion of agreement, kappa, intraclass correlation coefficient (icc), and descriptive statistics. overall agreement for ever/never use was . %, while agreement for ever/never use by type of ht was . %, . %, and . % for mono-estrogen, cyclical, and continuous combined therapy, respectively. icc for duration was high ( . ( % ci: . - . )), as were the iccs for age at first and last use ( . ( % ci: . - . ) and . ( % ci: . - . ), respectively). comparison of exact brand name resulted in perfect agreement for . % of participants, partial agreement for . %, and no agreement for . %. higher education and shorter length of recall were associated with better agreement. agreement was not differential by disease status. in conclusion, these self-reported ht data corresponded well with gynecologists' reports. background: legionnaires' disease (ld) is a pneumonia of low incidence. however, the impact of an outbreak can be substantial. objective: to stop a possible outbreak at an early stage, an outbreak detection programme was installed in the netherlands and evaluated after two years. design: the programme was installed nationally and consisted of sampling and controlling of potential sources to which ld patients had been exposed during their incubation period. potential sources were considered to be true sources of infection if two or more ld patients (cluster) had visited them, or if available patients' and environmental strains were indistinguishable by amplified fragment length polymorphism genotyping. all municipal health services of the netherlands participated in the study. the regional public health laboratory kennemerland sampled potential sources and cultured samples for legionella spp. results: rapid sampling and genotyping as well as cluster recognition helped to target control measures. despite these measures, two small outbreaks were only stopped after renewal of the water system. the combination of genotyping and cluster recognition lead to of ( %) patient-source associations. conclusion and discussion: systematic sampling and cluster recognition can contribute to ld outbreak detection and control. this programme can cost-effectively lead to secondary prevention. -up ( - ) , primary invasive breast cancers occurred. results: compared with hrt never-use, use of estrogen alone was associated with a significant . -fold increased risk. the association of estrogen-progestagen combinations with breast cancer risk varied significantly according to the type of progestagen: while there was no increase in risk with estrogen-progesterone (rr . [ . - . ]), estrogen-dydrogesterone was associated with a significant . -fold increase, and estrogen combined with other synthetic progestins with a significant . -fold increase. although the latter type of hrt involves a variety of different progestins, their associations with breast cancer risk did not differ significantly from one another. rrs did not vary significantly according to the route of estrogen administration (oral or transdermal/percutaneous). conclusion and discussion: progesterone rather than synthetic progestins may be preferred when an opposed estrogen therapy is to be prescribed. additional results on estrogen-progesterone are needed. background: although survival of hodgkin's lymphoma (hl) is high (> %), treatment may cause long-term side-effects like premature menopause. objectives: to assess therapy-related risk factors for premature menopause (age < ) following hl. design and methods: we conducted a cohort-study among female year hl-survivors, aged < at diagnose and treated between and . patients were followed from first treatment until june , menopause, death, or age . cumulative dose of various chemotherapeutic agents as well as radiation fields were studied as risk factors for premature menopause. cox-regression was used to adjust for age, year of treatment, smoking, bmi, and oral contraceptive-use. results: after a median follow-up of . years, ( %) women reached premature menopause. overall women ( %) were treated with chemotherapy only, ( %) with radiotherapy only and ( %) with both radio-and chemotherapy. exposure to procarbazine ), cyclophosphamide (hr . [ . - . ] ) and irradiation of the ovaries ]) were associated with significant increased risks for premature menopause. for procarbazine a dose-response relation was observed. procarbazine-use has decreased over time. conclusion: to decrease the risk for premature menopause after hl, procarbazine and cyclophosphamide exposure should be minimized and ovarian irradiation should be avoided. background: casale is an italian town where a large asbestos cement plant was active for decades. previous studies found increased risk for mesothelioma in residents, suggesting a decreasing spatial trend with distance from the plant. objective: to analyse the spatial variation of risk in casale and the surrounding area ($ , inhabitants) focussing on non-occupationally exposed individuals. design/methods: population-based case-control study including pleural mesotheliomas diagnosed between and . information on the cases and controls comprised lifelong residential and occupational history of subjects and their relatives. nonparametric tests of clustering were used to evaluate spatial aggregation. parametric spatial models based on distance between the longest-lasting subject residence (excluding the last years before diagnosis) and the source enabled estimation of risk gradient. results: mesothelioma risk appeared higher in an area of roughly - km radius from the source. spatial clustering was statistically significant (p = . ) and several clusters of cases were identified within casale. risk was highly related to the distance from the source; the best fitting model was the exponential decay with threshold. conclusion/discussion: asbestos pollution has increased dramatically the risk of mesothelioma in the area around casale. risk decreases slowly with the square of distance from the plant. malaria control programmes targeting malaria transmission from man to mosquito can have a large impact of malaria morbidity and mortality. to successfully interrupt transmission, a thorough understanding of disease and transmission parameters is essential. our objective was to map malaria transmission and analyse microenvironmental factors influencing this transmission in order to select high risk areas where transmission reducing interventions can be introduced. each house in the village msitu-wa-tembo was mapped and censused. transmission intensity was estimated from weekly mosquito catches. malaria cases identified through passive case detection were mapped by residence using gis software and the incidence of cases by season and distance to river were calculated. the distribution of malaria cases showed a clear seasonal pattern with the majority of cases during the rainy season (chisquare = . , p< . ). living further away from the river (p = . ) was the most notable independent protective factor for malaria infection. transmission intensity was estimated at . ( % ci . - . ) infectious bites per person per year. we show that malaria in the study area is restricted to a short transmission season. spatial clustering of cases indicates that interventions should be planned in the area closest to the river, prior and during the rainy season. background: the effectiveness of influenza vaccination of elders has been subject of some dispute. its impact on health inequalities also demands epidemiological assessments, as health interventions may affect early and most intensely better-off social strata. objectives: to compare pneumonia and influenza (p&i) mortality of elders (aged or more years old) before and after the onset of a largescale scheme of vaccination in sao paulo, brazil. methods: official information on deaths and population allowed the study of p&i mortality at the inner-city area level. rates related to the period to , during which vaccination coverage ranked higher than % of elders were compared with figures related to the precedent period ( ) ( ) ( ) ( ) ( ) . the appraisal of mortality decrease used a geo-referred model for regression analysis. results: overall p&i mortality reduced . % after vaccination. also the number of outbreaks, the excess of deaths during epidemic weeks, and the proportional p&i mortality ratio reduced significantly after vaccination. besides having higher prior levels of p&i deaths, deprived areas of the city presented a higher proportional decrease of mortality. conclusion: influenza vaccination contributed for an overall reduction of p&i mortality, while reducing the gap in the experience of disease among social strata. background: alcohol's first metabolite, acetaldehyde, may trigger aberrations in dna which predispose to developing colorectal cancer (crc) through several distinct pathways. our objective was to study associations between alcohol consumption and the risk of crc, according to two pathways characterized by mutations in apc and k-ras genes, and absence of hmlh expression. methods: in the netherlands cohort study, , men and women, aged - years, completed a questionnaire on risk factors for cancer in . case-cohort analyses were conducted using crc cases with complete data after . years of follow-up, excluding the first . years. gender-specific adjusted incidence rate ratios (rr) and % confidence intervals (ci) were estimated. results: neither total alcohol, nor beer, wine or liquor consumption was clearly associated with the risk of colorectal tumors lacking hmlh expression or harboring a truncating apc mutation and/or an activating k-ras mutation. in men and women, total alcohol consumption above g/day was associated with an increased risk of crc harboring a truncating apc and/or activating k-ras mutation, though not statistically significant. (rr: . ( % ci: . - . ) in men, rr: . ( % ci: . - . ) in women). in conclusion, alcohol consumption is not involved in the studied pathways leading to crc. background: educational level is commonly used to identify social groups with increased prevalence of smoking. other indicators of socioeconomic status (ses) might however be more discriminatory. objective: this study examines to what extent smoking behaviour is related to other ses indicators, such as labour market position and financial situation. methods: data derived from the european household panel, which includes data on smoking for european countries. we selected data for , respondents aged - years. the association between ses indicators and smoking prevalence was examined through logistic regression analyses. results: preliminary results show that, in univariate analysis, all selected ses indicators were associated with smoking. higher rates of smoking in lower social groups were observed in all countries, except for women in some mediterranean countries. in multivariate analyses, education retained an independent effect on smoking. no strong effect was observed for labour market position (occupational class, employment status) or for income. however, smoking prevalence was strongly related to economic deprivation and housing tenure. conclusion: these results suggest that different aspects of people's ses affect their smoking behaviour. interventions that aim to tackle smoking among high-risk groups should identify risk groups in terms of both education and material deprivation. objective: we investigated time trends in overweight and leisure time physical activities (ltpa) in the netherlands since . intra-national differences were examined stratified for sex, age and urbanisation degree. design and methods: we used a random sample from the health interview survey of about respondents, aged -to- years. self-reported data on weight, height and demographic characteristics were gathered through interviews (yearly) and data on ltpa were collected by selfadministered questionnaires . linear regression was performed for trend analyses. results: during - , mean body mass index (bmi) increased by . kg/m (p = . ). trends were similar across sex and urbanisation degrees. in -to- year old women, mean bmi increased more ( . kg/m ; p = . ) than in older women. concerning ltpa, no clear trend was observed during observed during - and observed during - . however, in year old women spent $ min/wk less on ltpa compared to older women, while this difference was smaller during - . conclusions: mean bmi increased more in younger women, which is consistent with the observation that this group spent less time on ltpa during recent years. although the overall increase in overweight could no´t be explained by trends in ltpa, physical activity interventions should target the younger women. background: prediction rules combine patient characteristics and test results to predict the presence of an outcome (diagnosis) or the occurrence of an outcome (prognosis) for individual patients. when prediction rules show poor performance in new patients, investigators often develop a new rule, ignoring the prior information available in the original rule. recently, several updating methods have been proposed that consider both prior information and information of the new patients. objectives: to compare five updating methods (that vary in extensiveness) for an existing prediction rule that preoperatively predicts the risk of severe postoperative pain (spp). design and methods: the rule was tested and updated on a validation set of new surgical patients ( ( %) with spp). we estimated the discrimination (the ability to discriminate between patients with and without spp) and calibration (the agreement between the predicted risks and observed frequencies of spp) of the five updated rules in other patients ( ( %) with spp). results: simple updating methods showed similar effects on calibration and discrimination as the more complex methods. discussion and conclusion: when the performance of a prediction rule in new patients is poor, a simple updating method can be applied to improve the predictive accuracy. about two million ethnic germans (aussiedler) have resettled in germany since . analyses with a yet incomplete follow-up of a cohort of aussiedler showed a different mortality compared to russia and germany. objectives: we investigated whether the mortality pattern changed after a complete follow-up and whether residential mobility after resettlement has an effect on mortality. we established a cohort of aussiedler who moved to germany between and . we calculated smr for all causes, external causes, cardiovascular deaths and cancer in comparison to german rates. results: with a complete follow-up, the cohort accumulated person years. overall, deaths were observed (smr . , % ci: . - . ). smr for all external causes, all cancer and cardiovascular deaths were . , . and . , respectively. increased number of moves within germany was associated with increased mortality. conclusion and discussion: the mortality in the cohort is surprisingly low, in particular for cardiovascular deaths. there is a mortality disadvantage from external causes and for some selected cancers. this disadvantage is however not as large as would be expected if aussiedler were representative of the general population in fsu countries. mobility as an indicator for a lesser integration will be discussed. background: breast cancer screening (bcs) provides an opportunity to analyze the relationship between lymph node involvement (ln), the most important prognostic factor, and biological and time dependent characteristics. objective: our aim was to assess those characteristics that are associated with ln in a cohort of screen-detected breast cancers. methods: observational population study of all invasive cancers within stage i-iiia detected from to through a bcs program in catalonia (spain). age, tumor size, histological grade, ln status and screening episode (prevalent or incident) were analyzed. pearson chi-square or fisher's exact test, mann-whitney test and stratified analyses were applied, as well as multiple logistic regression techniques. results: twenty nine percent ( % ci . - . %) out of invasive cancers had ln and . % were prevalent cancers. in the bivariate analysis, tumor size and age were strongly associated to ln (p< . ) while grade was related to ln only in incident cancers (p = . ). grade was associated with tumor size (p = . ) and with screening episode (p = . ). adjusting for screening episode and grade, age and tumor size were independent predictors of ln. conclusion: in conclusion, age and tumor size are independent predictors of ln. grade emerges as an important biological factor in incident cancers. background: the evidence regarding the association between smoking and cognitive function in the elderly is inconsistent. objectives: to examine the association between smoking and cognitive function in the elderly. design and methods: in , all participants of a population-based cohort study aged years or older were eligible for a telephone interview on cognitive function using validated instruments, such as the telephone interview of cognitive status (tics). information on smoking history was available from questionnaires administered in . we estimated the odds ratios (or) of cognitive impairment (below th percentile) and the corresponding % confidence intervals (ci) by means of logistic regression adjusting for age, sex, alcohol consumption, body mass index, physical exercise, educational level, depressive symptoms and co-morbidity. results: in total, participants were interviewed and had complete information on smoking history. former smokers had a lower prevalence of cognitive impairment (oradjusted = . ; % ci: . - . ) compared with never smokers, but not current smokers (oradjusted = . ; % ci: . - . ). conclusion: there is no association between current smoking and cognitive impairment in the elderly. discussion: the lack of association between current smoking and cognitive impairment is in line with previous non-prospective studies. the inverse association with former smoking might be due to smoking cessation associated with co-morbidities. background: many studies have reported late effects of treatment in childhood cancer survivors. most studies, however, focused on only one late effect or suffered from incomplete follow-up. objectives: we assessed the total burden of adverse events (ae), and determined treatment-related risk factors for the development of various aes. methods: the study cohort included -year survivors, treated in the emma childrens hospital amc in the netherlands between - . aes were graded for severity by one reviewer according to the common terminology criteria adverse events version . . results: medical follow-up data were complete for . % -year survivors. median follow-up time was years. almost % of survivors had one or more aes, and . % had even or more aes. of patients treated with rt alone, % had a high or severe burden of aes, while this was only % in patients treated with ct alone. radiotherapy (rt) was associated with the highest risk to develop an ae of at least grade , and was also associated with a greater risk to develop a medium to severe ae burden. conclusions: survivors are at increased risk for many health problems that may adversely affect their quality of life and long-term survival. background: studies in the past demonstrated that multifaceted interventions could enhance the quality of diabetes care. however many of these studies showed methodological flaws as no corrections were made for patient case-mix and clustering or a nonrandomised design was used. objective: to assess the efficacy of a multifaceted intervention to implement diabetes shared care guidelines. methods: a cluster randomised controlled trial of patients with type diabetes was conducted at general practises (n = ) and one outpatient clinic (n = ). in primary care, facilitators analysed barriers to change, introduced structured care, gave feedback and trained practice staff. at the outpatient clinic, an abstract of the guidelines was distributed. case-mix differences such as duration of diabetes, education, co-morbidity and quality of life were taken into account. results: in primary care, more patients in the intervention group were seen on a three monthly basis ( . % versus . %, p< . ) and their hba c was statistically significant lower ( . ± . versus . ± . , p< . ). however, significance was lost after correction for case-mix (p = . ). change in blood pressure and total cholesterol was not significant. we were unable to demonstrate any change in secondary care. conclusion: multifaceted implementation did improve the process of care but left cardiovascular risk unchanged. background: we have performed a meta-analysis including studies on the diagnostic accuracy of mr-mammography in patients referred for further characterization of suspicious breast lesions. using the bivariate diagnostic meta-analysis approach we found an overall sensitivity and specificity of . and . , respectively. the aim of the present analysis was to detect heterogeneity between studies. materials and methods: seventeen study and population characteristics were separately included in the bivariate model to compare sensitivity and specificity between strata of the characteristics. results: both sensitivity and specificity were higher in studies performed in the united states compared to non-united states studies. both estimates were also higher if two criteria for malignancy were used instead of one or three. only specificity was affected by the prevalence of cancer: specificity was highest in studies with the lowest prevalence of cancer in the study population. furthermore, specificity was affected by whether the radiologist was blinded for clinical information: specificity was higher if there was no blinding. conclusions: variation between studies was notably present across studies in country of publication, the number of criteria for malignancy, the prevalence of cancer and whether the observers were blinded for clinical information. objective: the aim of this project is to explore variation in three candidate genes involved in cholesterol metabolism in relation to risk of acute coronary syndrome (acs), and to investigate whether dietary fat intake modifies inherent genetic risks. study population: a case-cohort study is designed within the danish 'diet, cancer and health' study population. a total of cases of acs have been identified among , men and women who participated in a baseline examination between - when they were aged - years. a random sample of participants will serve as 'control' population. exposures: all participants have filled out a detailed -item food frequency questionnaire and a questionnaire concerning lifestyle factors. participants were asked to provide a blood sample. candidate genes for acs have been selected among those involved in cholesterol transport (atp-binding cassette transporter a , cholesterol-ester transfer protein, and acyl-coa:cholesterol acyltransferase ). five single nucleotide polymorphisms (snps) will be genotyped within each gene. snps will be selected among those with demonstrated functional importance, as assessed in public databases. methods: statistical analyses of association between genetic variation in the three chosen genes and risk of acs. explorations of methods to evaluate biological interaction will be of particular focus. background: c-reactive protein (crp) has been shown to be associated with type diabetes mellitus. it is unclear whether the association is completely explained by obesity. objective: to examine whether crp is associated with diabetes independent of obesity. design and methods: we measured baseline characteristics and serum crp in non-diabetic participants of the rotterdam study and followed them for a mean of . years. cox regression was used to estimate the hazard ratio. in addition, we performed a meta-analysis on published studies. results: during follow-up, participants developed diabetes. serum crp was significantly and positively associated with the risk to develop diabetes. the risk estimates attenuated but remained statistically significant after adjustment for obesity indexes. age, sex and body mass index (bmi) adjusted hazard ratios ( % ci) were . ( . - . ) for the fourth quartile, . ( . - . ) for the third quartile, and . ( . - . ) for the second quartile of serum crp compared to the first quartile. in the meta-analysis, weighed age, sex, and bmi adjusted risk ratio was . ( . - . ), for the highest crp category (> . mg/l) compared to the reference category (< . mg/l). conclusion: our findings shows that the association of serum crp with diabetes is independent of obesity. background: effectiveness of screening can be predicted by episode sensitivity, which is estimated by interval cancers following a screen. full-field digital or cr plate mammography are increasingly introduced into mammography screening. objectives: to develop a design to compare performance and validity between screen-film and digital mammography in a breast cancer screening program. methods: interval cancer incidence was estimated by linking screening visits from - at an individual level to the files of the cancer registry in finland. these data was used to estimate the study size requirements for analyzing differences in episode sensitivity between screen-film and digital mammography in a randomized setting. results: the two-year cumulative incidence of interval cancers per screening visits was estimated to be . to allow the maximum acceptable difference in the episode sensitivity between screenfilm and digital arm to be % ( % power, . significance level, : randomization ratio, % attendance rate), approximately women need to be invited. conclusion: only fairly large differences in the episode sensitivity can be explored within a single randomized study. in order to reduce the degree of non-inferiority between the screen-film and digital mammography, meta-analyses or pooled analyses with other randomized data are needed. session: socio-economic status and migrants presentation: oral. background: tackling urban/rural inequalities in health has been identified as a substantial challenge in reforming health system in lithuania. objectives: to assess mortality trends from major causes of death of the lithuanian urban and rural populations throughout the period of - . methods: information on major causes of death (cardiovascular diseases, cancers, external causes, and respiratory diseases) of lithuanian urban and rural populations from to was obtained from lithuanian department of statistics. mortality rates were age-standardized, using european standard. mortality trends were explored using the logarithmic regression analysis. results: overall mortality of lithuanian urban and rural populations was decreasing statistically significantly during - . more considerable decrease was observed in urban areas where mortality declined by . % per year in males and . % in females, compare to the decline by . % in males and . % in females in rural areas. the most notable urban/rural differences in mortality trends with unfavourable situation in rural areas were estimated in mortality from stoke, breast cancer in females, and external causes of death (traffic accidents and suicides). background: recent studies indicate that depression plays an important role in the occurrence of cardiovascular diseases (cvd). underlying mechanisms are not well understood. objectives: we investigated whether low intake of omega- fatty acids (fas) is a common cause for depression and cvd. methods: the zutphen elderly study is a prospective cohort study conducted in the netherlands. depressive symptoms were measured with the zung scale in men, aged - years, and free from cvd and diabetes in . dietary factors were assessed with a cross-check dietary history method. results: compared to high intake (= . mg/d), low intake (< . mg/d) of omega- fas, adjusted for demographics and cvd risk factors, was associated with an increased risk of depressive symptoms (or . ; % ci . - . ) at baseline, and non-significantly with -year cvd mortality (hr . ; % ci . - . ). the adjusted hr for a -point increase in depressive symptoms for cvd mortality was . ( % ci . - . ), and did not change after additional adjustment for omega- fas. conclusion: low intake of omega- fas may increase the risk of depression. our results, however, do not support the hypothesis that low intake of omega- fas explains the relation between depression and increased risk of cvd. background: during the last decades the incidence of metabolic syndrome has risen dramatically. several studies have shown beneficial effects of nut and seed intake on components of this syndrome. the relationship with prevalence of metabolic syndrome has not yet been examined. objectives: we studied the relation between nut and seed intake and metabolic syndrome in coronary patients. design and methods: presence of metabolic syndrome (according to international diabetes federation definition) was assessed in stable myocardial infarction patients ( % men) aged - years, as part of the alpha omega trial. dietary data were collected by food-frequency questionnaire. results: the prevalence of metabolic syndrome was %. median nut and seed intake was . g/day (interquartile range, . - . g/day). intake of nuts and seeds was inversely associated with the metabolic syndrome (prevalence ratio: . ; % confidence interval: . - . , for > g/day versus < g/day), after adjustment for age, gender, dietary and lifestyle factors. the prevalence of metabolic syndrome was % lower (p = . ) in men with a high nut and seed intake compared to men with a low intake, after adjustment for confounders. conclusion and discussion: intake of nuts and seeds may reduce the risk of metabolic syndrome in stable coronary patients. background: in epidemiology, interaction is often assessed by adding a product term to the regression model. in linear regression the regression coefficient of the product term refers to additive interaction. however, in logistic regression it refers to multiplicative interaction. rothman has argued that interaction as departure from additivity better reflects biological interaction. hosmer and lemeshow presented a method to quantify additive interaction and its confidence interval (ci) between two dichotomous determinants using logistic regression. objective: our objective was to provide an estimation method for additive interaction between continuous determinants. methods and results: from the abovementioned literature we derived the formulas to quantify additive interaction and its ci between one continuous and one dichotomous determinant and between two continuous determinants using logistic regression. to illustrate the theory, data of the utrecht health project were used, with age and body mass index as risk factors for diastolic blood pressure. conclusions: this paper will help epidemiologists to estimate interaction as departure from additivity. to facilitate its use, we developed a spreadsheet, which will become freely available at our website. background: the incidences of acute myocardial infarction (ami) and ischemic stroke (is) in finland have been among highest in the world. accurate geo-referenced epidemiological data in finland provides unique possibilities for ecological studies using bayesian spatial models. objectives: examine sex-specific geographical patterns and temporal variation of ami and is. design and methods: ami (n = , ) and is (n = , ) cases in - in finland, localized at the time of diagnosis according to the place of residence address using map coordinates. cases and population were aggregated to km x km grids. full bayesian conditional autoregressive models (car) were used for studying the geographic incidence patterns. results: the incidence patterns of ami and is showed on average % ( % ci - %) common geographic variation and significantly the rest of the variation was disease specific. there was no significant difference between sexes. the patterns of high-risk areas have persisted over the years and the pattern of is showed more annual random fluctuations. conclusions: although ami and is showed rather similar and temporally stable patterns, significant part of the spatial variation was disease specific. further studies are needed for finding the reasons for disease specific geographical variation. most studies addressing socio-economic inequalities in health services use fail to take into account the disease the patient is suffering from. the objective of this study is to compare possible socioeconomic differences in the use of ambulatory care between distinct patient groups: diabetics and patients with migraine. data was obtained from the belgian health interview surveys , and . in total patients with self reported diabetes or migraine were identified. in a multilevel analysis the probability of a contact and the volume of contacts with the general practitioner and/or the specialist were studied for both groups in relation to educational attainment and income. adjustment was made for age, sex, subjective health and comorbidity at the individual level, and doctors' density and region at district level. no socio-economic differences were observed among diabetic patients. among patients with migraine we observed a higher probability of specialist contacts in higher income groups (or , ; % ci , - , ) and higher educated persons (or , ; % ci , - , ), while lower educated persons tend to report more visits with the general practitioner. to correctly interpret socio-economic differences in the use of health services there is need to take into account information on the patient's type of disease. background: the suitability of non-randomised studies to assess effects of interventions has been debated for a long time, mainly because of the risk of confounding by indication. choices in the design and analytical phase of non-randomised studies determine the ability to control for such confounding, but have not been systematically compared yet. objective: the aim of the study will be to quantify the role of design and analytical factors on confounding in non-randomised studies. design and methods: a meta-regression analysis will be conducted, based on cohort and case-control studies analysed in a recent cochrane review on influenza vaccine effectiveness against hospitalisation or death in the elderly. primary outcome will be the degree of confounding as measured by the difference between the reported effect estimate (odds ratios or relative risks) and the best available estimate (nichol, unpublished data) . design factors that will be considered include study design, matching, restriction and availability of confounders. statistical techniques that will be evaluated include multivariate regression analysis with adjustment for confounders, stratification and, if available, propensity scores. results the rsults will be used to develop a generic guideline with recommendations how to prevent confounding by indication in non-randomised effect studies. the wreckage of the oil tanker prestige in november produced a heavy contamination of the coast of galicia (spain). we studied relationships between participation in clean-up work and respiratory symptoms in local fishermen. questionnaires including details of clean-up activities and respiratory symptoms were distributed among associates of fishermen's cooperatives, with postal and telephone follow-up. statistical associations were evaluated using multiple logistic regression analyses, adjusted for sex, age, and smoking status. between january and february , information was obtained from , fishermen (response rate %). sixty-three percent had participated in clean-up operations. lower respiratory tract symptoms were more prevalent in clean-up workers (odds ratio (or) . ; % confidence interval . - . ). the risk increased when the number of exposed days, number of hours per day, or number of activities increased (p for linear trend < . ). the excess risk decreased when more time had elapsed since last exposure (or . ( . - . ) and . ( . - . ) for more and less than months, respectively; p for interaction < . ). in conclusion, fishermen who participated in the clean-up work of the prestige oil spill show a prolonged dosedependent increased prevalence of respiratory symptoms one to two years after the beginning of the spill. background. hpv testing has been proposed for cervical cancer screening. objectives: evaluating the protection provided by hpv testing at long intervals vs. cytology every third year. methods: randomised controlled trial conventional arm: conventional cytology. experimental arm: in phase hpv and liquid-based cytology. hpv-positive cytology-negatives referred for colposcopy if age - , for repeat after one year if age - . in phase hpv alone. positives referred for colposcopy independently of age. endpoint: histologically confirmed cervical intraepithelial neoplasia (cin) grade or more. results: overall , women were randomised. preliminary data at recruitment are presented. overall, among women aged - years relative sensitivity of hpv versus conventional cytology was . ( % c.i. . - . ) and relative positive predictive (ppv) value was . ( % c.i. . - . ). among women aged - relative sensitivity of hpv vs. conventional cytology was . ( % c.i. . - . ) during phase but . ( % c.i. . - . ) during phase . conclusions: hpv testing increased cross-sectional sensitivity, but reduced ppv. in younger women data suggest that direct referral of hpv-positives to colposcopy results in relevant overdiagnosis of regressive lesions. measuring detection rate of cin at the following screening round will allow studying overdiagnosis and the possibility of longer screening intervals. background: plant lignans are present in foods such as whole grains, seeds, fruits and vegetables, and beverages. they are converted by intestinal bacteria into the enterolignans, enterodiol and enterolactone. enterolignans possess several biological activities whereby they may influence carcinogenesis. objective: to study the association between plasma enterolignans and the risk of colorectal adenomas. colorectal adenomas are considered to be precursors of colorectal cancer. design and method: the case-control study included cases with at least one histologically confirmed colorectal adenoma and controls with no history of any type of adenoma. associations between plasma enterolignans and colorectal adenomas were analyzed by logistic regression. results: associations were stronger for incident than for prevalent cases. when only incident cases (n = ) were included, high compared to low enterodiol plasma concentrations were associated with a reduction in colorectal adenoma risk after adjustment for confounding variables. enterodiol odds ratios ( % ci) were . , . ( . - . ), . ( . - . ), . ( . - . ) with a significant trend (p = . ) through the quartiles. although enterolactone plasma concentrations were fold higher, enterolactone's reduction in risk was not statistically significant (p for trend = . ). conclusion: we observed a substantial reduction in colorectal adenoma risk among subjects with high plasma concentrations of enterolignans, in particular enterodiol. background: smoking is a risk factor for tuberculosis diseases. recently the question was raised if smoking also increases the risk of tuberculosis infection. objective: to assess the influence of environmental tobacco smoke (ets) exposure in the household on tuberculosis infection in children. design and methods: a crosssectional community survey was done and information on children was obtained. tuberculosis infection was determined with a tuberculin skin test (tst) (cut-off mm) and information on smoking habits was obtained from all adult household members. univariate and multivariate analyses were performed, and odds ratio (or) was adjusted for the presence of a tb contact in the household, crowding and age of the child. results: ets was a risk factor for tuberculosis infection (or: . , % ci: . - . ) when all children with a tst read between two and five days were included. the adjusted or was . ( % ci: . - . ). in dwellings were a tuberculosis case had lived the association was strongest (adjusted or . , % ci: . - . ). conclusion and discussion: ets exposure seems to be a risk factor for tuberculosis infection in children. this is of great concern considering the high prevalence of smoking and tuberculosis in developing countries. background and objective: to implement a simulation model to analyze demand and waiting time (wt) for knee arthroplasty and to compare a waiting list prioritization system (ps) with the usual first-in, first-out (fifo) system. methods: parameters for the conceptual model were estimated using administrative data and specific studies. a discrete-event simulation model was implemented to perform -year simulations. the benefit of applying the ps was calculated as the difference in wt weighted by priority score between disciplines, for all cases who entered the waiting list. results: wt for patients operated under the fifo discipline was homogeneous (standard deviation (sd) between . - . months) with mean . . wt under the ps had higher variability (sd between . - . ) and was positively skewed, with mean . months and % of cases over months. when wt was weighted by priority score, the ps saved . months ( % ci . - . ) on average. the ps was favorable for patients with priority scores over , but penalized those with lower priority scores. conclusions: management of the waiting list for knee arthroplasty through a ps was globally more effective than through fifo, although patients with low priority scores were penalized with higher waiting times. background: we developed a probabilistic linkage procedure for the linking of readmissions of newborns from the dutch paediatrician registry (lnr). % of all newborns ( . ) are admitted to a neonatal ward. the main problems were the unknown number of readmissions per child and the identification of admissions of twins. objective: to validate our linking procedure in a double blinded study. design and methods: a random sample of admissions from children from the linked file has been validated by the caregivers, using the original medical records. results: response was %. for admissions of singletons the linkage contained no errors except for the small uncertain area of the linkage. for admissions of multiple births a high error rate was found. conclusion and discussion: we successfully linked the admissions of singleton newborns with the developed probabilistic linking algorithm. for multiple births we did not succeed in constructing valid admission histories due to too low data quality of twin membership variables. validation showed alternative solutions for linking twin admissions. we strongly recommend that linkage results should always be externally validated. background: salmonella typhimurium definitive phage type (dt) has emerged as an important pathogen in the last two decades. a -fold increase in cases in the netherlands during september-november prompted an outbreak investigation. objective: the objective was to identify the source of infection to enable preventive measures. methods: a subset of outbreak isolates was typed by molecular means. in a case-control study, cases (n = ) and matched population controls (n = ) were invited to complete self-administered questionnaires. results: the molecular typing corroborated the clonality of the isolates. the molecular type was similar to that of a recent s. typhimurium dt outbreak in denmark associated with imported beef. the incriminated shipment was traced after having been distributed sequentially through several eu member states. sampling of the beef identified s. typhimurium dt of the same molecular type as the outbreak isolates. cases were more likely than controls to have eaten a particular raw beef product. conclusions: our preliminary results are consistent with this s. typhimurium dt outbreak being caused by contaminated beef. our findings underline the importance of european collaboration, traceability of consumer products and a need for timely intervention into distribution chains. background: heavy-metals may affect newborns. some of them are presenting tobacco smoke. objectives: to estimate cord-blood levels of mercury, arsenic, lead and cadmium in newborns in areas in madrid, and to assess the relationship with maternal tobacco exposure. design and methods: bio-madrid study obtained cord-blood samples from recruited trios (mother/father/ newborn). cold-vapor atomic absorption spectrophotometry (aas) was used to measure mercury and graphite-furnace aas for the other metals. mothers answered a questionnaire including tobacco exposure. median, means and standard errors were calculated and logistic regression used to estimate or. results: median levels for mercury and lead were . mg/l and . mg/l. arsenic and cadmium were undetectable in % and % of samples. preliminar analysis showed a significant association of maternal tobacco exposure and levels of arsenic (or: . ; % ci: . - . ), cadmium (or: . ; % ci: . - . ), and lead (or: . ; % ci: . - . ). smoking in pregnancy was associated to arsenic (or: . ; % ci: . - . ), while passive exposure was more related to lead (or: . ; % ci: . - . ) and cadmium (or: . ; % ci: . - . ). conclusion: madrid newborns have high cord-blood levels of mercury. tobacco exposure in pregnancy might increase levels of arsenic, cadmium and lead. background: road traffic accidents (rta) are the leading cause of death for young. rta police reports provide no health information other then the number of deaths and injured, while health databases have no information on the accident dynamics. the integration of the two databases would allows to better describe the phenomenon. nevertheless, the absence of common identification variables through the lists makes the deterministic record linkage (rl) impossible. objective: to test feasibility of a probabilistic rl between rta and health information when personal identifiers are lacking. methods: health information came from the rta integrated surveillance for the year . it integrates data from ed visits, hospital discharges and deaths certificates. a deterministic rl was performed with police reports, where the name and age of deceased were present. results of the deterministic rl was then used as gold standard to evaluate the performance of the probabilistic one. results: the deterministic rl resulted in ( . %) linked records. the probabilistic rl, where the name was omitted, was capable to correctly identify ( . %). conclusions: performance of the probabilistic rl was good. further work is needed to develop strategies for the use of this approach in the complete datasets. background: overweight constitutes a major public health problem. the prevalence of overweight is unequally distributed between socioeconomic groups. risk group identification, therefore, may enhance the efficiency of interventions. objectives: to identify which socioeconomic variable best predicts overweight in european populations: education, occupation or income. design: european community household panel data were obtained for countries (n = , ). overweight was defined as a body mass index >= kg/m . uni-and multivariate logistic regression analyses were employed to predict overweight in relationship to socioeconomic indicators. results: major socioeconomic differences in overweight were observed, especially for women. for both sexes, a low educational attainment was the strongest predictor of overweight. after control for confounders and the other socioeconomic predictors, the income gradient was either absent or positive (men) or negative (women) in most countries. similar patterns were found for occupational level. for women, inequalities in overweight were generally greater in southern european countries. conversely, for men, differences were generally greater in other european countries. conclusion: across europe, educational attainment most strongly predicts overweight. therefore, obesity interventions should target adults and children with lower levels of education. background: because incidence and prevalence of most chronic diseases rise with age, their burden will increase in ageing populations. we report the increase in prevalence of myocardial infarction (mi), stroke (cva), diabetes type ii (dm) and copd based on the demographic changes in the netherlands. in addition, for mi and dm the effect of a rise in overweight was calculated. methods: calculations were made for the period - with a dynamic multi-state transition model and demographic projections of the cbs. results: based on ageing alone, between and prevalence of dm will rise from . to . (+ %), prevalence of mi from . to . (+ %), stroke prevalence from . to . (+ %) and copd prevalence from . to . (+ %). a continuation of the dutch (rising) trend in overweight prevalence would in lead to about . diabetics (+ %). a trend resulting in american levels would lead to over million diabetics (+ %), while the impact on mi was much smaller: about . (+ %) in . conclusion: the burden of chronic disease will substantially increase in the near future. a rising prevalence of overweight has an impact especially on the future prevalence of diabetes background: there has been increasing concern about the effects of environmental endocrine disruptors (eeds) on human reproductive health. eeds include various industrial chemicals, as well as dietary phyto-estrogens. intra-uterine exposures to eeds are hypothesized to disturb normal foetal development of male reproductive organs and specifically, to increase the risk of cryptorchidism, hypospadias, testicular cancer, and a reduced sperm quality in offspring. objective: to study the associations between maternal and paternal exposures to eeds and the risks of cryptorchidism, hypospadias, testicular cancer and reduced sperm quality. design and methods: these associations are studied using a case-referent design. in the first phase of the study, we collected questionnaire data of the parents of cases with cryptorchidism, cases with hypospadias and referent children. in the second phase, we will focus on the health effects at adult age: testicular cancer and reduced sperm quality. in both phases, we will attempt to estimate the total eed exposure of parents of cases and referents at time of pregnancy through an exposure-assessment model in which various sources of exposure, e.g. environment, occupation, leisure time activities and nutrition, are combined. in addition, we will measure hormone receptor activity in blood. background: eleven percent of the pharmacotherapeutic budget is spent on acid-suppressive drugs (asd); % of patients are chronic user. most of these indications are not conform to dyspepsia guidelines. objectives: we evaluated the implementation of an asd rationalisation protocol among chronic users, and analysed effects on volume and costs. method: in a cohort study patients from gp's with protocol were compared to a control group of patients from gp's without. prescription data of - were extracted from agis health database. a log-linear regression model compared standardised outcomes of number of patients that stopped or reduced asd (> %) and of prescription volume and costs. results: gp's and patients in both groups were comparable. % in the intervention group had stopped; % in the control group (p< . ). the volume had decreased in another % of patients; % in control group (p< . ). compared to the baseline data in the control group ( %) the adjusted or of volume in the intervention group was . %. the total costs adjusted or was . %. the implementation significantly reduced the number of chronic users, and substantially dropped volume and costs. active intervention from insurance companies can stimulate rationalisation of prescription. background/objective: today, % of lung cancers are resectable (stage i/ii). -year survival is therefore low ( %). spiral computed tomography (ct) screening detects more lung cancers than chest x-ray. it is unknown if this will translate into a lung cancer mortality reduction. the nelson trial investigates whether detector multi-slice ct reduces lung cancer mortality with at least % compared to no screening. we present baseline screening results. methods: a questionnaire was sent to , men and women. of the , respondents, , high-risk current and former smokers were invited. until december , , , of them gave informed consent and were randomised ( : ) in a screen arm (ct in year , and ) and control arm (no screening). data will be linked with the cancer registry and statistics netherlands to determine cancer mortality and incidence. results: of the first , baseline ct examinations % was negative (ct after one year), % indeterminate (ct after months) and % positive (referral pulmonologist). seventy percent of detected tumours were resectable. conclusion/discussion: ct screening detects a high percentage of early stage lung cancers. it is estimated that the nelson trial is sufficiently large to demonstrate a % lung cancer mortality reduction or more. background: due to diagnostic dilemmas in childhood asthma, drug treatment of young children with asthmatic complaints often serves as a trial treatment. objective: to obtain more insight into patterns and continuation of asthma medication in children during the first years of life. design: prospective birth cohort study methods: within the prevention and incidence of asthma and mite allergy (piama) study (n = , children) we identified children using asthma medication in their first year of life. results: about % of children receiving asthma medication before the age of one, discontinued use during follow-up. continuation of therapy was associated with male gender (adjusted odds ratio [aor] . , % confidence interval [ci]: . - . ), a diagnosis of asthma (aor . , % ci: . - . ) and receiving combination or controller therapy (aor , , % ci: . - . ). conclusion: patterns of medication use in preschool children support the notion that both beta -agonist and inhaled corticosteroids are often used as trial medication, since % discontinues. the observed association between continuation of therapy and both an early diagnosis of asthma and a prescription for controller therapy suggests that, despite of diagnostic dilemmas, children in apparent need of prolonged asthma therapy are identified at this very early age. background: this study explored the differences in birthweight between infants of first and second generation immigrants and infants of dutch women, and to what extent maternal, fetal and environmental characteristics could explain these differences. method: during months all pregnant women in amsterdam attending their first prenatel screening were asked to fill out a questionnaire (sociodemographic status, lifestyle) as part of the amsterdam born children and their development (abcd)-study; women ( %) responded. only singleton deliveries with pregnancy duration = weeks were included (n = ). results: infants of all first and second generation immigrant groups (surinam, the antilles, turkey, morocco, ghana, other countries) had lower birthweights (range: - gram) than dutch infants ( gram). linear regression revealed that, adjusted for maternal height, weight, age, parity, smoking, marital status, gestational age and gender, infants of surinamese women ( st and nd generation), antillean and ghanaian women (both st generation) were still lighter than dutch infants ( . , . , . , and . grams respectively; p< . ). conclusion: adjusted for maternal, fetal and environmental characteristics infants of some immigrant groups had substantial lower birthweights than infants of dutch women. other factors (like genetics, culture) can possibly explain these differences. introduction: missing data is frequently seen in cost-effectiveness analyses (ceas). we applied multiple imputation (mi) combined with bootstrapping in a cea. objective: to examine the effect of two new methodological procedures of combining mi and bootstrapping in a cea with missing data. methods: from a trial we used direct health and non-health care costs and indirect costs, kinesiophobia and work absence data assessed over months. mi was applied by multivariate imputation by chained equations (mice) and non-parametric bootstrapping was used. observed case analyses (oca), where analyses were conducted on the data without missings, were compared with complete case analysis (cca) and with analyses when mi and bootstrapping were combined after to % of cost and effect data were omitted. results: by the cca effect and cost estimates shifted from negative to positive and cost-effectiveness planes and acceptability curves were biased compared to the oca. the methods of combining mi and bootstrapping generated good cost and effect estimates and the cost-effectiveness planes and acceptability curves were almost identical to the oca. conclusion: on basis of our study results we recommend to use the combined application of mi and bootstrapping in data sets with missingness in costs and effects. background: since the s, coronary heart disease (chd) mortality rates have halved with approximately % of this decrease being attributable to medical and surgical treatments. objective: this study examined the cost-effectiveness of specific chd treatments. design and methods: the impact chd model was used to calculate the number of life-years-gained (lyg) by specific cardiology interventions given in , and followed over ten years. this previously validated model integrates data on patient numbers, median survival in specific groups, treatment uptake, effectiveness and costs of specific interventions. cost-effectiveness ratios were generated as costs per lyg for each specific intervention. results: in , medical and surgical treatments together prevented or postponed approximately , chd deaths in patients aged - years; this generated approximately , extra life years. aspirin and beta-blockers for secondary prevention of myocardial infarction and heart failure, and spironolactone for heart failure all appeared highly cost-effective ( % (positive predictive value was %). conclusion: omron fails the validation criteria for ankle sbp measurement. however, the ease of use of the device could outweigh the inaccuracy if used as a screening tool for aai< , in epidemiologic studies. background: associations exist between: ) parental birth weight and child birth weight; ) birth weight and adult psychopathology; and ) maternal psychopathology during pregnancy and birth weight of the child. this study is the first to combine those associations. objective: to investigate the different pathways from parental birth weight and parental psychopathology to child birth weight in one model. design and methods: depression and anxiety scores on , mothers and , fathers during weeks pregnancy and birth weights from , children were available. path analyses with standardized regression coefficients were used to evaluate the different effects. results: in the unadjusted path analyses significant effects existed between: maternal (r = . ) and paternal birth weight (r = . ) and child birth weight; maternal birth weight and maternal depression (r=). ) and anxiety (r=). ); and maternal depression (r = . ) and anxiety (r = . ) and child birth weight. after adjustment for confounders, only maternal (r = . ) and paternal (r = . ) birth weight and maternal depression (r=). ) remained significantly related to child birth weight. conclusion after adjustment maternal depression, and not anxiety, remained significantly related to child birth weight. discussion future research should focus on the different mechanisms of maternal anxiety and depression on child birth weight. background: most patients with peripheral arterial disease (pad) die from coronary artery disease (cad). non-invasive cardiac imaging can assess the presence of coronary atherosclerosis and/or cardiac ischemia. screening in combination with more aggressive treatment may improve prognosis. objective: to evaluate whether a non-invasive cardiac imaging algorithm, followed by treatment will reduce the -year-risk of cardiovascular events in pad patients free from cardiac symptoms. design and methods: this is a multicenter randomized controlled clinical trial. patients with intermittent claudication and no history of cad are eligible. one group will undergo computed tomography (ct) calcium scoring. the other group will undergo ct calcium scoring and ct angiography (cta) of the coronary arteries. patients in the latter group will be scheduled for a dobutamine stress magnetic resonance imaging (dsmr) test to assess cardiac ischemia, unless a stenosis of the left main (lm) coronary artery (or its equivalent) was found on cta. patients with cardiac ischemia or a lm/lm-equivalent stenosis will be referred to a cardiologist, who will decide on further (interventional) treatment. patients are followed for years. conclusion: this study assesses the value of non-invasive cardiac imaging to reduce the risk of cardiovascular events in patients with pad free from cardiac symptoms. background: hpv is the main risk factor for cervical cancer and also a necessary cause for it. participation rates in cervical cancer screening are low in some countries and soon hpv vaccination will be available. objectives: aim of this systematic review was to collect and analyze published data on knowledge about hpv. design and methods: a medline search was performed for publications on knowledge about hpv as a risk factor for cervical cancer and other issues of hpv infection. results: of individual studies were stratified by age of study population, country of origin, study size, publication year and response proportion. heterogeneity was described. results: knowledge between included studies varied substantially. thirteen to % (closed question) and . to . % (open question) of the participants knew about hpv as a risk factor for cervical cancer. women had consistently better knowledge on hpv than men. there was confusion of hpv with other sexually transmitted diseases. conclusion and discussion: studies were very heterogeneous, thus making comparison difficult. knowledge about hpv infections depended on the type of question used, gender of the participants and their professional background. education of the general public on hpv infections needs improvement, specially men should also be addressed. background: influenza outbreaks in hospitals and nursing homes are characterized by high attack rates and severe complications. knowledge of the virus' specific transmission dynamics in healthcare institutions is scarce but essential to develop cost-effective strategies. objective: to follow and model the spread of influenza in two hospital departments and to quantify the contributions of the several possible transmission pathways. methods: an observational prospective cohort study is performed on the departments of internal medicine & infectious diseases and pulmonary diseases of the umc utrecht during the influenza season. all patients and regular medical staff are checked daily on the presence of fever and cough, the most accurate symptoms of influenza infection. nose-throat swabs are taken for pcr analysis for both symptomatic individuals and a sample of asymptomatic individuals. to determine transmission, contact patterns are observed between patients and visitors and patients and medical staff. results/discussion: spatial and temporal data of influenza cases will be combined with contact data in a mathematical model to quantify the main transmission pathways. among others the model can be used to predict the effect of vaccination of the medical staff which is not yet common practice in the studied hospital. background: the long term maternal sequelae of stillbirths is unknown. objectives: to assess whether women who experienced stillbirths have an excess risk of long term mortality. study design: cohort study. methods: we traced jewish women from the jerusalem perinatal study, a population-based database of all births to west jerusalem residents ( - who gave birth at least twice during the study period, using unique identity numbers. we compared the survival to - - of women who had at least one stillbirth (n = ) to that of women who had only live births (n = ) using cox proportional hazards models. results: during a median follow up of . years, ( . %) mothers with stillbirths died compared to , ( . %) unexposed women; crude hazard ratio (hr) . ( % ci: . - . ). the mortality risk remained significantly increased after adjustments for sociodemo-graphic variables, maternal diseases, placental abruption and preeclampsia (hr: . , % ci: . - . ). stillbirth was associated with increased risk of death from cardiovascular (adjusted hr: . , . - . ), circulatory ( . , . - . ) and genitourinary ( . , . - . ) causes. conclusions: the finding of increased mortality among mothers of stillbirths joins the growing body of evidence demonstrating long term sequelae of obstetric events. future studies should elucidate the mechanisms underlying these associations. resilience is one of the essential characteristics of successful ageing. however, very little is known about the determinants of resilience in old age. our objectives were to identify resilience in the english longitudinal study of ageing (elsa) and to investigate social and psychological factors determining it. the study design was a crosssectional analysis of wave of elsa. using structural equation modelling, we identified resilience as a latent variable indicated by high quality of life in the face of six adversities: ageing, limiting long-standing illness, disability, depression, perceived poverty and social isolation and we regressed it on social and psychological factors. the latent variable model for resilience showed a highly significant degree of fit (weighted root mean square resid-ual= . ). determinants of resilience included good quality of relationships with spouse (p = . ), family (p = . ), and friends (p< . ), good neighbourhood (p< . ), high level of social participation (p< . ), involvement in leisure activities (p = . ); perception of not being old (p< . ); optimism (p = . ), and high subjective probability of survival to an older age (p< . ). we concluded that resilience in old age was as much a matter of social engagement, networks and context as of individual disposition. implications of this on health policy are discussed. background: there is extensive literature concluding that ses is inversely related to obesity in developed countries. several studies in developing populations however reported curvilinear or positive association between ses and obesity. objectives: to assess the social distribution of obesity in men and women in middle-income countries of eastern and central europe with different level of economic development. methods: random population samples aged - years from poland, russia and czech republic were examined between - as baseline for prospective cohort study. we used body-mass index (bmi) and waist/hip ratio (whr) as obesity measures. we compared age-adjusted bmi and whr for men and women by educational levels in all countries. results: the data collection was concluded in summer . we collected data from about , subjects. lower ses increased obesity risk in women in all countries (the strongest gradient in the czech republic and the lowest in russia), and in czech men. there was no ses gradient in bmi in polish men and positive association between education and bmi in russian men. conclusions: our findings strongly agree with previous literature showing that the association between ses status and obesity is strongly influenced by overall level of country economic development. background: inaccurate measurements of body weight, height and waist circumference will lead to an inaccurate assessment of body composition, and thus of the general health of a population. objectives: to assess the accuracy of self-reported body weight, height and waist-circumference in a dutch overweight working population. design and methods: bland and altman methods were used to examine the individual agreement between self-reported and measured body weight and height in overweight workers ( % male; mean age . +/) . years; mean body mass index [bmi] . +/) . kg/m ). the accuracy of self-reported waistcircumference was assessed in a subgroup of persons ( % male; mean age . +/) . years; mean bmi . +/) . kg/ m ), for whom both measured and self-reported waist circumference was available. results: body weight was underreported by a mean (standard deviation) of . ( . ) kg, body height was on average over-reported by . ( . ) cm. bmi was on average underreported by . ( . ) kg/m . waist-circumference was overreported by . ( . ) cm. the overall degree of error from selfreporting was between . and . %. conclusion and discussion: self-reported anthropometrics seem satisfactorily accurate for the assessment of general health in a middle-aged overweight working population. the incidence of breast cancer and the prevalence of metabolic syndrome are increasing rapidly in chile, but this relationship is still debated. the goal of this study is to assess the association between metabolic syndrome and breast cancer before and after menopause. a hospital based case-control study was conducted in chile during . cases with histologically confirmed breast cancer and age matched controls with normal mammography were identified. metabolic syndrome was defined by atpiii and serum lipids, glucose, blood pressure and waist circumference were measured by trained nurses. data of potential confounders such as, obesity, socioeconomic status, exercise and diet were obtained by anthropometric measures and a questionnaire. odds ratios (ors) and % confidence intervals (cis) were estimated by conditional logistic regression stratified by menopause. in postmenopausal women, a significant increase risk of breast cancer was observed in women with metabolic syndrome (or = . , % ci = . - . ). the elements of metabolic syndrome strongly associated were high levels of glucose and hypertension. in conclusion, postmenopausal women with metabolic syndrome had % of excess risk of breast cancer. these findings support the theory that there is a different risk profile of breast cancer after and before menopause. background: physical exercise during pregnancy has numerous beneficial effects on maternal and foetal health. it may, however, affect early foetal survival negatively. objectives: to examine the association between physical exercise and spontaneous abortion in a cohort study. design and methods: in total, , women recruited to the danish national birth cohort in early pregnancy, provided information on amount and type of exercise during pregnancy and on possible confounding factors. , women experienced foetal death before gestational weeks. hazard ratios for spontaneous abortion in four periods of pregnancy () , - , - , and - weeks) according to amount (min/week) and type of exercise, respectively, were estimated using cox regression. various sensitivity analyses to reveal distortion of the results from selection forces and information bias were made. results: the hazard ratios of spontaneous abortion increased stepwise with amount of physical exercise and were largest in the earlier periods of pregnancy (hrweek - = . (ci . - . ) for min/week, compared to no exercise). weight bearing types of exercise were strongest associated with abortion, while swimming showed no association. these results remained stable, although attenuated, in the sensitivity analyses. discussion: handling of unexpected findings that furthermore challenge official public health messages will be discussed. hemodialysis (hd) patients with a low body mass index (bmi) have an increased mortality risk, but bmi changes over time on dialysis treatment. we studied the association between changes in bmi and all-cause mortality in a cohort of incident hd patients. patients were followed until death or censoring for a maximum follow-up of years. bmi was measured every months and changes in bmi were calculated over each -mo period. with a time-dependent cox regression analysis, hazard ratios (hr) were calculated for these -mo changes on subsequent mortality from all causes, adjusted for the mean bmi of each -mo period, age, sex and comorbidity. men and women were included (age: ? years, bmi: . ? . kg/m , -y mortality: %). a loss of bmi> % was independently associated with an increased mortality risk (hr: . , %-ci: . - . ), while a loss of - % showed no difference (hr: . , . - . ) compared to no change in bmi () % to + % change). a gain in bmi of - % showed beneficial (hr: . , . - . ), while a gain of bmi> % was not associated with a survival advantage (hr: . , . - . ). in conclusion, hd patients with a decreasing bmi have an increased risk of all-cause mortality. background: within the tripss- project, impact of clinical guidelines (gl) on venous thromboembolism (vte) prophylaxis was evaluated in a large italian hospital. gl were effective in increasing the appropriateness of prophylaxis and in reducing vte. objectives: we performed a cost-effectiveness analysis by using a decision-tree model to estimate the impact of the adopted gl on costs and benefits. design and methods: a decision-tree model compared prophylaxis cost and effects before and after gl implementation. four risk profiles were identified (low, medium, high, very high). possible outcomes were: no event, major bleeding, asymptomatic vte, symptomatic vte and fatal pulmonary embolism. vte patients risk and probability of receiving prophylaxis were defined using data from the previous survey. outcome probabilities were assumed from literature. tariffs and hospital figures were used for costing the events. results: gl introduction reduced the average cost per patient from e to () %) with an increase in terms of event free patients (+ %). results are particularly relevant in the very high risk group. conclusion: the implementation of locally adapted gl may lead to a gain in terms of costs and effects, in particular for patients at highest vte risk. background: assisted reproductive techniques are used to overcome infertility. one reason of success is the use of ovarian stimulation. objectives: compare two ovarian stimulation protocols, gonadotropin-releasing hormone agonists/antagonists, assessing laboratorial and clinical outcomes, to provide proper therapy option. identify significant predictors of clinical pregnancy and ovarian response. design and methods: retrospective study (agonist cycles, ; antagonist cycles, ) including ivf/intracytoplasmic sperm injection cycles. multiple logistic and regression models, with fractional polynomial method were used. results: antagonist group exhibited lower length of stimulation and dose of recombinant follicle stimulating hormone (rfsh), higher number of retrieved and fertilized oocytes, and embryos. agonist group presented thicker endometrium, better fertilization, implantation and clinical pregnancy rates. clinical pregnancy has shown positive correlation with endometrial thickness and use of agonist; negative correlation with age and number of previous attempts. retrieved oocytes shown positive correlation with estradiol on day of human chorionic gonadotrophin (hcg) and use of antagonist; negative correlation with rfsh dose. conclusion: patients from antagonist group are more likely to get more oocytes and quality embryos, despite those from agonist group are more likely to get pregnant. background: prevalence studies of the metabolic syndrome require fasting blood samples and are therefore lacking in many countries including germany. objectives: to narrow the incertitude resulting from extrapolation of international prevalence estimates, with a sensitivity analysis of the prevalence of the metabolic syndrome in germany using a nationally representative but partially non-fasting sample. methods: stepwise analysis of the german health examination survey , using the national cholesterol education program (ncep) criteria, hemoglobin a c (hba c), non-fasting triglycerides and fasting time. results: among participants aged - years, the metabolic syndrome prevalence was (i) . % with . % inconclusive cases using the unmodified ncep criteria, (ii) . % with . % inconclusive cases using hba c > . % if fasting glucose was missing, (iii) . % with . % inconclusive cases when additionally using non-fasting triglycerides = th percentile stratified by fasting time, and (iv) . % to . % with < % inconclusive cases using different cutoffs (hba c . %, non-fasting triglycerides and mg/dl). discussion: despite a lower prevalence of obesity in germany compared to the us, the prevalence of the metabolic syndrome is likely to be in the same order of magnitude. this analysis may help promote healthy life styles by stressing the high prevalence of interrelated cardiovascular risk factors. background: epidemiologic studies that directly examine fruits and vegetables (f&v) consumption and other lifestyle factors in relation to weight gain are sparse. objective: we examined the associations between the f&v intake and -y weight gain among spanish adult people. design/methods: the study was conducted with a sub-sample of healthy people aged y and over at baseline in , who participated in a population-based nutrition survey in valencia-spain. data on diet, lifestyle factors and body weight (direct measurement) were obtained in and . information on weight gain was available for participants in . results: during the -y period, participants tended to gain on average . kg (median = . kg). in multivariate analyses, participants with the highest tertile of f&v intake at baseline had a % of lower risk of gaining = . kg compared with those who had the lowest intake tertile after adjustment for sex, age, education, smoking, tv-viewing, bmi, and energy intake (or = . ; % ci: . . ;p-fortrend = . ). for every g/d increase in f&v intake, the or was reduced by % (or = . ; . - . ;p-trend= . ). tvviewing at baseline was positively associated with weight gain, or for- h-increase= . ( . - . ;p-trend= . ). conclusions: our findings suggest that a high f&v intake and low tv-viewing may reduce weight gain among adults. background: diabetic patients develop more readily atherosclerosis thus showing greatly increased risk for cardiovascular disease. objective: the heinz nixdorf recall-study is a prospective cohort-study designed to assess the prognostic value of new risk stratification methods. here we examined the association between diabetes, previously unknown hyperglycemia and the degree of coronary calcification (cac). methods: a population-based sample of , men and women aged - years was recruited from three german cities between - . baseline examination included amongst others a detailed medical history, blood analyses and electron-beam tomography. we calculated adjusted prevalence ratios (pr), adjusting for age, smoking, bmi and %-confidence intervals ( % ci) with log-linear binomial regression. results: the prevalence of diabetes is . % (men: . %, women: . %), for hyperglycemia . % (men: . %, women: . %). prevalence ratio for cac in male diabetics without overt coronary heart disease is . ( % ci: . - . ), for those with hyperglycemia . ( . - . ). in women the association is even stronger: . ( . - . ) with diabetes, . ( . - . ) with hyperglycemia. conclusion: the data support the concept of regarding diabetic patients as being in a high risk category meaning > % hard events in years. furthermore, even persons with elevated blood glucose levels already show higher levels of coronary calcification. background: birth weight is associated with health in infancy and later in life. socioeconomic inequality in birth weight is an important marker of current and future population health inequality. objective: to examine the effect of maternal education on birth weight, low birth weight (lbw, birth weight< , g), and small for gestational age (sga) background: in clinical practice patient data are obtained gradually and health care practitioners tune prognostic information according to available data. prognostic research does not always reproduce this sequential acquisition of data: instead, 'worst', discharge or aggregate data are often used. objective: to estimate prognostic performance of sequentially updated models. methods: cohortstudy of all very-low-birth-weight-babies (< g) admitted to the study neonatal icu < days after birth ( eligible from to ) and followed-up until years old ( . % lost-to-follow-up). main outcomes: disabling cerebral palsy at years ( , . %) or death ( , . % ) % in the first weeks). main prognostic determinants: neonatal cerebral lesions identified with cranial ultrasound (us) exams performed per protocol on days , , and at discharge. logistic regression models were updated with data available at these different moments in time during admission. results: at days , and respectively, main predictor (severe parenchymal lesion) adjusted odds ratio: , and ; us model c-statistic: . , . and . . discussion: prognostic models performance in neonatal patients improved from inception to discharge, particularly for identification of the high risk category. time of data acquisition should be considered when comparing prognostic instruments. in epidemiological longitudinal studies one is often interested in the analysis of time patterns of censored history data. for example, how regular a medication is used or how often someone is depressed. our goal is to develop a method to analyse time patterns of censored data. one of the tools in longitudinal studies is a nonhomogeneous markov chain model with discrete time moments and categorical state space (for example, use of various medications). suppose we are interested only in the time pattern of appearance of a particular state which is in fact a certain epidemiological event under study. for this purpose we construct a new homogeneous markov chain associated with this event. the states of this markov chain are the time moments of the original nonhomogeneous markov chain. using the new transition matrix and standard tools from markov chain theory we can derive the probabilities of occurence of that epidemiological event during various time periods (including ones with gaps). for example, probabilities of cumulative use of medication during any time period. in conclusion, the proposed approach based on markov chain model provides a new way of data representation and analysis which is easy to interpret in practical applications. background: tuberculosis (tb) cases that belong to a cluster of the same mycobacterium tuberculosis dna fingerprint are assumed to be consequence of recent transmission. targeting interventions to fast growing clusters may be an efficient way of interrupting transmission in outbreaks. objective: to assess predictors for large growing clusters compared to clusters that remain small within a years period. design and method out of the culture confirmed tb patients diagnosed between and , ( %) had unique fingerprints while were part of a cluster. of the clustered cases were in a small ( to cases within the first years) and in a large cluster (more than cases within the first years). results independent risk factors for being a case within the first years of a large cluster were non-dutch nationality (or = . % ci [ . - . background: passive smoking causes adverse health outcomes such as lung cancer or coronary heart disease (chd). the burden of passive smoking on a population level is currently unclear and depends on several assumptions. we investigated the public health impact of passive smoking in germany. methods: we computed attributable mortality risks due to environmental tobacco smoke (ets). we considered lung cancer, chd, stroke, copd and sudden infant death. frequency of passive smoking was derived from the national german health survey. sensitivity analyses were performed using different definitions of exposure to passive smoking. results: in total, deaths every year in germany are estimated to be caused by exposure to passive smoking at home (women , men ). most of these deaths are due to chd ( ) and stroke ( ). additional consideration of passive smoking at workplace increased the number of deaths to . considering any exposure to passive smoking and also active smokers who report exposure to passive smoking increased the number of deaths further. conclusions: passive smoking has an important impact on mortality in germany. even the most conservative estimate using exposure to ets at home led to a substantial number of deaths related to passive smoking. des daughters have a strongly increased risk of clear-cell adenocarcinoma of the vagina and cervix (ccac) at a young age. longterm health problems, however, are still unknown. we studied incidence of cancer, other than ccac, in a prospective cohort of des daughters (desnet project). in , questionnaires were sent to des daughters registered at the des center in utrecht. also, informed consent was asked for linkage with disease registries. for this analysis, data of , responders and nonresponders were linked to palga, the dutch nationwide network and registry of histo-and cytopathology. mean age at the end of follow-up was . years. a total of incident cancers occurred. increased standardized incidence rates (sir) were found for vaginal/vulvar cancers (sir = . , % confidence interval ( % ci) . - . ), melanoma (sir = . , % ci . - . )) and breast cancer (sir= . , % ci . - . ) as compared to the general population. no increased risk was found for invasive cervical cancer, possibly due to effective screening. results for breast and cervical cancer are consistent with the sparse literature. the risk of melanoma might be due to surveillance bias. future analyses will include non-invasive cervical cancer, stage specific sirs for melanoma and adjustment for confounding (sister control group) for breast cancer. background: contact tracing plays an important role in the control of emerging infectious diseases in both human and farm animal populations, but little is known yet about its effectiveness. here we investigate in a generic setting for well-mixed populations the dependence of tracing effectiveness on the probability that a contact is traced, the possibility of iteratively tracing yet asymptomatic infectives, and delays in the tracing process. methods and findings: we investigate contact tracing in a mathematical model of an emerging epidemic, incorporating a flexible infectivity function and incubation period distribution. we consider isolation of symptomatic infected as the basic scenario, and determine the critical tracing probability (needed for effective control) in relation to two infectious disease parameters: the reproduction ratio under isolation and the duration of latent period relative to the incubation period. the effect of tracing delays is considered, as is the possibility of single-step tracing vs. iterative tracing of quarantined invectives. finally, the model is used to assess the likely success of tracing for influenza, smallpox, sars, and foot-and-mouth disease epidemics. conclusions: we conclude that single-step contact tracing can be effective for infections with a relatively long latent period or a large variation in incubation period, thus enabling backwards tracing of super spreading individuals. the sensitivity to changes in the tracing delay varies greatly, but small increases may have major consequences for effectiveness. if single-step tracing is on the brink of being effective, iterative tracing can help, but otherwise it will not improve much. we conclude that contact tracing will not be effective for influenza pandemics, only partially for fmd epidemics, and very effective for smallpox and sars epidemics. abstract: infections of highly pathogenic h n avian influenza in humans underline the need for tracking of the ability of these viruses to spread among humans. here we propose a method of analysing outbreak data that allows determination of whether and to what extent transmission in a household has occurred after an introduction from the animal reservoir. in particular, it distinguishes between onward transmission from humans that were infected from the animal reservoir (primary human-to-human transmission) and onward transmission from humans who were themselves infected by humans (secondary human-to-human transmission). the method is applied to data from an epidemiological study of an outbreak of highly pathogenic avian influenza (h n ) in the netherlands in . we contrast a number of models that differ with respect to the assumptions on primary versus secondary human-to-human transmission. session: mathematical modelling of infectious diseases presentation: oral. usually models for the spread of an infection in a population are based on the assumption of a randomly mixing population, where every individual may contact every other individual. however, the assumption of random mixing seems to be unrealistic, therefore one may also want to consider epidemics on (social) networks. connections in the network are possible contacts, e.g. if we consider sexually transmitted diseases and ignore all spread by other than sexual ways, the connections are only between people that may have intercourse with each other. in this talk i will compare the basic reproduction ratio, r and the probability of a major outbreak of network models and for randomly mixing populations. furthermore, i will discuss which properties of the network are important and how they can be incorporated in the model. in this talk a reproductive power model is proposed that incorporates the following points met when an epidemic disease outbreak is modeled statistically: ) the dependence of the data is handled with a non-homogeneous birth process. ) the first stage of the outbreak is described with an epidemic sir model. soon control measures will start to influence the process. these measures are in addition to the natural epidemic removal process. the prevalence is related to the censored infection times in such a way that the distribution function, and therefore the survival function, satisfies approximately the first equation of the sir model. this leads in a natural way to the burr-family of distributions. ) the non-homogeneous birth process handles the fact that in practice, with some delay, it is the infected that are registered and not the susceptibles. ) finally the ending of the epidemic caused by the measures taken is incorporated by modifying the survival function with a finalsize parameter in the same way as is done in long-term survival models. this method is applied to the dutch classical swine individual and area (municipality) measures of income, marital and employment status were obtained. there were , suicides and , controls. after controlling for compositional effects, ecological associations of increased suicide risk with declining area levels of employment and income and increasing levels of people living alone were much attenuated. individual-level associations with these risk factors were little changed when controlling for contextual effects. we found no consistent evidence that associations with individual level risk factors differed depending on the areas characteristics (cross-level interactions). this analysis suggests the ecological associations reported in previous studies are likely to be due in greater part to the characteristics of the residents in those areas than area-influences on risk, rather than to contextual effects. were found to be at higher risk. risk was significantly greater in women whose first full-term pregnancy was at age or more (or . , ). in addition, more than full-term pregnancies would be expected to correlate with an increase in the risk (x . , p< . ). in multivariate analysis, history of breast feeding is a significant factor in decreasing risk (or . , % ci . - . the euregion meuse-rhine (emr) is an area with different regions, regarding language, culture and law. organisations and institutions received frequently signals about an increasing and region-related consumption of addictive drugs and risky behaviour of adolescents. as a reaction institutions from regions of the emr started a cross-border cooperation project 'risky behaviour adolescents in the emr'. the partners intend to improve the efficiency of prevention programmes by investigating the prevalence and pre-conditional aspects related to risky behaviour, and creating conditions for best-practice-public-health. the project included two phases: study. two cross-border (epidemiological) studies where realized: a quantitative study of the prevalence of risky behaviour ( pupils) and a qualitative study mapped preconditional aspects of risky behaviour and possibilities to preventive programmes. implementation. this served bringing about recommendations on policy level as well as on prevention level. during this phase the planning and realisation of cross-border preventionprogrammes and activities started. there is region-related variance of prevalence in risky-behaviour of adolescents in de emr. also there are essential differences in legislation and regulation, (tolerated) policy, prevention structures, political and organizational priorities and social acceptance toward stimulants. cross-border studies and cooperation between institutions have resulted in best-practice-projects in (border) areas of the emr. abstract background: beta-blockers increase bone strenght in mice and may reduce fracture risk in humans. therefore, we hypothesized that inhaled beta- agonists may increase risk of hip fracture. objective: to determine the association between daily dose of beta- agonist use and risk of hip fractures. methods: a case-control study was conducted among adults who were enrolled in the dutch phar-mo database (n = , ). cases (n = , ) were patients with a first hip fracture. the date of the fracture was the index date. four controls were matched by age, gender and region. we adjusted our analyses for indicators of asthma/copd severity, and for disease and drug history. results: low daily doses (dds) (< ug albuterol eq.) of beta -agonists (crude or . , % ci . - . ) did not increase risk of hip fracture, in contrast to high dds (> ug albuterol eq., crude or . , % ci . . - . ). after extensive adjustment for indicators of the severity of the underlying disease, (including corticosteroid intake), fracture risk in the high dd group decreased to . ( % ci . - . ). conclusions: high dds of beta- agonists are linked to increased risk of hip fracture. extensive adjustments for the severity of the underlying disease is important when evaluating this association. abstract salivary nitrate arises from ingested nitrate and is the main source of gastric nitrite, a precursor of carcinogenic n-nitroso compounds. we examinated the nitrate and nitrite levels in saliva of children who used private wells for their drinking water supply. saliva was collected in the morning, from children aged - years. control group (n = ) drank water containing . - . mg/l (milligrams/litre) nitrate. exposure groups consisting of subjects (n = ) who used private wells with nitrate levels in drinking water below mg/l (mean ± standard deviation . ± . mg/l) and above mg/l (n = ) ( . ± . mg/l) respectively. the nitrate and nitrite of saliva samples was determined by high performance liquid chromatographs method. the values of nitrate in saliva samples from exposed groups ranged between . to . mg/l ( . ± . mg/l). for control groups, the levels of . to . mg/l ( . ± . mg/l) were registered. no differences between levels of salivary nitrite from control and exposed groups were found. regression analysis on water nitrate concentrations and salivary nitrate showed significant correlations. in conclusion, we estimate that salivary nitrate may be used as biomarkers of human exposure to nitrate. abstract disinfection of public drinking water supplies produces trihalomethanes. epidemiological studies have associated chlorinated disinfection by-products with cancer, reproductive and developmental effects. we studied the levels of trihalomethanes (chloroform, dibromochloromethane, bromodichloromethane, bromoform) in drinking water delivered to the population living in some urban areas (n= ). the water samples (n= ) were analysed using gas chromatographic method. assessment of exposure to trihalomethanes in tap water has been on monitoring data collected over - months periods and that we averaged over entire water system. analytical data revealed that total trihalomethanes levels were higher in the summer: mean ± standard deviation . ± . lg/l (micrograms/litre). these organic compounds were present in the end of distribution networks ( . ± . lg/l). it is noted that, sometimes, we found high concentrations of chloroform exceeding the sanitary norm ( lg/l) in tap water (maximum value . lg/l). results of sampling programs showed stronger correlations between chlorine and trihalomethanes value (correlation coefficient r = . to . , credible %interval). in conclusion, the population drank water with the law concentration of trihalomethanes, especially chloroform. abstract objective: to determine the validity of a performance-based assessment of knee function, dynaport[rsymbol] kneetest (dpkt), in first-time consulters with non-traumatic knee complaints in general practice. methods: patients consulting for nontraumatic knee pain in general practice aged years and older were enrolled in the study. at baseline and -months follow-up knee function was assessed by questionnaires and the dpkt; a physical examination was also performed at baseline. hypothesis testing assessed cross-sectional and longitudinal validity of the dpkt. results: a total of patients were included for dpkt of which were available for analysis. the studied population included women ( . %), median age was (range - ) years. at follow-up, patients ( . %) were available for dpkt. only out of ( %) predetermined hypotheses concerning cross-sectional and longitudinal validity were confirmed. comparison of the general practice and secondary care population showed a major difference in baseline characteristics, dynaport knee score, internal consistency and hypotheses confirmation concerning the construct validity. conclusion: the validity of the dpkt could not be demonstrated for first-time consulters with non-traumatic knee complaints in general practice. measurement instruments developed and validated in secondary care are not automatically also valid in primary care setting. abstract although animal studies have described the protective effects of dietary factors supplemented before radiation exposure, little is known about the lifestyle effects after radiation exposure on radiation damage and cancer risks in human. the purpose of this study is to clarify whether lifestyle can modify the effects of radiation exposure on cancer risk. a cohort of , japanese atomic-bomb survivors, for whom radiation dose estimates were currently available, had their lifestyle assessed in . they were followed during years for cancer incidence. the combined effect of smoking, drinking, diet and radiation exposure on cancer risk was examined in additive and multiplicative models. combined effects of a diet rich in fruit and vegetables and ionizing radiation exposure resulted in a lower cancer risk as compared to those with a diet poor in fruit and vegetables and exposed to radiation. similarly, those exposed to radiation and who never drink alcohol or never smoke tobacco presented a lower oesophagus cancer risk than those exposed to radiation and who currently drink alcohol or smoke tobacco. there was no evidence to reject either the additive or the multiplicative model. a healthy lifestyle seems beneficial to persons exposed to radiation in reducing their cancer risks. abstract background: clinical trials have shown significant reduction in major adverse cardiac events (mace) following implantation of sirolimus-eluting (ses) vs. bare-metal stents (bms) for coronary artery disease (cad). objective: to evaluate long-term clinical outcomes and economic implications of ses vs. bms in usual care. methods: in this prospective intervention study, cad patients were treated with bms or ses (sequential control design). standardized patient and physician questionnaires , , and months following implantation documented mace, disease-related costs and patient quality of life (qol). results: patients treated with ses (mean age ± , % male), with bms (mean age ± , % male). there were no significant baseline differences in cardiovascular risk factors and severity of cad. after months, % ses vs. % bms patients had suffered mace (p< . ). initial hospital costs were higher with ses than with bms, but respective month follow-up direct and indirect costs were lower ( , ± vs. , ± euro and ± vs. , ± euro, p = ns). overall, disease-related costs were similar in both groups (ses , ± , bms , ± , p = ns) . differences in qol were not significant. conclusions: as in clinical trials, ses patients experienced significantly fewer mace than bms patients during -month follow-up with similar overall costs and qol. abstract background: meta-analyses that use individual patient data (ipd-ma) rather than published data have been proposed as an improvement in subgroup-analyses. objective: to study ) whether and how often ipd-ma are used to perform subgroup-analyses ) whether the methodology used for subgroup-analyses differs between ipd-ma and meta-analyses of published data (map) methods: ipd-ma were identified in pubmed. related article search was used to identify map on the same objective. metaanalyses not performing subgroup-analysis were excluded from further analyses. differences between ipd-ma and map were analysed, reasons for discrepancies were described. we recently developed a simple diagnostic rule (including history and physical findings plus d-dimer assay results) to safely exclude the presence of deep vein thrombosis (dvt) without the need for referral in primary care patients suspected of dvt. when applied to new patients, the performance of any (diagnostic or prognostic) prediction rule tends to be lower than expected based on the original study results. therefore, rules need to be tested on their generalizability. the aim was to determine the generalizability of the rule. in this cross-sectional study, primary care patients with suspicion of dvt were prospectively identified. the rule items were obtained from each patient plus ultrasonography as reference standard. the accuracy of the rule was quantified on its discriminative performance, sensitivity, specificity, negative predictive value, and negative likelihood ratio, with accompanying % confidence interval. dvt could be safely excluded in % ( % in the original study) of the patients, without referral. none of these patients had dvt ( . % in the derivation population). in conclusion, the rule appears to be a safe diagnostic tool for excluding dvt in patients suspected of dvt in primary care. abstract background: long-term exposure to very low concentrations of asbestos in the environment and relation to incidence of mesothelioma contributes to insight into the dose-response relationship and public health policy. aim: to describe regional differences in the occurrence mesothelioma in the netherlands in relation to the occurrence in the asbestos polluted area around goor and to determine whether the increased incidence of pleural mesothelioma among women in this area could be attributed to environmental exposure to asbestos. methods: mesothelioma cases were selected in the period - from the netherlands cancer register (n = ). for the women in the region goor (n = ) exposure to asbestos due to occupation, household or environment was verified from the medical files, the general practitioner and next-of-kin for cases. results: in goor the incidence of pleural mesothelioma among women was -fold increased compared with the netherlands and among men fold. of the additional cases among women, cases were attributed to the environmental asbestos pollution and in cases this was the most likely cause. the average cumulative asbestos exposure was estimated at . fiber-years. . temporal trends and gender differences were investigated by random slope analysis. variance was expressed using median odds ratio (mor). results: ohcs appeared to be more relevant than administrative areas for understanding physicians' propensity to follow prescription guidelines (mor_ohc = . and mor_aa = . ). conclusion and discussion: as expected, the intervention increased prevalence and decreased variance, but at the end of the observation period practice variation remained high. these results may reflect inefficient therapeutic traditions, and suggest that more intensive interventions may be necessary to promote rational statin prescription. abstract background: mortality rates in ska˚ne, sweden have decreased in recent years. if this decline has been similar for different geographical areas have not been examined closely. objectives: we wanted to illustrate trends and geographical inequities in all cause mortality between the municipalities in ska˚ne, sweden from to . we also aimed to explore the application of multilevel regression analysis (mlra) in our study, since it is a relatively new methodology when describing mortality rates. design and methods: we used linear mlra with years at the first level and municipalities at the second to model direct age-standardized rates. temporal trends were examined by random slope analysis. variance across time was expressed using intra-class correlation (icc). results: the municipality level was very relevant for understanding temporal differences in mortality rates (icc = %). in average, mortality decreased by / Ù along the study period but this trend varied considerably between municipalities, geographical inequalities along the years were u-shaped with lowest variance in the s (var = ). conclusion: mortality has decreased in ska˚ne but municipality differences are increasing again. mlra is a useful technique for modelling mortality trends and variation among geographical areas. abstract background: ozone has adverse health effects but it is not clear who is most susceptible. objective: identification of individuals with increased ozone susceptibility. methods: daily visits for lower respiratory symptoms (lrs) in general practitioner (gp) offices in the north of the netherlands ( - , patients) were related to daily ozone levels in summer. ozone effects were estimated for patients with asthma, copd, atopic dermatitis, and cardiovascular diseases (cvd) and compared to effects in patients without these diseases. generalized additive models adjusting for trend, weekday, temperature, and pollen counts were used. results: the mean daily number of lrs-visits in summer in the gp-offices varied from . to . . mean (sd) -hour maximum ozone level was . ( . ) lg/m . rrs ( % ci) for a lg/m increase (from th to th percentile) in the mean of lag to of ozone for patients with/ without disease are: abstract asthma is a costly health condition, its economic effect is greater than that estimated for aids and tuberculosis together. following global initiative for asthma recommendations that require more data about the burden of asthma, we have determined the cost of this illness from - . an epidemiological approach based on population studies was made to estimate global as well as direct and indirect costs. data were obtained mainly from the national health ministry database, the national statistics institute of spain and the national health survey. the costs were averaged and adjusted to e. we have found a global burden (including private medicine) of million e. indirect and direct costs account for , and , %.the largest components within direct costs were pharmaceutical ( . %), primary health care systems ( , %), hospital admissions ( . %) and hospital non-emergency ambulatory visits ( . %). within indirect costs, total cessation of work days ( . %), permanent labour incapacity ( . %) and early mortality ( . %) costs were the main components. pharmaceutical cost is the first component as in most studies from developed countries, followed by primary health care systems unlike some reports that consider hospital admissions in second place. finally, direct costs represent . % of the total health care expenditure. abstract background: it is well known that fair phenotypical characteristics are a risk factor for cutaneous melanoma. the aim of our study was to investigate the analogous associations between phenotypical characteristics and uveal melanoma. design/methods: in our casecontrol study we compared incident uveal melamom patients with population controls to evaluate the role of phenotypical characteristics like iris-, hair-and skin color and other risk factors in the pathogenesis of this tumor. a total of patients and controls matched on sex, age and region were interviewed. conditional logistic regression was used to calculate odds ratio (or) and % confidence intervals ( % ci). results: risk of uveal melanoma was increased among people with light iris color (or = . % ci . - . ) and light skin color was slightly associated with an increased risk of uveal melanoma (or . % . - . ). hair color, tanning ability, burning tendency and freckles as a child showed no increased risk. results of the combined analysis of eye-and hair color, burning tendency and freckles showed that only light iris color was clearly associated with uveal melanoma risk. conclusion: among potential phenotypical risk factors only light iris-and skin color were identified as risk factor for uveal melanoma. abstract background: between-study variation in estimates of the risk of hcv mother-to-child transmission (mtct) and associated risk factors may be due to methodological differences or changes in population characteristics over time. objective: to investigate the effect of sample size and time on risk factors for mtct of hcv. design and methods: heterogeneity was assessed before pooling data. logistic regression estimated odds ratios for risk factors. results: the three studies included mother-child pairs born between and , born between and , and between and . there was no evidence of heterogeneity of the estimates for maternal hcv/hiv co-infection and mode of delivery (q = . , p = . and q = . , p = . , respectively). in pooled analysis the proportion of hcv/hiv co-infected mothers significantly decreased from % before to % since (p< . ). the pooled adjusted odds ratios for maternal hcv/hiv co-infection and elective caesarean section delivery were . ( %ci . - . ), p< . and . ( %ci . - . ), p = . respectively. there was no evidence that the effect of risk factors for mtct changed over time. conclusion: although certain risk factors have become less common, their effect on mtct of hcv has not changed substantially over time. abstract background: the need to gain insight into prevailing eating pattrerns and their health effects is evident. objective: to identify dietary patterns and their relationship with total mortality in dutch older women. methods: principal components analysis on food groups was used to identify dietary patterns among , women ( - y) included in the dutch epic-elderly cohort (follow-up $ . y). mortality ratios for three major principle components were assessed using cox proportional hazard analysis. results: the most relevant principal components were a 'mediterranean-like' pattern (high in vegetable oils, pasta/rice, sauces, fish, and wine), a 'traditional dutch dinner' pattern (high in meat, potatoes, vegetables, and alcoholic beverages) and a 'healthy traditional' pattern (high in vegetables, fruit, non-alcoholic drinks, dairy products, and potatoes). in , person years deaths occurred. independent of age, education, and other life style factors only the 'healthy traditional' pattern score was associated with a lower mortality rate, women in the highest tertile experienced a percent reduced mortality risk. conclusion: from this study a healthy traditional dutch diet, rather than a mediterranean diet, appears beneficial for longevity and feasible for health promotion. this diet is comparable to other reported 'healthy' or 'prudent' diets that have been shown to be protective. parents of (aged - ) and (aged - ) children were sent a questionnaire, as were adolescents (aged - ). to assess validity, generic outcome instruments were included (infant toddler quality of life questionnaire (itqol) or the child health questionnaire (chq) and the euroqol- d). response rate was - %. internal consistency of hobq and boq-scales was good (cronbach's alpha's > . in all but two scales). test-retest results showed no differences in - % of scales. high correlations between hobq-and boq-scales and conceptually equivalent generic outcome instruments were found. the majority of hobq ( / ) and boq scales ( / ) showed significant differences between children with a long versus short length of stay. the dutch hobq and boq can be used to evaluate functional outcome after burns in children. the study estimated caesarean section rates and odds ratios for caesarean section in association with maternal characteristics in both public and private sectors; and maternal mortality associated with mode of delivery in the public sector, adjusted for hypertension, other disorders, problems and complications, as well as maternal age. results: the caesarean section rate was . % in the public sector, and . % in the private sector. the odd ratio for caesarean section was . ( %ci: . - . ) for women with or more years of education. the odd ratio for maternal mortality associated with caesarean section in the public sector was . ( %ci: . - . ). conclusion and discussion: sao paulo presented high caesarean section rates. caesarean section compared to vaginal delivery in the public sector presented higher risk for mortality even when adjusted for hypertension, other disorders, problems and complications, as well as maternal age. we show that serious bias in questionnaires can be revealed by bland-altman methods but may remain undetected by correlation coefficients. we refute the argument that correlation coefficients properly investigate whether questionnaires rank subjects sufficiently well. conclusions: the commonly used correlation approach can yield misleading conclusions in validation studies. a more frequent and proper use of the bland-altman methods would be desirable to improve epidemiological data quality. abstract screening performance relies on quality and efficiency of protocols and guidelines for screening and follow-up. evidence of low attendance rates, over-screening of young women and low smear specificity gathered by the early 's in the dutch cervical cancer screening program called for an improvement. several protocols and guidelines were redefined in , with emphasis on assuring that these would be adhered to. we assessed improvement since by changes in various indicators: coverage rates, follow-up compliance and number of smears. information on all cervix uteri tests in the netherlands registered until st march was retrieved from the nationwide registry of histo-and cytopathology (palga). five-year coverage rate in the age group - years rose to %. the percentage of screened women in follow-up decreased from % to %. fourteen percent more women with abnormal smears were followed-up, and the time spent in follow-up decreased. a % decrease in the annual number of smears made was observed, especially among young women. in conclusion, the changes in protocols and guidelines, and their implementation have increased coverage and efficiency of screening, and decreased the screening-induced negative side effects. similar measures can be used to improve other mass screening programmes. abstract background: it is common knowledge that in low endemic countries the main transmission route of hepatitis b infection is sexual contact, while in high endemic regions it is perinatal transmission and horizontal household transmission in early childhood. objectives: to get insight into what determines the main transmission route in different regions. design and methods: we used a formula for the basic reproduction number r for hepatitis b in a population stratified by age and sexual activity to investigate under which conditions r > . using data extracted from the literature we investigated how r depends on fertility rates, rates of horizontal childhood transmission and sexual partner change rates. results: we identified conditions on the mean offspring number and the transmission probabilities for which perinatal and horizontal childhood transmission alone ensures that r > . those transmission routes are then dominant, because of the high probability for children to become chronic carriers. sexual transmission dominates if fertility is too low to be the driving force of transmission. conclusion: in regions with high fertility rates hepatitis b can establish itself on a high level of prevalence driven by perinatal and horizontal childhood transmission. therefore, demographic changes can influence hepatitis b transmission routes. abstract background: the artificial oestrogen diethylstilboestrol is known to be fetotoxic. thus, intrauterine exposure to other artificial sex hormones may increase the risk of fetal death. objective: to study if use of oral contraceptive months prior to or during pregnancy is associated to an increased risk of fetal death. design and methods: a cohort study of pregnant women who were recruited into the danish national birth cohort during the years - and interviewed about exposures during pregnancy, either during the first part of their pregnancy (n = ) or following a fetal loss (n = ). cox regression analyses with delayed entry were used to estimate the risk of fetal death. results: in total ( . %) women took oral contraceptives during pregnancy. use of combined oestrogen and progesterone oral contraceptives (coc) or progesterone only oral contraceptives (poc) during pregnancy were not associated with increased hazard ratios of fetal death compared to non-users, hr . ( % ci . - . ) and hr . ( % ci . - . ) respectively. neither use of coc nor poc prior to pregnancy was associated with fetal death. conclusion: use of oral contraceptive months prior to conception or during pregnancy is not related to an increased risk of fetal death. abstract background: few studies have been performed to assess if water fluoridation reduces social inequalities among groups of different socioeconomic status, and none of them was conducted in developing countries. objectives: to assess socioeconomic differences between brazilians towns with and without water fluoridation, and to compare dental caries indices among socioeconomic strata in fluoridated and non-fluoridated areas. design and methods: a countrywide survey of oral health performed in - and comprising , children aged years provided information about dental caries indices in brazilian towns. socioeconomic indices, the coverage and the fluoride status of the water supply network of participating towns were also appraised. multivariate regression models were performed. inequalities in dental outcomes were compared in towns with and without fluoridated tap water. results: better-off towns tended to present a higher coverage by the water supply network, and were more inclined to add fluoride. fluoridated tap water was associated with an overall improved profile of caries, concurrent with an expressively larger inequality in the distribution of dental disease. conclusion: suppressing inequalities in the distribution of dental caries requires an expanded access to fluoridated tap water; a strategy that can be effective to foster further reductions in caries indices. objective: to investigate the role of family socioeconomic trajectories from childhood to adolescence on dental caries and associated behavioural factors. design and methods: a population-based birth cohort was carried out in pelotas, brazil. a sample (n= ) of the population of subjects born in were dentally examined and interviewed at aged . dental caries index, care index, toothbrushing, flossing, and pattern of utilization of dental services were the outcomes. these measures were compared among four different family income trajectories. results: adolescents who were always poor showed, in general, a worse dental caries profile, whilst adolescents who never were poor had a better dental caries profile. adolescents who had moved from poverty in childhood to nonpoverty in adolescence and those who had moved from non-poverty in childhood to poverty in adolescence had similar dental profiles to those who were always poor except for pattern of utilization of dental services which was higher in the first group. conclusion: poverty in at least one stage of the lifespan has a harmful effect on dental caries, oral behaviours and utilization of dental services. we assessed contextual and individual determinants of dental caries in the brazilian context. a country-wide survey of oral health informed the dental status of , twelve-year-old schoolchildren living in towns in . a multilevel model fitted the adjustment of untreated caries prevalence to individual (socio-demographic characteristics of examined children) and contextual (geographic characteristics of participating towns) covariates. being black (or = . ; % ci: . - . ), living in rural areas (or = . ; . - . ) and studying in public schools (or = . ; . - . ) increased the odds of having untreated decayed teeth. the multilevel model identified the fluoride status of water supplies (ß=) . ), the proportion of households linked to the water network (ß=) . ) and the human development index (ß=) . ) as town-level covariates of caries experience. better-off brazilian regions presented an improved profile of dental health, besides having a less unequal distribution of dental treatment needs between blacks and whites, rural and urban areas, and public and private schools. dental caries experience is prone to socio-demographic and geographic inequalities. monitoring contrasts in dental health outcomes is relevant for programming socially appropriate interventions aimed both at overall improvements and at the targeting of resources for groups of population presenting higher levels of needs. abstract background: ultraviolet radiation (uvr) is the main cause of nonmelanoma skin cancer but has been hypothesised to protect against development of prostate cancer (pc). if this is true, skin cancer patients should have lower pc incidence than the general population. objectives: to study the incidence of pc after a diagnosis of skin cancer. design methods: using the eindhoven cancer registry, a cohort of male skin cancer patients diagnosed since ( squamous cell carcinoma (scc), basal cell carcinoma (bcc) and melanoma (cm)) was followed up for incidence of invasive pc. observed incidence rates of pc amongst skin cancer patients were compared to those in the reference population, resulting in standardised incidence ratios (sir). results: scc (sir . ( %ci: . ; . )) and bcc (sir . ( %ci: . ; . )) showed a decreased incidence of pc, cm did not. patients with bccs occurring in the chronically sun-exposed head and neck area (sir . ( %ci: . ; . ) had significantly lower pc incidence rates. conclusion discussion: although numbers of scc and cm were too small to obtain unequivocal results, this study partly supports the hypothesis that uvr protects against pc and also illustrate that cm patients are different from nmsc patients in several aspects. abstract introduction: hypo-and hyperthyroidism have been associated to various symptoms and metabolic dysfunctions in men and women. incidences of these diseases have been estimated in a cohort of middle-aged adults in france. methods: the su.vi.max (sup-ple´mentation en vitamines et mine´raux antioxydants) cohort study included volunteers followed-up for eight years since - . the incidence of hypo-and hyperthyroidism was estimated retrospectively from scheduled questionnaires and the data transmitted by the subjects during their follow-up. factors associated to incident cases have been identified by cox proportional hazards models. results: among the subjects free of thyroid dysfunction at inclusion, incident cases were identified. after an average follow-up of . years, the incidence of hyper-and hypothyroidism was . % in men, . % in - year old women, and . % in - year old women. no associated factor was identified in men. in women, age and alcohol consumption (> grams/day) increased the risk of hypo-or hyperthyroidism, while a high urinary thiocyanate level in - would be a protective factor. conclusion: the incidences of hypo and hyperthyroidism observed in our study as well as the associated risk factors found are in agreement with the data of studies performed in other countries. abstract background: lung cancer is the most frequent malignant neoplasm world-wide. in , the number of new lung cancer cases was estimated at . million, which makes over % of all new cases of neoplasm registered all round the globe. it is also the leading cause of cancer deaths. objective: the objective of this paper is to provide a systematic review of life-related factors for lung cancer risk. methods: data sources were medline from january to december , title in the field. search terms included: lung cancer, tobacco smoke, education, diet, alcohol consumption or physical activity terms. book chapters, monographs, relevant news reports, and web material were also reviewed to find articles. results: the results of the literature review suggest that smoking is a major, unquestionable factor of lung cancer risk. exposure to environmental tobacco smoke (ets) and education could also play a role in the occurrence of the disease. diet, alcohol consumption and physical activity level are other important but less extended determinants of lung cancer. conclusions: effective prevention programs against some of the life style-related factors for lung cancer, especially against smoking must be developed to minimize potential health risks and prevent the future cost of health. stedendriehoek twente and south (n = ), additional data (co-morbidity, complications after surgery and follow-up) were gathered. cox-regression analyses were used. results: the proportion resections declined from % of patients < to % of patients aged > = years, whereas primary radiotherapy increased from % to %. in the two regions patients ( %) underwent resection. co-morbid conditions did not influence the choice of the therapy. % had complications. postoperative mortality was %. in multivariate analysis, only treatment had an independent effect.two year survival was % for patients undergoing surgical resection and % for those receiving radiotherapy (p< . ). conclusion: number of co-morbid conditions did not influence choice of treatment, postoperative complications, and survival in patients with nsclc > = years. the epidemiology of oesophageal cancer has changed in recent decades. the incidence has increased sharply, mainly comprising men, adenocarcinoma and tumours of the lower third of the oesophagus. the eurocare study suggested large variation in survival between european countries, primarily related to early mortality. to study potential explanations, we compared data from the rotterdam and thames cancer registry. computer records from , patients diagnosed with oesophageal cancer in the period - were analysed by age, gender, histological type, tumour subsite, period and region. there was a large variation in resection rates between the two regions, % for rotterdam versus % for thames (p< . ). resection rates were higher for men, younger patients, adenocarcinoma and distal tumours. postoperative mortality (pom) was defined as death within days of surgery and was . % on average. pom increased with age from . % for patients younger than years to . % for patients older than years. pom was significantly lower in high-volume hospitals (> operations per year), . % versus . % (p< . ). this study shows a large variation in treatment practice between the netherlands and the united kingdom. potential explanations will need to be studied in detail. abstract russia has experienced tremendous decline in life expectancy after break up of the ussr. surprisingly, im has also been decreasing. less is known on the structure of im in different regions of russia. the official im data may be underestimated partly due to misreporting early neonatal deaths (end) as stillbirths (sb). end/sb ratio considerably exceeding : indicates misreporting. we present the trends and structure of im in arkhangelsk oblast (ao), north-west russia from to as obtained from the regional statistical committee. im decreased from . to . per live births. cause-specific death rates (per , ) decreased from to . for infectious diseases, from to for respiratory causes, from to for traumas, from to for inborn abnormalities but did not change for conditions of the perinatal period ( in both and ) . the end/sb ratio increased from to . . in , im from infections and respiratory causes in the ao are much lower than in russia in general. the degree of misreporting end as sb in the ao is lower than in russia in general. other potential sources of underestimation of im in russia will be discussed. abstract background: epidemiological studies that investigated malocclusion and different physical aspects in adolescents are rare in the literature. objective: we studied the impact of malocclusion on adolescents' self-image regardless of other physical aspects. design and methods: a cross-sectional study nested in a cohort study was carried out, in pelotas, brazil. a random sample of yearsold adolescents was selected. the world health organization ( ) criteria were used to define malocclusion. interviews about self-reported skin colour and appearance satisfaction were administered. the body mass index was calculated. gender, birth weight and socioeconomic characteristics were obtained from the birth phase of the cohort study. poisson regression models were performed. results: the prevalence of moderate or severe malocclusion was . % [ %ci . ; . ] in the whole sample without significant difference between boys and girls. a higher statistically significant difference of appearance dissatisfaction was identified in girls ( . %) than in boys ( . %). a positive association between malocclusion and appearance dissatisfaction was observed only in girls, after adjusting for other physical and socioeconomic characteristics. conclusions: malocclusion influenced appearance dissatisfaction only in young women. abstract background: factors for healthy aging with good functional capacity and those which increase the risk of death and disability need to be identified. objectives: we studied the prevalence of low functional capacity and its associations in a small city in southern brazil. design and methods: a population based cross sectional study was carried out with a random sample size of elderly people. a home-applied questionnaire including socioeconomic, demographic, house conditions, socioeconomic self-perception characteristics was applied. the low functional capacity was defined as the difficulty in the performance of or more activities or inability to carry out of those activities according to scale proposed by rikli and jones. descriptive statistics, association using chi-square test as well as the multiple logistic regression analysis were performed. abstract introduction: assessment of trichiuriasis spatial distribution is important to evaluate sanitation conditions. our objective was to identify risk areas for the trichuris trichiura infection. methods: cross sectional study was held in census tracts of duque de caxias county, rio de janeiro, brazil. collection and analysis of fecal specimens and a standardize questionnaire were carry out in order to evaluate socio-economic and sanitation conditions in a sample of , children between and years old. geoestatistics techniques were used to identify risk areas for trichiuriasis. results: the mean age of the studied population was . years old, which % were females and % were males. the prevalence of trichuris trichiura in the sample was %. children whose mothers studied for years or less had odds ratio (or) = . than children whose mothers studied for more than years old. children who were living in houses without water supply had or = . comparing to children living in houses with water supply. the spatial analysis identified risk areas for infection. conclusion: the results show association between socio-economic conditions and the proliferation of trichuris trichiura infection. the identification of risk areas can guide efficient actions to combat the disease. abstract background: refuge life and diabetes mellitus can affect the healthrelated quality of life (hrqol). objective: to assess how both aspects influence hrqol of the diabetic refugees in gaza strip. methods: overall subjects filled a self-administered questionnaire including world health organization quality of life questionnaire (whoqol-bref) and some socio-demographic information. the sample consisted of three frequency matched groups for gender and sex, each. first group were refugees with diabetes mellitus, second refugees without diabetes and third diabetes patients with no refugee history. the response rate was % on average. global score consisting of all four domains of who-qol-bref was dichotomized by the value of and logistic regression was used for the analysis. results: crude odds ratios (or) for lower quality of life were . ( % ci . - . ) for diabetes refugees compared to diabetes non-refugees and . ( . - . ) compared to non-diabetes refugees. after adjusting for age, gender, education, employment, income status and number of persons depending on the respondents or was . ( . - . ) and . ( . - . ), respectively. additionally, adjusting for length of diabetes and complications reduced the or to . ( . - . ) for diabetes refugees compared to diabetes non-refugees. conclusion: quality of life is highly reduced in refugees with diabetes. abstract background: pesticides have a significant public health benefit by increasing food production productivity and decreasing diseases. on the other hand, public concern has been raised about the potential health effects of the exposure to pesticides on the developing fetus and child. objectives: to review the available literature to find an epidemiological studies dealing with the exposure to pesticides and children health. design and methods: epidemiological studies were identified during search of the literature basis. following health effects were taking into account: adverse reproductive and developmental disorders, childhood cancer, neurodevelopmental effects and the role of pesticides as endocrine disrupters. results: pesticides were associated with wide range of reproductive disorders. the association between exposure to pesticides and the risk of childhood cancer and neurodevelopmental effects was found in several studies. epidemiological studies have been limited by luck of specific pesticide exposure, exposure based on job title, small size of examined groups. conclusions: in the light of existing although still limited evidence of adverse effects of pesticide exposure it is necessary to reduce the exposure. the literature review suggests a great need to increase awareness of people who are occupationally or environmentally exposed to pesticides about its potential negative influence on their children. in order to match local health policy more with the needs of citizens, the municipal health service utrecht started the project 'demand-orientated prevention policy'. one of the aims was to explore the needs of the utrecht citizens. the local health survey from contained questions about needs of information and support with regard to disorders and lifestyle. do these questions about needs give other results compared to questions about prevalence of health problems? in total utrecht citizens aged to years returned the health questionnaire (response rate %). most needs were observed on subjects concerning overweight and mental problems, and were higher among women, moroccans, turks, low educated people and citizens of deprived areas. the prevalence of disorders and unhealthy lifestyles did not correlate well to the needs (majority correlation coefficients: < , ). most striking, of the utrecht population % were smokers and % excessive alcohol drinkers, while needs related to these topics were low. furthermore, higher needs among specific groups did not always correspond to higher prevalences of related health problems in these groups. these results show the importance of including questions about needs in a health survey, because they add additional information to questions about prevalences. abstract background: recent studies associated statin therapy with better outcome in patients with pneumonia. because of an increased risk of pneumonia in patients with diabetes we aimed to assess the effects of statin use on pneumonia occurrence in diabetic patients managed in primary care. methods: we performed a case-control study nested in , patients with diabetes. cases were defined as patients with a diagnosis of pneumonia. for each case, up to controls were matched by age, gender, practice, and index date. patients were classified as current statin user when the index date was between the start and end date of statin therapy. results: statins were currently used in . % of , cases and in . % of , controls (crude or: . , % ci . - . ). after adjusting for potential confounders, statin therapy was associated with a % reduction in pneumonia risk (adjusted or: . , % ci . - . ). the association was consistent among relevant subgroups (stroke, heart failure, and pulmonary diseases) and independent of age or use of other prescription drugs. conclusions: use of statins was significantly associated with reduced pneumonia risk in diabetic patients and may apart from lipid lowering properties be useful in prevention of respiratory infections. abstract introduction: cigarette smoking is the most important risk factor for copd development. therefore, smoking cessation is the best preventive measure. aim: to determine the beneficial effect of smoking cessation on copd development. methods: incidence of copd (gold stage > = ) was studied in smokers without copd who quitted or continued smoking during yr of followup. we performed logistic regression analyses on pairs of observations. correlations within a subject and time, and time between successive surveys were taken into account. abstract objectives: to describe the prevalence and severity of dental caries in adolescents of the city of porto, portugal, and to assess socioeconomic and behavioral covariates of dental caries experience. methods: a sample of thirteen-year-old schoolchildren underwent dental examination. results from the dental examination were linked to anthropometric information and to data supplied by two structured questionnaires assessing nutritional factors, sociodemographic characteristics and behaviors related to health promotion. dental caries was appraised in terms of the dmft index, and two dichotomous outcomes, one assessing the prevalence of dental caries (dmft = ); the other assessing the prevalence of a high level of dental caries (dmft = ). results: consuming soft drinks derived from cola two or more times per week, attending a public school, being girl and having parents with low educational attainment were identified as risk factors both for having dental caries and for having a high level of dental caries. conclusion: the improvement of oral health status in the portuguese context demands the implementation of polices to reduce the frequency of sugar intake, and could benefit from an overall and longstanding expansion of education in society. abstract background: migrant mortality does not conform to a single pattern of convergence towards rates in the host population. to better understand how migrant mortality will develop, there is a need to further investigate how the underlying behavioural determinants change following migration. objective: we studied whether behavioural risk factors among two generations of migrants converge towards the behaviour in the host population. design and methods: cross-sectional interview-data were used including moroccan and turkish migrants, aged - . questions were asked about smoking, alcohol consumption, physical inactivity and weight/height. age-adjusted prevalence rates among first and second generation migrants were compared with prevalence rates in the host population. results: converging trends were found for smoking, physical inactivity and overweight. for example, we found a higher prevalence of physical inactivity in first generation turkish women as compared to ethnic dutch (or = . ( . - . )), whereas among second generation no differences were found (or = . ( . - . )). however, this trend was not found in all subgroups. additionally, alcohol consumption remained low in all subgroups and did not converge. conclusion and discussion: behavioural risk factors in two generations of migrants seem to converge towards the prevalence rates in the host population. although, some groups and risk factors showed a deviant pattern. abstract background/relevance: arm-neck-shoulder complaints are common in general practice. for referral in these complaints, only guidelines exist for shoulder complaints and epicondylitis. besides, other factors can be important. objective: what factors are associated with referral to physiotherapy or specialist in non-traumatic armneck-shoulder complaints in general practice, during the first consultation? design/methods: general practitioners (gps) recruited consulters with new arm, neck or shoulder complaints. data on complaint-, patient-, gp-characteristics and management were collected. the diagnosis was categorised into: shoulder specific, epicondylitis, other specific or non-specific. multilevel analyses (adjustment for treating gp) were executed in procgenmod to assess associated variables (p< . ). results: during the first consultation, % was referred for physiotherapy and % for specialist care. indicators of reference to physiotherapy were: long duration of complaint, recurrent complaint and gp located in a little/not urbanised area. while having shoulder specific or other specific diagnoses was negatively associated. indicators of reference to specialist care were: having other specific diagnosis, long duration of complaint, musculoskeletal co-morbidity, functional limitations and consulting a less experienced gp. conclusion/discussion: most referrals were to physiotherapy and only a minority to specialist care. mainly diagnosis and other complaint variables indicate on 'who goes where'. besides gp-characteristics can play a role. abstract background: the ruhr area has for years been a synonym for a me´gapolis of heavy industry with a high population density. presently, % of the population of the state of north rhine-westphalia live there, i.e. more than five million people. objectives: for the first time, social and health indicators of nrw's health indicator set were brought together for this me´gapolis area. design and methods: new standard tables were constructed for the central area of 'ruhr-city' including seven cities with more than inhabitants/km and the peripheral zone with eight districts and cities. for the pilot phase, four socio-demographic and four health indicators were recalculated. comparability of the figures was achieved by age standardization. the results obtained were submitted to a significance test by identifying % confidence intervals. results: the centre of 'ruhr-city' is characterised by elderly, unemployed, foreign, low-income citizens living closely together. infant mortality lies above nrw's average, male life expectancy is . years lower and female life expectancy . years lower than life expectancy in nrw (without 'ruhr-city'). several avoidable deaths' rate in the ruhr area are significantly higher than the average in nrw. specific intervention strategies are required to improve the health status in 'ruhr-city'. abstract background: general practitioners (gps) have a fundamental role to play in tobacco control, since they reach a high percentage of the target population. objectives: to evaluate specific strategies to enhance promotion of smoking cessation in general practice. design and methods in a cluster-randomized trial, medical practices were randomized following a · factorial design. patients aged - years who smoked at least cigarettes per day (irrespective of their intention to stop smoking) were recruited. the intervention included (ti) the provision of a two-hour physician group training in smoking cessation methods plus direct physician payments for every participant not smoking months after recruitment; and (tm) provision of the same training plus direct participant reimbursements for pharmacy costs associated with nicotine replacement therapy or bupropion treatment. results: in the mixed logistic regression model, no effect was identified for intervention ti (odds ratio (or) = . , % confidence interval (ci) . - . ), but intervention tm strongly increased the odds of cessation (or = . , % ci . - . ). conclusion and discussion: the cost-free provision of effective medication along with improved training opportunities for gps may be an effective measure to enhance smoking cessation promotion in general practice. in europe, little research on international comparison of health surveys has been accomplished, despite a growing interest in this field. smoking prevalence is chosen to explore data comparability. we aim to illustrate methodological problems encountered when comparing data from health surveys and investigate international variations in smoking behaviour. we examined a sample . individuals aged and more, from six european health surveys performed in - . problems met during the comparisons are described. we took the example of current smoking as an indicator allowing a valid comparison of the prevalences. the differences in age and sex distribution between countries were adjusted through direct standardisation. additionally, multivariate analysis will assess variations in current smoking between countries, when controlling for sex, age, and educational level. methodological problems concern comparability of socioeconomic variables. the percentage of current smokers varies from % to %. smoking patterns observed by age groups, sexes and educational level are similar, although rates per country differ. further results will determine if the variations in smoking related to socioeconomic status are alike. this international comparison of health surveys highlights methodological problems encountered when comparing data of several countries. furthermore, variations in smoking may call for adaptations in public health programs. from research it appears that adolescent alcohol use in the achterhoek is much higher than in the rest of the netherlands and rapidly increasing. excessive alcohol use has consequences for health and society. parents play an important role in preventing excessive adolescent alcohol use, but are not aware of the problem and consequences. for this reasons the municipalities in the achterhoek launch an alcohol moderation programme, starting with a regional media campaign to increase problem awareness among parents. the objective of this study is to assess the impact of this media campaign in the achterhoek. three successive independent cross-sectional telephone surveys, interviewing approximately respondents each, will be conducted before, during and after the campaign. respondents will be questioned on knowledge and awareness of excessive adolescent alcohol use, its consequences and the role child raising can play. also the reach and appreciation of the different activities of the campaign will be investigated. results: of the surveys before and during the implementation will be known by may . with these first findings the unawareness of the problem among parents and partly the reach and appreciation of the campaign can be assessed. abstract background: obesity is a growing problem, increasingly so in children and adolescents. overweight is partly 'programmed' during pregnancy, but few comprehensive studies looked prospectively into the changes of body composition and metabolic factors from birth. objectives: the aim of the population-based birth-cohort study within gecko is to study the etiology and prognosis of overweight and the metabolic syndrome during childhood. design and methods: the gecko drenthe will be a population based observational birth-cohort study, which includes all children born from april to april in drenthe, one of the northern provinces of the netherlands. during the first year of life, the study includes repeated questionnaires, extensive anthropometric measurements and blood measurements at birth (cord blood) and at the age of eleven months. results: the number of babies born in the drenthe province is about . per year. the results from a feasibility study conducted in february will be presented. conclusion: gecko drenthe is a unique project that will contribute to the understanding of the development of obesity in childhood and its tracking into adulthood. this will enable early identification of children at risk and opens the way for timely and tailored preventive interventions. abstract background: tunisia is facing an epidemiologic transition with the extension of chronic diseases that share common risk factors. obesity is a leading risk factor and happens to occur frequently in early life. objective: to study the prevalence and the risk factors of obesity and overweight among urban schoolchildren in sousse, tunisia. methods: cross sectional study of a tunisian sample of schoolchildren aged between and years living in the urban area of sousse, tunisia. a representative sample of school children selected by multistage cluster sampling procedure. measurements: weight and height, blood pressure measured by electronic system, fasting blood lipids. questionnaire assessment was used for family history of cardiovascular disease, smoking habits, physical activity and diet. abstract background: quality of life (qol) measurements are acknowledged as very important in the evaluation of health care. objectives: we studied the validity and the reliability of the hungarian version of the whoqol-bref among people living in small settlements. method: a questionnaire-based cross-sectional study was conducted in a representative sample (n = ) of persons aged years and over in south-east-hungary, in . data were analysed by the spss . . the internal consistency was evaluated using cronbach's alpha; for comparison of the qol scores amongst the various groups the two-tailed t-tests were used; convergent validity was assessed by spearman coefficients. results: the male:female ratio was . to . %, and the average age . (sd: . ) years. the domain scores were . (sd: . ) for the physical, . (sd: . ) for psychological, . (sd: . ) for the social, and . (sd: . ) for the environment domains. the cronbach's alpha values ranged from . to . across domains. the whoqol-bref seemed to be suitable to distinguish healthy and unhealthy people. the scores for all domains correlated with the self-evaluated health, and overall quality of life (p< . ). conclusion: our study supported that the whoqol-bref provided a valid, reasonable and useful determination of the qol of people living in hungarian villages. abstract background: further than a cardiovascular disease, arterial hypertension (aht) is the main cardiovascular risk factor. in spain, the aht prevalence reaches %, placed in the third position after germany and finland in affecters percentage. although its high morbi-mortality, the aht is a forecast factor. the treatment's objective (pharmacological and life style modifications) of hypertensive patients is not only to reduce blood pressure levels to optimum levels but also to treat all modifiable vascular risk factors. objective: economic impact evaluation of direct costs due to aht pathology (cie -mc - ) in spain in , according to autonomous region. design and methods: descriptive and transversal study of costs estimation in the period between january to december in spain according to autonomous region. the study is based on data available from the national health ministry database and the national statistics institute of spain. results: the national health service assigned million e to aht treatment. , % of the total cost is owe to pharmaceutical service expenses, , % to primary health care and a , % to hospital admissions. conclusion and discussion: the costs generated by aht are mainly due to the pharmaceutical service. the costs distribution is modified according to the geographical region. abstract background: over the last decades, for low-stage cervical cancer less surgical treatment and for high-stage cervical cancer chemoradiotherapy was recommended in the national guidelines. objectives: to describe changes and variation in treatment and survival in cervical cancer in the regions of the comprehensive cancer centre stedendriehoek twente (cccst) and south (cccs) in the netherlands. design and methods. newly diagnosed cervical cancer cases were selected from both cancer registries in the period - . patient characteristics, tumour characteristics, treatment and follow-up data were collected from the medical records. results: in figo stages ia -ib the percentage hysterectomy decreased from % in - to % in - (p<. ) and survival improved comparing - with - . figo stages iii-ivb had mostly received radiotherapy only ( %). no differences in survival between years of diagnosis were found. in the cccs-region more chemoradiotherapy was given in these stages ( % versus % in the cccst-region in the whole period). conclusion and discussion:. abstract background: the reason for the increased prevalence of depression in type diabetes (dm ) is unknown. objective: we investigated whether depression is associated with metabolic dysregulation or that depression is rather a consequence of having dm . methods: baseline data of the utrecht health project were used. subjects with cardiovascular disease were excluded. , subjects (age: . +/) ) were classified into four mutually exclusive categories: normal fasting plasma glucose (fpg < . mmol/l), impaired fpg (> = . and < . mmol/l), undiagnosed dm (fpg > = . mmol/l), and diagnosed dm . depression was defined as either a score of or more on the depression subscale of the symptom check list- or use of antidepressants. results: subjects with impaired fasting glucose and undiagnosed dm had no increased prevalence of depression. diagnosed dm patients had an increased prevalence of depression (or = . ( . - . )) after adjustment for gender, age, body mass index, smoking, alcohol consumption, physical activity, education level and number of chronic diseases. conclusions: our findings suggest that depression is not related to disturbed glucose homeostasis. the increased risk of depression in diagnosed dm only, suggests that depression is rather a consequence of the psychosocial burden of diabetes. abstract background: breast-conserving surgery (bcs) followed by radiotherapy (bcs-rt) is a safe treatment option for the large majority of patients with tumours less than cm. aim: the use of bcs and bcs-rt in pt (? cm) and pt -tumours ( - cm) was investigated in the netherlands in the period and . methods: from the netherlands cancer registry patients were selected with invasive pt (? . cm) or pt ( . - . cm) tumours, without metastasis at time of diagnosis. trends in the use of bcs and rt after bcs were determined for different age groups and regions. results: in the period - , pt -tumours and , pt -tumours were diagnosed. the %bcs in pt -tumours increased in all age groups. it remained lowest in patients years and older ( % in ). in pt -tumours a decrease was observed in patients years and older (from % to % in ). in both pt and pt tumours the %bcs-rt increased in patients years and older to respectively % and %. between regions and hospitals large differences were seen in %bcs and %bcs-rt. conclusion: multidisciplinary treatment planning, based on specific guidelines, and patient education could increase the use of bcs combined with rt in all age groups. abstract this is a follow-up study on the adverse health effects associated with pesticide exposure among cut-flower farmers. survey questionnaires and detailed physical and laboratory examinations were administered to and respondents, respectively, to determine pesticide exposure, work and safety practices, and cholinesterase levels. results showed that pesticide application was the most frequent activity associated with pesticide exposure, and entry was mostly ocular and dermal. majority of those exposed were symptomatic. on physical examination, or . % of those examined were found to have abnormal peak expiratory flow rate (pefr). eighty-two percent had abnormal temperature, followed by abnormal general survey findings (e.g. cardiorespiratory distress). % had cholinesterase levels below the mean value of . ? ph/hour, and . % exhibited a more than % depression in the level of rbc cholinesterase. certain hematological parameters were also abnormal, namely hemoglobin, hematocrit, and eosinophil count. using pearson's r, factors strongly associated with illness due to pesticides include using a contaminated piece of fabric to wipe sweat off (p.= . ) and reusing pesticide containers to store water (p.= . ). the greatest adverse effect of those exposed is an abnormal cholinesterase level which confirms earlier studies on the effect of pesticides on the body. objectives: this pair-study was performed to find out the rate of spontaneous abortions in female workers exposed to organic solvents from the wood-processing industry. methods: the level of organic solvents was assessed within the workplaces during a year period. exposed female workers from the wood-processing industry were examined. the occupational and non-occupational data associated with their fertility were obtained by applying a standard epidemiological computed questionnaire. the reference group consisted of female workers non-exposed to hypo-fertilizing agents, residing in the same locality. the rate of spontaneous abortions was evaluated in both groups as an epidemiological fertility indicator. results: within the studied period, the organic solvents levels exceeded several time the maximal admissible concentrations in all workplaces. the long-term exposure to organic solvents caused a significant increase in rate of spontaneous abortions compared to the reference group (p< . ). the majority of abortions ( %) have happened in the first trimester of pregnancy. conclusions: the long-term exposure to organic solvents may cause low fertility on female workers because of the spontaneous abortions. it is advised to reduce the organic solvents level in the air of all workplaces, as well as to stop working the pregnant women in exposure to organic solvents. abstract introduction: rio de janeiro city (rj) presents a fast aging of the population with changes in morbi-mortality. cardiovascular diseases are the first cause of death in elderly population. more than a half of ischemic heart diseases (ihd) cases occur in aged people (> years old). objective: describe the spatial distribution of ihd mortality in the elderly population in rj and associations with socio-demographics variables. methods: data were gathered from information on mortality system of the ministry of health and the demographic census of the foundation of the brazilian institute for geography and statistics. the geographic distributions of the standardized coefficient of mortality due to ihd and socio-demographics variables, by districts, in were analyzed in arcgis . . spatial autocorrelation of ihd was assessed by the moran and geary indices. a conditional autoregressive model was used to evaluate the association between idh and socio-demographics variables. results: association between idh mortality and income, educational level, family type and to possess computer, videocassette and microwave was found. conclusion: spatial analysis of the idh mortality and socio-demographics factors influence are fundamental to subsidize more efficient public policies in sense to prevention and control of this important injury of health. abstract purpose: to evaluate the prognostic impact of isolated loco-regional recurrences on metastatic progression among women treated for invasive stage i or ii breast cancer (within phase iii trials concerning the optimal management of breast cancer). patients and methods: the study population consisted of , women primary surgically treated for early stage breast cancer, enrolled in eortc trials , , or , by breast conservation ( %) and mastectomy ( %) with long time of follow-up (median: . range: . - . ). data were analysed in a multi-state model by using multivariate cox regression models, including a state-dependent covariate. results: after the incidence of the loco-regional recurrence, a positive nodal status at baseline is a significant prognostic risk factor for distant metastases. the effects of the young ages at diagnosis and larger tumour size, become less significant after the incidence of loco-regional recurrences. the presence of a locoregional recurrence in itself is a significant prognostic risk factor for distant metastases after loco-regional recurrences. the effect of the time to the loco-regional recurrence is not a significant prognostic factor. conclusion: the presence of local recurrence is an important risk factor for outcome in patients with early breast cancer. abstract background: the relationship between the antral follicles and ovarian reserve tests (ort) to determine ovarian response in ivf is extensively studied. we studied the role of follicle size distribution in the response on the various orts in a large group of subfertile patients. methods: in a prospective cohort study, female patients were included if they had regular ovulatory cycles, subfertility for > months, > = ovary and > = patent ovarian tube. antral follicles were counted by ultrasound and blood was collected for fsh, including a clomiphene challenge test (ccct), inhibin b, and estradiol before and after administration of puregon [rsymbol] . (efort test). statistical analysis was performed using spss . for windows. results: of eligible patients, participated. mean age was . years and mean duration of subfertility was . months. age, baseline fsh, ccct and efort correlated with the number of small follicles ( - mm) but not with large follicles ( - mm). regression analysis confirmed that the number of small follicles and average follicle size contributed to ovarian response after correction for age, while large follicles did not. conclusion: small antral follicles are responsible for the hormonal response in ort and may be suitable to predict ovarian response in ivf. abstract background: dengue epidemics account annually for several million cases and deaths worldwide. the high endemic level of dengue fever and its hemorrhagic form (dhf) correlates to extensive house infestation by aedes aegypti and multiple viral serotype human infection. objective: to describe dengue incidence evolutionary patterns and spatial distribution in brazil. methods: it is a review study that analyzed serial case reports registered since until . results: it was shown that defined epidemic waves followed the introduction of every serotype (den to ) , and reduction in susceptible people possibly responded for downward case frequency. conclusions and discussion: an incremental expansion of affected areas and increasing occurrence of dhf with high lethality were noted in recent years. in contrast, efforts based solely on chemical vectorial combat have been insufficient. moreover, some evidence demonstrated that educational action do not permanently modify population habits. in this regard it was stated that while vaccine is not available, further dengue control would depend on potential results gathered from basic interdisciplinary research and intervention evaluation studies, integrating environmental changes, community participation and education, epidemiological and virological surveillance, and strategic technological innovations aimed to stop transmission. abstract background: patient participation in treatment decisions can have positive effects on patient satisfaction, compliance and health outcomes. objectives: study objectives were to examine attitudes regarding participation in decision-making among psoriasis patients and to evaluate the effect of a decision-aid for discussing treatment options. methods: a 'quasi experiment' was conducted in a large dermatological hospital in italy: a questionnaire evaluating the decision-making process and knowledge on treatments was selfcompleted by a consecutive sample of psoriasis patients after routine clinical practice and by a second sample of patients exposed to a decision-board. results: in routine clinical practice . % of patients wanted to be involved in treatment decisions, . % wanted to leave decisions entirely to the doctor and . % preferred making decisions alone. . % and . % of the control and decision-board group had good knowledge level. at multivariate analysis good knowledge on treatments increased the likelihood of preferring an active role (or = . ; %ci . - . ; p = . ). the decision-board only marginally improved patient knowledge and doctor-patient communication. conclusion and discussion: in conclusion, large proportions of patients want to participate in decision-making, but insufficient knowledge can represent a barrier. further research is needed for developing effective instruments for improving patient knowledge and participation. abstract background: the only available means of controlling infections caused by the dengue virus is the elimination of its principal urban vector (aedes aegypti). brazil has been implementing programs to fight the mosquito; however, since the 's the geographic range of infestation has been expanding steadily, resulting in increased circulation of the virus. objective: to evaluate the effectiveness of the dengue control actions that have been implemented in the city of salvador. methods: prospective design, serologic inquiries were made in a sample population of residents of urban 'sentinel areas'.the seroprevalence and one year seroincidence of dengue are estimated and the relationship between intensity of viral circulation and the standards of living and vector density is analysed. results: there were high overall seroprevalence ( . %) and seroincidence ( . %) for the circulating serotypes (denv- and denv- ). the effectiveness of control measures appears to be low, and although a preventable fraction of . % was found, the incidence of infections in these areas was still very high ( . %). conclusions and discussions: it is necessary to revise the technical and operational strategies of the infection control program in order to attain infestation levels that are low enough to interrupt the circulation of the dengue virus. this study investigates the difference in cancer mortality risks between migrant groups and the native dutch population, and determines the extent of convergence of cancer mortality risks according to migrants' generation, age at migration and duration of residence. data were obtained from the national population & mortality registries in the period - ( person years, cancer deaths). we used poisson regression to compare the cancer mortality rates of migrants originating from turkey, morocco, surinam, netherlands antilles/aruba to the rates for the native dutch. results: all-cancer mortality among all migrant groups combined was significantly lower compared to the native dutch population (rr= . ci: . - . ). mortality rates for all cancers combined were higher among nd generation migrants, among those with younger age at migration, and those with longer duration of residence. this effect was particularly pronounced in lung cancer and colorectal cancer. for most cancers, mortality among nd generation migrants remained lower compared to the native dutch population. surinamese migrants showed the most consistent pattern of convergence of cancer mortality. conclusions: the generally low risk of cancer mortality for migrants showed some degree of convergence but the cancer mortality rates did not yet reach the levels of the native dutch population. abstract background: legionnaires' disease (ld) is a notifiable disease in the netherlands. ld cases are reported to authorities for national surveillance. supplementary, a national ld outbreak detection program (odp) is installed in the netherlands. these two registration systems have their own information exchange process and databases. objectives: surveillance systems are known to suffer from incompleteness of reported data. co-existence of two databases creates the opportunity to investigate accuracy and reliability in a national surveillance system. design and methods: comparison was made between the outcome 'diagnosis by culture' in both databases and physical presence of legionella strains in laboratories for patients. accuracy is described using the parameters sensitivity and correctness. for reliability, cohen's kappa coefficient (?) was applied. results: accuracy and reliability were significantly higher in the odp database, but not optimal in both databases when compared to data in laboratories. the odp database was moderately reliable (? = . ; %ci . - . ), the surveillance database slightly (? = . ; %ci . - . ). conclusion: our findings suggest that diagnostic notification data concerning ld patients are most accurate and reliable when derived directly from diagnostic laboratories. discussion: involvement of data-entry persons in outbreak detection results in higher reliability. unreliable data may have considerable consequences during outbreaks of ld. the aim of the study was to investigate the medical students' plans to emigrate, quantify the scale of migration in the near future and to build a profile of the possible emigrants. data were collected based on anonymous questionnaire delivered to random group of medical students (katowice). we used the binary logistic regression and multivariate analysis to identify the differences between groups preferring to go abroad or stay in poland. % respondents confirmed that considerate the emigration; . % of them declared they are very likely to move and further . % is certain. , % of those considering emigration confirmed having taken practical steps towards moving. binary logistic regression showed no difference between people who were certain or almost certain to go and those who were not considering going for most baseline characteristics: hometown size, socio economic background and having family tradition of the medical profession (p = . ). only marks' mean differentiates between the two groups: . for those who will definitely stay vs . for students who will definitely move (p = . ). the multivariate analysis gave similar results. conclusions: most of the students consider the emigration, but the declarations of will to departure are more frequent among those with the worse marks. abstract background: falls incidence in home resident elderly people varies from % to %. falls induce loss of self-sufficiency and increase mortality and morbidity. objectives: to evaluate falls incidence and risk factors in a group of general practice elderly patients. design : prospective cohort study with year follow-up methods: eight hundreds elderly (> years) were visited by practitioners for a baseline assessment. information on current pathologies and previous falls in the last six months was collected. functional status was evaluated using: short portable mental state questionnaire, geriatric depression scale, activities of daily living (adl), instrumental activities of daily living, total mobility tinetti score. falls were monitored through phone-interviews at and months. data were analyzed through logistic regression. results: twenty-eight percent of the elderly fell in the whole period. sixty percent of falls were not reported to the practitioner. independent predictors for falls were adl score (adl< : or = . ; % ci . - . ) and previous falls (or = . ; % ci . - . ). tinetti score was significantly associated to falls only in univariate analysis. conclusions: practitioners can play a key-role in identifying at-risk subjects and managing prevention interventions. falls monitoring and a continuous practice of comprehensive geriatric assessment should be encouraged. abstract background: oral health represents an important indicator of health status. socio-economic barriers to oral care among elderly are considerable. in the lazio region, a public health program for oral rehabilitation was implemented to offer dentures to elderly people with social security. objectives: to compare hospitalisation between elderly enrolled in the program and a control group. design and methods: for each elderly enrolled in the program living in rome, three controls, matched for sex and age, were selected from rome municipality register. hospital admissions in the two-year period before enrollment were traced by record-linkage with the hospital discharge register. results: totally, , admissions occurred. the annual admissions rate was per elderly among controls and in the program group (incidence rate ratio: irr = . ; % ci . - . ). when comparing diagnosis-specific rates, significant excesses were observed in the program group for respiratory diseases ( abstract background: herpes simplex virus (hsv) type and are important viral sexually transmitted diseases (sti) and can cause significant morbidity. in the netherlands, data about prevalences in the general population are hampered. objective: description of the seroprevalences of hsv- and hsv- and associated factors in the netherlands. design and methods: a population based serum bank survey in the netherlands with an age-stratified sample was used ( ) ( ) . antibodies against hsv- and hsv- were determined using elisa. a questionnaire was used to get information on demographics and risk factors. a logistic regression adjusting for age and full multiple regression were done to establish risk factors. results: questionnaires and sera were available for persons. both hsv- and hsv- seroprevalence increased with age. seroprevalence of hsv- was . % and was amongst others associated with female sex and being divorced. seroprevalence of hsv- was . % and was amongst others associated with being divorced and a history of sti. conclusion: seroprevalence is higher in certain groups like teenagers, women, divorced people and those with a history of sti. prevention should be focused on those groups. more research is needed on prevention methods, which can be used in the netherlands, like screening or vaccination. abstract background: frequently, statistically significant prognostic factors are reported with suggestions that patient management should be modified. however, the clinical relevance of such factors is rarely quantified. objectives: we evaluated the accuracy of predicting the need for invasive treatment among bph patients managed conservatively with alpha -blockers. methods: information on eight prognostic factors was collected from patients treated with alpha -blockers. with phm regression coefficients a risk score for retreatment was calculated for each patient. the analyses were repeated on groups of patients sampled from the original case series. these bootstrap results were compared to the original results. results: three significant predictors of retreatment were identified. the % of patients with the highest risk score had an -month risk of retreatment of only %. analyses of less than half of all the bootstrap samples resulted in the same three significant prognostic factors. the % of patients with the highest risk score in each of the samples experienced a highly variable risk of retreatment of % to %. conclusions: four of the five high risk patients would be overtreated with a modified policy. internal validation procedures may warn against the invalid translation of statistical significance into clinical relevance. background: e-cadherin expression is frequently lost in human epithelium derived cancers, including bladder cancer. for two genetic polymorphisms in the region of the e-cadherin gene (cdh ) promoter, a reduced transcription has been reported: a c/a single nucleotide polymorphism (snp) and a g/ga snp at bp and bp, respectively, upstream of the transcriptional start site. objective: we studied the association between both polymorphisms and the risk of bladder cancer. methods: patients with bladder cancer and population controls were genotyped for the ) c/ a and the ) g/ga promoter polymorphisms using pcr-rflp. results: a significantly increased risk for bladder cancer was found for a allele carriers compared to the homozygous c allele carriers (or . ; % ci: . - . ). the risk for the heterozygous and homozygous a allele carriers, was increased approximately . and -fold, respectively. the association was stronger for more aggressive tumors. we did not find any association between the ) g/ga snp and bladder cancer. conclusion: this study indicates that the ) c/a snp in the e-cadherin gene promoter is a low-penetrance susceptibility factor for bladder cancer. background: health problems, whether somatic, psychiatric or accident-related cluster within persons. the study of allostatic load as a unifying theme (salut) aims to identify risk factors that are shared by different pathologies and that could explain this clustering. studying patients with repetitive injuries might be helpful to identify risk factors that are shared by accident-related and other health problems. objectives: to study injury characteristics in repetitive injury (ri) patients as compared to single injury (si) patients. methods: the presented study included ri patients and si patients. medical records provided information about injury characteristics and patients were asked for possible causes and context. results: ri patients suffered significantly more from contusions than si patients ( % vs %). regarding the context, si patients were significantly more injured in traffic ( % vs %). in both groups most injuries were attributed to 'mere bad luck' (ri %, si %), closely followed by 'clumsiness or inattention' (ri %, si %). ri patients pointed out aggression or substance misuse significantly more often than si patients ( % vs %). conclusion: ri patients seem to have more 'at risk' behavior (i.e. aggression, impulsivity), which will increase their risk for psychiatric health problems. abstract background: breastfeeding may have a protective effect on infant eczema. bias as a result of methodological problems may explain the controversial scientific evidence. objective: we studied the association between duration of breastfeeding and eczema when taking into account the possible influence of reverse causation. design and methods: information on breastfeeding, determinants and outcomes at age one year was collected by repeated questionnaires in mother infant-pairs participating in the koala study ( cases of eczema). to avoid reverse causation, a periodspecific-analysis was performed in which only 'at risk' infants were considered. results: no statistically significant association between the duration of breastfeeding (> weeks versus formula feeding) and the risk of eczema in the first year was found (or . %ci . - . ). after excluding from the analysis all breastfed infants with symptoms of eczema reported in the same period as breastfeeding, also no statistical significant association was found for the duration of breastfeeding and eczema between and months (or . %ci . - . ). conclusion and discussion: in conclusion, no evidence was found for a protective effect of breastfeeding duration on eczema. this conclusion was strengthened by risk period-specific-analysis which made the influence of reverse causation unlikely. abstract background: the internet can be used to meet health information needs, provide social support, and deliver health services. the anonymity of the internet offers benefits for people with mental health problems, who often feel stigmatized when seeking help from traditional sources. objectives: to identify the prevalence of internet use for physical and mental health information among the uk population. to investigate the relationship of internet use with current psychological status. to identify the relative importance of the internet as a source of mental health information. design and methods: self-completion questionnaire survey of a random sample of the uk population (n = ). questions included demographic characteristics, health status (general health questionnaire), and use of the internet and other information sources. results: % of internet users had sought health information online, and % had sought mental health information. use was higher among those with current psychological problems. only % of respondents identified the internet as one of the most accurate sources of mental health information, compared with % who identified it as one of the sources they would use. conclusions: health service providers must recognise the increasing use of the internet in healthcare, even though it is not always regarded as being accurate. abstract old age is a significant risk factor for falls. approximately % of people older than are falling at least once a year, mostly in the own homes. resulting hip fractures cause at least partial immobility in - % of the affected persons. almost % are sent to nursing homes afterwards. in mecklenburg-west pomerania, ageing of the population proceeds particularly fast. to prevent falls and the loss of independent living a falls prevention module was integrated in a community-bbased study conducted in cooperation with a general practitioner (gp). in the patients homes' a trained nurse performed a test to estimate the falls risk of each patient and a consultation how to reduce risk, e.g. eye sight check, gymnastic exercise. in the feasibility study ( %) out of patients (average age years), agreed to a visiting of each room of their homes in search for tripping dangers. the evaluation was assisted by a standardized, computer-based documentation. the prevention module received a considerable acceptance despite the extensive home visiting. within one month the patients started to transfer advice into practice. during the first follow up visits of the nurse three patients reported e.g. to have started gymnastics and/or wear stable shoes. abstract background: the emergence of drug resistant m. tuberculosis (mtb) is an increasing problem in both developed and developing countries. objectives: investigation of isoniazid (inh) and rifampin (rif) susceptibility patterns among mtb isolates from patients. design and methods: in total sputum samples were collected. smears were prepared for acid fast staining and all the isolates were identified as m. tuberculosis by preliminary cultural and biochemical tests. the isolates were examined for inh and rifampin resistance using conventional mic method and pcr technique by using specific inh (kat g) and rifampin (rpo b) resistant primers. results: seven isolates were resistant to both inh and rifampin by mic method. in pcr technique, and out of above mentioned strains showed resistant to inh and rifampin respectively. coclusion: the epidemiology of drug resistance is . % in region of study which is significant. discussion: conventional mic method despite being time consuming is more sensitive for evaluation of drug resistance, however, pcr as a rapid and sensitive technique is recommended additionally to conventional method for having quicker results to start treatment and disease control management. abstract background and objectives: we studied in literature which design characteristics of food frequency questionnaires (ffqs) influence their validity to assess both absolute and relative levels of energy intake in adults with western food habits, and to rank them according to these intakes. this information is required in harmonizing ffqs for multi centre studies. design and methods: we performed a review of studies investigating the validity or reproducibility of ffqs, published since . the included studies validated ffqs against doubly labeled water (for energy expenditure) as a gold standard, or against food records or hour recalls for assessing relative validity (for energy intake). the design characteristics we studied were the number of food items, the reference period, the administration mode, and inclusion of portion size questions. results: and conclusion: for this review we included articles representing the validation of questionnaires. three questionnaires were validated against dlw, ten against urinary n and against -hour recalls or food records. in conclusion a positive linear relationship (r = . , p< . ) was observed between the number of items on the ffq and the reported mean energy intake. details about the influence of other design characteristics on validity will be discussed at the conference. abstract background: high ethanol intake may increase the risk of lung cancer. objectives: to examine the association of ethanol intake with lung cancer in epic. design and methods: information on baseline and past alcohol consumption, lifetime tobacco smoking, diet, and anthropometrics of , participants was collected between and . cox proportional hazard regression was used to examine the association of ethanol intake at recruitment ( cases) and mean lifelong ethanol intake ( cases) with lung cancer. results: non-consumers at recruitment had a higher lung cancer risk than low consumers ( . - . g/day) [hr = . , % ci . - . ]. lung cancer risk was lower for moderate ethanol intake at recruitment ( . - . g/day) compared with low intake (hr = . , % ci . - . ); no association was seen for higher intake. compared with lifelong low consumers, lifelong non-consumers did not have a higher lung cancer risk (hr = . , % ci . , . ) but lifelong moderate consumers had a lower risk (hr = . , % ci: . - . ). lung cancer risk tended to increase with increasing lifelong ethanol intake (= vs. . - . g/ day hr = . , % ci: . - . ). conclusion: while lung cancer risk was lower for moderate compared with low ethanol intake in this study, high lifelong ethanol intake might increase the risk. abstract background: one of the hypotheses to explain the increasing prevalence of atopic diseases (eczema, allergy and asthma) is imbalance between dietary intake of omega- and omega- fatty acids. objectives: we evaluated the role of perinatal fatty acid (fa) supply from mother to child in the early development of atopy. design and methods: fa composition of breast milk was used as a marker of maternal fa intake and placental and lactational fa supply. breast milk was sampled months postpartum from mother-infant pairs in the koala birth cohort study, the netherlands. the infants were followed for atopic symptoms (repeated questionnaires on eczema and wheezing) and sensitisation at age (specific serum ige against major allergens). multivariate logistic regression analysis was used to adjust for confounding factors. results: high levels of omega- long chain polyunsaturated fas were associated with lower incidence of eczema in the first year (odds ratio for the highest vs lowest quintile . , % confidence interval . - . ; trend over quintiles p = . ). wheeze and sensitisation were not associated with breast milk fa composition. conclusion and discussion: the results support the omega- / hypothesis. we suggest that anti-inflammatory activity of omega- eicosanoid mediators is involved but not allergic sensitisation. abstract background: acute myocardial infarction (ami) is among the main causes of death in italy and is characterized by high fatality associated with a fast course of the disease. consequently timeliness and appropriateness of the first treatment are paramount for a positive recovery. objectives: investigate the differences among italian regions of ami first treatment and in-hospital deaths. design and methods: following the theoretical care pathway (from onset of ami to hospitalization and recovery or death), regional in-hospital deaths are decomposed into the contributions of attack rate, hospitalization and in-hospital fatality. hospital discharges, death and population data are provided by the official statistics. results: generally in northern and central regions there is an excess of observed in-hospital deaths, while the opposite occurs in southern regions. conclusion: in northern and central regions the decomposition method suggests a more frequent and severe illness, generally accompanied by a higher availability of hospitals; exceptions are lombardia and lazio, where some inefficiencies in the hospital system are highlighted. in most southern regions the decomposition confirms a less frequent and less severe illness; exceptions are campania and sicilia, where only the less severe patients reach the hospital and then recover, while the others die before reaching the hospital. abstract background: atherosclerotic lesions have typical histological and histochemical compositions at different stages of their natural history. the more advanced atherosclerotic lesions contain calcification. objective: we examined the prevalence of and associations between calcification in the coronary arteries, aortic arch and carotid arteries assessed by multislice computed tomography (msct). methods: this study was part of the rotterdam study,a population-based study of subjects aged years and over. calcification was measured and quantified in subjects. correlation coefficients were computed using spearman's correlation coefficients. results: the prevalence of calcification increased with age throughout the vascular bed. in subjects aged and over, up to % of men had calcification in the coronary arteries and up to % of women had calcification in the aortic arch. in men, the strongest correlation was found between calcification in the aortic arch and the carotid arteries (r= . , p< . ). in women, the relations were somewhat lower, the strongest correlation was found between calcification in the coronary arteries and the carotid arteries (r = . , p< . ). conclusion and discussion: in conclusion, the prevalence of calcification was generally high and increased with increasing age. the study confirms the presence of strong correlations between atherosclerosis in different vessel beds. abstract background: health status deteriorates with age and can be affected by transition from active work to retirement. objective: to assess the effect of retirement on age related deterioration of health. methods: secondary analysis of the german health survey (bundesgesundheitssurvey ) and california health interview survey (chis ) . subjective health was assessed by a single question regarding respondent's health status from = excellent to = poor. locally weighted regression was used for exploratory analysis and b-splines for the effect estimation in regression models. results: subjective health decreased in an obviously non-linear manner with age. in both cases the decrease could be reasonably approximated by two linear segments, however the pattern was different in the german and california sample. in germany, the change point of the slope describing deterioration of health was located at . abstract objective: to assess the effectiveness of physiotherapy compared to general practitioners' care alone, in patients with acute sciatica. design, setting and patients: a randomised clinical trial in primary care with a -months follow-up period. patients with acute sciatica (recruited - ) were randomised in two groups: ) intervention group received physiotherapy (active exercises), and ) control group received general practitioners' care only. main outcome measures the primary outcome was patients' global perceived effect. secondary outcomes were severity of leg and back pain, severity of disability, general health and absence from work. the outcomes were measured at , , and weeks after randomisation. results: at months follow-up, % of the intervention group and % of the control group reported improvement (rr . ; % ci . to . ). at months follow-up, % of the intervention group and % of the control group reported improvement (rr . ; % ci . ; . ). no significant differences in secondary outcomes were found at short-term or long-term follow-up. conclusion: at months follow-up, evidence was found that physiotherapy added to general practitioners' care is more effective in the treatment of patients with acute sciatica than general practitioners' care alone. abstract background: only little is known about the epidemiology of skin melanoma in the baltic states. objectives: the aim of this study was to provide insights into the epidemiology of skin melanoma in lithuania by analyzing population-based incidence and mortality ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) time trends and relative survival based on skin melanoma. methods: we calculated age-standardized incidence and mortality rates (cases per , ) using the european standard population and calculated period estimates of relative survival. for the period - , % of all registered cases were checked by reviews of the medical charts. results: about % of the cases of the period - were reported to the cancer registry indicating a high quality of cancer registration of skin melanoma in lithuania. the incidence rates increased from (men: . , women: . ) to (men: . , women: . ). mortality rates increased from (men: . , women: . ) to (men: . , women: . ). relative -year relative survival rates among men were % lower than among women. the overall difference in survival is mainly due to a more favorable survival among women aged - years. conclusions: overall prognosis is less favorable among men most likely due to diagnoses at later stages. abstract background: only little is known about the epidemiology of skin melanoma in the baltic states. objectives: the aim of this study was to provide insights into the epidemiology of skin melanoma in lithuania by analyzing population-based incidence and mortality ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) time trends and relative survival based on skin melanoma. methods: we calculated age-standardized incidence and mortality rates (cases per , ) using the european standard population and calculated period estimates of relative survival. for the period - , % of all registered cases were checked by reviews of the medical charts. results: about % of the cases of the period - were reported to the cancer registry indicating a high quality of cancer registration of skin melanoma in lithuania. the incidence rates increased from (men: . , women: . ) to (men: . , women: . ). mortality rates increased from (men: . , women: . ) to (men: . , women: . ). relative -year relative survival rates among men were % lower than among women. the overall difference in survival is mainly due to a more favorable survival among women aged - years. conclusions: overall prognosis is less favorable among men most likely due to diagnoses at later stages. abstract background: multifactorial diseases share many risk factors, genetic as well as environmental. to investigate the unresolved issues on etiology of and individual susceptibility to multifactorial diseases, the research focus must move from single determinantoutcome relations to modification of universal risk factors. objectives: the aim of the lifelines project is to study universal risk factors and their modifiers for multifactorial diseases. modifiers can be categorized into factors that determine the effect of the studied risk factor (eg gen-expression), those that determine the expression of the studied outcome (eg previous disease), and generic factors that determine the baseline risk for multifactorial diseases (eg age). design and methods: lifelines is carried out in a representative sample of . participants from the northern provinces of the netherlands. apart from questionnaires and clinical measurements, a biobank is constructed (blood, urine, dna). lifelines will employ a three-generation family design (proband design with relatives), which has statistical advantages, enables unique possibilities to study social characteristics, and offers practical benefits. conclusion: lifelines will contribute to the understanding of how disease-overriding risk factors are modified to influence the individual susceptibility to multifactorial diseases, not only at one stage of life but cumulatively over time: the lifeline. abstract background: obesity-related mortality is a major public health problem, but few studies have been conducted on severely obese individuals. objectives: we assessed long-term mortality in treatment-seeking, severely obese persons. design and methods: we enrolled persons in six centres for obesity treatment in four italian regions, with body mass index (bmi) at first visit => kg/m and age => . after exclusion of duplicate registrations and persons with missing personal or clinical data, persons were followed up; as ( . %) could not be traced, persons ( men, women) were retained for analysis. results: there were ( men, women) deaths; the standardized mortality ratios (smrs) and % confidence intervals were ( - ) among men and ( - ) among women. mortality increased with increasing bmi, but the trend was not monotonic in men. lower smrs were observed among persons recruited more recently. excess mortality was inversely related to age attained at follow-up. conclusions and discussion: the harmful, long-term potential of severe obesity we documented confirms observations from studies carried out in different nutritional contexts. the decrease in mortality among most recently recruited persons may reflect better treatment of obesity and of its complications. abstract background: in finland, every cancer patient should have equal access to high quality care provided by the public sector. therefore no regional differences in survival should be observed. objectives: the aim of the study was to find any regional differences in survival, and to elaborate whether possible differences could be explained, e.g., by differences in distributions of prognostic factors. design and methods: the study material consisted of , patients diagnosed in to with cancer at one of the major primary sites. the common closing date was dec. . finland was divided into five university hospital regions. stage, age at diagnosis and sex were used as prognostic factors. the relative survival rates for calendar period window, - , were tabulated using period method and modelled. results: survival differences between the regions were not significant for most primary sites. for some sites, the differences disappeared in the modelling phase after adjusting for the prognostic factors. for a few of the primary sites (e.g., carcinoma of the ovary), regional differences remained after modelling. conclusion: we were able to highlight certain regional survival differences. ways to improve the equity of cancer care will be considered in collaboration with the oncological community. abstract background: the prevalence of cardiovascular disease (cvd) is extremely high in dialysis patients. disordered mineral metabolism, including hyperphosphatemia and hypercalcaemia, contributes to the development of cvd in these patients. objectives: to assess associations between plasma calcium, phosphorus and calciumphosphorus product levels and risk of cvd-related hospitalization in incident dialysis patients. design and methods: in necosad, a prospective multi-centre cohort study in the netherlands, we included consecutive patients new on haemodialysis or peritoneal dialysis between and . risks were estimated using adjusted time-dependent cox regression modeling. results: mean age was ± years, % was male, and % was treated with haemodialysis. cvd was the cause of hospitalization in haemodialysis patients ( % of hospitalizations) and in peritoneal dialysis patients ( %). most common cardiovascular morbidities were peripheral vascular disease and coronary artery disease in both patient groups. in haemodialysis patients risk of cvd-related hospitalization increased with elevated plasma calcium (hazard ratio: . ; % ci: . to . ) and calcium-phosphorus product levels ( . ; % ci: . to . ). in peritoneal dialysis patients, we observed similar effects that were not statistically significant. conclusion: tight control of plasma calcium and calcium-phosphorus product levels might prevent cvd-related hospitalizations in dialysis patients. abstract background: nurses are at health risk due to the nature of their work. analysis of morbidity among nurses was conducted to provide insight concerning the relationship between their occupational exposure and health response. methods: self reported medical history, was collected from an israeli female-nurses cohort (n = , + years old) and their siblings (n = , age matched +/) years) using a structured questionnaire. to compare disease occurrence between the two groups we used chi-square tests and hazard ratio (hr) was calculated by cox-regression to account for age of onset. results: cardiovascular diseases were more frequent among the nurses compared to the controls: heart diseases . % vs. . , p = . (hr= . , p = . ); hypertension . % vs. . %, p<. (hr= . , p = . ). the frequency of hyperlipidemia was . % among the nurses, and only . % among the controls. (hr= . ,p = . ). for the following chronic diseases the occurrence were significantly higher among the nurses and the hrs were significantly higher than : thyroid, hr= . ; liver, hr= . . total cancer and diabetes rates were similar in the groups (hr$ ). conclusions: the results suggest an association between working as a nurse and the existence of risk factors for cardiovascular diseases. the specific related determinants of their work should be further evaluated. abstract background: early referral (er) to a nephrologist and arteriovenous fistulae as first vascular access (va) reduce negative outcomes in chronic dialysis patients (cdp). objectives: to evaluate the effect of nephrologist referral timing and type of the first va on mortality. design and methods: prospective cohort study of incident cdp notified to lazio dialysis registry (italy) in - . late referral (lr) was a patient not referred to nephrologists within months before starting dialysis. we dichotomized va as fistulae versus catheters. to estimate mortality hazard ratios (hr) a multivariate cox model was performed. results: we observed . % lr subjects and . % catheters as first va; proportion of catheters was . % vs. . % (p< . ) for lr and er, respectively. we found a higher mortality hr for patient with a catheter as first va both for er (hr = . ; %c.i. = . - . ) and lr (hr = . ; %c.i. = . - . ); the interaction between referral and va was slight significant (p = . ). conclusions: the originality of our study was to investigate the influence of nephrologist referral timing and va on cdp mortality using a population registry, area-based: we found that a catheter as first va has an independent effect for mortality and modifies the effect of referral timing on this outcome. abstract patients with idiopathic venous thrombosis (vt) without known genetic risk factors but with a positive family history might carry yet unknown genetic defects. to determine the role of unknown hereditary factors in unexplained vt we calculated the risk associated with family history. in the multiple environmental and genetic assessment of risk factors for vt (mega) study, a large population-based case-control study, we collected blood samples and questionnaires on acquired risk factors (surgery, immobilisation, malignancy, pregnancy and hormone use) and family history of patients and control subjects. overall, positive family history was associated with an increased risk of vt (or ( % ci): . ( . - . )), especially in the absence of acquired risk factors (or ( % ci): . ( . - . ) ). among participants without acquired factors but with a positive family history, prothrombotic defects (factor v leiden, prothrombin a, protein c or protein s deficiency) were identified in out of ( %) patients compared to out of ( %) control subjects. after excluding participants with acquired or prothrombotic defects, family history persisted as a risk factor (or ( % ci): . ( . - . )). in conclusion, a substantial fraction of thrombotic events is unexplained. family history remains an important predictor of vt. abstract background: alcohol may have a beneficial effect on coronary heart disease (chd) through elevation of high-density lipoprotein cholesterol (hdlc) or other alterations in blood lipids. data on alcohol consumption and blood lipids in coronary patients are scarce. objectives: to assess whether alcohol consumption and intake of specific types of beverages are associated with blood lipids in older subjects with chd. design and methods: blood lipids (total cholesterol, hdlc, ldl cholesterol, triglycerides) were measured in myocardial infarction patients aged - years ( % male), as part of the alpha omega trial. intake of alcoholic beverages, total ethanol and macro and micronutrients were assessed by food-frequency questionnaire. results: seventy percent of the subjects used lipidlowering medication. mean total cholesterol was . mmol/l and hdlc was . mmol/l. in men but not in women, ethanol intake was positively associated with hdlc (difference of . mmol/l for = g/d vs. g/d, p = . ) after adjustment for diet, lifestyle, and chd risk factors. also, liquor consumption was weakly positively associated with hdlc in men (p = . ). conclusion and discussion: moderate alcohol consumption may elevate hdlc in (treated) myocardial infarction patients. this is probably due to ethanol and not to other beneficial substances in alcoholic beverages. session: posters session : july presentation: poster. abstract objective: early detection and diagnosis of silicosis among dust exposed workers is based mainly on the presence of rounded opacities on radiographs. it is thus important to examine how reliable the radiographic findings are in comparison to pathological findings. methods: a systematic literature search via medline was conducted. validity of silicosis detection and its influence on risk estimation in epidemiology was evaluated in a sensitivity analysis. results: studies on comparison between radiographic and pathological findings of silicosis were identified. the sensitivity of radiographic diagnosis of silicosis (ilo / ) varied between % and %, and specifity between % and %. under the realistic assumption of silicosis prevalence between % and % in dust exposed workers, % to % of silicosis identified may be falsely diagnosed. the sensitivity analysis indicates that invalid diagnostics alone may lead to the finding of an increased risk of lung cancer among patients with silicosis. it may also lead to findings of % to % of radiographic silicosis even when there is no case of silicosis. however, the risk of silicosis could also be underestimated if the prevalence of silicosis exceeds %. conclusion: epidemiological studies based on patients with silicosis should be interpreted with caution. abstract introduction: epidemics of dengue occurring in various countries have stimulated investigators to seek innovative ways of improving current knowledge on the issue. objective: to identify the characteristics of spatial-temporal diffusion of the first dengue epidemic in a major brazilian city (salvador, bahia). methods: notified cases of dengue in salvador in were georeferenced according to census sector (cs) and by epidemiological week. kernel density estimation was used to identify the spatial diffusion pattern. results: of the cs in the city, cases of dengue were registered in ( %). spatial distribution showed that in practically the entire city had been affected by the virus, with a greater concentration of cases in the western region, comprising cs of high population density and predominantly horizontal dwellings. conclusion and discussion: the pattern found showed characteristics of a contagious diffusion process. it was possible to identify the epicenter of the epidemic from which propagation initiated. the speed of progression suggested that even if a rapid intervention was initiated to reduce the vector population, it would probably have little effect in reducing the incidence of the disease. this finding confirms the need for new studies to develop novel technology for prevention of this disease. abstract background: knowing the size of drug user hidden populations in a community is important to plan and evaluate public health interventions. objectives: the aim of this study was to estimate the prevalence of opiate and cocaine users in liguria region by using the covariate capture-recapture method applied to four data sources. methods: we performed a cross-sectional study in the resident population aged - years ( . people at census). during individual cases identified as opiate or cocaine primary users were flagged by four sources (drug dependence services, social service at prefectures, therapeutic communities, hospital discharges). poisson regression models were fitted, adjusting for dependence among sources and for heterogeneity in catchability among categories of the two examined covariates: age ( - and - years) and gender. results: the prevalence of opiate or cocaine users was , % ( % c.i., , - , %) and , % ( % c.i.= , - , %) respectively. conclusions: the estimated prevalence of opiate and cocaine users is consistent with that found in inner london: . % and . % respectively (hickman m., ; hope v.d., ) . the covariate capture-recapture method applied to four data sources allowed identifying a large cocaine-using population and resulted appropriate to determine drug user hidden populations. abstract background: in - we performed a population based diabetes screening programme. objectives: to investigate whether the yield of screening is influenced by gp and practice characteristics. methods: a questionnaire containing items on the gp (age, gender, employment, specialty in diabetes, applying insulin therapy) and the practice (setting, location, number of patients from ethnic minority groups, specific diabetes clinic, involvement of practice assistant and practice nurse in diabetes care, cooperation with a diabetes nurse) was sent to general practitioners (gps) in practices in the southwestern region of the netherlands. multiple linear regression analysis was performed. outcome measure was the ratio screen detected diabetic patients/known diabetic patients per practice (sdm/kdm). results: sdm/kdm was independently associated with higher age of the gp (regression coefficient . ; % confidence interval . to . ), urban location () . ; ) . to ) . ) and involvement of the practice assistant in diabetes care ( . ; . to . ) . conclusion: a lower yield of screening, assumably reflecting a lower prevalence of undiagnosed diabetes, was found in practices of younger gps and in urban practices. a lower yield was not associated with an appropriate practice organization nor with a specialty of the gp in diabetes. session: posters session : july presentation: poster. background: since few years increased incidence rates for childhood cancer were reported from industrialized countries. these findings were discussed controversial, because increases could be caused by changing of potential risk factors. objectives: the question is: are observed increasing rates due to actual changes in incidence rates or mainly caused by changes in registration practice or artefacts? methods: for europe, data from the accis project (pooled data from european population-based cancer registries, performed at iarc, lyon; responsible: e. steliarova-foucher) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , and for germany, data of the german childhood cancer registry available from onwards were used. results: accis data (based on , cases) show significantly increased data with an overall average annual percentage change of about % and it is seen for mainly all diagnostic subgroups. for germany, increases are seen for neuroblastoma (due to screening programmes) and brain tumours (due to improved registration). for acute leukaemia the observed increase is explained by changes in classification. conclusion and discussion: the increased incidence for europe can only partly be explained by registration artefacts or improved diagnostic methods. the observed patterns suggest that an actual change exists. in germany, from till now the observed increased rates could be explained by artefacts. abstract suicide is the fourth most common cause of death among working age finns. among men socioeconomic status is strongly and inversely associated with suicide mortality, but little is known about socioeconomic differences in female suicide. we studied the direct and indirect effects of different socioeconomic indicators -education, occupation-based social class and income -on suicide among finnish women aged - . also the effect of main economic activity was studied. we used individual level data from the census linked to the death register for the years - . altogether over million person-years were included and suicides were committed. age-adjusted rii conducted using poissonregression model was . ( % ci . - . ) for education, . ( . - . ) for social class and . ( . - . ) for income. however, almost all of the effect of education was mediated by social class. fifteen per cent of social class was explained by education and per cent was mediated by income. the effect of income on suicide was mainly explained by economic activity. in conclusion, net of other indicators occupation-based social class is a strong determinant of socioeconomic differences in female suicide mortality, and actions aimed at preventing female suicide should target this group. abstract c-reactive protein levels (crp) in the range between and mg/ l independently predict the risk of future cardiovascular events. besides being a marker of atherosclerotic processes, high-normal crp levels may also be a sign of a more pronounced response to everyday inflammatory stimuli. the aim of our study is to assess the association between response to everyday stimuli and the risk of myocardial infarction. we will perform a population based case-control study including a total of persons. cases (n = ) are first myocardial infarction (mi) patients. controls (n = ) are partners of the patients. offspring of the mi patients (n = ) are included because disease activity and the use of medication by the mi patients may influence the inflammatory response. in order to assess the inflammatory response in mi patients the mean genetically determined inflammatory response in the offspring will be assessed and used as a measure for the inflammatory response in the mi patients. the offspring is free of disease and medication use. partners of the offspring (n = ) are the controls for the offspring. influvac vaccine will be given to assess crp concentration, i.e. inflammatory response, before and after the vaccination. abstract background:. ischemic heart disease risk may be influenced by long-term exposure to electromagnetic fields (emf) in vulnerable subjects, but epidemiological data is inconsistent. objectives: we studied whether the long-term occupational exposure to emf is related to an increased myocardial infarction (mi) risk. design and methods: we conducted a prospective case-control study, which involved mi cases and controls. emf exposure in cases and controls was assessed subjectively. the effect of emf exposure on mi risk was estimated using multivariate logistic regression. results: after adjustment for age, smoking, blood pressure, body mass index and psychological stress the odds ratios for emf exposure < years was . ; % ci . - . , for emf exposure - years - . ; % ci . - . and for emf exposure > years - . ; % ci . - . . conclusion: longterm occupational exposure to emf may increase the risk of mi. our crude estimates of emf exposure might have impact on excess risk because of nondifferential misclassification in assigning exposure. abstract background: it has been suggested that noise exposure is associated with ischemic heart disease risk, but epidemiological evidence is still limited. objectives: we studied whether road traffic noise exposure increases the risk of myocardial infarction (mi). design and methods: we conducted population-based prospective casecontrol study, which involved mi cases and controls. we measured traffic-related noise levels at the electoral districts and linked these levels to residential addresses. we used multiple logistic regression to assess effect of noise exposure on mi risk. results: after adjustment for age, smoking, blood pressure, body mass index, and psychological stress the risk of mi was higher for the men exposed to - dba ( background: some studies have suggested that patients, depressed following acute myocardial infarction (mi), experience poorer survival. however, i) other studies show no significant association, when adjusted for recognized prognostic indicators and ii) some 'natural' responses to mi may be recorded in questionnaires as indicators of depression. method: depression was assessed in mi patients, by interview on two measures (gwb and sf ) - weeks after discharge, clinical data were abstracted from patients' medical records and vital status was assessed at - years. survivals of depressed, marginally depressed and normal patients were calculated by kaplan meier method and comparisons made by log rank and cox proportional hazard modelling. results: crude survival at years in patients was higher for depressed and marginally depressed ( %) than for normals ( %), although not significantly. in multivariate analysis, four patient characteristics contributed significantly to survival: age (p< . ), previous mi (< . ), diabetes (< . ) and sex (< . ): other potential explanatory variables, including hypertension, infarct severity and depression were excluded by the model. abstract background: the low coronary heart disease (chd) incidence in southern europe could result in lower low density lipoprotein cholesterol oxidation (oxldl). objective. the aim of this study was to compare oxldl levels in chd patients from several european countries. methods: a cross-sectional multicenter study included stable chd male subjects aged to years from northen (finland and sweden), central (germany), and southern europe (greece and spain). lipid peroxidation was determined by plasma oxldl. results: the score of adherence to mediterranean diet, antioxidant intake, alcohol intake, and lipid profile were significantly associated with oxldl. oxldl levels were higher in northern ( . u/l) than in centre ( . u/l) and southern populations ( . u/l), p = . , in the adjusted models. the probability of northern europe to have the highest oxldl levels was . %, and . , % in logarithm of triglyceride-adjusted and fully adjusted models, respectively. the probability of this order to hold after adjustment for country was . %. conclusion: a gradient on lipoperoxidation from north to central and southern europe is very likely to exist, and parallels that observed in the chd mortality, and incidence rates. southern populations may have more favourable environmental factors against oxidation than northern europe. abstract background: whereas socioeconomic status (ses) has been established as a risk factor for a range of adverse health outcomes, little literature exists examining socio-economic inequalities in the prevalence of congenital anomalies. objectives: to investigate the relationship between the ses and the risk of specific congenital anomalies, such as neural tube defects (ntd), oral clefts (oc) and down's syndrome (ds). design and methods: a total of cases of congenital anomaly and non-malformed control births were collected between and from the italian archive of the certificates of delivery care. as a measure of ses, cases and controls were given a value for a level deprivation index. data were analysed using a logistic regression model. results: we found cases of dtn, cases of sof and cases of ds. the risk of having a baby with ntd was significantly higher for women of low ses (or = . ;c.i.: . - . ), as well as for oc (or = . ; c.i.: . - . ). no significant evidence for ses variation was found for ds. conclusion and discussion: our data suggest risk factors linked to ses, such as nutritional factors, lifestyle, and access to health services, may play a role in the occurrence of some malformations. abstract background: general practitioners (gp) are well-regarded by their patients and have the opportunity to play an active role in providing cessation advice. objectives: this study was run to examine whether a public health programme based on a carefully adapted programme of continuing education can increase gps' use of cessation advice and increase the success rates of such advice. methods: the particular context due to a randomization of gp leads us to consider a cluster randomization trial. marginal models, estimated by gee and mixed generalized linear models are used for this type of design. results: the cessation rate is relatively high for all smokers enrolled in the trial (n = ): a total of smokers were ex-smokers at one year ( . %). patients who were seen by trained gps were more likely to successfully stop smoking than those seen by the control gps ( . % vs . %). motivated subjects, aged over , lower had anxiety scores, and confidence in their ability to stop smoking, were predictive of successful cessation at one year follow-up. conclusions: cluster analysis indicated that factors important to successful cessation in this population of smokers are factors commonly found to influence cessation. abstract background: and purpose conventional meta-analysis showed no difference in primary outcome for coronary bypass surgery without (offpump) or with (onpump) the heart-lung machine. secondary outcome parameters such as transfusion requirements or hospitalization days favored offpump surgery. combined individual data analysis improves precision of effect estimates and allows accurate subgroup analyses. objective: our objective is to obtain accurate effect estimates for stroke, myocardial infarction, or death, after offpump versus onpump surgery, by meta-analysis on pooled individual patient data. method and results: bibliographic database search identified eleven large trials (> patients). the obtained data for trials data included patients. primary endpoint was composite (n = ), secondary endpoints were death (n = ), stroke (n = ) and myocardial infarction (n = ). hazard ratio for event-free survival after offpump vs onpump ( % ci) was: composite endpoint . ( . ; . ), death . ( . ; . ) myocardial infarction . ( . ; . ), stroke . ( . ; . ). after stratification for diabetes, gender and age the results slightly favored offpump for high-risk groups. hazard ratios remained not statistically significant. conclusion: no clinical or statistical significant differences were found for any endpoint or subgroup. offpump coronary bypass surgery is at least equal to conventional coronary bypass surgery. offpump surgery therefore is a justifiable option for cardiac surgeons for cardiac bypass surgery. in - an outbreak of pertussis occurred, mostly among vaccinated children. since then the incidence has remained high. therefore, a fifth dose with acellular booster vaccine for -yearolds was introduced in october . the impact of this vaccination on the age-specific pertussis incidence was assessed. mandatory notifications and hospitalisations were analysed for - and compared with previous years. surveillance data show 'epidemic' increases of pertussis in , , and . the total incidence/ , in ( . ) was higher than in the previous epidemic year ( . ). nevertheless, the incidence of notifications and hospitalisations in the age-groups targeted for the booster-vaccination had decreased with respectively % and % compared to . in contrast, the incidence in adolescents and adults almost doubled. unlike other countries that introduced a pre-school booster, the incidence of hospitalised infants < months also decreased ( % compared with ). as expected, the booster-vaccination for -year-olds has decreased the incidence among the target-population itself. more importantly, the decreased incidence among infants < months suggests that transmission from siblings to infants has also decreased. in further exploration of the impact of additional vaccination strategies (such as boostering of adolescents and/or adults) this effect should not be ignored. abstract acute respiratory infections (ari) are responsible for considerable morbidity in the community, but little is known about the presence of respiratory pathogens in asymptomatic individuals. we hypothesized that asymptomatic persons could have a sub clinical infection and so act as a source of transmission. between and all patients with ari who visited their sentinel general practitioner were reported to estimate the incidence of ari in dutch general practices. a random selection of them (cases) and an equal number of asymptomatic persons visiting for other complaints (controls) were included in a case-control study. nose/ throat swabs of participants were tested for a broad range of pathogens. the overall incidence of ari was per , person years, suggesting that in the dutch population an estimated , persons annually consult their general practitioner for respiratory complaints. viruses were detected in % of the cases, ?-haemolytic streptococci group a in % and mixed infections in %. besides, pathogens were detected in approximately % of controls, particularly in the youngest age groups. this study confirms that most ari are viral and supports the reserved policy of prescribing antibiotics. furthermore, we demonstrated that asymptomatic persons might be a neglected source of transmission. abstract background: the baking and flour producing industries in the netherlands agreed on developing a health surveillance system to reduce the burden of and improve prognosis of occupational allergic diseases. objectives: to develop and validate a diagnostic model for sensitization to wheat and fungal amylase allergens, as triage instrument to detect occupational allergic diseases. design and methods: a diagnostic regression model was developed in bakers from a cross-sectional study with ige serology to wheat and or amylase allergens as the reference standard. model calibration was assessed with hoshmer-lemeshow goodness of fit test; discriminative ability using area under receiver operating characteristic curve (auc); and internal validity using bootstrapping procedure. external validation was conducted in other bakers. results: the diagnostic model consisted of four questionnaire items (history of asthma, rhinitis, conjunctivitis, and work-related allergic symptom) showed good calibration (p = . ) and discriminative ability (auc . ; % ci . to . ). internal validity was reasonable (correction factor of . and optimism corrected auc of . ). external validation showed good calibration (p = . ) and discriminative ability (auc . ; % ci . to . ). conclusions and discussion: this easily applicable diagnostic model for sensitization to flour and enzymes shows reasonable diagnostic accuracy and external validation. abstract background: in the netherlands the baking and flour producing industries ( , small bakeries, industrial bakeries, and flour manufactures) agreed to reduce the high rate (up to %) of occupational related allergic diseases. objectives: to conduct health surveillance for early detection of occupational allergic diseases by implementing a diagnostic model as triage instrument. design and methods: in the preparation phase, a validated diagnostic regression model for sensitization to wheat and or a-amylase allergens was converted into score chart for use in occupational health practice. two cut off points of the sum scores were selected based on diagnostic accuracy properties. in the first phase, a questionnaire including the diagnostic predictors from the model was applied in . bakers. surveillance simulation was done in bakers recently enrolled in the surveillance. workers with high questionnaire scores were referred for advanced medical examination. results: implementing the diagnostic questionnaire model yielded %, %, and % bakers in the low, intermediate, and high score groups. workers with high scores showed the highest percentage of occupational allergic diseases. conclusions and discussion: with proper cut off points for referral, the diagnostic model could serve as triage instrument in health surveillance to early detect occupational allergic diseases. abstract background: the prevalence of cardiovascular risk factors in spain is high but myocardial infarction incidence is lower than in other countries. objective: to determine the role of basic lipid profile on coronary heart disease (chd) incidence in spain. methods: a cohort of , healthy spanish individuals aged to years was followed for years. the end-points were fatal and non-fatal myocardial infarction, and angina. results: the participants who developed a coronary end-point were significantly older ( vs ), more often diabetic ( % vs %), smoker ( % vs %) and hypertensive ( % vs %) than the rest, and their average total and hdl-cholesterols (mg/dl) were: vs (ns) and vs , (p< . ), respectively. chd incidence among individuals with low hdl levels (< in men/< in women) was higher than in the rest: . &aeyear- vs . &aeyear- (p< . ) in men, and . &aeyear- vs . &aeyear- (p< . ) in women. hdl-cholesterol was the only lipid related variable significantly associated with chd: hazard ratio for mg/dl increase was . ( % ci: . - . ) in men, and . ( % ci: . - . ) in women, after adjusting for classical risk factors. conclusion: hdl-cholesterol is the only classical lipid variable associated with chd incidence in spain. abstract background: it is widely recognized that health service interventions may reduce infant mortality/imr rate which usually occurs alongside with economic growth. however, there are reports showing that imr decrease under adverse economic and social conditions, indicating the presence of other unknown determinants. objective: this study aims to analyze temporal tendency of infant mortality in brazil during a recent period ( to ) of economic crisis. methods: temporal series study using data from the mortality information system, censuses (ibge) and epidemiological information (funasa). applying arima -autoregressive integrated moving average, it was described series parameters and, spearman correlation coefficients were used to evaluate the association between infant mortality coefficient and some determinants. results: the infant mortality showed a declining tendency () . %) and strong correlation to the majority of the indicators analyzed. however, only correlation between infant mortality coefficient and total fecundity and birth rates differed significantly within decades. conclusions/discussion: fecundity variation was responsible to the persistence of mortality decline during the eighties. in the next period those indicators of life conditions, mostly health care, could be more important. abstract background: across european union (eu) member states, considerable variation exists in the structure and performance of surveillance systems for communicable disease prevention and control. objectives: the study aims to support the improvement of surveillance systems of communicable diseases in europe while using benchmarking for the comparison of national surveillance systems. design and methods: surveillance systems from england & wales, finland, france, germany, hungary and the netherlands were described and analysed. benchmarking processes were performed with selected criteria (e.g. case definitions, early warning systems). after the description of benchmarks, best practices were identified and described. results: the six countries have in general wellfunctioning communicable disease control and prevention systems. nevertheless, different strengths and weaknesses in could be identified. practical examples for best practice from various surveillance systems demonstrated fields for improvement. conclusion and discussion: benchmarking national surveillance systems is applicable as a new tool for the comparison of communicable disease control in europe. a gold standard of surveillance systems in various countries is very difficult to achieve because of heterogeneity (e.g. in disease burden, personal and financial resources). however, to improve the quality of surveillance systems across europe, it will be useful to benchmark surveillance systems of all eu member states. abstract background: therapeutic decisions in osteoarthritis (oa) often involve trade-offs between accepting risks of side effects and gaining pain relief. data about the risk levels patients are willing to accept are limited. objectives: to determine patients' maximum acceptable risk levels (marls) for different adverse effects from typical oa medications and to identify the predictors of these risk attitudes. design and methods: marls were measured with a probabilistic threshold technique for different levels of pain relief. baseline pain and risk levels were controlled for in a x factorial design. clinical and sociodemographic characteristics were assessed using a selfadministered questionnaire. results: for subjects, marls distributions were skewed, and varied by level of pain relief, type of adverse effect, and baseline risk level. given a % baseline risk, for a -point ( - scale) pain benefit the mean (median) marls were . % ( %) for heart attack/stroke; . % ( %) for stomach bleed; . % ( . %) for hypertension; and . % ( . %) for dyspepsia. most clinical and sociodemographic factors were not associated with marls. conclusion: subjects were willing to trade substantial risks of side effects for pain benefits. this study provides new data on risk acceptability in oa patients that could be incorporated into practice guidelines for physicians. background: several independent studies have shown that single genetic determinants of platelet aggregation are associated with increased ihd risk. objectives: to study the effects of clustering prothombotic (genetic) determinants on the prediction of ihd risk. design and methods: the study is based on a cohort of , women, aged to years, who were followed from to . during this period, there were women with registered ihd (icd- - ) . a nested case cohort analysis was performed to study the relation of plasma levels vwf and fibrinogen, blood group genotype and prothrombotic mutations in the gene of a b , gpvi, gpib and aiibb to ihd. results: blood group ab, high vwf concentrations and high fibrinogen concentrations were associated with increased incidence of acute ihd. when the effects of blood group ab/o genotype, plasma levels fibrinogen and vwf were clustered with a score, there was a convincing relationship between a high prothrombotic score and increased incidence of acute ihd: the full-adjusted hr ( % confidence interval) was . ( . - . ) for women with the highest score when the lowest score was taken as reference. conclusions: clustering of prothrombotic markers is a major determinant of increased incidence of acute ihd. abstract background: studies have revealed heart rate variability (hrv) was a predictor of hypertension; however its h-recording has not been analysed with the -hour ambulatory blood pressure. objective: we studied the relationship between hrv and blood pressure. methods: hrv and blood pressure were measured by -hour ambulatory recordings, in randomly selected population without evidence of heart disease. cross-sectional analyses were conducted in women and men (mean age: . ± . ). hrv values, measured by the standard deviation of rr intervals (sdnn), were compared after logarithmic transformation between the blood pressure levels ( / mmhg), using analysis of variance. stepwise multiple-regression was performed to assess on sdnn the cumulative effects of systolic and diastolic blood pressure, clinical obesity, fasting glycaemia, c-reative protein, treatments, smoking and alcohol consumption. results: sdnn was lower in hypertensive men and women (p< . ), independently of drug treatments. after adjustment for factors associated with hypertension, sdnn was no more associated with hypertension, but with obesity, glycaemia and c-reative protein in both genders. sdnn was negatively associated with diastolic blood pressure in men (p = . ) in the multivariate approach. conclusion: whereas blood pressure levels were not related to the sdnn in the multivariate analysis, diastolic blood pressure contributed to sdnn in men. it has been proposed that n- fatty acids may protect against the development of allergic disease, while n- fatty acids may promote its development. in the piama (prevention and incidence of asthma and mite allergy) study we investigated associations between breast milk fatty acid composition of allergic and non allergic mothers and allergic disease (doctor diagnosed asthma, eczema or hay fever) in their children at the age of year and at the age of years. in children of allergic mothers prevalences of allergic disease at age and at age were relatively high if the breast milk they consumed had a low content (wt%) of n- fatty acids and particularly of n- long chain polyunsaturates (lcps), a low content of trans fatty acids, or a low ratio of n- lcps/n- lcps. the strongest predictor of allergic disease was a low breast milk n- lcps/n- lcps ratio (odds ratios ( % ci) of lowest vs highest tertile, adjusted for maternal age, parity and education: . ( . to . ) for allergic disease at age and . ( . to . ) for allergic disease at age ). in children of non allergic mothers no statistically significant associations were observed. abstract background/relevance: to find out about the appropriateness of using two vision related quality of life instruments to measure outcome of visually impaired elderly in a mono-and multidisciplinary rehabilitation centre. objective/design: to evaluate sensitivity of the vision quality of life core measure (vcm ) and the low vision quality of life questionnaire (lvqol) to measure changes in vision related quality of life in a non-randomised followup study. methods: visually impaired patients (n= ) recruited from ophthalmology departments administered questionnaires at baseline ( ) ( ) ( ) ( ) , months and year after rehabilitation. person measures were analysed using rasch analyses for polytomous rating scales. results: paired sample t-tests for the vcm showed improvement at months (p = . ; effect size = . and p = . ; effect size= . ) for the monodisciplinary and the multidisciplinary groups respectively. at year only the multidisciplinary group showed improvement on the vcm (p = . ; effect size = . ). on the lvqol, no significant improvement or deterioration was found for both groups. discussion: although, vcm showed improvement in vision related quality of life over time, the effect sizes appeared to be quite small. we conclude that both instruments lack sensitivity to measure changes. another explanation is that rehabilitation did not contribute to quality of life improvements. abstract background: the natural history of asthma severity is poorly known. objective: to investigate prognostic factors of asthma severity. methods: all current asthmatics identified in / in the european community respiratory health survey were followed up and their severity was assessed in by using the global initiative for asthma categorization (n = ). asthma severity was related to baseline/follow-up potential determinants by a multinomial logistic model, using intermittent asthmatics as reference category for relative risk ratios (rrr). results: patients in the lowest/highest levels of severity at baseline had an % likelihood of remaining in a similar level. severe persistent had a poorer fev %predicted at baseline, higher ige levels (rrr= . ; % ci: . - . ), higher prevalence of chronic cough/phlegm ( . ; . - . ) than intermittent asthmatics. moderate persistent showed similar associations. mild persistent were similar to intermittent asthmatics, although the former showed a poorer control of symptoms than the latter. subjects in remission had a lower probability of an increase in bmi than current asthmatics ( . ; . - . ). allergic rhinitis, smoking, respiratory infections in childhood were not associated with severity. conclusion: asthma severity is a relatively stable condition, at least for patients at the two extremes of the severity spectrum. high ige levels and persistent cough/phlegm are strong markers of moderate/severe asthma. abstract background: thyroid cancer (tc) has a low, yet growing, incidence in spain. ionizing radiation is the only well established risk factor. objectives: this study sought to depict the municipal distribution of tc mortality in spain and to argue about possible risk factors. design and methods: posterior distribution of relative risk for tc was computed using a single bayesian spatial model covering all municipal areas of spain ( , ) . maps were plotted depicting standardised mortality ratios, smoothed municipal relative risk (rr) using the besag, york and mollie`model, and the distribution of the posterior probability that rr> . results: from to a total of , tc deaths were registered in , municipalities. there was a higher risk of death in some areas of canary islands, galicia and asturias. abstract igf-i is an important growth factor associated with increased breast cancer risk in epidemiological and experimental studies. lycopene intake has been associated with decreased cancer risk. although some data indicate that lycopene can influence the igfsystem, this has never been extensively tested in humans. the purpose of this study is to evaluate the effects of a lycopene intervention on the circulating igf-system in women with an increased risk of breast cancer. we conducted a randomized, placebo-controlled cross-over intervention study on the effects of lycopene supplementation ( mg/day, months) in pre-menopausal women with ) history of breast cancer (n = ) and ) high familial breast cancer risk (n = ). drop-out rate was %. mean igf-i and igfbp- concentrations after placebo were . ± . ng/ml and . ± . mg/ml respectively. lycopene supplementation did not significantly alter serum total igf-i (mean lycopene effect: ) . ng/ml; % ci: ) . - . ) and igfbp- () . mg/ml; ) . - . ) concentrations. dietary energy and macronutrient intake, physical activity, body weight, and serum lycopene concentrations were assessed, and are currently under evaluation. in conclusion, this study shows that months lycopene supplementation has no effect on serum igf-system components in a high risk population for breast cancer. abstract introduction: patients who experience burden during diagnostic tests may disrupt these tests. the aim was to describe the perception of melanoma patients with lymph node metastases of the diagnostic tests. methods: patients were requested to complete a self-administrated questionnaire. experienced levels of embarrassment, discomfort and anxiety were calculated, as well as (total) scores for each burden. the non-parametric friedman test for related samples was used to see if there was a difference in burden. results: the questionnaire was completed by patients; response rate was %. overall satisfaction was high. in total % felt embarrassment, % discomfort and % anxiety. overall, % felt some kind of burden. there was no difference in anxiety between the three tests. however, patients experienced more embarrassment and discomfort during the pet (positron emission tomography) scan (p = . and p = . ). conclusion: overall levels of burden were low. however, patients experienced more embarrassment and discomfort during the pet scan, possibly as a result of lying immobile for a long time. the accuracy, costs and patients upstaged will probably be the most important to determine the additional value of fdg-pet and ct, but it is reassuring to know that only few patients experience severe or extreme burden. abstract gastric cancer (gc) is the second most frequent cause of cancer death in lithuania. some intercultural aspects of diet that is related to the outcome could be the risk factors of the disease. the objective of the study was to assess an associations between risk of gc and dietary factors. a case-control study included cases with diagnose of gc and controls that were cancer and gastric diseases free. a questionnaire used to collect information on possible risk factors. the odds ratios (or) and % confidence intervals (ci) estimated by the conditional logistic regression model. after controlling for possible confounders that were associated with gc, use of salted meat (or = . ; % ci = . - . ; > - times/week vs. almost never) smoked meat (or = . ; % ci = . - . ; > - times/week vs. less), smoked fish (or = . ; % ci = . - . ; > - times/week vs. less) was associated with increased risk of gc. higher risk of gc was associated with frequent use of butter, eggs and noodles. while frequent consumption of carrots, cabbages, broccolis, tomatoes, garlic, beans decreased the risk significantly. the data support a role of salt processed food and some animal foods in increasing the risk of gc and plant foods in reducing the risk of the disease. abstract background: standards for the evaluation of measurement properties of health status measurement instruments (hsmi), including explicit criteria for what constitutes good measurement properties, are lacking. nevertheless, many systematic reviews have been performed to compare and select hsmi, using different criteria to judge the measurement properties. objectives: ( ) to determine which measurement properties are reported in systematic reviews of hsmi and how these properties are defined, ( ) which standards are used to define how measurement properties should be assessed, and ( ) which criteria are defined for good measurement properties. methods: a systematic literature search was performed in pubmed, embase and psychlit. articles were included if they met the following inclusion criteria: ( ) systematic review, ( ) hsmi were reviewed, and ( ) the purpose of the review is to identify all measurement instruments assessing (an aspect of) health status and to report on the clinimetric properties of these hsmi. two independent reviewers selected the articles. a standardised data-extraction form was used. preliminary results: systematic reviews were included. conclusions: large variability in standards and criteria used for evaluating measurement properties was found. this review can serve as basis for reaching consensus on standards and criteria for evaluating measurement properties of hsmi. abstract residential exposure to nitrogen dioxide is an air quality indicator and could be very useful to assess the effects of air pollution on respiratory diseases. the present study aims at developing a model to predict residential exposure to no , combining data from questionnaires and from local monitoring stations (ms). in the italian centres of verona, torino and pavia, participating in ecrhs-ii, no concentrations were measured using passive samplers (ps-no ) placed outside the kitchen of subjects for days. simultaneously, average no concentrations were collected from all the mss of the three centres (ms-no ). a multiple regression model was set up with ps-no concentrations as response variable and questionnaire information and ms-no concentrations as predictors. the model minimizing the root mean square error (rmse), obtained from a ten fold cross validation, was selected. the model with the best predictive ability (rmse= . ), had as predictors: ms-no concentrations, season of the survey, centre, type of building, self-reported intensity of heavy vehicle traffic. the correlation coefficient between predictive and observed values was . ( % ci: . - . ). in conclusion, this preliminary analysis suggests that the combination of questionnaire information and routine data from the mss could be useful to predict the residential exposure to no . abstract background: currently only % of dutch mothers comply with the who recommendation to give exclusive breastfeeding for at least six months. therefore, the dutch authorities consider policies on breastfeeding. objectives: quantification of the health effect of several breastfeeding policies. methods: a systematic literature search of published epidemiological studies conducted in the general 'western' population. based on this overview a model is developed. the model simulates incidences of diseases of mother and child depending on the duration that mothers breastfeed. each policy corresponds to a distribution in the duration of breastfeeding. the health effects of each policy are compared to the present situation. results: breastfeeding has beneficial health effects on both the short and the long term for mother and child. the longer the duration of breastfeeding, the larger is the effect. most public health gain is achieved by introducing breastfeeding to all newborns rather than through a policy focusing just on extending the lactation period of women already breastfeeding. conclusion: breastfeeding has positive health effects. policy should focus preferentially on encouraging all mothers to start with breastfeeding. abstract background: constant increase of international trade and travel activities has risen the significance of pandemic infectious diseases worldwide. the / sars outbreak rapidly spread from china to countries, from which were located in europe. in order to control and prevent pandemic infections in europe, systematic and effective public health preparation by every member state is essential. method: supported by the european commission, surveys focusing on national sars (september ) and influenza (october ) preparedness were accomplished. a descriptive analysis was undertaken to identify differences in european infectious disease policies. results: guidelines and guidance for disease management were well established in most european countries. however, the application of control measures, like e.g. measures for mass gatherings or public information policies, had varied among member states. discussion: the results show that european countries are aware of preparing for pandemic infections. yet, the effectiveness of certain control measures is analysed insufficiently. further research and detailed knowledge about factors influencing international spread of diseases is required. 'hazard analysis of critical control points' (haccp) will be applied to evaluate national health response in order to provide comprehensive data for recommendations to european pandemic preparedness. abstract background: influenza is still an important problem for public health. knowing its space-time evolution is of special interest in order to carry out prevention plans. objectives: to analyze the geographical diffusion of the epidemic wave in extremadura. methods: the influenza incidence absolute rates in every town have been calculated, according to the registered cases per week in the compulsory disease declaration system. continuous maps have been represented using a geographical interpolation method (inverse distance weighting (idw) was applied with weighting exponents of ). results: the / season began in the th week of , with a small influenza incidence. there have been concrete cases in those towns until the th week. punctual areas diffusion in the north and southwest of the region between the th and the st weeks. the highest incidence appeared between the nd week of and the rd of . influenza cases started to decrease in the northwest and north of the region from the rd week of , till the th week, when most of the cases were found in the southwest. conclusion: there is a space-time diffusion of influenza, due to the higher population density. we propose to analyze these data combining temperature information. abstract background: acute lower respiratory tract infection (lrti) can cause various complications leading to morbidity and mortality notably among elderly patients. antibiotics are often given to prevent complications. to minimise costs and bacterial resistance, antibiotics are only recommended in case of pneumonia or in patients at serious risk for serious complications. objective: we assessed the course of illness of lrtis among dutch elderly primary care patients and assessed whether gps were inclined to prescribe antibiotics more readily to patients at risk for complications. methods: we retrospectively analysed medical data from , episodes of lrti among patients? years of age presenting in primary care to describe the course of illness. the relation between prescriptions of antibiotics and patients with risk factors for a complicated course was assessed by means of multivariate logistic regression. results: in episodes of acute bronchitis antibiotics were more readily prescribed to patients aged years or older. in exacerbations of copd or asthma gps favoured antibiotics in male patients and when diabetes, neurological disease or dementia was present. conclusion: gp's do not take all high risk conditions into account when prescribing antibiotics to patients with lrti despite recommendations of national guidelines. abstract background: the putative association between antidepressant treatment and increased suicidal behaviour has been under debate. objectives: to estimate the risk of suicide, attempted suicide, and overall mortality during antidepressant treatments. design and methods: study cohort consisted all subjects without non-affective psychosis, hospitalized due to a suicide attempt during the years - , followed up by using nationwide registers. main outcome were completed suicides, attempted suicides, and mortality. main explanatory variable was antidepressant usage. results: suicides, suicide attempts and deaths were observed. when the effect of background variables was taken into account, the risk of suicide attempts was increased markedly during antidepressant treatment (rr for selective serotonin reuptake inhibitors or ssri . , . - . ) compared with no antidepressants. however, the risk of completed suicides was not increased. a lower mortality was observed during ssri use (rr . , . - . ), which was mainly attributable to decrease in cardiovascular deaths. conclusion and discussion: in this suicidal high risk cohort the use of any antidepressant is associated with an increased risk of suicide attempts, but not with the increased risk of completed suicide. antidepressants and, especially, ssri use is associated with a substantial decrease in cardiovascular deaths and overall mortality. abstract background: the quattro study is a rct on the effectiveness of intensified preventive care in primary health care centres in deprived neighbourhoods. additional qualitative research on the execution of interventions in primary care was considered necessary for the explanation of differences in effectiveness. objectives: our question was: can we understand rct outcomes better with qualitative research? design and methods: an ethnographic design was used. in their daily work we observed researchers for months days a week, and practice nurses for days each. two other practice nurses were interviewed. all transcribed observations were analysed thematically. results: from the rct showed differences in effectiveness among the centres and that intensified preventive care provided no additional effect compared to structural physical measurements. ethnographic results show that these differences are due to variations in execution of the intervention among the centres. conclusion: in conclusion ethnographic analysis showed that differences in execution of intervention lead to differences in rct outcomes. the rct conclusion 'no additional effect' is problematic. discussion as variations in primary care influence a rcts' execution they create methodological problems for research. to what extent can additional qualitative research improve rct research. abstract background: acute myocardial infarction is the most important cause of morbidity from ischemic heart disease (ihd) and is the leading cause of death in the western world. objectives: to assess the benefits and harms of 'dan shen' compound for acute myocardial infarction. methods: we searched the cochrane controlled trials register on the cochrane library, medline, embase, chinese biomedical database and the chinese cochrane centre controlled trials register. we included randomized controlled studies lasting at least days. main results: eleven studies with participants in total were included. seven studies compared the mortality in routine treatment plus 'dan shen' compound and single routine treatment. one trial compared the arrhythmia in routine treatment plus 'dan shen' compound injection and single routine treatment. two trials compared the revascularization in routine treatment plus 'dan shen' compound injection and single routine treatment. conclusions: evidence is insufficient to recommend the routine use of 'dan shen' compound because of the small number of included studies and their low quality. no well designed randomized controlled trials with adequate power to provide a more definitive answer have been conducted. in addition, the safety of 'dan shen' compound is unproven, though adverse events were rarely reported. abstract antimicrobial resistance is emerging. to identify the scope of this threat and to be able to take proper actions and evaluate these, monitoring is essential. the remit of earss is to maintain a comprehensive surveillance system that provides comparable and validated data on the prevalence and spread of major invasive bacteria with clinically and epidemiologically relevant antimicrobial resistance in europe. since , earss collects routine antimicrobial susceptibility test (ast) results of invasive isolates of five indicator bacteria, tested according to standard protocols. in , ast results for , isolates were provided by laboratories, serving hospitals, covering million inhabitants in countries. through a biannual questionnaire denominator information was collected. the quality of ast results of laboratories was evaluated by the yearly external quality assessment. currently, earss includes all member and candidate states ( ) of the european union, plus israel, bosnia, bulgaria and turkey. participating hospitals treat a wide range of patients and laboratory results are of sufficient validity. earss identified antimicrobial resistance time trends and found a steady increase for most pathogen-compound combinations. in conclusion, earss is a comprehensive system with sufficient quality to show that antimicrobial resistance is increasing in europe and threatens health-care outcomes. abstract introduction: since chloroform has been detected in drinking waters, the number of studies has increased to identify the presence of trihalomethanes (thms) in drinking waters, as well as to establish the possible effects they may have population health. objectives: to determine thms levels in the water distributing network in the city of valencia. design and methods: over a one-year period, points of the drinking water distributing netowrk have undergone sampling at week intervals. the concentration of these pollutants was determined by gas chromatography. results: our results showed greater concentrations of the species substituted by chlorine and bromine atoms (dichlorobromomethane and dibromochloromethane) in the range of - z lg/l for both, - lg/l for trichloromethane and between - lg/l for tribromomethane. an increase in thms concentration was observed in those points near the sea, although they did not exceed the legal limit of lg/l. conclusion: we established two areas of concentration of these species in water: high and average, according to their proximity to the sea. abstract background: childhood cancer survivors are known to be at increased risk for second malignancies. objectives: we studied longterm risk of second malignancy in -year survivors, according to therapy and follow-up interval. methods: the risk of second malignancies was assessed in -year survivors of childhood cancer treated in the emma children's hospital amc in amsterdam and compared with incidence in the general population of the netherlands. complete follow-up till at least january was obtained for . % of the patients. the median follow-up time was . . results: sixty second malignancies were observed against . expected, yielding a standardized incidence ratio (sir) of ). the absolute excess risk (aer) was . per , persons per year. the sir appeared to stabilize after years of follow-up, but the absolute excess risk increased with longer follow-up (aer follow-up > = years of . ). patients who were treated with radiotherapy experienced the greatest increase of risk. conclusions: in view of the quickly increasing background rate of cancer with ageing of the cohort, it is concerning that even after more than years of follow-up the sir is still increased, as is the absolute excess risk. the chek * delc germline variant has been shown to increase susceptibility for breast cancer and could have an impact on breast cancer survival. this study aimed to determine the proportion of chek * delc germline mutation carriers, and breast cancer survival and tumor characteristics, compared to non-carriers in an unselected (for family history) breast cancer cohort. women with invasive mammary carcinoma, aged < years and diagnosed in several dutch hospitals between and , were included. for all patients, paraffin embedded tissue blocks were collected for dna isolation (normal tissue), and subsequent mutation analyses, and tumor revision. in breast cancer patients, ( . %) chek * delc carriers were detected. chek * delc tumors characteristics, treatment and patient stage did not differ from those of non-carriers. chek * delc carriers had times increased risk of developing a second breast cancer compared to non-carriers. with a mean follow up of years, chek * delc carriers had worse recurrence free and breast cancer specific survival than non-carriers. in conclusion, this study indicates a worse breast cancer outcome in chek * -delc carriers compared to non-carriers. the extension of the presence of the chek * delc germline mutation warrants research into therapy interaction and possibly into screening of premenopausal breast cancer patients. abstract background: for primary or secondary prevention (e.g. myocardial infarction) hormone therapy (ht) is no longer recommended in postmenopausal women. however, physicians commonly prescribe ht to climacteric women as a treatment of hot flashes/night sweats. objective: to assess efficacy and adverse reactions of ht in climacteric women with hot flashes (including night sweats). methods: for our systematic review (sr), we searched databases (medline, embase, cochrane) for randomized controlled trials, other srs and meta-analyses, published to . the quality of the studies was assessed using checklists corresponding to the study type. results: we identified studies of good/excellent quality. they included predominantly caucasian women and lasted - months. in all studies, ht showed a reduction of - % in the number of hot flashes, which was significantly better than placebo. most common adverse events of ht were uterine bleeding and breast pain/tenderness. cardiovascular diseases and neoplasms were reported only sporadically. conclusions: ht is highly effective in treating hot flashes in climacteric women. however, to assess serious adverse events longer studies (including also non-caucasian women) are needed, as there are only sparse data available. abstract igf-i is an important growth factor, and has been associated with increased colorectal cancer risk in both prospective epidemiological and experimental studies. however, it is largely unknown which lifestyle factors are related to circulating levels of the igf-system. studies investigating the effect of isoflavones on the igf-system have thus far been conflicting. the purpose of this study was to evaluate the effects of isoflavones on the circulating igf-system in men with high colorectal cancer risk. we conducted a randomized, placebo-controlled, cross-over study on the effect of a -month isoflavone supplementation ( mg/day) on igf levels in men with a family history of colorectal cancer or a personal history of colorectal adenomas. dropout rate was %, and all but men were more than % compliant. isoflavones supplementation did not significantly alter serum total igf-i () . %; %ci: ) . - . ) and igf binding protein (+ . %, %ci: ) . - . ) concentrations. other covariables, e.g. dietary energy and macronutrient intake, physical activity, and body weight, are currently under evaluation. in conclusion, this study shows that a -month isoflavone supplementation has no effect on serum igf-system components in men with high colorectal cancer risk. abstract background/objective: eurociss-european cardiovascular indicators surveillance set project, funded under the health monitoring programme of european commission, aims developing health indicators and recommendations for monitoring cardiovascular diseases (cvd). methods: prioritise cvd according to their importance in public health; identify morbidity and mortality indicators; develop data collection and harmonizing recommendations; describe data collection, validation procedures and discuss their comparability. population (geographical area, age, gender), methods (case definition, icd codes), procedures (record linkage, validation), morbidity indicators (attack rate, incidence, case fatality) collected by questionnaire. results: the main outcome was the inventory of acute myocardial infarction (ami) populationbased registers in the european partner countries: countries have no register, regional, of which also national. registers differ for: icd codes (only ami or also acute and subacute ischemic forms), ages ( - , - , all) , record linkage (probabilistic, personal identification number), calendar years, validation (monica, esc/acc diagnostic criteria). differences make morbidity indicators difficult to compare. conclusion: new diagnostic criteria led to a more exhaustive definition of myocardial necrosis as acute coronary syndrome (acs). given the high burden of ami/acs, efforts are needed to implement population-based registers in all countries. application of recommended indicators, validated through standardized methodology, will provide reliable, valid and comparable data. abstract objective: the objective of this paper was to compare and discuss the use of odd ratios and prevalence ratios using real data with complex sampling design. method: we carried out a cross-sectional study using data obtained from a two-stage stratified cluster sample from a study conducted in - (n = , ) . odds ratios and prevalence ratios were obtained by unconditional logistic regression and poisson regression, respectively, for later comparison using the stata statistical package (v. . ). confidence intervals and design effects were considered in the evaluation of the precision of estimates. two outcomes of a cross-sectional study with different prevalence were evaluates: vaccination against influenza ( . %) and self-referred lung disease ( . %). results: in the high-prevalence scenario, using prevalence ratios the estimates were more conservative and we found narrower confidence intervals. in the low-prevalence scenario, we found no important differences between the estimates and standard errors obtained using the two techniques. discussion: however, it is the researcher's task to choose which technique and measure to use for each set of data, since this choice must remain within the scope of epidemiology. abstract background: in italy coronary heart disease chd mortality has been falling since the s. objective: to examine how much of the fall between and could be attributed to trends in risk factors, medical and surgical treatments. methods: a validated model was used to combine and analyse data on uptake and effectiveness of cardiological treatments and risk factor trends. published trials, meta-analyses, official statistics, longitudinal studies, surveys are main data sources. results: chd mortality fell by % in men and % in women aged - ; , fewer deaths in . approximately half mortality fall was attributed to treatments in patients and half to population changes in risk factors: in men, mainly improvements in cholesterol ( %) and smoking ( %) rather than blood pressure ( %). in women / mortality fall attributable to improvements in cholesterol ( %) and blood pressure ( %); adverse trends in smoking () %). adverse trends also in bmi () % in both genders) and diabetes () % in men; ) . % in women). conclusion: half chd mortality fall was attributable to risk factors reductions, principally cholesterol in men and women and smoking in men; in women rising smoking rates generated substantial additional deaths. a comprehensive strategy promoting primary prevention is needed. objective: to investigate the efficacy of ni in the post exposure prophylaxis (pep), i.e. in persons who had contact with an influenza case. design and methods: we conducted a systematic electronic data base review for the period between and . studies were selected and graded by two independent reviewers. the proportion of influenza-positive patients was chosen as primary outcome. for all analyses fixed effect models were used. weighted relative risks (rr) and % confidence intervals (ci) were calculated on an intention-to-treat basis. results: randomized controlled trials (n= , ) were included in the analysis. zanamivir and oseltamivir were effective against an infection with influenza (rr= . , % ci . - . and rr= . , % ci . - . , respectively). prophylactic efficacy was comparable in the subgroup of persons who had contact with an index case with lab-confirmed influenza ( studies, all ni, rr= . , % ci . - . ). conclusions: the available evidence suggests that ni are effective in the pep of influenza. discussion: results have to be interpreted with caution when transferred into general medical practice because study populations mainly included young and healthy adults without chronic diseases. abstract an important risk factor of breast cancer, mammographic breast density (mbd) is inversely associated with reproductive factors (age at first childbirth, and lactating). as pregnancy and lactating are highly correlated, whether this decline is induced by pregnancy or lactating is still unclear. we hypothesize that lactation reduces mbd independent of age at first pregnancy and parity. a study was done on women in the third sub-cohort of the dom project who had complete data regarding lactating, dy, had a child but varied by duration of lactating. multiple logistic regression analysis was done using dy (yes/ no) as outcome variable. explanatory variables added into the model were age, bmi, parity and age at first childbirth. a significant univariate relation was seen between lactating of the first child and dy. or . (ci % . ; . ). adjusted for explanatory variables, the or changed to . (ci % . ; . ). lactating seems to contribute independently to the reduction of mbd over and above pregnancy itself. given the limitations of the dichotomous dy ratio scores, additional studies will address which part; either glandular mass or fat tissue is responsible for the observed relation which will be measured from mammograms to be digitized. abstract background: alcohol consumption is common, but little is known about whether drinking patterns vary across geographic regions. objectives: to examine potential disparities in alcohol consumption across census regions and urban, suburban, and rural areas of the united states. design and methods: the data source was the national epidemiologic survey on alcohol and related conditions, an in-person interview of approximately , adults. the prevalence of abstinence and, among drinkers, the prevalences of heavy and daily drinking were calculated by census region and metropolitan status. multivariate logistic regression analyses were conducted to test for differences in abstinence and per drinker consumption after controlling for confounders. results: the odds of abstinence, heavy, and daily drinking varied widely across geographic areas. additional analyses stratified by census region revealed that rural residents in the south and northeast as well as urban residents in the northeast had higher odds of abstinence. rural residents in the midwest had higher odds of heavy drinking. conclusion and discussion: heavy alcohol consumption is of particular concern among drinkers living in the rural areas of the united states, particularly the rural midwest. other nations should consider testing for similar differences as they implement policies to promote safe alcohol consumption. abstract background: long-term exposure to particulate air pollution (pm) has been suggested to accelerate atherogenesis. objective: we examined the relationship between long-term exposure to traffic emissions and the degree of coronary artery calcification (cac), a measure of atherosclerosis. methods: in a population based, crosssectional study, distances between participants' home addresses and major roads were calculated with a geographic information system. annual mean pm . -exposure at the residence was derived from a small scale geostatistical model. cac, assessed by electronbeam computer tomography, was modelled with linear regression by proximity to major roads, controlling for background pm . air pollution and individual level risk factors. results: of participants lived within m of a major road. background-pm . ranged from . to . lg/m (mean . ). mean cacvalues were strongly dependent on age, sex and smoking status. reducing the distance to major roads by % leads to increases in cac by . % ( %ci . - . %) in the unadjusted model and , % ( %ci ) . - . ) in the adjusted model. stronger effects (adjusted model) were seen in men ( , %, %ci ) . - . ) and male non-smokers ( , %, %ci ) . - . ). conclusions: this study provides epidemiologic evidence that long-term exposure to traffic emissions is an important risk factor for coronary calcification. abstract background: this polymorphism has been associated risk factor levels and in one study with a reduced risk of acute myocardial infarction (ami). yet, the risk relation has not been confirmed. objectives: we investigated the role of this polymorphism on occurrence of ami, coronary heart disease (chd) and stroke in healthy dutch women. design and methods: a case-cohort study in a prospective cohort of initially healthy dutch from until january st . results: we applied a cox proportional hazards model with an estimation procedure adapted for case-cohort designs. a lower ami (n= ) risk was found among carriers of the ala allele (n= ) compared with those with the more common pro pro genotype (hazard ratio= . ; % ci, . to . ). no relation was found for chd (n= ;hr . ; % ci, . - . ) and for stroke (n= ;hr . ; % ci, . - . ). in our data little evidence was found for a relation between pparg and risk factors. conclusion and discussion: this study shows the pro ala polymorphism in pparg gene is modestly related to a reduced risk of ami in our study. no statistically significant relation was found for chd and stroke. abstract background: pseudo cluster randomization was used in a services evaluation trial because individual randomization risked contamination and cluster randomization risked selection bias due to expected treatment arm preferences of recruiting general practitioners (gps). gps were randomized in two groups. depending on this randomization, participants were randomized in majority to one study arm: intervention:control/ : or intervention:control/ : . objectives: to evaluate internal validity of pseudo cluster randomization. have gps treatment arm preferences? what is the effect on allocation concealment and selection bias? design and methods: we compared the baseline characteristics of participants to study selection bias. using a questionnaire, gps indicated their treatment arm preferences on a visual analogue scale (vas) and the allocation proportions they believed were used to allocate their patients over treatment arms. results: gps preferred allocation to the intervention (vas . (sd . ); - : indicates strongly favoring the intervention arm). after recruitment % of gps estimated a randomization ratio of : was used. the participants showed no relevant differences at baseline. conclusion and discussion: gps profoundly preferred allocation to the intervention group. few indications of allocation disclosure or selection bias were found in the dutch easycare trial. pseudo cluster randomization proofs to be a valid randomization method. abstract background: epidemiological studies rely on self-reporting to acquire data on participants, although such data are often limited in reliability. the aim here is to assess nuclear magnetic resonance (nmr) based metabonomics for evaluation of self-reported data on paracetamol use. method: four in-depth -hour dietary recalls and two timed -hour urine collections were obtained for each participant in the intermap study. a mhz h nmr spectrum was acquired for each urine specimen (n = , ). training and test sets involved two strata, i.e., paracetamol metabolites yes or no in the urine spectra, selected from all population samples by a principal component analysis model. the partial least squares-discriminant analysis (pls-da) model based on a training set of samples was validated by test set (n = ). the model correctly predicted stratum for of samples ( %) after removal of outliers not fitting the model, sensitivity . %, specificity %. this model was used to predict paracetamol status in all intermap specimens. it identified participants ( . %) who underreported analgesic use, of whom underreported analgesic use in both clinical visits. conclusion: nmr-based metabonomics can be used as a tool to enhance reliability of self-reported data. abstract background: in patients with asthma, the decline in forced expiratory volume in one second (fev ) is accelerated compared with non-asthmatics. objective: to investigate long-term prognostic factors of fev change in asthmatics from the general population. methods: a cohort of asthmatics ( - years-old) was identified in the frame of the european community respiratory health survey ( / ), and followed up in / . spirometry was performed on both occasions. the annual fev decrease (?fev ) was analysed by multi-level regression models, according to age, sex, height, bmi, occupation, familiarity of asthma, hospitalization for asthma (baseline factors); cumulative time of inhaled corticosteroid (ics) use and annual weight gain during the follow-up; lifetime pack-years smoked. results: when adjusting for all covariates, ics use for > years significantly reduced ?fev , with respect to non-users, of . ( %ci: . - . ) ml/year. ?fev was . ( . - . ) ml/year lower in women than in men. it increased by . ( . - . ) ml/year for every additional year in patient age and by . ( . - . ) ml/year for every additional kg/year in the rate of weight gain. conclusion: long-term ics use (> years) seems to be associated with a reduced ?fev over a -year followup. body weight gain seems a crucial factor in determining lung function decrease in asthmatics. abstract background: effectiveness of screening can be predicted by episode sensitivity, which is estimated by interval cancers following a screen. full-field digital or cr plate mammography are increasingly introduced into mammography screening. objectives: to develop a design to compare performance and validity between screen-film and digital mammography in a breast cancer screening program. methods: interval cancer incidence was estimated by linking screening visits from - at an individual level to the files of the cancer registry in finland. these data were used to estimate the study size requirements for analyzing differences in episode sensitivity between screen-film and digital mammography in a randomized setting. results: the two-year cumulative incidence of interval cancers per screening visits was estimated to be . to allow the maximum acceptable difference in the episode sensitivity between screen-film and digital arm to be % ( % power, . significance level, : randomization ratio, % attendance rate), approximately women need to be invited. conclusion: only fairly large differences in the episode sensitivity can be explored within a single randomized study. in order to reduce the degree of non-inferiority between the screen-film and digital mammography, meta-analyses or pooled analyses with other randomized data are needed. according to the literature up to % of colorectal cancers worldwide is preventable by dietary change. however the results of the epidemiologic studies are not consistent across the countries. the objective of the study is to evaluate the role of dietary nutrients on colorectal cancer risk in poland. the hospital-based case-control study was carried out in - . in total, histologically confirmed cancer cases and controls were recruited. adjustment for age, sex, education, marital status, multivitamin use, alcohol consumption, cigarette smoking, family history and energy consumption was done by logistic regression model. low tertile of daily intake in the control group was defined as a reference level. the lower colorectal cancer risk was found in cases with high daily intake of dietary fiber (or = , ; %ci: , - , ) and vitamin e (or = , ; %ci: , - , ). on the other hand, an increased risk for high monosaccharides consumption was observed. the risk pattern wasn't changed after additional adjustment for physical activity and body mass index. the results of the present study support the protective role of dietary fiber and some antyoxidative vitamins in the etiology of colorectal cancer. additionally they suggest that high consumption of monosaccharides may lead to elevated risk of investigated cancers. abstract assessment of nutrition is very difficult in every population, but in children there's additional question if child can properly recognize and recall foods that have been eaten. the aim of this study was to assess if dietary recall administered to adolescents can be used in epidemiological studies on nutrition. subjects were children, - years old, and they caretakers. -h recall was used to evaluate children's nutrition. both, child and caretaker were asked to recall all products, drinks and dishes eaten by child during the day before recall. the statistical analyses were done separately for each meal. we have noticed statistically significant differenced for intake of energy and almost all nutrients from the lunch. the observed spearman rank correlation coefficients between child and his caretaker ranged from . for vitamin c up to . for intake of carbohydrates. only calcium intake ( . vs. . mg/day) differentiated groups for the breakfast and b-carotene for the supper. the study showed that the recall with adolescents could be helpful source of data for the research in the population aspect. however, one shouldn't use such data for the examination of the individual nutritional habit of children, especially information about dinner can be biased. abstract background: acute bronchitis is one of the most common diagnoses made by primary care physicians. in addition to antibiotics, chinese medicinal herbs may be a potential medicine of choice. objectives: this review aims to summarize the existing evidence of comparative effectiveness and safety of chinese medicinal herbs for treating uncomplicated acute bronchitis. methods: we searched the cochrane central register of controlled trials, medline, embase, chinese biomedical database and etc. we included randomised controlled trials comparing chinese medicinal herbs with placebo, antibiotics or other western medicine for treating uncomplicated acute bronchitis. at least two authors extracted data and assessed trial quality. main results: four trials reported the time to improvement of cough, fever, and rales; two trials reported the proportion of patients with improved signs and symptoms; thirteen trials analyzed the data of global assessments of improvement. one trial reported the adverse effect during treatment. conclusions: there is insufficient quality data to recommend the routine use of chinese herbs for acute bronchitis. the benefit found in individual studies and this systematic review could be due to publication bias and study design limitations. in addition, the safety of chinese herbs is unknown, though adverse events are rarely reported. design and methods: patients with a definite ms and classified as dead or alive at st january were included in this retrospective observational study. influence of demographic and clinical variables was assessed with kaplan meier and cox methods. standardised mortality ratios were computed to compare patients' mortality with the french general population. results: a total of patients were included ( men, women). the mean age at ms onset was +/) years and the mean follow-up duration was +/) years ( patients-years). by , deaths occurred ( per patients-years). male gender, progressive course, polysymptomatic onset and high relapse rate were related to a worse prognosis. ms did not increase the number of deaths in our cohort compared to the general french population ( expected), except for highly disabled patients ( observed, expected). conclusion: this study gave precise insights on mortality in multiple sclerosis in west france. mattress dust. methods: we performed nested case-control studies within ongoing birth cohort studies in germany, the netherlands, and sweden and selected approximately sensitised and non-sensitised children per country. we measured levels of bacterial endotoxin, ß( -> )-glucans, and fungal extracellular polysaccharides (eps) in dust samples collected on the children's mattresses. results: combined across countries, higher amounts of dust and higher endotoxin, ß( -> )-glucans, and eps loads of mattress dust were associated with a significantly decreased risk of sensitization to inhalant allergens, but not food allergens. after mutual adjustment, only the protective effect of the amount of mattress dust remained significant [odds ratio ( % confidence interval) . ( . - . )]. conclusion: higher amounts of mattress dust might decrease the risk of allergic sensitization to inhalant allergens. the effect might be partly attributable to endotoxin, ß( -> )-glucans, and eps. it is not possible to distinguish with certainty, which component relates to the effect, since microbial agents loads are highly correlated with amount of dust and with each other. abstract background: postmenopausal hormone therapy (ht) increases mammographic density, a strong breast cancer risk factor, but effects vary across women. objective: to investigate whether the effect of ht use on changes in mammographic density is modified by polymorphisms in the estrogen (esr ) and progesterone receptor (pgr) genes. design and methods: information on ht use, dna and two consecutive mammograms were obtained from ht users and never ht users of the dutch prospect-epic and the english epic-norfolk cohorts. mammographic density was assessed using a computer-assisted method. changes in density between mammograms before and during ht use were analyzed using linear regression. results: a difference in percent density change between ht users and never users was seen in women with the esr pvuii pp or pp genotype ( . %; p< . ), but not in those with the pp genotype ( . %; p = . ). similar effects were observed for the esr xbai and the pgr + g/a polymorphisms. the pgr progins polymorphism did not appear to make women more susceptible to the effects of ht use. discussion and conclusion: our results suggest that specific polymorphisms in the esr and pgr genes may make women more susceptible to the effects of ht use on mammographic density. abstract background: there is a paucity of data on the cancer risk of turkish migrant children in germany. objectives: to identify cancer cases of turkish origin in the german childhood cancer registry (gccr) and to compare the relative incidence of individual cancers among turkish and non-turkish children. design and methods: we used a name algorithm to identify children of turkish origin among the , cancer cases below years of age registered since . we calculated proportional cancer incidence ratios (pcir) stratified for sex and time period. results: the name algorithm performed well (high sensitivity and specificity), and turkish childhood cancers were identified. overall, the relative frequency of tumours among turkish and non-turkish children is similar. there are specific sites and cancers for which pcirs are different; these will be reported during the conference. conclusion: our study is the first to show differences in the relative frequency of cancers among turkish and non-turkish children in germany. discussion: case control studies could help to explain whether observed differences in the relative frequency of cancers are due to differences in genetic disposition, lifestyle or socio-economic status. mutations in the netherlands cohort study on diet and cancer. data from , participants, cases and , subcohort members were analysed from a follow-up period between . to . years after baseline. adjusted gender-specific incidence rate ratios (rr) and % confidence intervals (ci) were calculated over tertiles of folate intake in case-cohort analyses. high folate intake did not reduce overall colon cancer risk. however, in men only, it was inversely associated with apc[csymbol] colon tumours (rr . , % ci . - . for the highest versus the lowest tertile of folate intake), but positively associated with apc+ colon tumours (highest vs. lowest tertile: rr . , ci . - . ). folate intake was neither associated with overall rectum cancer risk, nor with rectum cancer when apc mutation status was accounted for. we observed opposite associations between folate intake and colon cancer risk with or without apc mutations in men, which may implicate a distinct mutated apc pathway mediated by folate intake in men. abstract background and objectives: ten years after completion of the first serum bank of the general population to evaluate the long-term effects of the national immunisation programme (nip) a new serum collection is desirable. the objective is to provide insight into age-specific estimates of the immunity to childhood diseases and estimates of the incidence of infectious diseases with a frequent sub clinical course. design and methods: a two-stage cluster sampling technique was used to draw a nationwide sample. in each of five geographic regions, eight municipalities were randomly selected proportionally to their size. within each municipality, an age-stratified sample of individuals ( - yr) will be drawn from the population register. in addition eight municipalities will be selected with lower immunization coverage to obtain insight into the immune status of persons who often refuse vaccination on religious grounds. furthermore over sampling of migrants will be performed to study whether their immune status is satisfactory. participants will be asked to fill in a questionnaire and to allow blood to be taken. extra blood will be taken for a genetic study. results and conclusion: the design of a population-based serum collection aimed at the establishment of a representative serum bank will be presented. abstract background: during the last decade, the standard of diabetes care evolved to require more intensive management focussing on multiple cardiovascular risk factors. treatment decisions for lipidlowering drugs should be based on cholesterol and blood pressure levels. objectives: to investigate the influence of hba c, blood pressure and cholesterol levels on subsequent intensification of lipid-lowering therapy between - . design and methods: we conducted a prospective cohort study including , type diabetes patients who had at least two consecutive annual visits to a diabetes nurse. treatment intensification was measured by comparing drug regimes per year, and defined as initiation of a new drug class or dose increase of an existing drug. results: between - , the prevalence of lipid-lowering drug use increased from % to %. rates of intensification of lipid-lowering therapy remained low in poorly controlled patients ( % to %;tc/hdl ratio> ). intensification of lipid-lowering therapy was only associated with tc/hdl ratio (age-adjusted or = . ; %ci . - . ) and this association became slightly stronger over time. blood pressure was not found to be a predictor of the intensification of lipid-lowering therapy (or = . ). conclusion: hypercholesterolemia management intensified between - , but therapy intensification was only triggered by elevated cholesterol levels. more attention for multifactorial risk assessment is needed. abstract background: there are no standard severity measures that can classify the range of illness and disease seen in general practice. objectives: to validate new scales of morbidity severity against age, gender, deprivation and poor physical function. design and methods: in a cross-sectional design, morbidity data for consulters in a -month period was linked to their physical function status . there were english older consulters ( years +) and dutch consulters ( years +). consulters for morbidities classified on four gp-defined ordinal scales of severity ('chronicity', 'time course', 'health care use' and 'patient impact on activities of daily living') were compared to consulters for morbidity other than the , by age-groups, gender, and dichotomised deprivation and physical function scores. results: for both countries, on all scales, there was an increasing association between morbidity severity and older ages, female gender, more deprivation (minimum p< . ) and poor physical function (all trends p< . ). the estimates for categories, for example, within the 'chronicity' scale was ordered as follows: 'acute' (unadjusted odds ratio . ), 'acute-on-chronic' ( . ), 'chronic' ( . ) and 'life-threatening' ( . ). conclusions: new validated measures of morbidity severity indicate physical health status and offer the potential to optimise general practice care. hospitalization or death. calibration and discriminative capacity were estimated. results: among episodes of lrti in elderly patients with dm, endpoints occurred (attack rate %). reliability of the model was good (goodness-of-fit test p = . ). the discriminative properties of the original rule was acceptable (area under the receiver-operating curve (auc): . , % ci: . to . ). conclusion: the prediction rule for the probability of hospitalization or death derived from an unselected elderly population with lrti appeared to have acceptable discriminative properties in diabetes patients and can be used to target management of these common diseases. confounding by indication is a major threat to the validity of nonrandomized studies on treatment effects. we quantified such confounding in a cohort study on the effect of statin therapy on acute respiratory disease (ard) during influenza epidemics in the umc utrecht general practitioner research database among persons aged > = years. the primary endpoint was a composite of pneumonia or prednisolone-treated ard during epidemic, non-epidemic and summer seasons. to quantify confounding, we obtained unadjusted and adjusted estimates of associations for outcome and control events. in all, , persons provided , persons-periods, statin therapy was used in . % and in , person-periods an outcome event occurred. without adjustments, statin therapy was not associated with the primary endpoint during influenza epidemics (relative risk [rr] . ; % confidence interval [ %ci]: . - . ). after applying multivariable generalized estimating equations (gee) and propensity score analysis the rrs were . ( % ci: . - . ) and . ( % ci: . - . ). the findings were consistent across relevant strata. in non-epidemic influenza and summer seasons the rr approached . while statin therapy was not associated with control event rates. observed confounding in the association between statin therapy and acute respiratory outcomes during influenza epidemics masked a potential benefit of more than %. abstract background: despite several advances in the treatment of schizophrenia, the currently available pharmacotherapy does not change the course of illness or prevent functional deterioration in a substantial number of patients. therefore, research efforts into alternative or adjuvant treatment options are needed. in this project, called the 'aspirine trial', we investigate the effect of the antiinflammatory drug acetylsalicylic acid as an add-on to regular antipsychotic therapy on the symptoms of schizophrenia. objectives: to objective is to study the efficacy of acetylsalicylic acid in schizophrenia on positive and negative psychotic symptoms, immune parameters and cognitive functions. design and methods: a randomized placebo controlled double-blind add-on trial of inpatients and outpatients with schizophrenia, schizophreniform or schizoaffective disorder is performed. patients are : randomized to either months mg acetylsalicylic acid per day or months placebo, in addition to their regular antipsychotic treatment. all patients receive pantoprazole treatment for gastroprotection. participants are recruited from various major psychiatric hospitals in the netherlands. the outcomes of this study are -month change in psychotic and negative symptom severity, cognitive function, and several immunological parameters. status around participants have been randomized. no interim analysis was planned. abstract background: congenital cmv infection is the most prevalent congenital infection worldwide. epidemiology and outcome are known to vary with socio-economic background, but few data are available on epidemiology and outcome in a developing country, where the overall burden of infectious diseases is high. objective: to determine prevalence, riskfactors and outcome of congenital cmv infection in an environment with high infectious disease burden methods: as part of an ongoing birth cohort study, baby and maternal samples were collected at birth, and tested with an inhouse pcr for the presence of cmv. standardised clinical assessment were performed by a paediatrician. placental malaria was also assessed. follow-up is ongoing till the age of years. preliminary results: the prevalence of congenital cmv infection was / ( . %). the infected children were more often first born babies ( . % vs . %, p< . ). while no seasonality was observed, placental malaria was more prevalent among congenitally infected children ( . % vs . %,p = . ). there were no symptomatic babies detected. conclusion: this prevalence of congenital cmv is much higher than reported in industrialised countries, in the absence of obvious clinical pathology. further follow up is needed to assess impact on response to vaccinations, growth, and morbidities. of wheeze or cough at night in the first years. data on respiratory symptoms and dda were collected by yearly questionnaires. in total, symptomatic children with and without an early dda were included in the study population. results: fifty-one percent of the children with and % of the children without an early dda had persistent respiratory symptoms at age . persistence of symptoms was associated with parental atopy, eczema, nose symptoms without a cold, or a combination of wheeze and cough in the first years. conclusions: monitoring the course of symptoms in children with risk factors for persistent symptoms, irrespective of a diagnosis of asthma, may contribute to early recognition and treatment of asthma. little is known about the response mechanisms of survivors of disasters. objective: to examine selective non-response and to investigate whether attrition has biased the prevalence estimates among survivors of a disaster. design and methods: a longitudinal study was performed after the explosion of a fireworks depot in enschede, the netherlands. survivors completed a questionnaire weeks (t ), months (t ) and years post-disaster (t ). prevalence estimates resulting from multiple imputation were compared with estimates resulting from complete case analysis. results: non-response differed between native dutch and nonwestern immigrant survivors. for example, native dutch survivors who participate at t only were more likely to have health problems at t such as depression than native dutch who participated at all three waves (or = . , % ci: . - . ) . in contrast, immigrants who participated at t only were less likely to have depression at t (or = . , % ci: . - . ). conclusion and discussion: among native dutch survivors, the imputed estimates of t health problems tended to underestimated than the complete case estimates. the imputed t estimates among immigrants were unaffected or somewhat overestimated than the complete case estimates. multiple imputation is a useful statistical technique to examine whether selective non-response has biased the prevalence estimates. session: posters session : july presentation: poster. background: several epidemiologic studies have shown decreased colon cancer risk in physically active individuals. objectives: this review provides an update of the epidemiologic evidence for the association between physical activity and colon cancer risk. we also explored whether study quality explains discrepancies in results between different studies. methods: we included cohort (male n = ; female n = ) and case-control studies (male n = ; female n = ) that assessed total or leisure time activities in relation to colon cancer risk. we developed a specific methodological quality scoring system for this review. due to the large heterogeneity between studies, we refrained from statistical pooling. results: in males, the cohort and case-control studies lead to different conclusions: the case-control studies provide strong evidence for a decreased colon cancer risk in the physically active while the evidence in the cohort studies is inconclusive. these discrepant findings can be attributed to either misclassification bias in cohort or selection bias in case-control studies. in females, the small number of high quality cohort studies precludes a conclusion and the case-control studies indicate an inverse association. conclusion: this review indicates a possible association of physical activity and reduction of colon cancer risk in both sexes but the evidence is not yet convincing. abstract background/objectives: radiotherapy after lumpectomy is commonly applied to reduce recurrence of breast cancer but may cause acute and late side effects. we determined predictive factors for the development of late toxicity in a prospective study of breast cancer patients. methods: late toxicity was assessed using the rtog/ eortc classification among women receiving radiotherapy following lumpectomy after a mean follow-up time of months. predictors of late toxicity were modelled using cox regression in relation to observation time, adjusting for age, bmi and biologically effective dose in the maximum at the skin. results: ( . %) patients presented with telangiectasia and ( . %) patients with fibrosis. we observed a strong association between development of telangiectasia and fibrosis (p< . ). increasing patient age was a risk factor for telangiectasia and fibrosis (p for trend . and . , respectively). boost therapy (hazard ratio (hr) . , % ci . - . ) and acute skin toxicity (hr . , % ci . - . ) significantly increased risk of telangiectasia. risk of fibrosis was elevated among patients with atopic diseases (hr . , % ci . - . ). discussion: our study revealed several risk factors for late complications of radiotherapy. further understanding of differences in response to irradiation may enable individualized treatment and improve cosmetic outcome. doctor-diagnosed asthma and respiratory symptoms (age ) were available for (rint) and (no) children. results: the discriminative capacities of rint and exhaled no were statistically significant for the prediction of doctor-diagnosed asthma, wheeze (rint only) and shortness of breath (rint only). due to the low prevalence of disease in this general population sample, the positive predictive values of both individual tests were low. however, the positive predictive value of the combination of increased rint (cutoff . kpa.l- .second) and exhaled no (cut-off ppb) was % for the prediction of doctor-diagnosed asthma, with a negative predictive value of %. combinations of rint or exhaled no with atopy of the child showed similar results. conclusions: the combination of rint, exhaled no and atopy may be useful to identify high-risk children, for monitoring the course of their symptoms and to facilitate early detection of asthma. abstract background: in a cargo aircraft crashed into apartment buildings in amsterdam, killing people, and destroying apartments. an extensive, troublesome aftermath followed with rumours on toxic exposures and health consequences. objectives: we studied the long-term physical health effects of occupational exposure to this disaster among professional assistance workers. design and methods: in this historical cohort study we compared the firefighters and police officers who were occupationally exposed to this disaster (i.e. who reported one or more disasterrelated tasks) with their nonexposed colleagues (n = , and n = , respectively), using regression models adjusted for background characteristics. data collection took place from january to march , and included various clinical blood and urine parameters (including blood count and kidney function), and questionnaire data on occupational exposure, physical symptoms, and background characteristics. the overall response rate was %. results: exposed workers reported various physical symptoms (including fatigue, skin and musculoskeletal symptoms) significantly more often than their nonexposed colleagues. in contrast, no consistent significant differences between exposed and nonexposed workers were found regarding clinical blood and urine parameters. discussion and conclusion: this epidemiological study demonstrates that professional assistance workers involved in a disaster are at risk for long-term unexplained physical symptoms. abstract background and objectives: recent studies indicate that women with cosmetic breast implants have significantly increased risk of suicide. reasons for elevated risk are not known. it is suggested that women with cosmetic breast implants differ in their characteristics and have more mental problems than women of general population. aim of this study was to find out possible associations between physical or mental health and postoperative quality of life among finnish women with cosmetic breast implants. design and methods: information was collected from patient records of women and structured questionnaires mailed to women of the same cohort. data was analysed by using pearson chi square testing and logistic regression modelling. results: although effects of implantation on postoperative quality of life in different areas were mainly reported as positive or neutral, % of the women reported decreased state of health. postoperative dissatisfaction and decreased quality of life were significantly associated with diagnoses of depression (p = . ) and local complication called capsular contracture (p< . ). conclusion: our results are consistent with previous results finding most of the cosmetic breast surgery patients satisfied after implantation. however, this study brings new information on associations between depression, capsular contracture and decreased quality of life. abstract cancer and its treatments often produce significant persistent morbidities that reduce quality of life (qol) in cancer survivors. research indicates that both, physical exercise and psycho-education might enhance qol. therefore, we developed a -week multidisciplinary rehabilitation program that combines physical training with psycho-education. the aim of the present multicenter study is to determine the effect of multidisciplinary rehabilitation on qol as compared to no treatment and, additionally, to physical training alone. furthermore, we will explore which variables are related to successful outcome (socio-demographic, disease related, physiological, psychological and environmental characteristics). participants are needed to detect a medium effect. at present, cancer survivors are randomised to either the multidisciplinary or physical rehabilitation program or a -month waiting list control group. outcome assessment will take place before, halfway, directly after, and months following the intervention by means of questionnaires. physical activity will be measured before, halfway and directly after rehabilitation using maximal and submaximal cycle ergometer testing and muscle strength measurement. effectiveness of multidisciplinary rehabilitation will be determined by analysing changes between groups from baseline to post-intervention using multiple linear and logistic regression. positive evaluation of multidisciplinary rehabilitation may lead to implementation in usual care. continuous event recorders (cer) have proven to be successful in diagnosing causes of palpitations but may affect patient qol and increase anxiety. objectives: determine qol and anxiety in patients presenting with palpitations, and to evaluate the burden of the cer on qol and anxiety in patients presenting to the general practitioner. methods: randomized clinical trial in general practice. the short form- (sf- ) and state-trait anxiety inventory (stai) were administered at study inclusion, -weeks and months. results: at baseline, patients with palpitations (n = ) reported lower qol and more anxiety than a healthy population for both males and females. there were no differences between the cer arm and usual gp care at -weeks. at -months the usual care group (n = ) showed minimal qol improvement and less anxiety compared to the cer group (n = ). type of diagnosis did not account for any of these reported differences. conclusion: anxiety decreases and qol increases in both groups at -weeks and -month follow-up. hence it is a safe and effective diagnostic tool, which is applicable for all patients with palpitations in the general practice. abstract background: clinical benefits of statin therapy are accepted, but their safety profiles have been under scrutiny, particularly for the most recently introduced statin, rosuvastatin, relating to serious adverse events involving muscle, kidney and liver. objective: to study the association between statin use and the incidence of hospitalizations for rhabdomyolysis, myopathy, acute renal failure and hepatic impairment (outcome events) in real life. methods: in and , , incident rosuvastatin users, , incident other statin users and , patients without statin prescriptions from the pharmo database of > million dutch residents were included in a retrospective cohort study. potential cases of hospitalization for myopathy, rhabdomyolysis, acute renal failure or hepatic impairment were identified using icd- -cm codes and validated using hospital records. results: there were validated outcome events in the three cohorts including one case each of myopathy (other statin group) and rhabdomyolysis (non-treated group). there were no significant differences in the incidence of outcome events between rosuvastatin and other statin users. discussion: this study indicated that the number of outcome events is less than per person years. rosuvastatin does not lead to an increased incidence of rhabdomyolysis, myopathy, acute renal failure and hepatic impairment compared to other statins. the aim: the aim of the study was to assess the influence of insulin resistance (ir) on the coronary artery disease (cad) occurrence in middle aged women with normal glucose tolerance (ngt) material and methods: in - year women aged - , participants of the polish multicenter study on diabetes epidemiology were examined. anthropometric, biochemical (fasting lipids, fasting and after glucose load plasma glucose and insulin) and blood pressure determinations were performed . ir was defined as the matsuda index (irmatsuda) below the lower quartile of the irmatsuda distribution in ngt population the questionnaire examination of the lifestyle, present and past diseases was performed. results: ir was observed in % of all examined women and in . % with ngt. cad was diagnosed in , % of all examined women and in , % of those with ngt. the relative risk of cad related to ir in ngt and normotensive women was , ( % ci: , - , ) (p< . ). regular menstruation was observed in , % of cad women. irmatsuda was not different for cad menstruating and non menstruating women (respectively , ± , and , ± , ). conclusion: in middle aged, normotensive and normal glucose tolerant women ir seems to be an important risk factor of cad abstract background: in germany, primary prevention at population level is provided by general practitioners (gp). little is known about gps' strategies to identify patients at high risk for vascular diseases using standardised risk scores. objectives: we studied gp attitudes and current practice in using risk scores. methods-a cross-sectional survey was conducted among gps in north rhine-westphalia, germany, using mailed self-administered questionnaires on attitudes and current practice in identification of patients at high risk for vascular diseases. results: in , gps participated in the study. . % of gps stated to know the framingham-score, . % the procam-score and . % the score-score. . % of gps reported regular use of standardised risk scores to identify patients at high risk for vascular diseases, most frequently procam-score ( . %), followed by score-score ( . %) and framingham-score ( . %). main reasons for not using standardised risk scores were assumed rigid assessment of individual patients' risk profile ( . %), time-consuming appliance ( . %) and higher confidence in own work experience ( . %). conclusion: use of standardised risk scores to identify patients at high risk for vascular diseases is common among gps in germany. however, more educational work might be useful to strengthen gps' belief in the flexible appliance of standardised risk scores in medical practice. among epilepsy patients than in general population, but effects of specific antiepileptic drugs on birth rate are not well known. objectives: to estimate birth rate in epilepsy patients on aed treatment or without aeds and in a population-based reference cohort without epilepsy. design and methods: patients (n = , ) with reimbursement for aeds for the first time between and and information on their aed use, were identified from the databases of social insurance institution of finland. reference cohort without epilepsy (n = , ) and information on live births were identified from the finnish population register centre. the analyses were performed using poisson regression modelling. results:birth rate was decreased in epilepsy patients in relation to reference cohort without epilepsy in both genders regardless of aed use. in relation to untreated patients, women on any of the aeds had non-significantly lower birth rates. among men, birth rate was decreased in men on oxcarbazepine (rr = . , % ci = . , . ), but was not clearly lower among those on carbamazepine (rr = . , % ci = . , . ) or valproate (rr = . , % ci = . , . ) when compared to untreated patients. conclusion: our results suggest that birth rate is decreased among epilepsy patients on aeds, more so in men. abstract background: hereditary hemochromatosis (hh), characterised by excessive iron absorption, subsequent iron storage and progressive clinical features, can when diagnosed at an early stage be successfully treated. high prevalence of the c y-mutation on the hfe-gene in the hh patient population may motivate genetic screening. objectives: in first-degree relatives of c y-homozygotes we studied the gender and age -related biochemical penetrance of hfe-genotype to define a high-risk population eligible for screening. design and methods: one-thousand-six first-degree family members of probands with clinically overt hfe-related hh from five medical centres in the netherlands were approached. data on levels of serum iron parameters and hfe-genotype were collected. elevation of serum ferritin was defined using the centre-specific normal-values by age and gender. results: among the participating relatives, highest serum iron parameters were found in male c y-homozygous siblings aged > years: % had elevated levels of serum ferritin. generally, male gender and increased age are related with higher iron values. discussion and conclusion: genetic screening for hh is most relevant in male and elderly first-degree relatives of patients with clinically overt hfe-related hh, enabling regular investigations of iron parameters in homozygous individuals. abstract background: nosocomial infection causes increased hospital morbidity and mortality rates. although handwashing is known to be the most important action in its prevention, adherence of health care workers to recommended hand hygiene procedures is extremely poor. objective: evaluation of compliance of hand hygiene recommendations in health care workers of a tertiary hospital in barcelona after a course on hand hygiene was given to all nurses in the hospital during the previous year. methods: by means of nondeclared observation, compliance (handwashing or disinfecting, not solely glove exchange) of recommendations given by the center for disease control related to opportunities for hand hygiene was registered, in procedures of diverse risk level for infection, both in physicians and nurses. results: in opportunities for hand hygiene carried out by health care workers compliance of recommendations was . %. adherence differed between wards ( . % in intensive care units, . % in medical wards and . % in surgical wards) and slightly between health care workers ( . % in physicians, . % in nurses). discussion: in conclusion, after one year of an intervention on education, adherence to hand hygiene recommendations is very low. these results enhance the need of reconsidering the type of interventions implemented. type of comorbidity affects qol most. objectives: we studied whether qol differed in subjects with dm with and without comorbidities. in addition, we determined differences in type of comorbidity. design and methods: cross-sectional data of dm patients, participants of a population-based dutch monitoring project on risk factors for chronic disease (morgen) were analyzed. qol was measured by the short form . we compared the means of subdimensions for dm patients with one comorbidity (cardiovascular diseases (cvd), musculoskeletal diseases (msd) and asthma/copd) to dm patients without this comorbidity, by regression analyses adjusted for age and sex. results: the prevalences of cvd, msd and asthma/copd were . %, . %, and . %. all comorbidities were associated with lower qol, especially for physical functioning. the mean difference ( % ci) was . abstract background: the extent or increase of ueds is suggested repeatedly, but never before the scientific literature was systematic studied. objectives: a systematic appraisal of the worldwide incidence and prevalence rates of upper extremity disorders (ueds) available in scientific literature was executed to gauge the range of these estimates in various countries and to determine whether the rates are increasing in time. design and methods: studies that recruited at least people, collected data by using questionnaires, interviews and/or physical examinations, and reported incidence or prevalence rates of the whole upper-extremity including neck, were included. results: no studies were found with regard to the incidence of ueds and studies that reported prevalence rates of ueds were included. the point prevalence ranged from . - %; the months prevalence ranged from . - %. one study reported on the lifetime prevalence ( %). we did not find evidence of a clear increasing or decreasing pattern over time. it was not possible to pool the date, because the definitions used for ueds differed enormously. conclusions: there are substantial differences in reported prevalence rates on ueds. main reason for this is the absence of a universally accepted way of labelling or defining ueds. abstract background: the absolute number of women diagnosed with breast cancer increased from , in to , in in the netherlands. likewise, the age standardized rate increased from . to . per , women. besides the current screening programme, changes in risk profile could be a reason for the increased incidence. objective: we studied the changes in breast cancer risk factors for women in nijmegen. methods: in the regional screening programme in nijmegen, almost , women aged - years filled in a questionnaire about risk factors in [ ] [ ] . similar questions were applied in the nijmegen biomedical study in , where women of - year participated. the median age in both studies was years. results: the frequency of a first-degree relative with breast cancer was . % and . % in and , respectively . none of the other risk factors, as the age of women at st birth ( . % respectively . %), nulliparity ( . % resp. . %), age at menarche ( . % resp. . %), age at menopause ( . % resp. . %) and obesity ( . % resp. . %), changed in time. conclusion: the distribution of risk factors hardly changed, and is unlikely to explain the rise in breast cancer incidence from onwards. abstract background: a single electronic clinical history system has been developed in the bac (basque autonomous community) for general use for all health centres, thus making it possible to collect information online on acute health problems as well as chronic ailments. method: the prevalence of diabetes, high blood pressure and copd (chronic obstructive pulmonary disease) was estimated using icd- -cm diagnosis performed by primary care physicians. an estimate was also made of the prevalence of cholesterolemia based on the results of analyses requested by physicians. results: in , , patients (out of a total population of , , ) were assessed for serum cholesterol levels. based on this highly representative sample, it was estimated that . % had serum cholesterol levels above mg/dl. the prevalence of diabetes mellitus in people over the age of was . %. the prevalence of high blood pressure in people over was %. discussion: the primary care database makes it possible to access information on problems related to chronic illnesses. knowing the prevalence of diabetes patients enables doctors to analyse all aspects related to services used by the diabetic population. it also makes it possible to monitor analytical data in real time and evaluate health service outcomes. examinations were used to asses risk factors for diabetes. cases (n = ) were matched on age and sex to controls (n = ) who were not treated with antidiabetic drugs. logistic regression was used to calculate odds ratios (or). results: the or of incident diabetes for acei-use versus non-acei use was . ( %ci : . - . ). for ace dd homozygotes the or was . ( %ci: . - . ) and for ace-i allele carriers . ( %ci: . - . ). the interaction or was . ( %ci: . - . ). the agt and at r genotypes did not modify the association between acei use and diabetes. abstract background: lignans have antioxidant and estrogen-like activity, and may therefore lower cardiovascular and cancer risk. objective: we have investigated whether intake of four plant lignans (lariciresinol, pinoresinol, secoisolariciresinol, matairesinol) was inversely associated with coronary heart disease (chd), cardiovascular diseases (cvd), cancer, and all-cause mortality. design: the zutphen elderly study is a prospective cohort study in which men aged - y were followed for years. lignan intake was estimated using a recently developed database, and related to mortality using cox proportional hazards analysis. results: median total lignan intake in was lg/d. beverages such as tea and wine, vegetables, bread, and fruits were the major lignan sources. total lignan intake was not related to mortality. however, matairesinol was inversely associated with chd, cvd, cancer, and all-cause mortality. multivariate adjusted rrs ( % ci) per sd increase in intake were . ( . - . ) for chd, . ( . - . ) for cvd, . ( . - . ) for cancer, and . ( . - . ) for allcause mortality. conclusions: total lignan intake was not associated with mortality. the intake of matairesinol was inversely associated with mortality from chd, cvd, cancer, and all-causes. we can not rule out that this is due to an associated factor, such as wine consumption. abstract despite the drastic increase in the amount of research into neighbourhood-level contextual effects on health, studies contrasting these effects between different domains of health within one contextual setting are strikingly sparse. in this study we use multilevel logistic regression models to estimate the existence of neighbourhood-level variations of physical health functioning (pcs) and mental well-being (ghq) in the helsinki metropolitan area and assess the causes of these differences. the individual-level data are based on a health-survey of - year old employees of the city of helsinki (n = , response rate %). the metropolitan area is divided into neighbourhoods, which are characterised using a number of area-level indicators (e.g. unemployment rate). our results show moderate but systematic negative effect of indicators of neighbourhood deprivation on physical functioning, whereas for mental health the effect is absent. these effects were strongest for proportion of manual workers; odds ratio for poor physical functioning was . for respondents living in areas with low proportion of manual workers. part of this effect was mediated by differences in health behaviour. analyses on cross-level interactions show that individual-level socioeconomic differences in physical health are smallest in most deprived areas, somewhat contradicting the results of earlier studies. abstract background: the second-eye cataract surgery is beneficial, nevertheless, there is a considerable proportion of unmet needs. objective: to estimate the proportion of second-eye cataract surgery in the public health system of catalonia, and explore differences in utilisation by patients' gender, age, and region of residence. methods: a total of , senile cataract surgeries performed between and were included. proportions observed were adjusted through independent logarithmic regression models for each study factor. results: the proportion of second-eye surgery showed an increasing trend (r . %) from . % ( % ci . ; . ) in november to . % ( % ci . ; . ) in december , and its projection to years was , % ( % ci . ; . ). the proportion of second-eye surgery was % ( % ci . ; . ) greater in women than in men. patients years or older had a lowest proportion ( . %; % ic . ; . ), which nevertheless increased during the period, unlike that of patients aged less than years. differences among regions were moderate and decreased throughout the period. conclusions: if the observed trends persist, there will be a substantial proportion of unmet need for bilateral surgery. we predict greater use of second-eye surgery by older patients. abstract background: persistence with bisphosphonates is suboptimal which could limit prevention of fractures in daily practice. objectives: to investigate the effect of long term persistent bisphosphonate usage on the risk of osteoporotic fractures. methods: the pharmo database, including drug-dispensing and hospital discharge records for > two million subjects in the netherlands, was used to identify new female bisphosphonate users > years from jan ' -jun ' . persistence with bisphosphonates was determined using the method of catalan. a nested matched case-control study was performed. cases had a first hospitalization for an osteoporotic fracture (index-date). controls were matched : to cases on year of inclusion and received a random index-date. the association with fracturerisk was assessed for one and two year persistent bisphosphonate use prior to the index-date. analyses were adjusted for di fferences in patient characteristics. results: , bisphosphonate users were identified and had a hospitalization for osteoporotic fracture during follow-up. one year persistent bisphosphonate use resulted in a % lower fracture rate (or . ; % ci . - . ) whereas two year persistent use resulted in a % lower rate (or . ; % ci . - . ). conclusion and discussion: these results emphasize the importance of persistent bisphosphonate usage to obtain maximal protective effect of treatment. abstract background: in the who recommended all countries to add hepatitis b (hbv) vaccination to their national immunization programs. the netherlands is a low hbv endemic country and therefore adopted a vaccination policy targeted towards high-risk groups. methods: during , epidemiological data and blood samples were collected from all reported patients with an acute hbv infection. a fragment of the s-gene was sequenced and phylogenetically analysed to clarify transmission patterns between risk groups. results: of hbv cases reported, % was infected through sexual contact ( % homo-/bisexual, % heterosexual). for patients samples were available for genotyping. phylogenetic analysis identified genotypes: a( %), b( %), c( %), d( %), e( %) and f( %). of men who have sex with men (msm), % were infected with genotype a. among heterosexuals, all genotypes were found. in many cases, genotypes b-f were direct or indirect related to countries abroad. only injecting drug user was found (genotype a). conclusion: genotype a is predominant in the netherlands, including most of the msm. migrant hbv carriers play an important role in the dutch hbv epidemic. genotyping provides insight into the spread of hbv among highrisk groups. this information will be used to evaluate the vaccination policy in the netherlands. abstract background: excess weight might affect the perception of both physical and mental health in women. objective: to examine the relationship between body mass index (bmi) and hrqol in women aged -to -year-old in a rural zone of galicia. design and methods: population-based cross-sectional study covering women, personally interviewed, from villages. hrqol was assessed with sf- questionnaire, through personal interviews. each scale of sf- was dichotomised in suboptimal or optimal hrqol using previously defined cut-offs. odds ratios (or) obtained from logistic regression summarize the relationship of bmi with each scale, adjusting for sociodemographic variables, sedentary leisuretime, number of chronic diseases and sleeping hours. results: a . % of women were obese (bmi = kg/m ) and . % overweight kg/m ) . frequency of suboptimal physical function was higher among overweight women (adjusted or: . ; % ci: . - . ) and obesity (adjusted or: . ; % ci: . - . ). furthermore, obese women had higher frequency of suboptimal scores on the general health scale (adjusted or: . ; % ci: . - . ). no differences were observed regarding mental health scores among women with different bmi categories. conclusion: in women from rural villages, overweight is associated with worse hrqol in physical function and general health. abstract background: pneumococcal vaccination among elderly is recommended in several western countries. objectives: we estimate the cost-effectiveness of a hypothetical vaccination campaign among the + general population in lazio region (italy). methods: a cohort was followed during a years timeframe. we estimated the incidence of invasive pneumococcal disease, in absence of vaccine, based on actual surveillance and hospital data. the avoided deaths and cases have been estimated from literature according to trial results. health expenditures included: costs of vaccine program, inpatient and some outpatient costs. cost-effectiveness was expressed as net healthcare costs per episode averted and life-year gained (lyg) and was estimated at baseline and in deterministic and stochastic sensitivity analyses. all parameters were age-specific and varied according to literature data. results: at baseline net costs per event averted and lyg at prices were, respectively, e , ( % ci: e , -e , ) and , ( % ci: e , -e , ). in the sensitivity analysis, bacteraemic pneumonia incidence and vaccine effectiveness increased the net cost per lyg by % and % in the worst-case scenario, and decreased it to e , in the best-case. conclusions: the intervention was not cost saving. the uncertainties concerning invasive pneumococcal disease incidence and vaccine effectiveness make the cost-effectiveness estimates instable. spain - abstract background: spatial data analysis can detect possible sources of heterogeneity in spatial distribution of incidence and mortality of diseases. moreover small area studies have greater capacity to detect local effects linked to environmental exposures. objective: to estimate the patterns of cancer mortality at municipal level in spain using smoothing techniques in a single spatial model. design and methods: cases were deaths due to cancer, registered at a municipal level nation-wide for the period - . expected cases for each town were calculated using overall spanish mortality rates and standard mortality ratios were computed. to plot the maps, smoothed municipal relative risks were calculated using besag york and mollie`model and markov chain monte carlo simulation methods. as an example maps for stomach and lung cancer neoplasms are shown. results: it was possible to obtain the posterior distribution of relative risk by a single spatial model including towns and the adjacencies. maps showed the singular patterns for both cancer locations. conclusion: the municipal atlas allows to avoid edge local effects, improving the detection of spatial patterns. discussion: bayesian modelling is a very efficient way to detect spatial heterogeneity by cancer and other causes of death. abstract background: little is known about the impact of socioeconomic status (ses) on outcomes of surgical care. objectives: we estimated the association between ses and outcomes of selected complex elective surgical procedures. methods: using hospital discharge registries (icd-ix-cm codes) of milan, bologna, turin and rome we identified patients undergoing cardiovascular operations (coronary artery bypass grafting, valve replacement, carotid endarterectomy, repair of unruptured thoracic aorta aneurysm) (n = , ) and cancer resections (pancreatectomy, oesophagectomy, liver resection, pneumonectomy, pelvic evisceration) (n = , ) in four italian cities, - . an area-based income index was calculated. post-operative mortality (in-hospital or within days) was the outcome. logistic regression adjusted for gender, age, residence, comorbidities, concurrent and previous surgeries. results: high income patients were older and had fewer comorbidities. mortality varied by surgery type (cabg , %, valve , %, endartectomy , %, aorta aneurysm , %, cancer . %). low income patients were more likely to die after cabg (or = . abstract background: an important medical problem of renal transplant patients who receive immunosuppression therapy, is the development of a malignancy during the long term follow-up. however, existing studies are not in agreement over whether patients who undergo renal transplantation have an increased risk of melanoma. objective: the aim of this study was to determine the incidence of melanoma in renal transplantation patients in the northern part of the netherlands. methods: we linked a cohort of patients who received a renal transplantation in the university medical centre groningen between and with the cancer registry of the comprehensive cancer centre north-netherlands, to identify all melanoma patients in this cohort. results: only patient developed a melanoma following the renal transplantation; no significant increase in the risk of melanoma was found. conclusion: although several epidemiologic studies have shown that the risk of melanoma is increased in renal transplantation patients who receive immunosuppression therapy to prevent allograft rejection, this increased risk was not found in the present study. the lower level of immunosuppressive agents given in the netherlands might be responsible for this low incidence. abstract background: socio-economic health inequalities are usually studied for self-reported income, although the validity of self-reports is uncertain. objectives: to compare self-reports of income by respondents to health surveys with their income according to tax registries, and determine to what extent choice of income measure influences the health-income relation. methods: around . respondents from the dutch permanent survey on living conditions were linked to data from dutch tax and housing registries of . both self-reported and registry-based measures of household equivalent income were calculated and divided into deciles. the association with less than good self-assessed health was studied using prevalence rates and odds ratios. results: around % of the respondents did not report their income. around % reported an income deciles lower or higher than the actual income value. the relation between income and health was influenced by choice of income measure. larger health inequalities were observed with selfreports compared to registry-based measures. while a linear healthincome relation was found using self-reported income, a curvilinear relation (with the worst health in the second lowest deciles) was observed for registry-based income. conclusion: choice of the income source has a major influence on the health-income relation that is found in inequality research. abstract background: while many health problems are known to affect immigrant groups more than the native dutch population, little is known about health differences within immigrant groups. objectives: to determine the association between self assessed health and socioeconomic status (ses) among people of turkish, moroccan, surinamese and antillean origin. methods: data were obtained from a social survey held among immigrants - years in the netherlands, with almost respondents per immigrant group. ses differences in the prevalence of 'poor' self-assessed health were measured using prevalence rate ratios estimated with regression techniques. results: within each immigrant group, poor health was much more common among those with low ses. the health of women was related to their educational level, occupational position, household income, financial situation and (to a lesser extent) their parents' education. similar relationships were observed for men, except that income was the strongest predictor of poor health. the health differences were about as large as those known for the native dutch population. conclusion and discussion: migrant groups are not homogenous. also within these groups, low ses is related to poor general health. in order to identify subgroups where most health problems occur, different socioeconomic indictors should be used. abstract background: genetic damage quantification can be considered as biomarker of exposure to genotoxic agents and as early-effect biomarker regarding cancer risk. objectives: to assess genetic damage in newborns and its relationship with anthropometrical, sociodemographic variables, maternal tobacco consumption and pollution perception. design and methods: the bio-madrid study recruited trios (mother/father/newborn) from areas in madrid to assess the impact of pollutants in humans. parents answered a questionnaire about socio-economic characteristics, pregnancy, life-style and perception of pollution. genetic damage in newborns were measured with the micronucleus(mn) test in peripheral lymphocytes poisson regression models were fitted using mn frequency per binucleated cells as dependent variable. explanatory variables included sex, parents age, tobacco, area and reported pollution level. results: the mean frequency of mn was . per (range: - ). no differences were found regarding area, sex and maternal tobacco consumption. mn frequency was higher in underweighted newborns and in those residing near heavy traffic roads. in recent years minimally invasive surgery procedures underwent rapid diffusion and laparoscopic cholecystectomy has been among the first to be introduced. after its advent, increasing rates of overall and laparoscopic cholecystectomy have been observed in many countries. we evaluated the effect of the introduction of laparoscopic procedure on the rates of cholecystectomy in friuli venezia giulia region, performing a retrospective study. from regional hospitals discharge data we selected all records with procedure code of laparoscopic (icd cm: ) or open ( ) cholecystectomy and diagnosis of uncomplicated cholelithiasis (icd cm: . ; . ; , ) or cholecystitis ( , ; , ), in any field, from to . in the year study period, the number of overall cholecystectomies increased from to (+ , %), mainly for the relevant increase of laparoscopic interventions from procedures, ( , % of overall cholecystectomies), to ( , %). rates of laparoscopic cholecystectomies increased from , to , per admitted patients with diagnosis of cholelithiasis or cholecystitis. the introduction of laparoscopic cholecystectomies was followed not only by a shift towards laparoscopically performed interventions but also by an increase in overall cholecystectomies in friuli venezia giulia region. abstract background: although a diminished doses scheme of -valent pneumococcal conjugate vaccination (pcv ) may offer protection against invasive pneumococcal disease, it might affect pneumococcal carriage and herd immunity. long term memory has to be evaluated. objective: to compare the influence of a and -doses pcv -vaccination scheme on pneumococcal carriage, transmission, herd immunity and anti-pneumococcal antibody levels. methods: in a prospective, randomized, controlled trial infants are randomly allocated to receive pcv at ages and months; ages , and months and the age of months only. nasopharyngeal (np) swabs are regularly obtained from infants and family members. the np swabs are cultured by conventional methods and pneumococcal serotypes are determined by quellung reaction. antibody levels are obtained at and months from infants in group i and ii and from infants in group iii. one thousand infants are needed to detect a % difference in pneumococcal carriage (a = . , ß = . ) between the three groups. results: so far, infants have been included. preliminary results show that prior to vaccination pneumococcal carriage was %. conclusion: this trial will provide insight into the effects of a diminished dose scheme on herd immunity and long-term antipneumococcal antibody development. abstract background: oil-spills cause important environmental damages and acute health problems on affected populations. objectives: to assess the impact of the prestige oil-spill in the hrqol of the exposed population. design and methods: we selected residents in coastal areas heavily affected by the oil-spill and residents in unaffected inland villages through random sampling, stratified by age and sex. hrqol was measured with the sf- questionnaire in personal interviews. individual exposure was also explored. mean differences in sf- scores > points were considered 'clinically relevant'. odds ratios (or) summarized the association between area of residence (coast vs inland) and suboptimum hrqol (lower than percentile th), adjusting for possible confounders. results: neither clinically relevant nor statistically significant differences were observed in most of the sf- scales regarding place of residence or individual exposure. worse scores (inland = , ; coast = , ; p< , ) abstract background: patient comorbidities are usually measured and controlled in health care outcome research. hypertension is one of the most commonly used comorbidity measures. objectives: this study aims to assess underreporting of hypertension in ami patients, and to analyze the impact of coding practices among italian regions or hospitals' type. methods: a cohort of ami hospitalisations in italy from november to october was selected. patients with a previous hospital admission reporting a diagnosis of complicated hypertension within the preceding months were studied. a logistic model was constructed. both crude and adjusted probability of reporting a hypertension in ami admissions, depending from the number of diagnosis fields compiled in discharge abstracts, and presence of other diseases were estimated. results: in . % of patients hypertension was not reported. probability of reporting hypertension increased with the number of compiled diagnosis fields (adjusted ors range: . - . ). there were no significant differences among italian regions, while private hospitals' reporting was less accurate. disorders of lipoid metabolism were more probably coded with hypertension (adjusted or: . ). conclusions: information from both ami and previous hospitali-sations would be needed to include hypertension in a comorbidity measure. abstract background: the angiotensin converting enzyme inhibitors (acei) should be considered the standard initial treatment of the systolic heart failure. this treatment is not recommended in patients with hypotension, although figures of systolic blood pressure around - mmhg during the treatment are allowed if the patient remains asymptomatic. objectives: to know the proportion of patients with systolic heart failure receiving treatment with acei, and the proportion of these patients with signs oh hypotension. design and methods: the electronic clinical records of all the patients diagnosed of systolic heart failure were reviewed. the electronic information system covers a % of the population of the basque country, approximately. diagnosis of heart failure was defined as the presence of any of the following cie- codes: or . or . . to evaluate the blood pressure, the last available determination was considered. results: out of patients with left heart failure, ( . %) have been prescribed acei. among the patients with blood pressure lower than mmhg (systolic) or than mmhg (diastolic), ( . %) were also receiving this treatment. conclusions: acei are clearly underprescribed in the basque country for the treatment of heart failure. attention should be given to the group at risk of hypotension. abstract background: epidemiologic studies have shown an association between c-reactive protein (crp) and cardiovascular endpoints in population samples. methods: in a longitudinal study of myocardial infarction (mi) survivors, crp was measured repeatedly (up to times) within a period of months. data on disease history and life style were collected at baseline. we examined the association between different variables and the level of crp using a random effects model. results: in total crp samples were collected in athens, augsburg, barcelona, helsinki, rome and stockholm. mean levels of crp were . , . , . , . , . , . [mg/l] respectively. body mass index (bmi) and chronic bronchitis (ever diagnosed) had the largest effect on crp ( % (for kg/m ) and % change from the mean level, respectively, p< . ). age classes showed a cubic function with a minimum at ages to . glycosylated hemoglobin (hba c) < . % as a measure of long-term blood glucose control and being male were found to be protective () % and ) % respectively, p< . ). conclusion: it was shown that bmi and history of bronchitis are important in predicting the level of crp. other variables, like alcohol intake, play a minor role in this large sample of mi patients. abstract background: during the last decades a remarkable increase in incidence rates of malignant lymphoma was seen. although some reasons are known or suspect underlying risk factors are not well understood. objectives: we studied the influence of medical radiation (x-ray, radiotherapy and szintigraphy) on the risk of malignant lymphoma. methods: we analysed data from a population-based case-control study with incident lymphoma cases in germany from - . after informed consent cases were pair-matched with controls recruited from registration office by age, gender and study region. data was collected in a personal interview. we analysed data using conditional logistic regression. results: the linear model shows an or = . /msv due to x-ray exposure and or = . ( %-ci = . - . ) comparing higher with lower exposure. radiotherapy shows an or = . (n = cases). there is no association between all lymphomas and szintigraphies but in the subgroup containing multiple myeloma, cll, malt-and marginalcell lymphoma we found an or = . ( %-ci = . - . ) in the multivariate model. discussion: no excess risk was observed for x-ray examinations. ionising radiation may increase risk for specific lymphoma subgroups. however, it should be noted that numbers in the subgroups are small and that radiation dose may be somehow inaccurate as no measures were available. abstract background: varus-alignment (bow-leggedness) is assumed to correlate with knee osteoarthritis (oa), but it is unknown whether varus-alignment precedes the oa or whether varus-alignment is a result of oa. objective: to assess the relationship between varusalignment and the development, as well as progression, of knee oa. methods: , participants in the rotterdam study were selected. knee oa at baseline and at follow-up (mean follow-up . years) was defined as kellgren & lawrence (k&l) grade , and progression of oa as an increase of k&l degree. alignment was measured by the femoro-tibial angle on baseline radiographs. multivariable logistic regression for repeated measurements was used. results: of , knees, . % showed normal alignment, . % varus-alignment, and . % valgus-alignment. comparison of high varus-alignment versus normal, low and mediate varus-alignment together, showed a two-fold increase in the development of knee oa. (or = . ; %ci = . - . ). the risk of progression was higher in the high varus group compared to the normal, low and mediate varus group (or = . ; %ci = . - . ). stratification for overweight gave similar odds ratio's in the overweight group, but weaker odds ratio's in the non-overweight group. conclusion: a higher value of varus-alignment is associated with the onset and progression of knee oa. abstract background: echocardiographic image quality in copd patients can be hampered by hyperinflated lungs. cardiovascular magnetic resonance imaging (cmr) may overcome this problem and provides accurate and reproducible information about the heart without geometric assumptions. objective: to determine the value of easily assessable cmr parameters compared to other diagnostic tests in identifying heart failure (hf) in copd patients. design and methods: participants were recruited from a cohort of copd patients = years. a panel established the diagnosis of hf during consensus meetings using all diagnostic information, including echocardiography. in a nested case-control study design, copd patients with hf (cases) and a random sample of copd patients without hf (controls) received cmr. the diagnostic value of cmr for diagnosing hf was quantified using univariate and multivariate logistic modelling and roc-area analyses. results: four easily assessable cmr measurements had significantly more added diagnostic value beyond clinical items (roc-area . ) than amino-terminal pro b-type natriuretic peptide (roc-area . ) or electrocardiography (roc-area . ). a 'cmr model' without clinical items had an roc-area of . . conclusion: cmr has excellent capacities to establish a diagnosis of heart failure in copd patients and could be an alternative for echocardiography in this group of patients. abstract background: the prevalence of overweight (i.e, body mass index [bmi] > = kg/m ) is increasing. new approaches to address this problem are needed. objectives: ) to assess the effectiveness of distance counseling (i.e., by phone and e-mail/internet) on body weight and bmi, in an overweight working population. ) to assess differences in effectiveness of the two communication methods. design and methods: overweight employees ( % male; mean age . ± . years; mean bmi . ± . kg/m ) were randomized to a control group receiving general information on overweight and lifestyle (n = ), a phone based intervention group (n = ) and an internet based intervention group (n = ). the intervention took months and used a cognitive behavioral approach, addressing physical activity and diet. the primary outcome measures, body weight and bmi, were measured at baseline and at six months. statistical analyses were performed with multiple linear regression. results: the intervention groups (i.e., phone and e-mail combined) lost . kg (bmi reduced by . kg/m ) over the control group (p = . ). the phone group lost . kg more than the internet group (p = . ). abstract objective: although an inverse gradient education-mortality has been shown in the general population, little is known about this trend in groups with higher risks of death.we examine differences in mortality by education and hiv-status among injecting drug users (idus) before and after introduction of highly active antiretroviral therapy (haart) in . methods: communitybased cohort study of idus recruited in aids prevention centres ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) abstract background: pancreatic cancer is an aggressive cancer with low survival time, with health-related quality of life (hrqol) being of major importance. objectives: the aim of our study was to assess both generic and disease-specific hrqol in patients with pancreatic cancer. methods: patients with suspected pancreatic cancer were consecutively included at admission to the hospital. hrqol was determined with the disease-specific european organization for research and treatment of cancer (eortc) health status instrument and generic euroqol (eq- d). results: a total of patients (mean age years ± , % men) were admitted with suspected pancreatic cancer. of these patients, ( %) had pancreatic cancer confirmed as final diagnosis. hrqol was significantly impaired in patients with pancreatic cancer for most eortc and eq- d scales in comparison to norm populations. the ed- d visual analogue scale (vas) and utility values were significantly correlated to the five functional scales, to the global health scale and to some but not all of the eortc symptom scales/items. conclusions: hrqol was severely impaired in patients with pancreatic cancer. there was a significant correlation between most eortc and eq- d scales. our results may facilitate further economic evaluations and aid health policy makers in resource allocation. abstract background: organised violence has health impact both on those who experience the violence directly and indirectly. the numbers of people affected by mass violence is alarming. substantial knowledge on the long-term health impact of organized violence is of importance for public health and for epidemiology. objectives: to investigate research results of long term mental health impact of organised violence. design and methods: a search of papers for the keywords genocide, organised violence, transgenerational effects, mental health was carried out in pubmed, science citation index and psychinfo. results: the systematic review on the long-term health impact of genocide showed that exposure to organised violence has an impact on mental health. methodological strenghts and weaknesses varied between studies. the found mental health consequences were associated with the country of research and the time of study. overall data showed organised violence has transgenerational impact on mental health of individuals and societies. conclusion: longitudinal studies have to be carried out to get further insight into the long-term health effects of organised violence. discussion: research results on mental health effects of organised violence have to be analysed in the context of changing concepts of illness. overweight is increasing and associated with various health problems. there are no well-structured primary care programs for overweight available in the netherlands. therefore, we developed a -month multidisciplinary treatment program in a primary care setting. the aim of the present study is to determine the feasibility and efficacy of a multidisciplinary treatment program on weight loss and risk profile in an adult overweight population. hundred participants of the utrecht health project are randomised to either a dietetic group or a dietetic plus physiotherapy group. the control group consist of another participants recruited from the utrecht health project and receives routine health care. body weight, waist circumference, blood pressure, serum levels, energy-intake and physical activity are measured at baseline, halfway and at the end of the treatment program. feasibility of the treatment program is assessed by response, compliance and program-associated costs and workload. efficacy is determined by analysing changes in outcome measures between groups over time using t-tests and anova repeated measurements. the treatment program is considered effective with at least a % difference in mean weight change over time between groups. positive evaluation of the multidisciplinary treatment program for overweight may lead to implementation in routine primary health care. abstract background: examining patient's quality of life (qol) before icu admission will permit to compare and analyze its relation with other variables. objectives: analyze qol of patients admitted to a surgical icu before admission and study its relation with baseline characteristics and outcome. design and methods: the study was observational and prospective in a surgical icu, enrolling all patients admitted between november and april . baseline characteristics of patients, history of co morbidities and quality of life survey score (qolss) were recorded. assessment of the relation between each variable or outcome and the total score of qolss was performed by multiple linear regression. results: total qolss demonstrated worse qol in patients with hypertension, cerebrovascular disease, renal insufficiency, severely ill (as measured by saps and asa physical status), and in older patients. there was no relation between qol and longer icu los. conclusions: preadmission qol correlates with age, severity of illness, comorbidities and mortality rates but is an able to predict longer icu stay. discussion: qolss appears to be a good indicator of outcome and severity of illness. abstract background: transient loss of consciousness (tloc) has a cumulative lifetime incidence of %, and can be caused by various disorders. objectives: to assess the yield and accuracy of initial evaluation (standardized history, physical examination and ecg), performed by attending physicians in patients with tloc, using follow-up as a gold standard. design and methods: adult patients presenting with tloc to the academic medical centre between february and may were included. after initial evaluation physicians made a certain, likely or no initial diagnosis. when no diagnosis was made additional cardiological testing, expert history taking and autonomic function testing were performed. the final diagnosis, after years follow-up, was determined by an expert committee. results: patients were included. after initial evaluation, % of the patients were diagnosed with a certain and % with a likely cause for their episodes. overall diagnostic accuracy was % ( %ci - %); % ( %ci - %) for the certain diagnoses and % ( %ci - %) for the likely diagnoses. conclusion and discussion: attending physicians make a diagnosis in % of patients with tloc after initial evaluation, with high accuracy. the use of abundant additional testing can be avoided in many patients. abstract background: the possibility of an influenza pandemic is one of the major public health challenges of today. risk perceptions among the general public may be important for successful public health measures to better control an outbreak situation. objectives: we investigated risk perception and efficacy beliefs related to an influenza pandemic in the general population in countries in europe and asia. design and methods: telephone interviews were conducted in . risk perception of an influenza pandemic was measured on a -point scale and outcome-and self-efficacy on a point scale (low-high). the differences in risk perception by country, sex and age were assessed with a general linear model including interaction effects. results: , persons were interviewed. the mean risk perception of flu was . and was significantly higher in europe ( . ) compared to asia ( . ) (p< . ) and higher in women ( . ) than men ( . ) (p< . ). outcome-and self-efficacy were lower in europe than asia. conclusion: in europe higher risk perceptions and lower efficacy beliefs were found as compared to asia. in developing preparedness plans for an influenza pandemic specific attention should therefore be paid to risk communication and how perceived self-efficacy can be increased. abstract background: increased survival of patients with cf has prompted interest towards their hrqol. objectives: .to measure hrqol and its predictors in cf patients cared for at the bambino gesuc hildren's hospital in rome; . to assess the psychometric properties of the italian version of the cf specific hrqol instrument (cystic fibrosis questionnaire, cfq). design and methods: crosssectional survey. all cf patients aged years or more were asked to complete the cfq (age-specific format). psychological distress was assessed through standardized questionnaires in patients (achenbach and general health questionnaire, ghq) and their parents (ghq and sf- ). results: one-hundred-eighteen patients ( males, females, age range to years) participated in the study (response rate %). internal consistency of cfq was satisfactory (cronbach alpha from . to . ); all item-test correlation were greater than . . average cfq standardized scores were very good in all domains (> on a - scale), except perceived burden of treatments ( ) and degree of socialization ( ). multiple regression analysis was performed to identify factors associated with different hrqol dimensions. conclusion: support interventions for these patients should concentrate on finding a balance between need to prevent infections and promotions of adequate, age-appropriate social interactions. abstract background: the metabolic syndrome (metsyn) -a clustering of metabolic risk factors with diabetes and cardiovascular diseases as the primary clinical outcomes -is thought to be highly prevalent with an enormous impact on public health. to date, consistent data in germany are missing. objective: the study was conducted to examine the prevalence of the metsyn (according to ncap atp iii-definition) among german patients in primary care. methods: the german-wide cross-sectional study run two weeks in october with randomly selected general practitioners included. blood glucose and serum lipids were analyzed, waist circumference and blood pressure assessed, data on smoking, dietary and exercise habits, regional and sociodemographic characteristics collected. abstract background: excessive infant crying is a common and often stress inducing problem than can ultimately result in child abuse. from previous research is known that maternal depression during pregnancy is related to excessive crying, but so far little attention is paid to paternal depression. objective: we studied whether paternal depression is independently associated to excessive infant crying. design and methods: in a prospective multiethnic population-based study we obtained depression scores of , mothers and , fathers at weeks pregnancy using the brief symptom inventory, and information on crying behaviour of , infants at months. we used logistic regression analyses in which we adjusted for depression of the mother, level of education, smoking and alcohol use. results: paternal depressive symptomatology was related to the widely used wessel's criteria for excessive crying (adusted odds ratio . , . - . ). conclusion: our findings indicate that paternal depressive symptomatology might be a risk factor for excessive infant crying. discussion genetic as well as other direct (e.g. interaction between father and child) or indirect (e.g. marital distress or poor circumstances) mechanisms could explain the found association. abstract background: in studying genetic background of congenital anomalies the comparison of affected cases to non-affected controls is popular method. investigation of case-parent triads uses observation of cases and their parents exclusively. methods: both casecontrol approach and log-linear case-parent triads model were implemented to spina bifida (sb) cases and their parents ( triads) and controls in analysis of impact of the c t and a c mthfr polymorphisms on occurrence of sb. results: observed frequencies for tt genotype were , % in sb children, , % in mothers, , % in fathers, , % in controls and for cc genotype were , % of sb children, , % of mothers, , % of fathers and , % of controls. both genotype frequencies in sb triads did not differ significantly from controls. case-control approach showed nonsignificant increase in risk of having sb for t allele carriers either in homozygous (or = , ) or heterozygous form (or = , ) and for c allele carriers in heterozygous form (or = , ). log-linear model revealed significant relative risk of sb in children with both tt and ct genotype (rr = , and rr = , respectively). child's genotype at a c and mother's genotypes did not contribute to the risk. conclusions: caseparent triads approach adds new information regarding impact of parental imprinting on congenital anomalies. abstract background: previous studies showed an association of autonomic dysfunction with coronary heart disease (chd) and with depression as well as an association of depression with chd. however, there is limited information on autonomic dysfunction as potential mediator of the adverse effect of depression on chd. objectives: to examine the role of autonomic dysfunction as a potential mediator of the association of depression with chd. design/ methods: we used data of participants aged - years of the ongoing population-based cross-sectional carla study ( % male). time-and frequency-domain parameters of heart rate variability (hrv) as a marker of autonomic dysfunction were calculated. prevalent myocardial infarction (mi) was defined as selfreported physician-diagnosed mi or diagnostic minnesota code in the electrocardiogram. depression was defined based on the cesd-depression scale. logistic regression was used to assess associations between depression, hrv and mi. results: in ageadjusted logistic regression models, there was no statistically significant association of hrv with depression, of depression with mi, or of hrv with mi in men and women. discussion/conclusion: the present analyses do not support the hypothesis of an intermediate role of autonomic dysfunction on the causal path from depression to chd. abstract background: hypertension is an established risk factor for cardiovascular disease. however, prevalence of untreated or uncontrolled hypertension is often high (even in populations at high risk). objectives: to assess the prevalence of untreated and of uncontrolled hypertension in an elderly east german population. design and methods preliminary data of a cross-sectional, populationbased examination of men and women aged - years were analysed. systolic (sbp) and diastolic blood pressure (dbp) were measured and physician-diagnosed hypertension and use of antihypertensive drugs were recorded. prevalence of hypertension was calculated according to age and sex. results: of all participants, . % were hypertensive ( . % of men, . % of women). of these, . % were untreated, . % treated but uncontrolled, and . % controlled. women were more often properly treated than men. the prevalence of untreated hypertension was highest in men aged - years ( . %) and lowest in men and women aged > = years ( . %). uncontrolled hypertension increases with age in both sexes. conclusion and discussion: in this elderly population, there is a high prevalence of untreated and uncontrolled hypertension. higher awareness in the population and among physicians is needed to prevent sequelae such as cardiovascular disease. abstract background: exposure to pesticides is a potential risk factor for subfertility, which can be measured by time-to-pregnancy (ttp). as female greenhouse workers constitute a major group of workers exposed to pesticides at childbearing age, a study was performed among these and a non-exposed group of female workers. objectives: to measure the effects of pesticide exposure on time-topregnancy. design and methods: data were collected through postal questionnaires with detailed questions on ttp, lifestyle factors, and work tasks (e.g. application of pesticides, re-entry activities, and work hours) during six months prior to conception of the most recent pregnancy. associations between ttp and exposure to pesticides were studied in cox's proportional hazards models among female greenhouse workers and referents. results: the initial fecundability ratio (fradjusted) for greenhouse workers versus referents was . ( %ci: . - . ). this fr proved to be biased by the reproductively unhealthy worker effect. restricting the analyses to fulltime workers only gave an fradjusted of . ( %ci: . - . ). among primigravidous greenhouse workers, an association was observed between prolonged ttp and gathering flowers (fr = . , %ci: . - . ). conclusion and discussion: this study adds some evidence to the hypothesis of adverse effects of pesticide exposure on time-topregnancy, but more research is needed. abstract background: hfe-related hereditary hemochromatosis (hh) is an iron overload disease for which screening is recommended to prevent morbidity and mortality. however, discussion has risen on the clinical penetrance of the hfe-gene mutations. objective: in the present study the morbidity and mortality of families with hferelated hh is compared to a normal population. methods: c yhomozygous probands with clinically overt hfe-related hh and their first-degree relatives filled in a questionnaire on health, diseases and mortality among relatives. laboratory results on serum iron parameters and hfe-genotype were collected. the self-reported morbidity, family mortality and laboratory results were compared with an age and gender matched subpopulation of the nijmegen biomedical study (nbs), a population-based survey conducted in the eastern part of the netherlands. results: twohundred-twenty-eight probands and first-degree relatives participated in the hefas. serum iron parameters were significantly elevated in the hefas population compared to the nbs controls. also, the morbidity within hefas families was significantly increased for fatigue, hypertension, liver disease, myocardial infarction, osteoporosis and rheumatism. mortality among siblings, children and parents of hefas probands and nbs participants was similar. discussion: the substantially elevated morbidity within hefas families justifies further exploration for a family cascade screening program for hh in the netherlands. abstract objectives: to evaluate awareness levels and effectiveness of warning labels in cigarette packs, among portuguese students enrolled in the th to the th grades. design and methods: a cross sectional-study was carried out in may ( ) in a high school population ( th- th grades) in the north of portugal (n = ). a confidential self-reported questionnaire was administered. warning labels effectiveness was evaluated by changes in smoking behaviour and cigarette consuption, during the period between june/ (before the implementation of the tobacco warnings labels in portugal) and may/ . continuous variables were compared by the t-test for paired samples and kruskal-wallis test. crude and adjusted odds ratios and confidential intervals were calculated by logistic regression analysis. results: the majority of students ( . %) have a high level of awareness about warning labels content. this knowledge was significantly associated with school grade and current smoking status. none of these variables was significantly associated with changes in smoking behaviour. although not reaching statistic significance, the majority of teenagers ( . %) increased or kept their smoking pattern. awareness level was not associated with smoking prevalence or consumption decreases. conclusions: current warning labels are ineffective in changing smoking behaviour among portuguese adolescents. abstract background: injuries are an important cause of morbidity. the presence of pre-existing chronic conditions (pecs) have been shown to be associated with higher mortality. objectives: aim of this study is to evaluate the association between pecs and risk of death in elder trauma patients. methods: an injury surveillance, based on the integration between emergency, hospital, and mortality databases of lazio region, year , was used. patients were the elder people visited at the emergency departments, and hospitalised. pecs were evaluated on the basis of the charlson comorbidity index (cci). to measure the effect of pecs on the probability of death, we used logistic regression. results: patients were admitted to the hospital. the . % of the injured subjects were affected by one or more chronic conditions. risk of death for non urgent and urgent patients increased at increasing cci score abstract background: c-reactive protein (crp) was shown to predict prognosis in heart failure (hf). objective: to assess variability of crp over time in patients with stable hf. methods: we measured high-sensitivity crp (hscrp) times ( -week intervals) in patients with stable hf. patients whose hscrp was > mg/dl or whose clinical status deteriorated were excluded. two consecutive hscrp measurements were available for patients: men, mean(sd) age . ( . ) years, % depressed left ventricular systolic function. forty-four patients had a third measurement. using the cutoff point of . mg/dl for prediction of adverse cardiac events we assessed the proportion of patients who changed risk category. results: median(p -p ) baseline hscrp was . mg/dl( . - . ). hscrp varied largely particularly for higher levels. the th and th percentiles of differences between first two measurements were ) . mg/dl and + . mg/dl. correlation coefficient between these measurements: . , p< . . eleven ( %) patients changed risk category, kappa = . , p< . . among patients whose first two measurements were concordant, . % changed category in third measurement, kappa = . , p< . . conclusion: large variability in hscrp in stable hf may decrease the validity of risk stratification based on single measurements. it remains to be demonstrated whether the pattern of change over time adds predictive value in hf patients. abstract background: instrumental variables can be used to adjust for confounding in observational studies. this method has not yet been applied with censored survival outcomes. objectives: to show how instrumental variables can be combined with survival analysis. design and methods: in a sample of patients with type- diabetes who started renal-replacement therapy in the netherlands between and , the effect of pancreas-kidney transplantation versus kidney transplantation on mortality was analyzed using region as the instrumental variable. because the hospital could not be chosen with this type of transplantation, patients can be assumed to be naturally randomized across hospitals. we calculated an adjusted difference in survival probabilities for every time point including the appropriate confidence interval (ci %). results: the -year difference in survival probabilities between the two transplantation methods, adjusted for measured and unmeasured confounders, was . (ci %: . - . ) favoring the pancreas-kidney transplantation. this is substantially larger than the intention-to-treat estimate of . (ci %: . - . ) where policies are compared. conclusion and discussion: instrumental variables are not restricted to uncensored data, but can also be used with a censored survival outcome. hazard ratios with this method have not yet been developed. the strong assumptions of this technique apply similarly with survival outcomes. . ] . sir of coronary heart disease was . [ %ci: . - . ] and remained significantly increased up to years of follow-up. cox regression analysis showed a . -fold ( % ci, . - . ) increased risk of congestive heart failure after anthracyclines and a . -fold ( % ci, . - . ) increased risk of coronary heart disease after radiotherapy to the mediastinum. conclusion: the incidence of several cardiac diseases was strongly increased after treatment for hl, even after prolonged follow-up. anthracyclines increased the risk of congestive heart failure and radiotherapy to the mediastinum increased the risk of coronary heart disease. abstract background: the concept of reproductive health is emerging as an essential need for health development. objectives: to know the opinions of parents, teachers and students about education of reproductive health issues to students of mid and high schools. design and methods: focus group discussions (fgd) as a qualitative research was chosen. a series of group discussions with participation of persons ( students, teachers, and parents) was held. each group had included to persons. results: all the participants noted to a true need in education of puberty health in order to provide essentials for pre-adolescent students to adopt the psycho-and somatic changes of puberty. however, a few fathers and a group of mothers believed that education of family planning is not suitable for students. a need for education of aids and marital problems for students was the major concern in all groups. the female students emphasized a need for programming counseling in pre-marital period. conclusion: essentials in puberty health, family planning, aids and marital problems should be provided in mid-and high schools in order to narrow the knowledge gap of the students. abstract background: the association between social support and hypertension in pregnancy remains controversial. objective: the objective of this study was to investigate whether level of social support is a protective factor against preeclampsia and eclampsia. design and methods: a case-control study was carried out in a public high-risk maternity hospital in rio de janeiro, brazil. between july -may , all cases, identified at diagnosis, and controls, matched on gestational age, were included in the study. participants were interviewed about clinical history, socio-demographic and psychosocial characteristics. the principal exposure was the level of social support available during the pregnancy, using the medical outcomes study scale. adjusted odds ratios were estimated using multivariate conditional logistic regression. results: multiparous women with a higher level of social support had a lower risk of presenting with preeclampsia and eclampsia (or = . ), although this association was not statistically significant ( % ci . - . ). in primiparous women, a higher level of social support was seen amongst cases (or = . ; % ci . - . ). an interaction between level of social support and stressful life events was not identified. these results contribute to increased knowledge of the relationship between preeclampsia and psychosocial factors in low-income pregnant and puerperal women. abstract background: current case-definitions for cfs/me are designed for clinical-use and not appropriate for health needs assessment. a robust epidemiological case-definition is crucial in order to achieve rational allocation of resources to improve service provision for people with cfs/me. objectives: to identify the clinical features that distinguish people with cfs/me from those with other forms of chronic fatigue and to develop a reliable epidemiological case-definition. methods-primary care patient data for unexplained chronic fatigue was assessed for symptoms, exclusionary and comorbid conditions and demographic characteristics. cases were assigned to disease and non-disease groups by three members of the chief medical officer's working group on cfs/me (reliability-cronbach's alpha . ). results: preliminary multivariate analyses were conducted and classification and regression tree analysis included a -fold cross-validation approach to prevent over fitting. the results suggested that there were at least four strong discriminating variables for cfs/ me with 'post-exertional malaise' being the strongest predictor. risk and classification tables showed an overall correct classification rate of . %. conclusion: the analyses demonstrated that the application of the combination of the four discriminating variables (the defacto epidemiological case-definition) and predefined comorbid conditions had the ability to differentiate between cfs/me and non-cfs/me cases. abstract background: infection with high-risk human papillomavirus (hpv) is a necessary cause for cervical cancer. vaccines against the most common types (hpv , hpv ) are being developed. relatively little is known about factors associated with hpv or hpv infection. we investigated associations between lifestyle factors and hpv and hpv infection. methods: uk women aged - years with a recent abnormal cervical smear underwent hpv testing and completed a lifestyle questionnaire. hpv and hpv status was determined using type-specific pcrs. associations between lifestyle factors and hpv status were assessed by multivariate logistic regression models. results: . % ( %ci . %- . %) of women were hpv -positive. . % ( % ci . %- . %) were hpv -positive. for both types, the proportion testing positive decreased with increasing age, and increased with increasing grade of cytological abnormality. after adjusting for these factors, significant associations remained between (i) hpv and marital, employment, and smoking status and (ii) hpv and marital status and contraceptive pill use. gravidity, ethnicity, barrier contraceptive use and socio-economic status were not related to either type. conclusions we identified modest associations between several lifestyle factors and hpv and hpv . studies of this type help elucidate hpv natural history in different populations and will inform development of future vaccine delivery programmes. in ageing men testosterone levels decline, while cognitive function, muscle and bone mass, sexual hair growth, libido and sexual activity decline and the risk of cardiovascular diseases increase. we set up a double-blind, randomized placebo-controlled trial to investigate the effects of testosterone supplementation on functional mobility, quality of life, body composition, cognitive function, vascular function and risk factors, and bone mineral density in older hypogonadal men. we recruited men with serum testosterone levels below nmol/l and ages - years. they were randomized to either four capsules of mg testosterone undecanoate (tu) or placebo daily for weeks. primary endpoints are functional mobility and quality of life. secondary endpoints are body composition, cognitive function, aortic stiffness and cardiovascular risk factors and bone mineral density. effects on prostate, liver and hematological parameters will be studied with respect to safety. measure of effect will be the difference in change from baseline visit to final visit between tu and placebo. we will study whether the effect of tu differs across subgroups of baseline waist girth, testosterone level, age, and level of outcome under study. at baseline, mean age, bmi and testosterone levels were (yrs), (kg/m ) and .x (nmol/l), respectively. abstract at a student population, the carie's prevalence was , %. objectives: to evaluate the efficiency between two types of oral health education programmes and the adherence towards tooth brushing. study design: case control study: youngsters took part, in the case group. an health education programme was carried out in schools and included two types of strategies: a participative strategy (learning based on the colouring of the dental plaque) towards a case group; and a traditional strategy (oral expository method) towards a control group. during the outcome of the programmes, the oral health condition evaluation was done through cpo index, the adherence towards tooth brushing and the (iho's) oral hygiene index. results: in the initial dental exam the (iho) average was of , . three months after the application of the oral health programme, there was a general decrease in the average of iho's to , . discussion and conclusion: in the case group the decrease was higher: , to , . the students submitted to a session of oral health education based on the colouring of the dental plaque showed an lower iho's average and higher knowledge. this can be due to the teaching session being more active, participative and demonstrative. abstract background: violence perpetuated by adolescents is a major problem in many societies. objectives: the aim of this study is to examine high school students' violent behaviour and to identify predictors. design and methods: a cross-sectional study was conducted in timis county, romania between may-june . the sample consisted of randomly selected classes, stratified proportionally according to grades - , high school profile, urban and rural environment. the students completed a self administered questionnaire in their classroom. a weighting factor was applied to each student record to adjust for non-response and for the varying probabilities of selection. results: a total of students were included in the survey. during the last months, . % of adolescents got mixed into a physical fight outside school and . % on school property. abstract background: drug use by adolescents has become an increasing public health problem in many countries. objectives: the aim of this study is to identify prevalence of drug use and to examine high school students' perceived risks of substance use. design and methods: a cross-sectional study was conducted in timis county, romania between may-june . the sample consisted of randomly selected classes, stratified proportionally according to grades - , high school profile, urban and rural environment. the students completed a self administered questionnaire in their classroom. eighteen items regarding illicit drug use suggesting different intensity of use were listed. the response categories were 'no risk', 'slight risk', 'moderate risk', 'great risk' and 'don't know'. results: a total of students were included in the survey. the lifetime prevalence of any illicit drug was . %. significant beliefs associated with drug use are: trying marijuana once or twice (p< . ), smoking marijuana occasionally (p = . ), trying lsd once or twice (p = . ), trying cocaine once or twice (p = . ), trying heroine once or twice (p = . ). conclusion: the overall drug use prevalence is small. however, use of some drugs once or twice is not seen as a very risky behaviour. abstract background: the health ombudsman service was created in ceara´, brazil, in , with the objective of receiving user opinions about public services. objectives: to describe user profiles, evaluating their satisfaction with health services and the ombudsman service itself. design and methods: a transversal and exploratory study with a random sample of users who had used the service in the last three months. the data were analyzed with the epi info program. results: women were those who used the service most ( . %). the users sought the service for complaints ( . %), guidance ( . %) and commendation ( . %). users made the following complaints about health services: lack of care ( . %), poor assistance ( . %) lack of medication ( . %). in relation to the ombudsman service, the following failures were mentioned: lack of autonomy ( . %), delay in solving problems ( . %) and few ombudsmen ( . %). conclusion: participation of the population in use of the serviced is small. the service does not satisfy the expectations of users, it is necessary to publicize the service and try to establish an effective partnership between users and ombudsmen so that the population finds in the ombudsman service an instrument to put into effect social control and improve the quality of health services. in chile, the rates of breast cancer and diabetes have dramatically increased in the last decade. the role of insulin resistance in the development of breast cancer, however, remains unexplored. we conducted a hospital-based case-control to assess the relationship of insulin resistance (ir) and breast cancer in chilean pre and postmenopausal women. we compared women, - y, with incident breast cancer diagnosed by biopsy and controls with normal mammography. insulin and glucose were measured in blood and ir was calculated by homeostasis model assessment method. anthropometric measurements and socio-demographic and behavioural data were also collected. odds ratios (ors) and % confidence intervals (cis) were estimated by multivariate logistic regression. the risk of breast cancer increased with age. ir was significantly associated to breast cancer in postmenopausal women (or = . , %ci = . - . ), but not in premenopausal (p> . ). socioeconomic status and smoking appeared as important risk factors for breast cancer. obesity was not associated with breast cancer at any age (p> . ). in these women, ir increased the risk of breast cancer only after menopause. overall, these results suggest a different risk pattern for breast cancer before and after menopause. keywords: insulin resistance; breast cancer; chile. abstract background: previous european community health indicators (echi) projects have proposed a shortlist of indicators as a common conceptual structure for health information. the european community health indicators and monitoring (echim) is a -year project to develop and implement health indicators and to develop health monitoring. objectives: our aim is to assess the availability and comparability of the echi-shortlist indicators in european countries. methods: four widely used health indicators i) perceived general health ii-iii) prevalence of any and certain chronic diseases or conditions iv) limitations in activities of daily living (adl) were evaluated. our evaluation of available sources for these indicators is based on the european health interview & health examination surveys database ( surveys in chile, breast cancer, obesity and sedentary behaviour rates are increasing. the role of specific nutrients and exercise in the risk of breast cancer remains unclear. the aim of the present study was to evaluate the role of fruits and vegetables intake and physical activity in the prevention of breast cancer. we undertook an age matched case-control study. cases were women with breast cancer histologically confirmed and controls were women with normal mammography, admitted to the same hospital. a structured questionnaire was used to obtain dietary information and measurement of physical activity was obtained from the international physical activity questionnaire. odds ratios (ors) and % confidence intervals (cis) were estimated by conditional logistic regression adjusted by obesity, socioeconomic status and smoking habit. a significant association was found with fruit intake (or = . , %ci = . - . ). the consumption of vegetables (or = . , %ci = . - . ), moderate (or = . , %ci = . - . ) and high physical activity (or = . , %ci = . - . ) were not observed as protective factors. in conclusion, the consumption of fruit is protective in breast cancer. these findings need to be replicated at chile to support the role of diet and physical activity in breast cancer and subsequence contribution in public health policy. keywords: diet; physical activity; breast cancer; chile. the role of trace elements in pathogenesis of liver cirrhosis and its complications is still not clearly understood. serum concentrations of zinc, copper, manganese and magnesium were determinated in patients with alcoholic liver cirrhosis and healthy subjects by means of plasma sequential spectrophotometer. serum levels of zinc were significantly lower (median . vs . lmol/l, p = . ) in patients with liver cirrhosis in comparison to controls. serum levels of copper were significantly higher in patients with liver cirrhosis ( . vs . lmol/l, p< . ) as well as manganese ( . vs . lmol/l, p = . ). concentration of magnesium was not significantly different between patients with liver cirrhosis and controls ( . vs . mmol/l, p = . ). there was no difference in trace elements concentrations between child-pugh groups. zinc level was significantly lower in patients with hepatic encephalopathy in comparison to cirrhotic patients without encephalopathy ( . vs . lmol/l, p = . ). manganese was significantly higher in cirrhotic patients with ascites in comparison to those without ascites ( . vs . lmol/l, p = . ). correction of trace elements concentrations might have beneficial effect on complications and maybe progression of liver cirrhosis. it would be recommendable to provide analyzis of trace elements as a routine. abstract background: respiratory tract infections (rti) are very common in childhood and knowledge of pathogenesis and risk factors is required for effective prevention. objective: to investigate the association between early atopic symptoms and occurrence of recurrent rti during first years of life. design and methods: in the prospective prevention and incidence of asthma and mite allergy birth cohort study, children were followed from birth to the age of years. information on atopic symptoms, potential confounders, and effect modifiers like passive smoking, daycare attendance and presence of siblings was collected at ages months and year by parental questionnaires. information on rti was collected at ages , , , and years. results: children with early atopic symptoms, i.e. itchy skin rash and/or eczema or doctordiagnosed cow's milk allergy at year of age had a slightly higher risk to develop recurrent rti (aor . ( . - . ); and . ( . - . ), respectively). the association between atopic symptoms and recurrent rti was stronger in children whose mother smoked during pregnancy and who had siblings (aor . ( . - . ) the aim : the aim of the study was to assess the relative risk (rr) of obesity and abdominal fat distribution on the insulin resistance (ir), diabetes, hyperlipidemia and hypertension in polish population. materials and methods: subjects at age - , were randomized and invited to the study. in participants anthropometric and blood pressure examination was performed. fasting lipids, fasting and after glucose load glucose and insulin were determined. ir was defined as the upper quartile of the homa-ir distribution for the normal glucose tolerant population. results: overweight and obesity was observed in , % and , % of subjects. visceral obesity was found in subjects ( , %-men and , %-women). rr of ir in obesity was , ( % ci: , - , ), for obese subjects at age below was , ( % ci: , - , ). in men with visceral obesity rr of ir was the highest for men aged below . rr of diabetes was increasing with the increase of body weight, in obese subjects with abdominal fat distribution was , ( %ci: , - , ). the same was observed for the hypertension and hyperlipidemia. conclusions: obesity and the abdominal fat distribution seems to be an important risk factor of ir, diabetes, hypertension, hiperlipidemia, especially in the younger age groups. abstract background: age as an effect modifier in cardiovascular risk remains unclear. objective: to evaluate age-related differences in the effect of risk factors for acute myocardial infarction (ami). methods: in a population-based case-control study, with data collected by trained interviewers, consecutive male cases of first myocardial infarction (participation rate %) and randomly selected male control dwellers (participation rate %) were compared. effect-measure modification was evaluated by the statistical significance of a product term of each independent variable with age. unconditional logistic regression was used to estimate ors in each age stratum (< years/> years). results: there was a statistically significant interaction between education (> vs. < years), sports practice, diabetes and age: the adjusted (education, ami family history, dyslipidemia, hypertension, diabetes, angina, waist circumference, sports practice, alcohol and caffeine consumption, and energy intake) ors ( %ci) were respectively . ( . - . ), . ( . - . ) and . ( . - . ) in younger, and . ( . - . ), . ( . - . ) and . ( . - . ) in older participants. conclusions: in males, age has a significant interaction with education, sports practice and diabetes in the occurrence of ami. the effect is evident in the magnitude but not in the direction of the association. abstract there are few studies on the role of diet in lung cancer etiology. thus, we calculated both, squamous cell and small cell carcinoma risks in relation to the frequency of consumption of vegetables, cooked meat, fish and butter in silesian male in industrial area of poland. in the case-control study, the studied population comprised men with squamous cell carcinoma and men with small cell carcinoma, and healthy controls. multivariate logistic regression was employed to calculate lung cancer risk in the relation to simultaneous influence of dietary factors. the relative risk was adjusted for age and smoking. we observed a significant decrease in lung cancer risk related to more frequent consumption of raw vegetables, cooked meat and fish. however, stronger protective effect was reported for squamous cell carcinoma. frequent fish consumption significantly decreases the risk especially in cigarette smokers. the frequent consumption of pickles lowers squamous cell carcinoma risk in all cases but small cell carcinoma risk only in smokers. the presence of butter, cooked meat, fish and vegetables in diet significantly decreases the lung cancer risk especially in smokers. the association between diet and lung cancer risk is more pronounced for squamous cell carcinoma. abstract background: in functional disease research selection mechanisms need to be studied to assure external validity of trial results. objective: we compared demographic and disease-specific characteristics, history, co-morbidity and psychosocial factors of patients diagnosed, approached and randomised for a clinical trial analysing the efficacy of fibre therapy in irritable bowel syndrome (ibs). design and methods: in primary care patients were diagnosed with ibs by their gp in the past two years. characteristics were compared between ( ) randomised patients (n = ); ( ) patients who did not give their informed consent (n = ); ( ) patients who decided not to participate (n = ); and ( ) those not responding to the mailing (n = ). results: the groups showed no significance differences in age and gender ( % females, mean age years, s.d. ). patients consulting their gp for the trial compared to patients not attending their gp showed significant more severe ibs symptoms, more abdominal pain during the previous three months, and a longer history of ibs (p< , ). patients randomised have more comorbidity (p = , ). conclusion and discussion: patients included in this ibs trial differ from no participating and excluded patients mainly in ibs symptomatology, history and comorbidity. this may affect the external validity of the trial results. abstract objectives: to evaluate smoking prevalence among teenagers and identify associated social-behavioral factors. study design and methods: a cross sectional-study was carried out in may ( ) in high school population ( th- th grades) in the north of portugal (n = ). a confidential self-reported questionnaire was administered. crude and adjusted odds ratios and confidence intervals were calculated by logistic regression analysis. results: overall smoking prevalence was . % (boys = . %; girls = . %) (or = . ; ci % = . - . ; p< . ). smoking prevalence was significantly and positively associated with gender, smoking parents, school failure and school grade; in the group of students with smoking relatives, smoking was significantly associated with parents who smoke near the student (or = . ; ci % = . - . ; p< . ); in the group of the secondary grade ( th- th grades) smoking was significantly associated with belonging to 'non science-courses' (or = . ; ci % = . - . ; p = . ). conclusions: smoking is a growing problem among portuguese adolescents, increasing with age, prevailing among males, although major increases have been documented in the female population. parents' behaviours and habits have an important impact in their children's smoking behaviour. school failure is also an important factor associated with smoking. there is a need for further prevention programmes that should include families and consider students' social environment. abstract background: social environment of school can contribute to etiology of health behaviors. objective: to evaluate the role of school context for substance abuse in youth. design: a cross-sectional study was carried out in , using self-completed classroomadministered questionnaire. subjects: from a representative sample of students, a sub-sample of students was selected (including / classes with at least persons without missing data)*. methods: substance abuse was measured by: tobacco smoking at present, episodes of drunkenness and marijuana use in the lifetime. overall index was created as main independent variable, ranging - (cronbach's alpha = . ). class membership, type of school, gender, place of domicile, and school climate were included as contextual variables, measured on individual or group level. results: on individual level, the mean index was equal to . (sd = . ), and ranged from . in general comprehensive schools to . in basic vocational schools and from . to . for separated classes. about . % of total variance in this index may be attributed to differences between classes. conclusion: individual differences in substance abuse in youth could be partly explained by factors at school level. * project no po d . abstract background: rates of c-section in brazil are very high, . % in . brazil illustrates an extreme example of medicalization of birth. c-section, as any major surgery, increases the risk of morbidity, which can persist long after discharge from hospital. objectives: to investigate how social, reproductive, prenatal care and delivery factors interact after hospital discharge, influencing post partum complications. design and methods: a cross-sectional study of women gathered information through home interviews and clinical examination during post-partum. a hierarchical logistic regression model of factors associated with post-partum complications was applied. results: physical and emotional post partum complications were almost twice as high among women having c-section. most of this effect were associated with lower socioeconomic conditions which influences, were mainly explained by longer duration of delivery (even in the presence of medical indications), and less social support when returning home. conclusion: risk of c-section complications is higher among women from the lower socioeconomic strata. social inequalities mediate the association between type of delivery and postpartum complications. discussion: c-section complications should be taken into account when decisions concerning type of delivery are made. social support after birth, from the public health sector, has to be provided for women in socioeconomic deprivation. the relationship between unemployment and increased mortality was previously reported in western countries. the aim of this study was to assess the influence of the changes in unemployment rate on survival in general population in northern poland at the time of economic transition. to analyze the association between the unemployment and risk of death we collected survival data from death certificates and data on rates of unemployment from regions of gdansk county from period - . kaplan-meier method and cox proportional hazard model were used in univariate and multivariate analysis. a change of unemployment (percentage) in the year of death in the area of residence, sex and educational level ( categories) were included into multivariate analysis. the change of unemployment rate was associated with significantly worse overall survival: hazard ratio . % confidence interval . to . . the highest risk associated with the change of unemployment in the area of residence was for death from congenital defects (hazard ratio . % confidence interval . to . ) and for death from cardiovascular diseases (hazard ratio . % confidence interval . to . abstract background: there is no evidence from randomized trials as to whether or not educational interventions improve voluntary reporting systems in terms of quantity or quality. objectives: evaluation of the effectiveness of educational outreach visits aimed at improving adverse drug reaction (adr) reporting by physicians design and methods: cluster-randomized controlled trial covering all health system physicians in northern portugal. four spatialclusters assigned to intervention group (n = ) received outreach visits tailored to training needs detected in previous study and clusters were assigned to the control (n = ). the main was the total number of reported adr; the second was the number of serious, unexpected, high-causality and new-drug-related adr. a follow-up was conducted for a period of months. results: the intervention increased reports as follows: total adr, . -fold (p< . ); serious adr, . -fold (p = . ); high-causality adr, . -fold (p< . ); unexpected adr, . -fold (p< . ); and newdrug-related adr, . -fold (p = . ). the intervention had its maximum effect during the first four months ( . -fold increase, p< . ), yet the effect was nonetheless maintained over the four month periods post-intervention (p = . ). discussion and conclusion: physician training based on academic detailing visits improves reporting quality and quantity. this type of intervention could result in sizeable improvements in voluntary reporting in many countries. there were no evidence of differences in absolute indications between the years. conclusion: most of the increase in rates in the period may be attributable to relative and non-medical indications. discussion policies to promote rational use of c-sections should take into account the role played by obstetrician's convenience and the increased medicalization of birth on cesarean rates. abstract background: the changing environment has led to unhealthy dietary habits and low physical activity of children resulting in overweight/obesity and related comorbid conditions. objective: idefics is a five-year multilevel epidemiological approach proposed under the sixth eu framework to counteract the threatening epidemic of diet-and lifestyle-induced morbidity by evidence-based interventions. design and methods: a population-based cohort of . children to years old will be established in nine european countries to investigate the aetiology of these diseases. culturally adapted multi-component intervention strategies will be developed, implemented and evaluated prospectively. results: idefics compares regional, ethnic and sex-specific distributions of the above disorders and their key risk factors in children within europe. the impact of sensory perception, genetic polymorphisms and the role of internal/external triggers of food choice and children's consumer behaviour are elucidated. risk profile inventories for children susceptible to obesity and its co-morbid conditions are identified. based on controlled intervention studies an evidencebased set of guidelines for health promotion and disease prevention is developed. conclusions: provision of effective intervention modules, easy to implement in larger populations, may reduce future obesity related disease incidence. discussion: transfer of feasible guidelines into practice requires involvement of health professionals, stakeholders and consumers. abstract background: non-medically indicated cesarean deliveries increase morbidity and health care costs. brazil has one of the highest rates of caesarean sections in the world. variations in rates are positively associated with socioeconomic status. objectives: to investigate factors associated with cesarean sections in public and private sector wards in south brazil. design and methods: cross sectional data from post partum interviews and clinical records of consecutive deliveries ( in the main public and in a private maternity) was analyzed using logistic regression. results: multiple regression showed privately insured women having much higher cesarean rates than those delivering in public sector wards (or = . ; ci %: . - . ). obstetricians individual rates varied from %- %. doctors working in both, public and private sectors had a higher rates of cesarean in private wards (p< . ). wanting and having a cesarean was significantly more common among privately insured women. conclusion: women from wealthier families are at higher risk of cesarean, particularly those willing this type of delivery and whose obstetrician works in the private sector. discussion: women potentially at lower clinical risk are more like to have a caesarean. the obstetricians' role and women's preferences must be further investigated to tackle this problem. abstract background: in the netherlands, bcg-vaccination is offered to immigrant children and children of immigrant parents in order to prevent severe tuberculosis. the effectiveness of this policy has never been studied. objectives: assessing the effectiveness of the bcg-vaccination policy in the netherlands. design and methods: we used data on the size of the risk population per year (from statistics netherlands), number of children with meningitis or miliary tuberculosis in the risk population per year, and vaccination status of those cases (from the netherlands tuberculosis register) over the period - . we estimated the vaccine efficacy and annual risk of acquiring meningitis or miliary tuberculosis by log-linear modelling and treating the vaccination coverage as missing data. results: in the period - cases of meningitis or miliary tuberculosis were registered. the risk for unvaccinated to children to acquire such a serious tuberculosis infection was . ( %ci . - . ) per per year; the reduction in risk for vaccinated children was % ( %ci - %). conclusion and discussion: this means that, discounting future effects with %, a ( %ci: - ) extra children should be vaccinated to prevent one extra case of meningitis or miliary tuberculosis. given that bcg-vaccination is relatively inexpensive, the current policy could even be cost-saving. abstract background: psychotic symptom experiences in the general population are frequent and often longlasting. objectives: the zurich cohort study offered the opportunity of differentiating the patterns of psychotic experiences over a span of years. design and methods: the zurich study is based on a stratified community sample of persons born in (women) and (men). the data were collected at six time points since . we examined variables from two subscales of the scl- -r -'paranoid ideation' and 'psychoticism' -using factor analysis, cluster analysis and polytomous logistic regression. results: two new subscales were derived representing 'thought disorders' and 'schizotypal signs'. continously high symptom load on one of these subscales (both subscales) was found in % ( . %) of the population. cannabis use was the best predictor of continuously high symptom load in the 'thought disorders' subscale, whereas several variables representing adversity in childhood / youth were associated with continuously high symptom load in the 'schizotypal signs' subscale. conclusion and discussion: psychotic experiences can be divided at least in two different syndromes -thought disorders and schizotypal signs. despite similar longitudinal course patterns and also similar outcomes these syndromes rely on different risk factors, thus possibly defining separate pathways to psychosis. abstract background: the reasons for the rise in asthma and allergies remain unclear. to identify influential factors several european birth cohort studies on asthma and allergic diseases have been initiated since . objective: the aim of one work package within the global allergy and asthma european network (ga len), sponsored by the european commission, was to identify and compare european birth cohorts specifically designed to examine asthma and allergic diseases. methods: for each study, we collected detailed information (mostly by personal visits) regarding recruitment process, study setting, follow-up rates, subjective/objective outcomes and exposure parameters. results: by june , we assessed european birth cohort studies on asthma and allergic diseases. the largest recruited over children each. most studies determined specific immunoglobulin e levels to various allergens or used the isaac questionnaire for evaluation of asthma or allergic rhinitis symptoms. however, the assessment of other objective and subjective outcomes (e.g. lung function or definitions of eczema) were rather heterogeneous across the studies. conclusions due to the unique cooperation within the ga len project a common database was established containing study characteristics of european birth cohorts on asthma and allergic diseases. the possibility to pool data and perform meta-analyses is currently being evaluated. abstract background: birth weight is an important marker of health in infancy and health trajectories later in life. social inequality in birth weight is a key component in population health inequalities. objective: to comparatively study social inequality in birth weight in denmark, finland, norway, and sweden from to . design and methods as part of the nordic collaborative project on health and social inequality in early life (norchase), register-based data covering all births in all involved countries - was linked with national registries on parental socioeconomic position, covering a host of different markers including income, education and occupation. also, nested cohort studies provide opportunity to test hypotheses of mediation. results: preliminary results show that the social inequality in birth weight, small for gestational age, and low birth weight has increased in denmark through out the period. also, preliminary results from finland, norway and sweden will be presented. discussion: crosscountry comparisons pose several methodological challenges. these challenges include characterizing the societal context of each country so as to correctly interpret inter-country differences in social gradients, along with dealing with differences in the data collection methods and classification schemes used by different national registries. also, strategies for influencing policy will be discussed. abstract background: modifying the availability of suicide methods is a major issue in suicide prevention. objectives: we investigated changes in the proportion of firearm suicides in western countries since the 's, and their relation to the change of legislation and regulatory measures. design and methods: data from previous publications, from the who mortality database, and from the international crime victims survey (icvs) were used in a multilevel analysis. results: multilevel modeling of longitudinal data confirmed the effect of the proportion of households owning firearms on firearm suicide rates. several countries stand out with an obvious decline in firearm suicides since the s: norway, united kingdom, canada, australia, and new zealand. in all of these countries legislative measures have been introduced which led to a decrease in the proportion of households owning firearms. conclusion and discussion: the spread of firearms is a main determinant of the proportion of firearm suicides. legislative measures restricting the availability of firearms are a promising option in suicide prevention. abstract background: fatigue is a non-specific but frequent symptom in a number of conditions, for which correlates are unclear. objectives: to estimate socio-demographic and clinical factors determining the magnitude of fatigue. methods: as part of a follow-up evaluation of a cohort of urban portuguese adults, socio-demographic and clinical variables for consecutive participants were collected through personal interview. lifetime history of chronic disease diagnosis was inquired (depression, cancer, cardiovascular, rheumatic, and respiratory conditions), anthropometry was measured, and haemoglobin determined. krupp's -item fatigue severity scale was applied and severe fatigue defined as mean score over . mean age (sd) was . ( . ) and . % of participants were females. logistic regression was used to compute adjusted odds ratios, and attributable fractions were estimated using the formula ar = -s(?j/orj). results: adjusted for age and clinical conditions, female gender (or = . , %ci: . - . ) and education (under -years schooling: or = . , %ci: . - . ) were associated with severe fatigue. obesity (or = . , %ci: . - . ) and diagnosed cardiovascular disease (or = . , %ci: . - . ) also increased fatigue. attributable fractions were . % for gender, . % for education, . % for obesity, and . % for cardiovascular disease. conclusion: gender and education have large impact on severe fatigue, and, to a lesser extent, obesity and cardiovascular disease. abstract introduction: analysis of infant mortality allows identification of death contributing factors and assessment of child health care quality. objective: to study characteristics of infant and fetal mortality using data from a committee for prevention of maternal and infant mortality, in sobral, brazil. methods: all cases of infant deaths between and were analyzed. medical records were reviewed and mothers, interviewed. using a tool to identify preventable deaths (seade classification -brazil) the committee characterized causes of death. meetings with governmental groups involved in family health care took place to identify death contributing factors. results: in , infant mortality decreased from . to . . in the next years there was an increase from . to . . the increase in was due to respiratory illnesses. in , was due to diarrhea. analysis of preventable deaths indicated a reduction from to deaths that could have been prevented by adequate gestational care, and an increase in preventable deaths by early diagnosis and treatment. conclusion: pre-natal and delivery care improved whereas care for children less than yr old worsened. analysis of death causes allowed a reduction of infant mortality rate to . abstract objective: to identify dietary patterns and its association with metabolic syndrome. design and methods: we evaluated noninstitutionalised adults. diet was assessed using a semi-quantitative dietary frequency questionnaire, and dietary patterns were identified using principal components analysis followed by cluster analysis (k-means method) with bootstrapping (choosing the clusters presenting the lowest intra-cluster variance). metabolic syndrome (mets) was defined according to the ncep-atp-iii. results: the overall prevalence of metabolic syndrome was . %. in the population sample clusters were identified in females - .healthy, .milk/soup; .fast food; .wine/low calories; and in males - .milk/carbohydrates; . codfish/soup; .fast food; .low calories. in males, using milk/carbohydrates as the reference and adjusting for age and education, high blood pressure (or = . ; %ci: . - . ) and high triglycerides (or = . ; %ci: . - . ) were associated with the fast food pattern, and low calories pattern presented higher frequency of high blood pressure (or = . ; %ci: . - . ). in females, after age and education adjustment, no significant association was found either with metabolic syndrome or its individual features and the dietary patterns identified. conclusion: we found no specific dietary pattern associated with an increased prevalence of metabolic syndrome. however, a fast food diet was significantly more frequent in males with dyslipidemia and high blood pressure. abstract aim: to determine the prevalence of stress urinary incontinence (sui) before, during pregnancy and following childbirth, and also to analyse the impact of a health education campaign about sui prevention, following childbirth in viana district, portugal. methods: participants (n = ), interviewed during hospitalization, after birth and two months later at health centres, were divided into two groups: a first group of non-exposed and a second exposed to a health education campaign. this second group was encouraged to perform an exercise programme and given a 'suiprevention-treatment' brochure, approved by the regional health authority. results: sui prevalence was . %( %ci: . - . ) before pregnancy, . %( %ci: . - . ) during pregnancy and . ( %ci: . - . ) four weeks after birth. less than half of the women with sui sought help from healthcare professionals. statistical significant differences were found between groups: sui knowledge level and practice of pelvic floor muscles re-education exercises were higher in the exposed group ( . and . times, respectively). conclusions: sui affects a great number of women but only a small percentage reveals it. this campaign improved women knowledge and modified their else behaviors. healthcare professionals must be aware of this reality, providing an early and continuous intervention that would optimise the verified benefits of this campaign. abstract background: social inequalities have been associated with poorer developmental outcomes, but little is known about the role of the area of residence. objectives: examine whether the housing infrastructure of the area modifies the effect of socio-economic conditions of the families on child development. design and methods: community-based survey of under-fives in southern brazil applied hierarchical multi-level linear regression to investigate determinants of child development, measured by a score from the denver developmental screening test. results: in multivariable models, the mean score of child development increased with maternal and paternal education and work qualification, family income and better housing and was higher when the mother was in paid work (all p< . ). paternal education had an effect in areas of lower housing quality only; the effect of occupational status and income in these areas were twice as large as in better-provided areas (likelihood test for all interactions p< . ). this model explained % of the variation in developmental score between the areas of residence. conclusion: the housing quality and sanitation of the area modified the effects of socioeconomic conditions on child development. discussion: housing and sanitation programs are potentially beneficial to decrease the negative effect of social disadvantage on child development. abstract background: it is known that both genetic and environmental factors are involved in the early development of type diabetes (t d), and that incidence varies geographically. however we still need to explain why there is variation in incidence. objectives: in order to better understand the role of non-genetic factors, we decided to examine whether prevalence of newborns with high risk genotypes or islet autoantibodies varies geographically. design and methods: the analysis was performed on a cohort of newborns born to non-diabetic mothers, between september and august , who were included in diabetes prediction in ska˚ne study (dipis) in sweden. neighbourhoods were defined by administrative boundaries and variation in prevalence was investigated using multi-level regression analysis. results: we observed that prevalence of newborns with islet autoantibodies differed across the municipalities of ska˚ne (s = . , p < . ), with highest prevalence found in wealthy urban areas. however there was no observed difference in the prevalence of newborns with high risk genes. conclusion and discussion: newborns born with autoantibodies to islet antigens appear to cluster by region. we suggest that non-genetic factors during pregnancy may explain some of the geographical variation in the incidence of t d. abstract background: risk assessment is a science-based discipline used for decision making and regulatory purposes, such as setting acceptable exposure limits. estimation of risks attributed to exposure to chemical substances are traditionally mainly the domain of toxicology. it is recognized, however, that human, epidemiologic data, if available, are to be preferred to data from laboratory animal experiments. objectives: how can epidemiologic data be used for (quantitative) risk assessment? results: we described a framework to conduct quantitative risk assessment based on epidemiological studies. important features of the process include a weight-of-theevidence approach, estimation of the optimal exposure-risk function by fitting a regression model to the epidemiological data, estimation of uncertainty introduced by potential biases and missing information in the epidemiological studies, and calculation of excess lifetime risk through a life table to take into account competing risks. sensitivity analyses are a useful tool to evaluate the impact of assumptions and the variability of the underlying data. conclusion and discussion: many types of epidemiologic data, ranging from published, sometimes incomplete data to detailed individual data, can be used for risk assessment. epidemiologists should better facilitate such use of their data, however. abstract background: high-virulence h. pylori (hp) strains and smoking increase the risk of gastric precancerous lesions. its association with specific types of intestinal metaplasia (im) in infected subjects may clarify gastric carcinogenesis pathways. objectives: to quantify the association between types of im and infection with highvirulence hp strains (simultaneously caga+, vacas and va-cam ) and current smoking. design and methods: male volunteers (n = ) underwent gastroscopy and completed a self-administered questionnaire. participants were classified based on mucin expression patterns in biopsy specimens (antrum, body and incisura). hp vaca and caga were directly genotyped by pcr/reverse hybridization. data were analysed using multinomial logistic regression (reference: normal/superficial gastritis), models including hp virulence, smoking and age. results: high-virulence strains increased the risk of all im types (complete: or = . , %ci: . - . ; incomplete: or = . , %ci: . - . ; mixed: or = . , %ci: . - . ) but smoking was only associated with an increased risk of complete im (or = . %ci: . - . ). compared to non-smokers infected with lowvirulent strains, infection with the high-virulence hp increased the risk of im similarly for smokers (or = . , %ci: . - . ) and non-smokers (or = . %ci: . - . ). conclusion: gastric precancerous lesions, with different potential for progression, are differentially modulated by hp virulence and smoking. the risk of im associated with high-virulence hp is not further increased by smoking. abstract background: in may , the portuguese government created the basic urgency units (buu). these buu must attend at least . persons, be open hours per day, and be at maximum minutes of distance to all the users. objectives: determine the optimal location of buu, considering the existing health centers, in the viseu district, north portugal. methods: from a matrix of distances between population and health centers an accessibility index was created (sum of distances traveled by population to reach a buu). the location-allocation models were used to create simulations based on p-median model, maximal covering location problem (mclp) and set covering location problem (sclp). the solutions were ranked by weighting the variables of accessibility ( %), number of doctors in the health centers ( %), equipments ( %), distance/time ( %) and total number of buu ( %). results: the best solution has buu, doctors, attends users and the accessibility index is . km. conclusions: it was proved that it is impossible to attend all the criterion for creation of a buu. in some areas with low population density, to sum at least persons in a buu, the travel time is necessarily more than hour. background: a prospective observational study of fatigue in staff working a day/ off/ night/ off roster of hour shifts was conducted at a fly-in/fly-out fertilizer mine in remote northern australia. objectives: to determine whether fatigue in staff increased: from the start compared to the finish of shift; with the number of consecutive shifts; and from day-compared to nightshift. methods: data of sleep diaries, the mackworth clock test and the swedish occupational fatigue inventory were obtained at the start and finish of each shift from august to november . results: a total of staff participated in the study. reaction times, sleepiness and lack of energy scores were highest at the finish of nights to . the reaction times increased significantly at both the start and finish of day onwards, and at the finish of night . reaction times and lack of motivation were highest during nightshift. conclusions: from the above results, a disturbed diurnal rhythm and decreased motivation during night-shift; and a roster of more than eight consecutive shifts can be inferred as the primary contributors to staff fatigue. discussion: the implications for changes to workplace practices and environment will be discussed. the aim of this survey was to assess the impact of a meta-analysis comparing resurfacing with nonresurfacing of the patella on the daily practice of experts. participants in this study were experts which had participated in a previous survey on personal preferences regarding patella resurfacing. these experts in the field of patella resurfacing were identified by a thorough search of medline, an internet search (with googletm search engine), and personal references from the identified experts. participants of the 'knee arthroplasty trial' (kat) in the united kingdom were also included. two surveys were sent to the participants, one before and one after the publication of the meta-analysis. the response rate is questionnaires or %. the vast majority of responders are not persuaded to change change their practice after reading the metaanalysis. this is only in part due to the fact that best evidence and practice coincide. other reasons given are methodology related, an observation which is shared by the authors of the review, which force the orthopedic community to improve its research methodology. reasons such as 'i do not believe in meta-analysis' either demands a fundamental discussion or demands the reader to take evidence based medicine more seriously. abstract background: patients with type diabetes (dm ) have a - fold increased risk of cardiovascular disease. delegating routine tasks and computerized decision support systems (cdss) such as diabetes care protocol (dcp) may improve treatment of cardiovascular risk factors hba c, blood pressure and cholesterol. dcp includes consultation-hours exclusively scheduled for dm patients, rigorous delegation of routine tasks in diabetes care to trained paramedics, and software to support medical management. objective: to investigate the effects of dcp, used by practice assistants, on the risk of coronary heart disease for patients with dm . design and methods: in an open-label pragmatic trial in general practices with patients, hba , blood pressure and cholesterol were examined before and prospectively one year after implementation of dcp. the primary outcome was the change in the year ukpds coronary heart disease (chd) risk estimate. results: the median year ukpds chd risk estimate improved significantly from . % to . %. hba decreased from . % to . %, systolic blood pressure from . to . mmhg and total cholesterol from . to . mmol/l. (all p< . ). conclusion: delegating routine task in diabetes care to trained paramedics and using cdss improves the cardiovascular risk of dm patients. tuberculosis in exposed immigrants by tuberculin skin test, ifn-g tests and epidemiologic abstract background: currently immigrants in western countries are only investigated for active tuberculosis (tb) by use of a chest x-ray. recent latent tuberculosis infection (ltbi) is hard to diagnose in this specific population because the only available test method, the tuberculin skin test (tst), has a low positive predictive value (ppv). recently interferon-gamma (ifn-g) tests have become available that measure cellular responses to specific m. tuberculosis antigens and might have a better ppv. objective: to determine the predictive value of tst and two different ifn-g tests combined with epidemiological characteristics for developing active tb in immigrants who are close contacts of smear positive tb patients. methods in this prospective cohort study close contacts will be included. demographic characteristics and exposure data are investigated. beside their normal examination they will all have a tst. two different ifn-g tests will be done in those with a tst induration of ? mm. these contacts will be followed for years to determine the occurrence of tb. results since april , municipal health services have started with the inclusion. preliminary results on the predictive value of tst, both ifn-g tests and epidemiological characteristics will be presented. abstract background: different factors contribute to the quality of ed (emergency department) care of an injured patient. objective: determine factors influencing the disagreement between er diagnoses and those assigned at hospital admission in injuried patients, and evaluate if disagreement between the diagnoses could have worsened the outcome. methods: all the er visits of the emergency departments of lazio region for unintentional injuries followed by hospitalisation in . concordant diagnoses were established on the basis of the barell matrix cells. logistic regression was used to assess the role of individual and er care factors on the probability of concordance. a logistic regression where death within days was the outcome and concordance the determinant was uses. results: , injury er visits were considered. in . % cases, the er and discharge diagnoses were concordant. higher concordance was found with increasing age and less urgent cases. factors influencing concordance were: the hour of the visit, er level, initial outcome, length of stay in hospital. patients who had non concordant diagnoses had a % higher probability of death. conclusions: a correct diagnosis at first contact with the emergency room is associated with lower mortality. methods: a cohort of consecutive patients treated for secondary peritonitis were sent the posttraumatic stress syndrome inventory (ptss- ) and impact of events scale-revised (ies-r) - years following their surgery for secondary peritonitis. results: from the patients operated upon between and , questionnaires were sent to the long-term survivors of which % responded (n = ). ptsd-related symptoms were found in % of patients by both questionnaires. patients admitted to icu (n = ) were significantly older, with higher apache-ii scores, but reported similar ptsd symptomology scores compared to non-icu patients (n = ). traumatic memories during icu and hospital-stay were most predictive for higher scores. adverse memories did not occur more often in the icu group than in the hospital-ward group conclusions: longterm ptsdrelated symptoms in patients with secondary peritonitis were very barthé lé my c cabanas ruiz-carrillo de la cruz den boon jimé nez-moleó n mü ller-nordhorn national evaluation team rich-edwards in the netherlands. design and methods we used the populationbased databases of the netherlands cancer registry, the eindhoven cancer registry (ecr) and the central bureau of statistics. patients from the ecr were followed until - - for vital status and relative survival was calculated. results: the number of breast cancer cases increased from in to . in , an annual increase of . % (p< . ). the death rate decreased , % annually (p< . ), which resulted in deaths in . the relative -yr survival was less than % for patients diagnosed in the seventies, this increased to over % for patient diagnosed since , patients with stage i disease even have a % -yr relative survival. conclusion: the alarming increase in breast cancer incidence is accompanied with a serious improvement in survival rates. this results in a large number of women (ever) diagnosed with breast cancer, about , in of whom % demand some kind of medical care. abstract background: nine % of the population in the netherlands belongs to non-western ethnic minorities. perceived health is worse and health care use different from dutch natives. objectives. which factors are associated with ethnic differences in self-rated health? which factors are associated with differences in utilisation of gp care? methods: during one year all contacts with gps were registered. adult surinam, antillean, turkish, moroccan and dutch responders were included (total n: . ). we performed multivariate analyses of determinants of self-rated health and on the number of contacts with gps. results: self-rated health differ from native dutch: surinam/antillean (or . ) and turkish/moroccan patients (or . / . ) , especially in turkish/moroccan females. more turks visit the gps at least once a year (or . ). less surinamese (or . ) and antillean patients (or . ) visit their gps than the dutch do. people from ethnic minorities in good health visit their gps more often ( . - . consults per year vs. . ). incidence rates of acute respiratory infections and chest complaints were significantly higher than in the dutch. conclusions: ethnicity is independently associated with self-rated health. higher use of gp-care by ethnic minorities in good health, points towards possible inappropriate use of resources. the future: do they fulfil it? first results of the limburg youth monitoring project abstract background: incidence of coronary heart disease (chd) and stroke can be estimated from local, population-based registers. it is unclear, to what extent local register data are applicable on a nationwide level. therefore, we compared german register data with estimates derived with who global burden of disease (gbd) method. methods: incidence of chd and stroke was computed with the gbd method using official german mortality statistics and prevalences from the german national health survey. results were compared to estimates from the kora/monica augsburg register (chd) and the erlangen stroke project in southern germany. results: gbd estimates and register data showed good agreement: chd (age group - years) , (gbd) versus , (register) and stroke (all ages) , versus , incident cases per year. chd incidence among all age groups was estimated with the gbd method to be , per year (no register data available). chd incidence in men and stroke incidence in women were underestimated with the gbd method as compared to register data. conclusions: gbd method is a useful tool to estimate incidence of chd and stroke. the computed estimates may be seen as lower limit for incidence data. differences between gbd estimates and register data are discussed. abstract background: children with mental retardation (mr) are a vulnerable not much studied population. objectives: to investigate psychopharmacotherapy in children with mr and to examine possible factors associated with psychopharmacotherapy. methods: participants were recruited through all facilities for children with mental retardation in friesland, the netherlands, resulting in participants, - years old, including all levels of mental retardation. the dbc and the pdd-mrs were used to assess general behavior problems and pervasive developmental disorders (pdd). information on medication was collected through a parent-interview. logistic regression was used to investigate the relationship between the psychotropic drug use and the factors dbc, pdd, housing, age, gender and level of mr. results: % of the participants used psychotropic medication. main factors associated with receiving psychopharmacotherapy were pdd (or . ) and dbc score (or . ). living away from home and mr-level also played a role whereas gender and age did not. dbc score was associated with clonidine, stimulants and anti-psychotics. pdd was the main factor associated with anti-psychotics use (or . ). discussion: psychopharmacotherapy is especially prevalent among children with mr and comorbid pdd and general behavior problems. although many psychotropic drugs are used off-label, specific drugs were associated with specific psychiatric or behavior problems. abstract background: increased survival in children with cancer has raised interest on the quality-of-life of long-term survivors. objective: to compare educational outcomes of adult survivors of childhood cancer and healthy controls. methods: retrospective cohort study including a sample of adult survivors ( ) treated for childhood cancer in the three existing italian paediatric research hospitals. controls ( ) were selected among siblings, relatives or friends of survivors. when these controls were not available, a search was carried out in the same area of residence of the survivors though random digit dialling. data collection was carried out through a telephone-administered structured questionnaire. results: significantly more survivors than controls needed school support (adjusted odds ratio -oradj- . , % ci . - . ); failed at least a grade after disease onset (oradj . , % ci . - . ); achieved a lower educational level (oradj . , % ci . - . ) and did not reach an educational level higher than their parents' (oradj . , % ci . - . ). subject's age, sex, parents' education and area of residence were taken into account as possible confounders. conclusions: these findings suggest the need to provide appropriate school support to children treated for childhood cancer. abstract background: in italy supplementation with folic acid (fa) in the periconceptional period to prevent congenital malformations (cms) is quite low. the national health institute has recently launched ( ) a programme to improve awareness about the role of fa in reducing the risk of serious defects also by providing . mg fa tablets free of charge to women planning a pregnancy. objectives: we analysed cms that are or may be sensitive to fa supplementation in order to establish an adequate baseline to allow a fa impact assessment in the next years and to investigate spatial differences among cms registries, time trends and time-space interactions. design and methods data collected over - by the italian registries members of eurocat and icbdsr on births and induced abortions with neural tube defects, ano-rectal atresia, omphalocele, oral clefts, cardiovascular, limb reduction and urinary system defects. results: all the cms showed statistically significant differences among registries with the exception of ano-rectal atresia. the majority of cms by registry showed stable or increasing trends over time. conclusions results show the importance of fa intake during the periconceptional period. differences among registries indicate also the need of having a baseline for each registry to follow trends over time. abstract country-specific resistance proportions are more biased by variable sampling and ascertainment procedures than incidence rates. within the european antimicrobial resistance surveillance system (earss) resistance incidence rates and proportions can be calculated. in this study, the association between antimicrobial resistance incidence rates and proportions and the possible effect of differential sampling of blood cultures was investigated. in , earss collected routine antimicrobial susceptibility test data from invasive s. aureus isolates, tested according to standard protocols. via a questionnaire denominator information was collected. the spearman correlation coefficient and linear regression were used for statistical analysis. this year, of hospitals and of laboratories from of earss countries responded to the questionnaire. they reported of, overall, , s. aureus isolates. in the different countries, mrsa proportions ranged from < % to % and incidence rates per , patient days from . ae - to . ae - . overall, the proportions and rates highly correlated. blood culturing rates only influenced the relationship between mrsa resistance proportions and incidence rates for eastern european countries. in conclusion, resistance proportions seem to be very similar to resistance incidence rates, in the case of mrsa. nevertheless, this relationship appears to be dependent of some level of blood culturing. . key: cord- - dqpxumn authors: shuja, junaid; alanazi, eisa; alasmary, waleed; alashaikh, abdulaziz title: covid- open source data sets: a comprehensive survey date: - - journal: appl intell doi: . /s - - - sha: doc_id: cord_uid: dqpxumn in december , a novel virus named covid- emerged in the city of wuhan, china. in early , the covid- virus spread in all continents of the world except antarctica, causing widespread infections and deaths due to its contagious characteristics and no medically proven treatment. the covid- pandemic has been termed as the most consequential global crisis since the world wars. the first line of defense against the covid- spread are the non-pharmaceutical measures like social distancing and personal hygiene. the great pandemic affecting billions of lives economically and socially has motivated the scientific community to come up with solutions based on computer-aided digital technologies for diagnosis, prevention, and estimation of covid- . some of these efforts focus on statistical and artificial intelligence-based analysis of the available data concerning covid- . all of these scientific efforts necessitate that the data brought to service for the analysis should be open source to promote the extension, validation, and collaboration of the work in the fight against the global pandemic. our survey is motivated by the open source efforts that can be mainly categorized as (a) covid- diagnosis from ct scans, x-ray images, and cough sounds, (b) covid- case reporting, transmission estimation, and prognosis from epidemiological, demographic, and mobility data, (c) covid- emotional and sentiment analysis from social media, and (d) knowledge-based discovery and semantic analysis from the collection of scholarly articles covering covid- . we survey and compare research works in these directions that are accompanied by open source data and code. future research directions for data-driven covid- research are also debated. we hope that the article will provide the scientific community with an initiative to start open source extensible and transparent research in the collective fight against the covid- pandemic. the covid- virus has been declared a pandemic by the world health organization (who) with more than ten million cases and deaths across the world as per who statistics of june [ ] . covid- is caused by severe acute respiratory syndrome coronavirus (sars-cov- ) and was declared pandemic by who on march , . the cure to covid- can take several months due to its clinical trials on humans of varying ages and ethnicity before approval. the cure to covid- can be further delayed due to possible genetic mutations shown by the virus [ ] . the pandemic situation is affecting billions of people socially, economically, and medically with drastic changes in social relationships, health policies, trade, work, and educational environments. the global pandemic is a threat to human society and calls for immediate actions. the covid- pandemic has motivated the research community to aid front-line medical service staff with cutting edge research for mitigation, detection, and prevention of the virus [ ] . scientific community has brainstormed to come up with ideas that can limit the crisis and help prevent future such pandemics. other than medical science researchers and virology specialists, scientists supported with digital technologies have tackled the pandemic with novel methods. two significant scientific communities aided with digital technologies can be identified in the fight against covid- . the main digital effort in this regard comes from the artificial intelligence (ai) community in the form of automated covid- detection from computed tomography (ct) scans and x-ray images. the second such community aided by digital technologies is of mathematicians and epidemiologists who are developing complex virus diffusion and transmission models to estimate virus spread under various mobility and social distancing scenarios [ ] . besides these two major scientific communities, efforts are being made for analyzing social and emotional behavior from social media [ ] , collecting scholarly articles for knowledge-based discovery [ ] , detection covid- from cough samples [ ] , and automated contact tracing [ ] . artificial intelligence (ai) and machine learning (ml) techniques have been prominently used to efficiently solve various computer science problems ranging from bio-informatics to image processing. ml is based on the premise that an intelligent machine should be able to learn and adapt from its environment based on its experiences without explicit programming [ , ] . ml models and algorithms have been standardized across multiple programming languages such as, python and r. the main challenge to the application of ml models is the availability of the open source data [ , ] . given publicly available data sets, ml techniques can aid the fight against covid- on multiple fronts. the principal such application is ml based covid- diagnosis from ct scans and x-rays that can lower the burden on short supplies of reverse transcriptase polymerase chain reaction (rt-pcr) test kits [ , ] . similarly, statistical and epidemiological analysis of covid- case reports can help find a relation between human mobility and virus transmission. moreover, social media data mining can provide sentiment and socio-economic analysis in current pandemic for policy makers. therefore, the covid- pandemic has necessitated collection of new data sets regarding human mobility, epidemiology, psychology, and radiology to aid scientific efforts [ ] . it must be noted that while digital technologies are aiding in the combat against covid- , they are also being utilized for spread of misinformation [ ] , hatred [ ] , propaganda [ ] , and online financial scams [ ] . data is an essential element for the efficient implementation of scientific methods. two approaches are followed by the research community while performing scientific research. the research methods and data are either closedsource to protect proprietary scientific contributions or open source. the open source research leads to higher usability, verifiability, transparency, quality, and collaborative research [ , ] . in the existing covid- pandemic, the open source approach is deemed more effective for mitigation and detection of the covid- virus due to its aforementioned characteristics. specifically, open source covid- diagnosis techniques are necessary to gain trust of medical staff and patients while engaging the research community across the globe. we emphasize that the covid- pandemic demands a unified and collaborative approach with open source data and methodology so that the scientific community across the globe can join hands with verifiable and transparent research [ , ] . the combination of ai and open source data sets produces a practical solution for covid- diagnosis that can be implemented in hospitals worldwide. automated ct scan based covid- detection techniques work with training the learning model on existing ct scan data sets that contain labeled images of covid- positive and normal cases. similarly, the detection of covid- from cough requires both normal and infected samples to learn and distinguish features of the infected person from a healthy person. therefore, it is necessary to provide open source data sets and methods so that (a) researchers across globe can enhance and modify existing work to limit the global pandemic, (b) existing techniques are verified for correctness by researchers across the board before implementation in real-world scenarios, and (c) researchers collaborate to aggregate data sets and enhance the performance of ai/ml methods in community-oriented research and development [ ] [ ] [ ] . the fruits of open source science can be seen in abundance among the community. some of the leading hospitals across the world are utilizing ai/ml algorithms to diagnose covid- cases from ct scans/x-ray images after preliminary trails of the technology. efforts have been made on surveying the role of ict in combating covid- pandemic [ , ] . specifically, the role of ai, data science, and big data in the management of covid- has been surveyed. researchers [ ] surveyed ai-based techniques for data acquisition, segmentation, and diagnosis for covid- . the article was not focused on works that are accompanied by publicly available data sets. moreover, the article focused only on the applications of ml towards medical diagnosis. authors [ ] listed publicly available medical data sets for covid- . the work did not detail the ai applications of the data set and textual and cough based data sets. latif et al. [ ] reviewed data science research focusing on mitigation and diagnosis of covid- . the listed surveys mention few open source data sets and point towards the unavailability of open data resources challenging trustworthy and realworld operations of ai/ml-based techniques. moreover, the critical analysis of possible ai innovations tackling covid- have also highlighted open data as the first step towards the right direction [ ] [ ] [ ] . triggered by this challenge limiting the adoption of ai/ml-powered covid- diagnosis, forecasting, and mitigation, we make the first effort in surveying research works based on open source data sets concerning covid- pandemic. the contributions of this article are manifold. • we formulate a taxonomy of the research domain while identifying key characteristics of open source data sets in terms of their type, applications, and methods. • we provide a comprehensive survey of the open-source covid- data sets while categorizing them on data type, i.e., biomedical images, textual, and speech data. with each listed data set, we also describe the applied ai, big data, and statistical techniques. • we provide a comparison of data sets in terms of their application, type, and size to provide valuable insights for data set selection. • we highlight the future research directions and challenges for missing or limited data sets so that the research community can work towards the public availability of the data. we are motivated by the fact that this survey will help researchers in the identification of appropriate open source data sets for their research. the comprehensive survey will also provide researchers with multiple directions to embark on an open data powered research against covid- . most of the articles included in this survey have not been rigorously peer-reviewed and published as preprints. however, their inclusion is necessary as the current pandemic situation requires rapid publishing process to propagate vital information on the pandemic. moreover, the inclusion of non-peer-reviewed studies in this article is supported by their open source methods which can be independently verified. for the collection of the relevant literature review, we searched the online databases of google scholar, biorxiv, and medrxiv. the keywords employed were "covid- " and "data set". we separately searched two online open source communities, i.e., kaggle and github for data sets that are not yet part of any publication. we focused on articles with applications of computer science and mathematics in general. we hope that our efforts will be fruitful in limiting the spread of covid- through elaboration of open source scientific fact-finding efforts. the rest of the article is organized as follows. in section , we detail the taxonomy of the research domain. section presents the comprehensive list of medical covid- data sets divided into categories of ct scans and x-rays. section details a list of textual data sets classified into covid- case report, cse report analysis, social media data, mobility data, npi data, and scholarly article collections. section lists speech based data sets that diagnose covid- from cough and breathing samples. in section , a comparison of listed data sets is provided in terms of openness, application, and data-type. section discusses the dimensions that need attention from scholars and future perspectives on covid- open source research. section provides the concluding remarks for the article. modern information and communication technologies (ict) help in combating covid- on many fronts. these include research efforts towards: • ai/ml based covid- diagnosis and screening from medical images [ ] . • covid- case reports for transmission estimation and forecasting [ ] . • covid- emotional and sentiment analysis from social media [ ] . • semantic analysis of knowledge from the collection of scholarly articles covering covid- [ ] . • application of ai enabled robots and drones to deliver food, medicine, and disinfect places [ ]. • ai and ml based methods to find and evaluate drugs and medicines [ ] . • smart device based monitoring of lock-down and quarantine measures [ ] . • speech based breathing rate and stress detection [ ] . • cough based covid- detection [ ] . some of these ict powered techniques are data-driven. for example, detection of covid- from medical images and cough samples requires samples from both non-covid and covid- positive cases. we focus on data-driven applications of ict that are also open source. we divide the data sets into three main categories, i.e., (a) medical images, (b) textual data, and (c) speech data. medical images based data sets are mostly brought into service for screening and diagnosis of covid- . medical images can belong to the class of ct scan, xrays, mrt, or ultrasound. most of the covid- data sets contain either ct scans or x-rays [ ] . therefore, we categorize medical data sets into ct scans and x-ray classes. moreover, some of the data sets consist of multiple types of images. medical image-based diagnosis facilities are available at most hospitals and clinics. as covid- test kits are short in supply and costly [ ] , medical imagebased diagnosis lowers the burden on conventional pcr based screening. medical image data sets are often preprocessed with segmentation and augmentation techniques. in medical images, segmentation leads to partitions of the image such that region of interest (infected region) is identified [ ] . image augmentation techniques include geometric transforms and kernel filters such that the size of the data set is enhanced. consequently, the application of ml techniques that require bigger data sets is made possible avoiding overfitting [ ] . the application of ml techniques on medical images has also led to the prediction of hospital stay duration in patients [ ] . medical image data sets should consider patient's consent and preserve patient privacy [ ] . textual data sets serve four main purposes. • forecasting the visualizing and spread of covid- based on reported cases. • analyzing public sentiment/opinion by tracking covid- related posts on popular social media platforms. • collecting scholarly articles on covid- for a centralized view on related research and application of information extraction/text mining. • studying the effect of mobility on covid- transmissions. the covid- case reports can be further applied to: (a) compile data on regional levels (city, county, etc) [ ] , (b) visualize cases and forecast daily new cases, recoveries, etc [ ] , (c) study effects of mobility on number of new cases [ ] , and (d) study effect of non-pharmaceutical interventions (npi) on covid- cases [ , ] . speech data sets contain cough and breath sounds that can be employed to diagnose covid- and its predict disease severity. ml, big data, and statistical techniques can be applied to the data sets for tasks related to prediction. figure illustrates a consolidated view of the taxonomy of covid- open source data sets [ , ] . medical images in the form of chest ct scans and x-rays are essential for automated covid- diagnosis. the ai powered covid- diagnosis techniques can be as accurate as a human, save radiologist time, and perform diagnosis cheaper and faster than the common laboratory methods. we discuss ct scan and x-ray image data sets separately in the following subsections. cohen et al. [ ] describe the public covid- image collection consisting of x-ray and ct-scans with ongoing updates. the data set consists of more than images extracted from various online publications and websites. the data set specifically includes images of covid- cases along with mers, sars, and ards based images. the authors enlist the application of deep and transfer learning on their extracted data set for identification of covid- while utilizing motivation from earlier studies that learned the type of pneumonia from similar images [ ] . each image is accompanied by a set of attributes such as patient id, age, date, and location. the extraction of ct scan images from published articles rather than actual sources may lessen the image quality and affect the performance of the machine learning model. some of the data sets available and listed below are obtained from secondary sources. the public data set published with this study is one of the pioneer efforts in covid- image based diagnosis and most of the listed studies utilize this data set. the authors also listed perspective use cases and future directions for the data set [ ] . researcher [ ] published a data set consisting of ct scans of covid- positive patients. the data set is extracted from medrxiv and biorxiv preprints about covid- . the authors also employed a deep convolutional network for training on the data set to learn covid- cases for new data with an accuracy of around %. the model is trained on covid- positive ct scans and negative cases. the model is tested on covid- positive cts and non-covid cts and achieves an f score of . . due to the small data set size, deep learning models tend to overfit. therefore, the authors utilized transfer learning on the chest x-ray data set released by nih to fine-tune their deep learning model. the online repository is being regularly updated and currently consists of ct images containing clinical findings of patients. wang et al. [ ] investigated a deep learning strategy for covid- screening from ct scans. a total of covid- pathogen-confirmed ct scans were utilized along with typical viral pneumonia cases. the covid- ct scans were obtained from various chinese hospitals with an online repository maintained at. transfer learning in the form of pre-trained cnn model (m-inception) was utilized for feature extraction. a combination of decision tree and adaboost were employed for classification with . % accuracy. figure depicts a generic work-flow of ml based covid- diagnosis [ , ] . the data set containing medical images is pre-processed with segmentation and augmentation techniques if necessary. afterward, a ml pretrained, fine-tuned, or built from scratch model is used to extract features for classification. the number of outputs classed is defined such as covid- positive, negative, viral pneumonia. a classifier can classify each image from the data set based on its extracted feature [ ] . segmentation helps health service providers to quickly and objectively evaluate the radiological images. segmentation is a pre-processing step that outlines the region of interest, e.g., infected regions or lesions for further evaluation. shan et al. [ ] obtained ct scan images from covid- cases based mostly in shanghai for deep learning-based lung infection quantification and segmentation. however, their data set is not public. the deep learning-based segmentation utilizes vb-net, a modified -d convolutional neural network (cnn), to segment covid- infection regions in ct scans. the proposed system performs auto-contouring of infection regions, accurately estimates their shapes, volumes, and percentage of infection. the system is trained using covid- patients and validated using new covid- patients. radiologists contributed as a human in the loop to iteratively add segmented images to the training data set. two radiologists contoured the infectious regions to quantitatively evaluate the accuracy of the segmentation technique. the proposed system and manual segmentation resulted in % dice similarity coefficients. other than the published articles, few online efforts have been made for image segmentation of covid- cases. a covid- ct lung and infection segmentation data set is listed as open source [ , ] . the data set consists of covid- ct scans labeled into left, right, and infectious regions by two experienced radiologists and verified by another radiologist. three segmentation benchmark tasks have also been created based on the data set. another such online initiative is the covid- ct segmentation data set. the segmented data set is hosted by two radiologists based in oslo. they obtained images from a repository hosted by the italian society of medical and interventional radiology (sirm). the obtained images were segmented by the radiologist using labels, i.e., ground-glass, consolidation, and pleural effusion. as a result, a data set that contains axial ct slices from patients with manual segmentation in the form of jpg images is formed. moreover, the radiologists also trained a d multi-label u-net model for automated semantic segmentation of images. in this subsection, we list covid- data set initiatives that are public but are not associated with any publication. the coronacases initiative shares d ct scans of confirmed cases of covid- . currently, the web repository contains d ct images of confirmed covid- cases shared for scientific purposes. the british society of thoracic imaging (bsti) in collaboration with cimar uk's imaging cloud technology deployed a free to use encrypted and anonymised online portal to upload and download medical images related to covid- positive and suspected patients. the uploaded images are sent to a group of bsti experts for diagnosis. each reported case includes data regarding the age, sex, pcr status, and indications of the patient. the aim of the online repository is to provide covid- medical images for reference and teaching. the sirm is hosting radiographical images of covid- cases. their data set has been utilized by some of the cited works in this article. another open source repository for covid- radiographic images is radiopaedia. multiple studies [ , ] employed this data set for their research. researchers [ ] present covid-net, a deep convolutional network for covid- diagnosis based on chest x-ray images. motivated by earlier efforts on radiography based diagnosis of covid- , the authors make their data set and code accessible for further extension. the data set consists of , chest radiography images from , patient cases from three open access data repositories. the covid-net architecture consists of two stages. in the first stage, residual architecture design principles are employed to construct a prototype neural network architecture to predict either of (a) normal, (b) non-covid infection, and (c) covid- infections. in the second stage, the initial network design prototype, data, and human-specific design requirements act as a guide to a design exploration strategy to learn the parameters of deep neural network architecture. the authors also audit the covid-net with the aim of transparency via examination of critical factors leveraged by covid-net in making detection decisions. the audit is executed with gsinquire which is a commonly used ai/ml explainability method [ ] . author in [ ] utilized the data set of cohen et al. [ ] and proposed covidx-net, a deep learning framework for automatic covid- detection from x-ray images. seven different deep cnn architecture, namely, vgg , densenet , inceptionv , resnetv , inception-resnetv , xception, and mobilenetv were utilized for performance evaluation. the vgg and densenet model outperform other deep neural classifiers in terms of accuracy. however, these classifiers also demonstrate higher training times. apostolopoulos et al [ ] merged the data set of cohen et al. [ ] , a data set from kaggle, and a data set of common bacterial-pneumonia x-ray scans [ ] to train a cnn to distinguish covid- from common pneumonia. five cnns namely vgg , mobilenet v , inception, xception, and inception resnet v with common hyper-parameters were employed. results demonstrate that vgg and mobilenet v perform better than other cnns in terms of accuracy, sensitivity, and specificity. the researchers extended their work in [ ] to extract bio-markers from x-ray images using a deep learning approach. the authors employ mobilenetv , a cnn is trained for the classification task for six most common pulmonary diseases. mobilenetv extracts features from xray images in three different settings, i.e., from scratch, with the help of transfer learning (pre-trained), and hybrid feature extraction via fine-tuning. a layer of global average pooling was added over mobilenetv to reduce overfitting. the extracted features are input to a node neural network for classification. the data set include recent covid- cases and x-rays corresponding to common pulmonary diseases. the covid- images ( ) are obtained from cohen et al. [ ] , sirm, rsna, and radiopaedia. the data set of common pulmonary diseases is extracted from a recent study [ ] among other sources. the training from scratch strategy outperforms transfer learning with higher accuracy and sensitivity. the aim of the research is to limit exposure of medical experts with infected patients with automated covid- diagnosis. researchers [ ] merged the data set of cohen et al. [ ] ( images) and a data set from kaggle ( images) for application of three pre-trained cnns, namely, resnet , inceptionv , and inceptionresnetv to detect covid- cases from x-ray radiographs. the data set was equally divided into normal and covid- positive cases. due to the limited data set, deep transfer learning is applied that requires smaller data set to learn and classify features. the resnet provided the highest accuracy for classifying covid- cases among the evaluated models. authors in [ ] propose support vector machine (svm) based classification of x-ray images instead of predominately employed deep learning models. the authors argue that deep learning models require large data sets for training that are not available currently for covid- cases. the data set brought to service in this article is an amalgam of cohen et al. [ ] , a data set of kaggle , and data set of kermany et. [ ] . author of [ ] also utilized data from same sources. the data set consists of covid- cases, pneumonia cases, and healthy cases. the methodology classifies the x-ray images into covid- , https://www.kaggle.com/paultimothymooney/ chest-xray-pneumonia kaggle. https://www.kagglee.com/andrewmvd/convid -x-rays pneumonia, and normal cases. pre-trained networks such as alexnet, vgg , vgg , googlenet, resnet , resnet , resnet , inceptionv , inceptionresnetv , densenet , xceptionnet, mobilenetv and shufflenet are employed on this data set for deep feature extraction. the deep features obtained from these networks are fed to the svm classifier. the accuracy and sensitivity of resnet plus svm is found to be highest among cnn models. similar to sethy et al. [ ] , afshar et al. [ ] also negated the applicability of dnns on small covid- data sets. the authors proposed a capsule network model (covid-caps) for the diagnosis of covid- based on x-ray images. each layer of a covid-caps consists of several capsules, each of which represents a specific image instance at a specific location with the help of several neurons. the length of a capsule determines the existence probability of the associated instance. covid-caps uses four convolutional and three capsule layers and was pretrained with transfer learning on the public nih data set of x-rays images for common thorax diseases. covid-caps provides a binary output of either positive or negative covid- case. the covid-caps achieved an accuracy of . %, a sensitivity of %, and specificity of . %. many authors have applied augmentation techniques on covid- image data sets. authors [ ] utilized data augmentation techniques to increase the number of data points for cnn based classification of covid- x-ray images. the proposed methodology adds data augmentation to basic steps of feature extraction and classification. the authors utilize the data set of cohen et al. [ ] . the authors design five deep learning model for feature extraction and classification, namely, custom-made cnns trained from scratch, transfer learning-based fine-tuned cnns, proposed novel covid-renet, dynamic feature extraction through cnn and classification using svm, and concatenation of dynamic feature spaces (covid-renet and vgg- features) and classification using svm. svm classification is brought to serve to further increase the accuracy of the task. the results showed that the proposed covid-renet and custom vgg- models accompanied by the svm classifier show better performance with approximately . % accuracy in identifying covid- cases. researchers [ ] formulated a data set of covid- and non-covid- cases containing both x-ray images and ct scans. augmentation techniques are applied on the data set to obtain approximately x-ray and ct images. the data set is divided into main categories of xray and ct images. the x-ray images comprise of covid- positive and non-covid images. the ctscan images comprise of covid- positive and non-covid images. these works on augmentation of covid- images resolve the issue of data scarcity for deep learning techniques. however, further investigation is required to determine the effectiveness of ml techniques in detecting covid- cases from augmented data sets. authors [ ] contributed towards a single covid- x-ray image database for ai applications based on four sources. the aim of the research was to explore the possibility of ai application for covid- diagnosis. the source databases were cohen et al. [ ] , italian society of medical and interventional radiology data set, images from recently published articles, and a data set hosted at kaggle. the cumulative data set contains covid- images, viral pneumonia images, and normal chest x-ray images. the authors further created augmented images from each category for the training and validation of four cnns. the four tested cnns are alexnet, resnet , densenet , and squeezenet for classification of xray images into normal, covid- , and viral pneumonia cases. the squeezenet outperformed other cnns with . % accuracy and . % sensitivity. the collective database can be found at. born et al. [ ] advocated the role of ultra-sound images for covid- detection. compared to ct scans, ultra-sound is a non-invasive, cheap, and portable medical imaging technique. first, the authors aggregated data in an open source repository named point-of-care ultrasound (pocus). the data set consists of images ( covid- , bacterial pneumonia, and healthy controls) extracted from online videos and published research works. the main sources of the data were grepmed.com, thepocusatlas.com, butterflynetwork.com, and radiopaedia.org. data augmentation techniques are also used to diversify the data. afterward, the authors train a deep cnn (vgg- ) named pocovid-net on the threeclass data set to achieve an accuracy of % and sensitivity for covid of %. lastly, the authors provide an open access medical service named pocovidscreen to classify and predict lung ultra-sound images. a comprehensive list of ai-based covid- research can be found at [ ] . a list open source data sets on the kaggle can be found at. covid- case reports, global and county-level dashboards, case report analysis, mobility data, social media posts, npi, and scholarly article collections are detailed in the following subsections. the earliest and most noteworthy data set depicting the covid- pandemic at a global scale was contributed by john hopkins university [ ] . the authors developed an online real-time interactive dashboard first made public in january . the dashboard lists the number of cases, deaths, and recoveries divided into country/provincial regions. a data is more detailed to the city level for the usa, canada, and australia. a corresponding github repository of the data is also available and datahub repository is available at. the data collection is semi-automated with main sources are dxy (a medical community ) and who. the dxy community collects data from multiple sources and updated every minutes. the data is regularly validated from multiple online sources and health departments. the aim of the dashboard was to provide the public, health authorities, and researchers with a userfriendly tool to track, analyze, and model the spread of covid- . authors [ ] employed four supervised ml models including svm and linear regression on this data set the predict the number of new cases, deaths, and recoveries. dey et al. [ ] analyzed the epidemiological outbreak of covid- using a visual exploratory data analysis approach. the authors utilized publicly available data sets from who, the chinese center for disease control and prevention, and johns hopkins university for cases between january to february all around the globe. the data set consisted of time-series information regarding the number of cases, origin country, recovered cases, etc. the main objective of the study is to provide time-series visual data analysis for the understandable outcome of the covid- outbreak. liu et al. [ ] formulated a spatio-temporal data set of covid- cases in china on the daily and city levels. as the published health reports are in the chinese language, the authors aim to facilitate researchers around the globe with data set translated to english. the data set also divides the cases to city/county level for analysis of citywide pandemic spread contrary to other data sets that provide county/province level categorizations. the data set consists of essential statistics for academic research, such as daily new infections, accumulated infections, daily new recoveries, accumulated recoveries, daily new deaths, etc. each of these statistics is compiled into a separate csv file and made available on github. the first two authors did cross-validation of their data extraction tasks to reduce the error rate. researchers [ , ] list and maintain the epidemiological data of covid- cases in china. the data set contains individual-level information of laboratory-confirmed cases obtained from city and provincial disease control centers. the information includes (a) key dates including the date of onset of disease, date of hospital admission, date of confirmation of infection, and dates of travel, (b) demographic information about the age and sex of cases, (c) geographic information, at the highest resolution available down to the district level, (d) symptoms, and (e) any additional information such as exposure to the huanan seafood market. the data set is updated regularly. the aim of the open access line list data is to guide the public health decision-making process in the context of the covid- pandemic. killeen et al. [ ] accounted for the county-level data set of covid- in the us. the machine-readable data set contains more than socioeconomic parameters that summarize population estimates, demographics, ethnicity, education, employment, and income among other healthcare system-related metrics. the data is obtained from the government, news, and academic sources. the authors obtain time-series data from [ ] and augment it with activity data obtained from safegraph. details of the safegraph data set can be found in the section . . a collection of country specific case reports and articles can be found at harvard dataverse repository. based on the open source data sets, we list various open source analysis efforts in this sub-section. most of listed research works in the below sections are accompanied by both open source data and code. the covid- textual data analysis serve varying purposes, such as, forecasting covid- transmission from china, estimating the effect of npi and mobility on number of cases, estimating the serial interval and reproduction rate. kucharski et al. [ ] modeled covid- cases based on data set of cases from and international cases that originated https://coronavirus.jhu.edu/map.html https://dataverse.harvard.edu/dataverse/ ncov from wuhan. the purpose of the study was to estimate human-to-human transmissions and virus outbreaks if the virus was introduced in a new region. the four time-series data sets used were: the daily number of new internationally exported cases, the daily number of new cases in wuhan with no market exposure, the daily number of new cases in china, and the proportion of infected passengers on evacuation flights between december and february . the study while employing stochastic modeling found that the r declined from . to . after travel restrictions were imposed in wuhan. the study also found that if four cases are reported in a new area, there is a % chance that the virus will establish within the community. researchers [ ] described an econometric model to forecast the spread and prevalence of covid- . the analysis is aimed to aid public health authorities to make provisions ahead of time based on the forecast. a timeseries database was built based on statistics from johns hopkins university and made public in csv format. auto-regressive integrated moving average (arima) model prediction on the data to predict the epidemiological trend of the prevalence and incidence of covid- . the arima model consists of an autoregressive model, moving average model, and seasonal autoregressive integrated moving average model. the arima model parameters were estimated by autocorrelation function. arima ( , , ) model was selected for the prevalence of covid- while arima ( , , ) was selected as the best arima model for determining the incidence of covid- . the research predicted that if the virus does not develop any new mutations, the curve will flatten in the near future. researcher [ ] utilize reported death rates in south korea in combination with population demographics for correction of under-reported covid- cases in other countries. south korea is selected as a benchmark due to its high testing capacity and well-documented cases. the author correlates the under-reported cases with limited sampling capabilities and bias in systematic death rate estimation. the author brings to service two data sets. one of the data sets is who statistics of daily country-wise covid- reports. the second data set is demographic database maintained by the un. this data set is limited from onwards and hosted on kaggle for country wise analysis. the adjustment in number of covid- cases is achieved while comparing two countries and computing their vulnerability factor which is based on population ages and corresponding death rates. as a result, the vulnerability factor of countries with higher age population is greater than one leading to higher death rate estimations. a complete work-flow of the analysis is also hosted on kaggle. kraemer et al. [ ] analyzed the effect of human mobility and travel restrictions on spread of covid- in china. real-time and historical mobility data from wuhan and epidemiological data from each province were employed for the study (source: baidu inc.). the authors also maintain a list of cases in hubei and a list of cases outside hubei. the data and code can be found at. the study found that before the implementation of travel restrictions, the spatial distribution of covid- can be highly correlated to mobility. however, the correlation is lower after the imposition of travel restrictions. moreover, the study also estimated that the late imposition of travel restrictions after the onset of the virus in most of the provinces would have lead to higher local transmissions. the study also estimated the mean incubation period to identify a time frame for evaluating early shifts in covid- transmissions. the incubation period was estimated to be . days. researchers [ ] provide another study for evaluating the effects of travel restrictions on covid- transmissions. the authors quantify the impact of travel restrictions in early with respect to covid- cases reported outside china using statistical analysis. the authors obtained an epidemiological data set of confirmed covid- cases from government sources and websites. all confirmed cases were screened using rt-pcr. the quantification of covid- transmission with respect to travel restrictions was carried out for the number of exported cases, the probability of a major epidemic, and the time delay to a major epidemic. lai et al. [ ] quantitatively studied the effect of npi, i.e., travel bans, contact reductions, and social distancing on the covid- outbreak in china. the authors modeled the travel network as susceptible-exposedinfectious-removed (seir) model to simulate the outbreak across cities in china in a proposed model named basic epidemic, activity, and response covid- model. the authors used epidemiological data in the early stage of the epidemic before the implementation of travel restrictions. this data was used to determine the effect of npi on onset delay in other regions with first case reports as an indication. the authors also obtained large scale mobility data from baidu location-based services which report billion positioning requests per day. another historical data set from baidu was obtained for daily travel patterns during https://www.kaggle.com/lachmann / correcting-under-reported-covid- -case-numbers https://github.com/emergent-epidemics/covid npi china the chinese new year celebrations which coincided with the covid- outbreak. the study estimated that there were approximately . million covid- cases in china as of february . without the implementation of npi, the cases were estimated to increase fold. the impact of various restrictions was varied with early detection and isolation preventing more cases than the travel restrictions. in the case of a three-week early implementation of npi, the cases would have been % less. on the contrary, if the npi were implemented after a further delay of weeks, the covid- cases would have increased times. a study on a similar objective of investigating the impact of npi in european countries was carried out in [ ] . at the start of pandemic spread in european countries, npi were implemented in the form of social distancing, banning mass gathering, and closure of educational institutes. the authors utilized a semi-mechanistic bayesian hierarchical model to evaluate the impact of these measures in european countries. the model assumes that any change in the reproductive number is the effect of npi. the model also assumed that the reproduction number behaved similarly across all countries to leverage more data across the continent. the study estimates that the npi have averted deaths up till march in the countries. the proportion of the population infected by covid- is found to be highest in spain followed by italy. the study also estimated that due to mild and asymptomatic infections many fold low cases have been reported and around % of spain population was infected in actual with a mean infection rate of . %. the mean reproduction number was estimated to be . . real-time data was collected from ecdc (european centre of disease control) for the study. wells et al. [ ] studied the impact of international travel and border control restrictions on covid- spread. the research work utilized daily case reports from december , to february , in china and countywise airport connectivity with china to estimate the risk of covid- transmissions. a total of countries have direct flight connectivity with mainland china. the covid- transmission/importation risk was assumed to be proportional to the number of airports with direct flights from china. it was estimated that an average reduction of . % exportation rate occurred due to the travel and lockdown restrictions. health questionnaire regarding exposure at least a week prior to arrival is estimated to identify % of the cases during incubation period. it is also estimated that if a case is identified via contact tracing within days of exposure, the chances of its travel during the incubation period are reduced by . %. researcher [ ] investigated the transmission control measures of covid- in china. the authors compiled and analyzed a unique data set consisting of case reports, mobility patterns, and public health intervention. the covid- case data were collected from official reports of the health commission. the mobility data were collected from location-based services employed by social media applications such as wechat. the travel pattern from wuhan during the spring festival was constructed from baidu migration index. the study found that the number of cases in other provinces after the shutdown of wuhan can be strongly related to travelers from wuhan. in cities with a lesser population, the wuhan travel ban resulted in a delayed arrival (+ . days) of the virus. cities that implemented the highest level emergency before the arrival of any case reported . % lesser number of cases. the low level of peak incidences per capita in provinces other than wuhan also indicates the effectiveness of early travel bans and other emergency measures. the study also estimated that without the wuhan travel band and emergency measures, the number of covid- cases outside wuhan would have been around on the th day of the pandemic. in summary, the study found a strong association between the emergency measures introduced during spring holidays and the delay in epidemic growth of the virus. tindale et al. [ ] study the covid- outbreak to estimate the incubation period and serial interval distribution based on data obtained in singapore ( cases) and tianjin ( ). the incubation period is the period between exposure to an infection and the appearance of the first symptoms. the data was made available to the respective health departments. the serial interval can be used to estimate the reproduction number (r ) of the virus. moreover, both serial interval and incubation period can help identify the extent of presymptomatic transmissions. with more than a months data of covid- cases from both cities, the mean serial interval was found to be . days for singapore and . days for tianjin. the mean incubation period was found to be . days for singapore and days for tianjin. researchers [ ] investigated the serial interval of covid- based on publicly reported cases. a total of covid- transmission events reported in china outside of hubei province between january , , and february , formulated the data set. the data is compiled from reports of provincial disease control centers. the data indicated that in of the cases, the infectee developed symptoms earlier than the infector indicated pre-symptomatic transmission. the mean serial interval is estimated to be . with a standard deviation of . . the mean serial interval of covid- is found to be lower than similar viruses of mers and sars. the production rate (r ) of the data set is found to be . . author in [ ] presented a framework for serial interval estimation of covid- . as the virus is easily transmitted in a community from an infected person, it is important to know the onset of illness in primary to secondary transmissions. the date of illness onset is defined as the date on which a symptom relevant to covid- infection appears. the serial interval refers to the time between successive cases in a chain of disease transmission. the authors obtain cases of pairs of infector-infectee cases published in research articles and investigation reports and rank them for credibility. a subset of high credible cases are selected to analyze that the estimated median serial interval lies at . days. the median serial interval of covid- is found to be smaller than sars. moreover, it is implied that contact tracing methods may not be effective due to the rapid serial interval of infector-infectee transmissions. researchers [ ] contributed towards a publicly available ground truth textual data set to analyze human emotions and worries regarding covid- . the initial findings were termed as real world worry dataset (rwwd). in the current global crisis and lock-downs, it is very essential to understand emotional responses on a large scale. the authors requested multiple participants from uk on th and th april (lock-down, pm in icu) to report their emotions and formed a data set of texts ( short and long texts). the number of participants was . each participant was required to provide a short tweet-sized text (max characters) and a long open-ended text (min characters). the participants were also asked to report their feelings about covid- situations using -point scales ( = not at all, = moderately, = very much). each participant rated how worried they were about the covid- situation and how much anger, anxiety, desire, disgust, fear, happiness, relaxation, and sadness they felt. one of the emotions that best represented their emotions was also selected. the study found that anxiety and worry were the dominant emotions. stm package from r was reported for topic modeling. the most prevalent topic in long texts related to the rules of lock-down and the second most prevalent topic related to employment and economy. in short texts, the most prominent topic was government slogans for lock-down. a large-scale twitter stream api data set for scientific research into social dynamic of covid- was presented in [ ] . the data set is maintained by the panacea lab at georgia state university with dedicated efforts starting on march , . the data set consists of more than million tweets in the latest version with daily updates. the data set consists of tweets in all languages with prevalence of english, spanish, and french. several keywords, such as, covid , coronaviruspandemic, covid- , ncov, coronaoutbreak, and coronavirus were used to filter results. the data set consists of two tsv files, i.e., a full data set and one that has been cleaned with no re-tweets. the data set also contains separate csv files indicating top frequent terms, top bigrams, and top trigrams. chen et al. [ ] describe a multilingual coronavirus data set with the aim of studying online conversation dynamics. the social distancing measure has resulted in abrupt changes in the society with the public accessing social platforms for information and updates. such data sets can help identify rumors, misinformation, and panic among the public along with other sentiments from social media platforms. using twitter's streaming api and tweepy, the authors began collecting tweets from january , while adding keywords and trending accounts incrementally in the search process. at the time of publishing, the data set consisted of over million tweets and gb of raw data. authors in [ ] collected a twitter data set of arabic language tweets on covid- . the aim of the data set collection is to study the pandemic from a social perspective, analyze human behavior, and information spread with special consideration to arabic speaking countries. the data set collection was started in march using twitter api and consists of more than , , arabic language tweets with regular additions. arabic keywords were used to search for relevant tweets. hydrator and twarc tools are employed for retrieving the full object of the tweet. the data set stores multiple attributes of a tweet object including the id of the tweet, username, hashtags, and geolocation of the tweet. yu et al. [ ] compiled a data set from twitter api solely based on the institutional and news channel tweets based on multiple countries including us, uk, china, spain, france, and germany. a total of twitter accounts were followed with from government and international organizations including who, eu commission, cdc, ecdc. news media outlets were monitored including ny times, cnn, washington post, and wsj. researcher [ ] analyzes a data set of tweets about covid- to explore the policies and perceptions about the pandemic. the main objective of the study is to identify public response to the pandemic and how the response varies time, countries, and policies. the secondary objective is to analyze the information and misinformation about the pandemic is presented and transmitted. the data set is collected using twitter api and covers january to march . the corpus contains , , tweets based on different keywords related to the virus in multiple languages. the data set is being continuously updated. the authors propose the application of natural language processing, text mining, and network analysis on the data set as their future work. similar data sets of twitter posts regarding covid- can be found at github and kaggle. zarei et al. [ ] gather social media content from instagram using hashtags related to covid- (coronavirus, covid , and corona etc.). the authors found that % of the social media posts concerning covid- were in english language. the authors proposed the application of fake new identification and social behavior analysis on their data set. sarker et al. [ ] mined twitter to analyze symptoms of covid- from self-reported users. the authors identified covid- patients while searching twitter streaming api with expressions related to self-report of covid- . the patients reported different symptoms with unique lexicons. the most frequently reported covid- symptoms were fever ( %) and cough ( %). the reported symptoms were compared with clinical findings on covid- . it was found that anosmia ( %) and ageusia ( %) reported on twitter were not found in the clinical studies. a generic workflow of ml and nlp application on social media data is illustrated in fig. . several articles have performed bibliometric research on scientific works focused on covid- . the allen institute for ai with other collaborators started an initiative for collecting articles on covid- research named cord- [ ] . the data set was initiated with k articles now contains more than k articles and k full texts. multiple repositories such as pmc, biorxiv, medrxiv, and who were searched with queries related to covid- ("covid- ", "coronavirus", "corona virus", " -ncov", etc.). along with the article's data set, a metadata file is also provided which includes each article doi and publisher among other information. the data set is also divided into commercial and non-commercial subsets. the duplicate articles were clustered based on publication id/doi and filtered to remove duplication. design challenges such as machine-readable text, copyright restrictions, and clean canonical meta-data were considered while collecting data. the aim of the data set collection is to facilitate information retrieval, extraction, knowledge-based discovery, and text mining efforts focused on covid- management policies and effective treatment. the data set has been popular among the research community with more than . million https://github.com/bayesfordays/coronada https://www.kaggle.com/smid /coronavirus-covid -tweets covid- open source data sets: a comprehensive survey fig. a generic work-flow of social media based ml and nlp applications [ ] views and more than k downloads. a competition at kaggle based on information retrieval from the proposed data set is also active. on the other hand, several publishers have created separate sections for covid- research and listed on their website. ahamed et al. [ ] applied graphbased techniques on this data set to study three topic related to covid- . researchers [ ] applied association rule text mining (artm) and information cartography techniques on the same data set. artm highlights distinguished terms and the association between them after parsing text documents while information cartography extracts structured knowledge from association rules. researchers [ ] provided a scoping review of research articles published before january indicating early studies on covid- . the review followed a five-step methodological framework for the scoping review as proposed in [ ] . the authors searched multiple online databases including biorxiv, medrxiv, google scholar, pubmed, cnki, and wanfang data. the searched terms included "ncov", " novel coronavirus", and " -ncov" among others. the study found that approximately % of the published articles were in the english language. the largest proportion ( . %) of articles studied the causes of covid- . chinese authors contributed to most of the work ( . %). the study also found evidence of virus origin from the wuhan seafood market. however, specific animal association of the covid- was not found. the most commonly reported symptoms were fever, cough, fatigue, pneumonia, and headache form the studies conduction clinical trails of covid- . the surveyed studies have reported masks, hygiene practices, social distancing, contact tracing, and quarantines as possible methods to reduce virus transmission. the article sources are available as supplementary resources with the article. researchers [ ] detailed a systematic review and critical appraisal of prediction models for covid- diagnosis and prognosis. multiple publication repositories such as pubmed were searcher for articles that developed and validated covid- prediction models. out of the available titles and studies describing prediction models were included for the review. three of these studies predicted hospital admission from pneumonia cases. eighteen studies listed covid- diagnostic models out of which were ml-based. ten studies detailed prognostic models that predicted mortality rates among other parameters. the critical analysis utilized probast, a tool for risk and bias assessment in prediction models [ ] . the analysis found that the studies were at high risk of bias due to poorly reported models. the study recommended that covid- diagnosis and prognosis models should adhere to transparent and open source reporting methods to reduce bias and encourage realtime application. researcher from berkeley lab have developed a web search portal for data set of scholarly articles on covid- . the data set is composed of several scholarly data sets including wang et al. [ ] , litcovid, and elsevier novel corona virus information center. the continuously expanding data set contains approximately k articles with k specifically related to covid- . the search portal employs nlp to look for related articles on covid- and also provides valuable insights regarding the semantic of the articles. mobility data sets during the covid- pandemic are essential to establish a relation between the number of cases (transmitted) and mobility patterns and observe the global response of communities in npi restrictions. there https://covidscholar.org are a number of open source mobility data sets providing information with varying features. mobility data sets can be investigated to answer questions like what is the effect of covid- on travel? did people stay at home during the lock-down? is their a correlation between high death rates and high mobility? a global mobility data set of more than countries collected from google location services can be found at google. it presents reports available in pdf format with a breakdown of countries and regions. the reports include a summary of changes in retail, recreation, supermarket, pharmacy, park, public transport, workplace, and residential visits. the privacy of users is ensured with aggregated and anonymised data that contains no identifiable personal information. a summary of the google mobility reports in csv format can be found at kaggle. apple made a similar data set available on mobility based on user requests to location services across the globe. user privacy was addressed with anonymised records as data sent from devices is associated with random rotating identifiers. the data set available in csv format compared the mobility with a baseline set on th january . the data set contains information on the country/region, sub-region, or city level. the geods lab at the university of wisconsin-madison has developed a web application identifying mobility patterns across the u.s. the data set is based on reports from safegraph and descartes labs . the baseline is formed between the two weekends occurring before the lock-down measures were announced in the u.s. the data set provides fine-grained details up to county levels. safegraph is a digital footprint platform that aggregates location-based data from multiple applications in the u.s. the journalistic data is used to infer the implementation of lock-down measures at the county level. the data set is envisioned to serve the scientific community in general and ml applications specifically for epidemiological modeling. some of the research works listed in the above sections have utilized mobility data from baidu location services limited data about commercial flights can be obtained from the flirt tool. npi is the collection of wide range of measures adopted by governments to curb the covid- pandemic. npi data sets are essential to the study of covid- transmissions and analyze the effect of npi on covid- cases (infections, deaths, etc) for better policy and decisions at government level [ ] . a team of academics and students at oxford university systematically collected publicly available data from every part of the world into a stringency index [ ] . the stringency index consists of information on government policies and a score to indicate their stringency. a total of indicators are listed grouped into four classes of containment and closure (school closure, cancel public events, international and local travel ban, etc), financial response (income support, debt relief for households, etc), health systems (emergency investment in healthcare, contact tracing, etc), and miscellaneous responses. these indicators are used to compare government responses and measure their effect on the rate of infections. a global map of the world ranking countries with the stringency index is presented. data is collected from internet sources, news articles, and government press releases. data is available in multiple formats and interfaces on the team website and github repository. the project is titled oxford covid- government response tracker (oxcgrt) and has a working paper and corresponding open source repository. a similar data set of countries has been curated by a group of volunteers. the objective of the data set curation is to study the effectiveness of npi on national scales without consideration of economic stimulus. the data set includes information on lock-down measures, travel bans, and testing counts. the authors acknowledge the sampling bias in the data set as some countries are difficult to document and the government reports may differ from actual implementation and consequences. the data set can be utilized to study the correlation between national responses and infection transmission rates. acaps, an independent information provider, details a dashboard the speech (audio) data sets help in covid- diagnosis and detection through three basic methods. firstly, cough sounds can help in detecting a covid- positive case after the application of ml techniques [ , ] . secondly, breathing rate can be detected from speech resulting in the screening of a person for covid- [ ] . thirdly, stress detection techniques from speech can be used to detect persons with indications of mental health problems and the severity of covid- symptoms. all these techniques require extensive efforts for data set collection. these speech-based covid- diagnosis techniques can be enabled by smartphone applications or remote medical care through telemedicine. imran et al. [ ] exploited the fact that cough is one of the major symptoms of covid- . what makes this exploitation process complex, is the truth that cough is a symptom of over thirty non-covid- related medical conditions. to address this problem, the authors investigate the distinctness of pathomorphological alterations in the respiratory system induced by covid- infection when compared to other respiratory infections. transfer learning is exploited to overcome the covid- cough training data shortage. to reduce the misdiagnosis risk stemming from the complex dimensionality of the problem, a multi-pronged mediator centered risk-averse ai architecture is leveraged. the ai architecture consists of three independent classifiers, i.e., deep learning-based multi-class classifier, classical ml-based multi-class classifier, and deep learning-based binary class classifier. if the output of any classifier mismatches other, inconclusive result is returned. results show that proposed ai covid- can distinguish among covid- coughs and several types of non-covid coughs. the accuracy of more than % is promising enough to encourage a large-scale collection of labeled cough data to gauge the generalization capability of ai covid- . researchers [ ] developed a cross-platform application for crowd-sourced collection of voice sounds (cough and breath) to distinguish healthy and unhealthy persons. the voice sounds are used to distinguish between covid- , asthma, and healthy persons. three binary classification tasks are constructed i.e., (a) distinguish covid- positive users from healthy users (b) distinguish covid- positive users who have a cough from healthy users who have a cough, and (c) distinguish covid- positive users with a cough from users with asthma who declared a cough. more than unique users (approximately k samples) participated in the crowdsourced data collection out of which more than reported being covid- positive. standard audio augmentation methods were used to increase the sample size of the data set. three classifiers, namely, logistic regression, gradient boosting trees, and svm were utilized for the classification task. the study utilizes the aggregate measure of the area under the curve (auc) for performance comparison. auc of greater than % is reported in all three binary classification tasks. the authors also utilize breathing samples for classification and find auc to be approximately %. however, when the cough and breathing inputs are combined for classification, the auc improves to approximately % for each task due to a higher number of features. sharma et al. [ ] aim to supplement the laboratorybased covid- diagnosis methods with cough based diagnosis. the project, named is coswara, utilizes cough, breath, and speech sounds to quantify biomarkers in acoustics. nine different vocal sounds are collected for each patient including breath (shallow and deep), cough (shallow and heavy), and vowel phonations. the nine vocals capture different physical states of the respiratory system. multidimensional spectral and temporal features are extracted from audio files. the classification and data curation tasks are under process. the work is supplemented by a web application for data collection and open source voice data set of approximately samples in wav format ( . khz ). in summary, detecting/screening covid- from cough samples using ml techniques has indicated promising results [ , ] . the accuracy of the studies is hindered due to the small data set of covid- cough samples. several researchers are gathering cough based data and have made appeals for contribution from the public. researchers from the university of cambridge , carnegie mellon university , and epfl have made calls for the community participation in the collection of the cough data. an independent ai team has made the call for data collection and also made an open source repository for the cough data. breathlessness or shortness of breath is a symptom in nearly % of the covid- patients which can also indicate other serious diseases such as pneumonia [ ] . automated detection of breathlessness from the speech is required in remote medical care and covid- screening applications. patient speech can be recorded for breath patterns with a simple microphone attached to smart devices. abnormality related to covid- can be detected from the breath patterns. faezipour et al. [ ] proposed an idea smartphone application for self-testing of covid- using breathing sounds. the authors imply that breathing difficulties due to covid- can reveal acoustic patterns and features necessary for covid- pre-diagnosis. the breathing sounds can be input to the smartphone through the microphone. signal processing, ml, and deep learning techniques can be applied to the breathing sound to extract features and classify the input into covid- positive and negative cases. such a smartphone-based application can be used as a self-test while eliminating the risks and costs associated with visiting medical facilities. the proposed framework can be augmented with data obtained from a spirometer (lung volume) and blood oxygenation measured from a pulse oximeter. the data should be initially labeled as covid- positive and negative by medical experts based on clinical findings to train the proposed model. afterward, ml techniques can extract features and classify new inputs based on model training. authors [ ] detailed a portable smartphone powered spirometer with automated disease classification using cnn. spirometer is a device that measures the volume of expired and inspired air. the proposed system consists of three basic modules. first, fleisch type airflow tube captures the breath with a differential pressure-based approach. a blue-tooth enabled micro-controller is built for data processing. lastly, an android application with a pretrained cnn model for classification is developed. stacked autoencoders, long short term memory network, and cnn are evaluated as classifiers for lung diseases such as obstructive lung diseases and restrictive lung diseases. the -d cnn classifier exhibits higher accuracy than other ml classifiers. the proposed model can be extended to include covid- classification and the classifiers can be reevaluated for accuracy. in summary, tools are available for the breath-based covid- diagnosis. however, existing applications are required to be updated accompanied by voice collections from covid- patients. han et al. [ ] provide an initial data driven study towards speech analysis for covid- and detect physical and mental states along with symptom severity. voice data from patients hospitalized in china was gathered with five sample sentences. moreover, each patient was asked to rate his sleep quality, fatigue, and anxiety on low, average and high level. demographic information for each patient was also collected. the data was pre-processed in four steps namely, data cleansing, hand annotating of voice activities, speaker diarisation, and speech transcription. opensmile toolkit was used to extract two feature sets namely, computational paralinguistics challenge (compare) and the extended geneva minimalistic acoustic parameter (egemaps). four classification tasks are performed on data. firstly, the patient severity is estimated with the help of number of hospitalization days. the rest of the three classification tasks predict the severity of self-reported sleep quality, fatigue, and anxiety levels of covid- patients with svm classifier and linear kernel. performance in terms of unweighted average recall showed promising results for sleep quality and anxiety prediction. in this section we provide a tabular and descriptive comparison of the surveyed open source data sets. table presents the comparison of medical image data sets in terms of application, data type, and ml method in tabular form. first, we compare all of the listed works on their openness. some of the works do not have data and code publicly available and it is difficult to validate their work [ ] . others have code or data publicly available [ ] . such studies are more relevant in the current pandemic for global actions concerning verifiable scientific research against covid- . on the other hand, some studies merge multiple data sets and mention the source of data but do not host it as a separate repository [ ] . the highly relevant studies have made public both data and code [ , ] . higher number of reported works have utilized x-ray images than ct scans. very few studies have utilized ultrasounds and mrt images [ ] . segmentation techniques to identify infected areas have been mostly applied to ct scans [ ] . similarly, augmentation techniques to increase the size of the data set have been applied mostly to x-ray image based data sets [ , ] . all of the works provided d ct scans except for one resource from the coronacases initiative. most of the covid- diagnosis works employed cnns for classification. some of the works utilized transfer learning to further increase the accuracy of classification by learning from similar tasks [ , ] . moreover, few works augmented cnns with svm for feature extraction and classification tasks [ , ] . higher accuracies were reported from works augmenting transfer learning and svm with cnns. cnns and deep learning techniques are reported to overfit models due to the limited size of covid- data sets [ ] . therefore, authors also researched alternative approaches in the form of capsule network [ ] and svm [ ] for better classification on limited data sets of covid- cases. augmentation techniques have also been employed to increase the size of data set. however, https://coronacases.org/ further analysis is required on the performance evaluation of covid- diagnosis on augmented data sets [ ] . most of the covid- diagnosis works distinguished between two outcomes of covid- positive or negative cases [ ] . however, some of the works utilized three outcomes, i.e., covid- positive, viral pneumonia, and normal cases for applicability in real-world scenarios. researchers [ ] expanded the classification to six common types of pneumonia. such methodologies require the extraction and compilation of data sets with other categories of pneumonia radiographs. the ml-based covid- diagnosis is difficult to fully automate as a human in the loop (hitl) is required to label radiographic data [ ] . segmentation techniques have been utilized to embed bio-markers in data set [ ] . however, the segmentation techniques also require hitl for verification. resnet, mobilenet, and vgg have been commonly employed as pre-trained cnns for classification [ , ] . ai/ml explainability methods have been seldom used to delineate the performance of cnns [ ] . most of the works report accuracies greater than % for covid- diagnosis [ , ] . the data sets of cohen et al. [ ] is considered pioneering effort and is mostly utilized for the covid- cases and kermany et al. [ ] is employed for common pneumonia cases. table presents the comparison of textual data sets (covid- case reports) in terms of application, data type, and statistical method in tabular form. the textual data sets are applied for multiple purposes, such as, (a) reporting and visualizing covid- cases in time-series formats [ , ] , (b) estimating community transmission [ ] , (c) correlating the effect of mobility on virus transmissions [ ] , (d) estimating effect of npi on covid- cases [ , ] , (e) forecasting reproduction rate and serial intervals, (f) learning emotional and socioeconomic issue from social media [ , ] , and (g) analyzing scholarly publications for semantics [ ] . most of the articles apply statistical techniques (stochastic, bayesian, and regression) for estimation and correlation of data [ , ] . there is great scope for the application of ai/ml technique as proposed in some studies [ , ] . however, only statistical techniques have been applied to textual data sets in most of the listed works. most of the studies that estimate covid- transmissions utilize covid- case data collected from various governmental, journalistic, and academic sources [ ] . the case reports are available in visual dashboards and csv formats. table presents the comparison of textual data sets (social media and scholarly articles) in terms of application, data type, and statistical method in tabular form. the studies that analyze human emotions have mostly utilized twitter api to collect data [ , ] . studies estimating effect of npi on covid- bring to service location, mobility, [ ] . the collection of scholarly articles have proposed potential of data science and nlp techniques [ ] while a demonstration of the same is available at covidscholar. however, the details of the semantic analysis algorithms applied by the covidscholar are not available. github is the first choice of researchers to share open access data while kaggle is seldom put to use [ , ] . table provides a comparison of mobility and npi data sets based on parameters including data set application, source, format, and coverage. mobility data sets provided by google, apple, and baidu aid the analysis of covid- case transmissions [ , ] . the mobility data is collected by location services provided on smart devices. the mobility data is usually anonymised with random identifiers to keep user privacy intact. however, the privacy measures of baidu are not known. several efforts have been made to record npi at the country level from information released by media and government press [ ] . the npi data in conjunction with infection rates can shed light on the effectiveness of the government policies. the mobility and npi data sets are available in dashboard info-graphics and in csv format. table lists the speech data based research works for covid- in terms of application, data type, ml methods, and data set size. the applications of speech data for covid- diagnosis are very encouraging as identified in most of the listed research works in section . most of the listed works focus on cough based covid- diagnosis from speech data. ml classifiers such as svm and cnn are commonly utilized for classification. although there are multiple mobile applications for collection of voice data, open source data sets are few and very small in size. further open source data collection is required for (a) application of deep learning methods, (b) application of methods for covid- severity prediction, and (c) prediction of patient behavioral features (mental health, anxiety, stress, etc) from speech data [ , ] . most of the research on covid- is currently not peerreviewed and in the form of pre-prints. the covid- pandemic is a matter of global concern and necessitates that any scientific work published should go through a rigorous review process. at the same time, the efficient diffusion of scientific research is also demanded. therefore, this review had to include pre-prints that have not been peer-reviewed to compile a comprehensive list of articles. the pre-prints contribute approximately % of the cited research in this review. the credibility of reviewed work is supported by the open source data sets and code accompanying the pre-prints. the research works can be compared on the re-usability metric of the data sets such as meloda [ , ] . there are multiple challenges to al/ml-based covid- research and data. the foremost on the list is the availability and openness of data [ ] [ ] [ ] . as more open source data is made available, ai/ml-based research collaborations across the globe, system verification, and real-world operations will be possible. we detail the future challenges in the following subsection. the ai/ml techniques are often open source and implemented as libraries and packages in programming language development platforms. some of the examples are scikitlearn module in python [ ] and weka library in r [ ] . as a result, the focus shifts towards openness and availability of data. the novel covid- pandemic necessitates creation, management, hosting, and benchmarking of new data sets. existing research works lean more towards opaque research methodology rather than open source methods. each of these practices has pros and cons. the closed source research can lead towards patenting of research innovations and ideas. it can also lead to collaboration in closed research groups across the globe. however, withholding critical data in the context of covid- may be considered maleficence [ ] . moreover, open source research methods offer greater benefits that are more far-reaching while accelerating ai/ml innovations for the covid- pandemic. the abrupt spread of the covid- pandemic has also highlighted the open source data as the current key barrier towards ai/ml-based combat against covid- [ ] . we list points below on future challenges of open source data sets. • most of the data and code on covid- analysis is closed source. whatever data is available, it is limited for applications of deep learning methods. efforts to curate and augment existing data sets with samples from hospitals and clinics (medical data sets) and self-testing (cough and breath data) applications are specifically required [ , ]. • data should be created, managed, hosted, processed, and bench-marked to accelerate covid- related ai/ml research. as more data is integrated, the (deep) learning techniques become more accurate and move towards large-scale operations. labeling of large data sets is another indispensable task [ , ] . • the scarcity of the data is attributed to (a) closed source research methods, (b) distributed nature of data (medical images may be available at a hospital but not aggregated in a unified database), and (c) privacy concerns limiting data sharing. therefore, a key challenge is the federation of the data sets to combat ai. standards and protocols along with international collaborations are necessary for the federation of data sets. the privacy concerns can be addressed by adopting standard procedures for anonymity of the data [ , ] . • interpretability and explainability of ai/ml techniques is another key challenge. ml techniques act as a blackbox. specifically, in deep learning, doctors and radiologists must know which features distinguish a covid- case from non-covid- . moreover, the probability of error needs to be estimated and communicated with the practitioners and patients [ , ] . • most of the medical data comes from china and european countries which may lead to selection bias when applied in other countries. as a result, the practice of diagnosing a patient with covid- using ai/ml is very rare. moreover, it is yet to be investigated if ai/ml can detect covid- before its symptoms appear in other laboratory methods to justify its practice [ ] . most of the researchers studying image-based diagnoses of covid- have emphasized that further accuracy is required for the application of their methods in clinical practices. moreover, researchers have also emphasized that the primary source of covid- diagnosis remains the rt-pcr test and medical imaging services aim to aid the current shortage of test kits as a secondary diagnosis method [ , , ] . contact-less work-flows need to be developed for ai-assisted covid- screening and detection to keep medical staff safe from the infected patients [ , ] . a patient with rt-pcr test positive can have a normal chest ct scan at the time of admission, and changes may occur only after more than two days of onset of symptoms [ ] . therefore, further analysis is required to establish a correlation between radiographic and pcr tests [ ] . data sets are available for most of the research directions in biomedical imaging. however, these data sets are limited in size for the application of deep learning techniques. researchers have emphasized that larger data sets are required for deep learning algorithms to provide better insights and accuracy in diagnosis [ ] . therefore, the collaboration of medical organizations across the globe is necessary for expanding existing data sets. moreover, the accuracy of augmentation techniques in increasing the data set size needs to be evaluated. the ct scan and x-ray based data sets and research are conventional. mri provides high-resolution images and soft-tissue contrast at a higher cost. mri based covid- diagnosis and data set are demanded to compare their accuracy with ct-scans and xray based methods. moreover, the operational performance and effectiveness of the proposed ai/ml-based diagnosis in clinical work-flows under regulatory and quality control measures and unbiased data needs investigation [ , ] . the research on speech-based diagnosis of covid- on symptoms of cough and breath rate is in the early stage of development. researchers have made calls for the collection of voice data, however, whatever data is utilized in the existing studies, only one open source data sets is available yet for speech based covid- diagnosis. moreover, the existing data set sizes need enhancement for higher accuracy of classification tasks. three reported studies collected scholarly articles related to covid- [ , , ] . however, the application of nlp is proposed in these works. the inference of scientific facts from published scholarly articles remains a challenge yet to be addressed in reference to covid- . the only resource available in this direction is the covidscholar developed by berkeley lab for semantic analysis of covid- research. however, the details of their algorithms are not available. similarly, data sets have been curated from social media platforms [ , ] . the human emotions and psychology in the pandemic, sentiments regarding lockdowns, and other npi are yet to be investigated thoroughly. another research direction is related to social distancing in the pandemic. given the preferred social distance across multiple countries and the open source data on mobility, what are the effects on covid- transmission? [ ] . moreover, data set curation is also needed to provide an update on practiced social distances during the covid- lock-down initiatives. the timeliness of the research is another important issue for textual data. social media data analysis and corresponding actions become outdated very quickly as it is collected, pre-processed, and annotated at large-scale [ ] . as user's data regarding mobility, location, medical diagnosis, and social media is utilized in ml and statistical studies, privacy remains a focal issue. privacy concerns are further escalated due to open source nature of data. privacy concerns can dominate public health concerns leading to limited sharing of data for scientific purposes. moreover, there is a fear that mission creep will occur when this pandemic is over and the governments will keep on tracking and surveying populations for other purposes. users have concerns about large-scale governmental surveillance in case such data from applications is shared with a third party [ , ] . google, apple, baidu, and safegraph have been identified as sources for mobility data in this review. similarly, hospitals and medical organizations have contributed to the collection of medical data. efforts have been made on the anonymity of the data. however, the data sets have not been rigorously tested for security and privacy vulnerabilities. the automated contact tracing application initiated by several governments in the wake of covid- transmissions also demands consideration of user privacy issues [ ] . automated contact tracing applications monitor the user-user interactions with the help of bluetooth communications. the population at risk can be identified if one user is diagnosed as covid- positive from his user-user interactions automatically saved by the contact tracing application [ ] . the contact tracing applications can be utilized for large-scale surveillance as user data is updated in a central repository frequently. it is yet to be debated the compliance of contact tracing applications with country-level health and privacy laws. similarly, patient privacy concerns need to be addressed on the country level based on health and privacy laws and social norms [ ] . public hatred and discrimination have also been reported against covid- patients and health workers [ ] . the situation demands complete anonymity of medical and mobility data to avoid any discrimination generating from data shared for academic purposes. blockchain and federated learning are two perspective solutions for additional privacy measures. private blockchain can provide privacy and accountability for data access as it is able to trace the data operations with trust and decentralization features [ , ] . several blockchainbased privacy-preserving solutions have been proposed in the context of covid- pandemic [ , ] . federated learning-based ml techniques do not need to share and store data at a centralized location such as a cloud data center. ml models are distributed over participating nodes while sharing only model parameters and outputs with the central node. as a result, privacy is preserved in a federated framework of machine learning [ , ] . moreover, public health may be prioritized over individual privacy issues in the context of covid- pandemic [ ] . although digital technologies have significantly aided the combat against covid- , they have also provided the ground for vulnerabilities that can be exploited in terms of social behaviors [ ] . fake news/misinformation sharing on social media platforms [ ] , racist hatred [ ] , propaganda (against g technologies and governments) [ ] , and online financial scams [ ] are few forms of digital platform exploitation in covid- pandemic. fake news and rumors have been spread about lock-downs policies, over-crowded places, and death cases on social media platforms. fakenews identification is already a popularly debated topic among the social and data science community [ ] . existing nlp techniques for fake news identification need to be applied on covid- social media data sets for evaluation of proposed works in the current pandemic [ , ] . the focus needs to be on the timely identification of fake news as with increasing time, all actions may become irrelevant. the social media platforms also need to be analyzed for human perceptions and sentiments regarding specific ethnicity (for example, sinophobia) and lock-down policies [ ] . alarming rise in hate speech and misinformation has been reported during the pandemic on the internet. the propaganda that g networks are responsible for covid- spread has also received widespread attention on social media. with the society heavily relying on online shopping and banking transactions in the pandemic, an increased number of online scamming and hacking activities have been reported [ ] . therefore, it is necessary to address and mitigate the misuse of technologies while we heavily rely on them in the existing pandemic for information, retail, entertainment, and banking. other future challenges related to the theme of our article are, but not limited to, (a) forecasting covid- cases and fatalities on city and county levels, (b) predicting transmission factors, incubation period, and serial interval on a community level, (c) using nlp to analyze public sentiments regarding covid- policies from social media, (d) applying nlp on scholarly articles to automatically infer scientific findings regarding covid- , (e) identifying key health (obesity, air pollution, etc) and social risk factors for covid- infections, (f) identifying demographics at more risk of infection from existing cases and trends, and (g) ethical and social consideration of analyzing patients data. apart from these, multiple challenges and competitions are being hosted on kaggle to address issues pertinent to open source covid- data sets , , and elsewhere on the internet. we provided a comprehensive survey of covid- open source data sets in this article. the survey was organized https://www.kaggle.com/covid https://www.kaggle.com/allen-institute-for-ai/ cord- -research-challenge/tasks https://www.kaggle.com/roche-data-science-coalition/uncover/ tasks https://www.covid challenge.eu/ on the basis of data type and data set application. medical images, textual data, and speech data formed the main data types. the applications of open source data set included covid- diagnosis, infection estimation, mobility and demographic correlations, npi analysis, and sentiment analysis. we found that although scientific research works on covid- are growing exponentially, there is room for open source data curation and extraction in multiple directions such as expanding of existing ct scan data sets for application of deep learning techniques and compilation of data set of cough samples. we compared the listed works on their openness, application, and ml/statistical methods. moreover, we provided a discussion of future research directions and challenges concerning covid- open source data sets. we note that the main challenge towards data-driven ai is the opaqueness of data and research methods. further work is required on (a) the curation of data set for cough based covid- diagnosis, (b) expanding ct scan and x-ray data sets for higher accuracy of deep learning techniques, (c) establishment of privacy-preserving mechanisms for patient data, user mobility, and contact tracing, (d) contact-less diagnosis based on biomedical images to protect front-line health workers from infection, (e) sentiment analysis and fake new identification from social media for policy making, and (f) semantic analysis for automated knowledge-based discovery from scholarly articles to list a few. we advocate that the works listed in this survey based on open source data and code are the way forward towards extendable, transparent, and verifiable scientific research on covid- . world health organization ( ) coronavirus disease (covid- ): situation report the efficacy of contact tracing for the containment of the novel coronavirus (covid- ) modeling and forecasting of epidemic spreading: the case of covid- and beyond early dynamics of transmission and control of covid- : a mathematical modelling study the lancet infectious diseases understanding the perception of covid- policies by mining a multilanguage twitter dataset cord- : the covid- open research dataset a survey of data mining and deep learning in bioinformatics regularized urdu speech recognition with semi-supervised deep learning scikit-learn: machine learning in python machine learning "red dot": open-source, cloud, deep convolutional neural networks in chest radiograph binary normality classification identification of covid- can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid- covid- : a survey on public medical imaging data resources fighting covid- misinformation on social media: experimental evidence for a scalable accuracy nudge intervention -ncov, fake news, and racism the covid- pandemic: making sense of rumor and fear: op-ed covid- and digital inequalities: reciprocal impacts and mitigation strategies data curation group. open access epidemiological data from the covid- outbreak. the lancet infectious diseases involvement of the opensource community in combating the worldwide covid- pandemic: a revie artificial intelligence and machine learning to fight covid- artificial intelligence (ai) and big data for coronavirus covid- image data collection prediction models for diagnosis and prognosis of covid- infection: systematic review and critical appraisal a comprehensive review of the covid- pandemic and the role of iot, drones, ai, blockchain, and g in managing its impact a review of modern technologies for tackling covid- pandemic tackling covid- through responsible ai innovation: five steps in the right direction artificial intelligence vs covid- : limitations, constraints and pitfalls on the responsible use of digital data to tackle the covid- pandemic covid-caps: a capsule networkbased framework for identification of covid- cases from x-ray images discrimination and social exclusion in the outbreak of covid- potential covid- c-like protease inhibitors designed using generative deep learning approaches e-quarantine: a smart health system for monitoring coronavirus patients for remotely quarantine covid- and computer audition:, an overview on what speech & sound analysis could contribute in the sars-cov- corona crisis exploring automatic diagnosis of covid- from crowdsourced respiratory sound data evaluation of covid- rt-qpcr test in multi-sample pools embracing imperfect datasets: a review of deep learning solutions for medical image segmentation a survey on image data augmentation for deep learning machine learning-based ct radiomics model for predicting hospital stay in patients with pneumonia associated with sars-cov- infection: a multicenter study a county-level dataset for informing the united states' response to covid- an interactive web-based dashboard to track covid- in real time. the lancet infectious diseases the effect of human mobility and control measures on the covid- epidemic in china science estimating the number of infections and the impact of non-pharmaceutical interventions on covid- in european countries open data resources for fighting covid- chester: a web delivered locally computed chest x-ray disease prediction system covid- image data collection covid-ct-dataset: a ct scan dataset about covid- a deep learning algorithm using ct images to screen for corona virus disease detection of coronavirus (covid- ) based on deep features and support vector machine coronavirus disease analysis using chest x-ray images and a novel deep convolutional neural network demystification of ai-driven medical image interpretation: past, present and future lung infection quantification of covid- in ct images with deep learning covid- ct lung and infection segmentation dataset towards efficient covid- ct annotation: a benchmark for lung and infection segmentation harmony-search and otsu based system for coronavirus disease (covid- ) detection using lung ct scan images extracting possibly representative covid- biomarkers from x-ray images with deep learning approach and image data related to pulmonary diseases covid-net:, a tailored deep convolutional neural network design for detection of covid- cases from chest radiography images explaining with impact:, a machine-centric strategy to quantify the performance of explainability algorithms covidx-net:, a framework of deep learning classifiers to diagnose covid- in x-ray images covid- : automatic detection from x-ray images utilizing transfer learning with convolutional neural networks identifying medical diagnoses and treatable diseases by imagebased deep learning automatic detection of coronavirus disease (covid- ) using x-ray images and deep convolutional neural networks extensive and augmented covid- x-ray and ct chest images dataset can ai help in screening viral and covid- pneumonia? automatic detection of covid- from a new lung ultrasound imaging dataset (pocus) covid- imaging-based ai research collection covid- future forecasting using supervised machine learning models analyzing the epidemiological outbreak of covid- : a visual exploratory data analysis (eda) approach coronavirus disease (covid- ) outbreak in china, spatial temporal dataset epidemiological data from the covid- outbreak, real-time case information application of the arima model on the covid- epidemic dataset data in brief pp correcting under-reported covid- case numbers assessing the impact of reduced travel on exportation dynamics of novel coronavirus infection (covid- ) effect of non-pharmaceutical interventions for containing the covid- outbreak: an observational and modelling study. medrxiv an investigation of transmission control measures during the first days of the covid- epidemic in china transmission interval estimates suggest pre-symptomatic spread of covid- the serial interval of covid- from publicly reported confirmed cases serial interval of novel coronavirus (covid- ) infections. international journal of infectious diseases measuring emotions in the covid- real world worry dataset a large-scale covid- twitter chatter dataset for open scientific research -an international collaboration the first public coronavirus twitter dataset large arabic twitter dataset on covid- open access institutional and news media tweet dataset for covid- social science research a first instagram dataset on covid- self-reported covid- symptoms on twitter: an analysis and a research resource information mining for covid- research from a large volume of scientific literature discovering associations in covid- related research papers epidemiology, causes, clinical manifestation and diagnosis, prevention and control of coronavirus disease (covid- ) during the early outbreak period: a scoping review scoping studies: towards a methodological framework probast: a tool to assess the risk of bias and applicability of prediction model studies variation in government responses to covid- . blavatnik school of government working paper covid- :, a remote assessment in primary care coswara-a database of breathing, cough, and voice sounds for covid- diagnosis smartphone-based self-testing of covid- using breathing sounds design and development of smartphone-enabled spirometer with a disease classification system using convolutional neural network an early study on intelligent analysis of speech under covid- coronavirus disease (covid- ): spectrum of ct findings and temporal progression of the disease academic radiology artificial intelligence distinguishes covid- from community acquired pneumonia on chest ct rapid ai development cycle for the coronavirus (covid- ) pandemic:, initial results for automated detection & patient monitoring using deep learning ct image analysis an overview on audio, signal, speech, & language processing for covid- meloda : a metric to assess open data reusability the principles of open government data open-source machine learning: r meets weka artificial intelligence in the battle against coronavirus (covid- ): a survey and future research directions patients with rt-pcr confirmed covid- and normal chest ct covid- and artificial intelligence: protecting health-care workers and curbing the spread correlation of chest ct and rt-pcr testing in coronavirus disease (covid- ) in china: a report of cases mapping the landscape of artificial intelligence applications against covid- preferred interpersonal distances: a global comparison blockchain and ai-based solutions to combat coronavirus (covid- )-like epidemics: a survey pact: privacy sensitive protocols and mechanisms for mobile contact tracing feasibility of controlling covid- outbreaks by isolation of cases and contacts. the lancet global health contact tracing mobile apps for covid- blockchain for ai: review and open research challenges ali imran m ( ) beeptrace: blockchain-enabled privacypreserving contact tracing for covid- pandemic and beyond experiments of federated learning for covid- chest x-ray images epidemiology from tweets: estimating misuse of prescription opioids in the usa from social media combating fake news: a survey on identification and mitigation techniques covid- : emerging compassion, courage and resilience in the face of misinformation and adversity abubakar i ( ) racism and discrimination in covid- responses acknowledgments this work is supported by the postdoctoral initiative program at the ministry of education, saudi arabia. publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- -bn g gi authors: wake, melissa; hu, yanhong jessika; warren, hayley; danchin, margie; fahey, michael; orsini, francesca; pacilli, maurizio; perrett, kirsten p.; saffery, richard; davidson, andrew title: integrating trials into a whole-population cohort of children and parents: statement of intent (trials) for the generation victoria (genv) cohort date: - - journal: bmc med res methodol doi: . /s - - -x sha: doc_id: cord_uid: bn g gi background: very large cohorts that span an entire population raise new prospects for the conduct of multiple trials that speed up advances in prevention or treatment while reducing participant, financial and regulatory burden. however, a review of literature reveals no blueprint to guide this systematically in practice. this statement of intent proposes how diverse trials may be integrated within or alongside generation victoria (genv), a whole-of-state australian birth cohort in planning, and delineates potential processes and opportunities. methods: parents of all newborns (estimated , ) in the state of victoria, australia, will be approached for two full years from . the cohort design comprises four elements: ( ) consent soon after birth to follow the child and parent/s until study end or withdrawal; retrospective and prospective ( ) linkage to clinical and administrative datasets and ( ) banking of universal and clinical biosamples; and ( ) genv-collected biosamples and data. genv-collected data will focus on overarching outcome and phenotypic measures using low-burden, universal-capable electronic interfaces, with funding-dependent face-to-face assessments tailored to universal settings during the early childhood, school and/or adult years. results: for population or registry-type trials within genv, genv will provide all outcomes data and consent via traditional, waiver, or trials within cohorts models. trials alongside genv consent their own participants born within the genv window; genv may help identify potential participants via opt-in or opt-out expression of interest. data sharing enriches trials with outcomes, prior data, and/or access to linked data contingent on custodian’s agreements, and supports modeling of causal effects to the population and between-trials comparisons of costs, benefits and utility. data access will operate under the findability, accessibility, interoperability, and reusability (fair) and care and five safes principles. we consider governance, ethical and shared trial oversight, and expectations that trials will adhere to the best practice of the day. conclusions: children and younger adults can access fewer trials than older adults. integrating trials into mega-cohorts should improve health and well-being by generating faster, larger-scale evidence on a longer and/or broader horizon than previously possible. genv will explore the limits and details of this approach over the coming years. conclusions: children and younger adults can access fewer trials than older adults. integrating trials into megacohorts should improve health and well-being by generating faster, larger-scale evidence on a longer and/or broader horizon than previously possible. genv will explore the limits and details of this approach over the coming years. keywords: research methodology, randomization, registry trials, multiple baseline randomized trials, trials within cohorts, population studies, generation victoria (genv), clinical trial as topic, children, intervention background randomized controlled trials (rct) provide high-quality evidence with regards to the effectiveness of therapies and prevention and are critical to guide translation and optimal resource allocation. the traditional parallelarms trial design is a stand-alone initiative for which each trial identifies a specific question and sample, obtains funding, consents and randomizes subjects to two or more different treatments or interventions, follows the groups in parallel and collects the outcome data. there are many variationsfor example, cluster vs. individual randomization, and stepped-wedge, adaptive and cross-over designed. stand-alone randomized trials are challenging, slow and costly [ ] . more than two-thirds of multi-center, publicly funded uk trials do not recruit their target number of patients within the specified timeframe [ ] . consequently, trials are often underpowered, require an extension with additional costs, and encounter delayed translation into clinical or preventive practice, or are never completed. furthermore, most trials interrogate a small number of hypotheses in restricted groups over a short time frame (when a long-term benefit is often the real goal), often with considerable heterogeneity between trials in samples, methods and outcomes. collectively, this results in financial and scientific inefficiencies and a lack of generalizability and translatability [ ] . this situation is particularly problematic for children [ ] , whose evidence base (and therefore care) lags due to a paucity of trials [ , ] . one efficient and generalizable solution is to embed trials in existing data collection structures such as registries, electronic health records, and administrative databases [ , ] . there are many examples of this approach. high-quality registries focus on full, unbiased condition ascertainment with standardized outcomes embedded into clinical care. these can support registry trials, whereby a registry participant meets a trial's eligibility criteria, is consented, randomized and accrues trial outcomes that are usually fully embedded in the registry. point-of-care trials embed trial processes (like randomization, ascertainment of outcomes) into the clinical care process, increasingly by effectively using the electronic medical record (emr). large, simple trials may compare interventions already in standard care, but for which evidence of superiority or equivalence is not available [ ] . as individual risk is low, these may include opt-out (with inclusion the default) or waiver of consent, point-of-care randomization (see above), and/or short information statements. registry, point-of-care and large, simple trial elements may co-occur in a single trial. a recent development to leverage even greater health gain from single registries is to coordinate simultaneous registry trials using pre-specified master protocols. these may test the impact of targeted therapy on multiple diseases (basket trials), of multiple therapies on a single disease (umbrella trials), or of several interventions against a common control (platform trials, also known as multiarm multi-stage (mams) trials) [ ] . park's 'landscape' analysis reported rapid growth in master protocols over the last years. however, very few of these trials target children, and most are in highly specialized fields. thus, of the master protocols identified ( basket, umbrella, platform; in the us), most were in adults ( / , %), exploratory, and designed to examine experimental drugs ( / , %) in the field of oncology ( / , %) [ ] . challenges include stakeholder coordination, infrastructure and governance requirements, and the integration and complexities of the pre-specified trial and analysis design [ ] . multiple trials can also be conducted within longitudinal epidemiological cohorts. the traditional role of a cohort study is to observe incidence, prevalence, trajectories, natural history and exposure-outcome associations. however, their sampling design and longer outcome horizons may also be appealing to trials, for which the cohorts can act essentially as populationbased registries. one advantage is that the trial sample can be compared to a broader population in terms of baseline characteristics and natural history of a condition of interest and its short and long term outcomes. trials may be conducted in parallel with the cohort and across a wide range of patient and participant groups, as in western australia's origins project [ ] ; with appropriate consent, a cohort may subsequently provide pre-randomization (e.g., genetic) and some or all outcomes data for trials. a second model has been variously labelled zelen trials [ ] , cohort multiple randomized controlled trials (cmrcts) [ , ] , and trials within cohorts (twics), as in england's born in bradford better start cohort [ ] . in twics, cohort participants consent to contribute control data to future unspecified trials at recruitment into the cohort itself, with only participants randomly allocated to the intervention arm then asked for informed consent into any given twics trial. despite the potential for allocation bias, they can be efficient and achieve valid [ ] and meaningful outcomes [ , ] , and can evaluate the impact of 'stacked' interventions [ ] approximating how parents and children naturalistically navigate needed services. a third potential model is that of master protocols as above that predefine from the outset and coordinate a set of trials that may occur. we are aware of two potential examples for healthy cohorts: the developing global alpha collaboration aims to embed large platform trials in routine care to improve maternal and perinatal health, while helti (healthy life trajectories initiative) comprises harmonized interventional periconception cohorts in china, south africa, india and canada testing preplanned, stacked interventions over a -year period aiming to prevent obesity and other non-communicable diseases in over , children. if a cohort were to involve a sufficiently large and complete population, then it would be possible to enact a range of these ideas simultaneously, encompassing a wide range of unmet trial needs from rare diseases through to population and health services research. the forthcoming generation victoria (genv) provides an opportunity to explore and operationalize these ideas. genv is a state-wide cohort that will approach for recruitment, parents of all newborns (estimated , ) in the state of victoria (population . million [ ] ), australia, over two full years from . its goal is to generate translatable evidence (prediction, prevention, treatments, services and policy) to improve the future wellbeing of all children and adults and to reduce future disease burden. however, there is no existing road map in the international literature to achieve and encourage the flexible trials-genv integration that could contribute to this goal. here, we report on the preliminary processes and guidance we have developed so that trials can prospectively integrate within a substantial cohort (genv) to maximize these opportunities. as a statement of intent, this paper differs from a master protocol in that it does not prespecify any one trial or trial design. this statement of intent outlines the proposed principles and processes for the integration of future trials into genv. its development has been guided by the spirit statement [ ] and the anticipated consort extension for rcts using cohorts and routinely collected health data. the genv cohort is currently in advanced design (see below). an aud . million grant from the paul ramsay foundation supports its infrastructure development, while a $ million grant from the victorian government supports its design, cohort planning and implementation, stakeholder engagement and knowledge translation activities. its sponsor-investigator is professor melissa wake, who is also genv's scientific director. the core executive comprises melissa wake, richard saffery (deputy director, biosciences), sharon goldfeld (deputy director, equity and knowledge translation) and kathryn north (director, murdoch children's research institute (mcri)). the directors regularly report to genv's operational advisory committee and thence to the board of the mcri. several advisory committees inform genv, including a broad range of senior victorian researchers comprising the investigator committee, and several working groups of which the ethics & governance and trials working groups are relevant here. this relatively simple administrative structure may mature after successful cohort implementation. genv has been endorsed by the royal children's hospital human research ethics committee (hrec . ), including in-principle approval as a mechanism to support trials. genv proposes to work with trials that fulfil the administrative information requirements laid out in the spirit checklist (see additional file ) or equivalent at time of application. these cover title, trial registration, protocol with version control, funding sources and types, and roles and responsibilities of protocol contributors, sponsor/s and funder/s, and other individuals or groups overseeing the trial such as the coordinating center. these documents will contain the unique features of each trial's design, participants, timelines and outcomes, as well as data sharing and other relationships with genv. at the time of writing, the genv cohort is in advanced planning and, therefore, still evolving. a range of genv summary documents are available for review on figshare (https://mcri.figshare.com/projects/generation_ victoria/ ) [ ] . here, we outline genv with only enough contextual detail to inform the context of the trials. genv's primary objective is to create very large, parallel whole-of-state birth and parent cohorts for discovery and interventional research. genv data and biosamples can only be used for research that benefits human health. genv's setting is the entire state of victoria (population . million in ), australia [ ] . because it may be relevant to trials that could be undertaken in a cohort of this magnitude, we provide some victorian descriptive information here. in the census, the median age of victorian people was years; . % of its population were - -year-old children, . % were aged years and over, and . % were male [ ] . around % of victorian residents were born in australia. the most common ancestries were english ( . %), australian ( . %), irish ( . %), scottish ( . %) and chinese ( . %), but its multi-ethnicity is reflected in more than languages. around % of the population identify as aboriginal and/ or torres strait islander [ ] . like other australians, victorians are relatively affluent, with a median weekly pre-tax personal income for people aged years and over of aud in [ ] . however, a wide range of advantage-disadvantage exists, with % of victoria's population, and % of its children, living below the poverty line based on the census data [ ] . over % of australian parents report their child has at least one ongoing health or developmental problem at every age from age , rising to around % from age to years [ ] . genv is a population-based cohort study that blends study-collected, study-enhanced and linked data. the cohort design comprises four elements: ( ) consent soon after birth to follow the child and parent/s indefinitely until the study closes (no end date set at this point) or withdrawal, ( ) retrospective and prospective linkage to clinical and administrative datasets, ( ) banking of retrospective and prospective universal and clinical biosamples, and ( ) genv-collected biosamples and data. genv proposes to recruit for two full years from in all of victoria's birthing hospitals (n = at time of writing) [ ] , in which collectively around , babies are born each year. mcri-employed recruiters aim to personally approach parents of all newborns for consent, with interpreting and translation support as needed. children will have the opportunity to decide on their continued participation as they reach the age for legal consent. genv's consent includes parent permission for approved researchers to access genv's data, for data sharing between genv and external trials, and for recontact to offer additional research opportunities. all children born in victoria during the recruitment period whose parents/guardians have decisional capacity, and their parents, are eligible to take part. participants who leave victoria may continue to take part via linked and contributed data, and families may join genv, who move into victoria later and have children born within the recruitment period. however, in both of these instances, data may be incomplete. principles for genv and for trials within and alongside genv figure is an infographic that shows the concept of how trials might integrate with genv across the life course. all genv activities (including those that relate to trials) are informed by the genv principles, as outlined in fig. (a): collaboration, inclusivity, sustainability, enhancement, systemized processes and value. therefore, it is implicit that all trials working prospectively within and alongside genv would also be a good fit with these principles. we do not anticipate that this would impose an additional burden since funding bodies and international guidelines already stipulate best practice integral to trials, such as highquality evidence of need (e.g. a prospero-registered systematic or rapid review) and checklists (e.g. con-sort, spirit). all trials need their own ethical and other approvals before they can start. for maximal mutual benefit, we envisage that partnerships between genv and trials will generally be prospective, i.e., worked out and agreed before the trial begins. we are currently developing processes to operationalize the trial-specific principles outlined in fig. (b) in minimally burdensome ways. other trials will continue as they have always done, independently and unrelated to genv. genv may support trials within (model ) and trials alongside (model ) genv, as outlined in fig. ; these models do not dictate the design of the trial (e.g., whether individually or cluster randomized). trials within genv (model ) may be conducted as standard trials with opt-in, opt-out, or waiver of consent preceding randomization. alternatively, twics/cohort multiple rcts may sometimes be considered whereby participants are randomized and then only those randomized to the intervention provide additional opt-in or opt-out consent. interventions for trials alongside genv (model ) may be delivered either by genv or by trials themselves, with the latter well suited to trials arising from emrs and external registries for participants born in the genv window. for model , the trial collects the participant's consent for two-way data sharing with genv, including a minimum dataset of items common to multiple trials. governance will be agreed upon before the trial implementation. for trials alongside genv, all or some of the trial sample is also in genv. in the latter situation, the genv principles and the data sharing and enhancing benefits of genv would only apply to that subsample. we anticipate that most trials that integrate with genv will be proposed by researchers outside genv's implementation team, who would, therefore, also define the content, ideally to have a good 'fit' with genv. some trialists are already actively approaching us with ideas for trials that would either be impossible without genv or would benefit from otherwise inaccessible outcomes data. genv hopes to elicit other possibilities via activities (to be developed) such as publicized annual open faceto-face and web-based fora to brainstorm and prioritize trial ideas, ideally involving a range of stakeholders including services, communities and families. good communication, transparency and agreement are vital and will underpin a working together agreement between genv and each trial (example shown in additional file ) developed following genv's rapid evidence review of large research-led partnerships [ ] . trialists and those responsible for genv may at times have differing opinions on where the balance of benefit vs burden lies, and this will need to be considered openly. the benefit can be demonstrated by a rapid or systematic review supporting the need for the trial, ideally within the context of a 'living' review that can be readily updated over time, including with the results of the trial [ ] [ ] [ ] . burden relates not only to participants (consent, intervention content, follow-up and thence potential attrition) but also to impacts on genv itself and its guiding principles. genv is not funded to conduct trials, which will require their own funding. genv can currently provide limited support (including supporting trials to apply for funding) and is seeking dedicated funding to be able to provide additional support. in the meantime, activities such as determining eligible participants, consent, randomization and (potentially) intervention delivery as per fig. , model may need to be undertaken on a cost-recovery basis. for support in the form of expertise, genv -rather than reinventing the wheel -proposes to connect proposed trials to local expert bodies such as the melbourne children's trials centre, monash university trials hub, nhmrc clinical trials centre, australian clinical trials alliance (acta), and the interdisciplinary maternal perinatal australasian collaborative trials (impact) network of maternal and perinatal trials. we anticipate an initial trial-genv discussion that articulates the rationale of the intervention, trial design and research questions using the picot framework. if feasibility (potentially demonstrated through pilot studies) and mutual alignment appear likely [ ] , the trial would proceed to a partnering agreement that defines at least the following items: ) which genv trial model is being followed; ) design and high-level (or draft) protocol; ) timelines; ) data sharing and governance plans; ) status of ethical approval; ) communication with participants, including information statement and consent; ) trial oversight and ) capacity assessment, including trial quality, human resource and funding. we envisage that this discussion will be enabled by a yet-tobe-established genv trials oversight committee with cross-disciplinary expertise. inclusion of consumers, including genv participants, will be important to minimize participant burden and streamline data collection and trial conduct. the trial sponsor will usually be a representative of a university, research institute or similar organisation, but genv does not preclude collaboration with commercial sponsors provided all its principles are met. it is assumed that, throughout the trial, genv will collect agreed data and maintain high retention while the trial will maintain independent quality, ethical and governance protocols in line with international standards. it is also assumed that genv staff will collaborate with the trial management staff in order to understand, prevent and solve any day-to-day issues at the genv end that may impact on the trial or genv. integration of trials with genv to greatest effect will occur with appropriate consent wording in both the trial and genv. at recruitment into genv, parents provide consent for genv to follow themselves and their child. as per the consort extension [ ] for rcts using cohorts and routinely collected health data, this includes consent to use their data for research purposes, with ongoing mechanisms to enable change in consent status (such as partial or full withdrawal or re-entry) any time after that. genv's full parent information & consent statement (picf) is available for review [ ] ; its explicit trial-relevant wording is shown in table (a). all trials require ethically-approved consent models and wording. as noted in fig. , it is plausible that trials could be undertaken via waiver, opt-out or opt-in consent models, and that opt-in/opt-out consent could be undertaken in full (all arms, ideally before table picf wording for (a) genv to work with and (b) trials to work with genv (a) wording in the genv picf that is specific to supporting trials: • "you may be offered the chance to take part in future ethically approved studies working with genv …. you can always choose whether to take part." • "genv's data can only be used for ethically approved research to improve health, development, or wellbeing for children and adults. over time, researchers will use lots of different methods to answer new and important questions. therefore, the value of your information will keep growing for many years." • "some genv participants may join research trials testing new approaches. all trials need ethical approval. who is offered the new approach is randomly picked, like tossing a coin. in some trials, only people offered the new approach are contacted about taking part. genv data can be used to compare the outcomes of people who do and do not receive the new approach." • "trials … may ask your consent to share data with genv, with ethics approval. we support this." (b) suggested wording for trials to include in their picf, as appropriate to its degree of integration with genv ( fig. • [model c onlychoose relevant wording] "you can be in this trial and not in genv, but we will be missing some information about you/your child or you can only take part in this trial if you are also in genv." randomization) or twics models. for trials wholly within genv (fig. , model ) , genv already includes consent for data sharing with approved users. for trials alongside genv (fig. , model ), trials will likely benefit most if data can flow from trial to genv and not merely from genv to trial (for which consent is already in place) as outlined in section (data) below. therefore, genv recommends that trials include wording along the lines of table b to support maximal data utility and value. design of randomization is determined by each trial as is most appropriate to the intervention, questions and sampling, and as per best sample selection, randomization and blinding of follow-up practice at the time. random allocation can take place before (in the case of twics) or after informed/waived consent, and by genv or by the trial, as indicated in fig. . randomization procedures will be reported in each trial's consort statement. genv may inform trials in a variety of ways. it may provide primary or secondary outcomes, measures on which to select or stratify participants, and moderator and mediator variables. figure illustrates the range and timing of measures being explored by genv at the time of writing, spanning linked, biosample-derived and genvcollected data. it is expected that ultimately many, but not all, will prove feasible for genv to include via data linkage, data collection or biosamples. whereas birth cohorts have traditionally been purely observational in design, and focused on the discovery of longitudinal associations, genv's focus is on solutions to improve health and reduce the burden of disease. testing such solutions may occur not only via trials but also natural experiments, simulation and causal modeling. all require robust outcome measures with sufficient sensitivity to demonstrate meaningful effects and an ability to quantify potential health gain when putative causal factors are targeted. the commonality of outcomes would enable comparisons of benefits and costs of different interventions for different target groups within a single dataset. a further benefit to trials is that genv intends to collect such measures over many years, enabling trials to access longer-term data than might be possible for a single trial. genv is planned without knowledge of what future trials may be proposed. however, genv has developed a framework (j wang, yj hu, s clifford, s goldfeld, m wake: selecting lifecourse frameworks to guide and communicate large new cohort studies: generation victoria (genv) case study, submitted) and outcomes hierarchy (fig. ) to guide its measures selection and prioritization (additional file ). this framework considers genv's whole-of-state remit and principles of inclusivity, sustainability and systematized processes, which require low measurement burden with high ease of administration. therefore, genv will not collect prioritized measures that are already reliably collected and accessible via linkage sources, unless required for the day-to-day running of genv. at the highest level, genv will repeatedly capture overarching health and wellbeing with generic measures that have international as well as local salience: health-related quality of life (quality-adjusted life years, qalys [ ] ), disease/disability burden (disability-adjusted life years, dalys [ , ] ), requiring information on conditions, illnesses and problems that parents and children experience (international classification of diseases th revision, icd- [ ] ), and functional status (international classification of functioning, disability and health, icf [ ] ). when coupled with service-related data, for example, regarding encounters, costs and medications, these measures would also support economic analyses. although comprehensive, these highest-level measures do not capture the individual traits/phenotypes that are critical to many interventions. therefore, to support the greatest number of trials while retaining parsimony, genv proposes to prioritize collecting outcomes included across multiple core outcome sets (cos) [ ] and thus already demonstrated to be of broad importance to patients, families, clinicians and policymakers as well as researchers. these span physical phenotypes (e.g., growth, body composition, dysmorphology, motor skills and senses), and mental, social, cognitive, learning and positive health. given that genv is targeting over , children and their parents with data from hundreds of participants daily, the only feasible way of collecting such phenotypes is remotely and digitally. to enable capture of multiple outcomes and phenotypes, therefore, genv is exploring developing an 'ephenome', a high-throughput digital platform will let genv measure and evaluate diverse outcomes cheaply and at very large scale, while maintaining genv's principles of value (including to participants) and inclusivity. we envisage that each ephenome measurement encounter would select from a suite of ultra-short, universalcapable digital survey items, measures, images and videos. these will either be pushed universally from genv for participants to complete remotely on any device or (for measures that meet the genv enhancement principle) could be collected by services within existing universal contacts. future funds permitting, we envisage that face-to-face school-based assessment will capture measures that require physical equipment, technical skill in administration and/or wearable devices. such assessments may be shaped by the needs of individual trials ongoing in the years before each wave. as well as its direct digital platform, genv proposes to draw data wherever possible from administrative datasets. this information includes, for example, data from health, education and other providers; electronic medical records (emr); geographic datasets; and from trials and registries. all available data will be integrated with the genv data systems, using direct deposition, enduring data linkage and/or ephemeral safehaven linkage processes. genv's 'victorian child's lifecourse journey in data' [ ] lists many of the datasets that victorian children and parents currently accrue, many of which genv may link to in the future. genv's website [ ] will publicly record all datasets accessed each calendar year. biological samples at any stage are frequently out of reach of trials due to burden or cost, especially samples that predate trial commencement. genv is working towards the consented storage and research use of existing and new universal biosamples to the highest possible standards of conservation (including transfer to genv's − °c autostore). figure outlines the range of samples currently being explored. it is hoped that the biosamples will span multiple tissues (e.g., blood, saliva, stool, breast milk), all participants, and multiple time points including all trimesters of pregnancy, the neonatal period and school entry. there may be potential for collaborating trials to help shape future whole-of-genv biosample collection. due to the depletable nature and likely small available volume of biosamples, it is highly unlikely that genv will approve individual assays for trials, but rather will make available comprehensive biological data of the broadest value possible such as metabolomics or polygenic risk data. genv will encourage trials to collect a generic minimum dataset relevant specifically to trials, following precedents set by initiatives such as the dutch older persons and informal caregivers survey minimum dataset (topics-mds) to which over projects have now prospectively contributed [ ] . this small minimum dataset is to be developed collaboratively in - . potential benefits include the ability to compare trials on common outcome metrics, evaluate effects of stacked interventions for those in more than one trial simultaneously or over time, and pooling of data for individual participant meta-analyses. many trials will also require outcomes specific to their research questions. once trials and genv are agreed, the trial investigators will most likely collect samples themselves outside of genv in dedicated visits. the data genv hold may prove very valuable to trials because they would otherwise be unavailable due to population coverage, timeframe, jurisdiction, logistic, funding or other constraints. however, constraints could likewise apply to genv. genv commits to making data available on completion of a given 'sweep' or 'wave,' i.e., once all participants have provided a particular set of data. like other major cohorts, it will generally handle data processing for its vast numbers of participants en masse, with benefits including efficiency and cost reduction, uniform access to technology advances (e.g., new automated scoring or assays), consistency (e.g., avoidance of batch effects/drift and of conflicting or overturned results) and completeness (unfinished data waves that do not meet the principle of inclusivity, whereby all data are available for all participants). each wave of genv data collection will likely take years from first to last participant to collect measures that are predicated on age milestones; thus, trials data would be available much sooner for trial interventions conducted later than earlier in a data collection wave. while some data items (e.g., straightforward proms (participant-reported outcome measures)) need no or minimal processing, others (e.g., image extraction) take additional time even when processing begins before the wave is complete. therefore, it will usually be important that trials work closely with genv during the design and ethics approval phases to consider the timing of likely outcomes data and its implications. for example, desired genv data may not be available sufficiently close to real-time to contribute to data safety & monitoring committees or to adaptive trial designs whereby the intervention is modulated according to therapeutic effect. note, however, that for trials within genv, process monitoring data (e.g., consent rates, interaction durations, data response rates) will be available promptly to optimize the trial's compliance to its protocol. for trials conducted wholly within genv, all data will be within its data repository. for trials conducted alongside genv, data will need to be shared between the trial and genv. figure illustrates the benefits of two-way data sharing. by genv transferring data to trials, trials can access additional outcomes over more extended time frames, and potentially examine variation in response by moderators (such as pre-existing prospectively-collected biological or psychosocial traits) or mediators. by trials transferring their data into the genv repository, they can access data that genv cannot on-transfer (e.g., linked administrative datasets according to custodian agreements), model causal effects to the whole population using actual whole-population data, and combine and compare costs and benefits/utility across trials. all of these should enhance trial prominence, impact, and translation of significant findings. data standardization, quality control, safety and privacy applied throughout the genv data repository will also apply to trial data that enter the genv dataset. genv's data, legal, linkage and cohort personnel will provide technical support and guidance for these issues. genv is designed to be accessed by a wide variety of analysts, including researchers, service providers and policy-makers while maintaining confidentiality. its data are intended to be an equal access resource, via the fair [ ] and five safes (safe people, projects, settings, data, outputs) [ ] principles, to facilitate uptake and translation. from the point at which a complete useable dataset is available to them, we propose that trialists would have exclusive access to trial data placed within genv for months (in line with non-trial studies such as the uk biobank and the longitudinal study of australian children), with intervention/control status masked for a full months. each trial will undertake its own sample size calculations, statistical analysis, and reporting according to its design and best international standards at that time. for example, it is assumed that each trial will involve a biostatistician experienced in trials, and that the trial will be analyzed and reported according to standards such as consort (including the forthcoming consort extension for trials using cohorts and routinely collected health data), spirit and template for intervention description and replication (tidier) [ , , ] . trials may also access advice and support from genv's biostatisticians, subject to genv funding. genv's trials capabilities will work within the solutions hub, the arm of genv that is concerned with epidemiology, science, knowledge translation and researcher engagement. the authors of this statement of intent comprise the current expert genv trials working group, whose working together agreement is shown in additional file . during - this group will support the governance and planning work needed to move from this statement of intent to a position where genv is fully enabled to support trials as envisaged. although impacted by the covid- pandemic, consumer and stakeholder engagement has commenced and will continue. in late , genv conducted an open web-based focus area survey whose analysis is nearing completion. genv has engaged and will continue to engage widely with research and service bodies spanning health (universal, primary, secondary and tertiary), education and other sectors, who are represented on many of its working groups. consumer consultation will proceed through engagement led by genv's solutions hub, including how people of aboriginal and torres strait island descent may choose to be involved. we propose a formal consultation process on an annual basis, with formats yet to be determined, and focusing mainly on the major issues and opportunities for the cohort's age and stage - years hence (see fig. ). this review allows enough time to plan and fund trials, put partnerships in place, and complete preparatory work such as rapid or systematic reviews supporting the need for the trial. genv does not propose to formally limit the number of trials that a participant could enter, but rather take into consideration the needs of individual trials and apply a 'reasonableness' approach. if more trials are proposed or funded, than the genv sample can accommodate, then a collaborative prioritization process will be needed to determine which can be supported. to our knowledge, genv is the first mega-cohort internationally aiming to maximize its experimental as well as observational evidence via an integrated and purposive program of trials. this statement of intent lays out broad principles and processes ahead of genv's planned commencement in , streamlining the integration of trials within or alongside genv from its earliest days. commencement of genv immediately after covid- will enable a unique and powerful platform for ongoing surveillance and response; its population reach, digital infrastructure and ability to remotely support decentralized trials place it uniquely to evaluate experimental strategies to manage the pandemic's health, economic and social aftermath on children, families and communities during potentially lengthy partial quarantine periods and recovery. the major strength is the design of genv itself. as a whole-population study aiming to recruit all babies born and their parents over full years in the sizable and stable state of victoria, it reaches into every metropolitan, regional, rural and remote community and every level of advantage. its data linkage and ephenome capacity lower the burden for both trialists and participants. the existing genv data systems can support the activities outlined in this statement of intent without architectural changes. this trial-ready scaffold may empower communities that have typically lacked the necessary infrastructure to lead or join trials, especially relating to health services research and behavioral interventions. multiple trials could be embedded, evaluating multiple interventions and identifying multiple participant groups all with a true population denominator. genv's outcomes hierarchy and time horizons should, for the first time, enable comparison on the same metrics of the lasting benefits and costs of multiple, widely-differing approaches to improve health and wellbeing. this statement of intent should expedite trial planning, documentation and implementation. regarding limitations, this statement of intent does not take into account the unknown success of genv's recruitment. this is potentially an issue if groups that could most benefit from a boost in trials-based evidence are under-represented (e.g. disadvantage, ability, minority) while noting that trials alongside genv offer a route to redress this via later recruitment into genv of those who initially declined or were missed. we also do not yet know whether or how much the inclusion of potential future trials in our parent information statement at the time of consent will impact on genv's uptake rates. while smaller studies (such as origins and born in bradford better start, see below) have been generous in sharing their learnings and thus shaping our plans, we have not identified existing very large-scale studies that could highlight possible unintended consequences. we hope in due course that genv can provide such empirical learning. despite the collaborative thinking underlying this statement, detailed capabilities remain to be designed and constructed (such as processes to identify and to randomize eligible participants), some of which may only be solved once trials are in planning or underway. some of these are discussed in practical or operational issues, below. a further limitation is that genv is not at this time funded to support trials or their administration. such support (over and above the funds required by the trial itself) may be vital to help collaborators navigate the requirements for starting and conducting trials, especially in regional or rural hospitals and communities that do not have a robust research infrastructure. obtaining such internal genv 'support' funding will be an ongoing focus. for those who may wonder if genv might stifle other research, we affirm that genv has no capacity or desire to impose collaboration with clinical or other trials involving children born in the genv birth window. we do hope for mutual awareness and communication. others are recognizing the promise of longitudinal intervention cohorts to 'stop describing and start fixing' children's problems [ ] . the bibbs (born in bradford better start) experimental birth cohort [ ] aims to recruit pregnant mothers by in inner-city bradford, north england, and to test over interventions for children's social and emotional development using a range of designs including trials within cohorts (twics) and quasi-experimental designs. origins [ ] aims to recruit , mothers in the joondalup region (a community in northern perth, western australia) between and . within this, active participants are invited to participate in sub-projects if they meet the eligibility criteria, with participation in some projects restricted if the outcomes overlap; at time of writing, twelve nested randomized trials are currently under way (personal communication, j davis). experience from both indicates a healthy appetite from researchers, policymakers and trial funders, but also that regular transparent two-way communication is vital, as are burden minimisation of and realism about timelines for trialsrelated data management and release unless the cohort itself is adequately funded to handle this. the helti (healthy life trajectories initiative) consortium has attracted large-scale funding from national funding bodies in its member countries canada, india, south africa and china in collaboration with the world health organization, and is well along the path of establishment [ ] . over time, genv participants may encounter more than one trial, and therefore "stacked" interventions across childhood that respond to the issues they are experiencing. this approach is less planned than helti but may mimic how children and adults accrue services naturalistically. at least one observational study has demonstrated a cumulative beneficial effect of participation in more services across childhood [ ] and another that stacking multiple intervention components is costeffective [ ] . while examples are accruing of trials integrated with cohort studies under different names, such as cohort embedded rct [ ] , cohort multiple rct [ ] , cohort nested rct [ ] and trials within cohorts (twics) [ ] , genv's freedom of design -spanning trials both within and alongside genv -appears unusual. here, we mention some of the many details that remain to be decided. many will require resourcing both in genv and the trials themselves. genv will need vigilance in limiting red tape while at the same time being in a position to help prioritize, plan, standardize (e.g. measures, processes), execute and monitor trials in ways that help trials while upholding genv principles. we have yet to develop genv's internal administrative structures to achieve this (ahead of recruitment commencing in ). we are currently developing a framework to enable rapid, followed by increasingly deep conversations and filtering with potential collaborating trials to enable mutual understanding of likely success and benefit that does not waste time. we are also developing our mechanisms to prioritize consumer engagement and involvement, to integrate genv-generated evidence into living evidence reviews, and to progress a brief minimum trials dataset. despite its recognized value, data linkage remains challenging in almost all jurisdictions in australia and worldwide, and goalposts will no doubt continue to shift. trial ethical approval and participant consent for data sharing, sometimes years in the future, will be critical. genv is conducting a privacy impact assessment and data security audit, even while knowing that practice and legislation in both continue to evolve. victoria's health and educational systems vary by individual practice, hospital and school, and across private/public and regional/district lines, so genv's statewide remit may bring challenges in terms of developing standardized interventions. most trials require piloting; we are unsure as to whether pilots would be conducted within or outside the genv environment and the impact of any pilots themselves. we are uncertain as to the extent to which a trial's research team may share genv's infrastructure (it systems, data management practices) to perform trials alongside genv, which could have practical benefits to both but would require resourcing. there may be external constraints on exactly how genv can provide data back to trials, and the extent to which trials need to use the genv data analysis and visualisation environments because of constraints of data custodians. at this time, there seems to be no evidence for an upper limit of trials for a single cohort or a single participant, but their possibly interacting effects may be challenging to tease out. we do not at this time propose any limit other than participant willingness to consent, a 'reasonableness' lens and commitment to the genv principles. while the inclusion of low-burden universal outcome measures makes participation in multiple trials possible both from the human and costs perspectives, genv will need a way to monitor and prevent participant fatigue and not to overburden/compromise either genv or the participating trials. there may be occasions where participation in one trial precludes participation in another, which will need to be considered on a caseby-case basis. research ethics committee decision-making in relation to an efficient neonatal trial what influences recruitment to randomised controlled trials? a review of trials funded by two uk funding agencies rethinking pragmatic randomised controlled trials: introducing the "cohort multiple randomised controlled trial" design the lancet editorial. paediatric research should take centre stage a descriptive analysis of child-relevant systematic reviews in the cochrane database of systematic reviews children are not just small adults: the urgent need for high-quality trial evidence in children born in bradford's better start: an experimental birth cohort study to evaluate the impact of early life interventions the imperative of overcoming barriers to the conduct of large neonatal brain injuries in england: population-based incidence derived from routinely recorded clinical data held in the national neonatal research database master protocols to study multiple therapies, multiple diseases, or both systematic review of basket trials, umbrella trials, and platform trials: a landscape analysis of master protocols pre-emptive medicine: public health aspects of developmental origins of health and disease a new design for randomized clinical trials development and internal validation of a clinical risk score to predict pain response after palliative radiation therapy in patients with bone metastases benefits and challenges of using the cohort multiple randomised controlled trial design for testing an intervention for depression cohort randomised controlled trial of a multifaceted podiatry intervention for the prevention of falls in older people (the reform trial) spirit statement: defining standard protocol items for clinical trials generation victoria figshare project victoria records highest population rise of all states and territories accessed census quickstats accessed parent-reported prevalence and persistence of common child health conditions generation victoria (genv) cohort s synopsis genv rapid evidence assessment report: large research-led partnerships. in: generation victoria figshare project figshare living systematic review: . introduction-the why, what, when, and how living systematic reviews: . combining human and machine effort living systematic reviews: . living guideline recommendations the evidence based-medicine working group protocol for the development of a consort extension for rcts using cohorts and routinely collected health data utilities and quality-adjusted life years understanding dalys measuring the burden of disease: disability adjusted life year (daly) toward icd- : improving the clinical utility of who's international classification of mental disorders staff who: international classification of functioning, disability and health: icf: world health organization methodology in core outcome sets. color dis the victorian child's lifecourse journey in data accessed generation victoria: generation victoria official website accessed the development of the older persons and informal caregivers survey minimum dataset (topics-mds): a large-scale data sharing initiative the fair guiding principles for scientific data management and stewardship five safes: designing data access for research statement: updated guidelines for reporting parallel group randomised trials better reporting of interventions: template for intervention description and replication (tidier) checklist and guide editorial perspective: stop describing and start fixing -the promise of longitudinal interventioncohorts healthy life trajectories initiative (helti) potential of 'stacking' early childhood interventions to reduce inequities in learning outcomes is stacking intervention components cost-effective? an analysis of the incredible years program the warwick hip trauma evaluation -an abridged protocol for the white study: a multiple embedded randomised controlled trial cohort study review of an innovative approach to practical trials: the 'cohort multiple rct' design core outcomes in ventilation trials (covent): protocol for a core outcome set using a delphi survey with a nested randomised trial and observational cohort study publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations we thank all members of genv and its working groups who have reviewed the paper and contributed to developing processes. as genv's scientific director, mw is the sponsor-investigator responsible for its overall design, including enabling it for trials. mw and ad conceived this paper. mw, yjh and hw drafted the first manuscript. md, mf, fo, mp, kpp, rs and ad each provided critical review and contributed to its contents. all authors reviewed the manuscript and had final approval of the submitted and published version of this paper. the authors comprise the current genv trials working group. this statement of intent outlines how the large genv cohort can serve as a platform to increase the number, speed, range and duration of trials for parents and children. much remains to be worked out. however, its innovative design could guide best practice for these groups which currently lack robust generalizable evidence, and whose good health and wellbeing are so vital to the functioning of populations going forward. supplementary information accompanies this paper at https://doi.org/ . /s - - -x. availability of data and materials no data are yet available. it is intended in the future that genv data analyses will be supported for all researchers meeting governance requirements such as the five safes principles. a range of materials are available at genv's figshare project [https://mcri.figshare.com/projects/generation_victoria/ ].ethics approval and consent to participate genv is endorsed by the royal children's hospital human research ethics committee (hrec . ). parents will provide written informed consent. all trials conducted within/alongside genv will have appropriate ethical approval. not applicable (no participants enrolled). the authors declare that they have no competing interests. key: cord- -ko mzz authors: asri, hiba; mousannif, hajar; al moatassime, hassan; zahir, jihad title: big data and reality mining in healthcare: promise and potential date: - - journal: image and signal processing doi: . / - - - - _ sha: doc_id: cord_uid: ko mzz nowadays individuals are creating a huge amount of data; with a cell phone in every pocket, a laptop in every bag and wearable sensors everywhere, the fruits of the information are easy to see but less noticeable is the information itself. this data could be particularly useful in making people’s lives healthier and easier, by contributing not only to understand new diseases and therapies but also to predict outcomes at earlier stages and make real-time decisions. in this paper, we explain the potential benefits of big data to healthcare and explore how it improves treatment and empowers patients, providers and researchers. we also describe the capabilities of reality mining in terms of individual health, social network mapping, behavior patterns and treatment monitoring. we illustrate the benefits of reality mining analytics that lead to promote patients’ health, enhance medicine, reduce cost and improve healthcare value and quality. furthermore, we highlight some challenges that big data analytics faces in healthcare. individuals are creating torrents of data, far exceeding the market's current ability to create value from it all. a significant catalyst of all this data creation is hyper-specific sensors and smart connected objects (iot) showing up in everything from clothing (wearable devices) to interactive billboards. sensors are capturing data at an incredible pace. in the specific context of healthcare, the volume of worldwide healthcare data is expected to grow up to , petabytes by [ ] . some flows have generated , petabytes of data now and zbs are expected by from different sources, such as electronic healthcare record (her), electronic medical records (emr), personal health records (phr), mobilized health records (mhr), and mobile monitors. moreover, health industries are investing billions of dollars in cloud computing [ ] . thanks to the use of her and emr, an estimated number of % of hospitals will integrate analytics solutions, such as hadoop, hadapt, cloudera, karmasphere, mapr, neo, datastax, for health big data [ ] . this paper gives insight on the challenges and opportunities related to analyzing larger amounts of health data and creating value from it, the capability of reality mining in predicting outcomes and saving lives, and the big data tools needed for analysis and processing. throughout this paper, we will show how health big data can be leveraged to detect diseases more quickly, make the right decisions and make people's life safer and easier. the remainder of this paper will be organized as follows: • section present the context of reality mining in healthcare system. • in sect. , we discuss capabilities of reality mining in terms of individual health, social network mapping, behavior patterns and treatment monitoring. • sections highlights benefits of reality mining to patients, researchers and providers. • section numbers advantages and challenges of big data analytics in the healthcare industry. • conclusions and directions for future work are given in sect. . reality mining is about using big data to study our behavior through mobile phone and wearable sensors [ ] . in fact every day, we perform many tasks that are almost routines. cataloguing and collecting information about individuals helps to better understand people's habits. using machine learning methods and statistical analytics, reality mining can now give a general picture of our individual and collective lives [ ] . new mobile phones are equipped with different kinds of sensors (e.g. motion sensors, location sensors, and haptic sensors). every time a person uses his/her cellphone, some information is collected. the chief technology officer of emc corporation, a big storage company, estimates that the amount of personal sensor information storage will balance from % to % in the next decade [ ] . the most powerful way to know about a person's behavior is to combine the use of some software, with data from phone and from other sources such as: sleep monitor, microphone, accelerometers, camera, bluetooth, visited website, emails and location [ ] . to get a picture of how reality mining can improve healthcare system, here are some examples: -using special sensors in mobiles, such as accelerometers or microphone, some diagnosis data can be extracted. in fact, from the way a person talks in conversations, it is possible to detect variations in mood and eventually detect depression. -supervising a mobile's motion sensors can contribute to recognize some changes in gait, and could be an indicator of an early stage of parkinson's disease. -using both healthcare sensor data (sleep quality, pressure, heart rate…) and mobile phone data (age, number of previous miscarriage…) to make an early prediction of miscarriage [ ] . communication logs or surveys recorded by mobiles or computers give just a part of the picture of a person's life and behavior. biometric sensors can go further to track blood pressure, pulse, skin conductivity, heartbeats, brain, or sleep activity. a sign of depression, for instance, can be revealed just by using motion sensors that monitor changes in the nervous system (brain activity) [ ] . currently, the most important source of reality mining is mobile phones. every time we use our smart phone, we create information. mobile phone and new technologies are now recording everything about the physical activity of the person. while these data threat the individual privacy, they also offer a great potential to both communities and individuals. authors in [ , ] assert that the autonomic nervous brain system is responsible of the change of our activity levels that can be measured using motion sensors or by audio; it has been successfully used to screen depression from speech analysis software in mobile phone. authors in [ ] assert that mobile phones can be used to scale time-coupling movement and speech of the person, which is an indication of a high probability of problems in language development. unaware mimicry between people (e.g., posture changes) can be captured using sensors. it is considered as trustworthy indicators of empathy and trusts; and manipulated to strongly enhance compliance. this unconscious mimicry is highly mediated by mirror neurons [ ] . authors in [ ] show that several sensors can also detect and measure fluidity and consistency of person's speech or movement. these brain function measurements remain good predictors of human behaviors. hence, this strong relationship helps for diagnosis of both neurology and psychiatry. besides data from individual health, mobile phones have the capability to capture information about social relationship and social networks. one of the most relevant applications of reality mining is the automatic mapping social network [ ] . through mobile phone we can detect user's location, user's calls, user's sms, who else is nearby and how is moving thanks to accelerometers integrated in new cell phone. authors in [ ] describe three type of self-reported: self-reported reciprocal friends when both persons report to other as friend, self-reported non-reciprocal friends when one of both reports to other as a friend and self-reported reciprocal non-friend when no one reports to other as a friend. this information has been shown to be very useful for identifying several important patterns. another study use pregnant woman's mobile phone health data like user's activity, user's sleep quality, user's location, user's age, user's body mass index (bmi)among others, considered as risk factors of miscarriage, in order to make an early prediction of miscarriage and react as earlier as possible to prevent it. pregnant woman can track her state of pregnancy through a mobile phone application that authors developed [ ] . another good example of network mapping is the computer game named dia-betnet. it is a game intended for young diabetics to help them keep track of their food quality, activity level and blood sugar rates [ ] . behavior pattern involves how a person lives, works, eats etc., not a place, age or residence. reality mining has the potential to identify behaviors with the use of classification methods that are very useful to predict health patterns [ ] . understanding and combining different behavior patterns of different populations is critical since every subpopulation has its own attitudes and profiles about their health choices. google flu trends represents a good example to model health of a large population. just by tracking terms typed in world wide web (www) searches and identifying the frequency for words related to influenza as illnesses, an influenza outbreak is detected indirectly. in the u.s., google searches prove an intense correlation between those frequencies and the incidence of estimated influenza based on cases detected by centers of disease control and prevention (cdc). also, with gps and other technologies, it is easily to track the activities and movements of the person. location's logs present a valuable source to public health in case of infectious diseases such as tuberculosis, anthrax, and legionnaires disease. in fact, location logs can help in identifying the source of infections and government may react for preventing further transmission. once a patient takes his treatment, which is pharmaceutical, behavioral, or other, doctors and clinicians have to monitor their patient's response to treatment. reality mining information used for diagnosis can be also relevant data for monitoring patient response to treatment and for comparing. characteristics such as behavior, activity and mobility could be collected in real-time and they could be useful for clinicians to change or adjust treatment depending on patient's response, and in some cases it could be a more effective care with a lower cost. a concrete example of this is parkinson's patients. real-time data are gathered through wearable accelerometers that integrate machine learning algorithms to classify movements' states of parkinson's patients and to get the development of those movements. two significant studies exist in literature to classify dyskinesia, hypokinesia and bradykinesia (slow movements) for seven patients [ ] . data are collected using different sources: wearable accelerometers, videos and clinical observations. results of studies show that bradykinesia and hypokinesia are the two main classes identified with a high accuracy. another classification is made to classify patients who feel off or about having dyskinesia. by combining big data and reality mining, we rang from single to large hospital network. the main benefits can be summarized into detecting diseases at earlier stages, detecting healthcare abuse and fraud faster, and reducing costs. in fact, big data market also contributes up to % of the global gdp and reduces % of healthcare costs. big data analytics improve health care insights in many aspects: big data can help patients make the right decision in a timely manner. from patient data, analytics can be applied to identify individuals that need "proactive care" or need change in their lifestyle to avoid health condition degradation. a concrete example of this is the virginia health system carillion clinic project [ ] , which uses predictive models for early interventions. patients are also more open to giving away part of their privacy if this could save their lives or other people's lives. "if having more information about me enables me to live longer and be healthier", said marc rotenberg of the electronic privacy information center," then i don't think i'm going to object. if having more information about me means my insurance rates go up, or that i'm excluded from coverage, or other things that might adversely affect me, then i may have a good reason to be concerned" [ ] . collecting different data from different sources can help improving research about new diseases and therapies [ ] . r & d contribute to new algorithms and tools, such as the algorithms by google, facebook, and twitter that define what we find about our health system. google, for instance, has applied algorithms of data mining and machine learning to detect influenza epidemics through search queries [ , ] . r & d can also enhance predictive models to produce more devices and treatment for the market. providers may recognize high risk population and act appropriately (i.e. propose preventive acts). therefore, they can enhance patient experience. moreover, approximately % of us hospitals are members in local or regional health-information exchanges (hies) or try to be in the future. these developments give the power to access a large array of information. for example, the hie in indiana connects currently hospitals and possess information of more than ten million patients [ ] (fig. ). although big data analytics enhance the healthcare industry, there are some limitations to the use of big data in this area: . the source of data from organizations (hospital, pharmacies, companies, medical centers…) is in different formats. these organizations have data in different systems and settings. to use this huge amount of data, those organizations must have a common data warehouse in order to get homogeneous information and be able to manage it. however, having such systems requires huge costs. . quality of data is a serious limitation. data collected are, in some cases, unstructured, dirty, and non-standardized. so, the industry has to apply additional effort to transform information into usable and meaningful data. . a big investment is required for companies to acquire staff (data scientists), resources and also to buy data analytics technologies. in addition, companies must convince medical organizations about using big data analytics. . using data mining and big data analytics requires a high level of expertise and knowledge. it is a costly affair for companies to hire such persons. . due to serious constraints regarding the quality of collected data, variations and errors in the results are not excluded. . data security is in big concern and researchers paid more attention on how we can secure all data generated and transmitted. security problems include personal privacy protection, financial transactions, intellectual property protection and data protection. in some developing and developing countries, they propose laws related to data protection to enhance the security. so, researchers are asked to carefully consider where they store and analyze data to not be against the regulations. big data is being actively used in healthcare industry to change the way that decisions are made; and including predictive analytics tools, have the potential to change healthcare system from reporting to predicting results at earlier stages. also reality mining becomes more common in medicine research project due to the increasing sophistication of mobile phones and healthcare wearable sensors. many mobile phones and sensors are collecting a huge number of information about their user and this will only increase. the use of both big data and reality mining in healthcare industry has the capability to provide new opportunities with respect to patients, treatment monitoring, healthcare service and diagnosis. in this survey paper, we discussed capabilities of reality mining in healthcare system including individual health, social network mapping, behavior patterns and public health service, and treatment monitoring; and how patient, providers, researchers and developers benefit from reality mining to enhance medicine. we highlight as well several challenges in sect. that must be addressed in future works. adoption of big data and reality mining in healthcare raises many security and patient privacy concerns that need to be addressed. big data in healthcare hype and hope healthcare administration big data analytics in healthcare: promise and potential reality mining: sensing complex social systems inferring friendship network structure by using mobile phone data real-time miscarriage prediction with spark big data in healthcare: challenges and opportunities toward a social signaling framework: activity and emphasis in speech. doctoral dissertation acoustical properties of speech as indicators of depression and suicidal risk naturalizing language: human appraisal and (quasi) technology the role of mimicry in understanding the emotions of others improving the fluidity of whole word reading with a dynamic coordinated movement approach reality mining and predictive analytics for building smart applications you are what you eat: serious gaming for type diabetic persons, master's thesis using machine learning algorithms for breast cancer risk prediction and diagnosis use of wearable ambulatory monitor in the classification of movement states in parkinson's disease. doctoral dissertation ibm news room: ibm predictive analytics to detect patients at risk for heart failure -united states the promise and peril of big data harnessing big data for health care and research: are urologists ready? the parable of google flu: traps in big data analysis big data analytics in healthcare: case studymiscarriage prediction comprehensive miscarriage dataset for an early miscarriage prediction key: cord- -baheh i authors: benreguia, badreddine; moumen, hamouma; merzoug, mohammed amine title: tracking covid- by tracking infectious trajectories date: - - journal: nan doi: nan sha: doc_id: cord_uid: baheh i nowadays, the coronavirus pandemic has and is still causing large numbers of deaths and infected people. although governments all over the world have taken severe measurements to slow down the virus spreading (e.g., travel restrictions, suspending all sportive, social, and economic activities, quarantines, social distancing, etc.), a lot of persons have died and a lot more are still in danger. indeed, a recently conducted study~cite{ref } has reported that % of the confirmed infections in china were caused by undocumented patients who had no symptoms. in the same context, in numerous other countries, since coronavirus takes several days before the emergence of symptoms, it has also been reported that the known number of infections is not representative of the real number of infected people (the actual number is expected to be much higher). that is to say, asymptomatic patients are the main factor behind the large quick spreading of coronavirus and are also the major reason that caused governments to lose control over this critical situation. to contribute to remedying this global pandemic, in this paper, we propose an iot (internet of things) investigation system that was specifically designed to spot both undocumented patients and infectious places. the goal is to help the authorities to disinfect high-contamination sites and confine persons even if they have no apparent symptoms. the proposed system also allows determining all persons who had close contact with infected or suspected patients. consequently, rapid isolation of suspicious cases and more efficient control over any pandemic propagation can be achieved. in december , a novel virus has emerged in wuhan city. this disease, named coronavirus or covid- , has quickly spread throughout china and then to the entire world. as of may , more than countries and territories have been affected and over million people have been diagnosed with the virus. since no cure or vaccine has been found and since tests cannot be applied on a large scale to millions of persons, governments had and still have no choice but to take severe actions such as border closing, travel canceling, curfews, quarantines, and contact precautions (facemasks, social distancing, and self-isolation). the authorities have also implemented strategies that aim to rapidly detect infections using different cutting-edge medical tools (such as thermal cameras, blood tests, nasal swabs, etc.). on the one hand, these measurements, which for the time being constitute the only possible solution, have succeeded to slow down the contagion spreading, but on the other hand, they have also caused considerable economic damages, especially to countries with brittle economies (suspension of all sorts of activities: social, economic, educational, etc.). the rapid spreading of coronavirus is due to the continuous person-to-person transmission [ , , ] . in addition to this, a recent study has also suggested that a second factor is playing a major causal role in this high virus spreading; namely the stealth transmission [ ] . the coronavirus can take days before the appearance of symptoms. during this incubation period, asymptomatic patients, called undocumented patients, can infect large communities of people. in turn, these newly infected persons, who will remain unaware of their illness (until they eventually develop the symptoms), can also infect larger communities, thus, leading to an uncontrollable domino effect [ , , ] . accordingly, to confine and effectively eliminate the coronavirus, it is crucial and mandatory to possess an efficient investigation system that allows determining ( ) highly infectious places and ( ) all the persons who were in contact with patients who have recently tested positive. indeed, persons who are known to be in direct relationship with a patient (such as family members, friends, and coworkers) can be easily determined and tested. however, numerous other persons could have also been in contact with this infected individual. all these persons, who cannot be easily determined, can contribute to the widespread of the virus. to remedy this issue, in this paper, we propose an iot investigation system that can determine the exact trajectories of infected persons (exact coordinates labeled with time). hence, as shown in figure , the proposed solution allows determining areas of high contamination and also provides a quick detection mechanism of undocumented patients who are known to be the main and most important factor in the rapid widespread of coronavirus. these persons (resp. places) can be tested/confined even if they had no blatant symptoms (resp. closed and properly disinfected). public building public building public building t ra je ct o ry t r a je c t o r y in te rs e ct io n high risk of infection during the presence of the patient and hours after he leaves. uninfected person x intersection t ra je ct o ry uninfected person y figure : infection tracking: example of historical trajectories taken by a covid- patient and all persons (resp. places) who he might have encountered/infected (resp. visited). the investigation system presented in this paper is a proposition to governments and lawmakers. so, it needs to be approved and adopted only in the case of highly dangerous pandemics and public health emergencies. in summary, to properly operate, the proposed system must be continuously fed with the coordinates of persons who are in public crowded places (mainly using iot devices that can identify persons and report their locations). although this proposed solution can open large debates from the standpoint of privacy and human rights, during extreme critical situations in which the entire human race might be at stake, the urgency of saving lives becomes of higher priority. in such circumstances, the proposed technique can be quickly applied as a last resort by higher authorities (governments, who , etc.) while giving guarantees about ( ) protecting the privacy of people, and ( ) disabling this system once the outbreak is contained. the remainder of this paper is organized into five sections as follows. section describes and details the proposed system. section presents a small-scale implementation example of this proposal. section addresses the benefits that the proposed investigation system can bring to the state-of-the-art mathematical disease spreading/prediction models. section discusses the main advantages and shortcomings of the proposed solution. finally, section concludes the paper and provides some recommendations. in the proposed system, a big-data architecture is considered to archive the continuously collected trajectories of persons. this archiving structure, which is inspired by the big data model proposed in [ ] , must (as previously mentioned) be fed by iot devices that can ( ) determine coordinates of persons during their outdoor activities and ( ) send the collected data to the system. we point out that persons staying at home or in their vehicles are assumed to be isolated (i.e., they cannot infect other persons nor they can be infected). consequently, they do not need to activate the coordinates collection process. on the contrary, to ensure their safety by ensuring the proper operation of the system, it is necessary for any person leaving his/her house to activate coordinates collection and save his/her tracks. as figure depicts, the proposed architecture has three main basic layers; namely, data collection, data storage, and data leveraging. in the remainder of this section, we describe each of these layers and provide their specific complementary tasks. as shown in figure , the data gathering process is divided into two parallel independent tasks. whereas the first one is related to collecting the necessary health data, the second is responsible for collecting the geolocalization coordinates of persons. reporting the coronavirus deaths and all the newly confirmed and recovered cases is a basic stepping stone in the overall investigation process. for instance, once a new case has been discovered, the latter must be immediately reported to the automatic investigation system to allow it to find all the individuals who could have been infected by this person. in the same context, given the fact that patients are the main factor in the system, once some of them have recovered from the virus, their statuses must be immediately changed from active to recovered, and so forth. on the contrary to the second task (i.e., geolocalization data gathering) in which data is automatically captured and reported by the employed iot devices, in this first task, data must be manually entered by health authorities. for example, hospitals can be responsible for providing the newly infected cases, and all the other indispensable updates. different approaches can be considered to collect the coordinates of persons who left their houses. for example, ( ) relying on telecom and it infrastructures, ( ) equipping public buildings with appropriate devices, or ( ) utilizing dedicated tracking apparatuses. • telecom/high-tech companies: using telecommunication technologies, smartphones can be easily instructed (programmed) to periodically report their actual location to a dedicated storage system (phone-number, gps-coordinates, time). • public buildings: by installing face recognition cameras on the entries/exits of public buildings, the identity of visitors can be easily determined and then sent to the designated infrastructure (person-id, building-id, entry-time, exit-time). this way, based on the information provided by the different public buildings and facilities, the tracks of infected people can be easily determined. the following lines provide an example of places that were visited by an infected person p : (p , building, (entry, t ), (exit, t )), (p , shopping mall, (entry, t ), (exit, t )), (.., .., (.., ..), (.., ..)), (p , airport a, (entry, tm), (exit to plane, tn)), (p , airport b, (entry from plane, tx), (exit, ty)), (.., .., (.., ..), (.., ..). the main drawback of this second data collection approach is that it cannot be easily implemented in most countries due to the lack of appropriate underlying identification/recognition systems. however, this solution can be utilized in the case where authorities fail to convince people of using their cellphones as tracking devices. • specific solutions and electronic devices: the two aforementioned approaches can be adopted independently or in combination. actually, during dangerous outbreaks where rapid tracking of infected people becomes an urgency, in addition to personal smartphones and public indoor (entry/exit) cameras, governments can consider using other tracking techniques like for instance electronic bracelets, public outdoor security cameras, drones, or even satellites. furthermore, to consolidate the previous two approaches, solutions such as bluetooth or nfc can also be considered [ , , , , ] . in the same context, in the event of technical issues related to gps, governments can opt for alternative efficient geo-localization techniques [ , ] . as their name implies, the iot devices (responsible for collecting person tracks) report the required data using the internet (via wired or mobile wireless networks: / g, ...). so, as shown in figure , to avoid losing important information in the case of internet disconnection, each device must store locally the collected data. once reconnected, these iot devices can then send the gathered data to the system. finally, it is worth mentioning that to ensure the success of the whole virus tracking process, it companies, public institutions, and third-parties must contribute to collecting and reporting the required data. big data, which refers to enormous datasets, has five essential characteristics known as the vs: volume (data size), velocity (data generation frequency), variety (data diversity), veracity (data trustworthiness and quality), and finally value (information or knowledge extracted from data). to address the issues related to storing and processing huge data volumes, extensive research has been done and numerous efficient solutions have been proposed (data ingestion, online stream processing, batch offline processing, distributed file systems, clustering, etc.). these techniques provide both efficient distributed storage and fast processing. currently, research is more focused on big data leveraging (i.e., the last v which consists of turning data into something valuable using data analytics, machine learning, and other techniques) [ , ] . this point, which represents the main contribution of our work, will be detailed in the next subsection. as stated in [ ] , an ideal big data storage/processing infrastructure must adhere to the four following major requirements: ( ) time and space efficiency, ( ) scalability, ( ) robustness against failures and damages, and ( ) data security/privacy. for instance, regarding this last point, since the collected data in our context is personal and highly sensitive, strict laws and regulations must be imposed to protect it. the established privacy model must determine who can access data (governments, higher health authorities, who organization, etc.), under what constraints, for how long data can be kept, and to whom this data can be distributed, etc. moreover, in addition to data privacy, all conventional data security mechanisms must also be considered (confidentiality, integrity, and availability). to ensure the proper operation of the system, data that is continuously coming from the various considered iot sources (such as sensors, smartphones, cameras, bracelets, etc.) must provide three main information: person-id, coordinates, and time ( table ). the received trajectories of persons are recorded as an immutable discrete set of points (coordinates) labeled with time. for example, according to table , the recorded traces of person p along with their corresponding time instants are: (c , t ), (c , t ), (c , t ), (c , t ), and (c , t ). note that in table , for any given person, time is a monotone function that must keep growing (it cannot go back). note also that more than one person can exist in the same coordinates at the same time. for example, p with p at (c , t ) and p with p at (c , t ). in the considered context, the frequency according to which data must be collected (i.e., the second v) has a direct significant impact on both the first and last vs (i.e., data size and data value). high-quality trajectories can be achieved by increasing the frequency of data collection. this, however, can considerably increase the volume of data. in contrast, low collection frequencies reduce the data size but yield low trajectory quality. so, a tradeoff between the quality of trajectories and the size of data must be defined. the main goal of the proposed system is to track the main sources of infection, whether they would be humans or places. but, in reality, the quality of the collected data can have a deep impact on the system's performance; it can contribute to its efficiency or deficiency. to prevent useless noisy data from harming or affecting the decisions of the investigation process, a filtering mechanism must be added ( figure ) . if it appears that the filtered data is useless, it can then be safely deleted. for example, data collected from highways and main roads is not of value. in this scenario, users are inside their respective cars and the risk of infection is nil (similarly to houses, people inside the same vehicle are considered as isolated). concretely speaking, the responsibility of data filtering can be left to authorities, which can decide which geographic zones or areas are of interest. in the proposed system, only two operations can be performed on the collected geolocalization data: storage and leveraging (figure ). this section presents our three proposed algorithms, which exploit the gathered data to ( ) find and classify suspected cases (i.e., persons who met with confirmed patients or other suspected cases), ( ) determine black areas (zones with high contamination probability), and ( ) find all persons who visited black areas (also considered as probably infected persons). for the convenience of the presentation, table compiles all the variables utilized by the different proposed algorithms presented respectively in the following three subsections. based on the stored data (both recorded trajectories of people and health information), the main goal is to deduce all possible infections. as previously mentioned, the proposed investigation system does not only find the suspected cases but also categorizes them into several disjoint subsets denoted suspected t,d ( figure and table ). the variable d, which stands for distance or degree, indicates how close/far is a person to a confirmed patient. the initial class of persons suspected t, represents the set of all patients (confirmed cases). if set suspected t, is defined (suspected t, = φ) then this means that it contains all individuals who had direct contact with at least a confirmed patient p c from suspected t, (∀ p c ∈ suspected t, ). as regards the elements (persons) of suspected t, , they had no direct intersection with a positive case from suspected t, definition common (pi, ci, ti) a recorded tuple: person identity, coordinates, and the corresponding collection time. set of all trajectories. set of all confirmed patients. a patient is defined as (pi, ti). aux a boolean that indicates whether an end-user is a suspected case. s server side. c client side. algorithm of figure an index to explore all black areas. to find and categorize suspected cases into disjoint subsets, the investigation system s performs the following steps: input: t rajectory; p atient; if (∃(p ersoni, ci, ti) ∈ t rajectory ) and (∃(p ersonj, ci, ti) ∈ t rajectory ) with ((p ersoni, ti) / ∈ suspected t,d− ) and ((cd − ti) ≤ ip )) then suspected t,d ← suspected t,d ∪ {(p ersonj, ti)}; (set of confirmed cases), but they have certainly met at least an element from suspected t, . in general, for i >= , each element of suspected t,i has certainly met at least an element from suspected t,(i− ) , but none from suspected t,(i− ) . to properly operate, the proposed approach must be given the initial set of patients suspected t, (suspects at distance ). as shown by the algorithm of figure , at each iteration, based on the precedently entered/calculated set suspected t,(i− ) , the new set of suspected persons suspected t,i is determined. note that a person is considered to be suspected if he has met a patient or another suspected person during the incubation period (line : (cd −t i ) ≤ ip ). the algorithm stops when the previously entered/calculated set to find black zones, the investigation system s performs the following steps: input: t rajectory; a; p atient; if ((∃(p ersoni, * ) ∈ p atient and (∃(p ersonj, c k , ti) ∈ t rajectory ) ) then count k ← count k + suspected t,(i− ) is empty (which means that the next set of suspected persons suspected t,i cannot be calculated). thus, after its execution, the proposed algorithm gives a classification (partition) of all individuals stored in the system: suspected t, , suspected t, , ..., suspected t,d (suspected t set). the remaining unclassified persons are considered uninfected. they have not met with confirmed patients from suspected t, nor with suspected persons from suspected t,i (with i > ). when a user (client) sends a query to the investigation system (algorithm of figure ), he will receive to which class he belongs (i.e., suspected t, , suspected t, , suspected t, , ... or negative category). using the distance d (rather than the identity of persons) allows informing (reassuring or alerting) users about their current statuses without exposing the privacy of other users of the system. the distance d can be seen as a warning, the more a user's distance is close to zero, the higher the risk of infection will be, and vice versa ( figure ) . finally, we point out that in the proposed system, geolocalization data is periodically collected every ∆t period. to avoid any synchronization issue that might occur during this process, time differences that are less or equal to ∆t are ignored. for example, if a person (p i , c i , t i ) has the same coordinates as another person (p j , c j , t j ) and the difference between the collection instants (t i and t j ) of these coordinates (c i and c j ) is less or equal to ∆t then ti and tj will be seen as equal (t i = t j ). another major problem that must be tackled is the possibility of virus transmission without direct human contact. this can happen, for instance, when an undocumented patient visits a public place and leaves the virus there (he touches objects found in that area, he sneezes, coughs, etc.). in such a scenario, as long as the virus stays alive, the probability of contamination remains very high. the algorithm of to find all persons who visited black areas, the investigation system s performs the following steps: input: t rajectory; ba; if ∃(p ersoni, * ) ∈ suspectt,sthen aux ← true; k ← s else aux ← f alse endif strates the process followed to find all black areas. the key idea consists of determining if numerous patients have visited the same location. if so, then this zone is likely an infectious area of high viral transmission. more concretely, for each public location a k ∈ a (where a is the predefined set of all public areas), the number of confirmed patients who have visited this area is determined using the count k variable. if this number overtakes the predefined threshold α, the corresponding location a k is then considered to be a black area. as depicted in figure and algorithm of figure , once black areas have been determined, all persons who have been in these infectious zones can be straightforwardly found. more specifically, based on the set of captured trajectories, the proposed algorithm finds each person p i who has visited at least one black area ba k . each of these found suspected people will be added to the corresponding suspect t,k set (list of all suspected persons who have frequented black area k at time t). in other terms, after its execution, the algorithm depicted in figure gives a different classification (partition) of all individuals stored in the system: suspect t, , suspect t, , ..., suspect t,k (suspect t set). the unclassified persons are considered uninfected (they have not visited any infectious areas ba k ∈ ba). accordingly, the investigation system can be queried about black areas and the people who frequented them. this section describes the small-scale version of our proposed system that we have implemented. the main purpose of this demonstration is to show the applicability of the ideas discussed earlier in the paper. in brief, the implemented system has three essential parts ( figure ): ( ) mobile phone application for end-users, ( ) automatic investigation mechanism (described in sections . and . ), and ( ) an interface dedicated to the government and health authorities. the developed mobile application collects the required geo-localization information by sending the gps longitude and latitude coordinates along with the corresponding time to the investigation system. this operation is repeated periodically every x unit of time. to avoid redundancies (useless data), during this defined period (i.e., x time unit), when a user remains in the same location, his coordinates will not be reported to the system nor stored in the phone local database. similarly, if a user moves for a certain ... distance of y meters or less, he will be considered as stationary and his coordinates will not be collected. also, as previously mentioned, in the event of internet disconnection, each smartphone will store locally the gathered data and wait for reconnection. the application can offer numerous services to the end-users. for instance, users can review their historical trajectories during any possible correct period that they can define ( figure ) . a second most important service that can also be offered is allowing end-users to know if they have met infected people. the investigation system is continuously fed with both the locations of users and health data (e.g., confirmed patients, closed cases, etc.). figure provides an example of historical trajectories taken by a single user of the system. as previously stated in the paper, the collected locations of persons are stored as immutable append-only data (since there is no need to change the gathered tracks, no updates are allowed). due to the continuous periodic data gathering, an important storing space is needed. however, for demonstration, in this small-scale evaluation version, a single pc was considered to store both health and localization data. numerous functionalities can be offered to the government, health authorities, and other entities that are responsible for monitoring, evaluating, and controlling the outbreak. the most important functionality is the ability to search for novel suspected cases. to do so, the authorities must first provide the investigation system with the necessary new coronavirus cases, deaths, and recovered patients. upon discovering and entering a new infected case, the proposed system can find all people who had contact with this patient ( figure ). as previously explained, the proposed system classifies the newfound suspected people into disjoint subsets. for example, if we assume that the newly discovered patient p i belongs to subset s , then, all persons who had direct contact with him will belong to s . and, all persons who had direct contact with persons from s will belong to s , and so forth. this way, categories of high and low risk can be easily determined. once determined, all suspected cases will be ( ) stored in the system and ( ) informed to immediately start self-isolation at home. the authorities can also apply several techniques to determine if suspected persons are violating the imposed quarantines. the investigation system must be designed so that ( ) it helps the authorities in their endeavor of fighting coronavirus, and ( ) it protects the privacy of users. the latter goal can be achieved through the designed gui and offered functionalities, which must be defined based on a well-established privacy model. mathematical modeling of infectious diseases is a powerful tool that allows analyzing, understanding, and predicting the behavior of pandemics [ , , ] . for instance, these models can be utilized to help authorities set up the best strategies for successful pandemics control. in the following three subsections, we will briefly show how different mathematical prediction models can benefit from the proposed investigation system. more exactly, how can the two generated lists/sets (suspected t and suspect t ) of ( ) suspected/infected persons and ( ) foci of infection (black areas) help attain a more accurate realistic estimate of disease spreading rate. more recently, a new mathematical model, called θ-seihrd , has been specifically proposed for the coronavirus disease [ ] . this model assumes that the pandemic spatial distribution inside a territory is omitted. θ-seihrd also assumes that people in a territory are characterized to be in one of the following nine compartments: s (susceptible), e (exposed), i ( infectious), iu (infectious but undetected), hr (hospitalized or in quarantine at home), hd (hospitalized that will die), rd (recovered after being previously detected as infectious), ru (recovered after being previously infectious but undetected) or finally d (dead by . the most important parameters of this model that determine the pandemic spreading are: β hd . these parameters represent the disease contact rates (day − ) of a person in both the corresponding compartments (e, i, iu, ...) and territory (i) (without taking into account the control measurements). indeed, it is very hard to determine these contacts rate without having an idea about who contacted whom. also, the contacts between persons are not sufficient to estimate the coronavirus contact rates. this disease can be transmitted without direct contact with patients (e.g., touching an object previously used by an infected person). the determination of both sets suspected t and suspect t helps to accurately estimate the contacts rate and to determine the set of infectious but undetected (iu) persons. in addition to this, our proposed investigation system can also help to determine the pandemic spatial distribution inside any monitored territory. as it is well-known, it is very important to understand diseases spreading, determine the safe and risky areas, and take control measurements with insight. sir is one of the most studied mathematical models for the spreading of infectious diseases [ , , ] . in this model, the most important parameters that determine the pandemic spreading are α, γ, and β. where α is the probability of becoming infected, γ is the number of infected people (new infections are the result of contact between infectives and susceptibles), and β is the average number of transmissions from an infected person (determined by the chance of contact and the probability of disease transmission). the major difficulty or drawback of this model is estimating these parameters. the answer to this question is our proposed investigation system. seir is also one of the most important mathematical models for the spreading of infectious diseases [ ] . in this model, the main parameter that determines the pandemic spreading is the contact rate β(t), which is the average number of susceptibles in a given population contacted per infective per unit time. the main difference between seir and sir lies in the addition of a latency period. more details can be found in [ , , , ] . endowing the seir model with accurate contact information can help it obtain better results about the spreading (prediction) of infectious diseases. as previously highlighted, this is exactly where our proposed solution comes in handy. it can help to understand and estimate important parameters for numerous mathematical models proposed for infectious diseases. in this section, we provide the main advantages and disadvantages of the proposed investigation system. first, among the numerous interesting features of this iot system we mainly cite: • this system allows tracking the trajectories of infected persons, and this, days before the appearance of their symptoms. the system also allows determining all persons who were in close contact with infected patients. thus, the former can be immediately confined and the proper measurements can be rapidly taken by health authorities. since all persons who have met infected patients can be identified earlier (i.e., parents, friends, colleagues, and most importantly, those who cannot be easily identified using conventional techniques), the proposed tracking system allows an early outbreak control. • the proposed system can be utilized to identify black zones, which can be the main source of virusspreading. as previously explained, a given location is said to be a source of contamination if numerous infected persons have visited it (the intersection zones of high infections can be identified based on the trajectories of infected persons). • according to [ ] , the coronavirus can live for several hours without a host. to remedy this, the proposed system can be configured to find all uninfected persons who have visited locations that were previously visited by other infected persons (and this, hours after the patients have left this location). • the proposed system can help reduce the economic damages generated by the suspension of all activities. instead of shutting down all sports, social and economic activities, the authorities can impose confinement on only a few people. • in addition to coronavirus, the proposed solution can be utilized for any potential outbreak which might be more powerful and have a higher spreading pace. despite its advantages, the proposed solution suffers from some issues that must be adequately tackled: • for instance, if certain persons do not respect the safety instructions (e.g., they did not intentionally or unintentionally save their trajectories when leaving their houses or cars), it will be impossible to determine whether they have met infected persons. the absence of one or several trajectories does not mean the total failure of the proposed investigation system. indeed, the efficiency of this system is proportional to its users. if x% of the population has been involved, then approximately x% of the persons who had close contact with patients will be determined. • the proposed system can be empowered with extra functionalities such as machine (deep) learning abilities. moreover, the system can be utilized to determine whether citizens are respecting social distancing and also to check whether the persons who are suspected to be infected are respecting the imposed confinement. in this work, we proposed a system that can quickly determine all persons who are suspected to be infected by the coronavirus. while being specific to this virus, the proposed solution can also be applied to control other pandemics. the objective (as specified in the paper) is not tracking people but tracking extremely dangerous viruses. to ensure a practical and more efficient application of this proposal, the following points must be taken into account: • system utilization: to keep the public health situation under control, it is recommended to launch such an automatic investigation process as early as dangerous rapid infections start taking place. in the case where the outbreak achieves large scales, the operation of detecting, tracking, and surveilling a huge number of infected people might become useless. • system efficiency: for a more efficient system, we recommend exploiting all possible existing resources that can enhance the quality of the collected data (cellphones, security camera footage, credit card records, etc.). • data privacy: finally, to ensure the privacy and security of the collected data, which is personal and highly sensitive, an authority of trust must manage this whole process. substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov ) clinical characteristics and imaging manifestations of the novel coronavirus disease (covid- ):a multi-center study in wenzhou city first known person-to-person transmission of severe acute respiratory syndrome coronavirus (sars-cov- ) in the usa patients with respiratory symptoms are at greater risk of covid- transmission transmission of covid- in the terminal stage of incubation period: a familial cluster stealth transmission' fuels fast spread of coronavirus outbreak decentralized human trajectories tracking using hodge decomposition in sensor networks cts: a cellular-based trajectory tracking system with gps-level accuracy smar-tits: smartphone-based identification and tracking using seamless indoor-outdoor localization hermes: pedestrian real-time offline positioning and moving trajectory tracking system based on mems sensors urska demsar & a. stewart fotheringham analysis of human mobility patterns from gps trajectories and contextual information a big data architecture for automotive applications: psa group deployment experience big automotive data: leveraging large volumes of data for knowledge-driven product development the mathematics of infectious diseases mathematical modeling of the spread of the coronavirus disease (covid- ) taking into account the undetected infections. the case of china. communications in nonlinear science and numerical simulation a contribution to the mathematical theory of epidemics mathematical modeling in infectious disease, clinical microbiology and infection ł european society of clinical microbiology and infectious diseases the sir model of infectious diseases seasonality and period-doubling bifurcations in an epidemic model a seir model for control of infectious diseases with constraints optimal control of a seir model with mixed constraints and l cost mathematical modeling of biological processes key: cord- -w sb h authors: schumacher, garrett j.; sawaya, sterling; nelson, demetrius; hansen, aaron j. title: genetic information insecurity as state of the art date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: w sb h genetic information is being generated at an increasingly rapid pace, offering advances in science and medicine that are paralleled only by the threats and risk present within the responsible ecosystem. human genetic information is identifiable and contains sensitive information, but genetic data security is only recently gaining attention. genetic data is generated in an evolving and distributed cyber-physical ecosystem, with multiple systems that handle data and multiple partners that utilize the data. this paper defines security classifications of genetic information and discusses the threats, vulnerabilities, and risk found throughout the entire genetic information ecosystem. laboratory security was found to be especially challenging, primarily due to devices and protocols that were not designed with security in mind. likewise, other industry standards and best practices threaten the security of the ecosystem. a breach or exposure anywhere in the ecosystem can compromise sensitive information. extensive development will be required to realize the potential of this emerging field while protecting the bioeconomy and all of its stakeholders. genetic information contained in nucleic acids, such as deoxyribonucleic acid (dna), has become ubiquitous in society, enabled primarily by rapid biotechnological development and drastic decreases in dna sequencing and dna synthesis costs (berger and schneck, ; naveed et al., ) . innovation in these industries has far outpaced regulatory capacity and remained somewhat isolated from the information security and privacy domains. a single human whole genome sequence can cost hundreds to thousands of dollars per sample, and when amassed genetic information can be worth millions , . this positions genetic information systems as likely targets for cyber and physical attacks. human genetic information is identifiable lowrence and collins, ) and also contains sensitive health information; yet it is not always defined in these capacities by law. unlike most other forms of data, it is immutable, remaining with an individual for their entire life. sensitive human genetic data necessitates protection for the sake of individuals, their relatives, and ethnic groups; genetic information in general must be protected to prevent national and global threats (sawaya et al., ) . therefore, human genetic information is a uniquely confidential form of data that requires increased security controls and scrutiny. furthermore, non-human biological sources of genetic data are also sensitive. for example, microbial genetic data can be used to create designer microbes with crispr-cas and other synthetic biology techniques (werner, ) , presenting global and national security concerns. several genomics stakeholders have reported security incidents according to news sources , , , and breach notifications , , , , , . the most common reasons were misconfigurations of cloud security settings and email phishing attacks, and one resulted from a stolen personal computer containing sensitive information . the national health service's genomics england database in the united kingdom has been targeted by malicious nation-state actors , and andme's chief security officer said their database of around ten million individuals is of extreme value and therefore "certainly is of interest to nation states" . despite this recognition, proper measures to protect genetic information are often lacking under current best practices in relevant industries and stakeholders. multi-stakeholder involvement and improved understanding of the security risks to biotechnology are required in order to develop appropriate countermeasures (millett et al., ) . towards these goals, this paper expands upon a microbiological genetic information system assessment by fayans et al. (fayans et al., ) to include a broader range of genetic information, as well as novel concepts and additional threats to the ecosystem. confidentiality, integrity, and availability are the core principles governing the secure operation of a system (fayans et al., ; international organization for standardization, ) . confidentiality is the principle of ensuring access to information is restricted based upon the information's sensitivity. examples of confidentiality include encryption, access controls, and authorization. integrity is the concept of protecting information from unauthorized modification or deletion, while availability ensures information is accessible to authorized parties at all times. integrity examples include logging events, backups, minimizing material degradation, and authenticity verification. availability can be described as minimizing the chance of data or material destruction, as well as network, power, and other infrastructure outages. sensitive genetic information, which includes both biological material and digital genetic data, is the primary asset of concern, and associated assets, such as metadata, electronic health records and intellectual property, are also vulnerable within this ecosystem. genetic information can be classified into two primary levels, sensitive and nonsensitive, based upon value, confidentiality requirements, criticality, and inherent risk. sensitive genetic information can be further categorized into restricted and private sublevels. ❖ restricted sensitive genetic information can be expected to cause significant risk to a nation, ethnic group, individual, or stakeholder if it is disclosed, modified, or destroyed without authorization. the highest level of security controls should be applied to restricted sensitive genetic information. examples of restricted sensitive information are material and data sourced from humans, resources humans rely upon, and organisms and microbes that could cause harm to humans or resources humans rely upon. due to its identifiability, human genetic information can be especially sensitive and thus requires special security considerations. ❖ private sensitive genetic information can be expected to cause a moderate level of risk to a nation, ethnic group, individual, or stakeholder if it is disclosed, modified, or destroyed without authorization. genetic information that is not explicitly classified as restricted sensitive or nonsensitive should be treated as private sensitive information. a reasonable level of security controls should be applied to private sensitive information. examples of private sensitive information are intellectual property from research, breeding, and agricultural programs. ❖ nonsensitive (or public) genetic information can be expected to cause little risk if it is disclosed, modified, or destroyed without authorization. while few controls may be required to protect the confidentiality of nonsensitive genetic information, controls should be in place to prevent unauthorized modification or destruction of nonsensitive information. examples of nonsensitive information are material and data sourced from non-human entities that are excluded from the sensitive level if the resulting data are to be made publicly available within reason. the genetic information ecosystem can be compromised in numerous ways, including purposefully adversarial activities and human error. organizations take steps to monitor and prevent error, and molecular biologists are skilled in laboratory techniques; however, they commonly do not have the expertise and resources to securely configure and operate these environments, nor are they enabled to do so by vendor service contracts and documentation. basic security features and tools, such as antivirus software, can easily be subverted, and advanced protections are not commonly implemented. much genetic data is already publicly available via open and semi-open databases, and dissemination practices are not properly addressed by regulations. there are wide-ranging motives behind adversaries targeting non-public genetic information (fayans et al., ) . numerous stakeholders, personnel, and insecure devices are relied upon from the path of sample collection to data dissemination. depending on the scale of an exploit, hundreds to millions of people could be compromised. local attacks could lead to certain devices, stakeholders, and individuals being affected, while supply chain and remote attacks could lead to global-scale impact. widespread public dissemination and lack of inherent security controls equate to millions of individuals and their relatives having substantial risk imposed upon them. genetic data can be used to identify an individual (lin et al., ) and predict their physical characteristics (li et al., ; lippert et al., ) , and capabilities for familial matching are increasing, with the ability to match individuals to distant relatives edge et al., ; ney et al., ) . identifiability of genetic information is a critical challenge leading to growing consumer privacy concerns (baig et al., ) , and behavioral predictions from genetic information are gaining traction to produce stronger predictors year over year (gard et al., ; johnson et al., ) . furthermore, many diseases and negative health outcomes have genetic determinants, meaning that genetic data can reveal sensitive health information about individuals and families (sawaya et al., ) . these issues pale in comparison to the weaponization of genetic information. genetics can inform both a doctor and an adversary in the same way, revealing weaknesses that can be used for treatment or exploited to cause disease (sawaya et al., ) . the creation of bioweapons utilizes the same processes as designing vaccines and medicines to mitigate infectious diseases, namely access to an original infectious organism or microbe and its genetic information (berger and roderick, ) . this alarming scenario was thought to be unlikely only six years ago as the necessary specialized skills and expertise were not widely distributed. since then, access to sensitive genetic data has increased, such as the genome sequences of the novel coronavirus (sars-cov- ) (sah et al., ) , african swine fever (mazur-panasiuk et al., ) , and the spanish influenza a (h n ) (tumpey et al., ) viruses. synthetic biology capabilities, skill sets, and resources have also proliferated (ney et al., ) . sars-cov- viral clone creation from synthetic dna fragments was possible only weeks after the sequences became publicly available (thao et al., ) . this same technology can be utilized to modify noninfectious microbes and microorganisms to create weaponizable infectious agents (berger and roderick, ; chosewood and wilson, ; salerno and koelm, ) . covid- susceptibility, symptoms, and mortality all have genetic components (taylor et al., ; ellinghaus et al., ; nguyen et al., ) , demonstrating how important it will be to safeguard genetic information in the future to avoid targeted biological weapons. additionally, microbiological data cannot be determined to have infectious origins until widespread infection occurs or until it is sequenced and deeply analyzed (chosewood and wilson, ; salerno and koelm, ) ; hence, data that is potentially sensitive also needs to be protected throughout the entire ecosystem. the genetic information ecosystem is a distributed cyber-physical system containing numerous stakeholders (supplementary material, appendix ), personnel, and devices for computing and networking purposes. the ecosystem is divided into the pre-analytical, analytical, and postanalytical phases that are synonymous with: (i) collection, storage, and distribution of biological samples, (ii) generation and processing of genetic data, and (iii) storage and sharing of genetic data (supplementary material, appendix ). this ecosystem introduces many pathways, or attack vectors, for malicious access to information and systems ( figure ). the genetic information ecosystem and accompanying threat landscape. the genetic information ecosystem is divided into three phases: pre-analytical, analytical, and post-analytical. the analytical phase is further divided into wet laboratory preparation, dna sequencing, and bioinformatic pipeline subphases. in its simplest form, this system is a series of inputs and outputs that are either biological material, data, or logical queries on data. every input, output, device, process, and stakeholder are vulnerable to exploitation via the attack vectors denoted by red letters. color schema: purple, sample collection and processing; blue, wet laboratory preparation; green, genetic data generation and processing; yellow, data dissemination, storage, and application. unauthorized physical access or insider threats could allow for theft of assets or the use of other attack vectors on any phase of the ecosystem (walsh and streilein, ) . small independent laboratories do not often have resources to implement strong physical security. large institutions are often enabled to maintain strong physical security, but the relatively large number of individuals and devices that need to be secured can create a complex attack surface. ultimately, the strongest cybersecurity can be easily circumvented by weak physical security. insider threats are a problem for information security because personnel possess deeper knowledge of an organization and its systems. many countries rely on foreign nationals working in biotechnological fields that may be susceptible to foreign influence . citizens can also be susceptible to foreign influence . personnel could introduce many exploits on-site if coerced or threatened. even when not acting in a purposefully malicious manner, personnel can unintentionally compromise the integrity and availability of genetic information through error (us office of the inspector general, ). appropriate safeguards should be in place to ensure that privileged individuals are empowered to do their work correctly and efficiently, but all activities should be documented and monitored when working with sensitive genetic information. sample collection, storage, and distribution processes have received little recognition as legitimate points for the compromise of genetic information. biological samples as inputs into this ecosystem can be modified maliciously to contain encoded malware (ney et al., ) , or they could be degraded, modified, or destroyed to compromise the material's and resulting data's integrity and availability. sample repository and storage equipment are usually connected to a local network for monitoring purposes. a remote or local network attack could sabotage connected storage equipment, causing samples to degrade or be destroyed. biorepositories and the collection and distribution of samples could be targeted to steal numerous biological samples, such as in known genetic testing scams . targeted exfiltration of small numbers of samples may be difficult to detect. sensitive biological material should be safeguarded in storage and transit, and when not needed for long-term biobanking, it should be destroyed following successful analysis. other organizations that handle genetic material could be targeted for the theft of samples and processed dna libraries. the wet laboratory preparation and dna sequencing subphases last several weeks and produce unused waste and stored material. at the conclusion of sequencing runs, the consumables that contain dna molecules are not always considered sensitive. these items can be found unwittingly maintained in many sequencing laboratories. several cases have been documented of dna being recovered and successfully sequenced while aged for years at room temperature and in non-controlled environments (colette et al., ). dna sequencing systems and laboratories are multifaceted in their design and threat profile. dna sequencing instruments have varying scalability of throughput, cost, and unique considerations for secure operation (table ) . sequencing instruments have a built-in computer and commonly have connected computers and servers for data storage, networking, and analytics. these devices contain a number of different hardware components, firmware, an operating system, and other software. some contain insecure legacy versions of operating system distributions. sequencing systems usually have wireless or wired local network connections to the internet that are required for device monitoring, maintenance, data transmission, and analytics in most operations. wireless capabilities and bluetooth technology within laboratories present unnecessary threats to these systems, as any equipment connected to laboratory networks is a potential network entry point. device vendors obtain various internal hardware components from several sources and integrate them into laboratory devices that contain vendor-specific intellectual property and software. generic hardware components are often produced overseas, which is cost effective but leads to insecurities and a lack of hardening for specific end-use purposes. hardware vulnerabilities could be exploited on-site, or they can be implanted during manufacturing and supply-chain processes for widespread and unknown security issues (fayans et al., ; ender et al., ; shwartz et al., ; anderson and kuhn, ) . such hardware issues are unpatchable and will remain with devices forever until newer devices can be manufactured to replace older versions. unfortunately, adversaries can always shift their techniques to create novel vulnerabilities within new hardware in a continual vicious cycle. third-party manufacturers and device vendors implement firmware in these hardware components. embedded device firmware has been shown to be more susceptible to cyber-attacks than other forms of software (shwartz et al., ) . in-field upgrades are difficult to implement, and like hardware, firmware and operating systems of sequencing systems can be maliciously altered within the supply chain (fayans et al., ) . a firmware-level exploit would allow for the evasion of operating system, software, and application-level security features. firmware exploits can remain hidden for long periods, even after hardware replacements or wiping and restoring to default factory settings. furthermore, operating systems have specific disclosed common vulnerabilities and exposures (cves) that are curated by the mitre organization and backed by the us government . with ubiquitous implementation in devices across all phases of the ecosystem, these software issues are especially concerning but can be partially mitigated by frequent updates. however, operating systems and firmware are typically updated every six to twelve months by a field agent accessing a sequencing device on site. device operators are not allowed to modify the device in any way, yet they are responsible for some security aspects of this equipment. additionally, researchers have confirmed the possibility of index hopping, or index misassignment, by sequencing device software, resulting in customers receiving confidential data from other customers (ney et al., ) or downstream data processors inputting incorrect data into their analyses. dna sequencing infrastructure is proliferating. illumina, the largest vendor of dna sequencing instruments, accounted for % of the world's sequencing data in by their own account . in , illumina had , sequencers implemented globally capable of producing a total daily output of tb (erlich, ) , with many of these instruments housed outside of the us and europe. in , technology developed by beijing genomics institute has finally resulted in the $ human genome (drmanac, ) while us prices remain around $ , . overseas organizations can be third-party sequencing service providers for direct-to-consumer (dtc) companies and other stakeholders. shipping internationally for analysis is less expensive than local services (office of the us trade representative, ), indicating that genetic data could be aggregated globally by nation-states and other actors during the analysis phase. https://cve.mitre.org/ https://www.cisa.gov/news/ / / /fbi-and-cisa-warn-against-chinese-targeting-covid- -research-organizations raw signal sequencing data are stored on a sequencing system's local memory and are transmitted to one or more endpoints. transmitting data across a local network requires internal information technology (it) configurations. vendor documentation usually depends upon implementing a firewall to secure sequencing systems, but doing so correctly requires deep knowledge of secure networking and vigilance of network activity. documentation also commonly mentions disabling and enabling certain network protocols and ports and further measures that can be difficult for most small-to medium-sized organizations if they lack dedicated it support. laboratories and dna sequencing systems are connected to many third-party services, and laboratories have little control over the security posture of these connections. independent cloud platforms and dna sequencing vendors' cloud platforms are implemented for bioinformatic processing, data storage, and device monitoring and maintenance capabilities (table ) . a thorough security assessment of cloud services remains unfulfilled in the genomics context. multifactor authentication, role-and task-based access, and many other security measures are not common in these platforms. misconfigurations to cloud services and remote communications are a primary vulnerability to genetic information, demonstrated by prior breaches, remote desktop protocol issues affecting illumina devices , and a disclosed vulnerability in illumina's basespace application program interface . laboratory information management systems (lims) are also frequently implemented within laboratories and connected to sequencing systems and laboratory networks (roy et al., ) . dna sequencing vendors provide their own limss as part of their cloud offerings. even when lims and cloud platforms meet all regulatory requirements for data security and privacy, they are handling data that is not truly anonymized and therefore remains identifiable and sensitive. furthermore, specific cves have been disclosed for dnatools' dnalims product that were actively exploited by a foreign nation-state . phishing attacks are another major threat, as email services add to the attack surface in many ways. sequencing service providers often share links granting access to datasets via email. these email chains are a primary trail of transactions that could be exploited to exfiltrate data on clients, metadata of samples, or genetic data itself. some laboratories transmit raw data directly to an external hard drive per customer or regulatory requirements. reducing network activity in this way can greatly minimize the threat surface of sensitive genetic information. separating networks and devices from other networks, or air gapping, while using hard drives is possible, but even air-gapped systems have been shown to be vulnerable to compromise (guri, ; guri et al., ) . sequencing devices are still required to be connected to the internet for maintenance and are often connected between offline operations. hard drives can be physically secured and transported; however, these methods are time and resource intensive, and external drives could be compromised for the injection of modified software or malware. bioinformatic software has not been commonly scrutinized in security contexts or subjected to the same adversarial pressure as other more mature software. open-source software is widely used across genomics, acquired from several online code repositories, and heavily modified for individual purposes, but it is only secure when security researchers are incentivized to assess these products. in a specialized and niche industry like genomics and bioinformatics this is typically not the case. bioinformatic programs have been found to be vulnerable due to poor coding practices, insecure function usage, and buffer overflows , (ney et al., ) . many researchers have uncovered that algorithms can be forced to mis-classify by intentionally modifying data inputs, breaking the integrity of any resulting outputs (finlayson et al., ) . nearly every imaginable algorithm, model type, and use case has been shown to be vulnerable to this kind of attack across many data types, especially those relevant to raw signal and sequencing data formats (biggio and roli, ) . similar attacks could be carried out in the processing of raw signal data internal to a sequencing system or on downstream bioinformatic analyses accepting raw sequencing data or processed data as an input. alarming amounts of human and other sensitive genetic data are publicly available , , , , . several funding and publication agencies require public dissemination, so researchers commonly contribute to open and semi-open databases (shi and wu, ) . healthcare providers either house their own internal databases or disseminate to third-party databases. their clinical data is protected like any other healthcare information as required by regulations; however, this data can be sold and aggregated by external entities. dtc companies keep their own internal databases closely guarded and can charge steep prices for third-party access. data sharing is prevalent when the price is right. data originators often have access to their genetic data and test results for download in plaintext. these reports can then be uploaded to public databases, such as gedmatch and dna.land, for further analyses, including finding distant genetic relatives with a shared ancestor . a well-known use of such identification tactics was the infamous golden state killer case (edge and coop, ) . data sharing is dependent upon the data controller's wants and needs, barring any legal or business requirements from other involved stakeholders. genetic database vulnerabilities have been well-studied and disclosed (edge and coop, ; ney et al., ; naveed et al., ; erlich and narayanan, ; gymrek et al., ) . for example, the contents of the entire gedmatch database could be leaked by uploading artificial genomes (ney et al., ) . such an attack would violate the confidentiality of more than a million users' and their relatives' genetic data because the information is not truly anonymized. even social media posts can be filtered for keywords indicative of participation in genetic research studies to identify research participants in public databases (liu et al., ) . all told, tens of millions of research participants, consumers, and relatives are already at risk. adversarial targeting of genetic information largely depends upon the sensitivity, quantity, and efficiency of information compromise for attackers, leading to various states in likelihood of a breach or exposure scenario. the impact of a compromise is determined by a range of factors, including the size of the population at risk, negative consequences to stakeholders, and capabilities and scale of adversarial activity. likelihood and impact both ultimately inform the level of risk facing stakeholders during ecosystem phases (figure ). risk to the genetic information ecosystem. quantity is not to scale but is denoted abstractly by width of the second column. likelihood judged by the available threats and opportunities to adversaries and the efficiency of an attack. impact in terms of the number of people affected and the current and emerging consequences to stakeholders. likelihood and impact scores: low (+); moderate (+ +); high (+ + +); very high (+ + + +); extreme (+ + + + +). low to extreme risk is denoted by the hue of red, from light to dark. security is a spectrum; stakeholders must do everything they can to chase security as a best practice. securing genetic information is a major challenge in this rapidly evolving ecosystem. attention has primarily been placed on the post-analytical phase of the genetic information ecosystem for security and privacy, but adequate measures have yet to be universally adopted. the pre-analytical and analytical phases are also highly vulnerable points for data compromise that must be addressed. adequate national regulations are needed for security and privacy enforcement, incentivization, and liability, but legal protection is dictated by regulators' responses and timelines. however, data originators, controllers, and processors can take immediate action to protect their data. genetic information security is a shared responsibility between sequencing laboratories and device vendors, as well as all other involved stakeholders. to protect genetic information, laboratories, biorepositories, and other data processors need to create strong organizational policies and reinvestments towards their physical and cyber infrastructure. they also need to determine the sensitivity of their data and material and take necessary precautions to safeguard sensitive genetic information. data controllers, especially healthcare providers and dtc companies, should reevaluate their data sharing models and methods, with special consideration for the identifiability of genetic data. device vendors need to consider security when their products are being designed and manufactured. many of these recommendations go against the current paradigms in genetics and related industries and will therefore take time, motivation, and incentivization before being actualized, with regulation being a critical factor. in order to secure genetic information and protect all stakeholders in the genetic information ecosystem, further in-depth assessments of this threat surface will be required, and novel security and privacy approaches will need to be developed. sequencing systems, bioinformatics software, and other biotechnological infrastructure need to be analyzed to fully understand their vulnerabilities. this will require collaborative engagement between stakeholders to implement improved security measures into genetic information systems (moritz et al., ; berger and schneck, ) . the development and implementation of genetic information security will foster a healthy and sustainable bioeconomy without damaging privacy or security. there can be security without privacy, but privacy requires security. these two can be at odds with one another in certain contexts. for example, personal security aligns with personal privacy, whereas public security can require encroachment on personal privacy. a similar story is unfolding within genetics. genetic data must be shared for public good, but this can jeopardize personal privacy. however, genetic data necessitates the strongest protections possible for public security and personal security. appropriate genetic information security will simultaneously protect everyone's safety, health, and privacy. the inspiration for this work occurred while performing several security assessments and penetration tests of dna sequencing laboratories and other stakeholders. initially, an analysis of available literature and technical documentation (n= ) was performed, followed by confidential semi-structured interviews (n= ) with key personnel from multiple relevant stakeholders. the study's population consisted of leaders and technicians from government agencies (n= ) and organizations in small, medium, and large enterprises (n= ) across the united states, including california, colorado, district of columbia, massachusetts, montana, and virginia. several stakeholders allowed access to their facilities for observing environments and further discussions. some stakeholders allowed in-depth assessments of equipment, networks, and services. gs, ss, and dn are founders and owners of geneinfosec inc. and are developing technology and services to protect genetic information. geneinfosec inc. has not received us federal research funding. ah declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. genetics stakeholders are categorized based upon their influence, contributions, and handling of biological samples and resulting genetic data (supplementary table ). asymmetries exist between stakeholders in these regards . data originators are humans that voluntarily or involuntarily are the source of biological samples or are investigators collecting samples from nonhuman specimens. examples of data originators include consumers, healthcare patients, military personnel, research subjects, migrants, criminals, and their relatives. data controllers are entities that are legally liable for and dictate the use of biological samples and resulting data. in humanderived contexts, data controllers are typically healthcare providers, researchers, law enforcement agencies, or dtc companies. data processors are entities that collect, store, generate, analyze, disseminate, and/or apply biological samples or genetic data. data processors may also be data originators and data controllers. examples include biorepositories, dna sequencing laboratories, researchers, cloud and other service providers, and supply chain entities responsible for devices, software and materials. regulators oversee this ecosystem and the application and use of biotechnology, biological samples, genetic data, and market/industry trends at the transnational, national, local, and organizational levels. biological samples and metadata from the samples must first be collected once a data originator or controller determines to proceed with genetic testing. biological samples can be sourced from any biological entity relying on nucleic acids for reproduction, replication, and other processes, including non-living microbes (e.g., viruses, prions), microorganisms (e.g., bacteria, fungi), and organisms (e.g., plants, animals). samples are typically de-identified of metadata and given a numeric identifier, but this is largely determined by the interests of data controllers and the regulations that may pertain to various sample types. metadata includes demographic details, inclusion and exclusion criteria, pedigree structure, health conditions critical for secondary analysis, and other identifying information . it can also be in the form of quality metrics obtained during the analysis phase. samples are then stored in controlled environments at decreased temperature, moisture, light, and oxygen to avoid degradation. sample repositories can be internal or third-party infrastructure housing small to extremely large quantities of material for short-and long-term storage. following storage, samples are distributed to an internal or third-party laboratory for dna sequencing preparations. the wet laboratory preparation phase chemically prepares biological samples for sequencing with sequencing-platform-dependent methods. this phase can be performed manually with time-and labor-intensive methods, or it can be highly automated to reduce costs, run-time, and error. common initial preparation steps involve removing contaminants and unwanted material from biological samples and extracting and purifying samples' nucleic acids. if rna is to be sequenced, it is usually converted into complementary dna. once dna has been isolated, a library for sequencing is created via size-selection, sequencing adapter ligation, and other chemical processes. adapters are synthetic dna molecules attached to dna fragments for sequencing and contain sample indexes, or identifiers. indexes allow for multiplexing sequencing runs with many samples at once to increase throughput, decrease costs, and to identify dna fragments to their sample source. to begin sequencing, prepared libraries are loaded into a dna sequencing instrument with the required materials and reagents. laboratory personnel must login to the instrument and any connected services, such as cloud services or information management systems, and configure a run to initiate sequencing. a single sequencing run can generate gigabytes to terabytes of raw sequencing data and last anywhere from a few hours to multiple days, requiring the devices to commonly be left unmonitored during operation. raw data can be stored on the instrument's local memory and are transmitted to one or more of the following endpoints during or following a sequencing run: (i) local servers, computers, or other devices within the laboratory; (ii) cloud services of the vendor or other service providers; and (iii) external hard drives directly tethered to the sequencer. data paths largely depend on the sequencing platform, the laboratory's capabilities and infrastructure, and the sensitivity of data being processed. certain regulations require external hard drive use and offline data storage, analysis, and transmission. bioinformatic pipelines convert raw data through a series of software tools into usable forms. raw signal data include images, chemical signal, electrical current signal, and other forms of signal data dependent upon the sequencing platform. primary analyses convert raw signal data into sequence data with accompanying quality metrics through a process known as basecalling. many sequencing instruments can perform these functions. the length of each dna molecule sequenced is orders of magnitude smaller than genes or genomes of interest, so basecalled sequence data must then be aligned to determine each read's position within a genome or genomic region. this aligned sequence data is then compared to reference genomes sourced from databases through a procedure known as variance detection to determine differences between a sample's data and the accepted normal genomic sequence. only the unique genetic variants of a sample are retained in variance call format (vcfs) files, a common final processed data form. vcf files are vastly smaller than the gigabytes to terabytes of raw data initially produced, making them an efficient format for longterm storage, dissemination, and analysis purposes. however, this file format exists as a security threat for sensitive genetic data because these files are personally identifiable and contain sensitive health information. following data analyses, processed data are integrated with metadata and ultimately interpreted for the data controller's purpose. metadata and genetic data are often housed together, and exploiting this combined information could lead to numerous risks and threats to the data originators, their relatives, and the liable entities involved along the data path. secondary analyses can be performed on datasets by data controllers and third-party data processors to answer any number of relevant research questions, such as in diagnostics or ancestry analysis. genetic research is only powerful when large datasets are created containing numerous data points from thousands to millions of samples. therefore, genetic data is widely distributed and accessible via remote means across numerous databases and stakeholders. low cost attacks on tamper resistant devices i'm hoping they're an ethical company that won't do anything that i'll regret" users perceptions of at-home dna testing companies national and transnational security implications of big data in the life sciences national and transnational security implications of asymmetric access to and use of biological data wild patterns: ten years after the rise of adversarial machine learning biosafety in microbiological and biomedical laboratories. us department of health and human services adverse effect of air exposure on the stability of dna stored at room temperature first $ genome sequencing enabled by new extreme throughput dnbseq platform how lucky was the genetic investigation in the golden state killer case attacks on genetic privacy via uploads to genealogical databases linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets genomewide association study of severe covid- with respiratory failure the unpatchable silicon: a full break of the bitstream encryption of xilinx -series fpgas routes for breaching and protecting genetic privacy identity inference of genomic data using longrange familial searches cyber security threats in the microbial genomics era: implications for public health adversarial attacks on medical machine learning genetic influences on antisocial behavior: recent advances and future directions. current opinion in psychology power-supplay: leaking data from air-gapped systems by turning the power-supplies into speakers brightness: leaking sensitive data from air-gapped workstations via screen brightness identifying personal genomes by surname inference iso/iec : . information technology -security techniques -guidelines for cybersecurity behavioral genetic studies of personality: an introduction and review of the results of + years of research. the sage handbook of personality theory and assessment robust genome-wide ancestry inference for heterogeneous datasets and ancestry facial imaging based on the genomes project. biorxiv genomic research and human subject privacy identification of individuals by trait prediction using whole-genome sequencing data amia annual symposium proceedings identifiability in genomic research the first complete genomic sequences of african swine fever cyber-biosecurity risk perceptions in the biotech sector promoting biosecurity by professionalizing biosecurity privacy in the genomic era computer security risks of distant relative matching in consumer genetic databases computer security, privacy, and dna sequencing: compromising computers with synthesized dna, privacy leaks, and more genotype extraction and false relative attacks: security risks to third-party genetic genealogy services beyond identity inference human leukocyte antigen susceptibility map for sars-cov- next-generation sequencing informatics: challenges and strategies for implementation in a clinical environment complete genome sequence of a novel coronavirus (sars-cov- ) strain isolated in nepal biological laboratory and transportation security and the biological weapons convention. national nuclear security administration artificial intelligence and the weaponization of genetic data an overview of human genetic privacy opening pandora's box: effective techniques for reverse engineering iot devices analysis of genetic host response risk factors in severe covid- patients. medrxiv rapid reconstruction of sars-cov- using a synthetic genomics platform characterization of the reconstructed spanish influenza pandemic virus the fbi dna laboratory: a review of protocol and practice vulnerabilities findings of the investigation into china's acts, policies and practices related to technology transfer, intellectual property, and innovation under section of the trade act of . office of the united states trade representative, executive office of the president security measures for safeguarding the bioeconomy the coming crispr wars: or why genome editing can be more dangerous than nuclear weapons thermo fisher scientific, inc. applied biosystems / xl genetic analyzer user guide thermo fisher scientific, inc. applied biosystems / xl dna analyzers user guide applied biosystems seqstudio genetic analyzer specification sheet illumina document # v illumina proactive | data security sheet illumina document # v illumina document # v illumina document # v , material # illumina document # v , material # nextseq dx instrument site prep guide novaseq sequencing system site prep guide thermo fisher scientific publication #col thermo fisher scientific publication #col ion torrent genexus integrated sequencer performance summary sheet gridion mk site installation and device operation requirements, version oxford nanopore technologies. minion it requirements, version promethion p /p site installation and device operation requirements, version pacific biosciences of california, inc. operations guide -sequel system: the smrt sequencer pacific biosciences of california, inc. operations guide -sequel ii system: the smrt sequencer connect platform | iot connectivity the authors would like to acknowledge the confidential research participants and collaborators on this study for their time, resources, and interest in bettering genetic information security. thank you to cory cranford, arya thaker, ashish yadav, and dr. kevin gifford and dr. daniel massey of the department of computer science, formerly of the technology, cybersecurity and policy program, at the university of colorado boulder for their support of this work. v appendix . overview of the genetic information ecosystem processes (page ) key: cord- -o c m wy authors: kostovska, ana; džeroski, sašo; panov, panče title: semantic description of data mining datasets: an ontology-based annotation schema date: - - journal: discovery science doi: . / - - - - _ sha: doc_id: cord_uid: o c m wy with the pervasiveness of data mining (dm) in many areas of our society, the management of digital data, readily available for analysis, has become increasingly important. consequently, nearly all community accepted guidelines and principles (e.g. fair and trust) for publishing such data in the digital ecosystem, stress the importance of semantic data enhancement. having rich semantic annotation of dm datasets would support the data mining process at various choice points, such as data understanding, automatic identification of the analysis task, and reasoning over the obtained results. in this paper, we report on the developments of an ontology-based annotation schema for semantic description of dm datasets. the annotation schema combines three different aspects of semantic annotation, i.e., annotation of provenance, data mining specific, and domain-specific information. we demonstrate the utility of these annotations in two use cases: semantic annotation of remote sensing data and data about neurodegenerative diseases. recently, the success of data mining (dm) and machine learning (ml) in a broad range of applications has led to a growing demand for ml systems. however, this success heavily relies on the ml expertise of the practitioners, and on the quality of the analyzed data, both of which are in short supply. one potential solution for overcoming the shortage of expertise is to develop more intelligent data analysis systems, that will assist domain practitioners in the construction of analysis pipelines and the interpretation of results. such an intelligent dm system would we able to reason over distributed heterogeneous data and knowledge bases, automatically define the learning task, recommend the most suitable algorithms for the task at hand, and correctly interpret the induced predictive models [ , ] . the first step towards the development of such systems is the improvement of data management and data understanding. research data must be enriched with formal and logical descriptors that capture the characteristics of the data relevant for the task of automation of the data analysis process. additionally, these descriptors have the potential to significantly improve interdisciplinary research by helping ml practitioners better understand the data originating from the application domains, as well as easily incorporate domain knowledge in the process of analysis. formal descriptors, when published on the web, can also improve the accessibility and reusability of scientific data. many academic institutions have recognized the importance of effective management of scientific data, making it their central mission. for example, the fair (findable, accessible, interoperable, and reusable) principles [ ] are a set of guiding principles that have been introduced to support and promote proper data management and stewardship. in that context, data must be discoverable and it should be semantically annotated with rich metadata. the metadata should always be accessible by standardized communication protocols. the data and the metadata have to be interoperable with external data from the same domain. finally, both data and metadata should be released with provenance details so that the data can be easily replicated and reused. another set of principles that builds upon fair data are the trust principles [ ] . the trust principles go a level higher by focusing on data repositories and providing them with guidance to demonstrate transparency, responsibility, user focus, sustainability, and technology (trust). at the core of both principles lies the semantic enrichment of research data. semantic annotation of data, as a powerful technique, has attracted attention in many domains. unfortunately, semantic annotation of dm and ml datasets is still in the early phases of development. to the best of our knowledge, there are no semantic dataset repositories from the general area of data science that completely adhere to the fair and trust principles. in this paper, we report on the development of an ontology-based annotation schema for semantic annotation of dm datasets. our main objective is to provide a rich vocabulary for data annotation, that will serve as a basis for the construction of a dataset repository that closely follows the fair and trust principles. the annotation schema we proposed includes three different types of information: provenance, dm-specific, and domain-specific. the provenance information improves the transparency and reusability of data. the dm-specific information provides means for reasoning over the analyzed data and helps (in a semi-automatic way) in the construction of the dm workflows (or pipelines). the domain-specific information helps to bridge the gap between ml practitioners and domain experts, as well as to improve cross-domain research. finally, we demonstrate the utility of domain-specific annotations in two use cases from the domains of neurodegenerative diseases and earth observation (eo), respectively. in the context of computer science, ontologies are "an explicit formal specifications of the concepts and the relations among them that can exist in a given domain" [ ] . in other words, they provide the basis for an unambiguous, logically consistent, and formal representation of knowledge. it is important to note that, the logical component of ontologies allows knowledge to be shared meaningfully both at machine and human level. also, an immediate consequence of having formal ontologies based on logic is that they can be used in a variety of reasoning tasks, as well as in the inference of new knowledge. the benefits of having ontology-based knowledge representations have been demonstrated in many data-and knowledge-driven applications. the research areas that retained most attention and contributed the most to the technological breakthrough of ontologies are bioinformatics and biomedicine. for example, the open biological and biomedical ontology (obo) foundry [ ] is a collective of ontology developers that have developed and maintain over publicly-available ontologies related to the life sciences. when it comes to the process of ontology engineering, the obo foundry has played a key role, as they have proposed ontology design principles that promote open, orthogonal, and strictly-scoped ontologies with collaborative development. these principles have further widened the use of ontologies across different fields of science. in the area of dm and ml, a large body of research has focused on the development of ontologies, vocabularies and schemas that cover different aspects of the domain. examples of such resources include the data mining optimization ontology (dmop) [ ] , exposé [ ] , mex vocabulary [ ] , and the ml schema [ ] . dmop has been designed to support automation at various choice points of the dm process. the exposé ontology provides the vocabulary needed for a detailed description of machine learning experiments. mex represents a lightweight interchange format for ml experiments. ml schema represents an effort to unify the representation of machine learning entities. the ontodm suite of ontologies is of particular interest, as this paper extends its line of work. ontodm includes three different ontologies: ontodm-core, ontodm-kdd, and ontodt. ontodm-core [ ] is an ontology of core data mining entities, such as dataset, dm task, generalizations, dm algorithms, implementations of algorithms, and dm software. ontodm-kdd [ ] is an ontology for representing the process of knowledge discovery following the crisp-dm methodology [ ] . ontodt [ ] is a generic ontology for the representation of knowledge about datatypes. another type of information related to dm datasets that is important to be formally represented is the provenance information. provenance information refers to the kind of information that describes the origin of a resource (in our case a dataset), i.e., who created the resource, when was it published, and what is its usage license. provenance information is valuable when it comes to deciding whether a specific resource can be trusted. this extra information also helps the users better understand it, easily cite and reuse the resource for their purposes. for the computers to make use of the provenance information, it has to be given explicitly, and it has to be based on common provenance vocabularies, such as the dublin core vocabulary [ ] , the prov ontology [ ] , data catalog vocabulary [ ] , or schema.org [ ] . to semantically describe a dm dataset, we consider three different types of vocabularies/ontologies: ( ) vocabularies for annotation of provenance information, such as title, description, license, and format; ( ) ontologies for annotation of datasets with dm-specific characteristics, i.e., data mining task, datatypes, and dataset specification; and ( ) ontologies for annotation of domain-specific knowledge that helps to contextualize the data originating from a given domain. in this section, we discuss the first two aspects of the semantic enrichment of datasets. we describe the schema.org vocabulary, which we reuse for the purpose of annotation of the dataset's provenance details. also, we outline the main characteristics of the ontodt and ontodm-core ontologies and we further extend their structure with terms essential for semantic description from a dm perspective. in sect. we discuss the annotation of domain specific knowledge through examples from two different domains. to annotate dm datasets with provenance information, we have chosen the schema.org vocabulary, one of the most widely used vocabularies that provides descriptors for provenance information in a structured manner. when annotating the datasets, we usually use a subset of the list of provided descriptors as the complete provenance information is not always available. figure depicts an example annotation of provenance information in json-ld format . for this example, we used a dataset from the domain of earth observation (eo), named forestry kras lidar landsat. the dataset was used in a study that investigates the possibility of predicting forest vegetation height and canopy cover in the karst region in slovenia by building predictive models using eo data [ ] . for semantic annotation of provenance information for this dataset, we used several terms from schema.org, such as name, description, url, keywords, creator, distribution, temporal and spatial coverage, citation, and license. the second type of annotation considers explicit specification of dataset characteristics from a dm perspective, e.g., the format of the data, the type of learning task, and the features' datatypes. data used in the process of dm can take various forms, but the standard one assumes that there is a set of objects of interest { " @context ": " https : // schema . org /" , " @type ": " dataset ", " name ": " forestry_kras_lidar_lansat ", " description ": " this dataset was employed in a study that investigates the possibility of predicting forest vegetation height and canopy cover in the karst region , slovenia by building predictive models using remotely sensed data ." " url ": " http : // semantichub . ijs . si / ontodm ", " keywords ": [ " remote sensing ", " karst region ", " lidar ", " landsat "] , " creator ":{ " @type ": " person ", " url ": " https : // www . researchgate . net / profile / daniela_stojanova " , " name ": " daniela stojanova " }, " distribution ":{ " @type ": " datadownload ", " encodingformat ": " arff ", " contenturl ": "" }, " temporalcoverage ": " - - , - - , - - , - - " , " spatialcoverage ":{ " @type ": " place ", " geo ": { " @type ": " geocoordinates ", " latitude ": . , " longitude ": . } }, " citation ": { " @type ": " scholarlyarticle ", " name ": " estimating vegetation height and canopy cover from remotely sensed data with machine learning " , " identifier ": " https : // doi . org / . /j. ecoinf . . . " }, " license ": " https : // creativecommons . org / licenses / by / . /" } lidar/landsat dataset [ ] using the schema.org vocabulary. described with features (or attributes). in that sense, the term data example, or (more commonly) data instance, refers to a tuple of feature values corresponding to an observed object. the features are formally typed, meaning that each of them has a designated datatype. in general, there are many different datatypes such as boolean, real, discrete datatype, to name a few. having standardized datatype information at disposal can enable the development of knowledge-based systems that automate parts of data analysis workflows, e.g., assist dm practitioners in choosing a suitable learning algorithm for the data at hand. data examples in dm can be described with different characteristics, which can lead to treating the data in radically different ways. we identified four different (orthogonal) characteristics that we believe are important to be represented appropriately. these include ( ) the availability of data examples, ( ) the existence of missing values, ( ) the mode of learning, and ( ) the type of target in the case of (semi-)supervised learning tasks. extending the ontodt and ontodm-core ontologies. while the ontodt and ontodm-core ontologies offer a rich vocabulary for the annotation of dm datasets, they do not cover all of the above aspects. thus, we extended ontodt with new dm-specific datatypes and provided an updated datatype taxonomy that allows us to properly describe dm datasets. the proposed taxonomy of datatypes was then used as a basis for the update of the taxonomies of dm tasks and data specification, which are part of the ontodm-core ontology. the extended ontodt and ontodm-core ontologies are available at https://w id. org/ontodt-extended and https://w id.org/ontodm-core-extended, respectively. based on the availability of the data examples, we distinguish between two types of data, i.e., batch data (or datasets) and online data (or data streams). the batch setting is the more traditional approach where large volumes of data are collected over a longer period. on the other hand, online data refers to the type of data that is continuously being generated by heterogeneous data sources. the availability of data examples is the first dimension we considered when we updated the taxonomies of core classes of ontodt and ontodm-core. in fig. semi-supervised learning. the key difference between them is the completeness of the data they use for training. unsupervised learning makes use of unlabeled data examples that are only composed of descriptive features. supervised learning, in contrast to unsupervised learning, uses labeled data that, apart from the descriptive features, has some special feature of interest usually referred to as target. finally, in semi-supervised learning, we have learning from both labeled and unlabeled data examples. in the updated taxonomies, we modeled this characteristic at the second level. hence, for both batch and online learning, we defined classes that specify information about the type of learning (see fig. ). if we take the taxonomy of data types as an example, in the batch learning scenario the ontodt: record figure illustrates in greater detail the taxonomy of data types and the four dimensions that it is based on. finally, the taxonomies of tasks and dataset specifications are designed similarly following the same principles (see fig. ). fig. ) since some dm algorithms cannot function properly in the presence of missing values. we say that one data example has missing values when there is no recorded value for at least one descriptive feature. this is different from having missing values in the target space, which, as we discussed above, leads to semi-labeled data. missing values affect the data quality; thus they must be handled accordingly by the dm algorithms. in the case of (semi-)supervised learning, data examples can become even more complex as the target/output itself can have a complex structure. based on the type of target we have primitive and structured output prediction tasks. primitive output prediction tasks predict a single target, as in classification (a discrete value) and regression (a real value). in the case of structured output prediction tasks, there is more than one target that has to be predicted. examples of such tasks are multi-target regression, multi-label classification, and hierarchical multi-label classification. figure presents the complete taxonomy of supervised and semi-supervised predictive modeling tasks. concerning the (semi-)supervised online predictive modeling tasks, the base datatypes of the target can be the same as the target datatypes in the batch predictive modeling tasks. figure illustrates how this is achieved in the ontodt and ontodm-core ontologies. for instance, ontodm-core: online predictive modeling task class is related with the ontodt: sequence of records with two components class. sequence datatypes have a base datatype, in this example, it is the ontodt: record with two components base type, which has the datatype role of ontodt:record of two components. note that ontodt:record of two components is the same class used for the representation of the data examples' datatype in the batch predictive learning mode. using this annotation schema, we have annotated dm datasets in total, all containing data from different application domains. the generated semantic annotations are publicly available in rdf format and can be queried via the jena fuseki server . after describing the four characteristics that govern the modeling of the taxonomies of datatypes, data specification, and tasks, we provide an illustrative example that shows how we can combine them in a single annotation schema for the purpose of semantic annotation of dm datasets. namely, fig. depicts the classes needed for annotation of a data stream with missing values applicable to the learning task of semi-supervised multi-label classification. to represent the datatype of the data examples, we use the ontodt:featurebased semi-labeled stream data with missing values and with a set of discrete output class. this class is connected via the has-part relation with the classes that represent the corresponding data mining task and data specification defined in the ontodm-core ontology, i.e., ontodm-core: online semi-supervised multilabel classification task and ontodm-core: multi-label semi-labeled classification data stream. the annotation schema for data streams includes also a specification of a base datatype. next, we have the classes used for describing the datatypes of the descriptive and target component. in this section, we demonstrate the utility of the annotation schema we introduced in the previous section on two use cases, i.e., annotation of datasets for the domains of neurodegenerative diseases and earth observation (eo). for the two fig. . an example of an annotation schema for data streams applicable to semisupervised multi-label classification. use cases, we also enriched the annotation schema with terminology specific to the domain at hand. the inclusion of domain-specific annotations improves the representation of the datasets, making them accessible and reusable, offers the possibility of execution of advanced query scenarios, and enables interoperability with other data from the domain. on a technical level, the alignment of the dm-specific annotation schema with the annotation schemas designed for the particular domains is straightforward. in that sense, the proposed ontology-based annotation schema enables the direct extension of the datatype classes at any level in the taxonomy with classes that define the semantic meaning of the domain-specific datatypes. the newly introduced datatype classes are then linked to the corresponding entities in the domain ontologies. neurodegenerative diseases such as alzheimer's disease (ad), parkinson's disease (pd), amyotrophic lateral sclerosis (als), and huntington's disease (hd) are a group of diseases caused by a progressive loss of structure or function of neurons. they can lead to irreversible deterioration of cognitive functions like memory loss, cause problems with movement, and spatial orientation. in the past two decades, researchers have been investigating new treatments that can slow or stop the progression of the diseases. there are two widely-known studies concerning neurodegenerative diseases, i.e., alzheimer's disease neuroimaging initiative (adni) [ ] and parkinson's progression markers initiative (ppmi) [ ] . to annotate the datasets with terms relevant to the domain, we use the nddo (neurodegenerative disease data ontology) ontology [ ] . nddo is designed in accordance with the adni and ppmi studies and it is aligned with the ontodt and ontodm-core ontologies. thus, it can be easily adjusted to annotate the four aspects of data examples we considered in sect. . to illustrate this, we use an instance dataset from the ppmi study that [ ] used for the task of predicting the motor impairment assessment scores by utilizing the values of regions of interest (rois) from fmri imaging assessment and dat scans. the dm task they were solving was multi-target regression (mtr). to represent the mtr task and mtr dataset specification, we use the classes defined in ontodm-core, and connect them with the corresponding datatype class from ontodt (in our case ontodt: feature-based completely labeled data with record of numeric ordered primitive output) (see fig. b) . this class has two field components. the first one describes the datatypes of the descriptive features, which are of a primitive datatype. the latter describes the datatypes of the features on the target side. in the mtr learning setting each target feature is described with the numeric datatype. the sub-classes of the numeric datatype, real and integer datatype, are positioned at the bottom of the datatype taxonomy, and we link them with the domain datatypes. for example, nddo: rd ventricle score is one of the descriptive features present in the ppmi dataset and it is linked with the nddo: rd ventricle datatype class that semantically defines its datatype. similarly, nddo: arising from the chair score is a target and its associated datatype is the nddo: arising from the chair datatype class. other features are connected with the respective datatypes in the same way. remote sensing (rs) is the process of monitoring specific physical characteristics of an area of interest by measuring the reflected and emitted energy at a distance from the target area. satellite-based remote sensing technologies are commonly used for earth observation (eo) to monitor characteristics that change over time, i.e., weather prediction, natural changes of the earth, and development of the urban area. due to the increasing availability of eo data, it is essential to develop an ontological approach to managing this kind of data. however, to the best of our knowledge, a general ontology that systematically describes the eo domain is still lacking. nonetheless, some ontologies formalize the knowledge of specific parts of the domain, i.e., semantic sensor network (ssn) ontology [ ] , sosa (sensor, observation, sample, and actuator) ontology [ ] , semantic web for earth and environment technology (sweet) ontology [ ] , and the extensible observation ontology (oboe) [ ] . for semantic annotation of eo data, we have designed a lightweight ontology that is aligned with the aforementioned eo ontologies. the ontology is available at https://w id.org/eo-ontology. the ontology was constructed using the bottom-up approach, based on instances of datasets we have available at our side from previous research [ ] . the datasets contain two target features (forest vegetation height and canopy cover) whose values are obtained via the lidar technology. but since lidar can sometimes be inconvenient or expensive, [ ] examined the possibility of using remote sensing data generated from satellites, such as landsat , irs-p , spot, as well as aerial photographs for the construction of descriptive features that can be relevant for the prediction of the two targets. the landsat , irs-p , and spot satellites use multiple channels for collecting reflected energy, and one channel of emitted energy, that operate on different wavelengths. in this study, when designing the eo ontology, we took into consideration the process of data collection and data preprocessing described in the study mentioned above. in the preprocessing phase, the raw satellite image is converted into a standard geo-referenced data format, which then undergoes the process of image segmentation (see fig. ). a key characteristic of the different image segments is the resolution of the segment size. the image segment size is modeled as a data property of the image segmentation specification class. all features present in the datasets are eo properties observed at a specific point in time, and they are related to a specific image segment. we define two subclasses of the eo property class, i.e., sosa: observable property class and eo aggregated property class. the first one refers to the properties observed with a remote sensor (sosa: sensor ) hosted on a given platform/satellite (sosa: platform). the latter defines the type of properties that are the result of some process of eo property aggregation that transforms the originally observed measurements. the process uses multiple eo properties as input and produces one eo aggregated property. the aggregation can be based on some statistical characteristics, such as stato: minimum value, stato: maximum value, stato: average value and stato: standard deviation, where stato is an ontology of statistical methods. this was also the case in our observed datasets. additionally, we define the eo property transformation process that transforms one eo property into another. similarly, as in sect. . , to achieve full interoperability, we integrated the general dm annotations with the domain-specific ones. the integration was perforemed at the level of features appearing in the dataset. thus, the ontodmcore: feature specification class connects with the datatype of the feature via the has-identifier relation, while it also connects with the eo property class via the is-about relation. additionally, the ontodm-core: feature-based data example class is composed of multiple oboe: measurements. in oboe, measurement represents a measurable characteristic of an observed property, which in our case is eo property. we have developed an ontology-based annotation schema for rich semantic annotation of dm datasets that takes into consideration different semantic aspects of the datasets: provenance, dm-specific characteristics of the data, and domainspecific information. the annotation schema is generic enough to support the easy extension of its core classes with information relevant to the application domain. the utility of the designed schema was demonstrated through semantic annotation of data from two different domains: neurodegenerative diseases and earth observation. annotations based on this schema provide means for support of the complete data analysis process, e.g., enable cross-domain interoperability, assist in the definition of the learning task, ensure consistent representation of datatypes, assess the soundness of data, and automatically reason over the obtained results. these annotations also enable the development of applications that require advanced data querying capabilities. they also enable the development of data repositories that adhere to the highest standards of the open data initiative. the data catalog vocabulary (dcat) vocabulary the schema.org vocabulary crisp-dm . step-by-step data mining guide the ssn ontology of the w c semantic sensor network incubator group ml schema core specification mex vocabulary: a lightweight interchange format for machine learning experiments toward principles for the design of ontologies used for knowledge sharing? sosa: a lightweight ontology for sensors, observations, samples, and actuators the data mining optimization ontology neurodegenerative disease data ontology the trust principles for digital repositories an ontology for describing and synthesizing ecological observation data multi-dimensional analysis of ppmi data ontodm-kdd: ontology for representing the knowledge discovery process ontology of core data mining entities generic ontology of datatypes alzheimer's disease neuroimaging initiative (adni): clinical characterization knowledge representation in the semantic web for earth and environmental terminology (sweet) the obo foundry: coordinated evolution of ontologies to support biomedical data integration estimating forest properties from remotely sensed data by using machine learning estimating vegetation height and canopy cover from remotely sensed data with machine learning exposé: an ontology for data mining experiments the dublin core: a simple content description model for electronic resources the fair guiding principles for scientific data management and stewardship open access this chapter is licensed under the terms of the creative commons the authors would like to acknowledge the support of the slovenian research agency through the project j - and the young researcher grant to ak. key: cord- -yvcrv c authors: souza, jonatas s. de; abe, jair m.; lima, luiz a. de; souza, nilson a. de title: the general law principles for protection the personal data and their importance date: - - journal: nan doi: . /csit. . sha: doc_id: cord_uid: yvcrv c rapid technological change and globalization have created new challenges when it comes to the protection and processing of personal data. in , brazil presented a new law that has the proposal to inform how personal data should be collected and treated, to guarantee the security and integrity of the data holder. the purpose of this paper is to emphasize the principles of the general law on personal data protection, informing real cases of leakage of personal data and thus obtaining an understanding of the importance of gains that meet the interests of internet users on the subject and its benefits to the entire brazilian society. the concern about the protection of people's data has grown over the years, but only after the approval of the brazilian law that received the name of marco civil da internet, established by law no. , , [ ] . in brazil, a new law has recently been sanctioned and it is generating a lot of discussion in several areas. the general law on personal data protection, law no. . [ ] of th august , gives the brazilian population rights and guarantees on how organizations will have to adapt to the collection and processing of personal data, whether by physical or digital means. discussing data protection in brazil has become a challenging task. the state of the crisis provoked by covid- (coronavirus) had a severe impact on companies, which began to adopt measures to make their workforce compatible with the demands existing during social isolation, and the adoption of measures to minimize the risk of the disease spreading among their workforce. the use of virtual private network -vpn and practices such as byod (bring your own device) have become common to incorporate daily life. there was also an exponential growth of e-commerce, home office, webinars, virtual meetings, and numerous activities that started to occur entirely through the internet. in the same proportion, the risks associated with the improper use of personal data, data leaks, improper access by third parties, theft of data kept by corporate servers, creation of fake profiles, fake news, among other practices frequently reported were multiplied. the objective of the paper is to present important aspects such as the principles and fundamentals of brazilian law and to present some real cases on data leaks. the paper is composed of sections, in section presents the theoretical reference that will address the history of data protection in brazil and the european regulation, in section describes the principles of brazilian law demonstrating the similarity with the european regulation, in section the importance of brazilian law showing the fundamentals of the law and emphasizes the importance of consent of the data holder, in section the results and discussions with real cases of data leaks and countries that already have some legislation on protection of personal data, in section are the conclusions bringing the final considerations obtained. in brazil, the legislation is based on the positivist model of law, adopted by lusitanian, german and italian schools that privilege the written law, this reflects in the delay of the implementation of the legislative process (figure ), which begins with the initial idea, passes through the creation of the bill, then through the bicameral approval and then the presidential sanction, to finally come into force with coercive force. the first brazilian initiative on personal data protection was in article of the federal constitution [ ] . art. all are equal the law, without distinction of any nature, guaranteeing brazilians and foreigners residing in the country the inviolability of the right to life, freedom, equality, security, and property, under the following terms [ ] : x -the intimacy, privacy, honor, and image of persons are inviolable, and the right to compensation for material or non-material damage resulting from their violation is guaranteed [ ] . xii -the secrecy of correspondence and telegraphic communications, data, and telephone communications shall be inviolable, except in the latter case by judicial order, in the cases and the manner established by law for criminal investigation or criminal proceedings [ ] . law no. . , of july th, [ ] deals with the interception of telephone communications and regulates clauses xii, art. of the federal constitution. on september th, [ ] law no. . , known as the consumer code (cdc), was enacted, bringing in its article the guarantee of access to the holder's data, demanding clarity and objectivity of the information and the possibility for the consumer to demand the correction of his registration data [ ] . art. the consumer, without prejudice to the provisions of art. , shall have access to the information existing in registers, files, records, personal data, and consumption filed about him, as well as to their respective sources [ ] . paragraph . consumer registrations and data must be objective, clear, truthful, and in easy-tounderstand language, and may not contain negative information for a period longer than five years [ ] . paragraph . the opening of the registration, file, record, personal, and consumption data shall be communicated in writing to the consumer when not requested by him [ ] . paragraph . the consumer, whenever he finds any inaccuracy in his data and registrations, may demand their immediate correction, and the archivist shall, within five working days, communicate the change to the eventual recipients of the incorrect information [ ] . paragraph . databases and registers relating to consumers, credit protection services, and the similar are considered public entities [ ] . paragraph . once the statute of limitations on the collection of consumer debts has been consummated, the respective credit protection systems shall not provide any information that may prevent or hinder new access to credit with suppliers [ ] . paragraph . all information referred to in the caption of this article must be made available in accessible formats, including for persons with a disability, at the request of the consumer [ ] . even bringing some progress on personal data protection, the cdc was still limited in its scope on the subject, which means that the protection would exist in the relationship between supplier and consumer within the scope of the legal concepts established in articles and article [ ] of the cdc. on april rd, , law no. , , now known as marco civil da internet [ ] , was approved, establishing principles, guarantees, rights, and duties for the use of the internet in brazil, and has the guarantee of privacy and protection of personal data, and will only make such data available through a court order. in art. , clauses i, ii and iii, and clauses vii, viii, ix, and x, deal with the rights of the holders of personal data [ ] . art. access to the internet is essential to the exercise of citizenship, and the user has assured the following rights [ ] : i -inviolability of intimacy and privacy, their protection and compensation for material or moral damage resulting from their violation [ ] . ii -inviolability and secrecy of the flow of your communications over the internet, except by court order, in the form of the law [ ] . iii -inviolability and secrecy of your stored private communications, except by court order [ ] . vii -do not provide third parties with your data, including connection records, and access to internet applications, except by free, express and informed consent or in the cases provided by law [ ] . viii -clear and complete information about the collection, use, storage, treatment and protection of your data, which may only be used for purposes that: a) justify their collection; b) are not prohibited by law, and c) are specified in service contracts or terms of use of internet applications [ ] . ix -express consent on the collection, use, storage, and processing of personal data, which shall occur in a manner detached from the other contractual clauses [ ] . x -definitive exclusion of personal data that you have provided to a certain internet application, at your request, at the end of the agreement between the parties, except for the cases of mandatory storage of records provided for in this law [ ] . the civil framework of the internet also includes aspects of the responsibility for the protection of personal data by access providers and in operations carried out through the internet, providing for some sanctions, described in articles , , and [ ] . on august th, , law no. , , called the general law on personal data protection [ ] , was approved, providing for the processing of personal data, whether digital or not, to protect the fundamental rights of freedom and privacy and the development personal personality of the individual in society. the general law on personal data protection -lgpd, law no. . of th august , which would come into force in august , has been postponed by provisional measure no. / [ ] extending the vacatio legis [ ] and postponed to may [ ] . the lgpd purpose is to provide guidelines on how personal data will be collected and processed, and to ensure the security and integrity of the data holder, whether digital or not. on th july , project law / -plc [ ] was approved by the plenary of the federal senate and was sanctioned on th august by the th president of brazil [ ] . article of the lgpd states that it is prepared to protect the processing of personal data to protect the rights of freedom, privacy, and personality development of the individual. moreover, it applies to any individual or legal entity that carries out-processing operations such as collection, production, reception, classification, processing, among other activities by physical or digital means in brazilian territory, or abroad if it is using personal data of individuals living in brazil. the general data protection regulation / [ ] -gdpr, of the european parliament and of the council of european union -eu, of th april , is a regulation that is on the protection of individuals about the processing of personal data and the free movement of such data and that repeals directive / /ec [ ] , eu companies had two years to comply with the regulation by the date of th may . the regulation applies to all activities involving the processing of personal data using full or partial consent, as well as to the processing of personal data by non-automated means. for a better understanding of lgpd [ ] , it is necessary to know the legal bases (principles) that should be observed for any type of data processing activities, the law is composed of ten principles that are listed in art . a gdpr [ ] [ ] is also guided by principles [ ] , which are set out in article , which form the basis for the eu regulation, and these principles should be linked to data processing. the lgpd [ ] , the purpose for which the data will be done must be very specific, explicit, and informed to the holder of the personal data that will be processed [ ] . in gdpr, the purpose limitation principle [ ], the data must be collected for specific, legitimate, and explicit purposes, and may not be processed for other unspecified purposes [ ] . the lgpd [ ] , is the formality with the holder of the personal data to process personal data [ ] . in gdpr, the storage limitation principle [ ], data may be stored in a database until the end of the data processing and must be informed to the data owner, and after the end of the processing, the data must be deleted from the database. it is also linked to the principle of bidding [ ] that the company that will process the data must comply with the regulation and with the data holder. [ ]. in the lgpd [ ] , the amount of data for data processing is only relevant, proportional, and not excessive [ ] . in the european regulation, the data minimization principle [ ], data should be collected following its purpose and only data that are necessary for the processing [ ] . the lgpd [ ] , guarantees that the data holder will have free access to the data in its entirety at any time, and this principle is linked to the gdpr transparency principle. in the european regulation there is a right which is described in article [ ] , which is called right to erasure [ ] or right of forgetfulness, which gives the "right to be forgotten" to the data holder of the database concerning the purpose of the processing, after the data holder has requested to delete the data, the officer shall delete the data relating to the data holder's request [ ] . the lgpd [ ] , guarantees the data owner clarity, accuracy, and relevance and updates the data according to the needs of the data treatment [ ] . the accuracy principle of gdpr [ ] that data should always be updated and correct thus maintaining the quality of the data that will be processed and incorrect data will be rectified or deleted [ ] . the lgpd [ ] , ensures that the data owner will have access to all necessary information clearly, accurately, and easy access to data processing [ ] . the transparency principle of gdpr is divided into three words, lawfulness, fairness, and transparency [ ] [ ]. the lawfulness or bidding is concerned, data controllers should comply with the regulation, on fairness or loyalty, it is stated that processing should take place fairly with the consent of the data owner, and on transparency, the data controller will allow him to have access to all information of the data processing [ ] . the lgpd [ ] , will use techniques for the protection of personal data from unauthorized access or accidental or illicit situations of alteration, destruction, loss, dissemination, and communication [ ] . the principle that about security in gdpr is the integrity principle and confidentiality [ ], the data must be stored securely, guaranteeing the data integrity, and adopting methods of protection against unauthorized processing, loss, accidental damage, destruction, or unauthorized access [ ] . it will use methods to prevent data processing damaging [ ] [ ]. data may not be processed for discrimination, illicit or abusive purposes [ ] . in brazilian law [ ] , it is up to the treatment agent to prove the purpose and which effective methods have been adopted, and he must be able to prove compliance with and enforcement of personal data protection rules, including the effectiveness of these methods [ ] . in the european regulation, the accountability principle [ ] , which is the full responsibility of the data processing agent, guarantees the length of the purpose of the processing and has evidence of the necessity of the processing [ ] . the lgpd [ ] , sanctioned in brazil, was inspired by gdpr of the eu [ ] and contains many similarities in its respective principles. in its art. they show the foundations (figure ), that served as a basis for the development of the law [ ] : art. the discipline of personal data protection based on the according to fundamentals [ ] : i -respect for privacy [ ] . ii -informative self-determination [ ] . iii -freedom of expression, information, communication, and opinion [ ] . iv -the inviolability of intimacy, honour, and image [ ] . v -economic and technological development and innovation [ ] . vi -free enterprise, free competition, and consumer protection [ ] . vii -human rights, free development of personality, dignity, and the exercise of citizenship by natural persons [ ] . one of the most important points that makes data processing possible is to have the consent of the data holder, according to article , clauses [ ] , and in article [ ] it is reinforced that the authorization must be in writing or by other means of manifestation of the owner will, and it is stated from paragraph [ ] that in case the authorization is in writing it must be highlighted in the contractual clauses. in the case of processing sensitive personal data, article , clauses [ ] , states that with the consent of the holder, and article , paragraph [ ] , which states must have the consent of parents or legal guardians concerning the processing of personal data of children and adolescents. in gdpr [ ] it is also explicit that for any activities that require a data processing must have the consent of the data holder, in article , paragraph , clauses a [ ] , it says that data processing will be lawful upon the consent of the data holder for specified purposes previously informed to the data holder, in article [ ] which sets out the conditions applicable to consent, it says that the data processing agent must prove that the data holder has agreed to the specified purposes. as regards the processing of data on children, article of the european regulation [ ] requires the person legally responsible for the child under the age to consent to the processing. the state may be responsible for giving consent if the child is under the age of and has no family members to answer for him or her. without a reference law for the use of personal data, the possibility of abuse in the collection and use of personal data is increased, as well as the encouragement of several other non-specialized bodies to issue their opinions regarding the use of data, which causes great confusion. this is the case, for example, of inspections and inspections by the public prosecutor's office and consumer protection agencies, the issuance of opinions by regulatory agencies, or even judicial decisions based on various sparse legal provisions [ ] [ ] that seek to define parameters for the processing of personal data. in the graph (figure ), shows the level of interest in internet users searches on the terms lgpd and gdpr over a twelve-month period, where the term lgpd represents the blue line and the term gdpr represents the red line, on the horizontal axis represents the time and on the vertical axis represents the level of search made on the terms, these levels are represented by the numbers (very low), (low), (average), (high) and (very high). this simple analysis shows that the red line had several peaks in some periods, this because the gdpr since [ ] is approved and had an adequacy period of two years and the level of interest is between average and very high, the blue line has had small peaks, this because the lgpd is a new subject in brazil and this makes the level of interest is between very low and average. in general, users looking for lgpd and gdpr terms are professionals in the juridical environment or information technology. it should be noted that both the brazilian law and the european regulation, they provide a guide for data processing and what procedures companies should take to comply with the law if these principles are not followed these companies will be at serious legal risk. an example of noncompliance with the law was the cambridge analytica scandal [ ] , which misused data from million facebook users (figure ), manipulated the data without the consent of the data holders, to help win donald trump's us presidential campaign, and for the british to vote to leave the european union, both in [ ] , facebook was asked about data security. in brazil, there have been several cases of data loss, such as the case of the netshoes website, according to the coordinator of the commission for personal data protection, prosecutor frederico meinberg, "this is one of the largest security incidents recorded in brazil" [ ] , which because of the data leak could put the integrity of , , users at risk if the leaked data fell into the wrong hands. the impact that data leaks go far beyond the financial losses, the exposure of each citizens' information can be irreversible damage that becomes impossible to measure the size of the loss. without an information security policy, it can cause serious problems such as the invasion of vital systems to steal tax returns, data, making illegal financial transfers, interrupting the strategic operations of a company, or the government. another case about data leakage was written by liliane nakagawa and published on the website olhar digital [ ] , which displays the news about the banking institution, specifically the bank of brazil provident fund [ ] . according to nakagawa, data leak that reaches thousand clients -official number of registered in the bb previdencia platform, according to bank of brazil. the source, who identified the security gap, stated that through the private pension system, aimed at companies and public agencies, it is possible to have access to all personal data of participants and, from breaking, editing and registering beneficiaries, all in the name of the registered person himself [ ] . after this news, several headlines were reporting the incident, the exame magazine published on its website, "bb previdencia website leak exposes data of thousand clients" [ ] , the newspaper, o estado de s. paulo, published on its website, "security sheet on bb previdencia website exposes client data" [ ] (figure ). for these leaks not to occur, companies must have a data protection officer -dpo [ ] [ ], where the primary function is to ensure that the organization processes the personal data of their employees, their customers, their suppliers or any other individuals securely and reliably according to the data protection rules of law [ ] . an lgpd will give the right to protection of the personal data of the respective holders and will give guidelines to the companies on how the treatment should be done. brazil will be adapting to gpdr and will move the job market for data protection specialists. however, brazil [ ] already has a law for the creation of a supervisory body to verify whether companies comply with the lgpd, but directors have not yet been appointed to the national data protection authority -anpd and the national council for personal data protection and privacy [ ] . the european body responsible for supervising undertakings on whether they comply with the european regulation is the european data protection supervisor -edps, an independent supervisory authority established according to eu regulation / , and its task is to ensure that the fundamental rights and freedoms of individuals -in particular their privacy -are respected when eu institutions and bodies process personal data. in the world, there are already some countries [ ] outside the eu that have a regulation regarding data protection. on the european commission's website, it informs countries that are at an appropriate level to the regulation, the european commission has recognized andorra, argentina [ ] , canada (trade organizations), faeroe islands, guernsey, israel, isle of man, jersey, new zealand, switzerland, uruguay [ ] and the united states of america (limited to the privacy shield framework) as providing adequate protection [ ] . through the internet civil framework, which establishes rights and duties, guarantees and principles for the use of the internet in brazil, it does not guarantee data protection and privacy in a well-structured, complete and comprehensive manner, nor is a general regulation on the protection of personal data, and its provisions on data protection not protective in nature. some of the challenges identified for implementing the law in brazil are legal adjustments and appropriate training, a complete action plan for companies to comply with lgpd, specialized implementation of personal data governance processes, information security technologies, educating brazilian society about this law by showing the rights and duties of citizens. therefore, there will still be a lot of debates and discussions about lgpd and whether it will adhere to gdpr, and how brazil will behave with the law when it becomes effective. estabelece princípios, garantias, direitos e deveres para o uso da internet no brasil, marco civil da internet dispõe sobre a proteção de dados pessoais e altera a lei nº . , de de abril de (marco civil da internet) lei nº . , de de julho de . regulamenta o inciso xii, parte final, do art. ° da constituição federal dispõe sobre a proteção do consumidor e dá outras providências. diário oficial da união, de setembro de medida provisória nº , de de abril de . regras para o auxílio emergencial e adiamento da vigência da lgpd vacatio legis -senado notícias altera a lei nº . , de de agosto de . para dispor sobre a proteção de dados pessoais e para criar a autoridade nacional de proteção de dados; e dá outras providências. diário oficial da união dispõe sobre a proteção de dados pessoais e altera a lei nº . , de abril de ue) / do parlamento europeu e do conselho, de de abril de , relativo à proteção das pessoas singulares no que diz respeito ao tratamento de dados pessoais e à livre circulação desses dados e que revoga a diretiva / /ce (regulamento geral sobre a proteção de dados) directiva / /ce do parlamento europeu e do conselho de de outubro de relativa à proteção das pessoas singulares no que diz respeito ao tratamento de dados pessoais e à livre circulação desse dado. luxemburgo unlocking the eu general data protection regulation setor de tecnologias educacionais -seted cambridge analytica teve acesso à milhões de contas an update on our plans to restrict data access on facebook netshoes deverá procurar milhões de clientes afetados por vazamento, diz mp previdência privada do banco do brasil vaza dados de mil clientes vazamento de site da bb previdência expõe dados de mil clientes falha de segurança em site da bb previdência expõe dados de clientes -economia -estadão quais os princípios do gdpr e seu impacto no brasil? proteção de dados na américa latina european commission -european commission google trends -c trends tools has experience in computer science, with emphasis on information systems, working mainly on the following topics: artificial intelligence, paraconsistent analysis network, paraconsistent logic, industry . , and artificial neurons brazil. i also received the doctor degree and livre-docente title from the same university. he is currently the coordinator of the logic area of institute of advanced studies -university of sao paulo brazil and full professor at paulista university -brazil. his research interest topics include paraconsistent annotated logics and ai, ann in biomedicine, and automation, among others. he is a senior member of ieee master's degree in production engineering in the area of artificial intelligence applied to software paraconsistent measurement software, post-undergraduate degree in ead, university professor, general coordinator of it course and campus assistant it consultant and/or roles: it director, commercial director, project manager, with clients: wci-mahlerti has experience in science and technology, with emphasis on information technology post-graduate -teaching for higher education falc -faculty aldeia de carapicuíba post-graduate -specialization in business intelligence faculty impacta de tecnologia -fit, itil expert certification; iso certification; privacy data certification; dpo privacy and data protection foundation certification this study was financed in part by the coordenação de aperfeiçoamento de pessoal de nível superior -brasil (capes) -finance code . key: cord- -qtrquuvv authors: wu, tianzhi; ge, xijin; yu, guangchuang; hu, erqiang title: open-source analytics tools for studying the covid- coronavirus outbreak date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: qtrquuvv to provide convenient access to epidemiological data on the coronavirus outbreak, we developed an r package, ncov (https://github.com/guangchuangyu/ncov ). besides detailed real-time statistics, it offers access to three data sources with detailed daily statistics from december , , for countries and more than chinese cities. we also developed a web app (http://www.bcloud.org/e/) with interactive plots and simple time-series forecasts. these analytics tools could be useful in informing the public and studying how this and similar viruses spread in populous countries. as demonstrated in suppl. doc. , this new package also contains functionalities to facilitate data visualization. for example, with one command, users can easily plot the distribution of cases on the maps of the world, china, and even individual provinces ( figure ). with historical data, we can incorporate temporal and spatial information to create an animation to help us understand disease transmission and examine the spread of the covid- outbreak. to enable users to access these datasets without coding, we also developed interactive web apps in both english [ ] and chinese [ ] . as demonstrated in supp. doc. , these apps can also be run locally from rstudio. using these apps, users can gain insights by quickly generating all plots in supp. doc. based on daily updated data. complementing the dashboard by dong et al. [ ] , our web app enables users to select their regions of interest and check both the historical and real-time data. generated by the app on february , , figure shows that the total confirmed cases in the provinces outside hubei are stabilizing, following a similar trend. the extreme measures that the chinese government took since january seem to be working. built with the rstudio shiny framework, these apps contain a simple forecast module. we first converted the log-transformed numbers of cases or deaths as a time-series data, then used the exponential smoothing method (ets) in the r package forecast [ ] with default settings to forecast the total cases. on february , , this simple model predicted that the death toll would reach in ten days, a staggering number at the time that later materialized, unfortunately. we also converted the raw number of cases as percent daily changes and conducted a similar forecast. interestingly, daily percent changes in both confirmed cases and deaths in china are decreasing linearly except for a few outliers (see figure and in supplementary document ). even though not all data sources are official statistics, this kind of detailed data offers a unique opportunity to study this novel pathogen. the hundreds of cities could even be considered as semi-independent outbreaks, as many of them are far from the epicenter and effectively on lockdown from the end of january . as shown in figures and in supp. doc. , the death rate, estimated by dividing current total deaths by total confirmed cases, in wuhan is . %. probably due to an overwhelmed healthcare system, this death rate is higher than the average of . % ( % confidence interval [ . % - . %]) observed in chinese cities with or more confirmed cases. cities in hubei province have higher fatality rates than cities in other regions (figure in supp. doc. ). internationally, the death rate in japan ( . %) is close to that of italy ( . %), lower than the . % observed in china overall (figure in supp. doc. ). the death rate in iran is . %, probably due to underreported cases. the rapid, exponential growth phase in china spans roughly from january to february , , when the number of confirmed cases skyrocketed -fold from to , . such rapid growth is now evident in south korea, italy, and iran ( figure ). other countries with a smaller number of cases but showing a sharp upward trend include germany, spain, and france. if not managed well, tens of thousands of cases in each of these and other countries could be possible in weeks. public health officials need to grasp the power of exponential growth. currently, city-level historical data is only available for china. these data sources occasionally change data formats, which requires us to monitoring the data sources. if the supplementary document : detailed tutorial and example of how to use the r package. supplementary document : example of plots obtained from our web app. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint figure . countries with rapidly growing covid- cases. this plot is obtained using our interactive app. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) the copyright holder for this preprint . https://doi.org/ . / . . . doi: medrxiv preprint a novel coronavirus outbreak of global health concern a novel coronavirus from patients with pneumonia in china an interactive web-based dashboard to track covid- in real time ncov : an r package for accessing coronavirus statistics real-time tracking of the coronavirus infection real-time data on the novel coronavirus daily statistics of -ncov clinical features of patients infected with novel coronavirus in wuhan coronavirus covid- outbreak statistics and forecast coronavirus covid- outbreak statistics and forecast forecasting with exponential smoothing : the state space approach. springer series in statistics apis stopped providing data, the real-time data would not be updated. but the historical data will remain accessible for researchers. we will maintain the web apps during this outbreak.our ncov package reduces the barrier for researchers and public health officials in obtaining comprehensive, up-to-date data on this ongoing outbreak. with this package, epidemiologists and other scientists can directly access data from four sources, facilitating mathematical modeling and forecasting of the covid- outbreak. the interactive web apps are accessible to the general public and could also be easily customized by researchers to produce other dashboards or track other countries. we hope these analytics tools could be useful in studying and managing this pathogen on a global scale.conflicts of interests: none. key: cord- -zl romvl authors: bowe, emily; simmons, erin; mattern, shannon title: learning from lines: critical covid data visualizations and the quarantine quotidian date: - - journal: big data soc doi: . / sha: doc_id: cord_uid: zl romvl in response to the ubiquitous graphs and maps of covid- , artists, designers, data scientists, and public health officials are teaming up to create counter-plots and subaltern maps of the pandemic. in this intervention, we describe the various functions served by these projects. first, they offer tutorials and tools for both dataviz practitioners and their publics to encourage critical thinking about how covid- data is sourced and modeled—and to consider which subjects are not interpellated in those data sets, and why not. second, they demonstrate how the pandemic’s spatial logics inscribe themselves in our immediate material landscapes. and third, they remind us of our capacity to personalize and participate in the creation of meaningful covid visualizations—many of which represent other scales and dimensions of the pandemic, especially the quarantine quotidian. together, the official maps and counter-plots acknowledge that the pandemic plays out differently across different scales: covid- is about global supply chains and infection counts and tv ratings for presidential press conferences, but it is also about local dynamics and neighborhood mutual aid networks and personal geographies of mitigation and care. the pervasive yet evasive nature of a pandemic-the widespread destruction of bodies and economies wrought by an invisible, submicroscopic agent-has inspired many attempts to visualize its presence and to track its spread for the purposes of containment. john snow's map of london's cholera outbreak, celebrated as a breakthrough in epidemiological cartography, has itself gone viral, inspiring a host of visualizations and maps of covid- (hempel, ; johnson, ; koch, ) . research centers and data scientists have launched dashboards and observatories (danielson, ; patel, ) . the widespread availability of consumer-friendly mapping platforms and open data repositories has equipped cartographers and information designers to plot their own charts and graphs-some of which then circulate on social media or appear on slide shows at official public health briefings (bazzaz, ; mattern, a; "triplet kids," ) . meanwhile, data journalists have sought to break through the columnar layout of the printed page, or to exploit the interactive affordances of the screen, to reveal data in their dynamism, projecting future global epidemiological scenarios and spotlighting hyper-local impacts (campolo, ; flowing data, ; heller, ) . as media scholar alexander campolo writes: "it is both understandable and desirable that expert modelers have worked quickly to produce simulations. . . .however, there is also danger in the uncritical circulation of decontextualized visualizations or headline statistics." one particular danger is that these visualizations "interpellate subjects as data points," driving individual behavior and shaping policy, scripting our present understandings, and modeling future norms. the map becomes the territory; the projection incites a course of action that can lead to its own realization. in response to these ubiquitous graphics, so often reflexively reified and retweeted, artists, designers, data scientists, and public health officials are teaming up to create counterplots and subaltern maps. they are providing tutorials and tools to both dataviz practitioners and their publics to encourage everyone to think more critically about how covid- data is sourced and modeled and even manipulated-and to consider which subjects are not interpellated in those data sets, and why not (taylor, ) . these projects also remind human subjects that they are more than mere data points. people have the capacity to personalize and participate in the creation of meaningful covid visualizations-many of which represent other scales and dimensions of the pandemic, especially the quarantine quotidian. covid's counter-mappers prompt their publics, whose attention is often trained on "flattening the curve," to also look behind or under the curve, to critically assess the making of covid- visualizations, and to plot curves and charts and maps of their own. together, the official maps and counterplots acknowledge that the pandemic plays out differently across different scales: covid- is about global supply chains and infection counts and tv ratings for presidential press conferences, but it is also about local dynamics and neighborhood mutual aid networks and personal geographies of mitigation and care. pandemic data practice covid- presents a state of global emergency where new data are being released at an almost hourly rate, and more people than ever have access to the raw data themselves (johns hopkins coronavirus resource center, ; the covid tracking project, a). this greater accessibility and public data consciousness (thanks to the rise of sites like fivethirtyeight and scholarship in critical data studies) imply a simultaneous need for a jedi code of sorts: rather than reifying the final graphics, we must question where our data come from and how they can be analyzed and represented (d'ignazio and klein, ; loukissas, ; noble, ) . with covid- data and the veritable flood of graphics being produced, we have a chance to strengthen our collective data literacy. by understanding the origins of epidemiological data and reinforcing the importance of context and non-quantitative forms of data, we can push for a richer, more diverse discourse through data visualization. a variety of tools help readers and viewers understand how to read covid data visualizations, and what goes into making one. nightingale: the journal of the data visualization society published "covid- data literacy is for everyone," a webcomic that helps readers ask questions about "data's back story" and better understand graphical representation by highlighting perspectives of professionals (alberda et al., ; see also bronner et al., b) . nicky case and marcel salath e's ( ) covid data simulations, illustrated in figure , explain the mechanics of epidemiological models and their impact on future covid- -related policy through interactive charts. "why it's so freaking hard to make a good covid- model" explains the variables and uncertainty embedded within any existing mathematical model due to differences in data entry, testing, and demographics (bronner et al., a) . there is also a great deal of "map critique" in mainstream media, where pundits debate the virtues of various renderings and designers demonstrate alternatives. one can learn through twitterthreads about the important differences between showing absolute and relative counts on a map, the challenges of making exponential growth legible to the public, and the importance of keeping published covid- maps and data up to date (paschal, ; peck, ) . but the data professionals need to think critically about these same things, too. feminist data practices and critical data and design studies remind practitioners of the importance of considering how data create subjectivities for the people they represent and how they empower or disempower those subjects, as well as related ethical and political issues (costanza-chock, ; d'ignazio and klein, ; loukissas, ) . pairing epidemiologists and healthcare organizations with data professionals, as the data visualization society's covid- matchmaking proj ect does, offers a model for thinking about how technical skills can be used in ways that are most relevant and context-aware (data visualization society, ). and those contextual cues might call for "not publishing your visualizations in the public domain at all," amanda makulec argues ( ). instead of making a map or a chart, technologists could use their skills to make the data more accessible. one example of this is the accessible covid statistics project, which uses machine-readable html to make covid- data intelligible to screen readers (littlefield, ) . making available datasets more inclusive also means working to better understand how different populations are affected by their representation (or lack thereof) in the data themselves. making known the limits of our data, especially in its collection, can help uncover the ways they exclude important narratives (onuoha, ) . such critical reflexivity has long been recommended as best practice, but the pandemic has reinforced the importance of asking questions about our ability to disaggregate data by slicing them into segments by race, age, sex, etc. it is impossible to report data for covid by race, for example, if the source data shows only raw counts by zip code, which is often the case. this is why the work being done by data for black lives and the covid racial data tracker (see figure ) to collect confirmed covid case data by race is vitally important (data for black lives, ; the covid tracking project, b). it is in prioritizing these types of disaggregation practices that we start to understand more clearly the different empirical, trackable, datafiable realities that exist in the face of this virus (see also kendi, ; the urban systems lab, ) . similarly, more focused studies of singular locations help to unmask the ways that the spread of the virus concentrates in places like homeless shelters, meatpacking plants, detention centers, and prisons (ellis, ; food environment reporting network, ; molteni, ; stewart, ; trovall, ; ura, ) . projects like covid- behind bars push us to ask questions about the ways that using a single number to represent a geography such as a city, state, or country obfuscates variability within that area, hiding "hot spots" that push up the overall numbers (ucla school of law, ). as figure shows, the marshall project's ( ) coronavirus in prisons visualizes the continually-updated data and allows a reader to filter by state to contextualize the numbers of cases, deaths, and tests in a particular prison system compared to the state's broader population. continued progress in visualizing the effects of the pandemic requires that we rely not just on "objective" data, but that we also pay attention to its spatial granularity, and to the environments that give context to those data. designers and artists and media-makers have also helped us recognize how our everyday environments are themselves indexing covid's presence. the pandemic has rendered itself visible, audible, and tangible in our material landscapes, transforming those spaces themselves into environmental data. we just have to train ourselves how to look and listen, diagnostically and forensically, both locally and at a distance (mccullough, ; weizman, ) . shelter-in-place orders around the globe have orchestrated a new acoustic universe, and several projects-including cities and memory's global #stayhomesounds map and daniel drew's "quarantine supercut" collaboration with the creative independent and kickstarter- capture the sounds of quarantine's boredom and domesticity, its impatience and absurdity, its isolation and fear, and intimate sociality (mattern b; quarantine supercut, ; #stayhomesounds, ) . in drew's piece, stitched together from roughly international submissions, we hear coughs and distant church bells and public address announcements reminding folks to wear face masks out of doors (see also nakagawa, ) . the hushed city has made it easier for people to hear their avian neighbors, so creative technologist jer thorp drew on an open bird sound database to make a quarantine game, birb, that allows users to practice their birdsong identifications (birb, ; greene, ; thorp, ; xeno-canto, n.d.) . meanwhile, the new york public library and mother new york's "missing sounds of new york" compilation reminds listeners of the city spots that are temporarily on mute, but are waiting for them in the post-pandemic world: places like music clubs, crowded parks, and baseball games (nypl staff, ). all of these projects allow listeners to hear their convalescing cities from a distance, through aggregated data sets. photographs, too, function as passive data visualizations in revealing how the virus reshapes spatial orders. drone images show caskets of unclaimed covid victims lined up for burial in the potter's field on new york city's hart island (rosen, ) . we see how the virus's violence inscribes itself into the landscape. in san antonio, tx, aerial photographs capture rows of cars lined up to claim emergency aid from the city's food bank (orsborn et al., ) . in the washington post, stitched-together street view images and interior photographs reveal the effect of unemployment on a single block of connecticut avenue, a stretch densely populated with small businesses (pecanha, ) . here, photojournalism doubles as cartography and serves to localize and humanize the bureau of labor statistics' line graphs of unemployment. abstracted graphs and human narratives converge in a widely circulated stock photograph that shows laborers eating their lunch at a car factory in wuhan, china. clad in a grey uniform, perched atop a bright-red stool, each body inhabits a node within a vast, anonymizing grid of social distance (juo, ) . the grid itself is an indexical artifact of the pandemic: we see impromptu lattices and ad hoc hash marks charting out six-foot geographies in grocery store checkout lines and at public parks (see figure ) . the victoria and albert museum is chronicling covid's myriad everyday artifacts: material embodiments, like the grid, of the virus's operational logics and affects. among the growing pandemic objects ( ) collection are hand-made signs-for businesses to announce their temporary closure, for neighbors to express community solidarity-as well as jury-rigged protective door handles, toilet paper, cardboard packaging, and flour and yeast for novice home bakers (wainwright, ) . this collection of analog data reminds visitors of how the pandemic marks its presence in the increased prevalence and value of humble materials (see figure ). more personal, intimate, and participatory data projects have focused on covid- data as tools for reflection. these engagements with the data of crisis draw out human (and sometimes non-human) stories and interactions and ask users to situate themselves within the overwhelming global narrative of emergency. not least, in a time of anxiety and uncertainty, these reflective projects provide a much-needed venue for play, humor, and affective response. some, like the illustrations of data journalist mona chalabi, humanize the statistics of covid- by rendering them in handmade drawings. chalabi takes potentially intimidating scientific jargon and translates it into the visual language of the everyday. figure (a) shows her most popular covid- visualization, "know thesymptoms of coronavirus," which has been translated into a dozen languages after a call on chalabi's instagram yielded volunteer translators (chalabi, a ). an outstretched hand in figure (b) lends a personal perspective to six feet of distance, while figure (c), "how new york is changing (according to calls*)," reframes the virus as the daily mundane: noisy upstairs neighbors, anxiety from whirring helicopters, and feeling % less interested in graffiti (chalabi, b (chalabi, , c . work from illustrator and graphic journalist wendy macnaughton and infectious disease specialist dr. eliah aranoff-spencer ( ) (see figure ) guides the viewer through a flowchart answering the question "what should i do?," combining tongue-incheek responses ("are you freaking out?") with useful information on symptoms and exposure. these visualizations feel personal and engaging, and they bring perspective to the vastness of the pandemic present (see also asian american feminist collective, ; kuo, ) . the gamifications of a cheeky flowchart and work like nathan yau's "toilet paper calculator" make this impossible situation feel more palpable, and perhaps also more surmountable (yau, ) . that feeling of control, of empowerment even, characterizes a slew of new data projects that ask users to become co-creators, observing their worlds and contributing their experiences to a collective understanding of life during covid- (see also bliss and martin, ; detroit cultural crisis survey, ) . participation in these communal pieces has a low bar for entry, as with the incredibly figure . (a) "know the symptoms of coronavirus," (b) "distancing," and (c) "how new york is changing (according to calls*)". source: reproduced with permission from chalabi ( a chalabi ( , b chalabi ( , c shareable images generated through "wash your lyrics". the create-your-own public service announcement lightened the mood of early march (see the state of new jersey's contribution), but it also matched the virus in its virality, creating a world-wide network of conscientious hand-washers, singing in solidarity (new jersey, ; william, ) . video clips of trees performing essential labor, traces of avian pathways, and the microbiomes of windowsills populate the environmental performance agency's (epa) "multispecies care survey" (environmental performance agency, ). the survey consists of six protocols, beginning with a "temperature check" (shown in figure ) that involves pressing your skin to a window, then building to engagements with the outside, like having a conversation with a nearby tree. in undertaking these protocols, participants archive their own lockdown environment and its more-than-human population. engaging with the survey provides some catharsis in knowing that elsewhere others are making the same examinations. in addition to making space for affective responses, these projects also give participants an activity-an exercise that takes up time and breaks the tedium of quarantine. the epa's archive includes children drawing and parents happy to see them briefly entertained; "wash your lyrics" becomes an endless twitter scroll where you can lose hours singing and laughing (and practicing good hygiene). it is not always possible, in the current climate, to be in community or find the energy to create with and for others. and that is fine. while many of the examples we have described here feature an element of data's public performance, there are numerous covid- responses that offer templates for those who wish to engage with the data of crisis through introspection. these projects also create an imagined community-a dispersed public engaged in the same activity, yet the activity remains unshared and personal. this form of personal "data processing" is just as valuable in the time of covid- , when multiscalar struggles are waged even at a microscopic level within our individual bodies. one such response, which pre-existed covid- but has found renewed meaning in the present, is giorgia lupi and pentagram's (n.d.) "mapping ourselves" activities. the project offers a mechanism for visualizing personal reflections and tools for plotting the networks of care that surround us all. recorded through guided drawing exercises, "mapping ourselves" can serve as a source of mindfulness while highlighting connections with those around you. while lupi's mapping activities create interactions with data that are structured and codified, pilobolus and map design lab's ( ) "you dance, we dance" is a covid- response predicated on the imprecision and freedom of movement (see figure ). combining d renderings of pilobolus dancers with simple dance challenges to complete in your home, "you dance, we dance" correlates the sometimes-paralyzing emotions of pandemic existence with bursts of choreography. "calm" suggests slowly inflating your body like a balloon while the rendering of an other-worldly orange dancer blooms like a flower. "brave" sees two d models trust each other in a series of balances; you stand tall and strong on one leg for a minute. this form of data visceralization brings data into the physical and experiential realms, while also making it more immediate, almost "realtime" (dobson, ; see also d'ignazio and klein, ) . this is a private practice, done in concert with others. we could say the same of quarantine-or of covid- infection itself: we shelter or suffer in private, knowing that our sacrifices and sorrows are shared with an international community. yet there is danger in subsuming these individual experiences under a global curve. as the projects above demonstrate, both the creators and consumers of our coronavirus maps and graphs need to attend to irregularities in demographics and geographic distribution, and to consider the politics of who or what is and is not represented in the standard datasets. and those of us who regularly consult covid- heatmaps from the security of our living rooms, hoping to see an ever-flatter curve, should also recognize that those data, while seemingly distant in their abstraction, actually index themselves in our immediate material environments, inscribing their spatial logics in our grocery stores and sidewalks. we can even bring those data into our homes and personal lives, performing them, contributing to their creation, reminding ourselves and others that data is always embodied and local and present. the author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. the author(s) received no financial support for the research, authorship, and/or publication of this article. emily bowe https://orcid.org/ - - - shannon mattern https://orcid.org/ - - - covid- data literacy is for everyone. nightingale: the journal of the data visualization society. available at: www.medium.com/nightingale/covid- -data-literacy-isfor-everyone- b cec asian american feminist collective ( ) care in the time of coronavirus avi schiffmann, the washington state teen behind a coronavirus website with millions of views a d rendering showing two pilobolus dancers moving to the prompt "brave available at: www.seattletimes. com/seattle-news/education/qa-avi-schiffmann-the-wash ington-state-teen-behind-a-coronavirus-website-with-mil lions-of-views your maps of life under lockdown. citylab, april available at: www.fiveth irtyeight.com/features/why-its-so-freaking-hard-to-makea-good-covid- -model available at: www.fiveth irtyeight.com/features/a-comic-strip-tour-of-the-wildworld-of-pandemic-modeling flattening the curve: visualization and pandemic knowledge. formations, april what happens next? covid- futures, explained with playable simulations if you think you might be sick and you can stay at home, then stay the fuck at home. instagram, march. available at: www.instagram available at: www.insta gram available at: www.instagram.com/p/ b-uypcbl db design justice: community-led practices to build the worlds we need notable maps visualizing covid- and surrounding impacts data for black lives ( ) d bl covid- disparities tracker data visualization society ( ) partnering health and data expertise for covid- . available at: www.datavisualiza tionsociety.com/covid- detroit cultural crisis survey ( ) detroit cultural crisis survey. six feet of distance. available at: www.sixfeetof distance.org/detroit-cultural-crisis data feminism available at: www.brown.edu/academics/music/ events/colloquium-machine-therapy-subtle-machines-anddata-visceralization environmental performance agency ( ) multispecies care survey: protocol available at: www.wired. com/story/coronavirus-covid- -homeless available at: www. flowingdata.com/tag/coronavirus available at: www.thefern.org/ / /mapping-covid- -in-meat-and-food-processing-plants do those birds sound louder to you the daily heller; covid in real times available at: www.printmag.com/daily-heller/ covid- -new-york-times-front-pages the atlas of disease: mapping deadly epidemics and contagion, from the plague to the zika virus johns hopkins coronavirus resource center ( ) covid- dashboard by the center for systems science and engineering (csse) at johns hopkins university (jhu). available at the ghost map: the story of london's most terrifying epidemic -and how it changed science, cities, and the modern world china's economy shrinks as coronavirus hits world trade. the guardian what the racial data show. the atlantic available at: www cartographies of disease: maps, mapping, and medicine my wheel of worry accessible covid- statistics tracker all data are local pause from drawing dinosaurs to announce, version is here! instagram, april. available at: www.instagram march) ten considerations before you create another chart about covid- . nightingale: the journal of the data visualization society available at: www.artnews.com/artin-america/features/andrew-cuomo-covid-briefings-power point-slideshow-authority- available at: www. placesjournal.org/article/urban-auscultation-or-perceiv ing-the-action-of-the-heart ambient commons: attention in the age of embodied information available at: www. wired.com/story/why-meatpacking-plants-have-becomecovid- -hot-spots social distancing, haiku and you. alan nakagawa. available at: www but above all this, i wish you wash your hands. twitter, march. available at: www algorithms of oppression the library of missing datasets. available at: www.mimionuoha.com/the-library-of-missing-data sets available at: www.expressnews.com/news/local/article/thousands-h it-hard-by-coronavirus-pandemic-s- pandemic objects ( ) v&a blog. available at: www.vam. ac.uk/blog/pandemic-objects here is what i have to say about these two maps. twitter, april. available at: www the best, the worst, of the coronavirus dashboards this is how mass unemployment looks on the ground as mentioned in attached thread, i see communicating exponential growth (virus spread) as one of the key challenges to helping people understand #covid . twitter, march. available at: www pentagram (n.d) mapping ourselves. pentagram. available at: www.pentagram.com/work/mapping-ourselves/story available at: www.youdancewedance.org quarantine supercut ( ) the creative independent. available at: www how covid- has forced us to look at the unthinkable cities and memory. available at: www.citiesandmemory.com/covid -sounds it's a time bomb': die as virus hits packed homeless shelters. the new york times available at www.govtech.com/em/safety/reassigned-florida-manager-told-to-delete-coronavirus-data-.html the covid tracking project ( a) most recent data. available at: www.covidtracking.com/data the covid tracking project ( b) covid racial data tracker. available at: www.covidtracking.com/race may) a state-by-state look at coronavirus in prisons. available at: www.themarshallpro ject.org/ / / /a-state-by-state-look-at-coronavirusin-prisons the urban systems lab ( ) social vulnerability, covid- and climate. available at: www.urbansystemslab.com/ covid available at: www.twitter.com/blprnt/status/ ?s= triplet kids 'co-author' covid- study with psychologist dad texas covid- cases in immigrant detention quadruple in two weeks, as ice transfers continue available at: www ucla school of law ( ) covid- behind bars data project hkpvtk/edit?usp=sharing texas investigating meat processing plants over coronavirus outbreaks museum of covid- : the story of the crisis told through everyday objects. the guardian forensic architecture: violence at the threshold of detectability wash your lyrics. available at: www.wash yourlyrics available at: www.xeno-canto.org toilet paper calculator. floating data. available at: www key: cord- - tkdwtc authors: zambetti, michela; khan, muztoba a.; pinto, roberto; wuest, thorsten title: enabling servitization by retrofitting legacy equipment for industry . applications: benefits and barriers for oems date: - - journal: procedia manufacturing doi: . /j.promfg. . . sha: doc_id: cord_uid: tkdwtc abstract in the wake of industry . , many industries have started to pivot towards digital, collaborative, and smart manufacturing systems by connecting their machinery as part of the internet of things (iot). iot has the potential to provide visibility and improve manufacturing systems through data collection, analysis, and subsequent actions based on insights generated from large amounts of manufacturing data. even though comparatively newer equipment come readily equipped with embedded sensors and industrial connectivity necessary to connect to the iot environment, there are many manufacturers (equipment users) who rely on long standing “legacy systems” that offer no or very limited connectivity. in this context, solutions mostly result in the development of low-cost retrofit or upgrade kits that allow integrating legacy equipment into industry . environment and thus enable digital servitization. servitization is a transformation journey that involves firms developing the capabilities they need to provide technical and data-driven services that supplement traditional product offerings. however, retrofitting solutions of legacy equipment rarely involve original equipment manufacturers (oems) who may otherwise leverage the opportunity to create and capture unique value by retrofitting and then provisioning data-driven value-added services for the manufacturers. hence, the primary objective of this paper is to identify and analyze the available literature on retrofitting and upgrading of the legacy equipment for industry . integration. in doing so, this study also investigates the potential opportunities and challenges of oems in supporting the industry . transition of legacy equipment in a servitization context. technological advances such as sensors systems, internet of things (iot), cloud computing, artificial intelligence (ai), and machine learning are fueling the fourth industrial revolution, also known as industry . [ ] . literature argues that industry . technologies will transform the paradigm of manufacturing and allow unprecedented levels of operational efficiency, overall equipment effectiveness (oee), and growth in productivity [ ] . the implementation of the industry . paradigm requires the integration of information and communications technology (ict), data science, and robotics to enable sharing of data on a massive scale among different industrial systems (machine tools) and devices that compose a smart manufacturing system. in case of comparatively newer industrial equipment, the machinery producers (oems) can leverage product connectivity to create and offer technical and data driven services based on the continuous data flow generated by various sensors within the equipment. even though new industrial equipment is designed to be industry . ready, there are still a large number of legacy systems operational in the industries. these systems are not capable to be integrated into the industry . environment [ ] [ ] . considering that the replacement of these legacy systems is not a feasible solution, a possible alternative is to develop low-cost retrofit or upgrade kits that technological advances such as sensors systems, internet of things (iot), cloud computing, artificial intelligence (ai), and machine learning are fueling the fourth industrial revolution, also known as industry . [ ] . literature argues that industry . technologies will transform the paradigm of manufacturing and allow unprecedented levels of operational efficiency, overall equipment effectiveness (oee), and growth in productivity [ ] . the implementation of the industry . paradigm requires the integration of information and communications technology (ict), data science, and robotics to enable sharing of data on a massive scale among different industrial systems (machine tools) and devices that compose a smart manufacturing system. in case of comparatively newer industrial equipment, the machinery producers (oems) can leverage product connectivity to create and offer technical and data driven services based on the continuous data flow generated by various sensors within the equipment. even though new industrial equipment is designed to be industry . ready, there are still a large number of legacy systems operational in the industries. these systems are not capable to be integrated into the industry . environment [ ] [ ] . considering that the replacement of these legacy systems is not a feasible solution, a possible alternative is to develop low-cost retrofit or upgrade kits that technological advances such as sensors systems, internet of things (iot), cloud computing, artificial intelligence (ai), and machine learning are fueling the fourth industrial revolution, also known as industry . [ ] . literature argues that industry . technologies will transform the paradigm of manufacturing and allow unprecedented levels of operational efficiency, overall equipment effectiveness (oee), and growth in productivity [ ] . the implementation of the industry . paradigm requires the integration of information and communications technology (ict), data science, and robotics to enable sharing of data on a massive scale among different industrial systems (machine tools) and devices that compose a smart manufacturing system. in case of comparatively newer industrial equipment, the machinery producers (oems) can leverage product connectivity to create and offer technical and data driven services based on the continuous data flow generated by various sensors within the equipment. even though new industrial equipment is designed to be industry . ready, there are still a large number of legacy systems operational in the industries. these systems are not capable to be integrated into the industry . environment [ ] [ ] . considering that the replacement of these legacy systems is not a feasible solution, a possible alternative is to develop low-cost retrofit or upgrade kits that th sme north american manufacturing research conference, namrc (cancelled due to allow the integration of the legacy equipment into the industry . environment. nevertheless, given the investment risk and lack of technical expertise, it can be a very complex and nontrivial task to configure legacy equipment for the new industrial ecosystem. this challenge, however, provides the oems with an opportunity to create and capture unique value by upgrading and retrofitting the legacy equipment and then provisioning data-driven value-added services for the manufacturers (equipment users) [ ] . in general, services are often resistant to the economic cycles that drive equipment purchase and hence services have the potential to provide a more stable source of revenue with higher margins compared to equipment sales [ ] . oem's involvement in retrofitting activities and service provision makes even more sense given the increasing trend of manufacturing servitization. this trend describes manufacturing firms increasingly pivoting towards servicebased value propositions in order to improve customer relationship and attain competitive advantage in the market [ ] [ ]. servitization is also becoming an effective strategy for oems to create new and resilient revenue streams from the installed base of long lifetime equipment [ ] . given the demand of technical and data driven service offerings, oems are increasingly recognizing the potential of servitization. consequently, oems have started to invest in new technologies that foster connectivity and digital innovation [ ] . indeed, digital innovation enables different services, such as remote monitoring [ ] , big data [ ] , or predictive analytics [ ] . nevertheless, it seems that none of the previous studies have investigated the potential opportunities and challenges of oems in supporting the industry . transition of legacy equipment that do not have connectivity capabilities. in this context, the objective of this paper is twofold. first, we plan to identify and analyze the literature related to retrofitting and upgrading of legacy equipment for enabling iot connectivity. second, we plan to investigate the potential benefits and barriers of oems in retrofitting legacy equipment in the context of servitization. to this end, we first develop a broad search string containing all the important keywords and then perform a literature review in two scientific databases, i.e., web of science and scopus. this paper is structured as follows: section two details the literature review methodology. section three presents findings based on the synthesis of the relevant literature. in section four we put a special focus on the servitization potential and challenges of the oems in supporting the industry . transition by means of retrofitting legacy equipment and provisioning data-driven services. this literature review consists of three main steps: (i) database and search string selection, (ii) screening literature for relevancy to retrofitting legacy equipment for industry . environment, and (iii) synthesis of the selected literature and generalization of the findings. in the first step, we selected web of science (wos) and scopus as our primary database due to their comprehensive coverage of the focus area and high quality of papers. following, we used the search string "(upgrad* or retrofit*) and ("industry . " or "internet of things" or "smart manufacturing" or "cyber physical system*" or "cloud manufacturing" or servitization or "smart service*" or "data-driven" or "product service system*")" as an initial screening criterion. this step resulted in papers from wos and papers from scopus. in the second step, after reading the titles and, if needed, the abstracts of the resulting papers, one of the authors identified papers from wos and a second author identified papers from scopus that are considered relevant based on their title and abstract. after removing duplicates, this step resulted in a total of unique papers. these papers were then carefully read in order to carry out a detailed content-based selection. during this step, only papers that focused on retrofitting solution were selected for further consideration in this paper. this step led to a final list of relevant papers as a basis for our study. in the third step, the authors separately summarized the key contents of each of the papers in a structured way. finally, the authors extensively discussed and refined the results, which allowed the authors to generalize the findings that are presented in the next section. based on the analysis of relevant papers, we were able to identify two distinct categories of research streams. the first stream of research focused on providing integration of various old equipment in the entire plant. the papers covering this domain mainly aimed at developing and implementing retrofits in order to provide connectivity and collect data from the existing set of legacy equipment within the plant to achieve a more transparent production system, which is a prerequisite for industry . implementation [ ] [ ] [ ] [ ] . the second stream of research focused on developing and implementing retrofit solutions for specific legacy equipment with the aim to support various data-driven applications such as output quality inspection, predictive maintenance, etc. [ ] [ ] [ ] . in this second category, the researchers also focused on the applications of data analytics tools for creating value for the manufacturer [ ] [ ] . considering the different processes manufactures followed to achieve industry . solutions, what clearly emerges is that, on the one side, there is a set of solutions aimed at connecting all relevant equipment within a plant. this can be argued as an "industry . push". on the other side, different implementations aimed at creating datadriven applications for a particular legacy equipment to exploit the collected data, which can be argued as a "need-based pull". fig. illustrates these two different approaches identified from the synthesis of the two research streams. even though the triggers for the decision of undertaking an industry . retrofitting project varied, the architecture underlying the proposed solutions remained similar. in majority of the cases, the authors asserted the need of sensors, connectivity, data storage, data processing, and analytics to support industry . applications [ ] . in some cases, the authors considered cybersecurity issues [ ] and also emphasized the necessity to create communication protocols that are compatible with the existing infrastructure and standards [ ] . alongside these projects, different competences interacted to reach the solutions, comprising industrial engineering, mechanical and mechatronics engineering, information technologies, and data science [ ] . moreover, understanding and leveraging standard algorithms and tools of data analytics were assumed very difficult for many manufactures. interpreting the collected data and creating value from them need expertise and a high level of know-how [ ] . different technical solutions exist for retrofitting machinery by leveraging low cost technology and open source components [ ] . we could divide the retrofitting approaches into three broad categories: • addition of smart sensors and/or edge gateways: in this approach, flexible and low-cost solutions are developed using iot sensors and smart gateway to gather data [ ] [ ] [ ] . gateways are also used alone in some cases to directly collect data from plc or other data source already present in the site. • retrofit kits: those solution are deployed by a third party provider as a complete package of sensors, connectivity, control, data analytics [ ] [ ]. • video cameras: another approach is the utilization of industrial cameras to monitor operation and capture data form both workforce and equipment [ ] . what emerges is that the difficult part in retrofitting and enabling industry . solution is not only related to the monetary cost or technology themselves, but also related to technological expertise and know-how on the convergence of various competencies and time required to build a solution suitable for the manufacturer-specific needs. one specific aim of this research was to understand the contributions of oems in supporting the industry . transition for legacy equipment in the context of servitization. to this end, we included keywords related to servitization in the initial query of the literature review. however, findings indicate that none of the papers included or even just referred to the (machine tool) oems, who sell the production equipment to the manufacturers of end products (users). several papers propose the adoption of ready-to-use toolkits that enables a manufacturer to connect its equipment within the plant to a server or the internet. these toolkits can be seen as a specific technology produced and sold regardless of the machinery type and its producer. even the papers that presented a single equipment oriented retrofit solution never mentioned the specific participation of an oem at any point during the development process of the solution. however, the oems are also leveraging industry . technologies to connect new products and, based on data analytics, offer new services and solutions. hence, it is logical to assume that the oems have the potential to provide additional value to the manufacturers in retrofitting solutions aimed at data collection and analysis. given the fact that the existing literature on the upgradability and retrofitting solution towards industry . do not include the oem and the service perspectives at this point, this research investigated oem's potential in providing connectivity and data analytics services to the manufacturers of end products. we argue that the "servitization-push" can be a new trigger in addition to industry . push and need-based pull. in order to support this argument, the potential benefits that can be derived from oem's direct involvement during the industry . transition for legacy equipment have been analyzed. additionally, we analyzed the challenges that oems may face in providing retrofitting and provisioning data-driven services. oems are usually financially stable businesses and are historically recognized as machinery approved vendors. their relationship with the end-users of machinery (i.e., manufacturers) is usually limited since they rely on other intermediaries (e.g., system integrators) for final sales and installation of their machinery and products. these intermediaries offer support, service, and spare parts to the customer. furthermore, they directly in touch with the market and have a better overview and relationship with the final customer (user). activities that initiate some contact between oems and customers are the services, which can vary from warranty to maintenance offers or the selling of product's functionality. nevertheless, oems are equipping their machinery with new technologies that enable them to gather a continuous stream of data from their end-users, exploring many benefits and opportunities for manufacturers. indeed, oems can expand their business into the service areas and establish long-term relationship with the customers by offering services based on a continuous stream of data the same reasoning is applicable for the existing plants where legacy equipment is present. these equipment usually were built in such ways that only the oem has the complete know-how and comprehensive domain knowledge on them. thus, oems have the capability to help their customers with data acquisition and analytics services based on their specific operations comprising both new and legacy equipment. considering this perspective, oems' servitization emerges as a new trigger for retrofitting legacy equipment. as depicted in fig. , oems are able to reach a level of product connectivity that can support the development and delivery of new services that can create value for their customers. most of the identified benefits are based on the fact that oems have significant expertise on their own machinery. their extensive knowledge of the equipment reduces the level of complexity of data gathering and analytics as well as service development (compared to service providers without that deep knowledge). moreover, the deep knowledge of machineries enhances also the reliability of the solutions. another assumption is that the oems will have access to data of multiple customers and installed equipment that are operational in different plants across various industries. this allows them to create a rich dataset (in terms of number of data points but also diversity and balance), which is the foundation of good data analytics. it is worth to mention that the development of quality solutions is not possible without direct interaction with the customer and the understanding of their needs. nonetheless, once a solution is developed it can be replicated in a similar environment. moreover, the manufacturer can benefit from equipment upgrades as well as services that arise from those solutions. indeed, oems can expand their usual business to the service domain. table enumerates different benefits both for the manufacturer and for the oems, especially considering differences in relation with the development of a similar project without their involvement. based on the literature in servitization and digital services, we have investigated the key factors that determine barriers for oems in offering industry . retrofit solutions and digital services. six different factors have been identified and are explained below. technical competences. as a first step, the technological challenges emerge. indeed, oems need to develop the building blocks that are needed to support digital solutions. considering a standard infrastructure as the one illustrated in fig. , it can be noticed that different dimensions emerge. they go form the sensors needed to retrieve information, to the connectivity and data transmission layers, platform and data storage, the application development as well as analytics capability and tools until arrive at the customer interface while being under the constraint of the security dimension. security emerges as one of the most important challenge to face. the high volume of sensitive user data should be secured and protected [ ] [ ] [ ] . researchers consider security and privacy as one of the main issue in the implementation of industry . [ ] and some possible solution have been proposed towards a holistic security framework for industrial iot systems [ ] . • close contact with customers leading to improved retention and market share. • visibility on customers, installed base and its status and operations • possibility to use data to develop customer profiles and propose ad-hoc solutions • new revenue streams from the retrofitting solutions and data driven services. • consolidated market of spare parts thanks to realtime condition monitoring • possibility to improve service delivery processes and scheduling based on the actual condition of the installed base • increased opportunity to sell new equipment at the end of life or provide disposal services • possibility to understand customer processes and receive valuable feedback leading to improved equipment design. • rely on the oem's domain knowledge on the equipment and expertise in solutions that they previously implemented for similar industries and end-user • access to proactive services, e.g. predictive maintenance • access to new services, e.g. oem can collect data on equipment fault in different customer plants and under different environments and after analyzing the data they can suggest best operational practices to the equipment users • access to equipment upgrades, i.e. based on the customer data oems can develop customized upgrade solutions to enhance the performance and functional capability of the equipment. another criticality that directly emerge considering the need of data collection and analysis is data standardization. "variety" and "veracity" of data constitute a challenge when fig. . proposed new perspective to consider data need to be integrated. particularly, "variety" refers to the fact that data are generated in large amount [ ] from different sources and formats, and contain multidimensional data fields including structured and unstructured data. "veracity" highlight the importance of quality data and reliability of data sources given that data need to be acquired, cleaned and organized in a standard format to enable analysis and sharing [ ] . moreover, considering legacy equipment with respect to new machinery, integration interfaces and communication protocols may require additional effort for their implementation. service culture, mentality and structure. the transformation of an equipment seller into a product-service system provider has been recognized as a complex process [ ] - [ ] . some of the challenges include: • reshaping the product-focus mentality and culture. • acquiring the capability to develop bundle solutions that provide both product and service to the customer. • rethinking business models, which may significantly change the revenue and cost structures. • creating a closer relationship with the customer. • structuring the organization in a way that can support service design and operations. extending the scope of service. connectivity and iot solutions open up opportunities for the oems to not only offer services but also think beyond their traditional focuses [ ] [ ] . digitalization can shift service boundaries beyond maintenance activities, giving the possibility to explore services supporting the customer in their operations. this requires the need to enter in the customer context and set a continuous relationship with them [ ] . moreover, considering the opportunity to retrieve data from external sources or companies, the scope of service provision increases. data exchange. "data exchange" refers to the fact that data is shared between different actors, and it leads to different challenges to overcome. first, considering the customer's perspective, the fear in sharing data need to be mentioned. customers are concerned that their data might be accessed and manipulated with malign intentions. they are also worried that the oems can potentially have visibility on the sensitive customer operations [ ] [ ] [ ] . additionally, from the oem's perspective, the implementation of a data-driven solution requires historical data that can be difficult to retrieve in case of legacy equipment. moreover, considering exchange of data in a one to one relationship between customer and oems, and also between different partners and oems, brings with it the data ownership dimension where it is difficult to decide and set norms, roles, and rules of data rights [ ] . lack of system level vision. even though retrofitting allows the oems to gain equipment level transparency, it is not possible for an individual oem to gain system level perspective. this is because a plant usually involves equipment from many different oems and current industrial production lines are mostly based on heterogeneous structures and architectures regarding communication protocols, control systems or electrical and mechanical components [ ] . as a result, the integration of these different technologies is nontrivial, and the oem's scope of service provision gets limited to equipment level. however, if different oems come forward and share machine level data between themselves, it may enable them to collaboratively develop system level services for individual customers. network reshapes. one of the critical choices for the oems refers to the downstream supply of technology and capabilities. oems need to decide whether to outsource these competencies or to develop them in-house [ ] . the main trend in this direction for oems is to collaborate and cooperate with new technological partners and suppliers. relationships with customers need to be reshaped since oems should better understand the customer processes and maintain a continuous contact created by the streams of data. finally, digitalization changes also business models [ ] and especially affect how value is created among different stakeholders in the offering. the relationships shifted from individual contribution by single firm to collaborations and integration of value chain partners to stimulate and achieve digital servitization [ ] [ ] . these six factors represent aspects that demine barriers for oems in offering industry . retrofit solutions and digital services. as previously explained, they have been defined based on literature in servitization and digital services. nevertheless, we argue that the factors that determine challenges can be translated also in the context of retrofitting solutions, which is not available in the analyzed literature today. the presented research focuses on the role of oems in retrofitting and upgrading legacy equipment towards industry . applications. particularly, the emphasize is given on the enabling of manufacturing servitization instead of on the technicality of retrofit solutions. one important gap identified from the literature review is the lack of oems' involvement in such projects. further analysis of literature helped to identify two main triggers "industry . push" and "need-based pull" that drove manufacturers to undertake retrofit projects. given that, we proposed the scope of a third trigger "servitizationpush", which refers to the fact that oems can leverage the opportunity to create and capture unique value by retrofitting and then provisioning data-driven value-added services for the manufacturers. in order to support the "servitization-push" as a new trigger, the authors explained the benefits of retrofitting for both oems and manufacturers in the context of servitization. considering the fact that, at the best of authors knowledge, "servitization-push" is not commonly provided for legacy equipment, last contribution of this paper is related to the identification and synthesis of factors that may determine the challenges and barriers of oems in providing these solutions. specifically, these factors have been classified in accordance with four different domains, i.e. "technological", "service-related", "digital service" and "legacy equipment retrofit". it is found that all of the factors have influence on the "legacy equipment retrofit", highlighting the fact that various challenges for the oems exist. as a next step, we plan to provide empirical validation of the identified benefits and challenges by conducting case studies and expert interviews involving both oems and users of industrial equipment. smart manufacturing: characteristics, technologies and enabling factors industry . -an overview of key benefits, technologies, and challenges.pdf a novel methodology for retrofitting cnc machines based on the context of industry . midlife upgrade of capital equipment: a servitization-enabled, value-adding alternative to traditional equipment replacement strategies managing the transition from products to services state-of-the-art in product-service systems servitization of the manufacturing firm: exploring the operations practices and technologies that deliver advanced services towards an operations strategy for product-centric servitization servitization and industry . convergence in the digital transformation of product firms: a business model innovation perspective servitized manufacturing firms competing through remote monitoring technology an exploratory study the value of big data in servitization the role of digital technologies for the service transformation of industrial companies industry-integrator as retrofit solution for digital manufacturing methods in existing industrial plants implementation of the mialinx integration concept for future manufacturing environments to enable retrofitting of machines from legacy-based factories to smart factories level according to the industry . cloud-enabled smart data collection in shop floor environments for industry . a case about the upgrade of manufacturing equipment for insertion into an industry . environment closing the lifecycle loop with installed base products condition monitoring of industrial machines using cloud communication enabling of predictive maintenance in the brownfield through low-cost sensors, an iiot-architecture and machine learning new threats for old manufacturing problems: secure iot-enabled monitoring of legacy production machinery apms definition of smart retrofitting: first steps for a company to deploy aspects of industry . embedded smart box for legacy machines to approach to i . in smart manufacturing a low-cost vision-based monitoring of computer upgrading legacy equipment to industry . through a cyber-physical interface a design approach to iot endpoint security for production machinery monitoring the intelligent industry of the future: a survey on emerging trends, research challenges and opportunities in industry . security and privacy challenges in industrial internet of things how 'big data' can make big impact: findings from a systematic review and a longitudinal case study big data for industry . : a conceptual framework four strategies for the age of smart services buyer-supplier relationships in a servitized environment developing integrated solution offerings for remote diagnostics: a comparative case study of two manufacturers servitized manufacturing firms competing through remote monitoring technology iot powered servitization of manufacturing -an exploratory case study data lifecycle and technology-based opportunities in new product service system offering towards a multidimensional framework characterizing service networks for moving from products to solutions moving from basic offerings to value-added solutions: strategies , barriers and alignment how smart, connected products are transforming competition organizing for innovation in the digitized world special issue territorial servitization: exploring the virtuous circle connecting knowledgeintensive services and new manufacturing businesses servitization, digitization and supply chain interdependency how the industrial internet of things changes business models in different manufacturing industries key: cord- -kvyzuayp authors: christ, andreas; quint, franz title: artificial intelligence: research impact on key industries; the upper-rhine artificial intelligence symposium (ur-ai ) date: - - journal: nan doi: nan sha: doc_id: cord_uid: kvyzuayp the trirhenatech alliance presents a collection of accepted papers of the cancelled tri-national 'upper-rhine artificial inteeligence symposium' planned for th may in karlsruhe. the trirhenatech alliance is a network of universities in the upper-rhine trinational metropolitan region comprising of the german universities of applied sciences in furtwangen, kaiserslautern, karlsruhe, and offenburg, the baden-wuerttemberg cooperative state university loerrach, the french university network alsace tech (comprised of 'grandes 'ecoles' in the fields of engineering, architecture and management) and the university of applied sciences and arts northwestern switzerland. the alliance's common goal is to reinforce the transfer of knowledge, research, and technology, as well as the cross-border mobility of students. in the area of privacy-preserving machine learning, many organisations could potentially benefit from sharing data with other, similar organisations to train good models. health insurers could, for instance, work together on solving the automated processing of unstructured paperwork such as insurers' claim receipts. the issue here is that organisations cannot share their data with each other for confidentiality and privacy reasons, which is why secure collaborative machine learning where a common model is trained on distributed data to prevent information from the participants from being reconstructedis gaining traction. this shows that the biggest problem in the area of privacy-preserving machine learning is not technical implementation, but how much the entities involved (decision makers, legal departments, etc.) trust the technologies. as a result, the degree to which ai can be explained, and the amount of trust people have in it, will be an issue requiring attention in the years to come. the representation of language has undergone enormous development of late: new models and variants, which can be used for a range of natural language processing (nlp) tasks, seem to pop up almost monthly. such tasks include machine translation, extracting information from documents, text summarisation and generation, document classification, bots, and so forth. the new generation of language models, for instance, is advanced enough to be used to generate completely realistic texts. these examples reveal the rapid development currently taking place in the ai landscape, so much so that the coming year may well witness major advances or even a breakthrough in the following areas: • healthcare sector (reinforced by the covid- pandemic): ai facilitates the analysis of huge amounts of personal information, diagnoses, treatments and medical data, as well as the identification of patterns and the early identification and/or cure of disorders. • privacy concerns: how civil society should respond to the fast increasing use of ai remains a major challenge in terms of safeguarding privacy. the sector will need to explain ai to civil society in ways that can be understood, so that people can have confidence in these technologies. • ai in retail: increasing reliance on online shopping (especially in the current situation) will change the way traditional (food) shops function. we are already seeing signs of new approaches with self-scanning checkouts, but this is only the beginning. going forward, food retailers will (have to) increasingly rely on a combination of staff and automated technologies to ensure cost-effective, frictionless shopping. • process automation: an ever greater proportion of production is being automated or performed by robotic methods. • bots: progress in the field of language (especially in natural language processing, outlined above) is expected to lead to major advances in the take-up of bots, such as in customer service, marketing, help desk services, healthcare/diagnosis, consultancy and many other areas. the rapid pace of development means it is almost impossible to predict either the challenges we will face in the future or the solutions destined to simplify our lives. one thing we can say is that there is enormous potential here. the universities in the trirhenatech alliance are actively contributing interdisciplinary solutions to the development of ai and its associated technical, societal and psychological research questions. utilizing toes of a humanoid robot is difficult for various reasons, one of which is that inverse kinematics is overdetermined with the introduction of toe joints. nevertheless, a number of robots with either passive toe joints like the monroe or hrp- robots [ , ] or active toe joints like lola, the toyota robot or toni [ , , ] have been developed. recent work shows considerable progress on learning model-free behaviors using genetic learning [ ] for kicking with toes and deep reinforcement learning [ , , ] for walking without toe joints. in this work, we show that toe joints can significantly improve the walking behavior of a simulated nao robot and can be learned model-free. the remainder of this paper is organized as follows: section gives an overview of the domain in which learning took place. section explains the approach for model free learning with toes. section contains empirical results for various behaviors trained before we conclude in section . the robots used in this work are robots of the robocup d soccer simulation which is based on simspark and initially initiated by [ ] . it uses the ode physics engine and runs at an update speed of hz. the simulator provides variations of aldebaran nao robots with dof for the robot types without toes and dof for the type with toes, naotoe henceforth. more specifically, the robot has ( ) dof in each leg, in each arm and in its neck. there are several simplifications in the simulation compared to the real nao: all motors of the simulated nao are of equal strength whereas the real nao has weaker motors in the arms and different gears in the leg pitch motors. joints do not experience extensive backlash rotation axes of the hip yaw part of the hip are identical in both robots, but the simulated robot can move hip yaw for each leg independently, whereas for the real nao, left and right hip yaw are coupled the simulated naos do not have hands the touch model of the ground is softer and therefore more forgiving to stronger ground touches in the simulation energy consumption and heat is not simulated masses are assumed to be point masses in the center of each body part the feet of naotoe are modeled as rectangular body parts of size cm x cm x cm for the foot and cm x cm x cm for the toes (see figure ). the two body parts are connected with a hinge joint that can move from - degrees (downward) to degrees. all joints can move at an angular speed of at most . degrees per ms. the simulation server expects to get the desired speed at hz for each joint. if no speeds are sent to the server it will continue movement of the joint with the last speed received. joint angles are noiselessly perceived at hz, but with a delay of ms compared to sent actions. so only after two cycles, the robot knows the result of a triggered action. a controller provided for each joint inside the server tries to achieve the requested speed, but is subject to maximum torque, maximum angular speed and maximum joint angles. the simulator is able to run simulated naos in real-time on reasonable cpus. it is used as competition platform for the robocup d soccer simulation league . in this context, only a single agent was running in the simulator. the following subsections describe how we approached the learning problem. this includes a description of the design of the behavior parameters used, what the fitness functions for the genetic algorithm look like, which hyperparameters were used and how the fitness calculation in the simspark simulation environment works exactly. the guiding goal behind our approach is to learn a model-free walk behavior. with model-free we depict an approach that does not make any assumptions about a robot's architecture nor the task to be performed. thus, from the viewpoint of learning, our model consists of a set of flat parameters. these parameters are later grounded inside the domain. the server requires values per second for each joint. to reduce the search space, we make use of the fact that output values of a joint over time are not independent. therefore, we learn keyframes, i.e. all joint angles for discrete phases of movement together with the duration of the phase from keyframe to keyframe. the experiments described in this paper used four to eight of such phases. the number of phases is variable between learning runs, but not subject to learning for now, except for skipping phases by learning a zero duration for it. the robocup server requires robots to send the actual angular speed of each joint as a command. when only leg joints are included, this would require to learn parameters per phase ( joints + for the duration of the phase), resulting in , and parameters for the , , phases worked with. the disadvantage of this approach is that the speed during a particular phase is constant, thus making it unable to adapt to discrepancies between the desired and the actual motor movement. therefore, a combination of angular value and the maximum amount of angular speed each joint should have is used. the direction and final value of movement is entirely encoded in the angular values, but the speed can be controlled separately. it follows that: -if the amount of angular speed does not allow reaching the angular value, the joint behaves like in version . -if the amount of angular speed is bigger, the joint stops to move even if the phase is not over. this almost doubles the amount of parameters to learn, but the co-domain of values for the speed values is half the size, since here we only require an absolute amount of angular speed. with these parameters, the robot learns a single step and mirrors the movement to get a double step. feedback from the domain is provided by a fitness function that defines the utility of a robot. the fitness function subtracts a penalty for falling from the walked distance in x-direction in meters. there is also a penalty for the maximum deviation in y-direction reached during an episode, weighted by a constant factor. in practice, the values chosen for f allenp enalty and a factor f were usually and respectively. this same fitness function can be used without modification for forward, backward and sideward walk learning, simply by adjusting the initial orientation of the agent. the also trained turn behavior requires a different fitness function. f itness turn = (g * totalt urn) − distance ( ) where totalt urn refers to the cumulative rotation performed in degrees, weighted by a constant factor g (typically / ). we penalize any deviation from the initial starting x / y position (distance) as incentive to turn in-place. it is noteworthy that other than swapping out the fitness function and a few more minor adjustments mentioned in . , everything else about the learning setup remained the same thanks to the model-free approach. naturally, the fitness calculation for an individual requires connecting an agent to the simspark simulation server and having it execute the behavior defined by the learned parameters. in detail, this works as follows: at the start of each "episode", the agent starts walking with the old model-based walk engine at full speed. once simulation cycles (roughly . seconds) have elapsed, the robot starts checking the foot force sensors. as soon as the left foot touches the ground, it switches to the learned behavior. this ensures that the learned walk has comparable starting conditions each time. if this does not occur within cycles (which sometimes happens due to non-determinism in the domain and noise in the foot force perception), the robot switches anyway. from that point on, the robot keeps performing the learned behavior that represents a single step, alternating between the original learned parameters and a mirrored version (right step and left step). an episode ends once the agents has fallen or seconds have elapsed. to train different walk directions (forward, backward, sideward), the initial orientation of the player is simply changed accordingly. in addition, the robot uses a different walk direction of the model-based walk engine for the initial steps that are not subject to learning. in case of training a morphing behavior (see . ) , the episode duration is extended to seconds. when a morphing behavior should be trained, the step behavior from another learning run is used. this also means that a morphing behavior is always trained for a specific set of walk parameters. after seconds, the morphing behavior is triggered once the foot force sensors detect that the left foot has just touched the ground. unlike the step / walk behavior, this behavior is just executed once and not mirrored or repeated. then the robot switches back to walking at full speed with the model-based walk engine. to maximize the reward, the agent has to learn a morphing behavior that enables the transition between learned model-free and old model-based walk to work as reliably as possible. finally, for the turn behavior, the robot keeps repeating the learned behavior without alternating with a mirrored version. in any case, if the robot falls, a training run is over. the overall runtime of each such learning run is . days on our hardware. learning is done using plain genetic algorithms. the following hyperparameters were used: - more details on the approach can be found in [ ] . this section presents the results for each kind of behavior trained. this includes three different walk directions, a turn behavior and a behavior for morphing. the main focus of this work has been on training a forward walk movement. figure shows a sequence of images for a learned step. the best result reaches a speed of . m/s compared to the . m/s of our model-based walk and . m/s for a walk behavior learned on the nao robot without toes. the learned walk with toes is less stable, however, and shows a fall rate of % compared to % of the model-based walk. regarding the characteristics of this walk, it utilizes remarkably long steps . table shows an in-depth comparison of various properties, including step duration, length and height, which are all considerably bigger compared to our previous model-based walk. the forward leaning of the agent has increased by . %, while . % more time is spent with both legs off the ground. however, the maximum deviation from the intended path (maxy ) has also increased by . %. table : comparison of the previously fastest and the fastest learned forward walk once a working forward walk was achieved, it was natural to try to train a backward walk behavior as well, since this only requires a minor modification in the learning environment (changing the initial rotation of the agent and model-based walk direction to start with). the best backward walk learned reaches a speed of . m/s, which is significantly faster than the . m/s of its model-based counterpart. unfortunately, the agent also falls % more frequently. it is interesting just how backward-leaning the agent is during this walk behavior. it could almost be described as "controlled falling" (see figure ). sideward walk learning was the least successful out of the three walk directions. like with all directions, the agent starts out using the old walk engine and then switches to the learned behavior after a short time. in this case however, instead of continuing to walk sideward, the agent has learned to turn around and walk forward instead, see figure . the resulting forward walk is not very fast and usually causes the agent to fall within a few meters , but it is still remarkable that the learned behavior manages to both turn the agent around and make it walk forward with the same repeating step movement. it is also remarkable that the robot learned that it is quicker with the given legs at least for long distances to turn and run forward than to keep making sidesteps. with the alternate fitness function presented in section , the agent managed to learn a turn behavior that is comparable in speed to that of the existing walk engine. despite this, the approach is actually different: while the old walk engine uses small, angled steps , the learned behavior uses the left leg as a "pivot", creating angular momentum with the right leg . figure shows the movement sequence in detail. unfortunately, despite the comparable speed, the learned turn behavior suffers from much worse stability. with the old turn behavior, the agent only falls in roughly % of cases, with the learned behavior it falls in roughly % of the attempts. one of the major hurdles for using the learned walk behaviors in a robocup competition is the smooth transition between them and other existing behaviors such as kicks. the initial transition to the learned walk is already built into the learning setup described in by switching mid-walk, so it does not have to be given special consideration. more problematic is switching to another behavior afterwards without falling. to handle this, the robot simply attempted to train a "morphing" behavior using the same model-free learning setup. the result is something that could be described as a "lunge" (see figure ) that reduces the forward momentum sufficiently to allow it to transition to the slower model-based walk when successful. however, the morphing is not successful in about % of cases, resulting in a fall. we were able to successfully train forward and backward walk behaviors, as well as a morphing and turn behavior using plain genetic algorithms and a very flexible model-free approach. the usage of the toe joint in particular makes the walks look more natural and human-like than that of the model-based walk engine. however, while the learned behaviors outperform or at least match our old modelbased walk engine in terms of speed, they are not stable enough to be used during actual robocup d simulation league competitions. we think this is an inherent limitation of the approach: we train a static behavior that is unable to adapt to changing circumstances in the environment, which is common in simspark's non-deterministic simulation with perception noise. deep reinforcement learning seems more promising in this regard, as the neural network can dynamically react to the environment since sensor data serves as input. it is also arguably even less restrictive than the keyframe-based behavior parameterization we presented in this paper, as a neural network can output raw joint actions each simulation cycle. at least two other robocup d simulation league teams, fc portugal [ ] and itandroids [ ] , have had great success with this approach, everything points towards this becoming the state-of-the-art approach in robocup d soccer simulation in the near future, so we want to concentrate our future efforts here as well. retail companies dealing in alcoholic beverages are faced with a constant flux of products. apart from general product changes like modified bottle designs and sizes or new packaging units two factors are responsible for this development. the first is the natural wine cycle with new vintages arriving at the market and old ones cycling out each year. the second is the impact of the rapidly growing craft beer trend which has also motivated established breweries to add to their range. the management of the corresponding product data is a challenge for most retail companies. the reason lies in the large amount of data and its complexity. data entry and maintenance processes are linked with considerable manual effort resulting in high data management costs. product data attributes like dimensions, weights and supplier information are often entered manually into the data base and are often afflicted with errors. another widely used source of product data is the import from commercial data pools. a means of checking the data thus acquired for plausibility is necessary. sometimes product data is incomplete due to different reasons and a method to fill the missing values is required. all these possible product data errors lead to complications in the downstream automated purchase and logistics processes. we propose a machine learning model which involves domain specific knowledge and compare it a heuristic approach by applying both to real world data of a retail company. in this paper we address the problem of predicting the gross weight of product items in the merchandise category alcoholic beverages. to this end we introduce two levels of additional features. the first level consists of engineered features which can be determined by the basic features alone or by domain specific expert knowledge like which type of bottle is usually used for which grape variety. in the next step an advanced second level feature is computed from these first level features. adding these two levels of engineered features increases the prediction quality of the suggestion values we are looking for. the results emphasize the importance of careful feature engineering using expert knowledge about the data domain. feature engineering is the process of extracting features from the data in order to train a prediction model. it is a crucial step in the machine learning pipeline, because the quality of the prediction is based on the choice of features used to training. the majority of time and effort in building a machine learning pipeline is spent on data cleaning and feature engineering [domingos ] . a first overview of basic feature engineering principles can be found in [zheng ]. the main problem is the dependency of the feature choice on the data set and the prediction algorithm. what works best for one combination does not necessarily work for another. a systematic approach to feature engineering without expert knowledge about the data is given in [heaton ]. the authors present a study whether different machine learning algorithms are able to synthesize engineered features on their own. as engineered features logarithms, ratios, powers and other simple mathematical functions of the original features are used. in [anderson ] a framework for automated feature engineering is described. the data set is provided by a major german retail company and consists of beers and wines. each product is characterized by the seven features shown in table . the product name obeys only a generalized format. depending on the user generating the product entry in the company data base, abbreviating style and other editing may vary. the product group is a company specific number which encodes the product category -dairy products, vegetables or soft drinks for example. in our case it allows a differentiation of the product into beer and wine. additionally wines are grouped by country of origin and for germany also into wine-growing regions. note that the product group is no inherent feature like length, width, height and volume, but depends on the product classification system a company uses. the dimensions length, width, height and the volume derived by multiplicating them are given as float values. the feature (gross) weight, also given as a float value, is what we want to predict. as is often the case with real world data, a pre-processing step has to be performed prior to the actual machine learning in order to reduce data errors and inconsistencies. for our data we first removed all articles missing one or more of the required attributes of table . then all articles with dummy values were identified and discarded. dummy values are often introduced due to internal process requirements but do not add any relevant information to the data. if for example the attribute weight has to be filled for an article during article generation in order to proceed to the next step but the actual value is not know, often a dummy value of or is entered. these values distort the prediction model when used as training data in the machine learning step. the product name is subjected to lower casing and substitution of special german characters like umlauts. special symbolic characters like #,! or separators are also deleted. with this preprocessing done the data is ready to be used for feature engineering. following this formal data cleaning we perform an additional content-focused pre-processing. the feature weight is discretized by binning it with bin width g. volume is likewise treated with bin size ml. this simplifies the value distribution without rendering it too coarse. all articles where length is not equal to width are removed, because in these cases there are no single items but packages of items. often the data at hand is not sufficient to train a meaningful prediction model. in these cases feature engineering is a promising option. identifying and engineering new features depends heavily on expert knowledge of the application domain. the first level consists of engineered features which can be determined by the original features alone. in the next step advanced second level features are computed from these first level and the original features. for our data set the original features are product name and group as well as the dimensions length, width, height and volume. we see that the volume is computed in the most general way by multiplication of the dimensions. geometrically this corresponds to all products being modelled as cuboids. since angular beer or wine bottles are very much the exception in the real world, a sensible new feature would be a more appropriate modelling of the bottle shape. since weight is closely correlated to volume, the better the volume estimate the better the weight estimate. to this end we propose four first level engineered features: capacity, wine bottle type, beer packaging type and beer bottle type which are in turn used to compute a second level engineered feature namely the packaging specific volume. figure shows all discussed features and their interdependencies. let us have a closer look at the first level engineered features. the capacity of a beverage states the amount of liquid contained and is usually limited to a few discrete values. . l and . l are typical values for beer cans and bottles while wines are almost exclusively sold in . l bottles and sometimes in . l bottles. the capacity can be estimated from the given volume with sufficient certainty using appropriate threshold values. outliers were removed from the data set. there are three main beer packaging types in retail: cans, bottles and kegs. while kegs are mainly of interest to pubs and restaurants and are not considered in this paper, cans and bottles target the typical super market shopper and come in a greater variety. in our data set, the product name in case of beers is preceded by a prefix denoting whether the product is packaged in a can or a bottle. extracting the relevant information is done using regular expressions. not, though, that the prefix is not always correct and needs to be checked against the dimensions. the shapes of cans are the same for all practical purposes, no matter the capacity. the only difference is in their wall thickness, which depends on the material, aluminium and tin foil being the two common ones. the difference is weight is small and the actual material used is impossible to extract from the data. a further distinction for cans in different types like for beer and wine is therefore unnecessary. regarding the german beer market, the five bottle types shown in figure the engineered feature beer packaging type assigns each article identified as beer by its product group to one of the classes bottle or can. the feature beer bottle type contains the most probably member of the five main beer bottle types. packages containing more than one bottle or can like crates or six packs are not considered in this paper and were removed from the data set. compared to beer the variety of commercially sold wine packagings is limited to bottles only. a corresponding packaging type attribute to distinguish between cans and bottles is not necessary. again there are a few bottle types which are used for the majority of wines, namely schlegel, bordeaux and burgunder ( figure ). deciding what product is filled in which bottle type is a question of domain knowledge. the original data set does not contain a corresponding feature. from the product group the country of origin and in the case of german wines the region can be determined via a mapping table. this depends on the type of product classification system the respective company uses and has not to be valid for all companies. our data set uses a customer specific classification with focus on germany. a more general one would be the global product classification (gpc) standard for example. to determine wine growing regions in non-german countries like france the product name has to be analyzed using regular expressions. the type of grape is likewise to be deduced from the product name if possible. using the country and specifically the region of origin and type of grape of the wine in question is the only way to assign a bottle type with acceptable certainty. there are countries and region in which a certain bottle type is used predominantly, sometimes also depending on the color of the wine. the schlegel bottle, for example, is almost exclusively used for german and alsatian white wines and almost nowhere else. bordeaux and burgunder bottles on the other hand are used throughout the world. some countries like california or chile use a mix of bottle types for their wines, which poses an additional challenge. with expert knowledge one can assign regions and grape types to the different bottle types. as with beer bottles this categorization is by no means comprehensive or free of exceptions but serves as a first step. the standard volume computation by multiplying the product dimensions length, width and height is a rather coarse cuboid approximation to the real shape of alcoholic beverage packagings. since the volume is intrinsically linked to the weight which we want to predict a packaging type specific volume computation is required for cans and especially bottles. the modelling of a can is straightforward using a cylinder with the given height ℎ and a diameter of the given width and length . thus the packaging type specific volume is: a bottle on the other hand needs to be modelled piecewise. its height can be divided into three parts: base, shoulders and neck as shown in figure . base and neck can be modeled by a cylinder. the shoulders are approximated by a truncated cone. with the help of the corresponding partial heights ℎ , ℎ ℎ and ℎ we can compute coefficients , ℎ and as fractions of the overall height ℎ of the bottle. the diameters of the bottle base and the neck opening are given by and and are likewise used to compute the ratio . since bottles have circular bases, the values for width and length in the original data have to be the same and either one may be used for . these four coefficients are characteristic for each bottle type, be it beer or wine (table ) . with their help, a bottle type specific volume from the original data length, width and height can be computed which is a much better approximation to the true volume than the former cuboid model. the bottle base can be modelled as a cylinder as follows: the bottle shoulders have the form of a truncated cone and are described by formula : the bottle neck again is a simple cylinder: summing up all three sections yields the packaging type specific volume for bottles: ur-ai // the experiments follow the multi-level feature engineering scheme as shown in figure . first, we use only the original features product group and dimensions. then we add the first level engineered features capacity and bottle type to the basic features. next the second level engineered feature packaging type specific volume is used along with the basic features. finally all features from every level are used for the prediction. after pre-processing and feature engineering the data set size is reduced from to beers and from to wines. for prediction of the continuous valued attribute gross weight, we use and compare several regression algorithms. both the decision-tree based random forests algorithm (breimann, ) and support vector machines (svm) (cortes, ) are available in regression mode (smola, ) . linear regression (lai, ) and stochastic gradient descent (sgd) (taddy, ) are also employed as examples of more traditional statics-based methods. our baseline is a heuristic approach taking the median of the attribute gross weight for each product group and use this value as a prediction for all products of the same product group. practical experience has shown this to be a surprisingly good strategy. the implementation was done in python . using the standard libraries sk-learn and pandas. all numeric features were logarithmized prior to training the models. the non-numeric feature bottle type was converted to numbers. the final results were obtained using tenfold cross validation (kohavi, ) . for model training % of the data was used while the remaining % constituted the test data. we used the root mean square error (rsme) ( ) as well as the mean and variance of the absolute percentage error ( ) as metrics for the evaluation of the performance of the algorithms. all machine learning algorithms deliver significant improvements regarding the observed metrics compared to the heuristic median approach. the best results for each feature combination are highlighted in bold script. the results for the beer data set in table show that the rsme can be more than halved, the mean almost be reduced to a third and the variance of quartered compared to the baseline approach. the random forest regressor achieves the best results in terms of rsme and for almost all feature combinations except basic features and basic features combined with the packaging type specific volume, in which cases support vector machines prove superior. linear regression and sgd are are still better than the baseline approach but not on par with the other algorithms. linear regression shows the tendency to improved results when successively adding features. sgd on the other hand exhibits no clear relation between number and level of features and corresponding prediction quality. a possible cause could be the choice of hyper parameters. sgd is very sensitive in this regard and depends more heavily upon a higher number of correctly adjusted hyper parameters than the other algorithms we used. random forests is a method which is very well suited to problems, where there is no easily discernable relation between the features. it is prone to overfitting, though, which we tried to avoid by using % of all data as test data. adding more engineered features leads to increasingly better results using random forest with an outlier for the packaging type specific volume feature. svm are not affected by only first level engineered features but profit from using the bottle type specific volume. regarding the wine data set the results depicted in table are not as good as for the beer data set though still much better than the baseline approach. a reduction of the rsme by over % and of the mean by almost % compared to the baseline were achieved. the variance of could even be limited to under % of the baseline value. again random forests is the algorithm with the best metrics. linear regression and svm are comparable in terms of while sgd is worse but shows good rsme values. in conclusion the general results of the wine data set show not much improvement when applying additional engineered features. discussion and conclusion the experiments show a much better predicting quality for beer than for wine. a possible cause could be the higher weight variance in bottle types compared to beer bottles and cans. it is also more difficult to correctly determine the bottle type for wine, since the higher overlap in dimensions does not allow to compute the bottle type with the help of idealized bottle dimensions. using expert knowledge to assign the bottle type by region and grape variety seems not to be as reliable, though. especially with regard to the lack of a predominant bottle type in the region with the most bottles (red wine from baden for example), this approach should be improved. especially bordeaux bottles often sport an indentation in the bottom, called a 'culot de bouteille'. the size and thickness of this indentation cannot be inferred from the bottle's dimensions. this means that the relation between bottle volume and weight is skewed compared to other bottles without these indentations, which in turn decreases prediction quality. predicting gross weights with machine learning and domain-specifically engineered features leads to smaller discrepancies than using simple heuristic approaches. this is important for retail companies since big deviations are much worse for logistical reasons than small ones which may well be within natural production tolerances for bottle weights. our method allows to check manually generated as well as data pool imported product data for implausible gross weight entries and proposes suggestion values in case of missing entries. the method we presented can easily be adapted to non-alcoholic beverages using the same engineered features. in this segment, plastics bottles are much more common than glass ones and hence the impact of the bottle weight compared to the liquid weight is significantly smaller. we assume that this will cause a smaller importance of the bottle type feature in the prediction. a more problematic kind of beverage is liquor. although there are only a few different standard capacities, the bottle types vary so greatly, that identifying a common type is almost impossible. one of the main challenges of our approach is determining the correct bottle types. using expert knowledge is a solid approach but cannot capture all exemptions. especially if a wine growing region has no predominant bottle type and is using mixed bottle types instead. additionally many wine growers use bottle types which haven't been typical for their wine types because they want to differ from other suppliers in order to get the customer's attention. assuming that all rieslings are sold in schlegel bottles, for example, is therefore not exactly true. one option could be to model hybrid bottles using a weighted average of the coefficients for each bottle type in use. if a region uses both burgunder and bordeaux bottles with about equal frequency, all products from this region could be assigned a hybrid bottle with coefficients computed by the mean value of each coefficient. if an initially bottle type labeled data set is available, preliminary simulations have shown that most bottle types can be predicted robustly using classification algorithms. the most promising strategy, in our opinion, is to learn the bottle types directly from product images using deep neural nets for example. with regard to the ever increasing online retail sector, web stores need to have pictures of their products on display, so the data is there to be used. quality assurance is one of the key issues for modern production technologies. especially new production methods like additive manufacturing and composite materials require high resolution d quality assurance methods. computed tomography (ct) is one of the most promising technologies to acquire material and geometry data non-destructively at the same time. with ct it is possible to digitalize subjects in d, also allowing to visualize their inner structure. a d-ct scanner produces voxel data, comprising of volumetric pixels that correlate with material properties. the voxel value (grey value) is approximately proportional to the material density. nowadays it is still common to analyse the data by manually inspecting the voxel data set, searching for and manually annotating defects. the drawback is that for high-resolution ct data, this process it very time consuming and the result is operator-dependent. therefore, there is a high motivation to establish automatic defect detection methods. there are established methods for automatic defect detection using algorithmic approaches. however, these methods show a low reliability in several practical applications. at this point artificial neural networks come into play that have been already implemented successfully in medical applications [ ] . the most common networks, developed for medical data segmentation, are by ronneberger et al., the u-net [ ] and by milletari et al., the v-net [ ] and their derivates. these networks are widely used for segmentation tasks. fuchs et al. describes three different ways of analysing industrial ct data [ ] . one of these contains a d-cnn. this cnn is based on the u-net architecture and is shown in their previous paper [ ] . the authors enhance and combine the u-net and v-net architecture to build a new network for examination of d volumes. in contrast, we investigate in our work how the networks introduced by ronneberger et al. and milletari et al. perform in industrial environments. furthermore, we investigate if derivates of these architectures are able to identify small features in industrial ct data. in industrial ct systems, not only in the hardware design but also in the resulting d imaging data differs from medical ct systems. voxel data from industrial parts differ from medical data in the contrast level and the resolution. state-of-the-art industrial ct scanner produce one to two order of magnitude larger data sets compared to medical ct systems. the corresponding resolution is necessary to resolve small defects. medical ct scanners are optimised for a low xray dose for the patient, the energy of x-ray photons are typically up to kev, industrial scanner typically use energies up to kev. in combination with the difference of the scan "object", the datasets differ significantly in size and image content. to store volume data there are a lot of different file formats. some of them are mainly used in medical applications like dicom [ ] , nifti or raw. in industrial applications vgl , raw and tiff are commonly used. also depending on the format, it is possible to store the data slice wise or as a complete volume stack. industrial ct data, as mentioned in previous section, has some differences to medical ct data. one aspect is the size of the features to be detected or learned by the neural network. our target is to find defects in industrial parts. as an example, we analyse pores in casting parts. these features may be very small, down to to voxels in each dimension. compared to the size of the complete data volume (typically larger than x x voxel), the feature size is very small. the density difference between material and pores may be as low as % of the maximum grey value. thus, it is difficult to annotate the data even for human experts. the availability of real industrial data of good quality, annotated by experts, is very low. most companies don't reveal their quality analysis data. training a neural network with a small quantity of data is not possible. for medical applications, especially ai applications, there are several public datasets available. yet these datasets are not always sufficient and researchers are creating synthetic medical data [ ] . therefore, we decided to create synthetic industrial ct data. another important reason for synthetic data is the quality of annotations done by human experts. the consistency of results is not given for different experts. fuchs et al. have shown that training on synthetic data and predicting on real data lead to good results [ ] . however, synthetic data may not reflect all properties of real data. some of the properties are not obvious, which may lead to ignoring some varieties in the data. in order to achieve a high associability, we use a large numbers of synthetic data mixed with a small number of real data. to achieve this, we developed an algorithm which generates large amounts of data, containing a large variation of aspects, needed to generalize a neural network. the variation includes material density, pore density, pore size, pore amount, pore shape and size of the part. there are some samples that could be learned easily, because the pores are clearly visible inside the material. however, some samples are more difficult to be learned, because the pores are nearly invisible. this allows us to generate data with a wide variety and hence the network can predict on different data. to train the neural networks, we can mix the real and synthetic data or use them separately. the real data was annotated manually by two operators. to create a dataset of this volume we sliced it into x x blocks. only the blocks with a mean density greater than % of the grayscale range are used, to avoid too much empty volumes in the training data. another advantage of synthetic data is the class balance. we have two classes, where corresponds to material and surrounding air and for the defects. because of the size of the defects there is a high imbalance between the classes. by generating data with more features than in the real data, we could reduce the imbalance. reducing the size of the volume to x x also leads to better balance between the size of defects compared to full volume. in table details of our dataset for training, evaluation and testing are shown. the synthetic data will not be recombined to a larger volume as they represent separate small components or full material units. the following two slices of real data ( figure ) and synthetic data (figure ) with annotated defects show the conformity between the data. ur-ai // hardware and software setup deep learning (dl) consist of two phases: the training and its application. while dl models can be executed very fast, the training of the neural network can be very time-consuming, depending on several factors. one major factor is the hardware. the time consumed can be reduced by the factor of around ten when graphics cards (gpus) are used. [ ] to cache the training data, before it is given into the model, calculated on the gpu, a lot of random-access memory (ram) is used [ ] [ ] [ ] . our system is built on a dual cpu hardware with cores each running at . ghz and a nvidia gpu titan rtx with gb of vram and gb of regular ram. all measurements in this work concerning training and execution time are related to this hardware setup. the operating system is ubuntu . lts. anaconda is used for python package management and deployment. the dl-framework is tensorflow . and keras as a submodule in python . based on the du-net [ ] and dv-net [ ] architecture compared from paichao et al. [ ] we created modified versions which differ in number of layers and their hyperparameters. due to the small size of our data, no patch division is necessary. instead the training is performed on the full volumes. we actually do not use the z-net enhancement proposed in their paper. the input size, depending on our data, is defined to x x x with dimension for channel. the incoming data will be normalized. as we have a binary segmentation task, our output activation is the sigmoid [ ] function. based on paichao et al. [ ] the convolutional layer of our du-nets have a kernel size of ( , , ) and the dv-nets have a kernel size of ( , , ). as convolution activation function we are using elu [ ] [ ] and he_normal [ ] as kernel initialization [ ] . the adam optimisation method [ ] [ ] is used with a starting learning rate of . , a decay factor of . and the loss function is the binary cross-entropy [ ] . figure shows a sample du-net architecture where downwards max pooling and upwards transposed convolution are used. compared to figure , the dv-net, where we have a fully convolutional neural network, the descend is done with a ( , , ) convolution and a stride of and ascent with transposed convolution. it also has a layer level addition of the input of this level added to the last convolution output of the same level, as marked by the blue arrows. to adapt the shapes of the tensors for adding them, the down-convolution and the last convolution of the same level, have to have the same number of kernel filters. our modified neural network differ in the levels of de-/ascending, the convolution filter kernel size and their hyperparameters, shown in table . the convolutions on one level have the same number of filter kernel. after every down convolution the number of filters is multiplied by and on the way up divided by . training and evaluation of the neural networks the conditions of a training and a careful parameters selection is important. in table the training conditions fitted to our system and networks are shown. we are also taking into account that different network architectures and number of layers are better performing on different learning rates, batch size, etc. to evaluate our trained models, we are mainly focusing on the iou metric, also called jackard index, which is the intersection over union. this metric is widely used for segmentation tasks and compares the intersection over union between the prediction and ground truth for each voxel. the value of iou range between and , whereas the loss values range between and infinite. therefore, the iou is a much clearer indicator. an iou close to indicates a high intersectionprecision between the prediction and the groundtruth. our networks where trained between and epochs until no more improvement could be achieved. both datasets consist of a similar number of samples, which means the epoch time is equivalent. one epoch took around minutes. figure shows the loss determined based on the evaluation data. as described in fehler! verweisquelle konnte nicht gefunden werden., all models are trained on and evaluated against the synthetic dataset gdata and on the mixed dataset mdata. in general, the loss achieved by all models is higher on mdata because the real data is harder to learn. a direct comparison between the models is only possible between models with the same architecture. the iou metric shown in figure . here the evaluation is sorted based on the iou metric. if we compare the loss of unet-mdata with unet-gdata, which are nearly the same for mdata, with their corresponding iou (unet-mdata (~ . ) and unet-gdata (~ . )), we can see that a lower loss does not necessarily lead to higher iou score. if only the loss and iou are considered, the unets tend to be better than the vnets. as a conclusion, considering the iou metric for model selection, the unet-gdata is the best performing model and vnet-gdata the least performing. the evaluation loss determined based on the evaluation data sorted from lowest to highest. the evaluation iou determined based on the evaluation data sorted from lowest to highest. after comparing the automatic evaluation, we show prediction samples of different models on real and synthetic data ( table ) . rows and show the comparison between unet-gdata and vnet-gdata, predicting on a synthetic test sample. the result of unet-gdata exactly hits the groundtruth, whereas the vnet-gdata prediction has a % overlap to the groundtruth but with surrounding false positive segmentations. in row and both models predict the groundtruth plus some false positive segmentations in the close neighbourhood. in row and the prediction results of the same two models on real data is shown, taking into account that both models are not trained on real data. unet-gdata delivers a good precision with some false positive segmentations in thegroundtruth area and one additional segmented defect. this shows that the model was able to find a defect which was missed by the expert. vnet-gdata shows a very high number of false positive segmentations. in this paper, we have proposed a neural network to find defects in real and synthetic industrial ct volumes. we have shown that neural networks, developed for medical applications can be adapted to industrial applications. to achieve high accuracy, we used a large variety of features in our data. based on the evaluation and manually reviewing random samples we have chosen the unet architecture for further research. this model achieved great performance on our real and synthetic dataset. in summery this paper shows that the artificial intelligence and their neural networks will take an import enrichment in industrial issues. stress can affect all aspects of our lives, including our emotions, behaviors, thinking ability, and physical health, making our society sick -both mentally and physically. among the effects that the stress and anxiety can cause are heart diseases, such as coronary heart disease and heart failure [ ] . due this information, this research will present a proposal to help people handling stress using the benefit of technology development and to set patters of stress status as way to propose some intervention, once the first step to controlling stress is to know the symptoms of stress. the stress symptoms are very board and can be confused with others diseases according the american institute of stress [ ] , for example the frequent headache, irritability, insomnia, nightmares, disturbing dreams, dry mouth, problems swallowing, increased or decreased appetite, or even cause other diseases such as frequent colds and infections. in view of the wide variety of symptoms caused by stress, this research intends to define, through physiological signals, the patterns generated by the body and obtained by wearable sensors and develop a standardized database to apply the machine learning. hand, advances in sensor technology, wearable devices and mobile growth would help to online stress identification based on physiological signals and delivery of psychological interventions. currently with the advancement of technology and improvements in the wearable sensors area, made it possible to use these devices as a source of data to monitor the user's physiological state. the majority of the wearable devices consist of low-cost board that can be used to the acquisition of physiological signals [ , ] . after the data are obtained it is necessary apply some filters to clear signal, without noise or distortions aiming to use some machine learning approaches to model and predict these stress states [ , ] . the wide-spread use of mobile devices and microcomputers, as raspberry pi, and its capabilities presents a great possibility to collect, and process those signs with an elaborated application. these devices can collect the physiological signals and detect specific stress states to generate interventions following the predetermined diagnosis based on the standards already evaluated in the system [ , ] . during the literature review it was evident the presence of few works dedicated to evaluating comprehensively the complete cycle of biofeedback, which comprises using the wearable devices, applying machine learning patterns detection algorithms, generate the psychologic intervention, besides monitoring its effects and recording the history of events [ , ] . stress is identified by professionals using human physiology, so wearables sensors could help on data acquisition and processing, through machine learning algorithms on biosignal data, suggesting psychological interventions. some works [ , ] are dedicated to define patterns as experiment for data acquisition simulation real situations. jebelli, khalili and lee [ ] showed a deep learning approach that was used to compare with a baseline feedforward artificial neural network. schmidt et al. [ ] describes wearable stress and affect detection (wesad), one public dataset used to set classifiers and identify stress patterns integrating several sensors signals with the emotion aspect with a precision of % in the experiments. the work of gaglioli et al. [ ] describe the main features and preliminary evaluation of a free mobile platform for the selfmanagement of psychological stress. in terms of the wearables, some studies [ , ] evaluate the usability of devices to monitory the signals and the patient's well-being. pavic et al. [ ] showed a research performed to monitor cancer patients remotely and as the majority of the patients have a lot of symptoms but cannot stay at hospital during all treatment. the authors emphasize that was obtained good results and that this system is viable, as long as the patient is not a critical case, as it does not replace medical equipment or the emergency care present in the hospital. henriques et al. [ ] focus was to evaluated the effects of biofeedback in a group of students to reduce anxiety, in this paper was monitored the heart rate variability with two experiments with duration of four weeks each. the work of wijman [ ] describes the use of emg signals to identify stress, this experiment was conducted with participants, evaluating both the wearables signals and questionnaires. in this section will be described the uniqueness of this research and the devices that was used. this solution is being proposed by several literature study about stress patterns and physiological aspects but with few results, for this reason, our project will address topics like experimental study protocol on signals acquisition from patients/participants with wearables to data acquisition and processing, in sequence will be applied machine learning modeling and prediction on biosignal data regarding stress (fig. ) . the protocol followed to the acquisition of signals during all different status is the trier social stress test (tsst) [ ] , recognized as the gold standard protocol for stress experiments. the estimated total protocol time, involving pre-tests and post-tests, is minutes with a total of thirteen steps, but applied experiment was adapted and it was established with ten stages: initial evaluation: the participant arrives, with the scheduled time, and answer the questionnaires; habituation: it will take a rest time of twenty minutes before the pre-test to avoid the influence of events and to establish a safe baseline of that organism; pre-test: the sensors will be allocated ( fig. ), collected saliva sample and applied the psychological instruments. the next step is explanation of procedure and preparation: the participant reads the instructions and the researcher ensures that he understands the job specifications, in sequence, he is sent to the room with the jurors (fig. ) , composed of two collaborators of the research, were trained to remain neutral during the experiment, not giving positive verbal or non-verbal feedback; free speech: after three minutes of preparation, the participant is requested to start his speech, being informed that he cannot use the notes. this will follow the arithmetic task: the jurors request an arithmetic task in which the participant must subtract mentally, sometimes, the jurors interrupt and warn that the participant has made a mistake; post-test evaluation: the experimenter receives the subject outside the room for the post-test evaluations; feedback and clarification: the investigator and jurors talk to the subject and clarify what the task was about; relaxation technique: a recording will be used with the guidelines on how to perform a relaxation technique, using only the breathing; final post-test: some of the psychological instruments will be reapplied, saliva samples will be collected, and the sensors will still be picking up the physiological signals. based on literature [ ] and wearable devices available the signals that was selected to analysis is the ecg, eda and emg for an initial experiment. this experimental study protocol on data acquisition started with participants, where data annotation each step was done manually, from protocol experiment, preprocessing data based on features selection. in the machine learning step, it was evaluated the metrics of different algorithms as decision tree, random forest, adaboost, knn, k-means, svm. the experiment was made using the bitalino kit -plux wireless biosignals s.a. (fig. ) composed by ecg sensor, which will provide data on heart rate and heart rate variability; eda sensor that will allow measure the electrical dermal activity of the sweat glands; emg sensor that allows the data collect the activity of the muscle signals. this section will describe the results in the pre-processing step and how it was made, listing all parts regarded to categorization and filtering data, evaluating the signal to know if it has plausibility and create a standardized database. the developed code is written in python due to the wide variety of libraries available, in this step was used the libraries numpy and pandas, both used to data manipulation and analysis. in the first step it is necessary read the files with the raw data and the timestamp, during this process the used channels are renamed to the name of the signal, because the bitalino store the data with the channel number as name of each signals. in sequence, the data timestamp is converted to a useful format, with goal to compare with the annotations, after time changed to the right format all channels unused are discarded to avoid unnecessary processing. the next step is to read the annotations taken manually in the experiment, as said before, to compare the time and classify each part of the experiment with its respective signal. after all signals are classified with its respective process of the tsst, each part of the experiment is grouped in six categories, which will be analyzed later. the first category is the "baseline", with just two parts of the experiment, representing the beginning of the experiment, when the participants had just arrived. the second is called of "tsst" comprises the period in which the participant spoke, the third category is the "arithmetic" with the data in acquired in the arithmetic test. the others two relevant categories are the "post_test_sensors_ " and "post_test_sensors_ ", with its respective signals in the parts called with the same name. every other part of the experiment was categorized as "no_category", in sequence, this category is discarded in function of it will not be necessary in the machine learning stage. after the dataframe is right with all signals properly classified, the columns with the participants number and the timestamp are removed of the dataframe. the next step is evaluated the signal, to verify if the signal is really useful in the process of machine learning. for this, it is analyzed the signals using the biosppy library, which performs the data filtering process and makes it possible to view the data. finally, the script checks the volume of data present in each classification and returns the value of the smallest category. this is done because it was found that the categories have different volumes of data, which would become a problem in the machine learning stage, by offering more data from a determinate category than from the others. due this fact, the code analyzes the others categories and reduce its size until all categories stay with the same number of rows in each category (); after this the dataframe is exported in a csv file, to be read in the machine learning stage. the purpose of this article is to describe some stages of the development of a system for the acquisition and analysis of physiological signals to determine patterns in these signals that would detect stress states. during the development of the project was verified that there are data gaps in the dataframe in the middle of the experiment in some participants; a hypothesis about the motivation of this had happened is the sampling of the acquisition of bitalino regarding communication issues in some specifics sampling rates. it evaluate the results obtained when reducing this acquisition rate, however, it is necessary to carefully evaluate the extent to which the reduction in the sampling rate will interfere with the results. during the evaluation of the plausibility of the signals, it was verified that there are evident differences between the signals patterns in the different stages of the process, thus validating the protocol followed in the acquisition of the standards. the next step in this project is implement the machine learning stage, applying different algorithms as svm, decision tree, random forest, adaboost, knn and k-means; besides to evaluate the results using metrics like accuracy, precision, recall and f . the next steps of this research will support the confirmation of the hypothesis raised about being able to define patterns of physiological signals to detect stress states. from the definition of the patterns, a system can be applied that identifies the acquisition of the signals and, in real time, performs the analysis of these data based on the machine learning results. therefore we can detect the state of the person and that the psychologist can indicate a proposal intervention and monitor whether the decrease is occurring. technological developments have been influencing all kinds of disciplines by transferring more competences from human beings to technical devices. the steps inculde [ ]: . tools: transfer of mechanics (material) from the human being to the device . machines: transfer of energy from the human being to the device . automatic machines : transfer of information from the human being to the device . assistants: transfer of decisions from the human being to the device with the introduction of artificial intelligence (ai), in particular its latest developments in deep learning, we let the system (in step ) take over our decisions and creation processes. thus, tasks and disciplines that were exclusively reserved for humans in the past can now co-exist or even take the human out of the loop. it is no wonder that this transformation is not stopped at disciplines such as engineering, business, agriculture but also affects humanities, art and design. each new technology has been adopted for artistic expression-just see the many wonderful examples in media art. therefore, it is not surprising, that ai is going to be established as a novel tool to produce creative content of any form. however, in contrast to other disruptive technologies, ai seems particular challenging to be accepted in the area of art because it offers capabilities we thought once only humans are able to perform-the art is no longer done by artists using new technology to perform their art, but the art is done by the machine itself without the need for a human to intervene. the question of "what is art" has always been an emotionally debated topic in which everyone has a slightly different definition depending on his or her own experiences, knowledge base and personal aesthetics. however, there seems to be a broad consensus that art requires human creativity and imagination as, for instance, stated by the oxford dictionary "the expression or application of human creative skill and imagination, typically in a visual form such as painting or sculpture, producing works to be appreciated primarily for their beauty or emotional power." every art movement challenges old ways and uses artistic creative abilities to spark new ideas and styles. with each art movement diverse intentions and reasons for creating the artwork came along with critics who did not want to accept the new style as an artform. with the introduction of ai into the creation process another art movement is trying to be established which is fundamentally changing the way we see art. for the first time, ai has the potential to take the artist out of the loop, to leave humans only in the positions of curators, observers and judges to decide if the artwork is beautiful and emotionally powerful. while there is a strong debate going on in the arts if creativity is profoundly human, we investigate how ai can foster inspiration, creativity and produce unexpected results. it has been shown by many publications that ai can generate images, music and the like which can resemble different styles and produce artistic content. for instance, elgammal et al. [ ] have used generative adversarial networks (gan) to generate images by learning about styles and deviating from style norms. the promise of ai-assisted creation is "a world where creativity is highly accessible, through systems that empower us to create from new perspectives and raise the collective human potential" as roelof pieters and samim winiger pointed out [ ] . to get a better understanding of the process on how ai is capable to propose images, music, etc. we have to open the black box to investigate where and how the magic is happening. random variations in the image space (sometimes also referred to as pixel space) are usually not leading to any interesting result. this is because semantic knowledge cannot be applied. therefore, methods need to be applied which constrain the possible variations of the given dataset in a meaningful way. this can be realized by generative design or procedural generation. it is applied to generate geometric patterns, textures, shapes, meshes, terrain or plants. the generation processes may include, but are not limited, to self-organization, swarm systems, ant colonies, evolutionary systems, fractal geometry, and generative grammars. mccormack et al. [ ] review some generative design approaches and discuss how art and design can benefit from those applications. these generative algorithms which are usually realized by writing program code are very limited. ai can change this process into data-driven procedures. ai, or more specifically artificial neural networks, can learn patterns from (labeled) examples or by reinforcement. before an artificial neural network can be applied to a task (classification, regression, image reconstruction), the general architecture is to extract features through many hidden layers. these layers represent different levels of abstractions. data that have a similar structure or meaning should be represented as data points that are close together while divergent structures or meanings should be further apart from each other. to convert the image back (with some conversion/compression loss) from the low dimensional vector, which is the result of the first component, to the original input an additional component is needed. together they form the autoencoder which consists of the encoder and the decoder . the encoder compresses the data from a high dimensional input space to a low dimensional space, often called the bottleneck layer. then, the decoder takes this encoded input and converts it back to the original input as closely as possible. the latent space is the space in which the data lies in the bottleneck layer. if you look at figure you might be wondering why a model is needed that converts the input data into a "close as possible" output data. it seems rather useless if all it outputs is itself. as discussed, the latent space contains a highly compressed representation of the input data, which is the only information the decoder can use to reconstruct the input as faithfully as possible. the magic happens by interpolating between points and performing vector arithmetic between points in latent space. these transformations result in meaningful effects on the generated images. as dimensionality is reduced, information which is distinct to each image is discarded from the latent space representation, since only the most important information of each image can be stored in this low-dimensional space. the latent space captures the structure in your data and usually offers some semantic meaningful interpretation. this semantic meaning is, however, not given a priori but has to be discovered. as already discussed autoencoders, after learning a particular non-linear mapping, are capable of producing photo-realistic images from randomly sampled points in the latent space. the latent space concept is definitely intriguing but at the same time non-trivial to comprehend. although latent space means hidden, understanding what is happening in latent space is not only helpful but necessary for various applications. exploring the structure of the latent space is both interesting for the problem domain and helps to develop an intuition for what has been learned and can be regenerated. it is obvious that the latent space has to contain some structure that can be queried and navigated. however, it is non-obvious how semantics are represented within this space and how different semantic attributes are entangled with each other. to investigate the latent space one should favor a dataset that offers a limited and distinctive feature set. therefore, faces are a good example in this regard because they share features common to most faces but offer enough variance. if aligned correctly also other meaningful representations of faces are possible, see for instance the widely used approach of eigenfaces [ ] to describe the specific characteristic of faces in a low dimensional space. in the latent space we can do vector arithmetic. this can correspond to particular features. for example, the vector a smiling woman representing the face of a smiling woman minus the vector a neutral woman representing a neutral looking woman plus the vector a neutral man representing a neutral looking man resulted in the vector a smiling man representing a smiling man. this can also be done with all kinds of images; see e.g. the publication by radford et al. [ ] who first observed the vector arithmetic property in latent space. a visual example is given in figure . please note that all images shown in this publication are produced using biggan [ ] . the photo of the author on which most of the variations are based on is taken by tobias schwerdt. in latent space, vector algebra can be carried out. semantic editing requires to move within the latent space along a certain 'direction'. identifying the 'direction' of only one particular characteristic is non-trivial since editing one attribute may affect others because they are correlated. this correlation can be attributed to some extent to pre-existing correlations in 'the real world' (e.g. old persons are more likely to wear eyeglasses) or bias in the training dataset (e.g. more women are smiling on photos than men). to identify the semantics encoded in the latent space shen et al. proposed a framework for interpreting faces in latent space [ ] . beyond the vector arithmetic property, their framework allows decoupling some entangled attributes (remember the aforementioned correlation between old people and eyeglasses) through linear subspace projection. shen et al. found that in their dataset pose and smile are almost orthogonal to other attributes while gender, age, and eyeglasses are highly correlated with each other. disentangled semantics enable precise control of facial attributes without retraining of any given model. in our examples, in figures and , faces are varied according to gender or age. it has been widely observed that when linearly interpolate between two points in latent space the appearance of the corresponding synthesized images 'morphs' continuously from one face to another; see figure . this implies that also the semantic meaning contained in the two images changes gradually. this is in stark contrast to having a simple fading between two images in image space. it can be observed that the shape and style slowly transform from one image into the other. this demonstrates how well the latent space understands the structure and semantics of the images. other examples are given in section . even though our analysis has focused on face editing for the reasons discussed earlier it holds true also for other domains. for instance, bau et al. [ ] generated living rooms using similar approaches. they showed that some units from intermediate layers of the generator are specialized to synthesize certain visual concepts such as sofas or tvs. so far we have discussed how autoencoders can connect the latent space and the image semantic space, as well as how the latent code can be used for image editing without influencing the image style. next, we want to discuss how this can be used for artistic expression. while in the former section we have seen how to use manipulation in the latent space to generate mathematical sound operations not much artistic content has been generatedjust variations of photography like faces. imprecision in ai systems can lead to unacceptable errors in the system and even result in deadly decisions; e.g. at autonomous driving or at cancer treatment. in the case of artistic applications, errors or glitches might lead to interesting, non-intended, artifacts. if those errors or glitches are treated as a bug or a feature lies in the eye of the artist. to create higher variations in the generated output some artists randomly introduce glitches within the autoencoder. due to the complex structure of the autoencoder these glitches (assuming that they are introduced at an early layer in the network) occur on a semantic level as already discussed and might cause the models to misinterpret the input data in interesting ways. some could even be interpreted as glimpses of autonomous creativity; see for instance the artistic work 'mistaken identity' by mario klingemann [ ] . so far the latent space is explored by humans either by random walk or intuitive steering into a particular direction. it is up to human decisions if the synthesized image of a particular location in latent space is producing a visually appealing or otherwise interesting result. the question arises where to find those places and if those places can be spotted by an automatized process. the latent space is usually defined by a space of ddimensions for which it is assumed the data to be represented as multivariate gaussian distributions n ( , i d ) [ ] . therefore, the mean representation of all images lies in the center of the latent space. but what does that mean for the generated results? it is said that "beauty lies in the eyes of the beholder", however, research shows that there is a common understanding of beauty. for instance, averaged faces are perceived as more beautiful [ ] . adopting these findings to latent space let us assume that the most beautiful images (in our case faces) can be found in the center of the space. particular deviations from the center stand for local sweet spots (e.g. female and male, ethnic groups). these types of sweet spots can be found by common means of data analysis (e.g. clustering). but where are interesting local sweet spots if it comes to artistic expression? figure demonstrates some variation in style within the latent space. of course, one can search for locations in the latent space where particular artworks from a given artist or art styles are located; see e.g. figure where the styles of different artists, as well as white noise , have been used for adoption. but isn't lingering around these sweet spots not only producing "more of the same"? how to find the local sweet spots which can define a new art style and can be deemed truly creative? or do those discoveries of new art style lie outside of the latent space, because the latent space is trained within a particular set of defined art styles and can, therefore, produce only interpolations of those styles but nothing conceptually new? so far we have discussed how ai can help to generate different variations of faces and where to find visually interesting sweet spots. in this section, we want to show how ai is supporting the creation process by applying the discussed techniques to other areas of image and object processing. probably, different variations of image-to-image translation are the most popular approach at least if looking at the mass media. the most prominent example is style transfer -the capability to transfer the style of one image to draw the content of another (examples are shown in figure ). but mapping an input image to an output image is also possible for a variety of other applications such as object transfiguration (e.g. horse-to-zebra, apple-to-orange, season transfer (e.g. summer-to-winter) or photo enhancement [ ] . while some of the just mentioned systems are not yet in a state to be widely applicable, ai tools are taking over and gradually automating design processes which used to be time-consuming manual processes. indeed, the most potential for ai in art and design is seen in its application to tedious, uncreative tasks such as coloring black-and-white images [ ] . marco kempf and simon zimmerman used ai in their work dubbed 'deepworld' to generate a compilation of 'artificial countries' using data of all existing countries (around ) to generate new anthems, flags and other descriptors [ ] . roman lipski uses an ai muse (developed by florian dohmann et al.) to foster his/her inspiration [ ] . because the ai muse is trained only on the artist's previous drawings and fed with the current work in progress it suggests image variations in line with roman's taste. cluzel et al. have proposed an interactive genetic algorithm to progressively sketch the desired side-view of a car profile [ ] . for this, the user has taken on the role of a fitness function through interaction with the system. the chair project [ ] is a series of four chairs co-designed by ai and human designers. the project explores a collaborative creative process between humans and computers. it used a gan to propose new chairs which then have been 'interpreted' by trained designers to resemble a chair. deep-wear [ ] is a method using deep convolutional gans for clothes design. the gan is trained on features of brand clothes and can generate images that are similar to actual clothes. a human interprets the generated images and tries to manually draw the corresponding pattern which is needed to make the finished product. li et al. [ ] introduced an artificial neural network for encoding and synthesizing the structure of d shapes which-according to their findings-are effectively characterized by their hierarchical organization. german et al. [ ] have applied different ai techniques trained by a small sample set of shapes of bottles, to propose novel bottle-like shapes. the evaluation of their proposed methods revealed that it can be used by trained designers as well as nondesigners to support the design process in different phases and that it could lead to novel designs not intended/foreseen by the designers. for decades, ai has fostered (often false) future visions ranging from transhumanist utopia to "world run by machines" dystopia. artists and designers explore solutions concerning the semiotic, the aesthetic and the dynamic realm, as well as confronting corporate, industrial, cultural and political aspects. the relationship between the artist and the artwork is directly connected through their intentions, although currently mediated by third-parties and media tools. understanding the ethical and social implications of ai-assisted creation is becoming a pressing need. the implications, where each has to be investigated in more detail in the future, include: -bias: al systems are sensitive to bias. as a consequence, the ai is not being a neutral tool, but has pre-decoded preferences. bias relevant in creative ai systems are: • algorithmic bias occurs when a computer system reflects the implicit values of the humans who created it; e.g. the system is optimized on dataset a and later retrained on dataset b without reconfiguring the neural network (this is not uncommon, as many people do not fully understand what is going on in the network, but are able to use the given code to run training on other data). • data bias occurs when your samples are not representative of your population of interest. • prejudice bias results from cultural influences or stereotypes which are reflected in the data. -art crisis: until years ago painting served as the primary method for visual communication and was a widely and highly respected art form. with the invention of photography, painting began to suffer an identity crisis because painting-in its current form then-was not able to reproduce the world as accurate and with as low effort as photography. as a consequence visual artists had to change to different forms of representations not possible by photography inventing different art styles such as impressionism, expressionism, cubism, pointillism, constructivism, surrealism, up to abstract expressionism. at the time ai can perfectly simulate those styles what will happen with the artists? will artists still be needed, be replaced by ai, or will they have to turn to other artistic work which yet cannot be simulated by ai? -inflation: similar to the image flood which has reached us the same can happen with ai art. because of the glut, nobody is valuing and watching the images anymore. -wrong expectations: only esthetic appealing or otherwise interesting or surprising results are published which can be contributed to similar effects as the well-known publication bias [ ] in other areas. eventually, this is leading to wrong expectations of what is already possible with ai. in addition, this misunderstanding is fueled by content claimed to be created by ai but has indeed been produced-or at least reworked-either by human labor or by methods not containing ai. -unequal judgment: even though the raised emotions in viewing artworks emerge from its underlying structure in the works, people also include the creation process in their judgment (in the cases where they know about it). frequently, becoming to know that a computer or an ai has created the artwork, in the opinion of the people it turns boring, has no guts, no emotion, no soul while before it was inspiring, creative and beautiful. -authorship: the authorship of ai-generated content has not been clarified. for instance, is the authorship of a novel song composed by an ai trained exclusively on songs by johann sebastian bach belonging to the ai, the developer/artist, or bach? see e.g. [ ] for a more detailed discussion. -trustworthiness: new ai-driven tools make it easy for non-experts to manipulate audio and/or visual media. thus, image, audio as well as video evidence is not trustworthy anymore. manipulated image, audio, and video are leading to fake information, truth skepticism, and claims that real audio/video footage is fake (known as the liar's dividend ) [ ] . the potential of ai in creativity has just been started to be explored. we have investigated on the creative power of ai which is represented-not exclusively-in the semantic meaningful representation of data in a dimensionally reduced space, dubbed latent space, from which images, but also audio, video, and d models can be synthesized. ai is able to imagine visualizations that lie between everything the ai has learned from us and far beyond and might even develop its own art styles (see e.g. deep dream [ ] ). however, ai still lacks intention and is just processing data. those novel ai tools are shifting the creativity process from crafting to generating and selecting-a process which yet can not be transferred to machine judgment only. however, ai can already be employed to find possible sweet spots or make suggestions based on the learned taste of the artist [ ] . ai is without any doubt changing the way we experience art and the way we do art. doing art is shifting from handcrafting to exploring and discovering. this leaves humans more in the role of a curator instead of an artist, but it can also foster creativity (as discussed before in the case of roman lipski) or reduce the time between intention and realization. it has the potential, just as many other technical developments, to democratize creativity because the handcrafting skills are not so much in need to express his/her own ideas anymore. widespread misuse (e.g. image manipulation to produce fake pornography) can limit the social acceptance and require ai literacy. as human beings, we have to ask ourselves if feelings are wrong just because the ai never felt alike in its creation process as we do? or should we not worry too much and simply enjoy the new artworks created no matter if they are done by humans, by ai or as a co-creation between the two ones? [ ] aims to design and implement a machine learning system for the sake of generating prediction models with respect to quality checks and reducing faulty products in manufacturing processes. it is based on an industrial case study in cooperation with sick ag. we will present first results of the project concerning a new process model for cooperating data scientists and quality engineers, a product testing model as knowledge base for machine learning computing and visual support of quality engineers in order to explain prediction results. a typical production line consists of various test stations that conduct several measurements. those measurements are processed by the system on the fly, to point out problematic products. among the many challenges, one focus of the project is on support for quality engineers. preparation of prediction models is usually done by data scientists. but the demand for data scientists is increasing too fast, when a big number of products, production lines and changing circumstances have to be considered. hence, a software is needed which quality engineers can operate directly and leverage the results from prediction models. based on quality management and data science standard processes [ ] [ ] we created a reference process model for production error detection and correction which includes needed actors and associated tasks. with ml system and data scientist assistance we bolster the quality engineer in his work. to support the ml system, we developed a product testing model which includes crucial information about a specific product. in this model we describe the relation to product specific features, test systems, production lines sequences etc. the idea behind this, is to provide metadata information which in turn is used by the ml system instead of individual script solutions for each product. a ml model with good predictions has often a lack of information about the internal decisions. therefore, it is beneficial to support the quality engineer with useful feature visualizations. by default, we support the quality engineer with d - d feature plots and histograms, in which the error distribution is visualized. on top, we developed further feature importance measures based on shap values [ ] . these can be used to get deeper insight for particular ml decisions to significant features which get lower ranked by standard feature importance measures. medicine is a highly empirical discipline, where important aspects have to be demonstrated using adequate data and sound evaluations. this is one of the core requirements, which were emphasized during the development of the medical device regulation (mdr) of the european union (eu) [ ] . this applies to all medical devices, including mechanical and electrical devices as well as software systems. also, the us food & drug administration (fda) recently set a focus on the discussions about using data for demonstrating the safety and efficacy of medical devices [ ] . beside pure approval steps, they foster the use of data for optimization of the products, as nowadays data can be acquired more and more, using modern it technology. in particular, they pursue the use of real world evidence, i.e. data that is collected through the lifetime of a device, for demonstrating improved outcomes. [ ] such approaches require the use of sophisticated data analysis techniques. beside classical statistics, artificial intelligence (ai) and machine learning (ml) are considered to be powerful techniques for this purpose. currently, they gain more and more attention. these techniques allow to detect dependencies in complex situations, where inputs and/or outputs of a problem have high-dimensional parameter spaces. this can e.g. be the case when extensive data is collected from diverse clinical studies or also treatment protocols from local sites. furthermore, ai/ml based techniques may be used in the devices themselves. for example, devices may be developed which are considered to improve complex diagnostic tasks or find individualized treatment options for specific medical conditions (see e.g. [ , ] for an overview). for some applications, it already has been demonstrated that ml algorithms are able to outperform human experts with respect to specific success rates (e.g. [ , ] ). in this paper, it will be discussed how ml based techniques can be brought onto the market including an analysis of appropriate regulatory requirements. for this purpose, the main focus lies on ml based devices applied in the intensive care unit (icu) as e.g. proposed in [ , ] . the need for specific regulatory requirements comes from the observation, that ai/ml based techniques pose specific risks which need to be considered and handled appropriately. for example, ai/ml based methods are more challenging w.r.t. bias effects, reduced transparency, vulnerability to cybersecurity attacks, or general ethical issues (see e.g. [ , ] ). in particular cases, ml based techniques may lead to noticeably critical results, as it has been shown for the ibm watson for oncology device. in [ ] , it was reported that the direct use of the system in particular clinical environments resulted in critical treatment suggestions. the characteristics of ml based systems led to various discussions about their reliability in the clinical context. it requires to find appropriate ways to guarantee their safety and performance. (cf. [ ] ) this applies to the field of medicine / medical devices as well as ai/ml based techniques in general. the latter was e.g. approached by the eu in their ethics guidelines for trustworthy ai [ ] . driven by this overall development, the fda started a discussion regarding an extended use of ml algorithms in samd (software as a medical device) with a focus in quicker release cycles. in [ ] , it pursued the development of a specific process which makes it easier to bring ml based devices onto the market and also to update them during their lifecycle. current regulations for medical devices, e.g. in us or eu, do not provide specific guidelines for ml based devices. in particular, this applies to systems which continuously collect data in order to improve the performance of the device. current regulations focus on a fixed status of the device, which may only be adapted in a minor extent after the release. usually, a new release or clearance by the authority is required, when the clinical performance of a device is modified. but continuously learning systems exactly want to do such improvement steps using additional real-world data from daily applications without extra approvals (see fig. ). basic approaches for ai/ml based medical devices. left side: classical approach, where the status of the software has to be fixed after the release / approval stage. right side: continuously learning system where data is collected during the lifetime of the device without a separated release / approval step. in this case, an automatic validation step has to guarantee proper safety and efficacy. in [ ] , the fda made suggestions how this could be addressed. it proposed the definition of so called samd pre-specifications (sps) and an algorithm change protocol (acp), which are considered to represent major tools for dealing with modifications of the ml based system during its lifetime. within the sps, the manufacturer has to define the anticipated changes which are considered to be allowed during the automatic update process. in addition, the acp defines the particular steps which have to be implemented to realize the sps specifications. see [ ] for more information about sps and acp. but the details are not yet well elaborated by the fda at the moment. the fda requested for suggestions with respect to this. in particular, these tools serve as a basis for performing an automated validation of the updates. the applicability of this approach depends on the risk of the samd. in [ ] , the fda uses the risk categories from the international medical device regulators forum (imdrf) [ ] . this includes the categories state of healthcare situation or condition (critical vs. serious vs. noncritical) and significance of information provided by samd to healthcare decision (treat or diagnose vs. drive clinical management vs. inform clinical management) as the basic attributes. according to [ ] , the regulatory requirements for the management of ml based systems are considered to depend on this classification as well as the particular changes which may take place during the lifetime of the device. the fda categorizes them as changes in performance, inputs, and intended use. such anticipated changes have to be defined in the sps in advance. the main purpose of the present paper is to discuss the validity of the described fda approach for enabling continuously learning systems. therefore, it uses a scenario based technique to analyze whether validation in terms of sps and acp can be considered adequate tools. the scenarios represent applications of ml based devices in the icu. it checks its consistency with other important regulatory requirements and analyzes pitfalls which may jeopardize the safety of the devices. additionally, it discusses whether more general requirements can be sufficiently addressed in the scenarios, as e.g. proposed in ethical guidelines for ai based systems like [ , ] . this is not considered as a comprehensive analysis of the topics, but as an addition to current discussions about risks and ethical issues, as they are e.g. discussed in [ , ] . finally, the paper proposes own suggestions to address the regulation of continuously learning ml based systems. again, this is not considered to be a full regulatory strategy, but a proposal of particular requirements, which may overcome some of the current limitations of the approach discussed in [ ] . the overall aim of this paper is to contribute to a better understanding of the options and challenges of ai/ml based devices on the one hand and to enable the development of best practices and appropriate regulatory strategies, in the future. within this paper, the analysis of the fda approach proposed in [ ] is performed using specific reference scenarios from icu applications, which are particularly taken from [ ] itself. the focus lies on ml based devices which allow continuous updates of the model according to data collected during the lifetime of the device. in this context, sps and acp are considered as crucial steps which allow an automated validation of the device based on specified measures. in particular, the requirements and limitations of such an automated validation are analyzed and discussed, including the following topics / questions.  is automated validation reasonable for these cases? what are limitations / potential pitfalls of such an approach when applied in the particular clinical context?  which additional risks could apply to ai/ml based samd, in general, which go beyond the existing discussions in the literature as e.g. presented in [ , , ] ?  how should such issues be taken into account in the future? what could be appropriate measures / best practices to achieve reliability? the following exemplary scenarios are used for this purpose. ur-ai //  base scenario icu: ml based intensive care unit (icu) monitoring system where the detection of critical situations (e.g. regarding physiological instability, potential myocardial infarcts or sepsis) is addressed by using ml. using auditory alarms, the icu staff is informed to initiate appropriate measures to treat the patients in these situations. this scenario addresses a 'critical healthcare situation or condition' and is considered to 'drive clinical management' (according to the risk classification used in [ ] ).  modification "locked": icu scenario as presented above, where the release of the monitoring system is done according to a locked state of the algorithm.  modification "cont-learn": icu scenario as presented above, where the detection of alarm situations is continuously improved according to data acquired during daily routine, including adaptation of performance to sub-populations and/or characteristics of the local environment. in this case, scs and acp have to define standard measures like success rates of alarms/detection and requirements for the management of data, update of the algorithm, and labeling. more details of such requirements are discussed later. this scenario was presented as scenario a in [ ] with minor modifications. this section provides the basic analysis of the scenarios according to the particular aspects addressed in this paper. it addresses the topics automated validation, man-machine interaction, explainability, bias effects, and confounding, fairness and non-discrimination as well as corrective actions to systematic deficiencies. according to standard regulatory requirements [ , , ] , validation is a core step in the development and for the release of medical devices. according to [ ] , a change in performance of a device (including an algorithm in a samd) as well as a change in particular risks (e.g. new risks, but also new risk assessment or new measures) usually triggers a new premarket notification ( (k)) for most of the devices which get onto the market in the us. thus, such situations require an fda review for clearance of the device. for samd, this requires to include an analytical evaluation, i.e. correct processing of input data to generate accurate, reliable, and precise output data. additionally, a clinical validation as well as the demonstration of a valid clinical association need to be provided. [ ] this is intended to show that the outputs of the device appropriately work in the clinical environment, i.e. have a valid association regarding the targeted clinical condition and achieve the intended purpose in the context of clinical care. [ ] thus, based on the current standards, a device with continuously changing performance usually requires a thorough analysis regarding its validity. this is one of the main points, where [ ] proposes to establish a new approach for the "cont-learn" cases. as already mentioned, sps and acp basically have to be considered as tools for automated validation in this context. within this new approach, the manual validation step is replaced by an automated process with only reduced or even no additional control by a human observer. thus, it may work as an automated of fully automatic, closed loop validation approach. the question is whether this change can be considered as an appropriate alternative. in the following, this question is addressed using the icu scenario with a main focus on the "cont-learn" case. some of the aspects also apply to the "locked" cases. but the impact is considered to be higher in the "cont-learn" situation, since the validation step has to be performed in an automated fashion. human oversight, which is usually considered important, is not included here during the particular updates. within the icu scenario, the validation step has to ensure that the alarm rates stay on a sufficiently high level, regarding standard factors like specificity, sensitivity, area under curve (auc), etc. basically, these are technical parameters which can be analyzed according to an analytical evaluation as discussed above. (see also [ ] ) this could also be applied to situations, where continuous updates are made during the lifecycle of the device, i.e. in the "cont-learn". however, there are some limitations of the approach. on the one hand, it has to be ensured, that this analysis is sound and reliable, i.e. it is not compromised according to statistical effects like bias or other deficiencies in the data. on the other hand, it has to be ensured that the success rates really have a valid clinical association and can be used as a sole criterion for measuring the clinical impact. thus, the relationship between pure success rates and clinical effects has to be evaluated thoroughly and there may be some major limitations. one major question in the icu scenario is, whether better success rates really guarantee a higher or at least sufficient level of clinical benefit. this is not innately given. for example, a higher success rate of the alarms may still have a negative effect when the icu staff relies more and more on the alarms and subsequently reduces attention. thus, it may be the case that the initiation of appropriate treatment steps may be compromised even though the actually occurring alarms seem to be more reliable. in particular, this may apply in situations where the algorithms are adapted to local settings, like in the "cont-learn" scenario. here, the ml based system is intended to be optimized to subpopulations in the local environment or to specific treatment preferences at the local site. according to habituation effects, the staff's expectations get aligned to the algorithm's behavior to a certain degree after a period of time. but when the algorithm changes or an employee from another hospital or department takes over duties in the local unit, the reliability of the alarms may be affected. in these cases, it is not clear whether the expectations are well aligned with the current status of the algorithmeither in the positive or negative direction. since the data updates of the device are intended to improve its performance w.r.t. detection rates, it is clear that significant effects on user interaction may happen. under some circumstances, the overall outcome in terms of the clinical effect may be impaired. evaluation of such risks have to be addressed during validation. it is questionable whether this can be performed by using an automatic validation approach which focuses on alarm rates but does not include an assessment of the associated risks. at least a clear relationship between these two aspects has to be demonstrated in advance. it is also unclear, whether this could be achieved by assessment of pure technical parameters which are defined in advance as required by the sps and acp. usually, ml based systems are trained to a specific scenario. they provide a specific solution for this particular problem. but they do not have a more general intelligence and reasoning about potential risks, which were not under consideration at that point of time. such a more general intelligence can only be provided when using human oversight. in general, it is not clear whether technical aspects like alarms lead to valid reactions by the users. in technical terms, alarm rates are basically related to the probability of occurrence of specific hazardous situations. but they do not address a full assessment of occurrence of harm. however, this is pivotal for risk assessment in medical devices, in particular for risks related to potential use errors. this is considered to be one of the main reasons why a change in risk parameters triggers a new premarket approval in the us according to [ ] . also, the mdr [ ] sets high requirements to address the final clinical impact and not only technical parameters. basically, the example emphasizes the importance to consider the interaction between man and machine, or in this case, the algorithm and its clinical environment. this is addressed in the usability standards for medical devices, e.g. iso [ ] . for this reason, the iso requires that the final (summative) usability evaluation is performed using the final version of the device (in this case, the algorithm) or an equivalent version. this is in conflict with the fda proposal which allows to perform this assessment based on previous versions. at most, a predetermined relationship between technical parameters (alarm rates) and clinical effects (in particular, use related risks) can be obtained. for usage of ml based devices, it remains crucial to consider the interaction between the device and the clinical environment as there usually are important interrelationships. the outcome of an ml based algorithm always depends on the data it gets provided. whenever an input parameter is omitted, which is clinically relevant, the resulting outcome of the ml based system is limited. in the presented scenarios, the pure alarm rates may not be the only clinically relevant outcomes. even though, such parameters are usually the main focus regarding the quality of algorithms, e.g. in publications about ml based techniques. this is due to the fact, that such quality measures are commonly considered the best available objective parameters, which allow a comparison of different techniques. this even more applies to other ml based techniques which are also very popular in the scientific community, like segmentation tasks in medical image analysis. here the standard quality measures are general distance metrics, i.e. differences between segmented areas. [ ] they usually do not include specific clinical aspects like the accuracy in specific risk areas, e.g. important blood vessels or nerves. but such aspects are key factors to ensure the safety of a clinical procedure in many applications. again, only technical parameters are typically in focus. the association to the clinical effects is not assessed accordingly. this situation is depicted in fig. for the icu as well as image segmentation cases. additionally, the validity of an outcome in medical treatments depends on many factors. regarding input data, multiple parameters from a patient's individual history may be important for deciding about a particular diagnosis or treatment. a surgeon usually has access to a multitude of data and also side conditions (like socio-economic aspects) which should be included in an individual diagnosis or treatment decision. his general intelligence and background knowledge allow him to include a variety of individual aspects, which have to be considered for a specific case-based decision. in contrary, ml based algorithms rely on a more standardized structure of input data and are only trained for a specific purpose. they lack a more general intelligence, which allows them to react in very specific situations. even more, ml based algorithms need to generalize and thus to mask out very specific conditions, which could by fatal in some cases. in [ ] , the fda presents some examples where changes of the inputs in an ml based samd are included. it is surprising, that the fda considers some of them as candidates for a continuous learning system, which does not need an additional review, when a tailored sps/acp is available. such discrepancies between technical outcomes and clinical effects also apply to situations like the icu scenario, which only informs or drives clinical management. often users rely on automatically provided decisions, even when they are informed that this only is a proposal. again, this is a matter of man-machine interaction. this gets even worse due to the lack of explainability which ml based algorithms typically have. [ , ] when surgeons or more general users (e.g, icu staff) detect situations which require a diverging treatment because of very specific individual conditions, they should overrule the algorithm. but users will often be confused by the outcome of the algorithm and do not have a clear idea how they should treat conflicting results between the algorithm's suggestions and their own belief. as long as the ml based decision is not transparent to the user, they will not be able to merge these two directions. the ibm watson example, referenced in the introduction shows, that this actually is an issue [ ] . this may be even more serious, when the users (i.e. healthcare professionals) fear litigation because they did not trust the algorithm. in a situation, where the algorithm's outcome finally turns out to be true, they may be sued because of this documented deviation. because of such issues, the eu general data protection regulation (gfpr) [ ] requires that the users get autonomy regarding their decisions and transparency about the mechanisms underlying the algorithm's outcome. [ ] this may be less relevant for the patients, who usually have only limited medical knowledge. they will probably also not understand the medical decisions in conventional cases. but it is highly relevant for responsible healthcare professionals. they require to get basic insights how the decision emerged, as they finally are in charge of the treatment. this demonstrates that methods regarding the explainability of ml based techniques are important. fortunately, this currently gets a very active field. [ , ] this need for explainability applies to locked algorithms as well as situations where continuous learning is applied. based on their own data-driven nature, ml based techniques highly depend on a very high quality of data which are provided for learning and validation. in particular, this is important for the analytical evaluation of the ml algorithms. one of the major aspects are bias effects due to unbalanced input data. for example, in [ ] a substantially different detection rate between white and colored people was recognized due to unbalanced data. beside ethical considerations, this demonstrates dependencies of the outcome quality on sub-populations, which may be critical in some cases. even though, the fda proposal [ ] currently does not consequently include specific requirements for assessing bias factors or imbalance of data. however, high quality requirements for data management are crucial for ml based devices. in particular, this applies to the icu "cont-learn" cases. there have to be very specific protocols that guarantee that new data and updates of the algorithms are highly reliable w.r.t. bias effects. most of the currently used ml based algorithms fall under the category of supervised learning. thus, they require accurate and clinically sound labeling of the data. during the data collection, it has to be ensured how this labeling is performed and how the data can be fed back into the system in a "cont-learn" scenario. additionally, the data needs to stay balancedwhatever this means in a situation where adaptions to sub-populations and/or local environments are intended for optimization. it is unclear, whether and how this could be achieved by staff who is only operating with the system but possibly does not know potential algorithmic pitfalls. in the icu scenario, many data points probably need to be recorded by the system itself. thus, a precise and reliable recording scheme has to be established which automatically avoids imbalance of data on the one hand and fusion with manual labelings on the other hand. basically, the sps and acp (proposed in [ ] ) are tools to achieve this. the question is whether this is possible in a reliable fashion using automated processes. a complete closed loop validation approach seems to be questionable, especially when the assessment of clinical impact has to be included. thus, the integration of humans including adequate healthcare professionals as well as ml/ai experts with sufficient statistical knowledge seems reasonable. at least, bias assessment steps should be included. as already mentioned, this is not addressed in [ ] in a dedicated way. further on, the outcomes may be compromised by side effects in the data. it may be the case, that the main reason for a specific outcome of the algorithm is not a relevant clinical parameter but a specific data artifact, i.e. some confounding factor. in the icu case, it could be the case, that the icu staff reacts early to a potentially critical situation and e.g. gives specific medication in advance to prevent upcoming problems. the physiological reaction of the patient can then be visible in the data as some kind of artifact. during its learning phase, the algorithm may recognize the critical situation not based on a deeper clinical reason, but on detecting the physiological reaction pattern. this may cause serious problems as shown subsequently. in the presented scenario, the definition of clinical situation and the pattern can be deeply coupled by design, since the labeling of the data by the icu staff and the administration of the medication will probably be done in combination at the particular site. this may increase the probability of such effects. usually, confounding factors are hard to determine. even when they can be detected, they are hard to be communicated and managed in an appropriate way. how should healthcare professionals react, when they get such potentially misleading information (see discussion about liability). this further limits the explanatory power of ml based systems. when confounders are not detected, they may have unpredictable outcomes w.r.t. the clinical effects. for example, consider the following case. in the icu scenario, an ml based algorithm gets trained in a way that it basically detects the medication artifact as described above during the learning phase. in the next step, this algorithm is used in clinical practice and the icu staff relies on the outcome of the algorithm. then, on the one hand, the medication artifact is not visible unless the icu staff administers the medication. on the other hand, the algorithm does not recognize the pattern and thus does not provide an alarm. subsequently, the icu staff does no act appropriately to manage the critical situation. in particular, such confounders may be more likely in situations where a strong dependence between the outcome of the algorithm and the clinical treatment exists. further examples of such effects were discussed in [ ] for icu scenarios. the occurrence of confounders may be a bit less probable in pure diagnostic cases without influence of the diagnostic task onto the generation of data. but even here, such confounding factors may occur. the discussion in [ ] provided examples where confounders may occur in diagnostic cases e.g. because of rulers placed for measurements on radiographs. in most of the publications about ml based techniques, such side effects are not discussed (or only in a limited fashion). in many papers, the main focus is the technical evaluation and not the clinical environment and the interrelation between technical parameters and clinical effects. additional important aspects which are amply discussed in the context of ai/ml based systems are discrimination and fairness (see e.g. [ ] ). in particular, the eu puts a high priority of their future ai/ml strategy on fairness requirements [ ] . fairness is often closely related to bias effects. but it goes beyond to more general ethical questions, e.g. regarding the natural tendency of ml based systems to favor specific subgroups. for example, the icu scenario "cont-learn" is intended to optimize w.r.t. to specifics of sub-populations and local characteristics, i.e. it tries to make the outcome better for specific groups. based on such optimization, other groups (e.g. minorities, underrepresented groups) which are not well represented may be discriminated in some sense. this is not a statistical but a systematic effect. superiority of a medical device for a specific subgroup (e.g. gender, social environment, etc.) is not uncommon. for example, some diagnosis steps, implants, or treatments achieve deviating success rates when applied to women in comparison to men. this also applies to differences between adults and children. when assessing bias in clinical outcome in ml based devices, it will probably often be unclear whether this is due to imbalance of data or a true clinical difference between the groups. does an ml based algorithm has to adjust the treatment of a subgroup to a higher level, e.g. a better medication, to achieve comparable results, when the analysis recognized worse results for this subgroup? another example could be a situation where the particular group does not have the financial capabilities to afford the high-level treatment. this could e.g. be the case in a developing country or in subgroups with a lower insurance level. in these cases, the inclusion of socio-economical parameters into the analysis seems to be unavoidable. subsequently, this compromises the notion of fairness as basic principle in some way. this is nothing genuine to ml based devices. but in the case of ml based systems with a high degree of automation, the responsibility for the individual treatment decision more and more shifts from the health care professional to the device. it is implicitly defined in the ml algorithm. in comparison to human reasoning, which allows some weaknesses in terms of individual adjustments of general rules, ml based algorithms are rather deterministic / unique in their outcome. for a fixed input, they have one dedicated outcome (when we neglect statistical algorithms which may allow minor deviations). differences of opinions and room for individual decisions are main aspects of ethics. thus, it remains unclear how fairness can be defined and implemented at all when considering ml based systems. this is even more challenging as socioeconomical aspects (even more than clinical aspects) are usually not included in the data and analysis of ml based techniques in medicine. additionally, they are hard to assess and implement in a fair way, especially when using automated validation processes. another disadvantage of ml based devices is the limited opportunities to fix systematic deficiencies in the outcome of the algorithm. let us assume that during the lifetime of the icu monitoring system a systematic deviation of the intended outcome was detected, e.g. in the context of post-market surveillance or due to an increased number of serious adverse events. according to standard rules, a proper preventive respectively corrective action has to be taken by the manufacturer. in conventional software devices, the error simple should be eliminated, i.e. some sort of bug fixing has to be performed. for ml based devices it is less clear, how bug fixing should work especially when the systematic deficiency is deeply hidden in the data and/or ml model. in these cases, there usually is no clear reason for the deficiency. subsequently, the deficiency cannot be resolved in a straightforward way using standard bug fixing. there is no dedicated route to find the deeper reasons and to perform changes which could cure the deficiencies, e.g. by providing additional data or changing the ml model. even more, other side effects may easily occur, when data and model are changed manually by intent to fix the issue. discussion and outlook in summary, there are many open questions, which are not yet clarified. there still is little experience how ml based systems work in clinical practice and which concrete risks may occur. thus, the fda's commitment to foster the discussion about ml based samd is necessary and appreciated by many stakeholders as the feedback docket [ ] for [ ] shows. however, it is a bit surprising that the fda proposes to substantially reduce its very high standards in [ ] at this point of time. in particular, it is questionable whether an adequate validation can be achieved by using a fully automatic approach as proposed in [ ] . ml based devices are usually optimized according to very specific goals. they can only account for the specific conditions that are reflected in the data and the used optimization / quality criteria. they do not include side conditions and a more general reasoning about potential risks in a complex environment. but this is important for medical devices. for this reason, a more deliberate path would be suited, from the author's perspective. in a first step, more experience should be gained w.r.t. to the use of ml based devices in clinical practice. thus, continuous learning should not be a first hand option. first, it should be demonstrated that a device works in clinical practice before a continuous learning approach should be possible. this could also be justified from a regulatory point-of-view. the automated validation process itself should be considered as a feature of the device. it should be considered as part of the design transfer which enables safe use of the device during its lifecycle. as part of the design transfer, it should be validated itself. thus, it has to be demonstrated that this automated validation process, e.g. in terms of the sps and acp, works in a real clinical environment. ideally, this would have been demonstrated during the application of the device in clinical practice. thus, one reasonable approach for a regulatory strategy could be to reduce or prohibit the options for enabling automatic validation in a first release / clearance of the device. during the lifetime, direct clinical data could be acquired to demonstrate a better insight into the reliability and limitations of the automatic validation / continuous learning approach. in particular, the relation between technical parameters and clinical effects could be assessed on a broader and more stable basis. based on this evidence in real clinical environments, the automated validation feature could then be cleared in a second round. otherwise, the validity of the automated validation approach would have to be demonstrated in a comprehensive setting during the development phase. in principle, this is possible when enough data is available which truly reflects a comprehensive set of situations. as discussed in this paper, there are many aspects which do not render this approach impossible but very challenging. in particular, this applies to the clinical effects and the interdependency between the users and clinical environment on the one hand and the device, including the ml algorithm, data management, etc., on the other hand. this also includes not only variation in the status and needs of the individual patient but also the local clinical environment and potentially also the socioeconomic setting. following a consequent process validation approach, it would have to be demonstrated that the algorithm reacts in a valid and predictable way no matter which training data have been provided, which environment have to be addressed, and which local adjustments have been applied. this also needs to include deficient data and inputs in some way. in [ ] , it has been shown that the variation of outcomes can be substantial, even w.r.t. rather simple technical parameters. in [ ] , this was analyzed for scientific contests ("challenges") where renowned scientific groups supervised the quality of the submitted ml algorithms. this demonstrates the challenges validation steps for ml based systems still include, even w.r.t. technical evaluation. for these reasons, it seems adequate to pursue the regulatory strategy in a more deliberate way. this includes the restriction of the "cont-learn" cases as proposed. this also includes a better classification scheme, where automated or fully automatic validation is possible. currently, the proposal in [ ] does not provide clear rules when continuous learning is allowed. it does not really address a dedicated risk-based approach that defines which options and limitations are applicable. for some options, like the change of the inputs, it should be reviewed, whether automatic validation is a natural option. additionally, the dependency between technical parameters and clinical effects as well as risks should get more attention. in particular, the grade of interrelationship between the clinical actions and the learning task should be considered. in general, the discussions about ml based medical devices are very important. these techniques provide valuable opportunities for improvements in fields like medical technologies, where evidence based on high quality data is crucial. this applies to the overall development of medicine as well as to the development of sophisticated ml based medical devices. this also includes the assessment of treatment options and success of particular devices during their lifetime. data-driven strategies will be important for ensuring high-level standards in the future. they may also strengthen regulatory oversight in the long term by amplifying the necessity of post-market activities. this seems to be one of the promises the fda envisions according to their concepts of "total product lifecycle quality (tplc)" and "organizational excellence" [ ] . also, the mdr strengthens the requirements for data-driven strategies in the pre-as well as postmarket phase. but it should not shift the priorities for a basically proven-quality-in-advance (exante) to a primarily ex-post regulation, which boils down to a trial-and-error oriented approach in the extreme. thus, we should aim at a good compromise between pushing these valuable and innovative options on the one hand and potential challenges and deficiencies on the other hand. computer-assisted technologies in medical interventions are intended to support the surgeon during treatment and improve the outcome for the patient. one possibility is to augment reality with additional information that would otherwise not be perceptible to the surgeon. in medical applications, it is particularly important that demanding spatial and temporal conditions are adhered to. challenges in augmenting the operating room are the correct placement of holograms in the real world, and thus, the precise registration of multiple coordinate frames to each other, the exact scaling of holograms, and the performance capacity of processing and rendering systems. in general, two different scenarios can be distinguished. first, applications exist, in which a placement of holograms with an accuracy of cm and above are sufficient. these are mainly applications where a person needs a three-dimensional view of data. an example in the medical field may be the visualization of patient data, e.g. to understand and analyse the anatomy of a patient, for diagnosis or surgical planning. the correct visualization of these data can be of great benefit to the surgeon. often only d patient data is available, such as ct or mri scans. the availability of d representations depend strongly on the field of application. in neurosurgery d views are available but often not extensively utilized due to their limited informative value. additionally computer monitors are a big limitation, because the data can not be visualized in real world scale. further application areas are the translation of known user interfaces into augmented ur-ai // reality (ar) space. the benefit here is that a surgeon refrains from touching anything, but can interact with the interface in space using hand or voice gestures. applications visualizing patient data, such as ct scans, only require a rough positioning of the image or holograms in the operation room (or). thus, the surgeon can conveniently place the application freely in space. the main requirement is then to keep the holograms in a constant position. therefore, the internal tracking of the ar device is sufficient to hold the holograms at a fixed position in space. the second scenario covers all applications, in which an exact registration of holograms to the real world is required, in particular with a precision below cm. these scenarios are more demanding, especially when holograms must be placed precisely over real patient anatomy. to achieve this, patient tracking is essential to determine position and to follow patient movements. the system therefore needs to track the patient and adjust the visualization to the current situation. furthermore, it is necessary to track and augment surgical instruments and other objects in the operating room. the augmentation needs to be visualized at the correct spatial position and time constraints need to be fulfilled. therefore, the ar system needs to be embedded into the surgical workflow and react to it. to achieve these goals modern state of the art machine learning algorithms are required. however, the computing power on available ar devices is often not yet sufficient for sophisticated machine learning algorithms. one way to overcome this shortcoming is the integration of the ar system into a distributed system with higher capabilities, such as the digital operating theatre op:sense (see fig. ). in this work an augmented reality system holomed [ ] (see fig. ) is integrated into the surgical research platform for robot assisted surgery op:sense [ ] . the objective is to enable high-quality and patient-safe neurosurgical procedures in order to increase the surgical outcome by providing surgeons with an assistance system that supports them in cognitively demanding operations. the physician's perception limits are extended by the ar system, which bases on supporting intelligent machine learning algorithms. ar glasses allow the neurosurgeon to perceive the internal structures of the patient's brain. the complete system is demonstrated by applying this methodology to the ventricular puncture of the human brain, one of the most frequently performed procedures in neurosurgery. the ventricle system has an elongated shape with a width of - cm and is located in a depth of cm inside the human head. patient models are generated fast (< s) from ct-data [ ] , which are superimposed over the patient during operation and serve as a navigation aid for the surgeon. in this work the expanded system architecture is presented to overcome some limitations of the original system where all information were processed on the microsoft hololens, which lead to performance deficits. to overcome these shortcomings the holomed project was integrated into op:sense for additional sensing and computing power. to achieve integration of ar into the operation room and the surgical workflows, the patient, the instruments and the medical staff need to be tracked. to track the patient, a marker system is fixated on the patient head and registration from the marker system to the patient is determined. a two-stage process was implemented for this purpose. first the rough position of the patient's head is determined on the or table by applying a yolo v net to reduce the search space. then a robot with a mounted rgb-d sensor is used to scan the acquired area and build a point cloud of the same. to determine the patient's head in space as precisely as possible a two-step surface matching approach is utilized. during recording, the markers are also tracked. with known position of the patient and the markers, the registration matrix can be calculated. for the ventricular puncture a solution is proposed to track the puncture catheter to determine the depth of insertion into the human brain. by tracking the medical staff the system is able to react to the current situation, e.g. if an instrument is passed. in the following the solutions are described in detail. our digital operating room op:sense (illustrated in fig. a) to detect the patient's head, the coarse position is first determined with the yolo v cnn [ ] , performed on the kinect rgb image streams. the position in d is determined through the depth stream of the sensors. the or table and the robots are tracked with retroreflective markers by the arttrack system. this step reduces the spatial search area for fine adjustment. the franka panda has an attached intel realsense rgb-d camera as depicted in fig. . the precise determination of the position is performed on the depth data with surface matching. the robot scans the area of the coarsely determined position of the patient's head. a combined surface matching approach with feature-based and icp matching was implemented. the process to perform the surface matching is depicted in fig. . in clinical reality, a ct scan of the patient head is always performed prior to a ventricular puncture for diagnosis, such that we can safely assume the availability of ct data. a process to segment the patient models from ct data was proposed by kunz et al. in [ ] . the algorithm processes the ct data extremely fast in under two seconds. the data format is '.nrrd', a volume model format, which can easily be converted into surface models or point clouds. the point cloud of the patient's head ct scan is the reference model that needs to be found in or space. the second point cloud is recorded from the realsense depth stream mounted on the panda robot by scanning the previously determined rough position of the patient head. all points are recorded in world coordinate space. the search space is further restricted with a segmentation step by filtering out points that are located on the or table. additionally, manual changes can be made by the surgeon. in a performance optimization, the resolution of the point clouds is reduced to decrease processing time without loosing too much accuracy. the normals of both point clouds generated from ct data and from the recorded realsense depth stream are subsequently calculated and harmonised. during this step, the harmonisation is especially important as the normals are often misaligned. this misalignment occurs because the ct data is a combination of several individual scans. for alignment of all normals, a point inside the patient's head is chosen manually as a reference point, followed by orienting all normals in the direction of this point and subsequently inverting all normals to the outside of the head (see fig. ). after the preprocessing steps, the first surface fitting step is executed. it is based on the initial alignment algorithm proposed by rusu et al. [ ] . an implementation within the point cloud library (pcl) is used. therefore fast point feature histograms need to be calculated as a preprocessing step. in the last step an iterative closest point (icp) algorithm is used to refine the surface matching result. after the two point clouds have been aligned to each other the inverse transformation matrix can be calculated to get the correct transformation from marker system to patient model coordinate space. as outlined in fig. , catheter tracking was implemented based on semantic segmentation using a full-resolution residual network (frrn) [ ] . after the semantic segmentation of the rgb stream of the kinect cameras, the image is fused with the depth stream ur-ai // to determine the voxels in the point cloud belonging to the catheter. as a further step a density based clustering approach [ ] is performed on the chosen voxels. this is due to noise especially on the edges of the instrument voxels in the point cloud. based on the found clusters an estimation of the three dimensional structure of the catheter is performed. for this purpose, a narrow cylinder with variable length is constructed. the length is changed accordingly to the semantic segmentation and the clustered voxels of the point cloud. the approach is applicable to identify a variety of instruments. the openpose [ ] library is used to track key points on the bodies of the medical staff. available ros nodes have been modified to integrate openpose in the op:sense ros environment. the architecture is outlined in fig. . in this chapter the results of the patient, catheter and medical staff tracking are described. the approach to find the coarse position of a patient's head was performed on a phantom head placed on the or table within op:sense. multiple scenarios with changing illumination and occlusion conditions were recorded. the results are depicted in fig. and the evaluation results are depicted in table . precision detection of the patient was performed with a two-stage surface matching approach. different point cloud resolutions were tested with regard to runtime behaviour. voxel grid edge sizes of , and mm have been tested, with a higher edge size corresponding to a smaller point cloud. the matching results of the two point clouds were analyzed manually. an average accuracy of . mm was found with an accuracy range between . and . mm. in the first stage of the surface matching, the two point clouds are coarsely aligned as depicted in fig. . in the second step icp is used for fine adjustment. a two-stage process was implemented as icp requires a good initial alignment of the two point clouds. ur-ai // for catheter tracking a precision of the semantic segmentation between % and % is reached (see table ). tracking of instruments, especially neurosurgical catheters, are challenging due to their thin structure and non-rigid shape. detailed results on catheter tracking have been presented in [ ] . the d estimation of the catheter is shown in fig. . the catheter was moved in front of the camera and the d reconstruction was recorded simultaneously. over a long period of the recording over % of the catheter are tracked correctly. in some situations this drops to under % or lower. the tracking of medical personnel is shown in fig. . the different body parts and joint positions are determined, e.g. the head, eyes, shoulders, elbows, etc. the library yielded very good results as described in [ ] . we reached a performance of frames per second on a workstation (intel i - k, geforce ti) processing stream. fig. . results of the medical staff tracking. ur-ai // discussion as shown in the evaluation, our approach succeeds in detecting the patient in an automated two-stage process with an accuracy between and mm. the coarse position is determined by using a yolo v net. the results under normal or conditions are very satisfying. the solution performance drops strongly under bright illumination conditions. this is due to large flares that occur on the phantom as it is made of plastic or silicone. however, these effects do not occur on human skin. the advantage of our system is that the detection is performed on all four kinect rgb streams enable different views on the operation area. unfavourable illumination conditions normally don't occur on all of these streams. therefore a robust detection is still possible. in the future the datasets will be expanded with samples with strong illumination conditions. the following surface matching of the head yields good results and a robust and precise detection of the patient. most important is a good preprocessing of the ct data and the recorded point cloud of the search area, as described in the methods. the algorithm does not manage to find a result if there are larger holes in the point clouds or if the normals are not calculated correctly. additionally, challenges that have to be considered include skin deformities and noisy ct data. the silicone skin is not fixed to the skull (as human skin is), which leads to changes in position, some of which are greater than cm. also the processing time of minutes is quite long and must be optimized in the future. the processing time may be shortened by reducing the size of the point clouds. however, in this case the matching results may also become worse. catheter tracking [ ] yielded good results, despite the challenging task of segmenting a very thin ( . mm) and deformable object. additionally, a d estimation of the catheter was implemented. the results showed that in many cases over % of the catheter can be estimated correctly. however, these results strongly depend on the orientation and the quality of the depth stream. using higher quality sensors could improve the detection results. for tracking of the medical staff openpose as a ready-to-use people detection algorithm was used and integrated into ros. the library produces very good results, despite medical staff wearing surgical clothing. in this work the integration of augmented reality into the digital operating room op:sense is demonstrated. this makes it possible to expand the capabilities of current ar glasses. the system can determine the precise patient's position by implementing a two-stage process. first a yolo v net is used to coarsly detect the patient to reduce the search area. in a second subsequent step a two-stage surface matching process is implemented for refined detection. this approach allows for precise location of the patient's head for later tracking. further, a frnn-based solution to track the surgical instruments in the or was implemented and demonstrated on a thin neurosurgical catheter for ventricular punctures. additionally, openpose was integrated into the digital or to track the surgical personnel. the presented solution will enable the system to react to the current situation in the operating room and is the base for an integration into the surgical workflow. due to the emergence of commodity depth sensors many classical computer vision tasks are employed on networks of multiple depth sensors e.g. people detection [ ] or full-body motion tracking [ ] . existing methods approach these applications using a sequential processing pipeline where the depth estimation and inference are performed on each sensor separately and the information is fused in a post-processing step. in previous work [ ] we introduce a scene-adaptive optimization schema, which aims to leverage the accumulated scene context to improve perception as well as post-processing vision algorithms (see fig. ). in this work we present a proof-of-concept implementation of the scene-adaptive optimization methods proposed in [ ] for the specific task of stereomatching in a depth sensor network. we propose to improve the d data acquisition step with the help of an articulated shape model, which is fitted to the acquired depth data. in particular, we use the known camera calibration and the estimated d shape model to resolve disparity ambiguities that arise from repeating patterns in a stereo image pair. the applicability of our approach can be shown by preliminary qualitative results. in previous work [ ] we introduce a general framework for scene-adaptive optimization of depth sensor networks. it is suggested to exploit inferred scene context by the sensor network to improve the perception and post-processing algorithms themselves. in this work we apply the proposed ideas in [ ] to the process of stereo disparity estimation, also referred to as stereo matching. while stereo matching has been studied for decades in the computer vision literature [ , ] it is still a challenging problem and an active area of research. stereo matching approaches can be categorized into two main categories, local and global methods. while local methods, such as block matching [ ] , obtain a disparity estimation by finding the best matching point on the corresponding scan line by comparing local image regions, global methods formulate the problem of disparity estimation as a global energy minimization problem [ ] . local methods lead to highly efficient real-time capable algorithms, however, they suffer from local disparity ambiguities. in contrast, global approaches are able to resolve local ambiguities and therefore provide high-quality disparity estimations. but they are in general very time consuming and without further simplifications not suitable for real-time applications. the semi-global matching (sgm) introduced by hirschmuller [ ] aggregates many feasible local d smoothness constraints to approximate global disparity smoothness regularization. sgm and its modifications are still offering a remarkable trade-off between the quality of the disparity estimation and the run-time performance. more recent work from poggi et al. [ ] focuses on improving the stereo matching by taking additional high-quality sources (e.g. lidar) into account. they propose to leverage sparse reliable depth measurements to improve dense stereo matching. the sparse reliable depth measurements act as a prior to the dense disparity estimation. the proposed approach can be used to improve more recent end-to-end deep learning architectures [ , ] , as well as classical stereo approaches like sgm. this work is inspired by [ ] , however, our approach does not rely on an additional lidar sensor but leverages a priori scene knowledge in terms of an articulated shape model instead to improve the stereo matching process. we set up four stereo depth sensors with overlapping fields of view. the sensors are extrinsically calibrated in advance, thus their pose with respect to a world coordinates system is known. the stereo sensors are pointed at a mannequin and capture eight greyscale images (one image pair for each stereo sensor, the left image of each pair is depicted in fig. a) . for our experiments we use a high-quality laser scan of the mannequin as ground truth. we assume that the proposed algorithm has access to an existing shape model that can express the observed geometry of the scene in some capacity. in our experimental setup, we assume a shape model of a mannequin with two articulated shoulders and a slightly different shape in the belly area of the mannequin (see fig. ). in the remaining section we use the provided shape model to improve the depth data generation of the sensor network. first, we estimate the disparity values of each of the four stereo sensors with sgm without using the human shape model. let p denote a pixel and q denote an adjacent pixel. let d denote a disparity map and d p ,d q denote the disparity at pixel location p and q. let p denote the set of all pixels and n the set of all adjacent pixels. then the sgm cost function can be defined as where d(p, d p ) denotes the matching term (here the sum of absolute differences in a × neighborhood) which assigns a matching cost to the assignment of disparity d p to pixel p and r(p, d p , q, d q ) penalizes disparity discontinuities between adjacent pixels p and q. in sgm the objective given in ( ) is minimized with dynamic programming, leading to the resulting disparity mapd = arg min d e(d). as input for the shape model fitting we apply sgm on all four stereo pairs leading to four disparity maps as depicted in fig. a . to be able to exploit the articulated shape model for stereo matching we initial need to fit the model to the d data obtained by classical sgm as described in . . to be more robust to outliers we do only use disparity values from pixels with high contrast and transform them into d point clouds. since we assume that the relative camera poses are known, it is straight forward to merge the resulting point clouds in one world coordinate system. finally the shape model is fitted to the merged point cloud by optimizing over the shape model parameters, namely the pose of the model and the rotation of the shoulder joints. we use an articulated mannequin shape model in this work as a proxy for an articulated human shape model (e.g. [ ] ) as proof-of-concept and plan to transfer the proposed approach on real humans in future work. once the model parameters of the shape model are obtained we can reproject the model fit to each sensor view by making use of the known projection matrices. fig. b shows the rendered wireframe mesh of the fitted model as an overlay on the camera images. for our guided stereo matching approach we then need the synthetic disparity map which can be computed from the synthetic depth maps (a byproduct of d rendering). we denote the synthetic disparity image by d synth . one synthetic disparity image is created for each stereo sensor, see fig. b . in the final step we exploit the existing shape model fit, in particular the synthetic disparity image d synth of each stereo sensor and combine it with sgm (inspired by guided stereo matching [ ] ). our augmented objective is defined as with the introduced objective is very similar to sgm and can be minimized in a similar fashion leading to the final disparity estimation in our scene-adaptive depth sensor network to summarize our approach, we exploit an articulated shape model fit to enhance sgm with minor adjustments. to show the applicability of our approach we present preliminary qualitative results. the results are depicted in fig. . using sgm without exploiting the provided articulated shape model leads to reasonable results, but the disparity map is very noisy and no clean silhouette of the mannequin is extracted (see fig. a ). fitting our articulated shape model to the data leads to clean synthetic disparity maps as shown in fig. c , with a clean silhouette. in the belly area the synthetic model disparity map (fig. b) does not agree with the ground truth (fig. d) . the articulated shape model is not general enough to explain the recorded scene faithfully. using the guided stereo matching approach, we construct a much cleaner disparity map than sgm. in addition, the approach takes the current sensor data into account and exploits an existing articulated shape model. in this work we have proposed a method for scene-adaptive disparity estimation in depth sensor networks. our main contribution is the exploitation of a fitted human shape model to make the estimation of disparities more robust to local ambiguities. our early results indicate that our method can lead to more robust and accurate results compared to classical sgm. future work will focus on a quantitative evaluation as well as incorporating sophisticated statistical human shape models into our approach. inverse process-structure-property mapping abstract. workpieces for dedicated purposes must be composed of materials which have certain properties. the latter are determined by the compositional structure of the material. in this paper, we present the scientific approach of our current dfg funded project tailored material properties through microstructural optimization: machine learning methods for the modeling and inversion of structure-property relationships and their application to sheet metals. the project proposes a methodology to automatically find an optimal sequence of processing steps which produce a material structure that bears the desired properties. the overall task is split in two steps: first find a mapping which delivers a set of structures with given properties and second, find an optimal process path to reach one of these structures with least effort. the first step is achieved by machine learning the generalized mapping of structures to properties in a supervised fashion, and then inverting this relation with methods delivering a set of goal structure solutions. the second step is performed via reinforcement learning of optimal paths by finding the processing sequence which leads to the best reachable goal structure. the paper considers steel processing as an example, where the microstructure is represented by orientation density functions and elastic and plastic material target properties are considered. the paper shows the inversion of the learned structure-property mapping by means of genetic algorithms. the search for structures is thereby regularized by a loss term representing the deviation from process-feasible structures. it is shown how reinforcement learning is used to find deformation action sequences in order to reach the given goal structures, which finally lead to the required properties. keywords: computational materials science, property-structure-mapping, texture evolution optimization, machine learning, reinforcement learning the derivation of processing control actions to produce materials with certain, desired properties is the "inverse problem" of the causal chain "process control" -"microstructure instantiation" -"material properties". the main goal of our current project is the creation of a new basis for the solution of this problem by using modern approaches from machine learning and optimization. the inversion will be composed of two explicitly separated parts: "inverse structure-property-mapping" (spm) and "microstructure evolution optimization". the focus of the project lies on the investigation and development of methods which allow an inversion of the structure-property-relations of materials relevant in the industry. this inversion is the basis for the design of microstructures and for the optimal control of the related production processes. another goal is the development of optimal control methods yielding exactly those structures which have the desired properties. the developed methods will be applied to sheet metals within the frame of the project as a proof of concept. the goals include the development of methods for inverting technologically relevant "structure-property-mappings" and methods for efficient microstructure representation by supervised and unsupervised machine learning. adaptive processing path-optimization methods, based on reinforcement learning, will be developed for adaptive optimal control of manufacturing processes. we expect that the results of the project will lead to an increasing insight into technologically relevant process-structure-property-relationships of materials. the instruments resulting from the project will also promote the economically efficient development of new materials and process controls. in general, approaches to microstructure design make high demands on the mathematical description of microstructures, on the selection and presentation of suitable features, and on the determination of structure-property relationships. for example, the increasingly advanced methods in these areas enable microstructure sensitive design (msd), which is introduced in [ ] and [ ] and described in detail in [ ] . the relationship between structures and properties descriptors can be abstracted from the concrete data by regression in the form of a structure-property-mapping. the idea of modeling a structure-property-mapping by means of regression and in particular using artificial neural networks was intensively pursued in the s [ ] and is still used today. the approach and related methods presented in [ ] always consist of a structure-property-mapping and an optimizer (in [ ] genetic algorithms) whose objective function represents the desired properties. the inversion of the spm can be alternatively reached via generative models. in contrast to discriminative models (e.g. spm), which are used to map conditional dependencies between data (e.g. classification or regression), generative models map the composite probabilities of the variables and can thus be used to generate new data from the assumed population. established, generative methods are for example mixture models [ ] , hidden markov models [ ] and in the field of artificial neural networks restricted boltzmann machines [ ] . in the field of deep learning, generative models, in particular generative adversarial networks [ ] , are currently being researched and successfully applied in the context of image processing. conditional generative models can generalize the probability of occurrence of structural features under given material properties. in this way, if desired, any number of microstructures could be generated. based on the work on the spm, the process path optimization in the context of the msd is treated depending on the material properties. for this purpose, the process is regarded as a sequence of structure-changing process operations which correspond to elementary processing steps. shaffer et al. [ ] construct a so called texture evolution network based on process simulation samples, to represent the process. the texture evolution network can be considered as a graph with structures as vertices, connected by elementary processing steps as edges. the structure vertices are points in the structure-space and are mapped to the property-space by using the spm for property driven process path optimization. in [ ] one-step deformation processes are optimized to reach the most reachable element of a texture-set from the inverse spm. processes are represented by so called process planes, principal component analysis (pca) projections of microstructures reachable by the process. the optimization then is conducted by searching for the process plane which best represents one of the texture-set elements. in [ ] , a generic ontology based semantic system for processing path hypothesis generation (matcalo) is proposed and showcased. the required mapping of the structures to the properties is modeled based on data from simulations. the simulations are based on taylor models. the structures are represented using textures in the form of orientation density functions (odf), from which the properties are calculated. in the investigations, elastic and plastic properties are considered in particular. structural features are extracted from the odf for a more compact description. the project uses spectral methods such as generalized spherical harmonics (gsh) to approximate the odf. as an alternative representation we investigate the discretization in the orientation-space, where the orientation density is represented by a histogram. the solution of the inverse problem consists of a structure-property-mapping and an optimizer: as [ ] described, the spm is modeled by regression using artificial neural networks. in this investigation, we use a multilayer perceptron. differential evolution (de) is used for the optimization problem. de is an evolutionary algorithm developed by rainer storn and kenneth price [ ] . it is a optimization method, which repeatedly improves a candidate solution set under consideration of a given quality measure over a continuous domain. the de algorithm optimizes a problem by taking a population of candidate solutions and generating new candidate solutions (structures) by mutation and recombination existing ones. the candidate solution with the best fitness is considered for further processing. so, for the generated structures the reached properties are determined using the spm. the fitness f is composed of two terms: the property loss l p , which expresses, how close the property of a candidate is to the target property, and the structure loss l s , which represents the degree of feasibility of the candidate structure in the process the property loss is the mean squared error (mse) between the reached properties p r ∈ p r and the desired properties p d ∈ p d : considering the goal that the genetic algorithm generates reachable structures, a neural network is formed which functions as an anomaly detector. the data basis of this neural network are structures that can be reached by a process. the goal of anomaly detection is to exclude unreachable structures. the anomaly detection is implemented using an autoencoder [ ] . this is a neural network (see fig. ) which consists of the following two parts: the encoder and the decoder. the encoder converts the input data to an embedding space. the decoder converts the embedding space as close as possible to the original data. due to the reduction to an embedding space, the autoencoder uses data compression and extracts relevant features. the cost function for the structures is a distance function in the odf-space, which penalizes the network if it produces outputs that differ from the input. the cost function is also known as the reconstruction loss: with s i ∈ s as the original structures,ŝ i ∈ˆ s as the reconstructed structures and λ = . to avoid division by zero. when using the anomaly detection, the autoencoder determines a high reconstruction loss if the input data are structures that are very different from the reachable structures. the overall approach is shown in fig. and consists of the following steps: . the genetic algorithm generates structures. . the spm determines the reached properties of the generated structures. . the structure loss l s is determined by the reconstruction loss of the anomaly detector for the generated structures with respect to the reachable structures. . the property loss l p is determined by the mse of the reached properties and the desired properties. . the fitness is calculated as the sum of the structure loss l s and the property loss l p . the structures, resulting from the described approach form the basis for optimal process control. due to the forward mapping, the process evolution optimization based on texture evolution networks ( [ ] ) is restricted to a-priori sampled process paths. [ ] relies on linearization assumptions and is applicable to short process sequences only. [ ] relies on a-priori learned process models in the form of regression trees and is also applicable to relatively short process sequences only. ur-ai // as an adaptive alternative for texture evolution optimization, that can be trained to find process-paths of arbitrary length, we propose methods from reinforcement learning. for desired material properties p d . the inverted spm determines a set of goal microstructures s d ∈ g, which are very likely reachable by the considered deformation process. the texture evolution optimization objective is then to find the shortest process path p * starting from a given structure s , and leading close to one of the structures from g. where p = (a k ) k= ,...,k ; k t is a path of process actions a, t is the maximum allowed process length. the mapping e(s, p) = s k delivers the resulting structure, when applying p to the structure s. here, for the sake of simplicity, we assume the process to be deterministic, although the reinforcement learning methods we use are not restricted to deterministic processes. g τ is a neighbourhood of g, the union of all open balls with radius τ and center points from g. to solve the optimization problem by reinforcement learning approaches, it must be reformulated as markov decision process (mdp), which is defined by the tuple (s, a, p, r). in our case s is the space of structures s, a is the parameter-space of the deformation process, containing process actions a, p : s × a → s is the transition function of the deformation process, which we assume to be deterministic. r g : s × a → r is a goalspecific reward function. the objective of the reinforcement learning agent is then to find the optimal goal-specific policy π * g (s t ) = a t that maximizes the discounted future goal-specific reward where γ ∈ [ , ] discounts early attained rewards, the policy π g (s k ) determines a k and the transition function p (s k , a k ) determines s k+ . for a distance function d in the structure space, the binary reward function r g (s, a) = , if d(p (s, a), g) < τ , otherwise ( ) if maximized, leads to an optimal policy π * g that yields the shortest path to g from every s for γ < . moreover, if v g is given for every microstructure from g, p from eq. is identical with the application of the policy π * ζ , where ζ = arg max g [v g ]. π * g can be approached by methods from reinforcement learning. value-based reinforcement learning is doing so by learning expected discounted future reward functions [ ] . one of these functions is the so called value-function v . in the case of a deterministic mdp and for a given g, this expectation value function reduces to v g from eq. and ζ can be extracted if v is learned for every g from g. for doing so, a generalized form of expectation value functions can be learned as it is done e.g. in [ ] . this exemplary mdp formulation shows how reinforcement learning can be used for texture evolution optimization tasks. the optimization thereby is operating in the space of microstructures and does not rely on a-priori microstructure samples. when using off-policy reinforcement learning algorithms and due to the generalization over goal-microstructures, the functions learned while solving a specific optimization task can be easily transferred to new optimization tasks (i.e. different desired properties or even a different property space). industrial robots are mainly deployed in large-scale production, especially in the automotive industry. today, there are already . industrial robots deployed per , employees on average in these industry branches. in contrast, small and medium-sized enterprises (smes) only use . robots per , employees [ ] . reasons for this low usage of industrial robots in smes include the lack of flexibility with great variance of products and the high investment expenses due to additional peripherals required, such as gripping or sensor technology. the robot as an incomplete machine accounts for a fourth of the total investment costs [ ] . due to the constantly growing demand of individualized products, robot systems have to be adapted to new production processes and flows [ ] . this development requires the flexibilization of robot systems and the associated frequent programming of new processes and applications as well as the adaption of existing ones. robot programming usually requires specialists who can adapt flexibly to different types of programming for the most diverse robots and can follow the latest innovations. in contrast to many large companies, smes often have no in-house expertise and a lack of prior knowledge with regard to robotics. this often has to be obtained externally via system integrators, which, due to high costs, is one of the reasons for the inhibited use of robot systems. during the initial generation or extensive adaption of process flows with industrial robots, there is a constant risk of injuring persons and damaging the expensive hardware components. therefore, the programs have to be tested under strict safety precautions and usually in a very slow test mode. this makes the programming of new processes very complex and therefore time-and cost-intensive. the concept presented in this paper combines intuitive, gesture-based programming with simulation of robot movements. using a mixed reality solution, it is possible to create a simulation-based visualization of the robot and project, to program and to test it in the working environment without disturbing the workflow. a virtual control panel enables the user to adjust, save and generate a sequence of specific robot poses and gripper actions and to simulate the developed program. an interface to transfer the developed program to the robot controller and execute it by the real robot is provided. the paper is structured as follows. first, a research on related work is conducted in section , followed by a description of the system of the gesture-based control concept in section . the function of robot positioning and program creation is described in section . last follow the evaluation in section and conclusion in section . various interfaces exist to program robots, such as lead-trough, offline or walk-trough programming, programming by demonstration, vision based programming or vocal commanding. in the survey of villani et al. [ ] a clear overview on existing interfaces for robot programming and current research is provided. besides the named interfaces, the programming of robots using a virtual or mixed reality solution aims to provide intuitiveness, simplicity and accessibility of robot programming for non-experts. designed for this purpose, guhl et al. [ ] developed a generic architecture for human-robot interaction based on virtual and mixed reality. in the marker tracking based approach presented by [ ] and [ ] , the user defines a collision-free-volume and generates and selects control points while the system creates and visualizes a path through the defined points. others [ ] , [ ] , [ ] and [ ] use handheld devices in combination with gesture control and motion tracking. herein, the robot can be controlled through gestures, pointing or via the device, while the path, workpieces or the robot itself are visualized on several displays. other gesture and virtual or mixed reality based concepts are developed by cousins et al. [ ] or tran et al. [ ] . here, the robots perspective or the robot in the working environment is presented to the user on a display (head-mounted or stationary) and the user controls the robot via gestures. further concepts using a mixed reality method enable an image of the workpiece to be imported into cad and the system automatically generates a path for robot movements [ ] or visualizing the intended motion of the robot on the microsoft hololens, that the user knows where the robot will move to next [ ] . other methods combine pointing at objects on an screen with speech instructions to control the robot [ ] . sha et al. [ ] also use a virtual control panel in their programming method, but for adjusting parameters and not for controlling robots. another approach pursues programming based on cognition, spatial augmented reality and multimodal input and output [ ] , where the user interacts with a touchable table. krupke et al. [ ] developed a concept in which humans can control the robot by head orientation or by pointing, both combined with speech. the user is equipped with a head-mounted display presenting a virtual robot superimposed over the real robot. the user can determine pick and place position by specifying objects to be picked by head orientation or by pointing. the virtual robot then executes the potential pick movement and after the user confirms by voice command, the real robot performs the same movement. a similar concept based on gesture and speech is persued by quintero et al. [ ] , whose method offers two different types of programming. on the one hand, the user can determine a pick and place position by head orientation and speech commands. the system automatically generates a path which is displayed to the user, can be manipulated by the user and is simulated by a virtual robot. on the other hand, it is possible to create a path on a surface by the user generating waypoints. ostanin and klimchik [ ] introduced a concept to generate collision-free paths. the user is provided with virtual goal points that can be placed in the mixed reality environment and between which a path is automatically generated. by means of a virtual menu, the user can set process parameters such as speed, velocity etc.. additionally, it is possible to draw paths with a virtual device and the movement along the path is simulated by a virtual robot. differently to the concept described in this paper, only a pick and place task can be realized with the concepts of [ ] and [ ] . a differentiation between movements to positions and gripper commands as well as the movement to several positions in succession and the generation of a program structure are not supported by these concepts. another distinction is that the user only has the possibility to show certain objects to the robot, but not to move the robot to specific positions. in [ ] a preview of the movement to be executed is provided, but the entire program (pick and place movements) is not simulated. in contrast to [ ] , with the concept presented in this paper it is possible to integrate certain gripper commands into the program. with [ ] programming method, the user can determine positions but exact axis angles or robot poses cannot be set. overall, the approach presented in this paper offers an intuitive, virtual user interface without the use of handheld devices (cf. [ ] , [ ] , [ ] , [ ] , [ ] and [ ] ) which allows the exact positions of the robot to be specified. compared to other methods, such as [ ] , [ ] , [ ] , [ ] or [ ] , it is possible to create more complex program structures, which include the specification of robot poses and gripper positions, and to simulate the program in a mixed reality environment with a virtual robot. in this section the components of the mixed reality robot programming system are introduced and described. the system consists of multiple real and virtual interactive elements, whereby the virtual components are projected directly into the field of view using a mixed reality (mr) approach. compared to the real environment, which consists entirely of real objects and virtual reality (vr), which consists entirely of virtual objects and which overlays the real reality, in mr the real scene here is preserved and only supplemented by the virtual representations [ ] . in order to interact in the different realities, head-mounted devices similar to glasses, screens or mobile devices are often used. figure provides an overview of the systems components and their interaction. the system presented in this paper includes kukas collaborative, lightweight robot lbr iiwa r combined with an equally collaborative gripper from zimmer as real components and a virtual robot model and a user interface as virtual components. the virtual components are presented on the microsoft hololens. for calculation and rendering the robot model and visualization of the user interface, the d-and physics-engine of the unity d development framework is used. furthermore, for supplementary functions, components and for building additional mr interactable elements, the microsoft mixed reality toolkit (mrtk) is utilized. for spatial positioning of the virtual robot, marker tracking is used, a technique supported by the vuforia framework. in this use case, the image target is attached to the real robot's base, such that in mr the virtual robot superimposes the real robot. the program code is written in c . the robot is controlled and programmed via an intuitive and virtual user interface that can be manipulated using the so-called airtap gesture, a gesture provided by microsoft hololens. ur-ai // to ensure that the virtual robot mirrors the motion sequences and poses of the real robot, the most exact representation of the real robot is employed. the virtual robot consists of a total of eight links, matching the base and the seven joints of iiwa r : the base frame, five joint modules, the central hand and the media flange. the eight links are connected together as a kinematic chain. the model is provided as open source files from [ ] and [ ] and is integrated into the unity d project. the individual links are created as gameobjects in a hierarchy, with the base frame defining the top level and are limited similar to those of the real robot. the cad data of the deployed gripping system is also imported into unity d and linked to the robot model. the canvas of the head-up displayer of the microsoft hololens is divided into two parts and rendered at a fixed distance in front of the user and on top of the scene. at the top left side of the screen the current joint angles (a to a ) are displayed and on the left side the current program is shown. this setting simplifies the interaction with the robot as the informations do not behave like other objects in the mr scene, but are attached to the head up display (hud) and move with the user's field of view. the user interface, which consists of multiple interactable components, is placed into the scene and is shown at the right side of the head-up display. at the beginning of the application the user interface is in "clear screen" mode, i.e. only the buttons "drag", "cartesian", "joints", "play" and "clear screen" and the joint angles at the top left of the screen are visible. for interaction with the robot, the user has to switch into a particular control mode by tapping the corresponding button. the user interface provides three different control modes for positioning the virtual robot: -drag mode, for rough positioning, -cartesian mode, for cartesian positioning and -joint mode, for the exact adjustment of each joint angle. figure shows the interactable components that are visible and therefore controllable in the respective control modes. depending on the selected mode, different interactable components become visible in the user interface, with whom the virtual robot can be controlled. in addition to the control modes, the user interface offers further groups of interactable elements: -motion buttons, with which e.g. the speed of the robot movement can be adjusted or the robot movement can be started or stopped, -application buttons, to save or delete specific robot poses, for example, -gripper buttons, to adjust the gripper and -interface buttons, that enable communication with the real robot. this section focuses on the description of the usage of the presented approach. in addition to the description of the individual control modes, the procedure for creating a program is also described. as outlined in section . , the user interface consists of three different control modes and four groups of further interactable components. through this concept, the virtual robot can be moved efficiently to certain positions with different movement modes, the gripper can be adjusted, the motion can be controlled and a sequence of positions can be chained. drag by gripping the tool of the virtual robot with the airtap gesture, the user can "drag" the robot to the required position. additionally, it is possible to rotate the position of the robot using both hands. this mode is particularly suitable for moving the robot very quickly to a certain position. cartesian this mode is used for the subsequent positioning of the robot tool with millimeter precision. the tool can be translated to the required positions using the cartesian coordinates x, y, z and the euler angles a, b, c. the user interface provides a separate slider for each of the six translation options.the tool of the robot moves analogously to the respective slider button, which the user can set to the required value. joints this mode is an alternative to the cartesian method for exact positioning. the joints of the virtual robot can be adjusted precisely to the required angle, which is particularly suitable for e.g. bypassing an obstacle. there is a separate slider for each joint of the virtual robot. in order to set the individual joint angles, the respective slider button is dragged to the required value, which is also displayed above the slider button for better orientation. to program the robot, the user interface provides various application buttons, such as saving and removing robot poses from the chain and a display of the poses in the chain. the user directs the virtual robot to the desired position and confirms using the corresponding button. the pose of the robot is then saved as joint angles from a to a and one gripper position in a list and is displayed on the left side of the screen. when running the programmed application, the robot moves to the saved robot poses and gripper positions according to the defined sequence. for a better orientation, the robots current target position changes its color from white to red. after testing the application, the list of robot poses can be sent to the controller of the real robot via a webservice. the real robot then moves analogously to the virtual robot to the corresponding robot poses and gripper positions. the purpose of the evaluation is how the gesture-based control concept compares to other concepts regarding intuitiveness, comfort and complexity. for the evaluation, a study was conducted with seven test persons, who had to solve a pick and place task with five different operating concepts and subsequently evaluate them. the developed concept based on gestures and mr was evaluated against a lead through procedure, programming with java, programming with a simplified programming concept and approaching and saving points with kuka smartpad. the test persons had no experience with microsoft hololens and mr, no to moderate experience with robots and no to moderate programming skills. the questionnaire for the evaluation of physical assistive devices (quead) developed by schmidtler et al [ ] was used to evaluate and compare the five control concepts. the questionnaire is classified into five categories (perceived usefulness, perceived ease of use, emotions, attitude and comfort) and contains a total of questions, rated on an ordinal scale from (entirely disagree) to (entirely agree). firstly, each test person received a short introduction to the respective control concept, conducted the pick and place task and immediately afterwards evaluated the respective control concept using quead. all test persons agreed that they would reuse the concept in future tasks ( mostly agree, entirely agree). in addition, the test persons considered the gesture-based concept to be intuitive ( mostly agree, entirely agree), easy to use ( mostly agree, entirely agree) and easy to learn ( mostly agree, entirely agree). two test persons mostly agree and four entirely agree that the gesture-based concept enabled them to solve the task efficiently and four test persons mostly agree and two entirely agree that the concept enhances their work performance. all seven subjects were comfortable using the gesturebased concept ( mostly agree, entirely agree). overall, the concept presented in this paper was evaluated as more comfortable, more intuitive and easier to learn than the other control concepts evaluated. in comparison to them, the new operating concept was perceived as the most useful and easiest to use. the test persons felt physically and psychologically most comfortable when using the concept and were most positive in total. in this paper, a new concept for programming robots based on gestures and mr and for simulating the created applications was presented. this concept forms the basis for a new, gesture-based programming method, with which it is possible to project a virtual robot model of the real robot into the real working environment by means of a mr solution, to program it and to simulate the workflow. using an intuitive virtual user interface, the robot can be controlled by three control modes and further groups of interactable elements and via certain functions, several robot positions can be chained as a program. by using this concept, test and simulation times can be reduced, since on the one hand the program can be tested directly in the mr environment without disturbing the workflow. on the other hand, the robot model is rendered into the real working environment via the mr approach, thus eliminating the need for time-consuming and costly modeling of the environment. the results of the user study indicate that the control concept is easy to learn, intuitive and easy to use. this facilitates the introduction of robots and especially in smes, since no expert knowledge is required for programming, programs can be created rapidly and intuitively and processes can be adapted flexibly. in addition, the user study showed that tasks can be solved efficiently and the concept is perceived as performance-enhancing. potential directions of improvement are: implement various movement types, such as point-to-point, linear and circular movements in the concept. this makes the robot motion more flexible and efficient, since positions can be approached in different ways depending on the situation. another improvement is to extend the concept with collaborative functions of the robot, such as force sensitivity or the ability to conduct search movements. in this way, the functions that make collaborative robots special can be integrated into the program structure. a further approach for improvement is to engage in a larger scale study. in the world's commercial fleet consists of , ships with a total capacity of , , thousand dwt. (a plus of . % in carrying capacity compared to last year) [ ] . according to the international chamber of shipping, the shipping industry is responsible for about % of all trade [ ] . in order to ensure the safe voyage of all participant in the international travel at sea, the need for monitoring is steadily increasing. while more and more data regarding the sea traffic is collected by using cheaper and more powerful sensors, the data still needs to be processed and understood by human operators. in order to support the operators, reliable anomaly detection and situation recognition systems are needed. one cornerstone for this development is a reliable automatic classification of vessels at sea. for example by classifying the behaviour of non cooperative vessels in ecological protected areas, the identification of illegal, unreported and unregulated (iuu) fishing activities is possible. iuu fishing is in some areas of the world a major problem, e. g., »in the wider-caribbean, western central atlantic region, iuu fishing compares to - percent of the legitimate landings of fish« [ ] resulting in an estimated value between usd and million per year. one approach for gathering information on the sea traffic is based on the automatic identification system (ais) . it was introduced as a collision avoidance system. as each vessel is broadcasting its information on an open channel, the data is often used for other purposes, like training and validating of machine learning models. ais provides dynamic data like position, speed and course over ground, static data like mmsi , shiptype and length, and voyage related data like draught, type of cargo, and destination about a vessel. the system is self-reporting, it has no strong verification of transmission, and many of the fields in each message are set by hand. therefore, the data can not be fully trusted. as harati-mokhtari et al. [ ] stated, half of all ais messages contain some erroneous data. as for this work, the dataset is collected by using the ais stream provided by aishub , the dataset is likely to have some amount of false data. while most of the errors will have no further consequences, minor coordinate inaccuracies or wrong vessel dimensions are irrelevant, some false information in vessel information can have an impact on the model performance. classification of maritime trajectories and the detection of anomalies is a challenging problem, e.g., since classifications should be based on short observation periods, only limited information is available for vessel identification. riveiro et al. [ ] give a survey on anomaly detection at sea, where shiptype classification is a subtype. jiang et al. [ ] present a novel trajectorynet capable of point-based classification. their approach is based on the usage of embedding gps coordinates into a new feature space. the classification itself is accomplished using an long short-term memory (lstm) network. further, jiang et al. [ ] propose a partition-wise lstm (plstm) for point-based binary classification of ais trajectories into fishing or non-fishing activity. they evaluated their model against other recurrent neural networks and achieve a significantly better result than common recurrent network architectures based on lstm or gated recurrent units. a recurrent neural network is used by nguyen et al. in [ ] to reconstruct incomplete trajectories, detect anomalies in the traffic data and identify the real type of a vessel. they are embedding the position data to generate a new representation as input for the neural network. besides these neural network based approaches, other methods are also used for situation recognition tasks in the maritime domain. especially expert-knowledge based systems are used frequently, as illegal or at least suspicious behaviour is not recorded as often as desirable for deep learning approaches. conditional random fields are used by hu et al. [ ] for the identification of fishing activities from ais data. the data has been labelled by an expert and contains only longliner fisher boats. saini et al. [ ] propose an hidden markov model (hmm) based approach to the classification of trajectories. they combine global-hmm and segmental-hmm using a genetic algorithm. in addition, they tested the robustness of the framework by adding gaussian noise. in [ ] fischer et al. introduce a holistic approach for situation analysis based on situation-specific dynamic bayesian networks (ssdbn). this includes the modelling of the ssdbn as well as the presentation to end-users. for a bayesian network, the parametrisation of the conditional probability tables is crucial. fischer introduces an algorithm for choosing these parameters in a more transparent way. important for the functionality is the ability of the network to model the domain knowledge and the handling of noisy input data. for the evaluation, simulated and real data is used to assess the detection quality of the ssdbn. based on dbns, anneken et al. [ ] implemented an algorithm for detecting illegal diving activities in the north sea. as explained by de rosa et al. [ ] an additional layer for modelling the reliability of different sensor sources is added to the dbn. in order to use the ais data, preprocessing is necessary. this includes cleaning wrong data, filtering data, segmentation, and calculation of additional features. the whole workflow is depicted in figure . the input in form of ais data and different maps is shown as blue boxes. all relevant mmsis are extracted from the ais data. for each mmsi, the position data is used for further processing. segmentation into separate trajectories is the next step (yellow). the resulting trajectories are filtered (orange). based on the remaining trajectories, geographic (green) and trajectory (purple) based features are derived. for each of the resulting sequences, the data is normalized (red), which results in the final dataset. only the major shiptypes in the dataset are used for the evaluation. these are "cargo", "tanker", "fishing", "passenger", "pleasure craft" and "tug". due to their similar behaviour, "cargo" and "tanker" will combined to a single class "cargo-tanker". figure : visualization of all preprocessing steps. input in blue, segmentation in yellow, filtering in orange, geographic features in green, trajectory feature in purple and normalization in red. four different trajectory features are used: ur-ai // -time difference -speed over ground -course over ground -trajectory transformation as the incoming data from ais is not necessarily uniformly distributed in time, there is a need to create a feature representing the time dimension. therefore, the time difference between two samples is introduced. as the speed and course over ground is directly accessible through the ais data, the network will be directly fed with these features. the vessel's speed is a numeric value in . -knot resolution in the interval [ ; ] and the course is the negative angle in degrees relative to true north and therefore in the interval [ ; ]. the position will be transformed in two ways. the first transformation, further called "relative-to-first", will shift the trajectory to start at the origin. the second transformation, henceforth called "rotate-to-zero", will rotate the trajectory, in such a way, that the end point is on the x-axis. additional to the trajectory based features, two geographic features are derived by using coastline maps and a map of large harbours. the coastline map consists of a list of line strips. in order to reduce complexity, the edge points are used to calculate the "distance-to-coast". further, only a lower resolution of the shapefile itself is used. in figure , the resolution "high" and "low" for some fjords in norway are shown. due to the geoindex' cell size set to km, a radius of km can be queried. the world's major harbours based on the world port index are used to calculate the "distance-to-closest-harbor". as fishing vessels are expected to stay near to a certain harbour, this feature should support the network to identify some shiptypes. the geoindex' cell size is set for this feature to , km, resulting in a maximum radius of , km. the data is split into separate trajectories by using gaps in either time or space, or the sequence length. as real ais data is used, package loss during the transmission is common. this problem is tackled by splitting the data if the time between two successive samples is larger than hours, or if the distance between two successive samples is large. regarding the distance, even though the great circle distance is more accurate, the euclidean distance is used. for simplification the distance value is squared and as a threshold − is used. depending on latitude this corresponds to a value of about km at the equator and only about m at • n. since the calculation includes approximation a relatively high threshold is chosen. as the neural network depends on a fixed input size, the data is split into fitting chunks by cutting and padding with these rules: -longer sequences are split into chunks according to the desired sequence length. -any left over sequence shorter than % of the desired length is discarded. -the others will be padded with zeroes. this results in segmented trajectories of similar but not necessarily same duration. as this work is about the vessel behaviour at sea, stationary vessels (anchored and moored vessels) and vessels traversing rivers are removed from the segmented trajectories. the stationary vessels are identified by using a measure of movement in a trajectory: where n as the sequence length and p i its data points. a trajectory will be removed if α stationary is below a certain threshold. a shapefile containing the major and most minor rivers (compare ??) is used in order to remove the vessels not on the high seas. a sequence with more than % of its points on a river is removed from the dataset. in order to speed up the training process, the data is normalized in the interval [ ; ] by applying here, for the positional features a differentiation between "global normalization" and "local normalization" is taken into account. the "global normalization" will scale the input data for the maximum x max and minimum x min calculated over the entire data set, while "local normalization" will estimate the maximum x max and minimum x min only over the trajectory itself. as the data is processed parallel, the parameters for the "global normalization" will be calculated only for each chunk of data. this will result in slight deviations in the minimum and maximum, but for large batches this should be neglectable. all other additional features are normalized as well. for the geographic features "distance-to-coast" and "distance-to-closest-harbor" the maximum distance, that can be queried depending on grid size, is used as x max and is used as the lower bound x min . the time difference feature is scaled using a minimum x min of and the threshold for the temporal gap since this is the maximum value for this feature. speed and course are normalized using and their respective maximum values. for the dataset, a period between - - and - - is used. altogether , unique vessels with , , , raw data points are included. using this foundation and the previously described methods, six datasets are derived. all datasets use the same spatial and temporal thresholds. in addition, filter thresholds are identical as well. the datasets differentiate in their sequence length and by applying only the "relativeto-first" transformation or additionally the "rotate-to-zero" transformation. either , , , or , points per sequence are used resulting in approximate h, h, or h long sequences. in figure , the distribution of shiptypes in the datasets after applying the different filters is shown. for the shiptype classification, neural networks are chosen. the different networks are implemented using keras [ ] with tensorflow as backend [ ] . fawaz et al. [ ] have shown, that, despite their initial design for image data, a residual neural network (resnet) can perform quite well on time-series classification. thus, as foundation for the evaluated architectures the resnet is used. the main difference to other neural network architectures is the inclusion of "skip connections". this allows for deeper networks by circumventing the vanishing gradient problem during the training phase. based on the main idea of a resnet, several architectures are designed and evaluated for this work. some information regarding the structure are given in table . further, the single architectures are depicted in figures a to f . the main idea behind these architectures is to analyse the impact of the depth of the networks. furthermore, as the features itself are not necessarily logically linked with each other, the hope is to be able to capture the behaviour better by splitting up the network path for each feature. to verify the necessity of cnns two multilayer perceptron (mlp) based networks are tested: one with two hidden layers and one with four hidden layers, all with neurons and fully connected with their adjacent layers. the majority of the parameters for the two networks are bound in the first layer. they are necessary to map the large number of input neurons, e. g., for the samples dataset * = , input neurons, to the first hidden layer. each of the datasets is split into three parts: % for the training set, % for the validation set, and % for the test set. for solving or at least mitigating the problem of overfitting, regularization techniques (input noise, batch normalization, and early stopping) are used. small noise on the input in the training phase is used to support the generalization of the network. for each feature a normal distribution with a standard deviation of . and a mean of is used as noise. furthermore, batch normalization is implemented. this means, before each relulayer a batch normalization layer is added, allowing higher learning rates. therefore, the initial learning rate is doubled. additionally, the learning rate is halved if the validation error does not improve after ten training epochs, improving the training behaviour during oscillation on a plateau. in order to prevent overfitting, an early stopping criteria is introduced. the training will be interrupted if the validation error is not decreasing after training epochs. to counter the dataset imbalance, class weights were considered but ultimately did not lead to better classification results and were discarded. the different neural network architectures are evaluated on a amd ryzen threadripper batch normalization and the input noise is tested. the initial learning rate is set to . without batch normalization and . with batch normalization activated. the maximum number of epochs is set to . the batch sizes are set to , , and for , , , and , samples per sequence respectively. in total different setups are evaluated. furthermore, additional networks are trained on the samples dataset with "relative-to-first" transformation. two mlps to verify the need of deep neural networks, and the shallow and deep resnet trained without geographic features to measure the impact of these features. (f) "rtz" with , samples shown. the first row shows the results for the "relative-to-first" (rtf) transformation, the second for the "rotate-to-zero" (rtz) transformation. the results for the six different architectures are depicted in figure . for samples the shallow resnet and the deep resnet outperformed the other networks. in case of the "relative-to-first" transformation (see figure a ), the shallow resnet achieved an f -score of . , while the deep resnet achieved . . for the "rotate-to-zero" transformation (see figure d ), the deep resnet achieved . and the shallow resnet . . in all these cases the regularization methods lead to no improvements. the "relative-to-first" transformation performs slightly better overall. for the datasets with samples per sequence, the standard resnet variants achieve higher f -scores compared to the split resnet versions. but this difference is relatively small. as expected, the tiny resnet is not large and deep enough to classify the data on a similar level. for the "relative-first" transformation and trajectories based on samples (see figure b ), the split resnet and the total split resnet achieve the best results. the first performed well with an f -score of . , while the latter is slightly worse with . . in both cases again the regularization did not improve the result. for the "rotateto-zero" transformation (see figure e ), the shallow resnet achieved an f -score of . without any regularization and . with only the the noise added to the input. for the largest sequence length of , samples, the split based networks slightly outperform the standard resnets. for the "relative-to-first" transformation (see figure c ), the split resnet achieved an f -score of . , while for the "rotate-to-zero" transformation (see figure f ) the total split resnet reached an f -score of . . again without noise and batch normalization. to verify, that the implementation of cnns is actually necessary, additional tests with mlps were carried out. two different mlps are trained on the samples dataset with "relative-to-first" transformation since this dataset leads to best results for the resnet architectures. both networks lead to no results as their output always is the "cargo-tanker" class regardless of the actual input. the only thing the models are able to learn is, that the "cargo-tanker" class is the most probable class based on the uneven distribution of classes. an mlp is not the right model for this kind of data and performs badly. the large dimensionality of even the small sequence length makes the use of the fully connected networks impracticable. probably, further hand-crafted feature extraction is needed to achieve better results. to measure the impact the feature "distance to coast" and "distance to closest harbor" have on the overall performance, a shallow resnet and a deep resnet are trained on the sample length data set with the "relative-to-first" transformation excluding these features. the trained networks have f -scores of . and . respectively. this means, by including this features, we are able to increase the performance by . %. the "relative-to-first" transformation compared to the "rotate-to-zero" transformation yields the better results. especially, this is easily visible for the longest sequence length. a possible explanation can be seen in the "stationary" filter. this filter removes more trajectories for the "relative-to-first" transformation than for the additional "rotate-to-zero" transformation. a problem might be, that the end point is used for rotating the trajectory. this adds a certain randomness to the data, especially for round trip sequences. in some cases, the stretched deep resnet is not able to learn the classes. it is possible, that there is a problem with the structure of the network or the large number of parameters. further, there seems to be a problem with the batch normalization, as seen in figures c and e . the overall worse performance of the "rotate-to-zero" transformation could be because of the difference in the "stationary" filter. in the "rotate-to-zero" dataset, fewer sequences are filtered out. the filter leads to more "fishing" and "pleasure craft" sequences in relation to each other as described in section . . this could also explain the difference in class prediction distribution since the network is punished more for mistakes in these classes because more classes are overall from this type. for the evaluation, the expectation based on previous work by other authors was, that the shorter sequence length should perform worse compared to the longer ones. instead the shorter sequences outperform the longer ones. the main advantages of the shorter sequences are essentially the larger number of sequences in the dataset. for example the samples dataset with "relative-to-first" transformation contains about . million sequences, while the corresponding , sample dataset contains only approximately , sequences. in addition, the more frequent segmentation can yield more easily classifiable sequences: the behaviour of a fishing vessel in general contains different characteristics, like travelling from the harbour to the fishing ground, the fishing itself, and the way back. the travelling parts are similar to other vessels and only the fishing part is unique. a more aggressive segmentation will yield more fishing sequences, that will be easier to classify regardless of observation length. the shallow resnet has the overall best results by using the samples dataset and the "relative-to-first" transformation. the results for this setup are shown in the confusion matrix in figure . as expected, the tiny resnet is not able to compete with the others. the other standard resnet architectures performed well, especially on shorter sequences. the split architectures are able to perform better on datasets with longer sequences, with the shallow resnet achieving similar performance. comparing the number of parameters, all three architectures have about , the shallow resnet about , more, the total split resnet about , less. only on the dataset with more sequences, the deep resnet performs well. this correlates with the need of more information due to the larger parameter count. due to the reduced flexibility, the split architecture can be interpreted as a "head start". this means, that the network has already information regarding the structure of the data, which in turn does not need to be extracted from the data. this can result in a better performance for smaller datasets. all in all, the best results are always achieved by omitting the suggested regularization methods. nevertheless, the batch normalization had an effect on the learning rate and needed training epochs: the learning rate is higher and less epochs are needed before convergence. based on the resnet, several architectures are evaluated for the task of shiptype classification. from the initial dataset based on ais data with over . billion datapoints six datasets with different trajectory length and preprocessing steps are derived. further to the kinematic information included in the dataset, geographical features are generated. each network architecture is evaluated with each of the datasets with and without batch normalization and input noise. overall the best result is an f -score of . with the shallow resnet on the samples per sequence dataset and a shift of the trajectories to the origin. additionally, we are able to show, that the inclusion of geographic features yield an improvement in classification quality. the achieved results are quite promising, but there is still some room for improvement. first of all, the the sequence length used for this work might still be too long for real world use cases. therefore, shorter sequences should be tried. additionally, interpolation for creating data with the same time delta between two samples or some kind of embedding or alignment layer might yield better results. as there are many sources for additional domain related information, further research in the integration of these sources is necessary. comparison of cnn for the detection of small ojects based on the example of components on an assembly many tasks which only a few years ago had to be performed by humans can now be performed by robots or will be performed by robots in the near future. nevertheless, there are some tasks in assembly processes which cannot be automated in the next few years. this applies especially to workpieces that are only produced in very small series or tasks that require a lot of tact and sensitivity, such as inserting small screws into a thread or assembling small components. in conversations with companies we have found out that a big problem for the workers is learning new production processes. this is currently done with instructions and by supervisors. but this requires a lot of time. this effort can be significantly reduced by modern systems, which accompany workers in the learning process. such intelligent systems require not only instructions that describe the target status and the individual work steps that lead to it, but also information on the current status at the assembly workstation. one way to obtain this information is to install cameras above the assembly workstation and use image recognition to calculate where an object is located at any given moment. the individual parts, often very small compared to the work surface, must be reliably detected. we have trained and tested several deep neural networks for this purpose. we have developed an assembly workstation where work instructions can be projected directly onto the work surface using a projector. at a distance, containers for components are arranged in three rows, slightly offset to the rear, one above the other. these containers can also be illuminated by the projector. thus a very flexible pick-by-light system can be implemented. in order for the system behind it to automatically switch to the next work step and, in the event of errors, to point them out and provide support in correcting them, it is helpful to be able to identify the individual components on the work surface. we use a realsense depth camera for this purpose, from which, however, we are currently only using the colour image. the camera is mounted in a central position at a height of about two meters above the work surface. thus the camera image includes the complete working surface as well as the containers and a small area next to the working surface. the objects to be detected are components of a kit for the construction of various toy cars. the kit contains components in total. some of the components vary considerably from each other, but some others are very similar to each other. since it is the same with real components of a production, the choice of the kit seemed appropriate for the purposes of this project. object detection, one of the most fundamental and challenging problems in computer vision, seeks to local object instances from a large number of predefined categories in natural images. until the beginning of , a similar approach was mostly used in object detection. keypoints in one or more images of a category were searched for automatically. at these points a feature vector was generated. during the recognition process, keypoints in the image were again searched, the corresponding feature vectors were generated and compared with the stored feature vectors. after a certain threshold an object was assigned to the category. one of the first approaches based on machine learning was published by viola and jones in [ ] . they still selected features, in their case they were selected by using a haar basis function [ ] and then using a variant of adaboost [ ] . starting in with the publication of alexnet by krizhevsky et al. [ ] , deep neural networks became more and more the standard in object detection tasks. they used a convolutional neural network which has million parameters in five convolutional layers, some of them are followed by max-pooling layers, three fully-connected layers and a final softmax layer. they won the imagenet lsvrc- competition with a error rate almost half as high as the second best. inception-v is mostly identical to inception-v by szegedy et al. [ ] . it is based on inception-v [ ] . all inception architectures are composed of dense modules. instead of stacking convolutional layers, they stack modules or blocks, within which are convolutional layers. for inception-v they redesigned the architecture of inception-v to avoid representational bottlenecks and have more efficient computations by using factorisation methods. they are the first using batch normalisation in object detection tasks. in previous architectures the most significant difference has been the increasing number of layers. but with the network depth increasing, accuracy gets saturated and then degrades rapidly. kaiming et al. [ ] addressed this problem with resnet using skip connections, while building deeper models. in howard et al. presented mobilenet architecture [ ] . mobilenet was developed for efficient work on mobile devices with less computational power and is very fast. they used depthwise convolutional layers for a extremely efficient network architecture. one year later sandler et al. [ ] published a second version of mobilenet. besides some minor adjustments, a bottleneck was added in the convolutional layers, which further reduced the dimensions of the convolutional layers. thus a further increase in speed could be achieved. in addition to the neural network architectures presented so far, there are also different methods to detect in which area of the image the object is located. the two most frequently used are described briefly below. to bypass the problem of selecting a huge number of regions, girshick et al. [ ] proposed a method where they use selective search by the features of the base cnn to extract just regions proposals from the image. liu et al. [ ] introduced the single shot multibox detector (ssd). they added some extra feature layers behind the base model for detection of default boxes in different scales and aspect ratios. at prediction time, the network generates scores for the presence of each object in each default box. then it produces adjustments to the box to better match the object shape. there is just one publication over the past few years which gives an survey of generic object detection methods. liu et al. [ ] compared common object detection architectures for generic object detection. there are many other comparisons of specific object detection tasks. for example pedestrian detection [ ] , face detection [ ] and text detection [ ] . the project is based on the methodology of supervised learning. thereby the models are trained using a training dataset consisting of many samples. each sample within the training dataset is tagged with a so called label (also called annotation). the label provides the model with information about the desired output for this sample. during training, the output generated by the model is then compared to the desired output (labels) and the error is determined. this error on the one hand gives information about the current performance of the model and, on the other hand it is used for further mathematical computations to adjust the model's parameters, so that the model's performance improves. for the training of neural networks in the field of computer vision the following rule of thumb applies: the larger and more diverse the training dataset, the higher the accuracy that can be achieved by the trained model. if you have too little data and/or run it through the model too often, this can lead to so-called overfitting. overfitting means that instead of learning an abstract concept that can be applied to a variety of data, the model basically memorizes the individual samples [ , ] . if you train neural networks for the purpose of this project from scratch, it is quite possible that you will need more than , different images -depending on the accuracy that the model should finally be able to achieve. however, the methodology of the so-called transfer learning offers the possibility to transfer results of neural networks, which have already been trained for a specific task, completely or partially to a new task and thus to save time and resources [ ] . for this reason, we also applied transfer learning methods within the project. the training dataset was created manually: a tripod, a mobile phone camera ( megapixel format x ) and an apeman action cam ( megapixel format x ) were used to take images for each of the classes. this corresponds to , images in total (actually images were taken per class, but only were suitable for use as training data). all images were documented and sorted into close-ups (distance between camera and object less than or equal to cm) and standards (distance between camera and object more than cm). this procedure should ensure the traceability and controllability of the data set. in total, the training data set contains approx. % close-ups and approx. % standards, each taken on different backgrounds and under different lighting conditions (see fig. ). the labelimg tool was used for the labelling of the data. with the help of this tool, bounding boxes, whose coordinates are stored in either yolo or pascval voc format, can be marked in the images [ ] . for the training of the neural networks the created dataset was finally divided into: ur-ai // -training data ( % of all labelled images): images that are used for the training of the models and that pass through the models multiple times during the training. -test data ( % of all labelled images): images that are used for later testing or validation of the training results. in contrast to the images used as training data, the model is presented these images for the first time after training. the goal of this approach, which is common in deep learning, is to see how well the neural network recognizes objects in images, that it has never seen before, after the training. thus it is possible to make a statement about the accuracy and to be able to meet any further training needs that may arise. the training of deep neural networks is very demanding on resources due to the large number of computations. therefore, it is essential to use hardware with adequate performance. since the computations that run for each node in the graph can be highly parallelized, the use of a powerful graphical processing unit (gpu) is particularly suitable. a gpu with its several hundred computing cores has a clear advantage over a current cpu with four to eight cores when processing parallel computing tasks [ ] . these are the outline parameters of the project computer in use: -operating system (os): ubuntu . . lts -gpu: geforce r gtx ti ( gb gddr x-memory, data transfer speed gbit/s) for the intended comparison the tensorflow object detection api was used. tensorflow object detection api is an open source framework based on tensorflow, which among other things provides implementations of pre-trained object detection models for transfer learning [ , ] . the api was chosen because of its good and easy to understand documentation and its variety of pre-trained object detection models. for the comparison the following models were selected: -ssd mobilenet v coco: [ , , ] -ssd mobilenet v coco: [ , , ] -faster rcnn inception v coco: [ ] [ ] [ ] -rfcn resnet coco: [ ] [ ] [ ] to ensure comparability of the networks, all of the selected pre-trained models were trained on the coco dataset [ ] . fundamentally, the algorithms based on cnn models can be grouped into two main categories: region-based algorithms and one-stage algorithms [ ] . while both ssd models can be categorized as one-stage algorithms, faster r-cnn and r-fcn fall into the category of region-based algorithms. one-stage algorithms predict both -the fields (or the bounding boxes) and the class of the contained objects -simultaneously. they are generally considered extremely fast, but are known for their trade-off between accuracy and real-time processing speed. region-based algorithms consist of two parts: a special region proposal method and a classifier. instead of splitting the image into many small areas and then working with a large number of areas like conventional cnn would proceed, the region-based algorithm first proposes a set of regions of interest (roi) in the image and checks whether one of these fields contains an object. if an object is contained, the classifier classifies it [ ] . region-based algorithms are generally considered as accurate, but also as slow. since, according to our requirements, both accuracy and speed are important, it seemed reasonable to compare models of both categories. besides the collection of pre-trained models for object detection, the tensorflow object detection api also offers corresponding configuration files for the training of each model. since these configurations have already shown to be successful, these files were used as a basis for own configurations. the configuration files contain information about the training parameters, such as the number of steps to be performed during training, the image resizer to be used, the number of samples processed as a batch before the model parameters are updated (batch size) and the number of classes which can be detected. to make the study of the different networks as comparable as possible, the training of all networks was configured in such a way that the number of images fed into the network simultaneously (batch size) was kept as small as possible. since the configurations of some models did not allow batch sizes larger than one, but other models did not allow batch sizes smaller than two, no general value for all models could be defined for this parameter. during training, each of the training images should be passed through the net times (corresponds to epochs). the number of steps was therefore adjusted accordingly, depending on the batch size. if a fixed shape resizer was used in the base configurations, two different dimensions of resizing (default: x pixels and custom: x pixels) were selected for the training. table gives an overview of the training configurations used for the training of the different models. in this section we will first look at the training, before we then focus on evaluating the quality of the results and the speed of the selected convolutional neural networks. when evaluating the training results, we first considered the duration that the neural networks require for epochs (see fig. ). it becomes clear that especially the two region based object detectors (faster r-cnn inception v and rfcn resnet ) took significantly longer than the single shot object detectors (ssd mobilenet v and ssd mobilenet v ). in addition, the single shot object detectors clearly show that the size of the input data also has a decisive effect on the training duration: while ssd mobilenet v with an input data size of x pixels took the shortest time for the training with hours minutes and seconds, the same neural network with an input data size of x pixels took almost three hours more for the training, but is still far below the time required by rfcn resnet for epochs of training. the next point in which we compared the different networks was accuracy (see fig. ). we focused on seeing which of the nets were correct in their detections and how often (absolute values), and we also wanted to see what proportion of the total detections were correct (relative values). the latter seemed to us to make sense especially because some of the nets showed more than three detections for a single object. the probability that the correct classification will be found for the same object with more than one detection is of course higher in this case than if only one detection per object is made. with regard to the later use at the assembly table, however, it does not help us if the neural net provides several possible interpretations for the classification of a component. figure shows that, in this comparison, the two region based object detectors generally perform significantly better than the single shot object detectors -both in terms of the correct detections and their share of the total detections. it is also noticeable that for the single shot object detectors, the size of the input data also appears to have an effect on the comparison point on the result. however, there is a clear difference to the previous comparison of the required training durations: while the training duration increased uniformly with increasing size of the images with the single shot object detectors, such a uniform observation cannot be made with the accuracy, concerning the relation to the input data sizes. while ssd mobilenet v achieves good results with an input data size of x pixels, ssd mobilenet v delivers the worst result of this comparison for the same input data size (regarding the number of correct detections as well as their share of the total detections). with an input data size of x pixels, however, the result improves with ssd mobilenet v , while the change to a smaller input data size has a deteriorating effect on the result with ssd mobilenet v . the best result of this comparison -judging by the absolute values -was achieved by faster r-cnn inception v . however, in terms of the proportion of correct detections in the total detections, the region based object detector is two percentage points behind rfcn resnet , also a region based object detector. we were particularly interested in how neural networks would react to particularly similar, small objects. therefore, we decided to investigate the behavior of neural networks within the comparison using an example to illustrate the behavior of the three very similar objects. figure shows the selected components for the experiment. for each of these three components we examined how often it was correctly detected and classified by the compared neural networks and how often the network misclassified it with which of the similar components. the first and the second component was detected in nearly all cases by both region based approaches. the classification by inception-v and resnet- failed in about one third of images. the ssd networks detected the object in just one of twenty cases but mobilenet classified this correct. it has been surprising, that the results for the third component looks very different to the others (see fig. ). ssd mobilenet v correctly identified the component in seven of ten images and did not produce any detections that could be interpreted as misclassifications with one of the similar components. ssd mobilenet v did not detect any of the three components, as in the two previous investigations. the results of the two region based object detectors are rather moderate. faster r-cnn inception v has detected the correct component in four of ten images, but still five misclassifications with the other two components. rfcn resnet has caused many misclassifications with the other two components. only two of ten images were correctly detected but it had six misclassifications with the similar components. an other important aspect of the study is the speed, or rather the speed at which the neural networks can detect objects, especially with regard to later use at the assembly table. for the comparison of the speeds on the one hand the data of the github repository of the tensorflow object detection api for the individual neural nets were used, on the other hand the actual speeds of the neural nets within this project were measured. it becomes clear that the speeds measured in the project are clearly below the achievable speeds that are mentioned in the github repository of the tensorflow object-detection api. on the other hand, the differences between the speeds of the region based object detectors and the single shot object detectors in the project are far less drastic than expected. we have created a training dataset with small, partly very similar components. with this we have trained four common deep neural networks. in addition to the training times, we examined the accuracy and the recognition time with general evaluation data. in addition, we examined the results for ten images each of three very similar and small components. none of the networks we trained produced suitable results for our scenario. nevertheless, we were able to gain some important insights from the results. at the moment, the runtime is not yet suitable for our scenario, but it is also not far from the minimum requirements, so that these can easily be achieved with smaller optimizations and better hardware. it was also important to realize that there are no serious runtime differences between the different network architectures. the two region based approaches delivered significantly better results than the ssd approaches. however, the results of the detection of the third small component suggest that mobilenet in combination with a faster r-cnn could possibly deliver even better results. longer training and training data better adapted to the intended use could also significantly improve the results of the object detectors. team schluckspecht from offenburg university of applied sciences is a very successful participant of the shell eco marathon [ ] . in this contest, student groups are to design and build their own vehicles with the aim of low energy consumption. since the event features the additional autonomous driving contest. in this area, the vehicle has to fulfill several tasks, like driving a parcour, stopping within a defined parking space or circumvent obstacles, autonomously. for the upcoming season, the schluckspecht v car of the so called urban concept class has to be augmented with the according hardware and software to reliably recognize (i. e. detect and classify) possible obstacles and incorporate them into the software framework for further planning. in this contribution we describe the additional components in hard-and software that are necessary to allow an opitcal d object detection. main criteria are accuracy, cost effectiveness, computational complexity for relative real time performance and ease of use with regard to incorporation in the existing software framework and possible extensibility. this paper consists of the following sections. at first, the schluckspecht v system is described in terms of hard-and software components for autonomous driving and the additional parts for the visual object recognition. the second part scrutinizes the object recognition pipeline. therefore, software frameworks, neural network architecture and final data fusion in a global map is depicted in detail. the contribution closes with an evaluation of the object recognition results and conclusions. the schluckspecht v is a self designed and self build vehicle according to the requirements of the eco marathon rules. the vehicle is depicted in figure . the main features are the relatively large size, including driver cabin, motor area and a large trunk, a fully equipped lighting system and two doors that can be opened separately. for the autonomous driving challenges, the vehicle is additionally equipped with several essential parts, that are divided into hardware, consisting of actuators, sensors, computational hardware and communication controllers. the software is based on a middle ware, can-open communication layers, localization, mapping and path planning algorithms that are embedded into a high level state machine. actuators the car is equipped with two actors, one for steering and one for braking. each actor is paired with sensors for measuring steering angle and braking pressure. environmental sensors several sensors are needed for localization and mapping. backbone is a multilayer d laser scanning system (lidar), which is combined with an inertial navigation system that consists of accelerometers, gyroscopes and magnetic field sensors all realized as triads. odometry information is provided from a global navigation satellite system (gnss) and two wheel encoders. the communication is based on two separate can-bus-systems, one for basic operations and an additional one for the autonomous functions. the hardware can nodes are designed and build from the team coupling usb-, i c-, spi-and can-open-interfaces. messages are send from the central processing unit or the driver depending on drive mode. the trunk of the car is equipped with an industrial grade high performance cpu and an additional graphics processing unit (gpu). can communication is ensured with an internal card, remote access is possible via generic wireless components. software structure the schluckspecht uses a modular software system consisting of several basic modules that are activated and combined within a high level state ma-chine as needed. an overview of the main modules and possible sensors and actuators is depicted in figure localization and mapping the schluckspecht v is running a simultaneous localization and mapping (slam) framework for navigation, mission planning and environment representation. in its current version we use a graph based slam approach based upon the cartographer framework developed by google [ ] . we calculate a dynamic occupancy grid map that can be used for further planning. sensor data is provided by the lidar, inertial navigation and odometry systems. an example of a drivable map is shown in figure . this kind of map is also used as base for the localization and placement of the later detected obstacles. the maps are accurate to roughly centimeters, providing relative localization towards obstacles or homing regions. path planning to make use of the slam created maps, an additional module calculates the motion commands from start to target pose of the car. the schluckspecht is a classical car like mobile system which means that the path planning must take into account the non holonomic kind of permitted movement. parking maneuvers, close by driving on obstacles or planning a trajectory between given points is realized as a combination of local control commands based upon modeled vehicle dynamics, the so called local planner, and optimization algorithms that find the globally most cost efficient path given a cost function, the so called global planner. we employ a kinodynamic strategy, the elastic band method presented in [ ] , for the local planning. global planning is realized with a variant of the a* algorithm as described in [ ] . middleware and communication all submodules, namely, localization, mapping, path planning and high-level state machines for each competition are implemented within ur-ai // the robot operating system (ros) middleware [ ] . ros provides a messaging system based upon the subscriber/publisher principle. the single modules are capsuled in a process, called node, capable to asynchronously exchange messages as needed. due to its open source character and an abundance on drivers and helper functions, ros provides additional features like hardware abstraction, device drivers, visualization and data storage. data structures for mobile robotic systems, e. g. static and dynamic maps or velocity control messages, allow for rapid development. the lidar sensor system has four rays, enabling only the incorporation of walls and track delimiters within a map. therefore, a stereo camera system is additionally implemented to allow for object detection of persons, other cars, traffic signs or visual parking space delimiters and simultaneously measure the distance of any environmental objects. camera hardware a zed-stereo-camera system is installed upon the car and incorporated into the ros framework. the system provides a color image streams for each camera and a depth map from stereo vision. the camera images are calibrated to each other and towards the depth information. the algorithms for disparity estimation are running around frames per second making use of the provided gpu. the object recognition relies on deep neural networks. to seamlessly work with the other software parts and for easy integration, the networks are evaluated with tensorflow [ ] and pytorch [ ] frameworks. both are connected to ros via the opencv image formats providing ros-nodes and -topics for visualization and further processing. the object recognition pipeline relies on a combination of mono camera images and calibrated depth information to determine object and position. core algorithm is a deep learning approach with convolutional neural networks. ur-ai // main contribution of this paper is the incorporation of a deep neural network object detector into our framework. object detection with deep neural networks can be subdivided into two approaches, one being a two step approach, where regions of interest are identified in a first step and classified in a second one. the second are so called single shot detectors (like [ ] ), that extract and classify the objects in one network run. therefore, two network architectures are evaluated, namely yolov [ ] as a single shot approach and faster r-cnn [ ] as two step model. all are trained on public data sets and fine tuned to our setting by incorporating training images from the schluckspecht v in the zed image format. the models are pre-selected due to their real time capability in combination with the expected classification performance. this excludes the current best instance segmentation network mask r-cnn [ ] due to computational burdens and fast but inaccurate networks based on the mobilenet backbone [ ] . the class count is adapted for the contest, in the given case eight classes, including the relevant pedestrian, car, van, tram and cyclist. for this paper, the two chosen network architectures were trained in their respective framework, i. e. darknet for the yolov detector and tensorflow for the faster r-cnn detector. yolov is used in its standard form with the darknet backbone, faster r-cnn is designed with the resnet [ ] backbone. the models were trained on local hardware with the kitti [ ] data set. alternatively, an open source data set from the teaching company udacity, with only three classes (truck, car, pedestrian) was tested. to deal with the problem of domain adaptation, the training images for yolov were pre-processed to fit the aspect ratio of the zed camera. the faster r-cnn net can cope with ratio variations as it uses a two stage approach for detection based on regions of interest pooling. both networks were trained and stored. afterward, their are incorporated into the system via a ros node making use of standard python libraries. the detector output is represented by several labeled bounding boxes within the d image. three dimensional information is extracted from the associated depth map by calculating the center of gravity of each box to get a x and y coordinate within the image. interpolating the depth map pixels accordingly one gets the distance coordinate z from the depth map to determine the object position p(x, y, z) in the stereo camera coordinate system. the ease of projection between dieeferent coordinate systems is one reason to use the ros middleware. the complete vehicle is modeled in a so calle tranform tree (tf-tree), that allows the direct interpolation between different coordinate systems in all six spatial degrees of freedom. the dynamic map, created in the slam subsystem, is now augmented with the current obstacles in the car coordinate system. the local path planner can take these into account and plan a trajectory including kinodynamic constraints to prevent collision or initiate a breaking maneuver. both newly trained networks were first evaluated upon the training data. exemplary results for the kitti data set are shown in figure . the results clearly indicate an advantage for the yolov system, both in speed and accuracy. the figure depicts good results for occlusions (e. g. the car on the upper right) or high object count (see the black car on the lower left as example). the evaluation on a desktop system showed fps for yolov and approximately fps for faster r-cnn. after validating the performance upon the training data, both networks were started as a ros node and tested upon real data of the schluckspecht vehicle. as the training data differs from the zed-camera images in format and resolution, several adaptions were necessary for the yolov detector. the images are cropped in real time before presented to the neural net to emulate the format of the training images. the r-cnn like two stage networks are directly connected to the zed node. the test data is not labeled as ground truth. it is therefore not possible to give quantitative results for the recognition task. table gives a quantitative overview of the object detection and classification, the subsequent figures give some expression of exemplary results. the evaluation on the schluckspecht videos showed an advantage for the yolov network. main reason is the faster computation, which results in a frame rate nearly twice as high compared to two stage detectors. in addition, the recognition of objects in the distance, i. e. smaller objects is a strong point of yolo. the closer the camera gets, the bigger is the balance shift towards faster r-cnn, that outperforms yolo on all categories for larger objects. what becomes apparent is a maximum detection distance of approximately meters, from which on cars become to small in size. figure shows an additional result demonstrating the detection power for partially obstructed objects. another interesting finding was the capability of the networks to generalize. faster r-cnn copes much better with new object instances than yolov . persons with so far unknown cloth color or darker areas with vehicles remain a problem for yolo, but ur-ai // commonly not for the r-cnn. the domain transfer from training data in berkeley and kitti to real zed vehicle images proved problematic. this contribution describes an optical object recognition system in hard-and software for the application in autonomous driving under restricted conditions, within the shell eco marathon competition. an overall overview of the system and the incorporation of the detector within the framework is given. main focus was the evaluation and implementation of several neural network detectors, namely yolov as one shot detector and faster r-cnn as a two step detector, and their combination with distance information to gain a three dimensional information for detected objects. for the given application, the advantage clearly lies with yolov . especially the achievable frame rate of minimum hz allows a seamless integration into the localization and mapping framework. given the velocities and map update rate, the object recognition and integration via sensor fusion for path planning and navigation works in quasi real-time. for future applications we plan to further increase the detection quality by incorporating new classes and modern object detector frameworks like m det [ ] . this will additionally increase frame rate and bounding box quality. for more complex tasks, the data of the d-lidar system shall be directly incorporated into the fusion framework to enhance the perception of object boundaries and object velocities. a few useful things to know about machine learning feature engineering for machine learning an empirical analysis of feature engineering for predictive modeling input selection for fast feature engineering random forests support vector regression machines strong consistency of least squares estimates in multiple regression ii business data science: combining machine learning and economics to optimize, automate, and accelerate business decisions global product classification (gpc) a study of cross-validation and bootstrap for accuracy estimation and model selection automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks convolutional networks for biomedical image segmentation v-net: fully convolutional neural networks for volumetric medical image segmentation self-supervised learning for pore detection in ct-scans of cast aluminum parts generating meaningful synthetic ground truth for pore detection in cast aluminum parts nema ps / iso , digital imaging and communications in medicine (dicom) standard, national electrical manufacturers association ct-realistic lung nodule simulation from d conditional generative adversarial networks for robust lung segmentation deep learning hardware: past, present, and future a survey on specialised hardware for machine learning a survey on distributed machine learning hardware for machine learning: challenges and opportunities d u-net: learning dense volumetric segmentation from sparse annotation z-net: an anisotropic d dcnn for medical ct volume segmentation activation functions: comparison of trends in practice and research for deep learning fast and accurate deep network learning by exponential linear units (elus) delving deep into rectifiers: surpassing human-level performance on imagenet classification toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity adam: a method for stochastic optimization diffgrad: an optimization method for convolutional neural networks tversky loss function for image segmentation using d fully convolutional deep networks a low-power multi physiological monitoring". processor for stress detection. ieee sensors using heart rate monitors to detect mental stress positive technology: a free mobile platform for the self-management of psychological stress exploring the effectiveness of a computer-based heart rate variability biofeedback program in reducing anxiety in college students psychological stress and incidence of atrial fibrillation continuously updated, computationally efficient stress recognition framework using electroencephalogram (eeg) by applying online multitask learning algorithms (omtl) ten years of research with the trier social stress test trapezius muscle emg as predictor of mental stress poptherapy: coping with stress through pop-culture du-md: an open-source human action dataset for ubiquitous wearable sensors stress recognition using wearable sensors and mobile phones introducing wesad, a multimodal dataset for wearable stress and affect detection feasibility and usability aspects of continuous remote monitoring of health status in palliative cancer patients using wearables detection of diseases based on electrocardiography and electroencephalography signals embedded in different devices: an exploratory study stress effects". the american institute of stress der smarte assistent can: creative adversarial networks, generating" art" by learning about styles and deviating from style norms creative ai: on the democratisation & escalation of creativity generative design: a paradigm for design research eigenfaces for recognition unsupervised representation learning with deep convolutional generative adversarial networks large scale gan training for high fidelity natural image synthesis interpreting the latent space of gans for semantic face editing visualizing and understanding generative adversarial networks mistaken identity spectral normalization for generative adversarial networks beauty is in the ease of the beholding: a neurophysiological test of the averageness theory of facial attractiveness unpaired image-to-image translation using cycleconsistent adversarial networks colorization for anime sketches with cycle-consistent adversarial network artificial muse using evolutionary design to interactively sketch car silhouettes and stimulate designer's creativity the chair project-four classics deepwear: a case study of collaborative design between human and artificial intelligence grass: generative recursive autoencoders for shape structures co-designing object shapes with artificial intelligence systematic review of the empirical evidence of study publication bias and outcome reporting bias ki-kunst und urheberrecht -die maschine als schöpferin? public law research paper no. ; u of maryland legal studies research paper no inceptionism: going deeper into neural networks proactive error prevention in manufacturing based on an adaptable machine learning environment. artificial intelligence: from research to application: the upper-rhine artificial intelligence symposium ur-ai the benefits fo pdca crisp-dm . : step-by-step data mining guide interpretable machine learning for quality engineering in manufacturing-importance measures that reveal insights on errors regulation (eu) / of the european parliament and of the council of april on medical devices -medical device regulation (mdr) use of real-world evidence to support regulatory decision-making for medical devices. guidance for industry and food and drug administration staff high-performance medicine: the convergence of human and artificial intelligence artificial intelligence powers digital medicine dermatologist-level classification of skin cancer with deep neural networks chestx-ray : hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases an attention based deep learning model of clinical events in the intensive care unit the artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care the european commission's high-level expert group on artificial intelligence: ethics guidelines for trustworthy ai key challenges for delivering clinical impact with artificial intelligence ibm's watson supercomputer recommended 'unsafe and incorrect' cancer treatments, internal documents show towards international standards for the evaluation of artificial intelligence for health proposed regulatory framework for modifications to artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) artificial-intelligence-and-machine-learning-discussion-paper.pdf . international medical device regulators forum (imdrf) -samd working group medical device software -software life-cycle processes general principles of software validation. final guidance for industry and fda staff deciding when to submit a (k) for a change to an existing device. guidance for industry and food and drug administration staff software as a medical device (samd): clinical evaluation. guidance for industry and food and drug administration staff international electrotechnical commission. iec - : -part : application of usability engineering to medical devices why rankings of biomedical image analysis competitions should be interpreted with care what do we need to build explainable ai systems for the medical domain / of the european parliament and of the council of april on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation -gdpr) artificial intelligence in healthcare: a critical analysis of the legal and ethical implications explainable artificial intelligence: understanding, visualizing and interpreting deep learning models association between race/ethnicity and survival of melanoma patients in the united states over decades docket for feedback -proposed regulatory framework for modifications to artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) openpose: realtime multi-person d pose estimation using part affinity fields a density-based algorithm for discovering clusters in large spatial databases with noise fast volumetric auto-segmentation of head ct images in emergency situations for ventricular punctures a system for augmented reality guided ventricular puncture using a hololens: design, implementation and initial evaluation op sense-a robotic research platform for telemanipulated and automatic computer assisted surgery yolov : an incremental improvement. arxiv deep learning based d pose estimation of surgical tools using a rgb-d camera at the example of a catheter for ventricular puncture fast point feature histograms (fpfh) for d registration joint probabilistic people detection in overlapping depth images towards end-to-end d human avatar shape reconstruction from d data scene-adaptive optimization scheme for depth sensor networks a taxonomy and evaluation of dense two-frame stereo correspondence algorithms advances in computational stereo a comparative analysis of cross-correlation matching algorithms using a pyramidal resolution approach fast approximate energy minimization via graph cuts stereo processing by semiglobal matching and mutual information guided stereo matching a large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation pyramid stereo matching network microstructure-sensitive design of a compliant beam microstructure sensitive design of an orthotropic plate subjected to tensile load microstructure sensitive design for performance optimization on the design, analysis, and characterization of materials using computational neural networks texture optimization of rolled aluminum alloy sheets using a genetic algorithm finite mixture models a tutorial on hidden markov models and selected applications in speech recognition information processing in dynamical systems: foundations of harmony theory generative adversarial nets building texture evolution networks for deformation processing of polycrystalline fcc metals using spectral approaches: applications to process design for targeted performance linear solution scheme for microstructure design with process constraints matcalo: knowledge-enabled machine learning in materials science differential evolution -a simple and efficient adaptive scheme for global optimization over continuous spaces reinforcement learning: an introduction hindsight experience replay industrieroboter für kmu. flexible und intuitive prozessbeschreibung toward efficient robot teach-in and semantic process descriptions for small lot sizes survey on human-robot collaboration in industrial settings: safety, intuitive interfaces and applications concept and architecture for programming industrial robots using augmented reality with mobile devices like microsoft hololens robot programming using augmented reality robot path and end-effector orientation planning using augmented reality spatial programming for industrial robots based on gestures and augmented reality spatial programming for industrial robots through task demonstration augmented reality based teaching pendant for industrial robot intuitive robot tasks with augmented reality and virtual obstacles development of a mixed reality based interface for human roboter interaction a hands-free virtual-reality teleoperation interface for wizard-of-oz control mixed reality as a tool supporting programming of the robot communicating robot arm motion intent through mixed reality head-mounted displays intuitive industrial robot programming through incremental multimodal language and augmented reality development of mixed reality robot control system based on hololens interactive spatial augmented reality in collaborative robot programming: user experience evaluation comparison of multimodal heading and pointing gestures for co-located mixed reality human-robot interaction robot programming through augmented trajectories in augmented reality interactive robot programming using mixed reality a taxonomy of mixed reality visual displays experimental packages for kuka manipulators within ros-indus-trial . siemens: ros-sharp a questionnaire for the evaluation of physical assistive devices (quead) unctad: review of maritime transport ( ) last accessed - - . . international chamber of shipping report of the second meeting of the regional working group on illegal, unreported and unregulated (iuu) fishing automatic identification system (ais): data reliability and human error implications maritime anomaly detection: a review trajectorynet: an embedded gps trajectory representation for point-based classification using recurrent neural networks partition-wise recurrent neural networks for point-based ais trajectory classification a multi-task deep learning architecture for maritime surveillance using ais data streams identifying fishing activities from ais data with conditional random fields a segmental hmm based trajectory classification using genetic algorithm wissensbasierte probabilistische modellierung für die situationsanalyse am beispiel der maritimen Überwachung detecting illegal diving and other suspicious activities in the north sea: tale of a successful trial source quality handling in fusion systems: a bayesian perspective tensorflow: large-scale machine learning on heterogeneous systems deep learning for time series classification: a review rapid object detection using a boosted cascade of simple features general framework for object detection a decision-theoretic generalization of on-line learning and an application to boosting imagenet classification with deep convolutional neural networks rethinking the inception architecture for computer vision going deeper with convolutions deep residual learning for image recognition mobilenets: efficient convolutional neural networks for mobile vision applications mobilenetv : inverted residuals and linear bottlenecks rich feature hierarchies for accurate object detection and semantic segmentation ssd: single shot multibox detector deep learning for generic object detection: a survey pedestrian detection: an evaluation of the state of the art a survey on face detection in the wild: past, present and future text detection and recognition in imagery: a survey information visualizations used to avoid the problem of overfitting in supervised machine learning data science for business: what you need to know about data mining and data-analytic thinking automatic object detection from digital images by deep learning with transfer learning gpu asynchronous stochastic gradient descent to speed up neural network training tensorflow: tensorflow object detection api: ssd mobilenet v coco faster r-cnn: towards real-time object detection with region proposal networks tensorflow object detection api: faster rcnn inception v coco. online . tensorflow: tensorflow object detection api: faster rcnn inception v coco r-fcn: object detection via region-based fully convolutional networks . tensorflow: tensorflow object detection api: rfcn resnet coco multi-scale feature fusion single shot object detector based on densenet references . shell: the shell eco marathon real-time loop closure in d lidar slam kinodynamic trajectory optimization and control for car-like robots experiments with the graph traverser program robot operating system automatic differentiation in pytorch ssd: single shot multibox detector yolov : an incremental improvement rich feature hierarchies for accurate object detection and semantic segmentation mobilenets: efficient convolutional neural networks for mobile vision applications deep residual learning for image recognition are we ready for autonomous driving? the kitti vision benchmark suite m det: a single-shot object detector based on multi-level feature pyramid network the upper-rhine artificial intelligence symposium ur-ai we thank our sponsor! main sponsor: esentri ag, ettlingen this research and development project is funded by the german federal ministry of education and research (bmbf) and the european social fund (esf) within the program "future of work" ( l c ) and implemented by the project management agency karlsruhe (ptka). the author is responsible for the content of this publication. underlying projects to this article are funded by the wtd of the german federal ministry of defense. the authors are responsible for the content of this article.this work was developed in the fraunhofer cluster of excellence "cognitive internet technologies". the upper-rhine artificial intelligence symposium key: cord- -swha e m authors: huizinga, tom w j; knevel, rachel title: interpreting big-data analysis of retrospective observational data date: - - journal: lancet rheumatol doi: . /s - ( ) - sha: doc_id: cord_uid: swha e m nan interpreting big-data analysis of retrospective observational data current data technology offers fantastic new oppor tunities to generate data that can inform us about the safety of drugs. these data will affect the way we use drugs by balancing benefits of specific agents with better and more information on their associated risks. nowadays, possibilities to use big data to deal with safety concerns are enormous, and it is difficult not to get enthusias tic reading papers that take this approach. an out standing example is the use of claims data of patients with rheumatoid arthritis to assess the risk for lowertract gastrointestinal perforation associated with tocilizumab and tofacitinib in comparison with other biological drugs. in the lancet rheumatology, jennifer lane and col leagues present a study using claims data and elec tronic medical records (mostly of patients with rheuma toid arthritis) to analyse the longterm risks of cardiovas cular complications (among other outcomes) in about users of hydroxychloroquine compared with more than users of sulfasalazine. this analy sis is relevant because the european league against rheumatism (eular) guidelines for the treatment of patients with systemic lupus erythematosus (sle) recom mend hydroxychloroquine for all patients with sle, and in practice the drug is given for decades. most doctors will feel that a study as large as that of lane and colleagues is most likely relevant, and they will try to weigh the information presented to optimise treatment strategies for their patients. it has been convincingly shown that most published data are false, and the corollary that the hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true is a relevant consideration given the recent discussions around use of hydroxychloroquine in patients with covid . so what considerations can be made? might this be a false positive result? in such a retrospective analysis of observational data, there can of course be confounding by indication. it is important to note that the authors used stateoftheart methods to deal with the chal lenges of studying retrospective electronic medical record data; they did a newuser cohort study and a selfcontrolled case series to avoid the risk of bias in a casecontrol design, using propensity scores, fitting models with ten-fold cross validation, and negative control outcome analyses. the study thus provides a relevant guide for researchers in the field of electronic medical record analyses. still, the question remains whether the results should guide our current standards of care. as the authors state in their discussion, the cohort included patients who were new users of hydroxy chloroquine or sulfasalazine with a diagnosis of rheuma toid arthritis, without medication use in the previous days, and with at least days of continuous observa tion time before the index event. in general, one expects hydroxychloroquine to be used in patients with more comorbidities and, from clinical reasoning alone, there is high potential for differences in the cohorts. this is a limitation of the study, as the authors correctly emphasise. thoroughly constructed propensity scoring was used to adjust for confounders, but this approach cannot control for all differences and could accidently include intermediary variables. it is also useful to look at the absolute numbers; the numbers of events for cardiovascularrelated mor tality was · per personyears for patients tak ing hydroxy chloroquine compared with · per personyears for patients on sulfasalazine. given these very low absolute numbers, one needs to consider that if bias between the groups exists, then the differences between per and per years of observation might also be caused by bias. although the selfcontrolled case series analysis overcome many of these possible biases, the indication for hydroxychloroquine use could still be a confounder. another unfortunate fact is that normal indicators of causality such as dose-response were missing from the study because of apparent lack of variation in dose of hydroxychloroquine or the inability to obtain data on the association between the duration of hydroxychloroquine use and cardiovascular event rate. the study by lane and colleagues also lacks con trols to show that the database yields what it should. maculopathy is a wellknown adverse effect of hydroxy chloroquine, but the authors were not able to observe an association between hydroxychloroquine use and maculo pathy in their databases. this might have been caused by positive control surveillance bias, but the absence of a positive control decreases the convincing ness of the data. finally, the key question for longterm hydroxy chloroquine prescription for patients with sle is how the benefits balance the risk. the current study did not (and did not intend to) address this question. so although we feel that the study by lane and colleagues is extremely interesting with regard to methodology, and we foresee the rapid growth of studies linking of electronic health record data and claims data, it is difficult to weigh the current data in the context of daily care of patients with sle, in which so much convincing evidence exists for the positive effects of hydroxychloroquine as recommended by eular. risk for gastrointestinal perforation among rheumatoid arthritis patients receiving tofacitinib, tocilizumab, or other biologics risk of hydroxychloroquine alone and in combination with azithromycin in the treatment of rheumatoid arthritis: a multinational, retrospective study update of the eular recommendations for the management of systemic lupus erythematosus why most published research findings are false on the estimation and use of propensity scores in casecontrol and casecohort studies we declare no competing interests.copyright © the author(s). published by elsevier ltd. this is an open access article under the cc by . license. key: cord- -ql tthyr authors: el-din, doaa mohey; hassanein, aboul ella; hassanien, ehab e.; hussein, walaa m.e. title: e-quarantine: a smart health system for monitoring coronavirus patients for remotely quarantine date: - - journal: nan doi: nan sha: doc_id: cord_uid: ql tthyr coronavirus becomes officially a global pandemic due to the speed spreading off in various countries. an increasing number of infected with this disease causes the inability problem to fully care in hospitals and afflict many doctors and nurses inside the hospitals. this paper proposes a smart health system that monitors the patients holding the coronavirus remotely. due to protect the lives of the health services members (like physicians and nurses) from infection. this smart system observes the people with this disease based on putting many sensors to record many features of their patients in every second. these parameters include measuring the patient's temperature, respiratory rate, pulse rate, blood pressure, and time. the proposed system saves lives and improves making decisions in dangerous cases. it proposes using artificial intelligence and internet-of-things to make remotely quarantine and develop decisions in various situations. it provides monitoring patients remotely and guarantees giving patients medicines and getting complete health care without anyone getting sick with this disease. it targets two people's slides the most serious medical conditions and infection and the lowest serious medical conditions in their houses. observing in hospitals for the most serious medical cases that cause infection in thousands of healthcare members so there is a big need to uses it. other less serious patients slide, this system enables physicians to monitor patients and get the healthcare from patient's houses to save places for the critical cases in hospitals. achieves more than % daily. the spreading of infected people and deaths numbers are increasing daily. most people are affected by patients are healthcare members whether doctors or nurses. the recent statistics of the infected healthcare members in doctors and nurses reach more than . [ ] . thousands of them die from the infection of patients holding coronavirus in serving healthcare in hospitals. so, there is a big need to go forward the remote healthcare especially in the highly infected viruses such as covid to save lives. remotely healthcare monitoring requires multiple sensors to record the parameter of each case in realtime to improve the healthcare services in speed time and decision making remotely [ ] . the smart healthcare system relies on the integration between artificial intelligence and internet-of-things technologies. previous research target are monitoring the patient in various diseases such as debates, diets, and after surgical operations. they enable physicians to observe multiple patients at the same time. that eases the system to be highly flexible, and accurate. the used sensors are variant types whether wearable or built-in sensors or mobile sensors [ ] . these systems require interpreting the extracted data from these sensors to reach the main objective. this paper presents the equarantine system that is a proposed smart health system for monitoring coronavirus patients for remote quarantine. it becomes important to save thousands of lives from infection or death. it is based on fused multiple data from various sensors to detect the degree of development of the disease and the seriousness of the health condition. it is based on monitoring the readings' heart pulse, respiratory rate, blood pressure, blood ph level in realtime. it proposes to be a time-series data that includes sequenced data points in a time domain. the data extracted from multiple sensors are gathered sequentially based on multi-variable measurements. it proposes a classification of patient's cases. it also targets observing multiple users concurrently. the proposed technique will be constructed based on the combination of fusion types, feature level, and decision level. it will rely on the long-short term memory (lstm) technique that is considered a deep neural network (dnn) technique for sequenced data. it uses the power of feature learning ability and improves the classification of serious health condition levels. then using the dempster-shafer fusion technique for fusion decision. the proposed system enables monitoring patients from their homes that save governmental cost and time through measuring the changes in patient's medical readings. it will serve humanity in the reduction of coronavirus infection and save healthcare members around the world. it also saves hospital places for emergency cases. the rest of the paper is organized as the following: section ii, related works, section iii, the dangers of coronavirus, section iv, the proposed smart health system, section v, experiments and results. finally, section vi targets the conclusion outlines and future works. several motivations research and investment in the smart health systems or telehealth systems that are simulated real system for observing or following diseases remotely whether in hospitals or patient's homes [ , , , , , ] . these motivations target saving lives, time and cost. the main objective of smart health is remotely controlling for many patients and monitoring their diseases follow in real-time. the essential challenge in smart health is interpreting, fusing and visualizing big data extracted from multiple smart devices or sensors. it improves the making of decision quality for medical systems. the data are collected from sensors to observe patients remotely at their houses. these data are applied by the aggregation and statistical ways for the decisions of the medical system. smart health is a hot area of research and industry which includes a connection between sensors deals with the patients. it can provide monitoring of the remote patient through several ways as video, audio or text. the main problem of this area how to manage the data, analytics them and visualize the reports of the data. this section presents a comparative study of several previous researches on smart health. it includes a combination of artificial intelligence and machine learning algorithms that will support high results of prediction and evaluation of the patient's problems. previous researches present several smart health motivations for constructing a suitable system for observing patients based on each medical case as shown in table ( ). researchers in [ ] , build graphical smart health system for visualizing patient's data for remotely physicians. noisy data and redundant features become the main limitations are faced with this system. researchers in [ ] , present a new smart health system with high accuracy for observing patients after surgeries. this system requires medical experts and huge analysis from doctors to support the full vision for each case. researchers in [ ] , improves the monitoring patients remotely with accuracy %. they face several limitations in reliability and integrity. researchers in [ , , ] , build smart medical system for hospitals to be powerful in monitoring patient's cases. however, it still requires motivation for enhancing accuracy. from previous motivations, finding to construct any smart health system requires to know all conditions, and some expert knowledge to make observing automatically and detect the important readings or anomalies for each patient. that requires supervised training to support any new cases and detect problems. data visualization for the extracted patient's data is very important to save time and lives simultaneously. data visualization refers to one of the main fields of big data analytics that enables end-users to analyze, understand, and extract insights. the essential idea of deep learning depends on the artificial neural networks (anns) study [ ] . anns are a new trend for the active research fields due to building a standard neural network (nn). it uses neurons for producing realvalued activations and, by adjusting the weights, the nns behave as expected. the approaches of deep learning have been utilized powerfully in big data analysis for various applications and targets [ ] . they use for analyzing the computer vision, pattern recognition, speech recognition, natural language processing, and recommendation systems. there is a trade-off between the accuracy measurements and the complexity when applying the deep learning algorithms. these approaches include several types as a convolutional neural network (cnn), recurrent neural network (rnn), and [ ] . long short-term memory recurrent neural networks (lstm-rnn) is one of the most powerful dynamic classifiers publicly known. it is considered a framework where data from multiple sources are gathered, mixed aggregated to make them more powerful and more adapted to a given application. data fusion means the process to reach higher efficiency results to deal with multiple and heterogeneous data sources. there are different types of patient-doctor communication video, audio or text message through remote iot devices or mobile sensors for computing bio-physical characteristics. in smart health systems, a system is responsible for measuring the effective requirement of therapy or other health-related issues only considering interviews and data gathered at the patient home. from the previous motivations (refer to table ( )), finding the hybrid between feature level and decision level is more complex however, it has better accuracy. the new coronavirus has killed nearly times as many people in weeks as sars did in months [ ]. table ( ) disuses the comparative study between covid- and sars in several dimensions. the main reason for the dangers of this virus is a bad impact on respiratory disease. that is caused by increasing high-risk cases by percent of patients and killing more than percent of confirmed cases. sars is caused by killing percent of infected patients. older people, whose immune defenses have declined with age, as well as those with underlying health conditions, are much more vulnerable than the young [ ]. however, death rates are hard to estimate in the early stages of an epidemic and rely on the medical care given to patients. lack of healthcare equipment cause killing people due to this virus has a bad effect on breath and respiratory rate. for instance, ventilators protect lives by causing pneumonia to breathe. the proposed smart health system aims at monitoring coronavirus patients for remotely quarantine. it targets saving thousands of lives from infection or death. it depends on the integration between artificial intelligence and internet-of-things for fusing multiple sensory data from various medical sensors to detect the degree of development of the disease and the seriousness of the health condition. the proposed system improves decision making quickly and simultaneously. figure ( ) shows the big image of the proposed smart health system for monitoring infected coronavirus remotely based on internet-of-things devices it is based on monitoring the reading's heart pulse, respiratory rate, and blood pressure, blood ph level in real-time. figure ( ) presents the proposed smart health system based on the communication between iot devices in a network. this consists of three tiers: tier deals with the different sensors connected by the patient such as in mobile, wearable, iot devices, or accumulators' sensors for measuring the patient evolution case (as blood pressure). tier the fusion between the data in multi-sources mostly in different multimedia. the tier the visualization and deciding for emergency cases, making profiling for each case, and how to predict the next problem for each patient. it proposes to be a time-series data that includes sequenced data points in a time domain. the data extracted from multiple sensors are gathered sequentially based on multi-variable measurements. it proposes a classification of patient's cases. it also targets observing multiple users concurrently. the proposed technique will be constructed based on the combination between the fusion feature level and fusion decision level. it will rely on the long-short term memory (lstm) technique that is considered a deep neural network (dnn) technique for sequenced data [ ] . it uses the power of feature learning ability and improves classification serious health condition level. then using the dempster-shafer fusion technique for fusion decision [ ] . the proposed system enables monitoring patients from their homes that save governmental cost and time through measuring the changs in the patient's medical readings. it will serve humanity in the reduction of coronavirus infection and save healthcare members around the world. it also saves hospital places for emergency cases. the life cycle of smart health technique includes six main layers, cleaning data layer, anomaly detection layer, extracting features based on deep learning, lstm deep learning layer, and fusion layer as shown in the lifecycle in figure ( ). the architecture of the smart health technique consists of the main six layers, cleaning data, feature extraction layer, deep learning techniques, fusion layer a shown in figure ( ) . the first layer, cleaning the data layer uses for pre-processing time-series data. the second layer includes anomaly detection controlling for observing and recording any changes in the data such as outliers or errors. the importance of this layer appears in neglecting the context or domain and focusing on the main three dimensions data types, major features, and anomalies to unify the technique in any domain. third layer, the feature extraction layer refers to the automated extracting features from any context. however, not all features are important and require fusing for the target. so, that requires feature reduction or ignorance of some features before the fusion process. the fourth layer, deep learning algorithm that is based on the input data type. the deep neural model is constructed based on target and features. the fifth layer, the fusion layer refers to fuse multiple data from multiple sensors. this layer manages the fusion process based on three dimensions the input data types, features, and anomalies in input data. the proposed algorithm is a hybrid fusion technique between feature fusion level and decision fusion level concerning determining anomalies to improve decision making quickly and simultaneously. the hybrid technique is illustrated in making a decision based on extracting new features from the data and fusing the decisions from tracing in each sensor. the proposed system enables to classify patients in the risk level and make decisions concurrently. it also predicts each patient's evolution case based on a remote monitoring process. the algorithm includes the main six layers. ( ) cleaning data layer: the nature of time-series data usually includes several noisy data or missing data. so, this layer targets ensuring the quality of the input data that ignoring missing data, determining error readings, filtering anomalies or fixing structural errors. it also determines the duplicate observations and ignores irrelevant notifications. this layer is very important to make the system highly reliable. the steps of this layer are: -check on empty records, or terms in each cell. -check on duplicate records -check on noisy data -check on unstructured data to convert the suitable structure for the dataset. ( ) anomaly detection layer: the previous layer can filter anomalies or readings errors [ ] . these anomalies mean identification rare events that are happening and affected on other observations. they usually require making decisions quickly and concurrently. this layer classifies anomalies to improve the prediction analysis of patient's cases. algorithm ( ) describes the main steps of anomaly detection. algorithm ( ) . check (t n+ ); // check any change in any cell in the same record to determine the errors or anomalies . if finding no changes that usually will be classified into error in readings . to ensure this classification, compare the row has changed and before or after to check errors . if finding changes or effects on other terms in the row's cells. that will be classified into the anomaly. . // making anomalies classification for emergency cases. . } } } } ( ) features level: the feature extraction layer or feature engineering layer refers to the interaction between extraction features [ ] . it deletes redundant and unused features. the extracted features are medical and require expert knowledge. the combination of the two features can interpret a new feature for improving results and decision making. heart rate and respiratory rate are the significant keys to estimate the physiological state of people in several clinical properties [ , ] . they are utilized as the main assessment in acutely sick children, as well as in those undergoing more intensive monitoring in high dependency. the heart rate and respiratory rate are important values utilized to detect responses to lifesaving interventions. heart rate and respiratory rate keep an integral part of the standard clinical estimation of people or children presenting. the emergency cases will be classified based on the outranges of these parameters. the previously computed median of the representative centiles ( st, th, th, th, th, th) for the data from each included study. finding each age has a different normal range of heart rate and respiratory rate. algorithm ( ): feature extraction layer . set r = mean respiration features; . h = mean heart features; . µt = mean and . σt = standard deviation . set τ = µt + σt. // threshold of the allowed skewness . set rm = r and hm = h . // rm and hm are the most recent respiration and heart feature vectors . set tr = // the time at which respiration features captured. . set th = // the time at which heart features captured. end for ( ) predication layer for each patient's case evolution: artificial intelligent technique is used for monitoring patients and their case evolution. long-short term memory (lstm) is a deep learning technique for sequenced data [ ] . this layer is based on lstm to predict the future case for each patient based on previous disease readings. this combining rule can be generalized by iteration: if we treat mj not as sensor sj's notification, but rather as the already fused by dempster-shafer theory notification of sensor sk and sensor s. ( ) visualization layer: the fusion output constructs a full vision for patient cases based on extracted sensory data for each patient. the physicians may be finding a hardness in readings hundreds of patients concurrently. so, visualizing data is a very important layer to classify the serious condition level for patients (high risk, medium, low) that may use colors (for example, red color for high-risk cases, yellow for medium risk cases, green for the normal rate or low-risk cases) to getting attention from doctors quickly to improving decision making simultaneously and quickly in figure ( ). output: that includes colorized and classified data for all patients for each observer doctor. it also includes a detailed sheet for the evolution of each patient and the prediction risk of their patients. v. the recent governmental motivations go forward to reach high benefits from smart health. it also starts collecting real data concurrently for saving health members' lives and saving patients. the proposed system targets multiple sensors or iot devices to record patients in quarantine in figure ( ) . it improves healthcare working due to the lack of healthcare equipment and reduces infection between healthcare members. the proposed fusion technique for the real dataset that covers the real datasets, the input dataset of coronavirus quarantine is a modified dataset from the input data as shown in table ( the training dataset is based on the collected several real datasets for normal and patients. the fusion process is very critical between multiple sensory data to reach the full vision of the corona disease level. the fusion technique depends on fusing many features of coronavirus that includes heart pulse, respiratory rate, body temperature, blood pressure, and blood ph level in real-time. they are the main features in fusion process in figure ( ) . the fusion technique targets improving the prediction of the disease features and protecting patients' and healthcare members' lives. there is a need to making many modifications to the dataset due to the lack of sensory data for patients (refer to table ( )) in terms of the real normal ranges for normal people as shown in table ( ) . these modifications include two steps, data augmentation and generated time. the data will be enlarged to . records by observing each patient into the previous hours to support predicting the next case rate and save lives before sudden actions in the "cardiovascular dataset". these data set is fused with the real ranges of blood ph level and body temperature ranges. the processing technique includes six layers, that are applied as the following. the proposed e-quarantine system uses a long short-term memory network (lstm) twice times, first using for detecting anomalies and for predicting the evolution case of each patient based on previous disease profile. lstm is considered an evolution from recurrent neural network technique. the lstm technique is very powerful for classifying sequential data. the most common way to making training on rnn is based on a backpropagation with time. however, the main challenge of the vanishing gradients is usually a reason for the parameters to take short-term dependencies while the information from earlier time-steps decays. the reverse issue, exploding gradients may be a reason for occurring the error to develop drastically with each time step. lstm is applied to the real dataset in a deep learning layer. an lstm layer learns long-term dependencies between time steps in time-series and sequence data. the layer performs additive interactions, which can support improving gradient flow over long sequences during training. to forecast the values of future time steps of a sequence, the training of the sequenced data regression based on the lstm network. for each time step of the input sequence, the lstm network learns to forecast the estimation value of the next time step. the flowchart in figure ( ) is applied in this experiment for predicting the emergency cases of patients. figure ( ) presents a sample of predictive analysis for future data and time about . records concerning historical data. it applies the lstm. the predictive analytics output includes continuous variables that are entitled a regression in figure ( ). an lstm regression network was trained with epochs; the lstm layer contained hidden units. this technique avoids the gradient explosion problem that is happening in artificial neural networks training data and backpropagation. the gradient condition was set to . the initial learning rate was set to . and then minimized after epochs by multiplying by . . lstm network was trained with the specified training choices. then the real measurements of the time proceedings between predictions are recognized, then the network state is updated with the observed values in the state the predicted values. resetting the network state prohibits prior predictions from affecting predictions on the new data. thus, the network state is reset and then initialized by forecasting the training data. predictions are made for each time step. for each forecast, the following time step values are predicted based on the observed values at the prior time step. the 'execution-environment' option was set to 'predict-and-update-state' using the 'cpu'. non-standardized predictions were made utilizing the parameters evaluated earlier; then, the root-mean-square error (rmse) was computed (as shown in equation ). the results of prediction achieve greater accuracy when the network state is edited with the notified values rather than the predicted values. update network state with observed values: when the existing values of the time steps between predictions are available, then the network state can be updated with the observed values instead of the predicted estimations. then, the predicted values are compared with the test data. figure ( ) illustrates a comparison between the forecasted values and the test data. figure ( ) examines the forecasting with updates with rmse = . . the x-axis that has been processed what the target shows that refers to the input data, so the constructed neural network has only one step of input data and it has seen times and it has seen steps of input data. smart traffic control system president & ceo of sensorcomm technologies, coauthored this white paper on remote patient monitoring for covid- internet-of-things and smart environments proceedings of the international congress on information and communication technology an innovative methodology for big data visualization for telemedicine minimum redundancy maximum relevance feature selection approach for temporal gene expression data learning word embeddings with chi-square weights for healthcare tweet classification a hospital recommendation system based on iot-based information system for healthcare application everything you wanted to know about smart health care: evaluating the different technologies and components of the internet of things for better health a survey of deep neural network architectures and their applications understanding lstm --a tutorial into long short-term memory recurrent neural networks smart city reference model: assisting planners to conceptualize the building of smart city innovation ecosystems iot based smart car parking system, ijsart a feature fusion based forecasting model for financial time series automating feature construction for multi-view time series data feature fusion models for deep autoencoders: application to traffic flow prediction a novel data fusion algorithm for multivariate time series search decision level fusion algorithm for time series in cyber physical system, international conference on big data computing and communications hybrid feature and decision level fusion of face and speech information for bimodal emotion recognition face and iris biometrics person identification using hybrid fusion at feature and score-level report of the who-china joint mission on coronavirus disease (covid- ) lstm-based deep learning model for predicting individual mobility traces of short-term foreign tourists, sustainability anomaly detection using machine learning techniques algorithms for monitoring heart rate and respiratory rate from the video of a user's face normal ranges of heart rate and respiratory rate in children from birth to years: a systematic review of observational studies Α respiratory sound database for the development of automated classification the summarized dataset interprets the results of number rates into ranges rates to make tracing easier. these results include risk levels (high, medium, low). these level drives doctors to make a suitable decision quickly and simultaneously (as shown in table ( ) and ( )). for examples, the patients reach to the high-risk level must go to the hospital if it is due the respiratory rate is high that must put them on a ventilator and requires making lung ct, if it is due the blood ph level that requires to take medicine and follow the blood acidity if it is due the heart rate or body temperature, or blood pressure, taking medicine with antiviral or anti-malaria with antipyretic. the proposed system targets two essential objectives, monitoring infected patients remotely that are not high risk to avoid reaching high risk and predicting the next level of risk of each patient to protect their levels and taking decisions simultaneously. this paper presents the e-quarantine system for monitoring infected patients with coronavirus remotely that uses for reducing infection and save hospital's places and equipment for high-risk patients only. the essential objectives of the e-quarantine system that simulates the quarantine for patients in their houses to monitor patients and classify the patients based on observing disease risks. the proposed system e-quarantine monitors the patient's case flow and predicts the emergency cases around hours by . % based on the supervised previous data. it is based on five parameters, blood ph level, heart rate, blood pressure, body temperature, and respiratory rate. the proposed hybrid fusion is based on a hybrid of feature fusion level and decision fusion level that improves the accuracy results that reach . %. finding a dempster-shafer technique is more powerful in sequenced data than images or videos. the fusion technique is applied to sequenced data for patients and their respiratory sounds. for future work, the proposed system requires higher flexibility to be adaptive with multiple data types to improve the results of each patient. on behalf of all authors, the corresponding author states that there is no conflict of interest. this article does not contain any studies with human participants or animals performed by any of the authors. key: cord- -cm tqbzk authors: wang, zijie; zhou, lixi; das, amitabh; dave, valay; jin, zhanpeng; zou, jia title: survive the schema changes: integration of unmanaged data using deep learning date: - - journal: nan doi: nan sha: doc_id: cord_uid: cm tqbzk data is the king in the age of ai. however data integration is often a laborious task that is hard to automate. schema change is one significant obstacle to the automation of the end-to-end data integration process. although there exist mechanisms such as query discovery and schema modification language to handle the problem, these approaches can only work with the assumption that the schema is maintained by a database. however, we observe diversified schema changes in heterogeneous data and open data, most of which has no schema defined. in this work, we propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes. our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios: coronavirus data integration, and machine log integration. it was reported in that data scientists spent - % efforts in the data integration process [ ] , [ ] , [ ] . the schema change, which impacts applications and causes system downtimes, is always a main factor leading to the tremendous human resource overhead required for data integration. schema changes are often caused by software evolution that is pervasive and persistent in agile development [ ] , or the diversity in data formats due to the lack of standards [ ] . example: coronavirus disease (covid- ) data integration. to predict the coronavirus outbreak, we integrate the coronavirus data repository at johns hopkins university (jhu) [ ] and the google mobility data [ ] . the jhu's data repository maintains the world coronavirus cases on a daily basis as a set of csv files. however, we find the schema of the data has frequent changes. as illustrated in fig. (a) , the changes include attribute name changes (e.g., longitude → long ), addition and removal of attributes (e.g., from six attributes initially to attributes), attribute type changes (e.g., date formats), key changes (e.g., from (country/region, province/state) to (combined key), (fips) and (country region, province state, admin ). the data scientist's python code for parsing these files easily breaks at each schema change, and requires manual efforts to debug and fix the issues. this becomes a ubiquitous pain for the users of this jhu data repository, and people even launched a project to periodically clean the jhu coronavirus data into a stable r-friendly format . however such cleaning https://github.com/lucas-czarnecki/covid- -cleaned-jhucsse is purely based on manual efforts and obviously not scalable. prior arts. schema evolution for data that was managed in relational databases, nosql databases, and multi-model databases are well-established research topics. the fundamental idea is to capture the semantic mappings between the old and the new schemas, so that the legacy queries can be transformed and/or legacy data can be migrated to work with the new schemas. there are two general approaches to capture the semantics mappings: ( ) to search for the queries that can transform the old schema to the new schema [ ] , [ ] , [ ] , [ ] , [ ] , [ ] . ( ) to ask the database administrators (dbas) or application developers to use a domain-specific language (dsl) to describe the schema transformation process [ ] , [ ] , [ ] - [ ] , [ ] , [ ] , [ ] , [ ] , [ ] . however, these approaches are not applicable to unmanaged data, including open data such as publicly available csv, json, html, or text files that can be downloaded from a url, and transient data that is directly collected from sensor devices or machines in real-time and discarded after being integrated. that's because the history of schema changes for these data is totally lost or become opaque to the users. it is an urgent need to automatically handle schema changes for unmanaged data without interruptions to applications and any human interventions. otherwise, with the rapid increase of the volume and diversity of unmanaged data in the era of big data and internet of things (iot), it is unavoidable to waste a huge amount of time and human resources in manually handling the system downtimes incurred by schema changes. a deep learning approach. in this work, we argue for a new data integration pipeline that uses deep learning to avoid interruptions caused by the schema changes. in the past few years, deep learning (dl) has become the most popular direction in machine learning and artificial intelligence [ ] , [ ] , and has transformed a lot of research areas, such as image recognition, computer vision, speech recognition, natural language processing, etc.. in recent years, dl has been applied to database systems and applications to facilitate parameter tuning [ ] , [ ] , [ ] , [ ] , indexing [ ] , [ ] , partitioning [ ] , [ ] , cardinality estimation and query optimization [ ] , [ ] , and entity matching [ ] , [ ] , [ ] , [ ] , [ ] , [ ] . while predictions based on deep learning cannot guarantee correctness, in the big data era, errors in data integration are usually tolerable as long as most of the data is correct, which is another motivation of our work. to the best of our knowledge, we are the first to apply deep learning arxiv: . v [cs.db] oct to learn the process of join/union/aggregation-like operations with schema changes occurring in the data sources. however, it's not an easy task and the specific research questions include: ( ) it is not straightforward to formulate a data integration task, which is usually represented as a combination of relational or dataflow operators such as join, union, filter, map, flatmap, aggregate, into a prediction task. what are effective representations for the features and labels? ( ) how to design the training process to make the model robust to schema changes? ( ) different model architectures, for example, simple and compact sequence models such as bi-lstm and complex and large transformer such as gpt- and bert, may strike different trade-offs among accuracy, latency, and resource consumption. what are the implications for model architecture selection in different deploying environments? ( ) annotating data to prepare for training data is always a major bottleneck in the end-to-end lifecycle of model deployment for production. then how to automate training data preparation for the aforementioned prediction tasks? uninterruptible integration of fast-evolving data. in this work, we first formulate a data integration problem as a deep learning model that predicts the position in the target dataset for each group of related data items in the source datasets. we propose to group related items in the same tuple or object that will always be processed together, and abstract each group into a super cell concept. we further propose to use source keys and attributes as features for describing the context of each cell, and use the target keys and attributes as labels to describe the target position where the super cell is mapped to. the features and labels can be represented as sentences or sentences with masks so that the representation can be applicable to state-ofart language models, including sequence models like bi-lstm and transformers like gpt- and bert. then, to seamlessly handle various schema changes, inspired by adversarial attacks [ ] , [ ] , which is a hot topic in dl, we see most types of schema changes as obfuscations injected to testing samples at inference time, which may confuse the model that is trained without noises. therefore, just like adversarial training [ ] , [ ] , [ ] , we address the problem by adding specially designed noises to the training samples to make the model robust to schema changes. the techniques we employ to do so include replacing words by randomly changed words and synonyms that are sampled from google knowledge graph. in addition, we propose to add an aggregation mode label to indicate how to handle super cells that are mapped to the same position, which can well handle the schema change of the type key expansion in the example we give earlier. based on the above discussions, we propose a fully automated end-to-end process for uninterruptible integration of fast-evolving data sources, as illustrated in fig. (b) . step , the system will leverage our proposed lachesis intermediate representation (ir) [ ] to automatically translate the user's initial data integration code into the executable code that automatically creates training data based on our proposed representation. step , obfuscations will be automatically injected to the training data to make the model robust to various schema changes. step , different model architectures will be chosen to train the predictive model that will perform the data integration task depending on the model deploying environment. due to the space limitation, this paper will focus on step and , while we will also discuss step as well as other techniques for reducing training data preparation overhead such as crowdsourcing and model reuse. the contributions of this work include: ( ) as to our best knowledge, we are the first to systematically investigate the application of deep learning and adversarial training techniques to automatically handle schema changes occurring in the data sources. ( ) we propose an effective formulation of the data integration problem into a prediction task as well as flexible feature representation based on our super cell concept (sec. ii). we also discuss how to alleviate the human costs involved in preparing training data for the representation (sec. v). ( ) we represent the common schema changes as various types of obfuscations, which can be automatically injected to the training data to make the model training process robust to these types of schema changes. (sec. iii) ( ) we compare and evaluate various trade-offs made by different model architectures, including both simple sequence model and complex transformer model for two different data integration tasks involving semi-structured and non-structured data respectively. (sec. iv and sec. vi) we assume that in a typical data integration scenario that converts a set of source datasets into one target dataset, each of the source datasets may have heterogeneous formats such as csv, json, text, etc., which may not be managed by any relational or nosql data stores; however, the target dataset must be tabular, so that each cell in the target dataset can be uniquely identified by its tuple identifier and attribute name. given a set of objects, where each object may represent a row in a csv file, a json object, a time series record, or a file, there are a few candidate representations for formulating the predictive task, including dataset-level representation, objectlevel representation, attribute-level representation, cell-level representation, and our proposed super cell based representation. the coarser-grained of the representations, the fewer times of inferences are required, and the more efficient of the prediction. however, a coarser-grained representation also indicates that the prediction task is more complex and harder to train a model with acceptable accuracy, because a training sample will be larger and more complex than other finergrained representations, and the mapping relationship to learn will naturally become more complicated. various levels of representations for the motivating example are illustrated in fig. collect sufficient training data represented at the dataset-level. the object-level representation groups all attributes and is not expressive enough in describing the different transformations applied to each attribute. similarly, the attribute-level representation assembles all values in the same attribute and is not efficient in expressing logics like filtering or aggregation. the cell-level representation, referring to a value of a single attribute in one tuple, is at the finest granularity, which simplifies the learning process. however, it may incur too many inferences and waste computational resources, particularly if there exist multiple cells in one object that will always be mapped/transformed together. therefore we propose and argue for a super cell representation. a super cell is a group of cells in an object that will always be mapped to the target table together in a similar way, such as the values of the confirmed attribute and the recovery attribute, as shown in fig. . while the flexible granularity of a super cell is between the object-level and cell-level, it can well balance the expressiveness and the performance regarding the training and testing process. for each super cell, we represent its features as a sentence that concatenates the keys, as well as the value and attribute of each cell in the super cell. particularly the keys in the context features should include both of the join key if applicable and the tuple/object identifier of the local table. we find that for - join, the join key usually also serves as a local tuple/object identifier; for -n and n -m joins, the join key is not necessarily the local key; and for non-join operations (e.g., union, filter, aggregate), no join key is required. we assume the target dataset is always tabular and propose a label representation that includes the tuple identifier (i.e., key of the target table) of the super cell, the target attribute name of each cell, and an aggregation mode to specify how values that are mapped to the same position should be aggregated into one value, e.g., add, avg, max, min, count, replace old, discard new, etc.. suppose there are m source datasets, represented as d = {d i }( ≤ i < m), each source dataset is modeled as a set of n i super cells, denoted as d i = {s ij }( ≤ i < m, ≤ j < n i ). we further describe each super cell s ij ∈ d i as a triplet that consists of three vectors for the keys shared by each cell in the super cell, attribute names of each cell, and values of each cell, respectively, represented as s ij = ( key ij , attribute ij , value ij ). we can further define a super set to describe the current state of the entire data repository: a super cell may be mapped to zero, one, or more than one positions in the target dataset, depending on the operations involved in the data integration task. for example, as illustrated in fig. , in a -n join operation, a super cell in the table at the left-hand side may be mapped to many positions in the target table. thus we can represent the target positions where the super cell is mapped to as a list of triples: f (s ij ) = {( key t ij , attribute t ij , agg mode)}, where each triple refers to one position that is indexed by the target keys key t shared by all cells in the super cell, as well as attribute name of each cell in the super cell, denoted as attribute t . given a fast evolving data repository s = {s ij ∈ d i |∀d i ∈ d}, for a data integration request, the user input should specify the schema of the expected target table. the schema includes a list of p attributes denoted as a = {a k }( ≤ k < p), where a k represents the k-th attribute in the target table, as well as a list q attributes that serve as minimum tuple identifier (i.e. key) denoted as a key = {a l }( ≤ l < q). we further denote a set of all possible key values in the target table as r = {r l |r l ∈ a l }( ≤ l < q). then we need find a model f s→(a∪{n u ll})×(r∪{n u ll}) that predicts a set of target positions denoted as {( key t ij , attribute t ij , agg mode)} for each super cell s ij ∈ s, where each element of the key ∀x ∈ [ , q), key t ij [x] ∈ r, and each element of the attribute vector ∀y ∈ [ , |s ij |), attribute t ij [y] ∈ a, where |s ij | denotes the number of cells in the super cell s ij . if a super cell doesn't belong to the target table and should be discarded, we define that r l = n u ll and a k = n u ll in this case. we identify five basic types of data schema changes, which cover all of the relational or nosql schema changing patterns [ ] , [ ] , [ ] - [ ] , [ ] , [ ] , [ ] , [ ] , [ ] as well as schema changes that we have discovered from open data. we first discuss the impact of each type and then propose our approaches to handle these changes by adding perturbations to the training process. ( ) domain pivoting. for example, originally the dataset was stored as three csv files, describing the daily confirmed coronavirus cases, daily recovery cases, and daily death cases; later the schema changed to a new set of csv files so that each file described all information (confirmed, death, recovery cases) on that specific date. we observed such changes prevalent in the coronavirus data [ ] , [ ] , [ ] and the weather database hosted by national oceanic and atmospheric administration (noaa) [ ] . such changes will easily break a conventional python-based data integration pipeline. ( ) key expansion. for example, the key of the dataset is lowered down from the state level (country/region, province/state) to the county level (combined key), (fips) and (country region, province state, admin ), which means a tuple in the original table (that describes statistics for a state) is broken down into multiple tuples with each describes the statistics for a county. such changes cannot be easily handled by conventional data integration methodologies. ( ) attribute name and ordering change. for example, in the first csv file added to the jhu covid- daily report data repository on jan nd, , the third column name is "last update". but in the csv file added to the same repository on sept th, , the same column is moved to the fifth position, and the name is slightly changed to "last update". such changes may interrupt a conventional program that joins two datasets on the "last update" column. ( ) value type/format change. for example, in the daily covid- file created on jan nd, the "last update" has values in the format of " / / : ". however, in the file on sept th, the value format has changed to " - - : : ". a conventional exact join operation using "last update" as the join key cannot handle such value format change, unless the programmer chooses to convert it into a similarity join, which is more complicated and much slower than an exact join and will result in significantly higher development costs [ ] , [ ] . ( ) addition or removal of non-key attributes. for example, the jhu covid- global daily report has changed from six attributes initially to attributes after a few months. this change may not make much influence, if the affected attributes are not used by users' data integration workloads. on the contrary, if a required column is totally removed, there is no way to handle such a situation without interruption, even if using a deep learning approach, so we mainly focus on the first four types of schema changes. perturbation based on schema changes. first, our super cell based representation will not be affected by dimension pivoting, attribute ordering change, and addition or removal of irrelevant attributes, because the context for any cell remains the same despite of these schema changes. therefore a model trained with our proposed representations is robust to these types of schema changes. second, schema changes such as renaming of an attribute and reformatting of cell values, are similar to adversarial attacks, which confuse the pre-trained models by adding noises to the expected testing samples. adding perturbations to training data is an effective way of training robust models against adversarial attacks [ ] . inspired by this analogy, we add specially designed perturbations to training data to handle these parts of schema changes. if we see each super cell representation as a sequence of words (i.e., a sentence), the training data is a corpus of sentences. then we can augment the training data by adding new sentences (i.e., perturbations), which are changed from existing sentences by randomly replacing a word using synonyms extracted from google knowledge graph [ ] , [ ] , [ ] , or randomly modified words by removing one or more letters. then we train a character-based embedding on this augmented corpus using fasttext [ ] , which maps related words to vectors that are close to each other so that the model can recognize the similarity of such words. we observe through experiments that character-based embedding can achieve better accuracy and reliability than word-based embedding and can smoothly handle outof-vocabulary words. it also shows that our locally trained embedding significantly outperforms pre-trained embeddings with google news or wikipedia. third, to make the deep learning model robust to key expansion, as mentioned, we add a new label to the representation called "aggregation mode". each value of the label represents an aggregation operator such as sum, avg, min, max, which will be applied to the cells that are mapped to the target position; replace, which means the cell will replace the old cell that exists in the same position of the target table; or discard, which means the new cell will be discarded if an older cell has been mapped to the same target position. in recent several years, natural language processing (nlp) has experienced several major advancements including the bi-directional mechanism, attention mechanism, transformer mechanism, and so on. existing works show that the final hidden state in bi-lstm networks cannot capture all important information in a long sentence. therefore, the attention mechanism was introduced to address the problem by preserving information from all hidden states from encoder cells and aligning them with the current target output. later such idea was integrated into the transformer architectures, so that encoders had self-attention layers, while decoders had encoder-decoder attention layers. most recently, to make the transformer architecture more flexible to applications other than language translation, gpt- that only uses the decoders' part and bert that only uses encoders' part are invented and achieve great success in a broad class of nlp problems. our assumption is that on one hand, more complicated models like gpt- and bert may naturally achieve better accuracy than a simpler model like bi-lstm; but on the other hand, these complex models may require significantly higher storage and computational resources, as well as more training data. it is important to know the trade-offs among accuracy, latency, and resource consumption, made by different model architectures. we mainly consider two types of language model architectures: ( ) simple and compact sequence models based on customized local character-based embedding and bi-lstm; and ( ) complex and large pre-trained transformer models, such as gpt- [ ] and bert [ ] . our bi-lstm model architecture, includes an embedding layer that has neurons; a bi-lstm layer that consists of neurons; and a fullyconnected layer that has neurons. ) transformer model: moreover, we also consider transformer models based on gpt- [ ] and bert [ ] . we use a pre-trained gpt- small model or a pre-trained bert base model as the backend, which connects to a frontend classifier composed of four convolutional layers and a fully connected layer. during the training process, the parameters of the gpt- small model and the bert base model are freezed, and only the parameters of the frontend will be updated. the pre-trained gpt- small has millions of parameters, including layers of transformers, each with independent attention mechanisms, called "heads", and an embedding size of dimensions. the hidden vector output from the gpt- small model is reshaped to add a channel dimension and then passed to four convolutional layers, including two maxpooled d convolution layer and two average-pooled d convolution layer respectively, the output is applied with a hadamard product, and then sent to a fully connected layer. the bert base model has millions of parameters, with transformer blocks, and each has hidden neurons and self-attention heads. it uses the same architecture of the frontend classifier with the gpt- small model. although gpt- and bert are both based on the transformer model, they use different units of the transformer. gpt- is built using transformer decoder blocks constructed by the masked selfattention layers, while the bert utilizes transformer encoder blocks with self-attention layers. although transformer models usually achieve better accuracy than sequence models through its attention mechanism, they also require significantly more storage space. for example, gpt- small, which is the smallest variant of gpt- model requires more than megabytes of storage space; the bert base model that we use takes megabytes of storage space. in contrast, the bi-lstm model is smaller than megabyte. for each super cell, the model will predict a set of target positions in the form of {( key t ij , attribute t ij , agg mode)}, as we mentioned in sec. ii. then based on each super cell and its predicted positions, a general data assembler will put each value to the right places in the target table. based on the configuration, the assembler can work in either local mode by buffering and writing one file to store the target dataset in local or dispatch the assembled tuples to users' registered deep learning workers (i.e., target data is consumed by a deep learning application) once an in-memory buffer is full. in the latter case, in each deep learning worker's side, a client is responsible for receiving and assembling tuples into the target dataset. during the dispatching process, the output table will be partitioned in a way to guarantee load balance and ensure the independent identical distribution (i.e., iid) to avoid introducing bias. an important objective of this work is to free human experts from all dirty works of wrangling with schema changes. therefore it's critical to reduce the human efforts required in training data preparation, such as parsing and annotating data. we propose to automate the training data creation by utilizing conventional python code developed for integrating an initial set of data sources. the users' python codes specify how to transform the data sources (usually with heterogeneous formats) to a target table (usually in tabular format), which is exactly the information needed for creating the training data. this gives us an opportunity to translate users' data integration code to training data preparation code. for relational data, the integration logic can be fully expressed in sql, which maps to relational algebra. then it is easy to generate code for training data creation process based on the relational algebra. first, all key and join key information are well maintained and can be directly retrieved. second, it is easy to identify which attributes of a table will always be processed similarly by analyzing the relational algebra expression, so that the values of these attributes in the same tuple can be grouped into a super cell. for example, by analyzing a query coded up for a data integration task such as ( , , ) ) . this output can be easily transformed into a base set of training data, into which the perturbations will be injected. however, because the integration code of open data, is usually written in an object-oriented language such as python, java, c++, the code after compilation is opaque to the system, and it is hard to modify the code directly. one solution is to map the integration code to an intermediate representation (ir), such as weld ir [ ] that is integrated with libraries like numpy and sparksql; and our proposed lachesis ir [ ] . such ir is usually a directed acyclic graph (dag), and can be reasoned by the system. in this dag, each node is an atomic computation, and each edge represents a data flow or a control flow from the source node to the destination node. the atomic computations useful to data integration workloads usually can be composed by three categories of operators: ( ) lambda abstraction functions such as a function that returns a literal (a constant numerical value or string), a member attribute or a member function from an object; unary functions such as exp, log, sqrt, sin, cos, tan, etc.. ( ) higher-order lambda composition functions such as binary operators: &&, ||, &, |, <,>, ==, +, -, * , /; conditional operator like condition? on_true:on_false; etc.. ( ) set-based operators such as scan and write that reads/writes a set of objects from/to the storage; map, join, aggregate, flatten, filter, etc.. we propose to modify existing intermediate representations, so that a super cell based processor can be derived from each atomic computation. we assume that each source dataset can be represented as pandas dataframes. then by traversing the ir graph, the system can understand the keys and the super cell mapping relationship. the super cell based processor of each of atomic computations transforms each super cell representation accordingly. for example, map operator that transforms a date cell " - - " to "oct , " as an example, the processor takes a super cell {"keys": [" - - ", "az", "us"], "attributes": ["date"], "cells":[" - - "]} as input, and outputs {"keys": [" - - ", "az", "us"], "attributes": ["date"], "cells":["oct , "]} so that the contextual relationship between "oct , " and its source key and attribute name is preserved. the write's processor transforms each super cell into a f eature, label representation, such as {"source super cell":{"keys": [" - - ", "az", "us"], "attributes": ["date"], "cells":["oct , "]}, "target position": {"keys": ["oct , ", "arizona", "united states"], "attributes":["datetime"]}}. in this way, we can obtain training data automatically. however, the limitation of above approach is that it may not work if the input object is totally nested and opaque and cannot be represented as a set of cells like pandas dataframes or spark dataframes. for example, a corpus of totally unstructured text files, unavoidably requires human pre-processing efforts. thereby, we design following approaches to further alleviate the problem: model reusing and crowdsourcing. according to sec. v-a, if we are able to convert an unstructured dataset, e.g., a set of opaque and nested objects or a set of unstructured texts, into a pandas dataframe or similar structures, the code generation approach maybe applicable to automate the training data creation process. however, it is nontrivial to identify the parsing logic, perform such conversion and identify the keys. all these tasks are hard to automate. we consider crowdsourcing as a potential approach to alleviate the burden from the data scientists or domain experts for these tasks. however, based on our experiments of crowdsourcing key identification tasks to graduate students, and undergraduate students from an introductory database course, requesting to identify all keys. we find that the accuracy is merely . %. first, some of the datasets, particularly these scientific datasets, require domain-specific knowledge to tell the tuple identifier, because these attribute names are acronyms or terms that are not understandable to most people who are not in the domain, and usually datasets are not shipped with detailed explanations for each attribute. second, for large datasets, it is impossible for a person who are not familiar with the datasets to tell the keys. third, it is not easy to find a lot of people who has database knowledge. other tasks such as identifying super cells and parsing unstructured datasets are even more challenging for crowdsourcing platforms due to the expert knowledge required in nature. another approach to reduce human efforts involved in preparing training data is to reuse models for similar data integration tasks. for this purpose, we design and develop a system, called as modelhub, which searches for reusable models for a new data integration task by comparing the attributes of the target dataset (that is created by the programmer's initial data integration code) with the target dataset of each existing data integration models. we leverage locality sensitive hashing (lsh) based on minwise hash [ ] , [ ] for text-based data and lsh based on js-divergence [ ] , [ ] for numerical data to accelerate the attribute-matching process. another benefit of utilizing the lsh is that, in the modelhub platform, each model only needs to be uploaded with lsh signatures of the target dataset's attributes, while the target dataset does not need to be submitted, which saves the storage overhead and also addresses privacy concerns. we mainly answer following questions in this section: ( ) how effective is our proposed deep learning representation for different data integration tasks? ( ) how effective are the perturbations added to the training data for handling various types of schema changes? ( ) how will different super cell granularities affect the accuracy, and the overheads for the training, testing, and assembling process? ( ) how will different model architectures (complex and large models vs. simple and compact models) affect the accuracy and latency for different types of data integration tasks? ( ) how will our approach of handling schema changes improve productivity and alleviate programmers' efforts? based on the proposed training data representation and training data perturbation methodology, we have created training data to train bi-lstm model, gpt- small model, and bert base model for two scenarios: coronavirus data integration and heterogeneous machine data integration. the first scenario mainly involves tabular source datasets in csv formats with aforementioned schema changes. however, the source datasets for the second scenario are mainly unstructured text data, in which most of the similar terms in different platforms are expressed very differently (e.g., cpu user time is logged as "cpu usage: . % user" in macos, "%cpu(s): . us" in ubuntu, and " %cpu %user" in android). model architectures. we compare three neural networks: bi-lstm, gpt- small with a cnn frontend classifier, and bert base with the same cnn frontend classifier. the model architectures are described in sec. iv. model training for the training process of both scenarios, bi-lstm is relatively slower in converging, requiring around epochs; while the models leveraging pre-trained gpt- small and bert base are much faster to converge, requiring only around epochs, as illustrated in fig. . metrics. we evaluate and compare the accuracy, the storage overhead, and the end-to-end training and inference latency, with all types of schema changes as mentioned in sec. iii applied at the inference stage. the accuracy of the data integration model is defined as the ratio of the number of super cells that has been predicted with correct target positions and aggregation actions to the total number of super cells in the testing data. hardware platform. for all experiments, if without specification, we use one nvidia tesla v gpu from google colab. all running times (e.g., training time, inference time) are measured as the average of multiple repeated runs. b. coronavirus data integration scenario ) experiment setup: we evaluate our system in a covid- data integration scenario that is close to the example in sec. i. we predict covid- trend using daily and regional information regarding the number of vaqarious cases and mobility factors. given a set of raw data sources, we need to create a -dimensional target dataset on daily basis. in the target dataset, each row represents coronavirus and mobility information for a state/province on the specific date, and each column represents the state, country, number of confirmed cases, recovery cases, death cases, and the mobility factors regarding workplace, grocery, transit, etc.. the target dataset can be used as inputs to various curve-fitting techniques [ ] , [ ] for covid- prediction. datasets. we assume the user specifies/recommends a small set of initial data sources. for the first scenario, the user specifies the john hopkins university's covid- github repository [ ] and google mobility data [ ] . the statistics about the above source tables are illustrated in tab. i. the jhu dataset contains files with each file representing covid- statistics on a specific date. these files have tens of versions, growing from rows and attributes to rows and columns. perturbations. we add perturbations such as random changes to attribute names and values, and replacing attribute names and value tokens by synonyms as described in sec. iii to . % of the attributes in the training data. in addition, we add key expansion changes, which accounts for . % of the rows in the training data. we test the model using jhu-covid- data and google mobility data collected from feb , to oct , , as illustrated in tab. i. ) overall results: the overall results are illustrated in tab. ii, which show that employing a complex transformer like the pre-trained gpt- small and bert base, we can achieve better accuracy, though more complicated models require significantly more storage space and computational time for training one epoch or inference. the results also show that with the increase of the granularity of super cells , the required training and testing time will be significantly reduced, while the accuracy will decrease. ) ablation study: using the bi-lstm model with singlecell representation, we also conducted detailed ablation study as illustrated in tab. iii. it shows that handling value format changes (e.g., date format change like - - and ; and different abbreviations of region and subdistricts like az and arizona.) is a main factor for accuracy degradation. using a customized synonymous dictionary to encode these format changes for adding perturbations to the training data can greatly improve the accuracy compared with using synonyms extracted from google knowledge graph, as illustrated in tab. iv. in addition, we also find that using character-based embedding can significantly outperform wordbased embedding, as illustrated in tab. v. we developed the data integration code using python and pandas dataframe to integrate the jhu covid- data collected on feb , and the time-series google mobility data. after feb , , the first schema evolution of the jhu covid- data schema that breaks the integration code and causes system downtime, happened on mar , . we invite an experienced software engineer, a ph.d student, and an undergraduate student to develop the revisions respectively and ask them to deliver the task as soon as possible. we record the time between the task assignment and code submission, as well as the time they dedicated to fixing the issue as they reported. we find that although the reported dedicated time ranges from to minutes; the time between the task assignment and code submission ranges from one to three days. this example illustrates the unpredictability of human resources. in contrast, our proposed data integration pipeline can smoothly handle schema changes without any interruptions or delays, and requires no human intervention at all. performance of python-based integration code. we run our python-based and human-coded data integration pipeline on the aforementioned daily jhu covid- data and google mobility data in a c .xlarge aws instance that has four cpus and eight gigabytes memory, and it takes seconds of time on average to integrate data for one day, without considering the time required to fix the pipeline for schema changes. % of the time is spent on removing redundant county-level statistics from the relatively large google mobility file that has . millions of tuples. otherwise the co-existing state-level and county-level statistics in the google mobility file will confuse the join processing. this observation indicates that with the acceleration of high-end gpu processor, the overall training and inference latency of using a deep learning based pipeline is lower than using the traditional human-centered pipeline. considering that the training process only needs to be carried out at the beginning and when a concept drift [ ] is detected. c. machine log integration ) environment setup: suppose a lab administrator developed a python tool to integrate various performance metrics of a cluster of macos workstations, such as cpu utilization (user, system, idle, wait), memory utilization (cached, buffered, swap), network utilization (input, output), disk utilization (write, read), and so on. the tool collects these metrics by periodically reading the output of an omnipresent shell tool "top" and then perform a union operation for time-series metrics collected from each machine. now the lab purchased four ubuntu linux servers. however, because the "top" tool's output in ubuntu is very different from macos, the python tool cannot work with these new linux machines without additional coding efforts. such problem is prevalent in machine or sensor data integration, where different devices produced by different manufacturers may use different schemas to describe similar information. ) overall results: the results are illustrated in tab. vi and tab. vii, showing that our approach can achieve acceptable accuracy. particularly, the transformer models can achieve significantly better accuracy than the bi-lstm model. for this case, with the increase in super cell granularity (i.e., decrease in number of super cells per target tuple), the accuracy of the bi-lstm network is improved, while the accuracy of the transformer-based models is slightly degraded. the transformer-based models can achieve significantly better accuracy, while the computational time required for training (per epoch) and inference is significantly higher. also the larger of the super cell granularity, the fewer number of training and testing samples. therefore, the time required for training and testing is also significantly reduced with the increase in the super cell granularity. in this section, we discuss the process of assembling prediction results into tabular files. we mainly measure how the sizes of source datasets, target datasets, and granularity of super cells affect the overall latency of the assembling process. the results are illustrated in fig. , which show that increasing super cell granularity will significantly reduce the assembling latency. it indicates that if storage space is not the bottleneck, using a transformer-based model and the largest possible super cell granularity will achieve acceptable accuracy while significantly reducing the computational time required for training, inferences, and assembling. schema evolution in relational database, xml, json and ontology has been an active research area for a long time [ ] , [ ] . one major approach is through model (schema) management [ ] , [ ] and to automatically generate executable mapping between the old and evolved schema [ ] , [ ] , [ ] . while this approach greatly expands the theoretical foundation of relational schema evolution, it requires application maintenance and may cause undesirable system downtimes [ ] . to address the problem, prism [ ] is proposed to automate the end-to-end schema modification process by providing dbas a schema modification language (smo) and automatically rewriting users' legacy queries. however, prism requires data migration to the latest schema for each schema evolution, which may not be practical for today's big data era. other techniques include versioning [ ] , [ ] , [ ] , which avoids the data migration overhead, but incurs version management burden and significantly slows down query performance. there are also abundant works discussing about the schema evolution problem in nosql databases, polystore or multi-model databases [ ] , [ ] , [ ] , [ ] most of these works are mainly targeting at enterprise data integration problems and require that each source dataset is managed by a relational or non-relational data store. however the open data sources widely used by today's data science applications are often unmanaged, and thus lack schemas or metadata information [ ] . a deep learning model, once trained, can handle most schema evolution without any human intervention, and does not require any data migration, or version management overhead. moreover, today's data science applications are more tolerant to data errors compared to traditional enterprise transaction applications, which makes a deep learning approach promising. data discovery is to find related tables in a data lake. aurum [ ] is an automatic data discovery system that proposes to build enterprise knowledge graph (ekg) to solve real-world business data integration problems. in ekg, a node represents a set of attributes/columns, and an edge connects two similar nodes. in addition, a hyperedge connects any number of nodes that are hierarchically related. they propose a two-step approach to build ekg using lsh-based and tfidf-based signatures. they also provide a data discovery query language srql so that users can efficiently query the relationships among datasets. aurum [ ] is mainly targeting at enterprise data integration. in recent, numerous works are proposed to address open data discovery problems, including automatically discover table unionability [ ] and joinability [ ] , [ ] , based on lsh and similarity measures. nargesian and et al. [ ] propose a markov approach to optimize the navigation organization as a dag for a data lake so that the probability of finding a table by any of attributes can be maximized. in the dag, each node of navigation dag represents a subset of the attributes in the data lake, and an edge represents a navigation transition. all of these works provide helpful insights from an algorithmatic perspective and system perspective for general data discovery problems. particularly, fernandez and et al. [ ] proposes a semantic matcher based on word embeddings to discover semantic links in the ekg. our work has a potential to integrate data discovery and schema matching into a deep learning model inference process. we argue that in our targeting scenario, the approach we propose can save significant storage overhead as we only need store data integration models which are significantly smaller than the ekg, and can also achieve better performance for wide and sparse tables. we will prove in the paper that the training data generation and labeling process can be fully automated. traditionally, to solve the data integration problem for data science applications, once related datasets are discovered, the programmer will either manually design queries to integrate these datasets, or leverage a schema matching tool to automatically discover queries to perform the data integration. there are numerous prior-arts in schema matching [ ] , [ ] , [ ] , [ ] , which mainly match schemas based on metadata (e.g., attribute name) and/or instances. entity matching (em) [ ] , which is to identify data instances that refer to the same real-world entity, is also related. some em works also employ a deep learning-based approach [ ] , [ ] , [ ] , [ ] , [ ] , [ ] , [ ] . mudgal and et al. [ ] evaluates and compares the performance of different deep learning models applied to em with three types of data: structured data, textual data, and dirty data (with missing value, inconsistent attributes and/or miss-placed values). they find that deep learning doesn't outperform existing em solutions on structured data, but it outperforms them on textual and dirty data. in addition, to apply schema matching to heterogeneous data sources, it is important to discover schemas from semistructured or non-structured data. we proposed a schema discovery mechanism for json data [ ] , among other related works [ ] , [ ] . our approach proposes a super cell data model to unify open datasets. we train deep learning models to learn the mappings between the data items in source datasets and their positions as well as aggregation modes in the target table. if we see the context of a super cell in the source as an entity, and the target position of the super cell as another entity, the problem we study in this work shares some similarity with the entity matching problem. the distinction is that the equivalence of two "entities" in our problem is determined by users' data integration logic, while general entity matching problem does not have such constraints. thirumuruganathan and et al. [ ] discuss various representations for learning tasks in relational data curation. cappuzzo and et al. [ ] further propose an algorithm for obtaining local embeddings using a tripartite-graph-based representation for data integration tasks such as schema matching, and entity matching on relational database. we are mainly targeting at open data in csv, json and text format and choose to use a super cell based representation. these works can be leveraged to improve the super cell representation and corresponding embeddings proposed in this work. in this work, we propose an end-to-end approach based on deep learning for periodical extraction of user expected tables from fast evolving data sources of open datasets. we further propose a relatively stable super cell based representation to embody the fast-evolving source data and to train models that are robust to schema changes by automatically injecting schema changes (e.g., dimension pivoting, attribute name changes, attribute addition/removal, key expansion/contraction, etc.) to the training data. we formalize the problem and conduct experiments on integration of open covid- data and machine log data. the results show that our proposed approach can achieve acceptable accuracy. in addition, by applying our proposed approach, the system will not be easily interrupted by schema changes and no human intervention is required for handling most of the schema changes. caltech covid- modeling covid- data repository by the center for systems science and engineering (csse) at johns hopkins university harvard covid- data: county age&sex with ann national oceanic and atmospheric administration the seattle report on database research a semantic approach to discovering schema mapping expressions applying model management to classical meta data problems model management . : manipulating richer mappings data warehouse scenarios for model management a plea for standards in reporting data collected by animal-borne electronic devices creating embeddings of heterogeneous relational datasets for data integration tasks locality-sensitive hashing for f-divergences: mutual information loss and beyond data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection automating the database schema evolution process graceful database schema evolution: the prism workbench locality-sensitive hashing scheme based on p-stable distributions pre-training of deep bidirectional transformers for language understanding alex: an updatable adaptive learned index automatic generation of normalized relational schemas from nested key-value data semantic integration research in the database community: a brief survey. ai magazine deeper-deep entity resolution clio: schema mapping creation and data exchange aurum: a data discovery system seeping semantics: linking datasets using word embeddings for data discovery domain-adversarial training of neural networks schema mapping discovery from data instances clio: a semi-automatic tool for schema mapping codel-a relationally complete language for database evolution living in parallel realities: co-existing schema versions with a bidirectional database evolution language migcast: putting a price tag on data model evolution in nosql data stores learning a partitioning advisor for cloud databases evolution management of multi-model data data migration low-resource deep entity resolution with transfer and active learning a collective, probabilistic approach to schema mapping using diverse noisy evidence learned cardinalities: estimating correlated joins with deep learning ontology versioning on the semantic web evolution management of multi-model data toward building entity matching management systems the case for learned index structures learning to optimize join queries with deep reinforcement learning adversarial machine learning at scale deep learning. nature qtune: a query-aware database tuning system with deep reinforcement learning s jsd-lsh: a locality-sensitive hashing schema for probability distributions a comprehensive benchmark framework for active learning methods in entity matching open data integration schema mapping as query discovery nose: schema design for nosql applications query rewriting for continuously evolving nosql databases managing and querying transaction-time databases under schema evolution prima: archiving and querying historical data with evolving schemas scalable architecture and query optimization fortransaction-time dbs with evolving schemas deep learning for entity matching: a design space exploration organizing data lakes for navigation table union search on open data weld: a common runtime for high performance data analytics mapping xml and relational schemas with clio language models are unsupervised multitask learners an online bibliography on schema evolution managing schema evolution in nosql data stores deep learning in neural networks: an overview discovering queries based on example tuples acm sigmod international conference on management of data non-blocking lazy schema changes in multi-version database management systems learning from simulated and unsupervised images through adversarial training data integration: the current status and the way forward nosql schema evolution and data migration: state-of-the-art and opportunities using probabilistic reasoning to automate software tuning schema mappings and data examples reuse and adaptation for entity resolution through transfer learning data curation with deep learning the problem of concept drift: definitions and related work automatic database management system tuning through large-scale machine learning preserving mapping consistency under schema changes schema management for document stores ed-join: an efficient algorithm for similarity joins with edit distance constraints semantic adaptation of schema mappings when schemas evolve an end-to-end automatic cloud database tuning system using deep reinforcement learning auto-em: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning josie: overlap set similarity search for finding joinable tables in data lakes auto-join: joining tables by leveraging transformations lsh ensemble: internetscale domain search lachesis: automated generation of persistent partitionings for big data applications key: cord- -syirijql authors: adiga, aniruddha; chen, jiangzhuo; marathe, madhav; mortveit, henning; venkatramanan, srinivasan; vullikanti, anil title: data-driven modeling for different stages of pandemic response date: - - journal: arxiv doi: nan sha: doc_id: cord_uid: syirijql some of the key questions of interest during the covid- pandemic (and all outbreaks) include: where did the disease start, how is it spreading, who is at risk, and how to control the spread. there are a large number of complex factors driving the spread of pandemics, and, as a result, multiple modeling techniques play an increasingly important role in shaping public policy and decision making. as different countries and regions go through phases of the pandemic, the questions and data availability also changes. especially of interest is aligning model development and data collection to support response efforts at each stage of the pandemic. the covid- pandemic has been unprecedented in terms of real-time collection and dissemination of a number of diverse datasets, ranging from disease outcomes, to mobility, behaviors, and socio-economic factors. the data sets have been critical from the perspective of disease modeling and analytics to support policymakers in real-time. in this overview article, we survey the data landscape around covid- , with a focus on how such datasets have aided modeling and response through different stages so far in the pandemic. we also discuss some of the current challenges and the needs that will arise as we plan our way out of the pandemic. as the sars-cov- pandemic has demonstrated, the spread of a highly infectious disease is a complex dynamical process. a large number of factors are at play as infectious diseases spread, including variable individual susceptibility to the pathogen (e.g., by age and health conditions), variable individual behaviors (e.g., compliance with social distancing and the use of masks), differing response strategies implemented by governments (e.g., school and workplace closure policies and criteria for testing), and potential availability of pharmaceutical interventions. governments have been forced to respond to the rapidly changing dynamics of the pandemic, and are becoming increasingly reliant on different modeling and analytical techniques to understand, forecast, plan and respond; this includes statistical methods and decision support methods using multi-agent models, such as: (i) forecasting epidemic outcomes (e.g., case counts, mortality and hospital demands), using a diverse set of data-driven methods e.g., arima type time series forecasting, bayesian techniques and deep learning, e.g., [ ] [ ] [ ] [ ] [ ] , (ii) disease surveillance, e.g., [ , ] , and (iii) counter-factual analysis of epidemics using multi-agent models, e.g., [ ] [ ] [ ] [ ] [ ] [ ] ; indeed, the results of [ , ] were very influential in the early decisions for lockdowns in a number of countries. the specific questions of interest change with the stage of the pandemic. in the pre-pandemic stage, the focus was on understanding how the outbreak started, epidemic parameters, and the risk of importation to different regions. once outbreaks started-the acceleration stage, the focus is on determining the growth rates, the differences in spatio-temporal characteristics, and testing bias. in the mitigation stage, the questions are focused on non-prophylactic interventions, such as school and work place closures and other social-distancing strategies, determining the demand for healthcare resources, and testing and tracing. in the suppression stage, the focus shifts to using prophylactic interventions, combined with better tracing. these phases are not linear, and overlap with each other. for instance, the acceleration and mitigation stages of the pandemic might overlap spatially, temporally as well as within certain social groups. different kinds of models are appropriate at different stages, and for addressing different kinds of questions. for instance, statistical and machine learning models are very useful in forecasting and short term projections. however, they are not very effective for longer-term projections, understanding the effects of different kinds of interventions, and counter-factual analysis. mechanistic models are very useful for such questions. simple compartmental type models, and their extensions, namely, structured metapopulation models are useful for several population level questions. however, once the outbreak has spread, and complex individual and community level behaviors are at play, multi-agent models are most effective, since they allow for a more systematic representation of complex social interactions, individual and collective behavioral adaptation and public policies. as with any mathematical modeling effort, data plays a big role in the utility of such models. till recently, data on infectious diseases was very hard to obtain due to various issues, such as privacy and sensitivity of the data (since it is information about individual health), and logistics of collecting such data. the data landscape during the sars-cov- pandemic has been very different: a large number of datasets are becoming available, ranging from disease outcomes (e.g., time series of the number of confirmed cases, deaths, and hospitalizations), some characteristics of their locations and demographics, healthcare infrastructure capacity (e.g., number of icu beds, number of healthcare personnel, and ventilators), and various kinds of behaviors (e.g., level of social distancing, usage of ppes); see [ ] [ ] [ ] for comprehensive surveys on available datasets. however, using these datasets for developing good models, and addressing important public health questions remains challenging. the goal of this article is to use the widely accepted stages of a pandemic as a guiding framework to highlight a few important problems that require attention in each of these stages. we will aim to provide a succinct model-agnostic formulation while identifying the key datasets needed, how they can be used, and the challenges arising in that process. we will also use sars-cov- as a case study unfolding in real-time, and highlight some interesting peer-reviewed and preprint literature that pertains to each of these problems. an important point to note is the necessity of randomly sampled data, e.g. data needed to assess the number of active cases and various demographics of individuals that were affected. census provides an excellent rationale. it is the only way one can develop rigorous estimates of various epidemiologically relevant quantities. there have been numerous surveys on the different types of datasets available for sars-cov- , e.g., [ ] [ ] [ ] [ ] , as well as different kinds of modeling approaches. however, they do not describe how these models become relevant through the phases of pandemic response. an earlier similar attempt to summarize such responsedriven modeling efforts can be found in [ ] , based on the -h n experience, this paper builds on their work and discusses these phases in the present context and the sars-cov- pandemic. although the paper touches upon different aspects of model-based decision making, we refer the readers to a companion article in the same special issue [ ] for a focused review of models used for projection and forecasting. multiple organizations including cdc and who have their frameworks for preparing and planning response to a pandemic. for instance, the pandemic intervals framework from cdc describes the stages in the context of an influenza pandemic; these are illustrated in figure . these six stages span investigation, recognition and initiation in the early phase, followed by most of the disease spread occurring during the acceleration and deceleration stages. they also provide indicators for identifying when the pandemic has progressed from one stage to the next [ ] . as envisioned, risk evaluation (i.e., using tools like influenza risk assessment tool (irat) and pandemic severity assessment framework (psaf)) and early case identification characterize the first three stages, while non-pharmaceutical interventions (npis) and available figure : cdc pandemic intervals framework and who phases for influenza pandemic therapeutics become central to the acceleration stage. the deceleration is facilitated by mass vaccination programs, exhaustion of susceptible population, or unsuitability of environmental conditions (such as weather). a similar framework is laid out in who's pandemic continuum and phases of pandemic alert . while such frameworks aid in streamlining the response efforts of these organizations, they also enable effective messaging. to the best of our knowledge, there has not been a similar characterization of mathematical modeling efforts that go hand in hand with supporting the response. for summarizing the key models, we consider four of the stages of pandemic response mentioned in section : pre-pandemic, acceleration, mitigation and suppression. here we provide the key problems in each stage, the datasets needed, the main tools and techniques used, and pertinent challenges. we structure our discussion based on our experience with modeling the spread of covid- in the us, done in collaboration with local and federal agencies. • acceleration (section ): this stage is relevant once the epidemic takes root within a country. there is usually a big lag in surveillance and response efforts, and the key questions are to model spread patterns at different spatio-temporal scales, and to derive short-term forecasts and projections. a broad class of datasets is used for developing models, including mobility, populations, land-use, and activities. these are combined with various kinds of time series data and covariates such as weather for forecasting. • mitigation (section ): in this stage, different interventions, which are mostly non-pharmaceutical in the case of a novel pathogen, are implemented by government agencies, once the outbreak has taken hold within the population. this stage involves understanding the impact of interventions on case counts and health infrastructure demands, taking individual behaviors into account. the additional datasets needed in this stage include those on behavioral changes and hospital capacities. • suppression (section ): this stage involves designing methods to control the outbreak by contact tracing & isolation and vaccination. data on contact tracing, associated biases, vaccine production schedules, and compliance & hesitancy are needed in this stage. figure gives an overview of this framework and summarizes the data needs in these stages. these stages also align well with the focus of the various modeling working groups organized by cdc which include epidemic parameter estimation, international spread risk, sub-national spread forecasting, impact of interventions, healthcare systems, and university modeling. in reality, one should note that these stages may overlap, and may vary based on geographical factors and response efforts. moreover, specific problems can be approached prospectively in earlier stages, or retrospectively during later stages. this framework is thus meant to be more conceptual than interpreted along a linear timeline. results from such stages are very useful for policymakers to guide real-time response. consider a novel pathogen emerging in human populations that is detected through early cases involving unusual symptoms or unknown etiology. such outbreaks are characterized by some kind of spillover event, mostly through zoonotic means, like in the case of covid- or past influenza pandemics (e.g., swine flu and avian flu). a similar scenario can occur when an incidence of a well-documented disease with no known vaccine or therapeutics emerges in some part of the world, causing severe outcomes or fatalities (e.g., ebola and zika.) regardless of the development status of the country where the pathogen emerged, such outbreaks now contains the risk of causing a worldwide pandemic due to the global connectivity induced by human travel. two questions become relevant at this stage: what are the epidemiological attributes of this disease, and what are the risks of importation to a different country? while the first question involves biological and clinical investigations, the latter is more related with societal and environmental factors. one of the crucial tasks during early disease investigation is to ascertain the transmission and severity of the disease. these are important dimensions along which the pandemic potential is characterized because together they determine the overall disease burden, as demonstrated within the pandemic severity assessment framework [ ] . in addition to risk assessment for right-sizing response, they are integral to developing meaningful disease models. formulation let Θ = {θ t , θ s } represent the transmission and severity parameters of interest. they can be further subdivided into sojourn time parameters θ δ · and transition probability parameters θ p · . here Θ corresponds to a continuous time markov chain (ctmc) on the disease states. the problem formulation can be represented as follows: given Π(Θ), the prior distribution on the disease parameters and a dataset d, estimate the posterior distribution p(Θ|d) over all possible values of Θ. in a model-specific form, this can be expressed as p(Θ|d, m) where m is a statistical, compartmental or agent-based disease model. in order to estimate the disease parameters sufficiently, line lists for individual confirmed cases is ideal. such datasets contain, for each record, the date of confirmation, possible date of onset, severity (hospitalization/icu) status, and date of recovery/discharge/death. furthermore, age-and demographic/comorbidity information allow development of models that are age-and risk group stratified. one such crowdsourced line list was compiled during the early stages of covid- [ ] and later released by cdc for us cases [ ] . data from detailed clinical investigations from other countries such as china, south korea, and singapore was also used to parameterize these models [ ] . in the absence of such datasets, past parameter estimates of similar diseases (e.g., sars, mers) were used for early analyses. modeling approaches for a model agnostic approach, the delays and probabilities are obtained by various techniques, including bayesian and ordinary least squares fitting to various delay distributions. for a particular disease model, these are estimated through model calibration techniques such as mcmc and particle filtering approaches. a summary of community estimates of various disease parameters is provided at https://github.com/midas-network/covid- . further such estimates allow the design of pandemic planning scenarios varying in levels of impact, as seen in the cdc scenarios page . see [ ] [ ] [ ] for methods and results related to estimating covid- disease parameters from real data. current models use a large set of disease parameters for modeling covid- dynamics; they can be broadly classified as transmission parameters and hospital resource parameters. for instance in our work, we currently use parameters (with explanations) shown in table . challenges often these parameters are model specific, and hence one needs to be careful when reusing parameter estimates from literature. they are related but not identifiable with respect to population level measures such as basic reproductive number r (or effective reproductive number r eff ) and doubling time which allow tracking the rate of epidemic growth. also the estimation is hindered by inherent biases in case ascertainment rate, reporting delays and other gaps in the surveillance system. aligning different data streams (e.g., outpatient surveillance, hospitalization rates, mortality records) is in itself challenging. when a disease outbreak occurs in some part of the world, it is imperative for most countries to estimate their risk of importation through spatial proximity or international travel. such measures are incredibly valuable in setting a timeline for preparation efforts, and initiating health checks at the borders. over centuries, pandemics have spread faster and faster across the globe, making it all the more important to characterize this risk as early as possible. formulation let c be the set of countries, and g = {c, e} an international network, where edges (often weighted and directed) in e represent some notion of connectivity. the importation risk problem can be formulated as below: given c o ∈ c the country of origin with an initial case at time , and c i the country of interest, using g, estimate the expected time taken t i for the first cases to arrive in country c i . in its probabilistic form, the same can be expressed as estimating the probability p i (t) of seeing the first case in country c i by time t. data needs assuming we have initial case reports from the origin country, the first data needed is a network that connects the countries of the world to represent human travel. the most common source of such information is the airline network datasets, from sources such as iata, oag, and openflights; [ ] provides a systematic review of how airline passenger data has been used for infectious disease modeling. these datasets could either capture static measures such as number of seats available or flight schedules, or a dynamic count of passengers per month along each itinerary. since the latter has intrinsic delays in collection and reporting, for an ongoing pandemic they may not be representative. during such times, data on ongoing travel restrictions [ ] become important to incorporate. multi-modal traffic will also be important to incorporate for countries that share land borders or have heavy maritime traffic. for diseases such as zika, where establishment risk is more relevant, data on vector abundance or prevailing weather conditions are appropriate. modeling approaches simple structural measures on networks (such as degree, pagerank) could provide static indicators of vulnerability of countries. by transforming the weighted, directed edges into probabilities, one can use simple contagion models (e.g., independent cascades) to simulate disease spread and empirically estimate expected time of arrival. global metapopulation models (gleam) that combine seir type dynamics with an airline network have also been used in the past for estimating importation risk. brockmann and helbing [ ] used a similar framework to quantify effective distance on the network which seemed to be well correlated with time of arrival for multiple pandemics in the past; this has been extended to covid- [ , ] . in [ ] , the authors employ air travel volume obtained through iata from ten major cities across china to rank various countries along with the idvi to convey their vulnerability. [ ] consider the task of forecasting international and domestic spread of covid- and employ official airline group (oag) data for determining air traffic to various countries, and [ ] fit a generalized linear model for observed number of cases in various countries as a function of air traffic volume obtained from oag data to determine countries with potential risk of under-detection. also, [ ] provide africa-specific case-study of vulnerability and preparedness using data from civil aviation administration of china. challenges note that arrival of an infected traveler will precede a local transmission event in a country. hence the former is more appropriate to quantify in early stages. also, the formulation is agnostic to whether it is the first infected arrival or first detected case. however, in real world, the former is difficult to observe, while the latter is influenced by security measures at ports of entry (land, sea, air) and the ease of identification for the pathogen. for instance, in the case of covid- , the long incubation period and the high likelihood of asymptomaticity could have resulted in many infected travelers being missed by health checks at poes. we also noticed potential administrative delays in reporting by multiple countries fearing travel restrictions. as the epidemic takes root within a country, it may enter the acceleration phase. depending on the testing infrastructure and agility of surveillance system, response efforts might lag or lead the rapid growth in case rate. under such a scenario, two crucial questions emerge that pertain to how the disease may spread spatially/socially and how the case rate may grow over time. within the country, there is need to model the spatial spread of the disease at different scales: state, county, and community levels. similar to the importation risk, such models may provide an estimate of when cases may emerge in different parts of the country. when coupled with vulnerability indicators (socioeconomic, demographic, co-morbidities) they provide a framework for assessing the heterogeneous impact the disease may have across the country. detailed agent-based models for urban centers may help identify hotspots and potential case clusters that may emerge (e.g., correctional facilities, nursing homes, food processing plants, etc. in the case of covid- ). formulation given a population representation p at appropriate scale and a disease model m per entity (individual or sub-region), model the disease spread under different assumptions of underlying connectivity c and disease parameters Θ. the result will be a spatio-temporal spread model that results in z s,t , the time series of disease states over time for region s. data needs some of the common datasets needed by most modeling approaches include: ( ) social and spatial representation, which includes census, and population data, which are available from census departments (see, e.g., [ ] ), and landscan [ ] , ( ) connectivity between regions (commuter, airline, road/rail/river), e.g., [ , ] , ( ) data on locations, including points of interest, e.g., openstreetmap [ ] , and ( ) activity data, e.g., the american time use survey [ ] . these datasets help capture where people reside and how they move around, and come in contact with each other. while some of these are static, more dynamic measures, such as from gps traces, become relevant as individuals change their behavior during a pandemic. modeling approaches different kinds of structured metapopulation models [ , [ ] [ ] [ ] [ ] , and agent based models [ ] [ ] [ ] [ ] [ ] have been used in the past to model the sub-national spread; we refer to [ , , ] for surveys on different modeling approaches. these models incorporate typical mixing patterns, which result from detailed activities and co-location (in the case of agent based models), and different modes of travel and commuting (in the case of metapopulation models). challenges while metapopulation models can be built relatively rapidly, agent based models are much harder-the datasets need to be assembled at a large scale, with detailed construction pipelines, see, e.g., [ ] [ ] [ ] [ ] [ ] . since detailed individual activities drive the dynamics in agent based models, schools and workplaces have to be modeled, in order to make predictions meaningful. such models will get reused at different stages of the outbreak, so they need to be generic enough to incorporate dynamically evolving disease information. finally, a common challenge across modeling paradigms is the ability to calibrate to the dynamically evolving spatio-temporal data from the outbreak-this is especially challenging in the presence of reporting biases and data insufficiency issues. given the early growth of cases within the country (or sub-region), there is need for quantifying the rate of increase in comparable terms across the duration of the outbreak (accounting for the exponential nature of such processes). these estimates also serve as references, when evaluating the impact of various interventions. as an extension, such methods and more sophisticated time series methods can be used to produce short-term forecasts for disease evolution. formulation given the disease time series data within the country z s,t until data horizon t , provide scale-independent growth rate measures g s (t ), and forecastsẐ s,u for u ∈ [t, t + ∆t ], where ∆t is the forecast horizon. data needs models at this stage require datasets such as ( ) time series data on different kinds of disease outcomes, including case counts, mortality, hospitalizations, along with attributes, such as age, gender and location, e.g., [ ] [ ] [ ] [ ] [ ] , ( ) any associated data for reporting bias (total tests, test positivity rate) [ ] , which need to be incorporated into the models, as these biases can have a significant impact on the dynamics, and ( ) exogenous regressors (mobility, weather), which have been shown to have a significant impact on other diseases, such as influenza, e.g., [ ] . modeling approaches even before building statistical or mechanistic time series forecasting methods, one can derive insights through analytical measures of the time series data. for instance, the effective reproductive number, estimated from the time series [ ] can serve as a scale-independent metric to compare the outbreaks across space and time. additionally multiple statistical methods ranging from autoregressive models to deep learning techniques can be applied to the time series data, with additional exogenous variables as input. while such methods perform reasonably for short-term targets, mechanistic approaches as described earlier can provide better long-term projections. various ensembling techniques have also been developed in the recent past to combine such multi-model forecasts to provide a single robust forecast with better uncertainty quantification. one such effort that combines more than methods for covid- can be found at the covid forecasting hub . we also point to the companion paper for more details on projection and forecasting models. challenges data on epidemic outcomes usually has a lot of uncertainties and errors, including missing data, collection bias, and backfill. for forecasting tasks, these time series data need to be near real-time, else one needs to do both nowcasting, as well as forecasting. other exogenous regressors can provide valuable lead time, due to inherent delays in disease dynamics from exposure to case identification. such frameworks need to be generalized to accommodate qualitative inputs on future policies (shutdowns, mask mandates, etc.), as well as behaviors, as we discuss in the next section. once the outbreak has taken hold within the population, local, state and national governments attempt to mitigate and control its spread by considering different kinds of interventions. unfortunately, as the covid- pandemic has shown, there is a significant delay in the time taken by governments to respond. as a result, this has caused a large number of cases, a fraction of which lead to hospitalizations. two key questions in this stage are: ( ) how to evaluate different kinds of interventions, and choose the most effective ones, and ( ) how to estimate the healthcare infrastructure demand, and how to mitigate it. the effectiveness of an intervention (e.g., social distancing) depends on how individuals respond to them, and the level of compliance. the health resource demand depends on the specific interventions which are implemented. as a result, both these questions are connected, and require models which incorporate appropriate behavioral responses. in the initial stages, only non-prophylactic interventions are available, such as: social distancing, school and workplace closures, and use of ppes, since no vaccinations and anti-virals are available. as mentioned above, such analyses are almost entirely model based, and the specific model depends on the nature of the intervention and the population being studied. formulation given a model, denoted abstractly as m, the general goals are ( ) to evaluate the impact of an intervention (e.g., school and workplace closure, and other social distancing strategies) on different epidemic outcomes (e.g., average outbreak size, peak size, and time to peak), and ( ) find the most effective intervention from a suite of interventions, with given resource constraints. the specific formulation depends crucially on the model and type of intervention. even for a single intervention, evaluating its impact is quite challenging, since there are a number of sources of uncertainty, and a number of parameters associated with the intervention (e.g., when to start school closure, how long, and how to restart). therefore, finding uncertainty bounds is a key part of the problem. data needs while all the data needs from the previous stages for developing a model are still there, representation of different kinds of behaviors is a crucial component of the models in this stage; this includes: use of ppes, compliance to social distancing measures, and level of mobility. statistics on such behaviors are available at a fairly detailed level (e.g., counties and daily) from multiple sources, such as ( ) the covid- impact analysis platform from the university of maryland [ ] , which gives metrics related to social distancing activities, including level of staying home, outside county trips, outside state trips, ( ) changes in mobility associated with different kinds of activities from google [ ] , and other sources, ( ) survey data on different kinds of behaviors, such as usage of masks [ ] . modeling approaches as mentioned above, such analyses are almost entirely model based, including structured metapopulation models [ , [ ] [ ] [ ] [ ] , and agent based models [ ] [ ] [ ] [ ] [ ] . different kinds of behaviors relevant to such interventions, including compliance with using ppes and compliance to social distancing guidelines, need to be incorporated into these models. since there is a great deal of heterogeneity in such behaviors, it is conceptually easiest to incorporate them into agent based models, since individual agents are represented. however, calibration, simulation and analysis of such models pose significant computational challenges. on the other hand, the simulation of metapopulation models is much easier, but such behaviors cannot be directly represented-instead, modelers have to estimate the effect of different behaviors on the disease model parameters, which can pose modeling challenges. challenges there are a number of challenges in using data on behaviors, which depends on the specific datasets. much of the data available for covid- is estimated through indirect sources, e.g., through cell phone and online activities, and crowd-sourced platforms. this can provide large spatio-temporal datasets, but have unknown biases and uncertainties. on the other hand, survey data is often more reliable, and provides several covariates, but is typically very sparse. handling such uncertainties, rigorous sensitivity analysis, and incorporating the uncertainties into the analysis of the simulation outputs are important steps for modelers. the covid- pandemic has led to a significant increase in hospitalizations. hospitals are typically optimized to run near capacity, so there have been fears that the hospital capacities would not be adequate, especially in several countries in asia, but also in some regions in the us. nosocomial transmission could further increase this burden. formulation the overall problem is to estimate the demand for hospital resources within a populationthis includes the number of hospitalizations, and more refined types of resources, such as icus, ccus, medical personnel and equipment, such as ventilators. an important issue is whether the capacity of hospitals within the region would be overrun by the demand, when this is expected to happen, and how to design strategies to meet the demand-this could be through augmenting the capacities at existing hospitals, or building new facilities. timing is of essence, and projections of when the demands exceed capacity are important for governments to plan. the demands for hospitalization and other health resources can be estimated from the epidemic models mentioned earlier, by incorporating suitable health states, e.g., [ , ] ; in addition to the inputs needed for setting up the models for case counts, datasets are needed for hospitalization rates and durations of hospital stay, icu care, and ventilation. the other important inputs for this component are hospital capacity, and the referral regions (which represent where patients travel for hospitalization). different public and commercial datasets provide such information, e.g., [ , ] . modeling approaches demand for health resources is typically incorporated into both metapopulation and agent based models, by having a fraction of the infectious individuals transition into a hospitalization state. an important issue to consider is what happens if there is a shortage of hospital capacity. studying this requires modeling the hospital infrastructure, i.e., different kinds of hospitals within the region, and which hospital a patient goes to. there is typically limited data on this, and data on hospital referral regions, or voronoi tesselation can be used. understanding the regimes in which hospital demand exceeds capacity is an important question to study. nosocomial transmission is typically much harder to study, since it requires more detailed modeling of processes within hospitals. challenges there is a lot of uncertainty and variability in all the datasets involved in this process, making its modeling difficult. for instance, forecasts of the number of cases and hospitalizations have huge uncertainty bounds for medium or long term horizon, which is the kind of input necessary for understanding hospital demands, and whether there would be any deficits. the suppression stage involves methods to control the outbreak, including reducing the incidence rate and potentially leading to the eradication of the disease in the end. eradication in case of covid- appears unlikely as of now, what is more likely is that this will become part of seasonal human coronaviruses that will mutate continuously much like the influenza virus. contact tracing problem refers to the ability to trace the neighbors of an infected individual. ideally, if one is successful, each neighbor of an infected neighbor would be identified and isolated from the larger population to reduce the growth of a pandemic. in some cases, each such neighbor could be tested to see if the individual has contracted the disease. contact tracing is the workhorse in epidemiology and has been immensely successful in controlling slow moving diseases. when combined with vaccination and other pharmaceutical interventions, it provides the best way to control and suppress an epidemic. formulation the basic contact tracing problem is stated as follows: given a social contact network g(v, e) and subset of nodes s ⊂ v that are infected and a subset s ⊂ s of nodes identified as infected, find all neighbors of s. here a neighbor means an individual who is likely to have a substantial contact with the infected person. one then tests them (if tests are available), and following that, isolates these neighbors, or vaccinates them or administers anti-viral. the measures of effectiveness for the problem include: (i) maximizing the size of s , (ii) maximizing the size of set n (s ) ⊆ n (s), i.e. the potential number of neighbors of set s , (iii) doing this within a short period of time so that these neighbors either do not become infectious, or they minimize the number of days that they are infectious, while they are still interacting in the community in a normal manner, (iv) the eventual goal is to try and reduce the incidence rate in the community-thus if all the neighbors of s cannot be identified, one aims to identify those individuals who when isolated/treated lead to a large impact; (v) and finally verifying that these individuals indeed came in contact with the infected individuals and thus can be asked to isolate or be treated. data needs data needed for the contact tracing problem includes: (i) a line list of individuals who are currently known to be infected (this is needed in case of human based contact tracing). in the real world, when carrying out human contact tracers based deployment, one interviews all the individuals who are known to be infectious and reaches out to their contacts. modeling approaches human contact tracing is routinely done in epidemiology. most states in the us have hired such contact tracers. they obtain the daily incidence report from the state health departments and then proceed to contact the individuals who are confirmed to be infected. earlier, human contact tracers used to go from house to house and identify the potential neighbors through a well defined interview process. although very effective it is very time consuming and labor intensive. phones were used extensively in the last - years as they allow the contact tracers to reach individuals. they are helpful but have the downside that it might be hard to reach all individuals. during covid- outbreak, for the first time, societies and governments have considered and deployed digital contact tracing tools [ ] [ ] [ ] [ ] [ ] . these can be quite effective but also have certain weaknesses, including, privacy, accuracy, and limited market penetration of the digital apps. challenges these include: (i) inability to identify everyone who is infectious (the set s) -this is virtually impossible for covid- like disease unless the incidence rate has come down drastically and for the reason that many individuals are infected but asymptomatic; (ii) identifying all contacts of s (or s ) -this is hard since individuals cannot recall everyone they met, certain folks that they were in close proximity might have been in stores or social events and thus not known to individuals in the set s. furthermore, even if a person is able to identify the contacts, it is often hard to reach all the individuals due to resource constraints (each human tracer can only contact a small number of individuals. the overall goal of the vaccine allocation problem is to allocate vaccine efficiently and in a timely manner to reduce the overall burden of the pandemic. formulation the basic version of the problem can be cast in a very simple manner (for networked models): given a graph g(v, e) and a budget b on the number of vaccines available, find a set s of size b to vaccinate so as to optimize certain measure of effectiveness. the measure of effectiveness can be (i) minimizing the total number of individuals infected (or maximizing the total number of uninfected individuals); (ii) minimizing the total number of deaths (or maximizing the total number of deaths averted); (iii) optimizing the above quantities but keeping in mind certain equity and fairness criteria (across socio-demographic groups, e.g. age, race, income); (iv) taking into account vaccine hesitancy of individuals; (v) taking into account the fact that all vaccines are not available at the start of the pandemic, and when they become available, one gets limited number of doses each month; (vi) deciding how to share the stockpile between countries, state, and other organizations; (vii) taking into account efficacy of the vaccine. data needs as in other problems, vaccine allocation problems need as input a good representation of the system; network based, meta-population based and compartmental mass action models can be used. one other key input is the vaccine budget, i.e., the production schedule and timeline, which serves as the constraint for the allocation problem. additional data on prevailing vaccine sentiment and past compliance to seasonal/neonatal vaccinations are useful to estimate coverage. modeling approaches the problem has been studied actively in the literature; network science community has focused on optimal allocation schemes, while public health community has focused on using meta-population models and assessing certain fixed allocation schemes based on socio-economic and demographic considerations. game theoretic approaches that try and understand strategic behavior of individuals and organization has also been studied. challenges the problem is computationally challenging and thus most of the time simulation based optimization techniques are used. challenge to the optimization approach comes from the fact that the optimal allocation scheme might be hard to compute or hard to implement. other challenges include fairness criteria (e.g. the optimal set might be a specific group) and also multiple objectives that one needs to balance. while the above sections provide an overview of salient modeling questions that arise during the key stages of a pandemic, mathematical and computational model development is equally if not more important as we approach the post-pandemic (or more appropriately inter-pandemic) phase. often referred to as peace time efforts, this phase allows modelers to retrospectively assess individual and collective models on how they performed during the pandemic. in order to encourage continued development and identifying data gaps, synthetic forecasting challenge exercises [ ] may be conducted where multiple modeling groups are invited to forecast synthetic scenarios with varying levels of data availability. another set of models that are quite relevant for policymakers during the winding down stages, are those that help assess overall health burden and economic costs of the pandemic. epideep: exploiting embeddings for epidemic forecasting an arima model to forecast the spread and the final size of covid- epidemic in italy (first version on ssrn march) real-time epidemic forecasting: challenges and opportunities accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the u.s realtime forecasting of infectious disease dynamics with a stochastic semi-mechanistic model healthmap the use of social media in public health surveillance. western pacific surveillance and response journal : wpsar the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak basic prediction methodology for covid- : estimation and sensitivity considerations. medrxiv covid- outbreak on the diamond princess cruise ship: estimating the epidemic potential and effectiveness of public health countermeasures impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand. imperial college technical report modelling disease outbreaks in realistic urban social networks computational epidemiology forecasting covid- impact on hospital bed-days, icudays, ventilator-days and deaths by us state in the next months open data resources for fighting covid- data-driven methods to monitor, model, forecast and control covid- pandemic: leveraging data science, epidemiology and control theory covid- datasets: a survey and future challenges. medrxiv mathematical modeling of epidemic diseases the use of mathematical models to inform influenza pandemic preparedness and response mathematical models for covid- pandemic: a comparative analysis updated preparedness and response framework for influenza pandemics novel framework for assessing epidemiologic effects of influenza epidemics and pandemics covid- pandemic planning scenarios epidemiological data from the covid- outbreak, real-time case information covid- case surveillance public use data -data -centers for disease control and prevention covid- patients' clinical characteristics, discharge rate, and fatality rate of meta-analysis estimating the generation interval for coronavirus disease (covid- ) based on symptom onset data the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application estimating clinical severity of covid- from the transmission dynamics in wuhan, china the use and reporting of airline passenger data for infectious disease modelling: a systematic review flight cancellations related to -ncov (covid- ) the hidden geometry of complex, network-driven contagion phenomena potential for global spread of a novel coronavirus from china forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study. the lancet using predicted imports of -ncov cases to determine locations that may not be identifying all imported cases. medrxiv preparedness and vulnerability of african countries against introductions of -ncov. medrxiv creating synthetic baseline populations openstreetmap american time use survey multiscale mobility networks and the spatial spreading of infectious diseases optimizing spatial allocation of seasonal influenza vaccine under temporal constraints assessing the international spreading risk associated with the west african ebola outbreak spread of zika virus in the americas structure of social contact networks and their impact on epidemics generation and analysis of large synthetic social contact networks modelling disease outbreaks in realistic urban social networks containing pandemic influenza at the source report : impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand the structure and function of complex networks a public data lake for analysis of covid- data midas network. midas novel coronavirus repository covid- ) data in the united states covid- impact analysis platform covid- surveillance dashboard the covid tracking project absolute humidity and the seasonal onset of influenza in the continental united states epiestim: a package to estimate time varying reproduction numbers from epidemic curves. r package version google covid- community mobility reports mask-wearing survey data impact of social distancing measures on coronavirus disease healthcare demand, central texas, usa current hospital capacity estimates -snapshot total hospital bed occupancy quantifying the effects of contact tracing, testing, and containment covid- epidemic in switzerland: on the importance of testing, contact tracing and isolation quantifying sars-cov- transmission suggests epidemic control with digital contact tracing isolation and contact tracing can tip the scale to containment of covid- in populations with social distancing. available at ssrn privacy sensitive protocols and mechanisms for mobile contact tracing the rapidd ebola forecasting challenge: synthesis and lessons learnt acknowledgments. the authors would like to thank members of the biocomplexity covid- response team and network systems science and advanced computing (nssac) division for their thoughtful comments and suggestions related to epidemic modeling and response support. we thank members of the biocomplexity institute and initiative, university of virginia for useful discussion and suggestions. this key: cord- -aju xkel authors: wei, viska; ivkin, nikita; braverman, vladimir; szalay, alexander title: sketch and scale: geo-distributed tsne and umap date: - - journal: nan doi: nan sha: doc_id: cord_uid: aju xkel running machine learning analytics over geographically distributed datasets is a rapidly arising problem in the world of data management policies ensuring privacy and data security. visualizing high dimensional data using tools such as t-distributed stochastic neighbor embedding (tsne) and uniform manifold approximation and projection (umap) became common practice for data scientists. both tools scale poorly in time and memory. while recent optimizations showed successful handling of , data points, scaling beyond million points is still challenging. we introduce a novel framework: sketch and scale (sns). it leverages a count sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tsne or umap on the summary, representing the densest areas, extracted from the aggregated sketch. we show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe. we demonstrate the power of our method on two mid-size datasets: cancer data with million -band pixels from multiple images of tumor biopsies; and astrophysics data of million stars with multi-color photometry from the sloan digital sky survey (sdss). dimensionality reduction plays a crucial role in both machine learning and data science. it primarily serves two fundamental roles: ( ) as a preprocessing step it helps to extract the most important low dimensional representation of a signal before feeding the data into a machine learning algorithm, ( ) as a visualization tool it navigates data scientists towards better understanding of local and global structures within the dataset while working with more comprehensible two-or threedimensional plots. in clustering and classification problems we often seek to find a relatively small number of clusters, to correspond the number of categories human perception can distinguish. among the full spectrum of dimensionality reduction and lower dimensional embedding techniques available today [ ] , [ ] , [ ] , [ ] , [ ] , tsne [ ] and umap [ ] are probably the two most popular methods for visualization. tdistributed stochastic neighbor embedding (tsne) is a high-dimensional data visualization tool proposed by geoffrey hinton's group in [ ] . tsne converts similarities between data points to joint probabilities and tries to minimize the kullback-leibler divergence between the joint probabilities of the low-dimensional embedding and the highdimensional data. in contrast to pca, tsne is not linear, it employs the local relationships between points to create a low-dimensional mapping, by comparing the full high-dimensional distance to the one in the projection subspace and capturing non-linear structures. tsne has a non-convex cost function and provides different results for different - - - - / /$ . © ieee initialisations. one of the major hurdles with tsne is that it ceases to be efficient when processing high-(or even medium) dimensional data sets with large cardinalities, due to the fact that the naive tsne implementation scales as o(n ). on a typical laptop cpu, processing , points with tsne would take close to an hour. there was a considerable research effort exerted to speed up tsne. barnes-hut approximation pushed scaling down to o(n log n) [ ] . approximated tsne [ ] lets user steer the trade off between speed and accuracy; netsne is training neural networks on tsne [ ] to provide a more scalable execution time. the multicore tsne [ ] and tsne-cuda [ ] introduced highly parallel version of the algorithm for cpu and gpu platforms. umap [ ] is using manifold learning and topological data analysis to reduce dimensionality. it uses cross-entropy to optimize the lower dimensional representation. faster performance is the main advantage over tsne [ ] . nevertheless, both frameworks scale poorly and computational prohibitive when data hits a hundred million points. in addition, the entire dataset has to reside in memory of one machine, making it infeasible for the datasets distributed across several compute clusters around the globe. for instance, privacy concerns of the healthcare data might limit transfers from clinic to clinic, physics related data accumulated by several research centers can be too large to transfer. in this paper, we introduce sketch and scale (sns), a solution to deal with much highercardinality datasets with an intermediate number of dimensions. we implement our idea as a preprocessor to any of the above mentioned dimensionality reduction techniques. sns uses approximate sketching over parallel, linear streams. furthermore, it can be executed over spatially segregated subsets of the data. specifically, we utilize a hashing-based algorithm count sketch over the quantized high dimensional coordinates to find the cells with the highest densities: the so called "heavy-hitters" [ ] , [ ] . we then select an appropriate number of heavy hitters ( - ) to analyze with the standard techniques to get the final clusters/visualizations. in section ii we present the idea of sketching and describe its crucial role in building scalable preprocessing pipeline. further, in section iv we apply it to two data sets: multispectral cancer images, with a total of million pixels, and photometric observations of million stars from the sloan digital sky survey. finally we discuss various practical aspects on how to scale our technique to much larger data sets, geographically disjoint data and overcoming privacy-related constraints. there is a genuine need to find clusters in data sets with cardinalities in the billions: pixels of multispectral imaging data in medical and geospatial imaging, large multi-dimensional point clouds, etc. furthermore, we want to compute approximate statistics (various moments) of a multidimensional probability distribution with a large cardinality. our tool helps with all of this: it generates a very compact approximation to the full multidimensional probability distribution of a large dataset. ) clusters and heavy hitters: hereafter we will assume that our data is clustered, i.e. there exists a metric space with a non-vanishing correlation function: the excess probability over random that we can find two points at a certain distance from one another. while a gaussian random process can be fully described with a single correlation function, other higher order processes can have nontrivial higher order, n-point correlations. we can quantify this by creating a discretized grid over our metric space, and counting the number of points in each bin. this way we introduce a probability distribution p (n ) that a bin will count n points. the bins intersecting our biggest clusters will have a large count. we will call these "heavy hitters" (hh). datasets with strong clustering will have cell count distributions with a fat tail. many of these heavy hitters will be contact neighbors, as the real clusters will be split by the grid discretization. a way to identify heavy hitters is to count the number of points in each bin, those above the threshold will be our heavy hitters. finally, we use vanilla tsne/umap to find the real, connected clusters which have been split up into multiple adjacent bins. tsne/umap is applied in the reduced cardinality of we weighted each hh by replicating it multiple times with small uniform perturbation ( / of the cell size), since identical points are merged in tsne. one scheme is to give a higher score towards the highest ranked hh. assume that the smallest hh has a rank of r max , and a count f min . a second possibility for weighting is to use + log (r max /r) as the number of the replicas, scaling with the rank r of the hh. finally, we can also use + log (f /f min ) as the replication factor, weighting by the log of the counts f . we have tested that these do not impact the main features of the clustering patterns we find. ) heavy hitters vs random subsampling: one would think we apply tsne/umap to the subsampled data, however, this does not work at low sampling rates. no matter how fat the tail of the original p (n ) distribution is, it will converge to an infinitesimally poisson process as the sampling rate nears zero, and the fat tail rapidly disappears [ ] . consider a data set with points with some dense clusters of points or more. to apply tools like tsne/umap, we need to reduce the dataset size down to around , i.e. sampling rate of − . in order to detect a cluster with even a modest significance, we need to have enough points to keep the poisson error low. in practice points or more will push relative poisson error to / √ = . . for instance, a dense cell with n ≈ points, sampled at a rate of − , would have in average only one point sampled, k = n p = . therefore, it will be indistinguishable from the many billions of low density cells. at a sampling rate of p = − we could detect k = for the cell with points with a % uncertainty, but the whole ran-dom subset would have − = points, too large to feed it into tsne. with a larger data set these tradeoffs get rapidly much worse. in contrast, we detect the top heavy hitters with a high confidence, using direct aggregation at full or somewhat reduced sampling rate. then we take a small enough subset of these, so that tsne/umap can easily cluster them further, while we discard all the low-density bins. this has the advantage, that a small number of the top heavy hitters will still contain a large fraction of the points, particularly those which are located in dense clusters. of course, aggregating a multidimensional histogram in high resolution with a large number of dimensions by brute force would result in an untenable memory requirement. instead we will use sketching techniques to build an approximate aggregation with logarithmic memory and linear compute time. our approach is particularly powerful for the data with the following properties: (i) the data has a very large number of rows ( to ), (ii) has a moderate number of dimensions (< ) and (iii) there is a moderate number of clusters (< ), but these can have non-compact extents (iv) the clusters have a high density contrast in the metric space. the limitation on the number of dimensions is less of a problem than it seems, as dimensionality reduction techniques, like random projections [ ] can be applied as a preprocessor to our code. ) streaming sketches of heavy hitters: the field of streaming algorithms arose from the necessity to process massive data in sequential order while operating in a very low memory. first introduced for the estimation of the frequency moments [ ] , it was further developed for a wide spectrum of problems within linear algebra [ ] , graph problems [ ] , and others [ ] ; and found applications in machine learning [ ] , [ ] , networking [ ] , [ ] , astrophysics [ ] , [ ] . further we provide a glimpse on streaming model and sketches for finding frequent items. for comprehensive review refer to [ ] . given a zero vector f of dimension n, the algorithm observes the stream of updates to its coordinates s = {s , ..., s m }, where s j specifies the update to i: f i ← f i + . alon, matias and szegedy [ ] were first to show the data structure (ams sketch) approximating norm of the vector f at the end of the stream while using only o(log nm) bits of memory. in a nutshell, ams is a counter c and a hash function h : at the end of the stream, c is returned as an approximation of f . it is unbiased: where the last inequality holds due to h (i) = and e(h(i)h(j)) = for i = j and -wise independent h(·). similarly, one can show the variance bound: v ar(c ) ≤ f , then running several instances in parallel, averaging and/or using a median filter provide control over the approximation [ ] . count sketch (cs) algorithm [ ] extends ams approach to find heavy hitters with an guarantee, i.e. all i, s.t. f i > ε f , together with approximations of f i . the idea behind the algorithm combines the ams sketch with a hashing table with c buckets. it uses hash h (·) to map arrived item i to one of the c buckets and hash h (·) for choosing the sign in ams sketch. every (ε, )-heavy hitter's frequency can be estimated by the corresponding bin count if we choose c = /ε , as on average only ε fraction of norm of non-heavy items will fall into the same bin. to eliminate the collisions and identify the heavy hitters r = log(n/δ) hash table maintained in parallel, where − δ probability of successful recovery of all heavy hitters. cs memory utilization is sublinear in n and m: o( ε log nm), which is handy when working with billions and even trillions of items. iii. scalable count sketch ) the count sketch algorithm: as described in section ii- , the count sketch (cs) accepts updates to n-dimensional vector f in a streaming fashion and can recover coordinates of f for which f i ≥ ε f , i.e. (ε, )-heavy hitters. it's major advantage is memory usage that scales logarithmically in dimensionality n and stream size m. moreover, cs is a linear operator, thus can be merged efficiently. this opens a diverse pool of applications in distributed settings: multiple nodes can compute sketches, each on its own piece of data, and send it to the master node for merging. such an approach can help to alleviate the bottleneck in communication speed and brings a certain level of privacy for free [ ] , [ ] . below we present the details of the four major operations over the cs data structure: initialization, update, estimate and merge [ ] . : function init(r, c): : init r × c [ ] to find · most frequent items from stream of length : rows, · columns. the total memory is less than mb. in order to be able to compute hash values for the binning, we need to enclose the data in a d-dimensional hypercube, with m linear bins in each dimension. then the discrete quantized coordinates of the data can be concatenated together to create a feature vector, that can be fed into count sketch. question is how to choose the number of bins: too many bins will result in very low density in in each bin representing the cluster, to few bins will cause several independent clusters to be merged into one. further, we estimate the random collision rate between heavy hitters in adjacent cells. the discretized volume v = m d is the total number of bins in the hypercube. we limit the number of heavy hitters as k < . λ = k/v is the mean density of heavy hitters in a cell. we can define a contact neighborhood of the cell by small hypercube with volume w = d around it. given the single cell density λ, the density of heavy hitters within a contact neighborhood is ρ = w λ = k(w/v ). since the heavy hitters (from the random collisions perspective) can be treated as a poisson point process, we can estimate the probability that a neighborhood volume contains or one heavy hitters: p ( ) = e −ρ , p (> ) = −p ( ) = −e −ρ . the number of heavy hitters with a random collision in its neighborhood is c = kp (> ). this number is quite sensitive on the number of dimensions and the number of linear bins. for k = , d = , m = , the collision rate is high: c = , ; while if we increase the number of bins to m = , it goes down to c = . . though only approximately, this argument gives some guidance in choosing the binning. the number of dimensions is limited by the hash collision rate in the sketch matrix, nevertheless, the growth of storage there is only logarithmic (see next section). it is expected that the hash collisions in the sketch table will cause uncertainties in the estimation of the cell frequency counts. the use of a ± hash value for the increment is mitigating this, but still the many cells with small counts will add a fixed poisson noise to the sketch counts, leading to an increasing relative uncertainty as the frequencies decrease. we evaluated how well cs algorithm estimates the exact frequencies of the discretized multidimensional cells, and how well it ranks those that appear most frequently. we used the cancer sample (using bins in each coordinate, top k hh) and determined how the relative error grows with the rank of the densest cells. the rank represents a descending ordering, so cells with the highest counts have the lowest ranks. for each cell i we find its frequency f i and rank r i in the output of an exact algorithm and its frequencyf i in the count sketch output. the relative error is defined as |f i −f i |/f i . the rms values of the relative error are . for r < , . for < r < , and . for k < r < k. in this paper we test the ideas on two different data sets: ( ) m pixels in multispectral images from cancer immunotherapy and ( ) photometric observations of m stars in the sloan digital sky survey (sdss). in both cases the clusters are not thought to be sharply divided into very distinct categories, rather they form distributions where the categories gradually morph into one another. in addition the first few components of pca do not give a meaningful separation, i.e. nonlinear techniques are needed. however, the data set is too large to use tsne/umap directly. check project repository [ ] for additional details. ) clustering of pixels in cancer images: our dataset consists of images taken at . µ/pixel resolution, with a % overlap. the slide contains a µ thick section of a melanoma biopsy. the images are observed with a combination of different broadband excitation filters and nm wide narrow-band filters, for a total of layers, x pixels in each. the tissue was stained with different fluorescent markers/dyes: dna content of nuclei; lineage markers tumor, cd , foxp and cd (type of cell); and expression markers pd- and pd-l ("checkpoint blockers", controlling the interaction between tumors and the immune system). cancer cells are mostly located in dense areas, tumors, with a reasonably sharp boundary. today these tissue areas are annotated visually, by a trained pathologist. cancer immunotherapy is aimed at understanding the interactions between cancer cells and the immune system, and much of these take place in the tumor micro environment (tme), in the boundary of tumor and immune cells. to automate imaging efforts to thousands of images and billions of cells, the task goes beyond detecting and segmenting into distinct cells of a certain type, it is also important to automatically identify the tumor tissue and the tumor membrane. we expect the data to have approximately degrees of freedom. our goal is to see what level of clustering can be detected at the pixel level, and whether clustering can be used to identify (a) cells of different types (b) outlines associated with tumor and possibly other tissue types. as each specimen contains different ratios of tissues and cells, naive pca will not work well, as the weights of the different components will vary from sample to sample. even if the subspaces will largely overlap, the orientation of the axes will vary from sample to sample. in addition, each staining batch of the samples is slightly different, thus the clusters will be moving around in the pixel color space. our goal is to verify if we can find enough clusters that can be used further downstream as anchor points for mapping the color space between different staining batches and tissue types. the dataset has a limited set of labels available: a semi-automated segmentation of the images, the detection of cell nuclei and separating them into two basic subtypes, cancer and non-cancer. the non-cancer cells are likely to have a large fraction of immune cells, but not exclusively. the expression markers are typically attached to the membranes, in between the nuclei. we take the the first components of the pixelwise pca of the images. we then compute the intensity (euclidean norm) of the pixel intensities, eliminate the background noise using a threshold derived from the noise. this leaves m of the initial m pixels. each pixel is then normalized by the intensity, turning them into colors. we embed the points into an -dimensional hypercube, and quantize each coordinate into linear bins. then we run the count sketch algorithm and create an ordered list of the top , heavy hitters. the top hh has , points, while the , th rank has only . the cumulative fraction of the top , heavy hitters is . %. the sketch matrix is x , . we then feed the top k hh to umap, and generate the top two coordinates. we find clusters in the data, labeled from through , as shown on figure . these can be grouped into three categories, pixels related to tumor ( , , , ) , pixel related to nuclei of cells ( , ) , and non-tumor tissue ( , , , ) . these show an excellent agreement with labels generated for nuclei using an industry standard segmentation software. we built a contingency table summarizing the pixel level classifications shown above. for each pixel in the label set marked as tumor or other we build a histogram of the classes in our classification. we considered a classification correct if a tumor pixel in the label set belonged to either a nucleus or tumor class in our scheme. we did the same for the non-tumor cells. the pixels tagged as background by our mask were ignored. the results are quite good, the false positive rates are . % and . % for tumor and other, respectively. ) classification of stellar photometry in sdss: we used the thirteenth data release (dr ) of sdss. our goal is to see how well can we recover the traditional astronomical classification of stars, the so called hertzsprung-russel diagram. we extracted two different subsets of stars: ( ) k stars with classification labels obtained from analyzing their spectra and ( ) m stars without labels. the features are the combinations of magnitudes u, g, r, i, z, defined by the differences (ug, ur, ..., iz) = (u − g , u − r , ..., i − z ). we built the count sketch on m objects and selected the top , heavy hitters. the count was , , at rank and at rank , . the , hhs contained . % stars, forming a highly representative sample. we then feed the hhs to umap, and extract a -dimensional projection. one of the -way scatter plot between the coordinates is shown on figure . we can distinguish white dwarfs, and f,a,k and m stars. this experiment has yielded a success beyond any expectations. we presented a preprocessing technique that is aimed at reducing the cardinality of extreme sized data sets with moderate dimensions while preserving the clustering properties. we use approximate sketching with a linear time streaming algorithm to find the heavy hitter cells of the quantized input data. this new point process, formed by the heavy hitters will correctly represent the clustering properties of the underlying point cloud. our code makes heavy use of gpus for the hash computations and the sketch aggregation, and can be parallelized to an arbitrary high degree. we demonstrated the utility of this approach on two different data sets, one on m pixels of cancer images, the other on million stars with colors from the sloan digital sky survey. we have found that the heavy hitters correctly sampled the clusters in both data sets with quite different properties, and the results were in excellent agreement with the sparse labels available. the computations were extremely fast, essentially i/o limited. processing the count sketch of million points takes only a few seconds on a single v gpu with a single stream i/o. introducing parallell i/o would saturate all cuda kernels of the gpu. by running replicating the data we scaled beyond billion points, and demonstrated an asymptotically linear scaling (fig. ). for the cancer data, using pixel colors only we were able to split the pixels into three distinct groups, which have formed spatially coherent and connected regions, separated by the tumor boundaries. this experiment has exceeded our early expectations. the current data used was rather modest, with million pixels in images. currently at johns hopkins university we have more than , images created, with , more in the queue, resulting in hundreds of billions of total pixels, as shown on fig. . approaching a pixel-wide analysis of such a data will only be possible through highly scalable algorithms. for the m stars, we generated , heavy hitters in a very short time. feeding these to umap and projecting to dimensions, let us identify several major classes of stars based upon imaging data only. while it does not represent a breakthrough in astronomy (to properly classify stars we need absolute luminosities, thus distances obtainable only by other techniques) it is a a good demonstration of the scalability and feasibility of our technique. our approach has additional long-term implications. sketches can be computed on arbitrary subsets of the data, and be combined subsequently. the only constraint is that the hashing functions and the sketch matrix sizes must be the same for all threads. using this approach, sketches of data at different geographic locations can be computed in place, and only the accumulations move to the final aggregation site. this not only saves huge amounts of data movement, but also diminishes potential data privacy concerns, as the approximate hashing is not invertible, i.e. hides all identifiable information. our approach naturally overcomes the problem rising from institutional and national policies limiting the free movement of the data across boundaries and between research centers and hospitals. such concerns for studies of clustering in segregated large-scale data are already present in current covid- research, e.g. aggregating mobility data between cellular providers in different countries. when data is stored on a massively parallel storage system, like in many commercial clouds, it is quite easy to run a count sketch job over parallel processes. the sketches can then be aggregated using a tree topology in logarithmic time first within one datacenter, then across many datacenters with minimal communications overhead. in summary, our algorithm enables the generation of extremely powerful approximate statistics over almost arbitrary large data sets. the space complexity of approximating the frequency moments t-sne-cuda: gpu-accelerated t-sne and its applications to modern data finding frequent items in data streams neural data visualization for scalable and generalizable single cell analysis an improved data stream summary: the count-min sketch and its applications stochastic neighbor embedding i know what you did last summer: network monitoring using interval queries scalable streaming tools for analyzing n-body simulations: finding halos and investigating excursion sets in one pass communication-efficient distributed sgd with sketching streaming algorithms for halo finders one sketch to rule them all: rethinking network flow monitoring with univmon accelerating t-sne using tree-based algorithms visualizing data using t-sne graph stream algorithms: a survey uniform manifold approximation and projection for dimension reduction data streams: algorithms and applications approximated and user steerable tsne for progressive visual analytics performance comparison of dimension reduction implementations a randomized algorithm for principal component analysis communication-efficient federated learning with sketching nonlinear dimensionality reduction by locally linear embedding a nonlinear mapping for data structure analysis using dimensionality reduction to optimize t-sne effects of sampling on measuring galaxy count probabilities a global geometric framework for nonlinear dimensionality reduction sketching as a tool for numerical linear algebra key: cord- -exej zwh authors: coveney, peter v.; highfield, roger r. title: when we can trust computers (and when we can't) date: - - journal: nan doi: nan sha: doc_id: cord_uid: exej zwh with the relentless rise of computer power, there is a widespread expectation that computers can solve the most pressing problems of science, and even more besides. we explore the limits of computational modelling and conclude that, in the domains of science and engineering that are relatively simple and firmly grounded in theory, these methods are indeed powerful. even so, the availability of code, data and documentation, along with a range of techniques for validation, verification and uncertainty quantification, are essential for building trust in computer generated findings. when it comes to complex systems in domains of science that are less firmly grounded in theory, notably biology and medicine, to say nothing of the social sciences and humanities, computers can create the illusion of objectivity, not least because the rise of big data and machine learning pose new challenges to reproducibility, while lacking true explanatory power. we also discuss important aspects of the natural world which cannot be solved by digital means. in the long-term, renewed emphasis on analogue methods will be necessary to temper the excessive faith currently placed in digital computation. the extent to which reproducibility is an issue for computer modelling is more profound and convoluted however, depending on the domain of interest, the complexity of the system, the power of available theory, the customs and practices of different scientific communities, and many practical considerations, such as when commercial considerations are challenged by scientific findings. for research on microscopic and relatively simple systems, such as those found in physics and chemistry, for example, theory -both classical and quantum mechanical -offers a powerful way to curate the design of experiments and weigh up the validity of results. in these and other domains of science that are grounded firmly on theory, computational methods more easily help to confer apparent objectivity, with the obvious exceptions of pathological science and fraud . for the very reason that the underlying theory is established and trusted in these fields, there is perhaps less emphasis than there should be on verification and validation ("solving the equations right" and "solving the right equations", respectively ) along with uncertainty quantification-collectively known by the acronym vvuq. by comparison, in macroscopic systems of interest to engineers, applied mathematicians, computational scientists and technologists and others who have to design devices and systems that actually work, and which must not put people's lives in jeopardy, vvuq, is a way of life -in every sense -to ensure that simulations are credible. this vvuq philosophy underpins advances in computer hardware and algorithms that improve our ability to model complex processes using techniques such as finite element analysis, and computational fluid dynamics for end-toend simulations in virtual prototyping and to create digital twins. there is a virtuous circle in vvuq, where experimental data hone simulations, while simulations hone experiments and data interpretation. in this way, the ability to simulate an experiment influences validation by experiment. in other domains, however, notably biology and biomedical sciences, theories have rarely attained the power and generality of physics. the state space of biological systems tends to be so vast that detailed predictions are often elusive, and vvuq is less well established, though that is now changing rapidly as, for example, models and simulations begin to find clinical use . despite the often stated importance of reproducibility, researchers still find various ways to unwittingly fool themselves and their peers . data dredging-also known as blind big data, data fishing, data snooping, and p-hackingseeks results that can be presented as statistically significant, without any knowledge of the structural characteristics of the problem, let alone first devising a hypothesis about the underlying mechanistic relationships. while corroboration or indirect supporting evidence may be reassuring, when taken too far it can lead to the interpretation of random patterns as evidence of correlations, and to conflation of these correlations with causative effects. spurred on by the current reward and recognition systems of academia, it is easier and very tempting to quickly publish one-off findings which appear transformative, rather than invest additional money, energy and time to ensure that these one-off findings are reproducible. as a consequence, a significant number of 'discoveries' turn out to be unreliable because they are more likely to depend on small populations, weak statistics and flawed analysis [ ] [ ] [ ] [ ] [ ] . there is also a temptation to carry out post hoc rationalisation or harking, 'hypothesizing after the results are known' and to invest more effort into explaining away unexpected findings than validating expected results. most contemporary research depends heavily on computers which generate numbers with great facility. ultimately, though, computers are themselves tools that are designed and used by people. because human beings have a capacity for self-deception , the datasets and algorithms that they create can be subject to unconscious biases of various kinds, for example in the way data are collected and curated in data dredging activities, a lack of standardized data analysis workflows , or the selection of tools that generate promising results, even if their use is not appropriate in the circumstances. no field of science is immune to these issues, but they are particularly challenging in domains where systems are complex and many dimensional, weakly underpinned by theoretical understanding, and exhibit non-linearity, chaos and long-range correlations. with the rise of digital computing power, approaches predicated on big data, machine learning (ml) and artificial intelligence (ai) are frequently deemed to be indispensable. ml and ai are increasingly used to sift experimental and simulation data for otherwise hidden patterns that such methods may suggest are significant. reproducibility is particularly important here because these forms of data analysis play a disproportionate role in producing results and supporting conclusions. some even maintain that big data analyses can do away with the scientific method. however, as data sets increase in size, the ratio of false to true correlations increases very rapidly, so one must be able to reliably distinguish false from true if one is able to find robust correlations. that is difficult to do without a reliable theory underpinning the data being analysed. we, like others , argue that the faith placed in big data analyses is profoundly misguided: to be successful, big data methods must be more firmly grounded on the scientific method. far from being a threat to the scientific method, the weaknesses of blind big data methods serve as a timely reminder that the scientific method remains the most powerful means we have to understand our world. in science, unlike politics, it does not matter how many people say or agree about something: if science is to be objective, it has to be reproducible ("within the error bars"). observations and "scientific facts and results" cannot depend on who is reporting them but must be universal. consensus is the business of politics and the scientific equivalent only comes after the slow accumulation of unambiguous pieces of empirical evidence (albeit most research and programmes are still funded on the basis of what the majority of people on a review panel thinks is right, so that scientists who have previously been successful are more likely to be awarded grants , .) there is some debate about the definition of reproducibility . some argue that replicability is more important than reproducibility. others maintain that the gold standard of research should be 're-testability', where the result is replicated rather than the experiment itself, though the degree to which the 'same result' can emerge from different setups, software and implementations is open to question. by reproducibility we mean the repetition of the findings of an experiment or calculation, generally by others, providing independent confirmation and confidence that we understand what was done and how, thus ensuring that reliable ideas are able to propagate through the scientific community and become widely adopted. when it comes to computer modelling, reproducibility means that the original data and code can be analysed by any independent, sceptical investigator to reach the same conclusions. the status of all investigators is supposedly equal and the same results should be obtained regardless of who is performing the study, within well-defined error bars -that is, reproducibility must be framed as a statistically robust criterion because so many factors can change between one set of observations and another, no matter who performs the experiment. the uncertainties come in two forms: (i) "epistemic", or systematic errors, which might be due to differences in measuring apparatus; and (ii) "aleatoric", caused by random effects. the latter typically arise in chaotic dynamical systems which manifest extreme sensitivity to initial conditions, and/or because of variations in conditions outside of the control of an experimentalist. by seeking to control uncertainty in terms of a margin of error, reproducibility means that an experiment or observation is robust enough to survive all manner of scientific analysis. note, of course, that reproducibility is a necessary but not a sufficient condition for an observation to be deemed scientific. in the scientific enterprise, a single result or measurement can never provide definitive resolution for or against a theory. unlike mathematics, which advances when a proof is published, it takes much more than a single finding to establish a novel scientific insight or idea. indeed, in the popperian view of science, there can be no final vindication of the validity of a scientific theory: they are all provisional, and may eventually be falsified. the extreme form of the modern machine-learners' pre-baconian view stands in stark opposition to this: there is no theory at all, only data, and success is measured by how well one's learning algorithm performs at discerning correlations within these data, even though many of these correlations will turn out to be false, random or meaningless. moreover, in recent years, the integrity of the scientific endeavour has been open to question because of issues around reproducibility, notably in the biological sciences. confidence in the reliability of clinical research has, for example, been under increasing scrutiny. in , john p. a. ioannidis wrote an influential article about biomedical research, entitled "why most published research findings are false", in which he assessed the positive predictive value of the truth of a research finding from values such as threshold of significance and power of the statistical test applied. he found that the more teams were involved in studying a given topic, the less likely the research findings from individual studies turn out to be true. this seemingly paradoxical corollary follows because of the scramble to replicate the most impressive "positive" results and the attraction of refuting claims made in a prestigious journal, so that early replications tend to be biased against the initial findings. this 'proteus phenomenon' has been observed as an early sequence of extreme, opposite results in retrospective hypothesis-generating molecular genetic research , although there is often a fine line to be drawn between contrarianism, wilful misrepresentation and the scepticism ('nullius in verba') that is the hallmark of good science. such lack of reproducibility can be troubling. an investigation of medical studies undertaken between - -with more than citations in total -found that % were contradicted by subsequent studies, % found stronger effects than subsequent studies, % were replicated, and % remained largely unchallenged. in psychological science, a large portion of independent experimental replications did not reproduce evidence supporting the original results despite using high-powered designs and original materials when available. even worse performance is found in cognitive neuroscience. scientists more widely are routinely confronted with issues of reproducibility: a may survey in nature of more than scientists reported that more than % had tried and failed to reproduce another scientist's experiments, and more than half had failed to reproduce their own experiments. this lack of reproducibility can be devastating for the credibility of a field. computers are critical in all fields of data analysis and computer simulations need to be reliable -validated, verified, and their uncertainty quantified -so that they can feed into real world applications and decisions be they governmental policies dealing with pandemics, for the global climate emergency, the provision of food and shelter for refugee populations fleeing conflicts, creation of new materials, the design of the first commercial fusion reactor, or to assist doctors to test medication on a virtual patient before a real one. reproducibility in computer simulations would seem trivial to the uninitiated: enter the same data into the same program on the same architecture and you should get the same results. in practice, however, there are many barriers to overcome to ensure the fidelity of a model in a computational environment. overall, it can be challenging if not impossible to test the claims and arguments made by authors in published work without access to the original code and data, and even in some instances the machines the software ran on. one study of what the authors dubbed 'weak repeatability' examined papers with results backed by code and found that, for one third, they were able to obtain the code and build it within half an hour, while for just under half they succeeded with significant extra effort. for the remainder, it was not possible to verify the published findings. the authors reported that some researchers are reluctant to share their source code, for instance for commercial and licensing reasons, or because of dependencies on other software, whether due to external libraries or compilers, or because the version they used in their paper had been superseded, or had been lost due to lack of backup. many detailed choices in the design and implementation of a simulation never make it into published papers. frequently, the principal code developer has moved on, the code turns out to depend on exotic hardware, there is inadequate documentation, and/or the code developers say that they are too busy to help. there are some high-profiles examples of these issues, from disclosure of climate codes and data, to delays in sharing codes for covid- pandemic modelling. if the public are to have confidence in computing models that could directly affect them, transparency, openness and the timely release of code and data are critical. in response to this challenge, there have been various proposals to allow scientists to openly share code and data that underlie their research publications: runmycode [runmycode.org] and, perhaps better known, github [github.com]; share, a web portal to create, share, and access remote virtual machines that can be cited from research papers to make an article fully reproducible and interactive; papermâché, another means to view and interact with a paper using virtual machines; various means to create 'executable papers' , ; and a verifiable result identifier (vri), which consists of trusted and automatically generated strings that point to publicly available results originally created by the computational process. in addition to external verification, there are many initiatives to incorporate verification and validation into computer model development, along with uncertainty quantification techniques to verify and validate the models. in the united states, for example, the american society of mechanical engineers has a standards committee for the development of verification and validation v&v procedures for computational solid mechanics models, guidelines and recommended practices have been developed by the national aeronautics and space administration (nasa); the us defense nuclear facilities safety board backs model v&v for all safety-related nuclear facility design, analyses, and operations, while various groups within the doe laboratories (including sandia, los alamos, and lawrence livermore) are conducting research in this area. in europe, the vecma (verified exascale computing for multiscale applications) project is developing software tools that can be applied to many research domains, from the laptop to the emerging generation of exascale supercomputers, in order to validate, verify, and quantify the uncertainty within highly diverse applications. the major challenge faced by the state of the art is that many scientific models are multiphysics in nature, combining two or more kinds of physics, for instance to simulate the behaviour of plasmas in tokamak nuclear fusion reactors , electromechanical systems or in food processing . even more common, and more challenging, many models are also multiscale, which require the successful convergence of various theories that operate at different temporal and/or spatial scales. they are widespread at the interface between various fields, notably physics, chemistry and biology. the ability to integrate macroscopic universality and molecular individualism is perhaps the greatest challenge of multiscale modelling . as one example, we certainly need multiscale models if we are to predict the biology, medicine that underpin the behaviour of an individual person. digital medicine is increasingly important and, as a corollary of this, there have been calls for steps to avoid a reproducibility "crisis" of the kind that has engulfed other areas of biomedicine. although there are many kinds of multiscale modelling, there now exist protocols to enable the verification, validation, and uncertainty quantification of multiscale models. the vecma toolkit , which is not only open source but whose development is also performed openly, has many components: fabsim , to organise and perform complex remote tasks; easyvvuq, a python library designed to facilitate verification, validation and uncertainty quantification for a variety of simulations , ; qcg pilot job, to provide the efficient and reliable execution of large number of computational jobs; qcg-now, to prepare and run computational jobs on high performance computing machines; qcg-client, to provide support for a variety of computing jobs, from simple ones to complex distributed workflows; easyvvuq-qcgpilotjob, for efficient, parallel execution of demanding easyvvuq scenarios on high performance machines; and muscle , to make creating coupled multiscale simulations easier, and to then enable efficient uncertainty quantification of such models. the vecma toolkit is already being applied in several circumstances: climate modelling, where multiscale simulations of the atmosphere and oceans are required; forecasting refugee movements away from conflicts, or as a result of climate change, to help prioritise resources and investigate the effects of border closures and other policy decisions ; for exploring the mechanical properties of a simulated material at several length and time scales with verified multiscale simulations; and multiscale simulations to understand the mechanisms of heat and particle transport in fusion devices, which is important because the transport plays a key role in determining the size, shape and more detailed design and operating conditions of a future fusion power reactor, and hence the possibility of extracting almost limitless energy; and verified simulations to aid in the decision-making of drug prescriptions, simulating how drugs interact with a virtual version of a patient's proteins, or how stents will behave when placed in virtual versions of arteries. recent years have seen an explosive growth in digital data accompanied by the rising public awareness that their lives depend on "algorithms", though it is plain to all that any computer code is based on an algorithm, without which it will not run. under the banner of artificial intelligence and machine learning, many of these algorithms seek patterns in those data. some -emphatically not the authors of this paper -even claim that this approach will be faster and more revealing than modelling the underlying behaviour notably by the use of conventional theory, modelling and simulation. this approach is particularly attractive in disciplines traditionally not deemed suitable for mathematical treatment because they are so complex, notably life and social sciences, along with the humanities. however, to build a machine-learning system, you have to decide what data you are going to choose to populate it. that choice is frequently made without any attempt to first try to understand the structural characteristics that underlie the system of interest, with the result that the "ai system" produced strongly reflects the limitations or biases (be they implicit or explicit) of its creators. moreover, there are four fundamental issues with big data that are frequently not recognised by practitioners : complex systems are strongly correlated, so they do not generally obey gaussian statistics; no datasets are large enough for systems with strong sensitivity to rounding or inaccuracies; correlation does not imply causality; and too much data can be as bad as no data: although computers can be trained on larger datasets than the human brain can absorb, there are fundamental limitations to the power of such datasets (as one very real example, mapping genotype to phenotype is far from straightforward), not least due to their digital character. all machine-learning algorithms are initialised using (pseudo) random number generators and have to be run vast numbers of times to ensure that their statistical predictions are robust. however, they typically make plenty of other assumptions, such as smoothness (i.e. continuity) between data points. the problem is that nonlinear systems are often anything but smooth, and there can be jumps, discontinuities and singularities. not only the smoothness of behaviour but also the forms of distribution of data regularly assumed by machine learners are frequently unknown or untrue in complex systems. indeed, many such approaches are distribution free, in the sense that there is no knowledge provided about the way the data being used is distributed in a statistical sense. often, a gaussian ("normal") distribution is assumed by default; while this distribution plays an undeniable role across all walks of science it is far from universal. indeed, it fails to describe most phenomena where complexity holds sway because, rather than resting on randomness, these typically have feedback loops, interactions and correlations. machine learning is often used to seek correlations in data. but in a real-world system, for instance in a living cell that is a cauldron of activity of million protein molecules , can we be confident that we have captured the right data? random data dredging for complex problems is doomed to fail where one has no idea which variables are important. in these cases, data dredging will always be defeated by the curse of dimensionality -there will simply be far too much data needed to fill in the hyperdimensional space for blind machine learning to produce correlations to any degree of confidence. on top of that, as mentioned earlier, the ratio of false to true correlations soars with the size of the dataset, so that too much data can be worse than no data at all. there are practical considerations too. machine-learning systems can never be better than the data they are trained on, which can contain biases 'whether morally neutral as toward insects or flowers, problematic as toward race or gender, or even simply veridical, reflecting the status quo distribution of gender with respect to careers or first names'. in healthcare systems, for example, where commercial prediction algorithms are used to identify and help patients with complex health needs, significant racial bias has been found. machine learning systems are black boxes, even to the researchers that build them, making it hard for their creators, let alone others, to assess the results produced by these glorified curve-fitting systems. precise replication would be nearly impossible given the natural randomness in neural networks and variations in hardware and code. that is one reason why blind machine learning is unlikely to ever be accepted by regulatory authorities in medical practice as a basis for offering drugs to patients. to comply with the regulatory authorities such as the us food and drug administration and the european medicines agency, the predictions of a ml algorithm are not enough and it is essential that an underlying mechanistic explanation is also provided, one which can explain not only when a drug works but also when it fails, and/or produces side effects. there are even deeper problems of principle in seeking to produce reliable predictions about the behaviour of complex systems of the sort one encounters frequently in the most pressing problems of twenty-first century science. we are thinking particularly, in life sciences, medicine, healthcare and environmental sciences, where systems typically involve large numbers of variables and many parameters. the question is how to select these variables and parameters to best fit the data. despite the constant refrain that we live in the age of "big data", the data we have available is never enough to model problems of this degree of complexity. unlike more traditional reductionist models, where one may reasonably assume one has sufficient data to estimate a small number of parameters, such as a drug interacting with a nerve cell receptor, this ceases to be the case in complex and emergent systems, such as modelling a nerve cell itself. the favourite approach of the moment is of course to select machine learning, which involves adjustments of large numbers of parameters inside the neural network "models" used; these can be tuned to fit the data available but have little to no predictability beyond the range of the data used because they do not take into account the structural characteristics of the phenomenon under study. this is a form of overfitting. as a result of the uncertainty in all these parameters, the model itself becomes uncertain as testing it involves an assessment of probability distributions over the parameters and, with nowhere near adequate data available, it is not clear if it can be validated in a meaningful manner. for some related issues of a more speculative and philosophical nature in the study of complexity, see succi ( ) . compounding all this, there is a fundamental problem that undermines our faith in simulations which arises from the digital nature of modern computers, whether classical or quantum. digital computers make use of four billion rational numbers that range from plus to minus infinity, the so-called 'single-precision ieee floating-point numbers', which refers to a technical standard for floating-point arithmetic established by the institute of electrical and electronics engineers in the s; they also frequently use double precision floating-point numbers, while half-precision has become commonplace of late in the running of machine learning algorithms. however, digital computers only use a very small subset of the rational numbers -so-called dyadic numbers, whose denominators are powers of because of the binary system underlying all digital computers -and the way these numbers are distributed is highly nonuniform. moreover, there are infinitely more irrational than rational numbers, which are ignored by all digital computers because to store any one of them, typically, one would require an infinite memory. leaving aside the mistaken belief held by some that a very few repeats of, say, a molecular dynamics simulation is any replacement for (quasi) monte carlo methods based on ensembles of replicas, these findings strongly suggest that the digital simulation of all chaotic systems, found in models used to predict weather, climate, molecular dynamics, chemical reactions, fusion energy and much more, contain sizeable errors of a nature that hitherto have been unknown to most scientists. by the same token, the use of data from these chaotic simulations to train machine learning algorithms will in turn produce artefacts, making them unreliable. this shortcoming produced generic errors of up to per cent in the case of the bernoulli map along with pure nonsense on rare occasions. one might ask why, if the consequences can be so substantial, these errors have not been noticed. the difficulty is that for real world simulations in turbulence and molecular dynamics, for example, there are no exact, closed form mathematical solutions for comparison so the numerical solutions that roll off the computer are simply assumed to be correct. given the approximations involved in such models, not to speak of the various sources of measurement errors, it is never possible to obtain exact agreement with experimental results. in short, the use of floating point numbers instead of real numbers contributes additional systematic errors in numerical schemes that have not so far been assessed at all . for modelling, we need to tackle both epistemic and aleatoric sources of error. to deal with these challenges, a number of association for the advance of artificial intelligence, aaai, conferences ) found that only % of the presenters shared the algorithm's code. the most commonly-used machine learning platforms provided by big tech companies have poor support for reproducibility. studies have shown that even if the results of a deep learning model could be reproduced, a slightly different experiment would not support the findings-yet another example of overfitting-which is common in machine learning research. in other words, unreproducible findings can be built upon supposedly reproducible methods. rather than continuing to simply fund, pursue and promote 'blind' big data projects, more resources should be allocated to the elucidation of the multiphysics, multiscale and stochastic processes controlling the behaviour of complex systems, such as those in biology, medicine, healthcare and environmental science. finding robust predictive mechanistic models that provide explanatory insights will be of particular value for machine learning when dealing with sparse and incomplete sets of data, ill-posed problems, exploring vast design spaces to seek correlations and then, most importantly, for identifying correlations. where machine learning provides a correlation, multiscale modelling can test if this correlation is causal. there are also demands in some fields for a reproducibility checklist, to make ai reproducibility more practical, reliable and effective. another suggestion is the use of so-called "model cards" -documentation that accompanies trained machine learning models which outline the application domains, the context in which they are being used and their carefully benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, and phenotypic groups; and proposals for best practice in reporting experimental results which permit for robust comparison. despite the caveat that computers are made and used by people, there is also considerable interest in their use to design and run experiments, for instance using bayesian optimization methods, such as in the field of cognitive neuroscience and to model infectious diseases and immunology quantitatively. when it comes to the limitations of digital computing, research is under way by boghosian and pvc to find alternative approaches that might render such problems computable on digital computers. among possible solutions, one that seems guaranteed to succeed is analogue computing, an older idea, able to handle the numerical continuum of reality in a way that digital computers can only approximate. in the short term, notably in the biosciences, better data collection, curation, validation, verification and uncertainty quantification procedures of the kind described here, will make computer simulations more reproducible, while machine learning will benefit from a more rigorous and transparent approach. the field of big data and machine learning has become extremely influential but without big theory it remains dogged by a lack of firm theoretical underpinning ensuring its results are reliable. indeed, we have argued that in the modern era in which we aspire to describe really complex systems, involving many variables and vast numbers of parameters, there is not sufficient data to apply these methods reliably. our models are likely to remain uncertain in many respects, as it is so difficult to validate them. in the medium term, ai methods may, if carefully produced, improve the design, objectivity and analysis of experiments. however, this will always require the participation of people to devise the underlying hypotheses and, as a result, it is important to ensure that they fully grasp the assumptions on which these algorithms are based and are also open about these assumptions. it is already becoming increasingly clear that 'artificial intelligence' is a digital approximation to reality. moreover, in the long term, when we are firmly in the era of routine exascale and perhaps eventually also quantum computation, we will have to grapple with a more fundamental issue. even though there are those who believe the complexity of the universe can be understood in terms of simple programs rather than by means of concise mathematical equations, , digital computers are limited in the extent to which they can capture the richness of the real world. , freeman dyson, for example, speculated that for this reason the downloading of a human consciousness into a digital computer would involve 'a certain loss of our finer feelings and qualities' . in the quantum and exascale computing eras, we will need renewed emphasis on the analogue world and analogue computational methods if we are to trust our computers. the turing way" -a handbook for reproducible data science the need for open source software in machine learning as concerns about non-reproducible data mount, some solutions take shape addressing scientific fraud. science ( -) bad pharma: how drug companies mislead doctors and harm patients betrayers of the truth: fraud and deceit in the halls of science quantification of uncertainty in computational fluid dynamics assessing the reliability of complex models: mathematical and statistical foundations of verification, validation, and uncertainty quantification editorial: special issue on verification, validation, and uncertainty quantification of cardiovascular models: towards effective vvuq for translating cardiovascular modelling to clinical utility how scientists fool themselves -and how they can stop low statistical power in biomedical science: a review of three human research domains empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature a survey of the statistical power of research in behavioral ecology and animal behavior evaluating the quality of empirical journals with respect to sample size and statistical power. ouzounis ca power failure: why small sample size undermines the reliability of neuroscience the evolution and psychology of self-deception variability in the analysis of a single neuroimaging dataset by many teams the end of theory: the data deluge makes the scientific method obsolete the scientific method in the science of machine learning big data need big theory too the matthew effect in science: the reward and communication systems of science are considered. science ( -) the matthew effect in science funding muddled meanings hamper efforts to fix reproducibility crisis reproducible research: a minority opinion why most published research findings are false early extreme contradictory estimates may appear in published research: the proteus phenomenon in molecular genetics research and randomized trials why selective publication of statistically significant results can be effective contradicted and initially stronger effects in highly cited clinical research estimating the reproducibility of psychological science. science ( -) , scientists lift the lid on reproducibility reproducibility in scientific computing published online : - . . the disclosure of climate data from the climatic github -imperial college london -code for modelling estimated deaths and cases for covid share: a web portal for creating and sharing executable research papers paper mâché: creating dynamic reproducible science a provenance-based infrastructure to support the life cycle of executable papers a universal identifier for computational results a comprehensive framework for verification, validation, and uncertainty quantification in scientific computing verification, validation and uncertainty quantification (vvuq) concepts of model verification and validation vecma verified exascale computing for multiscale applications multidimensional multiphysics simulation of nuclear fuel behavior multiphysics simulation innovative food processing technologies: advances in bridging the gaps at the physics-chemistry-biology interface the reproducibility crisis in the age of digital medicine multiscale modelling: approaches and challenges building confidence in simulation: applications of easyvvuq a library for verification, validation and uncertainty quantification in high performance computing a generalized simulation development approach for predicting refugee destinations ensemble-based steered molecular dynamics predicts relative residence time of a a receptor binders semi-intrusive multiscale metamodelling uncertainty quantification with application to a model of in-stent restenosis big data: the end of the scientific method unification of protein abundance datasets yields a quantitative saccharomyces cerevisiae proteome semantics derived automatically from language corpora contain human-like biases dissecting racial bias in an algorithm used to manage the health of populations. science ( -) weapons of math destruction. crown random house the problem of overfitting the evolution of scientific knowledge: from certainty to uncertainty a new pathology in the simulation of chaotic dynamical systems on digital computers of naturalness and complexity metascience could rescue the 'replication crisis the unity of knowledge psychologists strike a blow for reproducibility increasing transparency through a multiverse analysis the science of team science brazilian biomedical science faces reproducibility test policy: nih plans to enhance reproducibility the academy of medical sciences. reproducibility and reliability of biomedical research: improving research practice state of the art: reproducibility in artificial intelligence missing data hinder replication of artificial intelligence studies. science ( -) out-of-the-box reproducibility: a survey of machine learning platforms unreproducible research is reproducible the machine learning reproducibility checklist (version . ) model cards for model reporting show your work: improved reporting of experimental results neuroadaptive bayesian optimization and hypothesis testing host genotype and time dependent antigen presentation of viral peptides: predictions from theory. sci rep from digital hype to analogue reality: universal simulation beyond the quantum and exascale eras the wolfram physics project: a project to find the fundamental theory of physics a new kind of science. wolfram media a class of models with the potential to represent fundamental physics the authors are grateful for many stimulating conversations with bruce boghosian, daan crommelin, ed dougherty, derek groen, alfons hoekstra, robin richardson & david wright. authors' contributions all authors contributed to the concept and writing of the article. the authors have no competing interests.funding statement p.v.c. is grateful for funding from the uk epsrc for the ukcomes uk high-end computing consortium (ep/r / ), from mrc for a medical bioinformatics grant (mr/l / ), the european commission for the compbiomed, compbiomed and vecma grants (numbers , and respectively) and special funding from the ucl provost. key: cord- -n hza vm authors: xu, jie; glicksberg, benjamin s.; su, chang; walker, peter; bian, jiang; wang, fei title: federated learning for healthcare informatics date: - - journal: j healthc inform res doi: . /s - - - sha: doc_id: cord_uid: n hza vm with the rapid development of computer software and hardware technologies, more and more healthcare data are becoming readily available from clinical institutions, patients, insurance companies, and pharmaceutical industries, among others. this access provides an unprecedented opportunity for data science technologies to derive data-driven insights and improve the quality of care delivery. healthcare data, however, are usually fragmented and private making it difficult to generate robust results across populations. for example, different hospitals own the electronic health records (ehr) of different patient populations and these records are difficult to share across hospitals because of their sensitive nature. this creates a big barrier for developing effective analytical approaches that are generalizable, which need diverse, “big data.” federated learning, a mechanism of training a shared global model with a central server while keeping all the sensitive data in local institutions where the data belong, provides great promise to connect the fragmented healthcare data sources with privacy-preservation. the goal of this survey is to provide a review for federated learning technologies, particularly within the biomedical space. in particular, we summarize the general solutions to the statistical challenges, system challenges, and privacy issues in federated learning, and point out the implications and potentials in healthcare. the recent years have witnessed a surge of interest related to healthcare data analytics, due to the fact that more and more such data are becoming readily available from various sources including clinical institutions, patient individuals, insurance companies, and pharmaceutical industries, among others. this provides an unprecedented opportunity for the development of computational techniques to dig data-driven insights for improving the quality of care delivery [ , ] . healthcare data are typically fragmented because of the complicated nature of the healthcare system and processes. for example, different hospitals may be able to access the clinical records of their own patient populations only. these records are highly sensitive with protected health information (phi) of individuals. rigorous regulations, such as the health insurance portability and accountability act (hipaa) [ ] , have been developed to regulate the process of accessing and analyzing such data. this creates a big challenge for modern data mining and machine learning (ml) technologies, such as deep learning [ ] , which typically requires a large amount of training data. federated learning is a paradigm with a recent surge in popularity as it holds great promise on learning with fragmented sensitive data. instead of aggregating data from different places all together, or relying on the traditional discovery then replication design, it enables training a shared global model with a central server while keeping the data in local institutions where the they originate. the term "federated learning" is not new. in , patrick hill, a philosophy professor, first developed the federated learning community (flc) to bring people together to jointly learn, which helped students overcome the anonymity and isolation in large research universities [ ] . subsequently, there were several efforts aiming at building federations of learning content and content repositories [ , , ] . in , rehak et al. [ ] developed a reference model describing how to establish an interoperable repository infrastructure by creating federations of repositories, where the metadata are collected from the contributing repositories into a central registry provided with a single point of discovery and access. the ultimate goal of this model is to enable learning from diverse content repositories. these practices in federated learning community or federated search service have provided effective references for the development of federated learning algorithms. federated learning holds great promises on healthcare data analytics. for both provider (e.g., building a model for predicting the hospital readmission risk with patient electronic health records (ehr) [ ] ) and consumer (patient)-based applications (e.g., screening atrial fibrillation with electrocardiograms captured by smartwatch [ ] ), the sensitive patient data can stay either in local institutions or with individual consumers without going out during the federated model learning process, which effectively protects the patient privacy. the goal of this paper is to review the setup of federated learning, discuss the general solutions and challenges, and envision its applications in healthcare. in this review, after a formal overview of federated learning, we summarize the main challenges and recent progress in this field. then we illustrate the potential of federated learning methods in healthcare by describing the successful recent research. at last, we discuss the main opportunities and open questions for future applications in healthcare. there has been a few review articles on federated learning recently. for example, yang et al. [ ] wrote the early federated learning survey summarizing the general privacy-preserving techniques that can be applied to federated learning. some researchers surveyed sub-problems of federated learning, e.g., personalization techniques [ ] , semi-supervised learning algorithms [ ] , threat models [ ] , and mobile edge networks [ ] . kairouz et al. [ ] discussed recent advances and presented an extensive collection of open problems and challenges. li et al. [ ] conducted the review on federated learning from a system viewpoint. different from those reviews, this paper provided the potential of federated learning to be applied in healthcare. we summarized the general solution to the challenges in federated learning scenario and surveyed a set of representative federated learning methods for healthcare. in the last part of this review, we outlined some directions or open questions in federated learning for healthcare. an early version of this paper is available on arxiv [ ] . federated learning is a problem of training a high-quality shared global model with a central server from decentralized data scattered among large number of different clients (fig. ) . mathematically, assume there are k activated clients where the data reside in (a client could be a mobile phone, a wearable device, or a clinical institution data warehouse, etc.). let d k denote the data distribution associated with client k and fig. schematic of the federated learning framework. the model is trained in a distributed manner: the institutions periodically communicate the local updates with a central server to learn a global model; the central server aggregates the updates and sends back the parameters of the updated global model n k the number of samples available from that client. n = k k= n k is the total sample size. federated learning problem boils down to solving a empirical risk minimization problem of the form [ , , ] : where w is the model parameter to be learned. the function f i is specified via a loss function dependent on a pair of input-output data pair {x i , y i }. typically, x i ∈ r d and y i ∈ r or y i ∈ {− , }. simple examples include: in particular, algorithms for federated learning face with a number of challenges [ , ] , specifically: -statistical challenge: the data distribution among all clients differ greatly, i.e., ∀k =k, we have e . it is such that any data points available locally are far from being a representative sample of the overall distribution, i.e., -communication efficiency: the number of clients k is large and can be much bigger than the average number of training sample stored in the activated clients, i.e., k (n/k). -privacy and security: additional privacy protections are needed for unreliable participating clients. it is impossible to ensure all clients are equally reliable. next, we will survey, in detail, the existing federated learning related works on handling such challenges. the naive way to solve the federated learning problem is through federated averaging (fedavg) [ ] . it is demonstrated can work with certain non independent identical distribution (non-iid) data by requiring all the clients to share the same model. however, fedavg does not address the statistical challenge of strongly skewed data distributions. the performance of convolutional neural networks trained with fedavg algorithm can reduce significantly due to the weight divergence [ ] . existing research on dealing with the statistical challenge of federated learning can be grouped into two fields, i.e., consensus solution and pluralistic solution. most centralized models are trained on the aggregated training samples obtained from the samples drawn from the local clients [ , ] . intrinsically, the centralized model is trained to minimize the loss with respect to the uniform distribution [ ] :d = k k= n k n d k , whered is the target data distribution for the learning model. however, this specific uniform distribution is not an adequate solution in most scenarios. to address this issue, the recent proposed solution is to model the target distribution or force the data adapt to the uniform distribution [ , ] . specifically, mohri et al. [ ] proposed a minimax optimization scheme, i.e., agnostic federated learning (afl), where the centralized model is optimized for any possible target distribution formed by a mixture of the client distributions. this method has only been applied at small scales. compared to afl, li et al. [ ] proposed q-fair federated learning (q-ffl), assigning higher weight to devices with poor performance, so that the distribution of accuracy in the network reduces in variance. they empirically demonstrate the improved flexibility and scalability of q-ffl compared to afl. another commonly used method is globally sharing a small portion of data between all the clients [ , ] . the shared subset is required containing a uniform distribution over classes from the central server to the clients. in addition to handle non-iid issue, sharing information of a small portion of trusted instances and noise patterns can guide the local agents to select compact training subset, while the clients learn to add changes to selected data samples, in order to improve the test performance of the global model [ ] . generally, it is difficult to find a consensus solution w that is good for all components d i . instead of wastefully insisting on a consensus solution, many researchers choose to embracing this heterogeneity. multi-task learning (mtl) is a natural way to deal with the data drawn from different distributions. it directly captures relationships among non-iid and unbalanced data by leveraging the relatedness between them in comparison to learn a single global model. in order to do this, it is necessary to target a particular way in which tasks are related, e.g., sharing sparsity, sharing low-rank structure, and graphbased relatedness. recently, smith et al. [ ] empirically demonstrated this point on real-world federated datasets and proposed a novel method mocha to solve a general convex mtl problem with handling the system challenges at the same time. later, corinzia et al. [ ] introduced virtual, an algorithm for federated multi-task learning with non-convex models. they consider the federation of central server and clients as a bayesian network and perform training using approximated variational inference. this work bridges the frameworks of federated and transfer/continuous learning. the success of multi-task learning rests on whether the chosen relatedness assumptions hold. compared to this, pluralism can be a critical tool for dealing with heterogeneous data without any additional or even low-order terms that depend on the relatedness as in mtl [ ] . eichner et al. [ ] considered training in the presence of block-cyclic data and showed that a remarkably simple pluralistic approach can entirely resolve the source of data heterogeneity. when the component distributions are actually different, pluralism can outperform the "ideal" iid baseline. in federated learning setting, training data remain distributed over a large number of clients each with unreliable and relatively slow network connections. naively for synchronous protocol in federated learning [ , ] , the total number of bits that required during uplink (clinets → server) and downlink (server → clients) communication by each of the k clients during training is given by: where u is the total number of updates performed by each client, |w| is the size of the model and h ( w up/down ) is the entropy of the weight updates exchanged during transmitting process. β is the difference between the true update size and the minimal update size (which is given by the entropy) [ ] . apparently, we can consider three ways to reduce the communication cost: (a) reduce the number of clients k, (b) reduce the update size, (c) reduce the number of updates u . starting at these three points, we can organize existing research on communication-efficient federated learning into four groups, i.e., model compression, client selection, updates reducing, and peer-to-peer learning (fig. ). the most natural and rough way for reducing communication cost is to restrict the participated clients or choose a fraction of parameters to be updated at each round. shokri et al. [ ] use the selective stochastic gradient descent protocol, where the selection can be completely random or only the parameters whose current values are farther away from their local optima are selected, i.e., those that have a larger gradient. nishio et al. [ ] proposed a new protocol referred to as fedcs, where the central server manages the resources of heterogeneous clients and determines which clients should participate the current training task by analyzing the resource information of each client, such as wireless channel states, computational capacities, and the size of data resources relevant to the current task. here, the server should decide how much data, energy, and cpu resources used by the mobile devices such that the energy consumption, training latency, and bandwidth cost are minimized while meeting requirements of the training tasks. anh [ ] thus proposes to use the deep q-learning [ ] technique that enables the server to find the optimal data and energy management for the mobile devices participating in the mobile crowdmachine learning through federated learning without any prior knowledge of network dynamics. the goal of model compression is to compress the server-to-client exchanges to reduce uplink/downlink communication cost. the first way is through structured updates, where the update is directly learned from a restricted space parameterized using a smaller number of variables, e.g., sparse, low-rank [ ] , or more specifically, pruning the least useful connections in a network [ , ] , weight quantization [ , ] , and model distillation [ ] . the second way is lossy compression, where a full model update is first learned and then compressed using a combination of quantization, random rotations, and subsampling before sending it to the server [ , ] . then the server decodes the updates before doing the aggregation. federated dropout, in which each client, instead of locally training an update to the whole global model, trains an update to a smaller sub-model [ ] . these submodels are subsets of the global model and, as such, the computed local updates have a natural interpretation as updates to the larger global model. federated dropout not only reduces the downlink communication but also reduces the size of uplink updates. moreover, the local computational costs is correspondingly reduced since the local training procedure dealing with parameters with smaller dimensions. kamp et al. [ ] proposed to average models dynamically depending on the utility of the communication, which leads to a reduction of communication by an order of magnitude compared to periodically communicating state-of-the-art approaches. this facet is well suited for massively distributed systems with limited communication infrastructure. bui et al. [ ] improved federated learning for bayesian neural networks using partitioned variational inference, where the client can decide to upload the parameters back to the central server after multiple passes through its data, after one local epoch, or after just one mini-batch. guha et al. [ ] focused on techniques for one-shot federated learning, in which they learn a global model from data in the network using only a single round of communication between the devices and during the computation, no computation node is able to recover the original value nor learn anything about the output (green pie). any nodes can combine their shares to reconstruct the original value. b differential privacy. it guarantees that anyone seeing the result of a differentially private analysis will make the same inference (answer and answer are nearly indistinguishable) the central server. besides above works, ren et al. [ ] theoretically analyzed the detailed expression of the learning efficiency in the cpu scenario and formulate a training acceleration problem under both communication and learning resource budget. reinforcement learning and round robin learning are widely used to manage the communication and computation resources [ , , , ]. in federated learning, a central server is required to coordinate the training process of the global model. however, the communication cost to the central server may be not affordable since a large number of clients are usually involved. also, many practical peer-to-peer networks are usually dynamic, and it is not possible to regularly access a fixed central server. moreover, because of the dependence on central server, all clients are required to agree on one trusted central body, and whose failure would interrupt the training process for all clients. therefore, some researches began to study fully decentralized framework where the central server is not required [ , , , ] . the local clients are distributed over the graph/network where they only communicate with their one-hop neighbors. each client updates its local belief based on own data and then aggregates information from the one-hop neighbors. in federated learning, we usually assume the number of participated clients (e.g., phones, cars, clinical institutions...) is large, potentially in the thousands or millions. it is impossible to ensure none of the clients is malicious. the setting of federated learning, where the model is trained locally without revealing the input data or the model's output to any clients, prevents direct leakage while training or using the model. however, the clients may infer some information about another client's private dataset given the execution of f (w), or over the shared predictive model w [ ] . to this end, there have been many efforts focus on privacy either from an individual point of view or multiparty views, especially in social media field which significantly exacerbated multiparty privacy (mp) conflicts [ , ] (fig. ). secure multi-party computation (smc) has a natural application to federated learning scenarios, where each individual client uses a combination of cryptographic techniques and oblivious transfer to jointly compute a function of their private data [ , ] . homomorphic encryption is a public key system, where any party can encrypt its data with a known public key and perform calculations with data encrypted by others with the same public key [ ] . due to its success in cloud computing, it comes naturally into this realm, and it has certainly been used in many federated learning researches [ , ] . although smc guarantees that none of the parties shares anything with each other or with any third party, it can not prevent an adversary from learning some individual information, e.g., which clients' absence might change the decision boundary of a classifier, etc. moreover, smc protocols are usually computationally expensive even for the simplest problems, requiring iterated encryption/decryption and repeated communication between participants about some of the encrypted results [ ] . differential privacy (dp) [ ] is an alternative theoretical model for protecting the privacy of individual data, which has been widely applied to many areas, not only traditional algorithms, e.g., boosting [ ] , principal component analysis [ ] , support vector machine [ ] , but also deep learning research [ , ] . it ensures that the addition or removal does not substantially affect the outcome of any analysis and is thus also widely studied in federated learning research to prevent the indirect leakage [ , , ] . however, dp only protects users from data leakage to a certain extent and may reduce performance in prediction accuracy because it is a lossy method [ ] . thus, some researchers combine dp with smc to reduce the growth of noise injection as the number of parties increases without sacrificing privacy while preserving provable privacy guarantees, protecting against extraction attacks and collusion threats [ , ] . federated learning has been incorporated and utilized in many domains. this widespread adoption is due in part by the fact that it enables a collaborative modeling mechanism that allows for efficient ml all while ensuring data privacy and legal compliance between multiple parties or multiple computing nodes. some promising examples that highlight these capabilities are virtual keyboard prediction [ , ] , smart retail [ ] , finance [ ] , and vehicle-to-vehicle communication [ ] . in this section, we focus primarily on applications within the healthcare space and also discuss promising applications in other domains since some principles can be applied to healthcare. ehrs have emerged as a crucial source of real world healthcare data that has been used for an amalgamation of important biomedical research [ , ] , including for machine learning research [ ] . while providing a huge amount of patient data for analysis, ehrs contain systemic and random biases overall and specific to hospitals that limit the generalizability of results. for example, obermeyer et al. [ ] found that a commonly used algorithm to determine enrollment in specific health programs was biased against african americans, assigning the same level of risk to healthier caucasian patients. these improperly calibrated algorithms can arise due to a variety of reasons, such as differences in underlying access to care or low representation in training data. it is clear that one way to alleviate the risk for such biased algorithms is the ability to learn from ehr data that is more representative of the global population and which goes beyond a single hospital or site. unfortunately, due to a myriad of reasons such as discrepant data schemes and privacy concerns, it is unlikely that data will eve be connected together in a single database to learn from all at once. the creation and utility of standardized common data models, such as omop [ ] , allow for more wide-spread replication analyses but it does not overcome the limitations of joint data access. as such, it is imperative that alternative strategies emerge for learning from multiple ehr data sources that go beyond the common discoveryreplication framework. federated learning might be the tool to enable large-scale representative ml of ehr data and we discuss many studies which demonstrate this fact below. federated learning is a viable method to connect ehr data from medical institutions, allowing them to share their experiences, and not their data, with a guarantee of privacy [ , , , , , ] . in these scenarios, the performance of ml model will be significantly improved by the iterative improvements of learning from large and diverse medical data sets. there have been some tasks were studied in federated learning setting in healthcare, e.g., patient similarity learning [ ] , patient representation learning, phenotyping [ , ] , and predictive modeling [ , , ] . specifically, lee et al. [ ] presented a privacy-preserving platform in a federated setting for patient similarity learning across institutions. their model can find similar patients from one hospital to another without sharing patient-level information. kim et al. [ ] used tensor factorization models to convert massive electronic health records into meaningful phenotypes for data analysis in federated learning setting. liu et al. [ ] conducted both patient representation learning and obesity comorbidity phenotyping in a federated manner and got good results. vepakomma et al. [ ] built several configurations upon a distributed deep learning method called splitnn [ ] to facilitate the health entities collaboratively training deep learning models without sharing sensitive raw data or model details. silva et al. [ ] illustrated their federated learning framework by investigating brain structural relationships across diseases and clinical cohorts. huang et al. [ ] sought to tackle the challenge of non-iid icu patient data by clustering patients into clinically meaningful communities that captured similar diagnoses and geological locations and simultaneously training one model per community. federated learning has also enabled predictive modeling based on diverse sources, which can provide clinicians with additional insights into the risks and benefits of treating patients earlier [ , , ] . brisimi et al. [ ] aimed to predict future hospitalizations for patients with heart-related diseases using ehr data spread among various data sources/agents by solving the l -regularized sparse support vector machine classifier in federated learning environment. owkin is using federated learning to predict patients' resistance to certain treatment and drugs, as well as their survival rates for certain diseases [ ] . boughorbel et al. [ ] proposed a federated uncertaintyaware learning algorithm for the prediction of preterm birth from distributed ehr, where the contribution of models with high uncertainty in the aggregation model is reduced. pfohl et al. [ ] considered the prediction of prolonged length of stay and in-hospital mortality across thirty-one hospitals in the eicu collaborative research database. sharma et al. [ ] tested a privacy preserving framework for the task of in-hospital mortality prediction among patients admitted to the intensive care unit (icu). their results show that training the model in the federated learning framework leads to comparable performance to the traditional centralized learning setting. summary of these work is listed in table . an important application of federated learning is for natural language processing (nlp) tasks. when google first proposed federated learning concept in , the application scenario is gboard-a virtual keyboard of google for touchscreen mobile devices with support for more than language varieties [ , ] . indeed, as users increasingly turn to mobile devices, fast mobile input methods with auto-correction, word completion, and next-word prediction features are becoming more and more important. for these nlp tasks, especially next-word prediction, typed text in mobile apps is usually better than the data from scanned books or speech-to-text in terms of aiding typing on a mobile keyboard. however, these language data often contain sensitive information, e.g., passwords, search queries, or text messages with personal information. therefore, federated learning has a promising application in nlp like virtual keyboard prediction [ , , ] . other applications include smart retail [ ] and finance [ ] . specifically, smart retail aims to use machine learning technology to provide personalized services to customers based on data like user purchasing power and product characteristics for product recommendation and sales services. in terms of financial applications, tencent's webank leverages federated learning technologies for credit risk management, where several banks could jointly generate a comprehensive credit score for a customer without sharing his or her data [ ] . with the growth and development of federated learning, there are many companies or research teams that have carried out various tools oriented to scientific research and product development. popular ones are listed in table . in this survey, we review the current progress on federated learning including, but not limited to healthcare field. we summarize the general solutions to the various challenges in federated learning and hope to provide a useful resource for researchers to refer. besides the summarized general issues in federated learning setting, we list some probably encountered directions or open questions when federated learning is applied in healthcare area in the following. -data quality. federated learning has the potential to connect all the isolated medical institutions, hospitals, or devices to make them share their experiences with privacy guarantee. however, most health systems suffer from data clutter and efficiency problems. the quality of data collected from multiple sources is uneven and there is no uniform data standard. the analyzed results are apparently worthless when dirty data are accidentally used as samples. the ability to strategically leverage medical data is critical. therefore, how to clean, correct, and complete data and accordingly ensure data quality is a key to improve the machine learning model weather we are dealing with federated learning scenario or not. -incorporating expert knowledge. in , ibm introduced watson for oncology, a tool that uses the natural language processing system to summarize patients' electronic health records and search the powerful database behind it to advise doctors on treatments. unfortunately, some oncologists say they trust their judgment more than watson tells them what needs to be done. therefore, hopefully doctors will be involved in the training process. since every data set collected here cannot be of high quality, so it will be very helpful if the standards of evidence-based machine are introduced, doctors will also see the diagnostic criteria of artificial intelligence. if wrong, doctors will give further guidance to artificial intelligence to improve the accuracy of machine learning model during training process." -incentive mechanisms. with the internet of things and the variety of third party portals, a growing number of smartphone healthcare apps are compatible with wearable devices. in addition to data accumulated in hospitals or medical centers, another type of data that is of great value is coming from wearable devices not only to the researchers but more importantly for the owners. however, during federated model training process, the clients suffer from considerable overhead in communication and computation. without well-designed incentives, self-interested mobile or other wearable devices will be reluctant to participate in federal learning tasks, which will hinder the adoption of federated learning [ ] . how to design an efficient incentive mechanism to attract devices with high-quality data to join federated learning is another important problem. -personalization. wearable devices are more focus on public health, which means helping people who are already healthy to improve their health, such as helping them exercise, practice meditation, and improve their sleep quality. how to assist patients to carry out scientifically designed personalized health management, correct the functional pathological state by examining indicators, and interrupt the pathological change process are very important. reasonable chronic disease management can avoid emergency visits and hospitalization and reduce the number of visits. cost and labor savings. although there are some general work about federated learning personalization [ , ] , for healthcare informatics, how to combining the medical domain knowledge and make the global model be personalized for every medical institutions or wearable devices is another open question. -model precision. federated tries to make isolated institutions or devices share their experiences, and the performance of machine learning model will be significantly improved by the formed large medical dataset. however, the prediction task is currently restricted and relatively simple. medical treatment itself is a very professional and accurate field. medical devices in hospitals have incomparable advantages over wearable devices. and the models of doc.ai could predict the phenome collection of one's biometric data based on its selfie, such as height, weight, age, sex, and bmi. how to improve the prediction model to predict future health conditions is definitely worth exploring. funding the work is supported by onr n - - - and nsf . fw would also like to acknowledge the support from amazon aws machine learning research award and google faculty research award. deep learning with differential privacy cpsgd: communication-efficient and differentially-private distributed sgd federated ai technology enabler human activity recognition on smartphones using a multiclass hardware-friendly support vector machine efficient training management for mobile crowd-machine learning: a deep reinforcement learning approach an agent-based federated learning object search service towards federated learning at scale: system design practical secure aggregation for privacy-preserving machine learning federated uncertainty-aware learning for distributed hospital ehr data federated learning of predictive models from federated electronic health records partitioned variational inference: a unified framework encompassing federated and continual learning expanding the reach of federated learning by reducing client resource requirements leaf: a benchmark for federated settings secure federated matrix factorization a near-optimal algorithm for differentially-private principal components fedhealth: a federated transfer learning framework for wearable healthcare communication-efficient federated deep learning with asynchronous model update and temporally weighted aggregation secureboost: a lossless federated learning framework differential privacy-enabled federated learning for sensitive health data predicting adverse drug reactions on distributed health data using federated learning the physionet/computing in cardiology challenge : reducing false arrhythmia alarms in the icu variational federated multi-task learning international application of a new probability algorithm for the diagnosis of coronary artery disease doc.ai: declarative, on-device machine learning for ios, android, and react native learning from electronic health records across multiple sites: a communication-efficient and privacy-preserving distributed algorithm our data, ourselves: privacy via distributed noise generation boosting and differential privacy semi-cyclic stochastic gradient descent a survey of homomorphic encryption for nonspecialists the next generation of precision medicine: observational studies, electronic health records, biobanks and continuous monitoring national health information privacy: regulations under the health insurance portability and accountability act robust aggregation for adaptive privacy preserving federated learning in healthcare ketos: clinical decision support and machine learning as a service-a training and deployment platform based on docker one-shot federated learning distributed learning of deep neural network over multiple agents deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding robust federated training via collaborative machine teaching using trusted instances federated learning for mobile keyboard prediction private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption central server free federated learning over single-sided trust social networks the rationale for learning communities and learning community models distilling the knowledge in a neural network observational health data sciences and informatics (ohdsi): opportunities for observational researchers patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records privacy preserving qoe modeling using collaborative learning mining electronic health records: towards better research applications and clinical care improving federated learning personalization via model agnostic meta learning a survey towards federated semi-supervised learning mimic-iii, a freely accessible critical care database advances and open problems in federated learning efficient decentralized deep learning by dynamic model averaging incentive design for efficient federated learning in mobile networks: a contract theory approach credit risk assessment from combined bank records using federated learning federated tensor factorization for computational phenotyping federated optimization: distributed optimization beyond the datacenter federated optimization: distributed machine learning for on-device intelligence federated learning: strategies for improving communication efficiency survey of personalization techniques for federated learning peer-to-peer federated learning on graphs deep learning privacy-preserving patient similarity learning in a federated environment: development and analysis federated optimization for heterogeneous networks fair resource allocation in federated learning distributed learning from multiple ehr databases: contextual embedding models for medical events two-stage federated phenotyping and patient representation learning threats to federated learning: a survey communication-efficient learning of deep networks from decentralized data learning differentially private recurrent language models predictive modeling of the hospital readmission risk from patients' claims data using machine learning: a case study on copd deep learning for healthcare: review, opportunities and challenges agnostic federated learning system and method for dynamic context-sensitive federated search of multiple information repositories client selection for federated learning with heterogeneous resources in mobile edge dissecting racial bias in an algorithm used to manage the health of populations multiparty differential privacy via aggregation of locally trained classifiers large-scale assessment of a smartwatch to identify atrial fibrillation federated and differentially private learning for electronic health records the eicu collaborative research database, a freely available multi-center database for critical care research modern framework for distributed healthcare data analytics based on hadoop a model and infrastructure for federated learning content repositories accelerating dnn training in wireless federated edge learning system braintorrent: a peer-to-peer environment for decentralized federated learning learning in a large function space: privacypreserving mechanisms for svm learning a generic framework for privacy preserving deep learning federated learning for ultra-reliable lowlatency v v communications robust and communication-efficient federated learning from non-iid data preserving patient privacy while training a predictive model of in-hospital mortality biscotti: a ledger for private and secure peer-topeer machine learning privacy-preserving deep learning federated learning in distributed medical databases: meta-analysis of large-scale subcortical brain data an investigation into on-device personalization of end-to-end automatic speech recognition models using the adap learning algorithm to forecast the onset of diabetes mellitus federated multi-task learning multiparty privacy in social media unfriendly: multi-party privacy risks in social networks federated learning: rewards & challenges of distributed private ml a hybrid approach to privacy-preserving federated learning federated learning of electronic health records improves mortality prediction in patients hospitalized with covid- medrxiv deep reinforcement learning with double q-learning split learning for health: distributed deep learning without sharing raw patient data ai in health: state of the art, challenges, and future directions -edge ai: intelligentizing mobile edge computing, caching and communication by federated learning federated learning for healthcare informatics federated patient hashing federated machine learning: concept and applications a federated learning framework for healthcare iot devices federated learning with non-iid data mobile edge computing, blockchain and reputationbased crowdsourcing iot federated learning: a secure, decentralized and privacy-preserving system multi-objective evolutionary federated learning federated reinforcement learning publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations conflict of interest the authors declare that they have no conflict of interest. key: cord- - lsv cel authors: challet, damien; ayed, ahmed bel hadj title: predicting financial markets with google trends and not so random keywords date: - - journal: nan doi: nan sha: doc_id: cord_uid: lsv cel we check the claims that data from google trends contain enough data to predict future financial index returns. we first discuss the many subtle (and less subtle) biases that may affect the backtest of a trading strategy, particularly when based on such data. expectedly, the choice of keywords is crucial: by using an industry-grade backtesting system, we verify that random finance-related keywords do not to contain more exploitable predictive information than random keywords related to illnesses, classic cars and arcade games. we however show that other keywords applied on suitable assets yield robustly profitable strategies, thereby confirming the intuition of preis et al. ( ) taking the pulse of society with unprecedented frequency and accuracy is becoming possible thanks to data from various websites. in particular, data from google trends (gt thereafter) report historical search volume interest (svi) of given keywords and have been used to predict the present [ ] (called nowcasting in [ ] ), that is, to improve estimate of quantities that are being created but whose figures are to be revealed at the end of a given period. they include unemployment, travel and consumer confidence figures [ ] , quarterly company earnings (from searches about their salient product)s [ ] , gdp estimates [ ] and influenza epidemics [ ] . asset prices are determined by traders. some traders look for, share and ultimately create information on a variety on websites. therefore asset prices should be related to the behavior of website users. this syllogism has been investigated in details in [ ] : the price returns of the components of the russell index are regressed on many factors, including gt data, and these factors are averaged over all of the assets. interestingly, the authors find inter alia a significant correlation between changes in svi and individual investors trading activity. in addition, on average, variations of svi are negatively correlated with price returns over a few weeks during the period studied (i.e, in sample). the need to average over many stocks is due to the amount of noise in both price returns and gt data, and to the fact that only a small fraction of people who search for a given keywords do actually trade later. [ ] 's claim is much stronger: it states that future returns of the dow jones industrial average are negatively correlated with svi surprises related to some keywords, hence that gt data contains enough data to predict financial indices. several subtle (and not so subtle) biases prevent their conclusions from being as forceful as they could be. using a robust backtest system, we are able to confirm that gt data can be used to predict future asset price returns, thereby placing their conclusions on a much more robust footing. raw asset prices are well described by suitable random walks that contain no predictability whatsoever. however, they may be predictable if one is able to determine a set of conditions using either only asset returns (see e.g. [ ] for conditions based on asset cross-correlations) or external sources of information. google trends provide normalized time series of number of searches for given keywords with a weekly time resolution [ ] , denoted by v t . [ ] propose the following trading strategy: defining the previous base-line search interest asv t = to consider the inverse strategy, but average price reversion over the next one or two weeks with respect to a change of svi was already noticed by other authors [ , ] . instead of trying to predict the dow jones industrial average index, we use the time series of spy, which mirrors the standard and poors index. this provides a weak form of cross-validation, the two time series being highly correlated but not identical. for the same reason, we compute returns from monday to friday close prices instead of monday to monday, which keeps index returns in sync with gt data (they range from sundays to saturdays). prediction is hard, especially about the future. but prediction about the future in the past is even harder. this applies in particular to the backtesting of a trading strategy, that is, to the computation of its virtual gains in the past. it is prone to many kinds of biases that may significantly alter its reliability, often positively [ , ] . most of them are due to the regrettable and possibly inevitable tendency of the future to creep into the past. this is the most overlooked bias. it explains in part why backtest performances are often very good in the s and s, but less impressive since about , even when one accounts for realistic estimates of total transaction costs. finding predictability in old data with modern tools is indeed easier than it ought to be. think of applying computationally cpu-or memory-intensive methods on pre-computer era data. the best known law of the computational power increase is named after gordon moore, who noticed that the optimal number of transistors in integrated circuits increases exponentially with time (with a doubling time τ years) [ ] . but other important aspects of computation have been improving exponentially with time, so far, such as the amount of computing per unit of energy (koomey' law, τ . years [ ] ) or the price of storage (kryder's law, τ years [ ] ). remarkably, these technological advances are mirrored by the evolution of a minimal reaction timescale in financial data [ ] . in addition, the recent ability to summon and unleash almost at once deluges of massive cloud computing power on large data sets has changed the ways financial data can be analyzed. it is very hard to account for this bias. for educational purposes, one can familiarize oneself with past computer abilities with virtual machines such as qemu [ ] tuned to emulate the speed and memory of computers available at a given time for a given sum of money. the same kind of bias extends to progresses of statistics and machine learning literature, and even to the way one understands market dynamics: using a particular method is likely to give better results before its publication than, say, one or two years later. one can stretch this argument to the historicity of the methods tested on financial data at any given time because they follow fashions. at any rate, this is an aspect of backtesting that deserves a more systematic study. data are biased in two ways. first, when backtesting a strategy that depends on external signals, one must ask oneself first if the signal was available at the dates that it contains. gt data was not reliably available before august , being updated randomly every few months [ ] . backtests at previous dates include an inevitable part of science fiction, but are still useful to calibrate strategies. the second problem is that data is revised, for several reasons. raw financial data often contains gross errors (erroneous or missing prices, volumes, etc.), but this is the data one would have had to use in the past. historical data downloaded afterwards has often been partly cleaned. [ ] give good advice about high-frequency data cleaning. revisions are also very common for macro-economic data. for example, gross domestic product estimates are revised several times before the definitive figure is reached (about revision predictability, see e.g. [ ] ). more perversely, data revision includes format changes: the type of data that gt returns was tweaked at the end of . it used to be made of real numbers whose normalization was not completely transparent; it also gave uncertainties on these numbers. quite consistently, the numbers themselves would change within the given error bars every time one would download data for the same keyword. nowadays, gt returns integer numbers between and , being the maximum of the time-series and its minimum; small changes of gt data are therefore hidden by the rounding process; error bars are no more available, but it is fair to assume that a fluctuation of ± should be considered irrelevant. in passing, the process of rounding final decimals of prices sometimes introduces spurious predictability, which is well known for fx data [ ] . revised data also concerns the investible universe. freely available historical data does not include deceased stocks. this is a real problem as assets come and go at a rather steady rate: today's set of investible assets is not the same as last week's. accordingly, components of indices also change. analyzing the behavior of the components of today's index components in the past is a common way to force feed it with future information and has therefore an official name: survivor(ship) bias. this is a real problem known to bias considerably measures of average performance. for instance [ ] shows that it causes an overestimation of backtest performance in % of the cases of long-only portfolios in a well chosen period. this is coherent since by definition, companies that have survived have done well. early concerns were about the performance of mutual funds, and various methods have been devised to estimate the strength of this bias given the survival fraction of funds [ , ] finally, one must mention that backtesting strategies on untradable indices, such as the nasdaq composite index, is not a wise idea since no one could even try to remove predictability from them. what keywords to choose is of course a crucial ingredient when using gt for prediction. it seems natural to think that keywords related to finance are more likely to be related to financial indices, hence, to be more predictive. accordingly, [ ] build a keyword list from the financial times, a financial journal, aiming at biasing the keyword set. but this bias needs to be controlled with a set of random keywords unrelated to finance, which was neglected. imagine indeed that some word related to finance was the most relevant in the in-sample window. our brain is hardwired to find a story that justifies this apparent good performance. statistics is not: to test that the average performance of a trading strategy is different from zero, one uses a t test, whose result will be called t-stat in the following, and is defined as z = µ σ √ n where µ stands for the average of strategy returns, σ their standard deviation and n is the number of returns; for n > , z looks very much like a gaussian variable with zero average and unit variance. [ ] wisely compute t-stats: the best keyword, debt, has a t-stat of . . the second best keyword is color and has a t-stat of . . both figures are statistically indistinguishable, but debt is commented upon in the paper and in the press; color is not, despite having equivalent "predictive" power. let us now play with random keywords that were known before the start of the backtest period ( ). we collected gt data for common medical conditions/ailments/illnesses, classic cars and all-time best arcade games (reported in appendix a) and applied the strategy described above with k = instead of k = . table i reports the t-stats of the best positive and negative performance (which can be made positive by inverting the prescription of the strategy) for each set of keywords. we leave the reader pondering about what (s)he would have concluded if bone cancer or moon patrol be more finance-related. this table also illustrates that the best t-stats reported in [ ] are not significantly different from what one would obtains by chance: the t-stats reported here being a mostly equivalent to gaussian variables, one expects % of their absolute values to be larger that . , which explains why keywords such color as have also a good t-stat. finally, debt is not among the three best keywords when applied to spy from monday to friday: its performance is unremarkable and unstable, as shown in more details below. nevertheless, their reported t-stats of financial-related terms is biased towards positive values, which is compatible with the reversal observed in [ , ] , and with results of table . this may show that the proposed strategy is able to extract some amount of the possibly weak information contained in gt data. an other explanation for this bias could have been coding errors (it is not). time series prediction is easy when one mistakenly uses future data as current data in a program, e.g. by shifting incorrectly time series; we give the used code in appendix. a very simple and effective way of avoiding this problem is to replace all alternatively price returns and external data (gt here) by random time series. if backtests persist in giving positive performance, there are bugs somewhere. the aim of [ ] was probably not to provide us with a profitable trading strategy, but to attempt to illustrate the relationship between collective searches and future financial returns. it is however striking that no in-and out-sample periods are considered (this is surprisingly but decreasingly common in the literature). we therefore cannot assess the trading performance of the proposed strategy, which can only be judged by its robustness and consistency out-ofsample, or, equivalently, of both the information content and viability of the strategy. we refer the reader to [ ] for an entertaining account of the importance of in-and out-of-sample periods. f. keywords from the future [ ] use keywords that have been taken from the editions of the ft dated from august to june , determined ex post. this means that keywords from editions are used to backtest returns in e.g. . therefore, the set of keywords injects information about the future into the past. a more robust solution would have been to use editions of the ft available at or before the time at which the performance evaluation took place. this is why we considered sets of keywords known before . g. parameter tuning/data snooping each set of parameters, which include keywords, defines one or more trading strategies. trying to optimize parameters or keywords is called data snooping and is bound to lead to unsatisfactory out of sample performance. when backtest results are presented, it is often impossible for the reader to know if the results suffer from data snooping. a simple remedy is not to touch a fraction of historical data when testing strategies and then using it to assess the consistence of performance (cross-validation) [ ] . more sophisticated remedies include white's reality check [ ] (see e.g. [ ] for an application of this method). data snooping is equivalent as having no out-of-sample, even when backtests are properly done with sliding in-and out-of-sample periods. let us perform some in-sample parameter tuning. the strategy proposed has only one parameter once the financial asset has been chosen, the number of time-steps over which the moving averagev t is performed. figure reports the t-tstat of the performance associated with keyword debt as a function of k. its sign is relatively robust against changes over the range of k ∈ , · · · , but its typical value in this interval is not particularly exceptional (between and ). let us take now the absolute best keyword from the four sets, moon patrol. both the values and stability range of its t-stat are way better than those of debt (see figure ), but this is most likely due to pure chance. there is therefore no reason to trust more one keyword than the other. assuming an average cost of bps ( . %) per trade, trades per year and years of trading ( - ), transaction fees diminish the performance associated to any keyword by about %. as a beneficial side effect, periods of flat fees-less performance suddenly become negative performance periods when transaction costs are accounted for, which provides more realistic expectations. cost related to spread and price impact should also included in a proper backtest. given the many methodological weaknesses listed above, one may come to doubt the conclusions of [ ] . we show here that they are correct. the first step is to avoid methodological problems listed above. one of us has used an industrial-grade backtest system and more sophisticated strategies (which therefore cause tool bias). first, let us compare the resulting cumulated performance of the three random keyword sets that we defined, plus the set of keywords from the financial times. for each sets of keywords, we choose as inputs the raw svi, lagged svi, and various moving averages of svi, together with past index returns. it turns out that none of the keyword sets brings information able to predict significantly index movements (see fig. ). this is not incompatible with results of [ , , ] . it simply means that the signal is probably too weak to be exploitable in practice. the final part of the performances is of course appealing, but this come from the fact that monday close to friday close spy returns have been mostly positive during this period: any machine learning algorithm applied on returns alone would likely yield the same result. so far we can only conclude that a given proper (and not overly stringent) backtest system was not able to find any exploitable information from the four keyword sets, not that the keyword sets do not contain enough predictive information. to conclude, we use the same backtest system using some gt data with exactly the same parameters and input types as before. the resulting preliminary performance, reported in fig. , is more promising and shows that there really is consistently some predictive information in gt data. it is not particularly impressive when compared to the performance of spy itself, but is nevertheless interesting since the net exposure is always close to zero (see [ ] for more information). sophisticated methods coupled with careful backtest are needed to show that google trends contains enough exploitable information. this is because such data include too many searches probably unrelated to the financial assets for a given keyword, and even more unrelated to actual trading. when one restricts the searches by providing more keywords, gt data often only contain information at a monthly time scale, or no information at all. if one goes back to the algorithm proposed by [ ] and the compatible findings of [ , ] , it is hard to understand why future prices should systematically revert after a positive svi surprise and vice-versa one week later. the reversal is weak and only valid on average. it may be the most frequent outcome, but profitability is much higher if one knows what triggers reversal or trend following. there is some evidence that supplementing gt data with news leads to much improved trading performance (see e.g. [ ] ). another paper by the same group suggests a much more promising source of information: it links the changes in the number of visits on wikipedia pages of given companies to future index returns [ ] . further work will investigate the predictive power of this type of data. we acknowledge stimulating discussions with frédéric abergel, marouanne anane and thierry bochud. when requesting data restricted to a given quarter, gt returns daily data qemu, a fast and portable dynamic translator. usenix, . url www.qemu.org survivorship bias in performance studies quant . -harnessing the mood of the web nowcasting is not just contemporaneous forecasting encelade capital internal report. final version available on request in predicting the present with google trends in search of earnings predictability in search of attention an introduction to high-frequency finance measuring economic uncertainty and its impact on the stock market survivor bias and mutual fund performance news and noise in g- gdp announcements behind the smoke and mirrors: gauging the integrity of investment simulations detecting influenza epidemics using search engine query data critical reflexivity in financial markets: a hawkes process analysis implications of historical trends in the electrical efficiency of computing after hard drives -what comes next? magnetics stupid data miner tricks: overfitting the s&p dissecting financial markets: sectors and states quantifying wikipedia usage patterns before stock market moves cramming more components onto integrated circuits. electronics quantifying trading behavior in financial markets using google trends data-snooping, technical trading rule performance, and the bootstrap a reality check for data snooping google trends -wikipedia, the free encyclopedia when requesting data restricted to a given quarter kidney stone, leukemia, liver tumour, lung cancer, malaria, melena, memory loss, menopause, mesothelioma, migraine, miscarriage, mucus in stool, multiple sclerosis, muscle cramps, muscle fatigue, muscle pain, myocardial infarction, nail biting, narcissistic personality disorder, neck pain, obesity, obsessive-compulsive disorder, osteoarthritis, osteomyelitis, osteoporosis, ovarian cancer, pain, panic attack, paranoid personality disorder, parkinson's disease, penis enlargement, peptic ulcer, peripheral artery occlusive disease, personality disorder, pervasive developmental disorder, peyronie's disease, phobia, pneumonia, poliomyelitis, polycystic ovary syndrome, post-nasal drip, post-traumatic stress disorder, premature birth, premenstrual syndrome, propecia, prostate cancer, psoriasis, reactive attachment disorder, renal failure, restless legs syndrome, rheumatic fever, rheumatoid arthritis, rosacea, rotator cuff, scabies, scars, schizoid personality disorder, schizophrenia, sciatica, severe acute respiratory syndrome, sexually transmitted disease, sinusitis, skin eruptions, skin cancer, sleep disorder, smallpox, snoring, social anxiety disorder, staph infection, stomach cancer, strep throat, sudden infant death syndrome, sunburn, syphilis, systemic lupus erythematosus, tennis elbow, termination of pregnancy iso griffo a l ferrari gtb/ , shelby mustang kr corvette stingray, dodge challenger, dodge charger, dodge dart swinger, facel vega facel ii, ferrari , ferrari gto, ferrari gto, ferrari , ferrari daytona missile command, moon buggy, moon patrol, ms. pac-man, naughty boy, pac-man, paperboy, pengo, pitfall!, pole position, pong, popeye, punch-out!!, q*bert we have downloaded gt data for the following keywords, without any manual editing. here is a simple implementation in r of the strategy given in [ ] . we do mean "=" instead of "<-". c o m p u t e p e r f s t a t s=f u n c t i o n ( f i l e n a m e , k= , g e t p e r f=false) { g t d a t a=loadgtdata ( f i l e n a m e ) i f ( i s . n u l l ( g t d a t a ) | | l e n g t h ( g t d a t a ) < ){ r e t u r n (null) } spy=loadyahoodata ( 'spy ' ) s p y _ r e t s=g e t f u t u r e r e t u r n s ( spy ) #s p y _ r e t s i s a zoo o b j e c t , c o n t a i n s r_{ t + } gtdata_mean=r o l l m e a n r ( gtdata , k ) # \ bar v_t gtdata_mean_lagged=l a g ( gtdata_mean , − ) # \ bar v_{ t − } pos = * ( gtdata >gtdata_mean_lagged)− p e r f=−pos * s p y _ r e t s p e r f=p e r f [ which ( ! i s . na ( p e r f ) ) ] i f ( g e t p e r f ) { r e t u r n ( p e r f ) } e l s e { r e t u r n ( t . t e s t ( p e r f ) $ s t a t i s t i c ) } } key: cord- -ig nwtmi authors: nan title: th european conference on rare diseases & orphan products (ecrd ) date: - - journal: orphanet j rare dis doi: . /s - - - sha: doc_id: cord_uid: ig nwtmi nan theme: when therapies meet the needs: enabling a patient-centric approach to therapeutic development. background: rare diseases (rd) often result in a wide spectrum of disabilities, on which information is lacking. there is a need for standardised, curated data on the functional impact of rd to facilitate the identification of relevant patient reported/patient centered outcome measures (proms/pcoms) as well as for the use of validated quality of life instruments based on functional outcomes. to address these issues, orphanet is partnering with mapi research trust (mrt) in order to connect orphanet to proqolid ™ , mrt's proms/pcoms database, through disease codes. visit orphanet at www.orpha .net. methods and materials: the orphanet functioning thesaurus (oft) is a multilingual controlled vocabulary derived from the icf-cy. a subset of rd present in the orphanet nomenclature is annotated with the oft, with the addition of attributes for each functional impact (frequency, severity, and temporality) for each specific rd. annotations result from structured interviews with clinical experts, medical-social sector care providers, and patient organisations. in order to link proqolid ™ data with orphanet disability data, the taxonomy used to qualify rds in proqolid ™ was reviewed and mapped to orphanet's. all proms developed for rds were identified, and all products approved by the fda and ema from to with an orphan drug designation (odd) and a pro claim were listed. results: the orphanet knowledge base contains over rd, of which rd have been assessed for their functional consequences, of which rd have been annotated: the remaining rd were annotated, after discussions with medical experts, as either being highly variable, non-applicable or resulting in early-death. of the most prevalent rare diseases, have been annotated according to their functional consequences. rds had a prom (n = ) in pro-qolid ™ and . % of odd included a pro claim. the rds with the most prom were sickle cell anemia, spinal cord injuries, cystic fibrosis, all forms of hemophilia a and b and duchenne muscular dystrophy. prom used in labels were primarily focusing on symptoms ( %), rarely on functioning ( %) or health-related quality of life ( %). conclusions: linking these two databases, and providing standardised, curated data, will enable the community to identify proms/ pcoms for rd, and is the first step towards validated quality of life instruments based on functional outcomes. and eurordis. some patient organisations distributed the survey too more common disease patients as well, e.g. hashimoto's disease, and these responses were excluded. the survey asked patients to give suggested topics (i.e. fertility, heritability, tiredness, daily medicine intake, sleep quality, physical discomfort, and ability to work, partake in social life, and sports) a priority score and to suggest their own topics for research in open fields. open field responses were analysed with topic modelling and klipp-analysis. results: after exclusion of responses from more common endocrine disease patients, survey responses were analysed. most responses were received from northern ( %) and western europeans ( %), while southern ( %) and eastern europe ( %) were underrepresented. of the suggested topics respondent were most interested in research concerning the ability to work and participate in social life, and on tiredness. when patients were open to suggest their own topics, common responses included long-term side effects of drugs and quality of life. however, priorities differed between disease groups. for example, adrenal, pituitary and thyroid patients were more interested in research concerning tiredness than others. conclusion: with this survey endo-ern is provided with a large sample of responses from european patients with a rare endocrine condition, and those patients experience unmet needs in research, though these needs differ between the disease groups. the results of this study should be incorporated by clinical experts in the design of future studies in the rare endocrine disease field. purpose: when developing a health technology that requires clinical studies, developers institute working relations with clinical investigators. patient representatives can also create and manage advisory boards with product developers. this was of high utility in the s, in the development of products to treat hiv infection. inspired by this model, the european organisation for rare diseases (eurordis) proposes the eurocab programme to facilitate a two-way dialogue between patient representatives and medicine developers. as of , disease-specific cabs exist of approximately members each and others are being formed. methods: eurordis invites developers to sign a charter for collaboration with patients in clinical research, and provides guidelines together with a mentoring and training programme for patient networks. cabs help set the agenda with the developer, work on topics as diverse as study design, feasibility, informed consent and site selection, qol and proms, and organize the meetings. discussions also cover compassionate use, pricing, relative efficacy, etc. meetings last for to days with sessions with different developers, all under confidentiality. there are regular between-meeting teleconferences for trainings and action plan updates, and some cabs have instituted working groups on access, psychological support, etc. the collaboration is evaluated via a post-meeting survey send to both cab members and medicine developers. in addition, cabs have recently started to monitor outcomes of the meeting and progress towards their goals with a tracker tool. results: the results of the first surveys from distinct cab meetings with companies show that this form of shared decision-making is valuable as well as ethical for both parties. we have seen that working relations always continue, even when discussions become heated. all involved show interest in the co-creation possibilities of such collaboration and we look forward to seeing progress and change via the tracker. conclusions: monitoring and evaluation are crucial to understand whether and how the cabs are making an impact on medicine development. demonstrating impact is challenging because of the contextualized nature and complexity inherent to patient engagement collaborations in research design. eurordis is working within para-digm on our monitoring and evaluation strategy, focusing on improving its comprehensiveness and including multi-stakeholder perspectives. our current experiences show that the eurocab programme, with collective thinking and exchange between patients and a collaborative mentality from both sides, ensures high-quality and constructive dialogue with researchers and developers and can eventually inform both hta and regulatory decision-making (fig. ) . we have started to work on the metrics of markers of success. purpose: the french national network for rare sensory diseases sensgene launched in a -min motion design video ( fig. ) aiming at guiding healthcare professionals to welcome visually impaired patients in the hospital. this educational video was created to address patients' expectations and improve their experience in the network's hospital. the european reference network for rare eye diseases (ern-eye) collaborated on the project and created an english version of the video in order to distribute it widely in europe. method: sensgene worked on this project with two big french associations of visually impaired people: fédération des aveugles et amblyopes de france and fédération des aveugles alsace lorraine grand est. more than french patients' associations actively contributed to the project through five focus groups (workshops) which collected testimonies and gathered the needs of visually impaired persons and health-care professionals (fig. ). an evaluation was made by the independent body ipso facto: health professionals answered to a survey before and after viewing the video. results: the video deals with common situations in the delivery of care activities for different types of visual impairment: reception in a hospital center, consultations, moves and orientation in a hospital room (fig. ) . this fits perfectly with the needs of the patients reported in the focus groups. besides, the evaluation showed that % of them improved their knowledge on the topic. background: traditional appraisal and reimbursement approaches such as cost/qaly are increasingly recognised as being potentially unsuitable for rare disease treatments (rdts). approaches to appraising rdts vary across countries, from the same processes used for all medicines, to those completely separate from the standard, to adapted standard processes with greater willingness to pay (wtp). this study examines the impacts of standard versus special appraisal processes for specific rdts in selected countries. methodology: a case study analysis was conducted in which countries with a variety of rdt appraisal processes were selected, along with two rdts representative of the following criteria: rare/ultra-rare treatment, affecting child/adult, cancer/non-cancer, life-threatening/disabling. public hta reports for each country's appraisal of the selected rdts were retrieved and used to extract information into predesigned templates, which allowed for systematic comparison of the rdt processes across countries to compare and exemplify the impact of the different processes. results: reports from belgium, england, france, germany, u.s., italy, lithuania, netherlands, poland, romania, scotland, slovakia, and sweden were selected for spinraza and voretigene. characteristics of each country's process were extracted, including special reimbursement for rdts, special rdt committees, economic evaluation modifications, greater wtp, quality of evidence flexibility, additional considerations, etc. special and standard processes seemed to have different impacts on the appraisal of rdts. special processes more consistently managed rdt issues such as evidential uncertainty and higher icers. standard processes sometimes informally applied some of the characteristics included in special processes, such as broader consideration of value. conclusions: comparing case study country examples of rdt appraisal exemplified the complexity of these processes. special processes were more consistent in managing the challenges in rdt appraisal than standard processes. practical application: findings suggest a need for adapted approaches for rdt appraisal, to facilitate management of associated challenges and more consistent decision-making. estimating the broader fiscal impact of rare diseases using a public economic framework: a case study applied to acute hepatic porphyria (ahp) mark p. connolly , background: the aim of this study was to apply a public economic framework to evaluate a rare disease, acute hepatic porphyria (ahp) taking into consideration a broad range of costs that are relevant to government in relation to social benefit payments and taxes paid by people with ahp. ahp is characterized by potentially life-threatening attacks and for many patients, chronic debilitating symptoms that negatively impact daily functioning and quality of life. the symptoms of ahp prevent many individuals from working and achieving lifetime work averages. we model the fiscal consequences for government based on reduced lifetime taxes paid and benefits payments for a person diagnosed aged experiencing attacks per year. materials & methods: a public economic framework was developed exploring lifetime costs for government attributed to an individual with ahp in sweden. work-activity and lifetime direct taxes paid, indirect consumption taxes and requirements for public benefits were estimated based on established clinical pathways for ahp and compared to the general population (gp). results: lifetime earnings are reduced in an individual with ahp by sek . million compared to the gp. this also translates to reduced lifetime taxes paid of sek . million for an ahp individual compared to the gp. we estimate increased lifetime disability benefits support of sek . million for an ahp individual compared to gp. we estimate average lifetime healthcare costs for ahp individual of sek . million compared to gp of sek . million. these estimates do not include other societal costs such as impact on caregiver costs. conclusions: due to severe disability during the period of constant attacks, public costs from disability are significant in the ahp patient. lifetime taxes paid are reduced as these attacks occur during peak earning and working years. the cross-sectorial public economic analysis is useful for illustrating the broader government consequences attributed to health conditions. ethics approval: the study results described here are based on a modeling study. no data on human subjects has been collected in relation to this research. the european cystic fibrosis (cf) society patient registry collects demographic and clinical data from consenting people with cf in europe. the registry's database contains data of over , patients from countries. high quality data is essential for use in annual reports, epidemiological research and postauthorisation studies. methods: a validation programme was introduced to quantify consistency and accuracy of data-input at source level and verify that the informed consent, required to include data in the registry, has been obtained in accordance with local and european legislation. accuracy is defined as the proportion of values in the software matching the medical record, consistency as definitions used by the centre matching those defined and required by the registry. data fields to verify: demographic, diagnostic, transplantation, anthropometric and lung function measurement, bacterial infections, medications and complications. the number of countries to validate: % of the total countries/year. in the selected country ≥ % of the centres should be visited and - % of the data validated. results: in , ten out of centres ( %) in countries (austria, portugal, slovakia, switzerland) with ≥ % of all patients in their countries were selected. in a day visit, the data of the registry were compared with the medical records, the outcomes and recommendations discussed, and a final report provided. demographic, diagnostic and transplant data were checked for patients ( %*), clinical data for patients ( %*) ( data). challenges were: informed consent, mutation information (genetic laboratory report missing), definitions interpretations. see fig. for the results. conclusion: the registry's data is highly accurate for most data verified. the validation visits proved to be essential to optimise data quality at source, raise awareness of the importance of correct informed consent and encourage dialogue to gain insight in how procedures, software, and support can be improved. *of the total patients in these countries. background: people with duchenne muscular dystrophy (dmd) adopt compensatory movement patterns to maintain independence as muscles get weaker. the duchenne video assessment (dva) tool provides a standardized way to document and assess quality of movement. caregivers video record patients doing specific movement tasks at home using a secure mobile application. physical therapists (pts) score the videos using scorecards with prespecified compensatory movement criteria. objective: to gather expert input on compensatory criteria indicative of clinically meaningful change in disease to include in scorecards for movement tasks. approach: we conducted rounds of a delphi panel, a method for building consensus among experts. we recruited pts who have evaluated ≥ dmd patients in clinic and participated in ≥ dmd clinical trials. in round , pts completed a preliminary questionnaire to evaluate compensatory criteria clarity and rate videos of dmd patients performing each movement task using scorecards. in round , pts participated in an in-person discussion to reach consensus (≥ % agreement) on all compensatory criteria with disagreement or scoring discrepancies during round . results: of the pts, % practiced physical therapy for ≥ years, % provided physical therapy to ≥ dmd patients, and % participated in ≥ dmd clinical trials. of version compensatory criteria, ( %) were revised in round . of version compensatory criteria, ( %) were revised in round . the pts reached % agreement on all changes made to scorecards during the in-person discussion except the run scorecard due to time restrictions. a subset of the panel ( pts) met after the in-person discussion and reached consensus on compensatory criteria to include in the run scorecard. conclusion: expert dmd pts confirmed that the compensatory criteria included in the dva scorecards were appropriate and indicative of clinically meaningful change in the disease. introduction: fifty percent of rare disease cases occur in childhood. despite this significant proportion of incidence, only % of adult medicines authorised by the european medicines agency (ema) completed paediatric trials [ , ] . as a result, many clinical needs are left unmet. various factors compound the development of treatments for paediatric rare diseases, including the need for new clinical outcome assessments (coas), as conventional endpoints such as the minute walking test ( mwt) have been shown to not be applicable in all paediatric age subsets, [ ] and therefore may not be useful in elucidating patient capabilities. coas are a well-defined and reliable assessment of concepts of interest, which can be used in adequate, well-controlled studies in a specified context. coas capture patient functionality and can be deployed through the use of wearable sensor technology; this feasibility study presents data obtained from patients with paediatric rare diseases who were assessed with this type of technology. methods: niemann pick-c (np-c) (n = ) and duchenne muscular dystrophy (dmd) (n = ) patients were asked to wear a wrist-worn wearable sensor at home for a minimum of weeks. feasibility was assessed qualitatively and quantitatively, with data captured in minute epochs, measuring the mean of epoch's with the most steps over a month (adm), average daily steps (ads), average steps per minute epoch (ade) (table ) and reasons for non-adherence (table ) . no restriction in the minimum number of epochs available for analysis were applied, and all patient data analysed. results: discrepancies in ambulatory capacity were observed between np-c and dmd patients overall, with np-c patients covering greater distances and taking more steps daily. qualitative assessment of both patient groups highlighted their relationships with the technology, which in turn detailed adherence. some patients exhibited behavioural issues which resulted in a loss of data and low engagement. conclusions: the wearable sensor technology was able to capture the ambulatory capacity for np-c/dmd patients. insights into disease specific parameters that differed were gained, which will be used for developing the technology further for use in future trials. additional work is required to correlate the wearable device data with other clinical markers, however the study displays the feasibility of wearable sensors/apps as potential outcome measures in clinical trials. background: neonatal surgery is decentralized in germany. in there were departments of pediatric surgery that treated % of the abdominal wall defects with an average case load of less than per unit [ ] . patient organizations stress the importance of quality measurements for the care of children with rare diseases. study plan: currently, there is no nationwide data collection regarding the short term and long term care of patients with congenital malformations, who often need surgery during the first weeks of life. the german society of pediatric surgery, which covers almost all of the german pediatric surgical units, has initiated the work of creating a national patient registry (kirafe) for the following congenital malformations: malformations of the gastrointestinal tract, the abdominal wall, the diaphragm, and meningomyelocele. the development of the registry involves three different patient organizations and health care professionals from all over germany. the registry will be set up in based on the open source registry system for rare diseases (osse). the primary objective of the registry is the measurement of quality attributes of rare congenital malformations. furthermore, the registry will facilitate recruitment of patients to clinical trials. it will also serve as a basis for policy making and planning of health and social services for people with rare disorders. informed consent will be obtained from the participants. the registry will include core data, mainly comprising information on the set of malformations of each patient. each malformation will then prompt further different modules for data collection. this modular structure offers the greatest possible flexibility for the documentation of patients with more than one congenital malformation. data will be collected by health care professionals. results: since the start of the preparation individuals, either working in one of hospitals or being member of one of the three patient organizations, have contributed in the ongoing activities. the registry is listed in the european directory of registries (erdri) [ ] . ethical approval was obtained, financial resources were secured. in , german hospitals and three non-german hospitals confirmed their intention to document their patients within the registry. conclusion: the registry is an example for a nationwide collaboration with the goal to optimize the quality of care for a patient group with rare diseases. is a collaboration between cf europe and five pharmaceutical companies (to date). through biannual meetings, we aim to institute a longterm educational collaboration with companies with an interest in cf. membership of industrial partners is dependent upon adherence to the cfrtoc code of conduct and a financial contribution for cf europe to fulfil its missions. common objectives include access to information. one strong example, applicable even beyond rare diseases, is the need for improved communication regarding clinical trials (cts) which has been inconsistent and often difficult to understand. from , the new european ct regulation / will oblige sponsors to share ct results through lay summaries. to help move this initiative forward, cf europe, with the active support of the cystic fibrosis trust, is collaborating with the european cystic fibrosis society-clinical trials network (ecfs-ctn) and cfrtoc members to establish a glossary of relevant cf terms. it will be freely available so that all stakeholders can systematically use it in patient-friendly scientific summaries and wider communication. in a pilot project, people with cf and patient associations, together with industrial partners will shortlist terms. these will be defined by lay members and subsequently subjected to the study and approval of the legal department of participating companies. provided this process is successful, we aim to create approved definitions by the end of . cf europe and ecfs-ctn intend to advertise the use of this glossary online and through communications at scientific events. national patient organisations will be further encouraged to provide translations in their national language. alkaptonuria (aku, ochronosis) is an inborn metabolic disease, resulting in the accumulation of the metabolic intermediate homogentisic acid (hga). oxidation of hga by air or within connective tissue causes darkening of the urine, pigmentation of eyes and ears, kidney-and prostrate-stones, aortic stenosis, but most severely an early onset of arthritis called ochronotic arthropathy (ochronosis) due to deposition in the cartilage. ochronosis is very painful, disabling and progresses rapidly. starting in the thirties with the spine and affecting large joints in the forties, patients frequently require joint replacements in their fifties and sixties [ , ] . like many of the rare diseases, aku-patients undergo a long odyssey of several years until their diagnosis. the german aku-society "deutschsprachige selbsthilfegruppe für alkaptonurie (dsaku) e.v. " was founded in and became subsequently registered as a non-profit patient organization. first of all, the dsaku identified aku-patients, set up a homepage [ ] and designed flyers with information for patients, their families, medical professionals and healthcare services. second, it offered workshops on aku-related issues and enabled personal exchange. third, it raised awareness of aku, both nationally and internationally by information booths, presentations and posters at scientific congresses as well as rare disease days (rdd). fourth, in response to the needs of patients, it established collaborations and built up national networks for a better health care accordingly. thus, patients were encouraged to visit the centers for metabolic diseases at the charité (berlin), hannover medical school (mhh), university of düsseldorf and institute of human genetics at the university of würzburg to bundle knowledge and expertise. the dsaku is member of achse e.v., nakos, eurordis and metabern and registered in the databases se-atlas, zipse and orphanet. finally, the dsaku is nationally and internationally active in health politics regarding training in drug safety and evidence-based medicine. introduction: autoinflammatory diseases are rare conditions characterized by recurrent episodes of inflammation with fever associated to elevation of acute phase reactants and symptoms affecting mainly the mucocutaneous, musculoskeletal or gastrointestinal system. these diseases affect the quality of life of patients and their families. objectives: aim of this project is to develop a tool able to ameliorate patients' management of the disease and to enhance patientphysician communication. to develop a tool based on real-life needs, we involved patients and caregivers since the initial phase of the project. a first workshop designed to capture their needs was organized. innovative co-design activities were performed through "legoserious-play ™ " (lsp) methodology [ ] [ ] [ ] . during a first phase of "divergence" patients (from teen-agers to adults) affected by different aids (fmf, trap, caps, mkd) and physicians where involved in the lsp activities. participants were asked to describe, through lego and metaphors: • the disease • themselves in comparison with the disease • solutions and supports which could help them in managing the disease after each step the participants presented their models, and everyone was engaged in the discussion. the ideas collected during the three phases allowed to make a list of functionalities identified as necessary for the app to be developed. due to the actual sars-cov- sanitary emergency the second phase of the project, aimed at presenting the participants the results of the first meeting and proceed with the app finalization was performed through web-based meeting and surveys in which the patients and caregivers actively participated. results: in the first phase patients and caregivers participated actively expressing various needs, that we subsequently summarized in main areas (table ) . participants were then further involved and their opinion taken into consideration for the user experience and interface definition for the development of the mobile app including the required functionalities (after a further activity of prioritization). introduction: gaps in communication and education are becoming one of the biggest key pain points for patients that are suffering rare diseases. due to the limited resources and the misleading information on the internet we wanted to test the poc systems to deliver more efficiently the information to our patients and their relatives. userfriendly information at the point of care should be well structured, rapidly accessible, and comprehensive. method: we implemented a specific poc channel using several touchpoints to deliver the right content at the right time. we created and selected the video content that will be most helpful to our patients. later on, we analyzed the patient journey and we decided to use a mobile app where the patients could search for information when they are at their home. at the medical practice, we use the waiting room and exam room as learning areas through monitors and tablets. moreover, healthcare professionals are prescribing content to their patients that they reviewed when they are home. results: thanks to the use of the poc channel and technologies related we were able to reduce the time needed to perform an explanation by %. furthermore, our healthcare professionals reported that their conversations with the patients improved % and patient satisfaction increased by %. conclusion: poc channel created a positive impact on our patient experience allowing us to be more efficient delivering the information to our patients and their relatives. [ , ] , realworld safety and efficacy data are limited -particularly for patients who receive > treatment. we report initial data from the restore registry, including cohort clinical characteristics, treatments received, and outcomes. materials and methods: restore is a prospective, multicenter, treatment-agnostic registry of sma patients. the primary objectives include assessment of contemporary sma treatments; secondary objectives include assessment of healthcare resource utilization, caregiver burden, and changes in patient functional independence over time. planned follow-up is years from enrollment. as of january , data were available for patients, all from de novo clinical sites in the united states; information on treatment regimens was available for patients (table ) . disease-modifying treatments were administered sequentially or in combination. % of treated patients showed symptoms at sma diagnosis, with the most common being hypotonia and limb weakness ( table ) . Ågrenska, a swedish national centre for rare diseases, has for thirty years arranged courses for families of children with rare diagnoses and has experienced that the conditions often have complex and varying consequences in the children ś everyday lives. knowledge of these consequences and of how to adapt the treatment, environment and activities to create the best possible conditions for participation and learning, is often lacking. many professionals also report lack of sources of knowledge. knowledge formation and dissemination are thus of outmost importance. in order to aid knowledge formation and dissemination Ågrenska has developed an observation instrument for children with rare diagnoses, identifying both abilities and difficulties on a group level. the instrument consists of quantitative and qualitative items and covers ten areas: social/communicative ability, emotions and behaviours, communication and language, ability to manage his/her disability and everyday life, activities of daily life, gross and fine motor skills, perception and worldview, prerequisites for learning and basic school abilities. observations are made during the children ś school and pre-school activities during the Ågrenska course. teachers and special educators, working with the children, are responsible observers. some school-related abilities are difficult to observe during the five-day stay. this information is instead collected through a telephone interview with the children ś home teacher. the instrument was content validated against a number of existing instruments. the items were considered relevant as they, with few exceptions, appear in well-known assessment tools. to test interrater reliability observations of six children were performed. each child was observed by two educators. interrater reliability was calculated for the quantitative items usually observed during the course. interrater reliability reached . %. background: sma is a neurodegenerative disease caused by survival motor neuron gene (smn ) deletion or mutation [ , ] . disease severity (sma type) correlates with smn copy number [ , ] . gene therapy with onasemnogene abeparvovec provides sustained, continuous production of smn protein, and is fda approved [ ] , with ongoing trials for sma type (sma ) and sma , and presymptomatic treatment for all sma types. with treatment options available, many states in the united states (us) are implementing newborn screening (nbs) to detect smn deletions and smn copies, providing early diagnosis and the option of pre-symptomatic treatment [ ] . we examine the economic consequences of implementing nbs for sma and pre-symptomatic treatment with onasemnogene abeparvovec gene therapy among newborns in the us. a decision-analytic model was built to assess the cost effectiveness of nbs in , hypothetical newborns from a us third-party payer perspective. the model included separate arms, each allowing for a different treatment strategy. model inputs for epidemiology, test characteristics, and screening and treatment costs were based on publicly available literature (table ) . inputs and assumptions of lifetime costs and utilities for sma types were obtained from the institute for clinical and economic review sma report [ ] ; other values were sourced from published literature. model outputs included total costs, quality-adjusted life years (qalys), and incremental cost-effectiveness ratios (icers). scenario and sensitivity analyses tested model robustness. park's programme, particularly across education and engagement and prioritisation and development of research. in addition to representation on governance structures, wales gene park (wgp) collaborates with patients and the public to involve them in rare disease and genetic research. wgp has co-produced a rare disease research gateway following consultation with patients and the public from its networks. the gateway hosts relevant studies in genetic and rare disease research on the wgp website. it promotes involvement opportunities in addition to signposting to studies that patients and other members of the public can participate in. it also links to training opportunities for ppi representatives. consultation with patients and the public regarding the usability, design and development of the gateway was undertaken. feedback has enhanced the user experience and it was launched in october . there are currently over studies featured, and the gateway is searchable according to condition or key word. impact will be monitored through online usage and website analytics. engagement with researchers through a professional network enables opportunities to be advertised from all areas of genetic and rare disease research and ensures that patient and public representatives are involved in the design and development of research from its inception. wgp were invited to present at the welsh health and care research wales conference in as the gateway was highlighted as an exemplar of good practice. specialist visit, medications) and non-medical resource use (lost productivity and homecare or caregiver's time). outcomes of interest for treatment options assessed the efficacy and safety of treatments for rett syndrome. results: the search on economic burden yielded articles; intervention type and costs were extracted from , representing studies. in the economic burden studies, enteral feeding and assisted walking increased the risk of respiratory-related hospital admissions, while length-of-stay was lower in younger patients. mean recovery-stay after scoliosis-correcting surgery was . days and . days in each of studies. care integration improved outcomes and reduced costs. the search on clinical trials yielded articles; efficacy and safety were extracted from , representing studies ( randomized controlled trials, single-arm; n = - ; follow-up - months). of these, focused on pharmacological symptom treatment; examined environmental enrichment effects; none targeted the underlying cause. the most common primary endpoints are stated in table . naltrexone, trofinetide, and mecasermin demonstrated clinical benefits versus placebo, but most treatments yielded no significant improvement ( table ) . the cml advocates network (cml an) is an active network specifically for leaders of chronic myeloid leukemia (cml) patient groups, connecting patient organisations in countries on all continents. it was set-up and is run by cml patients and carers. its aim is to facilitate and support best practice sharing among patient advocates across the world. the cml community advisory board (cml-cab) is a working group of the cml advocates network. since its inception the cml-cab has met on nineteen occasions with five sponsors. the cml-cab is comprised of two chairs and cab-members. cml-cab organisation, sustainability and follow-up is supported by a part-time cml-cab officer and the cml-an executive director. the principles of leaving no one behind are essential to the goals of world health organization (who) and united nations (un). in , an ambitious objective to ensure that billion more people will benefit from universal health coverage (uhc) until was entrenched in the who th general programme of work [ ] . all un member states have agreed to try to achieve universal health coverage by , as part of the sustainable development goals [ ] . however, it is essential, that rare disease (rd) patients are not left behind on our trip to uhc. in , un declared that rd are among the most vulnerable groups that are still on the fringes of uhc [ ] . the first step on a way to the full uhc cube [ ] for rd is an identification of root causes of health inequities. health determinants of rd fundamentally differ from those for common diseases. some of them are unavoidable: up to % of rd have a genetic basis (individual or genetic determinants). although socio-economic factors are highly important, in contrast to common diseases, they are a consequence rather than a cause of rd. meanwhile, one of the major root cause amenable to change are health system determinants: organization of services for rd requires unique solutions in our health systems that are mostly adapted for common diseases. political and legal determinants also play a key role: while rd is an explicit example of an area, loaded orphanet j rare dis , (suppl ): with needs for pan-european solutions, relative "weakness" of eu legal powers to regulate and have an impact on implementation of pan-european policies in health results in vast inequities among and inside member states and lack of engagement at a national level. health activism that includes strong advocacy and a loud voice of patient organizations has also been ascribed to health determinants and may have a crucial role in rd [ ] to improve the situation, we already have some powerful tools at hand including national plans for rd, european reference networks [ ] and european joint programme on rare diseases [ ] . however, to reach the full potential of these, multiple obstacles have to be removed and full implementation ensured. since march , there has been an explosion in digital health adoption as people look for remote ways to manage their health and wellbeing. national government covid- strategies, local authorities and consumers, have all turned to health apps, both as a potential means of slowing the spread of the virus, and a method of allowing people to self-manage their own health. in the first few weeks of the covid- pandemic, orcha worked with app developers to build a dedicated covid- app library full of evaluated apps. free to use for all, it included relevant, quality assured apps that had been through orcha's rigorous review process. to build such a tool in such a short space of time is testament to the speed of this market. more consumers have been using health and care apps. in just one week, orcha saw an increase of . % in app downloads from its app libraries, and a , % increase in app recommendations from health and care professionals. orcha can see from the data across its app libraries that the most popular search terms since the pandemic began have included: mental health, physiotherapy, fitness, anxiety, rehabilitation, diabetes, respiratory, and sleep. whereas 'covid' was initially the most searched term at the beginning of the outbreak, people have since searched for specific condition areas. this indicates a shift in focus to actively self-managing health and wellbeing, and a desire for knowledge about particular health areas. the recent increase in digital health adoption has highlighted that the challenge remains of helping consumers to understand which apps are potentially unsafe to use, and ensuring that consumers are armed with the full facts about the strengths and weaknesses of an app, before it is downloaded. while considerable progresses have been made in the last years in research on innovative medicinal products for adults, children have not benefited from progresses to the same extent as adults in terms of appropriate treatments and advanced tools. it is well known that the availability of drugs for paediatric use still represents a challenging issue, since research and development in this field is characterized by many that range from methodological, ethical and economic reasons, especially when neonates and rare diseases are involved. moreover, even when industry has the capacity to perform a paediatric drug development plan, there are many economic reasons limiting the commercial sponsors' interest (the paediatric population is a small population; paediatric diseases often concern rare disorders with unknown mechanism; it is very difficult to perform preclinical and clinical studies; ethical concerns are still relevant and additional regulatory requirements have to be considered). in this scenario, eptri can make the different in closing the gap between innovative technologies and paediatric drug development processes. it is a eu-funded project that arises from the need to find answers to the serious lack of medicines for children in eu and worldwide, and aimed to design the framework for a european paediatric translational research infrastructure dedicated to paediatric research. an high interest is tailored on rare diseases (rd) as they affect mainly children and genetic rd start early in the prenatal/childhood life with an high frequent use of medicines not specifically tested (off-label, unlicensed). eptri will work to accelerate the paediatric drug development processes from medicines discovery, biomarkers identification and preclinical research to developmental pharmacology, age tailored formulations and medical devices. this will allow is to facilitate the translation of the acquired new knowledge and scientific innovation into paediatric clinical studies phases and medical use. neonatal screening started in many countries around - after phenylketonuria turned out to be a treatable condition. if diagnosed early, a diet could help to avoid impaired brain development. public health programmes were developed to offer all newborn children the possibility to be tested. screening always has benefits and disadvantages, and only rarely pros outweigh cons at reasonable costs. the world health organization in published criteria to evaluate benefits and disadvantages, concerning amongst others ( ) important health problem ( ) treatment ( ) suitable test and ( ) appropriate use of resources. pku was mentioned as an example of an important health problem [ ] . neonatal screening is more than a test. information to parents, communication of results, ict infrastructure, follow-up of affected infants, reimbursement of test and treatment and governance all need adequate attention [ ] . around the number of diseases covered in european countries in neonatal screening programs was very diverse: from zero in albania to more than in austria, hungary, iceland, portugal and spain [ ] . many countries have seen an increase in the number of diseases covered because of new tests and treatments becoming available. health authorities were almost always involved in changes in the programmes, hta experts and parents organizations sometimes. half of the countries had laws on nbd, and half had a body overseeing nbs programs. less than half of the countries informed parents of the storage of dried blood spots [ ] . after the eu initiated "tender nbs" had provided advice to eu policy makers [ ] , little initiatives for harmonization were taken, because health is the mandate of member states. from the perspective of newborns this implies that early diagnosis and adequate treatment for nbs conditions may differ very much for children being born in one or another eu country. with more tests and more treatments becoming available, this makes it even more urgent to attune the perspectives of different eu stakeholders for the benefit of all newborns. background: as genome sequencing is rapidly moving from research to clinical practice, evidence is needed to understand the experience of patients with rare diseases and their families. in the presentation, we discuss families' experience of receiving, making sense of and living with genomic information. the presentation includes video-clips from two short films from families' narratives. specifically, families struggled with the lack of information on the course of the disease, the difficulties to access support and navigate health and social care services, and the challenges related to making sense of the implications of genomic information for other family members. despite these issues, families identified a wide range of benefits from taking part in genome sequencing, which were broader than the clinical utility of the diagnosis. the findings raise questions regarding how to talk about 'diagnosis' in a way that reflects families' experience, including their uncertainty but also their perceived benefits. they also have implications for the design and delivery of health services in the genomic era, pointing to the need to better support families after their search for a diagnosis. saluscoop [http://www.salus coop.org] is a non-profit data cooperative for health research that aims to make a greater amount and diversity of data available to a broader set of health researchers, and to help citizens to manage their data for the common good. data heals. health research is data-driven: the larger the universe, the greater the quantity, quality, and diversity of the data, the more potential the data has to cure. in our european context, it is clear: data belongs citizen. gdpr regulates ownership and our rights over data that include portability. data protection laws rightly consider that health data deserves the maximum protection. however, the only truth, we note every day: in practice citizen often cannot access their data or control its use. the future of our health depends significantly on the ability to combine, integrate and share personal health data from different sources. the only one who can integrate all your information (public, private, clinical, personal, habits, genetics) is the citizen himself. using data well, it is possible to obtain more and better health for all. we are a cooperative that works to facilitate the transformation process towards this goals doing: -dissemination, awareness, communication -studies, manifestos. -licenses to facilitate it -salus common good license - it is necessary to dissociate the provision of services, of the possession of the data. the accumulation and centralization of the data is not necessary. blockchain and the like allow the certification of transactions without the need for intermediaries. the need for the existence of new social institutions for the collective management of data for the common good is much clearer today: so that these citizens have the technological and legal tools effectively manage their data. so that health research can address the real problems of our societies. the abstract is being presented on behalf of a saluscoop management board group. the region of murcia, located on the southeast of spain, has . million inhabitants. in , approximately % of its population was identified with a rd, based on the regional rd information system, which showed a public health problem requiring an integral and coordinated approach. results: in , after years of participative work (interdepartmental government representatives, patients associations and professionals) the regional plan for rd integrated (holistic) care was approved, for a period of years ( - ) and a budget of millions euros; with the goal of improving health, education and social care through interdisciplinary coordination and placing patients and families in the center of the actions. the plan includes ten different strategic areas related to information, prevention and early detection, healthcare, therapeutic resources, social-health care, social services, education, training of professionals, research, monitoring and evaluation a regional rd coordination center, linked to the medical genetics unit in the tertiary reference hospital, is connected to the health areas, educational and social local services, through a case manager integrated in the multidisciplinary team. this was our building experience presented in the innovcare project, co-funded by the eu. to design a holistic care plan for rd we need to know the prevalence based on rd registries, available and needed resources and an interdisciplinary participative action approach with the appropriate government and financial support with periodic evaluation. case management has an important role. the recognition of clinical genetics as health specialty is also urgent in spain to provide equal access to rd patients and families all over the country. [ ] . these policies have served us well, but it is essential that the policies guiding us towards the future we wish to see are equipped to address the needs of the future rd population. the rare project [ ] is working towards precisely this goal, and has identified over a hundred future-facing trends likely to impact on the field. some of these trends concern demographic changes about which we can be reasonably certain: whilst overwhelmingly positive, changes such as ageing rd populations will bring new challenges in managing comorbidities. they will also create new opportunities as well as risks in areas such as reproductive choice; however, these choices incur major ethical, legal and social concerns, and it is unclear how many countries really have robust frameworks in place to cope with this. besides the fairly certain demographic changes, there are many topics -and many needs-for which the future is not clear. will there be easier access to expert multidisciplinary teams? what will be the role of technology in care delivery? these fundamental issues are here debated in interview format [ ] . adrenoleukodystrophy, or ald, is a complex x-linked genetic brain disorder which mainly affects males between the ages of four and -males who are previously perfectly healthy and 'normal' . ald damages the myelin in the brain and spinal cord, and those with cerebral symptoms become completely dependent on their loved ones or carers. this usually involves patients becoming wheelchair or bed bound, blind, unable to speak or communicate and tube fed. it is a difficult disorder to diagnose with behaviour problems usually the primary indicator. in males, cerebral ald is a terminal illness with most dying within one to years of symptoms developing. if diagnosed before symptoms become apparent, usually through identification of a family member, the condition can be successfully treated through bone marrow transplant. some adults (males and females) develop a related condition called adrenomyeloneuropathy, or amn. symptoms include difficulty walking, bladder and bowel incontinence and sexual dysfunction. tragically, around one third of males with amn go on to develop cerebral ald. initial behavioural symptoms often have an impact on the individual's professional and personal lives -their capacity to work, maintain relationships and family ties -over time, they can become isolated and socially unacceptable. commonly, those individuals without supportive family structures are missed or misdiagnosed. the presentation presents a personal case study detailing the impact of an ald diagnosis on the whole family, moving on to alex tlc's experience in applying to add ald to the uk's new born screening programme. the conclusion includes next steps following an initial negative response, and thoughts on the methods used to assess decisions on the prevention and treatment of rare disease. the rare foresight study gathers the input of a large group key opinion leaders through an iterative process to propose recommendations for a new policy framework for people living with rare diseases (rd) in europe. since the adoption of the council recommendation on european action in the field of rd in , the european union has fostered tremendous progress in improving the lives of people living with rd. rare will recommendations for the next ten years and beyond. the rare foresight study includes major stages (fig. ) . the european conference on rare diseases and orphan products (ecrd ) marked the occasion to present four proposed future scenarios (fig. the market-led approach first creates the technology innovation, then seeks out its market. deep understanding of needs as the starting point of the innovation process. with symptoms and being suspected of having a rare disease can be the longest in many steps to getting a diagnosis. this is something we have the power to change now by providing content tailored to medics, early in their careers that will equip them to #daretothinkrare. to prepare for delivery of gene therapies, companies typically focus on four key areas: patient identification & diagnostics; treatment centre qualification; manufacturing & supply and market access. timely diagnosis of patients is important as with progressive disorders, the earlier patients are treated, typically the better their long-term clinical outcomes will be. targeted tools and resources are used to educate clinical specialists on the early symptoms of the disease. improving access to the appropriate diagnostic tests is essential. if newborn screening is considered, validated assays and pilot studies are required. gene therapies have to be administered in qualified treatment centres. after regulatory approval, treatment centres are relatively few so patients may need to cross borders and work is required to expand the recognition of patient rights to be treated in another eu country (e.g. through the s mechanism). many companies partner with contract manufacturing organisations and are developing ways to preserve gene-corrected stem cells to enable their transportation from the manufacturing site to treatment centres. the final area is market access, whereby it is vital to evolve the way healthcare systems think about delivery, funding and value determination. manufacturers have the responsibility to generate health economic evidence. recent research [ ] in metachromatic leukodystrophy showed that caregivers (n = ) spend an average of hours a day caring for their child. % of parents were forced to miss work with % of this being unpaid leave. in addition, it is recommended to have the optionality of payment models that allow the sharing of risk between the healthcare system and manufacturer (e.g. annuity or outcomes-based payments). orchard has developed a holistic value framework as gene therapies are expected to benefit patients, families, communities, healthcare systems and society reference background: employment has always been one of the fundamental human rights. it is important for people with rare diseases, because it helps to stay connected to the community and to continue professional development. equal access to job employment can help to overcome the consequences of the condition and to gain financial independence. on the other hand unemployment can increase the social exclusion. in the last few years there is an improvement in the european policies about job employment. in spite of this, people with rare diseases still have to overcome discrimination in this field. as a proof of this statement is the recent online survey, conducted by eurordis. according to it, % of the respondents admitted they had to reduce their professional activities after they were diagnosed with rare condition. this means that more than half of the people with rare diseases in europe face employment challenges. the analysis of this survey was important input to the presentation of the epf youth group project -ways. results: this is the abbreviation of work and youth strategy and it is a two year project, disseminated among young patients with chronic conditions. the main purpose was to increase the awareness about positive and negative practices for young patients on the labour market and to develop recommendations to employers and decision makers. that is why epf youth group conducted an online survey and provided different deliverables like factsheet with recommendations to employers and video about young patients' rights on the work place. the results of both survey provided important insight about the challenges people with rare diseases face in job employment. it proved the fact that only if we work together as a community of patients, we will be able to provide better opportunities for national and international inclusion. paul rieger , eberhard scheuer centiva health ag, zug, switzerland correspondence: paul rieger -paul@centiva.health orphanet journal of rare diseases , (suppl ):s the lack of access to research participants is the number one reason why medical studies fail [ ] . real-world data is often difficult to get despite usd billion costs of patient data intermediation. therefore, a new model for patient access is necessary where patients get paid fairly for their data, retain control over their data, and drive citizencentered research. on the other hand, researchers and industry must be enabled to access patients directly without violating their privacy, while reducing time and costs of data access at the same time. current patient registries facilitate patient access and match patients with a centralized data flow while giving little to no incentives. whereas, a decentralized patient registry allows for direct and confidential matchmaking between patients and organizations looking for data through the use of blockchain technology. it lets the patient decide with whom they want to share their data. on such a platform, patients can receive incentives in the form of digital currency. currently, centiva health [ ] is used in the context of rare diseases and population health, i.e., outbreak monitoring. in the area of rare diseases centiva health cooperates with patient advocacy groups by enhancing existing registries with the ability to collect real-world data. the access to patient via a decentralized registry leads to aligned incentives, real-time access to data, improved disease visibility while preserving patient privacy. orphanet j rare dis , (suppl ): the united kingdom and in the czech republic, to co-design optimal methods/services for the communication of genomic results. methods and results: using a methodology called experience-based co-design (ebcd) , we supported families and health professionals to shared and discuss their experiences, identify priories for improvement and then work together to prototype and test out interventions to address these. the process involved observations of clinical appointments (), interviews with families () and health professionals and a series of workshops and remote consultations at both sites. results: five shared priorities for improvement were identified by participants at the two sites, and eight quality improvement interventions were prototyped/tested to address these ( table ) . discussion: the findings clearly indicate the need for improved follow-up care to support families in the short, medium term after the sharing of the results, including when a diagnosis is confirmed. different service models were prototyped, including follow up consultations with clinical geneticists and a dedicated role to facilitate co-ordinated care. the findings also demonstrate the need for continued workforce development on the psychosocial aspects of genomic and genetic communication, specifically on families' needs regarding genomic consent and the experience of guilt and (self-) blame. to use technology has been used in the home to provide objective seizure data prior to upcoming clinic appointments. the covid- pandemic has prompted an acceleration in telemedicine and epihunter has improved the effectiveness of virtual consultations bringing opportunities for both diagnostics and informed changes in treatment. epihunter is an example of technology repurposing to create a new normal for people with hidden disabilities such as those living with absence epilepsy. the rare disease patient community tried to get this well detailed plan to be transferred to regulation which usually means an adequate financial substitution of those expert services. the patients should benefit from a centralized expert treatment/care pathway. esophageal atresia (ea) is a rare congenital condition with an estimated prevalence of to in , live births. esophageal atresia patients require life-long attention. ernica has developed a 'patient journey' for ea patients, under the leadership of patient representatives from the international federation of ea support groups (eat). in germany, patients with congenital malformations which need surgery in early life are treated in hospitals with (very) low experience. how can we as patient representatives get the fruits of the erns into the national health system? we don't have public money. we have no official contract and no political support. keks e.v., the german ea support group together with other support groups (e.g. soma e.v.), and with surgical expert teams across germany, some of them members in erns, started to organize monthly virtual boards for those patients. a self-commitment on ethical and medical standards following the ern-criteria, and a collaborative attitude within the group, help us to get step by step the first ernica results to the bedside of ea patients. springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. delays in completion and results reporting of clinical trials under the paediatric regulation in the european union: a cohort study new strategies for the conduct of clinical trials in pediatric pulmonary arterial hypertension: outcome of a multistakeholder meeting with held at the european medicines agency on monday decentralized rather than centralized pediatric surgery care in germany erdri.dor -european directory of registries natural history of alkaptonuria recent advances in management of alkaptonuria when you build in the world, you build in your mind white paper on lego serious play articulation of tacit and complex knowledge zolgensma (onasemnogene abeparvovec-xioi) childhood spinal muscular atrophy: controversies and challenges spinal muscular atrophy bannockburn, il indirect estimation of the prevalence of spinal muscular atrophy type i, ii, and iii in the united states pilot study of populationbased newborn screening for spinal muscular atrophy in new york state presymptomatic diagnosis of spinal muscular atrophy through newborn screening one year of newborn screening for sma -results of a german pilot project correlation between sma type and smn copy number revisited: an analysis of unrelated spanish patients and a compilation of reported cases available from: www.ibm.com/produ cts/micro medex -red-book available from: www.ibm.com/produ cts/micro medex -red-book references . united kingdom national health service international rett syndrome foundation presented at aacap's th annual meeting placebo-controlled crossover assessment of mecasermin for the treatment of rett syndrome cerebrolysin therapy in rett syndrome: clinical and eeg mapping study effects of acetyl-l-carnitine on cardiac dysautonomia in rett syndrome: prevention of sudden death? rett syndrome: controlled study of an oral opiate antagonist, naltrexone pharmacologic treatment of rett syndrome with glatiramer acetate safety, pharmacokinetics, and preliminary assessment of efficacy of mecasermin (recombinant human igf- ) for the treatment of rett syndrome effects of ω- pufas supplementation on myocardial function and oxidative stress markers in typical rett syndrome thirteenth general programme of work - . promote health -keep the world safe -serve the vulnerable principles and practice of screening for disease newborn screening programmes in europe; arguments and efforts regarding harmonization. part -from screening laboratory results to treatment, follow-up and quality assurance newborn screening programmes in europe; arguments and efforts regarding harmonization. part -from blood spot to screening result a framework to start the debate on neonatal screening policies in the eu: an expert opinion document communication from the commission to the european parliament, the council, the european economic and social committee and the committee of the regions on rare diseases: europe's challenges on an action in the field of rare diseases on the application of patients' rights in cross-border healthcare available from: https :// drive .googl e.com/file/d/ sfe xp deisc ogrbw swht uznx erj/ view?usp=shari ng . which scenarios are most preferred by the rd community? . which scenarios are most likely to happen? . how do we achieve the scenarios we prefer and avoid those we don factors associated with clinical trials that fail and opportunities for improving the likelihood of success: a review s background: to help inform cross-national development of genomic care pathways, we worked with families of patients with rare diseases and health professionals from two european genetic services bringing user experience to health care improvement: the concepts, methods and practices of experience-based design department of medical informatics correspondence: info-rdsgofair@go-fair.org (marco roos -m.roos@ lumc.nl, gülçin gümüş -gulcin.gumus@eurordis.org) in practice, it can take months of searching data, understanding the sources, mapping to consistent standards, and negotiating how one might use the data. many assume that for sharing and analysis, data need to be moved between sources. this can lead to sharing only minimal, non-sensitive data: a fraction of global rare disease data. alternatively, data elements and local access conditions can be described by globally agreed, computer understandable standards conform fair principles. this enables analysis at each source, while sharing only the analysis results. fair prepares data for rapid discovery, access, and analysis, also when data remain at source. projects such as the european joint programme for rare diseases work on the technical infrastructure to support this. adopting fair principles requires culture change. fair advocates working on rare diseases have organised the 'rare diseases global open fair implementation network rds go fair prioritizes patient representatives for their capacity to reshape current practices, welcoming them to organise their own network within rds go fair to foster fair for patient priorities (registration for follow-up meetings is possible via eucerd recommendations on quality criteria for centres of expertise for rare diseases in member states joint action rd-action (european union's health programme european commission website pdf s patient's view on disruptive innovations in clinical research elizabeth vroom we would like to thank the j-rare patient organization groups and the asrid research ethics committee. consent to publish: informed consent to publish has been obtained from patients. we thank all the families who took part in the interviews, the staff of the health services and charities who collaborated to advertise the study to eligible participants and the members of the family advisory groups who reviewed the interview schedule and provided invaluable feedback on the preliminary findings. the work has been presented on behalf of the study "improving the communication of genomic diagnosis results using experience based co-design (ebcd)", which is part of the solve rd project. the solve-rd project has received funding from the european union's horizon research and innovation programme under grant agreement no . we thank all the families who took part in the interviews, the staff of the health services and charities who collaborated to advertise the study to eligible participants and the members of the family advisory groups who reviewed the interview schedule and provided invaluable feedback on the preliminary findings. the work has been presented on behalf of the study "improving the communication of genomic diagnosis results using experience based co-design (ebcd)", which is part of the solve rd project. the solve-rd project has received funding from the european union's horizon research and innovation programme under grant agreement no . acknowledgements: we would like to thank all seed group members of the rare diseases global open fair implementation network, the go fair office, eurordis, the european union's horizon research and innovation program under the ejp rd cofund-ejp n° , the rd-connect community, the lumc biosemantics research group, simone louisse (guardheart epag), and the many patients and patient representatives that inspire us. recent advances in next-generation phenotyping (ngp) for syndromology, such as deepgestalt, have learned phenotype representations of multiple disorders by training on thousands of patient photos. however, many mendelian syndromes are still not represented by existing ngp tools, as only a handful of patients were diagnosed. moreover, the current architecture for syndrome classification, e.g., in deepgestalt, is trained "end-to-end", that is photos of molecularly confirmed cases are presented to the network and a node in the output layer, that will correspond to this syndrome, is maximized in its activity during training. this approach will not be applicable to any syndrome that was not part of the training set, and it cannot explain similarities among patients. therefore, we propose "gestaltmatch" as an extension of deepgestalt that utilizes the similarities among patients to identify syndromic patients by their facial gestalt to extend the coverage of ngp tools. methods: we compiled a dataset consisting of , patients with , different rare disorders. for each individual, a frontal photo and the molecularly confirmed diagnosis were available. we considered the deep convolutional neural network (dcnn) in deepgestalt as a composition of a feature encoder and a classifier. the last fully-connected layer in the feature encoder was taken as facial phenotypic descriptor (fpd). we trained the dcnn on the patients' frontal photos to optimize the fpd and to define a clinical face phenotype space (cfps). the similarities among each patient were quantified by cosine distance in cfps. results: patients with similar syndromic phenotypes were located in close proximity in the cfps. ranking syndromes by distance in cfps, we first showed that gestaltmatch provides a better generalization of syndromic features than a face recognition model that was only trained on healthy individuals. moreover, we achieved % top- accuracy in identifying rare mendelian diseases that were excluded from the training set. we further proved that the distinguishability of syndromic disorders does not correlate with its prevalence. conclusions: gestaltmatch enables matching novel phenotypes and thus complements related molecular approaches.an audience of over delegates voted on the rare scenarios and discussions throughout the sessions of ecrd indicated the following opinions:-if we continue as we are we will find ourselves in the "fast over fair" scenario which forecasts high collective responsibility but an emphasis on market-led innovation -the majority of the audience preferred a future scenario with continued high collective accountability but more of an emphasis on needs-led innovation, "investments for social justice" -a significant portion of the audience agreed that a balance must remain with the market led attractiveness of the "technology along will save you" scenario -a scenario where "it's up to you to get what you need" was least preferred by all the diagnostic pathway in rare disease has a number of bottlenecks that can result in the pathway becoming an odyssey. while some barriers are being removed through remarkable innovation, there is one story of diagnostic delay that is echoed by rare disease patients across the globe and across thousands of different rare diseases: doctors failed to suspect something rare. however we cannot expect doctors to suspect rare diseases when they haven't been trained to or, in some cases, have been trained to do the exact opposite with the mantra "common things are common". without appropriate training 'rare' can be mistaken for 'irrelevant' when in reality million european citizens live with a rare disease [ ] . medics rarediseases is driving an attitude change towards rare diseases in the medical profession. this begins with explaining that rare diseases are collectively common and all clinicians should expect to manage people with diagnosed and undiagnosed rare disease regularly during their careers. this attitude change is called #daretothinkrare. secondly m rd is suggesting a new approach to educating about rare disease for trainers and training institutes. this approach tackles rare disease as a collective and focuses on patient needs rather than details of individual diseases. this not only solves the impossible challenge of covering over rare diseases during medical training but also provides some equity between different diseases. lastly, m rd promotes the use of rare disease specific resources that will support both doctors and their patients. this includes the invaluable input from patient advocacy groups. the step between presenting quality assurance of rare disease (rd) centers of excellence (coe) through designation, accreditation, monitoring and constant improvement provides a means to ensure high quality, centralization of resources and expertise, and cost-efficiency. eucerd recommendations for quality criteria of coe, issued in , are still highly relevant [ ] . in the state of art resource, almost all european union (eu) member states (ms) claim, that their coes conform to eucerd recommendations [ ] . however, national quality assurance processes differ significantly: some ms apply robust procedures, while in other ms, many of them -but not exclusively -are eu- ms, processes of quality assurance are less developed. under the subsidiarity principle embedded into european treaties, the eu plays a limited role in many areas of healthcare, and coes quality assurance processes are a choice and responsibility of ms.with the establishment of erns, another layer of quality assurance has been developed by the european commission and the ms [ ] . this new quality assurance framework may be in line, or not, with national accreditation systems and involves i) assessment of coes when they apply for full membership of erns and ii) continuous monitoring afterwards [ ] . in every ern, members have to be "equal partners in the game" and share the same goals, rights and obligations. while the ern logo should eventually be a quality mark of the highest standards, strong links of ern members to national systems, including many more and less specialized healthcare providers, are essential to ensure proper care pathways for rd patients. importantly, erns themselves and patients/non-governmental organizations provide us with additional means of "informal quality assurance". many erns are implementing their own monitoring processes through the creation of registries to collect health outcomes that allow peer-benchmarking. meanwhile, patients provide their strong voice through european patient advocacy groups (epags) and help to signpost "the best" coes through information sharing. in both these processes, the power of open, transparent information on performance may finally lead to improved transparency and accountability at a national level and, presumably, may have an impact on the composition of erns in the future. in order to improve clinical research, patient preferences and outcome measures relevant to patients should become the core of drug development and be implemented from the earliest stage of drug development. from 'bedside to bench' instead of from 'bench to bedside' . at all levels the reuse of data could and should be enhanced. patient derived or provided data are not owned by those who collected them, and their reuse should be primarily controlled by the donors of these data. researchers and health professionals are custodians (gdpr). to enable the optimal reuse of real world data, the data needs to be findable, accessible, interoperable and reusable (fair) by medical professionals, patients and in particular also by machines. for this reason the world duchenne organization published a duchenne fair data declaration [ ] . reuse of placebo data and use of natural history data could speed up research especially in the field of rare diseases at this moment, in line with gdpr, patients are in a good position to decide about the reuse of their own data and should not only have access to these data but preferable also be in charge of their own data. background: drug repurposing for rare disease has brought more costeffective and timely treatment options to patients compared to traditional orphan drug development, however this approach focuses purely on medical interventions and requires extensive clinical trials prior to approval. in the case of refractory epilepsy, practical solutions are also required to better manage daily life. here we present an example of technology repurposing as a practical aid to managing absence epilepsy. methodology: existing research tells us that seizure control is not the only consideration of quality of life in children with epilepsy and that mental health and caregiver/peer support are of utmost importance. we explored the needs of stakeholders and determined that there was a delicate balance between the individual (and those that care for them) and those that have the power to change their lives. results: across all stakeholders there was a shared common need to obtain objective data on absence monitoring to relieve the burden on families/carers to retain manual seizure diaries whilst providing accurate and timely data to medical teams, researchers and social care. epihunter is an absence seizure tracking software using repurposed technology: a headset from wellness/leisure to collect electroencephalographic (eeg) data and an ai algorithm to detect and record absence seizures on a mobile phone application in real-time. both eeg and video recording of the seizure are automatically captured. this low cost, easy key: cord- -ag xt nj authors: schmidt, lena; weeds, julie; higgins, julian p. t. title: data mining in clinical trial text: transformers for classification and question answering tasks date: - - journal: nan doi: nan sha: doc_id: cord_uid: ag xt nj this research on data extraction methods applies recent advances in natural language processing to evidence synthesis based on medical texts. texts of interest include abstracts of clinical trials in english and in multilingual contexts. the main focus is on information characterized via the population, intervention, comparator, and outcome (pico) framework, but data extraction is not limited to these fields. recent neural network architectures based on transformers show capacities for transfer learning and increased performance on downstream natural language processing tasks such as universal reading comprehension, brought forward by this architecture's use of contextualized word embeddings and self-attention mechanisms. this paper contributes to solving problems related to ambiguity in pico sentence prediction tasks, as well as highlighting how annotations for training named entity recognition systems are used to train a high-performing, but nevertheless flexible architecture for question answering in systematic review automation. additionally, it demonstrates how the problem of insufficient amounts of training annotations for pico entity extraction is tackled by augmentation. all models in this paper were created with the aim to support systematic review (semi)automation. they achieve high f scores, and demonstrate the feasibility of applying transformer-based classification methods to support data mining in the biomedical literature. systematic reviews (sr) of randomized controlled trials (rcts) are regarded as the gold standard for providing information about the effects of interventions to healthcare practitioners, policy makers and members of the public. the quality of these reviews is ensured through a strict methodology that seeks to include all relevant information on the review topic . a sr, as produced by the quality standards of cochrane, is conducted to appraise and synthesize all research for a specific research question, therefore providing access to the best available medical evidence where needed (lasserson et al., ) . the research question is specified using the pico (population; intervention; comparator; outcomes) framea https://orcid.org/ - - - b https://orcid.org/ - - - c https://orcid.org/ - - work. the researchers conduct very broad literature searches in order to retrieve every piece of clinical evidence that meets their review's inclusion criteria, commonly all rcts of a particular healthcare intervention in a specific population. in a search, no piece of relevant information should be missed. in other words, the aim is to achieve a recall score of one. this implies that the searches are broad (lefebvre et al., ) , and authors are often left to screen a large number of abstracts manually in order to identify a small fraction of relevant publications for inclusion in the sr (borah et al., ) . the number of rcts is increasing, and with it increases the potential number of reviews and the amount of workload that is implied for each. research on the basis of pubmed entries shows that both the number of publications and the number of srs increased rapidly in the last ten years (fontelo and liu, ) , which is why acceleration of the systematic reviewing process is of interest in order to decrease working hours of highly trained researchers and to make the process more efficient. in this work, we focus on the detection and annotation of information about the pico elements of rcts described in english pubmed abstracts. in practice, the comparators involved in the c of pico are just additional interventions, so we often refer to pio (populations; interventions; outcomes) rather than pico. focus points for the investigation are the problems of ambiguity in labelled pio data, integration of training data from different tasks and sources and assessing our model's capacity for transfer learning and domain adaptation. recent advances in natural language processing (nlp) offer the potential to be able to automate or semi-automate the process of identifying information to be included in a sr. for example, an automated system might attempt to pico-annotate large corpora of abstracts, such as rcts indexed on pubmed, or assess the results retrieved in a literature search and predict which abstract or full text article fits the inclusion criteria of a review. such systems need to be able to classify and extract data of interest. we show that transformer models perform well on complex dataextraction tasks. language models are moving away from the semantic, but static representation of words as in word vec (mikolov et al., ) , hence providing a richer and more flexible contextualized representation of input features within sentences or long sequences of text. the rest of this paper is organized as follows. the remainder of this section introduces related work and the contributions of our work. section describes the process of preparing training data, and introduces approaches to fine-tuning for sentence classification and question answering tasks. results are presented in section , and section includes a critical evaluation and implications for practice. the website systematicreviewtools.com (marshall, ) lists software tools for study selection to date. some tools are intended for organisational purposes and do not employ pico classification, such as covidence (covidence, ) . the tool rayyan uses support vector machines (ouzzani et al., ) . robotreviewer uses neural networks, word embeddings and recently also a transformer for named entity recognition (ner) (marshall et al., ) . question answering systems for pico data extraction exist based on matching words from knowledge bases, hand-crafted rules and naïve bayes classification, both on entity and sentence level (demner-fushman and lin, ) , (niu et al., ) , but commonly focus on providing information to practicing clinicians rather than systematic reviewers (vong and then, ) . in the following we introduce models related to our sentence and entity classification tasks and the data on which our experiments are based. we made use of previously published training and testing data in order to ensure comparability between models. in the context of systematic review (semi)automation, sentence classification can be used in the screening process, by highlighting relevant pieces of text. a long short-term memory (lstm) neural network trained with sentences of structured abstracts from pubmed was published in (jin and szolovits, ) . it uses a pre-trained word vec embedding in order to represent each input word as a fixed vector. due to the costs associated with labelling, its authors acquired sentence labels via automated annotation. seven classes were assigned on the basis of structured headings within the text of each abstract. table provides an overview of class abbreviations and their meaning. in the following we refer to it as the pubmed data. the lstm itself yields impressive results with f scores for annotation of up to . for pio elements, it generalizes across domains and assigns one label per sentence. we were able to confirm these scores by replicating a local version of this model. the stanford question answering dataset (squad) is a reading-comprehension dataset for machine learning tasks. it contains question contexts, questions and answers and is available in two versions. the older version contains only questions that can be answered based on the given context. in its newer version, the dataset also contains questions which can not be answered on the basis of the given context. the squad creators provide an evaluation script, as well as a public leader board to compare model performances (rajpurkar et al., ) . in the pico domain, the potential of ner was shown by nye and colleagues in using transformers, as well as lstm and conditional random fields. in the following, we refer to these data as the ebm-nlp corpus. (nye et al., ) . the ebm-nlp corpus provided us with tokenized and annotated rct abstracts for training, and expert-annotated abstracts for testing. annotation in this corpus include pio classes, as well as more detailed information such as age, gender or medical condition. we adapted the humanannotated ebm-nlp corpus of abstracts for training our qa-bert question answering system. in the following, the bidirectional encoder representations from transformers (bert) architecture is introduced (devlin et al., a) . this architecture's key strengths are rooted in both feature representation and training. a good feature representation is essential to ensure any model's performance, but often data sparsity in the unsupervised training of embedding mechanisms leads to losses in overall performance. by employing a word piece vocabulary, bert eliminated the problem of previously unseen words. any word that is not present in the initial vocabulary is split into a sub-word vocabulary. especially in the biomedical domain this enables richer semantic representations of words describing rare chemical compounds or conditions. a relevant example is the phrase 'two drops of ketorolac tromethamine', where the initial three words stay intact, while the last words are tokenized to 'ket', '#oro', '#lac', 'tro', '#meth', '#amine', hence enabling the following model to focus on relevant parts of the input sequence, such as syllables that indicate chemical compounds. when obtaining a numerical representation for its inputs, transformers apply a 'self-attention' mechanism, which leads to a contextualized representation of each word with respect to its surrounding words. bert's weights are pre-trained in an unsupervised manner, based on large corpora of unlabelled text and two pre-training objectives. to achieve bidirectionality, its first pre-training objective includes prediction of randomly masked words. secondly, a next-sentence prediction task trains the model to capture long-term dependencies. pre-training is computationally expensive but needs to be carried out only once before sharing the weights together with the vocabulary. fine-tuning to various downstream tasks can be carried out on the basis of comparably small amounts of labelled data, by changing the upper layers of the neural network to classification layers for different tasks. scibert is a model based on the bert-base architecture, with further pre-trained weights based on texts from the semantic scholar search engine (al-lenai, ). we used these weights as one of our three starting points for fine-tuning a sentence classification architecture (beltagy et al., ) . furthermore, bert-base (uncased) and bert multilingual (cased, base architecture) were included in the comparison (devlin et al., a) . in the following, we discuss weaknesses in the pubmed data, and lstm models trained on this type of labelled data. lstm architectures commonly employ a trimmed version of word vec embeddings as embedding layer. in our case, this leads to % of the input data being represented by generic 'unknown' tokens. these words are missing because they occur so rarely that no embedding vector was trained for them. trimming means that the available embedding vocabulary is then further reduced to the known words of the training, development and testing data, in order to save memory and increase speed. the percentage of unknown tokens is likely to increase when predicting on previously unseen and unlabelled data. we tested our locally trained lstm on abstracts from a study-based register and found that % of all unique input features did not have a known representation. in the case of the labelled training and testing data itself, automatic annotation carries the risk of producing wrongly labelled data. but it also enables the training of neural networks in the first place because manual gold standard annotations for a project on the scale of a lstm are expensive and time-consuming to produce. as we show later, the automated annotation technique causes noise in the evaluation because as the network learns, it can assign correct tags to wrongly labelled data. we also show that sentence labels are often ambiguous, and that the assignment of a single label limits the quality of the predictions for their use in real-world reviewing tasks. we acknowledge that the assignment of classes such as 'results' or 'conclusions' to sentences is potentially valuable for many use-cases. however, those sentences can contain additional information related to the pico classes of interest. in the original lstmbased model the a, m, r, and c data classes in table are utilized for sequence optimization, which leads to increased classification scores. their potential pico content is neglected, although it represents crucial information in real-world reviewing tasks. a general weakness of predicting labels for whole sentences is the practical usability of the predictions. we will show sentence highlighting as a potential use-case for focusing reader's attention to passages of interest. however, the data obtained through this method are not fine-grained enough for usage in data extraction, or for the use in pipelines for automated evidence synthesis. therefore, we expand our experiments to include qa-bert, a question-answering model that predicts the locations of pico entities within sentences. in this work we investigate state-of-the-art methods for language modelling and sentence classification. our contributions are centred around developing transformer-based fine-tuning approaches tailored to sr tasks. we compare our sentence classification with the lstm baseline and evaluate the biggest set of pico sentence data available at this point (jin and szolovits, ) . we demonstrate that models based on the bert architecture solve problems related to ambiguous sentence labels by learning to predict multiple labels reliably. further, we show that the improved feature representation and contextualization of embeddings lead to improved performance in biomedical data extraction tasks. these fine-tuned models show promising results while providing a level of flexibility to suit reviewing tasks, such as the screening of studies for inclusion in reviews. by predicting on multilingual and full text contexts we showed that the model's capabilities for transfer learning can be useful when dealing with diverse, real-world data. in the second fine-tuning approach, we apply a question answering architecture to the task of data extraction. previous models for pico question answering relied on vast knowledge bases and hand-crafted rules. our fine-tuning approach shows that an abstract as context, together with a combination of annotated pico entities and squad data can result in a system that outperforms contemporary entity recognition systems, while retaining general reading comprehension capabilities. . feature representation and advantages of contextualization a language processing model's performance is limited by its capability of representing linguistic concepts numerically. in this preliminary experiment, we used the pubmed corpus for sentence classification to show the quality of pico sentence embeddings retrieved from bert. we mapped a random selection of population, intervention, and outcome sentences from the pubmed corpus to bert-base uncased and scibert. this resulted in each sentence being represented by a fixed length vector of dimensions in each layer respectively, as defined by the model architecture's hidden size. these vectors can be obtained for each of the network's layers, and multiple layers can be represented together by concatenation and pooling. we used the t-distributed stochastic neighbour embedding (t-sne) algorithm to reduce each layer-embedding into two-dimensional space, and plotted the resulting values. additionally, we computed adjusted rand scores in order to evaluate how well each layer (or concatenation thereof, always using reduce mean pooling) represents our input sequence. the rand scores quantify the extent to which a naïve k-means (n= ) clustering algorithm in different layers alone led to correct grouping of the input sentences. we used the pubmed corpus to fine-tune a sentence classification architecture. class names and abbreviations are displayed in table . the corpus was supplied in pre-processed form, comprising , abstracts. for more information about the original dataset we refer to its original publication (jin and szolovits, ) . because of the pico framework, methods for systematic review semi(automation) commonly focus on p, i, and o detection. a, m, r, and c classes are an additional fea-ture of this corpus. they were included in the following experiment because they represent important information in abstracts and they occur in a vast majority of published trial text. their exclusion can lead to false classification of sentences in full abstracts. in a preliminary experiment we summarized a, m, r, and c sentences as a generic class named 'other' in order to shift the model's focus to pio classes. this resulted in high class imbalance, inferior classification scores and a loss of ability to predict these classes when supporting systematic reviewers during the screening process. in the following, abstracts that did not include a p, i, and o label were excluded. this left a total of , sentences for training, and , for testing ( : split). we carried out fine-tuning for sentence classification based on bert-base (uncased), multilingual bert (cased), and on scibert. we changed the classification layer on top of the original bert model. it remains as linear, fully connected layer but now employs the sigmoid cross-entropy loss with logits function for optimization. during training, this layer is optimised for predicting probabilities over all seven possible sentence labels. therefore, this architecture enables multi-class, multi-label predictions. in comparison, the original bert fine-tuning approach for sentence classification employed a softmax layer in order to obtain multi-class, single-label predictions of the most probable class only. during the training process the model then predicts class labels from table for each sentence. after each training step, backpropagation then adjusts the model's internal weights. to save gpu resources, a maximal sequence length of , batch size , learning rate of × − , a warmup proportion of . and two epochs for training were used. in the scope of the experiments for this paper, the model returns probabilities for the assignment of each class for every sentence. these probabilities were used to show effects of different probability thresholds (or simply assignment to the most probable class) on recall, precision and f scores. the number of classes was set to , thereby making use of the full pubmed dataset. both the training and testing subsets from the ebmnlp data were adapted to fit the squad format. we merged both datasets in order to train a model which firstly correctly answers pico questions on the basis of being trained with labelled ebm-nlp data, and secondly retains the flexibility of general-purpose question answering on the basis of squad. we created sets of general, differently phrased p, i, and o questions for the purpose of training a broad representation of each pico element question. in this section we describe the process of adapting the ebm-nlp data to the second version of the squad format, and then augmenting the training data with some of the original squad data. figure shows an example of the converted data, together with a highlevel software architecture description for our qa-bert model. we created a conversion script to automate this task. to reduce context length, it first split each ebm-nlp abstract into sentences. for each p, i, and o class it checked the presence of annotated entity spans in the ebm-nlp source files. then, a question was randomly drawn from our set of general questions for this class, to complete a context and a span-answer pair in forming a new squad-like question element. in cases where a sentence did not contain a span, a question was still chosen, but the answer was marked as impossible, with the plausible answer span set to begin at character . in the absence of impossible answers, the model would always return some part of the context as answer, and hence be of no use for rarer entities such as p, which only occurs in only % of all context sentences. for the training data, each context can contain one possible answer, whereas for testing multiple question-answer pairs are permitted. an abstract is represented as a domain, subsuming its sentences and question answer-text pairs. in this format, our adapted data are compatible with the original squad v. dataset, so we chose varying numbers of original squad items and shuffled them into the training data. this augmentation of the training data aims to reduce the dependency on large labelled corpora for pico entity extraction. testing data can optionally be enriched in the same way, but for the presentation of our results we aimed to be comparable with previously published models and therefore chose to evaluate only on the subset of expert-annotated ebm-nlp testing data. the python huggingface transformers library was used for fine-tuning the question-answering models. this classification works by adding a spanclassification head on top of a pre-trained transformer model. the span-classification mechanism learns to predict the most probable start and end positions of potential answers within a given context (wolf et al., ) . the transformers library offers classes for tokenizers, bert and other transformer models and provides methods for feature representation and optimization. we used bertforquestionanswering. training was carried out on google's colab, using the gpu runtime option. we used a batch size of per gpu and a learning rate of − . training lasted for epochs, context length was limited to . to reduce the time needed to train, we only used bertbase (uncased) weights as starting points, and used a maximum of out of the squad domains. to date, the transformers library includes several bert, xlm, xlnet, distilbert and albert question answering models that can be fine-tuned with the scripts and data that we describe in this paper. figure shows the dimensionality-reduced vectors for sentences in bert-base, along with the positions of three exemplary sentences. all three examples were labelled as 'p' in the gold standard. this visualization highlights overlaps between the sentence data and ambiguity or noise in the labels. sentences and are labelled incorrectly, and clearly appear far away from the population class centroid. sentence is an example of an ambiguous case. it appears very close to the population centroid, but neither its label nor its position reflect the intervention content. this supports a need for multiple tags per sentence, and the fine-tuning of weights within the network. figure shows the same set of sentences, represented by concatenations of scibert outputs. scibert was chosen as an additional baseline model for fine-tuning because it provided the best representation of embedded pico sentences. when clustered, its embeddings yielded an adjusted rand score of . for a concatenation of the two layers, compared with . for bert-base. precision, recall, and f scores, including a comparison with the lstm, are summarized in table . underlined scores represent the top score across all models, and scores in bold are the best results for singleand multi-label cases respectively. the lstm assigns one label only and was outperformed in all classes of main interest (p, i, and o). a potential pitfall of turning this task into multilabel classification is an increase of false-positive predictions, as more labels are assigned than given in the single-labelled testing data in the first place. however, the fine-tuned bert models achieved high f scores, and large improvements in terms of recall and precision. in its last row, table shows different probability thresholds for class assignment when using the pubmed dataset and our fine-tuned scibert model for multi-label prediction. after obtaining the model's predictions, a simple threshold parameter can be used to obtain the final class labels. on our labelled testing data, we tested evenly spaced thresholds between and in order to obtain these graphs. here, recall and precision scores in ranges between . and . are possible with f scores not dropping below . for the main classes of interest. in practice, the detachment between model predictions and assignment of labels means that a reviewer who wishes to switch between high recall and high precision results can do so very quickly, without obtaining new predictions from the model itself. table : predicting picos in chinese and german. classes were assigned based on foreign language inputs only. for reference, translations were provided by native speakers. original sentences with english translations for reference chinese population "方法:選擇 - / - 在惠州市第二人民醫院精神科精神分裂癥住院的患者 例, 簡明精神癥狀量表總分> 分,陰性癥狀量表總分> 分." (huang et al., ) translation: "methods: in the huizhou no. people's hospital (mar -mar , patients with psychiatric schizophrenia were selected. total score of the brief psychiatric rating scale was > , and total score of the negative syndrome scale was > in each patient." intervention " 隨機分為 組,泰必利組與奎的平組,每組 例,患者家屬知情同意. translation: " . they were randomly divided into groups, tiapride group and quetiapine group, with cases in each group. patients' family was informed and their consent was obtained." 初始劑量 mg,早晚各 次,以后隔日增加 mg,每日劑量范圍 ~ mg." (huang et al., ) translation: "the initial dose was mg, once in the morning and evening. the dose was increased by mg every other day, reaching a daily dose in the range of and mg." outcome " 效果評估:使用陰性癥狀量表評定用藥前后療效及陰性癥狀改善情況,使用副反應量表評價藥 物的安全性." (huang et al., ) translation: " evaluation: the negative symptom scale was used to evaluate the efficacy of the drug before and after treatment and the improvement of the negative symptoms. the treatment emergent symptom scale was used to evaluate the safety of the drug." method "實驗過程為雙盲." (huang et al., ) translation: "the experimental process was double-blind." german aim "ziel: untersuchung der wirksamkeit ambulanten heilfastens auf schmerz, befindlichkeit und gelenkfunktion bei patienten mit arthrose." (schmidt et al., ) translation: "aim: to investigate outpatient therapeutic fasting and its effects on pain, wellbeing and joint-function in patients with osteoarthritis. more visualizations can be found in this project's github repository , including true class labels and a detailed breakdown of true and false predictions for each class. the highest proportion of false classification appears between the results and conclusion classes. the fine-tuned multilingual model showed marginally inferior classification scores on the exclusively english testing data. however, this model's contribution is not limited to the english language because its interior weights embed a shared vocabulary of languages, including german and chinese . our evaluation of the multilingual model's capacity for language transfer is of a qualitative nature, as there were no labelled chinese or german data available. table shows examples of two abstracts, as predicted by the model. additionally, this table demonstrates how a sentence prediction model can be used to highlight text. with the current infrastructure it is possible to highlight picos selectively, to highlight all classes simultaneously, and to adjust thresholds for class assignment in order to increase or decrease the amount of highlighted sentences. when applied to full texts of rcts and cohort studies, we found that the model retained its ability to identify and highlight key sentences correctly for each class. we tested various report types, as well as recent and old publications, but remain cautious that large scale testing on labelled data is needed to draw solid conclusions on these model's abilities for transfer learning. for further examples in the english language, we refer to our github repository. we trained and evaluated a model for each p, i, and o class. table shows our results, indicated as qa-bert, compared with the currently published leader board for the ebm-nlp data (nye et al., ) and results reported by the authors of scibert (beltagy et al., ) . for the p and i classes, our models outperformed the results on this leader board. the index in our model names indicates the amount of additional squad domains added to the training data. we never used the full squad data in order to reduce time for training but observed increased performance when adding additional data. for classifying i entities, an increase from to additional squad domains resulted in an increase of % for the f score, whereas the increase for the o domain was less than %. after training a model with additional squad domains, we also evaluated it on the original squad development set and obtained a f score of . for this general reading comprehension task. in this evaluation, the f scores represent the overlap of labelled and predicted answer spans on token level. we also obtained scores for the subgroups of sentences that did not contain an answer versus the ones that actually included pico elements. these results are shown in table . for the p class, only % of all sentences included an entity, whereas its sub-classes age, gender, condition and size averaged % each. in the remaining classes, these percentages were higher. f scores for correctly detecting that a sentence includes no pico element exceeded . in all classes. this indicates that the addition of impossible answer elements was successful, and that the model learned a representation of how to discriminate pico contexts. the scores for correctly predicting picos in positive scenarios are lower. these results are presented in table . here, two factors could influence this score in a negative way. first, labelled spans can be noisy. training spans were annotated by crowd workers and the authors of the original dataset noted inter-annotator disagreement. often, these spans include full stops, other punctuation or different levels of detail describing a pico. the f score decreases if the model predicts a pico, but the predicted span includes marginal differences that were not marked up by the experts who annotated the testing set. second, some spans include multiple picos, sometimes across sentence boundaries. other spans mark up single picos in succession. in these cases the model might find multiple picos in a row, and annotate them as one or vice versa. in this work, we have shown possibilities for sentence classification and data extraction of pico characteristics from abstracts of rcts. for sentence classification, models based on transformers can predict multiple labels per sentence, even if trained on a corpus that assigns a single label only. additionally, these architectures show a great level of flexibility with respect to adjusting precision and recall scores. recall is an important metric in sr tasks and the architectures proposed in this paper enable a post-classification trade-off setting that can be adjusted in the process of supporting reviewers in realworld reviewing tasks. however, tagging whole sentences with respect to populations, interventions and outcomes might not be an ideal method to advance systematic review automation. identifying a sentence's tag could be helpful for highlighting abstracts from literature searches. this focuses the reader's attention on sentences, but is less helpful for automatically determining whether a specific entity (e.g. the drug aspirin) is mentioned. our implementation of the question answering task has shown that a substantial amount of pico entities can be identified in abstracts on a token level. this is an important step towards reliable systematic review automation. with our provided code and data, the qa-bert model can be switched with more advanced transformer architectures, including xlm, xlnet, distilbert and albert pre-trained mod- which intervention did the participants receive? prediction context auditory integrative training to explore the short-term treatment effect of the auditory integrative training on autistic children and provide them with clinical support for rehabilitative treatment. ketorolac tromethamine mg and mg suppositories the analgesia activity of ketorolac tromethamine mg and mg suppositories were evaluated after single dose administration by assessing pain intensity and pain relief using a point scale ( vrs ). what do power station steam turbines use as a cold sink in the absence of chp? surface condensers where chp is not used, steam turbines in power stations use surface condensers as a cold sink. the condensers are cooled by water flow from oceans, rivers, lakes, and often by cooling towers [. . . ] els. more detailed investigations into multilingual predictions (clef, ) pre-processing and predicting more than one pico per sentence are reserved for future work. limitations in the automatically annotated pubmed training data mostly consist of incomplete detection or noise p, i, and o entities due to the single labelling. we did not have access to multilingual annotated pico corpora for testing, and therefore tested the model on german abstracts found on pubmed, as well as chinese data provided by the cochrane schizophrenia group. for the question answering, we limited the use of original squad domains to enrich our data. this was done in order to save computing resources, as an addition of squad domains resulted in training time increases of two hours, depending on various other parameter settings. adjusted parameters include increased batch size, and decreased maximal context length in order to reduce training time. with this paper we aimed to explore state-ofthe-art nlp methods to advance systematic review (semi)automation. both of the presented fine-tuning approaches for transformers demonstrated flexibility and high performance. we contributed an approach to deal with ambiguity in whole-sentence predictions, and proposed the usage of a completely different approach to entity recognition in settings where training data are sparse. in conclusion we wish to emphasize our argument that for future applications, interoperability is important. instead of developing yet another stand-alone organizational interface with a machine learning classifier that works on limited data only, the focus should be to develop and train cross-domain and neural models that can be integrated into the backend of existing platforms. the performance of these models should be comparable on standardized datasets, evaluation scripts and leader boards. the logical next step, which remains less explored in the current literature because of its complexity, is the task of predicting an rct's included or excluded status on the basis of picos identified in its text. for this task, more complex architectures that include drug or intervention ontologies could be integrated. additionally, information from already completed reviews could be re-used as training data. semantic scholar scibert: pretrained contextualized embeddings for scientific text analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the prospero registry clef ehealth challenge multilingual information extraction covidence systematic review software, veritas health innovation knowledge extraction for clinical question answering : preliminary results bert: pre-training of deep bidirectional transformers for language understanding bert: pre-training of deep bidirectional transformers for language understanding a review of recent publication trends from top publishing countries cochrane handbook for systematic reviews of interventions emotion and social function of patients with schizophrenia and the interventional effect of tiapride: a double-blind randomized control study pico element detection in medical text via long short-term memory neural networks chapter : starting a review chapter : searching for and selecting studies the systematic review toolbox automating biomedical evidence synthesis: robotreviewer efficient estimation of word representations in vector space answering clinical questions with role identification a corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature ebm-nlp leaderboard rayyan-a web and mobile app for systematic reviews squad: , + questions for machine comprehension of text introducing raptor: revman parsing tool for reviewers unkontrollierte klinische studie zur wirksamkeit ambulanten heilfastens bei patienten mit arthrose study-based registers reduce waste in systematic reviewing: discussion and case report information seeking features of a pico-based medical question-answering system huggingface's transformers: state-of-the-art natural language processing bert-as-service we would like to thank clive adams for providing testing data and feedback for this project. we thank vincent cheng for the chinese translation. furthermore, we thank the bert team at google re-search and allenai for making their pre-trained model weights available. finally, we acknowledge the huggingface team and thank them for implementing the squad classes for transformers. scripts and supplementary material, as well as further illustrations are available from https:// github.com/l-ena/healthinf . training data for sentence classification and question answering are freely available from the cited sources.additionally, the cochrane schizophrenia group extracted, annotated and made available data from studies included in over systematic reviews. this aims at supporting the development of methods for reviewing tasks, and to increase the re-use of their data. these data include risk-of-bias assessment, results including all clean and published outcome data extracted by reviewers, data on picos, methods, and identifiers such as pubmed id and a link to their study-based register. additionally, a senior reviewer recently carried out a manual analysis of all , outcome names in these reviews, parsed and allocated to , unique outcomes in eight main categories (schmidt et al., ) . key: cord- -zdxebonv authors: chen, lichin; tsao, yu; sheu, ji-tian title: using deep learning and explainable artificial intelligence in patients' choices of hospital levels date: - - journal: nan doi: nan sha: doc_id: cord_uid: zdxebonv in countries that enabled patients to choose their own providers, a common problem is that the patients did not make rational decisions, and hence, fail to use healthcare resources efficiently. this might cause problems such as overwhelming tertiary facilities with mild condition patients, thus limiting their capacity of treating acute and critical patients. to address such maldistributed patient volume, it is essential to oversee patients choices before further evaluation of a policy or resource allocation. this study used nationwide insurance data, accumulated possible features discussed in existing literature, and used a deep neural network to predict the patients choices of hospital levels. this study also used explainable artificial intelligence methods to interpret the contribution of features for the general public and individuals. in addition, we explored the effectiveness of changing data representations. the results showed that the model was able to predict with high area under the receiver operating characteristics curve (auc) ( . ), accuracy ( . ), sensitivity ( . ), and specificity ( . ) with highly imbalanced label. generally, social approval of the provider by the general public (positive or negative) and the number of practicing physicians serving per ten thousand people of the located area are listed as the top effecting features. the changing data representation had a positive effect on the prediction improvement. deep learning methods can process highly imbalanced data and achieve high accuracy. the effecting features affect the general public and individuals differently. addressing the sparsity and discrete nature of insurance data leads to better prediction. applications using deep learning technology are promising in health policy making. more work is required to interpret models and practice implementation. while ensuring the accessibility to care, some countries applied a "gate-keeping" strategy and others empowered patients with the freedom to choose their own providers. in the gate-keeping strategy, the patients first visit general practitioners (gp) for medical advice, and the gp decides whether to refer the patient to a secondary or tertiary institution. the intention of the gate-keeping strategy is to enhance efficiency, regulate cost, and reduce wait time for secondary care. however, there are some limitations to this strategy such as gps limiting the possibility of specialists responding to patient and market demands [ , ] ; further, the subsequent delayed referral can cause other problems [ ] . some countries have reformed their strategy to offer patients with the freedom to choose providers. patients are to "vote with their feet" and choose health providers who fit their preferences and needs [ ] . this strategy empowers the patients by prompting providers to compete for patients through a customer-market mechanism, such as improving their services, such as care quality, efficiency, and wait time [ , ] . however, such expectations has preconditions. patients are expected to make their choices based on sufficient information and rational decision [ , ] , which is commonly not the case. other studies report that patients have shown inadequate ability to use comparative information during provider selection [ , ] . past studies indicate that the patients' choice is a complex interrelationship between the characteristics of the patients, the providers, and the incident itself [ ] . the decision may differ based on the characteristics of individuals, the characteristics of the provider, and the condition of the incidence, which makes it difficult to evaluate in advance. however, researchers have developed several techniques to simulate and forecast the patient flow, patient volume, resource allocation, and patient choices. the gravity model, for instance, calculates the spatial interaction between a community and a hospital using population mass of the community, capacity and service mix of hospitals, and distances (or traveling time) [ , ] . the aggregate hospital choice model (ahcm) is intended to model hospital choices through the market share. based on historical data, ahcm uses time-series techniques to forecast future patient volumes [ ] . forecast and simulation techniques such as mean absolute percentage errors, autoregressive integrated moving average (arima), seasonal arima [ ] , and discrete event simulation models [ , ] have also been used previously. however, these theories have strict preconditions and partially explain the choice scenario. recently, deep learning methods have gained popularity. they capture the underlying pattern of data by transforming data into a more abstract matter and classifying them based on the latent distribution [ , ] . the deep learning approach has been proven effective and has shown excellent performance in a wide variety of applications [ , , ] , such as disease risk forecasting [ ] , vital signs classification into physiological symptoms [ , ] , image classification for diagnosis [ , ] , text-based medical condition recognition [ , ] , and clinical event forecasting [ ] . however, irrespective of the outstanding performance, the results of the deep learning technology remained a blackbox, which merely provided the results without reasons or any information that indicated how the conclusion was reached. this incapability of the deep learning method to gain trust and convince people to use its results limited its implementation in healthcare field. meanwhile, some previous studies indicate that the existing hierarchical coding scheme of electronic health recodes is not sufficiently representative [ , ] . it does not quantify the inherited similarity between concepts, and using deep learning to project discrete encodings into vector spaces may lead to a better analysis and prediction. while insurance data are the most commonly used data in policymaking, it is necessary to investigate the pattern of insurance data before applying deep learning. this study accumulates the possible features that were previously mentioned in related work on the patients' choices and aims to use deep neural networks (dnns) to generalize and predict the patients' choices. focusing on the hospital levels of the patients' choices, this study used explainable artificial intelligence (xai) methods to interpret the effecting features for the general public and individuals. in addition, this study explored the representations of insurance data by comparing the performance of model training with and without the preprocessing of changing data representations. the data used were the insurance data of the national health insurance (nhi) of taiwan. we characterized all the effecting features based on three main entities: the patient, the provider, and the incident. the characteristics of patients such as the age, gender, income, and previous medical experiences (positive and negative) were all possible features affecting the behavior of patients while accessing care [ , , ] . further, young people need relatively less care and females are more endurable than males; these facts along with the income of individuals affected the patients' willingness to visit a facility. meanwhile, studies [ ] indicated that the patients' satisfaction and loyalty were significantly related to the facility they would go to. when patients were satisfied with and loyal to a specific provider, they were less likely to change providers. continuity of care was also considered an effecting feature, which could be indicated as the duration and frequency of visiting a provider (known as density); those who visited a provider regularly were less likely to change providers. the frequency of changing providers (known as dispersion) could also be an indicator of continuity of care. the characteristics of a provider included the hospital reputation, hospital level, facilities at the hospital, and travel distances. commonly, patients considered that institutes with higher levels (secondary and tertiary institutes) possess better equipment, more skillful physicians, and higher reputations. some studies indicate that service quality affects the patients' demand [ ] , while others report that patients commonly neglect the quality indicators and prefer recommendations from associates [ ] . professionals such as gps are tempted to decide based on feedback from patients and colleagues as well as their cooperation experiences with the department or hospital, rather than official information such as quality of service or wait time [ ] . some patients even decide based on review websites [ , ] , which is another form of social approval and recommendations from others. studies [ ] also indicate that the patients' choices may change according to the condition of the incident, for example, the severity of the condition, the complication of the disease, and the number of patients with chronic disease. the health status of the patients would affect their willingness to travel distance. noncritical events and healthier patients would consider farther distance of travel and seek a second opinion. people would be tempted to go to emergency services as outpatient services are closed on weekends or holidays. people in taiwan have the freedom to choose their own providers without the referral of a gp [ , ] . under universal coverage, people do not possess the knowledge of choosing providers rationally. other than choosing physicians, people commonly consider the level of the institute. primary care service in taiwan are usually a physician of a certain specialty who owns a clinic and acts as a private practitioner [ , ] . hence, the primary care service commonly referred to clinics irrespective of specialty. a secondary and tertiary care often referred to regional/district hospitals and medical centers. people commonly consider that "large hospitals possess more skillful physicians and better medical facilities," resulting in patients, irrespective of their medical severity, swarming to medical centers [ , ] . according to a public opinion poll conducted in [ ] , although . % of the respondents agreed that for a mild condition the patient should go to the primary care service nearby instead of tertiary hospitals, % considered institutes with higher levels to possess better professional skills, and % expressed having confidence in determining the severity of their own condition. this phenomenon causes the recession of the primary care service and overcrowding of the tertiary care [ ] . consequently, tertiary facilities are overwhelmed with mild-conditioned patients and have limited capacity to treat acute and critical patients [ , ] . this also induces an increase in the cost of treating mild conditions. the "hierarchization of services," highlighted for several years, refers to visiting appropriate medical resources according to needs rather than swarming to tertiary institutes. approaches to ease such extreme unbalanced patient volume have been proposed, such as increasing copayments, strengthening referral mechanisms, limiting outpatient service volumes for tertiary facilities, and providing incentives for cooperation between different levels of providers. however, such a phenomenon still exists [ , ] . to address the maldistribution of patient volume, it is essential to oversee the choice of patients before strategy planning. this study aims to support the evaluation by providing a sophisticated tool to predict the patients' choice and provide the effecting features for them to do so. the aim of this study is to predict the patients' choices of hospital levels and interpret the reasons for the prediction. in addition, this study demonstrates the effectiveness of changing the representation of insurance data. the data used were the insurance-claims data from the two million clinical declaration files and the registry for beneficiaries files from the taiwan nhi research database (nhird), dated from january , , to december , . the data were originally sampled to ensure their representation of the population across taiwan. the files included the demographic information and visiting records of outpatients and emergency settings. some publicly announced data were added to enrich the records, including the "physician density" information that referred to the number of practicing physicians serving per ten thousand people in each region of taiwan [ ] . the national calendar was used to retrieve information on weekends and national holidays. incomplete or questionable data, such as individuals without birth date or gender (or with two genders), records without a date, birth date later than the visit date, patients without any visiting records, patients without a primary diagnosis, and incomplete information of visited hospitals were excluded. eighteen features were included in the analysis, which were characteristics of the patients, providers, and incidents. the following section will further outline the definition and calculation equation of the features. the targeted prediction outcome of this research is the four hospital levels, namely the medical center, regional hospital, district hospital, and clinic. the age, gender, low income (yes/no), total number of visits, total number of diseases, total number of chronic diseases, and four continuity indicators were included as characteristics of the patient. the age was determined according to the date of the visit. the denotation of low income was the status identified when entering the insurance. the total number of visits was the number of visiting records during the study period. the total number of diseases and chronic diseases were identified using the encoded international classification of diseases (icd) codes for each patient. the continuity indicators captured the duration, density, dispersion, and continuity of an individual while accessing care [ ] [ ] [ ] . to track duration and density of patients accessing care, the usual provider of care (upc) and least usual provider of care (lupc) were used to indicate the visiting ratio of medical institutes. the upc represented the most frequently visited institute, and the lupc represented the least frequently visited institute. the patients were required to visit the provider at least once to denote an institute as the lupc. the sequential continuity of care index (secoc) was used to calculate the change of providers, representing the dispersion of care. the continuity of care index (coci) was a single indicator that represented the continuity of care for an individual. the calculation of the continuity indicators are shown in equations ( ), ( ), ( ), and ( ), where n represents the total number of visits, n ! denotes the number of visits to the i th provider, k is the number of providers once visited, and c " is denoted as when the j th provider is the same as the (j+ ) th provider ( if not). we characterized the providers with indicators: the physician density of each region, the most frequent provider continuity (mfpc), and the least frequent provider continuity (lfpc). with reference to the patients' "vote with their feet" in choosing providers, we calculated the mfpc and lfpc to represent the patients' experiences and recommendations for each institute [ , ] . the mfpc represented the frequency of being voted as the upc, and the lfpc represented the frequency of being voted as the lupc. each patient could vote for only one mfpc and lfpc. the calculations are shown in equations ( ) and ( ), where p indicates the total number of patients, and upc ! and lupc ! indicate the i th patient who voted the provider as the upc or lupc, respectively. the incident was characterized based on five encoded features. through the encoded icd codes and treatment codes of visiting records, we identified whether a surgery was involved (yes/no), whether it was an emergency service (yes/no), whether it was considered as a severe condition (yes/no), whether the visit day was a work day (yes/no), and the disease importance rate (dir) of the target disease during that visit. the identification of surgeries and emergency services were based on the treatment codes defined by the nhird. the severity was defined based on emergency triage results. triage results rank for level to ( = resuscitation, = emergency, and = urgent) were listed as severe, and conditions that were included in the catastrophic illness announced by the nhi were also listed as severe. the date of the incident was distinguished as a work day or non-work day. to capture whether the visit of the patient was a regular or singular event, we used the dir to represent the importance of the target disease in that visit. the encoded primary diagnosis was identified as the target disease in that visit. the dir represents the ratio of importance and is calculated as shown in equation ( ), where n indicates the total number of visits of the patient, and di represents the total number of visits for disease di. we used only the primary diagnosis of the visit to identify the dir. this study used the dnn framework to train the patients' choices of hospital levels. dnn is a complex version of an artificial neural network (ann) that contains multiple hidden layers [ , ] , where every neuron in layer is fully connected to every other neuron in layer + . in a multi-layer neural network, each layer of the network is trained to produce a higher level of representation of the observed pattern. every layer produces a representation of the input pattern that is more abstract than the previous layer by composing more nonlinear operations [ , ] . the computation is shown in equation ( ) . each hidden layer computes a weighted and bias of the output from the previous layer, followed by a nonlinear active function that calculates the sum as outputs. the number of units in the previous layer is represented by and the output of the previous layer by . figure demonstrates the dnn architecture. to demonstrate the effect of changing data representations of insurance data, we designed a comparison. we used an autoencoder (ae) as the processor for the data representation change. ae is popular for processing scarce and noisy data [ , , ] . it encoded the input into a lower dimension space and then decoded the representation by reconstructing an approximate input %. the goal of the reconstruction was to minimize the mean square error of and % . equations ( ) and ( ) demonstrate the computation of the encoder and decoder, where and ′ denote the respective weights and and ′ denote the respective bias of the encoder and decoder. figure demonstrates the ae architecture. figure . autoencoder architecture xai methods enhance the interpretability of machine learning models [ , ] . this study adopted the shapley additive explanations (shap) [ ] . shap combines the desirable characteristics of other interpretation frameworks, including local interpretable model-agnostic explanations (lime) and deep learning important features (deeplift). the shap value was computed using all combinations of the input, and the average marginal contribution of a feature value over all possible coalitions was calculated. shap has the ability to interpret models globally and locally, that is, to show the general effects of features on the whole population and individuals. all features were aggregated into a visit vector. due to the imbalanced distribution of hospital levels, this study used a random undersampling strategy to sample the majority label and balance the training set. the model was trained on balanced data and tested on actual distributed data [ ] . meanwhile, to deal with numerical features with different scale levels, all the numerical values (including the age, number of diseases and chronic diseases, number of visits, number of votes as mfpc and lfpc, and physician density) were normalized between and , and the categorical features were transformed into a one-hot/dummy encoding before analysis. those indicators that were already ratio figures (values between and ) were used accordingly (including coci, upc, lupc, secoc, and dir). the data were randomly split into training data ( %) and testing data ( %). a five-fold cross-validation training strategy was used. the data processing flow is shown in figure . the proposed dnn model contained input nodes (based on input features) and hidden layers with neurons each. the rectified linear unit (relu) active functions were used for each layer. four output nodes symbolized the hospital levels. optimization was carried out by mini-batch stochastic gradient descent that iterated through small subsets of the training data and modified the parameters in the opposite direction of the gradient of the loss function to minimize the reconstruction error. the data representation changing comparison is done through the same training process, except that one is preprocessed with ae, and the other is not, as shown in figure . the ae model consisted of hidden layers; the neurons in those layers were , , , , and . the relu active function was used for the encoder (first layer) and decoder (last layer) and a sigmoid active function for the latent space conversion (the third layer). the performance indicators used here are the area under the receiver operating characteristics curve (auc), accuracy, sensitivity/recall, specificity, precision, and f score, as shown in equations ( ) to ( ) , where tp, tn, fp, and fn denote for true positive, true negative, false positive, and false negative. for a multi-class classification of hospital levels, the macro average was used to generalize the performance, which computed the metric independently for each class and then took the average to consider each class equally. the auc used the one-vs-rest scheme to demonstrate the general performance. the result also compared the shap value of the trained model, with and without ae preprocessing, to show the effect of representation change on feature importance. this study was implemented with python version . . , combined with pytorch framework . a total of , patients and , , visiting records were analyzed, where . % patients chose to go to the clinics, . % to the district hospital, . % to the regional hospital, and . % to the medical center. tables to demonstrate the information of patients, providers, and incidents. the performance results are listed in table . processing without and with ae reached an auc of . and . , respectively. therefore, changing data representations led to an increase of . in auc. the mean shap values are listed in figure , with the mfpc, physician density, and lfpc listed as the top factors. the contribution could be positive or negative. the contribution of other features, even combined together, was very less. the ae process changed the contribution ranking, shifting the physician density from the second to the third position. figure shows the interpretation analysis of individual cases. the base value represents the decision of the general public ( . ). the red color symbolizes the features that contributed positively and pushed the decision above the base value, and the blue color symbolizes the features that contributed negatively and pushed the decision below the base value. figures (a) and (b) show cases processed by the model without ae, and (c) and (d) show cases processed by the model with ae. it appears that the processing of ae enlarges the less contributing features and makes them more visible on graphics. with the outbreak of covid- , it has been highlighted that there are insufficient applications in assisting public health. exploring the patients' choices of hospital levels is the cornerstone of evaluating policies for referrals, utilization of healthcare resources, and rerouting patient volume for the hierarchization of services, and during disease outbreaks or infection control. the choice of patients was highly imbalanced. some of the features did not reflect the character of the general public, such as low continuity, visits often not involving surgery, emergency service, or the condition not being considered severe, however, the model could still predict with high sensitivity and specificity, without initially finding the best-fit features. this study tried to address the black-box problem of machine learning [ , ] using the shap value. according to our result, three features could interpret the majority of patients' choices of hospital levels: the mfpc, lfpc, and physician density. however, the features affected individuals differently. although entering the training process without previous data preprocessing is ideal, the performance of models can be improved by changing the data representation. it is straightforward that image and audio signals include disturbance and noise information, and using methods to eliminate noise leads to better predictions. structured data were encoded with existing encoding schemes that were meaningful to people. the sparsity, discrete, and scarcity of data, which is invisible to the human eye, were difficult to notice. this study demonstrated the effectiveness of changing data representations. by merely adding preprocessing in advance, all the performance indicators increased, and the shap value became less extreme, allowing less contributing features to be observed. the discipline of social economics mostly focused on clarifying the causality and the interrelationship of factors that affect patients in choosing hospitals. however, the machine learning approach attempted to seek the underlying pattern of data and predict accurately without relying on existing knowledge. the choice indicated a certain underlying trend, but not necessarily complete reasons and causalities. this study provided an alternative approach to observe the patients' choice. the prediction was based on the trajectory of the de-identified patient-visit data, commonly collected by the insurance company. hence, the model is highly achievable elsewhere as it neither involves (d) complex information that is difficult to collect nor violates patient privacy. however, this study has several limitations. distance to travel remains an important factor in choosing hospitals [ ] . although there are ways of using de-identified insurance data to project the region where patients live [ ] , such information is still based on hypothesis and cannot be validated accurately. further analysis could be done based on decisions of patients of different regions and explore different decision choices due to misallocation of medical resources. future applications using deep learning technology are promising in health policy making. a golden standard for interpreting machine learning models has not been established. the method of using the interpretation model and the generated scores can be further explored. shared decision making between patient and gp about referrals from primary care: does gatekeeping make a difference determinants of patient choice of healthcare providers: a scoping review. bmc health services research are the serious problems in cancer survival partly rooted in gatekeeper principles? an ecologic study identifying quality indicators used by patients to choose secondary health care providers: a mixed methods approach free choice of healthcare providers in the netherlands is both a goal in itself and a precondition: modelling the policy assumptions underlying the promotion of patient choice through documentary analysis and interviews. bmc health services research which factors decided general practitioners' choice of hospital on behalf of their patients in an area with free choice of public hospital? a questionnaire study. bmc health services research hospital choice models: a review and assessment of their utility for policy impact analysis the geography of hospital admission in a national health service with patient choice time series modelling and forecasting of emergency department overcrowding predicting patient volumes in hospital medicine: a comparative study of different time series forecasting methods using simulation to forecast the demand for hospital emergency services at the regional level forecasting emergency department crowding: a discrete event simulation. annals of emergency medicine deep learning for healthcare: review, opportunities and challenges a comparison of shallow and deep learning methods for predicting cognitive performance of stroke patients from mri lesion images opportunities and challenges in developing deep learning models using electronic health records data: a systematic review a guide to deep learning in healthcare assessment of deep learning using nonimaging information and sequential medical records to develop a prediction model for nonmelanoma skin cancer voice pathology detection using deep learning on mobile healthcare framework deep learning for healthcare applications based on physiological signals: a review. computer methods and programs in biomedicine dermatologist-level classification of skin cancer with deep neural networks end-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. nature medicine comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives using deep learning to enhance cancer diagnosis and classification doctor ai: predicting clinical events via recurrent neural networks deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. ieee journal of biomedical and health informatics deep patient: an unsupervised representation to predict the future of patients from the electronic health records choice of hospital: which type of quality matters measuring patients' healthcare service quality perceptions, satisfaction, and loyalty in public and private sector hospitals in pakistan do patients choose hospitals with high quality ratings? empirical evidence from the market for angioplasty in the netherlands the impact of web-based ratings on patient choice of a primary care physician versus a specialist: randomized controlled experiment does universal health insurance make health care unaffordable? lessons from taiwan an overview of the healthcare system in taiwan taiwan's healthcare report taiwan's single-payer success story -and its lessons for america public opinion poll report summary. , national health insurance administration, ministry of health and welfare statistics yearbook of practicing physicians and health care organizations in taiwan measuring care continuity: a comparison of claims-based methods using an integrated coc index and multilevel measurements to verify the care outcome of patients with multiple chronic conditions. bmc health services research measuring majority of care measuring the performance of individual physicians by collecting data from multiple health plans: the results of a two-state test relationship between continuity of care in the multidisciplinary treatment of patients with diabetes and their clinical results a survey of deep neural network architectures and their applications efficient processing of deep neural networks: a tutorial and survey artificial intelligence for the otolaryngologist: a state of the art review. otolaryngology-head and neck surgery deep learning algorithm predicts diabetic retinopathy progression in individual patients shapley additive explanations for no forecasting explaining the predictions of any classifier the role of balanced training and testing data sets for binary classifiers in bioinformatics using regional differences and demographic characteristics to evaluate the principles of estimation of the residence of the population in national health insurance research databases (nhird) this study was approved by the research ethics committee at national taiwan university (no. em ), and the waived informed patient consent for the data was already de-identified before analysis. the authors have no conflicts of interest to declare. key: cord- -ltbvpv b authors: garcia-gasulla, dario; napagao, sergio alvarez; li, irene; maruyama, hiroshi; kanezashi, hiroki; p'erez-arnal, raquel; miyoshi, kunihiko; ishii, euma; suzuki, keita; shiba, sayaka; kurokawa, mariko; kanzawa, yuta; nakagawa, naomi; hanai, masatoshi; li, yixin; li, tianxiao title: global data science project for covid- summary report date: - - journal: nan doi: nan sha: doc_id: cord_uid: ltbvpv b this paper aims at providing the summary of the global data science project (gdsc) for covid- . as on may . covid- has largely impacted on our societies through both direct and indirect effects transmitted by the policy measures to counter the spread of viruses. we quantitatively analysed the multifaceted impacts of the covid- pandemic on our societies including people's mobility, health, and social behaviour changes. people's mobility has changed significantly due to the implementation of travel restriction and quarantine measurements. indeed, the physical distance has widened at international (cross-border), national and regional level. at international level, due to the travel restrictions, the number of international flights has plunged overall at around percent during march. in particular, the number of flights connecting europe dropped drastically in mid of march after the united states announced travel restrictions to europe and the eu and participating countries agreed to close borders, at percent decline compared to march th. similarly, we examined the impacts of quarantine measures in the major city: tokyo (japan), new york city (the united states), and barcelona (spain). within all three cities, we found the significant decline in traffic volume. we also identified the increased concern for mental health through the analysis of posts on social networking services such as twitter and instagram. notably, in the beginning of april , the number of post with #depression on instagram doubled, which might reflect the rise in mental health awareness among instagram users. besides, we identified the changes in a wide range of people's social behaviors, as well as economic impacts through the analysis of instagram data and primary survey data. the gdsp (global data science project) for covid- consists of an international team focusing on various societal aspects including mobility, health, economics, education, and online behavior. the team consists of volunteer data scientists from various countries including the united states, japan, spain, france, lithuania and china. the purpose of the gdsp is to quantitatively measure the impacts of the covid- pandemic on our societies in terms of people's mobility, health, and behaviour changes, and inform public and private decision-makers to make effective and appropriate policy decisions. a. quantifying physical distancing physical distancing is key to avoid or slow down the spread of viruses. each country has taken different policies and actions to restrict human mobility. in this project, we investigate how policies and actions affect human mobility in certain cities and countries. by referencing our analysis of policy and secondary impacts, we hope that decision makers can make effective and appropriate actions. furthermore, by analyzing human mobility, we also aim to develop a physical distancing risk index to monitor the risk on areas with high population densities and probability of contraction. due to physical distancing and lockdown policies, people have begun relying on video conferencing tools for meetings, lectures, and conversations among friends more frequently than usual. children are especially affected by the quarantine since many must refrain from going to their classrooms and take classes online. by leveraging various data sources, we will analyze how daily behavior has been affected by this pandemic, and also compare behaviors among different countries and cities. we will also measure online e-commerce and consumer behavior by analyzing sites such as amazon. for health, we have focused on emotion changes that people have experienced during this pandemic. emotion changes have stemmed from various reasons such as unemployment, implementation of stay-at-home policies, fear of the virus, etc. we quantify emotion changes by using social media data, including twitter and instagram. since the breakout of covid- , we have seen an increase in online discussions that use hashtags such as #covid- and #depression. we believe it is vital to visualize and analyze the differences in people's perceptions of . we also hope to analyze overall responses to the pandemic by sentiment: sadness, depression, isolation, happiness, etc. a further detailed analysis will also look into specific keywords and corresponding trends. each section in this report follows the following format; key takeaway, data description, policy changes, overall analysis, subcategory analysis. we aim at analysing the seeming trade-off between economics and prevention of infection spread. based on the calculation of physical distance index (mobility index), economic damages, and the number of newly infected patients, we evaluate the optimal level where we embrace both the steady decline in the number of infections and recovery of economics. to investigate the effects of these travel restrictions on worldwide flights, we analyzed the decline of flights for continents and countries from public flight data. we found that the overall international flights significantly decreased from the beginning of march, at around percent. in particular, the number of flights connecting europe drastically dropped in mid of march after the united states announced travel restrictions to europe and the eu and participating countries agreed to close borders, at percent decline compared to march th. we conducted a more detailed analysis in another paper suzumura et al. ( ) . in order to analyze real-time international flight data, we obtained voluntarily provided flight dataset from the opensky network -free ads-b and mode s data for research schäfer, strohmeier, lenders, martinovic, and wilhelm ( ),strohmeier ( ) . the dataset contains flight records with departure and arrival times, airport codes (origin and destination), and aircraft types. the dataset includes the following flight information during january st to april th. the dataset for a particular month is made available during the beginning of the following month. the data covers countries out of countries, including major airports and several small to medium size airports ( figure ). the data was collected over the period of january to march . as for a data methodology, we build a temporal network where a country or an airport is represented as a vertex and a connection between countries or airports is represented as an edge. by building such a temporal network and compute shortest paths and their length between countries or airports, we can measure how travels are restricted in a quantitative manner by using graph analytics. the data is analysed on ( ) global, ( ) continental, ( ) country (airport) level. the top airports were based on the preliminary world airport traffic rankings released by aci world international ( ). before the end of february , the overall number of (departed) international flights were around , . however, since march the number of international flights has started to decline. it reflected the first coronavirus death in the united states and the announcement of travel restrictions of 'do not travel' on february th. since march , this decline has further accelerated in response to president trump's announcement of the travel restriction on european countries on march and . in italy, , people were infected before february th in italy and cases in other european countries remain an area of concern, and the us records its first coronavirus death and announces travel restrictions of 'do not travel' on february th, resulting in the decline of international flights from france switzerland, and italy around february th. then, a significant slump of flights also occurred from around march th in these countries after the declaration of covid- by who. we examined the mobility in tokyo and its relations to the national and local government's measures to suppress covid- . we found that ) different demographic groups respond differently to the governments' messages (e.g., the senior responded first but the younger generations are willing to comply with the governments' instructions to stay home once the formal announcement of the state of emergency is declared), and ) people's behaviour is affected more by the mood of the society than the official declaration of the state of emergency. we also investigated the correlation between the daily mobility index and the growth rate of reported confirmed cases, which suggested that the mobility index may be an early indicator of the growth rate of confirmed cases as well as the number of confirmed cases affecting future mobility. we analyzed ntt docomo data and accessed high-resolution hourly population data within tokyo from mobile kukan toukei docomo ( ). the data is based on mobile phone location on every hour and covers all the tokyo metropolis (average daily population is around m) and from jan. st, to the current date. we also received the same data for jan. to mar. in for the comparison. the data set divides tokyo into , grid cells of . km x . km. the provided data is a collection of population vectors pt where pt[i] is the population of the grid cell i at time t (t is an hourly time point between : on january st and : on march st). we defined the overall mobility within tokyo at time t as l (pt -pt+ ) where l is the l norm. intuitively, this metric counts the sum of the number of people who came into or left each cell during the given hour. note that these metric underestimates actual mobility since incoming and outgoing people within an hour cancel each other out. the mobility index above for tokyo includes large rural areas that may have contributed less to covid- transmission. for this study, we clustered , grid cells into groups based on hour-of-day population patterns (each grid cell is represented as a variable vector) using the data. figure and show the change of mobility index by age groups. we observe the largest drop of mobility occurred around march th when the media started to discuss potential "capitol lockdown". when the official state of emergency was declared on april th, the mobility had already dropped to less than half of that of the normal time. we also note that the senior groups, who are supposed to be at higher risk, responded to the epidemic initially, but later the younger groups were more willing to stay at home. we also investigated the potential use of the mobility index as an earlier indicator of the future spread of the disease. figure and shows the daily mobility index and the growth rate of confirmed cases in tokyo. in the plots, we noticed that the drop of mobility around march nd may be correlated to the drop of growth rate on march th as shown in the blue arrows, and the pickup of mobility on march th may be correlated to the peak of the growth rate on march th as shown in the red arrows it is not conclusive, but it may suggest that the mobility index has some signals for predicting the future spread of disease. different clusters show different responses in mobility to covid- . figure shows how mobility changes over time depending on the cluster. we analyzed the changes in traffic volumes and a bicycle sharing service in new york city to examine the effect of covid- and announcements from the city government. we found out the traffic volume has significantly decreased after the beginning of march, and thousands of people use citibike in every daytime. we analyzed the mobility changes in new york city through the road traffic data and tracking data of sharing bikes. we retrieved public historical data about traffic volume of freeways from nyc open data opendata ( b). as for road traffic data, we extracted the daily average travel time and speed in days ranges. the x-axis of figure and as for sharing bike data, we track the number of people using bikes at each bike station every seconds from nyc open data opendata ( a). we aggregated the number of departed bikes in bike stations, located in most areas of manhattan, brooklyn and queens near the east river, and jersey city along the hudson river. since we can only track the number of available bikes in each station, we estimate the number of departed bikes by computing the difference in the number of available bikes between two timestamps. we developed an interactive visualization dashboard that illustrates how bikes are used over time since march th. after the nyc municipality recommended citizens to ride bikes instead of using public transportation, there was a surge in the usage of citi bike in the beginning of march, a privately-owned public bicycle sharing system in new york city kuntzman ( ). figure describes the total number of hourly citibike usages from march rd. more than a thousand of people used citibike in a peak hour every day, and in some days the number of peak usages exceeded , in some days (e.g., april th, th, and may nd). from the beginning of may, more than , people used bikes in peak hour. we analyzed the use of the public bike system and the amount of traffic load within the barcelona metropolitan area, to understand how the covid pandemic and the government measures were affecting public movement. we detected that mobility was only significantly altered once the harshest measures were implemented, hinting at a potential inefficiency of mild measures. furthermore, as the lockdown went on, the mobility kept decreasing, indicating an increasing adherence as the understanding of the severity increases. the mobility data in barcelona, spain was collected through the location signal of face-book application users who have consented to share their location, public bike sharing data, and road traffic data. similar to new york city, public bike sharing is available in barcelona. by analyzing the availability of docking stations throughout the city, we measure the changes in population mobility. traffic data is obtained from the open data released by the barcelona city hall. it includes over measuring points, evenly distributed throughout the city. the mobility data in barcelona, spain was collected through the location signal of face-book application users who have consented to share their location, public bike sharing data, and road traffic data. similar to new york city, public bike sharing is available in barcelona. by analyzing the availability of docking stations throughout the city, we measure the changes in population mobility. traffic data is obtained from the open data released by the barcelona city hall. it includes over measuring points, evenly distributed throughout the city. the first detected case of covid- within spain was on january st in the canary islands, located more than , km from peninsular spain. by late february, imported cases were detected in the mainland, and on february th the first endemic case was diagnosed. on march th, cases were diagnosed and certain regions in spain started implementing local restriction policies. by march th, cases had been detected across all provinces. the following day, march th, the spanish government announced a state of emergency, and implemented a lockdown for the whole population. citizens were only permitted to travel for work, and all social events were prohibited. this lockdown was reinforced on march th with total mobility restrictions, and only essential services were an exception. the first restriction against covid- by the regional government of barcelona was on march th that informed citizens to avoid gatherings of more than , people. two days later, on march th, with confirmed cases in the region, all classes were suspended. by march th, a national lockdown was declared by the spanish government. the effects on mobility are only visible from march th, indicating that the local population did not alter their mobility patterns in response to earlier and milder governmental restrictions. posts with some hashtags related to physical distancing suddenly started increasing in mid of march . a hashtag #zoom, online meeting software, was frequently posted all over the world in late march, and the stock price of the software company rose dependent upon the increasing volume of posts with the hashtag. a post on instagram would be summarized and categorized by hashtags on the post. we analyse the following hashtag, which represent physical distancing have increased signif-icantly since march ; #stayhome, #stayathome, #socialdistancing, #workfromhome, #zoom (online meeting software). as of april , , more than million posts with #stayhome have uploaded on instagram, including million posts in march . during mid of march , the number of posts with #stayhome gradually rose up, with states across the u.s. announcing a stayat-home order. on march , , instagram announced that it launched a "stay home" sticker to help those practicing physical distancing connect with others. this might have also boosted the number in march . the outbreak of covid- recently has affected human life to a great extent. besides direct physical and economic threats, the pandemic also indirectly impacts people's emotional conditions, which can be overwhelming but difficult to measure. we apply natural language processing (nlp) vaswani et al. ( ) , qi, zhang, zhang, bolton, and manning ( ) to analyze tweets in terms of emotion analysis and attempt to find more in-depth topics and facts about emotions in terms of covid- li, li, li, alvarez-napagao, and garcia ( ) . we have seen an increase in discussions tagged with hashtags such as #covid- and #depression on twitter and instagram. we believe it is vital to analyze differences in. we also hope to analyze overall responses to the pandemic as well as changes in behavior due to the virus and generate reports on global situations regarding mental health. we plan to categorize the tweets and instagram posts that mention covid- by sentimental categories: .anger, anticipation, disgust, fear, joy, sadness, surprise and trust. further detailed analysis will also look into specific keywords and corresponding trends. we applied twitter api to conduct a crawler with a list of keywords: #coronavirus, #covid , #covid, #covid , #confinamiento, #flu, #virus, #hantavirus, #fever, #cough, #social #distance, #lockdown, #pandemic, #epidemic, #conlabelious, #infection, #stayhome, #corona, #epidemie, #epidemia,新冠肺炎, 新型冠病毒, 疫情, 新冠病毒, 感染, 新型コロナウイルス, コロナ. each day, we are able to crawl million tweets in free text format from different languages. due to the high capacity, we look at the tweets from march to , to get language and geolocation statistics. among these tweets, , , tweets have the language information ("lang" field of the "tweet" object in tweet api), and , tweets have the geographic information ("country code"value from the "place" field if not none). we applied a deep learning model (bert) devlin, chang, lee, and toutanova ( ) trained on manually labeled cases to million english tweets. fear, anger and sadness ranked first. we now look at the emotion trends on different topics. using bert, we analyzed two topics: "mask" and "lockdown". to understand why people feel fear and sadness, we calculated correlation on the tweets categorized by fear and sadness, and then kept nouns and noun phrases with the help of stanford stanza tool. we utilized the lda(latent dirichlet allocation) topic modeling blei, ng, and jordan ( ) to analyze the topics on people's tweets. each "topic" learned by the model is a bunch of key words, then we manually labeled these topics as meaningful concepts. by applying our model, we show the emotion distribution among categories in fig. . each day, the overall distribution has no big difference. so we show the results on a million tweets from march th, . note that these tweets contain many languages rather than english. we could notice that the top emotions are very positive: fear, anger and sadness. we select the data of two weeks (march , -april , , and apply our model to predict the emotions on all the tweets we crawled (around million each day) that contain the two "masks" and "lockdown" respectively. we found the dominating emotions and variations of the change are closely related to the topic. in fig. and , we illustrate the emotion trend for each single day of the selected keywords. the high variation (plot in solid lines in the figures) showed up in sadness, anger and anticipation for the tweets that contain the word "mask" in fig. , and disgust, sadness for the tweets that contain the word lockdown in fig. . especially for the lockdown tweets, the percentage of disgust emotion had a significant increase on march and dropped on the next two days, as marked with the black asterisks. to further investigate, we looked at the news on march , which included the u.s. as the first country to report , confirmed coronavirus cases, and in americans were staying home; india and south africa joined the countries to impose lockdowns. given that the united states, india and brazil have large groups of twitter users, we assume that this dramatic change may be triggered by that news. we set the number of topics to be , and did detailed analysis on the data from april th, . the topics are listed as the following: topic : covid testing, deaths cases, positive cases topic : president trump, government, federal affairs topic : lockdown, stay at home, physical distancing topic : (spanish) pandemic, health conditions topic : the peak, serious treatment, boris johnson figure shows the distribution of the topics. we choose the data on april th, and first we do inference on all the data, and show the ratio for each topic learned above (all). then we do inference on the tweets that are only labeled as sadness or fear (sad and fear). and the following is the ratio of each topic learned. in general, the public may be worried about topic and , mainly, the pandemic and lockdown, which are making people stressed. in the beginning of april , the number of posts with #depression on instagram doubled, which might reflect the rise in mental health awareness among instagram users. we collected , posts on instagram inc. ( ) with #depression from march to april , . during this period, the number of the posts are steadily increasing as below. among the posts with location information, #depression was mostly posted in the u.s., the u.k. and india. in those countries, users posted the hashtag during the local daytime. during the long quarantine, people might struggle to keep their mental conditions healthy and the trend of #depression on instagram might reflect their attitudes. the number of hourly posts with #depression doubled in a week from march to april . among the posts with location information, most posts were uploaded in the u.s. in all top three countries, the u.s., the u.k. and india, the hashtag was mainly posted in between afternoon and the evening. the number of posts with #cough started increasing in mid of march, several days before the stay-at-home orders in some countries. some governments could take their initial response earlier than they actually did based on the increase in the number of the behavioral changes on the social networking service platform. other hashtags related to users behavioral changes, including #mask, #facemask, #stayalive, started increasing in mid of march. we analyzed basic hashtags such as #covid and #coronavirus; hashtags related to medical supplies #mask and #facemask; #clapbecausewecare, a daily event people cheer medical professionals working on the frontline; and a hashtag that support others such as#stayalive. we have collected , posts with #mask and , posts with #facemask since february , . figure and show that the two hashtags have been steadily increasing since around march , , a few weeks earlier than some lockdown announcements in europe and stay-at-home orders in the u.s. people might have considered how to protect themselves amid the pandemic before their governments imposed strict prohibitions on residents. people's perception regarding covid- varies depending on the number of infected cases in the local community. also, people"s behavior has some indicative signal for the future spreading the disease. we analyzed the data of the national survey concerning covid- news and resulting behavior changes on march to provided by survey research center co. ltd. responders were randomly selected from each prefecture from an approximate national pool of million panelists. we also used national data regarding confirmed covid- cases from march th (the day before the survey was conducted). in densely affected areas higher levels of concern regarding the impact of covid- on everyday work (q - ) were observed. in areas with lower numbers of confirmed cases, commuting behavior was shifted to avoid public transportation. we also considered how people's perception and behavior may impact the spread of the disease. since there is some delay between an infection and its appearance in the official statistics, we took the growth rate of the reported confirmed cases on march th as the indicator of the rate of the spread. figure shows the strong positive and negative correlations. the mobility analysis of barcelona seemed to indicate that society assimilated the importance and severity of the situation during the second week of lockdown. this may be supported by this work, which indicates that this same week hosted most of the layoffs. this is however, before the toughest part of the lockdown (during the first two weeks, travel to work was allowed). in this regard, the hard lockdown seemed to have little further effect on unemployment. for this analysis we used the public ertos data from the generalitat of catalunya and the public data of unemployment of sepe . during the lockdown, the spanish government promoted a temporal unemployment fiscal figure (erte) under which companies can temporarily layoff workers. in this period, % of the worker's salary is paid by the government, and the company may complement the rest. the purpose of this measure is to maximize the number of workplaces which are restored after the economic lockdown. the success of this measure will have a great impact on the duration of the economic side-effects of the pandemic. we analysed the effect of this measure in catalonia, one of the most populated autonomous communities of spain. catalonia includes the city of barcelona and close to . m workers. industrial activity represents nearly % of the catalan gdp, while tourism accounts for %. figure we plot the number of temporal unemployments per day. that is number of affected workers on each day. for context, the spanish government announced a state of emergency, and implemented a lockdown for the whole population on march th. this lockdown was reinforced on march th to total mobility restrictions, with the only exception of essential services. figure : volume of workers affected by temporal unemployment on each day as seen in figure , most of the layoffs happened during the second week of the pandemic, before the the total lockdown. this indicates that, for the case of catalonia, the mild lockdown and the hard lockdown have similar economic effects in terms of unemployment. on the worst day, th of march, % of catalan workers were fired. by the end of may, the total number of workers affected by temporal unemployment in catalonia was , . next, in figure , we compare the unemployment volume with the volume of ertes per month. considering the difference between the ertes volume (over k on the worse day) and the growth of unemployment (less than k accumulated), ertes seem to many of the layoffs, mitigating the growth of unemployment at least temporally. the behavior of unemployment in the coming months, as ertes expire, will provide the real measure on the effectiveness of ertes. latent dirichlet allocation bert: pre-training of deep bidirectional transformers for language understanding mobile kukan tokei people-safe-informed-and-supported-on-instagram international, a. c. ( ). preliminary world airport traffic rankings released -aci boom! new citi bike stats show cycling surge is real -but mayor is not acting what are we depressed about when we talk about covid : mental health analysis on tweets using natural language processing citi bike live station feed (json), nyc open data real-time traffic speed data -nyc open data stanza: a python natural language processing toolkit for many human languages bringing up opensky: a large-scale ads-b sensor network for research opensky covid- flight dataset the impact of covid- on flight networks polosukhin, i key: cord- - loab f authors: capps, benjamin title: where does open science lead us during a pandemic? a public good argument to prioritize rights in the open commons date: - - journal: cambridge quarterly of healthcare ethics : cq : the international journal of healthcare ethics committees doi: . /s sha: doc_id: cord_uid: loab f during the covid- pandemic, open science has become central to experimental, public health, and clinical responses across the globe. open science (os) is described as an open commons, in which a right to science renders all possible scientific data for everyone to access and use. in this common space, capitalist platforms now provide many essential services and are taking the lead in public health activities. these neoliberal businesses, however, have a problematic role in the capture of public goods. this paper argues that the open commons is a community of rights, consisting of people and institutions whose interests mutually support the public good. if os is a cornerstone of public health, then reaffirming the public good is its overriding purpose, and unethical platforms ought to be excluded from the commons and its benefits. for some, the ongoing severe acute respiratory syndrome coronavirus (sars-cov- ) pandemic has "reaffirmed the urgent need for a transition to open science." "the real antidote to epidemic is not segregation, but rather cooperation," so that open science will accelerate societal and economic progress. across the globe, open science (os) is enhancing evidence-based nonpharmaceutical measures, and equitably contributing to scientific and clinical responses to the pandemic. although os has come to mean many things, this paper is a critique of the concept that all scientific knowledge resides in an "open commons." os is preferable to a competitive, secretive, and proprietary scientific culture; however, the veneration of data idealism under the "new" commons obscures potential abuse of os, too. first, the pandemic has become a false panacea for os in the unprecedented application of big data and fast science, leading to hurried and expedient publication, rather than prudent protocols, to support public health. critics, however, point out that much of the information is not vetted, noisy, and can be socially and politically distorting. os's chaotic application is contributing to negative social determinants of fairness and equitability, in respect to whom the data is about, who can use it, and ultimately who controls it. second, os has become an end-in-itself, so its purpose to support the public good has morphed into surveillance that bleeds into social control and profiteering. of these requires obligations to use it ethically. focusing on the question of markets (rather than surveillance), this paper attempts to keep os as a cornerstone of public health, and therefore reaffirms the public good as os's overriding purpose. during this pandemic, os has become central to sharing data about the sars-cov- and clinical nature of covid- . this exponential growth of real-time information has been used directly to justify public health policies. os is not new to this pandemic, however, and its narrative has a distinct ideology which could have an impact on how we emerge from this time. the contemporary idea of os comes from a nebulous background: it has origins in history, social convention, jurisprudence, and ethics. these ideas represent different heritages about what constitutes favorable conditions for social innovation. today, os is principally positioned as a response to the unprecedented production of data-too quickly to be contained and too much to be constrained by solitary users-and a realization that misuse of proprietary models can discourage socially valuable innovation. os, therefore, began the gradual normalization of open access journals as a benchmark for scientific dissemination, and has since become a movement to advocate for a new "open commons" to underpin all parts of the research process. colloquially, os is about removing barriers to scientific knowledge, and, as such, it has supported creative models that prioritize better and faster access to science for anyone who wants to use it. os is meant to remove structural obstacles to reporting and dissemination, and thereby optimize socially valuable practices (e.g., "open access," "open data," and "open source"). numerous funding agencies and governmental institutions stipulate ethical conditions like transparency and fairness, confident that opportunities stemming from os will be socially equitable. some industry-based individuals and institutions have integrated os into their work ideology. it is also anticipated that publics, such as users, research participants and patients, will share their data too. to this end, os is interconnecting all domains of scientific record-keeping, archiving, discovery, and innovation, and has taken root in the community ethos underlying citizen science and crowdsourcing. primarily, os maximizes efficiencies in the knowledge economy by reducing proprietary claims on intellectual property and promoting co-created knowledge. doing so enhances international and interjurisdictional cooperation on fair data creating, access and use. there is evidence that os benefits researchers through recognition, partnerships, and enhanced data access. it ensures critical review and reproducibility, and opens up scientific culture to social scrutiny and thereby reinforces the scientific method (i.e., advocating holistic creation and stewardship of knowledge). os may become a cornerstone in clinical practice, enabling diagnostics and cures using patient records across vast places and time. in public health, realtime data analytics enables rapid and equitable responses to pandemics. although os culture may be a result of compromises, paradoxes, and surprises, it now seems to be the latest bandwagon; yet the reasons to jump on it may not always be compelling. in the context of this paper, os raises concerns about the contexts it creates for excessive scope for monitoring and control (e.g., classification, profiling, and ultimately manipulation). os-premised on big data-also opens up many other controversies : the data is messy, noisy, and often irrelevant, and where does open science lead us during a pandemic? creates ideal conditions for creating falsehoods and misleading perceptions to influence public policy. that raises further challenges for legitimate evidence, especially in situations where rapid publication compounds long-standing problems such as effective peer review and scrupulous communication of science. by necessity, technology-based sociality (the online world) forces us to leave our digital footprints everywhere; these can be (falsely) incriminating or exonerating, embarrassing, and harmful-if-known. it is difficult to hide or cover those tracks, and the millions of data points they reveal are easily harvested and assimilated into research without consent. patients lose control of data, as it is sequestered under conditions that refuse further access, and patients are denied use of, or exploited for access to, end products. bona fides scientists may be held responsible for disingenuous use of their data, and experience the backlash when someone else openly gets it wrong. the conditions for os can theoretically preclude ethical scientists opportunities if they are concerned about specific circumstances of how data will be used; moreover, the imperative to take part denies researchers control over their hard worked-for data, and also makes them obedient to signals from the market. this milieu simultaneously undermines and celebrates "experts" and displaces scientific professionalism and the scientific method: experience and qualifications can be replaced by familiarity with particular forms of social criticism and popular debate, and therefore scientific authority and professional responsibility are won or lost depending on ideology. strategic promotion of the "citizen scientist" also makes a mockery of their years of training, and questions experts' claims to have authority. finally, in the open commons, as we shall see, data created under ethical conditions for the public good, such as medical health records, can be acquired under a social pretext, capturing its benefits for traditional and novel market economies. these possibilities erode the ethos of communal science. the capitalist platforms seemingly leading this are knowingly impacting on public health: their acquisition of data relies on freeriding the open commons, using os to create a competitive advantage, but then implementing proprietary rules to protect their interests. in the earliest of at least three phases, os emerged as a movement to maximize practical conditions for fruitful collaborations and enhanced sharing. proponents talked about unprecedented knowledge generation and imagined that incredible transformative discoveries were imminent. this new way of doing science would be transformative, so it became necessary to define ethical conditions for data deposit, access, and use. one of the key challenges became the tension between maximizing innovation, on the one hand, and respecting autonomy, on the other. the argument turned out to be comparable to the utilitarian reasoning underlying primitive ideas about public health, where collective wellbeing (a good in itself) may conditionally trump rights. thus, os proponents appealed to "the public interest" to explain an innovation-based critique of autonomy that justified rescinding rights for the greater good. in this respect, proponents may acknowledge the problems of the unproven os research paradigms practiced by capitalist platforms, but nevertheless remain positive about the capacity for society and jurisprudence to evolve an appropriate balance, that is, in respect to promoting innovation, but still having expectations about privacy and confidentiality. they argue that research benjamin capps participants and patients must adapt to this sea change too, despite the risks to their rights, because on balance they benefit from the innovative opportunities of os. this view, however, shares the commercial function of "capitalist platforms," and therefore cannot stop economies from undermining the communities they claim to serve. "social openness," illustrated by uninhibited social media use, has increasingly normalized capitalist platforms. all the while, these platforms have gained further footholds in providing essential social services, often changing them into devices of capitalism. within os, there have been few questions raised as to whether these business models provide the most efficacious approach to data management in public health circumstances, and some have forsaken concerns about the ethical appropriateness of private interests taking part in providing public goods. os, in fact, creates a self-sustaining context for limitless data, and that data has become extraordinarily valuable now that the mechanisms for extraction and exploitation are in place: big data is a naturally, freely, and effortlessly self-propagating good, requiring little more than strategic mining. the "new oil" has become ascendant as a unit of capital for market engines. in this respect, the "extreme concentration of wealth means an extreme concentration of power," so that the capitalist platforms have become extraordinarily influential. as our wellbeing has increasingly shifted online, that space has become a progressively attractive one for enhancing entrepreneurial freedoms, extending beyond core businesses (social networking and e-commerce) into providing basic services such as health care. meanwhile, there has been a global capitulation to neoliberal values, presenting an opportunity to keep the state in a subservient role of merely preserving institutional frameworks appropriate to market practices. free market capitalists therefore imagine os supports laissez-fairism: it lacks government interventionism (i.e., strict control over data) so is conducive to the kinds of liberalization that promotes capital generation. os also organizes society through voluntary and community activity, cooperation, civicness, networking, and social capital: these are easily exploited for their production of vast and free data. so, if there is any truth to a sociological view that "privacy is dead," then os is an ideally "lawless" space to capture the public good. the social work is freely done, allowing the platforms siphon off a rent from every transaction they facilitate. this neoliberal ruse is exemplified in a critique of "surveillance capitalism" or "dataveillance" : the "unexpected and illegible mechanisms of extraction and control that exile persons from their own behavior." although many remain wary of the surveillance narrative (what will health apps be used for after the pandemic?), we seem less concerned with those who opportunistically clearcut the public good through its capture, using an "inherent political asymmetry …[and] in fact a posteriori private expropriation." os risks opening the door to exploitative practices, conflicts of interest, and poor data security, under a vague conjecture about the public interest. the potential consequence of doing so takes us further down the path to dystopia: we become imprisoned but "happy consumers"-homo datus or data avatars, content (perhaps) to be counted, analyzed, and surveyed. big data, bioinformatics, and ai combine to create artificial identities, replacing our dignity with a price to know everything about us. in this form, persons have insufficient knowledge about what is known about them and little ability to control how it is used. the problem is not necessarily the monolith of state or the forces of innovation, but them acting together in a neoliberal adaptation of the role of public health. in so doing, public health now serves a public interest in strong economies. where does open science lead us during a pandemic? seen this way, os makes communities more surveilled and potentially less free, paradoxically at odds with the intent of the open commons. perhaps as a result of the blurring of public health and capitalist agendas, in the second phase of os, proponents have organized the movement into an "open commons." the open commons has become more than the aggregation of data; it shifts os arguments from "intellectual property or confidentiality restrictions," to the "fundamental shared nature of the genomic commons." given the right context, there is an ethical obligation to take part in these communities situated in both research and healthcare contexts. however, we are permanently connected through our household economies, education, work, use of the health system and social services, and all our real and virtual socializing. these circumstances perhaps entertain a contemporary "right to the internet," especially when the circumstance of a pandemic befalls a society. the open commons therefore signifies a continual connection between our being and with very large data, both created spontaneously (i.e., by social media use, as well as other nonprofessional activities) and by structured initiatives such as health care, biobanks and specific analytic websites, for example, genetic and virological. the data is shared between traditional networks of local, regional, and international research infrastructure and hubs, as well as real-time database apps, platforms, and archives. thus, following elinor ostrom, the open commons has become a massive "common pool resource" rather than a public good; data is a resource that may be decreased through consumption, and exclusion is possible, necessitating complex (and ethical) rules for deposit, access and use. this "new commons" still presupposes a community coopted for the good of science, but its people are compelled to give up some of their interests without expecting immediate gains or fearing instantaneous harms. this may be explained by an antecedent view-which contains elements of robert merton's "communism"-that a culture of sharing underlies all fruitful collaboration and equitable transfer of legitimate goods between citizens and scientists. for example, the total aggregation of accessible genomic data (across many institutions and resources) is referred to as the "genomic commons" : that is a community of liberal citizens and public scientists working with their industrial counterparts, who are equally committed to the values of os. in this respect, os fosters behavioral change in those who habitually conceal their data. in reality, these relationships require "tiered access" that assesses individuals or organizations, and grants them specific data-use conditions. the commons, however, has become a place to influence social and political discourse. for example, the case may be made that if the open commons is in the "best interests" of people, then it also creates a space for the specific interests of a sector that stands to profit from exploiting its co-inhabitants. however, neoliberalism also applies to the kinds of entrepreneurial "experiments" currently undertaken in the open commons. a particular example is the emergence of "open research"often bypassing ethics scrutiny -that blurs principles of research integrity with the social and economic critiques of big data. big data research not only uses data voluntarily provided (sometimes) and spontaneously harvested (with or without persons' consent), but its researchers have no qualms about using data that is secretively or disingenuously gathered, because there is an competitive advantage in doing so and it comes with few penalties (and powerful advocates). in the background to these researches, there is an unfettered market where neoliberals may opportunistically capture public resources: commodification transforms a market economy into a market society, in which the solution to all manner of social and civic challenges is "the market" itself. that is more worrisome, because the os consensus generally falls on publics to be players, and public institutions to support it, rather than being obligatory or reciprocal on the private sector. in that regard, production of data has complex, mostly public but sometimes private origins, so that os idealism may be used to deliberately weaken those institutions operating "for the public good." doing so creates ideal circumstances to generate public bads that prospectively obstruct or reduce social opportunities: captured goods may become disruptive commodities. in the current narrative of os, privacy may not be the only right we stand to lose by the subversion of the commons, as access and use of both legitimately and surreptitiously obtained data not only affect persons' freedom and wellbeing, but may undermine the commons by promoting illegitimate interest. these possibilities stoke criticisms about the corporate abuses of data, which could weaken our response to the covid- pandemic and ultimately reduce the effectiveness of public health. in this third phase, proponents have attempted to move away from utility-based arguments, to define a more complex ethical environment that frames os as a "human right to benefit from the fruits of scientific research." significantly, our existence in the commons is both essential (as sociable, cooperative beings) and unavoidable (as perpetually online beings), so that its moral governance requires a rights framework. in this respect, the "right to science" includes an untrammeled obligation to share, and a social contract involving trade-offs that are neither necessarily mutual or correlative, so that the open commons becomes critical, as well as supportive, of the conditions for freedom and wellbeing. finding out where that right sits between autonomy and the public interests requires tracing it back to its roots in international conventions, where we find that right is anchored to equal dignity, self-realization, and substantive freedoms. thus, the right to science must include the right to consent to scientific experimentation. that right is fundamental to the integrity of science, and the development and diffusion of ethical technologies; that is, it protects persons from science misuse. the "right to science" is simultaneously a veneration of good knowledge as a "public good"-a resource for the realization of human rights -and also limited by fundamental obligations to "human dignity." the correlation of the public good and dignity inevitably creates a tension between "the public interest" and autonomy; but since recognizing the atrocities of mid- th century committed by doctors and scientists, autonomy has been clearly favored over the public interest. without that correlate, the right to science, as a disambiguation of an obligation to share, yet lacking stable protections of freedom and wellbeing, is on shaky ground. the ground becomes firm only if the right to science is within a hierarchy of obligations. foremost there is the protection of basic rights (as formulated, e.g., in the nuremberg code), which later evolved to include a right to receive reasonable technological benefits. the basic rights create a correlative obligation to a prima facie positive right to privacy as well as a negative right to be "left alone." next, the public interest promotes rights in the sense of general welfare-that is what tells us what is good where does open science lead us during a pandemic? about the commons-and promotes opportunities to enjoy second tier rights, proportional to the tension between diminished freedom and prospectively enhanced wellbeing. last, there is a right to engage in ethical science. although we may all "enjoy the benefits" of science, it is clear that citizens, whether they take part in its creation or revel in its progress, also have a choice in both respects; that choice is a freedom from an unjustified public interest. what i have just described in brief terms is a "community of rights," in which an egalitarian conception of solidarity promotes os as mutual and cooperative, rather than secretive, manipulative or competitive. therefore, the kinds of research conducted in this space must be bona fides, which excludes activities that prospectively obstruct or reduce social opportunities (i.e., public bads) and precludes capture of public resources. the ethical commons, therefore, is conditional on the technical examination of "the public interest" in terms of the public good and legitimate rights. there can be no duty to take part in os without careful consideration of the public good; and, as such, we can reasonably opt out of taking part in research that foreseeably harms us. moreover, the public interest creates obligations for institutions: innovation is an ability to make use of new scientific knowledge, but also a capacity to put it to use creatively and ethically to help solve broadly social problems. in general, candidate institutions must first commit to a normative principle of institutional responsibility. this principle stipulates that they observe the meaning of the public interest as an "indirect" form of a social contract to promote the public interest in welfare. such an application of rights theory means that they are practically as committed to equally respecting equally the rights of people in the commons as they are to their shareholders. responsible institutions protect rights and interests jointly by including procedural requirements for ethical associations and partnerships and establishing instrumental governance. all this requires a regulatory response focused on the os sector as a whole, since it is not always easy to separate research or practice into distinct private, public or not-for profit forms. institutional responsibility also requires that public-private industry occupancy of the commons be ethically symmetrical. the onus also falls on prospective partners to provide reciprocal openness, so that its intentions become transparent in respect to why it collects data and what it is used for. ethically, participating private industries must contribute corporate data to the public good, too, just as legitimate public bodies do. private institutions should compensate the open commons, so as to preclude freeriding on costs incurred by other people; that premium can be adjusted in respect to their adherence to these conditions. finally, these conditions should be spelled out in specific "data collection" and "data use" rules that are externally enforced, and there should be oversight in respect to applying norms of research ethics. we can learn a lot from nuanced approaches to os such as that of uk biobank: it stipulates that its purpose is for the public good, so that its stewardship over data is conditional. the data it contains is never truly open, but is accessible to all bona fides researchers, whether public or private, on the conditions that data is returned to the resource and "unreasonable" patents are precluded in any future invention. like uk biobank, the open commons may exclude those who attempt to capture goods or create public bads. alas, although there are industries volunteering to use this approach already, in general joining the commons for them likely requires a comprehensive sea change to alter course from a speculated social cataclysm of post-pandemic capitalism. the current os narrative potentially underestimates the opportunities for surveillance capitalism during a pandemic. enthusiasm for innovation, the open commons, and the "right to science" continue to conflate ethical conditions for bona fides research with capital purposes. this mistake has become an opportunity to undo rights protections, as os proponents cannot set effective ethical conditions for data access and use: os avails all possible scientific data for anyone and everyone to access and use for any purpose. ethical os requires a far more cautious approach in respect to how society emerges from the pandemic: new data, as well as new technologies (i.e., ai) and tools (open source software), even if used ethically now, will become resources able to be used beyond the purposes, and protections offered by, public health. despite trust in some legal safeguards, os has also become an opportunity for capitalist platforms to provide many essential services based on public health's legitimacy of surveilling peoples' health. this paper, therefore, provides some evidence that the protections afforded to persons may be rolled back under the "new commons," and that could undermine the essential provision of public goods. if we begin to imagine os as simply knowledge generation qua innovation, without establishing ethical norms, as well as legal protections, the conditions for the public interest can quickly become ambiguous in respect to the public good. this conclusion may be resisted still, because there is likely little social appetite for returning to times when science was more of a proprietary activity. but it will not do to remain ambiguous or ambivalent to the influence of surveillance capitalism. the open commons is not only data. it is a patient's expectant diagnosis, a community that provides care and future welfare, and it is where worthwhile research is done to promote public health. it is also the social space where much of our lives has shifted to during the covid- pandemic. these activities are underpinned by the public good, which establishes fundamental obligations on individuals and institutions that, at times like these, are necessary for an effective pandemic response. in the community of rights, the freedom of the commons is given to friendly participants, donors or altruists, and their exclusion from its benefits is unethical; people are free to come and go. the egalitarian response to a potential the tragedy of the commons, brought about by misplaced trust in capitalist platforms, is to exclude those that "follow strategies that destroy the very resource" itself; and recognition that the threat from open data use is "so sweeping that its can no longer be circumscribed by the concept of privacy and its contests." the worthy ambitions of ethical os need safeguarding by expanding the narrative to the existential challenges of our time; it has become evident that the emergent os ecosystem is not sufficiently equitable, and encourages activities that actively reduce socially valuable outcomes. past examples of the untrammeled use of data (most recently, in political campaigns) raise concerns about the extent of data held about persons, and how that data can be manipulated and in what ways and for what purpose. the current os narrative does not go to the root of these concerns about the breadth of information needed and available to make sophisticated predictions about people, and ultimately, the consequential decisions made that limit their freedom during a public health crisis and beyond. a new sense of solidarity in the open commons is one of the few reassuring things to have happened during this pandemic, and through experiencing degrees of alienation, illness, poverty and sadness during the pandemic, communities should not be exploited by entities compelled by old-fashioned, anti-community ideas of neoliberalism. covid- -open science and reinforced scientific cooperation in the battle against coronavirus, humanity lacks leadership. time a new pharmaceutical commons: transforming drug discovery profile of a killer: the complex biology powering the coronavirus pandemic open science versus commercialization: a modern research conflict the genomic commons falling giants and the rise of gene editing: ethics, private interests and the public good griffin a. coronavirus contact-tracing app could make 'privacy and rights another casualty of the virus', campaigners warn. the independent covid tracing tracker. mit technology review delete your account: on the theory of platform capitalism these platforms see lax regulatory infrastructure, including rescinding privacy rights for mass surveillance (and unbounded data gathering), conducive to maximising further control by those best (and privately) positioned to do so screen new deal. the intercept the secrets of surveillance capitalism. frankfurter allgemeine privacy protective research: facilitating ethically responsible access to administrative data the science of this pandemic is moving at dangerous speeds the "fact that open science is proposed equally as a panacea for each and every one of these problems is what prompts the suspicion of snake oil the future(s) of open science we can argue about the details of this context (or commons) but it will include elements pertaining to our wellbeing (clean air and water, food, environmental integrity, and the like) and our freedom (security, an absence of fear and intimidation, and so on) the historical origins of open science: an essay on patronage, reputation and common agency contracting in the scientific revolution the sociology of science: theoretical and empirical investigations order without intellectual property law: open science in influenza how does one 'open' science? questions of value in biological research open science now: a systematic literature review for an integrated definition how open science helps researchers succeed your health data helps everyone descartes and the tree of knowledge genomic variant sharing: a position statement open-access science: a necessity for global public health revealed: facebook's global lobbying against data privacy laws. the observer changing the rules: general principles for data use and analysis a similar process, it should be noted, has left persons without control over cells and tissues taken from their body redefining property in human body parts: an ethical enquiry in the stem cell era waste in covid- research the right to erasure: privacy, data brokers, and the indefinite retention of data models of biobanks and implications for reproductive health innovation ebola blood samples taken from patients during the - epidemic in west africa: "while some ethicists argued that the public good that could emerge from a vaccine or cure outweighed researcher responsibilities to any of the patients, it reinforces a perception that patients-and particularly patients in the developing world-do not have any rights lerner s. big pharma prepares to profit from the coronavirus. the intercept latin: honesty or lawfulness of purpose) indicates a systematic, investigative, or experimental activity conducted in good faith for the purpose of acquiring (and sharing) knowledge. it includes veracity in one's good faith or genuineness, and evidence/testimony of one's qualifications, reputation and achievements the commodification and exploitation of fresh water: property, human rights and green criminology crackpots and active researchers: the controversy over links between arxiv and the blogosphere the funding of medical research by industry: can a good tree bring forth evil fruit? what can we salvage from care.data? public goods in the ethical reconsideration of research innovation introduction: the why and whither of genomic data sharing compare with, for example, the "community of equals creating a data resource: what will it take to build a medical information commons? will privacy rule continue to favour open science? annual report of the chief medical officer politeness and the boundary between theory and practice in ethical rationalism people have really gotten comfortable not only sharing more information and different kinds, but more openly and with more people privacy no longer a social norm, says facebook founder. the guardian location surveillance to counter covid- : efficacy is what matters. lawfare assessing private sector involvement in health care and universal health coverage in light of the right to health performance of private sector health care: implications for universal health coverage regulating the internet giants: the world's most valuable resource is no longer oil, but data. the economist big data: a revolution that will transform how we live, work, and think the triumph of injustice: how the rich dodge taxes and how to make them pay a brief history of neoliberalism …skillfully exploited a lag in social evolution as the rapid development of their abilities to surveil for profit outrun public understanding and the eventual development of law and regulation that it produces big other: surveillance capitalism and the prospects of an information civilization is privacy dead? financial times the system of freedom of expression also: "…the prospects of 'surveillance capitalism' being exploited for ulterior motives are only just starting to be understood, often to the surprise and concern of the platforms themselves. such is the commitment to this guiding thesis that it has spawned a new doctrine: 'dataism', in which information flow is the 'supreme value' s covid- app fuels worries over authoritarianism and surveillance. the guardian uk government using confidential patient data in coronavirus response. the guardian google executive took part in sage meeting, tech firm confirms. the guardian vote leave ai firm wins seven government contracts in months. the guardian the world's scariest facial recognition company is now linked to everybody from ice to macy's. vox the public interest, public goods, and third party access to uk biobank the surveillant assemblage - brownsword r. knowing me, knowing you-profiling, privacy and the public interest emerging problems in knowledge sharing and the three new ethics of knowledge transfer attitudes of publics who are unwilling to donate dna data for research open science takes on the coronavirus pandemic the authors describe the "genomic commons" (with similar characteristics shared with "scientific disciplines of all descriptions") as "the worldwide collection of genomic data that is generally available for public use sustainable governance of common-pool resources: context, methods, and politics global alliance for genomics and health. a federated ecosystem for sharing genomic, clinical data the promise of common pool resource theory and the reality of commons projects neo-liberalism and the end of liberal democracy but the data is already public": on the ethics of research in facebook tastes, ties, and time: a new social network dataset using facebook.com experimental evidence of massive-scale emotional contagion through social networks the phenomenon known as the tragedy of the commons will occur in highly valued openaccess commons where those involved and/or external authorities do not establish an effective governance regime polycentric systems as one approach for solving collective-action problems. working paper w - . bloomington: indiana university workshop in political theory and policy analysis commission staff working document; accompanying the document: commission recommendation on access to and preservation of scientific information {c( ) final} brussels, . first amendment experts: downloading hacked clinton campaign emails not a crime the problem isn't cambridge analytica: it's facebook. forbes applied ethics in a troubled world towards an understanding of the right to enjoy the benefits of scientific progress and its application dignity" has become a conservative protest against controversial experiments (e.g., embryo research or human cloning), leading many to decry its too prohibitive reach although we should be rightly wary of the dignitarian's false claims, careful analysis of the function of dignity may also provide legitimating criteria for such experiments, too; brownsword r. bioethics today, bioethics tomorrow: stem cell research and the 'dignitarian alliance rights without trimmings privacy, rights and biomedical data collections informed consent for human genetic and genomic studies: a systematic review defining variables of access to uk biobank: the public interest and the public good this is underscored by the "privacy tradeoff fallacy" that runs counter to the claim that people are always happy to give up their personal data in exchange for perceived benefits the tradeoff fallacy: how marketers are misrepresenting american consumers and opening them up to exploitation. philadelphia: annenberg school for communications, university of pennsylvania big data for all: privacy and user control in the age of analytics zuckerberg says facebook is pivoting to privacy after a year of controversies. the guardian conflicts of interest in e-cigarette research: a public good and public interest perspective open source drug discovery: finding a niche (or maybe several) open source in biotechnology: open questions that project's efficacy relied on a new code that ensured a secure connection and tracked the researchers' actions. arguably, that permitted them to legitimately interrogate the public good, that is they accessed, without consent, the national health service health records of british patients as credible "public health servants" and subject to british law. the code that enabled them to do so has been made available as open source. opensafely collaborative. opensafely: factors associated with covid- -related hospital death in the linked electronic health records of million adult nhs patients how to restore data privacy after the coronavirus pandemic. world economic forum institutional arrangements for resolving the commons dilemma: some contending approaches acknowledgements: an early iteration of this paper was presented at "open science: what do we need to know to protect the interests of public and scientists?," a workshop jointly presented by the hugo key: cord- - sx j authors: diou, christos; sarafis, ioannis; papapanagiotou, vasileios; alagialoglou, leonidas; lekka, irini; filos, dimitrios; stefanopoulos, leandros; kilintzis, vasileios; maramis, christos; karavidopoulou, youla; maglaveras, nikos; ioakimidis, ioannis; charmandari, evangelia; kassari, penio; tragomalou, athanasia; mars, monica; nguyen, thien-an ngoc; kechadi, tahar; donnell, shane o'; doyle, gerardine; browne, sarah; malley, grace o'; heimeier, rachel; riviou, katerina; koukoula, evangelia; filis, konstantinos; hassapidou, maria; pagkalos, ioannis; ferri, daniel; p'erez, isabel; delopoulos, anastasios title: bigo: a public health decision support system for measuring obesogenic behaviors of children in relation to their local environment date: - - journal: nan doi: nan sha: doc_id: cord_uid: sx j obesity is a complex disease and its prevalence depends on multiple factors related to the local socioeconomic, cultural and urban context of individuals. many obesity prevention strategies and policies, however, are horizontal measures that do not depend on context-specific evidence. in this paper we present an overview of bigo (http://bigoprogram.eu), a system designed to collect objective behavioral data from children and adolescent populations as well as their environment in order to support public health authorities in formulating effective, context-specific policies and interventions addressing childhood obesity. we present an overview of the data acquisition, indicator extraction, data exploration and analysis components of the bigo system, as well as an account of its preliminary pilot application in schools and clinics in four european countries, involving over , participants. obesity prevalence has been continuously rising for the past forty years [ ] and is now one of the world's biggest health challenges. given that the disease is largely preventable, researchers have sought appropriate policy measures to limit the development of overweight and obesity, especially in children, since children who are overweight or obese are likely to remain obese in adulthood [ ] . many large-scale public health actions are limited to indiscriminate blanket policies and single-element strategies [ ] , that often fail to address the problem effectively. this *the work leading to these results has received funding from the european community's health, demographic change and well-being programme under grant agreement no. , / / - / / (http://bigoprogram.eu/). has been attributed to the complex nature of the disease [ ] , implying that effective interventions must be evidencebased, adapted to the local context and address multiple obesogenic factors of the environment [ ] , [ ] , even on a local neighborhood level [ ] . developing effective multi-level interventions addressing childhood obesity therefore requires data that link conditions in the local environment to children's obesogenic behaviors such as low levels of physical activity, unhealthy eating habits, as well as insufficient sleep. most of the existing evidence on obesogenic behaviors of children rely on diet and physical activity recall questionnaires [ ] , which can lead to inaccurate measurements [ ] and often do not provide sufficient detail about the interaction of children with their environment (such as use of available opportunities for physical activity, or visits to different types of food outlets). on the other hand, the availability and widespread use of wearable and portable devices, such as smartphones and smartwatches, provide an excellent opportunity for obtaining objective measurements of population behavior. this has not yet been exploited to its full extent for evidence-based policy decision support addressing childhood obesity. in this paper we present an overview of the bigo system , which has been developed with the aim to objectively measure the obesogenic behaviors of children and adolescent populations in relation to the local environment using a smartphone and smartwatch application. specifically, bigo provides policy makers the tools to (i) measure the behavior of a sample of the local population (targeting ages - ) regarding their physical activity, eating and sleep patterns, (ii) aggregate data at geographical level to avoid revealing any individual information about participants, (iii) quantify the conditions of the local urban environment, (iv) visualize and explore the collected data, (v) perform inferences about the strength of relationships between the environment and obesogenic behaviors and, finally, (vi) predict and monitor the impact of policy interventions addressing childhood the data is processed to extract individual and aggregated behavior and environment indicators. aggregated indicators are then used through the public health portal, a web application that supports data exploration and visualization and helps analysts make inferences about possible local drivers of obesogenic behaviors, as well as to assess and predict the impact of policy interventions addressing childhood obesity. other web applications (portals) are also provided. the school portal is used to organize educational activities around obesity at school and to coordinate class or group participation in the data collection. the clinical portal is used by clinicians to better measure their patients' obesogenic behaviors and provide personalized guidance. finally, the community portal provides a summary of data and findings to the public. obesity. the following sections present an outline of the bigo system as well as the challenges that we have identified from its preliminary application to data collection pilots in schools and clinics, involving over , participants to date (april ). a conceptual overview of the bigo system is shown in figure . data is collected from smartwatch and smartphone sensors through the "mybigo" app, available for android and ios operating systems [ ] . it is then stored at the bigo dbms, consisting of apache cassandra (for time series data) and mongodb (for all other application data) databases. data processing is carried out by the bigo analytics engine, built on top of apache spark. the processing steps involve the extraction of individual and population-level behavioral indicators, the extraction of environment indicators, as well as statistical data analysis and machine learning model training. the processing outputs support the operation of the end-user interfaces which include the public health authorities portal, school portal, clinical portal and the community portal. administrative and operational data (e.g., number of exercise sessions per week at school, availability of school lunches, class start/end times) are collected at school and clinic level through the portals. furthermore, data about body mass index range, age and sex of participants is also collected through the portal. no directly identifiable information (such as names or emails) is stored anywhere in the system. regarding individual participants, data is collected through a smartphone and smartwatch application. collected data includes (i) triaxial accelerometer signal, (ii) gps location data, (iii) meal pictures, (iv) food advertisement pictures, (v) meal self-reported data, (vi) answers to a one-time questionnaire (when the mybigo app is started for the first time) and (v) answers to recurring mood questionnaires. the biggest challenges in data collection come from the battery power requirements of the accelerometer and location sensors. accelerometer is sampled at a low sampling rate, which is device dependent and is usually in the range of - hz. location data is sampled every minute. to preserve battery, the data acquisition module of the mobile application is compatible with the "doze" mode of the android operating system. it stops data capturing whenever a device is inactive (doze mode) and restarts whenever the device becomes active again. during this time, acceleration is assumed to only be affected by gravity (not any kind of user motion) and location data is fixed to the last known position [ ] b collected raw data are processed to extract behavioral indicators. this can take place at the mobile phone (to avoid transmitting raw data) or centrally, at the bigo servers. in both cases, raw data is considered sensitive and cannot be accessed directly. aggregated behavioral indicators that result from the raw data are used instead [ ] . there are three levels of granularity of behavioral indicators, namely (i) base indicators, which describe the behavior of an individual at fine temporal granularity (ii) individual indicators, which aggregate indicators across time to summarize the behavior of an individual and (iii) population indicators, which aggregate the behavior across individuals in a geographical region. examples are provided in tables i and ii . base indicators are extracted through signal processing and machine learning algorithms [ ] , such as [ ] and [ ] for step counting, [ ] for activity type detection, [ ] for transportation mode detection and [ ] for detecting visited points of interest (pois). regarding the visited pois, the information stored is the type of poi, from a pre-defined poi hierarchy, such as "restaurant", "fast food or takeaway" and "sports facility". certain base indicators (such as activity counts) and individual indicators are only used at the clinical portal, which is accessible by health professionals to obtain information about their patients. in all other cases, aggregated information is used, for privacy protection purposes. there are two types of aggregation, leading to two different analysis types: • habits: in this analysis we are interested in the overall behavior of participants living in a region • use of resources: in this analysis we are interested in the behavior of participants visiting a region, but only during their visits to that region a region can either be an administrative region or a geohash . in addition to the measured behavior of individuals, each geographical region is characterized by the local urban and socio-economic context. these are quantified by a set of variables called local extrinsic conditions (lecs) in bigo, which are obtained through official statistics, or through gis databases. table ii shows some examples of lecs used in bigo. the collected behavior and environment indicators can be used to (i) infer associations and possible causal relations between lecs and obesogenic behaviors and (ii) predict and monitor the impact of interventions on the measured population behaviors. a geohash is a public domain geocode system encoding rectangular geographical regions as alphanumeric strings for each resident of the region, first compute their number of steps for each minute of recorded data, for individuals with more than hours of recorded data. this indicator is the percentage of residents of the region that walk, on average, less than steps per hour. average number of restaurants within m radius from locations within the region create a m point grid inside the region. for each point, compute the number of restaurants in a m radius. this value is the average across all points inside the region. using publicly available data sources compute the number of athletics/sports facilities inside the region. given the complexity and multifactorial nature of obesity [ ] , it is important to account for confounding factors and to avoid spurious correlations. to this end, we start from the foresight's obesity system map [ ] and construct directed acyclic graphs (dags) indicating causal relations between variables. an example dag for physical activity is shown in figure . the associations between variables are quantified using explainable statistical models, such as linear and generalized linear models. self-selection and other sources of bias will need to be quantified during the analysis as well. for prediction, non-linear machine learning models can be used as well, such as support vector machines, or neural networks. the analysis of bigo data is still work in progress, however some preliminary results on prediction are presented in [ ] . bigo has been deployed in clinics and in schools in athens, larissa and thessaloniki in greece, the stockholm area in sweden and dublin in ireland. children join as citizen-scientists and contribute data about their behavior and environment through the mybigo app. the planned data collection time is two weeks per child. in schools, data is used in school projects, with the help of the bigo school portal, which provides data visualizations for participating school classes. in the clinic, the data is used by clinicians to monitor the behavior of patients, through the clinical portal. in total, children from schools and clinics have contributed data so far. ethical approvals have been obtained, as well as the necessary consent from all participants. by april of , bigo had reached out to , children and their parents, out of which over , registered in the system. not all children provide the same amount of data. reasons for children providing fewer data than expected include technical issues from the user side (e.g. smartwatch not properly paired with phone), technical issues with the smartphone (e.g., some smartphone manufacturers do not permit background apps to run) as well as low participant compliance. the current estimate is that monitoring data is received for approximately % of the app usage time, while approx % of the registered users do not provide accelerometer or gps data (only self-reports and pictures). the currently collected data volume includes approximately years of accelerometery data, years of gps data and , meal pictures. note that the actual monitoring time is higher (since no data is recorded when the device is idle). there are some noteworthy observations that result from the experience in organizing and deploying the bigo pilots. on the technical side, there are significant obstacles to overcome when depending on off-the-shelf devices such as smartphones and smartwatches for data acquisition. besides battery consumption, certain mobile phone manufacturers introduce custom modifications to the device operating system which can prevent background recording applications to run. users must be aware of these restrictions and disable them for the mybigo app, a process which is device-dependent and not always straightforward. regarding recruitment, it seems that engaging teachers and clinicians is an effective way to invite the participation of children and their parents (who must give their consent). so far, the bigo portals have been used by teachers and clinicians, leading to an acceptance rate of approximately % for the children that were reached out to participate in bigo. this approach is now challenged by the recent school lockdowns due to the sars-cov- pandemic, but recruitment is expected to resume once schools open again. it is clear that scaling up such data collection actions requires the active engagement of the local school and education authorities. in bigo, the pilots were carried through the initiative of participating researchers and schools/clinics that decided to join, without the direct support from the local authorities. our vision is that by demonstrating that such citizen-science activities are effective for collecting data to formulate evidence-based policies, bigo will motivate local authorities to adopt such systems as part of their decisionmaking process. while data collection in the bigo pilots continues, the next steps include the analysis of the collected data and the dissemination of results to all relevant regional and national government bodies. eu action plan for childhood obesity child and adolescent obesity: part of a bigger picture cochrane review: interventions for treating obesity in children implications of the foresight obesity system map for solutions to childhood obesity obesogenic environments: a systematic review of the association between the physical environment and adult weight status, the spotlight project ultra-processed food advertisements dominate the food advertising landscape in two stockholm areas with low vs high socioeconomic status. is it time for regulatory action? validation of the stanford -day recall to assess habitual physical activity energy balance measurement: when something is not better than nothing developing a novel citizen-scientist smartphone app for collecting behavioral and affective data from children populations a methodology for obtaining objective measurements of population obesogenic behaviors in relation to the environment collecting big behavioral data for measuring population behavior against obesity robust and accurate smartphone-based step counting for indoor localization a smartwatch step counter for slow and intermittent ambulation creating and benchmarking a new dataset for physical activity monitoring travel mode detection with varying smartphone data collection frequencies an improved dbscan algorithm to detect stops in individual trajectories tackling obesities: future choices-project report. department of innovation inferring the spatial distribution of physical activity in children population from characteristics of the environment key: cord- - kmder authors: meyer, r. daniel; ratitch, bohdana; wolbers, marcel; marchenko, olga; quan, hui; li, daniel; fletcher, chrissie; li, xin; wright, david; shentu, yue; englert, stefan; shen, wei; dey, jyotirmoy; liu, thomas; zhou, ming; bohidar, norman; zhao, peng-liang; hale, michael title: statistical issues and recommendations for clinical trials conducted during the covid- pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: kmder the covid- pandemic has had and continues to have major impacts on planned and ongoing clinical trials. its effects on trial data create multiple potential statistical issues. the scale of impact is unprecedented, but when viewed individually, many of the issues are well defined and feasible to address. a number of strategies and recommendations are put forward to assess and address issues related to estimands, missing data, validity and modifications of statistical analysis methods, need for additional analyses, ability to meet objectives and overall trial interpretability. the covid- outbreak emerging in china in december quickly became a global pandemic as declared by the world health organization in march . as of today, still only a few months into the pandemic, the disease and public health control measures are having very substantial impact on clinical trials globally. quarantines, site restrictions, travel restrictions affecting participants and site staff, covid- infections of study participants, and interruptions to the supply chain for study medication have led to operational problems, including difficulties in adhering to study protocols. trial sponsors have rapidly responded to this crisis, where the overriding primary concern has been to protect participant safety. some trials have been halted or enrolment suspended in the interest of participant safety. for ongoing trials, sponsors have implemented a variety of mitigations to assure safety of participants and address operational issues. the downstream effects of protocol deviations and trial conduct modifications lead to varying degrees of impacts on clinical trial data. the impacts, described in more detail in later sections, raise important statistical issues for the trial. in the extreme, trial integrity, interpretability, or the ability of the trial to meet its objectives could be compromised. intermediate to that, planned statistical analyses may need to be revised or supplemented to provide a thorough and appropriate interpretation of trial results. this paper offers a spectrum of recommendations to address the issues related to study objectives, inference, and statistical analyses. the major categories of impacts and mitigations are summarized in figure . the issues we discuss here largely involve ongoing trials, started before but conducted during the pandemic for non-covid- related therapies. many of the issues and recommendations will also apply to new trials. regulatory agencies have rapidly published guidance for clinical trial sponsors to address covid- issues (fda , ema a , b . the current paper is influenced by and expands upon these important guidance documents. the paper is organized as follows. in section we describe overall trial impact assessment. section considers assessment of impacts on the trial through the estimand framework. section summarizes recommendations for revised and supplemental analyses that may be needed for the trial, including the likely mechanisms of missing data and the recommended statistical approaches to address missingness. section outlines additional considerations for trial-level impact. a summary of recommendations is given in section . for trials impacted by the pandemic, assessing the change of the benefit/risk for participants is the first step in the decision-making process (fda ). all recommendations in this article presuppose that appropriate steps have been taken to assure participant safety. sponsors are advised to perform a risk assessment based on aggregated and blinded data to evaluate the likelihood of a trial to deliver interpretable results. it must start as a forwardlooking assessment in anticipation of effects not yet seen but with some likelihood to occur. it should continue throughout the conduct of the study in light of the evolving situation and accumulating data, considering regional differences in the infection status and pandemic-  determine what additional information needs to be collected in the study database or in the form of input from study investigators in order to adequately monitor, document, and address pandemic-related issues (feasibility to obtain such information and its quality may vary and this needs to be considered as part of the risk factors);  understand reasons for treatment or study discontinuation and the impact on planned estimands and intercurrent events;  evaluate extent of missing data and specific reasons for missingness;  assess changes in enrollment and in study population over time;  evaluate the protocol-specified assumptions and the likelihood that the trial would be able to achieve its goals;  ideally, verify the usability of data captured from alternative methods (e.g., virtual audio or video visits) before implementing them. such data may add more variability or not be interpretable;  determine any changes to planned analyses and analysis population definitions, or additional sensitivity analyses that need to be pre-specified prior to unblinding. based on the risk evaluations above, many sponsors have developed standardized metrics of trial operational status, such as rates of missed visits, discontinuations from treatment and study, protocol deviations, adverse events (aes), to reinforce a consistent approach to risk monitoring and assessment. such metrics are useful to identify trials that are more acutely impacted and to monitor the overall state of a portfolio of trials. the ich e (r ) addendum (ich, ) defines the estimand framework for ensuring that all aspects of study design, conduct, analysis, and interpretation align with the study objectives. it also provides a rigorous basis to discuss potential pandemic-related disruptions and to put them in the context of study objectives and design elements. for an affected trial, the first major question is whether the primary objective, and therefore the primary estimand, should target the treatment effect without potentially confounding influences of covid- . we recommend that for most studies started before the pandemic, the original primary objective should be maintained as designed, implying a treatment effect that is not confounded by pandemic-related disruptions. (for new studies, this definition of treatment effect may also be reasonable, but depends on many aspects of the trial design.) this does not automatically imply a broad "hypothetical estimand" with the same hypothetical scenario for all possible pandemic-related intercurrent events (ice). confounding may need to be addressed in different ways for different types of ices depending on the study context. we discuss this in section . our discussion is mainly geared towards considerations for the primary efficacy estimands, but similar logic can be applied to other study estimands. strategies for handling non-pandemic related ices should remain unchanged. here we discuss handling of pandemic-related ices only. the estimand framework allows for different strategies to be used for different types of ices and such estimands will likely be the most appropriate in the current context. ices should be considered pandemic-related if they occur as a result of pandemicrelated factors and are not attributed to other non-pandemic related reasons, e.g., treatment discontinuation due to lack of efficacy or a toxicity. pandemic-related ices of importance should first be categorized in terms of their impact on study treatment adherence (e.g., study treatment discontinuation) or ability to ascertain the target outcomes (e.g., death).  admitted to intensive care. * covid- related deaths and initiation of treatment for covid- infections may also be considered as ices if they occur after the completion of study treatment or after other non-pandemic-related ices and before the time point associated with the endpoint of interest. certain types of non-adherence to study treatment may not normally be considered as ices but may need to be in the context of the pandemic. for example, an ice of significantly reduced compliance or temporary treatment interruption may not have been anticipated at study design but could be now if considered likely due to pandemic-related disruptions. for some studies with significant pandemic-related treatment interruptions, the minimal duration of interruption expected to dilute the treatment effect could be defined. different strategies can be used for interruptions exceeding this duration as opposed to shorter interruptions. sensitivity analyses can be used to assess robustness of inference to the choice of cut-off. for time-to-event endpoints, it is tempting to define the minimally acceptable level of drug compliance on a participant level according to the observed exposure to study drug from the first dose of study drug until the event or censoring and exclude participants without a minimally acceptable level of compliance. however, such an approach could introduce immortal time bias and should therefore be avoided (van walraven et al, ) . a special consideration may be warranted for participants receiving experimental treatment for covid- regardless of whether they remain on study treatment. also, in studies where mortality was not originally expected, death due to covid- should be considered as a potential ice. most of the ice types listed against the "participant's adherence to study treatment" attribute in table (e.g., study treatment discontinuation) due to non-pandemic related reasons are likely addressed in the primary estimand prior to the pandemic. we recommend starting with an examination of whether the original strategy is justifiable when these ices occur due to the pandemic. if not, a different strategy should be chosen. we outline some high-level considerations in this respect.  treatment policy strategy, in which ices are considered irrelevant in defining the treatment effect, will typically not be of scientific interest for most pandemic-related ices because the conclusions would not generalize in the absence of the pandemic. for example, the treatment effect estimated under the treatment policy for premature treatment discontinuations caused by pandemic-related disruptions will reflect the effect of a regimen where discontinuations and changes in therapy occur due to pandemic-related factors (e.g., disruptions in study drug supply) which would not be aligned with the primary study objective. initiation of treatment for covid- infection after an earlier non-pandemic-related ice that was planned to be handled by the treatment policy strategy but prior to observation of an efficacy or safety endpoint will also need to be considered carefully and cannot simply be deemed irrelevant. under the treatment policy strategy, the estimated treatment effect may reflect the effects of infection and its treatment, which are presumably not of interest for the primary objective. a decision to use a treatment policy approach for pandemic-related ices may be justifiable if the percentage of participants with such events is low and this strategy was planned for similar non-pandemic related ices. this strategy may also be considered for handling ices corresponding to relatively short treatment interruptions. the treatment policy strategy should be avoided for pandemic-related ices of premature study treatment discontinuation in non-inferiority and equivalence studies, as similarity between treatment groups may artificially increase with the number of such events. similar considerations also apply to the composite strategy.  composite strategies, in which ices are incorporated into the definition of the outcome variable, are unlikely to be appropriate for most pandemic-related ices. for example, study treatment discontinuation due to pandemic-related disruptions should not be counted as treatment failure in the same way as discontinuation due to lack of efficacy or adverse reactions. a more nuanced consideration may be needed in studies of respiratory conditions, where covid- complications may be considered with a composite strategy as a form of unfavourable outcome. (see also a discussion on covid- related deaths further below.)  principal stratification strategy stratifying on a covid- related event (e.g., serious complications or death due to covid- ) is unlikely to be of interest for the primary estimand because it would limit conclusions to a sub-population of participants defined based on factors not relevant in the context of the future clinical practice.  while-on-treatment strategy may continue to be appropriate under certain conditions if it was originally planned for non-pandemic-related ices. this strategy is typically justifiable when treatment duration is not relevant for establishing treatment effect (e.g., treatment of pain in palliative care), but certain conditions may need to be considered, such as a minimum treatment duration required to reliably measure treatment outcomes.  hypothetical strategy, in which the interest is in the treatment effect if the ice did not occur, is a natural choice for most pandemic-related ices. this would especially apply to ices of study treatment disruption for pandemic reasons. for such participants, the hypothetical scenario where they would continue in the study in the same way as similar participants with an undisrupted access to treatment is reasonable. it is not necessary to assume a hypothetical scenario where such participants would fully adhere to the study treatment through the end of the study. rather, a hypothetical scenario may include a mixture of cases who adhere to the study treatment and those who don't adhere for non-pandemic reasons. discussions with regulatory agencies may be helpful to reach an agreement on the details prior to the final study unblinding. although estimation methods are not part of the estimand consideration, the ability to estimate treatment effects in a robust manner under a hypothetical strategy based on available data should not be taken for granted and should be assessed as the estimand definition is finalized. this aspect should be part of the overall risk assessment and decisions on the choice of the mitigation strategies. the discussion above highlights the need to capture the information associated with pandemic-related factors, such as those listed in table . this can be done either through designated fields in the case report form (crf) or through a detailed and structured capture of protocol deviations. an ice of death due to covid- requires careful consideration and the appropriate strategy depends on the disease under study and clinical endpoint. in disease areas with minimal mortality where death is not a component of the endpoint, a hypothetical strategy for deaths related to covid- infections may be recommended. for studies in more severe diseases where death is part of the endpoint, it is inevitable that more than one estimand will be of interest when evaluating the benefit of treatment for regulatory purposes. a pragmatic approach which includes covid- -related deaths in the outcome, i.e., which uses a composite strategy, is suitable if the number of covid- -related deaths is low or if there is a desire to reflect the impact of the pandemic in the treatment effect estimate. (see also the related section on competing risks analyses in section . . .) using a hypothetical strategy for deaths related to covid- infections will be important in evaluating the benefit of treatment in the absence of covid- (for example, when the disease is eradicated or effective treatment options emerge). it is acknowledged that such trials frequently include elderly, frail, or immunocompromised participants and it may be difficult to adjudicate a death as caused by covid- or whether the participant died with covid- . while treatment policy, composite, and principal stratification strategies may not be of interest for the primary estimand, they may be of interest for supplementary estimands when there is a scientific rationale to investigate the study treatment either in subpopulations of participants stratified based on covid- infection and outcomes and/or together with a concomitant use of treatments administered for covid- . for example, this may be of interest for studies in respiratory diseases or conditions suspected to be risk factors for covid- complications. the relevance of such estimands will also depend on the evolution of this pandemic, whether the virus is eventually eradicated or persists like seasonal flu. in the latter case, the reality and clinical practice are still likely to be different from the current crisis management conditions as the society and clinical practice adapt to deal with a new disease. in general, treatment condition of interest should remain the same as originally intended. however, operationally, the mode of treatment delivery may need to be changed due to pandemic-related reasons, e.g., treatment self-administered by the participant at home rather than in the clinic by a health-care professional. when such changes are feasible, they may be considered to reduce the frequency of visits to the clinic, and therefore reduce the risk of infection exposure. pandemic-related complications with study treatment adherence and concomitant medications should be considered as ices and handled with an appropriate strategy. the extent of such ices should be evaluated in terms of whether the treatment(s) received by participants during the study remain sufficiently representative of what was intended. this may include treatment interruptions, reduced compliance, and access to any background, rescue, and subsequent therapies planned to be covered by the treatment policy strategy. to align with the primary study objective, the target participant population should remain as originally planned, i.e., should not be altered simply due to the pandemic. protocol amendments unrelated to the pandemic to further qualify the study population should still be possible. the trial inclusion/exclusion criteria should also remain largely unchanged relative to those that would be in place in absence of the pandemic, except for the possible exclusion of active covid- infections. the clinical endpoint should generally remain as originally planned. in cases where alternative measurement modalities may be necessary during the pandemic, for example, central labs vs. local labs, remote assessments of questionnaires instead of in-clinic, etc., it must be assured that clinical endpoint measurement is not compromised and potential effects on endpoint variability should be assessed (see section . . ). in cases where pandemic-related ices, such as covid- deaths, are handled using the composite strategy, the definition of the endpoint may need to be adjusted. in cases of numerous delays between randomization and start of treatment, where the endpoint is defined relative to the date of randomization (e.g., in time-to-event endpoints), consideration may be given to redefine the endpoint start date to start of treatment. however, in the context of openlabel studies this may not be advisable. population-level summary describing outcomes for each treatment and comparison between treatments should remain unchanged, in general. in rare situations, a summary measure may need to be changed, for example, if the originally planned endpoint is numeric and a composite strategy is used for covid- deaths to rank them worse than any value in survived participants. in this case, a summary measure may be changed from mean to median. another example could be a hazard ratio (hr) from a cox proportional hazard regression, a summary measure of treatment effect commonly used for trials with time-toevent endpoints. if the assumption of proportional hazards is not satisfied, the estimated hr depends on the specific censoring pattern observed in the trial, which is influenced both by participant accrual and dropout patterns (rufibach, ) . external validity and interpretability of the hr needs to be carefully considered if censoring patterns are affected during the pandemic in ways that are not representative of non-pandemic conditions and if additional pandemic-related censoring depending on covariates such as the participant's age or comorbidities are observed. similarly, the validity of the log-rank test relies on the assumptions that the survival probabilities are the same for participants recruited early and late in the trial and that the events happened at recorded times. such assumptions may need to be assessed. supportive estimands with alternative summary measures could be considered (see e.g., boyd et al., ; nguyen and gillen, ; mao et al., ) . planned statistical analyses may need to be modified due to effects of the pandemic on trials. additional sensitivity and supplementary analyses may be needed to properly understand and characterize the treatment effect. depending on the trial, modifications in planned analyses may range from relatively minor, e.g., for trials with relatively low impact, to major, e.g. in settings where study drug administration and visits are severely disrupted by the pandemic. a general summary of analysis considerations is provided in table and detailed discussions are presented in subsequent sections. all planned modifications and additional analyses should be documented in the sap prior to data unblinding and in the clinical study report. additional post hoc exploratory analyses may also be necessary after study unblinding to fully document the impact of the pandemic and characterize the treatment effect in this context.  review all planned main and sensitivity analyses to ensure alignment with the revised estimand(s).  review / amend methods for handling of missing data, or censoring rules, to accommodate pandemic-related missingness.  summarize the occurrence of pandemic-related ices and protocol deviations.  summarize the number of missed or unusable assessments for all key endpoints.  summarize the number of assessments performed using alternative modalities.  summarize study population characteristics before and after pandemic onset. additional sensitivity and supportive analyses  plan additional analyses for sensitivity to pandemic-related missingness.  consider the need for additional, alternative summary measures of treatment effect.  consider exploring inclusion of additional auxiliary variables, interaction effects, and time-varying exogenous covariates in the analysis methods.  consider subgroup analyses based on subgroups defined by pandemic impact, e.g. primary endpoint visits before or after pandemic onset.  consider the need for evaluation of potential impact of alternative data collection modalities.  consider sources of data external to the trial, for example to justify use of alternative modalities.  plan for additional safety analyses. all planned efficacy analyses should be re-assessed considering the guidance provided in sections and in terms of handling of pandemic-related missing data (see section . ). the core analysis methodology should not change. however, depending on the revisions to the estimand, the strategies chosen for pandemic-related ices, and the handling of pandemicrelated missing data, some changes to the planned analyses may be warranted. additional analyses will frequently also be required to assess the impact of the pandemic disruption. special considerations may be needed for studies and endpoints where participant outcomes could be directly impacted by the pandemic, e.g., in respiratory diseases or quality of life endpoints. in studies where enrolment is halted due to the pandemic, sponsors should compare the populations enrolled before and after the halt. more generally, shifts in the population of enrolled participants over the course of the pandemic should be evaluated. baseline characteristics (including demographic, baseline disease characteristics, and relevant medical history) could be summarized by enrolment period to assess whether there are any relevant differences in the enrolled population relative to the pandemic time periods. shifts could be associated with regional differences in rates of enrollment because start and stop of enrollment is likely to vary by country as pandemic measures are implemented or lifted. sponsors should make every effort to minimize the amount of missing data without compromising safety of participants and study personnel during the covid- pandemic and placing undue burden on the healthcare system. whenever feasible and safe for participants and sites, participants should be retained in the trial and assessments continued, with priority for the primary efficacy endpoint and safety endpoints, followed by the key secondary endpoints. despite best efforts, sponsors should prepare for the possibility of increased amounts and/or distinct patterns of missing data. in the framework of ich e (r ), an assessment or endpoint value is considered missing if it was planned to be collected and considered useful for a specific estimand but ended up being unavailable for analysis. in case of ices that are addressed by a hypothetical strategy, endpoint values are not directly observable under the hypothetical scenario. such data are not missing in the sense of the ich e (r ) definition, however, they need to be modelled in the analysis, often using methods similar to those for handling of missing data. in the remainder of the paper, we will discuss methods for handling of missing data, and note that such methods can be useful for modelling unobserved data after ices, if the modelling assumptions align with the hypothetical scenarios chosen for addressing the corresponding ices. sponsors should assess and summarize patterns (amount and reasons) of pandemic-related missing data in affected trials. data may be missing because a) planned assessments could not be performed; b) collected data is deemed unusable for analysis, e.g., out-of-window; or c) data under a desired hypothetical scenario cannot be observed after an ice (e.g., censored). additionally, each pandemic-related missingness instance also has specific reasons and circumstances. at a high level, reasons for pandemic-related missing data could be structural (e.g., government enforced closures or sites stopping study-related activities) or they could be participant-specific (e.g., individual covid- disease and complications or individual concerns for covid- ). table outlines the aspects that together provide a comprehensive picture for assessing the impact of missing data and planning how to handle them in analysis. sponsors should, therefore, capture such information in the clinical study database as much as possible. the last two rows of table reflect circumstances similar to those that are considered in the context of ices (see table ). since missing data may occur both in the presence of ices (e.g., handled by a hypothetical strategy) and in the absence of ices (e.g., participant continues to adhere to study treatment but misses some assessments), we list them here as well. table . attributes of pandemic-related missing data. row summarizes reasons of missing data, rows - summarize related conditions that contribute to those reasons.  assessment missing due to a participant's premature discontinuation from the study overall for pandemicrelated reasons;  assessment missing due to missed study visits/procedures while a participant remains in the study (intermittent missing data);  assessment delayed (out-of-window) and deemed unusable for an analysis;  a composite score (e.g., acr in rheumatoid arthritis) cannot be calculated because some components are missing;  assessment deemed to be influenced by pandemic-related factors and deemed unusable for a particular analysis because the interpretability of the results may be impacted (e.g., in assessments of quality of life, activity/functional scales, healthcare utilization, etc);  recorded data cannot be properly verified or adjudicated due to covid- -related factors and deemed to be unreliable for analysis;  assessment performed after an intercurrent event intended to be handled with a hypothetical strategy and collected data are deemed unusable for this estimand. assessment accessibility  site (facilities or staff) unavailable to perform studyrelated assessments;  site/assessment procedure available but participant is unable/unwilling to get assessment done due to personal pandemic-related reasons. sponsors should also consider a potential for under-reporting of symptoms and aes during the pandemic due to missed study visits or altered assessment modalities, e.g., a telephone follow-up instead of physical exam. (see section . .) sponsors should also consider reporting patterns of missingness along several dimensions: over time (in terms of study visits as well as periods before and after the start of pandemic disruptions), with respect to certain demographic and baseline disease characteristics, as well as co-morbidities considered to be potential risk factors associated with covid- infection or outcomes, and across geographic regions. blinded summaries of missing data patterns prior to study unblinding may inform the choice of missing data handling strategies. it may be useful to compare missing data patterns from the current studies with similar historical studies, especially with respect to missingness in subgroups. after study unblinding (for final or unblinded dmc analyses), missing data patterns should be summarized overall and by treatment arm. although in most cases pandemicrelated missingness, especially structural missingness, would not be expected to differ between treatment arms, such a possibility should not be ruled out. in special circumstances, such as in an open-label study, missing visits may be related to treatment if the experimental treatment must be administrated at the site while the control treatment could be administrated at home, there may be more missing assessments in the control group. this may result in biased treatment effect estimates if mitigating strategies are not implemented. this could also be the case for a single cohort trial using an external control. sponsors should generally maintain the same approaches for handling of non-pandemicrelated missing data as originally planned in the protocol and sap. for pandemic-related missingness, appropriate strategies will need to be identified in the context of each estimand and analysis method. which strategy is most appropriate should be considered in light of the underlying context and reasons for missingness as shown in table and in alignment with the estimand for which the analysis is performed. three cases are described. ( ) when data are missing without the participant having an ice, i.e., participant continues to adhere to study treatment but has some endpoint values missing: the missing data modelling should be based on clinically plausible assumptions of what the missing values could have been given the fact the participant continues to adhere to study treatment and the participant's observed data. ( ) when data are modelled in presence of an ice: the strategy defined in the estimand for addressing that ice should be considered. provide an adequate selection of tools to deal with pandemic-related missingness (see e.g., molenberghs and kenward, ; nrc, ; ratitch, , mallinckrodt et al., ) . methods for dealing with missing data are often categorized based on the type of assumptions that can be made with respect to the missingness mechanism. using molenberghs and kenward ( ) 's classification of missingness mechanisms aligned with longitudinal trials with missing data, data are missing completely at random (mcar) if the probability of missingness is independent of all participant-related factors or, conditional upon appropriate pre-randomization covariates, the probability of missingness does not depend on either the observed or unobserved outcomes. (we note that in the framework of little and rubin ( ) , mcar is defined as independent of any observed or unobserved factors. this definition was subsequently generalized to encompass dependence on prerandomization covariates, also referred to as covariate-dependent missingness, and mcar is now used in the literature in both cases.) some types of pandemic-related missingness may be considered mcar, e.g., if it is due to a site suspending all activities related to clinical trials. consideration may be given to whether such participants should be excluded from the primary analysis set depending on the amount of data collected prior to or after the pandemic. for example, when all (most) data are missing for some participants, imputing their data based on a model from participants with available data would not add any new information to the inference, while excluding such participants is unlikely to introduce bias. when participants have data only for early visits before the expected treatment effect onset and the rest of the data are mcar, then including such participants in the analysis set may not add information for inference about treatment effect while adding uncertainty due to missing data. data are mar if, conditional upon appropriate (pre-randomization) covariates and observed outcomes (e.g., before participant discontinued from the study), the probability of missingness does not depend on unobserved outcomes. if relevant site-specific and participant-specific information related to missingness is collected during the study, most of the pandemic-related missingness can be considered mcar or mar. the definitions of mcar and mar mechanisms are based on conditional independence of missing data given a set of covariates and observed outcomes that explain missingness. the factors explaining pandemic-related missingness may include additional covariates and, in the case of mar, post-randomization outcomes. for example, missingness during the pandemic may depend on additional baseline characteristics, e.g., age and co-morbidities, as well as post-randomization pandemic-related outcomes, such as covid- infection complications. in the case of covariate-dependent missingness, regression adjustment for the appropriate baseline covariates is sufficient for correct inference, though this complicates the analysis model and the interpretation of the treatment effect for models such as logistic or cox regression where conditional and marginal treatment effect estimates do not agree. under mcar and mar, some modelling frameworks such as direct likelihood, e.g., mixed models for repeated measures (mmrm), can take advantage of separability between parameter spaces of the distribution of the outcome and that of missingness. in this case, missingness can be considered ignorable (molenberghs and kenward, ) , and the factors related only to missingness do not need to be included in the inference about marginal effects of treatment on outcome. this does not, however, apply to all inferential frameworks. multiple imputation (mi) methodology (rubin, ) may be helpful in this respect as it allows inclusion of auxiliary variables (both pre-and post-randomization) in the imputation model while utilizing the previously planned analysis model. multiple imputation with auxiliary variables may be used for various types of endpoints, including continuous, binary, count, and time-to-event and coupled with various inferential methods in the analysis step. the use of mi with rubin's rule for combining inferences from multiple imputed datasets may introduce some inefficiencies and impact study power, although some alternatives exist (see, e.g., schomaker and heumann, ; hughes et al., ) . for implementing a hypothetical strategy for covid- related ices in the context of time-to-event endpoints, regression models (e.g., cox regression) adjusted for relevant baseline covariates (in the mcar setting) or multiple imputation (in the mar setting) is recommended. competing risks analyses which treat pandemic-related ices that fully or partially censor the outcome, e.g., covid- related deaths, as competing events are not compatible with a hypothetical strategy. a further complication in the interpretation of competing risks analyses in this context is that participants are not at risk for the competing event from their time origin (e.g., randomization or start of treatment) onwards but only during the pandemic which is not experienced synchronously across the trial cohort. this compromises the validity of common competing risks analyses and prevents interpretation of results from such analyses to be generalized to the population. the implication of assuming specific missingness mechanism is that missing outcomes can be modelled using observed pre-randomization covariates or covariates and post-randomization outcomes (mar) from other "similar" participants, conditional on the observed data. for pandemic-related missingness, it is important to evaluate whether there are sufficient observed data from "similar" participants to perform such modelling, even if factors leading to missingness are known and collected. for example, if missingness and endpoint depend on age, and few older participants have available endpoint data, it may not be appropriate to model missing data of older participants using a model obtained primarily from available data of younger participants (possibly resulting in unreliable extrapolation). similarly, severe complications of a covid- infection may be due to these participants having a worse health state overall and modelling their outcomes based on data from participants without such complications may not be justifiable. additional assumptions about participants with missing data versus those with observed data may need to be postulated and justified, perhaps based on historical data. this is where the consideration of pandemic-related factors surrounding missingness, such as those mentioned in table , is important. pandemic-related missing data may need to be considered mnar if missingness and study outcomes depend on covid- related risk factors, treatment, and infection status but such data are not collected. the missingness mechanism may also be mnar if it depends on unobserved study outcomes. in the context of the pandemic, it may arise when participants with milder disease or lower treatment response are more inclined to discontinue the study or treatment and if their outcomes and reasons for discontinuation are not documented before discontinuation. analysis under mnar requires additional unverifiable assumptions but may be avoided through collection of relevant data. analysis of sensitivity to departures from mar assumptions should be considered, for example, models assuming plausible mnar mechanisms (see e.g., carpenter et al., ; mallinckrodt et al., ) . modifications to planned main analyses needed to handle pandemic-related missing data should be specified in the sap prior to study unblinding. see section . . for a discussion of sensitivity analyses with respect to missing data. additional sensitivity and supplementary analyses will frequently be required to assess the impact of pandemic-related disruptions on the trial. for non-pandemic-related events and missing data, the originally planned sensitivity analyses should be performed, but simply applying the pre-planned sensitivity analyses to both pandemic and non-pandemic ices and missing data may be problematic for three reasons. first, the objective to estimate treatment effects in the absence of a covid- pandemic may mandate different strategies for pandemic-related and unrelated events. second, as discussed in section . . , an mcar or mar assumption is frequently plausible for covid- -related events whereas this may not be the case for other missing data. third, sensitivity analyses to missing data could become excessively conservative if the amount of pandemic-related missing data is large. while a relatively large proportion of missing data could normally be indicative of issues in trial design and execution leading to greater uncertainty in trial results, that premise is tenuous when an excess of missing data is attributed to the pandemic. subgroup analyses for primary and secondary endpoints by enrolment or pandemic (see section . . ) period are recommended. if subgroup analyses are indicative of potential treatment effect heterogeneity, the potential for this to be rooted in regional differences should be considered. issues of multiregional clinical trials as described in ich ( ) may be magnified by the pandemic. in addition, dynamic period-dependent treatment effects could also be assessed in an exploratory fashion. for example, in models for longitudinal and time-to-event endpoints, one could include interaction terms between the treatment assignment and the exogenous time-varying covariate describing the patients' dynamic status during follow-up. however, results from such analyses may be nontrivial to interpret and generalize. more liberal visit time-windows may be appropriate for visits depending on the specific trial. sensitivity analyses should assess the robustness of results to out-of-schedule visits by either including them or treating them as missing data. a tipping point analysis (ratitch et al., ) may be used to assess how severe a departure from the missing data assumptions made by the main estimator must be to overturn the conclusions from the primary analysis. tipping point adjustments may vary between the pandemic vs. non-pandemic missingness and by reason of missingness. for example, one could tip missing data due to pandemic-unrelated missing data but use standard mar imputation for pandemic-related missing data. historical data may be useful to put in context the plausibility of the assumptions and tipping point adjustments. when main analysis relies on imputation techniques, sensitivity analyses can be done by using an extended set of variables in the imputation algorithm. when dealing with missing data in a context of intercurrent events handled with the hypothetical strategy, one could vary the assumed probability that the participant would have adhered to treatment through the end of the study vs that they would have had other non-pandemic-related events and impute their outcomes accordingly. for binary responder analysis, it is not uncommon to treat participants who have missing assessments as non-responders when a proportion of such cases is very small. for pandemic-related delayed and missed assessments, especially those occurring in absence of ices, it may be preferable to use a hypothetical strategy based on a mar assumption instead. however, anon-responder imputation for both non-pandemic and pandemic-related missing response assessments could be reported as a supplementary analysis. for time-toevent endpoints, we recommend the usage of interval-censored methods to account for cases where the event of interest is known to have occurred during a period of missing or delayed visits but the time to event is not precisely known in sensitivity analyses (bogaerts, komarek, and lesaffre, ) . alternative measurements of endpoints may be necessary during the pandemic period. a careful study-specific assessment is necessary to judge whether these alternative measurements are exchangeable with standard protocol assessments. ideally, exchangeability is established at the time of implementation based on information external to the trial. if not, blinded data analyses can support this, e.g. comparisons of the distribution of alternative measurements to the original measurement. however, if the validity of the new instrument has not been established previously, it will be challenging to rigorously demonstrate equivalence using data from the trial alone. if exchangeability can be established or assumed, the main estimator could include data from both the original and the alternative data collection. the sensitivity analysis would then include data collected according to the original protocol only and treat other data as missing. if the validity of the exchangeability assumption is uncertain, the opposite approach can be taken. modelling the interaction between treatment and assessment method can be undertaken as an alternate sensitivity analysis. outcomes it will be important to understand the pandemic effect on trial outcomes. there are several schools of thought on how this could be done. previous sections in this paper have described the need for collecting data describing how events such as missed visits and treatment interruptions can be attributed to the pandemic. these data are incorporated into definitions of pandemic-related ices and missing data, and the strategies for handling those in the analysis. statistical analyses of trial data will then be properly adjusted for pandemic effects. in many ways this is the ideal approach as it incorporates what is known for each participant directly into the analysis, and in a way that is very well understood. this is a standard approach to adjusting statistical analysis for inevitable perturbations in clinical trials. this approach has the disadvantage of needing to collect detailed data on pandemicrelatedness, which may not be feasible in some circumstances. another approach involves the use of pandemic time periods defined external to the trial database (e.g., to define pre-/during/post-pandemic phases as described in ema/chmp b). this approach requires the accurate and precise definition of pandemic periods. this is simpler to apply in single-country studies where the impact of the pandemic and local containment measures may be relatively homogeneous across participants. however, even for a single country, the pandemic may evolve in a gradual fashion, complicating the definition of pandemic start and stop dates, and the impact of the pandemic on study participants will likely not be homogenous. moreover, there may be several waves of infection outbreaks. the definition could prove even more challenging to implement in global trials because the start and stop dates of these periods and the impact of the pandemic on study participants may well vary by region. in practice, a standardized and pragmatic definition based on regionally reported numbers of covid- cases and deaths over time and/or start date and stop dates of local containment measures will likely be required. once pandemic periods have been defined, time-varying indicator variables for visits occurring during different pandemic periods could be incorporated as appropriate in statistical models or in ice definitions, as discussed previously. when this approach is used, the rationale for defining the pandemic start and stop dates should be documented. the situation is evolving rapidly and at this is point, it is not possible to provide definitive recommendation on the definition, implementation, and interpretation of these pandemic periods. in a third approach to generally assessing pandemic effects, each participant in the trial could be categorized according to the extent of pandemic impact on their treatment and assessments collected in the study database (details of protocol deviation, ices, missing assessments, pandemic-related reasons for discontinuation etc.). for trials with a fixed follow-up duration and minimal loss to follow-up, the categorization could be integrated in standard analyses, for example in defining subgroups for standard subgroup analysis. there is insufficient evidence currently to favour a single approach to this issue. sponsors are preparing to do at least the first two approaches, as these have been the subject of regulatory guidance. until we see how they play out and how the pandemic evolves, etc., it's sensible to consider multiple methods of summarizing pandemic effects. standard safety summaries will include all aes as usual. however, additional separate analyses may be needed for events associated with covid- infections and unassociated events, respectively, to fully understand the safety profile. the determination whether aes and, particularly, deaths are covid- related should be made during trial monitoring before data unblinding to avoid bias. in many situations, safety reporting will remain unchanged. however, disruptions due to the covid- pandemic may lead to increases in treatment interruptions, discontinuations and study withdrawals as well as the occurrence of covid- infections and deaths. hence, the estimands framework outlined in section could also be useful for safety analyses and we refer to (unkel et al, ) for a general discussion of estimands as well as time-to-event and competing risks analyses for safety. trials that require physical visits to adequately assess safety of the intervention will need to have maintained a schedule of physical visits to satisfy the requirement. generally addressing the potential for bias in collection of ae data is beyond the scope of this paper. exposure considering the compliance rate-or follow-up-adjusted analyses could be done, e.g., comparing the adjusted rates before and during the pandemic or with historical data. we don't have other methodological recommendations at this time, and more research is needed. the cumulative impact of missing data and revised statistical models discussed in the previous sections contribute to an overall study-level impact. the cumulative effect could alter the likelihood of meeting trial objectives, or even the interpretability of the trial results. sponsors should assess potential impact of missing data on several aspects and it may be important to reach agreement with regulatory agencies on some of these questions:  feasibility of planned estimation methods given the data expected to be available;  potential for bias in treatment effect estimation if there are important differences in missingness patterns across treatment arms;  study power for the primary and key secondary objectives;  interpretability of study findings;  adequacy of safety database due to potential reduction in total drug exposure time and potential for underreporting of aes;  adequacy of regional evidence required for regulatory submissions. as discussed in the previous sections, covid- related factors impact trial data in many ways with consequences for power of the study, probability of success, sample size or other aspects of the trial design. quantifying the potential effects of the various pandemic factors on trial results can be done through clinical trial simulations. the simulation models will depend on the factors used in the original trial design, and incorporate adjustments to estimands, missing data handling and analysis methods as discussed in sections and . to maintain trial integrity, the simulations should be informed only by blinded data from the study and the assumed values for the design parameters from external sources. variability and treatment effect estimates may be modified from their original values used in trial design. trial properties such as power and probability of success can be updated accordingly. sample size adjustment can be considered based on the simulation outcomes, or more extensive modifications of the trial design also may be considered such as change in the primary endpoint, analysis method, or addition of interim analysis with associated adaptation (posch and proschan ; kieser and friede ; muller and schaefer ) . such changes are challenging, and should be discussed with regulatory agencies, but can be considered if trial integrity is maintained. for some trials, it may not be feasible to increase sample size and the trial will fall short of enrollment target. given the extraordinary circumstances, we advocate more flexibility to consider methods for quantifying evidence across multiple trials and sources, including use of historical control arm data and real-world data, although sources and methodology for selection of such data would need to be planned and agreed with regulatory agencies in advance. if the observed treatment effect after data unblinding is meaningful but does not meet the statistical criterion due to covid- effects, the sponsor can evaluate whether the study results will be acceptable for registration on the basis of the accumulated evidence from the program; alternatively, whether the trial results could be used to define the inferential prior for a smaller follow up trial (viele et al ) . for a trial with a dmc, the sponsor should ensure that the dmc is well-informed of all measures taken to protect participant safety and to address operational issues. known or potential shortcomings of the data should be communicated. the timing of the regular preplanned safety interim analyses may need to be re-assessed. in addition, revised or additional data presentations may be needed. in some circumstances of interim analysis discussed in this section, a dmc may need to be established if not already in existence. there could also be circumstances related to participant safety where there may be a need to urgently review unblinded data, and establishing an internal dmc that is appropriately firewalled from the rest of the study team is recommended (e.g., studies without an existing dmc where it could take many months to organize an external dmc). efficacy interim analyses should be conducted as planned with information level (e.g., number of participants with primary endpoint or specific information fraction) as described in the protocol, which may cause a delay in the expected timeline. intermediate unplanned efficacy or futility interim analyses are generally discouraged unless there are safety and ethical considerations. however, if it is not feasible to reach the planned information level, altering the plan for interim analysis would need to be considered, for example with timing based on calendar time. in cases with strong scientific rationale for an unplanned interim analysis, the dmc should be informed and consulted on the time and logistics of the analysis. if an estimand, planned analysis methods, and/or decision rules have been changed to address pandemic-related disruptions and missing data, these changes should be communicated to the dmc and documented in the dmc charter. we do not generally advocate utilizing a dmc for operational risk assessment / mitigation process, to prevent influence of unblinded data on trial conduct decisions (ema/chmp , fda ). if the sponsor decides to involve the dmc, details should be clearly defined and documented in the dmc charter, including additional responsibilities of dmc members and measures to prevent introduction of bias. as we have discussed, the covid- outbreak continues to have major impact on planned and ongoing clinical trials. the effects on trial data have multiple implications. in many cases these may go beyond the individual clinical trial and will need to be considered when such results are included with other trial results, such as an integrated summary of efficacy or safety. our goal was to describe the nature of the statistical issues arising from covid- potential impact on ongoing clinical trials and make general recommendations for solutions to address the issues. the following are the most important findings and recommendations:  risk assessment, mitigation measures, and all changes to study conduct, data collection, and analysis must be documented in statistical analysis plans and clinical study reports as appropriate. some changes may necessitate protocol amendments and consultation with regulatory agencies (fda , ema a , b .  implications of the operational mitigations for the statistical analysis of the trial data should be considered before implementing those mitigations, especially for key efficacy and safety endpoints. validity and exchangeability of alternate methods of data collection require careful consideration.  the estimand framework, comprised of five key attributes, provides a pathway for assessing the impact of the pandemic on key study objectives in a systematic and structured manner and may be useful regardless of whether estimands are formally defined in the study protocol.  as much as possible, we recommend that original objectives of the trial be maintained; but some impact to planned estimands may be unavoidable. pandemicrelated intercurrent events will likely need to be defined to properly and rigorously account for unexpected pandemic effects.  planned efficacy and safety analyses should be reviewed carefully for changes needed to ensure that the estimators and missing data strategies align with updated estimands. additional sensitivity and supportive analyses will be needed to fully describe the impact of the pandemic-related disruptions.  sponsors should make every effort to minimize missing data without compromising safety of participants and study personnel and without placing undue burden on the healthcare system. priority should be on the assessments which determine the primary endpoint, important safety endpoints, followed by the key secondary efficacy endpoints.  most data that are missing due to pandemic reasons may be argued to be mcar or mar, especially if missingness is due to structural reasons, but additional considerations may apply, especially for certain diseases and participant-specific missingness.  sponsors should carry out rigorous and systematic risk assessment concerning trial and data integrity and update it regularly. the ability of trials to meet their objectives should be assessed quantitatively, taking account of the impacts on trial estimands, missing data and missing data handling, and needed modifications to analysis methods. european medicines agency committee for medicinal products for human use (ema/chmp guidance to sponsors on how to manage clinical trials during the covid- pandemic food and drug administration (fda) ( ), guidance for clinical trial sponsors: establishment and operation of clinical trial data monitoring committees guidance on conduct of clinical trials of medical products during covid- public health emergency, guidance for industry, investigators, and institutional review boards addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials general principles for planning and design of multiregional clinical trials survival analysis with interval-censored data: a practical approach estimation of treatment effect under non-proportional hazards and conditionally independent censoring analysis of longitudinal trials with protocol deviation: a framework for relevant, accessible assumptions, and inference via multiple imputation comparison of imputation variance estimators simple procedures for blinded sample size adjustment that do not affect the type i error rate aligning estimators with estimands in clinical trials: putting the ich e (r ) guidelines into practice estimands, estimators, and sensitivity analysis in clinical trials on the propensity score weighting analysis with survival outcome: estimands, estimation, and inference missing data in clinical studies statistical analysis with missing data a general statistical principle for changing a design any time during the course of a trial robust inference in discrete hazard models for randomized clinical trials the prevention and treatment of missing data in clinical trials clinical trials with missing data: a guide for practitioners unplanned adaptations before breaking the blind missing data in clinical trials: from clinical assumptions to statistical analysis using pattern mixture models multiple imputation for nonresponse in surveys treatment effect quantification for time-to-event endpoints -estimands, analysis strategies, and beyond bootstrap inference when using multiple imputation on estimands and the analysis of adverse events in the presence of varying follow-up times within the benefit assessment of therapies time-dependent bias was common in survival analyses published in leading clinical journals use of historical control data for assessing treatment effects in clinical trials group sequential and confirmatory adaptive designs in clinical trials we are grateful for the help of colleagues at each of our companies who have devoted much time to addressing these issues in their ongoing clinical trials. they have generously shared their ideas, and this manuscript has benefited from this broad input. we also thank the members of the "biopharmaceutical statistics leaders consortium" who brought the team of authors together and provided valuable input; and the associate editor and reviewers who provided extensive and helpful input within tight timelines. key: cord- -pwmr m o authors: gupta, deepti; bhatt, smriti; gupta, maanak; tosun, ali saman title: future smart connected communities to fight covid- outbreak date: - - journal: nan doi: nan sha: doc_id: cord_uid: pwmr m o internet of things (iot) has grown rapidly in the last decade and continue to develop in terms of dimension and complexity offering wide range of devices to support diverse set of applications. with ubiquitous internet, connected sensors and actuators, networking and communication technology, and artificial intelligence (ai), smart cyber-physical systems (cps) provide services rendering assistance to humans in their daily lives. however, the recent outbreak of covid- (also known as coronavirus) pandemic has exposed and highlighted the limitations of current technological deployments to curtail this disease. iot and smart connected technologies together with data-driven applications can play a crucial role not only in prevention, continuous monitoring, and mitigation of the disease, but also enable prompt enforcement of guidelines, rules and government orders to contain such future outbreaks. in this paper, we envision an iot-enabled ecosystem for intelligent monitoring, pro-active prevention and control, and mitigation of covid- . we propose different architectures, applications and technology systems for various smart infrastructures including e-health, smart home, smart supply chain management, smart locality, and smart city, to develop future connected communities to manage and mitigate similar outbreaks. furthermore, we present research challenges together with future directions to enable and develop these smart communities and infrastructures to fight and prepare against such outbreaks. covid- is an infectious disease caused by a newly discovered coronavirus (sars-cov- ) and is rapidly spreading around the world. according to world health organization (who ), covid- has already affected countries and territories around the world and continue to spread rapidly across other regions. the highly contagious coronavirus outbreak was declared a "pandemic" on march , . who reports that the number of positive cases has dramatically surged, with nearly millions reported cases and , fatalities as of april , . the fatalities are assuredly the most tragic cost of this disease. in order to control the spread the pandemic, lockdowns, quarantines and stay home orders have been issued by several nations across the globe, which have crippled national and world economy with critical consequences to workers, employers and investors. in addition, the industries, businesses and travel restrictions restrain the supply of goods and services, and the economic disruptions will continue to have a long term impact on global supply chains and economy. in the united states, unemployment rates spiked to . % -its highest level since the great depression, in addition to fear of stronger second wave of the disease looming during winter. currently, with no cure or vaccine for this disease, the first line of defense to fight against this pandemic is preventative measures and mitigation strategies. as suggested by the who, the u.s centers for disease control and prevention (cdc ) and several other federal organizations suggest personal protective measure (ppe), social distancing, environmental surface cleaning, self isolation, travel restrictions, local and national lockdowns, quarantine, limits on large gatherings, restrictions on opening businesses, and school closures, as some of the preventive measures that are needed to limit the spread of the disease. however, these guidelines impose restrictions which hinder the way of normal life for humans. it has become a huge challenge to swiftly implement and enforce such measures on a large scale across cities, nations, and around the world. we believe that to effectively enforce and monitor the preventive controls and mitigation strategies for covid- , iot together with its key enabling technologies including cloud computing and artificial intelligence (ai) and data-driven applications can play an important role. there are several existing examples of use of technology to control the spread of covid- and manage the large gatherings of already infected patients and possibly infected cases. the u.s. cdc has introduced a self-checker application enabled with cloud platform, which helps a patient to make decision to find appropriate healthcare service through questionnaires. however, most people do not have any symptoms who are also known as the silent spreaders/asymptomatic carriers. therefore, conducting over the last few years, there has been a huge surge in the number of iot devices and different types of smart sensors have been introduced. with new technological advancements, this trend is expected to continue and grow in the future. iot market is currently valued at $ billion per year and is expected to reach $ billion by . another recent article predicted more than billion devices will be internet-connected by . in today's connected world, not having network capability in a device limits the market potential for that device. as a result, there are large number and various types of network connected iot devices providing convenience and ease of life to humans. smart devices have the potential to be a major breakthrough in efforts to control and fight against the current pandemic situation. iot is an emerging field of research, however, the ubiquitous availability of smart technologies, as well as increased risks of infectious disease spread through the globalization and interconnection of the world necessitates its use for predicting, preventing, and mitigating diseases like covid- . iot includes large number of novel consumer devices including hdmi sticks, ip cameras, smartwatches, connected light bulbs, smart thermostats, health and fitness trackers, smart locks, connected sprinkler systems, garage connectivity kits, window and door sensors, smart light switch, home security systems, smart ovens, smart baby monitors, and blood pressure monitors. however, mostly these iot devices are used in a distributed manner based on user and their requirements. iot technology including smart sensors, actuators, and devices and data driven applications can enable smart connected com- https://www.forbes.com/sites/louiscolumbus/ / / /iot-market-predicted-to-double-by- -reaching- b/# f f https://www.cisco.com/c/dam/en/us/products/collateral/se/internet-of-things/at-a-glance-c - .pdf munities to strengthen the health and economical postures of the nations to fight against the current covid- situation and other future pandemics efficiently. in this paper, our main goal is to present a holistic vision of iot-enabled smart communities utilizing various iot devices, applications, and relevant technologies (e.g., ai, machine learning (ml), etc.). here, we propose a vision of smart connected ecosystem, as shown in figure , with real-world scenarios in various applications domains with a focus on detecting, preventing and mitigating covid- outbreak. the major contributions of this paper are outlined below. • we outline some of covid- symptoms, preventive measures, mitigation strategies, and current problems, challenges and present an overview of adaptable multi-layered iot architecture and depict interactions between layers focusing on smart communities for covid- requirements. • we design a smart connected ecosystem by developing multiple conceptual iot application frameworks including e-health, smart home, smart supply chain management, smart locality, and smart city. we introduce use-cases and applications scenarios for early covid- detection, prevention, and mitigation. • we identify and highlight challenges to implement this smart ecosystem including security and privacy, performance efficiency, interoperability and iot federation, implementation challenges, policy and guidelines, machine learning and big data analytic. finally, we discuss the interdisciplinary research directions to enable and empower future smart connected communities. the remainder of this paper is organized as follows. section discusses the essential characteristics to diagnose, prevent and mitigate covid- disease. section presents the multi-layered architecture for iot, whereas section discusses smart connected ecosystem scenarios in various iot application domains. section highlights open research challenges and future directions. finally, section draws conclusion to this research paper. essential characteristics to diagnose, prevent and mitigate coronavirus is transmitted mainly by the infected person's saliva and nasal drips which spread during coughing and sneezing and infects anybody in close contact. another source of infection is contaminated surfaces in surrounding and high risk areas, such as door handles, railings, elevators, and public restrooms. covid- is a highly contagious virus with the incubation period stretched from days to weeks after exposure. symptoms of covid- range from mild symptoms including fever, coughing and shortness of coping with anxiety disorder, depression issues, and mental health problems. shortness of breath or cannot breathe deeply enough to fill your lungs with air, chills. avoid face-to-face meetings, practice social distance from other people outside of the home. monitor symptoms regularly, wear a cloth covering or n- mask over nose and mouth. knowledge gaps to understand virus transmission, no specific antiviral treatment, and no vaccine available. loss of the sense of smell is most likely to occur by the third day of infection and some patients also have experience a loss of the sense of taste. cover mouth and nose with a cloth or wear mask when around others, wear gloves and discard them properly. manufacturers use of all cleaning and disinfection products, follow the workplace protocol and provide ppe to their employee. lack of testing and essential resources such as ventilators, masks, beds, and health staffs, cancel elective surgery. diarrhea and nausea a few days prior to fever, cdc says a sudden confusion or an inability to wake up and be alert may be a serious sign. cover cough or sneeze with a tissue, then throw the tissue in the trash. hospital task force such as increase the number of testing, available the ppe for their staff members, and increase the incentive care. food and drug administration (fda ) provided emergency use authorization for hydroxychloroquine medicine to treat the people who are suffering from this virus in hospitals. later, on april , fda warned against use of hydroxychloroquine to treat this disease outside of the hospital setting or a clinical trial due to risk of heart rhythm problems. people can protect themselves by following some protective measures and help to slow the spread using mitigation strategies. table provides a comprehensive overview of the symptoms, preventive measures, mitigation strategies and some challenges fighting covid- disease. one of the easiest preventive measure is to wash your hands frequently and thoroughly with soap and water for at least seconds or use hand sanitizer or an alcohol-based hand rub when soap and water are not available. people should keep social distancing (six feet distance) from others especially from people who are coughing or sneezing. it is suggested to wear mask and gloves in outdoor locations, and to avoid touching the face and surfaces such as the button at a traffic light, a keypad to add a tip for the restaurant take-out order, elevator buttons, etc. many surfaces are touched by hands accidentally and virus can be potentially picked up and then transmitted to other surfaces and locations. once the hands are contaminated, the virus can be transferred through eyes, nose or mouth, thus, it enters human body. respiratory hygiene is another protective measure. there is a need to avoid cough or sneeze into the hands, and to cover mouth and nose with a tissue or bent elbow during cough or sneeze and throw away the tissue immediately. groceries and packets can be contaminated from coronavirus, and it is recommended to wash grocery items carefully and wipe packets using disinfectant spray. local public health administrations regularly issue health guidelines, which people should follow. the who, governments, and healthcare workers are all urging people to stay home if they can. on top of basic illness prevention, experts said that the best (and only real) defense against disease is a strong immune system. in addition to the physical health, taking care of mental health is also necessary. high stress levels can take a toll on human's immune system, which is the opposite of what people want in this situation. in addition, mitigation strategies are a set of actions applied by the people and communities (hospitals, grocery stores, and cities) to help slow the spread of respiratory virus infections. these mechanisms can be scaled up or down depending on the evolving local situation. at individual level, if a person is infected with coronavirus, then he/she should self-isolate and follow the guidelines of quarantine provided by hospital. the hospitals must support healthcare workforce, increase testing and intensive care capacity, and availability of personal protective equipment. city governments can appoint task https://www.fda.gov/media/ /download https://www.fda.gov/drugs/drug-safety-and-availability/fda-cautions-against-use-hydroxychloroquine-or-chloroquine-covid- -outside-hospital-setting-or force, open shelters for homeless people, and maintain availability of resources to implement preventive and mitigation strategies for this disease. while each community is unique, appropriate mitigation strategies vary based on the level of community transmission, characteristics of the community and their populations, and the local capacity to implement strategies. nonetheless, it is crucial to understand the characteristics of this novel virus and spread awareness and up-to-date information across communities through appropriate technology. consequently, it is essential to address the challenges with significant research and implementation of strategies as shown in the table. multi-layered architecture in this section, we explain an integrated multi-layer iot architecture which can fundamentally change the infrastructure and underlying technologies for smart communities including hospitals, grocery retail stores, transportation, and city etc. as shown in figure . our proposed architecture extends and adapts existing iot and cps architectures [ ] [ ] [ ] [ ] [ ] [ ] , and focuses on the need of swift enforcement of policies, laws and public guidelines, in order to curtail the widespread of such disease. the architecture integrates a hybrid cloud and edge computing nodes together with iot and smart sensor devices, to enable real-time and data-driven services and applications needed in covid- pandemics. overall, the architecture consists of six layers: object layer, edge layer, virtual object layer, cloud layer, network communication, and application layer. the object layer is a rich set of iot devices including sensors, actuators, embedded devices, road side infrastructures, vehicles, etc. these physical objects are spread across and implemented in smart communities like hospitals, retail-stores, homes, parking lots. the edge layer provides local real-time computation and analysis needed for smart resource constrained physical objects. this layer incorporates edge gateways and cloudlets [ ] which can enable local computation at this layer overcoming limited bandwidth and latency requirements, and also impacts the usability of the iot applications. this multi layer architecture has integrated the concept of virtual objects (vos) [ ] , which are the digital delineation of physical iot devices. vos show the current state of corresponding physical objects in the digital space when they are connected, and can also store a future state for these devices when they are offline. cloud layer provides various services like remote storage, computation, big data analysis, and data-driven ai applications etc. for huge amount of information generated by billions of iot devices connected to the cloud. we defined a computation layer which comprises of edge layer, virtual object layer, and cloud layer. computation, data analytic and processing services are performed in this layer. network communication layer run among different layers to establish the interaction. it is responsible for connecting physical sensors, smart devices, edge compute nodes or cloudlets, and cloud services with different technologies, and is also used for transmitting and processing sensor data. application layer delivers specific services to end users through different iot applications. in the multi-layered smart communities architecture, this application integrates mobile phones, edge computing, cloud computing, ai based analytic, and data-driven services. this architecture can incorporate the iot application frameworks within different domains as discussed in the section , and different use case scenarios can be mapped and implemented using relevant technologies associated with each layer of the architecture. in this section, we discuss various iot use case scenarios to monitor, diagnose, detect, and mitigate it is expected that the global internet of medical things (iomt) market will grow to a . billions in year . as of , there are . million medical devices in use that are connected and monitor body parameters of the users to inform data-driven applications in making real-time healthcare decisions. improving the efficiency and quality of healthcare services in hospitals have been an important and critical challenge during the covid- pandemic time. in e-health use cases, we will discuss three important scenarios which can help reduce the risk and spread of coronavirus infection. e-health set up comprises of a smart hospital, a remote patient monitoring, and a smart testing booth as shown in figure . these use-cases involve connected smart sensors, connected devices, robots, patients, hospital practitioners, workers etc. together with iot applications, edge devices and cloud services to offer data-driven services. other scenarios including smart pharmacy, smart ambulance and smart parking in hospital's parking area, are also briefly discussed. such scenarios can be extended with the current proposed architectures described in section . various components of smart hospital concept has been studied in the literature [ ] [ ] [ ] [ ] [ ] . however, it is still a challenge to track covid- patient's record and keep track of essential resources in hospital during such pandemic crisis. the hospitals are overflowing with patients, and running out of hospital beds, ppe and other essential resources needed for treatment and prevention. to overcome these problems, we propose a smart hospital use case here, whihc extends the existing infrastructure to enable coordinated actions for coronavirus patients. within a smart hospital, rfid sensors can be used to track inventory items like masks, face shields, gauzes, disposable patient examination papers, boxes of gloves, and plastic bottles and vials. these rfid tags also could be an ideal way to keep track of large equipment as well, such as smart beds, ventilators within a smart hospital. towel, sheets, and blankets must be washed and disinfected regularly, and such items also can be tracked through rfid laundry tags. in wuhan, wuchang field hospital provided wearable smart bracelets and rings embedded with multiple sensors to each patient in the hospital, where these iot devices are synced with the cloud ai platform so that patient's vital signs, body temperature, heart rate and blood oxygen levels, can be monitored regularly by hospital practitioners. in addition, all hospital workers and staff members also wear these smart bracelets and rings to notice any early symptoms of coronavirus infection. iot devices generate tremendous amount of data and this data can be collected using edge servers deployed in the hospital facilities. these data sets can be used for training with federated deep learning [ ] technique to enhance the intelligence of data-assisted applications which can be used to predict coronavirus infections for hospital practitioners, staff members and also provide insights on coronavirus characteristics and infection trends for the future. hospital practitioners can check the patient's data through remote applications which will also help in reducing the number of visits to the patient's room to measure her/his vital parameters. in this way, hospital practitioners can not only collect more data in less time with minimal in-person contact but can also reduce the risk for cross-infection from the patients. this technique can significantly help reduce the workload and increase the efficiency of hospital practitioners. moreover, smart hospitals can utilize smart beds, which sense the presence of a covid- patient and automatically adjust the bed to a good angle if the patient is short of breathe to provide proper support without the need for a nurse to intervene. a singapore based medical device company invented a smart ventilator, which allows inpatient monitoring process through remote access via an online portal. these ventilators measure the amount of oxygen automatically, or monitor the rate of delivery to the patient, as high pressure to force in more oxygen can damages the lungs of the patient also. these smart ventilators can communicate through patient's smart bracelets with embedded sensors and can respond according to patient's body parameters. besides, there are other iot devices, such as disinfected robots which can autonomously disinfect a patient room regularly and after the patient is discharged, or a specific hospital area post contamination as needed. a study if a patient is in the car. the pharmacist communicates with patient through an application and provides prescribed medicine to them. before attending the other person, the pharmacist must take some time to sanitize the smart pickup box. in the above scenarios, data and information collected from smart devices are sent to edge gateways, services, or cloud. due to high security and privacy concerns in health domain, it is important to understand that these edge gateways and cloud-iot platforms will be owned only by authorized entities, such as hospitals or other highly trusted entities through some private cloud. we elaborate these challenges in detail in section . various aspects of smart home for health have been investigated in the literature [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . it is nearly impossible to keep everything in the real-world virus-free. however, individuals must exercise due pre- the novel coronavirus has people boarded up inside their homes due to lockdowns and stay home orders. however, people can still go outside for essential services. it is critical to take precautions during outside time and must be aware of possibly infected items and exposure to the virus that they may accidentally bring home. it is paramount to keep our smart homes disinfected and sanitized using different iot during the pandemic, there is a requirement to improvise supply chain management to adapt automatic business processes and also improve the inventory with delivery of essential items. in this subsection, we will discuss two scenarios: smart inventory, and smart retail stores, which illustrate how iot devices and technologies can enable efficient supply-chain and help in slowing the covid- spread using various prevention and detection mechanisms. the complete scenario of smart supply chain management is shown in figure . various aspects of smart inventory systems have been investigated in the literature [ ] [ ] [ ] [ ] [ ] [ ] [ ] . most of them involve iot devices where staff use handheld readers to scan the bar code of goods, and then write the storage information to the rfid tags to complete the inventory. smart inventory systems show how iot technology can be leveraged globally to plan and respond under the current pandemic situation. inventories are facing an unprecedented challenge in coping with the fallout from covid- . however, a smart inventory system can provide a safe and secure environment to the workers using iot technologies. within the inventories, drones can be used to track all the employees to check their temperature using thermal sensors, and also measure their social distancing practices. inventory manager can also provide smart wearable devices connected to the centralized cloud to each employee to monitor and track them. if an employee or his/her family member is infected with coronavirus, inventory manager can get notification through data-driven iot applications. in addition, disinfectant spray can be attached to the shelves and can start to spray when associated sensor senses the sound of sneezing. california had the earliest stay-at-home order issued on march . in california, there has been an early rise in truck activity since the week of march . autonomous delivery robots can also help in smart inventory and help to reduce cross infection. however, truck activity in california has fallen . % from early february. iot sensors like thermal sensors, gps, motion sensors can be attached with delivery trucks to maintain the temperature for perishable items and to track the location. this data can be stored on inventory cloud, and can help predict demand and supply for next month. these tools will become the foundation on which supply chain managers gain insight into their markets and erratic supply and demand trends. the rfid antenna scans the number of units on the sales floor and alerts a store manager in case it's low. iot allows store managers to automate product orders, is capable of notifying when a certain product needs to be re-ordered, gathers data regarding the popularity of a certain item, and prevents employee theft. the retail industry is seeing a rapid transformation, with the iot solutions taking the center stage in the sector. smart grocery store have been widely investigated in the literature [ , [ ] [ ] [ ] [ ] [ ] . iot along with ai and ml technologies can help slow the spread of infection by enforcing prevention and detection mechanisms through connected sensor devices in a smart grocery store. due to the stay home order, people are panicking and stocking up grocery items. they need to stand in queues for hours outside the store to buy groceries. by employing iot sensors around the store and wearable iot devices, a store manager can get a better understanding to slow the spread in the store. chinese tech firm kuang-chi technologies has developed a smart helmet that is used to identify and target those people who are at high-risk for virus transmission in the retail store. the customer will wear a rfid tag based smart bracelet at the store. at the entrance each customer's smart bracelet will be scanned by a scanner, which will show https://talkbusiness.net/ / /groups-share-data-quantifying-covid- -impacts-on-trucking-industry/ https://www.yicaiglobal.com/news/chinese-tech-firm-debuts-five-meter-fever-finding-smart-helmet his/her body parameters. if the records shows any symptoms of covid- , an alert can be sent to the store manager and individual can be restricted to enter in the store. similarly, there can be thermal cameras and microphone sensors installed at the store which can detect the people who are coughing in store during shopping and take pictures. these areas will be disinfected and identified individuals may be reported for testing based on their other symptoms and will be categorized as risky customers. this information can be maintained in store for short-period of time to assist in identifying these individuals during their future visits to the store. from a customer's perspective, the user can enable alerts on his smartphone regarding his grocery list, can see the map of the store and crowded aisles and plan accordingly to maintain social distance while shopping. the customer can visit desired aisles and get items from the smart shelves, which will allow to pick some limited number of items as per family size, and then put items in the smart cart. smart shelves will have three common elements -an rfid tag, an rfid reader, and an antenna. data collected by smart shelves during the day will be analyzed and shopper buying trends, patterns, shopper traffic, etc. will be shared with a store manager to provide customer-related insights to efficiently manage the store inventory and restocking goods. most retail stores now allow only ten people at minutes shopping interval slot to avoid large gatherings inside the store. social distancing can be measured and enforced by autonomous retail worker (robot) together with smart cameras, and smart microphone sensors and speakers to alert the customers, using ai technologies. if two customers come in same aisle and violate the physical distancing norm, an autonomous retail worker will go there to warn them or a loudspeaker attached with smart camera will announce to keep maintain social distancing. autonomous retail worker can roam around the store and can take note of misplaced items, or products running out of stock (smart shelves also keep track of items and can send alert for restocking as needed). at&t with xenex and brain corporation has already launched iot robots to help grocery stores in keeping them clean, killing germs and maintaining well stocked shelves more efficiently. the uvd robot also uses ultraviolet light to zap infection viruses and sanitize surfaces. in smart pickup, sensors and other ai based techniques are used to determine whether order is ready to pickup and a person is here to pickup his order. for instance, a parking space or driveway at the store might allow covid- patient or elderly people firstly to avoid waiting time. smart pickup can automatically allow most vulnerable and infected people first and enforce these rules to inform the vehicles. a restaurant takeout service can follow the same protocol for smart pickup. intelligent transportation system (its) [ ] can support to deliver resources to essential services and delivery https://spectrum.ieee.org/news-from-around-ieee/the-institute/ieee-member-news/thermal-cameras-are-beingoutfitted-to-detect-fever-and-conduct-contact-tracing-for-covid https://www.fiercewireless.com/iot/at-t- g-lte-connects-iot-robots-to-kill-germs-keep-shelves-stocked robot can enhance contact-less delivery, which reduce the spread of the virus. gupta et al [ ] have also elaborated how its and smart city infrastructures can be used to enable and enforce social distancing community measures in covid- outbreak. smart localities have been widely investigated in the literature [ ] [ ] [ ] [ ] [ ] [ ] [ ] . such localities are a collection of various interdependent human and physical systems, where iot represents the sensing and actuating infrastructure to estimate the state of human and physical systems and assist in adapting/changing these systems. here, we discuss two scenarios, which can help humans avoid coronavirus infections and adapt to the 'new normal' in covid- situation living in the smart locality. these two scenarios including other relevant scenarios are shown in figure . every individual who lives in a smart neighborhood will receive notifications regarding allotted time for outside activities, such as riding a bike, a walk on the trail, etc. in order to maintain the social distancing while being outside in the locality common areas. in a smart locality, motion sensors and cameras will sense and count the number of people and send the data to the locality cloud- as shown in figure . cloud smart analytics service can analyze the locality data and send notification to people regarding https://cloud.google.com/solutions/smart-analytics/ number of positives cases and categorize risk zones with different colors (e.g., red -high risk, yellowmedium risk, and green -low risk) in smart neighborhood. when a person will go for a walk in a smart neighborhood, he/she will receive an alert if any infected person/pet are around and also alert them to avoid high risk (red) zones in smart neighborhood. there will be disinfected sprinkler installed that can spray on the possibly infected areas, such as pedestrian path, common areas, etc. when sensor will sense the presence of infected person in the area through notifications from the locality cloud- . the u.s. cdc we have discussed in previous scenarios. figure shows that the body parameters of the elderly people can be taken through attached body bracelet and sent to edge devices and gateways. iot devices and applications connected through locality cloud- can share information of covid- patients to smart hospitals. the nursing home staff can monitor the patient's body parameters regularly, and will also track other elderly people in the smart nursing home. in a smart gym, multiple sensors, devices, autonomous devices can be connected through gateway, and gym manager/coach can access the information of each member at different access levels and can communicates to other locality through locality cloud- . a gym member will receive a notification regarding to come to the gym, it must required to maintain the % occupancy at the gym and time interval to sanitize all gym equipment and surface. to enable multi-cloud secure data and information sharing and communications between locality clouds, there is need to be decentralized trust framework in place using advanced technologies like blockchain and trusted distributed computing. smart and connected city infrastructures have been investigated in the literature [ , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . daegu has setup a novel system using large amount of data gathered from various sensors and devices, such as surveillance camera footage and credit card transactions of confirmed coronavirus patients to recreate their movements. the newcastle university urban observatory developed a way of tracking of pedestrian, car parks, traffic movement to understand how social distancing is being followed in tyne and wear. however, other major cities need to prepare themselves for coronavirus future outbreak waves. countries have used cell phone data to track citizens' movements during the pandemic showing the geographic data on hot spots and risk zones where people are more likely to get infection. smart and connected vehicles have been extensively investigated in the literature [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . in order to keep patients and healthcare providers safe, drive-thru coronavirus testing sites have been popping up in the city. an autonomous testing vehicle can be used for covid- testing in urban and rural areas. the smart testing vehicle can include infrared body temperature, oxygen level sensors, smart test kit, camera, microphone, and local edge services. it can help reduce exposure of old-age people with preexisting conditions. in these vehicles, a person can enter from one side of the glass-walled area in the car. in-build sensors can record person's body temperature and oxygen level and can store on city cloud- . it can also provide a test kit to individuals who can test themselves and return it through the car window. in rural areas, autonomous testing vehicles are largely applicable and can help in testing people, as well as inform and make people aware of the covid- pandemic, symptoms and preventive measures. to flatten the curve of confirmed cases, smart city can provide mass quarantine for coronavirus patients, who have mild symptoms but with higher risk of cross-infection to others. a smart hall or large stadium or facilities can be setup for quarantine with installed sensors, smart devices, robots, and connected to city cloud- (as shown in figure ). disinfectant robot is an autonomous robot that can sterilize floors in these large areas as discussed in other scenarios. the large-scale disinfectant robot can also be used to clean the roads of the city. autonomous and self driving vehicles can be used for delivering the post, which will also help reducing the human contact and cutting down the number of covid- cases. smart city will also provide immunity-based rfid tags to those people, who recover from the disease and allow the tag holders to return to work with extra-precautions. in the future, once covid- vaccines are available, the individual with vaccination can get similar immunity-based rfid tags to prove their immunity. the development of the proposed smart connected ecosystem requires to address several challenges and needs inter-disciplinary research from an integrated perspective involving different domains and stakeholders. in this section, we will discuss these challenges in detail with examples from each of the proposed scenarios, as illustrated in figure . one of the major challenges in the deployment of the smart infrastructure is the security and privacy concerns pertaining to iot and cps users, smart devices, data, and applications in different application domains like healthcare, smart home, supply chain management, transportation, and smart city. in health care industry, it is still a challenge to secure connected medical devices and ensure user privacy. in e-health scenario, for instance, a user visits smart testing booth for covid- testing, and his/her data is transmitted and stored on smart hospital private cloud. hospitals then share this data with state healthcare staff or city government for tracking and monitoring the user activities. to secure the identity of user and ensure privacy, differential privacy [ ] and data masking techniques [ ] , such as pseudonymize [ ] and anonymize [ ] , can be used. however, there are limitations intrinsic to these solutions. in pseudonymize technique, data can be traced back into its original state with high risk of compromising user privacy, whereas it becomes impossible to return data into its original state in anonymize. it is critical to ensure user privacy while deploying iot and data-driven applications for their wide-adoption in preventing, monitoring, and mitigating covid- . secure authentication mechanisms including access control and communication control models are necessary for cloud-enabled iot platforms to defend against unauthorized access and securing data, both at rest and in motion. several iot access control models have been developed in the literature [ ] , with cloud-assisted iot access control models for aws [ ] , google [ ] , and azure [ ] . traditional access control models are not adequate in addressing dynamic and evolving access control requirements in iot. attribute-based access control (abac) [ , ] , offers a flexible and dynamic access control model, which fits more into distributed iot environments, such as smart home [ ] , connected vehicles [ , ] , and wearable iot devices [ , ] . in addition to access control, communications in terms of data flow between various components in cloud-enabled iot platform need to be secured from unauthorized data access and modifications. thus, attribute-based communication control (abcc) [ ] a pertaining risk to these and other ai assisted system and applications is adversarial machine learning [ ] using which adversaries compromise user data and privacy. in order to protect the data sets, differential privacy [ ] can be applied to add noise. cloud based medical data storage and the upfront challenges have been extensively addressed in the literature [ , ] . study [ ] conducted semistructured interviews with fifteen people living in smart homes to learn about how they use their smart homes, and to understand their security and privacy concerns, expectations, and actions. in future research, there are requirements to conduct interviews of practitioners to understand the security and privacy concerns while developing the smart hospital, and need to apply similar approach involving community residents, infrastructure manufacturers and stakeholders to develop other components of the smart connected ecosystem. privacy preserving deep learning approaches such as collaborative deep learning or federated deep learning also need to be explored to train and deploy local models at the edge devices. within a connected ecosystem, users are constantly interacting with numerous smart devices and applications. one of main challenges in such an environment with billions of smart devices is performance https://spectrum.ieee.org/telecom/security/tracking-covid -with-the-iot-may-put-your-privacy-at-risk efficiency and quality of service (qos iot is an emerging technology that is being adopted by several nations across the world. one reason for the lack of constitutions and policies may be because the iot differs from other network technologies and there is a lack of specific iot standards. the research on constitution and policy, including engagement on public policy development debates, and iot standards is necessary to successfully integrate privacy, accuracy, property and accessibility solutions in the smart communities. to develop effective constitutional policies and standards, collaboration across governmental and nongovernmental organizations and industry partners as cloud providers, iot manufacturers, would be beneficial. enabling a smart community requires thousands of low-power and low cost embedded devices together with large scale data analytics and applications. there are several implementation challenges involved in developing such large scale smart infrastructure. fault tolerance and resilience are the challenges for reliable delivery of sensor data from smart devices to distributed cloud service. various failures can occur including face recognition, community infrastructure management, and http://smartcities.gov.in/content/innerpage/guidelines.php https://www.transportation.gov/smartcity https://www.govtech.com/opinion/if-only-one-us-city-wins-the-smart-city-race-the-whole-nation-loses.html emergency response in smart infrastructure. geographically correlated resilient overlay networks (geocron) [ ] is developed to capture the localized nature of community iot deployments in the context of small failures. research in [ ] proposed a new fault-tolerant routing technique for hierarchical sensor networks. another challenge for constant running and managing these iot devices is costs related to energy , communication, computation, infrastructure, and operation. there is generally a tradeoff between benefit and cost for iot applications [ ] , however in the scenario of covid- pandemic, expected benefits (saving lives, economic growth) should outweigh the operational and deployment costs. another challenge, for instance, various iot devices communicate to each other or server to build a prediction model. if the local model is not able to predict accurately due to data duplication or other reasons, there will be no point to build such a model. this study [ ] proposed a game theoretical analysis to allocate more storage capacity in a cost-effective manner, which achieves to maximize the benefits. for the future directions, game theoretical approach can be used to analyze the smart infrastructures in terms of cost-benefit analysis. furthermore, interdisciplinary research collaboration is inevitable to implement a smart connected ecosystem. there are several areas of research and engineering aspects, as machine-to-machine technology, artificial intelligence, deep and machine learning, predictive analysis, security and privacy, and others need to be merged and collaborative research approach is necessary in implementing, deploying, and managing a smart connected ecosystem. iot generate tremendous amount of data collected by physical devices, and this raw data is converted into valuable knowledge using ai and machine learning technologies. the " v" (value, velocity, volume, variety, variability, and veracity) big data challenges for iot applications are discussed in [ , ] . the volume of data from iot devices overwhelms storage capacities. there is not only storage issue, but the data needs to be organized properly so that it can be retrieved and processed in a timely manner. data duplication is a data storage issue when an organization has multiple copies of the same data source. for example, a user has multiple wearable smart bracelets (smart hospital bracelet, smart grocery bracelet, and rfid antibody tag bracelet), these wearable devices will collect similar kind of data from a user which can create an issue of data duplication. machine learning (ml) based applications require a large amount of valuable data for correct prediction, however, complicated and insufficient data can be an issue to the accuracy of the learning and predictive models. in addition, ml approaches need further research and development to deal with such heterogeneous and constantly evolving sensory data inputs. for instance, a locality-based covid- patient detection model based on early symptoms learns with the collaboration with smart nursing home data sets and smart child care data sets. the prediction model can be biased towards elderly people if the number of patients in smart nursing home are more than smart child care. to overcome this problem, both models can learn at their edge networks using collaborative deep learning [ ] . research on these open challenges will help early development and deployment of future smart communities. in this paper, we propose future smart connected community scenarios, which are blueprints to develop smart and intelligent infrastructures against covid- and stop similar pandemic situations in the future. the autonomous operation with low human intervention in smart communities enable safe environment and enforce preventive measures for controlling the spread of infection in communities. data-driven and ai assisted applications facilitate increased testing, monitoring and tracking of covid- patients, and help to enforce social distance measure, predict possible infections based on symptoms and human activities, optimize the delivery of essential services and resources in a swift and efficient manner. the paper discussed different use case scenarios to reflect smart applications and ecosystem. the plethora of iot devices enable huge data collection in different sectors including healthcare, home, supply chain management, transportation, environment, and city, which raises user concerns. in addition, the implementation of proposed smart connected scenarios face other challenges including legislation and policy, deployment cost, interoperability etc. which have also been discussed in the paper. we believe that our vision of smart communities will ignite interdisciplinary research and development of connected ecosystem to prepare our world for future such outbreaks. smart items, fog and cloud computing as enablers of servitization in healthcare enabling health monitoring as a service in the cloud opportunities and challenges of the internet of things for healthcare: systems engineering perspective security and privacy in smart farming: challenges and opportunities cloud-assisted industrial internet of things (iiot)-enabled framework for health monitoring an access control framework for cloud-enabled wearable internet of things the case for vm-based cloudlets in mobile computing the virtual object as a major element of the internet of things: a survey an iot-aware architecture for smart healthcare systems enhancing the quality of life through wearable technology an iot-aware architecture for smart healthcare systems fall detection -principles and methods flexible technologies and smart clothing for citizen medicine, home healthcare, and disease prevention privacy-preserving deep learning a health smart home system to report incidents for disabled people correlation between real and simulated data of the activity of the elderly person living independently in a health smart home study and implementation of a network point health smart home electrocardiographic model and simulator of the activity of the elderly person in a health smart home health smart home -towards an assistant tool for automatic assessment of the dependence of elders detecting health and behavior change by analyzing smart home sensor data patient status monitoring for smart home healthcare automated cognitive health assessment from smart home-based behavior data towards a distributed estimator in smart home environment access control model for google cloud iot a closer look into privacy and security of chromecast multimedia cloud communications investigating security and privacy of a cloud-based wireless ip camera: netcam a testbed for privacy and security of iot devices an experimental framework for investigating security and privacy of iot devices analysis of iot traffic using http proxy convergence of manet and wsn in iot urban scenarios a method to make accurate inventory of smart meters in multi-tags group-reading environment smart spare part inventory management system with sensor data updating an iot application for inventory management with a self-adaptive decision model an iot based inventory system for high value laboratory equipment iot applications on secure smart shopping system aeon: a smart medicine delivery and inventory system for cebu city government's long life medical assistance program study of smart warehouse management system based on the iot iot based grocery management system: smart refrigerator and smart cabinet short on time? context-aware shopping lists to the rescue: an experimental evaluation of a smart shopping cart the smart shopping basket based on iot applications sysmart indoor services: a system of smart and connected supermarkets enabling rfid in retail secure v v and v i communication in intelligent transportation using cloudlets enabling and enforcing social distancing measures using smart city and its infrastructures: a covid- use case smart community: an internet of things application internet of things and big data analytics for smart and connected communities smart cities, big data, and communities: reasoning from the viewpoint of attractors a community-based iot personalized wireless healthcare solution trial an integrated cloud-based smart home management system with community hierarchy modeling 'thriving communities' using a systems architecture to improve smart cities technology approaches research on key technology for data storage in smart community based on big data a use case in cybersecurity based in blockchain to deal with the security and privacy of citizens and smart cities cyberinfrastructures internet of things for smart cities understanding smart cities: an integrative framework an information framework for creating a smart city through internet of things foundations for smarter cities long-range communications in unlicensed bands: the rising stars in the iot and smart city scenarios smart health: a context-aware health paradigm within smart cities smarter cities and their innovation challenges everything you wanted to know about smart cities: the internet of things is the backbone uav-enabled intelligent transportation systems for the smart city: applications and challenges a collaborative mechanism for private data publication in smart cities scalable mobile sensing for smart cities: the musanet experience an introduction to multi-sensor data fusion smart cars on smart roads: problems of control the security and privacy of smart vehicles real-time object detection for "smart" vehicles predictive active steering control for autonomous vehicle systems collision avoidance and stabilization for autonomous vehicles in emergency scenarios learning driving styles for autonomous vehicles from demonstration dynamic groups and attribute-based access control for next-generation smart cars the algorithmic foundations of differential privacy on the security and privacy of internet of things architectures and systems privacy through pseudonymity in user-adaptive systems privacy protection: p-sensitive k-anonymity property anas abou elkalam, and abdellah ait ouahman. access control in the internet of things: big challenges and new opportunities access control model for aws internet of things parbac: priority-attribute-based rbac model for azure iot cloud guide to attribute based access control (abac) definition and considerations (draft). nist special publication a unified attribute-based access control model covering dac, mac and rbac authorizations in cloud-based internet of things: current trends and use cases authorization framework for secure cloud assisted connected cars and vehicular internet of things poster: iot sentinel-an abac approach against cyber-warfare in organizations abac-cc: attribute-based access control and communication control for internet of things iot passport: a blockchain-based trust framework for collaborative internet-of-things adversarial attacks on medical machine learning development of private cloud storage for medical image research data extensive medical data storage with prominent symmetric algorithms on clouda protected framework end user security and privacy concerns with smart homes mifaas: a mobile-iot-federation-asa-service model for dynamic cooperation of iot cloud providers internet of things-new security and privacy challenges a framework for internet of things-enabled smart government: a case of iot cybersecurity policies and use cases in us federal government the united kingdom's emerging internet of things (iot) policy landscape. tanczer, lm, brass, i the united kingdom's emerging internet of things (iot) policy landscape the internet of things (iot) and its impact on individual privacy: an australian perspective resilient overlays for iotbased community infrastructure communications enabling reliable and resilient iot based smart city applications cost-benefit analysis at runtime for self-adaptive systems applied to an internet of things application cost-benefit analysis game for efficient storage allocation in cloud-centric internet of things systems: a game theoretic perspective internet of things: vision, future directions and opportunities deep learning for iot big data and streaming analytics: a survey learner's dilemma: iot devices training strategies in collaborative deep learning key: cord- -t zubl p authors: daubenschuetz, tim; kulyk, oksana; neumann, stephan; hinterleitner, isabella; delgado, paula ramos; hoffmann, carmen; scheible, florian title: sars-cov- , a threat to privacy? date: - - journal: nan doi: nan sha: doc_id: cord_uid: t zubl p the global sars-cov- pandemic is currently putting a massive strain on the world's critical infrastructures. with healthcare systems and internet service providers already struggling to provide reliable service, some operators may, intentionally or unintentionally, lever out privacy-protecting measures to increase their system's efficiency in fighting the virus. moreover, though it may seem all encouraging to see the effectiveness of authoritarian states in battling the crisis, we, the authors of this paper, would like to raise the community's awareness towards developing more effective means in battling the crisis without the need to limit fundamental human rights. to analyze the current situation, we are discussing and evaluating the steps corporations and governments are taking to condemn the virus by applying established privacy research. due to its fast spreading throughout the world, the outbreak of sars-cov- has become a global crisis putting stress on the current infrastructure in some areas in unprecedented ways, making shortcomings visible. since there is no vaccination against sars-cov- , the only way to deal with the current situation are non-pharmaceutical interventions (npi's), to reducing the number of new infections and flatten the curve of total patients. having a look at european states like italy, spain, france, or austria which, are in lockdown as of march , keeping people away from seeing each other, their right to living a selfdetermined life is not in their hands anymore. as shown by hatchett et al., this method showed a positive effect in st. luis during 's influenza pandemic [ ] , nevertheless, its longterm effects on the economy and day-to-day life, including psychological effects on people forced to self-isolate, are often seen as a cause of concern [ ] . furthermore, some models show the possibility of a massive rise of new infections after the lockdown is ended [ ] . hence, to handle the situation, measures are being discussed, some of which may invade a citizen's privacy. we can see an example of this approach in asian countries e.g., south korea [ ] and singapore [ ] where, besides extensive testing, methods such as tracing mobile phone location data in order to identify possible contact with infected persons [ ] . other countries have taken similar measures. for instance, netanyahu, israel's prime minister, ordered shin bet, israel's internal security service, to start surveilling citizens' cellphones [ ] , [ ] , [ ] . persons who have been closer than two meters to an infected person are receiving text messages telling them to go into immediate home isolation for days. as shin bet mandate is to observe and fight middle easter terrorism, naturally, israel's citizens are now concerned that it is now helping in a medical situation [ ], [ ] , [ ] . within the eu, in particular, in germany and austria, telecommunications providers are already providing health organizations and the government with anonymous data of mobile phone location data [ ] . although nobody has evaluated the effectiveness of these measures, they raise concerns from privacy experts as the massive collection of data can easily lead to harming the population and violating their human rights if the collected data is misused. in this paper, we discuss the privacy issues that can arise in times of crisis and take a closer look into the case of the german robert koch institute receiving data from telekom. we conclude by providing some recommendations about ways to minimize privacy harms while combating the pandemic. in this section we outline the general definitions of privacy, including describing the contextual integrity framework for reasoning about privacy, and discuss privacy harms that can occur from misuse of personal data. we furthermore discuss the issues with privacy that can occur during a crisis such as this global pandemic and what can be done to ensure information security and hence appropriate data protection. privacy is a broad concept which has been studied from the point of view of different disciplines, including social sciences and humanities, legal studies and computer science. the definitions of privacy are commonly centered around seeing privacy as confidentiality (preventing disclosure of information), control (providing people with means to control how their personal data is collected and used) and transparency (ensuring that users are aware of how their data is collected and used, as well as ensuring that the data collection and processing occurs in a lawful manner) [ ] . hannah arendt, a jewish philosopher who grew up in germany in the beginning of the th century defined privacy within the context of public and private space. her claim was that if there exists public space, there is also private space. arendt considers the privacy concept as a distinction between things that should be shown and things that should be hidden [ ] . and that, private spaces exist in opposition to public spaces. meaning, while the public square is dedicated to appearances, the private space is devoted to the opposite, namely hiding and privacy. she associated privacy with the home. due to the fact that we have become used to a "digital private space", such as our own email inbox or personal data on the phone, people are concerned and offended when the private, hidden space is violated. however, in times of crisis the term hidden or privacy becomes a new meaning. helen nissenbaum, a professor of information science, proposed the concept of contextual integrity as a framework to reason about privacy. according to her framework, privacy is defined as adhering to the norms of information flow [ ] . these norms are highly contextual: for example, it is appropriate for doctors to have access to the medical data of their patients, but in most cases it is inappropriate for employers to have access to medical data of their workers. nissenbaum distinguishes between the following five principles of information flow [ ] : the sender, the subject, the receiver, the information type and the transmission principle (e.g. whether confidentiality has to be preserved, whether the data exchange is reciprocal or whether consent is necessary and/or sufficient for the appropriateness of the data exchange). the norms governing these parameters are furthermore evaluated against a specific context, including whether the information flow is necessary for achieving the purpose of the context. data misuse can lead to different kinds of harms that jeopardise physical and psychological well-being of people as well as the overall society (see e.g. solove, ) . one of them is persecution by the government -this might not be a big concern in democratic societies, but democratic societies can move into more authoritarian governance styles, especially is crisis situations. even if this does not happen, there are other harms, e.g. a so called "chilling effect", where people are afraid to speak up against the accepted norms when they feel that they are being watched. furthermore, harms can result from data leaks, like unintentional errors or cyberattacks. in these cases, information about individuals may become known to unintended targets. this can result in physical harm, stalking and damage of the data subject's personal relationships. knowledge about one's medical data can lead to job discrimination. leaked details about one's lifestyle can lead to raised insurance rates. leakage of location data, in particular, can reveal a lot of sensitive information about an individual, such as the places they visit, which might in turn result in dramatic effects when revealed. just think of closeted homosexuals visiting a gay clubs or marginalized religious minorities visiting their place of worship. even beyond these concerns, access to large amounts of personal data can be used for more effective opinion and behavior manipulation, as evidenced by the cambridge analytica scandal [ ] . in summary, absence of privacy has a dramatic effect on our freedom of expression as individuals and on the wellfunctioning of the society as a whole. it is therefore important to ensure that the damage to privacy is minimized even in times of crisis. when we are considering the example of doctors treating their patients, we can use the framework of contextual integrity to reason about the appropriate information flow as follows: the patient is both the sender and the subject of the data exchange, the doctor is the receiver, the information type is the patient's medical information, the transmission principle includes, most importantly, doctor-patient confidentiality aside from public health issues. the overall context is health care, and the purpose of the context is both healing the patient and protecting health of the population. it can therefore be argued that in case of a global pandemic, one should allow the exchange of patient's data, especially when it comes to data about infected patients and their contacts, to the extent that it is necessary to manage the pandemic. there is, however, a danger of misusing the collected data outside of the defined context -the so-called "mission creep", which experts argue was the case with nsa collecting data from both us and foreign citizens on an unprecedented scale as an aftermath of the / terrorist attack [ ] . furthermore, aside from the danger of collecting data by the government, the crisis situation leads to increase of data collection by private companies, as people all over the world switch to remote communication and remote collaboration tools from faceto-face communications. the data collection and processing practices of these tools, however, are often obscure from their users: as known from research in related fields, privacy policies are often too long, obscure, and complicated to figure out, and shorter notices such as cookie disclaimers tend to be perceived as too vague and not providing useful information [ ] , [ ] . this leads to users often ignoring the privacy policies and disclaimers, hence, being unaware of important information about their data sharing. moreover, even among the privacyconcerned users, the adoption of more privacy-friendly tools can be hindered by social pressure and network effects, if everyone else prefers to use more popular tools that are less inclined to protect the privacy of their users (as seen in studies on security and privacy adoption in other domains, see e.g. [ ] , [ ] ). this data collection even furthers the effects of the so-called surveillance capitalism [ ] , which leads to corporations having even more power over people than before the crisis. this access to personal data by corporations is furthermore aggravated by an increased usage of social media platforms, increases in users sharing their location data and giving applications increased access to their phone's operating system. lowered barriers and increased online activity that can be directly linked to an individual or an email address is a treasure trove for for-profit corporations that monetize consumer data. many corporations are now getting free or low cost leads for months to come. a question that is often open for discussion is to which extent people themselves would be ready to share their data, even if it results in a privacy loss. as such, data sharing habits in general have been the topic of research, leading to discussions on so-called privacy paradox: people claiming that privacy is important to them, yet not behaving in a privacypreserving way. the privacy paradox can be explained by different factors [ ] . one of them is the lack of awareness about the extent of data collection as well as about the possible harms that can result from unrestricted data sharing. a further factor stems from decision biases, such as people's tendency to underestimate the risks that may happen in the future compared against immediate benefit. another noteworthy factor are the manipulations by service providers (so-called dark patterns) nudging users into sharing more of their data contrary to their actual preferences. but rational decisions in times of crisis are even more difficult. given the state of stress and anxiety many are in, people might be more likely to accept privacy-problematic practices if they are told that these practices are absolutely necessary for managing the crisiseven if this is not actually the case. the problem that people are more likely to surrender their privacy rights if they have already had to surrender other fundamental rights (such as freedom of movement due to lockdown restrictions) is reminiscent of the psychological mechanism of door-in-the-face technique. the door-in-the-face technique is a method of social influence, where we ask a person at first to do something that requires more than they would accept. afterward, we ask for a second smaller favor. research has shown that the person is now more likely to accept the other smaller favor [ ] . in the case of the sars-cov- pandemic, governments first asked their citizens to self-isolate ( limiting significant fundamental freedom) before following up with the smaller favor of handing over some private data to fight the outbreak. however, according to cantarero et al., the level of acceptance differs from individual to individual [ ] , which makes it even more critical to rising consciousness in population. at the same time, timely access to data voluntarily shared by people (in addition to the data collected by hospitals and authorities) can indeed help combat the epidemics. in this, we are supporting informed consent of data subjects, because it ensures that people will only share data with institutions that kept their data safe against privacy harms. in an increasingly digital world, establishing proper information security safeguards is critical in preventing data leaks, and hence, in preserving the privacy of data subjects. however, the situation of such a global pandemic places significant challenges on established workflows, information technology, and security as well, resulting in various issues. these problems arose when people stopped traveling, going into the office, and started working from home. while some companies and institutions have provided a possibility for remote work also before the crisis, or are at least infrastructurally and organizationally prepared, many are unprepared for such a dramatic increase of home office work. they face significant technical and organizational challenges, such as ensuring the security of their systems given the need for opening the network to remote access, e.g., via the so-called demilitarized zone (dmz), or perimeter control, an extension of technical monitoring of the system and overall extension of system hardening is "hostile" (home) environments. a recent poll revealed that the security teams of % of companies did not have "emergency plans in place to shift an on-premise workforce to one that is remote" [ ] . even worse, these challenges are more present in regulated (and therefore often critical) industries as sumir karayi, ceo and founder of e, in a threatpost interview states: "government, legal, insurance, banking and healthcare are all great examples of industries that are not prepared for this massive influx of remote workers [...] many companies and organizations in these industries are working on legacy systems and are using software that is not patched. not only does this mean remote work is a security concern, but it makes working a negative, unproductive experience for the employee. [...] regulated industries pose a significant challenge because they use systems, devices or people not yet approved for remote work [...] proprietary or specific software is usually also legacy software. it's hard to patch and maintain, and rarely able to be accessed remotely." [ ] in consequence, the urgent need to enable remote collaboration related to the lack of preparation and preparation time may lead to hurried and immature remote work strategies. at the same time, ensuring proper security behavior of the employees -something that was a challenge in many companies also before the crisis -is becoming an even more difficult task. we can currently see employees trying to circumvent corporate restrictions by sending or sharing data and documents over private accounts (shadow it). additionally, there is a surge of social engineering attacks among other phishing email campaigns, business email compromise, malware, and ransomware strains, as sherrod degrippo, senior director of threat research and detection at proofpoint, states [ ] . similar findings are provided by atlas vpn research, which shows that several industries broadly use unpatched or no longer supported hardware or software systems, including the healthcare sector [ ]. together with immature remote strategies, information security and privacy risks may significantly increase and undermine the standardized risk management process. the european data protection board (edpb) has formulated a statement on the processing of personal data in the context of the sars-cov- outbreak [ ] . according to edpb, data protection rules do not hinder measures taken in the fight against the coronavirus pandemic. even so, the edpb underlines that, even in these exceptional times, the data controller and processor must ensure the protection of the personal data of the data subjects. therefore, several considerations should be taken into account to guarantee the lawful processing of personal data, and in this context, one must respect the general principles of law. as such, the gdpr allows competent public health authorities like hospitals and laboratories as well as employers to process personal data in the context of an epidemic, by national law and within the conditions set therein. concerning the processing of telecommunication data, such as location data, the national laws implementing the eprivacy directive must also be respected. the national laws implementing the eprivacy directive provide that the location data can only be used by the operator when they are made anonymous, or with the consent of the individuals. if it is not possible to only process anonymous data, art. of the eprivacy directive enables the member states to introduce legislative measures pursuing national security and public security. this emergency legislation is possible under the condition that it constitutes a necessary, appropriate, and proportionate measure within a democratic society. if a member state introduces such measures, it is obliged to put in place adequate safeguards, such as granting individuals the right to a judicial remedy. in this section, we conduct a preliminary analysis of german disease control receiving movement data from a telecommunication provider. in germany, the authority for disease control and prevention, the robert koch institute (rki), made headlines on march , , as it became public that telecommunication provider telekom had shared an anonymized set of mobile phone movement data to monitor citizens' mobility in the fight against sars-cov- . in total, telekom sent million customer's data to the rki for further analysis. the german federal commissioner for data protection and freedom of information, ulrich kelber, overseeing the transfer, commented on the incident that he is not concerned about violating any data protection rules, as the data had been anonymized upfront [ ] . however, researchers have shown that seemly anonymized data sets can indeed be "deanonymized" [ ] . constanze kurz, an activist, and expert on the subject matter, commented that she was skeptical about the anonymization. she urged telekom to publicize the anonymization methods that were being used and asked the robert koch institute to explain how it will protect this data for unauthorized third-party access. several research studies had shown the deanonymization for data sets to extract personal information, including a case from , when a journalist and a data scientist acquired an anonymized dataset with the browsing habits of more than three million german citizens [ ] , [ ] . as at the moment, it is hard to tell whether disclosure of personal data is possible from the shared set (even more so given the development of new re-identification methods, including possible future development), we look at the worstcase scenario, namely, that personal data is deanonymizeable. given this scenario, we use nissenbaum's contextual integrity thesis to understand if privacy telekom has violated its customer's privacy [ ] . we do so by stating the context of the case, the norm -what everyone expects should happenplus five contextual parameters to further analyze the situation. table i summarizes the contextual integrity framework as applied to the german data sharing situation. a principle that is perhaps most interesting for further elaboration is the transmission principle. given the context and urgency of the situation, one might agree that having the german federal commissioner for data protection and freedom of information oversee the transaction and taking some measures to anonymize the data set appropriately serves as a practical solution towards limiting the spread of sars-cov- , also without explicitly obtaining consent from data subjects. we do, however, assume that appropriate use of data would be limiting it to a specific purpose of combating the pandemic, and not reusing it to other purposes without further assessment. note, however, that there is space for discussion, in which the community should be engaged, about the norms that apply in this situation, especially given the extraordinary situation and the severity of the crisis. a further step of the contextual integrity is, however, also part of the contextual integrity framework to nissenbaum's five parameter thesis of contextual information to create hypothetical scenarios that could threaten the decision's future integrity. we, therefore, consider the following hypotheticals, which we believe would violate contextual integrity: hypothetical scenario : "the robert koch institute does not delete the data after sars-cov- crisis" hypothetical scenario : "the robert koch institute forwards data to other state organs or to third parties" these hypothetical scenarios would violate the transmission principle that the data is only going to be used to handle the crisis (and, in the second hypothetical, also the receiver of the data). we believe a future assessment is necessary to determine if the data transfer was indeed necessary to fight the pandemic. alternatively, if alternatively, customer permissions should have been required upfront. hypothetical scenario : "the robert koch institute requests data about telekom customer movements from the last ten years these scenarios change the type of information. we want to argue that the new exchanged data no longer serves the purpose of fighting the pandemic. this point was also made by the electronic freedom frontier organization [ ] , noting that since the incubation period of the virus is estimated to last days, getting access to data that is much older than that would be a privacy violation. we think that, similar to the first three scenarios, a further assessment, based on transparent information, is necessary. as with the first three scenarios, the transmission principle of confidentiality is violated in this scenario, albeit unintentionally, and, in case of improper anonymization, personal information might still leak. hence, a privacy violation has taken place. referring to outlined information security concerns, an increase in cyber attacks related to improper information security management in a time of crisis significantly increases the risk. given the above-outlined hypotheticals, we recommend implementing appropriate protection measures.t countries around the world have already taken numerous initiatives to slow down the spread of the sars-cov- , such as remote working, telemedicine, and online learning and shopping. that has required a legion of changes in our lives. however, as mentioned in previous sections, these activities come with associated security and privacy risks. various organizations are raising concerns regarding these risks (see e.g., the statement and proposed principles from the electronic freedom frontier [ ] . of particular interest is the case of healthcare systems, which must be transparent with the information related to patients, but cautious with the disclosed information. equally, hospitals might also decide to withhold information in order to try to minimize liability. that is a slippery slope: both cases -no information or too much information -might lead to a state of fear in the population and a false sense of security (i.e., no information means there is no problem) or a loss of privacy when we decide to disclose too much information. in the current situation and others that might arise, principles, and best practices developed before the crisis are still applicable, namely, privacy by design principles, and most importantly, data minimization. only strictly necessary data needed to manage the crisis should be collected and deleted once humanity has overcome the crisis. in this context, patient data should be collected, stored, analyzed, and processed under strict data protection rules (such as the general data protection regulation gdpr) by competent public health authorities, as mentioned in the previous chapter [ ] . an example of addressing the issues of data protection during the crisis can also be seen within the austrian project vkt-goepl [ ] . it was the project's goal to generate a dynamic situational map for ministries overseeing the crisis. events, such as terrorist attacks, flooding, fire, and pandemic scenarios, were selected. already ten years ago, the need for geographical movement data provided by telecommunication providers was treated as a use-case. furthermore, the project initiators prohibited the linking of personal data from different databases in cases where this data was not anonymized. they recommended that ministries are transparently informing all individuals about the policies which apply to the processing of their data. regarding data analysis, we recommend that citizens only disclose their data to authorized parties, once these are putting adequate security measures and confidentiality policies in place. moreover, only data that is strictly necessary should be shared. we think that proper data storage should make use of advanced technology such as cryptography. patient data -including personal information such as contact data, sexual preferences or religion amongst others -should not be revealed. as anonymizing data has been shown to be a non-trivial task that is hard to achieve in a proper way, advanced solutions such as cryptographic techniques for secure multiparty computation or differential privacy algorithms for privacy-preserving data releases should be used. besides, to ensure privacy from the collection stage, consistent training of the medical personnel, volunteers, and administrative staff should be done. the current lack of training (due to limited resources, shortage of specialists, and general time pressure) leads to human errors and neglect of proper security and privacy protection measures. a further concern, which we did not investigate in this paper is to ensure fairness when it comes to algorithmic decision making. as such, automated data systems ("big data" or "machine learning") are known to have issues with biasbased e.g., on race or gender that can lead to discrimination [ ] . in order to prevent such adverse effects during the crisis, these systems should furthermore be limited in order to limit bias based on nationality, sexual preferences, religion, or other factors that are not related to handling the pandemic. finally, we recognize that having access to timely and accurate data can play a critical role in combating the epidemic. nevertheless, as discussed in previous sections, ignoring issues around the collection and handling of personal data might cause serious harm that will be hard to repair in the long run. therefore, as big corporations and nation-states are collecting data from the world's population; it is of crucial importance that this data is handled responsibly and keeping the privacy of the data subjects in mind. cdc grand rounds: modeling and public health decision-making the psychological impact of quarantine and how to reduce it: rapid review of the evidence impact of non-pharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand coronavirus: south korea's success in controlling disease is due to its acceptance of surveillance interrupting transmission of covid- : lessons from containment efforts in singapore effect of non-pharmaceutical interventions for containing the covid- outbreak in china israel begins tracking and texting those possibly exposed to the coronavirus israel starts surveilling virus carriers, sends who were nearby to isolation mobilfunker a liefert bewegungsströme von handynutzern an regierung the cyber security body of knowledge the human condition privacy as contextual integrity contextual integrity up and down the data food chain. theoretical inquiries law fresh cambridge analytica leak 'shows global manipulation is out of control mission creep: when everything is terrorism a design space for effective privacy notices this website uses cookies": users' perceptions and reactions to the cookie disclaimer obstacles to the adoption of secure communication tools a socio-technical investigation into smartphone security big other: surveillance capitalism and the prospects of an information civilization the myth of the privacy paradox door-in-the-face-technik being inconsistent and compliant: the moderating role of the preference for consistency in the door-in-the-face technique coronavirus poll results: cyberattacks ramp up, wfh prep uneven working from home: covid- 's constellation of security challenges us is fighting covid- with outdated software statement onthe processing of personal data in the context of the covid- outbreak warum die telekom bewegungsdaten von handynutzern weitergibt broken promises of privacy: responding to the surprising failure of anonymization estimating the success of re-identifications in incomplete datasets using generative models big data deidentification, reidentification and anonymization protecting civil liberties during a public health crisis crisis and disaster management as a network-activity how big data increases inequality and threatens democracy key: cord- -ns u authors: ye, yanfang; hou, shifu; fan, yujie; qian, yiyue; zhang, yiming; sun, shiyu; peng, qian; laparo, kenneth title: $alpha$-satellite: an ai-driven system and benchmark datasets for hierarchical community-level risk assessment to help combat covid- date: - - journal: nan doi: nan sha: doc_id: cord_uid: ns u the novel coronavirus and its deadly outbreak have posed grand challenges to human society: as of march , , there have been , confirmed cases and , reported deaths in the united states; and the world health organization (who) characterized coronavirus disease (covid- ) - which has infected more than , people with more than , deaths in at least countries - a global pandemic. a growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of covid- and thus better respond with actionable strategies for community mitigation. by advancing capabilities of artificial intelligence (ai) and leveraging the large-scale and real-time data generated from heterogeneous sources (e.g., disease related data from official public health organizations, demographic data, mobility data, and user geneated data from social media), in this work, we propose and develop an ai-driven system (named $alpha$-satellite}, as an initial offering, to provide hierarchical community-level risk assessment to assist with the development of strategies for combating the fast evolving covid- pandemic. more specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable individuals to select appropriate actions for protection while minimizing disruptions to daily life to the extent possible. the developed system and the generated benchmark datasets have been made publicly accessible through our website. the system description and disclaimer are also available in our website. coronavirus disease (covid- ) [ ] is an infectious disease caused by a new virus that had not been previously identified in humans; this respiratory illness (with symptoms such as a cough, fever and pneumonia) was first identified during an investigation into an outbreak in wuhan, china in december and is now rapidly spreading in the u.s. and globally. the novel coronavirus and its deadly outbreak have posed grand challenges to human society. as of march , , there have been , confirmed cases and , reported deaths in the u.s. (figure ); and the who characterized covid- -which has infected more than , people with more than , deaths in at least countries -a global pandemic. it is believed that the novel virus which causes covid- emerged from an animal source, but it is now rapidly spreading from personto-person through various forms of contact. according to the centers for disease control and prevention (cdc) [ ] , the coronavirus seems to be spreading easily and sustainably in the community -i.e., community transmission which means people have been infected with the virus in an area, including some who are not sure how or where they became infected. an example of community transmission that caused the outbreak of covid- in king county at washington state (wa) is shown in figure . the challenge with community transmission is that carriers are often asymptomatic and unaware that they are infected and through their movements within the community they spread the disease. according to the cdc, before a vaccine or drug becomes widely available (i.e., this is the case for covid- by far), community mitigation, which is a set of actions that persons and communities can take to help slow the spread of respiratory virus infections, is the most readily available interventions to help slow transmission of the virus in communities [ ] . a growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of covid- and thus better respond with actionable strategies for community mitigation. unlike the influenza pandemic [ ] where the global scope and devastating impacts were only determined well after the fact, covid- history is being written daily, if not hourly, and if the right types of data can be acquired and analyzed there is the potential to improve self awareness of the risk to the population and develop proactive (rather than reactive) interventions that can halt the exponential growth in the disease that is currently being observed. realizing the true potential of real-time surveillance, with this opportunity comes the challenge: the available data are uncertain and incomplete while we need to provide mitigation strategies objectively with caution and rigor (i.e., enable people to select appropriate actions to protect themselves at increased risk of covid- while minimize disruptions to daily life to the extent possible). to address the above challenge, leveraging our long-term and successful experiences in combating and mitigating widespread malware attacks using ai-driven techniques [ , , , , , , , [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , in this work, we propose to design and develop an ai-driven system to provide hierarchical community-level risk assessment at the first attempt to help combat the fast evolving covid- pandemic, by using the large-scale and real-time data generated from heterogeneous sources. in our developed system (named α-satellite), we first develop a set of tools to collect and preprocess the large-scale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations, demographic data, mobility data, and user generated data from social media; and then we devise advanced ai-driven techniques to provide hierarchical community-level risk assessment to enable actionable strategies for community mitigation. more specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable people to select appropriate actions for protection while minimizing disruptions to daily life. the framework of our proposed and developed system is shown in figure . in the system of α-satellite, ( ) we first construct an attributed heterogeneous information network (ahin) to model the collected large-scale and real-time pandemic related data in a comprehensive way; ( ) based on the constructed ahin, to address the challenge of limited data that might be available for learning (e.g., social media data to learn public perceptions towards covid- in a given area might not be sufficient), we then exploit the conditional generative adversarial nets (cgans) to gain the public perceptions towards covid- in each given area; and finally ( ) we utilize meta-path based schemes to model both vertical and horizontal information associated with a given area, and devise a novel heterogeneous graph auto-encoder (gae) to aggregate information from its neighborhood areas to estimate the risk of the given area in a hierarchical manner. the developed system α-satellite and the generated benchmark datasets have been made publicly accessible through our website. there have been several works on using ai and machine learning techniques to help combat covid- : in the biomedical domain, [ , , , , ] use deep learning methods for covid- pneumonia diagnosis and genome study; while [ , ] develop learning-based models to predict severity and survival for patients. another research direction is to utilize public accessible data to help the estimation of infection cases or forecast the covid- outbreak [ , , , , , , ] . however, most of these existing works mainly focus on wuhan china; the studies of using computational models to combat covid- in the u.s. are scarce and there has no work on community-level risk assessment to assist with community mitigation by far. to meet this urgent need and to bridge the research gap, in this work, by advancing capabilities of ai and leveraging the large-scale and real-time data generated from heterogeneous sources, we propose and develop an ai-driven system, named α-satellite, to provide hierarchical community-level risk assessment at the first attempt to help combat the deadly and fast evolving covid- pandemic. in this section, we will introduce our proposed method integrated in the system of α-satellite to automatically provide hierarchical community-level risk assessment related to covid- in detail. realizing the true potential of real-time surveillance requires identifying the proper data sources, based on which we can devise models to extract meaningful and actionable information for community mitigation. since relying on a single data source for estimation and prediction often results in unsatisfactory performance, we develop a set of crawling tools and preprocessing methods to collect and parse the large-scale and real-time pandemic related data from multiple sources, which include the followings. • disease related data. we collect the up-to-date county-based coronavirus related data including the numbers of confirmed cases, new cases, deaths and the fatality rate, from i) official public health organizations such as who, cdc, and county government websites, and ii) digital media with real-time updates of figure : system architecture of α-satellite (i.e., an ai-driven system for hierarchical community-level risk assessment). in α-satellite, (a) we first construct an ahin to model the collected large-scale and real-time pandemic related data in a comprehensive way; (b) based on the constructed ahin, we then exploit the cgans to gain the public perception towards covid- in an given area; (c) we finally utilize meta-path based schemes to model both vertical and horizontal information associated with a given area, and devise heterogeneous gae to aggregate information from its neighborhood areas to estimate the risk of the given area in a hierarchical manner. covid- (e.g., point acres ). the collected up-to-date countybased covid- related statistical data can be an important element for risk assessment of an associated area. • demographic data. the united states census bureau provides the demographic data including basic population, business, and geography statistics for all states and counties, and for cities and towns with more than , people. the demographic information will contribute to the risk assessment of an associated area: for example, as older adults may be at higher risk for more serious complications from covid- [ , ] , the age distribution of a given area can be considered as an important input. in this work, given a specific area, we mainly consider the associated demographic data including the estimated population, population density (e.g., number of people per square mile), age and gender distributions. • mobility data. given a specific area (either user input or automatic positioning), a mobility measure that estimates how busy the area is in terms of traffic density will be retained from location service providers (i.e., google maps). • user generated data from social media. as users in social media are likely to discuss and share their experiences of covid- , the data from social media may contribute complementary knowledge such as public perceptions towards covid- in the area they associate with. in this work, we initialize our efforts with the focus on reddit, as it provides the platform for scientific discussion of dynamic policies, announcements, symptoms and events of covid- . in particular, we consider i) three subreddits with general discussion (i.e., r/coronavirus , r/covid and r/coronavirusus ); ii) four region-based subreddits, which are r/coronavirusmidwest, r/coronavirussouth, r/coronavirussoutheast and r/coronaviruswest; and iii) statebased subreddits (i.e., washington, d.c. and states). to analyze public perceptions towards covid- for a given area (note that all users are anonymized for analysis using hash values of usernames), we first exploit stanford named entity recognizer [ ] to extract the location-based information (e.g., county, city), and then utilize tools such as nltk [ ] to conduct sentiment analysis (i.e., positive, neutral or negative). more specifically, positive denotes well aware of covid- , while negative indicates less aware of covid- . for example, with the analysis of the post by a user (with hash value of "cf*** ") in subreddit of r/coronaviruspa on march , : "i live in montgomery county, pa and everyone here is acting like there's nothing going on.", the location-related information of montgomery county and pennsylvania state (i.e., pa) can be extracted, and a user's perception towards covid- in montgomery county at pa can be learned (i.e., negative indicating less aware of covid- ). such automatically extracted knowledge will be incorporated into the risk assessment of the related area; meanwhile, it can also provide important information to help inform and educate about the science of coronavirus transmission and prevention. to comprehensively describe a given area for its risk assessment related to covid- , based on the data collected from multiple sources above, we consider and extract higher-level semantics as well as social and behavioral information within the communities. attributed features. based on the collected data above, we further extract the corresponding attributed features. • a : disease related feature. for a given area, its related covid- pandemic data will be extracted including the numbers of confirmed cases, new cases, deaths and the fatality rate, which is represented by a numeric feature vector a . for example, as of march , , the cuyahoga county at ohio state (oh) has had confirmed cases, new cases, death and . % fatality rate, which can be represented as a =< , , , . >. • a : demographic feature. given a specific area, we obtain its associated city's (or town's) demographic data from the united states census bureau, including the estimated population, population density (i.e., number of people per square mile), age distribution (i.e., percentage of people over year-old) and gender distribution (i.e., percentage of females). for example, to assist with the risk assessment of the area of euclid ave in cleveland at oh, the obtained demographic data associated with it are: cleveland with population of , population density of , . % people over year-old, and . % females, which will be represented as a =< , , . , . >. • a : mobility feature. given a specific area, a mobility measure that estimates how busy the area is in terms of traffic density will be obtained from google maps, which will represented by five degree levels (i.e., [ , ] , the larger number the busier). • a : representation of public perception. after performing the automatic sentiment analysis based on the collected posts associated with a given area from reddit, the public perceptions towards covid- in this area will be represented by a normalized value (i.e., [ , ]) indicated the awareness of covid- (i.e., the larger value the more aware). for the previous example of the reddit post of "i live in montgomery county, pa and everyone here is acting like there's nothing going on. ", a related perception towards covid- in montgomery county at pa will be formulated as a numeric vale of . , denoting people in this area were less aware of covid- on march , . after extracting the above features, we concatenate them as a normalized attributed feature vector a attached to each given area for representation, i.e., a = a ⊕a ⊕a ⊕a . note that we zero-pad the ones in the elements when the data are not available. relation-based features. besides the above extracted attributed features, we also consider the rich relations among different areas. • r : administrative affiliation. according to the severity of covid- , available resources and impacts to the residents, different states may have different policies, actionable strategies and orders with responses to covid- . therefore, given an area, we accordingly extract its administrative affiliation in a hierarchical manner. particularly, we acquire the state-include-county and county-include-city relations from city-to-county finder . • r : geospatial relation. we also consider the geospatial relations between a given area and its neighborhood areas. more specifically, given an area, we retain its k-nearest neighbors at the same hierarchical level by calculating the euclidean distances based on their global positioning system (gps) coordinates obtained from google maps and wikipedia . with an entity type mapping ϕ: v → t and a relation type mapping ψ : e → r, where v = m i= x i denotes the entity set and e is the relation set, t denotes the entity type set and r is the relation type set, a = m i= a i , and |t | + |r| > . network schema [ ] : the network schema of an ahin g is a meta-template for g, denoted as a directed graph t g = (t , r) with nodes as entity types from t and edges as relation types from r. in this work, we have four types of entities (i.e., nation, state, county and city, |t | = ), two types of relations (i.e., r and r , |r| = ), and each entity is attached with an attributed feature vector as described above. based on the definitions, the network schema of ahin in our case is shown in figure . although the constructed ahin can model the complex and rich relations among different entities attached with attributed features, there faces a challenge that there might be missing values of attributed features attached to the entities in the ahin because of limited data that might be available for learning. more specifically, given an area, there may not be sufficient social media data (i.e., reddit data in this work) to learn the public perceptions towards covid- in this area. for example, for the state of montana, as of march , , in its corresponding subreddit r/coronavirusmontana, there only have been posts by seven users discussing the virus. to address this issue, we propose to exploit cgans [ ] for synthetic (virtual) social media user data generation for public perception learning to enrich the ahin. different from traditional gans [ ] , a cgan is a conditional model extended from gans, where both the generator and discriminator are conditioned on some extra information. in our case, we propose to exploit cgan to generate the synthetic posts for those areas where the data are not available. in our designed cgan, given an area where reddit data are not available, the condition composes of three parts: the disease related feature vector in this area a , its related demographic feature vector a and its gps coordinate denoted as o. as shown in figure , the generator in the devised cgan aims to incorporate the prior noise p z (z), with the conditions of a , a and o as the inputs to generate the synthetic posts represented by latent vectors; while in the discriminator, real post representations obtained by using doc vec [ ] or generated synthetic post latent vectors along with a , a and o are fed to a discriminative function. both generator and discriminator could be a non-linear mapping function, such as a multi-layer perceptron (mlp). the generator and discriminator play the adversarial minimax game formulated as the following minimax problem: d(g(z|a , a , o) ))]. ( ) the generator and discriminator are trained simultaneously: adjusting parameters for generator to minimize log( − d (g(z|a , a , o) )) while adjusting parameters for discriminator to maximize the probability of assigning the correct labels to both training examples and generated samples. after applying cgan for synthetic post latent vector generation, we further exploit deep neural network (dnn) to learn the public perceptions towards covid- in this area. more specifically, we first use doc vec to obtain the representations of real posts collected from reddit and feed them to train the dnn model; and then given a generated synthetic post latent vector, we use the trained model to gain its related perception (i.e., awareness of covid- ). meta-path expression. to assist with the risk assessment of a given area related to the fast evolving covid- , it might not be sufficient if only considering its vertical information (e.g., its related city, county or state's responses, strategies and policies); the horizontal information (i.e., information from its neighborhood areas) will also be important inputs. to comprehensively integrate both vertical and horizontal information, we propose to exploit the concept of meta-path [ ] to formulate the relatedness among different areas in the constructed ahin. definition . meta-path. a meta-path p is a path defined on the network schema t g = (t , r), and is denoted in the form of . . · r l between types t and t l+ , where · denotes relation composition operator, and l is the length of p. city denotes that, to assess the risk of a specific city, we not only consider the city itself, but also the information from its related county and nearby cities. heterogeneous graph auto-encoder. given a node (i.e., area) in the constructed ahin, guided by its corresponding meta-path scheme (i.e., city level guided by p , county level guided by p , and state level guided by p ), to aggregate the information propagated from its neighborhood nodes, we propose a heterogeneous graph auto-encoder (gae) model to achieve this goal. the designed heterogeneous gae model consists of an encoder and a decoder: the encoder aims at encoding meta-path based propagation to a latent representation, and the decoder will reconstruct the topological information from the representation. encoder. we here exploit attentive mechanism [ , , ] to devise the encoder: it will first search the meta-path based neighbors n (v) for each node v, and then each node will attentively aggregate information from its neighbors. to learn the importance of the information from neighborhood nodes, we first present each relation type r ∈ r in the constructed ahin by r r ∈ r d a ×d a , where d a denotes the dimension of the attributed feature vector; and then the attentive weight β of node u (the neighbor of v) indicate the relevance of these two nodes measured in terms of the space r r , that is, where a v and a u are the attributed feature vectors attached to node v and u. we further normalize the weights across all the neighbors of v by applying softmax function: then, the neighbors' representations can be formulated as the linear combination: where the weight β r (v, u) indicates the information propagated from u to v in terms of relation r . finally, we aggregate v's representation a v and its neighbors' representations a n(v) by: decoder. the decoder is used to reconstruct the network topological structure. more specifically, based on the latent representations generated from the encoder, the decoder is trained to predict whether there is a link between two nodes in the constructed ahin. to this end, leveraging latent representations learned from the heterogeneous gae, the risk index of a given area is calculated as: where γ i is the adjustable parameter that can be specified by human experts, indicating the importance of i-th element in a v (e.g., the number of confirmed cases, population density, age distribution, mobility measure, etc.) in the rapidly changing situation. because of the critical need to act promptly and deliberately in this rapidly changing situation, we have deployed our developed system α-satellite (i.e., an ai-driven system to automatically provide hierarchical community-level risk assessment related to covid- ) for public test. given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable people to select appropriate actions for protection while minimizing disruptions to daily life. the link of the system is: https://covid- .yes-lab.org, which also include the brief description and disclaimer of the system as well as the following benchmark datasets. data collection and preprocessing. we have developed a set of crawling and preprocessing tools to collect and parse the largescale and real-time pandemic related data from multiple sources, including disease related data from official public health organizations and digital media, demographic data, mobility data, and user generated data from social media (i.e., reddit). we have made our collected and proprocessed data available for public use through the above link. we describe each publicly accessible benchmark dataset (i.e., db -db ) in detail below. db : disease related dataset. according to simplemaps , the u.s. includes states, washington, d.c. and puerto rico as well as , counties and , cities. we have collected the up-to-date countybased coronavirus related data including the numbers of confirmed cases, new cases, deaths and the fatality rate, from official public health organizations (e.g., who, cdc, and county government websites) and digital media with real-time updates of covid- (e.g., point acres). by the date, we have collected these data from , counties and states (including washington, d.c. and puerto rico) on a daily basis from feb. , to date (i.e., march , ). db : demographic and mobility dataset. we parse the demographic data collected from the the united states census bureau (data updated on july , ) in a hierarchical manner: for each city, county or state in the u.s., the dataset includes its estimated population, population density (e.g., number of people per square mile), age and gender distributions. by the date, we make the demographic and mobility dataset available for public use including the information of estimated population, population density, and gps coordinates for , cities, , counties and states (including washington, d.c. and puerto rico). db : social media data from reddit. in this work, we initialize our efforts on social media data with the focus of public perception analysis on reddit, as it provides the platform for scientific discussion of dynamic policies, announcements, symptoms and events of covid- . in particular, we have collected and analyzed statebased subreddits (i.e., washington, d.c. and states in this section, we evaluate the practical utility of the developed system α-satellite for hierarchical community-level risk assessment related to covid- through a set of case studies. case study : real-time risk index of a given area. given a specific location (either user input or automatic positioning by google map), the developed system will automatically provide its related risk index (i.e., ranging from [ , ], the larger number indicates higher risk and vice versa) associated with the public perceptions (i.e., awareness) towards covid- in this area (i.e., ranging from [ , ], the larger number denotes more aware and vice versa), demographic density (i.e., the number of people per square mile in its related county), and traffic status (i.e., ranging from [ , ] , the larger number means more traffic and vice versa). figure .(a) shows an example: given the location of euclid ave, cleveland, oh , the risk index provided by the system was . (with public perception of . , demographic density of , , and traffic status of ) at : pm edt on march , . at the same time, the risk indexes and public perceptions of corresponding county (i.e., cuyahoga county with risk index of . and public perception of . ) and state (i.e., oh state with risk index of . and public perception of . ) will also be shown in a hierarchical manner to enable people to select appropriate actions for protection while minimizing disruptions to daily life. case study : comparisons of risk indexes on different dates. in this study, given the same area, we examine how the generated risk indexes change over time. using the same location above, figure .(b) shows the comparison results on different dates at the time of : pm edt, from which we have the following observations: ) in general, its risk indexes increased over days from march , in this study, given the same time, we examine how the generated risk indexes change over areas. when a user inputs the areas he/she are interested in (e.g., grocery stores near me) in the search bar, the system will display the nearby grocery stores using google maps application programming interface (api) and automatically provide the associated indexes. for example, using the same time in the first study (i.e., : pm edt on march , ), figure shows the "grocery stores near me" (i.e., near the location of euclid ave, cleveland, oh ) and their related indexes. from figure , we can observe that the indexes of nearby areas might vary due to the factors of different public perceptions towards covid- and different traffic statuses in specific areas. as shown in the right part of figure , the system also provides related reddit posts to users. case study : comparisons of different counties and states. in this study, we compare the indexes of different counties and different states given the same time. using the time in the first study (i.e., : pm edt on march , ), figure .(a) shows an example of comparisons. more specifically, at county-level, using oh state as an example, we choose the counties with top five largest numbers of confirmed cases on march for comparisons: cuyahoga ( ), franklin ( ), hamilton ( ) , summit ( ) and lorain ( ) . figure. .(b) illustrates the risk indexes associated with multiple factors versus the numbers of confirmed cases in these counties. for the comparisons of different states, we also choose five states: two most severe states (new york (ny) with , confirmed cases and deaths, california (ca) with , confirmed cases and deaths), two medium severe states (oh with confirmed cases and deaths, virginia (va) with confirmed cases and deaths) and one least severe state (west virginia (wv) with confirmed cases and deaths). figure. .(c) shows the risk indexes versus the numbers of confirmed cases in these states, from which we can see that there is a positive correlation between the numbers of confirmed cases and the risk indexes. to track the emerging dynamics of covid- pandemic in the u.s., in this work, we propose to collect and model heterogeneous data from a variety of different sources, devise algorithms to use these data to train and update the models to estimate the spread of covid- and predict the risks at community levels, and thus help provide actionable information to users for community mitigation. in sum, leveraging the large-scale and real-time data generated from heterogeneous sources, we have developed the prototype of an aidriven system (named α-satellite) to help combat the deadly covid- pandemic. the developed system and generated benchmark datasets have made publicly accessible through our website. in the future work, we plan to continue our efforts to expand the data collection and enhance the system to help combat the fast evolving covid- pandemic. we will continue to release our generated data and updates of the system to facilitate researchers and practitioners on the research to help combat covid- pandemic, while assisting people to select appropriate actions to protect themselves at increased risk of covid- while minimize disruptions to daily life to the extent possible. natural language processing with python: analyzing text with the natural language toolkit cdc. . pandemic (h n virus are you at higher risk for severe illness? cdc. . how covid- spreads implementation of mitigation strategies for communities with local covid- transmission deep learning-based model for detecting novel coronavirus pneumonia on high-resolution computed tomography: a prospective study. medrxiv securedroid: enhancing security of machine learning-based detection against adversarial android malware attacks adversarial machine learning in malware detection: arms race between evasion attack and defense metapath-guided heterogeneous graph neural network for intent recommendation gotcha -sly malware! scorpion: a metagraph vec based malware detection system malicious sequential pattern mining for automatic malware detection incorporating non-local information into information extraction systems by gibbs sampling generative adversarial nets forecasting the wuhan coronavirus ( -ncov) epidemics using a simple (simplistic) model. medrxiv alphacyber: enhancing robustness of android malware detection system against adversarial attacks on heterogeneous graph based model hindroid: an intelligent android malware detection system based on structured heterogeneous information network artificial intelligence forecasting of covid- in china using twitter and web news mining to predict covid- outbreak distributed representations of sentences and documents enhancing robustness of deep neural networks against adversarial malware samples: principles, framework, and application to aics' challenge semi-supervised clustering in attributed heterogeneous information networks early transmissibility assessment of a novel coronavirus in wuhan conditional generative adversarial nets machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: covid- case study identification of covid- can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine deep learning-based quantitative computed tomography model in predicting the severity of covid- : a retrospective study in patients an epidemiological forecast model and software assessing interventions on covid- epidemic in china deep learning enables accurate diagnosis of novel coronavirus (covid- ) with ct images. medrxiv pathsim: meta path-based top-k similarity search in heterogeneous information networks vital surveillances. . the epidemiological characteristics of an outbreak of novel coronavirus diseases (covid- )-china graph attention networks a deep learning algorithm using ct images to screen for corona virus disease kgat: knowledge graph attention network for recommendation who. . coronavirus disease (covid- deep learning system to screen coronavirus disease pneumonia prediction of survival for severe covid- patients with three clinical features: development of a machine learning-based prognostic model with clinical data in wuhan deepam: a heterogeneous deep learning framework for intelligent malware detection out-of-sample node representation learning for heterogeneous graph in real-time android malware detection a survey on malware detection using data mining techniques automatic malware categorization using cluster ensemble intelligent file scoring system for malware detection from the gray list cimds: adapting postprocessing techniques of associative classification for malware detection combining file content and file relations for cloud based malware detection imds: intelligent malware detection system an intelligent pe-malware detection system based on association mining shufang wu, and yonghong xiao. . host and infectivity prediction of wuhan novel coronavirus using deep learning algorithm. biorxiv key: cord- -mbww yyt authors: hayashi, teruaki; uehara, nao; hase, daisuke; ohsawa, yukio title: data requests and scenarios for data design of unobserved events in corona-related confusion using teeda date: - - journal: nan doi: nan sha: doc_id: cord_uid: mbww yyt due to the global violence of the novel coronavirus, various industries have been affected and the breakdown between systems has been apparent. to understand and overcome the phenomenon related to this unprecedented crisis caused by the coronavirus infectious disease (covid- ), the importance of data exchange and sharing across fields has gained social attention. in this study, we use the interactive platform called treasuring every encounter of data affairs (teeda) to externalize data requests from data users, which is a tool to exchange not only the information on data that can be provided but also the call for data, what data users want and for what purpose. further, we analyze the characteristics of missing data in the corona-related confusion stemming from both the data requests and the providable data obtained in the workshop. we also create three scenarios for the data design of unobserved events focusing on variables. due to the global violence of the novel coronavirus, various industries have been affected and the breakdown between systems has been apparent. to understand and overcome the phenomenon related to this unprecedented crisis caused by the coronavirus infectious disease (covid- ), the importance of data exchange and sharing across fields has gained social attention. in fact, johns hopkins university uses data from the us centers for disease control and prevention, world health organization, and chinese authorities to visualize the spread of covid- and disseminate information . in addition, local governments and companies, such as the tokyo metropolitan government, are making efforts to disclose data and technology using github . moreover, the covid- data exchange initiative , the pro bono effort launched against the covid- pandemic, has applied their respective expertise, experience, and network to create the largest community of private and public organizations in support of data exchange. however, the new issue arisen in the corona related co however, the new issue arising in the corona-related confusion is the discussion to determine what types of data are missing. apparently, cross-disciplinary data sharing and utilization are essential for understanding and controlling unknown phenomena. for this reason, the data published by many institutions are attracting attention, but the intention and background of such data acquisition are often unclear, to the extent that there is insufficient context to grasp the facts and make appropriate decisions. regarding covid- , silver said, "the number of reported covid- cases is not a very useful indicator of anything unless you also know something about how tests are being conducted" [ ] . he warns against looking at statistical data without understanding how and why the data were obtained. although many international organizations and companies publish some of their data, the data we want are kept fully closed. in other words, it is limited to unilateral information provision from data providers, and there has been almost no discussion about creating data of unobserved events and the methodologies for supporting it. in this study, we use the platform teeda (treasuring every encounter of data affairs) [ ] to externalize the information from data users. teeda is a tool that is specifically developed to exchange not only the information about data that can be provided but also the call for datawhat is wanted by data users, and for what purpose. using teeda, we collect data items (data requests and providable data) in the corona-related confusion in the workshop, discuss the characteristics of missing data, and create three scenarios for data design of unobserved events focusing on variables. the remainder of this paper is organized as follows. in section , we explain the methodology of teeda based on the descriptions of data requests and providable data, as well as demonstrate the functions of the platform. in section , we present the experimental details of this study. in section , we discuss the results obtained from our experiment. finally, we conclude the paper in section . with the development of data catalogs and portal sites in the data exchange ecosystem, such as data marketplaces, data users have more opportunities to learn about the publications of data holders or providers [ , ] . in the context of coronarelated confusion, open platforms for sharing data across crises and organizations, such as the humanitarian data exchange or the world bank , collect and publish many types of data from different domains. however, it is hard to discover information about what types of data the users want, and for what purpose, as this type of detail is not often sufficiently shared. therefore, data providers are unable to learn what types of data are required, and there is a risk that only those data that do not meet the user requests are provided on the platform. teeda facilitates communication and matching between data holders and users by capturing and presenting requests for data (call for data) that users desire in the data exchange ecosystem. the data holders register the information about the data (metadata), and the data users provide information about the purpose and structure of the data in the form of data requests in teeda. the collected data items (the data request and providable data) are processed by a matching algorithm and visualized to facilitate data exchange between data holders and users. each data request has three description items: data name, variables, and the purpose of data use. the data name is an item to express the data that the users want. examples include "the rate of self-restraint due to covid- " and "behavioral history of those infected with covid- ." the second description item, variables, is a set of logical data attributes [ , ] . for example, in the case of meteorological data, "area name," "maximum temperature," "minimum temperature," "average temperature," or "date" are the variables. the third description item captures an expression of the purpose of data use. in this study, this item will be useful for understanding what types of data and variables are needed and for what purpose in regard to corona-related confusion. as an existing part of teeda, the records of providable data already have some description items in an element known as the "data jacket" (dj). a dj is a framework for summarizing data information while keeping the data itself confidential [ ] . the summary information of the data includes explanatory text about the data. a dj enables an understanding of the types of data that exist on different platforms and the information included in the data, even if the contents of the data cannot be made public. in the teeda format for providable data, we used data name, data outline, variables, types, formats, and sharing conditions of data. the sharing condition is the list of terms and conditions imposed by data providers to exchange data with, or provide data to, other parties. we used variables to examine the relationships and matching possibilities of data requests and providable data, based on the assumption that the completeness of the variables is the condition for data users. therefore, we can represent the relationships between data requests and providable data in the network format, where the data items are the nodes and the links are established when the data items have common variables. figure shows the teeda interface and the network of data items input in the experiment (explained in detail in the next section). the green nodes represent the data requests, and the orange nodes are the providable data. to encourage data users and providers to understand the relationships between their own data items as well as others', links are also established between data requests and between providable data. teeda runs on a web browser, and the input data items are reflected on the screens of other users in realtime. teeda will automatically highlight neighboring nodes when browsing the data item details of a clicked node. in addition, there is an interface between the dashboard and the toolbox shown on the right side of fig. , where the input data items and variables are displayed. the network layout can be changed manually by dragging and dropping. in this study, we focus on the function of call for data of teeda and externalize the needs for data in corona-related confusion as data requests. in addition, we analyze the degree to which data requests are satisfied by comparing with the data that have been released during corona-related confusion as providable data. subsequently, we propose and discuss three scenarios of data consisting of what types of variables should be newly designed and acquired. the aim of the experiment was to understand the characteristics of data requests and providable data in the corona-related confusion and create scenarios for new data design of unobserved events focusing on variables. the experiment involved men and women (students and professionals) years and older. initially, they were taught how to use teeda for approximately min. subsequently, participants input the information on the data requests and the providable data about corona-related confusion on teeda for min via discussion with other participants. according to the specifications of the teeda platform, when data requests are entered, only the description items of 'data name' and 'variables' are required. the 'purpose of data use' description item is optional. all items are written in natural language, and there is no upper limit to the number of words entered. when entering providable data, the data name and variables are required, and the data outline is optional. as before, all items are written in natural language with no upper limit to the word count. there are nine recognized datatypes: "time series," "numerical value," "text," "table," "image," "graph," "movie," "sound," and "other,". teeda can deal with nine file formats: "csv," "txt," "rdb," "markup," "rdf," "weka," "shape," "pdf," and "other." users can select multiple checkboxes for these items because some data have multiple data types and are provided in several file formats. by contrast, users select one sharing condition with a radio button from seven predefined types: "generally shareable," "conditions/negotiations are required," "shareable within a limited range," "non-shareable," "shareable by purchased," "not yet decided," and "other conditions." note that the providable data externalized in the workshop did not necessarily cover all the available data in coronarelated confusion. since the experiment was conducted on june th, , it should be noted that the results obtained and the attributes of some data may have changed after this paper was published. in addition, the input information about data in the workshop was written in english and japanese, and in the analysis, we unified them into english. sixty-one data items-divided approximately evenly into data requests and providable data entries-were input during the workshop. tables i and ii show examples of data requests and providable data, respectively. first, we discuss what types of data and for what purposes they are required in corona-related confusion. in the data requests, a lot of data are necessary for understanding the measures to prevent infection by social distancing or quarantine, such as "behavioral history of those infected with covid- " and "measures against covid- implemented at stores." in addition, there are many lifestyle-related data including data for managing anxiety such as "coping with anxiety during covid- pandemic by age, sex, and prefecture," "changes in the lifestyle caused by covid- ," or "people's preference changed after the covid- pandemic." by contrast, to recognize the facts, there were the need for new statistical data for supplementing the published data, such as "needs of countries in the world during covid- pandemic" and "number of tests in countries around the world." typical purposes of these data were "to analyze the situations of different countries because it is hard to compare with current published data." most of the providable data were statistical data on the attributes of infected persons, such as "number of covid- cases by country," or "number of positive cases in tokyo metropolis (by city)." these data were mainly published by governments and international institutions. in addition to statistical data, there were publicly available data for academic purposes, such as "image datasets for covid- related physicians", as well as survey data provided by the investigation company, such as "a survey on coping with anxiety during covid- pandemic" related to staying home or working from home. there was hoarding and a toilet paper shortage. we must clarify what products were really needed and lacking in practice. figure (a) shows a comparison of the providable data collected using teeda in the past and the data in coronarelated confusion by sharing conditions. we used cases whose sharing conditions are described in the past providable data. consequently, the proportion of shareable data provided in corona-related confusion was about %, whereas the ratio before corona was only %. in other words, a large amount of 'generally shareable' data are externalized as being relevant to corona-related confusion. it is known that the ratio of generally shareable data in the data exchange platform is about % [ ] , and institutions and companies may tend to be more open with their data related to problems with high public interest, such as corona-related confusion. by contrast, the comparisons of data types and formats are shown in fig. (b) and (c). note that one piece of data can have multiple types and formats. the ratios of data types are almost the same for "time series," "numerical value," "text," "table," and "image," but it can be seen that the proportion of the "graph" under the corona-related confusion is significantly larger than that before corona. this is because the data related to this topic are often provided in a graph format so that even a person who is not a specialist in the data can read it and understand the trend and the situation of the number of infected persons. in fact, of the data that have "time series" also have the data type "graph." in addition, tabular data are also an excellent format for reading and comparing as well as the graph data, and many data on the number of infected persons by prefectures include the type "table." by contrast, regarding data formats, there are many data in csv and rdb formats, which easily handle time-series data in the tabular form, and txt of the language corpus in the data before corona. however, for the data since corona, "other" format is the most frequent with cases. all of the "other" data are the image format (such as jpeg or png), including "image" or "graph" for the data types, also peculiar to the corona situation. image formats are good at visually conveying information to the public. by contrast, it seems that too much emphasis is placed on just communicating the information because the data that allow secondary use, such as csv, are seen less often since the pandemic began. the data provided in pdf, which is human-friendly but has poor machine readability, also exists in a certain proportion under the corona situation, and it is required to provide data in a format that makes it easy for secondary use. table iii presents the details of the input data items. the types of variables in the providable data were , which is slightly larger than those of data requests. in addition, the average number of variables is larger at . in providable data than that of data requests at . , which varies from to . in contrast, the number of data that the data users want to obtain is as large as , but both the average number and the types of variables are less than those of providable data. these results suggest that data users may not need the data composed of many variables and do not require variables as diverse as the providable data. this is an important point when considering data design scenarios, and will be discussed in detail in the next subsection. as for the frequency of variables, of types of variables appear only once in the data requests and of types of providable data appeared once. in other words, the frequency of appearance of most variables is approximately once in both data requests and providable data. in a previous study, the frequency distributions of the variables of both data requests and providable data show power distributions [ ] . although the number of samples in this experiment is small, it is considered that both data requests and providable data are composed of a variety of variables with low frequency. next, we compared the details of the types of variables in the data items that appeared in corona-related confusion. figure shows the top- variables of (a) all data items, (b) data requests, and (c) providable data. the variable "date" appeared the most; while "number", "prefecture name", and "area name" of the patients consistently occupied the top ranks. when discussing the variables, it is debatable whether to use well-defined schemata or natural language concepts. studies on ontology matching [ , ] or ontology-based data access [ , ] have defined schemata for heterogeneous data integration. because corona-related confusion is an unprecedented crisis, the types of data that are providable or needed remain unclear. to allow diverse data with a variety of variables to understand and make decisions during the crisis, although "address" and "location" are almost synonymous, in this study, we did not unify the notation fluctuation of variables. as we explained, most of the providable data were statistical data concerning the attributes of infected persons (patient's place of residence, city name, the degree of seriousness) along with other variables, such as the number of polymerase chain reaction (pcr) tests or the event names, which are orderable in a time series by the inclusion of the variable "date." by contrast, some data such as "image datasets for covid- related physicians" and "a survey on coping with anxiety during covid- pandemic" do not have "date", and are unique compared with other providable data. in the data requests, "area name" and "address" whose granularities are higher than "prefecture name" or "city name" appeared frequently, which are not included in the statistical data in the providable data. these variables are included in the data such as "measures against covid- implemented at stores" and "number of cases by hospital in japan", and the reasons why these data were required are "to use them as best practices." the rate of infection to the number of healthcare workers is investigated to identify the risk of overwhelming hospitals," which were desired for grasping the current situation of the corona and taking measures. by contrast, not only the high granularity variables such as "address" or "area name" but also it is interesting that there is a large-meshed variable such as "country name." as described above, it is a central variable for accurately understanding the global situation of covid- rather than individual decision making. furthermore, looking at the breakdown of the types of variables, there were only types of variables common between data requests and providable data. this result means that the providable data do not contain enough variables for the data that users want to obtain. it can be said that there is a big mismatch here. the results in the last subsection suggest that although the data provided under the corona-related confusion are varied in the types of variables, there are relatively few data that satisfy the data requests, leading to a mismatch. what types of data need to be newly designed and obtained in corona-related confusion? there are various purposes for using data in data requests, and we categorized them into the following three types: l phenomenon understanding: for verifying the facts that have affected society, such as business or medical fields l individual decision-making: to obtain evidence for making decisions in one's life, such as going out or staying home l organizational decision-making: for learning the social demands and changes in the post-corona society and formulating business and organizational guidelines table iv shows the number of categorized data requests with examples. we analyzed the types of variables in each category and discussed possible scenarios for data acquisition. fifteen data requests are categorized as "phenomenon understanding," which was the most numerous compared with the other two categories. the variables "country names ( times)," "area names ( times)," and "date ( times)" appeared frequently, and the requests contained both the detailed and global variables lacking in the data provided by local governments and institutions. in particular, many variables for understanding what types of needs from which ages in coronarelated confusion have received much attention. in understanding the phenomena, it is better to acquire the data with missing variables included in the data requests, using "date," "area name," and "address." in addition, although we understood that many companies were affected by the covid- pandemic, there are few data on the types of companies in which industries were affected. it is considered important to collect data to understand the kinds of impacts with the variables "area name," "type of business." there are seven data requests related to "individual decision-making," least numerous compared with the other two categories. among the variables, "date ( times)" is the most common, "address (twice)" is the second most common, and all others appeared only once, with the ratio of variables that appear only once being the highest among the three categories. variables such as "recommended frequency of going out," "acceptance of covid- patients," "item people touch," and "number of people touching it per day" are the unique events that have not been observed yet. it is difficult to extract common interests because of the diversity of needs in individual decision-making, but there seems to be a need to obtain data that are deeply related to our lives, such as the data on the number of contacts or the infected information in the area where we live. in the decision-making of the organization, "sex ( times)," "age ( times)," "area name ( times)," and "date ( times)" appeared frequently. based on these common variables, there are the data including variables to try to change their business policies such as "increased activity due to self-restraint," "whether to continue it or not after self-restraint life," and the variables to create new businesses such as "type of anxiety" or "consultation content." for companies and institutions to adapt to social changes in people's lives in the wake of the corona pandemic, we consider there to be a need for extended statistical questionnaire data beyond that, which has not been widely provided yet. although there were differences in the types of variables required for each category, it can be said that "date" is the central variable in the data design of unobserved events in all categories. in particular, "date" plays an important role not only as statistical data but also as it captures people's interests and business conditions that change from moment to moment in corona-related confusion. moreover, it is notable that the number of variables in each data request are few and it suggests that users want the data with just those variables specialized to their own interests, which is hardly included in the providable data. in addition, from the analysis of providable data, it can be said that data formats that are not only human-friendly, such as pdf or images, but also machine-readable and easy for secondary use, such as csv or json, are strongly required. in this study, to discuss the data design of unobserved events in corona-related confusion, we used teeda to externalize the information about data items from data users and data providers and analyzed their characteristics. via experiments, we found different structures across data requests and providable data and the large mismatch between them. based on the discussion, we created three possible scenarios for data design, focusing on variables in data requests divided into three categories: phenomenon understanding, individual decision-making, and organizational decision-making. in our future work, we will obtain data according to these scenarios and verify via demonstration experiments whether the results meet the needs of data users in the society. in this study, we obtained data requests and providable data from the participants in the form of the workshop, but it was not possible to cover all the available data provided in the actual corona pandemic. there are more data and various variables in the world. to find out more information about providable data, it is important to collect them differently and discuss their characteristics. moreover, from the viewpoint of data design, the variables contained in data other than corona-related data are also considered useful. variable quest (vq) is an algorithm with the knowledge base for estimating sets of variables of unknown events from data outlines [ ] . using the knowledge base of vq for external information about data and variables, it may be possible to construct data for unobserved events. in addition, since major file exchange formats such as json were not yet supported by teeda, these specification changes are possible considerations for the future. coronavirus case counts are meaningless teeda: an interactive platform for matching data providers and users in the data marketplace data marketplaces: an emerging species a survey on big data market: pricing, trading and protection the basics of social research the econometrics of data combination data jackets for synthesizing values in the market of data understanding the structural characteristics of data platforms using metadata and a network approach ontology integration for linked data ontology matching linking data to ontologies satisfaction and implication of integrity constraints in ontology-based data access variable quest: network visualization of variable labels unifying co-occurrence graphs acknowledgment this study was supported by jsps kakenhi (jp h ), the "startup research program for post-corona society" of academic strategy office, school of engineering, the university of tokyo, and the artificial intelligence research promotion foundation. we wish to thank editage (www.editage.jp) for providing english language editing. key: cord- -g dsnhmm authors: wescoat, ethan; mears, laine; goodnough, josh; sims, john title: frequency energy analysis in detecting rolling bearing faults date: - - journal: procedia manufacturing doi: . /j.promfg. . . sha: doc_id: cord_uid: g dsnhmm abstract component failure analysis is sometimes difficult to directly detect due to the complexity of an operating system configuration. raw time series data is not enough in some cases to understand the type of fault or how it is progressing. the conversion of data from the time domain to the frequency domain assists researchers in making a more discernible difference for detecting failures, but depending on the manufacturing equipment type and complexity, there is still a possibility for inaccurate results. this research explores a method of classifying rolling bearing faults utilizing the total energy gathered from the power spectral density (psd) of a fast fourier transform (fft). using a spectrogram over an entire process cycle, the psd is swept through time and the total energy is computed and plotted over the periodic machine cycle. comparing with a baseline set of data, classification patterns emerge, giving an indication of the type of fault, when a fault begins and how the fault progresses. there is a separable difference in each type of fault and a measurable change in the distribution of accumulated damage over time. a roller bearing is used as a validating component, due to the known types of faults and their classifications. traditional methods are used for comparison and the method verified using experimental and industrial applications. future application is justified for more complex and not so well-understood systems. as manufacturing products become more complex, the machinery that makes the products has increased in complexity. with an increase in machinery and equipment, the maintenance required takes more time and costs more to a company. this in turn makes it harder to identify failures on equipment before they occur. untimely equipment failure has one of the biggest impacts on the operating costs for a manufacturer with costs in some cases exceeding k€ [ ] . all aspects of the manufacturing line are susceptible to failure, whether human or machine. it only takes one component to fail for a stoppage to occur in production. to avoid this, manufacturers and maintenance staff are developing and using analysis tools and methods to predict when component and equipment failures will occur. by knowing when failures will occur, maintenance is scheduled to repair or tend to equipment before untimely failure occurs. different tools of analysis have been proposed as cost effective for manufacturing maintenance and effective in determining the remaining lifetime of equipment components. for these different tools, the forms of analysis require different types of data to inform the user on subsequent needs. as an example, data scientists and researchers have used data features such as mean and variance of raw time series data to justify equipment analysis and maintenance scheduling. these methods perform well with an isolated component, but when incorporated into manufacturing equipment with multiple different components, the signal is susceptible to misinterpretation and outside interference from other components. system level predictive maintenance and modeling is extending beyond the capability of such conventional and customary analyses. time series data is convertible to frequency data using a variety of different types of frequency transforms, notably the fourier transform and wavelet transform. from the frequency spectrum, different features can be extracted such as power and phase with respect to a range of frequencies. for components in a system, there are corresponding nominal characteristic frequencies [ ] . when there are deviations from these expected frequencies, then a possible defect to the as manufacturing products become more complex, the machinery that makes the products has increased in complexity. with an increase in machinery and equipment, the maintenance required takes more time and costs more to a company. this in turn makes it harder to identify failures on equipment before they occur. untimely equipment failure has one of the biggest impacts on the operating costs for a manufacturer with costs in some cases exceeding k€ [ ] . all aspects of the manufacturing line are susceptible to failure, whether human or machine. it only takes one component to fail for a stoppage to occur in production. to avoid this, manufacturers and maintenance staff are developing and using analysis tools and methods to predict when component and equipment failures will occur. by knowing when failures will occur, maintenance is scheduled to repair or tend to equipment before untimely failure occurs. different tools of analysis have been proposed as cost effective for manufacturing maintenance and effective in determining the remaining lifetime of equipment components. for these different tools, the forms of analysis require different types of data to inform the user on subsequent needs. as an example, data scientists and researchers have used data features such as mean and variance of raw time series data to justify equipment analysis and maintenance scheduling. these methods perform well with an isolated component, but when incorporated into manufacturing equipment with multiple different components, the signal is susceptible to misinterpretation and outside interference from other components. system level predictive maintenance and modeling is extending beyond the capability of such conventional and customary analyses. time series data is convertible to frequency data using a variety of different types of frequency transforms, notably the fourier transform and wavelet transform. from the frequency spectrum, different features can be extracted such as power and phase with respect to a range of frequencies. for components in a system, there are corresponding nominal characteristic frequencies [ ] . when there are deviations from these expected frequencies, then a possible defect to the as manufacturing products become more complex, the machinery that makes the products has increased in complexity. with an increase in machinery and equipment, the maintenance required takes more time and costs more to a company. this in turn makes it harder to identify failures on equipment before they occur. untimely equipment failure has one of the biggest impacts on the operating costs for a manufacturer with costs in some cases exceeding k€ [ ] . all aspects of the manufacturing line are susceptible to failure, whether human or machine. it only takes one component to fail for a stoppage to occur in production. to avoid this, manufacturers and maintenance staff are developing and using analysis tools and methods to predict when component and equipment failures will occur. by knowing when failures will occur, maintenance is scheduled to repair or tend to equipment before untimely failure occurs. different tools of analysis have been proposed as cost effective for manufacturing maintenance and effective in determining the remaining lifetime of equipment components. for these different tools, the forms of analysis require different types of data to inform the user on subsequent needs. as an example, data scientists and researchers have used data features such as mean and variance of raw time series data to justify equipment analysis and maintenance scheduling. these methods perform well with an isolated component, but when incorporated into manufacturing equipment with multiple different components, the signal is susceptible to misinterpretation and outside interference from other components. system level predictive maintenance and modeling is extending beyond the capability of such conventional and customary analyses. time series data is convertible to frequency data using a variety of different types of frequency transforms, notably the fourier transform and wavelet transform. from the frequency spectrum, different features can be extracted such as power and phase with respect to a range of frequencies. for components in a system, there are corresponding nominal characteristic frequencies [ ] . when there are deviations from these expected frequencies, then a possible defect to the th sme north american manufacturing research conference, namrc (cancelled due to component may have been identified. however, there is still a high degree of variation in the frequency spectrum associated with observing an fft output, in terms of deviation from the expected frequencies and power values for similar data. wavelet transforms can offer results that incorporate both steady-state and transient information, but there are inherent restrictions in application of this method in systems with bursts in data or non-stationarity [ ] . however, the authors have proposed a new method of interpreting frequency domain results using the energy content of the signal over time. two different applications are used for demonstration. the first is a bearing test station for observing the energy changes with different types of induced faults. the second application is a vertical vehicle lift at an automotive manufacturing facility. the authors' proposed method involves calculating the energy content from the psd over the course of a defined test process, as gathered from a spectrogram at each time step of a discrete test process. for the experimental validation, baseline data and fault data are generated at a component test stand. for the vertical lift applications, a system with one good bearing and one faulty bearing was chosen. the energy content over time is assessed to evaluate changes and their associated patterns. for further validation, a distribution of points is created to help visualize potential changes to the data for the bearing station. in the remainder of this paper, the literature review first covers a brief overview of predictive maintenance and analysis methods relating to vibration and ffts. the next section covers the fft and different types of bearing failures. the methodology is described as an overview in terms of how data is gathered and a more in-depth description of how the analysis is performed. the results cover the initial findings employing the method, and conclusions over the larger impacts as well as future work are presented. as manufacturing equipment complexity increases, maintenance teams move from corrective to predictive maintenance. figure below shows a representation of the timeline of the different types of maintenance and when they were introduced [ ] . predictive maintenance changes the state of maintenance from anticipation of faults with preventive maintenance to knowing when machine faults will occur. for predictive maintenance to occur for machines, sources of data are gathered from equipment and test stands. from this data, different data features are created to classify potential faults. in , j.t. renwick et al. wrote on how vibration data is a proven technique for predictive maintenance due to ease of implementation and cost [ ] . in , m. lebold created a review of different data features in vibration data for classification such as kurtosis [ ] . these terms also appear in other vibration literature. liu et al. made a list of data features that he used in his analysis and noted that as the amount features increased so did the analysis [ ] . further reviews, such as those written by jardine et al. and carden et al., deal with the application of these features and techniques to machinery and structural engineering applications, respectively [ ] , [ ] . however, there is a tipping point referenced when analysis became too computational costly for the value of identifying additional features [ ] . focusing on bearings, d. dyer et al. proposed a detection method based on kurtosis in in select frequency ranges as a simple and cheap technique moving away from traditional trend analysis [ ] . in and , p.d. mcfadden et al. proposed two models for detecting defects in a rolling element bearing [ ] , [ ] . in mcfadden's analysis, envelope analyses of various frequency spectra are observed and tested with determining identification of fault based on bearing load, geometry and speed. ian howard wrote an extensive review in in rolling element bearing vibration, highlighting envelope analysis, fourier analysis and wavelet transform. he provided several use cases for frequency techniques as well as highlighting the use of a spectrogram for changing speed applications. he also makes mention of using envelope analysis and recognizes the difficulty in choosing the correct frequency bands for envelope analysis [ ] . , wei zhou et al. reviewed the different methods for monitoring bearings in electric machines, determining vibration and current analysis as the more popular methods [ ] . in , blodt et al. using current analysis, developed a model to determine bearing damage based on magnetomotive force and observing the psd [ ] . in and , lau et al. and chen et al. used wavelet package transform on current analysis and vibration data, respectively [ ] , [ ] . for both papers, failure and predictive models are created using the wavelet transform for their respective data sources. as machinery complexity increases, the need to filter out noise increases [ ] . if there is too much outside interference in the analysis, the interference can disrupt the analysis by either indicating a fault when it has not occurred or a "hide" a fault until it is too late. taking machine bearings as an example, the bearings operate under several known frequencies' dependent on process factors: size, speed of rotation and rolling element dimensions. researchers use a frequency transform for better visualization of bearing frequencies and to note if any differentiations in the data start to happen. neural networks and naïve bayes classifiers are a common form of pattern recognition utilized across many different industries [ ] , [ ] . in , liu et al. proposed a neural network for classing failure features in bearings, with a success rate of % versus other know techniques [ ] . in , samanta et al. used artificial neural networks and genetic algorithms to detect bearing failures. the classification is based on data features, such as kurtosis and skewness in the time domain [ ] . while the algorithm worked well for identifying single failures, the authors ran into problems with regards to multiple failure classification. in , prieto et al. presented a neural network classification based on manifold learning techniques versus statistical time techniques [ ] . prieto was able to ascertain his classification within % and validate his methodology against statistical time techniques. bayesian classifiers are commonly found in feature classification and fault detection. in , mehta et al. developed a condition-based maintenance (cbm) system architecture that investigated spindle damage using bayesian classification and sensor fusion [ ] . from the multiple sources of sensor data, they were able to identify the rising trends of impending failure in spindles beyond single-source classification methods. in as well, sharma et al. used sound signal data to compare a naïve bayes and a bayes net classifier on the detection of a faults within a roller bearing. the results were both able to accurately determine the fault and showed comparison of strength in early or late detection of fault. in , islam et al. used a reliable bearing fault diagnosis involving a combination of bayesian and multi-class support vector machines for bearing fault classification. this study incorporated the classification of frequency events in addition to time evets [ ] . in each of the following papers, one of the main distinctions is the training set of data to classify the results and the separation of data for this training. neural networks and naïve bayesian classifiers are very susceptible to misclassification if the training data set is not carefully defined; this motivates offline learning, defined as not connected to the production line [ ] . setting up an experimental stand to generate training data is a means of both validation and training a model in an offline environment. conversely, online learning takes place directly on the production line; neural networks and genetic algorithms are examples of methods capable of self-classification. however, without parameters carefully set, then there is a chance for misclassification that would require a reset of the model [ ] . a positive to both offline and online methods however is the continuation of learning as more data is added to better build the prediction models [ ] . lau et al. made the distinction in his training of offline and online learning for their experiments. neural networks and bayesian classifiers are adaptable based on new training data. one common item to note from the following papers is the isolation of one type of bearing fault for each classification and model. a continued area of research is the propagation of a multi-class fault classifier, which is a common problem in bearings as wear propagates [ ] . as mentioned earlier, frequency ranges and frequency space are a key area in failure classification stretching back to [ ] . in this regard certain transforms are heavily cited in literature in failure classification, notably: fourier transform [ ] , hilbert huang transform [ ] and wavelet analysis [ ] . of these techniques, the most cited technique is the wavelet transform. many papers compare this technique directly with the fft. focusing on bearings, tse et al. compare the two techniques in analyzing bearing faults [ ] , finding that both are capable of classification of each bearing fault. yet, the fft required more data for the same accuracy faults. jingling chen et al. writes about the application of fault diagnosis in wind turbines using the wavelet transform, finding their method valid even in a heavily noisy environment during an application test [ ] . ziwei proposes a method of using wavelets based on the signal-to-noise ratio and mean square error for a roller bearing [ ] . the method shows high accuracy and efficiency, as well as accounting for a noisy environment in their model. while good classification results are common from all the papers, a couple of aspects arise; the first is the need for a specific definition of features common across each category. one paper raised this being a potential issue with machine operators, who may not appreciate the frequent interruption to their equipment to make adjustments to sensors or changing the sensors [ ] . a need for a more general case of identifying wavelet parameters was also called for. in a review of wavelet transforms by m. gomez et al., the resolution of the wavelet was raised as a concern [ ] , citing better visualization required higher computational resources. in a production setting, directing computational resources may be difficult when analyzing a system where operators or maintenance staff need to make real-time line decisions. an additional advantage of wavelets is the dissemination of time data of when frequency events occur. however, in response to that the authors use a spectrogram to register when time events occur and the corresponding frequencies as raised by howard [ ] . as a means to add further time information to an fft, mehla et al. introduced the concept of a windowed fft or a short time fourier transform to add time data [ ] . the windowed short time fourier transform combined aspects of both wavelet analysis and fourier transform analysis in achieving mixed results compared to either independently. returning to the fft, one key problem stems from its inability to accurately predict events over a variable speed or load [ ] , [ ] , an issue that wavelet analysis is capable of eliminating. the fft is heavily dependent on the periodic samples of a system, which are in turn affected by the characteristic frequencies of the system. when changing the speed of the shaft in a rotodynamic system or the amount of load mid process, it could misinform the analysis. as a means of investigating this issue for manufacturing, the authors explored other applications of the fft. ffts are used in seismology [ ] - [ ] . one area the fft has been used is to detect frequency events related to earthquake timing versus other natural events. one paper was interested in understanding the separation of tensile fault and shear fault propagation before an earthquake [ ] . the approach they took involved calculating the energy related to the elastic wave triggered by one of the propagated faults. in doing so they can derive indicators and factors to fault displacement of tectonic plates. for medical devices, karim et al. were involved with studying the change in energy related to the psd for medical data classification when epilepsy occurs [ ] . this application looks primarily at ensuring quick data extraction and speeding up computational time. in the results of the paper, karim referenced that using energy spectral density for classification did speed up computational time and had higher classification results when coupled with support vector machines over conventional data features. based on research in seismology and biomedical devices, there is clear indication that the energy spectral density is a valid feature for data classification. one element that has not been broadly explored is the change in energy over time from the fft. the authors believe this to be an important aspect for manufacturing equipment in seeing how the energy of the signal changes over the course of a process with respect to the psd. another aspect is the clear use of the change in the distribution of energy points over time. haskell made mention to the change in distribution regarding spectral space for calculating the displacement but did not care for the mapping the probability density function line to measure faults [ ] . one of the first components tested are bearings due to their well-documented failure modes and the existing literature on bearing failures and how they initiate. companies such as skf, timken and barden have all released documents detailing the many different types of failures and defects for bearings [ ] - [ ] . the common one that each document mentions is normal fatigue life or wear, considered unavoidable and once detected will progressively increase until the bearing failure is complete. another common failure mode is spalling, a stress-induced delamination defect that will spread around the entire bearing surface once initiated. however, what is of more interest for early detection are the causes of premature failure and how they present in the data domain. in the document from barden [ ] , the primary causes of early failure are the load on the bearing, the environmental conditions and geometrical bearing defects. excessive and reverse loading can also cause rolling elements and rings to deform. some environmental defects are overheating, lubrication failure and corrosion. each of these defects can cause lubrication breakdown, which in turn can cause excessive wear on the bearings. geometrical failures involve tight and loose fits. either of these defects could cause excessive heat and wear leading to bearing deformation and destruction. figure contains some of the common causes for bearing failures and faults. figure a comes from reverse loading of the bearing ball. figure b shows the effect of corrosion on the bearing. figure c is caused by a loose fit of the bearing when affixed in a pillow block. figure d shows the effects of inadequate lubrication on a bearing. the causes of failure are identified usually from the location, type and size of the defect on the bearing. the geometry of the application plays particular role in the identification of the failure as it is typically a problem of the system that causes bearing failure to appear. in the skf document [ ] , other bearing failures mentioned include smearing and more explicitly brinelling indentations. this causes damage to the rings of the bearings and can in turn lead to spalling. while the list of defects and failures for bearings go on, it is important to be able to classify each major type of bearing defect to identify and fix the issue causing damage. in the timken document [ ] , the authors include the previous defects listed as well as defects that could cause the roller cage damage, which can lead to ball damage and deformation. figure contains a representation of a bearing with the various labeling of parts referenced in equations through . with reference to the bearing frequencies mentioned earlier, equations - refer to how to calculate the bearing characteristic frequencies. the irf is the inner race frequency, figure a describes reverse loading, when the ball is acting in the opposite direction it is meant too [ ] . figure b is the effect of corrosion due to improper sealing or maintenance [ ] . figure c is caused by a loose fit to a pillow block [ ] . figure d shows wear brought about due to improper lubrication [ ] . most of the failures come from the application or improper maintenance. the orf is the outer race frequency and rf is the roller frequency. nb is the number of rolling elements, and s is the rotational speed. bd is the ball diameter, pd is the pitch diameter, and ∅ is the contact angle. bearing defect data is created at a bearing station comprising an electric motor, a steel shaft, a linear table and a pillow block bearing assembly (np- ), as shown in figure . the bearing in the motor next to the shaft is incrementally damaged on purpose to gather failure data from different defects. for the tests, the outer ring of the bearing is held fixed while the inner ring rotates with the shaft. the shaft connects the motor to the pillow block to introduce an "application". the linear table induces misalignment, allowing for loading on the bearing. this loading causes the bearing vibration to increase, making it easier to identify the defect frequencies. bearings themselves are chosen as the validating component due to the widespread understanding and application in manufacturing equipment. the bearings used are zz bearings, commonly found in roller skates, fidget spinners, fans and small hand tools. these were chosen due to easy access and the ability to use them in bulk for the application. figure shows the type of damage for an inner ring defect in the initial state. different damage tests are designed for each bearing to isolate the onset of failure for each test and to then see how the damage spreads to other components or increases along the individual components. different bearing faults were tested at varying degrees of misalignment. the minimum degree misalignment was degree, while the maximum misalignment was assessed just at -degree. typically, the maximum misalignment for a bearing is degree. equation shows how misalignment was calculated regarding the z bearing. Δf refers to the allowable misalignment. Δa refers to the axial play of the bearing. dp refers to the pitch diameter. the damage and failures were induced to replicate the progression of failure leading to a destroyed bearing found in an automotive manufacturing facility (shown in figure ). the zz is not the exact bearing used in the manufacturing plant, however the experiment designed involves the same type of bearing and application use. what is of interest to the authors is primarily the failure signal. there were three different bearings tested. one bearing was a default zz bearing tested from the original motor. the second bearing tested was a bearing with an induced inner race defect. the third bearing tested was a bearing with an induced outer race defect introduced. for the inner race defect bearing, the bearing is destroyed to the final stage of failure. the final stage is meant to simulate a fully damaged bearing as seen in figure . table shows the different tests and bearings described in the data gathering phase. the defects were caused by a dremel tool with a fine tip. this bit was used to "flake" and "roughen" the surface of the bearing to simulate each defect. this additional damage affects the possibility of isolating failure signals to faults. the smallest defect on the race surface was mm x mm and . mm deep. for tests - and - , this was the initial size for each defect on the damaged bearings. for tests - , the damage grew to x x . mm as this was the next size of defect on the bearing. the amount of defect on the bearing also exponentially increased. tests - were of the maximum defect damage state, again exponentially increasing with the entire surface marked up with at least an initial defect size of x x . mm. this mimicking of exponential destruction is similar as well to the increase of destruction found in most bearings according to literature. for testing and making defects on the bearings, the shape of each defect is circular. this was to ensure precision of making the defect on the bearing naturally versus a square defect area. for the experimental data, since a ball bearing is used, the defect begins in the center of the race as this is where the rolling element makes most contact with the inner ring. individual "process" conditions are run for each test. each test is run on the same electric motor. the motor is given time to warmup of an initial seconds. this gives the signal time to stabilize, then data collection can begin for the process. a process constitutes turning on the motor, leaving it on to gather at least seconds of data and then turning the motor off. this is repeated until the processes of the data are collected. the sensor and data capture software come from ifm efector inc. the sensor is an accelerometer sensor that samples at khz. at least mb worth of data are gathered following this method per process. this is done to simulate a lift process that is seen in the second application of this method. for the second application, vibration data is collected using an accelerometer from a vertical lift application at bmw. there are two motors in each lift application. the lift is used to move cars between the different levels of the manufacturing facility. the bed of the lift is connected to a belt that is then attached to a bearing and motor. only one motor and bearing are coupled at any time to the belt that moves the lift. the other motor is held in reserve if maintenance must be scheduled on the running motor. the idle motor can quickly slot in and lift operations can continue as normal until the former motor is repaired. the data collection system is like the one used in the experimental data collection. the same accelerometer and data acquisition system is used. the same sampling rate is used as is the same overall amount of data collected as in the experimental data setup. two cases of data were collected. one was from a lift with a known bearing fault. the other case of data was collected from lift that was considered "healthy". each data test was saved into one file of five processes. for the analysis, each file was then split into five separate files, one for each process. each file is made up of the raw vibration data. from this vibration data, a spectrogram of the process is created. a spectrogram is a representation of frequencies of a signal as it varies with time. the primary reason for selection of the fft was ease of implementing possible different filters and the relatively low requirement of needing to choose parameters as denoted by the wavelet transform and highlighted in section . . the spectrogram usually appear with time along the x axis, frequencies along the y axis and then power represented as different colorations in the graphs. using the spectrogram, energy is calculated over time from the ffts. this energy value comes from integrating the power spectral density with regards to the frequency range of the fft. this converts the multiple power peaks of the fft into one value of energy represented in time. this energy calculation closely corresponds with the motions of the equipment. the firs method the authors consider is as a distribution of energy points with respect to the probability density function. the baseline case is expected to have a single peak in terms of the distribution and should represent a normal distribution of energy. this is an established expectation from the vibration handbook by chris mechesfke [ ] based on existing healthy and faulty data values from machinery. any deviations from this would then appear as a potential fault. the second application used classifiers to compare healthy processes versus processes of fault data. this was done to asses how possible it was for the data to be incorporated in different types of analysis. in this case a support vector machine and a density based spatial clustering method was used. this was also meant to represent the difference of using supervised versus unsupervised learning in the analysis. these methods are further explained in the results. the fft had the following characteristics. the window for the fft is the hamming window, due to windows use in literature. the size of each block of the fft is data points. this increased the detail of the fft and increased the computational time. the overlap point between the blocks was points. the sampling frequency was around hz. for the faulty bearings, the expected energy values are expected to have a wider variance. when fault begins to appear in raw values, the variance and amplitude of the raw data will increase and cause a wider variation in the signal rather than having a single peak [ ] . the probability density function is expected to be lower due to a fact the energy values have a wider range to cover. this is then determined as representative of the process and determining between healthy or an unhealthy machine. figure and show fft response from test and . test was the baseline case at a misalignment of degree, while test was the inner race defect case at a misalignment of degree. the conventional case for the fft is comparing differences with regards to the response of established baseline data. the calculated bpfo from equation was hz. the calculated bpfo was hz. the calculated rolling element frequency was hz. the misalignment of the shaft is attributed to the deviation of the bpfi. the introduced defect to the inner race is causes the sidebands seen in figure . sidebands are present in all three frequencies seen here. with the roller frequency from figure and figure , there is a noticeable difference in terms of the psd. figure has singular high peaks at each characteristic frequency, while figure has noticeable sidebands at each response. the difference in peak location and peak amplitude for the bpfo and rf is noticeable, with each peak location dropping by around to hz. the peak amplitude for the outer and inner rings is also reduced by w/hz in both cases. while the differences are noticeable to humans, a computer program may not be able to register the difference in the signal seen. algorithms such as naive bayesian classifiers and neural networks require strict training data sets for determinable analysis on whether the data is faulty or healthy. the closeness of the signal difference could lead to multiple misclassification of data points. there are ways filter out the signal digitally and through the experimental setup, but that will also require more software and hardware, then some manufacturers can or are willing to provide. with the energy method proposed, the differences in those ffts will become pronounced and determine not only healthy or fault, but also the type of defect associated with the different bearing tests. figure and figure show the probability density function with respect to the energy points. the probability density function is based on the occurrence of the energy point over the length of the process. figure shows the baseline case data figure : fft response from bearing test . this is from bearing test of the inner race defect case. on the x axis are the different frequencies. the y axis is the power components of the signal. single sided power (w/hz) frequency (hz) figure : baseline data. this is the plotted data taken from bearing test shown in table . the x axis has the energy values with units of j/hz for energy spectral density. the y axis is the probability density function. from test , while figure shows the initial induced damaged inner race bearing from test . each figure has three processes plotted. the different processes are denoted by different dashed lines. the first peak of the process is centered around j/hz. this is indicative of the start of the process and represents when the process is not in motion. the normal operation of the process begins at j/m /s and then ranges up to j/hz. only one baseline has a clear normal distribution shown as the third process with the dotted line. however, the range of energy values is similar for each individual process. for figure , the process data plotted appears to follow a similar distribution as seen with the baseline data, which was not expected. the peak of the energy data occurred at a lower energy value than was expected. the peak location in terms of energy values is around j/hz. the amplitude of the pdf is around . . there is a slight peak around j/hz for each process, like the baseline data. a higher distribution in energy values was expected with a lower pdf value. this may be due to the initial size of the defect compared to the baseline data. the highest concentration of energy values was more than the baseline data, which was unexpected. figure shows the change in misalignment for the baseline bearing. interesting enough the baseline bearing changes from a normal distribution and then begins to form two peaks in the intermediate stage at a wider distribution of energy points before centralizing at two peaks at a narrower and lower energy content. from figure , at the extreme case of misalignment a wider distribution of energy values was expected denoted by the dotted dashed line. however, in that case, the distribution began to centralize around two peaks at j/hz and j/hz. figure shows the baseline misalignment with the average distribution of the tested bearings for the single inner race defect, the single outer race defect and the baseline case, all at degree of misalignment. from this figure , the classification of each bearing defect from the baseline is not possible. at the maximum misalignment, the noise generated by that state drowns out the possibility of classification based on the defect. the information that figure does provide is that misalignment is detectable from the -degree misalignment state, irrespective of defect. misalignment is considered a bearing defect and one that if neglected could lead to warpage and damage to the bearing, motor and shaft. figure shows the difference in extreme damage on the inner race bearing with a comparison to the baseline data. as noted in figure , the initial bearing damage started off very low and culminated in a high peak at low energy values. in figure , the average bearing energy signature appears to have stabilized close to a middle value between j/hz and j/hz this is between the initial damage reading shown and the baseline damage process lines shown in figures and . this could relate to the possibility of the appearance of more sidebands in the fft and higher peak amplitudes in the psd. the average values are lower for the maximum defect data peaking at a location around j/hz and j/hz for the defect data in the dotted line. the average process data for test , and is shown above. the difference in each of these tests is the angle of misalignment for the shaft and bearing. the x and y axis remain the same as the previous plots. table is plotted over each other. on the x axis, the spectral energy density and on the y axis is the probability density function in the figures - , there are several different conclusions drawn. the first is that fault data is discernible from baseline data on the process. this is based both on the shape of the distributions as well as the location of where the peaks are occurring. the second conclusion is that misalignment interferes with additional bearing defect classification. in this case, the authors meant to use it as a means of loading the bearing for excitation by the motor. however, when comparing the bearings in the extreme misalignment, it was too difficult to recognize which defect was under observation. another observation is the subsequent appearance of additional peaks in more defect data. this could be due to the appearance of additional peaks in the fft from the defect. this next section shows a case study from bmw on one of their vertical lift applications. the vehicle car lifts are a high priority in ensuring they stay healthy as if one were to fail unexpectedly it can shut down the line for multiple shifts. the lifts have ifm vibration sensors similarly to the ones used in the experimental testing. data is taken from two sources. the baseline comes from a lift with similar characteristics as to the defect lift. the maintenance staff had deemed it healthy. the defect data comes from a lift with a known bearing defect. the detection of the bearing defect stemmed from calculating the bearing fault frequencies originally. in this event, the bearing defect is a ball defect that has scraped against the inner race surface, causing a small surface of the bearing ball to appear. the data collected goes through the same configuration, however the distributions are not shown. a support vector machine (svm) classifier and a density based spatial clustering method were used to compare and see if there was a possible difference in the classification. figure shows the process events with respect to the energy content versus the corresponding fft for the healthy lift. for every sampled ffts, that represents approximately a minute of process time. for lift operations, there is the up motion and the down motion. in this case the down motion is the events without the initial spike. the energy spike seen at the beginning of each process correspond then to the motor initially ramping up before settling into the operation. the end spike is the brake engaging, bringing the motor to a sudden halt. for the process, the average energy content remains around j/hz except for the spike. figure shows the process events with respect to the energy content versus the corresponding fft for the unhealthy events. in the case for this data, the brake events and the initial motor spikes are not discernible. the average energy is also higher closer to . j/hz versus the healthy j/hz. this is different from what is seen in the experimental data. in figure and , the distribution of the data shows the healthy data as having a higher energy content as oppose to the defect data. a possible reason could be due to the increase of load of holding a car. based on figure and , it is possible to discern the difference between the average energy level as well as near elimination of the brake events in the data. these are a few of the features with a measurable difference that are useful in terms of classification. from the data presented, it shows that tracking of process events is also possible regarding using the energy spectral method. there is a . j/hz difference about the energy content between the healthy lift and damaged lift. figure shows a comparison side by side of a process of each data. figure shows the classification of the data using a one class support vector machine. the kernel used was the rbf kernel, which denotes a gaussian distribution. the tolerance for the stopping criterion was set to . . the degree of the polynomial was set at to account for the spikes seen in figure and . the classification was based on features. the the healthy data was taken after the bearing change, while the unhealthy data was taken just prior to the bearing change. difference in the maximum peak positions of the fft within a hz tolerance and the difference in the energy of the sample. from the classification, the classifier had % success rate in classifying the fault failure data from the baseline data. figure shows the classification results for each point with regards to the fault data energy and top maximum frequency positions. in the first graph, class is considered a health data point, while class is an unhealthy data point. there is still some misclassification from using this method stemming around the brake events. however, the authors believe those errors are removeable through better svm parameters. a density based spatial clustering method (dbsc) was used to see if it was possible to for unsupervised learning in clustering the data. the dbsc was used from the machine learning library, sklearn. here the command is termed as dbscan. the euclidean distance of points from one another was a metric with a minimum of points to become a cluster. figure shows the initial results of comparing the area and position of the maximum fft peak position. figure shows a comparison of this points in an unsupervised learning method. three clusters emerged around three different positions. the first cluster corresponds to the lower energy values typically found in the baseline data as seen in the dotted and dashed circle. the dashed circle holds values associated with the brake event, while finally the furthest cluster is failure events. this was more of a proof of concept to see if the data is cluster able in unsupervised method. future use of the method could lead to detection of different faults if different clusters begin to emerge or if one cluster grows larger than another. a method for using the energy spectral density to detect component faults was presented. bearings were used to validate the method. data from a vertical lift application was used to show the use in a production environment. two different applications were represented with an experimental and an industrial application. fifteen different cases were presented shown in table . each case is represented by a different location of the defect, the size of the defect, the location of the defect, the amount of misalignment and the relative amount of damage compared to the overall surface of the bearing. based on the results, the energy content from the power spectral density is a valid method for determining bearing faults from healthy data over the course of a process. from the experimental test stand, each bearing defect had a different distribution in comparison to each defect and the baseline data. when misalignment was introduced, however the faults became more difficult to discern from one another and the baseline misalignment case. the distribution of energy values was used to validate the detection of faults in the experimental data. for the industrial data, the energy content with respect to the corresponding fft and by extension the corresponding time was used to show the difference regarding a healthy lift versus a damaged lift. from figures - , the data is separable between healthy and unhealthy processes and classification is possible. however, there is still work required before completely validating the tested method on an industrial application. a more rigorous experimental plan should be developed for offline learning with a specific emphasis on multi failure classification. an experimental test was considered for ball defects to validate the data seen in the corresponding lift application. another issue is the dependence of fft levels on speed and load. for the industrial data, simply taking data from a corresponding and similar lift could invite potential error, if the speed and load are different between different lift applications. however, a relationship between the speed and load can be defined using the characteristic bearing equations and modeling of the applications for each case. another instance is the amount of noise generated in the bearing vibration test stand. for some of the test cases, the variation exceeded % and did not match the baseline or any other know defect state. therefore, for figures and , only three test processes are plotted instead of five test processes. a new bearing station has been designed as a more rigid configuration to allow for better and more rigorous testing. this will eliminate much of the expected environmental noise. another possible elimination of ambient noise is the use of filtering. a bandpass filter added into the model could focus in on the selected frequencies and mitigate the amount of possible error. fine tuning the parameters in the classifications will lead to more robust results and reduce the number of misclassifications. this is a problem seen with wavelet transforms as well. a more general case will need to be developed for applications to other components. another application is the extension of incorporating wavelets into this method. wavelet analysis is the accepted method for bearing fault diagnosis and offers the ability to show the time occurrence of when frequency events occur. incorporating the occurrence of energy events from the frequency over time could increase accuracy for multi class failures as additional features. future work will also seek to extend this application to other equipment and see if it is possible to detect faults with them. bearings were used to see if this was a valid technique for classification and the data features were truly separable. the objective in the future is to test the method on different industrial applications, for continued validation. operations having more significant process variation will be selected, in order to provide a more robust test of the classification approach. possible equipment and components involve vehicle platforms and valves from non-newtonian sealant applicators. current status of machine prognostics in condition-based maintenance: a review machine condition monitoring and fault diagnostics strengths and limitations of the wavelet spectrum method in the analysis of internet traffic overview of maintenance strategy, acceptable maintenance standard and resources from a building maintenance operation perspective vibration analysis -a proven technique as a predictive maintenance tool review of vibration analysis methods for gearbox diagnostics and prognostics intelligent monitoring of ball bearing conditions vibration based condition monitoring: a review a review on machinery diagnostics and prognostics implementing condition-based maintenance optimal number of features as a function of sample size for various classification rules detection of rolling element bearing damage by statistical vibration analysis vibration monitoring of rolling element bearings by the high-frequency resonance technique -a review the vibration produced by multiple point defects in a rolling element bearing a review of rolling ellement bearing vibration 'detection, diagnosis and prognosis bearing condition monitoring methods for electric machines: a general review models for bearing damage detection in induction motors using stator current monitoring detection of motor bearing outer raceway defect by wavelet packet transformed motor current signature analysis fault feature extraction of gearbox by using overcomplete rational dilation discrete wavelet transform on signals measured from vibration sensors an overview on vibration analysis techniques for the diagnosis of rolling element bearing faults pattern recognition statistical, structural and neural approaches a review of neural networks for statistical process control bearing fault detection using artificial neural networks and genetic algorithm bearing fault detection by a novel condition-monitoring scheme based on statistical-time features and neural networks condition based maintenancesystems integration and intelligence using bayesian classification and sensor fusion reliable bearing fault diagnosis using bayesian inference-based multi-class support vector machines online learning versus offline learning a baseline for detecting misclassified and out-of-distribution examples in neural networks health monitoring, fault diagnosis and failure prognosis techniques for brushless permanent magnet machines bearing fault diagnosis using fft of intrinsic mode functions in hilbert-huang transform wavelets for fault diagnosis of rotary machines: a review with applications wavelet analysis and envelope detection for rolling element bearing fault diagnosis-their effectiveness and flexibilities generator bearing fault diagnosis for wind turbine via empirical wavelet transform using measured vibration signals fault diagnosis of a rolling bearing using wavelet packet denoising and random forests review of recent advances in the application of the wavelet transform to diagnose cracked rotors a comparative study of fft, stft and wavelet techniques for induction machine fault diagnostic analysis processing for improved spectral analysis fault detection in induction machines using power spectral density in wavelet decomposition spectral approach to geophysical inversion by lorentz, fourier, and radon transforms time-frequency analysis of earthquake records total energy and energy spectral density of elastic wave radiation from propagating faults: part ii. a statistical source model a new framework using deep auto-encoder and energy spectral density for medical waveform data classification and processing bearing failure: causes and cures bearing failures and their causes timken bearing damage analysis with lubrication reference guide key: cord- - q j authors: stanley, philip m.; strittmatter, lisa m.; vickers, alice m.; lee, kevin c.k. title: decoding dna data storage for investment date: - - journal: biotechnol adv doi: . /j.biotechadv. . sha: doc_id: cord_uid: q j while dna's perpetual role in biology and life science is well documented, its burgeoning digital applications are beginning to garner significant interest. as the development of novel technologies requires continuous research, product development, startup creation, and financing, this work provides an overview of each respective area and highlights current trends, challenges, and opportunities. these are supported by numerous interviews with key opinion leaders from across academia, government agencies and the commercial sector, as well as investment data analysis. our findings illustrate the societal and economic need for technological innovation and disruption in data storage, paving the way for nature's own time-tested, advantageous, and unrivaled solution. we anticipate a significant increase in available investment capital and continuous scientific progress, creating a ripe environment on which dna data storage-enabling startups can capitalize to bring dna data storage into daily life. the digital revolution triggered a paradigm shift in how we generate and store information, resulting in an unprecedented exponential increase in the amount of data that we produce and marking the beginning of the information age. until now, data storage media including magnetic tapes and silicon chips have kept up with this demand, but they are fast approaching a critical limit in their physical storage capacities. in addition, the demand for data storage is expected to exceed the supply of silicon within the next years (zhirnov et al., ) . currently, cloud-based systems are widely used for remote storage of data that does not need to be frequently accessed (schadt et al., ) . whilst it might conjure up an ethereal image of how data are stored, the reality of cloud storage is much starker. storage services use large warehouses stacked with constantly active servers that require a continuous supply of power and cooling systems to prevent overheating. another major cost driver is archive replication. to attain redundancy, users can require multiple copies of data to be stored in geographically distinct locations. for petabyte to exabyte scale archives, this is non-trivial as the cost of each archive replicate is an integer multiple of the original archive cost. consequently, cloud storage services are associated with substantial costs in terms of associated materials, storage space and electricity (trelles et al., ) . therefore, how we currently meet our data storage needs is unsustainable environmentally, physically and financially. the urgent unmet need to develop truly disruptive technologies for the future of data storage has been widely recognized by organizations within both the public and private sectors. this has triggered a sharp increase of activity in pursuit of this goal. a significant advantage of dna over conventional data storage approaches is its longevity and stability. dna has a half-life of approximately years and can endure over millennia under appropriate storage conditions (allentoft et al., ; branzei and foiani, ; zhirnov et al., ) . in contrast, current storage media including magnetic tapes and optical disks have a lifespan in the order of decades, requiring data to be regularly copied to new media for preservation . notably, dna can be archived at room temperature without any power input (grass et al. ) . equally remarkable is the improvement in density that dna can provide. one cubic millimeter of dna can store up to bytes, which would give dna an approximately six orders of magnitude higher theoretical storage density than the densest storage medium currently available . in practice, a sample of dna the size of a few dice would store the equivalent of an entire data center"s worth of data (bornholt et al., ) . even though new copies of data stored in dna do not need to be frequently produced for preservation, it can be done with ease. dna can be copied exponentially using the same polymerase chain reaction (pcr) that is frequently used in laboratories for life science and medical purposes. this markedly improves the efficiency of producing data backups compared to current storage technologies. however, it should be noted that amplification is also a source of error, as small variations are compounded during amplification, leading to molecular bias . there are no data storage media that share the eternal relevance of dna, with its prominence in nature over billions of years of evolution. inevitably, there will always be the desire to read and write dna. further, the storage medium dna has the right characteristics to conduct computations, so-called dna-or bio-computing (adleman, ; braich et al., ; thachuk and liu, ) . looking further into the future, one can imagine a holistic, novel solution for data handling completely run on dna. norbert wiener and mikhail neiman are considered the founding fathers of data storage in nucleic acid (u.s. news and world report, ) . in , norbert wiener proposed that a memory system could be built from genetic material outside of a living organism in the future. since then, many researchers from academia and industry have worked on developing this initial idea towards a viable product. moving forward, the concerted efforts of academia, large corporates, innovative startup companies together with venture capital investment will be required to propel dna data storage to commercial scale. in general, storing information in a biological material follows the same principle as for common silicon-based hard drives or recording tapes (figure ) . the data needs to be transferred into a code ( . .) that is written ( . .) into dna suitable for storage ( . .), where the information can be accessed ( . .) again to read ( . .) the data. the ability to copy ( . .) data from one device to another is also beneficial . interdisciplinary efforts spanning molecular biology, computer science, and information technology are required to reach a complete dna data storage workflow and in the following paragraphs, several of these approaches are discussed in more depth. dna naturally carries an intrinsic code composed of its four nucleobases: adenine (a), thymine (t), cytosine (c) and guanine (g). therefore, the most obvious approach is to translate binary (digital) code directly into this four-base alphabet which researchers have succeeded in. nevertheless, programming code for storage is completely flexible and not universal across different approaches in the field. it has been recently shown that rational system design can yield higher bit densities, e.g. . bits per nucleotide through employing the four canonical nucleotides in order to approach the maximum theoretical number of two bits per base for a quaternary system like dna (erlich and zielinski, ) . even several nucleotide-long, pre-made dna oligonucleotides can be utilized to encode information when they are assembled in a combinatorial manner. in addition, the four-letter alphabet code can be extended by chemical modifications of the nucleobases or the phosphate backbone. a striking example of an extended dna alphabet is "hachimoji" dna, which uses eight non-canonical bases with watson-crick hydrogen binding (hoshika et al., ) . completely different approaches, however, use secondary structure elements that can be constructed into dna. naturally, dna is a double helix of two antiparallel strands of nucleotides. these are connected via their sugar phosphates that build the backbone, with the four different nucleobases facing inwards. the formation of hairpins of different lengths, in which one of the strands loops out, is an example of a nanostructure that has been shown to represent digital code (long hairpin = " ", short hairpin = " "). this nanostructure can be sensed and decoded through nanopores (see . .) (chen et al., ) . introducing cuts into the phosphate backbone of one of the strands, called nicking, can serve as a code, too (tabatabaei et al., ) . the presence or absence of a nick at a certain position resembles a " " or a " ", respectively, in digital binary code. for this approach, an existing dna sequence is extracted from a native source, e.g. bacterial dna, to select registers with desirable sequence elements for restrictions. the nicks are then enzymatically introduced in a parallel fashion. denaturing the dna separates the two strands and alignment of the fragments onto a known sequence retrieves the position of the nick, regenerating the code. the nicking approach leads to a - -fold decrease in information density per base pair of dna, compared to codes with a nucleotide resolution. hairpin approaches also share this inherent density reduction. however, this could be a very low price to pay considering the ease of reading, and potential reductions in costs of dna synthesis, providing adequate automation of these processes can be realized. a further key aspect to consider in the encoding process is the choice of codec scheme before synthesis to facilitate accurate error correction during readout (see . .). this can include using multiple copies, reed solomon block codes, repeat accumulate codes, and more (blawat et al., ; wang et al., a) . a descriptive example recently reported for storing quantized images in dna uses signal processing and machine learning techniques to deal with error instead of redundant oligos or rewriting. this relies on decoupling and separating the color channels of images, performing specialized quantization and compression on the individual color channels, and using discoloration detection and image inpainting (c. pan et al., ) . it is worth noting that images are a medium the brain can self-error-correct to a certain degree, making it not necessary to recover every bit, nevertheless this is an example of rational system design and codec scheme choice. depending on how information is encoded in dna, the requirements for its synthesis vary. producing long strands of dna is currently the main challenge. while all synthetic oligonucleotides are prone to errors during synthesis (hölz et al., ) , oligonucleotides longer than nucleotides are particularly difficult to obtain with high fidelity due to accumulated errors. during the reaction, costly reagents generate toxic byproducts. most technologies still rely on a sequential one-by-one addition of nucleotides to the growing strand, where the speed of liquid handling in microfluidic devices limits production speed. this explains the industry push toward systems using shorter sequences, which has been demonstrated in academic research (wang et al., b) . in array-based dna synthesis, several strands that encode different dna sequences are grown simultaneously while immobilized on a surface. this allows for higher parallelization, thereby increasing production speed (kosuri and church, ) . novel approaches focus on enzymatic dna synthesis (lee et al., ; lee et al., a) . while oligonucleotides produced with this methodology currently remain shorter, experts expect lower error rates, higher speed and longer fragments with this upcoming technology. to obtain larger sequence strings for ease of read, an assembly process connecting these to nucleotide-long fragments is needed. currently, most efforts follow the same principle that is used in molecular biology for gene assembly. codes that are independent of the actual sequence but rely on secondary structure can be assembled from a pool of oligonucleotides, which can be produced by chemical synthesis in large amounts and at low costs. the same applies to code in which longer oligonucleotide sequences present one digital state (" " or " "). to promote the correct assembly of these in a fast and reliable fashion, researchers at catalog have developed an inkjet-printer like machine. in contrast to biological applications, the requirements for dna synthesized for data storage are throughput, costs and few copies per unique sequence. researchers across the field agree that dna synthesis remains the biggest challenge and needs to become faster, more reliable, and significantly cheaper to advance data storage in dna. j o u r n a l p r e -p r o o f whichever way encoded, the nucleic acid molecules themselves can adapt to any structure and could, in principle, be stored in any geometric shape or form. dna molecules can be pooled for liquid storage in a suitable solvent. on the other hand, researchers have also developed storage devices where dna molecules are immobilized on solid surfaces, or where dna molecules are embedded in other materials such as glass or plastic (personal communication, unpublished data; grass et al., ; koch et al., ) . so far, empirical values for the retention time (the time data can still be recovered reliably) of all storage forms still need to be confirmed. initial experiments, in which dna was encapsulated into an inorganic matrix, have shown promising results. in this form, information on dna is predicted to be stably stored for years (grass et al., ) . to avoid reading a whole storage device, systems for organizing and accessing information are needed. a nested file address system has, for instance, been shown to increase the capacity of dna storage further and provides progress towards a scalable dna-based data storage system (stewart et al., ; tomek et al., ) . random access describes the reading of selected information in computer science, and it is a key feature that needs to be developed for dna data storage to become viable. indexing (adding a unique file identification sequence onto each dna molecule) helps to identify the desired data; however, how to index data is not a unified system yet. in most approaches, the index will allow for targeted amplification of the requisite information by pcr, for example by having the same pcr primer target sequences form a unique file id for each strand and including a one-of-a-kind, strand-specific address to order strands within a file (organick et al., ) . in other approaches, a complementary sequence of the index region is encoded on magnetic beads and hybridization allows for the physical separation of the desired dna molecule from the pool (tomek et al., ) . these index regions must be designed carefully to access only the desired dna molecule. both approaches are derived from molecular biology techniques. search functions have been designed in a similar way, generating query dna strands to identify the searched information through hybridization. recently, this content-based retrieval from a dna database was scaled to include . million database images with a retrieval rate much greater than chance when prompted with new images (bee et al., ) . in some use cases, data need to be rewritten, meaning only parts of the data change while other parts are retained. the first successful approach to create a rewritable dna-based storage system was described in . the technology is based on an elegant design of dna blocks with recognition sequences that can be altered via pcr (yazdi et al., ) . more recent advances include a dynamic storage system based on a t promoter sequence with a single-stranded overhang, unlocking versatile editing and rewriting capabilities (lin et al., ) . retrieving information from dna for storage can benefit from the sequencing ecosystem that is continuously being improved for life science and medical applications. as novel sequencing methods are developed, the cost per base sequenced decreases while the speed increases. currently, the main sequencing approaches used by researchers in the dna data storage field are sequencing by synthesis, promoted by illumina, or nanopore sequencing. the novel nanopore sequencing technology, designed for longer molecules, can in principle also decode secondary structure elements and base modifications. however, the workflow to prepare dna for sequencing is still currently laborious and additional steps are required for nanopore sequencers. despite improvements in the workflow, experts agree that the entire process speed needs to be improved. despite inherent error rates, dna data storage can in principle tolerate high error rates in both write and read channels through sufficient redundancy, appropriate codecs (coder-decoder), error correction codes and algorithm design (erlich and zielinski, ; organick et al., ; press et al., ) . researchers describe that on average ten copies of a dna molecule encoding the same sequence is sufficient to reliably retrieve the stored information with the current technologies . for biological applications, this so-called coverage is typically required to exceed reads per nucleobase. the required error correction algorithms are different and often more complex than those needed for biological purposes or algorithms used in conventional data storage. beside the substitution errors that are found in the latter, nucleotide insertions or deletions are additional common error types that occur during dna synthesis and sequencing. one such algorithm has recently been developed to repair all three error types, where insertions or deletions can be corrected directly within a single dna strand, unlike previous codes that correct substitution errors (press et al., ) . a part of this algorithm, known as hedges (hash encoded, decoded by greedy exhaustive search), translates between a string of the four nucleobases and a binary string of the digital binary code (see . .), j o u r n a l p r e -p r o o f journal pre-proof without changing the number of bits. it can do this all while tackling practical challenges of storing information in dna. in addition to correcting the three error types, hedges can convert unresolved insertions or deletions into substitutions, and it can also adapt to sequence constraints (e.g. having a balanced g-c content). hedges, therefore, has the potential to enable error-free recovery of data on a large scale. another coding algorithm developed in can tolerate high error rates during reading, while also reducing the level of sequencing redundancy required for error-free decoding (organick et al., ) . this limits the number of required copies of dna to recover stored data, which becomes increasingly important as throughput increases. to reduce the need for error correction, codes that avoid long stretches of the same nucleobase, which are difficult to synthesize and sequence, can be applied (see . .). the intrinsic property of dna to make copies of itself for data transfer in living organisms is highly beneficial for data storage, as information is more valuable when it can be multiplied and distributed. currently, a minimum of two copies of the data are required as a pre-sequencing step because modern sequencers are constructed to discard the dna as part of the reading process. standard pcr protocols can easily be transferred from molecular biology to dna data storage to generate additional copies in a fast and parallel way. together with the use of pcr-based random access, dna sequencing combined with pcr-based copying are the primary reasons why most experts predict dna will become the next generation storage medium instead of other organic or biological polymers. most recent developments for the latter still rely on low-throughput mass spectrometry-based sequencing techniques (lee et al., b) , in stark contrast to methods to rapidly sequence large quantities of dna as outlined previously (see . .). for dna storage to be widely adopted as a commercial product, the whole process, including transfer steps between synthesis, storage and sequencing, needs to eventually be automated. the first end-to-end storage device handling bytes of data encoding the word "hello" (published in march by the university of washington and microsoft) sets the stage for further fully-integrated solutions . this approach is based on liquid dna storage, where the main limiting factor is considered to be liquid handling. progress in the field of nanoand micro-fluidics will help advance automation strategies, such as the novel runtime system, "puddle", a high-level dynamic, error-correcting, full-stack microfluidics platform and dehydrated dna spots on glass (newman et al., ) . another approach towards fully integrated solutions is building on established complementary metal oxide semiconductor (cmos) technology to increase throughput via bespoke, microfabricated, and highly-parallel synthesis and sequencing devices, such as those in development by twist biosciences and roswell biotechnologies. beside the technical hurdles in this interdisciplinary field, researchers have also realized that a more extensive network with effective communication is needed to advance data storage in dna. recently, for instance, a glossary and controlled vocabulary was introduced to increase accessibility (hesketh et al., ) . while dna data storage technologies are immensely intriguing from a scientific point of view, companies are still facing key challenges towards achieving large-scale commercial success. widely recognized as the central bottleneck, dna synthesis is costly, time consuming and prone to errors. the synthesis price per base has seen a rapid decline over the past decades, with companies like twist bioscience or dnascript continuously pushing the boundaries of what is possible. twist bioscience provides large quantities of error-free dna fragments up to nucleotides using their silicon-based writing technology. in early , dnascript announced the successful production of the first nucleotide-long dna fragment by enzymatic synthesis. however, we speculate that initial go-to-market technologies will still need to circumvent codes that depend on long-strand, error-free dna synthesis. requirements of sequencing for data storage are orthogonal to traditional life sciences applicationsthe latter cannot tolerate errors. this is one of the factors that may provide additional leverage for emerging technologies in the field. when considering the nature of dna data storage, it becomes clear that instantaneous and random access presents a substantial problem. especially for large quantities of data, full sequencing and decoding will not always be practical and entails higher latency. commercially, this will push development towards the low-hanging fruit of archiving "cold data", often referred to as "write once, read never" ("worn"). an extremely promising entry point into the archiving market is image storage: the developed error correction codes and the eye"s fault tolerance level mean that image fidelity does not have to be %, thereby compensating for error rates (personal communication, unpublished data) . we therefore expect first commercial adopters to be sourced in market segments such as image backup or streaming services. these described challenges represent intrinsic hurdles for dna data storage from a molecular perspective. more issues arise when considering the required operational infrastructure. in structural terms, dna molecules cannot just be applied to existing chip architectures. thus, the silicon-to-dna interface will have to be optimized and accounted for by software and physical interconnects. moving forward from current prototypes (see . .) to larger setups will entail fluidic difficulties, as a liquid-based data storage system will ideally operate under zerohuman-contact conditions. an end-to-end solution will require a standardization of data formats when stored in dna, processable by all data-hardware interfaces, and subsequently streamlined workflow steps to enable crossplatform storage and seamless embedding into existing data architectures (see .). currently, the individual approaches consume considerable resources and efforts -dna data storage will require strategic investments in holistic frameworks to reach its full potential. in terms of dna data storage reaching mass markets and becoming a large-scale commercial success, the aforementioned hurdles seem quite daunting and obtrusive. however, this should not deter research, investments and public interest in this highly promising field. several indicators show that this area is already gaining substantial traction. the general public is being introduced to the concept and technology by recent articles in prominent mainstream outlets, such as forbes and wired (forbes, ; wired, ) . meanwhile, significant funding opportunities are arising from the us government"s defense advanced research projects agency (darpa), intelligence advanced research projects activity (iarpa), and national science foundation (nsf). iarpa most recently launched its molecular information storage technology (mist) program, currently involving dna script and illumina, which aims to develop technologies that can write tb and read tb of data per day with dna. the considerable interest from large corporations and prominent universities, like microsoft and the university of washington, is fostering fruitful collaborations. the key opinion leaders interviewed for this article -spanning from academics to business leaders -consider a timeline of - years, depending on the application, level of automation and scale size, to be a realistic estimate for market entry and success. this technology maturity horizon places dna data storage into the scope of venture capital funds, particularly for strategic investments from corporate venture capital branches and financial funds with a longer return-on-investment mandate or an evergreen fund structure. tangible benefits will arise for investments in robotics for automation and scaling solutions, as these will require a high capital expenditure to develop but do not incur an intrinsic chemical problem. additionally, focusing cash flows into areas with key differences to dna"s biomedical applications will aid the rise of dna data storage, ultimately leading to the different technological approaches branching off into respective best-in-class applications. a predominant example for such an area is fastthroughput dna synthesis and sequencing. increased public and corporate interest in a novel technology often garners an uptake in investments and startup funding. additionally, in this case, the venture capital perspective on the overarching industrial landscape should be considered. looking at the current general data storage market, it is a harsh business for hardware manufacturers and providers, who operate on low margins and rely on technological improvements to drive new products. typically, venture capitalists avoid investing in such a commoditized industry as it is not attractive for exit potentials and high returns. however, the imminent crisis upon reaching physical limits in classical data storage will also drastically change the investment landscape. not only will succeeding technologies be highly contested acquisition targets by numerous large corporations, they will simultaneously generate very substantial capital returns for investors. furthermore, novel methods and scientific breakthroughs achieved in pursuit of molecular data storage will generate j o u r n a l p r e -p r o o f complimentary use cases and applications in other industries, like bio-computing and life sciences. from a business model and scaling perspective, such a broad range of markets is appealing. this paradigm shift in the dna data storage investment landscape has already begun and can be quantified by tracking the number and size of venture capital deals made in the relevant market segments and technologies. thus, we have performed a database search and analysis of investment activity in the dna data storage space over the past years. using appropriate key words and input criteria, a list of companies that have been -or are still -venture capital backed was generated (figure ) . several key takeaways are apparent from this data. first, the past seven years have seen a rapid increase in total capital invested in companies developing dna data storage-enabling technologies (figure a) , averaging a % compound annual growth rate (cagr). the number of concluded deals per year follows a similar trend. this significant upswing corresponds with the previously noted uptake in academic activity and accomplishments in this space. second, the predominant sectors that the invested companies are active in are novel nucleic acid analysis, sequencing and software solutions (figure b) . while this is not surprising, it does stress the need for innovation and technological progress, fueled by increased funding, in the nucleic acid synthesis area. as this is widely regarded as the current key bottleneck and pain point, this space provides significant opportunities for large value generation. we anticipate that the funding focus, which currently stems from life science-oriented venture capital companies, will shift to include multidisciplinary and deep tech-oriented funds. as a result, additional capital sources will become available. third, the data confirm dna data storage to be an attractive and fast-growing investment sector, and we expect significant funding in the coming years. one major, and likely defining, event of the covid- pandemichas disrupted almost every industry and economy around the world, introducing a level of economic uncertainty not seen in generations. increased timelines on development and commercialization due to the pandemic have led to an increase in capital needed to achieve milestones. this has, in turn, decreased the total number of deals a fund can conduct to support new and existing portfolio companies. despite this, it is important to note that venture capital deals have continued (milkove, ) . even during the pandemic, venture capital firms likely need to deploy the capital they have already raised in order to reach their target return on investments. given dna data storage"s technology maturity time-scale, venture capital firms companies investing in this space (see .) would already be operating with a long-term view and longer timelines than are typical for other fields of investment. covid- has actually lead to increased investment in biotech companies listed on the stock exchanges with the nasdaq biotechnology index up % since the start of the year (senior, ) , as well as highlighted the importance of investing in life sciences technologies like sequencing. the increased worldwide interest in such technologies sparked by this global health crisis could result in a boost in investments in technologies that will ultimately advance dna data storage. in addition, covid- has accelerated digitalization across industries, further increasing the demand for data storage services (tilley, ) . big tech-and data-oriented companies, like microsoft, have already benefited from the pandemic and will likely emerge from covid- in an even stronger financial position to invest in and secure their stake in the dna data storage space. with global economies beginning to recover, potential vaccine candidates looking promising, and markets looking forward, the venture capital landscape is also likely to recover. though there is a risk of a potential prolonged market shock, we deduce that it is still a favorable time for startups in the dna data storage space to seek funding from venture capital firms. the dawn of the modern information age has drastically accelerated the rate of data generation to the point where almost overwhelming quantities threaten our current storage capacities. paradoxically, one of the most promising long-term prospects towards this challenge is storing data in dnathe very substance that defines and holds the code to human life. nature"s time-tested and durable solution to complex data encoding and storage guarantees its longevity and eternal relevance. however, large-scale commercialization of non-genetic dna data storage is faced with considerable technical, engineering and financial barriers. we believe that the latter will be increasingly addressed in the future by investment companies such as venture capital firms, given the substantial increase in invested capital in the field over of the last seven years (figure a) . considering the growing public and corporate traction, dna data storage-enabling startups are firmly on investment radars for strategic and long-term funding. as the overarching storage industry, currently a commoditized industry, is rapidly approaching disruption, this paves the path for technological innovations and will offer high returns. we anticipate that a significant increase in available capital, together with continuing scientific progress, will enable the rise and mainstream implementation of dna data storage into daily life. dna-based approaches promise to provide orders of magnitude higher storage density with outstanding long-term stability, while maintaining integrity without a power supply. additionally, the intrinsic nature of dna guarantees eternal relevance and provides exciting new opportunities for biocomputing. finally, storage redundancies are easily achievable through the well-researched dna amplification process. figure . workflow of dna data storage. top panels show a simplified way of how coded information can be written into dna, where it can be accessed in the storage device to read the code and retrieve the information. additionally, the stored material can be copied. each step is described in detail in the text. bottom panels highlight development efforts in each area. figure . analysis of the venture capital landscape for dna data storage according to custom search results extracted from pitchbook, private capital market data provider. venture capital funding was only included into the analysis if the respective company disclosed to work on a technology that has a potential to advance dna data storage (e.g. modifications of sequencing technologies for specific biologic set-ups were not included). for startups active in the development of various technologies and/or areas, it is not possible to attribute the percentage of funding specifically supporting dna data storage technologies. a. column graph, left axis: total capital invested per year in companies in the dna data storage space, specifically including novel nucleic acid synthesis and sequencing technologies, software for sequencing data analysis, nucleic acid analysis methods (excluding single nucleotide sequencing), dna amplification and manipulation techniques, biological computing, and storage devices. line graph, right axis: corresponding total deal count per year. *note that the values are assessed with fully completed and public deals by the th of september. b. segmentation of all the companies found in the database search into their respective technology and market areas as described under a. j o u r n a l p r e -p r o o f molecular computation of solutions to combinatorial problems the half-life of dna in bone: measuring decay kinetics in dated fossils content-based similarity search in large-scale dna data storage systems forward error correction for dna data storage a dna-based archival storage system solution of a -variable -sat problem on a dna computer regulation of dna repair throughout the cell cycle image processing in dna. icassp - ieee international conference on acoustics, speech and signal processing molecular digital data storage using dna digital data storage using dna nanostructures and solid-state nanopores quantifying molecular bias in dna data storage dna fountain enables a robust and efficient storage architecture dna data storage is about to go viral robust chemical preservation of digital information on dna in silica with error-correcting codes improving communication for interdisciplinary teams working on storage of digital information high-efficiency reverse ( '→ ') synthesis of complex dna microarrays hachimoji dna and rna: a genetic system with eight building blocks a dna-of-things storage architecture to create materials with embedded memory large-scale de novo dna synthesis: technologies and applications photon-directed multiplexed enzymatic dna synthesis for molecular digital data storage terminator-free template-independent enzymatic dna synthesis for digital information storage high-density information storage in an absolutely defined aperiodic sequence of monodisperse copolyester dynamic and scalable dna-based information storage covid- "s impact on early stage venture capital. medium. high density dna data storage library via dehydration with digital microfluidic retrieval random access in large-scale dna data storage probing the physical limits of reliable dna data retrieval hedges error-correcting code for dna storage corrects indels and allows sequence constraints computational solutions to large-scale data management and analysis the biopharmaceutical anomaly a content-addressable dna database with learned sequence encodings dna punch cards: storing data on native dna sequences via nicking demonstration of end-to-end automation of dna data storage dna computing and molecular programming one business winner amid coronavirus lockdowns: the cloud. the wall street journal. driving the scalability of dna-based information storage systems big data, but are we ready? machines smarter than men? interview with dr high capacity dna data storage with variable-length oligonucleotides using repeat accumulate code and hybrid mapping oligo design with single primer binding site for high capacity dna-based data storage puddle: a dynamic, error-correcting, full-stack microfluidics platform the rise of dna data storage random-access dna-based storage system nucleic acid memory this work was supported by numerous interviews with key opinion leaders on dna data storagein research and industryas well as industry specialists for traditional cmos-based storage. we are especially grateful to the many experts who gave their time freely for providing insight and feedback on the topic, including the following individuals: albert keung ( the views, opinions, and findings contained in this article are those of the authors and do not represent those of the institutions or organizations listed in the acknowledgements. furthermore, they should not be interpreted as representing the official views or policies, either expressed or implied, of the defense advanced research projects agency, the intelligence advanced research projects activity, or the department of defense. the authors of the article were employees of m ventures, the strategic corporate venture capital arm of merck kgaa, darmstadt, germany. correspondingly, this work was supported by m ventures. key: cord- - x qs i authors: gupta, abhishek; lanteigne, camylle; heath, victoria; ganapini, marianna bergamaschi; galinkin, erick; cohen, allison; gasperis, tania de; akif, mo; institute, renjie butalid montreal ai ethics; microsoft,; university, mcgill; commons, creative; college, union; rapid ,; global, ai; university, ocad title: the state of ai ethics report (june ) date: - - journal: nan doi: nan sha: doc_id: cord_uid: x qs i these past few months have been especially challenging, and the deployment of technology in ways hitherto untested at an unrivalled pace has left the internet and technology watchers aghast. artificial intelligence has become the byword for technological progress and is being used in everything from helping us combat the covid- pandemic to nudging our attention in different directions as we all spend increasingly larger amounts of time online. it has never been more important that we keep a sharp eye out on the development of this field and how it is shaping our society and interactions with each other. with this inaugural edition of the state of ai ethics we hope to bring forward the most important developments that caught our attention at the montreal ai ethics institute this past quarter. our goal is to help you navigate this ever-evolving field swiftly and allow you and your organization to make informed decisions. this pulse-check for the state of discourse, research, and development is geared towards researchers and practitioners alike who are making decisions on behalf of their organizations in considering the societal impacts of ai-enabled solutions. we cover a wide set of areas in this report spanning agency and responsibility, security and risk, disinformation, jobs and labor, the future of ai ethics, and more. our staff has worked tirelessly over the past quarter surfacing signal from the noise so that you are equipped with the right tools and knowledge to confidently tread this complex yet consequential domain. these past few months have been especially challenging, and the deployment of technology in ways hitherto untested at an unrivaled pace has left the internet and technology watchers aghast. artificial intelligence has become the byword for technological progress and is being used in everything from helping us combat the covid- pandemic to nudging our attention in different directions as we all spend increasingly larger amounts of time online. it has never been more important that we keep a sharp eye out on the development of this field and how it is shaping our society and interactions with each other. with this inaugural edition of the state of ai ethics we hope to bring forward the most important developments that caught our attention at the montreal ai ethics institute this past quarter. our goal is to help you navigate this ever-evolving field swiftly and allow you and your organization to make informed decisions. this pulse-check for the state of discourse, research, and development is geared towards researchers and practitioners alike who are making decisions on behalf of their organizations in considering the societal impacts of ai-enabled solutions. we cover a wide set of areas in this report spanning agency and responsibility, security and risk, disinformation, jobs and labor, the future of ai ethics , and more. our staff has worked tirelessly over the past quarter surfacing signal from the noise so that you are equipped with the right tools and knowledge to confidently tread this complex yet consequential domain. to stay up-to-date with the work at maiei, including our public competence building, we encourage you to stay tuned on https://montrealethics.ai which has information on all of our research. we hope you find this useful and look forward to hearing from you! wishing you well, abhishek gupta the state of ai ethics, june the debate when ethicists ask for rights to be granted to robots is based on notions of biological chauvinism and that if robots display the same level of agency and autonomy, not doing so would not only be unethical but also cause a setback for the rights that were denied to disadvantaged groups. by branding robots as slaves and implying that they don't deserve rights has fatal flaws in that they both use a term, slave, that has connotations that have significantly harmed people in the past and also that dehumanization of robots is not possible because it assumes that they are not human to begin with. while it may be possible to build a sentient robot in the distant future, in such a case there would be no reason to not grant it rights but until then, real, present problems are being ignored for imaginary future ones. the relationship between machines and humans is tightly intertwined but it's not symmetrical and hence we must not confound the "being" part of human beings with the characteristics of present technological artifacts. technologists assume that since there is a dualism to a human being, in the sense of the mind and the body, then it maps neatly such that the software is the mind and the robot body maps to the physical body of a human, which leads them to believe that a sentient robot, in our image, can be constructed, it's just a very complex configuration that we haven't completely figured out yet. the more representative view of thinking about robots at present is to see them as objects that inhabit our physical and social spaces. objects in our environment take on meaning based on the purpose they serve to us, such as a park bench meaning one thing to a skateboarder and another to a casual park visitor. similarly, our social interactions are always situated within a larger ecosystem and that needs to be taken into consideration when thinking about the interactions between humans and objects. in other words, things are what they are, because of the way they configure our social practices and how technology extends the biological body.our conception of human beings, then, is that we are and have always been fully embedded and enmeshed with our the state of ai ethics, june designed surroundings, and that we are critically dependent on this embeddedness for sustaining ourselves. because of this deep embedding, instead of seeing the objects around us merely as machines or on the other end as 'intelligent others', we must realize that they are very much a part of ourselves because of the important role they play in defining both our physical and social existence. some argue that robots take on a greater meaning when they are in a social context like care robots and people might be attached to them, yet that is quite similar to the attachment one develops to other artifacts like a nice espresso machine or a treasured object handed down for generations. they have meaning to the person but that doesn't mean that the robot, as present technology, needs to be granted rights. while a comparison to slaves and other disenfranchised groups is made when robots are denied rights because they are seen as 'less' than others, one mustn't forget that it happens to be the case that it is so because they are perceived as instruments and means to achieve an end. by comparing these groups to robots, one dehumanizes actual human beings. it may be called anthropocentric to deny rights to robots but that's what needs to be done: to center on the welfare of humans rather than inanimate machines. an interesting analogue that drives home the point when thinking about this is the milgram prison experiment where subjects who thought they had inflicted harms on the actors, who were a part of the experiment, were traumatized even after being told that the screams they heard were from the actors. from an outside perspective, we may say that no harm was done because they were just actors but to the person who was the subject of the experiment, the experience was real and not an illusion and it had real consequences. in our discussion, the robot is an actor and if we treat it poorly, then that reflects more so on our interactions with other artifacts than on whether robots are "deserving" of rights or not. taking care of various artifacts can be thought of as something that is done to render respect to the human creators and the effort that they expended to create it. discussion of robot rights for an imaginary future that may or may not arrive takes away focus and perhaps resources from the harms being done to real humans today as part of the ai systems being built with bias and fairness issues in them. invasion of privacy, bias against the disadvantaged, among other issues are just some of the few already existing harms that are being leveled on humans as intelligent systems percolate into the everyday fabric of social and economic life. the state of ai ethics, june from a for-profit perspective, such systems are poised and deployed with the aims of boosting the bottom line without necessarily considering the harms that emerge as a consequence. in pro-social contexts, they are seen as a quick fix solution to inherently messy and complex problems. the most profound technologies are those that disappear into the background and in subtle ways shape and form our existence. we already see that with intelligent systems pervading many aspects of our lives. so we're not as much in threat from a system like sophia which is a rudimentary chatbot hidden behind a facade of flashy machinery but more so from roomba which impacts us more and could be used as a tool to surveil our homes. taking ethical concerns seriously means considering the impact of weaving in automated technology into daily life and how the marginalized are disproportionately harmed. in the current dominant paradigm of supervised machine learning, the systems aren't truly autonomous, there is a huge amount of human input that goes into enabling the functioning of the system, and thus we actually have human-machine systems rather than just pure machinic systems. the more impressive the system seems, the more likely that there was a ton of human labor that went into making it possible. sometimes, we even see systems that started off with a different purpose such as recaptcha that are used to prevent spam being refitted to train ml systems. the building of ai systems today doesn't just require highly skilled human labor but it must be supplemented with mundane jobs of labeling data that are poorly compensated and involve increasingly harder tasks as, for example, image recognition systems become more powerful, leading to the labeling of more and more complex images which require greater effort. this also frames the humans doing the low skilled work squarely in the category of being dehumanized because of them being used as a means to an end without adequate respect, compensation and dignity. an illustrative example where robots and welfare of humans comes into conflict was when a wheelchair user wasn't able to access the sidewalk because it was blocked by a robot and she mentioned that without building for considering the needs of humans, especially those with special needs, we'll have to make debilitating compromises in our shared physical and social spaces. ultimately, realizing the goals of the domain of ai ethics needs to reposition our focus on humans and their welfare, especially when conflicts arise between the "needs" of automated systems compared to those of humans. what happens when ai starts to take over the more creative domains of human endeavour? are we ready for a future where our last bastion, the creative pursuit, against the rise of machines is violently snatched away from us? in a fitting start to feeling bereft in the times of global turmoil, this article starts off with a story created by a machine learning model called gpt- that utilizes training data from more than million documents online and predicts iteratively the next word in a sentence given a prompt. the story is about "life in the time of coronavirus" that paints a desolate and isolating picture of a parent who is following his daily routine and feels different because of all the changes happening around them. while the short story takes weird turns and is not completely coherent, it does give an eerie feeling that blurs the line between what could be perceived as something written by a human compared to that by a machine. a news-styled article on the use of facial recognition systems for law enforcement sounds very believable if presented outside of the context of the article. the final story, a fictional narrative, presents a fractured, jumpy storyline of a girl with a box that has hallucinatory tones to its storytelling. the range of examples from this system is impressive but it also highlights how much further these systems have to go before they can credibly take over jobs. that said, there is potential to spread disinformation via snippets like the second example we mention and hence, something to keep in mind as you read things online. technology, in its widest possible sense, has been used as a tool to supplement the creative process of an artist, aiding them in exploring the adjacent possible in the creative phasespace. for decades we've had computer scientists and artists working together to create software that can generate pieces of art that are based on procedural rules, random perturbations of the audience's input and more. off late, we've had an explosion in the use of ai to do the same, with the whole ecosystem being accelerated as people collide with each other serendipitously on platforms like twitter creating new art at a very rapid pace. but, a lot of people have been debating whether these autonomous systems can be attributed artistic agency and if they can be called artists in their own right. the author here argues that it isn't the case because even with the push into using technology that is more the state of ai ethics, june automated than other tools we've used in the past, there is more to be said about the artistic process than the simple mechanics of creating the artwork. drawing on art history and other domains, there is an argument to be made as to what art really is -there are strong arguments in support of it playing a role in servicing social relationships between two entities. we, as humans, already do that with things like exchanging gifts, romance, conversation and other forms of social engagement where the goal is to alter the social relationships. thus, the creative process is more so a co-ownership oriented model where the two entities are jointly working together to create something that alters the social fabric between them. as much as we'd like to think some of the ai-enabled tools today have agency, that isn't necessarily the case when we pop open the hood and see that it is ultimately just software that for the most part still relies heavily on humans setting goals and guiding it to perform tasks. while human-level ai might be possible in the distant future, for now the ai-enabled tools can't be called artists and are merely tools that open up new frontiers for exploration. this was the case with the advent of the camera that de-emphasized the realistic paint form and spurred the movement towards modern art in a sense where the artists are more focused on abstract ideas that enable them to express themselves in novel ways. art doesn't even have to be a tangible object but it can be an experience that is created. ultimately, many technological innovations in the past have been branded as having the potential to destroy the existing art culture but they've only given birth to new ideas and imaginings that allow people to express themselves and open up that expression to a wider set of people. the state of ai ethics, june ranking and retrieval systems for presenting content to consumers are geared towards enhancing user satisfaction, as defined by the platform companies which usually entails some form of profit-maximization motive, but they end up reflecting and reinforcing societal biases, disproportionately harming the already marginalized. in fairness techniques applied today, the outcomes are focused on the distributions in the result set and the categorical structures and the process of associating values with the categories is usually de-centered. instead, the authors advocate for a framework that does away with rigid, discrete, and ascribed categories and looks at subjective ones derived from a large pool of diverse individuals. focusing on visual media, this work aims to bust open the problem of underrepresentation of various groups in this set that can render harm on to the groups by deepening social inequities and oppressive world views. given that a lot of the content that people interact with online is governed by automated algorithmic systems, they end up influencing significantly the cultural identities of people. while there are some efforts to apply the notion of diversity to ranking and retrieval systems, they usually look at it from an algorithmic perspective and strip it of the deep cultural and contextual social meanings, instead choosing to reference arbitrary heterogeneity. demographic parity and equalized odds are some examples of this approach that apply the notion of social choice to score the diversity of data. yet, increasing the diversity, say along gender lines, falls into the challenge of getting the question of representation right, especially trying to reduce gender and race into discrete categories that are one-dimensional, third-party and algorithmically ascribed. the authors instead propose sourcing this information from the individuals themselves such that they have the flexibility to determine if they feel sufficiently represented in the result set. this is contrasted with the degree of sensitive attributes that are present in the result sets which is what prior approaches have focused on. from an algorithmic perspective, the authors advocate for the use of a technique called determinantal point process (dpp) that assigns a higher probability score to sets that have higher spreads based on a predefined distance metric. how dpp works is that for items that the individual feels represents them well, the algorithm clusters those points closer together, for points that they feel don't represent them well, it moves those away from the ones that represent them well in the embedding space. optimizing for the triplet loss helps to achieve the goals of doing this separation. but, the proposed framework still leaves open the question of sourcing in a reliable manner these ratings from the individuals about what represents and doesn't represent them well and then encoding them in a manner that is amenable to being learned by an algorithmic system. while large-scale crowdsourcing platforms which are the norm in seeking such ratings in the machine learning world, given that their current structuring precludes raters' identities and perceptions from consideration, this framing becomes particularly challenging in terms of being able to specify the rater pool. nonetheless, the presented framework provides an interesting research direction such that we can obtain more representation and inclusion in the algorithmic systems that we build. in maryland, allstate, an auto insurer, filed with the regulators that the premium rates needed to be updated because they were charging prices that were severely outdated. they suggested that not all insurance premiums be updated at once but instead follow recommendations based on an advanced algorithmic system that would be able to provide deeper insights into the pricing that would be more appropriate for each customer based on the risk that they would file a claim. this was supposed to be based on a constellation of data points collected by the company from a variety of sources. because of the demand from the regulators for documentation supporting their claim, they submitted thousands of pages of documentation that showed how each customer would be affected, a rare window into the pricing model which would otherwise have been veiled under privacy and trade secret arguments. a defense that is used by many companies that utilize discriminatory pricing strategies using data sourced beyond what they should be using to make pricing decisions. according to the investigating journalists, the model boiled down to something quite simple: the more money you had and the higher your willingness to not budge from the company, the more the company would try to squeeze from you in terms of premiums. driven by customer retention and profit motives, the company pushed increases on those that they knew could afford them and would switch to save dollars. but, for those policies that had been overpriced, they offered less than . % in terms of a discount limiting their downsides while increases were not limited, often going up as high as %. while they were unsuccessful in getting this adopted in maryland where it was deemed discriminatory, the model has been approved for use in several states thus showing that opaque models can be deployed not just in high-tech industries but anywhere to provide individually tailored pricing to extract away as much of the consumer surplus as possible based on the purportedly undisclosed willingness of the customer to pay (as would be expressed by their individual demand curves which aren't directly discernible to the producer). furthermore, the insurers aren't mandated to make disclosures of how they are pricing their policies and thus, in places where they should have offered discounts, they've only offered pennies on the dollar, disproportionately impacting the poorest for whom a few hundred dollars a year can mean having sufficient meals on the table. sadly, in the places where their customer retention model was accepted, the regulators declined to answer why they chose to accept it, except in arkansas where they said such pricing schemes aren't discriminatory unless the customers are grouped by attributes like race, color, creed or national origin. this takes a very limited view of what price discrimination is, harkening back to an era where access to big data about the consumer was few and far between. in an era dominated by data brokers that compile thick and rich data profiles on consumers, price discriminaton extends far beyond the basic protected attributes and can be tailored down to specificities of each individual. other companies in retail goods and online learning services have been following this practice of personalized pricing for many years, often defending it as the cost of doing in business when they based the pricing on things like zip codes, which are known proxies for race and paying capacity. personalized pricing is different from dynamic pricing, as seen when booking plane tickets, which is usually based on the timing of purchase whereas here the prices are based on attributes that are specific to the customer which they often don't have any control over. a obama administration report mentioned that, "differential pricing in insurance markets can raise serious fairness concerns, particularly when major risk factors are outside an individual customer's control." why the case of auto insurance is so much more predatory than, say buying stationery supplies, is that it is mandatory in almost all states and not having the vehicle insured can lead to fines, loss of licenses and even incarceration. transport is an essential commodity for people to get themselves to work, children to school and a whole host of other daily activities. in maryland, the regulators had denied the proposal by allstate to utilize their model but in official public records, the claim is marked as "withdrawn" rather than "denied" which the regulators claim makes no internal difference but allstate used this difference to get their proposal past the regulators in several other states. they had only withdrawn their proposal after being denied by the regulators in maryland. the national association of insurance commissioners mentioned that most regulators don't have the right data to be able to meaningfully evaluate rate revision proposals put forth by insurers and this leads to approvals without review in a lot of cases. even the data journalists had to spend a lot of time and effort to discern what the underlying models were and how they worked, essentially summing up that the insurers don't necessarily lie but don't give you all the information unless you know to ask the right set of questions. allstate has defended its price optimization strategy, called complementary group rating (cgr) as being more objective, and based on mathematical rigor, compared to the ad-hoc, judgemental pricing practices that have been followed before, ultimately citing better outcomes for their customers. but, this is a common form of what is called "mathwashing" in the ai ethics domain where discriminatory solutions are pushed as fair under the veneer of mathematical objectivity. regulators in florida said that setting prices based on the "modeled reaction to rate changes" was "unfairly discriminatory." instead of being cost-based, as is advocated by regulators for auto-insurance premiums because they support an essential function, allstate was utilizing a model that was heavily based on the likelihood of the customer sticking with them even in the face of price rises which makes it discriminatory. these customers are price-inelastic and hence don't change their demand much even in the face of dramatic price changes. consumer behaviour when purchasing insurance policies for the most part remains static once they've made a choice, often never changing insurers over the course of their lifetime which leads them to not find the optimal price for themselves. this is mostly a function of the fact that the decisions are loaded with having to judge complex terms and conditions across a variety of providers and the customers are unwilling to have to go through the exercise again and again at short intervals. given the opacity of the pricing models today, it is almost impossible to find out what the appropriate pricing should be for a particular customer and hence the most effective defense is to constantly check for prices from the competitors. but, this unduly places the burden on the shoulders of the consumer. google had announced its ai principles on building systems that are ethical, safe and inclusive, yet as is the case with so many high level principles, it's hard to put them into practice unless there is more granularity and actionable steps that are derived from those principles. here are the principles: • be socially beneficial this talk focused on the second principle and did just that in terms of providing concrete guidance on how to translate this into everyday practice for design and development teams. humans have a history of making product design decisions that are not in line with the needs of everyone. examples of the crash dummy and band-aids mentioned above give some insight into the challenges that users face even when the designers and developers of the products and services don't necessarily have ill intentions. products and services shouldn't be designed such that they perform poorly for people due to aspects of themselves that they can't change. for example, when looking at the open image dataset, searching for images marked with wedding indicate stereotypical western weddings but those from other cultures and parts of the world are not tagged as such. from a data perspective, the need for having more diverse sources of data is evident and the google team made an effort to do this by building an extension to the open images data set by providing users from across the world to snap pictures from their surroundings that captured diversity in many areas of everyday life. this helped to mitigate the problem that a lot of open image data sets have in being geographically skewed. biases can enter at any stage of the ml development pipeline and solutions need to address them at different stages to get the desired results. additionally, the teams working on these solutions need to come from a diversity of backgrounds including ux design, ml, public policy, social sciences and more. so, in the area of fairness by data which is one of the first steps in the ml product lifecycle and it plays a significant role in the rest of the steps of the lifecycle as well since data is used to both train and evaluate a system. google clips was a camera that was designed to automatically find interesting moments and capture them but what was observed was that it did well only for a certain type of family, under particular lighting conditions and poses. this represented a clear bias and the team moved to collect more data that better represented the situations for a variety of families that would be the target audience for the products. quickdraw was a fun game that was built to ask users to supply their quickly sketched hand drawings of various commonplace items like shoes. the aspiration from this was that given that it was open to the world and had a game element to it, it would be utilized by many people from a diversity of backgrounds and hence the data so collected would have sufficient richness to capture the world. on analysis, what they saw was that most users had a very particular concept of a shoe in mind, the sneaker which they sketched and there were very few women's shoes that were submitted. what this example highlighted was that data collection, especially when trying to get diverse samples, requires a very conscious effort that can account for what the actual distribution the system might encounter in the world and make a best effort attempt to capture their nuances. users don't use systems exactly in the way we intend them to, so reflect on who you're able to reach and not reach with your system and how you can check for blindspots, ensure that there is some monitoring for how data changes over time and use these insights to build automated tests for fairness in data. the second approach that can help with fairness in ml systems is looking at measurement and modeling. the benefits of measurement are that it can be tracked over time and you can test for both individuals and groups at scale for fairness. different fairness concerns require different metrics even within the same product. the primary categories of fairness concerns are disproportionate harms and representational harms. the jigsaw api provides a tool where you can input a piece of text and it tells you the level of toxicity of that piece of text. an example in the earlier version of the system rated sentences of the form "i am straight" as not toxic while those like "i am gay" as toxic. so what was needed to be able to see what was causing this and how it could be addressed. by removing the identity token, they monitored for how the prediction changed and then the outcomes from that measurement gave indications on where the data might be biased and how to fix it. an approach can be to use block lists and removals of such tokens so that sentences that are neutral are perceived as such without imposing stereotypes from large corpora of texts. these steps prevent the model from accessing information that can lead to skewed outcomes. but, in certain places we might want to brand the first sentence as toxic if it is used in a derogatory manner against an individual, we require context and nuance to be captured to make that decision. google undertook project respect to capture positive identity associations from around the world as a way of improving data collection and coupled that with active sampling (an algorithmic approach that samples more from the training data set in areas where it is under performing) to improve outputs from the system. another approach is to create synthetic data that mimics the problematic cases and renders them in a neutral context. adversarial training and updated loss functions where one updates a model's loss function to minimize difference in performance between groups of individuals can also be used to get better results. in their updates to the toxicity model, they've seen improvements, but this was based on synthetic data on short sentences and it is still an area of improvement. some of the lessons learned from the experiments carried out by the team: • test early and test often • develop multiple metrics (quantitative and qualitative measures along with user testing is a part of this) for measuring the scale of each problem • possible to take proactive steps in modeling that are aware of production constraints from a design perspective, think about fairness in a more holistic sense and build communication lines between the user and the product. as an example, turkish is a gender neutral language, but when translating to english, sentences take on gender along stereotypes by attributing female to nurse and male to doctor. say we have a sentence, "casey is my friend", given no other information we can't infer what the gender of casey is and hence it is better to present that choice to the user from a design perspective because they have the context and background and can hence make the best decision. without that, no matter how much the model is trained to output fair predictions, they will be erroneous without the explicit context that the user has. lessons learned from the experiments include: • context is key • get information from the user that the model doesn't have and share information with the user that the model has and they don't • how do you design so the user can communicate effectively and have transparency so that can you get the right feedback? • get feedback from a diversity of users • see the different ways in how they provide feedback, not every user can offer feedback in the same way • identify ways to enable multiple experiences • we need more than a theoretical and technical toolkit, there needs to be rich and context-dependent experience putting these lessons into practice, what's important is to have consistent and transparent communication and layering on approaches like datasheets for data sets and model cards for model reporting will aid in highlighting appropriate uses for the system and where it has been tested and warn of potential misuses and where the system hasn't been tested. the paper starts by setting the stage for the well understood problem of building truly ethical, safe and inclusive ai systems that are increasingly leveraging ubiquitous sensors to make predictions on who we are and how we might behave. but, when these systems are deployed in socially contested domains, for example, "normal" behaviour where loosely we can think of normal as that defined by the majority and treating everything else as anomalous, then they don't make value-free judgements and are not amoral in their operations. by viewing the systems as purely technical, the solutions to address these problems are purely technical which is where most of the fairness research has focused and it ignores the context of the people and communities where these systems are used. the paper serves to question the foundations of these systems and to take a deeper look at unstated assumptions in the design and development of the systems. it urges the readers, and the research community at large, to consider this from the perspective of relational ethics. it makes key suggestions: • center the focus of development on those within the community that will face a disproportionate burden or negative consequences from the use of the system • instead of optimizing for prediction, it is more important to think about how we gain a fundamental understanding of why we're getting certain results which might be arising because of historical stereotypes that were captured as a part of the development and design of the system • the systems end up creating a social and political order and then reinforcing it, meaning we should involve a larger systems based approach to designing the systems • given that the terms of bias, fairness, etc evolve over time and what's acceptable at some time becomes unacceptable later, the process asks for constant monitoring, evaluation and iteration of the design to most accurately represent the community in context. at maiei, we've advocated for an interdisciplinary approach leveraging the citizen community spanning a wide cross section to best capture the essence of different issues as closely as possible from those who experience them first hand. placing the development of an ml system in context of the larger social and political order is important and we advocate for taking a systems design approach (see a primer in systems thinking by donna meadows) which creates two benefits: one is that several ignored externalities can be considered and second to involve a wider set of inputs from people who might be affected by the system and who understand how the system will sit in the larger social and political order in which it will be deployed. also, we particularly enjoyed the point on requiring a constant iterative process to the development and deployment of ai systems borrowing from cybersecurity research on how security of the system is not done and over with, requiring constant monitoring and attention to ensure the safety of the system. underrepresentation of disabilities in datasets and how they are processed in nlp tasks is an important area of discussion that is often not studied empirically in the literature that primarily focuses on other demographic groups. there are many consequences of this, especially as it relates to how text related to disabilities is classified and has impacts on how people read, write, and seek information about this. research from the world bank indicates that about billion people have disabilities of some kind and often these are associated with strong negative social connotations. utilizing linguistic expressions as they are used in relation to disabilities and classifying them into recommended and non-recommended uses (following the guidelines from anti-defamation league, acm sigaccess, and ada national network), the authors seek to study how automated systems classify phrases that indicate disability and whether usages split by recommended vs. non-recommended uses make a difference in how these snippets of text are perceived. to quantify the biases in the text classification models, the study uses the method of perturbation. it starts by collecting instances of sentences that have naturally occurring pronouns he and she. then, they replace them with the phrases indicating disabilities as identified in the previous paragraph and compare the change in the classification scores in the original and perturbed sentences. the difference indicates how much of an impact the use of a disability phrase has on the classification process. using the jigsaw tool that gives the toxicity score for sentences, they test these original and perturbed sentences and observe that the change in toxicity is lower for recommended phrases vs. the non-recommended ones. but, when disaggregated by categories, they find that some of them elicit a stronger response than others. given that the primary use of such a model might in the case of online content moderation (especially given that we now have more automated monitoring happening as human staff has been thinning out because of pandemic related closures), there is a high rate of false positives where it can suppress content that is non-toxic and is merely discussing disability or replying to other hate speech that talks about disability. to look at sentiment scores for disability related phrases, the study looks at the popular bert model and adopts a template-based fill-in-the-blank analysis. given a query sentence with a missing word, bert produces a ranked list of words that can fill the blank. using a simple template perturbed with recommended disability phrases, the study then looks at how the predictions from the bert model change when disability phrases are used in the sentence. what is observed is that a large percentage of the words that are predicted by the model have negative sentiment scores associated with them. since bert is used quite widely in many different nlp tasks, such negative sentiment scores can have potentially hidden and unwanted effects on many downstream tasks. such models are trained on large corpora, which are analyzed to build "meaning" representations for words based on co-occurrence metrics, drawing from the idea that "you shall know a word by the company it keeps". the study used the jigsaw unintended bias in toxicity classification challenge dataset which had a mention of a lot of disability phrases. after balancing for different categories and analyzing toxic and non-toxic categories, the authors manually inspected the top terms in each category and found that there were key types: condition, infrastructure, social, linguistic, and treatment. in analyzing the strength of association, the authors found that condition phrases had the strongest association, and was then followed by social phrases that had the next highest strongest association. this included topics like homelessness, drug abuse, and gun violence all of which have negative valences. because these terms are used when discussing disability, it leads to a negative shaping of the way disability phrases are shaped and represented in the nlp tasks. the authors make recommendations for those working on nlp tasks to think about the socio-technical considerations when deploying such systems and to consider the intended, unintended, voluntary, and involuntary impacts on people both directly and indirectly while accounting for long-term impacts and feedback loops. such indiscriminate censoring of content that has disability phrases in them leads to an underrepresentation of people with disabilities in these corpora since they are the ones who tend to use these phrases most often. additionally, it also negatively impacts the people who might search for such content and be led to believe that the prevalence of some of these issues are smaller than they actually are because of this censorship. it also has impacts on reducing the autonomy and dignity of these people which in turn has a larger implication of how social attitudes are shaped. the second wave of algorithmic accountability the article dives into explaining how the rising interest in ensuring fair, transparent, ethical ai systems that are held accountable via various mechanisms advocated by research in legal and technical domains constitutes the "first wave" of algorithmic accountability that challenges existing systems. actions as a part of this wave need to be carried out incessantly with constant vigilance of the deployment of ai systems to avoid negative social outcomes. but, we also need to challenge why we have these systems in the first place, and if they can be replaced with something better. as an example, instead of making the facial recognition systems more inclusive, given the fact that they cause social stratification perhaps they shouldn't be used at all. a great point made by the article is that under the veneer of mainstream economic and ai rationalizations, we obscure broken social systems which ultimately harm society at a more systemic level. the trolley problem is a widely touted ethical and moral dilemma wherein a person is asked to make a split-second choice to save one or more than one life based on a series of scenarios where the people that need to be saved have different characteristics including their jobs, age, gender, race, etc. in recent times, with the imminent arrival of self-driving cars, people have used this problem to highlight the supposed ethical dilemma that the vehicle system might have to grapple with as it drives around. this article makes a point about the facetious nature of this thought experiment as an introduction to ethics for people that will be building and operating such autonomous systems. the primary argument being that it's a contrived situation that is unlikely to arise in the real-world setting and it distracts from other more pressing concerns in ai systems. moral judgments are relativistic and depend on cultural values of the geography where the system is deployed. the nature paper cited in the article showcases the differences in how people respond to this dilemma. there is an eeriness to this whole experimental setup, the article gives some examples on how increasingly automated environments, devoid of human social interactions and language, are replete with the clanging and humming of machines that give an entirely inhuman experience. for most systems, they are going to be a reflection of the biases and stereotypes that we have in the world, captured in the system because of the training and development paradigms of ai systems today. we'd need to make changes and bring in diversity to the development process, creating awareness of ethical concerns, but the trolley problem isn't the most effective way to get started on it. most of us have a nagging feeling that we're being forced into certain choices when we interact with each other on various social media platforms. but, is there a way that we can grasp that more viscerally where such biases and echo chambers are laid out bare for all to see? the article details an innovative game design solution to this problem called monster match that highlights how people are trapped into certain niches on dating websites based on ai-powered systems like collaborative filtering. striking examples of that in practice are how your earlier choices on the platform box you into a certain category based on what the majority think and then recommendations are personalized based on that smaller subset. what was observed was that certain racial inequalities from the real world are amplified on platforms like these where the apps are more interested in keeping users on the platform longer and making money rather than trying to achieve the goal as advertised to their users. more than personal failings of the users, the design of the platform is what causes failures in finding that special someone on the platform. the creators of the solution posit that through more effective design interventions, there is potential for improvement in how digital love is realized, for example, by offering a reset button or having the option to opt-out of the recommendation system and instead relying on random matches. increasingly, what we're going to see is that reliance on design and other mechanisms will yield better ai systems than purely technical approaches in improving socially positive outcomes. the article presents the idea of data feminism which is described as the intersection between feminism and data practices. the use of big data in today's dominant paradigm of supervised machine learning lends itself to large asymmetries that reflect the power imbalances in the real world. the authors of the new book data feminism talk about how data should not just speak for itself, for behind the data, there are a large number of structures and assumptions that bring it to the stage where they are collated into a dataset. they give examples of how sexual harassment numbers, while mandated to be reported to a central agency from college campuses might not be very accurate because they rely on the atmosphere and degree of comfort that those campuses promote which in turn influences how close the reported numbers will be to the actual cases. the gains and losses from the use of big data are not distributed evenly and the losses disproportionately impact the marginalized. there are a number of strategies that can be used to mitigate the harms from such flawed data pipelines. not an exhaustive list but it includes the suggestion of having more exposure for technical students to the social sciences and moving beyond having just a single ethics class as a check mark for having educated the students on ethics. secondly, having more diversity in the people developing and deploying the ai systems would help spot biases by asking the hard questions about both the data and the design of the system. the current covid- numbers might also suffer from similar problems because of how medical systems are structured and how people who don't have insurance might not utilize medical facilities and get themselves tested thus creating an underrepresentation in the data. this recent work highlights how commercial speech recognition systems carry inherent bias because of a lack of representation from diverse demographics in the underlying training datasets. what the researchers found was that even for identical sentences spoken by different racial demographics, the systems had widely differing levels of performance. as an example, for black users, the error rates were much higher than those for white users which probably had something to do with the fact that there is specific vernacular language used by black people which wasn't adequately represented in the training dataset for the commercial systems. this pattern has a tendency to be amplifying in nature, especially for systems that aren't frozen and continue to learn with incoming data. a vicious cycle is born where because of poor performance from the system, black people will be disincentivized from using the system because it takes a greater amount of work to get the system to work for them thus lowering utility. as a consequence of lower use, the systems get fewer training samples from black people thus further aggravating the problem. this leads to amplified exclusionary behavior mirroring existing fractures along racial lines in society. as a starting point, collecting more representative training datasets will aid in mitigating at least some of the problems in these systems. algorithmic bias at this point is a well-recognized problem with many people working on ways to address issues, both from a technical and policy perspective. there is potential to use demographic data to serve better those who face algorithmic discrimination but the use of such data is a challenge because of ethical and legal concerns. primarily, a lot of jurisdictions don't allow for the capture and use of protected class attributes or sensitive data for the fear of their misuse. even within jurisdictions, there is a patchwork of recommendations which makes compliance difficult. even with all this well established, proxy attributes can be used to predict the protected data and in a sense, according to some legislations, they become protected data themselves and it becomes hard to extricate the non-sensitive data from the sensitive data. because of such tensions and the privacy intrusions on data subjects when trying to collect demographic data, it is hard to align and advocate for this collection of data over the other requirements within the organization, especially when other bodies and leadership will look to place privacy and legal compliance over bias concerns. even if there was approval and internal alignment in collecting this demographic data, if there is voluntary provision of this data from data subjects, we run the risk of introducing a systemic bias that obfuscates and mischaracterizes the whole problem. accountability will play a key role in evoking trust from people to share their demographic information and proper use of it will be crucial in ongoing success. potential solutions are to store this data with a non-profit third-party organization that would meter out the data to those who need to use it with the consent of the data subject. to build a better understanding, partnership on ai is adopting a multistakeholder approach leveraging diverse backgrounds, akin to what the montreal ai ethics institute does, that can help inform future solutions that will help to address the problems of algorithmic bias by the judicious use of demographic data. detection and removal of hate speech is particularly problematic, something that has been exacerbated as human content moderators have been scarce in the pandemic related measures as we covered here. so are there advances in nlp that we could leverage to better automate this process? recent work from facebook ai research shows some promise. developing a deeper semantic understanding across more subtle and complex meanings and working across a variety of modalities like text, images and videos will help to more effectively combat the problem of hate speech online. building a pre-trained universal representation of content for integrity problems and improving and utilizing post-level, self-supervised learning to improve whole entity understanding has been key in improving hate speech detection. while there are clear guidelines on hate speech, when it comes to practice there are numerous challenges that arise from multi-modal use, differences in cultures and context, differences in idioms, language, regions, and countries. this poses challenges even for human reviewers who struggle with identifying hate speech accurately. a particularly interesting example shared in the article points out how text which might seem ambiguous when paired with an image to create a meme can take a whole new meaning which is often hard to detect using traditional automated tooling. there are active efforts from malicious entities who craft specific examples with the intention of evading detection which further complicates the problem. then there is the counterspeech problem where a reply to hate speech that contains the same phrasing but is framed to counter the arguments presented can be falsely flagged to be brought down which can have free speech implications. the relative scarcity of examples of hate speech in its various forms in relation to the larger non-hate speech content also poses a challenge for learning, especially when it comes to capturing linguistic and cultural nuances. the new method proposed utilizes focal loss which aims to minimize the impact of easy-to-classify examples on the learning process which is coupled with gradient blending which computes an optimal blend of modalities based on their overfitting patterns. the technique called xlm-r builds on bert by using a new pretraining recipe called roberta that allows training on orders of magnitude more data for longer periods of time. additionally, nlp performance is improved by learning across languages using a single encoder that allows learning to be transferred across languages. since this is a self-supervised method, they can train on large unlabeled datasets and have also found some universal language structures that allow vectors with similar meanings across languages to be closer together. facial recognition technology (frt) continues to get mentions because of the variety of ways that it can be misused across different geographies and contexts. with the most recent case where frt is used to determine criminality, it brings up an interesting discussion around why techniques that have no basis in science, those which have been debunked time and time again keep resurfacing and what we can do to better educate researchers on their moral responsibilities in pursuing such work. the author of this article gives some historical context for where the state of ai ethics, june phrenology started, pointing to the work of francis galton who used the "photographic composite method" to try and determine characteristics of one's personality from a picture. prior, measurements of skull size and other facial features wasn't deemed as a moral issue and the removal of such techniques from discussion was done on the objection that claims around the localization of different brain functions was seen as antithetical to the unity of the soul according to christianity. the authors of the paper that is being discussed in the article saw only empirical concerns with the work that they put forth and didn't see any of the moral shortcomings that were pointed out. additionally, they justified the work as being only for scientific curiosity. they also failed to realize the various statistical biases introduced in the collection of data as to the disparate rates of arrests, and policing, the perception of different people by law enforcement, juries, and judges and historical stereotypes and biases that confound the data that is collected.thus, the labeling itself is hardly value-neutral. more so, the authors of the study framed criminality as an innate characteristic rather than the social and other circumstances that lead to crime. especially when a project like this resurrects class structures and inequities, one must be extra cautious of doing such work on the grounds of "academic curiosity". the author of this article thus articulates that researchers need to take their moral obligations seriously and consider the harm that their work can have on people. while simply branding this as phrenology isn't enough, the author mentions that identifying and highlighting the concerns will lead to more productive conversations. an increase in demand for workers for various delivery services and other gig work has accelerated the adoption of vetting technology like those that are used to do background checks during the hiring process. but, a variety of glitches in the system such as sourcing out-of-date information to make inferences, a lack of redressal mechanisms to make corrections, among others has exposed the flaws in an overreliance on automated systems especially in places where important decisions need to be made that can have a significant impact on a person's life such as employment. checkr, the company that is profiled in this article claims to use ai to scan resumes, compare criminal records, analyze social media accounts, and examine facial expressions during the interview process. during a pandemic, when organizations are short-staffed and need to make rapid decisions, checkr offers a way to streamline the process, but this comes at a cost. two supposed benefits that they offer are that they are able to assess a match between the criminal record and the person being actually concerned, something that can especially be fraught with errors in cases where the person has a common name. secondly, they are also able to correlate and resolve discrepancies in the different terms that may be used for crimes across different jurisdictions. a person spoke about his experience with another company that did these ai-powered background checks utilizing his public social media information and bucketed some of his activity into categories that were too coarse and unrepresentative of his behaviour, especially when such automated judgements are made without a recourse to correct, this can negatively affect the prospects of being hired. another point brought up in the article is that social media companies might themselves be unwilling to tolerate scraping of their users' data to do this sort of vetting which against their terms of use for access to the apis. borrowing from the credit reporting world, the fair credit reporting act in the us offers some insights when it mentions that people need to be provided with a recourse to correct information that is used about them in making a decision and that due consent needs to be obtained prior to utilizing such tools to do a background check. though it doesn't ask for any guarantees of a favorable outcome post a re-evaluation, at least it does offer the individual a bit more agency and control over the process. the toxic potential of youtube's feedback loop on youtube everyday, more than a billion hours of video are watched everyday where approximately % of those are watched by automated systems that then provide recommendations on what videos to watch next for human users in the column on the side. there are more than billion users on the youtube platform so this has a significant impact on what the world watches. guillaume had started to notice a pattern in the recommended videos which tended towards radicalizing, extreme and polarizing content which were underlying the upward trend of watch times on the platform. on raising these concerns with the team, at first there were very few incentives for anyone to address issues of ethics and bias as it related to promoting this type of content because they feared that it would drive down watch time, the key business metric that was being optimized for by the team. so maximizing engagement stood in contrast to the quality of time that was spent on the platform. the vicious feedback loop that it triggered was that as such divisive content performed better, the ai systems promoted this to optimize for engagement and subsequently content creators who saw this kind of content doing better created more of such content in the hopes of doing well on the platform. the proliferation of conspiracy theories, extreme and divisive content thus fed its own demand because of a misguided business metric that ignored social externalities. flat earthers, anti-vaxxers and other such content creators perform well because the people behind this content are a very active community that spend a lot of effort in creating these videos, thus meeting high quality standards and further feeding the toxic loop. content from people like alex jones and trump tended to perform well because of the above reasons as well. guillaume's project algotransparency essentially clicks through video recommendations on youtube to figure out if there are feedback loops. he started this with the hopes of highlighting latent problems in the platforms that continue to persist despite policy changes, for example with youtube attempting to automate the removal of reported and offensive videos. he suggests that the current separation of the policy and engagement algorithm leads to problems like gaming of the platform algorithm by motivated state actors that seek to disrupt democratic processes of a foreign nation. the platforms on the other hand have very few incentives to make changes because the type of content emerging from such activity leads to higher engagement which ultimately boosts their bottom line. guillaume recommends having a combined system that can jointly optimize for both thus helping to minimize problems like the above. a lot of the problems are those of algorithmic amplification rather than content curation. many metrics like number of views, shares, and likes don't capture what needs to be captured. for example, the types of comments, reports filed, and granularity of why those reports are filed. that would allow for a smarter way to combat the spread of such content. however, the use of such explicit signals compared to the more implicit ones like number of views comes at the cost of breaking the seamlessness of the user experience. again we run into the issue of a lack of motivation on part of the companies to do things that might drive down engagement and hurt revenue streams. the talk gives a few more examples of how people figured out ways to circumvent checks around the reporting and automated take-down mechanisms by disabling comments on the videos which could previously be used to identify suspicious content. an overarching recommendation made by guillaume in better managing a more advanced ai system is to understand the underlying metrics that the system is optimizing for and then envision scenarios of what would happen if the system had access to unlimited data. thinking of self-driving cars, an ideal scenario would be to have full conversion of the traffic ecosystem to one that is autonomous leading to fewer deaths but during the transition phase, having the right incentives is key to making a system that will work in favor of social welfare. if one were to imagine a self-driving car that shows ads while the passenger is in the car, it would want to have a longer drive time and would presumably favor longer routes and traffic jams thus creating a sub-optimal scenario overall for the traffic ecosystem. on the other hand, a system that has the goal of getting from a to b as quickly and safely as possible wouldn't fall into such a trap. ultimately, we need to design ai systems such that they help humans flourish overall rather than optimize for monetary incentives which might run counter to the welfare of people at large. the article provides a taxonomy of communities that spread misinformation online and how they differ in their intentions and motivations. subsequently, different strategies can be deployed in countering the disinformation originating from these communities. there isn't a one-size-fits-all solution that would have been the case had the distribution and types of the communities been homogenous. the degree of influence that each of the communities wield is a function of types of capital: economic, social, cultural, time and algorithmic, definitions of which are provided in the article. understanding all these factors is crucial in combating misinformation where different capital forms can be used in different proportions to achieve the desired results, something that will prove to be useful in addressing disinformation around the current covid- situation. the social media platform offers a category of pseudoscience believers which advertisers can purchase and target. according to the markup, this category has million people in it and attempts to purchase ads targeting this category were approved quite swiftly. there isn't any information available as to who has purchased ads targeting this category. the journalist team was able to find at least one advertiser through the "why am i seeing this ad?" option and they reached out to that company to investigate and they found that the company hadn't selected the pseudoscience category but it had been auto-selected by facebook for them. facebook allows users the option to change the interests that are assigned to each user but it is not something that many people know about and actively monitor. some other journalists had also unearthed controversy-related categories that amplified messages and targeted people who might be susceptible to such kind of misinformation. with the ongoing pandemic, misinformation is propagating at a rapid rate and there are many user groups that continue to push conspiracy theories. other concerns around being able to purchase ads to spread misinformation related to potential cures and remedies for the coronavirus continue to be approved. with the human content moderators being asked to stay home (as we covered here) and an increasing reliance on untested automated solutions, it seems that this problem will continue to plague the platform. there isn't a dearth of information available online, one can find confirmatory evidence to almost any viewpoint since the creation and dissemination of information has been democratized by the proliferation of the internet and ease of use of mass-media platforms. so in the deluge of information, what is the key currency that helps us sift through all the noise and identify the signal? this article lays out a well articulated argument for how reputation and being able to assess it is going to be a key skill that people will need to have in order to effectively navigate the information ecosystem effectively. we increasingly rely on other people's judgement of content (akin to how maiei analyzes the ecosystem of ai ethics and presents you with a selection), coupled with algorithmically-mediated distribution channels, we are paradoxically disempowered by more information and paralyzed into inaction and confusion without a reputable source to curate and guide us. there are many conspiracy theories, famous among them that we never visited the moon, flat earth and more recently that g is causing the spread of the coronavirus. as rational readers, we tend to dismiss this as misinformation yet we don't really spend time to analyze the evidence that these people present to support their claims. to a certain extent, our belief that we did land on the moon depends on our trust in nasa and other news agencies that covered this event yet we don't venture to examine the evidence first-hand. more so, with highly specialized knowledge becoming the norm, we don't have the right tools and skills to even be able to analyze the evidence and come to meaningful conclusions. so, we must rely on those who provide us with this information. instead of analyzing the veracity of a piece of information, the focus of a mature digital citizen needs to be on being able to analyze the reputation pathway of that information, evaluate the agendas of the people that are disseminating the information and critically analyze the intentions of the authorities of the sources. how we rank different pieces of information arriving to us via our social networks need to be appraised for this reputation and source tracing, in a sense a second-order epistemology is what we need to prepare people for. in the words of hayek, "civilization rests on the fact that we all benefit from the knowledge that we do not possess." our cyber-world can become civilized by evaluating this knowledge that we don't possess critically when mis/disinformation can spread just as easily as accurate information. a very clear way to describe the problem plaguing the us response to the coronavirus, the phenomenon of truth decay is not something new but has happened many times in the past when trust in key institutions deteriorated and led to a diffused response to the crisis at hand, extending the recovery period beyond what would be necessary if there was a unified response. in the us, the calls for reopening the economy, following guidance on using personal protective equipment, and other recommendations is falling along partisan lines. the key factor causing this is how the facts and data are being presented differently to different audiences. while this epidemic might have been the perfect opportunity for bringing people together, because it affects different segments of society differently, it hasn't been what everyone expected it to be. at the core is the rampant disagreement between different factions on facts and data. this is exacerbated by the blurring of facts and opinions. in places like newsrooms and tv shows, there is an intermingling of the two which makes it harder for everyday consumers to discern fact from opinion. the volume of opinion has gone up compared to facts and people's declining trust in public health authorities and other institutions is also aggravating the problem. put briefly, people are having trouble finding the truth and don't know where to go looking for it. this is also the worst time to be losing trust in experts; with a plethora of information available online, people are feeling unnecessarily empowered that they have the right information, comparable to that of experts. coupled with a penchant for confirming their own beliefs, there is little incentive for people to fact-check and refer to multiple sources of information. when different agencies come out with different recommendations and there are policy changes in the face of new information, something that is expected given that this is an evolving situation, people's trust in these organizations and experts erodes further as they see them as flip-flopping and not knowing what is right. ultimately, effective communication along with a rebuilding of trust will be necessary if we're to emerge from this crisis soon and restore some sense of normalcy. the deepfake detection challenge: synthetic media is any media (text, image, video, audio) that is generated by an ai system or that is synthesized. on the other hand, non-synthetic media is one that is crafted by humans using a panoply of techniques, including tools like photoshop. detecting synthetic media alone doesn't solve the media integrity challenges, especially as the techniques get more sophisticated and trigger an arms race between detection and evasion methods. these methods need to be paired with other existing techniques that fact checkers and journalists already use in determining whether something is authentic or synthesized. there are also pieces of content that are made through low tech manipulations like the nancy pelosi video from which showed her drunk but in reality it was just a slowed down video. other such manipulations include simpler things like putting fake and misleading captions below the true video and people without watching the whole thing are misled into believing what is summarized in the caption. in other cases, the videos might be value neutral or informative even when they are generated so merely detecting something as being generated doesn't suffice. a meaningful way to utilize automated tools is a triaging utility that flags content to be reviewed by humans in a situation where it is not possible to manually review everything on the platform. while tech platforms can build and utilize tools that help them with these tasks, the adjacent possible needs of the larger ecosystem need to be kept in mind such that they can be served at the same time, especially for those actors that are resource-constrained and don't have the technical capabilities to build it themselves. the tools need to be easy to use and shouldn't have high friction such that they become hard to integrate into existing workflows. through open sourcing and licensing, the tools can be made available to the wider ecosystem but it can create the opportunity for adversaries to strengthen their methods as well. this can be countered by responsible disclosure as we'll cover below. for any datasets created as a part of this challenge and otherwise to aid in detection, one must ensure that it captures sufficient diversity in terms of environment and other factors and reflects the type of content that might be encountered in the world. the scoring rules need to be such that they minimize gaming and overfitting and capture the richness of variation that a system might encounter. for example most datasets today in this domain aim to mitigate the spread of pornographic material. they also need to account for the vastly different frequencies of occurrence of authentic and generated content. solutions in this domain involve an inherent tradeoff between pro-social use and potential malicious use for furthering the quality of inauthentic content. the release of tools should be done in a manner that enhances pro-social use while creating deterrents for malicious use. the systems should be stress-tested by doing red team-blue team exercises to enhance robustness because this is inherently an adversarial exercise. such challenges should be held often to encourage updating of techniques because it is a fast evolving domain where progress happens in the span of a few months. results from such detection need to be accessible to the public and stakeholders and explanations for the research findings should be made available alongside the challenge to encourage better understanding by those that are trying to make sense of the digital content. responsible disclosure practices will be crucial in enabling the fight against disinformation to have the right tools while deterring adversaries from utilizing the same tools to gain an advantage. a delayed release mechanism where the code is instantly made available to parties in a non-open source manner while the research and papers are made public with the eventual release of the code as well after a - months delay which would help with the detectors having a headstart over the adversaries. such detection challenges can benefit from extensive multi-stakeholder consultations which require significant time and effort so budget for that while crafting and building such challenges. some of the allocation of prize money should be towards better design from a ux and ui perspective. it should also include explainability criteria so that non-technical users are able to make sense of the interventions and highlights of fake content such as bounding boxes around regions of manipulations. the process of multi-stakeholder input should happen at an early stage allowing for meaningful considerations to be incorporated and dataset design that can be done appropriately to counter bias and fairness problems. finally, strong, trusting relationships are essential to the success of the process and require working together over extended periods to have the hard conversations with each other. it is important to have clear readings ahead of meetings that everyone has to complete so that discussions come from an informed place. spending time scoping and coming to clearer agreement about projects goals and deliverables at the beginning of the process is also vital to success. there is a distinction between misinformation and disinformation -misinformation is the sharing of false information unintentionally where no harm is intended whereas disinformation is false information that is spread intentionally with the aims of causing harm to the consumers. this is also referred to as information pollution and fake news. it has massive implications that have led to real harms for people in many countries with one of the biggest examples being the polarization of views in the us presidential elections. meaningful solutions to this will only emerge when we have researchers from both technical and social sciences backgrounds working together to gain a deeper understanding of the root causes. this isn't a new problem and has existed for a very long time, it's just that with the advent of technology and more people being connected to each other we have a much more rapid dissemination of the false information and modern tools enable the creation of convincing fake images, text and videos, thus amplifying the negative effects. some of the features that help to delve deeper into the study of how mis/disinformation spreads are: • democratization of content creation: with practically anyone now having the ability to create and publish content, information flow has increased dramatically and there are few checks for the veracity of content and even fewer mechanisms to limit the flow rate of information. • rapid news cycle and economic incentives: with content being monetized, there is a strong incentive to distort information to evoke a response from the reader such that they click through and feed the money-generating apparatus. • wide and immediate reach and interactivity: by virtue of almost the entire globe being connected, content quickly reaches the furthest corners of the planet. more so, content creators are also able to, through quantitative experiments, determine what kind of content performs well and then tailor that to feed the needs of people. • organic and intentionally created filter bubbles: the selection of who to follow along with the underlying plumbing of the platforms permits for the creation of echo chambers that further strengthen polarization and do little to encourage people to step out and have a meaningful exchange of ideas. • algorithmic curation and lack of transparency: the inner workings of platforms are shrouded under the veil of ip protections and there is little that is well-known about the manipulative effects of the platforms on the habits of content consumers. • scale and anonymity of online accounts: given the weak checks for identity, people are able to mount "sybil" attacks that leverage this lack of strong identity management and are able to scale their impact through the creation of content and dispersion of content by automated means like bot accounts on the platform. what hasn't changed even with the introduction of technology are the cognitive biases which act as attack surfaces for malicious actors to inject mis/disinformation. this vulnerability is of particular importance in the examination and design of successful interventions to combat the spread of false information. for example, the confirmation bias shows that people are more likely to believe something that conforms with their world-view even if they are presented with overwhelming evidence to the contrary. in the same vein, the backfire effect demonstrates how people who are presented with such contrary evidence further harden their views and get even more polarized thus negating the intention of presenting them with balancing information. in terms of techniques, the adversarial positioning is layered into three tiers with spam bots that push out low-quality content, quasi-bots that have mild human supervision to enhance the quality of content and pure human accounts that aim to build up a large following before embarking on spreading the mis/disinformation. from a structural perspective, the alternate media sources often copy-paste content with source attribution and are tightly clustered together with a marked separation with other mainstream media outlets. on the consumer front, there is research that points to the impact that structural deficiencies in the platforms, say whatsapp where source gets stripped out in sharing information, create not only challenges for researchers trying to study the ecosystem but also exacerbate the local impact effect whereby a consumer trusts things coming from friends much more so than other potentially more credible sources from an upstream perspective. existing efforts to study the ecosystem require a lot of manual effort but there is hope in the sense that there are some tools that help automate the analysis. as an example, we have the hoaxy tool, a tool that collects online mis/disinformation and other articles that are fact-checking versions. their creators find that the fact-checked articles are shared much less than the original article and that curbing bots on a platform has a significant impact. there are some challenges with these tools in the sense that they work well on public platforms like twitter but on closed platforms with limited ability to deploy bots, automation doesn't work really well. additionally, even the metrics that are surfaced need to be interpreted by researchers and it isn't always clear how to do that. the term 'deepfake' originated in and since then a variety of tools have been released such as face face that allow for the creation of reanimations of people to forge identity, something that was alluded to in this paper here on the evolution of fraud. while being able to create such forgeries isn't new, what is new is that this can be done now with a fraction of the effort, democratizing information pollution and casting aspersions on legitimate content as one can always argue something was forged. online tracking of individuals, which is primarily used for serving personalized advertisements and monetizing the user behaviors on websites can also be used to target mis/disinformation in a fine-grained manner. there are a variety of ways this is done through third-party tracking like embedding of widgets to browser cookies and fingerprinting. this can be used to manipulate vulnerable users and leverage sensitive attributes gleaned from online behaviors that give malicious actors more ammunition to target individuals specifically. even when platforms provide some degree of transparency on why users are seeing certain content, the information provided is often vague and doesn't do much to improve the understanding for the user. earlier attempts at using bots used simplistic techniques such as tweeting at certain users and amplifying low-credibility information to give the impression that something has more support than it really does but recent attempts have become more sophisticated: social spambots. these slowly build up credibility within a community and then use that trust to sow disinformation either automatically or in conjunction with a human operator, akin to a cyborg. detection and measurement of this problem is a very real concern and researchers have tried using techniques like social network graph structure, account data and posting metrics, nlp on content and crowdsourcing analysis. from a platform perspective, they can choose to analyze the amount of time spent browsing posts vs. the time spent posting things. there is an arms race between detection and evasion of bot accounts: sometimes even humans aren't able to detect sophisticated social bots. additionally, there are instances where there are positive and beneficial bots such as those that aggregate news or help coordinate disaster response which further complicates the detection challenge. there is also a potential misalignment in incentives since the platforms have an interest in having higher numbers of accounts and activity since it helps boost their valuations while they are the ones that have the maximum amount of information to be able to combat the problem. this problem of curbing the spread of mis/disinformation can be broken down into two parts: enabling detection on the platform level and empowering readers to select the right sources. we need a good definition of what fake news is, one of the most widely accepted definitions is that it is something that is factually false and intentionally misleading. framing a machine learning approach here as an end-to-end task is problematic because it requires large amounts of labelled data and with neural network based approaches, there is little explanation offered which makes downstream tasks harder. so we can approach this by breaking it down into subtasks, one of which is verifying the veracity of information. most current approaches use human fact-checkers but this isn't a scalable approach and automated means using nlp aren't quite proficient at this task yet. there are attempts to break down the problem even further such as using stance detection to see if information presented agrees, disagrees or is unrelated to what is mentioned in the source. other approaches include misleading style detection whereby we try to determine if the style of the article can offer clues to the intent of the author but that is riddled with problems of not having necessarily a strong correlation with a misleading intent because the style may be pandering to hyperpartisanship or even if it is neutral that doesn't mean that it is not misleading. metadata analysis looking at the social graph structure, attributes of the sharer and propagation path of the information can lend some clues as well. while all these attempts have their own challenges and in the arms race framing, there is a constant battle between attack and defense, even if the problem is solved, we still have human cognitive biases which muddle the impacts of these techniques. ux and ui interventions might serve to provide some more information as to combating those. as a counter to the problems encountered in marking content as being "disputed" which leads to the implied truth effect leading to larger negative externalities, an approach is to show "related" articles when something is disputed and then use that as an intervention to link to fact-checking websites like snopes. other in-platform interventions include the change from whatsapp to show "forwarded" next to messages so that people had a bit more insight into the provenance of the message because there was a lot of misinformation that was being spread in private messaging. there are also third-party tools like surfsafe that are able to check images as people are browsing against other websites where they might have appeared and if they haven't appeared in many places, including verified sources, then the user can infer that the image might be doctored. education initiatives by the platform companies for users to spot misinformation are a method to get people to become more savvy. there have also been attempts to assign nutrition labels to sources to list their slant, tone of the article, timeliness of the article and the experience of the author which would allow a user to make a better decision on whether or not to trust an article. platforms have also attempted to limit the spread of mis/disinformation by flagging posts that encourage gaming of the sharing mechanisms on the platform, for example, downweighting posts that are "clickbait". the biggest challenges in the interventions created by the platforms themselves are that they don't provide sufficient information as to make the results scientifically reproducible. given the variety of actors and motivations, the interventions need to be tailored to be able to combat them such as erecting barriers to the rate of transmission of mis/disinformation and demonetization for actors with financial incentives but for state actors, detection and attribution might be more important. along with challenges in defining the problem, one must look at socio-technical solutions because the problem has more than just the technical component, including the problem with human cognitive biases. being an inherently adversarial setting, it is important to see that not all techniques being used by the attackers are sophisticated, some simple techniques when scaled are just as problematic and require attention. but, given that this is constantly evolving, detecting disinformation today doesn't mean that we can do so successfully tomorrow. additionally, disinformation is becoming more personalized, more realistic and more widespread. there is a misalignment in incentives as explored earlier in terms of what the platforms want and what's best for users but also that empowering users to the point of them being just skeptical of everything isn't good either, we need to be able to trigger legitimate and informed trust in the authentic content and dissuade them away from the fake content. among the recommendations proposed by the authors are: being specific about what a particular technological or design intervention means to achieve, breaking down the technological problems into smaller, concrete subproblems that have tractable solutions and then recombining them into the larger pipeline. we must also continue to analyze the state of the ecosystem and tailor defenses such that they can combat the actors at play. additionally, rethinking of the monetary incentives on the platform can help to dissuade some of the financially-motivated actors. educational interventions that focus on building up knowledge so that there is healthy skepticism and learning how to detect markers for bots, the capabilities of technology to create fakes today and discussions in "public squares" on this subject are crucial yet we mustn't place too much of a burden on the end-user that distracts them from their primary task which is interaction with others on the social network. if that happens, people will just abandon the effort. additionally, designing for everyone is crucial, if the interventions, such as installing a browser extension, are complicated, then one can only reach the technically-literate people and everyone else gets left out. on the platform end, apart from the suggestions made above, they should look at the use of design affordances that aid the user in judging the veracity, provenance and other measures to discern legitimate information vs. mis/disinformation. teaming up with external organizations that specialize in ux/ui research will aid in understanding the impacts of the various features within the platform. results from such research efforts need to be made public and accessible to non-technical audiences. proposed solutions also need to be interdisciplinary to have a fuller understanding of the root causes of the problem. also, just as we need tailoring for the different kinds of adversaries, it is important to tailor the interventions to the various user groups who might have different needs and abilities. the paper also makes recommendations for policymakers, most importantly that the work in regulations and legislations be grounded in technical realities that are facing the ecosystem so that they don't undershoot or overshoot the needs for successfully combating mis/disinformation. for users, there are a variety of recommendations provided in the references but notably being aware of our own cognitive biases and having a healthy degree of skepticism and checking information against multiple sources before accepting it as legitimate are the most important ones. disinformation is harmful even during times when we aren't going through large scale changes but this year the us has elections, the once in a decade census, and the covid- pandemic. malicious agents are having a field day disbursing false information, overwhelming people with a mixture of true and untrue pieces of content. the article gives the example of a potential lockdown and people reflecting on their experience with the boston marathon bombings including stockpiling essentials out of panic. this was then uncovered to have originated from conspiracy theorists, but in an environment where contact with the outside world has become limited and local touch points such as speaking with your neighbor have dwindled, we're struggling with our ability to combat this infodemic. social media is playing a critical role in getting information to people but if it's untrue, we end up risking lives especially if it's falsehoods on how to protect yourself from contracting a disease. but wherever there is a challenge lies a corresponding opportunity: social media companies have a unique window into discovering issues that a local population is concerned about and it can, if used effectively, be a source for providing crisis response to those most in need with resources that are specific and meaningful. when it comes to disinformation spreading, there isn't a more opportune time than now with the pandemic raging where people are juggling several things to manage and cope with lifestyle and work changes. this has increased the susceptibility of people to sharing news and other information about how to protect themselves and their loved ones from covid- . as the who has pointed out, we are combating both a pandemic and an infodemic at the same time. what's more important is that this might be the time to test out design and other interventions that might help curb the spread of disinformation. this study highlighted how people shared disinformation more often than they believed its veracity. in other words, when people share content, they care more about what they stand to gain (social reward cues) for sharing the content than whether the content they're sharing is accurate or not. to combat this, the researchers embarked on an experiment to see if asking users to check whether something was true before sharing -a light accuracy nudge, would change their behaviour. while there was a small positive effect in terms of them sharing disinformation less when prompted to check for accuracy, the researchers pointed out that the downstream effects could be much larger because of the amplification effects of how content propagates on social media networks. it points to a potentially interesting solution that might be scalable and could help fight against the spread of disinformation. the who has mentioned the infodemic as being one of the causes that is exacerbating the pandemic as people follow differing advice on what to do. communication by authorities has been persistent but at times ineffective and this article dives into how one could enhance the visibility of credible information by governments, health authorities and scientists so that the negative impacts of the infodemic can be curbed. but, spewing scientific facts from a soapbox alone isn't enough -one is competing with all the other pieces of information and entertainment for attention and that needs to be taken into account. one of the key findings is that starting a dialogue helps more than just sending a one-way communiqué. good science communication relies on the pillars of storytelling, cutting through the jargon and making the knowledge accessible. while online platforms are structured such that polarization is encouraged through the algorithmic underpinnings of the system, we should not only engage when there is something that we disagree with, instead taking the time to amplify good science is equally important. using platform-appropriate messaging, tailoring content to the audience and not squabbling over petty details, especially when they don't make a significant impact on the overall content helps to push out good science signals in the ocean of information pollution. clickbait-style headlines do a great job of hooking in people but when leading people into making a certain assumption and then debunking it, you stand the risk of spreading misinformation if someone doesn't read the whole thing, so in trying to make headlines engaging, it is important to consider what might be some unintended consequences if someone didn't read past the subtitle. science isn't just about the findings, the process only gets completed when we have effective communication to the larger audience of the results, and now more than ever, we need accurate information to overpower the pool of misinformation out there. there is a potential for ai to automate repetitive tasks and free up scarce resources towards more value-added tasks. with a declining business model and tough revenue situations, newsrooms and journalism at large are facing an existential crisis. cutting costs while still keeping up high standards of reporting will require innovation on the part of newsrooms to adapt emerging technologies like ai. for example, routine tasks like reporting on sports scores from games and giving updates on company earnings calls is already something that is being done by ai systems in several newsrooms around the world. this frees up time for journalists to spend their efforts on things like long-form journalism, data-driven and investigative journalism, analysis and feature pieces which require human depth and creativity. machine translation also offers a handy tool making the work of journalists accessible to a wider audience without them having to invest in a lot of resources to do the translations themselves. this also brings up the possibility of smaller and resource-constrained media rooms to use their limited resources for doing in-depth pieces while reaching a wider audience by relying on automation. transcription of audio interviews so that reporters can work on fact-checking and other associated pieces also helps bring stories to fruition faster, which can be a boon in the rapidly changing environment. in the case of evolving situations like the pandemic, there is also the possibility of using ai to parse through large reams of data to find anomalies and alert the journalist of potential areas to cover. complementing human skills is the right way to adopt ai rather than thinking of it as the tool that replaces human labor. the article gives an explanation for why truth labels on stories are not as effective as we might think them to be because of something called the implied truth effect. essentially, it states that when some things are marked as explicitly false and other false stories aren't, people believe them to be true even if they are outright false because of the lack of a label. fact checking all stories manually is an insurmountable task for any platform and the authors of the study mention a few approaches that could potentially mitigate the spread of false content but none are a silver bullet. there is an ongoing and active community that researches how we might more effectively dispel disinformation but it's nascent and with the proliferation of ai systems, more work needs to be done in this arms race of building tools vs increasing capabilities of systems to generate believable fake content. this paper by xiao ma and taylor w. brown puts forth a framework that extends the well studied social exchange theory (set) to study human-ai interactions via mediation mechanisms. the authors make a case for how current research needs more interdisciplinary collaboration between technical and social science scholars stemming from a lack of shared taxonomy that places research in similar areas on separate grounds. they propose two axes of human/ai and micro/macro perspectives to visualize how researchers might better collaborate with each other. additionally, they make a case for how ai agents can mediate transactions between humans and create potential social value as an emergent property of those mediated transactions. as the pace of research progress quickens and more people from different fields engage in work on the societal impacts of ai, it is essential that we build on top of each other's work rather than duplicating efforts. additionally, because of conventional differences in how research is published and publicized in the social sciences and technical domains, there's often a shallowness in the awareness of the latest work being done at the intersection of these two domains. what that means is that we need a shared taxonomy that allows us to better position research such that potential gaps can be discovered and areas of collaboration can be identified. the proposed two axes structure in the paper goes some distance in helping to bridge this current gap. ai systems are becoming ever more pervasive in many aspects of our everyday lives and we definitely see a ton of transactions between humans that are mediated by automated agents. in some scenarios, they lead to net positive for society when they enable discovery of research content faster as might be the case for medical research being done to combat covid- but there might be negative externalities as well where they can lead to echo chambers walling off content from a subset of your network on social media platforms thus polarizing discussions and viewpoints. a better understanding of how these interactions can be engineered to skew positive will be crucial as ai agents get inserted to evermore aspects of our lives, especially ones that will have a significant impact on our lives. we also foresee an emergence of tighter interdisciplinary collaboration that can shed light on these inherently socio-technical issues which don't have unidimensional solutions. with the rising awareness and interest from both social and technical sciences, the emerging work will be both timely and relevant to addressing challenges of the societal impacts of ai head on. as a part of the work being done at maiei we push for each of our undertakings to have an interdisciplinary team as a starting point towards achieving this mandate. most concerns when aiming to use technology within healthcare are along the lines of replacing human labor and the ones that are used in aiding humans to deliver care don't receive as much attention. with the ongoing pandemic, we've seen this come into the spotlight as well and this paper sets the stage for some of the ethical issues to watch out for when thinking about using ai-enabled technologies in the healthcare domain and how to have a discussion that is grounded in concrete moral principles. an argument put forth to counter the use of ai solutions is that they can't "care" deeply enough about the patients and that is a valid concern, after all machines don't have empathy and other abilities required to have an emotional exchange with humans. but, a lot of the care work in hospitals is routine and professionalism asks for maintaining a certain amount of emotional distance in the care relationship. additionally, in places where the ratio of patients/carers is high, they are unable to provide personalized attention and care anyways. in that respect, human-provided care is already "shallow" and the author cites research where care that is too deep actually hurts the carer when the patients become better and move out of their care or die. thus, if this is the argument, then we need to examine more deeply our current care practices. the author also posits that if this is indeed the state of care today, then it is morally less degrading to be distanced by a machine than by a human. in fact, the use of ai to automate routine tasks in the rendering of medical care will actually allow human carers to focus more on the emotional and human aspects of care. good healthcare, supposedly that provided by humans doesn't have firm grounding in the typical literature on the ethics of healthcare and technology. it's more so a list of things not to do but not positive guidance on what this kind of good healthcare looks like. thus, the author takes a view that it must, at the very least, respect, promote and preserve the dignity of the patient. yet, this doesn't provide concrete enough guidance and we can expand on this to say that dignity is a) treating the patient as a human b) treating them as a part of a culture and community and c) treating them as a unique human. to add even more concreteness, the author borrows from the work done in economics on the capabilities approach. this capabilities approach states that having the following capabilities in their entirety is necessary for a human to experience dignity in their living: life, bodily health, bodily integrity, being able to use your senses, imaginations and thoughts, emotions, practical reasoning, affiliation, other species, play, and control over one's environment. this list applied to healthcare gives us a good guideline for what might constitute the kind of healthcare that we deem should be provided by humans, with or without the use of technology. now, the above list might seem too onerous for healthcare professionals but we need to keep in mind that healthcare to achieve a good life as highlighted by the capabilities approach things that are dependent on things beyond just the healthcare professionals and thus, the needs as mentioned above need to be distributed accordingly. the threshold for meeting them should be high but not so high that they are unachievable. principles are only sufficient for giving us some guidance for how to act in difficult situations or ethical dilemmas, they don't determine the outcome because they are only one element in the decision making process. we have to rely on the context of the situation and the moral surroundings of it. the criteria proposed are to be used in moral deliberation and should address whether the criterion applies to the situation, is it satisfied and is it sufficiently met (which is in reference to the threshold). with the use of ai-enabled technology, privacy is usually cited as a major concern but the rendering of care is decidedly a non-private affair, imagine a scenario where the connection facilitated by technology allows for meeting the social and emotional needs of a terminal patient, if there is a situation where the use of technology allows for a better and longer life, then in these cases there can be an argument for sacrificing privacy to meet the needs of the patient. ultimately, a balance needs to be struck between the privacy requirements and other healthcare requirements and privacy should not be blindly touted as the most important requirement. framing the concept of the good life with a view of restoring, maintaining and enhancing the capabilities of the human, one mustn't view eudaimonia as happiness but rather the achievement of the capabilities listed because happiness in this context would fall outside of the domain of ethics. additionally, the author proposes the care experience machine thought experiment that can meet all the care needs of a patient and asks the question if it would be morally wrong to plug in a patient into such a machine. while intuitively it might seem wrong, we struggle when trying to come up with concrete objections. as long as the patient feels cared for and has, from an objective standpoint, their care needs met, it becomes hard to contest how such virtual care might differ from real care that is provided by humans. if one can achieve real capabilities, such as the need to have freedom of movement and interaction with peers outside of their care confinement and virtual reality technology enables that, then the virtual good life enhances the real good life -a distinction that becomes increasingly blurred as technology progresses. another moral argument put forward in determining whether to use technology-assisted healthcare is if it is too paternalistic to determine what is best for the patient. in some cases where the patient is unable to make decisions that restore, maintain and enhance their capabilities, such paternalism might be required but it must always be balanced with other ethical concerns and keeping in mind the capabilities that it enables for the patient. when we talk about felt care and how to evaluate whether care rendered is good or not, we should not only look at the outcomes of the process through which the patient exits the healthcare context but also the realization of some of the capabilities during the healthcare process. to that end, when thinking about felt care, we must also take into account the concept of reciprocity of feeling which is not explicitly defined in the capabilities approach but nonetheless forms an important aspect of experiencing healthcare in a positive manner from the patient's perspective. in conclusion, it is important to have an in-depth evaluation of technology assisted healthcare that is based on moral principles and philosophy, yet resting more on concrete arguments rather than just the high-level abstracts as they provide little practical guidance in evaluating different solutions and how they might be chosen to be used in different contexts. an a priori dismissal of technology in the healthcare domain, even when based on very real concerns like breach of privacy in the use of ai solutions which require a lot of personal data, begets further examination before arriving at a conclusion. the article brings up some interesting points around how we bond with things that are not necessarily sentient and how our emotions are not discriminating when it comes to reducing loneliness and imprinting on inanimate objects. people experience surges in oxytocin as a consequence of such a bonding experience which further reinforces the relationship. this has effects for how increasingly sentient-appearing ai systems might be used to manipulate humans into a "relationship" and potentially steer them towards making purchases, for example via chatbot interfaces by evoking a sense of trust. the article also makes a point about how such behaviour is akin to animism and in a sense forms a response to loneliness in the digital realm, allowing us to continue to hone our empathy skills for where they really matter, with other human beings. with more and more of our conversations being mediated by ai-enabled systems online, it is important to see if robots can be harnessed to affect positive behaviour change in our interactions with each other. while there have been studies that demonstrate the positive impact that robots can have on influencing individual behaviour, this study highlighted how the presence of robots can influence human to human interactions as well. what the researchers found was that having a robot that displayed positive and affective behavior triggered more empathy from humans towards other humans as well as other positive behaviors like listening more and splitting speaking time amongst members more fairly. this is a great demonstration of how robots can be used to improve our interactions with each other. another researcher pointed out that a future direction of interest would be to see how repeated exposure to such robot interactions can influence behaviour and if the effects so produced would be long-lasting even in the absence of the robot to participate in the interactions. since time immemorial there has been a constant tussle between making predictions and being able to understand the underlying fundamentals of how those predictions worked. in the era of big data, those tensions are exacerbated as machines become more inscrutable while making predictions using ever-more higher-dimensional data which lies beyond intuitive understanding of humans. we try to reason through some of that high-dimensional data by utilizing techniques that either reduce the dimensions or visualize into -or -dimensions which by definition will tend to lose some fidelity. bacon had proposed that humans should utilize tools to gain a better understanding of the world around them -until recently where the physical processes of the world matched quite well with our internal representations, this wasn't a big concern. but a growing reliance on tools means that we rely more on what is made possible by the tools as they measure and model the world. statistical intelligence and models often get things right but often they are hostile to reconstruction as to how they arrived at certain predictions. models provide for abstractions of the world and often don't need to follow exactly the real-world equivalents. for example, while the telescope allows us to peer far into the distance, its construction doesn't completely mimic a biological eye. more so, radio telescopes that don't follow optics at all give us a unique view into distant objects which are just not possible if we rely solely on optical observations. illusions present us with a window into the limits of our perceptual systems and bring into focus the tension between the reality and what we think is the reality. "in just the same way that prediction is fundamentally bounded by sensitivity of measurement and the shortcomings of computation, understanding is both enhanced and diminished by the rules of inference." in language models, we've seen that end-to-end deep learning systems that are opaque to our understanding perform quite a bit better than traditional machine translation approaches that rest on decades of linguistic research. this bears some resemblance to searle's chinese room experiment where if we just look at the inputs and the outputs, there isn't a guarantee that the internal workings of the system work in exactly the way we expect them to. "the most successful forms of future knowledge will be those that harmonise the human dream of understanding with the increasingly obscure echoes of the machine." abhishek gupta (founder of the montreal ai ethics institute) was featured in fortune where he detailed his views on ai safety concerns in rl systems, the "token human" problem, and automation surprise among other points to pay attention to when developing and deploying ai systems. especially in situations where these systems are going to be used in critical scenarios, humans operating in tandem with these systems and utilizing them as decision inputs need to gain a deeper understanding of the inherent probabilistic nature of the predictions from these systems and make decisions that take it into consideration rather than blindly trusting recommendations from an ai system because they have been accurate in % of the scenarios. with increasing capabilities of ai systems, and established research that demonstrates how human-machine combinations operate better than each in isolation, this paper presents a timely discussion on how we can craft better coordination between human and machine agents with the aim of arriving at the best possible understanding between them. this will enhance trust levels between the agents and it starts with having effective communication. the paper discusses how framing this from a human-computer interaction (hci) approach will lead to achieving this goal. this is framed with intention-, context-, and cognition-awareness being the critical elements which would be responsible for the success of effective communication between human and machine agents. intelligibility is a notion that is worked on by a lot of people in the technical community who seek to shed a light on the inner workings of systems that are becoming more and more complex. especially in the domains of medicine, warfare, credit allocation, judicial systems and other areas where they have the potential to impact human lives in significant ways, we seek to create explanations that might illuminate how the system works and address potential issues of bias and fairness. however, there is a large problem in the current approach in the sense that there isn't enough being done to meet the needs of a diverse set of stakeholders who require different kinds of intelligibility that is understandable to them and helps them meet their needs and goals. one might argue that a deeply technical explanation ought to suffice and others kinds of explanations might be derived from that but it makes them inaccessible to those who can't parse well the technical details, often those who are the most impacted by such systems. the paper offers a framework to situate the different kinds of explanations such that they are able to meet the stakeholders where they are at and provide explanations that not only help them meet their needs but ultimately engender a higher level of trust from them by highlighting better both the capabilities and limitations of the systems. ai value alignment is typically mentioned in the context of long-term agi systems but this also applies to the narrow ai systems that we have today. optimizing for the wrong metric leads to things like unrealistic and penalizing work schedules, hacking attention on video platforms, charging more money from poorer people to boost the bottomline and other unintended consequences. yet, there are attempts by product design and development teams to capture human well-being as metrics to optimize for. "how does someone feel about how their life is going?" is a pretty powerful question that gives a surprising amount of insight into well-being distanced from what might be influencing them at the moment because it makes them pause and reflect on what matters to them. but, capturing this subjective sentiment as a metric in an inherently quantitative world of algorithms is unsurprisingly littered with mines. a study conducted by facebook and supported by external efforts found that passive use of social media triggered feelings of ennui and envy while active use including interactions with others on the network led to more positive feelings. utilizing this as a guiding light, facebook strove to make an update that would be more geared towards enabling meaningful engagement rather than simply measuring the number of likes, shares and comments. they used user panels as an input source to determine what constituted meaningful interactions on the platform and tried to distill this into the well-being metrics. yet, this suffered from several flaws, namely that the evaluation of this change was not publicly available and was based on the prior work comparing passive vs. active use of social media. this idea of well-being optimization extends to algorithmic systems beyond social media platforms, for example, with how gig work might be better distributed on a platform such that income fluctuations are minimized for workers who rely on it as a primary source of earnings. another place could be amending product recommendations to also capture environmental impacts such that consumers can incorporate that into their purchasing decisions apart from just the best price deals that they can find. participatory design is going to be a key factor in the development of these metrics; especially given the philosophy of "nothing about us without us" as a north star to ensure that there isn't an inherent bias in how well-being is optimized for. often, we'll find that proxies will need to stand in for actual well-being in which case it is important to ensure that the metrics are not static and are revised in consultation with users at periodic intervals. tapping into the process of double loop learning, an organization can not only optimize for value towards its shareholders but also towards all its other stakeholders. while purely quantitative metrics have obvious limitations when trying to capture something that is inherently subjective and qualitative, we need to attempt something in order to start and iterate as we go along. in a world where increasing automation of cognitive labor due to ai-enabled systems will dramatically change the future of labor, it is now more important than ever that we start to move away from a traditional mindset when it comes to education. while universities in the previous century rightly provided a great value in preparing students for jobs, as jobs being bundle of tasks and those tasks rapidly changing with some being automated, we need to focus more on training students for things that will take much longer to automate, for example working with other humans, creative and critical thinking and driving innovation based on insights and aggregating knowledge across a diversity of fields. lifelong learning serves as a useful model that can impart some of these skills by breaking up education into modules that can be taken on an "at will" basis allowing people to continuously update their skills as the landscape changes. students will go in and out of universities over many years which will bring a diversity of experiences to the student body, encouraging a more close alignment with actual skills as needed in the market. while this will pose significant challenges to the university system, innovations like online learning and certifications based on replenishment of skills like in medicine could overcome some of those challenges for the education ecosystem. individual actions are powerful, they create bottom-up change and empower advocates with the ability to catalyze larger change. but, when we look at products and services with millions of users where designs that are inherently unethical become part of everyday practice and are met with a slight shrug of the shoulders resigning to our fates, we need a more systematized approach that is standardized and widely practiced. ethics in ai is having its moment in the spotlight with people giving talks and conferences focusing on it as a core theme yet it falls short of putting the espoused principles into practice. more often than not, you have individuals, rank and file employees who go out of their way, often on personal time, to advocate for the use of ethical, safety and inclusivity in the design of systems, sometimes even at the risk of their employment. while such efforts are laudable, they lack widespread impact and awareness that is necessary to move the needle, we need leaders at the top who can affect sweeping changes to adopt these guidelines not just in letter but in spirit and then transmit them as actionable policies to their workforce. it needs to arrive at a point where people advocating for this change don't need to do so from a place of moral and ethical obligations which customers can dispute but from a place of policy decisions which force disengagement for non-adherence to these policies. we need to move from talk to action not just at a micro but at a macro scale. the wrong kind of ai? artificial intelligence and the future of labor demand do increasing efficiency and social benefits stand in exclusion to each other when it comes to automation technology? with the development of the "right" kind of ai, this doesn't have to be the case. ai is a general purpose technology that has wide applications and being offered as a platform, it allows others to build advanced capabilities on top of existing systems creating an increasingly powerful abstraction layer with every layer. according to the standard approach in economics, a rise in productivity is often accompanied with an increase in the demand for labor and hence a rise in wages along with standards of living. but, when there is a decoupling between the deployment of technology and the associated productivity accrual, it can lead to situations where we see more output but not a corresponding increase in the standards of living as the benefits accrue to capital owners rather than wage-earning labor which is distanced from the production lifecycle. this unevenness in the distribution of gains causes losses of jobs in one sector while increasing productivity in others, often masking effects at an aggregate level through the use of purely economic focused indicators like gdp growth rates. the authors expound on how the current wave of automation is highly focused on labor replacement driven by a number of factors. when this comes in the form of automation that is just as good as labor but not significantly better, we get the negative effects as mentioned before, that is a replacement of labor without substantial increases in the standards of living. most of these effects are felt by those in the lower rungs of the socio-economic ladder where they don't have alternate avenues for employment and ascent. a common message is that we just have to wait as we did in the case of the industrial revolution and new jobs will emerge that we couldn't have envisioned which will continue to fuel economic prosperity for all. this is an egregious comparison that overlooks that the current wave of automation is not creating simultaneous advances in technology that allow the emergence of a new class of tasks within jobs for which humans are well-suited. instead, it's increasingly moving into domains that were strongholds of human skills that are not easily repeatable or reproducible. what we saw in the past was an avenue made available to workers to move out of low skill tasks in agriculture to higher skill tasks in manufacturing and services. some examples of how ai development can be done the "right" way to create social benefits: • in education, we haven't seen a significant shift in the way things are done for close to years. it has been shown that different students have different learning styles and can benefit from personalized attention. while it is infeasible to do so in a traditional classroom model, ai offers the potential to track metrics on how the student interacts with different material, where they make mistakes, etc., offering insights to educators on how to deliver a better educational experience. this is accompanied by an increase in the demand for teachers who can deliver different teaching styles to match the learning styles of students and create better outcomes. • a similar argument can be made in the field of healthcare where ai systems can allow medical staff to spend more time with patients offering them personalized attention for longer while removing the need for onerous and drudgery in the form of menial tasks like data entry. • industrial robots are being used to automate the manufacturing line often cordoning off humans for safety reasons. humans are also decoupled from the process because of a difference in the level of precision that machines can achieve compared to humans. but we can get the best of both worlds by combining human flexibility and critical thinking to address problems in an uncertain environment with the high degree of preciseness of machines by creating novel interfaces, for example, by using augmented reality. an important distinction that the authors point out in the above examples is that they are not merely the job of enablers, humans that are used to train machines in a transitory fashion, but those that genuinely complement machine skills. there are market failures when it comes to innovation and in the past, governments have helped mitigate those failures via public-private partnerships that led to the creation of fundamental technologies like the internet. but, this has decreased over the past two decades because of smaller amount of resources being invested by the government in basic research and the technology revolution becoming centered in silicon valley which has a core focus on automation that replaces labor, and with that bias and their funding of university and academic studies, they are causing the best minds of the next generation to have the same mindset. markets are also known to struggle when there are competing paradigms and once one pulls ahead, it is hard to switch to another paradigm even if it might be more productive thus leading to an entrenchment of the dominant paradigm. the social opportunity cost of replacing labor is lower than the cost of labor, pushing the ecosystem towards labor replacing automation. without accounting for these externalities, the ecosystem has little incentive to move towards the right kind of ai. this is exacerbated by tax incentives imposing costs on labor while providing a break on the use of capital. additionally, areas where the right kind of ai can be developed don't necessarily fall into the cool domain of research and thus aren't prioritized by the research and development community. let's suppose large advances were made in ai for health care. this would require accompanying retraining of support staff aside from doctors, and the high level bodies regulating the field would impose resistance, thus slowing down the adoption of this kind of innovation. ultimately, we need to lean on a holistic understanding of the way automation is going to impact the labor market and it will require human ingenuity to shape the social and economic ecosystems such that they create net positive benefits that are as widely distributed as possible. relying on the market to figure this out on its own is a recipe for failure. the labor impacts of ai require nuance in discussion rather than fear-mongering that veers between over-hyping and downplaying concerns when the truth lies somewhere in the middle. in the current paradigm of supervised machine learning, ai systems need a lot of data before becoming effective at their automation tasks. the bottom rung of this ladder consists of robotic process automation that merely tracks how humans perform a task (say, by tracking the clicks of humans as they go about their work) and ape them step by step for simple tasks like copying and pasting data across different places. the article gives an example of an organization that was able to minimize churn in their employees by more than half because of a reduction in data drudgery tasks like copying and pasting data across different systems to meet legal and compliance obligations. economists point out that white-collar jobs like these and those that are middle-tier in terms of skills that require little training are at the highest risk of automation. while we're still ways away from ai taking up all jobs, there is a slow march starting from automating the most menial tasks, potentially freeing us up to do more value-added work. with a rising number of people relying on social media for the news, the potential for hateful content and misinformation spreading has never been higher. content moderation on platforms like facebook and youtube is still largely a human endeavor where there are legions of contract workers that spend their days reviewing whether different pieces of content meet the community guidelines of the platform. due to the spread of the pandemic and offices closing down, a lot of these workers have been asked to leave (they can't do this work from home as the platform companies explained because of privacy and legal reasons), leaving the platforms in the hands of automated systems. the efficacy of these systems has always been questionable and as some examples in the article point out, they've run amok taking down innocuous and harmful content alike, seeming to not have very fine-tuned abilities. the problem with this is that legitimate sources of information, especially on subjects like covid- , are being discouraged because of their content being taken down and having to go through laborious review processes to have their content be approved again. while this is the perfect opportunity to experiment with the potential of using automated systems for content moderation given the traumatic experience that humans have to undergo as a part of this job, the chasms that need to be bridged still remain large between what humans have to offer and what the machines are capable of doing at the moment. workplace time management and accounting are common practices but for those of us who work in places where schedules are determined by automated systems, they can have many negative consequences, a lot of which could be avoided if employers paid more attention to the needs of their employees. clopening is the notion where an employee working at a retail location is asked to not only close the location at the end of the day but also arrive early the next day to open the location. this among other practices like breaks that are scheduled down to the minute and on-call scheduling (something that was only present in the realm of emergency services) wreak havoc on the physical and mental health of employees. in fact, employees surveyed have even expressed willingness to take pay cuts to have greater control over their schedules. in some places with ad-hoc scheduling, employees are forced to be spontaneous with their home responsibilities like taking care of their children, errands, etc. while some employees try to swap shifts with each other, often even that becomes hard to do because others are also in similar situations. some systems track customer demand and reduce pay for hours worked tied to that leading to added uncertainty even with their paychecks. during rush seasons, employees might be scheduled for back to back shifts ignoring their needs to be with families, something that a human manager could empathize with and accommodate for. companies supplying this kind of software hide behind the disclaimer that they don't take responsibility for how their customers use these systems which are often black-box and inscrutable to human analysis. this is a worrying trend that hurts those who are marginalized and those who require support when juggling several jobs just to make ends meet. relying on automation doesn't absolve the employers of their responsibility towards their employees. while the dominant form of discussion around the impacts of automation have been that it will cause job losses, this work from kevin scott offers a different lens into how jobs might be created by ai in the rust belt in the us where automation and outsourcing have been gradually stripping away jobs. examples abound of how entrepreneurs and small business owners with an innovative mindset have been able to leverage advances in ai, coupling them with human labor to repurpose their businesses from areas that are no longer feasible to being profitable. precision farming utilizes things like drones with computer vision capabilities to detect hotspots with pests, disease, etc. in the big farms that would otherwise require extensive manual labor which would limit the size of the farms. self-driving tractors and other automated tools also augment human effort to scale operations. the farm owners though highlight the opaqueness and complexity of such systems which make them hard to debug and fix themselves which sometimes takes away from the gains. on the other hand, in places like nursing homes that get reimbursed based on the resource utilization rates by their residents, tools using ai can help minimize human effort in compiling data and let them spend more of their effort on human contact which is not something that ai succeeds on yet. while automation has progressed rapidly, the gains haven't been distributed equally. in other places where old factories were shut down, some of them are now being utilized by ingenious entrepreneurs to bring back manufacturing jobs that cleverly combine human labor with automation to deliver high-quality, custom products to large enterprises. thus, there will be job losses from automation but the onus lies with us in steering the progress of ai towards economic and ethical values that we believe in. the state of ai ethics, june what's next for ai ethics, policy, and governance? a global overview in this ongoing piece of work, the authors present the landscape of ethical documents that has been flooded with guidelines and recommendations coming from a variety of sectors including government, private organizations, and ngos. starting with a dive into the stated and unstated motivations behind the documents, the reader is provided with a systematic breakdown of the different documents prefaced with the caveat that where the motivations are not made explicit, one can only make a best guess based on the source of origin and people involved in its creation. the majority of the documents from the governmental agencies were from the global north and western countries which led to a homogeneity of issues that were tackled and the recommendations often touted areas of interest that were specific to their industry and economical make up. this left research and development areas of interest like tourism and agriculture largely ignored which continue to play a significant role in the global south. the documents from the former category were also starkly focused on gaining a competitive edge, which was often stated explicitly, with a potential underlying goal of attracting scarce, high-quality ai talent which could trigger brain drain from other countries that are not currently the dominant players in the ai ecosystem. often, they were also positioning themselves to gain an edge and define a niche for themselves, especially in the case of countries that are non-dominant and thus overemphasizing the benefits while downplaying certain negative consequences that might arise from widespread ai use, like the displacement and replacement of labor. for documents from private organizations, they mostly focused on self and collective regulation in an effort to pre-empt stringent regulations from taking effect. they also strove to tout the economic benefits to society at large as a way of de-emphasizing the unintended consequences. a similar dynamic as in the case of government documents played out here where the interests of startups and small and medium sized businesses were ignored and certain mechanisms proposed would be too onerous for such smaller organizations to implement this further entrenching the competitive advantage of larger firms. the ngos on the other hand seemed to have the largest diversity both in terms of the participatory process of creation and the scope, granularity, and breadth of issues covered which gave technical, ethical, and policy implementation details making them actionable. some documents like the montreal declaration for responsible ai were built through an extensive public consultation process and consisted of an iterative and ongoing approach that the montreal ai ethics institute contributed to as well. the ieee document leverages a more formal standards making approach and consists of experts and concerned citizens from different parts of the world contributing to its creation and ongoing updating. the social motivation is clearly oriented towards creating larger societal benefits, internal motivation is geared towards bringing about change in the organizational structure, external strategic motivation is often towards creating a sort of signaling to showcase leadership in the domain and also interventional to shape policy making to match the interests of those organizations. judging whether a document has been successful is complicated by a couple of factors: discerning what the motivations and the goals for the document were, and the fact that most implementations and use of the documents is done in a pick-and-choose manner complicating attribution and weight allocation to specific documents. some create internal impacts in terms of adoption of new tools, change in governance, etc., while external impacts often relate to changes in policy and regulations made by different agencies. an example would be how the stem education system needs to be overhauled to better prepare for the future of work. some other impacts include altering customer perception of the organization as one that is a responsible organization which can ultimately help them differentiate themselves. at present, we believe that this proliferation of ethics documents represents a healthy ecosystem which promotes a diversity of viewpoints and helps to raise a variety of issues and suggestions for potential solutions. while there is a complication caused by so many documents which can overwhelm people looking to find the right set of guidelines that helps them meet their needs, efforts such as the study being done in this paper amongst other efforts can act as guideposts to lead people to a smaller subset from which they can pick and choose the guidelines that are most relevant to them. the white paper starts by highlighting the existing tensions in the definitions of ai as there are many parties working on advancing definitions that meet their needs. one of the most commonly accepted ones frames ai systems as those that are able to adapt their behavior in response to interactions with the world independent of human control. also, another popular framing is that ai is something that mimics human intelligence and is constantly shifting as a goal post as what was once perceived as ai, when sufficiently integrated and accepted in society becomes everyday technology. one thing that really stands out in the definitions section is how ethics are defined, which is a departure from a lot of other such documents. the authors talk about ethics as a set of principles of morality where morality is an assemblage of rules and values that guide human behavior and principles for evaluating that behavior. they take a neutral stand on the definition, a far cry from framing it as a positive inclination of human conduct to allow for diversity in embedding ethics into ai systems that are in concordance with local context and culture. ai systems present many advantages which most readers are now already familiar given the ubiquity of ai benefits as touted in everyday media. one of the risks of ai-enabled automation is the potential loss of jobs, the authors make a comparison with some historical cases highlighting how some tasks and jobs were eliminated creating new jobs while some were permanently lost. many reports give varying estimates for the labor impacts and there isn't yet a clear consensus on the actual impacts that this might have on the economy. from a liability perspective, there is still debate as to how to account for the damage that might be caused to human life, health and property by such systems. in a strict product liability regime like europe, there might be some guidance on how to account for this, but most regimes don't have specific liability allocations for independent events and decisions meaning users face coverage gaps that can expose them to significant harms. by virtue of the complexity of deep learning systems, their internal representations are not human-understandable and hence lack transparency, which is also called the black box effect. this is harmful because it erodes trust from the user perspective, among other negative impacts. social relations are altered as more and more human interactions are mediated and governed by machines. we see examples of that in how our newsfeeds are curated, toys that children play with, and robots taking care of the elderly. this decreased human contact, along with the increasing capability of machine systems, examples of which we see in how disinformation spreads, will tax humans in constantly having to evaluate their interactions for authenticity or worse, relegation of control to machines to the point of apathy. since the current dominant paradigm in machine learning is that of supervised machine learning, access to data is crucial to the success of the systems and in the state of ai ethics, june cases where there aren't sufficient protections in place for personal data, it can lead to severe privacy abuses. self determination theory states that autonomy of humans is important for proper functioning and fulfillment, so an overreliance on ai systems to do our work can lead to loss of personal autonomy, which can lead to a sense of digital helplessness. digital dementia is the cognitive equivalent where relying on devices for things like storing phone numbers, looking up information, etc. will over time lead to a decline in cognitive abilities. the echo chamber effect is fairly well studied, owing to the successful use of ai technologies to promulgate disinformation to the masses during the us presidential elections of . due to the easy scalability of the systems, the negative effects are multiplicative in nature and have the potential to become run-away problems. given that ai systems are built on top of existing software and hardware, errors in the underlying systems can still cause failures at the level of ai systems. more so, given the statistical nature of ai systems, behaviour is inherently stochastic and that can cause some variability in response which is difficult to account for. flash crashes in the financial markets are an example of this. for critical systems that require safety and robustness, there is a lot that needs to be done for ensuring reliability. building ethics compliance by design can take a bottom-up or top-down approach, the risk with a bottom-up approach is that by observing examples of human behaviour and extracting ethics principles from that, instead of getting things that are good for people, you get what's common. hence, the report advocates for a top-down approach where desired ethical behavior is directly programmed into the system. casuistic approaches to embedding ethics into systems would work well in situations where there are simple scenarios, such as in healthcare when the patient has a clear directive of do-not-resuscitate. but, in cases where there isn't one and where it is not possible to seek a directive from the patient, such an approach can fail and it requires that programmers either in a top-down manner embed rules or the system learns from examples. though, in a high-stakes situation like healthcare, it might not be ideal to rely on learning from examples because of skewed and limited numbers of samples. a dogmatic approach would also be ill-advised where a system might slavishly follow a particular school of ethical beliefs which might lead it to make decisions that might be unethical in certain scenarios. ethicists utilize several schools of thought when addressing a particular situation to arrive at a balanced decision. it will be crucial to consult with a diversity of stakeholders such that the nuances of different situations can be captured well. the wef is working with partners to come up with an "ethical switch" that will imbue flexibility on the system such that it can operate with different schools of thought based on the demands of the situation.the report also proposes the potential of utilizing a guardian ai system that can monitor other ai systems to check for compliance with different sets of ai principles. given that ai systems operate in a larger socio-technical ecosystem, we need to tap into fields like law and policy making to come up with effective ways of integrating ethics into ai systems, part of which can involve creating binding legal agreements that tie in with economic incentives.while policy making and law are often seen as slow to adapt to fast changing technology, there are a variety of benefits to be had, for example higher customer trust for services that have adherence to stringent regulations regarding privacy and data protection. this can serve to be a competitive advantage and counter some of the negative innovation barriers imposed by regulations. another point of concern with these instruments is that they are limited by geography which leads to a patchwork of regulation that might apply on a product or service that spans several jurisdictions. some other instruments to consider include: self-regulation, certification, bilateral investments treaties, contractual law, soft law, agile governance, etc. the report highlights the initiatives by ieee and wef in creating standards documents. the public sector through its enormous spending power can enhance the widespread adoption of these standards such as by utilizing them in procurement for ai systems that are used to interact with and serve their citizens. the report also advocates for the creation of an ethics board or chief values officer as a way of enhancing the adoption of ethical principles in the development of products and services. for vulnerable segments of the population, for example children, there need to be higher standards of data protection and transparency that can help parents make informed decisions about which ai toys to bring into their homes. regulators might play an added role of enforcing certain ethics principles as part of their responsibility. there also needs to be broader education for ai ethics for people that are in technical roles. given that there are many negative applications of ai, it shouldn't preclude us from using ai systems for positive use cases, a risk assessment and prudent evaluation prior to use is a meaningful compromise. that said, there are certain scenarios where ai shouldn't be used at all and that can be surfaced through the risk or impact assessment process. there is a diversity of ethical principles that have been put forth by various organizations, most of which are in some degree of accordance with local laws, regulations, and value sets. yet, they share certain universal principles across all of them. one concern highlighted by the report is on the subject of how even widely accepted and stated principles of human rights can be controversial when translated into specific mandates for an ai system. when looking at ai-enabled toys as an example, while they have a lot of privacy and surveillance related issues, in countries where there isn't adequate access to education, these toys could be seen as a medium to impart precision education and increase literacy rates. thus, the job of the regulator becomes harder in terms of figuring out how to balance the positive and negative impacts of any ai product. a lot of it depends on the context and surrounding socio-economic system as well. given the diversity in ethical values and needs across communities, an approach might be for these groups to develop and apply non-binding certifications that indicate whether a product meets the ethical and value system of that community. since there isn't a one size fits all model that works, we should aim to have a graded governance structure that has instruments in line with the risk and severity profile of the applications. regulation in the field of ai thus presents a tough challenge, especially given the interrelatedness of each of the factors. the decisions need to be made in light of various competing and sometimes contradictory fundamental values. given the rapid pace of technological advances, the regulatory framework needs to be agile and have a strong integration into the product development lifecycle. the regulatory approach needs to be such that it balances speed so that potential harms are mitigated with overzealousness that might lead to ineffective regulations that stifle innovation and don't really understand well the technology in question. ai is currently enjoying a summer of envy after having gone through a couple of winters of disenchantment, with massive interest and investments from researchers, industry and everyone else there are many uses of ai to create societal benefits but they aren't without their socio-ethical implications. ai systems are prone to biases, unfairness and adversarial attacks on their robustness among other real-world deployment concerns. even when ethical ai systems are deployed for fostering social good, there are risks that they cater to only a particular group to the detriment of others. moral relativism would argue for a diversity of definitions as to what constitutes good ai which would depend on the time, context, culture and more. this would be reflected in market decisions by consumers who choose products and services that align with their moral principles but it poses a challenge for those trying to create public governance frameworks for these systems. this dilemma would push regulators towards moral objectivism which would try and advocate for a single set of values that are universal making the process of coming up with a shared governance framework easier. a consensus based approach utilized in crafting the ec trustworthy ai guidelines settled on human rights as something that everyone can get on board with. given the ubiquity in the applicability of human rights, especially with their legal enshrinement in various charters and constitutions, they serve as a foundation to create legal, ethical and robust ai as highlighted in the ec trustworthy ai guidelines. stressing on the importance of protecting human rights, the guidelines advocate for a trustworthy ai assessment in case that an ai system has the potential to negatively impact the human rights of an individual, much like the better established data protection impact assessment requirement under the gdpr. additional requirements are imposed in terms of ex-ante oversight, traceability, auditability, stakeholder consultations, and mechanisms of redress in case of mistakes, harms or other infringements. the universal applicability of human rights and their legal enshrinement also renders the benefits of established institutions like courts whose function is to monitor and enforce these rights without prejudice across the populace. but they don't stand uncontested when it comes to building good ai systems; they are often seen as too western, individualistic, narrow in scope and abstract to be concrete enough for developers and designers of these systems. some arguments against this are that they go against the plurality of value sets and are a continued form of former imperialism imposing a specific set of values in a hegemonic manner. but, this can be rebutted by the signing of the original universal declaration of human rights that was done by nations across the world in an international diplomatic manner. however, even despite numerous infringements, there is a normative justification that they ought to be universal and enforced. while human rights might be branded as too individual focused, potentially creating a tension between protecting the rights of individuals to the detriment of societal good, this is a weak argument because stronger protection of individual rights has knock-on social benefits as free, healthy and well-educated (among other individual benefits) creates a net positive for society as these individuals are better aware and more willing to be concerned about societal good. while there are some exceptions to the absolute nature of human rights, most are well balanced in terms of providing for the societal good and the good of others while enforcing protections of those rights. given the long history of enforcement and exercises in balancing these rights in legal instruments, there is a rich jurisprudence on which people can rely when trying to assess ai systems. while human rights create a social contract between the individual and the state, putting obligations on the state towards the individual but some argue that they don't apply horizontally between individuals and between an individual and a private corporation. but, increasingly that's not the case as we see many examples where the state intervenes and enforces these rights and obligations between an individual and a private corporation as this falls in its mandate to protect rights within its jurisdiction. the abstract nature of human rights, as is the case with any set of principles rather than rules, allows them to be applied to a diversity of situations and to hitherto unseen situations as well. but, they rely on an ad-hoc interpretation when enforcing them and are thus subjective in nature and might lead to uneven enforcement across different cases. under the eu, this margin of appreciation is often criticized in the sense that it leads to weakening and twisting of different principles but this deferment to those who are closer to the case actually allows for a nuanced approach which would be lost otherwise. on the other hand we have rules which are much more concrete formulations and thus have a rigid definition and limited applicability which allows for uniformity but it suffers from inflexibility in the face of novel scenarios. yet, both rules and principles are complementary approaches and often the exercise of principles over time leads to their concretization into rules under existing and novel legal instruments. while human rights can thus provide a normative, overarching direction for the governance of ai systems, they don't provide the actual constituents for an applicable ai governance framework. for those that come from a non-legal background, often technical developer and designers of ai systems, it is essential that they understand their legal and moral obligations to codify and protect these rights in the applications that they build. the same argument cuts the other way, requiring a technical understanding of how ai systems work for legal practitioners such that they can meaningfully identify when breaches might have occurred. this is also important for those looking to contest claims of breaches of their rights in interacting with ai systems. this kind of enforcement requires a wide public debate to ensure that they fall within accepted democratic and cultural norms and values within their context. while human rights will continue to remain relevant even in an ai systems environment, there might be novel ways in which breaches might occur and those might need to be protected which require a more thorough understanding of how ai systems work. growing the powers of regulators won't be sufficient if there isn't an understanding of the intricacies of the systems and where breaches can happen, thus there is more of a need to enshrine some of those responsibilities in law to enforce this by the developers and designers of the system. given the large public awareness and momentum that built up around the ethics, safety and inclusion issues in ai, we will certainly see a lot more concrete actions around this in . the article gives a few examples of congressional hearings on these topics and advocates for the industry to come up with some standards and definitions to aid the development of meaningful regulations. currently, there isn't a consensus on these definitions and it leads to varying approaches addressing the issues at different levels of granularity and angles. what this does is create a patchwork of incoherent regulations across domains and geographies that will ultimately leave gaps in effectively mitigating potential harms from ai systems that can span beyond international borders. while there are efforts underway to create maps of all the different attempts of defining principle sets, we need a more coordinated approach to bring forth regulations that will ultimately protect consumer safety. in containing an epidemic the most important steps include quarantine and contact tracing for more effective testing. while before, this process of contact tracing was hard and fraught with errors and omissions, relying on memories of individuals, we now carry around smartphones which allow for ubiquitous tracking ability that is highly accurate. but such ubiquity comes with invasion of privacy and possible limits on freedoms of citizens. such risks need to be balanced with public interest in mind while using enhanced privacy preserving techniques and any other measures that center citizen welfare in both a collective and individual sense. for infections that can be asymptomatic in the early days, like the covid- , it is essential to have contact tracing, which identifies all people that came in close contact with an infected person and might spread the infection further. this becomes especially important when you have a pandemic at hand, burdening the healthcare system and testing every person is infeasible.an additional benefit of contact tracing is that it mitigates resurgence of the peaks of infection. r determines how quickly a disease will spread and is dependent on three factors (period of infection, contact rate and mode of transmission) out of which the first and third are fixed so we're only left with control over the contact rate.with an uptake of an application that facilitates contact tracing, the amount of reduction in contact rate is an increasing return because of the number of people that might come in contact with an infected person and thus, we get a greater reduction of r in terms of percentage compared to the percentage uptake of the application in the population. ultimately, reducing r to below leads to a slowdown in the spread of the infection thus helping the healthcare system cope up with the sudden stresses that are brought on by pandemic peaks. one of the techniques that governments or agencies responsible for public health use is broadcasting in which the information of diagnosed carriers is made public via various channels but it carries severe issues like exposing private information of individuals and businesses where they might have been which can trigger stigma, ostracization and unwarranted punitive harm. it also suffers from the problem of people needing to access this source of information of their own volition and then self-identify (and remember correctly) if they've been in recent contact with a diagnosed carrier. selective broadcasting is a more restricted form of the above where information about diagnosed carriers is shared to a select group of individuals based on location proximity in which case the user's location privacy would have to be compromised and in another vector of dissemination, messages are sent to all users but filtered on device for their specific location and is not reported back to the broadcaster. but, the other second-order negative effects remain the same as broadcasting. both though require the download of an application which might decrease the uptake of it by people. unicasting is when messages are sent tailored specifically to each user and they require the download of an app which needs to be able to track timestamps and location and has severe consequences in terms of government surveillance and abuse. participatory sharing is a method where diagnosed carriers voluntarily share their information and thus have more data control but it still relies on individual action both on the sender and receiver and its efficacy is questionable at best. there is also a risk of abuse by malicious actors to spread misinformation and seed chaos in society via false alarms. private kit: safe paths is an open-source solution developed by mit that allows for contact tracing in a privacy preserving way. it utilizes the encrypted location trail of a diagnosed carrier who chooses to share that with public health agencies and then other users who are also using the solution can pull this data and via their own logged location trail get a result of they've been in close contact with a diagnosed carrier. in the later phases of development of this solution, the developers will enable a mix of participatory sharing and unicasting to further prevent possible data access by third parties including governments for surveillance purposes. risks of contact tracing include possible public identification of the diagnosed carrier and severe social stigma that arises as a part of that. online witch hunts to try and identify the individual can often worsen the harassment and include spreading of rumors about their personal lives. the privacy risks for both individuals and businesses have potential for severe harm, especially during times of financial hardship, this might be very troublesome. privacy risks also extend to non-users because of proximal information that can be derived from location trails, such as employees that work at particular businesses that were visited by a diagnosed carrier. it can also bring upon the same stigma and ostracization to the family members of these people. without meaningful alternatives, especially in health and risk assessment during a pandemic, obtaining truly informed consent is a real challenge that doesn't yet have any clear solutions. along with information, be it through any of the methods identified above, it is very important to provide appropriate context and background to the alerts to prevent misinformation and panic from spreading especially for those with low health, digital and media literacy. on the other hand, some might not take such alerts seriously and increase the risk for public health by not following required measures such as quarantine and social distancing. given the nature of such solutions, there is a significant risk of data theft from crackers as is the case for any application that collects sensitive information like health status and location data. the solutions can also be used for fraud and abuse, for example, by blackmailing business owners and demanding ransom, failing to pay which they would falsely post information that they're diagnosed carriers and have visited their place of business. contact tracing technology requires the use of a smartphone with gps and some vulnerable populations might not always have such devices available like the elderly, homeless and people living in low-income countries who are at high risk of infection and negative health outcomes. ensuring that technology that works for all will be an important piece to mitigating the spread effectively. there is an inherent tradeoff between utility from the data provided and the privacy of the data subjects. compromises may be required for particularly severe outbreaks to manage the spread. the diagnosed carriers are the most vulnerable stakeholders in the ecosystem of contact tracing technology and they require the most protection. adopting open-source solutions that are examinable by the wider technology ecosystem can engender public trust. additionally, having proper consent mechanisms in place and exclusion of the requirement of extensive third party access to the location data can also help allay concerns. lastly, time limits on the storage and use of the location trails will also help address privacy concerns and increase uptake and use of the application in supporting public health measures. for geolocation data that might affect businesses, especially in times of economic hardship, information release should be done such that they are informed prior to the release of the information but there is little else in current methods that can both protect privacy and at the same time provide sufficient data utility. for those without access to smartphones with gps, providing them with some information on contact tracing can still help their communities. but, one must present information in a manner that accounts for variation in health literacy levels so an appropriate response is elicited from the people. alertness about potential misinformation and educational awareness are key during times of crises to encourage people to have measured responses following the best practices as advised by health agencies rather than those based on fear mongering by ill informed and/or malicious actors. encryption and other cybersecurity best practices for data security and privacy are crucial for the success of the solution. time limits on holding data for covid- is recommended at - days, the period of infection, but for an evolving pandemic one might need it for longer for more analysis. tradeoffs need to be made between privacy concerns and public health utility. different agencies and regions are taking different approaches with varying levels of efficacy and only time will tell how this change will be best managed. it does present an opportunity though for creating innovative solutions that both allow for public sharing of data and also reduce privacy intrusions. while the insights presented in this piece of work are ongoing and will continue to be updated, we felt it important to highlight the techniques and considerations compiled by the openmined team as it is one of the few places that adequately capture, in a single place, most of the technical requirements needed to build a solution that respects fundamental rights while balancing them with public health outcomes as people rush to make ai-enabled apps to combat covid- . most articles and research work coming out elsewhere are very scant and abstract in the technical details that would be needed to meet the ideals of respecting privacy and enabling health authorities to curb the spread of the pandemic. the four key techniques that will help preserve and respect rights as more and more people develop ai-enabled applications to combat covid- are: on-device data storage and computation, differential privacy, encrypted computation and privacy-preserving identity verification. the primary use cases, from a user perspective, for which apps are being built are to get: proximity alerts, exposure alerts, information on planning trips, symptom analysis and demonstrate proof of health. from a government and health authorities perspective, they are looking for: fast contact tracing, high-precision self-isolation requests, high-precision self-isolation estimation, high-precision symptomatic citizen estimation and demonstration of proof of health. while public health outcomes are at the top of the mind for everyone, the above use cases are trying to achieve the best possible tradeoff between economic impacts and epidemic spread. using the techniques highlighted in this work, it is possible to do so without having to erode the rights of citizens. this living body of work is meant to serve as a high-level guide along with resources to enable both app developers and verifiers implement and check for privacy preservation which has been the primary pushback from citizens and civil activists. evoking a high degree of trust from people will improve adoption of the apps developed and hopefully allow society and the economy to return to normal sooner while mitigating the harmful effects of the epidemic. there is a fair amount of alignment in the goals of both individuals and the government with the difference being that the government is looking at aggregate outcomes for society. some of the goals shared by governments across the world include: preventing the spread of the disease, eliminating the disease, protecting the healthcare system, protecting the vulnerable, adequately and appropriately distributing resources, preventing secondary breakouts, minimizing economic impacts and panic. the need for digital contact tracing is important because manual interventions are usually highly error prone and rely on human memory to trace how the person might have come in contact with. the requirement for high-precision self-isolation requests will avoid the need for geographic quarantines where everyone in an area is forced to self-isolate which leads to massive disruptions in the economy and can stall the delivery of essential services like food, electricity and water. the additional benefits of high-precision self-isolation is that it can help create an appropriate balance between economic harms and epidemic spread. high-precision symptomatic citizen estimation is a useful application in that it allows for more fine-grained estimation of the number of people that might be affected beyond what the test results indicate which can further strengthen the precision of other measures that are undertaken. a restoration of normalcy in society is going to be crucial as the epidemic starts to ebb, in this case, having proof of health that helps to determine the lowest risk individuals will allow for them to participate in public spaces again further bolstering the supply of essential services and relieving the burden from a small subset of workers who are participating. to service the needs of both what the users want and what the government wants, we need to be able to collect the following data: historical and current absolute location, historical and current relative position and verified group identity, where group refers to any demographic that the government might be interested in, for example, age or health status. to create an application that will meet these needs, we need to collect data from a variety of sources, compute aggregate statistics on that data and then set up some messaging architecture that communicates the results to the target population. the toughest challenges lie in the first and second parts of the process above, especially to do the second part in a privacy-preserving manner. for historical and current absolute location, one of the first options considered by app developers is to record gps data in the background. unfortunately, this doesn't work on ios devices and even then has several limitations including coarseness in dense, urban areas and usefulness only after the app has been running on the user device for some time because historical data cannot be sourced otherwise. an alternative would be to use wi-fi router information which can give more accurate information as to whether someone has been self-isolating or not based on whether they are connected to their home router. there can be historical data available here which makes it more useful though there are concerns with lack of widespread wi-fi connectivity in rural areas and tracking when people are outside homes. other ways of obtaining location data could be from existing apps and services that a user uses -for example, history of movements on google maps which can be parsed to extract location history. there is also historical location data that could be pieced together from payments history, cars that record location information and personal cell tower usage data. historical and current relative data is even more important to map the spread of the epidemic and in this case, some countries like singapore have deployed bluetooth broadcasting as a means of determining if people have been in close proximity. the device broadcasts a random number (which could change frequently) which is recorded by devices passing by close to each other and in case someone is tested positive, this can be used to alert people who were in close proximity to them. another potential approach highlighted in the article is to utilize gyroscope and ambient audio hashes to determine if two people might have been close together, though bluetooth will provide more consistent results. the reason to use multiple approaches is the benefit of getting more accurate information overall since it would be harder to fake multiple signals. group membership is another important aspect where the information can be used to finely target messaging and calculating aggregate statistics. but, for some types of group membership, we might not be able to rely completely on self-reported data. for example, health status related to the epidemic would require verification from an external third-party such as a medical institution or testing facility to minimize false information. there are several privacy preserving techniques that could be applied to an application given that you have: confirmed covid- patient data in a cloud, all other user data on each individual's device, and data on both the patients and the users including historical and current absolute and relative locations and group identifier information. private set intersections can be used to calculate whether two people were in proximity to each other based on their relative and absolute location information. private set intersection operates similarly to normal set intersection to find elements that are common between two sets but does so without disclosing any private information from either of the sets. this is important because performing analysis even on pseudonymized data without using privacy preservation can leak a lot of information. differential privacy is another critical technique to be utilized, dp consists of providing mathematical guarantees (even against future data disclosures) that analysis on the data will not reveal whether or not your data was part of the dataset. it asserts that from the analysis, one is not able to learn anything about your data that they wouldn't have been able to learn from other data about you. google's battle-tested c++ library is a great resource to start along with the python wrapper created by the openmined team. to address the need for verified group identification, one can utilize the concept of a private identity server. it essentially functions as a trusted intermediary between a user that wants to provide a claim and another party that wants to verify the claim. it functions by querying a service from which it can verify whether the claim is true and then serve that information up to the party wishing to verify the claim without giving away personal data. while it might be hard to trust a single intermediary, this role can be decentralized to provide for obtaining a higher degree of trust by relying on a consensus mechanism. building on theory from management studies by christensen et al. the authors of this article dive into how leaders of tech organizations, especially upstarts that are rapid in the disruption of incumbents should approach the accompanying responsibilities that come with a push into displacing existing paradigms of how an industry works. when there is a decoupling of different parts of the value chain in how a service is delivered, often the associated protections that apply to the entire pipeline fall by the wayside because of distancing from the end user and a diffusion of responsibility across multiple stakeholders in the value chain. while end users driven innovation will seek to reinforce such models, regulations and protections are never at the top of such demands and they create a burden on the consumers once they realize that things can go wrong and negatively affect them. the authors advocate for the leaders of the companies to proactively employ a systems thinking approach to identify different parts that they are disrupting, how that might affect users, what would happen if they become the dominant player in the industry and then apply lessons learned from such an exercise to pre-emptively design safeguards into the system to mitigate unintended consequences. many countries are looking at utilizing existing surveillance and counter-terrorism tools to help track the spread of the coronavirus and are urging tech companies and carriers to assist with this. the us is looking at how they can tap into location data from smartphones, following in the heels of israel and south korea that have deployed similar measures. while extraordinary measures might be justified given the time of crisis we're going through, we mustn't lose sight of what behaviors we are normalizing as a part of this response against the pandemic. russia and china are also using facial recognition technologies to track movements of people, while iran is endorsing an app that might be used as a diagnosis tool. expansion of the boundaries of surveillance capabilities and government powers is something that is hard to reign back in once a crisis is over. in some cases, like the signing of the freedom act in the usa reduced government agency data collection abilities that were expanded under the patriot act. but, that's not always the case and even so, the powers today exceed those that existed prior to the enactment of the patriot act. what's most important is to ensure that decisions policy makers take today keep in mind the time limits on such expansion of powers and don't trigger a future privacy crisis because of it. while no replacement for social distancing, a virus tracking tool putting into practice the technique of contact tracing is largely unpalatable to western democracies because of expectations of privacy and freedom of movement. a british effort underway to create an app that meets democratic ideals of privacy and freedom while also being useful in collecting geolocation data to aid in the virus containment efforts. it is based on the notion of participatory sharing, relying on people's sense of civic duty to contribute their data in case they test positive. while in the usa, discussions between the administration and technology companies has focused on large scale aggregate data collection, in a place like the uk with a centralized healthcare system, there might be higher trust levels in sharing data with the government. while the app doesn't require uptake by everyone to be effective, but a majority of the people would need to use it to bring down the rate of spread. the efficacy of the solution itself will rely on being able to collect granular location data from multiple sources including bluetooth, wi-fi, cell tower data, and app check-ins. a lot of high level cdc officials are advising that if people in the usa don't follow best practices of social distancing, sheltering in place, and washing hands regularly, the outbreak will not have peaked and the infection will continue to spread, especially hitting those who are the most vulnerable including the elderly and those with pre-existing conditions. on top of the public health impacts, there are also concerns of growing tech-enabled surveillance which is being seriously explored as an additional measure to curb the spread. while privacy and freedom rights are enshrined in the constitution, during times of crisis, government and justice powers are expanded to allow for extraordinary measures to be adopted to restore the safety of the public. this is one of those times and the us administration is actively exploring options in partnership with various governments on how to effectively combat the spread of the virus including the use of facial recognition technology. this comes shortly after the techlash and a potential bipartisan movement to curb the degree of data collection by large firms, which seem to have come to a halt as everyone scrambles to battle the coronavirus. regional governments are being imbued with escalated powers to override local legislations in an effort to curb the spread of the virus. the article provides details on efforts by various countries across the world, yet we only have preliminary data on the efficacy of each of those measures and we require more time before being able to judge which of them is the most effective. that said, in a pandemic that is fast spreading, we don't have the luxury of time and must make decisions as quickly as possible using the information at hand, perhaps using guidance from prior crises. but, what we've seen so far is minimal coordination from agencies across the world and that's leading to ad-hoc, patchy data use policies that will leave the marginalized more vulnerable. strategies that use public disclosure of those that have been tested positive in the interest of public health are causing harm to the individuals and other individuals that are close to them such as their families. as experienced by a family in new york, online vigilantes attempted to harass the individuals while their family pleaded and communicated measures that they had taken to isolate themselves to safeguard others. unfortunately, the virus might be bringing out the worst in all of us. an increasing number of tools and techniques are being used to track our behaviour online and while some may have potential benefits, for example, the use of contact tracing to potentially improve public health outcomes, if this is not done in a privacy-preserving manner, there can be severe implications for your data rights. but, barring special circumstances like the current pandemic, there are a variety of simple steps that you can take to protect your privacy online. these range from simple steps like using an incognito browser window which doesn't store any local information about your browsing on your device to using things like vpns which protect snooping of your browsing patterns even from your isp. when it comes to using the incognito function of your browser, if you're logged into a service online, there isn't any protection though it does prevent storing cookies on your device. with vpns, there is an implicit trust placed in the provider of that service to not store logs of your browsing activity. an even more secure option is to use a privacy-first browser like tor which routes your traffic requests through multiple locations making tracking hard. there is also an os built around this called tailsos that offers tracking protection from the device perspective as well not leaving any trace on the host machine allowing you to boot up from a usb. the eff also provides a list of tools that you can use to get a better grip on your privacy as you browse online. under the children's online privacy protection act, the ftc levied its largest fine yet of $ m on youtube last year for failing to meet requirements of limiting personal data collection for children under the age of . yet, as many advocates of youth privacy point out, the fines, though they appear to be large, don't do enough to deter such personal data collection. they advocate for a stronger version of the act while requiring more stringent enforcement from the ftc which has been criticized for slow responses and a lack of sufficient resources. while the current act requires parental consent for children below to be able to utilize a service that might collect personal data, there is no verification performed on the self-declared age provided at the time of sign up which weakens the efficacy of this requirement. secondly, the sharp threshold of years old immediately thrusts children into an adult world once they cross that age and some people are advocating for a more graduated approach to the application of privacy laws. given that such a large part of the news cycle is dominated by the coronavirus, we tend to forget that there might be censors at work that are systematically suppressing information in an attempt to diminish the seriousness of the situation. some people are calling github the last piece of free land in china and have utilized access to it to document news stories and people's first hand experiences in fighting the virus before they are scrubbed from local platforms like wechat and weibo. they hope that such documentation efforts will not only shed light on the reality and on the ground situation as it unfolds but also give everyone a voice and hopefully provide data to others who could use it to track the movement of the virus across the country. such times of crisis bring out creativity and this attempt highlights our ability as a species to thrive even in a severely hostile environment. there is a clear economic and convenience case to be made (albeit for the majority, not for those that are judged to be minorities by the system and hence get subpar performance from the system) where you get faster processing and boarding times when trying to catch a flight. yet, for those that are more data-privacy minded, there is an option to opt-out though leveraging that option doesn't necessarily mean that the alternative will be easy, as the article points out, travelers have experienced delays and confusion from the airport staff. often, the alternatives are not presented as an option to travelers giving a false impression that people have to submit to facial recognition systems. some civil rights and ethics researchers tested the system and got varying mileage out of their experiences but urge people to exercise the option to push back against technological surveillance. london is amongst a few cities that has seen public deployment of live facial recognition technology by law enforcement with the aim of increasing public safety. but, more often than not, it is done so without public announcement and an explanation as to how this technology works, and what impacts it will have on people's privacy. as discussed in an article by maiei on smart cities, such a lack of transparency erodes public trust and affects how people go about their daily lives. several artists in london as a part of regaining control over their privacy and to raise awareness are using the technique of painting adversarial patterns on their faces to confound facial recognition systems. they employ highly contrasting colors to mask the highlights and shadows on their faces and practice pattern use as created and disseminated by the cvdazzle project that advocates for many different styles to give the more fashion-conscious among us the right way to express ourselves while preserving our privacy. such projects showcase a rising awareness for the negative consequences of ai-enabled systems and also how people can use creative solutions to combat problems where laws and regulations fail them. there is mounting evidence that organizations are taking seriously the threats arising from malicious actors geared towards attacking ml systems. this is supported by the fact that organizations like iso and nist are building up frameworks for guidance on securing ml systems, that working groups from the eu have put forth concrete technical checklists for the evaluating the trustworthiness of ml systems and that ml systems are becoming key to the functioning of organizations and hence they are inclined to protect their crown jewels. the organizations surveyed as a part of this study spanned a variety of domains and were limited to those that have mature ml development. the focus was on two personas: ml engineers who are building these systems and security incident responders whose task is to secure the software infrastructure including the ml systems. depending on the size of the organization, these people could be in different teams, same team or even the same person. the study was also limited to intentional malicious attacks and didn't investigate the impacts of naturally occurring adversarial examples, distributional shifts, common corruption and reward hacking. most organizations that were surveyed as a part of the study were found to primarily be focused on traditional software security and didn't have the right tools or know-how in securing against ml attacks. they also indicated that they were actively seeking guidance in the space. most organizations were clustered around concerns regarding data poisoning attacks which was probably the case because of the cultural significance of the tay chatbot incident. additionally, privacy breaches were another significant concern followed by concerns around model stealing attacks that can lead to the loss of intellectual property. other attacks such as attacking the ml supply chain and adversarial examples in the physical domain didn't catch the attention of the people that were surveyed as a part of the study. one of the gaps between reality and expectations was around the fact that security incident responders and ml engineers expected that the libraries that they are using for ml development are battle-tested before being put out by large organizations, as is the case in traditional software. also, they pushed upstream the responsibility of security in the cases where they were using ml as a service from cloud providers. yet, this ignores the fact that this is an emergent field and that a lot of the concerns need to be addressed in the downstream tasks that are being performed by these tools. they also didn't have a clear understanding of what to expect when something does go wrong and what the failure mode would look like. in traditional software security, mitre has a curated repository of attacks along with detection cues, reference literature and tell-tale signs for which malicious entities, including nation state attackers are known to use these attacks. the authors call for a similar compilation to be done in the emergent field of adversarial machine learning whereby the researchers and practitioners register their attacks and other information in a curated repository that provides everyone with a unified view of the existing threat environment. while programming languages often have well documented guidelines on secure coding, guidance on doing so with popular ml frameworks like pytorch, keras and tensorflow is sparse. amongst these, tensorflow is the only one that provides some tools for testing against adversarial attacks and some guidance on how to do secure coding in the ml context. security development lifecycle (sdl) provides guidance on how to secure systems and scores vulnerabilities and provides some best practices, but applying this to ml systems might allow imperfect solutions to exist. instead of looking at guidelines as providing a strong security guarantee, the authors advocate for having code examples that showcase what constitutes security-and non-security-compliant ml development. in traditional software security there are tools for static code analysis that provide guidance on the security vulnerabilities prior to the code being committed to a repository or being executed while dynamic code analysis finds security vulnerabilities by executing the different code paths and detecting vulnerabilities at runtime. there are some tools like mlsec and cleverhans that provide white-and black-box testing; one of the potential future directions for research is to extend this to the cases of model stealing, model inversion, and membership inference attacks. including these tools as a part of the ide would further make it naturalized for developers to think about secure coding practices in the ml context. adapting the audit and logging requirements as necessitated for the functionality of the security information and event management (siem) system, in the field of ml, one can execute the list of attacks as specified in literature and ensure that the logging artifacts generated as a consequence are traced to an attack. then, having these incident logs be in a format that is exportable and integratable with siem systems will allow forensic experts to analyze them post-hoc for hardening and analysis. standardizing the reporting, logging and documentation as done by the sigma format in traditional software security will allow the insights from one analyst into defenses for many others. automating the possible attacks and including them as a part of the mlops pipeline is something that will enhance the security posture of the systems and make them pedestrian practice in the sdl. red teaming, as done in security testing, can be applied to assess the business impacts and likelihood of threat, something that is considered best practice and is often a requirement for supplying critical software to different organizations like the us government. transparency centers that allow for deep code inspection and help create assurance on the security posture of a software product/service can be extended to ml which would have to cover three modalities: ml platform is implemented in a secure manner, ml as a service meets the basic security and privacy requirements, and that the ml models embedded on edge devices meet basic security requirements. tools that build on formal verification methods will help to enhance this practice. tracking and scoring ml vulnerabilities akin to how they are done in software security testing done by registering identified vulnerabilities into a common database like cve and then assigning it an impact score like the cvss needs to be done for the field of ml. while the common database part is easy to set up, scoring them isn't something that has been figured out yet. additionally, on being alerted that a new vulnerability has been discovered, it isn't clear how the ml infrastructure can be scanned to see if the system is vulnerable to that. because of the deep integration of ml systems within the larger product/service, the typical practice of identifying a blast radius and containment strategy that is applied to traditional software infrastructure when alerted of a vulnerability is hard to define and apply. prior research work from google has identified some ways to qualitatively assess the impacts in a sprawling infrastructure. from a forensic perspective, the authors put forth several questions that one can ask to guide the post-hoc analysis, the primary problem there is that only some of the learnings from traditional software protection and analysis apply here, there are many new artifacts, paradigmatic, and environmental aspects that need to be taken into consideration. from a remediation perspective, we need to develop metrics and ways to ascertain that patched models and ml systems can maintain prior levels of performance while having mitigated the attacks that they were vulnerable to, the other thing to pay attention is that there aren't any surfaces that are opened up for attack. given that ml is going to be the new software, we need to think seriously about inheriting some of the security best practices from the world of traditional cybersecurity to harden defenses in the field of ml. all technology has implications for civil liberties and human rights, the paper opens with an example of how low-clearance bridges between new york and long island were supposedly created with the intention of disallowing public buses from crossing via the underpasses to discourage the movement of users of public transportation, primarily disadvantaged groups from accessing certain areas. in the context of adversarial machine learning, taking the case of facial recognition technology (frt), the authors demonstrate that harm can result on the most vulnerable, harm which is not theoretical and is gaining in scope, but that the analysis also extends beyond just frt systems. the notion of legibility borrowing from prior work explains how governments seek to categorize through customs, conventions and other mechanisms information about their subjects centrally. legibility is enabled for faces through frt, something that previously was only possible as a human skill. this combined with the scale offered by machine learning makes this a potent tool for authoritarian states to exert control over their populations. from a cybersecurity perspective, attackers are those that compromise the confidentiality, integrity and availability of a system, yet they are not always malicious, sometimes they may be pro-democracy protestors who are trying to resist identification and arrest by the use of frt. when we frame the challenges in building robust ml systems, we must also pay attention to the social and political implications as to who is the system being made safe for and at what costs. positive attacks against such systems might also be carried out by academics who are trying to learn about and address some of the ethical, safety and inclusivity issues around frt systems. other examples such as the hardening of systems against doing membership inference means that researchers can't determine if an image was included in the dataset, and someone looking to use this as evidence in a court of law is deterred from doing so. detection perturbation algorithms permit an image to be altered such that faces can't be recognized in an image, for example, this can be used by a journalist to take a picture of a protest scene without giving away the identities of people. but, defensive measures that disarm such techniques hinder such positive use cases. defense measures against model inversion attacks don't allow researchers and civil liberty defenders to peer into black box systems, especially those that might be biased against minorities in cases like credit allocation, parole decision-making, etc. the world of security is always an arms race whether that is in the physical or cyberspace. it is not that far-fetched to imagine how a surveillance state might deploy frt to identify protestors who as a defense might start to wear face masks for occlusion. the state could then deploy techniques that bypass this and utilize other scanning and recognition techniques to which the people might respond by wearing adversarial clothing and eyeglasses to throw off the system at which point the state might choose to use other biometric identifiers like iris scanning and gait detection. this constant arms battle, especially when defenses and offenses are constructed without the sense for the societal impacts leads to harm whose burden is mostly borne by those who are the most vulnerable and looking to fight for their rights and liberties. this is not the first time that technology runs up against civil liberties and human rights, there are lessons to be learned from the commercial spyware industry and how civil society organizations and other groups came together to create "human rights by design" principles that helped to set some ground rules for how to use this technology responsibly. researchers and practitioners in the field of ml security can borrow from these principles. we've got a learning community at the montreal ai ethics institute that is centered around these ideas that brings together academics and others from around the world to blend the social sciences with the technical sciences. recommendations on countering some of the harms centre around holding the vendors of these systems to the business standards set by the un, implementing transparency measures during the development process, utilizing human rights by design approaches, logging ml system uses along with possible nature and forms of attacks and pushing the development team to think about both the positive and negative use cases for the systems such that informed trade-offs can be made when hardening these systems to external attacks. in this insightful op-ed, two pioneers in technology shed light on how to think about ai systems and their relation to the existing power and social structures. borrowing the last line in the piece, " … all that is necessary for the triumph of an ai-driven, automation-based dystopia is that liberal democracy accept it as inevitable.", aptly captures the current mindset surrounding ai systems and how they are discussed in the western world. tv shows like black mirror perpetuate narratives showcasing the magical power of ai-enabled systems, hiding the fact that there are millions, if not billions of hours of human labor that undergird the success of modern ai systems, which largely fall under the supervised learning paradigm that requires massive amounts of data to work well. the chinese ecosystem is a bit more transparent in the sense that the shadow industry of data labellers is known, and workers are compensated for their efforts. this makes them a part of the development lifecycle of ai while sharing economic value with people other than the tech-elite directly developing ai. on the other hand, in the west, we see that such efforts go largely unrewarded because we trade in that effort of data production for free services. the authors give the example of audrey tang and taiwan where citizens have formed a data cooperative and have greater control over how their data is used. contrasting that, we have highly-valued search engines standing over community-run efforts like wikipedia which create the actual value for the search results, given that a lot of the highly placed search results come from wikipedia. ultimately, this gives us some food for thought as to how we portray ai today and its relation to society and why it doesn't necessarily have to be that way. mary shelly had created an enduring fiction which, unbeknownst to her, has today manifested itself in the digital realm with layered abstractions of algorithms that are increasingly running multiple aspects of our lives. the article dives into the world of black box systems that have become opaque to analysis because of their stratified complexity leading to situations with unpredictable outcomes. this was exemplified when an autonomous vehicle crashed into a crossing pedestrian and it took months of post-hoc analysis to figure out what went wrong. when we talk about intelligence in the case of these machines, we're using it in a very loose sense, like the term "friend" on facebook, which has a range of interpretations from your best friend to a random acquaintance. both terms convey a greater sense of meaning than is actually true. when such systems run amok, they have the potential to cause significant harm, case in point being the flash crashes the financial markets experienced because of the competitive behaviour of high frequency trading firm algorithms facing off against each other in the market. something similar has happened on amazon where items get priced in an unrealistic fashion because of buying and pricing patterns triggered by automated systems. while in a micro context the algorithms and their working are transparent and explainable, when they come together in an ecosystem, like finance, they lead to an emergent complexity that has behaviour that can't be predicted ahead of time with a great amount of certainty. but, such justifications can't be used as a cover for evading responsibility when it comes to mitigating harms. existing laws need to be refined and amended so that they can better meet the demands of new technology where allocation of responsibility is a fuzzy concept. ai systems are different from other software systems when it comes to security vulnerabilities. while traditional cybersecurity mechanisms rely heavily on securing the perimeter, ai security vulnerabilities run deeper and they can be manipulated through their interactions with the real world -the very mechanism that makes them intelligent systems. numerous examples of utilizing audio samples from tv commercials to trigger voice assistants have demonstrated new attack surfaces for which we need to develop defense techniques. visual systems are also fooled, especially in av systems where, according to one example, manipulating stop signs on the road with innocuous stripes of tape make it seem like the stop sign is a speed indicator and can cause fatal crashes. there are also examples of hiding these adversarial examples under the guise of white noise and other imperceptible changes to the human senses. we need to think of ai systems as inherently socio-technical to come up with effective protection techniques that don't just rely on technical measures but also look at the human factors surrounding them. some other useful insights are to utilize abusability testing, red-teaming, white hacking, bug bounty programs, and consulting with civic society advocates who have deep experience with the interactions of vulnerable communities with technology. reinforcement systems are increasingly moving from applications to beating human performance in games to safety-critical applications like self-driving cars and automated trading. a lack of robustness in the systems can lead to catastrophic failures like the $ m lost by knight capital and the harms to pedestrian and driver safety in the case of autonomous vehicles. rl systems that perform well under normal conditions can be vulnerable to adversarial agents that can exploit the brittleness of the systems when it comes to natural shifts in distributions and more carefully crafted attacks. in prior threat models, the assumptions for the adversary are that they can modify directly the inputs going into the rl agent but that is not very realistic. instead, here the authors focus more on a shared environment through which the adversary creates indirect impact on the target rl agent leading to undesirable behavior. for agents that are trained through self-play (which is a rough approximation of nash equilibrium), they are vulnerable to adversarial policies. as an example, masked victims are more robust to modifications in the natural observations by the adversary but that lowers the performance in the average case. furthermore, what the researchers find is that there is a non-transitive behavior between self-play opponent, masked victim, adversarial opponent and normal victim in that cyclic order. self-play being normally transitive in nature, especially when mimicking real-world scenarios is then no doubt vulnerable to these non-transitive styled attacks. thus, there is a need to move beyond self-play and apply iteratively adversarial training defense and population based training methods so that the target rl agent can become robust to a wider variety of scenarios. vehicle safety is something of paramount importance in the automotive industry as there are many tests conducted to test for crash resilience and other physical safety features before it is released to people. but, the same degree of scrutiny is not applied to the digital and connected components of cars. researchers were able to demonstrate successful proof of concept hacks that compromised vehicle safety. for example, with the polo, they were able to access the controller area network (can) which sends signals and controls a variety of aspects related to driving functions. given how the infotainment systems were updated, researchers were able to gain access into the personal details of the driver. they were also able to utilize the shortcomings in the operation of the key fob to gain access to the vehicle without leaving a physical trace. other hacks that were tried included being able to access and influence the collision monitoring radar system and the tire-pressure monitoring system which both have critical implications for passenger safety. on the focus, they found wifi details including the password for their production line in detroit, michigan. on purchasing a second-hand infotainment unit for purposes of reverse-engineering the firmware, they found the previous owner's home wifi details, phone contacts and a host of other personal information. cars store a lot of personal information including tracking information which, as stated on the privacy policy, can be shared with affiliates which can have other negative consequences like changes in insurance premiums based on driving behaviour. europe will have some forthcoming regulations for connected car safety but those are currently slated for release in . we've all experienced specification gaming even if we haven't really heard the term before. in law, you call it following the law to the letter but not in spirit. in sports, it is called unsportsman-like to use the edge cases and technicalities of the rules of the game to eke out an edge when it is obvious to everyone playing the game that the rules intended for something different. this can also happen in the case of ai systems, for example in reinforcement learning systems where the agent can utilize "bugs" or poor specification on the part of the human creators to achieve the high rewards for which it is optimizing without actually achieving the goal, at least in the way the developers intended them to and this can sometimes lead to unintended consequences that can cause a lot of harms. "let's look at an example. in a lego stacking task, the desired outcome was for a red block to end up on top of a blue block. the agent was rewarded for the height of the bottom face of the red block when it is not touching the block. instead of performing the relatively difficult maneuver of picking up the red block and placing it on top of the blue one, the agent simply flipped over the red block to collect the reward. this behaviour achieved the stated objective (high bottom face of the red block) at the expense of what the designer actually cares about (stacking it on top of the blue one)". this isn't because of a flaw in the rl system but more so a misspecification of the objective. as the agents become more capable, they find ever-more clever ways of achieving the rewards which can frustrate the creators of the system. this makes the problem of specification gaming very relevant and urgent as we start to deploy these systems in a lot of real-world situations. in the rl context, task specification refers to the design of the rewards, the environment and any other auxiliary rewards. when done correctly, we get true ingenuity out of these systems like move from the alphago system that baffled humans and ushered a new way of thinking about the game of go. but, this requires discernment on the part of the developers to be able to judge when you get a case like lego vs. move . as an example in the real-world, reward tampering is an approach where the agent in a traffic optimization system with an interest in achieving a high reward can manipulate the driver into going to alternate destinations instead of what they desired just to achieve a higher reward. specification gaming isn't necessarily bad in the sense that we want the systems to come up with ingenious ways to solve problems that won't occur to humans. sometimes, the inaccuracies can arise in how humans provide feedback to the system while it is training. ''for example, an agent performing a grasping task learned to fool the human evaluator by hovering between the camera and the object." incorrect reward shaping, where an agent is provided rewards along the way to achieving the final reward can also lead to edge-case behaviours when it is not analyzed for potential side-effects. we see such examples happen with humans in the real-world as well: a student asked to get a good grade on the exam can choose to copy and cheat and while that achieves the goal of getting a good grade, it doesn't happen in the way we intended for it to. thus, reasoning through how a system might game some of the specifications is going to be an area of key concern going into the future. the ongoing pandemic has certainly accelerated the adoption of technology in everything from how we socialize to buying groceries and doing work remotely. the healthcare industry has also been rapid in adapting to meet the needs of people and technology has played a role in helping to scale care to more people and accelerate the pace with which the care is provided. but, this comes with the challenge of making decisions under duress and with shortened timelines within which to make decisions on whether to adopt a piece of technology or not. this has certainly led to issues where there are risks of adopting solutions that haven't been fully vetted and using solutions that have been repurposed from prior uses that were approved to now combat covid- . especially with ai-enabled tools, there are increased risks of emergent behavior that might not have been captured by the previous certification or regulatory checks. the problems with ai solutions don't just go away because there is a pandemic and shortcutting the process of proper due diligence can lead to more harm than the benefits that they bring. one must also be wary of the companies that are trying to capitalize on the chaos and pass through solutions that don't really work well. having technical staff during the procurement process that can look over the details of what is being brought into your healthcare system needs to be a priority. ai can certainly help to mitigate some of the harms that covid- is inflicting on patients but we must keep in mind that we're not looking to bypass privacy concerns that come with processing vast quantities of healthcare data. in the age of adversarial machine learning (maiei has a learning community on machine learning security if you'd like to learn more about this area) there are enormous concerns with protecting software infrastructure as ml opens up a new attack surface and new vectors which are seldom explored. from the perspective of insurance, there are gaps in terms of what cyber-insurance covers today, most of it being limited to the leakage of private data. there are two kinds of attacks that are possible on ml systems: intentional and unintentional. intentional attacks are those that are executed by malicious agents who attempt to steal the models, infer private data or get the ai system to behave in a way that favors their end goals. for example, when tumblr decided to not host pornographic content, creators bypassed that by using green screens and pictures of owls to fool the automated content moderation system. unintended attacks can happen when the goals of the system are misaligned with what the creators of the system actually intended, for example, the problem of specification gaming, something that abhishek gupta discussed here in this fortune article. in interviewing several officers in different fortune companies, the authors found that there are key problems in this domain at the moment: the defenses provided by the technical community have limited efficacy, existing copyright, product liability, and anti-hacking laws are insufficient to capture ai failure modes. lastly, given that this happens at a software level, cyber-insurance might seem to be the way to go, yet current offerings only cover a patchwork of the problems. business interruptions and privacy leaks are covered today under cyber-insurance but other problems like bodily harm, brand damage, and property damage are for the most part not covered. in the case of model recreation, as was the case with the openai gpt- model, prior to it being released, it was replicated by external researchers -this might be covered under cyber-insurance because of the leak of private information. researchers have also managed to steal information from facial recognition databases using sample images and names which might also be covered under existing policies. but, in the case with uber where there was bodily harm because of the self-driving vehicle that wasn't able to detect the pedestrian accurately or similar harms that might arise if conditions are foggy, snowy, dull lighting, or any other out-of-distribution scenarios, these are not adequately covered under existing insurance terms. brand damage that might arise from poisoning attacks like the case with the tay chatbot or confounding anti-virus systems as was the case with an attack mounted against the cylance system, cyber-insurance falls woefully short in being able to cover these scenarios. in a hypothetical situation as presented in a google paper on rl agents where a cleaning robot sticks a wet mop into an electric socket, material damage that occurs from that might also be considered out of scope in cyber-insurance policies. traditional software attacks are known unknowns but adversarial ml attacks are unknown unknowns and hence harder to guard against. current pricing reflects this uncertainty, but as the ai insurance market matures and there is a deeper understanding for what the risks are and how companies can mitigate the downsides, the pricing should become more reflective of the actual risks. the authors also offer some recommendations on how to prepare the organization for these risks -for example by appointing an officer that works closely with the ciso and chief data protection officer, performing table-top exercises to gain an understanding of potential places where the system might fail and evaluating the system for risks and gaps following guidelines as put forth in the eu trustworthy ai guidelines. there are no widely accepted best practices for mitigating security and privacy issues related to machine learning (ml) systems. existing best practices for traditional software systems are insufficient because they're largely based on the prevention and management of access to a system's data and/or software, whereas ml systems have additional vulnerabilities and novel harms that need to be addressed. for example, one harm posed by ml systems is to individuals not included in the model's training data but who may be negatively impacted by its inferences. harms from ml systems can be broadly categorized as informational harms and behavioral harms. informational harms "relate to the unintended or unanticipated leakage of information." the "attacks" that constitute informational harms are: behavioral harms "relate to manipulating the behavior of the model itself, impacting the predictions or outcomes of the model." the attacks that constitute behavioral harms are: • poisoning: inserting malicious data into a model's training data to change its behavior once deployed • evasion: feeding data into a system to intentionally cause misclassification without a set of best practices, ml systems may not be widely and/or successfully adopted. therefore, the authors of this white paper suggest a "layered approach" to mitigate the privacy and security issues facing ml systems. approaches include noise injection, intermediaries, transparent ml mechanisms, access controls, model monitoring, model documentation, white hat or red team hacking, and open-source software privacy and security resources. finally, the authors note, it's important to encourage "cross-functional communication" between data scientists, engineers, legal teams, business managers, etc. in order to identify and remediate privacy and security issues related to ml systems. this communication should be ongoing, transparent, and thorough. beyond near-and long-term: towards a clearer account of this paper dives into how researchers can clearly communicate about their research agendas given ambiguities in the split of the ai ethics community into near and long term research. often a sore and contentious point of discussion, there is an artificial divide between the two groups that seem to take a reductionist approach to the work being done by the other. a major problem emerging from such a divide is a hindrance in being able to spot relevant work being done by the different communities and thus affecting effective collaboration. the paper highlights the differences arising primarily along the lines of timescale, ai capabilities, deeper normative and empirical disagreements. the paper provides for a helpful distinction between near-and long-term by describing them as follows: • near term issues are those that are fairly well understood and have concrete examples and relate to rêvent progress in the field of machine learning • long term issues are those that might arise far into the future and due to much more advanced ai systems with broad capabilities, it also includes long term impacts like international security, race relations, and power dynamics what they currently see is that: • issues considered 'near-term' tend to be those arising in the present/near future as a result of current/foreseeable ai systems and capabilities, on varying levels of scale/severity, which mostly have immediate consequences for people and society. • issues considered 'long-term' tend to be those arising far into the future as a result of large advances in ai capabilities (with a particular focus on notions of transformative ai or agi), and those that are likely to pose risks that are severe/large in scale with very long-term consequences. • the binary clusters are not sufficient as a way to split the field and not looking at underlying beliefs leads to unfounded assumptions about each other's work • in addition there might be areas between the near and long term that might be neglected as a result of this artificial fractions unpacking these distinctions can be done along the lines of capabilities, extremity, certainty and impact, definitions for which are provided in the paper. a key contribution aside from identifying these factors is that they lie along a spectrum and define a possibility space using them as dimensions which helps to identify where research is currently concentrated and what areas are being ignored. it also helps to well position the work being done by these authors. something that we really appreciated from this work was the fact that it gives us concrete language and tools to more effectively communicate about each other's work. as part of our efforts in building communities that leverage diverse experiences and backgrounds to tackle an inherently complex and muti-dimensional problem, we deeply appreciate how challenging yet rewarding such an effort can be. some of the most meaningful public consultation work done by maiei leveraged our internalized framework in a similar vein to provide value to the process that led to outcomes like the montreal declaration for responsible ai. the rise of ai systems leads to an unintended conflict between economic pursuits which seek to generate profits and value resources appropriately with the moral imperatives of promoting human flourishing and creating societal benefits from the deployment of these systems. this puts forth a central question on what the impacts of creating ai systems that might surpass humans in a general sense which might leave humans behind. technological progress doesn't happen on its own, it is driven by conscious human choices that are influenced by the surrounding social and economic institutions. we are collectively responsible for how these institutions take shape and thus impact the development of technology -submitting to technological-fatalism isn't a productive way to align our ethical values with this development. we need to ensure that we play an active role in the shaping of the most consequential piece of technology. while the economic system relies on the market prices to gauge what people place value on, by no means is that a comprehensive evaluation. for example, it misses out on the impact of externalities which can be factored in by considering ethical values as a complement in guiding our decisions on what to build and how to place value on it. when thinking about losses from ai-enabled automation, an outright argument that economists might make is that if replacing labor lowers the costs of production, then it might be market-efficient to invest in technology that achieves that. from an ethicist's perspective, there are severe negative externalities from job loss and thus it might be unethical to impose labor-saving automation on people. when unpacking the economic perspective more, we find that job loss actually isn't correctly valued by wages as price for labor. there are associated social benefits like the company of workplace colleagues, sense of meaning and other social structural values which can't be separately purchased from the market. thus, using a purely economic perspective in making automation technology adoption decisions is an incomplete approach and it needs to be supplemented by taking into account the ethical perspective. market price signals provide useful information upto a certain point in terms of the goods and services that society places value on. suppose that people start to demand more eggs from chickens that are raised in a humane way, then suppliers will shift their production to respond to that market signal. but, such price signals can only be indicated by consumers for the things that they can observe. a lot of unethical actions are hidden and hence can't be factored into market price signals. additionally, several things like social relations aren't tradable in a market and hence their value can't be solely determined from the market viewpoint. thus, both economists and ethicists would agree that there is value to be gained in steering the development of ai systems keeping in mind both kinds of considerations. pure market-driven innovation will ignore societal benefits in the interest of generating economic value while the labor will have to make unwilling sacrifices in the interest of long-run economic efficiency. economic market forces shape society significantly, whether we like it or not. there are professional biases based on selection and cognition that are present in either side making its arguments as to which gets to dominate based on their perceived importance. the point being that bridging the gap between different disciplines is crucial to arriving at decisions that are grounded in evidence and that benefit society holistically. there are also differences fundamentally between the economic and ethical perspective -namely that economic indicators are usually unidimensional and have clear quantitative values that make them easier to compare. on the other hand, ethical indicators are inherently multi-dimensional and are subjective which not only make comparison hard but also limit our ability to explain how we arrive at them. they are encoded deep within our biological systems and suffer from the same lack of explainability as decisions made by artificial neural networks, the so-called black box problem. why is it then, despite the arguments above, that the economic perspective dominates over the ethical one? this is largely driven by the fact that economic values provide clear, unambiguous signals which our brains, preferring ambiguity aversion, enjoy and ethical values are more subtle, hidden, ambiguous indicators which complicate decision making. secondly, humans are prosocial only upto a point, they are able to reason between economic and ethical decisions at a micro-level because the effects are immediate and observable, say for example polluting the neighbor's lawn and seeing the direct impact of that activity. on the other hand, for things like climate change where the effects are delayed and not directly observable (as a direct consequence of one's actions) that leads to behaviour where the individual prioritizes economic values over ethical ones. cynical economists will argue that there is a comparative advantage in being immoral that leads to gains in exchange, but that leads to a race to the bottom in terms of ethics. externalities are an embodiment of the conflict between economic and ethical values. welfare economics deals with externalities via various mechanisms like permits, taxes, etc. to curb the impacts of negative externalities and promote positive externalities through incentives. but, the rich economic theory needs to be supplemented by political, social and ethical values to arrive at something that benefits society at large. from an economic standpoint, technological progress is positioned as expanding the production possibilities frontier which means that it raises output and presumably standards of living. yet, this ignores how those benefits are distributed and only looks at material benefits and ignores everything else. prior to the industrial revolution, people were stuck in a malthusian trap whereby technological advances created material gains but these were quickly consumed by population growth that kept standards of living stubbornly low. this changed post the revolution and as technology improvement outpaced population growth, we got better quality of life. the last decades have had a mixed experience though, whereby automation has eroded lower skilled jobs forcing people to continue looking for jobs despite displacement and the lower demand for unskilled labor coupled with the inelastic supply of labor has led to lower wages rather than unemployment. on the other hand, high skilled workers have been able to leverage technological progress to enhance their output considerably and as a consequence the income and wealth gaps between low and high skilled workers has widened tremendously. typical economic theory points to income and wealth redistribution whenever there is technological innovation where the more significant the innovation, the larger the redistribution. something as significant as ai leads to crowning of new winners who own these new factors of production while also creating losers when they face negative pecuniary externalities. these are externalities because there isn't explicit consent that is requested from the people as they're impacted in terms of capital, labor and other factors of production. the distribution can be analyzed from the perspective of strict utilitarianism (different from that in ethics where for example bentham describes it as the greatest amount of good for the greatest number of people). here it is viewed as tolerating income redistribution such that it is acceptable if all but one person loses income as long as the single person making the gain has one that is higher than the sum of the losses. this view is clearly unrealistic because it would further exacerbate inequities in society. the other is looking at the idea of lump sum transfers in which the idealized scenario is redistribution, for example by compensating losers from technology innovation, without causing any other market distortions. but, that is also unrealistic because such a redistribution never occurs without market distortions and hence it is not an effective way to think about economic policy. from an ethics perspective, we must make value judgments on how we perceive dollar losses for a lower socio-economic person compared to the dollar gains made by a higher socio-economic person and if that squares with the culture and value set of that society. we can think about the tradeoff between economic efficiency and equality in society, where the level of tolerance for inequality varies by the existing societal structures in place. one would have to also reason about how redistribution creates more than proportional distortions as it rises and how much economic efficiency we'd be willing to sacrifice to make gains in how equitably income is distributed. thus, steering progress in ai can be done based on whether we want to pursue innovation that we know is going to have negative labor impacts while knowing full well that there aren't going to be any reasonable compensations offered to the people based on economic policy. given the pervasiveness of ai and by virtue of it being a general-purpose technology, the entrepreneurs and others powering innovation need to take into account that their work is going to shape larger societal changes and have impacts on labor. at the moment, the economic incentives are such that they steer progress towards labor-saving automation because labor is one of the most highly-taxed factors of production. instead, shifting the tax-burden to other factors of production including automation capital will help to steer the direction of innovation in other directions. government, as one of the largest employers and an entity with huge spending power, can also help to steer the direction of innovation by setting policies that encourage enhancing productivity without necessarily replacing labor. there are novel ethical implications and externalities that arise from the use of ai systems, an example of that would be (from the industrial revolution) that a factory might lead to economic efficiency in terms of production but the pollution that it generates is so large that the social cost outweighs the economic gain. biases can be deeply entrenched in the ai systems, either from unrepresentative datasets, for example, with hiring decisions that are made based on historical data. but, even if the datasets are well-represented and have minimal bias, and the system is not exposed to protected attributes like race and gender, there are a variety of proxies like zipcode which can lead to unearthing those protected attributes and discriminating against minorities. maladaptive behaviors can be triggered in humans by ai systems that can deeply personalize targeting of ads and other media to nudge us towards different things that might be aligned with making more profits. examples of this include watching videos, shopping on ecommerce platforms, news cycles on social media, etc. conversely, they can also be used to trigger better behaviors, for example, the use of fitness trackers that give us quantitative measurements for how we're taking care of our health. an economics equivalent of the paper clip optimizer from bostrom is how human autonomy can be eroded over time as economic inequality rises which limits control of those who are displaced over economic resources and thus, their control over their destinies, at least from an economic standpoint. this is going to only be exacerbated as ai starts to pervade into more and more aspects of our lives. labor markets have features built in them to help tide over unemployment with as little harm as possible via quick hiring in other parts of the economy when the innovation creates parallel demands for labor in adjacent sectors. but, when there is large-scale disruption, it is not possible to accommodate everyone and this leads to large economic losses via fall in aggregate demand which can't be restored with monetary or fiscal policy actions. this leads to wasted economic potential and welfare losses for the workers who are displaced. whenever there is a discrepancy between ethical and economic incentives, we have the opportunity to steer progress in the right direction. we've discussed before how market incentives trigger a race to the bottom in terms of morality. this needs to be preempted via instruments like technological impact assessments, akin to environmental impact assessments, but often the impacts are unknown prior to the deployment of the technology at which point we need to have a multi-stakeholder process that allows us to combat harms in a dynamic manner. political and regulatory entities typically lag technological innovation and can't be relied upon solely to take on this mantle. the author raises a few questions on the role of humans and how we might be treated by machines in case of the rise of superintelligence (which still has widely differing estimates for when it will be realized from the next decade to the second half of this century). what is clear is that the abilities of narrow ai systems are expanding and it behooves us to give some thought to the implications on the rise of superintelligence. the potential for labor-replacement in this superintelligence scenario, from an economic perspective, would have significant existential implications for humans, beyond just inequality, we would be raising questions of human survival if the wages to be paid to labor fall below subsistence levels in a wide manner. it would be akin to how the cost of maintaining oxen to plough fields was outweighed by the benefits that they brought in the face of mechanization of agriculture. this might be an ouroboros where we become caught in the malthusian trap again at the time of the industrial revolution and no longer have the ability to grow beyond basic subsistence, even if that would be possible. authors of research papers aspire to achieving any of the following goals when writing papers: to theoretically characterize what is learnable, to obtain understanding through empirically rigorous experiments, or to build working systems that have high predictive accuracy. to communicate effectively with the readers, the authors must: provide intuitions to aid the readers' understanding, describe empirical investigations that consider and rule out alternative hypotheses, make clear the relationship between theoretical analysis and empirical findings, and use clear language that doesn't conflate concepts or mislead the reader. the authors of this paper find that there are areas where there are concerns when it comes to ml scholarship: failure to distinguish between speculation and explanation, failure to identify the source of empirical gains, the use of mathematics that obfuscates or impresses rather than clarifies, and misuse of language such that terms with other connotations are used or by overloading terms with existing technical definitions. flawed techniques and communication methods will lead to harm and wasted resources and efforts hindering the progress in ml and hence this paper provides some very practical guidance on how to do this better. when presenting speculations or opinions of authors that are exploratory and don't yet have scientific grounding, having a separate section that quarantines the discussion and doesn't bleed into the other sections that are grounded in theoretical and empirical research helps to guide the reader appropriately and prevents conflation of speculation and explanation. the authors provide the example of the paper on dropout regularization that made comparisons and links to sexual reproduction but limited that discussion to a "motivation" section. using mathematics in a manner where natural language and mathematical expositions are intermixed without a clear link between the two leads to weakness in the overall contribution. specifically, when natural language is used to overcome weaknesses in the mathematical rigor and conversely, mathematics is used as a scaffolding to prop up weak arguments in the prose and give the impression of technical depth, it leads to poor scholarship and detracts from the scientific seriousness of the work and harms the readers. additionally, invoking theorems with dubious pertinence to the actual content of the paper or in overly broad ways also takes away from the main contribution of a paper. in terms of misuse of language, the authors of this paper provide a convenient ontology breaking it down into suggestive definitions, overloaded terminology, and suitcase words. in the suggestive definitions category, the authors coin a new technical term that has suggestive colloquial meanings and can slip through some implications without formal justification of the ideas in the paper. this can also lead to anthropomorphization that creates unrealistic expectations about the capabilities of the system. this is particularly problematic in the domain of fairness and other related domains where this can lead to conflation and inaccurate interpretation of terms that have well-established meanings in the domains of sociology and law for example. this can confound the initiatives taken up by both researchers and policymakers who might use this as a guide. overloading of technical terminology is another case where things can go wrong when terms that have historical meanings and they are used in a different sense. for example, the authors talk about deconvolutions which formally refers to the process of reversing a convolution but in recent literature has been used to refer to transpose convolutions that are used in auto-encoders and gans. once such usage takes hold, it is hard to undo the mixed usage as people start to cite prior literature in future works. additionally, combined with the suggestive definitions, we run into the problem of concealing a lack of progress, such as the case with using language understanding and reading comprehension to now mean performance on specific datasets rather than the grand challenge in ai that it meant before. another case that leads to overestimation of the ability of these systems is in using suitcase words which pack in multiple meanings within them and there isn't a single agreed upon definition. interpretability and generalization are two such terms that have looser definitions and more formally defined ones, yet because papers use them in different ways, it leads to miscommunication and researchers talking across each other. the authors identify that these problems might be occurring because of a few trends that they have seen in the ml research community. specifically, complacency in the face of progress where there is an incentive to excuse weak arguments in the face of strong empirical results and the single-round review process at various conferences where the reviewers might not have much choice but to accept the paper given the strong empirical results. even if the flaws are noticed, there isn't any guarantee that they are fixed in a future review cycle at another conference. as the ml community has experienced rapid growth, the problem of getting high-quality reviews has been exacerbated: in terms of the number of papers to be reviewed by each reviewer and the dwindling number of experienced reviewers in the pool. with the large number of papers, each reviewer has less time to analyze papers in depth and reviewers who are less experienced can fall easily into some of the traps that have been identified so far. thus, there are two levers that are aggravating the problem. additionally, there is the risk of even experienced researchers resorting to a checklist-like approach under duress which might discourage scientific diversity when it comes to papers that might take innovative or creative approaches to expressing their ideas. a misalignment in incentives whereby lucrative deals in funding are offered to ai solutions that utilize anthropomorphic characterizations as a mechanism to overextend their claims and abilities though the authors recognize that the causal direction might be hard to judge. the authors also provide suggestions for other authors on how to evade some of these pitfalls: asking the question of why something happened rather than just relying on how well a system performed will help to achieve the goal of providing insights into why something works rather than just relying on headline numbers from the results of the experiments. they also make a recommendation for insights to follow the lines of doing error analysis, ablation studies, and robustness checks and not just be limited to theory. as a guideline for reviewers and journal editors, making sure to strip out extraneous explanations, exaggerated claims, changing anthropomorphic naming to more sober alternatives, standardizing notation, etc. should help to curb some of the problems. encouraging retrospective analysis of papers is something that is underserved at the moment and there aren't enough strong papers in this genre yet despite some avenues that have been advocating for this work. flawed scholarship as characterized by the points as highlighted here not only negatively impact the research community but also impact the policymaking process that can overshoot or undershoot the mark. an argument can be made that setting the bar too high will impede new ideas being developed and slow down the cycle of reviews and publication while consuming precious resources that could be deployed in creating new work. but, asking basic questions to guide us such as why something works, in which situations it does not work, and have the design decisions been justified will lead to a higher quality of scholarship in the field. the article summarizes recent work from several microsoft researchers on the subject of making ai ethics checklists that are effective. one of the most common problems identified relate to the lack of practical applicability of ai ethics principles which sound great and comprehensive in the abstract but do very little to aid engineers and practitioners from applying them in their day to day work. the work was done by interviewing several practitioners and advocating for a co-design process that brings in intelligence on how to make these tools effective from other disciplines like healthcare and aviation. one of the things emerging from the interviews is that often engineers are few and far between in raising concerns and there's a lack of top-down sync in enforcing these across the company. additionally, there might be social costs to bringing up issues which discourages engineers from implementing such measures. creating checklists that reduce friction and fit well into existing workflows will be key in their uptake. for a lot of people who are new to the field of artificial intelligence and especially ai ethics, they see existential risk as something that is immediate. others dismiss it as something to not be concerned about at all. there is a middle path here and this article sheds a very practical light on that. using the idea of canaries in a coal mine, the author goes on to highlight some potential candidates for a canary that might help us judge better when we ought to start paying attention to these kinds of risks posed by artificial general intelligence systems. the first one is the automatic formulation of learning problems, akin to how humans have high-level goals that they align with their actions and adjust them based on signals that they receive on the success or failure of those actions. ai systems trained in narrow domains don't have this ability just yet. the second one mentioned in the article is achieving fully autonomous driving, which is a good one because we have lots of effort being directed to make that happen and it requires a complex set of problems to be addressed including the ability to make real-time, life-critical decisions. ai doctors are pointed out as a third canary, especially because true replacement of doctors would require a complex set of skills spanning the ability to make decisions about a patient's healthcare plan by analyzing all their symptoms, coordinating with other doctors and medical staff among other human-centered actions which are currently not feasible for ai systems. lastly, the author points to the creation of conversation systems that are able to answer complex queries and respond to things like exploratory searches. we found the article to put forth a meaningful approach to reasoning about existential risk when it comes to ai systems. a lot of articles pitch development, investment and policymaking in ai as an arms race with the us and china as front-runners. while there are tremendous economic gains to be had in deploying and utilizing ai for various purposes, there remain concerns of how this can be used to benefit society more than just economically. a lot of ai strategies from different countries are thus focused on issues of inclusion, ethics and more that can drive better societal outcomes yet they differ widely in how they seek to achieve those goals. for example, ai has put forth a national ai strategy that is focused on economic growth and social inclusion dubbed #aiforall while the strategy from china has been more focused on becoming a global dominant force in ai which is backed by state investments. some countries have instead chosen to focus on creating strong legal foundations for the ethical deployment of ai while others are more focused on data protection rights. canada and france have entered into agreements to work together on ai policy which places talent, r&d and ethics at the center. the author of the article makes a case for how global coordination of ai strategies might lead to even higher gains but also recognizes that governments will be motivated to tailor their policies to best meet the requirements of their countries first and then align with others that might have similar goals. reproducibility is of paramount importance to doing rigorous research and a plethora of fields have suffered from a crisis where scientific work hasn't met muster in terms of reproducibility leading to wasted time and effort on the part of other researchers looking to build upon each other's work. the article provides insights from the work of a researcher who attempted a meta-science approach to trying to figure out what constitutes good, reproducible research in the field of machine learning. there is a distinction made early on in terms of replicability which hinges on taking someone else's code and running that on the shared data to see if you get the same results but as pointed out in the article, that suffers from issues of source and code bias which might be leveraging certain peculiarities in terms of configurations and more. the key tenets to reproducibility are being able to simply read a scientific paper and set up the same experiment, follow the steps prescribed and arrive at the same results. arriving at the final step is dubbed as independent reproducibility. the distinction between replicability and reproducibility also speaks to the quality of the scientific paper in being able to effectively capture the essence of the contribution such that anyone else is able to do the same. some of the findings from this work include that having hyperparameters well specified in the paper and its ease of readability contributed to the reproducibility. more specification in terms of math might allude to more reproducibility but it was found to not necessarily be the case. empirical papers were inclined to be more reproducible but could also create perverse incentives and side effects. sharing code is not a panacea and requires other accompanying factors to make the work really reproducible. cogent writing was found to be helpful along with code snippets that were either actual or pseudo code though step code that referred to other sections hampered reproducibility because of challenges in readability. simplified examples while appealing didn't really aid in the process and spoke to the meta-science process calling for data-driven approaches to ascertaining what works and what doesn't rather than relying on hunches. also, posting revisions to papers and being reachable over email to answer questions helped the author in reproducing the research work. finally, the author also pointed out that given this was a single initiative and was potentially biased in terms of their own experience, background and capabilities, they encourage others to tap into the data being made available but these guidelines provide good starting points for people to attempt to make their scientific work more rigorous and reproducible. the push has been to apply ai to any new problem that we face, hoping that the solution will magically emerge from the application of the technique as if it is a dark art. yet, the more seasoned scientists have seen these waves come and go and in the past, a blind trust in this technology led to ai winters. taking a look at some of the canaries in the coal mine, the author cautions that there might be a way to judge whether ai will be helpful with the pandemic situation. specifically, looking at whether domain experts, like leading epidemiologists endorse its use and are involved in the process of developing and utilizing these tools will give an indication as to whether they will be successful or not. data about the pandemic depends on context and without domain expertise, one has to make a lot of assumptions which might be unfounded. all models have to make assumptions to simplify reality, but if those assumptions are rooted in domain expertise from the field then the model can mimic reality much better. without context, ai models assume that the truth can be gleaned solely from the data, which though it can lead to surprising and hidden insights, at times requires humans to evaluate the interpretations to make meaning from them and apply them to solve real-world problems. this was demonstrated with the case where it was claimed that ai had helped to predict the start of the outbreak, yet the anomaly required the analysis from a human before arriving at that conclusion. claims of extremely high accuracy rates will give hardened data scientists reason for caution, especially when moving from lab to real-world settings as there is a lot more messiness with real-world data and often you encounter out-of-distribution data which hinders the ability of the model to make accurate predictions. for ct scans, even if they are sped up tremendously by the use of ai, doctors point out that there are other associated procedures such as the cleaning and filtration and recycling of air in the room before the next patient can be passed through the machine which can dwindle the gains from the use of an unexplainable ai system analyzing the scans. concerns with the use of automated temperature scanning using thermal cameras also suffers from similar concerns where there are other confounding factors like the ambient temperature, humidity, etc. which can limit the accuracy of such a system. ultimately, while ai can provide tremendous benefits, we mustn't blindly be driven by its allure to magically solve the toughest challenges that we face. offering an interesting take on how to shape the development and deployment of ai technologies, mhlambi utilizes the philosophy of ubuntu as a guiding light in how to build ai systems that better empower people and communities. the current western view that dominates how ai systems are constructed today and how they optimize for efficiency is something that lends itself quite naturally to inequitable outcomes and reinforcing power asymmetries and other imbalances in society. embracing the ubuntu mindset which puts people and communities first stands in contrast to this way of thinking. it gives us an alternative conception of personhood and has the potential to surface some different results. while being thousands of years old, the concept has been seen in practice over and over again, for example, in south africa, after the end of the apartheid, the truth and reconciliation program forgave and integrated offenders back into society rather than embark on a kantian or retributive path to justice. this restorative mindset to justice helped the country heal more quickly because the philosophy of ubuntu advocates that all people are interconnected and healing only happens when everyone is able to move together in a harmonious manner. this was also seen in the aftermath of the rwanda genocide, where oppressors were reintegrated back into society often living next to the people that they had hurt; ubuntu believes that no one is beyond redemption and everyone deserves the right to have their dignity restored. bringing people together through community is important, restorative justice is a mechanism that makes the community stronger in the long run. current ai formulation seeks to find some ground truth but thinking of this in the way of ubuntu means that we try to find meaning and purpose for these systems through the values and beliefs that are held by the community. ubuntu has a core focus on equity and empowerment for all and thus the process of development is slow but valuing people above material efficiency is more preferable than speeding through without thinking of the consequences that it might have on people. living up to ubuntu means offering people the choice for what they want and need, rooting out power imbalances and envisioning the companies as a part of the communities for which they are building products and services which makes them accountable and committed to the community in empowering them. ethics in the context of technology carries a lot of weight, especially because the people who are defining what it means will influence the kinds of interventions that will be implemented and the consequences that follow. given that technology like ai is used in high-stakes situations, this becomes even more important and we need to ask questions about the people who take this role within technology organizations, how they take corporate and public values and turn them into tangible outcomes through rigorous processes, and what regulatory measures are required beyond these corporate and public values to ensure that ethics are followed in the design, development and deployment of these technologies. ethics owners, the broad term for people who are responsible for this within organizations have a vast panel of responsibilities including communication between the ethics review committees and product design teams, aligning the recommendations with the corporate and public values, making sure that legal compliance is met and communicating externally about the processes that are being adopted and their efficacy. ethical is a polysemous word in that it can refer to process, outcomes, and values. the process refers to the internal procedures that are adopted by the firm to guide decision making on product/service design and development choices. the values aspect refers to the value set that is both adopted by the organization and those of the public within which the product/service might be deployed. this can include values such as transparency, equity, fairness, privacy, among others. the outcomes refer to desirable properties in the outputs from the system such as equalized odds across demographics and other fairness metrics. in the best case, inside a technology company, there are robust and well-managed processes that are aligned with collaboratively-determined ethical outcomes that achieve the community's and organization's ethical values. from the outside, this takes on the meaning of finding mechanisms to hold the firms accountable for the decisions that they take. further expanding on the polysemous meanings of ethics, it can be put into four categories for the discussion here: moral justice, corporate values, legal risk, and compliance. corporate values set the context for the rest of the meanings and provide guidance when tradeoffs need to be made in product/service design. they also help to shape the internal culture which can have an impact on the degree of adherence to the values. legal risk's overlap with ethics is fairly new whereas compliance is mainly concerned with the minimization of exposure to being sued and public reputation harm. using some of the framing here, the accolades, critiques, and calls to action can be structured more effectively to evoke substantive responses rather than being diffused in the energies dedicated to these efforts. framing the metaphor of "data is the new oil" in a different light, this article gives some practical tips on how organizations can reframe their thinking and relationship with customer data so that they take on the role of being a data custodian rather than owners of the personal data of their customers. this is put forth with the acknowledgement that customers' personal data is something really valuable that brings business upsides but it needs to be handled with care in the sense that the organization should act as a custodian that is taking care of the data rather than exploiting it for value without consent and the best interests of the customer at heart. privacy breaches that can compromise this data not only lead to fines under legislation like the gdpr, but also remind us that this is not just data but details of a real human being. as a first step, creating a data accountability report that documents how many times personal data was accessed by various employees and departments will serve to highlight and provide incentives for them to change behaviour when they see that some others might be achieving their job functions without the need to access as much information. secondly, celebrating those that can make do with minimal access will also encourage this behaviour change, all being done without judgement or blame but more so as an encouragement tool. pairing employees that need to access personal data for various reasons will help to build accountability and discourage intentional misuse of data and potential accidents that can lead to leaks of personal data. lastly, an internal privacy committee composed of people across job functions and diverse life experiences that monitors organization-wide private data use and provides guidance on improving data use through practical recommendations is another step that will move the conversation of the organization from data entitlement to data custodianship. ultimately, this will be a market advantage that will create more trust with customers and increase business bottom line going into the future. towards the systematic reporting of the energy and carbon climate change and environmental destruction are well-documented. most people are aware that mitigating the risks caused by these is crucial and will be nothing less than a herculean undertaking. on the bright side, ai can be of great use in this endeavour. for example, it can help us optimize resource use, or help us visualize the devastating effects of floods caused by climate change. however, ai models can have excessively large carbon footprints. henderson et al.'s paper details how the metrics needed to calculate environmental impact are severely underreported. to highlight this, the authors randomly sampled one-hundred neurips papers. they found that none reported carbon impacts, only one reported some energy use metrics, and seventeen reported at least some metrics related to compute-use. close to half of the papers reported experiment run time and the type of hardware used. the authors suggest that the environmental impact of ai and relevant metrics are hardly reported by researchers because the necessary metrics can be difficult to collect, while subsequent calculations can be time-consuming. taking this challenge head-on, the authors make a significant contribution by performing a meta-analysis of the very few frameworks proposed to evaluate the carbon footprint of ai systems through compute-and energy-intensity. in light of this meta-analysis, the paper outlines a standardized framework called experiment-impact-tracker to measure carbon emissions. the authors use metrics to quantify compute and energy use. these include when an experiment starts and ends, cpu and gpu power draw, and information on a specific energy grid's efficiency. the authors describe their motivations as threefold. first, experiment-impact-tracker is meant to spread awareness among ai researchers about how environmentally-harmful ai can be. they highlight that "[w]ithout consistent and accurate accounting, many researchers will simply be unaware of the impacts their models might have and will not pursue mitigating strategies". second, the framework could help align incentives. while it is clear that lowering one's environmental impact is generally valued in society, this is not currently the case in the field of ai. experiment-impact tracker, the authors believe, could help bridge this gap, and make energy efficiency and carbon-impact curtailment valuable objectives for researchers, along with model accuracy and complexity. third, experiment-impact-tracker can help perform cost-benefit analyses on one's ai model by comparing electricity cost and expected revenue, or the carbon emissions saved as opposed to those produced. this can partially inform decisions on whether training a model or improving its accuracy is worth the associated costs. to help experiment-impact-tracker become widely used among researchers, the framework emphasizes usability. it aims to make it easy and quick to calculate the carbon impact of an ai model. through a short modification of one's code, experiment-impact-tracker collects information that allows it to determine the energy and compute required as well as, ultimately, the carbon impact of the ai model. experiment-impact-tracker also addresses the interpretability of the results by including a dollar amount that represents the harm caused by the project. this may be more tangible for some than emissions expressed in the weight of greenhouse gases released or even in co equivalent emissions (co eq). in addition, the authors strive to: allow other ml researchers to add to experiment-impact-tracker to suit their needs, increase reproducibility in the field by making metrics collection more thorough, and make the framework robust enough to withstand internal mistakes and subsequent corrections without compromising comparability. moreover, the paper includes further initiatives and recommendations to push ai researchers to curtail their energy use and environmental impact. for one, the authors take advantage of the already widespread use of leaderboards in the ai community. while existing leaderboards are largely targeted towards model accuracy, henderson et al. instead put in place an energy efficiency leaderboard for deep reinforcement learning models. they assert that a leaderboard of this kind, that tracks performance in areas indicative of potential environmental impact, "can also help spread information about the most energy and climate-friendly combinations of hardware, software, and algorithms such that new work can be built on top of these systems instead of more energy-hungry configurations". the authors also suggest ai practitioners can take an immediate and significant step in lowering the carbon emissions of their work: run experiments on energy grids located in carbon-efficient cloud regions like quebec, the least carbon-intensive cloud region. especially when compared to very carbon-intensive cloud regions like estonia, the difference in co eq emitted can be considerable: running an experiment in estonia produces up to thirty times as much emissions as running the same experiment in quebec. the important reduction in carbon emissions that follows from switching to energy-efficient cloud regions, according to henderson et al., means there is no need to fully forego building compute-intensive ai as some believe. in terms of systemic changes that accompany experiment-impact-tracker, the paper lists seven. the authors suggest the implementation of a "green default" for both software and hardware. this would make the default setting for researchers' tools the most environmentally-friendly one. the authors also insist on weighing costs and benefits to using compute-and energy-hungry ai models. small increases in accuracy, for instance, can come at a high environmental cost. they hope to see the ai community use efficient testing environments for their models, as well as standardized reporting of a model's carbon impact with the help of experiment-impact-tracker. additionally, the authors ask those developing ai models to be conscious of the environmental costs of reproducing their work, and act as to minimize unnecessary reproduction. while being able to reproduce other researchers' work is crucial in maintaining sound scientific discourse, it is merely wasteful for two departments in the same business to build the same model from scratch. the paper also presents the possibility of developing a badge identifying ai research papers that show considerable effort in mitigating carbon impact when these papers are presented at conferences. lastly, the authors highlight important lacunas in relation to driver support and implementation. systems that would allow data on energy use to be collected are unavailable for certain hardware, or the data is difficult for users to obtain. addressing these barriers would allow for more widespread collection of energy use data, and contribute to making carbon impact measurement more mainstream in the ai community. the paper highlights four challenges in designing more "intelligent" voice assistant systems that are able to respond to exploratory searches that don't have clear, short answers and require nuance and detail. this is in response to the rising expectations that users have from voice assistants as they become more familiar with them through increased interactions. voice assistants are primarily used for productivity tasks like setting alarms, calling contacts, etc. and they can include gestural and voice-activated commands as a method of interaction. exploratory search is currently not well supported through voice assistants because of them utilizing a fact-based approach that aims to deliver a single, best response whereas a more natural approach would be to ask follow up questions to refine the query of the user to the point of being able to provide them with a set of meaningful options. the challenges as highlighted in this paper if addressed will lead to the community building more capable voice assistants. one of the first challenges is situationally induced impairments as presented by the authors highlights the importance of voice activated commands because they are used when there are no alternatives available to interact with the system, for example when driving or walking down a busy street. there is an important aspect of balancing the tradeoff between smooth user experience that is quick compared to the degree of granularity in asking questions and presenting results. we need to be able to quantify this compared to using a traditional touch based interaction to achieve the same result. lastly, there is the issue of privacy, such interfaces are often used in a public space and individuals would not be comfortable sharing details to refine the search such as clothing sizes which they can discreetly type into the screen. such considerations need to be thought of when designing the interface and system. mixed-modal interactions include combinations of text, visual inputs and outputs and voice inputs and output. this can be an effective paradigm to counter some of the problems highlighted above and at the same time improve the efficacy of the interactions between the user and the system. further analysis is needed as to how users utilize text compared to voice searches and whether one is more informational or exploratory than the other. designing for diverse populations is crucial as such systems are going to be widely deployed. for example, existing research already highlights how different demographics even within the same socio-economic subgroup utilize voice and text search differently. the system also needs to be sensitive to different dialects and accents to function properly and be responsive to cultural and contextual cues that might not be pre-built into the system. differing levels of digital and technical literacy also play a role in how the system can effectively meet the needs of the user. as the expectations from the system increase over time, ascribed to their ubiquity and anthropomorphization, we start to see a gulf in expectations and execution. users are less forgiving of mistakes made by the system and this needs to be accounted for when designing the system so that alternate mechanisms are available for the user to be able to meet their needs. in conclusion, it is essential when designing voice-activated systems to be sensitive to user expectations, more so than other traditional forms of interaction where expectations are set over the course of several uses of the system whereas with voice systems, the user comes in with a set of expectations that closely mimic how they interact with each other using natural language. addressing the challenges highlighted in this paper will lead to systems that are better able to delight their users and hence gain higher adoption. the paper highlights how having more translation capabilities available for languages in the african continent will enable people to access larger swathes of the internet and contribute to scientific knowledge which are predominantly english based. there are many languages in africa, south africa alone has official languages and only a small subset are made available on public tools like google translate. in addition, due to the scant nature of research on machine translation for african languages, there remain gaps in understanding the extent of the problem and how they might be addressed most effectively. the problems facing the community are many: low resource availability, low discoverability where language resources are often constrained by institution and country, low reproducibility because of limited sharing of code and data, lack of focus from african society in seeing local languages as primary modes of communication and a lack of public benchmarks which can help compare results of machine translation efforts happening in various places. the research work presented here aims to address a lot of these challenges. they also give a brief background on the linguistic characteristics of each of the languages that they have covered which gives hints as to why some have been better covered by commercial tools than others.in related work, it is evident that there aren't a lot of studies that have made their code and datasets public which makes comparison difficult with the results as presented in this paper. most studies focused on the autshumato datasets and some relied on government documents as well, others used monolingual datasets as a supplement. the key analysis of all of those studies is that there is a paucity in the focus on southern african languages and because apart from one study, others didn't make their datasets and code public, the bleu scores listed were incomparable which further hinders future research efforts. the autshumato datasets are parallel, aligned corpora that have governmental text as its source. they are available for english to afrikaans, isizulu, n. sotho, setswana, and xitsonga translations and were created to build and facilitate open source translation systems. they have sentence level parallels that have been created both using manual and automatic methods. but, it contains a lot of duplicates which were eliminated in the study done in this paper to avoid leakage between training and testing phases. despite these eliminations, there remain some issues of low quality, especially for isizulu where the translations don't line up between source and target sentences. from a methodological perspective, the authors used convs s and transformer models without much hyperparameter tuning since the goal of the authors was to provide a benchmark and the tuning is left as future work. additional details on the libraries, hyperparameter values and dataset processing are provided in the paper along with a github link to the code. in general the transformer model outperformed convs s for all the languages, sometimes even by points on the bleu scores. performance on the target language depended on both the number of sentences and the morphological typology of the language. poor quality of data along with small dataset size plays an important role as evidenced in the poor performance on the isizulu translations where a lowly . bleu score was achieved. the morphological complexity of the language also played a role in the state of the performance as compared to other target languages. for each of the target languages studied, the paper includes some randomly selected sentences to show qualitative results and how different languages having different structures and rules impacts the degree of accuracy and meaning in the translations. there are also some attention visualizations provided in the paper for the different architectures demonstrating both correct and incorrect translations, thus shedding light on potential areas for dataset and model improvements. the paper also shows results from ablation studies that the authors performed on the byte pair encodings (bpe) to analyze the impact on the bleu scores and they found that for datasets that had smaller number of samples, like for isizulu, having a smaller number of bpe tokens increased the bleu scores. in potential future directions for the work, the authors point out the need for having more data collection and incorporating unsupervised learning, meta learning and zero shot techniques as potential options to provide translations for all official languages in south africa. this work provides a great starting point for others who want to help preserve languages and improve machine translations for low resources languages. such efforts are crucial to empower everyone in being able to access and contribute to scientific knowledge of the world. providing code and data in an open source manner will enable future research to build upon it and we need more such efforts that capture the diversity of human expression through various languages. presence of media cards, total interactions, history of engagement with the creator of the tweet, the user's strength of connection with the creator and the user's usage pattern of twitter. from these factors, one can deduce why filter bubbles and echo chambers form on the platform and where designers and technologists can intervene to make the platform a more holistic experience for users that doesn't create polarizing fractions which can promote the spread of disinformation. for the first time, there's a call for the technical community to include a social impact statement from their work which has sparked a debate amongst camps that are arguing to leave such a declaration to experts who study ethics in machine learning and those that see this as a positive step in bridging the gap between the social sciences and the technical domains. specifically, we see this as a great first step in bringing accountability closer to the origin of the work. additionally, it would be a great way to build a shared vernacular across the human and technical sciences easing future collaboration. we are impressed with all the work that the research and practice community has been doing in the domain of ai ethics. there are many unsolved and open problems that are yet to be addressed but our overwhelming optimism in the power of what diversity and interdisciplinarity can help to achieve makes us believe that there is indeed room for finding novel solutions to the problems that face the development and deployment of ai systems. we see ourselves as a public square, gathering people from different walks of life who can have meaningful exchanges with each other to create the solutions we need for a better future. let's work together and share the mic with those who have lived experiences, lifting up voices that will help us better understand the contexts within which technology resides so that we can truly build something that is ethical, safe, and inclusive for all. see you here next quarter, the state of ai ethics, june suckers list: how allstate's secret auto insurance algorithm squeezes big spenders algorithmic injustices towards a relational ethics social biases in nlp models as barriers for persons with disabilities the second wave of algorithmic accountability the unnatural ethics of ai could be its undoing this dating app exposes the monstrous bias of algorithms ( arielle pardes data is never a raw, truthful input -and it is never neutral racial disparities in automated speech recognition working to address algorithmic bias? don't overlook the role of demographic data ai advances to better detect hate speech algorithms associating appearance and criminality have a dark past beware of these futuristic background checks go deep: research summaries the toxic potential of youtube's feedback loop study: facebook's fake news labels have a fatal flaw research summaries capabilities, and ai assistive technologies go wide: article summaries ancient animistic beliefs live on in our intimacy with tech humans communicate better after robots show their vulnerable side at the limits of thought engineers should spend time training not just algorithms, but also the humans who use them using multimodal sensing to improve awareness in human-ai interaction different intelligibility for different folks aligning ai to human values means picking the right metrics why lifelong learning is the international passport to success (pierre vandergheynst and isabelle vonèche cardia you can't fix unethical design by yourself research summaries the wrong kind of ai? artificial intelligence and the future of labor demand ai is coming for your most mind-numbing office tasks tech's shadow workforce sidelined, leaving social media to the machines here's what happens when an algorithm determines your work schedule automation will take jobs but ai will create them research summaries what's next for ai ethics, policy, and governance? a global overview (daniel schiff a holistic approach to implement ethics in ai beyond a human rights based approach to ai governance: promise, pitfalls and plea go wide: article summaries this is the year of ai regulations apps gone rogue: maintaining personal privacy in an epidemic maximizing privacy and effectiveness in covid- apps article summaries who's allowed to track my kids online? chinese citizens are racing against censors to preserve coronavirus memories on github can i opt out of facial scans at the airport? with painted faces, artists fight facial recognition tech research summaries adversarial machine learning -industry perspectives article summaries the deadly consequences of unpredictable code adversarial policies: attacking deep reinforcement learning specification gaming: the flip side of ai ingenuity doctors are using ai to triage covid- patients. the tools may be here to stay the future of privacy and security in the age of machine learning research summaries towards a clearer account of research priorities in ai ethics and society integrating ethical values and economic value to steer progress in ai machine learning scholarship go wide: article summaries microsoft researchers create ai ethics checklist with ml practitioners from a dozen tech companies why countries need to work together on ai quantifying independently reproducible machine learning be a data custodian, not a data owner research summaries towards the systematic reporting of the energy and carbon footprints of machine learning challenges in supporting exploratory search through voice assistants a focus on neural machine translation for african languages radioactive data: tracing through training using deep learning at scale in twitter's timeline (nicolas koumchatzky neurips requires ai researchers to account for societal impact and financial conflicts of interest in modern ai systems, we run into complex data and processing pipelines that have several stages and it becomes challenging to trace the provenance and transformations that have been applied to a particular data point. this research from the facebook ai research team proposes a new technique called radioactive data that borrows from medical science where compounds like baso are injected to get better results in ct scans. this technique applies minor, imperceptible perturbations to images in a dataset by causing shifts within the feature space making them "carriers".different from other techniques that rely on poisoning the dataset that harms classifier accuracy, this technique instead is able to detect such changes even when the marking and classification architectures are different. it not only has potential to trace how data points are used in the ai pipeline but also has implications when trying to detect if someone claims not to be using certain images in their dataset but they actually are. the other benefit is that such marking of the images is difficult to undo thus adding resilience to manipulation and providing persistence. prior to relevance based timeline, the twitter newsfeed was ordered in reverse chronological order but now it uses a deep learning model underneath to display the most relevant tweets that are personalized according to the user's interactions on the platform. with the increasing use of twitter as a source of news for many people, it's a good idea for researchers to gain an understanding of the methodology that is used to determine the relevance of tweets, especially as one looks to curb the spread of disinformation online. the article provides some technical details in terms of the deep learning infrastructure and the choices made by the teams in deploying computationally heavy models which need to be balanced with the expediency of the refresh times for a good experience on the platform. but, what's interesting from an ai ethics perspective are the components that are used to arrive at the ranking which constantly evolves based on the user's interaction with different kinds of content.the ranked timeline consists of a handful of the tweets that are the most relevant to the user followed by others in reverse chronological order. additionally, based on the time since one's last visit on the platform, there might be an icymi ("in case you missed it") section as well. the key factors in ranking the tweets are their recency, key: cord- - n h w authors: willforss, jakob; siino, valentina; levander, fredrik title: omicloupe: facilitating biological discovery by interactive exploration of multiple omic datasets and statistical comparisons date: - - journal: biorxiv doi: . / . . . sha: doc_id: cord_uid: n h w visual exploration of gene product behavior across multiple omic datasets can pinpoint technical limitations in data and reveal biological trends. the omicloupe software was developed to facilitate such exploration and provides more than interactive cross-dataset visualizations for omic data. it expands visualizations to multiple datasets for quality control, statistical comparisons and overlap and correlation analyses, while allowing for rapid inspection and downloading of selected features. the usage of omicloupe is demonstrated in three diverse studies, including an analysis of sars-cov- infection across omic layers, based on previously published proteomics and transcriptomics studies. omicloupe is available at quantitativeproteomics.org/omicloupe omic analysis carries the potential to reveal new biological understanding and serve as a source of biomarkers. still, omic data are challenging to work with, in part as they often contain considerable variation within and between experiments driven by both biological and technical factors, such as differing experimental conditions or sampling procedures. this variation needs to be considered to correctly interpreting the data. furthermore, choices of algorithms and statistical procedures for processing the data cause additional differences in the final results [ , ] . the variation seen in the data can represent valuable biological trends, but can also be caused by nuisance factors, such as batch effects [ ] or sample-to-sample technical variation. if the sources of trends in a dataset are understood, the dataset's reliability can be assessed, and robust approaches of analysis and follow-up studies can be designed. visualization is a critical tool for developing this understanding. in comparative studies, one commonly overlooked aspect is the in-depth analysis of how individual features, such as transcripts or proteins, detected in one set of samples behave in other samples, datasets, or types of omics. quality visualizations such as principal component analysis (pca), and visualizations based on the outcome of statistical comparisons such as volcano plots and p-value histograms are often used to study trends within datasets. as an extension, several approaches to multiomics have been presented where the aim is to project down multiple sets of data to the same low dimensional space, such that they can be jointly visualized and inspected [ ] [ ] [ ] . these provide useful overviews of multiomic datasets, but does not offer a detailed view of how individual features behave across multiple datasets in statistical comparisons. here, we propose an approach where single dataset visualization approaches are expanded to allow direct comparisons across datasets. use cases are, for example, ( ) biomarker studies where an initial set of candidates is to be validated ( ) time-series experiment where the global expression is inspected, for instance, at different times after infection ( ) multiomics experiments where multiple types of data are produced for the same or similar biological systems and ( ) detailed studies of comparisons between methods or software approaches. to facilitate such analyses, we here introduce the interactive software omicloupe, which leverages additions to standard visualizations to allow for explorations of features and conditions across datasets beyond simple thresholds, giving insight which otherwise might be lost. the tool aims to be easy to use, directly interface with upstream software and to enable exploration and exporting parts of particular interest in the data. in the present work, we further demonstrate how omicloupe can be used to rapidly explore complex datasets in three different use cases to improve the accessibility and capability of analysis of complex datasets, we developed omicloupe. it is an interactive piece of software accessible through any web browser (quantitativeproteomics.org/omicloupe), which can either be accessed online or installed and launched locally as an r package (https://github.com/computationalproteomics/omicloupe). a singularity container for execution without any prior dependency installation, and video tutorials are available to increase its accessibility. the code follows a modular design, promoting the extension of omicloupe with additional visualizations in the future. omicloupe is built as a collection of modules, each performing a certain part of the analysis ( figure ). it is built to fit into upstream workflows and can handle any combination of one or two expression datasets where the data are presented as tables with samples as columns and features (genes, proteins, transcripts or other measured features) per rows (illustrated in supplementary materials s ). statistical visualizations require columns with p-value, false discovery rate (fdr) corrected p-values, fold change (difference of means between the two compared groups) and average feature level. these values are provided by up-stream software such as normalyzerde [ ] or r packages such as limma [ ] for most types of omics or deseq [ ] , for rna-seq expression data. after loading the data in the web interface, the visualizations can be accessed immediately. the general analysis workflow is shown in figure . the workflow starts with the user assessing their data using the sample-wide quality visualizations, including boxplots, density plots, bar plots, dendrograms, histograms, and principal component plots. these visualizations commonly reveal outlier samples and the presence of systematic effects in the data. further, for studies involving two datasets, omicloupe provides the side-by-side study of whether these effects are uniquely present in one or both of the datasets. these visualizations help the user to make decisions on how to best perform analysis such as outlier omissions, or decisions on what statistical comparisons to perform, and to judge the reliability of the data. next, the user can screen overlaps between statistical comparisons and sample conditions by inspecting whether features pass specific statistical cutoffs (p-value, fdr, optionally in combination with fold change) in one or several statistical comparisons (i.e. specific treatments or time points) or datasets. this overlap is illustrated by venn diagrams for pairwise comparisons and upset plots [ ] for a higher number of comparisons, with the upset visualization designed to efficiently compare a high number of overlaps. further, for statistical comparisons, overlaps can be split by fold direction giving a better sense of whether overlaps indicate a shared abundance pattern. a novel visualization illustrates the fraction of features that change abundance in the same direction for low-and high-p-values, and the fold patterns of shared features are highlighted. these illustrations jointly provide a detailed view of similarities between contrasts. for both statistical and qualitative upset plots and for the venn comparisons, subsets can be directly inspected and exported. single features can be chosen for closer inspection in the feature check module to evaluate in detail how their abundance values are distributed over any sample condition, shown either as raw data points or by using box-or violin plots. beyond the previously mentioned, further specialized visualizations are provided. an analysis approach for identifying features uniquely present in certain conditions is provided as an upset plot, which can highlight features for which the abundance is below detection limit for certain samples. a correlation plot allows direct illustration of feature correlation patterns between data layers based on the same set of samples, for instance multiomics or alternative software processing of the same dataset. all the plots discussed above can be downloaded in png or vector format and can be customized, providing to summarize, omicloupe provides a tool to rapidly assess datasets for technical trends and for indepth studies of statistical comparisons and individual analytes. case : effects of data processing software on differential expression analysis outcome. to assess the utility and validity of the approaches introduced in omicloupe, we started by analyzing spike-in proteomic data that have previously been explored extensively in a comparison of data processing software for data-independent acquisition (dia) lc-ms/ms data [ ] . this dataset consists of e. coli and yeast proteins spiked at two different concentrations into a human proteome background. the two mixtures were analyzed in triplicates. in the original work, the data were processed using five different software, allowing for a comparison of their relative performance. the software used were peakview, skyline, openswath, dia umpire, and spectronaut, where only dia umpire was used without matching to a previously generated spectral library. this dataset was employed as an example with known ground truth where concentrations of proteins from different organisms were known, allowing assessment of how well the visualizations illustrate the known underlying trends. further, it demonstrates how omicloupe can be used to assess the impact of different dia software methods for processing the same set of samples. upon inspection in omicloupe, the quality control visualizations show that the choice of software impacts the resulting absolute values, as illustrated in the density plots shown in figure a . less obvious differences are seen between the spike-in levels, although upon inspection in a dendrogram, the difference between openswath and peakview appeared smaller than their respective spike-in levels (supplementary s ). it can be noted that the intensity values were scaled to reach similar levels in subsequent analyses in the original study for this reason. qualitative inspection was performed to identify proteins only detected by certain software processing methods. here, the majority of proteins ( ) were detected by all five methods, and proteins were detected by all methods except dia umpire. conversely, dia umpire identified proteins that were not detected by any other method (highlighted in blue, figure b ), although peakview identified a higher number of proteins uniquely ( , figure b ). upon statistically comparing the abundances between the spike-in levels, a considerable number of proteins were also uniquely identified as differentially expressed by peakview ( spike-ins). eleven proteins were found to be significant but with opposing direction of change when comparing peakview and spectronaut output (highlighted in blue, figure c ). out of these, all eleven were yeast protein, correctly identified as downregulated by peakview. to further elucidate the underlying differences in the processed data from dia umpire and peakview, a closer inspection of the statistical distributions was made in omicloupe. out of the six statistical figures, the volcano plots are illustrated in figure (all six can be found in the supplementary material s ). inspection of features passing the thresholds fdr < . and fold change (log ) > ( figure a) shows a considerable number of common statistical features with the same abundance change directions (blue) but it is notable that a larger number of features are identified only after peakview processing. these are distributed across all significance levels and folds, with no evident trends for a higher concentration of lower abundance values. a handful of features are found to be changing in opposite direction between the groups (green) and were compared to the expected regulation pattern. here, it was found that all features with a positive log fold in both methods were true positives, while out of the two with negative fold-change, one was a false positive. this indicates how omicloupe could be used to qualitatively give indications of the reliability of features across comparisons by using fold change. interestingly, sets of features found to be changing in one group are clustering around zero-fold change in the other method, indicating a different ability of the software to handle these features. further inspection of the ground truth ( figure b ) illustrates the respective types of spike-in. for peakview, there seems to be, in particular for yeast, a considerable number of false negatives identified, while for dia umpire these are less common. to illustrate the joint use of the two methods, a subset of features identified after processing by both methods was inspected ( figure c ). here, a set of features, only identified as differentially abundant after peakview processing (yellow in figure a ), was highlighted and their distribution after dia umpire processing inspected. one exception was significant in both cases, seen as the green point with lowest p-value (highest along the y-axis). interestingly, the features represented a mix of true positives (e. coli) and false positives (human). the true positives were found with greater fold change in dia umpire (two rightmost green points in figure c , dia umpire panel). from these observations, we conclude that omicloupe allows for finegrained analysis of differences resulting from data processing using different software and allows careful inspection of specific data points across multiple datasets. multiomics studies of the same biological samples are becoming increasingly more frequent, but how to integrate the data types and finding important features remains challenging. we thus investigated how omicloupe can be used for direct comparisons of different data types taken from the same set of samples, to reveal features only detected in certain conditions, and common patterns of observed abundance level changes. for this purpose, a comprehensive multiomics dataset from endometrial cancer samples was downloaded [ ] . multiple types of data, including proteomics and rna-seq, were acquired for the samples in the original study, and the features had been mapped to common gene identifiers. the samples are classified in different histological types, including copy number variation (cnv) high, which includes serous samples and cancers penetrating at least % of the endometrial wall and cnv low consisting of samples penetrating less than % of the endometrial wall. here, the statistical analysis focused on the comparison between these groups. a first view using the pca module revealed a primary separation between most normal samples and tumor samples ( figure a ). this separation was similar in both proteomics and rna-seq datasets, with few noticeable differences. further, a partial separation between cnv high and cnv low can be seen along with the second principal component. pca analysis without the control samples group was also performed using the function available in omicloupe (supplementary s ). to study the similarity of the statistical comparisons across the two data types, features with positive abundance change and with low p-values were highlighted in the rna-seq contrast (by dragging directly in the figure) between cnv high and cnv low to see how these distribute in the corresponding contrast in the proteomics dataset ( figure b ). the majority of features were also upregulated in the proteomics data with three exceptions, namely folr , steap b and tbl d . genes of interest, including tp , which was discussed in the original study, were inspected using the feature check ( figure c ), which showed a reversed pattern in transcriptomics and proteomics. finally, the correlation between transcriptomics and proteomics was studied. pearson and spearman correlations are illustrated in figure d and showed similar median values ( . pearson) to those presented in the original study, with a small number of inversely correlated features. this demonstrates how omicloupe can be used to inspect similarities and differences between layers of omic data generated from the same set of samples, providing an improved understanding of both the general expression profiles and individual gene products. for the proteomics study, the initial quality control revealed two aspects in the data influencing the subsequent analysis. quality control visualizations using pca and dendrograms revealed clustering of samples according to a sample name-based categorization, thought to be the plating numbers of the cell lines ( figure a ). to compensate for this effect, this number was included as a covariate in the statistical tests, and the impact of including it was investigated in omicloupe. the inclusion of this covariate led to a considerably higher number of features detected as significantly different at the thresholds employed (fdr < . and log fold change > . ), as exemplified for the comparison of infected samples at hours versus infected samples at hours ( figure b ). next, omicloupe was used to study control and infected samples independently (as illustrated in supplementary s ). here, a clear pattern was seen in the infected samples, with the hours infected samples separating out along pc , while the hours samples showed a weaker separation. in the control samples, the trend was less clear, and the control hours samples appeared as weak outliers. in order to study the potential impact of these group comparisons, control and infected samples at and hours were compared, as depicted in figure c . a strong effect of decreasing abundance is seen in control hours, while in infected hours the trend is smaller, with known viral proteins being clearly upregulated. this unexpected distribution of the hours control samples led us to focus on comparisons between infected samples, and the hours infected versus control comparison. to study the viral distribution between the infected conditions we highlighted proteins with known annotation related to either virus or virus receptors (according to uniprot https://covid- .uniprot.org, downloaded th of july ) in comparisons of , and hours infected samples compared to hours infected samples. figure a furthermore, to study the potential of omicloupe, the results from the proteomic study were compared to the transcriptomic dataset. here, the two datasets can easily be uploaded and compared based on their time points. to make a similar comparison in the two datasets, we decided to compare the proteomic and the transcriptomic data at hours, despite one outlier sample being identified in the transcriptomics data at this time point (the pca plot for the transcriptomics data illustrating this outlier is shown in supplementary s ). the distribution of the significant genes in the proteomic study in comparison with the transcriptomic study (expansion media), is depicted in the volcano plot in figure b . of particular interest are the significant genes that are shared between the datasets at hours after infection. at the set threshold (fdr < . and log fold change > . ), differentially expressed genes are shared between the proteomic and the transcriptomic data after hours of infection in the differentiation media. for the extension media, genes are significantly shared between the proteomic and transcriptomic datasets after hours of viral infection. the overlap between the ids in both datasets is displayed in the upset plot in figure c . interestingly, genes were overlapping between the proteomic dataset and the transcriptomic study (differential and expansion media). as an example, one of those shared genes, cd , is depicted in figure d . cd is a leukocyte surface antigene, which has been shown to be upregulated after a viral infection, including sars-cov- infection, as a host response to the infection [ ] . these overlapping groups were further analyzed in string [ ] to investigate the relevant pathways connected to these significant genes identified. of biological interest is that one of the main regulated reactome [ ] pathways is neutrophil degranulation, which in many studies has been reported as a key biological process during the sars-cov- infection [ ] [ ] [ ] . in summary, in this third case study, omicloupe was used to perform a parallel analysis of two datasets from different types of omics (proteomics and transcriptomics) to investigate the response to infection over time. both these datasets were obtained from published studies. by straightforward visualizations, we demonstrate the feasibility of using this tool to easily identify significantly changing gene products, common to both datasets, which can be used for further analysis, such as go enrichment and pathway analysis. visualization is an important tool to fully explore omics datasets and to highlight features that can be difficult to assess with numbers alone. consequently, there are many new software solutions for omic data visualization presented over the past few years. these include a range of user-friendly stand-alone software for omics visualization such as perseus [ ] for proteomics, or shiny-based software such as shinyomics [ ] , which provides a flexible quality-oriented interface to omic data, and wilson [ ] providing high-quality interactive figures based on an open file format but only limited abilities to compare features. intervene is a software focusing on comparisons [ ] , aiming to provide various types of overlap information, but only based on fold change information and not allowing for featureby-feature examination. furthermore, software dedicated to incorporating multiple layers of omics such as mixomics [ ] has extensive multiomics integration capabilities, but does so on a sample-wide scale rather than focusing on the behavior of single features. despite these notable examples and several more new visualization software packages now being available, we have failed to find software performing several of the functionalities for multiple comparisons across datasets provided by omicloupe. key features in omicloupe like the side-by-side data distribution comparison volcano and ma plots, with labeling of key features across the comparisons and of datasets, as well as the ability to rapidly switch to individual feature views across samples, enable a deeper understanding of the individual features in the data. the implementation of upset plots with optional splitting based on changes in abundance direction, can rapidly help in determining reproducibility across datasets. while standard statistical comparison, using strict thresholds in many cases is the default option, underlying trends can be found in plots such as upset with less strict thresholds, when the data are lacking power. here, we explored three diverse datasets to highlight different aspects of omicloupe's functionality. by comparing the impact of different proteomics software processing methods, we could study in detail differences in outcome between the methods, and identify specific features handled correctly only by one or some of the methods. next, multiomics exploration with both transcriptomic and proteomic data obtained from the same samples gave the opportunity to explore features across omic layers. here, we identified an overall similarity of trends across the omic data layers, and rapidly illustrated the correlations of transcripts and proteins. further, we visualized key features in detail, including tp , a key protein discussed in the original study, and detected differences at transcript and protein level. this demonstrates how omicloupe can confirm and provide extended knowledge for existing data. finally, we used two separate sars-cov- studies to profile intestine cells during infection. omicloupe was used to identify and navigate technical limitations, including a batch effect, and a seeming lower reliability of one set of control samples. the overall regulation patterns were relatively different, as expected due to the different types of samples, but still subsets of features with joint abundance changes were identified. these were downloaded and enriched, revealing biological trends in line with what had been observed in prior studies. the cases presented in this manuscript are common examples of challenges encountered when analyzing omics data. beyond these, the omicloupe software has the potential to be used in a wide range of scenarios, to better understand both single-and multiple-omics datasets. to this end, we believe usability is of critical importance for this kind of software, and omicloupe has a straightforward interface, with user help text complemented with video tutorials at the website. having these at hand may mean the difference in how extensively the data could be explored, and thus how well they can be understood. we thus encourage users to test the software, provide feedback about its functionality and to comment on possible useful new extensions. here, we have presented omicloupe, which both introduces novel approaches for comparative visualization cross dataset and presents these in an interactive easy-to-use software. we have demonstrated its utility on three diverse datasets, starting with a technical dataset to demonstrate how omicloupe can be used for comparing processing methods and how the cross-comparison fold can provide important information. secondly, we explored a multi-omic cancer dataset illustrating how same-sample cross-omics can be readily illustrated. finally, to demonstrate its versatility, we reanalyzed two recently published sars-cov- datasets, performed comparative explorations of these datasets and rapidly identified proteins and rna transcripts showing the same abundance change trends across both studies. based on these results and usage on other datasets, we propose that omicloupe can be a versatile tool in many expression omics-based analyses, both for novice and expert users. we provide it for usage by the community, as an r-package and as an online server. omicloupe is implemented using r (v . . ) and rshiny (v . . . ), using packages providing interactive visualizations: plotly (v . . . ), dt (v . ) and packages for data visualization: ggplot (v . . ), ggally (v . . ), upsetr (v . . , [ ] ) and dplyr (v . . ) for data processing. the code is developed in modules to facilitate reusability. further, a singularity container [ ] was prepared allowing immediate local execution without being required to install the r package dependencies. omicloupe was evaluated using three datasets covering different use cases. an r notebook containing the code used for preprocessing the datasets together with an html-document with the code output is outlined in the supplementary s and accessible on the doi: https://doi.org/ . /zenodo. . a technical dataset was employed where proteins from human, e. coli and yeast had been spiked in at controlled concentrations [ ] and subsequently analyzed using five different dia methods. the data was downloaded from proteomexchange [ ] at the id pxd , selecting the data generated on the tripletof instrument with fixed-size windows for all five methods. the hye dataset was used, with a spike-in difference of log fold . the raw data matrices were preprocessed both into five separate data matrices, and into a single merged matrix consisting of all joint protein entries. they were subsequently log -transformed and rolled up to protein level using an r-reimplementation (github.com/computationalproteomics/proteinrollup) of the danter rrollup [ ] , using default settings and excluding proteins supported by a single peptide. statistical contrasts between the two concentration levels were subsequently calculated using limma [ ] as provided by normalyzerde [ ] , and resulting p-values adjusted for multiple hypothesis testing using the benjamini-hochberg procedure [ ] . a multiomics dataset from a study investigating prospectively collected endometrial carcinoma tumors [ ] divided into four histological groups, including copy number (cnv)-high (serous cancera rare aggressive variant, and cancers with more than % penetrance of the endometrial wall) and cnv-low (less than % penetrance of the endometrial wall). the data matrices and meta information were downloaded from the supplementary information of the original study. samples omitted from the original study were similarly omitted, as specified in the matrix obtained from the original supplementary. further, upon inspection with omicloupe the set of normal samples was identified as a strong outgroup and omitted to avoid influence in the statistical procedure. the original dataset contains multiple layers of omic data, out of which the following two were used in the present study: proteomics and mrna levels. for the proteomics matrix, statistical contrasts were calculated using limma [ ] in normalyzerde and benjamini-hochberg [ ] corrected fdr values were calculated. for the transcriptomics matrices the data was provided as rsem estimated counts. it was first transformed with voom with quality weights [ ] and subsequently processed using limma [ ] . statistical contrasts were for both data types calculated between samples classified as cnv-high and cnv-low. the first dataset analyzed in this case study is a recently published sars-cov- proteomic dataset [ ] , where human colon epithelial carcinoma cell line (caco- ) was infected by sars-cov- and proteomic analyses were performed on samples at four time points ( , , , hours after infection), both for infected samples, and samples treated with a mock infection. the proteomic data and metadata were generously provided by the authors in the supplementary materials of the study zero-values were replaced with na, and the protein abundance values were log transformed. statistical contrasts were calculated using limma [ ] in normalyzerde [ ] and resulting p-values fdr-corrected using the benjamini-hochberg procedure [ ] . initially, statistical comparisons were made between infected and control samples at each of the four time points ( , , and hours after infection). after initial explorations in omicloupe a batch effect was identified, which was subsequently included as a covariate in the statistical test, as described in the results section. the second dataset was from human intestinal organoids infected with sars-cov- in both differentiation and expansion media and analyzed at two time points after infection ( and omicloupe is designed to easily interface with the upstream data generation process and to work on any expression data matrix. it provides the ability to explore up to two datasets, and provides comparative views between statistical contrasts performed either within one dataset or across multiple. it is organized in modules allowing rapidly shifting from a sample-wide view, to inspect individual statistical comparisons, overlaps between multiple comparisons, to understanding single features. (adapted from schematics shown at http://quantitativeproteomics.org/omicloupe). upset plots of the features that were found as significantly differentially abundant (fdr < . , log fold change > ) when data had been processed using the different software. in c, features that are changing upwards or downwards in the comparison are displayed separately to visualize contradictory abundance changes due to differential processing. eleven proteins that were deemed significantly changing, but with opposing direction of change after processing in peakview and spectronaut are highlighted in blue. these panels show part of the statistical interface in omicloupe. panel a) shows features passing the significance threshold fdr < . and log fold > . in individual datasets, and in both. green points ("contra" in the legend), are passing the significance threshold in both datasets, but with reversed log fold direction. panel b) shows coloring based on the spike-in source. panel c) show the outcome of interactively highlighting a set of five features only significant in peakview and one significant in both. this reveals their distribution in dia umpire, showing that the features upregulated in both are true positives, while one of the two found in lower abundance in dia umpire is a false hit. a) principal component illustration present in the quality module, comparing proteomics and transcriptomics. as can be seen, the major trends are similar between the two data types. b) distribution of how high-significance features upregulated in rna-seq distribute in the proteomics dataset. c) boxplots of tp , identified in the dataset across the four studied sample classifications using the feature check module. d) correlation distributions between the rna-seq and proteomics features using the correlation module. a) inspection revealed a separation of samples along the second principal component likely related to a plating effect. this was compensated for in subsequent statistical tests by including it as a covariate. b) the impact of performing differential expression analysis without and with including the putative plating number as a covariate. the inclusion of the covariate yielded new statistical features while losing six as compared to not including the covariate. c) comparison of control samples h and h shows many features with a decrease in abundance, indicating that the mock treatment might influence the data. comparison between infected samples at h and h show more limited differences, with seven detected viral proteins among those with increased abundance at hours indicated in red circles (log fold change > . ) out of which two passed the fdr threshold. statistical analysis performed using the following settings: fdr < . and log fold change > . . for both datasets, data from hours after viral infection are used a) inspection of infected samples, hours and hours compared to hours, colored by proteins known as virus proteins and virus receptors revealed a clear upregulation among virus proteins. b) direct comparison of infected and control at hours after infection in proteomics and comparison between the proteomic dataset and the transcriptomic dataset (expansion media). the coloring is based on in which dataset the gene products pass the significance threshold. green points ("contra" in the legend), are passing the significance threshold in both datasets, but with reversed log fold direction. this comparison revealed a set of shared proteins, both changing abundance in the same or opposite direction. c) illustration of the shared significant genes between proteomic and transcriptomic (differential and expansion media) datasets. d) cd distribution at different time points in the proteomic study and in the transcriptomic data. the data was retrieved from the ncbi gene expression omnibus (geo) database, from the accession number gse . the data were tmm normalized comparing infected samples at and hours after infection to the uninfected reference p-values were fdr-corrected using the benjamini-hochberg procedure funding jw and fl were supported by the swedish foundation for environmental strategic research (mistra biotech). vs and fl were supported by olle engkvist byggmästare ( - ) and the technical faculty at lund university (proteoforms@lu) a multicenter study benchmarks software tools for label-free proteome quantification a comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation why batch effects matter in omics data, and how to avoid them mint: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms diablo: an integrative approach for identifying key molecular drivers from multi-omics assays sparse pls discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems online tool for improved normalization of omics expression data and high-sensitivity differential expression analysis limma powers differential expression analyses for rna-sequencing and microarray studies moderated estimation of fold change and dispersion for rna-seq data with deseq upsetr: an r package for the visualization of intersecting sets and their properties proteogenomic characterization of endometrial carcinoma sars-cov- productively infects human gut enterocytes. science ( -) upregulation of cd is a host checkpoint response to pathogen recognition string -a global view on proteins and their functional interactions in organisms the reactome pathway knowledgebase clinical features of patients infected with novel coronavirus in wuhan crucial, or harmful immune cells involved in coronavirus infection: a neutrophil extracellular traps in covid- the perseus computational platform for comprehensive analysis of (prote)omics data collaborative exploration of omics-data web-based interactive omics visualization intervene: a tool for intersection and visualization of multiple gene or genomic region sets lê cao ka. mixomics: an r package for 'omics feature selection and multiple data integration singularity: scientific containers for mobility of compute proteomexchange provides globally coordinated proteomics data submission and dissemination a statistical tool for quantitative analysis of -omics data controlling the false discovery rate: a practical and powerful approach to multiple testing voom: precision weights unlock linear model analysis tools for rna-seq read counts sars-cov- infected host cell proteomics reveal potential therapy targets we want to thank victor lindh whose master's thesis project allowed us to explore ideas which later were further developed here. we also thank omicloupe users who have tested and provided valuable feedback. finally, we would like to thank the authors behind the four datasets used in the three case studies who generously provided their data for further use in an accessible way. the authors declare that they have no competing interests. the study was conceived by jw and fl. the software development was carried out by jw with input about functionality from vs and fl. data analysis was performed by jw and vs. the manuscript was drafted by jw and was expanded, edited and approved by all authors. key: cord- -klpmipaj authors: zachreson, cameron; mitchell, lewis; lydeamore, michael; rebuli, nicolas; tomko, martin; geard, nicholas title: risk mapping for covid- outbreaks using mobility data date: - - journal: nan doi: nan sha: doc_id: cord_uid: klpmipaj covid- is highly transmissible and containing outbreaks requires a rapid and effective response. because infection may be spread by people who are pre-symptomatic or asymptomatic, substantial undetected transmission is likely to occur before clinical cases are diagnosed. thus, when outbreaks occur there is a need to anticipate which populations and locations are at heightened risk of exposure. in this work, we evaluate the utility of aggregate human mobility data for estimating the geographic distribution of transmission risk. we present a simple procedure for producing spatial transmission risk assessments from near-real-time population mobility data. we validate our estimates against three well-documented covid- outbreak scenarios in australia. two of these were well-defined transmission clusters and one was a community transmission scenario. our results indicate that mobility data can be a good predictor of geographic patterns of exposure risk from transmission centres, particularly in scenarios involving workplaces or other environments associated with habitual travel patterns. for community transmission scenarios, our results demonstrate that mobility data adds the most value to risk predictions when case counts are low and spatially clustered. our method could assist health systems in the allocation of testing resources, and potentially guide the implementation of geographically-targeted restrictions on movement and social interaction. similar to other respiratory pathogens such as influenza, the transmission of sars-cov- occurs when infected and susceptible individuals are co-located and have physical contact, or exchange bioaerosols or droplets [ , ] . behavioural modification in response to symptom onset (i.e., self-isolation) can act as a spontaneous negative feedback on transmission potential by reducing the rate of such contacts, making epidemics much easier to control and monitor. however, covid- (the disease caused by sars-cov- virus) has been associated with relatively long periods of pre-symptomatic viral shedding (approximately - days), during which time case ascertainment and behavioural modification are unlikely [ , ] . in addition, many cases are characterised by mild symptoms, despite long periods of viral shedding [ ] . transmission studies have demonstrated that asymptomatic and pre-symptomatic transmission hamper control of sars-cov- [ ] [ ] [ ] . pre-symptomatic and asymptomatic transmission has also been documented systematically in several residential care facilities in which surveillance was essentially complete [ , ] . currently, there are no prophylactic pharmaceutical interventions that are effective against sars-cov- transmission. therefore, interventions based on social distancing and infection control practices have constituted the operative framework, applied in innumerable variations around the world, to combat the covid- pandemic. social distancing policies directly target human mobility. therefore, it is logical to suggest that data describing aggregate travel patterns would be useful in quantifying the complex effects of policy announcements and decisions [ ] . the ubiquity of mobile phones and public availability of aggregated near-real-time movement patterns has led to several such studies in the context of the ongoing covid- pandemic [ ] [ ] [ ] . one source of mobility data is the social media platform facebook, which offers users a mobile app that includes location services at the user's discretion. these services document the gps locations of users, which are aggregated as origindestination matrices and released for research purposes through the facebook data for good program. the raw data is stored on a temporary basis and aggregated in such a way as to protect the privacy of individual users [ ] . several studies have utilised subsets of this data for analysis of the effects of covid- social distancing restrictions [ ] [ ] [ ] [ ] in this work, we complement these studies by addressing the question: to what degree can realtime mobility patterns estimated from aggregate mobile phone data inform short-term predictions of covid- transmission risk? to do so, we develop a straight-forward procedure to generate a relative estimate of the spatial distribution of future transmission risk based on current case data or locations of known transmission centres. to critically evaluate the performance of our procedure, we retrospectively generate risk estimates based on data from three outbreaks that occurred in australia when there was little background transmission. the initial wave of infections in australia began in early march, , and peaked on march th with new cases. the epidemic was suppressed through widespread social distancing measures which escalated from bans on gatherings of more than people (imposed on march th) to a nation-wide "lockdown" which began on march th and imposed a ban on gatherings of more than people. by late april, daily incidence numbers had dropped to fewer than per day [ ] . the outbreaks we examine occurred during the subsequent period over which these general suppression measures were progressively relaxed. one of these occurred in a workplace over several weeks, one began during a gathering at a social venue, and one was a community transmission scenario with no single identified outbreak center, which marked the beginning of australia's "second wave" (which is ongoing as of august, ). the term "community transmission" refers to situations in which multiple transmission chains have been detected with no known links identified from contact tracing and no specific transmission centres are clearly identifiable. in each case, we use the facebook mobility data that was available during the early stages of the outbreak to estimate future spatial patterns of relative transmission risk. we then examine the degree to which these estimates correlate with the subsequently observed case data in those regions. our results indicate that the accuracy of our estimates varies with outbreak context, with higher correlation for the outbreak centred on a workplace, and lower correlation for the outbreak centred on a social gathering. in the community transmission scenario without a well-defined transmission locus, we compare the risk prediction based on mobility data to a null prediction based only on active case numbers. our results indicate that mobility is more informative during the initial phases of the outbreak, when detected cases are spatially localised and many areas have no available case data. our general method is to use an origin-destination (od) matrix based on facebook mobility data to estimate the diffusion of transmission risk based on one or more identified outbreak sources. the data provided by facebook comprises the number of individuals moving between locations occupied in subsequent -hr intervals. for an individual user, the location occupied is defined as the most frequently-visited location during the -hr interval. more details on the raw data, the aggregation and pre-processing performed by facebook before release, and our pre-processing steps can be found in the supplemental information. covid- case data is made publicly available by most australian state health authorities on the scale of local government areas (lgas). in these urban and suburban regions, lga population densities typically vary from approximately . × to × residents per km , but can be low as residents per km in the suburban fringe where lgas contain substantial parkland and agricultural zones. the output of our method is a relative risk estimate for each lga based on their potential for local transmission. the general method is as follows: . construct the prevalence vector p, a column vector with one element for each location with a value corresponding to the transmission centre status of that location. for pointoutbreaks in areas with no background transmission, we use a vector with a value of for the location containing the transmission centre and for all other locations. for outbreaks with transmission in multiple locations, we construct p using the number of active cases as reported by the relevant public health agency. . construct an od matrix m, where the value of a component m ij gives the number of travellers starting their journey at location i (row index) and ending their journey at location j (column index). to approximately match the pre-symptomatic period of covid- , we average the od matrix over the mobility data provided by facebook during the week preceding the identification of the targeted transmission centre. by averaging over an appropriate time interval, the od matrix is built to represent mobility during the initial stages of the outbreak, when undocumented transmission may have been occurring. the choice of appropriate time interval varied by scenario, as described below. . multiply the od matrix by the prevalence vector to produce an unscaled risk vector r with a value for each location corresponding to the aggregate strength of its outgoing connections to transmission centres, weighted by the prevalence in each transmission centre. this is re-scaled to give the relative transmission risk for each region r i . in other words, we treat the od matrix as analogous to the stochastic transition matrix in a discrete-time markov chain, and compute the unscaled vector of risk values r as: so that r is approximately proportional to the average interaction rate between susceptible individuals from location i and infected individuals located in the outbreak centres. these approximate interaction rates are then re-scaled to give relative risk values r i between and : for point-outbreaks, this is simply: where k is the column index of the single outbreak location. the numerator is the number of individuals travelling from region i to the outbreak centre, and the denominator is the total number of travellers into the outbreak centre over all origin locations j. in addition to the typical assumptions about equilibrium mixing (in the absence of more detailed interaction data), this interpretation is subject to the assumption that the strength of transmission in each centre is proportional to the number of active cases in that location. this assumption is consistent with the observation that the majority of individuals start and end their journeys in the same locations, but there is not sufficient data to unequivocally determine the relationship between transmission risk within an area and active case numbers in the resident population of that area. therefore, it is appropriate to think of our method as a heuristic approach to estimating transmission risk based only on qualitative information about epidemiological factors and informed by near-real-time estimates of mobility patterns. these are derived from a biased sample of the population (a subset of facebook users), and aggregated to represent movement between regions containing on the order of to outbreaks occur in different contexts, some of which may suggest use of external data sources to infer at-risk sub-populations. such inference can be used to refine spatial risk prediction. for example, the workplace outbreak we investigated occurred in a meat processing facility, where the virus spread among workers at the plant and their contacts. to adapt the general method to this context, we averaged od matrices over the subset of our data capturing the transition between nighttime and daytime locations, as an estimate of work-related travel. in addition, we examined the effect of including industry of employment statistics as an additional risk factor. in this case, we used data collected by the australian bureau of statistics (abs) to estimate the proportion of meat workers by residence in each lga, and weighted the outgoing traveller numbers by the proportion associated with the place of origin. the resulting relative risk value r i is a crude estimate of the probability that an individual: • travelled from origin location i into the region containing the outbreak centre; • travelled during the period when many cases were pre-symptomatic and no targeted intervention measures had been applied; • made their trip(s) during the time of day associated with travel to work and; • were part of the specific subgroup associated with the outbreak centre (in this case, those employed in meat-processing occupations). the variation described above is specific for workplace outbreaks in which employees are infected, but could be generally applied to any context where a defined subgroup of the population is more likely to be associated (e.g., school children, aged-care workers, etc.), or in which habitual travel patterns associated with particular times of day are applicable. for each of the three outbreak scenarios, we present the mobility-based estimates of the relative transmission risk distribution, and a time-varying correlation between our estimate and the case numbers ascertained through contact tracing and testing programs. for details of these correlation computations, see the supplementary information. cedar meats is an abattoir (slaughterhouse and meat packing facility) in brimbank, victoria. it is located in the western area of melbourne. it was the locus of one of the first sizeable outbreaks in australia after the initial wave of infections had been suppressed through widespread physical distancing interventions. meat processing facilities are particularly high-risk work environments for transmission of sars-cov- , so it is perhaps unsurprising that the first large outbreak occurred in this environment [ , ] . it began at a time when community transmission in the region was otherwise undetected. as the transmission cluster grew, it was thoroughly traced and subsequently controlled. the contact-tracing effort included (but was not limited to) intensive testing of staff, each of which required a negative test before returning to work, -day isolation periods for all exposed individuals, and daily follow-up calls with every close contact. the outbreak was officially recognised on april th, when four cases were confirmed in workers at the site and, according to media reports, victoria dhhs informed the meatworks of these findings [ ] . we also explored the effect of weighting mobility by a context-specific factor: the proportion of employed persons with occupations in meat processing ( figure b ). the geographic distribution of relative transmission risk due to mobility into brimbank during the nighttime → daytime transition is presented in figure (a), while the distribution generated by including both mobility and the proportion of meat workers in each lga is shown in figure (b). to validate our estimate, we computed spearman's correlation between this risk estimate for each region to the time-dependent case count for each region documented over the course of the outbreak (supplied by the victorian department of health and human services). we use spearman's rather than pearson's correlation because while we expect monotonic dependence between estimated relative risk and case counts, we have no reason to expect linear dependence or normally-distributed errors. the outbreak case data was supplied as a time series of cumulative detected cases in each lga for each day of the outbreak. therefore, we present our correlation as a function of time from april th, when recorded case numbers began to increase dramatically (before may st, the number of affected lgas was too small compute a confidence interval (n ≤ )). as case numbers increase, correlation between our risk estimates and case numbers hume ( ) melton ( ) wyndham ( ) whittlesea ( ) moorabool ( ) brimbank ( ) greater geelong ( ) banyule ( ) darebin ( ) moreland ( ) hobsons bay ( ) melbourne ( ) moonee valley ( ) maribyrnong ( ) yarra ( ) stonnington ( ) port phillip ( ) b) whittlesea ( ) moorabool ( ) brimbank ( ) greater geelong ( ) banyule ( ) darebin ( ) moreland ( ) hobsons bay ( ) melbourne ( ) moonee valley ( ) maribyrnong ( ) yarra ( ) stonnington ( ) port phillip the next scenario we examine began with a single spreading event that occurred during a large gathering at a social venue in western sydney. while workplaces have frequently been the locus of covid- clusters, many outbreaks have also been sparked by social gatherings [ , ] . in urban environments, such outbreaks can prove more challenging to trace, as the exposed individuals may be only transiently associated with the outbreak location. the crossroads hotel was the site of the first covid- outbreak to occur in new south wales after the initial wave of infections was suppressed. the cluster was identified on july th, , during a period when new cases numbered fewer than notifications per day. however, the second wave of community transmission in victoria produced sporadic introductions in nsw, one of which led to a spreading event at the crossroads hotel [ ] . based on media reports, state contact-tracing data indicated that the cluster began on the evening of july rd, during a large gathering [ ] . unlike the cedar meats cluster, the crossroads hotel scenario was not a workplace outbreak with transmission occurring in the same context for a sustained time period, but a single spreading event in a large social centre. for this reason, to estimate relevant mobility patterns we averaged trip numbers over all time-windows in our data (daytime → evening → nighttime → daytime) for the period of june th -july th. it was also necessary to perform some pre-processing of the mobility data provided by facebook in order to correlate case data provided by new south wales health to our mobility-based risk estimates due to substantial differences in the geographic boundaries used in the respective data sets (see supplemental information and technical note). aside from these minor differences, the method applied in this scenario is essentially the same as the one described above for the cedar meats outbreak. risk of transmission in an area is assessed as the proportion of travellers who entered the outbreak location from that area (see equation ). correlation of our risk estimate to the number of cases in each lga as a function of time is shown in figure (a) . heat maps of estimated risk and case numbers are shown in figures (b) and (c), respectively. in this analysis, the available data did not explicitly identify the outbreak to which each case was associated, however, it did distinguish between cases associated with local transmission clusters and those associated with international importation. because the crossroads hotel cluster was the only documented outbreak during this time, we attribute to it all cluster-associated cases during the period investigated. this assumption is anecdotally consistent with media reports that specify more detailed information about the residential location of individuals associated with the outbreaks. the covid- case data for new south wales is publicly available [ ] . could have been predicted based on case numbers and mobility data that were available in early june. our goal is to examine whether the effectiveness of mobility patterns in predicting relative transmission risk from point outbreaks can extend to community transmission scenarios in which outbreak sources are unknown. in the community transmission scenario, as with the crossroads hotel outbreak, there were no clear context-dependent factors that suggested the use of other population data. in contrast to the first two scenarios, community transmission was occurring in multiple locations at the beginning of our investigation period. for each day, the unscaled risk estimate r i is the product of the od matrix (averaged over the preceding week) and the vector of active case numbers in each location (see equation ). therefore, in this case the relative risk value r i represents the proportion of travellers into all areas containing active cases, with the contribution of each infected region weighted by the number of active cases (see equation ). for this scenario, we investigate the correlation between relative risk estimates at time t, and incident case numbers (notifications) at time t , for all dates between june st and july st. we the results of our correlation analysis for the victoria community transmission scenario are shown in figure the goal of this study was to develop and critically analyse a simple procedure for translating aggregate mobility data into estimates of the spatial distribution of relative transmission risk from covid- outbreaks. our results indicate that aggregate mobility data can be a useful tool in estimation of covid- transmission risk diffusion from locations where active cases have been identified. the utility of mobility data depends on the context of the outbreak and appears to be more helpful in scenarios involving environments where context indicates specific risk factors. the procedure we presented may also be useful during the early stages of community transmission and could help determine the extent of selective intervention measures. in community transmission scenarios, mobility will already have played a role in determining the distribution of case counts when community transmission is detected. our results indicate that the insight added by the incorporation of mobility data diminishes as case counts grow. however, we also observed low correlations due to stochastic effects in the crossroads hotel scenario. taken together, these results indicate that there is an optimal usage window that opens when case counts are high enough for aggregate mobility patterns to shed light on transmission patterns, and closes when these transmission patterns begin to determine the distribution of active cases which then predict their own future distribution with only limited information added by considering mobility. our examination of the second wave of community transmission in victoria showed that several weeks before it was recognised, the spatial distribution of a small number of active cases it is essential that the use of mobility data for disease surveillance comply with privacy and ethical considerations [ ] . due to this requirement, there will always be trade-offs between the spatiotemporal resolution of aggregated mobility data and the completeness of the data set after curation, which typically involves the addition of noise and the removal of small numbers based on a specified threshold. to help ensure users cannot be identified, facebook removes od pairs with fewer than unique users over the -hr aggregation period. the combination of this aggregation period with the -user threshold affects regional representation in the data set, particularly in more sparsely populated areas. the final product resulting from these choices contains frequently-updated and temporally-specific mobility patterns for densely populated urban areas, at the cost of incomplete data in sparsely populated regions. in general, increased temporal or spatial resolution will reduce trip numbers in any given set of raw data, which can have a dramatic impact on the amount of information missing from the curated numbers [ ] . the comparison of our results from the cedar meats outbreak and those from the crossroads hotel cluster demonstrate that the utility of aggregated mobility patterns in estimation of the spatial distribution of relative risk depends on the context of the outbreak, with more value in situations involving habitual mobility such as commuting to and from work. detailed examination of the inconsistencies between risk estimates and case data from the crossroads hotel outbreak indicate that small numbers of people travelling longer distances were responsible for the relative lack of correspondence in that scenario. in particular, news reports discussed instances of single individuals who had travelled from the rural suburbs to visit the crossroads hotel for the july rd gathering who then infected their family members. these scenarios were not consistent with the risk predictions produced by the mobility patterns into and out of the region and exemplify the limitations of risk assessment based on aggregate behavioural data. the mobility data provided by the facebook data for good program represents a non-uniform and essentially uncharacterised sample of the population. while it is a large sample, with aggregate counts on the order of % of abs population figures, the spatial bias introduced by the condition of mobile app usage cannot be determined due to data aggregation and anonymisation. while it is possible to count the number of facebook users present in any location during the specified time-intervals, it is not possible to distinguish which of those are located in their places of residence. in order to account for the (possibly many) biases affecting the sample, a detailed demographic study would be necessary that is beyond the scope of the present work. a heat map (supplemental figure s ) of the average number of facebook users present during the nighttime period ( am to am) as a proportion of the estimated resident population reported by the abs ( [ ] ) shows qualitative similarity to the spatial distributions of active cases and relative risk shown in figure on a fundamental level, mobility patterns are responsible for observed departures from continuum mechanics observed in real epidemics [ ] . over the past two decades, due to public health concern over the pandemic potential of sars, mers, and novel influenza, spatially explicit models of disease transmission have become commonplace in simulations of realistic pandemic intervention policies [ , ] . such models rely on descriptions of mobility patterns which are usually derived from static snapshots of mobility obtained from census data [ , , ] . while this approach is justifiable given the known importance of mobility in disease transmission, it is also clear that the shocks to normal mobility behaviour induced by the intervention policies of the covid- pandemic will not be captured by static treatments of mobility patterns. to account for the dynamic effects of intervention, several models have been developed to simulate the imposition of social distancing measures through adjustments to the strength of contextspecific transmission factors [ , ] . this type of treatment implicitly affects the degree of mixing between regions without explicitly altering the topology of the mobility network on which the model is based and it is unclear whether such a treatment is adequate to capture the complex response of human population behaviour. given the results of our analysis, the incorporation of real-time changes in mobility patterns could add policy-relevant layers of realism to such models that currently rely on static, sometimes dated, depictions of human movement. example scripts and data used for computing risk estimates and correlations can be found in the associated github repository: https://github.com/cjzachreson/covid- -mobility-risk-mapping however, due to release restrictions on the mobility data provided by facebook, the od matrices are not included as these were derived from the data provided by the facebook data for good program (random matrices are included as placeholders). the processed mobility data used in this work may be made available upon request to the authors, subject to conditions of release consistent with the facebook data for good program access agreement. a generic implementation of the code used to re-partition od matrices between different geospatial boundary definitions is enclosed in the supplementary technical note. the data used in our study was provided by the facebook data for good program. the data set (in the disease prevention maps subset) is aggregated from individual-level gps coordinates collected from the use of facebook's mobile app. therefore, the raw data is biased to over- (national-scale) and smaller (city-scale) regions of interest, we determined that the state-level data provided the best balance, with trip numbers large enough to produce a sufficiently dense network of connections while still providing a subregion size that is usually smaller than the local government areas for which case data is reported. because the raw mobility data is provided as movements between tiles, while case data is provided based on the boundaries of local government areas. we note that while facebook releases data aggregated to administrative regions, these regions were not geographically consistent with the current lga boundaries for australia. in order to ensure consistency of our method across datasets and jurisdictions, we produced our own correspondence system. we did this by performing two spatial join operations. these associate either tiles or lgas with meshblocks (the smallest geographic partition on which the australian bureau of statistics releases population data). meshblocks were associated based on their centroid locations. each meshblock centroid s was associated to the tile with the nearest centroid and to the lga containing it. we did not split meshblocks whose boundaries lay on either side of an lga or tile boundary, as their sizes are sufficiently small that edge effects are negligible (in addition, the set of lgas forms a complete partition of meshblocks, so edge effects were only observed for tile associations). we then associated tiles to lgas proportionately based on the fraction of the total meshblock population within that tile that was associated with each overlapping lga. once a correspondence is established between the tile partitions on which mobility data is released and the lga partitions on which case data is released, the matrix of connections between tiles must be converted into a matrix of connections between lgas. the supplementary technical note explains how we performed this step, and gives a general method for converting matrices between partition schemes. briefly, the number of trips between two locations in the initial data is split between the overlapping set of partitions in the new set of boundaries (in this case, local government areas), based on the correspondence between partition schemes determined as explained in the previous subsection. to investigate the spatial sample biases present in the mobility data provided by facebook, we examined the ratio of facebook users to abs population for each suburb in victoria. while the true number varies from day to day, an example of this distribution is shown as a heat map in supplemental figure s , which displays the average number of facebook mobile app users indexed to each lga between the hours of am and am from may th to june th, divided by the estimated resident population reported by the abs in . the distribution is narrow, with most urban areas falling in the range of % to % facebook users. however, this is not an exact representation of residential population proportions, as many mobile users work during the nighttime and will not be located at their residence during the selected period. unfortunately, it is not possible to precisely quantify the bias introduced by facebook's sampling scheme. despite these limitations, it may still be informative to examine whether accounting for the bias for the cedar meats outbreak scenario, accounting for the facebook sample bias in this way improves the correlation between our mobility-based relative risk estimate and the recorded case counts ( figure s a ). for the community transmission scenario, performing this extra step does not appear to substantially change the result shown in figure (compare figure s b and figure c ). we used spearman's rank correlation to investigate the correspondence between our relative risk estimates and documented case data. this measure of correlation is typically used when comparing ordinal data, or, more generally, when monotonic relationships are expected, but errors are not normally-distributed. in order to investigate the monotonicity between relative risk estimates and reported case numbers, we aligned the documented case data for all regions in which infections had been tabulated against the corresponding relative risk estimates for those regions. note that our correlations did not include regions for which no case data was available. therefore, our correlation results illustrate the degree to which risk estimates are monotonic with case numbers, but do not account for any risk estimates made in areas with no cases to compare to. this results in a high degree of uncertainty when the number of affected areas is small, reflected by the wide confidence intervals observed in the early stages of the cedar meats and crossroads hotel outbreaks (figures , and a , respectively). the % confidence intervals were computed using fisher's z transformation with quantile parameter α = . . two data sets from the australian bureau of statistics were used in this study: ) number of residents by industry of occupation ( ), and ) resident population ( ). the distributions shown in figure s were computed by dividing the number of facebook users indexed to each lga during the nighttime period by the resident population in each lga. we obtained the population data from the abs population dataset which is publicly available [ ] . the facebook user populations are provided by the data for good program in addition to the mobility data discussed above. as a context-specific risk factor for the cedar meats outbreak we obtained the number of to compute the factors used to weight the mobility-based relative risk predictions, we divided the total number of workers in both of the above categories by the number of employed persons (those employed full time or part-time) in each lga, which we also drew from the australian census via census tablebuilder. covid- case data by local government area is available from australian jurisdictional health authorities. for this work, we used data provided by nsw health [ ] (all data is publicly s available) and from victoria dhhs. the data used for the cedar meats outbreak scenario was obtained from dhhs through a formal request to the victorian agency for health information (vahi) and cannot be made public in this work. the case data by lga used to evaluate the victoria community transmission scenario was taken directly from the covid- daily update archives available on the dhhs public website [ ] . transmission routes of respiratory viruses among humans. current opinion in virology guideline for isolation precautions: preventing transmission of infectious agents in health care settings the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application temporal dynamics in viral shedding and transmissibility of covid- epidemiologic features and clinical course of patients infected with sars-cov- in singapore quantifying sars-cov- transmission suggests epidemic control with digital contact tracing substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (sars-cov- ) presymptomatic transmission of sars-cov- -singapore presymptomatic sars-cov- infections and transmission in a skilled nursing facility asymptomatic and presymptomatic sars-cov- infections in residents of a long-term care skilled nursing facility aggregated mobility data could help fight covid- covid- outbreak response, a dataset to assess mobility changes in italy following national lockdown. scientific data effectiveness of social distancing strategies for protecting a community from a pandemic with a data driven contact network based on census and real-world mobility data social distancing as a health behavior: county-level movement in the united states during the covid- pandemic is associated with conventional health behaviors facebook disaster maps: aggregate insights for crisis response & recovery economic and social consequences of human mobility restrictions under covid- job loss and behavioral change: the unprecedented effects of the india lockdown in delhi interdependence and the cost of uncoordinated responses to covid- human mobility in response to covid- in france covid- ) at a glance - covid- among workers in meat and poultry processing facilities- states interregional sars-cov- spread from a single introduction outbreak in a meat-packing plant in northeast first cedar meats covid- case confirmed on infection fatality rate of sars-cov- infection in a german community with a super-spreading event. medrxiv high sars-cov- attack rate following exposure at a choir practice covid- weekly surveillance in nsw, epidemiological week , ending fears of further spread as crossroads hotel virus cases become infectious within a day nsw covid- cases by location and likely source of infection updates about the outbreak of the coronavirus disease (covid- ) creating a surrogate commuter network from australian bureau of statistics census data. scientific data by region synchrony, waves, and spatial hierarchies in the spread of influenza mitigation strategies for pandemic influenza in the united states interfering with influenza: nonlinear coupling of reactive and static mitigation strategies what can urban mobility data reveal about the spatial distribution of infection in a single city investigating spatiotemporal dynamics and synchrony of influenza epidemics in australia: an agent-based modelling approach impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand modelling transmission and control of the covid- pandemic in australia bing maps tile system about tablebuilder; key: cord- -ajwnihk authors: carrillo, dick; nguyen, lam duc; nardelli, pedro h. j.; pournaras, evangelos; morita, plinio; rodr'iguez, dem'ostenes z.; dzaferagic, merim; siljak, harun; jung, alexander; h'ebert-dufresne, laurent; macaluso, irene; ullah, mehar; fraidenraich, gustavo; popovski, petar title: containing future epidemics with trustworthy federated systems for ubiquitous warning and response date: - - journal: nan doi: nan sha: doc_id: cord_uid: ajwnihk in this paper, we propose a global digital platform to avoid and combat epidemics by providing relevant real-time information to support selective lockdowns. it leverages the pervasiveness of wireless connectivity while being trustworthy and secure. the proposed system is conceptualized to be decentralized yet federated, based on ubiquitous public systems and active citizen participation. its foundations lie on the principle of informational self-determination. we argue that only in this way it can become a trustworthy and legitimate public good infrastructure for citizens by balancing the asymmetry of the different hierarchical levels within the federated organization while providing highly effective detection and guiding mitigation measures towards graceful lockdown of the society. to exemplify the proposed system, we choose the remote patient monitoring as use case. in which, the integration of distributed ledger technologies with narrowband iot technology is evaluated considering different number of endorsed peers. an experimental proof of concept setup is used to evaluate the performance of this integration, in which the end-to-end latency is slightly increased when a new endorsed element is added. however, the system reliability, privacy, and interoperability are guaranteed. in this sense, we expect active participation of empowered citizens to supplement the more usual top-down management of epidemics. the covid- pandemic has clearly shown that, in many senses, the world as a whole was not prepared for dealing with a disease of such magnitude. although a final assessment is infeasible at this point (october, ), in which a second wave is creeping back in europe and is poised to rage across the continent by fall [ ] the current statistics available about covid- indicate that the most successful policies for monitoring and controlling the virus propagation are employing various digital technologies and connectivity [ ] , [ ] . these technologies can be used in two principal ways: ( ) to provide status and predictions of the epidemiological spread and facilitate actions, such as administration of diagnostic tests or preparation of medical equipment; ( ) to implement active policies that facilitate societal processes and safe citizen movement, such as shop schedules, organized delivery of goods, and similar. the success of such informed policies is also strongly related to the timing of their implementation for the respective place, as virus contagion is a spatiotemporal phenomenon that usually leads to exponential growth in the number of infected individuals [ ] . unfortunately, even the most successful timely interventions are in some way overreaching, shutting down abruptly most of the economic activities and creating social distancing between citizens. while acknowledging that this is our current societal need to contain the propagation of this pandemic, our main hypothesis is that, in the near future, digital technologies and wireless connectivity can enable a graceful lockdown by having the following main roles. • facilitate selective lockdowns by allowing flexible transfer of the work, learn and play activities online to the desired extent. • assist in maintaining physical distancing and tracking the spreading of the disease, while offering digital tools to control the level of lockdown/reopening. • provide real-time information for targeted and efficient testing. note that these are also important today (october, ), when a second wave is forcing several countries to rethink a stepby-step reopening based on the aforementioned points. this will be achieved based on an integrated solution for the ubiquitous and privacy-preserving data acquisition, explainable predictive methods for contagion risk assessment, and digitized policies for selective lockdown and scheduling of societal activities. data will be locally acquired via online social networks, personal smart phone devices and internet-of-things (iot) sensors in general. it is foreseen that an effective global solution shall be decentralized, yet hierarchical (systemof-systems), to provide fine-tuned coordination among the federated entities to orchestrate actions and then avoid both the human rights cost of overly stringent solutions [ ] and the human life cost of "hands-off" approaches. beyond more traditional centralized policy-making, such a decentralized platform, which will be built using trustworthy distributed ledger technology (dlt), should also provide citizens with arxiv: . v [cs.dc] oct tools for more direct participation in mitigation measures based on tailored individual incentives. in this sense, we expect to bring managerial decisions closer to the citizens. with such a federated governance, we attempt to convert the usually passive data acquisition process to an active participatory one. hence, our aim is to move away from a purely top-down management approach. in addition, the proposed platform aims at achieving the minimal use of personal data, processed with the maximum possible privacy-preservation to eliminate global health risks, while protecting the economy [ ] . the covid- pandemic has demonstrated that nations and governments need to be better prepared and coordinated to detect and react to global threats focusing on early detection, early response, moving beyond traditionally passive disease monitoring based on voluntary reporting systems. hence, a rapid, participatory responsive detection system must be in place to support public health officials. a synergistic relationship with public health officials, policymakers, and citizens' active participation will be critical to align the mandate for public health with the protection of privacy, freedom and democracy [ ] , [ ] . in this context, one main factor is to design a special set of incentives that would allow the citizens to provide secured anonymized access to their data while actively participating in the crowd platform to support early disease detection, a public information system, and possible mitigation measures. building up such an incentive structure that maps those tradeoffs is a key aspect (beyond purely technical ones) for the success of the proposed federated architecture. furthermore, this system should be constructed and evaluated rigorously following ethical guidelines as in [ ] , [ ] . our contribution in this paper is threefold: ) a comprehensive analysis of epidemiological models, data collection, and wireless connectivity is done based on key relevant scientific references on pandemics. ) a federated global epidemiological warning system is proposed based on dlts. ) a proof of concept of the integration between dlt and nb-iot is used to evaluate the wireless network performance on the iot infrastructure supporting a remote patient monitoring use case. the rest of the paper is organized as follows: section ii presents a discussion of epidemiological models and their limitations. section iii describes the relevance of wireless connectivity on pandemic scenarios. in section iv, the federated global epidemiological warning system is proposed, here is also detailed the proof of concept of the integration of dlt and narrowband iot (nb-iot). section v is reserved for some discussion and future perspectives. biomedical data alone do not contain the information required for preventing and mitigating pandemics. in a highly interconnected globalized society, social interactions, environmental data, spatiotemporal events, and collective nonlinear phenomena observed on complex infrastructures such as transport systems, require the mining of heterogeneous pandemic big data that are a result of a complex system of systems processes. in fact, modern disease monitoring systems, as the global public health intelligence network (gphin) [ ] and the global outbreak alert and response network (goarn) [ ] , and event-based approaches that use a combination of web crawlers, artificial intelligence, and public health expertise in the detection of indicators [ ] , [ ] provide goodalthough limited-examples of successful platforms to identify latent indicators of an outbreak. to go beyond those existing solutions, data scale, contextualization, and granularity are key requirements for data quality, which is usually orthogonal to privacy preservation, timely processing and analysis as well as storage and processing cost. in the following, more details will be provided about existing epidemiological models and epidemic-related data collection. models for emerging epidemics come in various forms, all relying on differently coarse-grained individuals' data. state-of-the-art population-level models for covid- rely on metapopulation models with an underlying susceptible-exposed-infectious-recovered disease dynamics [ ] . the metapopulations represent regions surrounding travel hubs and are interconnected through mobility data, for example, data from airlines, public transportation systems, and traffic control systems. the disease dynamics itself accounts for the fact that exposed individuals can be presymptomatic and unaware that they infect others, as suggested by early case data [ ] . unfortunately, while this model is effective at providing large-scale forecasts and predicting importations at aggregated levels, it is less informative at the individual level. however, this more granular information is necessary to accurately estimate the probability of such epidemics given a single importation event, which is rare given the population size. this particular event has the potential to become the initial point of a widespread contagion in a region not reached by epidemics until that point. since this event basically depends on one individual, aggregated level statistics provide poor description of the heterogeneity of individual level behavior and mobility. given this, most governments around the world introduced stay-at-home policies to reduce the mobility of their citizens, which should result in the now famous flattening of the curve. this expression refers to the process of slowing down the virus spread to keep the need for hospital care below the health care capacity. different variations of the stay-at-home policy have been introduced, e.g. allowing citizens to move in a predefined radius to walk with their pets and exercise, restricting the number of times per day or the number of people per household that can go out. as previously argued, this is, understandably, a necessary abrupt solution for now, but it can become more graceful by properly utilizing crowdsourced data and active citizens' participation. b. data collection: current activities and potential existing sources at higher levels, mining existing data sets collected by different types of service providers is an obvious and powerful way to quantify to which extent citizens follow stay-at-home policies, thereby providing an essential means of assessing their effectiveness. for example, smart grid systems can provide information about the changes in the energy consumption in different areas suggesting that people, on average, spend more time at home. the network usage information from broadband operators can also be used as an indicator of change in human behavior related to their mobility. another source of information are mobility data sets from cellular network operators that are already present on the network. smart appliances, like smart tvs, smart fridges, and smart light bulbs can be used to provide information about the overall time spent at home versus the time spent outside. finally, information collected by several smart city applications, such as public transport usage and traffic, could also be leveraged to estimate the change in mobility patterns at the level of places, buildings and vehicles. at a lower granularity, proximity-tracing platforms have been proposed to collect data at the individual level, for instance the pan-european privacy-preserving proximity tracing (pepp-pt) [ ] , often in partnership with mobile service providers and operators. by aggregating proximity data over time, these approaches can follow data protection regulations. among others, two of the largest big tech companies apple and google proposed a joint solution that appears to focus on the protection of user privacy, for example by keeping people anonymous in the central servers and making data submission voluntary. different technologies are being proposed to automate and extend contact tracing (e.g. shared databases, gps traces of confirmed cases, contact tracing through bluetooth). the use of bluetooth-based solutions, either in the form of the third-party dedicated apps or as a feature built into the mobile devices' operating systems, seems to be the most promising solution to date. although these solutions are a new source of granular data, they cannot cover all relevant forms of infection, namely asymptomatic infections, and infections occurring out of proximity through shared surfaces (e.g. doorknobs). in this sense, our hope is to combine the metapopulation and individual tracing models in a citizen science framework to consider more possible routes of infection as well as both proximity (e.g. close contact) and mobility data (e.g. location visited). in addition to this, machine learning methods are expected to combine expert knowledge (e.g. from virologists and sociologists) with global data collected from different sources for fitting powerful predictive models to high-dimensional data. in doing so, the target is to leverage the different straights of different data sources at different granularity for privacy protection and allow for more thorough probabilistic forecasts. ultimately this will allow for graceful lockdowns, tailored incentives for citizens' participation, and fine-tuned legitimate interventions. fine-grained data collection approaches raise two major concerns as highlighted in [ ] : ( ) user privacy; and ( ) trustworthiness of the shared information. clearly, sharing location and health information with third-party entities can lead to the misuse of the data set for instance through unwanted advertising and health insurance implications. the impact of fake health information on the contact tracing system should be analyzed to assess its implications in terms of access to testing. all these applications rely on the premise that users will trust and adopt these applications. this brings us to an additional problem that could hamper these approaches. the adoption of new technologies is hard to predict, and it might introduce bias for example related to age, wealth, level of education, and state of country development. citizens' participation in warning and response of epidemic outbreaks requires new incentives that reward the responsible use of citizens' personal data for protecting public health, while penalize and prevent citizens' profiling actions, manipulative nudging and power misuse. designing new data fusion schemes tailored to fuel data analytics processes for prevention and mitigation of pandemics is an open research question: which smart phone sensor data can model compliance of stay-at-home policies? how such models can be enhanced with social media, smart grid, or transport data? pandemics require that we revisit and potentially reinvent how data should be managed and how systems should be designed to manage data in a more responsible way [ ] . more specifically, discovering new ways to turn private sensitive data of citizens into public good, while preventing massive surveillance, profiling, and discriminatory analytics, becomes imperative. to pioneer this capability, new transparent and accountable data commons [ ] , for instance personal data stores [ ] - [ ] , to which citizens have incentives to contribute data, are required. the citizens retain the right for their data to be "forgotten" [ ] . ultimately, they have control and give their consent to how these data are used. these data commons are designed to maximize the use of techniques for privacy preservation [ ] , [ ] , for example homomorphic encryption, differential privacy, anonymity, and obfuscation, while citizens are made aware of the privacy risks they experience. the previous section covered different aspects related to the data required to support a fine-grained epidemiological model. it is argued that a ubiquitous system can be employed to manage such data in a privacy-preserving manner. here, we identify wireless connectivity and internet of things (iot) devices as a way to collect data following those principles. wireless connectivity and iot devices come in many flavors, but can, in general, be classified into ( ) personal devices, such as mobile phones and earbuds; and ( ) unattended connected machines and things, such as surveillance cameras and motion sensors. there are three principal sources of epidemic-relevant data acquired through wireless connectivity: ( ) online social networks; ( ) personal smart phone and mobile data; and ( ) sensory and internet of things (iot) devices. in critical events, people tend to use online social networks to post comments about the emergency and learn from other users' comments. as a result, online social networks become a rich source of diverse information that could help to understand the main characteristics of the crises and their potential magnitude in early stages. for instance, in [ ] , authors collected comments posted in the chinese social media channel weibo during the first month of the covid- epidemic in wuhan (china) to understand the evolution trend of the public opinion. a similar approach was used on twitter in [ ] , [ ] . government entities can use this approach to give more attention to the needs of the public during the beginning of the epidemic and adjust their responses accordingly. another form of data content, normally provided by explicit consent from the citizens, is the gps location data from personal devices. at the application level, the twitter platform offers various user information, geographical location being one of the most important for a wide range of applications. however, only a few users enable their geolocation as public. to discover the user's geolocation (geographical region or exact geocoordinates), there are mainly two approaches called contentbased and network-based algorithms. the first one uses textual contents from tweets, and the latter uses the connections and interactions between users. in [ ] , a neural network model based on multiview learning by combining knowledge from both user-generated content and network interaction is proposed to infer users' locations. this source of information comes in many different forms, ranging from the use of metadata in cellular mobile networks for user localization up to the metadata associated with different applications, such as twitter. the use of data collected in mobile networks during the covid- pandemic has received significant attention [ ] . agglomerated mobile network data can be used to verify if interventions, such as school closures, are effective, and help to understand the geographical spread of an epidemic. different types of mobile network metadata are collected at different levels of the communication system, and they offer varying levels of information. at the lowest level, localization of mobile network users is possible by evaluating the strength of the wireless signal received at base stations. depending on how many base stations are connected to the device, the location can be determined at the level of entire cells or down to a few meters using triangulation. the resolution offered by cellular mobile localization maybe too coarse for detecting citizens' proximity and potential infections by face-to-face interactions. nevertheless, even lowresolution location data can give insight into behavioral patterns (how much time is spent at home, office, and events) of individuals. proximity detection can be enhanced by using the metadata from bluetooth devices, such as beacons and discovery messages (e.g., [ ] ). wi-fi is also a technology that is used ubiquitously, and its metadata can provide information on user proximity. for example, proximity can be inferred by comparing the lists of access points that each device can see within a given time interval [ ] . fusing the metadata from these different sources, along with the context information (how many family members are at home, which events) is very relevant for monitoring the epidemic. many applications and devices become an important tool to provide epidemic-relevant data. these can be related to surveillance/thermal cameras, drones and even wearable devices. for example, surveillance cameras can be employed to count the number of people entering and exiting a specific area. in other scenarios, thermal cameras are used with specialized settings to focus on human skin temperature range, and infrared spot sensors for individual temperature scanning. a similar application to identify individual or group activities in a given place is based on motion sensors as infrared lasers, or ultrasonic sensors. in this approach, the motion sensors are installed and deployed in a specific environment to recognize different human activities. other important source of iot data are the smart healthcare devices, for instance, remote monitoring systems to check body temperature, which is a key sign in the support of homeostasis. other popular applications are oxygen saturation monitoring based on beat oximetry, electrocardiogram monitoring with a specialized framework to estimate the heart rate, elderly monitoring using doppler radar to identify risk movements of elderly people, sugar level monitoring, and blood pressure monitoring. these iot devices provide information that is analyzed by the iot data to support specific applications. we envision a federated and decentralized coordination system for epidimiological warning and response. this system is federated by citizen communities that crowd-source data, personal smart phone devices, community-level iot devices (e.g. lorawan networks) and other computational resources on the edge. it can scale organically and bottom-up to citylevel, national-level and ultimately at global-level to coordinate in a socially responsible way the international actions of public health organizations and governments. this scaling requires tailored incentives that align public health policy-making with citizens' privacy and autonomy. distributed ledgers with secure crypto-economic incentive models running on edge serves and scaling-up on-demand at a global level using wireless connectivity and public cloud infrastructure are the means to support the federated nature of this proposed system. figure illustrates the our vision. the proposed federated global epidemiological warning and response system is built upon epidemic-relevant information obtained through wireless connectivity introduced in section iii, and the internet infrastructure including fiber-optic and satellite links. thus, in addition to the devices that are the data sources (e.g., smart phones, wearables, smart appliances, and cameras) with their applications and existing communication networks, there is a need for a computer infrastructure dedicated to store and process the epidemic-relevant data in a secure and privacy-preserving manner. it is argued that this infrastructure also needs to follow a similar federated organization based on the geographic locations. each municipality, city, county, or larger neighborhood will rely on edge servers to process their respective data contents. the premise is to keep the computations as local as possible. however, larger-scale computations related to the interrelations between locations (e.g. mobility, commuting, and traveling) requires data from these different places. once again, the federated organization supports this collaborative sharing (up and downstream between the federated entities), but it will probably require computationally more demanding algorithms. it is possible that those computational tasks require more powerful cloud servers, or collaborative parallel computing at the edges. for security reasons, dlts shall be employed to store data from the edge servers at the global level. at the technical level, the following aspects deserve special attention. in this paper, a privacy-by-design approach is proposed with which several critical operations of the federated global epidemiological system can be performed; for instance, decentralized data analytics [ ] , [ ] , social interactions analysis [ ] , [ ] , decentralized planning and resource allocation [ ] and federated learning [ ] , [ ] . such operations integrate in a smart way state-of-the-art techniques, for instance, informational self-determination, homomorphic encryption, differential privacy, obfuscation, and anonymity. often, privacy may limit the quality of service known as the system utility as the accuracy and quality of data are deteriorated to hide information content. for instance, predicting the risk of infections for individuals based on an epidemiological network model is a graph-based semi-supervised learning problem. privacy-preserving semisupervised learning over graphs has been considered in [ ] . however, the precise trade-off between (lowering of) privacy protection and learning accuracy is largely open. pareto optimal trade-offs can be configured and regulated by tuning the parameters of the privacy techniques as previously shown in [ ] . monetary and other incentives can be used to coordinate data sharing choices in a crowd. however, new types of social incentives are required for such an epidemiological system; for instance, incentives related to well-being, receiving solidarity, and longterm payoffs. ) interoperability: the big data required by the proposed solution have to be interoperable, i.e. the several applications that are providing data to be used to accomplish a specific task have to operate together and share a common "understanding" [ ] . this can be achieved by employing standards for health informatics such as iso ts : , ts : , or ts . for instance, the data collected from different sources can be used to predict refinements to patient care or new drug contraindication [ ] . this key issue, though, has deserved little attention in large-scale epidemiological studies [ ] , [ ] ; it is usually assumed that heterogeneous data sources are compatible with each other. in practice, though, the highly heterogeneous data sources lead to poor interoperability, which creates barriers to effectively combat pandemics like covid- , as indicated by [ ] . ) user interfaces: a successful platform also involves a suitable end-user interface [ ] . in this sense, data consumption by public health officials and global health agencies will require user-friendly web-based interfaces, using common dashboarding techniques. however, beyond this, the proposed federated platform has to consider citizen participation and the heterogeneity of end-users. therefore, the following characteristics should be taken into consideration in the design of the platform: it should be (a) explainable/accountable to improve for instance awareness and engagement [ ] ; (b) gamified to engage and incentivize participation [ ] , [ ] ; and (c) customizable for different user groups at an international level. the proposed architecture has to articulate the key elements from data acquisition to analytics, following the best practices related to privacy and cyber-security. at the acquisition level, in addition to the existing data retrieval from the web, wirelessconnected devices will send data to edge nodes that will be associated with specific regions. data will be anonymously preprocessed at the edge, including some intelligent detection of anomalies or event detection, and then sent to regional (cloud) servers to run a more complete model that will fuse the geolocated timestamped data and run detection models based on explainable ai approaches combined with mathematical computational methods. in this case, crowd-sourcing models are applied based on hardware as a public good in an approach similar to the diaspora social network, also extending this approach to software and data. regional models will be associated in federations resembling the governance structure of the actual regions under interest, which is customizable to local policies/governance models. following this federated and collaborative organization, it will be possible to build organically a fine-tuned early detection system at the global level in a decentralized yet hierarchical manner to support graceful lockdowns. this would turn the proposed federated architecture into a holarchy, where each level is independent and self-sustained, but it can also be encapsulated at a level above to capture new goals [ ] . this system will rely on dlts to guarantee a trustful system without a responsible third party and minimize the risk of data manipulation [ ] . distributed ledgers could also be used to facilitate different (crypto-economic) incentive mechanisms, such as token curated registries. its main objective is to build a crowd-sensing trustworthy platform based on privacypreserving methods to detect potential harmful symptoms in a specific region in almost real-time and flag them to relevant authorities. this alarm needs to be accurate, reliable, and explainable. it will also require a user-tailored interface that could empower citizens by providing detailed explanations and easy-to-access information, while being a tool for policymakers (from the city level to global organizations) designing the correct interventions (a variety of incentives, sanctions, and other persuasive measures) given the specific context of the epidemic in some specific region in a given period of time. to obtain insights from a practical implementation, we define a specific use case that matches the premises of the proposed architecture. this key industry use case focuses on remote patient monitoring. here, many issues in treating chronic patients could be reduced or resolved through more efficient patient care. in the case of pandemics, such as covid- , the remote monitoring of diagnoses patient can represent a key difference to avoid the virus acceleration. besides that, the application should guarantee privacy and security on the gathered data. to guarantee people participation, the system can provide incentives, such as bonus or tax compensations. in the context of the proposed federated global epidemiological warning system, the remote patient monitoring is a representative use case, in which the integration between dlts and iot devices plays a key role. we analyze this integration through an experimental setup based on nb-iot, which is one of the most representative low-power wide-area network (lpwan) technologies in cellular networks. considering that the proposed architecture requires massive end devices deployment, low power consumption, superior coverage, and worldwide deployment, nb-iot becomes a key solution to support the federated global epidemiological warning system. some details of the integration setup are described in the next sub-section, emphasizing data workflow, and end-to-end (e e) latency. we consider latency because the experimental results indicate that it is the most sensible network parameter when dlts are built in the top of the connectivity system. in this section, we provide brief explanation of the integration between the nb-iot network and the dlt-based system, which is illustrated in figure . in a conventional nb-iot system, the uplink data generated by the user equipments (ues) is transmitted though specialized packet messages and routed toward an edge data center to be stored and processed (yellow background in fig. representing a standard-nb-iot). at this point, the monitoring system has no control on the collected data, so modification, corruption, and losses may occur. conversely, in our dltenabled nb-iot setup, the uplink data generated by the ues is transmitted to a randomly chosen group of endorsing peers of hyperledger fabric, a type of dlt used in our study, as transaction proposals (check the purple background in fig. ) . then, each of the peers signs the transaction using the elliptic curve digital signature algorithm (ecdsa) and adds the signature before returning the signed message back to the iot devices (ues). the peers that provide an endorsement of the ledger send an update to the application, but do not immediately apply the proposed update to their copy of the ledger. instead, a response is sent back to the iot device to confirm that the transaction proposal is correct, has not been previously submitted to the ledger, and has a valid signature. therefore, the security increases with the number of endorsing peers. in addition, smart contracts can be executed to update or query the ledger. then the iot device broadcast the confirmed transaction proposals along with the confirmation to the ordering service. the received transactions are ordered chronologically to create blocks. these transaction blocks are delivered to all peers for validation. then, the peers append the block to the ledger, and the valid transactions are committed to the current state database. finally, a confirmation message is transmitted back to the iot devices to report that the submitted transaction has been immutably published to the ledger. the important of confirmation is analyzed in [ ] . ) results: we evaluate the performance of an integrated dlt and iot system with permissioned hyperledger fabric and nb-iot. our experimental setup is based on hyperledger fabric v . , nb-iot development kits sara evk n and one nb-iot amarisoft enb station. we compare the e e latency of conventional nb-iot with dlt-based nb-iot in various scenarios. the number of endorsing peers is varied from to peers. the iot device transmits packet every minute, and the total number of messages transmitted is around . the figure shows that the end-to-end (e e) latency is computed based on the total uplink and downlink transmission latency and validation latency in the ledger. we observe that when adding an new endorsing node to the same channel, this will increase the endorsing latency. in comparison with conventional nb-iot system, the dlt-based nb-iot perform a higher latency due to the validation process from ledger to verify transaction as well as building block. this is a trade-off between dlt-based system and conventional iot system. dlt-based iot system guarantees the trust; however, the latency is slightly degraded. a federated global epidemiological warning system was proposed. based on it, an specialized use case focused on remote patient monitoring was considered. as one critical piece of this use case is the integration of dlts and iot devices, we implemented an experimental integration setup between dlts and nb-iot. in this study, we conclude that the latency increased when the number of endorsing peers is increased. results indicate that in average the latency is increased . seconds per endorsed peer that is added. however, this increases the guarantees in terms of secrecy, privacy, and interoperability. this experimental setup also reveals additional technology challenges, such as the performance of dlt-based system with massive number of connections, as well as the trade-off between privacy, interoperability and system performance. although the proposed architecture is designed to be fair among all citizens, few important practical points still deserve attention. good quality data usually follow the wealth distribution over all scales of the proposed federation. in other words, good quality data is very likely to be less available in poorer regions of a city. the same is valid for poor regions within a country and poor countries. this existing gap needs to be considered when building incentives for participation, otherwise the proposed solution has the potential of reinforcing inequality [ ] . another current issue we cannot neglect is the public perception of wireless technologies. in this sense, part of the general public perceives iot as insecure [ ] and fifth generation ( g) as a health hazard [ ] to the point of claiming an astonishing causal relation between g and covid- , which has caused destruction of base stations across uk. beyond these conspiracy theories that are hard to combat, there exist legitimate concerns of anonymized data not being anonymous [ ] , and of novel surveillance techniques introduced in times of crisis that are maintained for monitoring (legally or illegally) populations, e.g. post- / surveillance in the usa [ ] . all these have to be carefully addressed from the early stages of the system development. furthermore, it has to be clear that neither technology nor data can prevent another outbreak on their own, but can only provide the extremely valuable tools to enable the holy grail of controlling epidemics: early detection, early response along all relevant actors within the federated organization. in other words, technology and data detect and identify potential harms and suggest actions and reactions, but the final diagnostic and further interventions are due to the responsible institutions within the federated structure and citizens' active participation. our proposal answers those challenges based on informational self-determination as the way to build trustworthy and secure public infrastructure that shall enable graceful lockdowns as advocated here. in this sense, the proposed solution introduces a more balanced management strategy, moving away from purely top-down approaches toward a participatory system where the citizens are active. we further expect that, even without being implemented, the high-level architecture introduced here could offer important technological suggestions to decisionmakers of how to start smoothly resuming activities after lockdowns. second wave covid- pandemics in europe: a temporal playbook digital technology and covid- on the responsible use of digital data to tackle the covid- pandemic special report: the simulations driving the world's response to covid- can china's covid- strategy work elsewhere optimization of privacy-utility trade-offs under informational self-determination will democracy survive big data and artificial intelligence? pandemics meet democracy. experimental evidence from the covid- crisis in spain the ethics of ai ethics: an evaluation of guidelines the global public health intelligence network and early warning outbreak detection gphin, goarn, gone" the role of the world health organization in global disease surveillance and response revitalizing the global public health intelligence network (gphin) intelligence and global health: assessing the role of open source and social media intelligence analysis in infectious disease outbreaks the effect of travel restrictions on the spread of the novel coronavirus presymptomatic transmission of sars-cov- -singapore pan-european privacy-preserving proximity tracing contact tracing mobile apps for covid- : privacy considerations and related trade-offs privacy in iot blockchains: with big data comes big responsibility make users own their data: a decentralized personal data store prototype based on ethereum and ipfs blockchain as a notarization service for data sharing with personal data store blockchain-based personal data management: from fiction to solution the future of privacy lies in forgetting the past privacypreserving mechanisms for crowdsensing: survey and research challenges characterizing the propagation of situational information in social media during covid- epidemic: a case study on weibo how the world's collective attention is being paid to a pandemic: covid- related -gram time series for languages on twitter event detection system based on user behavior changes in online social networks: case of the covid- pandemic twitter user geolocation using deep multiview learning mobile phone data and covid- : missing an opportunity?" arxiv e-prints quantifying sars-cov- transmission suggests epidemic control with digital contact tracing inferring person-to-person proximity using wifi signals practical secure aggregation for privacy-preserving machine learning engineering democratization in internet of things data analytics mining social interactions in privacy-preserving temporal networks ieee/acm international conference on advances in social networks analysis and mining (asonam) decentralized privacy preserving services for online social networks decentralized collective learning for self-managed sharing economies federated machine learning: concept and applications federated multi-task learning privacy preserving semi-supervised learning for labeled graphs semantic interoperability as key to iot platform federation fundamentals of clinical data science preparing data at the source to foster interoperability across rare disease resources improving semantic interoperability of big data for epidemiological surveillance are high-performing health systems resilient against the covid- epidemic? a pandemic influenza modeling and visualization tool participatory disease surveillance systems: ethical framework participatory surveillance based on crowdsourcing during the rio olympic games using the guardians of health platform: descriptive study applications of crowdsourcing in health: an overview hierarchical self-awareness and authority for scalable self-integrating systems trusted wireless monitoring based on distributed ledgers over nb-iot connectivity communication aspects of the integration of wireless iot devices with distributed ledger technology potential biases in machine learning algorithms using electronic health record data a longitudinal analysis of the public perception of the opportunities and challenges of the internet of things the g debate in new zealand-government actions and public perception anonymization and risk making sense from snowden: what's significant in the nsa surveillance revelations key: cord- -snkdgpym authors: ackermann, klaus; chernikov, alexey; anantharama, nandini; zaman, miethy; raschky, paul a title: object recognition for economic development from daytime satellite imagery date: - - journal: nan doi: nan sha: doc_id: cord_uid: snkdgpym reliable data about the stock of physical capital and infrastructure in developing countries is typically very scarce. this is particular a problem for data at the subnational level where existing data is often outdated, not consistently measured or coverage is incomplete. traditional data collection methods are time and labor-intensive costly, which often prohibits developing countries from collecting this type of data. this paper proposes a novel method to extract infrastructure features from high-resolution satellite images. we collected high-resolution satellite images for million km $times$ km grid cells covering african countries. we contribute to the growing body of literature in this area by training our machine learning algorithm on ground-truth data. we show that our approach strongly improves the predictive accuracy. our methodology can build the foundation to then predict subnational indicators of economic development for areas where this data is either missing or unreliable. the efficient allocation of limited governmental funds from local governments as well as international aid organizations crucially depends on reliable information about the level of socioeconomic indicators. these indicators (e.g. income, education, physical infrastructures, social class etc.) are critical inputs for addressing the socioeconomic issues for researchers and policy-makers alike. although data availability and quality for the developing countries has been improving in recent years, consistently measured and reliable data is still relatively scarce. numerous studies have documented specifically the problems of aggregate economic accounts, in particular to africa, where the data suffers from various conceptual problems, measurement biases, and other errors (e.g. chen and nordhaus ; johnson et al. ; jerven and johnston ) . researchers have probed into alternative options in the absence of reliable official statistics. among this newer generation of alternative economic data research, a burgeoning literature has emerged that uses satellite imagery of nighttime luminosity as a proxy for economic activity. work by sutton and costanza ( ) , elvidge et al. ( ) , chen and nordhaus ( ) , henderson, storeygard, and weil ( ) , sutton, elvidge, and tilottama ( ) and *contributed equally. work in progress. hodler and raschky ( ) documents a strong relationship between nighttime luminosity and gross domestic product (gdp) at the national and subnational levels. this allows researchers to generate information for any levels of regional analysis and also the likelihood of strategic, human manipulation is limited with satellite generated data. however, luminosity data as a proxy for economic activity is not free from concerns. satellite sensors have a lower detection bound and nighttime light emissions below this bound are not captured by the satellites' readings. this leads to bottom-coding problem and this is particularly an issue in low-output and low-density regions (chen and nordhaus ) , which are very often regions and countries (e.g. africa) where official macroeconomic data is missing or unreliable as well. over the past few decades, some parts of the african continent have witnessed large increases in economic development. nevertheless, the majority of regions within african nations still lacks behind. the continent faces further challenges due to localized conflicts (berman et al. ) , rapid urbanization (moyo et al. ) as well as the impacts of the covid- pandemic ), among others. a key pre-requisite in formulating adequate strategies to address these challenges, is reliable socioeconomic data at a spatially, granular level. as of now, even data about basic infrastructure such as roads and buildings is not consistently collected across the african continent. the purpose of this project is to overcome this data problem, by applying machine learning and artificial intelligence tool to a vast amount of unstructured data from daytime satellite imagery. ultimately, this project aims to go beyond the use of nightlight luminosity as a proxy for economic development data and use high resolution, daytime satellite imagery to predict key infrastructure variables at national and subnational levels for less developed countries like in africa. daytime images contain more information about the landscape that is correlated with economic activity, but the images are highly complex and unstructured, making the extraction of meaningful information from them rather difficult. our approach builds upon and further expands the work of (jean et al. a ). the standard approach in the literature is to learn a representation out of satellite images, that allow an interpretation of pixel activation that are important for predicting night time light or other target. this represen-tation is then used to predict an aggregated wealth index. instead, we directly predict infrastructure measures on the ground, albeit knowing that there is a wide spread scarcity of ground truth data. existing solutions for policy makers in developing countries often rely from traditional data gathering processes (i.e. surveys), which are costly and infrequent. given the high costs, this data does not cover an entire country but only a subsample of geographic units. our solution provides a lowcost method to collect valuable insights about economic development for every location in a country. our methodology provides relevant decision makers in developing countries as well as ngos and international organizations with very accurate counts of buildings and the length of roads for an entire country and continent. for example, accurate building counts and density can be used in natural hazard preparedness tools as an indicator for an areas vulnerability against natural disasters. information about roads and settlement helps infrastructure agencies to quickly identify areas that lack market access, a key determinant for economic growth in developing countries. although relatively new, recent studies have begun to use different daytime satellite images to conduct novel economic research (donaldson and storeygard ) . daytime images contain more information than night-time images and are thus a good alternative data source for empirical economics. marx, stoker, and suri ( ) used daytime images to analyse the effects of investment on housing quality in the slums of kibera, kenya. investment was calculated based on the age of a households roof. the results showed that ethnicity plays an important role in determining investment in housing and belonging to the same tribe as that of the local chief has a positive effect on household investment. (engstrom et al. ) used daytime satellite imagery and survey data to estimate the poverty rates of , km subnational areas in sri lanka. using a convolutional neural networks algorithm, they identify object features from raw images that were predictive of poverty estimates. the features examined by the study included built-up areas (buildings), cars, roof types, roads, railroads and different types of agriculture. the results showed that built-up areas, roads and roofing materials had strong effects on poverty rates. a suite of related work has used satellite images to predict population density (e.g. simonyan and zisserman ; doupe et al. ), urban sprawl (burchfield et al. ) , urban markets (baragwanath et al. ) electricity usage (robinson, hohmans, and dilkina ) , as well as income levels (pandey, agarwal, and krishna ) . more broadl, we also relate growing body of literature that uses other passively collected data to measure local economic activity (e.g. abelson, varshney, and sun ; blumenstock, cadamuro, and on ; chen and nordhaus ; henderson, storeygard, and weil ; hodler and raschky ) , methodologically, our paper contributes to the large remote-sensing literature that applies high-dimensional techniques to extract features from satellite imagery (e.g. jean et al. b jean et al. , yeh et al. ; ronneberger, fischer, and brox ) . in general, reliable data at a more granular spatial level is very scarce for the african continent. this poses a particular challenge if the researcher wants to apply machine learning tools that require some form of ground truth data. to overcome this problem, we accessed data from two open-data sources. the first one is open street map, a collaborative project allowing volunteers around the world to contribute georeferenced information in an open-source gis. we utilized http://download.geofabrik.de/ to retrieve a complete snapshot of all geo-located objects africa in . in general, osm coverage for africa is very sparse and often non-existent outside urban areas. our strategy to mitigate this issue, was to build an iterative procedure that would help us select areas ( × km) with good osm coverage. we were then able to convert the geometric osm data into an image mask. our image data was collected in via the google maps api following the exact procedure as in jean et al. ( b) . this data set has be used in various studies (e.g. jean et al. ; sheehan et al. ; uzkent et al. ; oshri et al. ) . again the same pattern as with osm data emerges, the image quality of these freely available african images is not as good as in other places around the world, see figure . in the absence of reliable ground truth data, we selected the architecture based on data that we could make look like as if it would be from our target domain. for buildings, we employ imagery collected by drones in africa from the "open cities ai challenge: segmenting buildings for disaster resilience" , with the corresponding ground truth data provided and re-scaled and blurred the drone imagery. for roads, we build a model to select images with almost complete masks, albeit having missing roads and errors. we benchmark our proposed methodology against the latest publication of poverty predictions in africa using their provided wealth index based on dhs cluster data (yeh et al. ). as it is common in this literature an index is created with a principal component analysis (pca) out of survey respondents. again, due to data limitations, research in this area always only performed a in-sample validation. a true out of sample comparison would require a strict separation between the train and test set, something that is not possible if the pca is calculated over all data points across all countries and therefore inflating the prediction results. as such, (yeh et al. ) also provided an index that is based on within country survey respondents. this enables us to benchmark against both indices. in principle, we follow the outline of the well known u-net architecture for medial images (ronneberger, fischer, and brox ) and modify it for satellite images creating a satellite-u-net (sat-unet). figure provides a general overview of our approach. the network contains layers in total, with major blocks of types: convolution / down-sampling block, intermediate convolution block and the de-convolution/up-sampling block. the convolution block, shown in figure , consists of a batch normalization layer, two convolution layers with the kernel of ( , ) and a dropout layer. the dropout layer is not used in the first down-sampling block. the number of filters in downsampling blocks (encoder part) starts from and doubles every time in the following block reaching in the intermediate convolution block, and then decreases in the upsampling blocks (decoder) with the coefficient . . the core difference from (ronneberger, fischer, and brox ) is that instead of up-sampling layers, we are using transposed convolution layers, which performs the reverse convolution operation (dumoulin and visin ) . in addition, we added drop-out layers after each convolution and de-convolution block. in a x image the number of pixel belong to a house or a road, class one, vs the number of pixels of zero-class (non classified space) is up to times higher, creating a severe class imbalance. we address this issue with a hybrid loss function. first, we use the loss of the sum of binary cross entropy ( ) and the sorensen-dice coefficient: combined. as metric we used the intersection over a union (the jaccard index) data pre-processing to make image input size is x , compatible with a factor of to conform the shape reduction coefficients of the network, we added a padding of . on average across our images, the rgb colours maximum was around - out of the maximum of . color channels re-scaling has been implemented to intensify colors before feeding the image into the network. for augmentation we used rotation by , and degrees. the main difficulty in choosing the exact architecture for the road network was the lack of a sufficiently large amount of, error-free, ground truth data. therefore, we used the following iterative strategy: . create an initial mask with osm data and train on them. . filter out masks, where the model predicts significantly more objects than the osm mask has. . retrain the model on the filtered data-set. due to the large possible set of images to train from, around million, we first selected a subset based on osm data. as our main focus is to get an indicator of the economic development, the best case would be to find areas of economic activity. osm has a classification for commercial buildings, which is rarely used ( / mil). we selected areas in the same adm regions of those images based on descending order of square meters occupied by buildings on a uniform grid, until we had selected a base set of masks. next, we trained our sat-unet model for roads on all masks as labels we had created from osm. the judge: for filtering purposes the sat-u-net based model judge has been created with an additional input for the osm mask. using transfer learning the weights of the pretrained sat-unet model have been transferred to the bottom layers of the judge for the mask creation from the original image, and top layers perform the calculations of the index of validity, using the calculated mask and the osm mask as inputs. combining everything in a single gpu model allows to achieve more than x increase in performance comparing to cpu-based technique. . the index of validity is where i,j-pixel values of predicted mask and the osm respectively. the resulting filtering model decreased the dataset approximately by %, filtering out instances like those presenting in figure . furthermore, as our network predictions had very low values caused by model uncertainty, we re-trained several sat-unet model on the selected images with different random seeds. this ensemble learning significantly increased our predictive performance, as shown in figure . the top three are predictions on the test set, while the last image is the combination of all three with the pixels reduced to the skeleton for counting. for building recognition, we used open cities ai challenge data set as ground truth data set. this data set contains imagery of several african cities in a ultra high resolution of up to cm per pixel. each city is split by square areas and for each image there is a geojson file with vector data describing contours of the buildings. out of these geometric data, we created a contour layer and a centroid layer, which represents the center of every building structure. we down scaled the images to the scale of . pixel/meter to match google this high-quality ground-truth data further allowed us to experiment with different architectures. we replaced the encoder part in the sat-unet model with the inception v (xia, xu, and nan ) and the resnet (he et al. ) . in every instance we reset all the weights to random before training but we did not make use of any transfer learning. table presents the evaluation results on our test set. we also compared how well the network performs in counting the correct house based on the jaccard index, visually shown in figure . the performance of the incep -unet and resnet -unet was quite similar. for the final model selection we trained both architectures on the previous selected masks out of osm, and compared their performance in terms of their predictability on a test set. table demonstrates the effect of different thresholds on the performance.the threshold is in color intensity units (range - ), tp is true positive, when the predicted centroid is located inside the building contour of the mask, pred-to-mask coefficient is the ratio between predicted number of houses and the ground truth number, and false positives (fp): f p ← · t otalp red − t p t otalp red ( ) figure : the filter removes incomplete osm masks. from left to right: original image, osm mask, prediction of the trained model on the stage . figure : africa, building model stage, contour-in-contour evaluation. buildings that were predicted correctly are in blue, not predicted in orange. orange areas of random shapes inside building blocks are usually courtyard areas and not considered as a wrong prediction. a threshold of has the closest prediction-to-mask score, acceptable tp and fp rates. therefore, we picked this threshold for further modeling. the final model resnet -unet has layers. to avoid the vanishing gradient problem with the depth of hundreds of layers resnet uses skip connections, it adds input information of the convolution block to its output. in addition skip connections give the model the ability to learn the identity function which guarantee the similar performance of the lower and higher layers (he et al. ) . we compare our prediction results of buildings and roads to the latest benchmark study in the field of poverty prediction based on dhs data (yeh et al. ) . the dhs data is collected in various waves across countries and years. for the comparison use of the most recent wave available for a country and use the aggregated wealth index data. two indices are provided, the first wealthpooled is the pca calculation across all years, while the second index wealth, is calculated (yeh et al. ) did not provide any out of sample estimates, therefore we fully replicated their method by getting all their images they used for their location and trained their combined cnn model of multi-spectrum and night lights images to determine economic well-being in africa, with wealth as label for the last of their training folds (d). the performance results are almost identical in terms of rsquared. in their study they also found a high correlation of the measures to other type's of aggregation, such the sum total of all assets. using their wealthpool predictions as predictor for wealth, the r-squared is around . vs . for wealthpool. the dhs location data from (yeh et al. ) has , unique cluster location in the last wave of each respective country. we use a km radius corresponding to the possible displacement of survey measurements, as selection criteria to select our images. for every square km we predict the number of buildings, the number of roads as well as calculate the night time light (elvidge et al. ) by grid cell. we then aggregate the roughly . million images into features, by building the sum, averag and quantiles by cluster across all input variables. in total, this leaves us with , locations. we perform loocv cross validation by country, as figure : ensembling: the first three are predicted masks based on different random seeds, the last one is the resulting mask. we are interested in predicting the marginal unit if we would use the model to get data for one extra country. we also iterate over standard machine learning algorithms without any hyper parameter tuning, by only using the default settings. table and table present the results for out of sample and out of country predictions, respectively. as expected, using a normalized outcome measures across all samples, inflates the performance. in comparison to the previous literature, our predictions show an increased predictive performance, both in and out of sample. this paper introduces a novel and scalable method to predict road and housing infrastructure from daytime satellite imagery. compared to existing approaches, we achieve higher predictive performance by training a u-net style architecture using ground-truth data from a subset of images. using satellite images from african countries we show how our method can be used to generate very granular information about the stock of housing and road infrastructure for regions in the world, where reliable information about the local level of economic development is hardly available. consistently measured and comparable indicators about local economic development are crucial inputs for governments in developing countries as well as international organizations in their decision where to allocate scarce public funds and development aid. the predictions generated by our method can be directly included in existing decision support systems. for example, international organization such as the red cross are using similar data at the local level to evaluate an area's vulnerability against natural hazards. our data can be considered as more granular complements to existing measures of the local stock of physical infrastructure. numerous charitable organizations already rely on satellite imagery to identify districts of african countries that are among the least developed (e.g. abelson, varshney, and sun ) . our approach provides a low-cost and scalable alternative to identify areas that are in need. in addition, the open street map mapping community would benefit from our findings as well. the road prediction model could be used worldwide to help completing the road network or help narrowing down possible errors in the data. finally, our approach is an important methodological contribution to the large group of scholars from varying disciplines working in the area of poverty measurement. the majority of the existing research focuses on predicting poverty based on aggregate household wealth. this paper shows that predicting poverty measures can also be viewed as a simple high dimensional feature representation problem. our study is a proof-of-concept exercise to show that combining daytime satellite imagery, open source ground truth data and machine learning tools can translate unstructured image data into valuable insights about local economic development at an unprecedented scale. table : predictive performance of satellite predictions, r-squared based on loocv on out of country predictions by country using wealthpooled targeting direct cash transfers to the extremely poor detecting urban markets with satellite imagery: an application to india this mine is mine! how minerals fuel conflicts in africa predicting poverty and wealth from mobile phone metadata causes of sprawl: a portrait from space. the quarterly using luminosity data as a proxy for economic statistics the view from above: applications of satellite data in economics equitable development through deep learning: the case of subnational population density estimation a guide to convolution arithmetic for deep learning viirs night-time lights a global poverty map derived from satellite data evaluating the relationship between spatial and spectral features derived from high spatial resolution satellite data and urban poverty in colombo, sri lanka deep residual learning for image recognition measuring economic growth from outer space regional favoritism combining satellite imagery and machine learning to predict poverty combining satellite imagery and machine learning to predict poverty tile vec -unsupervised representation learning for spatially distributed data statistical tragedy in africa? evaluating the database for african economic development is newer better? penn world table revisions and their impact on growth estimates the economics of slums in the developing world african cities disrupting the urban future infrastructure quality assessment in africa using satellite imagery and deep learning multitask deep learning for predicting poverty from satellite images a deep learning approach for population estimation from satellite imagery u-net: convolutional networks for biomedical image segmentation predicting economic development using geolocated wikipedia articles very deep convolutional networks for large-scale image recognition estimation of gross domestic product at sub-national scales using nighttime satellite imagery global estimates of market and non-market values derived from nighttime satellite imagery, land cover, and ecosystem service valuation the nextgencities africa programme learning to interpret satellite images using wikipedia. ijcai inception-v for flower classification using publicly available satellite imagery and deep learning to understand economic well-being in africa key: cord- -o tiqel authors: breugel, floris van; kutz, j. nathan; brunton, bingni w. title: numerical differentiation of noisy data: a unifying multi-objective optimization framework date: - - journal: nan doi: nan sha: doc_id: cord_uid: o tiqel computing derivatives of noisy measurement data is ubiquitous in the physical, engineering, and biological sciences, and it is often a critical step in developing dynamic models or designing control. unfortunately, the mathematical formulation of numerical differentiation is typically ill-posed, and researchers often resort to an textit{ad hoc} process for choosing one of many computational methods and its parameters. in this work, we take a principled approach and propose a multi-objective optimization framework for choosing parameters that minimize a loss function to balance the faithfulness and smoothness of the derivative estimate. our framework has three significant advantages. first, the task of selecting multiple parameters is reduced to choosing a single hyper-parameter. second, where ground-truth data is unknown, we provide a heuristic for automatically selecting this hyper-parameter based on the power spectrum and temporal resolution of the data. third, the optimal value of the hyper-parameter is consistent across different differentiation methods, thus our approach unifies vastly different numerical differentiation methods and facilitates unbiased comparison of their results. finally, we provide an extensive open-source python library texttt{pynumdiff} to facilitate easy application to diverse datasets (https://github.com/florisvb/pynumdiff). derivatives describe many meaningful characteristics of physical and biological systems, including spatial gradients and time rates-of change. however, these critical quantities are often not directly measurable by sensors. although computing derivatives of analytic equations is straightforward, estimating derivatives from real sensor data remains a significant challenge because sensor data is invariably corrupted by noise [ ] . more accurate estimation of derivatives would improve our ability to produce robust diagnostics, formulate accurate forecasts, build dynamic or statistical models, implement control protocols, and inform policy making. there exists a large and diverse set of mathematical tools for estimating derivatives of noisy data, most of which are formulated as an ill-posed problem regularized by some appropriate smoothing constraints. however, the level and type of regularization are typically imposed in an ad hoc fashion, so that there is currently no consensus "best-method" for producing "best-fit" derivatives. one particularly impactful application of estimating derivatives is the use of time-series data in modeling com-plex dynamical systems. these models are of the form dx/dt =ẋ = f (x), where x is the state of the system. models of this kind have been integral to much of our understanding across science and engineering [ ] , including in classical mechanics [ ] , electromagnetism [ ] , quantum mechanics [ ] , chemical kinetics [ ] , ecology [ ] epidemiology [ ] , and neuroscience [ ] [ ] [ ] . in some cases, even higher order time derivatives are also crucial for understanding the dynamics [ ] . a recent innovation in understanding complex dynamical systems uses data-driven modeling, where the underlying dynamics are learned directly from sensor data using a variety of modern methods [ ] [ ] [ ] . for this application in particular, a derivative with both small and unbiased errors is crucial for learning interpretable dynamics. in principle, the discrete derivative of position can be estimated as the finite difference between adjacent measurements. if we write the vector of all noiseless positions in time measured with timestep ∆t as x, theṅ where k indexes snapshots in time. in reality, however, where η represents measurement noise. here we will assume η is zero-mean gaussian noise with unknown variance. even with noise of moderate amplitude, a naïve application of eq. ( ) produces derivative estimates that are far too noisy to be useful (fig. a) . thus, more sophisticated methods for data smoothing and/or differentiation of noisy time series measurements of position y are required. although smoothing mitigates the errors, it can also introduce biases. our goal in this paper is to develop a general approach for methodically choosing parameters that balance the need to minimize both error and bias. we usex andx to denote the smoothed estimates of the position and its derivative computed from y, respectively. to evaluate the quality of these estimates, we compare these estimates to the true discrete time position and its derivative, x andẋ. developing approaches for estimatingx from noisy measurements y has been the focus of intense research for many decades. despite the diversity of methods that have been developed, only a few studies have performed a comprehensive comparison of their performance on different types of problems [ , , ] . in this paper, we tackle the challenge of parameter selection by developing a novel, multi-objective optimization framework for choosing parameters to estimate the derivative of noisy data. our approach minimizes a loss function consisting of a weighted sum of two metrics computed from the derivative estimate: the faithfulness of the integral of the derivative and its smoothness. we suggest these metrics as proxies for minimizing the error and bias of the estimated derivative, and we show that sweeping through values of a single hyper-parameter γ produces derivative estimates that generally trace the pareto front of solutions that minimize error and bias. importantly, this optimization framework assumes no knowledge of the underlying true derivative and reduces the task of selecting many parameters of any differentiation algorithm to solving a loss function with a single hyperparameter. furthermore, we show that the value of the hyper-parameter is nearly universal across four different differentiation methods, making it possible to compare the results in a fair and unbiased way. for real-world applications, we provide a simple heuristic for automatically determining a value of γ that is derived from the power spectrum and temporal resolution of the data. all of the functionality described in this paper is implemented in an open-source python toolkit pynumdiff, which is found here: https://github.com/florisvb/pynumdiff. what is a "good" estimate of a derivative? let us start by considering a toy system with synthetic measurement noise, where we are able to evaluate the quality of an estimated derivative by comparing to the true, known derivative. we consider two metrics for evaluating the quality of a derivative fig. b-d; later, we use these same metrics to evaluate the performance of our optimization framework, which does not have access to the ground truth. first, the most intuitive metric is how faithfully the estimated derivativex approximates the actual derivativeẋ. we can measure this using the root-mean-squared error, where · is the vector -norm. if the data are very noisy, a small rmse can only be achieved by applying significant smoothing. however, smoothing the data often attenuates sharp peaks in the data and results in underestimating the magnitude of the derivative. to measure the degree to which the derivative estimate is biased due to underestimates of the actual derivative, we calculate the square of the pearson's correlation coefficient, r , between the errors (x −ẋ) and the actual derivativeẋ. we refer to this metric as the error correlation, which is bounded between and . small error correlations imply that the imposed dynamics of the differentiation method (e.g. filtering) minimally influenced the derivative estimate; therefore, the method of estimating derivatives would have minimal impact on any models that are constructed using these estimates. conversely, large error correlations imply that the estimate is significantly influenced by the dynamics of the differentiation method and typically correspond to very smooth estimates. in the limit where the derivative estimate is a horizontal line, the error correlation takes on a value of unity. other metrics that measure the smoothness, for example the total variation or tortuosity, may be substituted for error correlation [ ] ; however, these metrics are harder to interpret. for instance, if the true derivative is very smooth, a low total variation is desired, whereas if the true derivative is quite variable, a high total variation would correspond to an accurate derivative. in contrast, a low error correlation is desirable for any true derivative. for many datasets, the rmse and error correlation metrics define a pareto front, where no single parameter choice minimizes both values (fig. b) . furthermore, the minimal rmse can be achieved with a variety of different error correlations. the most suitable parameter set depends on the application of the estimated derivative: is a non-smooth derivative with minimal bias preferred ( fig. b-d: teal), or one that is smooth, but biased ( fig. b-d: brown) . we suggest that, for most purposes, the estimated derivative that balances these met- a large variety of methods for numerical differentiation exist, and a complete review of them all is beyond the scope of this paper. instead, we have selected four differentiation methods (table ) , which make different assumptions and represent different approaches to computing the derivative including both global and local methods [ ] , to showcase the universal application of our optimization framework. one common approach to manage noisy data is to apply a smoothing filter to the data itself, followed by a finite difference calculation. in this family of differenti-ation methods, we chose to highlight the butterworth filter [ ] , which is a global spectral method with two parameters: filter order and frequency cutoff. the second family of methods relies instead on building a local model of the data through linear regression. a common and effective approach involves making a sliding polynomial fit of the data [ ] , often referred to as locally estimated scatterplot smoothing (loess) [ ] . an efficient approach for accomplishing the same calculations is the savitzky-golay filter, which builds the polynomial model in the frequency domain [ , ] . the savitzky-golay filter has two parameters: window size and polynomial order. by default, a savitzky-golay filter provides a jagged derivative because the polynomial models can change from one window to the next, so here we also apply some smoothing by convolving the result with a gaussian kernel. this smoothing adds a third parameter: a smoothing window size. the third family we consider is the kalman filter [ ] [ ] [ ] . the kalman filter is most effective when models of the system and of the noise characteristics are known. our focus here is the case where neither is known, so we chose to highlight a constant acceleration forwardbackward kalman smoother [ ] with two parameters: the model and noise covariances. finally, we consider an optimization approach to computing derivatives with the total variation regularization (tvr) method [ , ] . one advantage of the tvr methods is that there is only a single parameter, which corresponds to the smoothness of the derivative estimate. tvr derivatives are not as widely used as the other three methods we highlight, so we provide a brief overview here. solving for the tvr derivative involves first findingx and its corresponding finite-difference derivativex (calculated according to eq. ) that minimize the following loss function, here t v is the total variation, where · denotes the norm and m is the number of time snapshots in the data. the single parameter for this method is γ, and larger values result in smoother derivatives. if γ is zero, this formulation reduces to a finite difference derivative. solutions for tvrx can be found with an iterative solver [ ] . because both components of the loss function eq. ( ) are convex, we can also solve forx using convex optimization tools, such as cvxpy [ ] , and with a convex solver, such as mosek [ ] . the two methods are equivalent, if the iterative solver is repeated sufficiently many times. the convex solution to penalizing the first order difference in time, as in eq. ( ), results in a piece-wise constant derivative estimate. by offloading the calculations to a convex optimization solver, however, we can easily penalize higher order derivatives by replacing the st order finite difference derivativex in eq. ( ) with a nd order (x) or rd order (. .. x ) finite difference derivative. penalizing higher-order time derivatives results in smoother derivative estimates. for example, penalizing the nd order derivative results in a piece-wise linear derivative estimate, whereas penalizing the rd order derivative, also known as the jerk, results in a smooth estimate. in this paper, we will use the total variation regularized on the jerk (tvrj). for large datasets, solving for the tvrj derivative is both computationally expensive and can accumulate small errors. to manage the size of the optimization problem, we solve for the tvrj derivative in sliding windows. with noisy data collected in the real world, no ground truth is accessible. the rmse and error correlation metrics described in the previous section cannot be calculated and used to optimize parameter choices, so the parameter selection is an ill-posed problem. even sosomehow-parameters must be chosen. in this section, we propose a general approach for choosing parameters and show that for a wide range of problems, noise levels, time resolutions, and methods, our approach yields reasonable derivative estimates without the need for hyperparameter turning. given noisy position measurements y, we seek to estimate the derivative in time of the dynamical system that underlies the measurementsx. when the ground trutḣ x is unknown, we propose choosing the set of parameters Φ (for any given numerical algorithm, including those enumerated in table ) that minimize the following loss function, which is inspired by eq. ( ), where trapz(·) is the discrete-time trapezoidal numerical integral, µ resolves the unknown integration constant, and γ is a hyper-parameter. note that this formulation has a single hyper-parameter γ, and a heuristic for choosing γ is introduced in the following section. the first term of the loss function in eq. ( ) promotes faithfulness of the derivative estimate by ensuring that the integral of the derivative estimate remains similar to the data, whereas the second term encourages smoothness of the derivative estimate. if γ is zero, the loss function simply returns the finite difference derivative. larger values of γ will result in a smoother derivative estimate. this loss function effectively reduces the set of parameters Φ (which ranges between and or more, depending on the method) to a single hyper-parameter γ. unfortunately, l is not convex, but tractable optimization routines can be used to solve for the set of Φ that minimize l. here we use the nelder-mead method [ ] , a gradient descent algorithm, as implemented in scipy [ ] , with multiple initial conditions. the advantages of our loss function in eq. ( ) are that it does not require any ground truth data, and it simplifies the process of choosing parameters by reducing all the parameters associated with any given method for differentiation to a single hyper-parameter γ corresponding to the how smooth the resulting derivative should be. to understand the qualities of the derivative estimates resulting from parameters selected by our loss function, we begin by analyzing the derivative estimates of noisy sinusoidal curves using the savitzky-golay filter and return to our original metrics, rmse and error correlation to evaluate the results. interestingly, sweeping through values of γ results in derivative estimates with rmse and error correlation values that generally follow the pareto front defined by all possible derivative estimates for that given method ( fig. a) . which of these derivative estimates is best depends on the intended use of the derivative; nevertheless, we suggest that a good general purpose derivative is one that corresponds with the elbow in the lower left corner of the stereotypical curve traced by a sweep of γ in the rmse vs. error correlation space (the star-shaped markers in fig. a ). this point often, but not always, corresponds to the lowest rmse (for example, see fig. ). although in many cases a quantitatively better derivative estimate than the one found by our loss function does exist (the gray dots in fig. a that lie left of the star), the qualitative differences between these two derivative estimates are generally small ( fig. a middle row) . in practice, the need to choose even a single parameter can be time consuming and arbitrary. to alleviate these issues, we derive an empirical heuristic to guide the choice of γ that corresponds with the elbow of the pareto front. we found that the best choice of γ is dependent on the frequency content of the data. to characterize this relationship, we evaluated the performance of derivative estimates achieved by a savitzky-golay filter by sweeping through different values of γ for a suite of sinusoidal data with various frequencies (f ), noise levels (additive white (zero-mean) gaussian noise with variance σ ), temporal resolutions (∆t), and dataset lengths (in time steps, l) ( fig. a-b) . to describe this empirical relationship between the optimal choice of γ and quantitative features of the data, we first considered an all-inclusive multivariate log-linear model, log(γ) = α log(f )+α log(∆t)+α log(σ)+α log(l)+α . ( ) fitting the data (fig. b triangles) to this model with ordinary least squares resulted in an r = . , suggesting that, in many cases, it is feasible to automatically determine a reasonable guess for γ. table provides the coefficients (α k ) and associated p-values for each of the four terms and intercept. from this analysis we can conclude that the magnitude of measurement noise in the data is not an important predictor of γ. we note, however, that here we have assumed that the magnitude of noise does not change within a time-series dataset. eliminating the unnecessary terms from our model results in slightly adjusted coefficients, provided in table . in short, the optimal choice of γ, assuming that both low rmse and low error correlation are valued, can be found according to the following relationship: log(γ) = − . log(f ) − . log(dt) − . . we analyze the performance of our loss function and heuristic with respect to a broad suite of representative synthetic problems. real world data takes on a much greater diversity of shapes than the sinusoidal timeseries we used to derive the heuristic for choosing γ given in eq. ( ) . because it is difficult to define a clear quantitative description of the range of shapes that real data might take on (such as frequency for a sinusoidal function), we first examine differentiating one component of a lorenz system (fig. ) . from the power spectra, we select a frequency corresponding to the frequency where the power begins to decrease and the noise of the spectra increases. although somewhat arbitrary, this approach (in conjunction with eq. ( )) allows us to use a standard signal processing tool to quickly determine a choice of γ. our method produces reliable derivatives without further tuning in each case except high noise and low temporal resolution (fig. , fourth row) , which is not surprising considering the low quality of the data. . also included in the plot, but not indicated, are different noise levels ( -mean normally distributed with standard deviations of . %, . %, %, and % of the amplitude) and length of the dataset ( , , , , , , sec). the "+" markers indicate results from datasets for which the period was greater than the length of the time series, which were omitted from the fit. the diagonal lines indicate the empirical heuristics for choosing γ based on a multivariate ordinary least squares model, provided in eqn. and table . next we consider four other synthetic problems, all with similarly effective results (fig. ) . for the logistic growth problem, the curve traced by our loss function takes on a more complicated shape, perhaps because the characteristics of data vary substantially across time. still, our heuristic results in a good choice of parameters that correspond to an accurate derivative. for the triangle wave, the loss function does a good job of tracing the pareto front, and the heuristic selects an appropriate value of γ, yet the resulting derivative does show significant errors. this is likely due to two reasons. first, the savitzky-golay filter is designed to produce a smooth derivative, rather than a piece-wise constant one. second, the frequency content of the data varies between two extremes, near-zero, and near-infinity. for the sum of sines problem, selecting the appropriate frequency cutoff is more straightforward than the previous problems, as we can simply choose a frequency shortly after the high frequency spike in the spectra. the final problem is a time-series resulting from a simulated dynamical system controlled by a proportional-integral controller subject to periodic disturbances. this data is a challenging problem for numerical differentiation, as the position data almost appears to be a straight line but does contain small variations. our loss function does an excellent job of tracing the pareto front in this case, and our heuristic results in an appropriate choice of γ. we examine how our loss function and heuristic for choosing γ might perform on other differentiation methods beyond the savitzky-golay filter. figure shows that for a noisy lorenz system, the possible solution space is similar for all four methods we highlighted earlier, and our loss function achieves a similar pareto front in each case. note that although the savitzky-golay and butterworth filters both operate in the frequency domain, the kalman smoother and tvrj methods do not. interestingly, for all four differentiation methods, the possible solutions (the gray dots), and in particular their pareto front, are quite similar, with the exception of the tvrj method. this deviation may be because the tvrj method only contains a single parameter. our loss function, which defines the colored curves in the rmse vs error correlation space, results in similar curves for each method, each of which follows the pareto front quite closely. although there are some differences in the location along the pareto front that our heuristic selects as the optimal choice for each method, the resulting derivative estimates are qualitatively quite similar. a close comparison of the curves defined by the loss function, and the points selected by the heuristic, suggest that the kalman and tvrj methods produce slightly more accu- figure : heuristic for choosing γ is effective across a broad range of toy problems, using a savitzky-golay filter. the first column shows raw (synthetic) position data, indicating the shape of the data, degree of noise, and temporal resolution. next we evaluate the performance of derivative estimate using the metrics described in the fig. . gray dots indicate the range of outcomes for , parameter choices, the violet line indicates the options provided by our loss function, and the red star indicates the performance using the automatically selected value of γ according to eqn. . frequency of the data is evaluated by inspecting the power spectra; the red line indicates the frequency used to determine γ. the final two columns compare the ground truth and estimates for position and velocity. figure : loss function and heuristic for choosing γ is equally effective for different differentiation methods. a. synthetic noisy data from the same lorenz system as shown in fig. . b. comparison of metrics, position, and velocity estimates using four differentiation methods, with the same value of γ, as determined through the spectral analysis in fig. . c. overlay of the pareto fronts and velocities for all four methods. rate derivative estimates with a lower error correlation. however, looking at the resulting derivatives we see that the regions where the derivative estimates have high errors, all four estimates exhibit similar errors, suggesting that these errors may be a result of the data, not the method. these results suggest that our optimization framework is universal across different methods, a claim further supported by its performance across a range of synthetic problems (fig. . the most significant result of this analysis is that all four methods, despite being very different in their underlying mathematics, behave similarly under both our loss function and heuristic for choosing γ across a wide range of data. even in the case where they disagree on a quantitative level (second row, low temporal resolution lorenz data), and the savitzky-golay filter appears to provide the estimate with the lowest error correlation, the resulting derivative estimates are in fact qualitatively quite similar. taking a closer look at the errors in the derivative es-timates across the range of toy problems shown in fig. reveals a subtle point about the limitations of the differentiation methods we highlight here. for all four methods, the errors in the derivative estimates are largest for the triangle problem, and to a lesser extent the proportionalintegral control problem. these errors likely stem from two particular challenges. first, the frequency content of the data is very heterogeneous: it is near zero between the peaks and valleys, and near infinite at the peaks and valleys. furthermore, the frequency of the oscillations for the triangle increase with time. second, all four of the methods we highlighted here are designed to provide smooth derivatives, whereas the true derivative for the triangle problem is piece-wise constant. if this were known from the outset, it might be more effective to choose a method that is designed to return piece-wise constant derivatives, such as the total variation regularized on the st derivative. the real value of our multi-objective optimization framework is its straightforward application to real, noisy data where no ground truth data is available. here we provide two such examples: differentiation of the new confirmed daily cases in the united states of covid- , the disease caused by sars-cov- (fig. ) , and differentiation of gyroscope data from a downhill ski (fig. ) . in both examples, we examine the power spectra of the data to choose a cutoff frequency that corresponds to the start of the dropoff in power. this cutoff frequency, in conjunction with the time resolution of the data, are then used as inputs to our heuristic described by eq. ( ) to determine an optimal value of γ. with γ chosen, we minimize our loss function from eq. ( ) to find the optimal parameters for numerical differentiation. the year has seen a dramatic growth of the prevalence of a novel coronavirus, sars-cov- , which causes the disease known as covid- . estimating and understanding the rate of increase of disease incidence is important for guiding appropriate epidemiological, health, and economic policies. in the raw data ( [ ] , https://github.com/cssegisanddata/covid- ) for the raw new confirmed daily cases of covid- , fig. a ) there is a clear oscillation with a period of one week, most likely due to interruptions in testing and reporting during weekends. as such, we selected a lower cutoff frequency of months, corresponding to the beginning of the steep drop off in the power spectra (fig. b ). if the weekly oscillations were important, one could just as easily select a cutoff frequency of /week. our heuristic for choosing γ was based on sinusoidal data with a limited domain of time resolutions ranging from . to . seconds, so we scaled the time step units of the covid- data to be close to this range, using dt = day, rather than , seconds. our chosen cutoff frequency yielded a value of γ = . . using this same value of γ for each of the four differentiation methods under consideration resulted in very similar smoothed daily case estimates and derivatives, except during the final weeks (fig. c) . this highlights an important application of our method, which facilitates easy and fair comparison between different smoothing methods. where these methods disagree, it is clear that none of the estimates can be trusted. a more subtle difference between the methods is that the butterworth filter appears to preserve a larger remnant of the weekly oscillations seen in the raw data. finally, we consider angular velocity data collected from a gyroscope attached to a downhill ski over one minute of descent (fig. a ) (icm- , sparkfun; wildcat ski, moment skis). this type of data is repre-sentative of kinematic data that might be collected during experiments with robots or animals, which might be used to construct data-driven models of their dynamics [ ] . from the power spectrum, we chose a cutoff frequency of . hz (fig. b) . this selection together with the time resolution of . seconds yielded an optimal value of γ = . using our heuristic. we calculated the smoothed angular velocity and acceleration estimates using a savitzky-golay filter (fig. d-f) . the other methods showed similar results (not shown for visual clarity), though the total variation method is not recommended for large datasets like this one due to the compounding computational costs. in summary, this paper develops a principled multiobjective optimization framework to provide clear guidance for solving the ill-posed problem of numerical differentiation of noisy data, with a particular focus on parameter selection. we define two independent metrics for quantifying the quality of a numerical derivative estimate of noisy data: the rmse and error correlation. unfortunately, neither metric can be evaluated without access to ground truth data. instead, we show that the total variation of the derivative estimate, and the rmse of its integral, serve as effective proxies. we then introduced a novel loss function that balances these two proxies, reducing the number of parameters that must be chosen for any given numerical differentiation method to a single universal hyperparameter, which we call γ. importantly, the derivative estimates resulting from a sweep of γ lie close to the pareto front of all possible solutions with respect to the true metrics of interest. although different applications may require different values of γ to produce more smooth or less biased derivative estimates, we derive an empirical heuristic for determining a general purpose starting point for γ given two features that can easily be determined from timeseries data: the cutoff frequency and time step. our method also makes it possible to objectively compare the outputs for different methods. we found that for each problem that we tried, the four differentiation methods we explored in depth, including both local and global methods, all produce qualitatively similar results. in our loss function we chose to use the rmse of the integral of the derivative estimate and the total variation of the derivative estimate as our metrics. however, our loss function can be extended to a more general form, l = m (x,ẋ) + γ m (x,ẋ) + · · · + γ p m p (x,ẋ), ( ) where m , m , · · · , m p represent p different metrics that could be used, balanced by p − hyper-parameters. alternative metrics include, for example, the tortuosity of the derivative estimate, the error correlation between the data and the integral of the derivative estimate, a metric describing the distribution of the error between the data and the integral of the derivative estimate. depending on the qualities of the data and the specific application, different sets of metrics may be suitable as terms in the loss function. our loss function makes three important assumptions that future work may aim to relax. the first is that we assume the data has consistent zero-mean gaussian measurement noise. how sensitive the loss function and heuristic are to outliers and other noise distributions remains an open question. it is possible that once we include other noise models, we will find differences in the behavior of differentiation methods. the second major limitation is that our loss function finds a single set of parameters for a given time series. for data where the frequency content dramatically shifts over time, it may be better to use time-varying parameters. presently, this is limited by our current implementation, which relies on a computationally expensive optimization step. future efforts may focus on ways to improve the efficiency of these calculations. finally, we have focused on single dimensional time-series data. in principle, our proposed loss function can be used with multi-dimensional data, such as -and -dimensional spatial data, with only minor modifications. by simplifying the process of parameter selection for numerical differentiation to the selection of a single hyperparameter, our approach makes it feasible to directly compare the performance of different methods within a given application. one particular application of interest is that of data-driven model discovery. methods such as sparse identification of nonlinear dynamics (sindy) [ ] , for example, rely directly on numerical derivative estimates, and the characteristics of these estimates can have an important impact on the resulting models. using our method, it is now tractable to systematically investigate the collection of data-driven models learned from estimated derivatives of different smoothness and explore their impact on the models. : numerical differentiation of noisy gyroscope data from a downhill ski during one ski run, with no parameter tuning. a. data from one axis of a gyroscope attached to the center of a downhill ski. b. power spectra of the data, indicating the cutoff frequency (red) used for selecting γ = . . c. zoomed in section of the data from a, which was used to optimize parameter selection. d. smoothed angular velocities and angular accelerations, calculated using a savitzky-golay filter and the optimal parameters determined using our heuristic and loss function. e-f. zoomed in sections from d. coeff p-value intercept - . log(f req) - . log(dt) - . log(noise) . . log(length) . . table : optimal log(γ) is correlated with frequency and temporal resolution, but not the noise or length of the dataset. the table provides the coefficients and associated p-values for a ordinary least squares model, with an adjusted r = . . variable coeff p-value intercept - . log(f req) - . log(dt) - . table : optimal log(γ) can be determined based on the frequency and temporal resolution of the data. the table provides the coefficients and associated p-values for a ordinary least squares model, with an adjusted r = . . numerical differentiation of experimental data: local versus global methods mathematics applied to deterministic problems in the natural sciences classical mechanics classical electrodynamics introduction to quantum mechanics chemical kinetics and catalysis elements of mathematical ecology modern epidemiology a comparative approach to closed-loop computation the synergy between neuroscience and control theory: the nervous system as inspiration for hard control challenges generalized local linear approximation of derivatives from time series yank: the time derivative of force is an important biomechanical variable in sensorimotor systems distilling free-form natural laws from experimental data discovering governing equations from data by sparse identification of nonlinear dynamical systems automated adaptive inference of phenomenological dynamical models estimating velocities and accelerations of animal locomotion: a simulation experiment comparing numerical differentiation algorithms analysis of the three-dimensional trajectories of organisms: estimates of velocity, curvature and torsion from positional information experimental wireless and the wireless engineer meshless methods: an overview and recent developments regression modeling strategies : with applications to linear models, logistic and ordinal regression, and survival analysis what is a savitzky-golay filter? soothing and differentiation of data by simplified least squares procedures a new approach to linear filtering and prediction problems fundamentals of kalman filtering : a practical approach generalized kalman smoothing: modeling and algorithms optimal estimation of dynamic systems nonlinear total variation based noise removal algorithms numerical differentiation of noisy, nonsmooth data cvxpy: a pythonembedded modeling language for convex optimization the mosek optimization api for python a simplex method for function minimization an interactive web-based dashboard to track covid- in real time anipose: a toolkit for robust markerless d pose estimation we are grateful for helpful discussions with steve brunton and pierre karashchuk. fvb acknowledges support from a moore/sloan data science and washington research foundation innovation in data science postdoctoral fellowship, a sackler scholarship in biophysics, and the national institute of general medical sciences of the national institutes of health (p gm ). jnk acknowledges support from the air force office of scientific research (fa - - - ) . bwb acknowledges support from the washington research foundation, the air force office of scientific research (fa - - - ), and the national institute of health ( r mh ). key: cord- -ccc wzne authors: ram, natalie; gray, david title: mass surveillance in the age of covid- date: - - journal: j law biosci doi: . /jlb/lsaa sha: doc_id: cord_uid: ccc wzne epidemiological surveillance programs such as digital contact tracing have been touted as a silver bullet that will free the american public from the strictures of social distancing, enabling a return to school, work, and socializing. this article assesses whether and under what circumstances the united states ought to embrace such programs. part i analyzes the constitutionality of programs like digital contact tracing, arguing that the fourth amendment's protection against unreasonable searches and seizures may well regulate the use of location data for epidemiological purposes, but that the legislative and executive branches have significant latitude to develop these programs within the broad constraints of the ``special needs'' doctrine elaborated by the courts in parallel circumstances. part ii cautions that the absence of a firm warrant requirement for digital contact tracing should not serve as a green light for unregulated and mass digital location tracking. in light of substantial risks to privacy, policy makers must ask hard questions about efficacy and the comparative advantages of location tracking versus more traditional means of controlling epidemic contagions, take seriously threats to privacy, tailor programs parsimoniously, establish clear metrics for determining success, and set clear plans for decommissioning surveillance programs. interview infected individuals to learn about their activities and the people they encountered after becoming ill, and then monitor those contacts for illness. according to a recent report from the johns hopkins center for health security, more than , contact tracers may be needed nationally to grapple with the covid- pandemic. massachusetts alone "plans to hire and train roughly , people to do contact tracing." other jurisdictions have trained their sights on cell phone data as a new, potentially more powerful, efficient, and accurate tool for contact tracing. with china, south korea, singapore, israel, and others as examples, data brokers and app developers are working to bring digital contact tracing to european and american markets. their pitch is enticing. in a forthcoming paper in science, researchers at oxford argue that covid- spreads too quickly and asymptomatically to be controllable through traditional contact tracing methods. to alleviate the need for long term mass social distancing, the authors of that study argue that communities will need to deploy "instant digital contact tracing." id. covid- testing and support for quarantining identified contacts, undermine the efficacy of digital contact tracing efforts that its proponents seemingly take for granted. as for privacy, digital contact tracing efforts abroad already raise significant cause for concern. these mass surveillance programs sweep up revealing location data indiscriminately. although they are defended on grounds of emergency and the urgent need to contend with the present health crisis, experiences in those countries already reveal the potential for abuse. moreover, our own history amply demonstrates that surveillance powers claimed on emergency grounds frequently remain after the emergency has passed, often morphing into tools of social control targeted against should the political branches fail to perform on these duties, then the courts, as guardians of the fourth amendment's sacred trust, must act. in the u.s. context, questions about protecting privacy against threats of government surveillance implicate the fourth amendment, which guarantees that "the right of the people to be secure in their persons, houses, papers, and effects against unreasonable searches and seizures shall not be violated . . . ." would an epidemiological surveillance program that uses cellphone location data to conduct individual contact tracing or to document and predict infection patterns using aggregate data be subject to fourth amendment regulation? if so, what form should those regulations take? answering these questions requires addressing two fourth amendment by "location tracking," we mean the use of technologies like cell site location or gps that monitor the movements of individual persons or devices through space. in the present context, location tracking technologies might be used to trace the movements of individuals who test positive for sars-cov- and to document potential contacts with others. by "proximity tracking," we mean technologies like geofencing or wi-fi that document the presence of individuals or devices in a specific area. proximity tracking might be used to control the flow of persons in particular spaces, such as malls, grocery stores, or even cities. it might also be used to identify those who have traversed hotspots or other areas of potential contagion. thousands of requests each year. they may therefore already be acting as state agents when they collect and aggregate location data. that case is even stronger in the narrower circumstance of epidemiological surveillance. these programs require ongoing access to historical data for baseline analysis, recent aggregate data to model population flows and contact patterns, targeted data to trace individual persons, and real-time data to monitor the activities of persons and groups. in short, these programs will entail close, ongoing public-private partnerships, which will have the effect of converting at&t, google, and other data collectors and aggregators into government agents for purposes of the fourth amendment. to the extent doubts about the state agency requirement persist, they are probably mooted by the supreme court's recent decision in carpenter v. united states. there, the court held that the fourth amendment governs law enforcement access to historical cell site location data gathered and stored by cellphone service providers (cellphone location data, whether in the form of cell site location or gps tracking, appears to be a centerpiece of tracing and proximity surveillance proposals because these devices are so often with their users ). as the carpenter court noted, service providers collect and aggregate this data for their own business purposes. nevertheless, the court held that law enforcement must secure a warrant before accessing that data in the context of a criminal investigation. the court was decidedly mealy-mouthed about what constituted the "search" in carpenter, who did it, and when, but the circumstances contemplated by epidemiological surveillance programs are sufficiently analogous to conclude that the state agency requirement would not be an impediment to applying the fourth amendment, even if the precise reasons why might remain a mystery. but does epidemiological surveillance constitute a "seizure" or a "search"? the fourth amendment regulates conduct that threatens the security of the people against "unreasonable searches and seizures." conventionally, this means that the fourth amendment applies only to conduct that constitutes either a "seizure" or a "search." "seizures" entail material interference with property or liberty. depending on the technology used, epidemiological surveillance programs plausibly could constitute seizures of "effects." for example, if the government required citizens to install specific applications on their cellular phones or other electronic devices, then that interference might well amount to a seizure of effects. so, too, if government agents surreptitiously hacked devices to install tracking software. in addition, if a surveillance program was aimed at using personal devices to limit carpenter, s. ct. at . id. at . see id. at ("the location information obtained from carpenter's wireless carriers was the product of a search."); id. at ("the government's acquisition of the cell-site records was a search within the meaning of the fourth amendment."). - ( ) . technology company and made available to researchers or business entities? is the cdc subject to fourth amendment regulation if it has done nothing more than what a private party could do? in one respect the answer is easy: "yes, of course!" as described above, the state agency requirement highlights the fact that government agents are subject to fourth amendment restraints that do not bind private actors. in another respect, however, the answer is less clear. one might wonder about the application of doctrine elaborated in the context of criminal law enforcement to the quite different case of public health. it is important here to distinguish between two distinct fourth amendment questions: whether government action constitutes a "search," and, if so, what form of prospective restraint is necessary to guarantee the security of the people against unreasonable search. when answering the "search" question, it does not matter whether the government agent is engaged primarily in criminal law enforcement or public health. a search is a search, whether conducted by police looking for evidence of a crime or the cdc looking for traces of a contagion. by contrast, when addressing the remedy question, it matters quite a bit whether a search is conducted to advance a criminal investigation or to advance other governmental interests. see infra notes - and accompanying text. in addition, there are real concerns that some tracking programs justified by public health concerns may be exploited for other purposes, including law enforcement, immigration, and national security, which will require careful safeguards. see infra part ii. cell site location data that is routinely gathered and aggregated by cellphone service providers for their own business purposes. the crux of the court's reasoning was that location tracking reveals a host of intimate details about private associations and activities. the court also worried that granting law enforcement unfettered access to this kind of data would facilitate programs of broad and indiscriminate search, threatening the right of the people to be secure against threats of arbitrary state power, and conjuring the specters of general warrants and writs of assistance that haunted the minds of the founding generation. epidemiological surveillance programs robust enough to conduct individual contact tracing or to document disease progression using aggregate data will trigger these same concerns. this suggests that they, too, would be subject to some form of fourth amendment restraint. the fact that some of the data used might be anonymized does not change the calculus. first, as has been amply demonstrated, it is very easy to deanonymize location data. that is likely to be particularly true in a world where people have been ordered to stay at home because location data will be robustly associated with individual residences. second, the fact that data is anonymized may salve some of the individual privacy concerns, but it does little to resolve concerns about "arbitrary power" and "permeating . . . surveillance," which threaten the as public health as long as they are narrowly tailored, likely to succeed, strike a reasonable balance between privacy interests and public policy goals, and limit the discretion of government agents conducting searches. the next part details a framework that policymakers can apply to meet these constitutional demands when deploying and using epidemiological surveillance tools such as contact tracing. the covid- pandemic poses an emergency for public health. faced with such emergencies, executive agents are wont to assert broad discretionary powers. as the canadian freeholder observed in , they are fond of doctrines of reason of state, and state necessity, and the impossibility of providing for great emergencies and extraordinary cases, without a discretionary power in the crown to proceed sometimes by uncommon methods not agreeable to the known forms of law. the fourth amendment guards against these threats, "curb[ing] the exercise of discretionary authority" to search and seize. to deploy and use epidemiological surveillance tools will therefore require more than executive fiat. what the fourth amendment demands is a clear and deliberative process, weighing the genuine benefits and costs of programs likely to engage in invasive and potentially mass surveillance. this process-which should involve both legislative and agency actors-must identify prospective remedial measures sufficient to safeguard the right of people to be secure against unreasonable searches and seizures while reasonably accommodating legitimate public health goals. in other work, one of us has elaborated a constitutionally informed framework policymakers can apply when designing data surveillance programs. this framework challenges the political branches to engage critical questions about need, efficacy, parsimony, and discretion before deploying these kinds of surveillance tools. it also provides a guide for courts to evaluate the constitutional sufficiency of the regulatory structures erected around these kinds of surveillance programs. below, we explain how that framework might guide the development and deployment of public health surveillance programs like digital contact tracing, location monitoring, and data aggregation and analysis. pre-deployment review. before a digital public health surveillance program is deployed, proponents must publicly and transparently identify the goals of the program and establish a reasonable, scientifically grounded fit between those goals, the methods to be used, and the data to be gathered. in particular, proponents must clearly articulate why digital methods outperform traditional alternatives in order to justify additional intrusions on privacy. reliable processes to identify infected individuals must be followed by efficient procedures to inform and monitor any contacts who may have been exposed. in the case of covid- , current proposals for digital contact tracing largely appear to take for granted that identified contacts will immediately and reliably self-isolate for the two-week incubation period of a possible covid- infection. the oxford research team that advocates "instant digital contact tracing" plainly states that it modelled the impact of "tracing the contacts of symptomatic cases and quarantining them." their model defines its success rate as "the fraction of all contacts traced, assuming perfectly successful quarantine upon tracing, or the degree to which infectiousness of contacts is reduced assuming all of them are traced." that is a generous assumption with no demonstrated grounding in reality. our recent experience with social distancing suggests that many people will continue to congregate, whether by choice or necessity, despite prompts to maintain social distance. abound of crowded subways, public markets, houses of worship, beaches, and parks across the country. workers without paid sick leave-let alone paid leave for possible, but unconfirmed, infection-may simply be unable to afford to self-isolate when prompted by public health orders, even if they are otherwise inclined to obey. from a purely practical perspective, then, models grounded in assumptions about compliance with self-isolation instructions offer little in terms of evidence that digital tracing methods will be superior enough to traditional methods to justify the radical costs to privacy attendant to mass surveillance. moreover, digital contact tracing may cast so wide and impersonal a net that it will be less effective than traditional means in generating compliance. depending on the precision of the location data, prompts to self-isolate may become overbroad and routine, which will further reduce compliance. social distancing recommendations emphasize six feet of distance between people to minimize infection. this suggests that ideal contact tracing would limit its scope to individuals who were within six feet of an infected person. but cell tower and gps data typically have margins of error of more than six feet. gps data, which is more precise than cell tower location data, is usually accurate only to within sixteen feet, with even poorer performance in crowded urban areas. bluetooth tracking may be more precise, but it is also overinclusive, likely registering contacts between devices despite the presence of walls, car doors, or even whole floors in a building. at least for now, these tools are likely to direct into isolation many people who were never actually at risk. over time, overbroad notifications will fail to prompt appropriate selfisolation even among individuals who are genuinely at risk-the epidemiological equivalent of crying "wolf!" other difficulties may arise as well, from false or malicious designations of an individual as infected when they are not, to insufficient participation in voluntary programs. traditional contact tracing, though perhaps a bit slower, may still prove to be more precise, accurate, robust, reliable, and visceral, and therefore may be more effective in generating actual compliance. in sum, digital contact tracing is unlikely at present to yield its promised benefits due to low testing rates, low compliance rates, and technological limitations. before requesting or requiring individuals to sacrifice their locational and associational privacy, policymakers must even if an epidemiological surveillance program can establish efficacy, that does not end the fourth amendment inquiry. given the substantial privacy interests at stake, legislators and app developers must take care at all stages of design, deployment, and use to mitigate against privacy harms, beginning with data gathering. indeed, these later stages-and the need to continue to probe issues of efficacy-will take on increased importance if app developers or policymakers charge ahead before pre-deployment review is completed. data gathering and aggregation. epidemiological surveillance programs such as digital contact tracing should gather the minimum amount and types of data reasonably necessary to facilitate their public health goals. although gps data is seemingly ubiquitous, it casts too wide a net and thus exposes a larger population's location data in response to every query. bluetooth data, by contrast, may be able to register proximity between devices more precisely, alongside or instead of logging location directly. utilizing proximity data rather than location data could minimize the intrusiveness of the data gathered because this data would reveal that two devices were in proximity but not where. limiting the timeframe covered by location or proximity data would minimize the scope of information revealed about a person's movements, habits, and intimate associations. see, e.g., supra notes - and accompanying text (charting the proliferation of digital contact tracing apps, including in some u.s. states). id. in addition to satisfying the fourth amendment, data gathering, aggregation, and other features of a digital contact tracing effort must also comply with existing statutory privacy protections. for instance, the california consumer privacy act provides california residents with, among other rights, the right to know what information certain businesses have collected about them, the right to request deletion of that data, and the right to opt out of the sale of that data to others. see cal. civ. code § code § . - should conduct regular reviews to determine whether the promises of a program match its reality. epidemiological surveillance programs such as digital contact tracing have been touted as a silver bullet that will free the american public from the strictures of social distancing, enabling a return to school, work, and socializing. but these tools also tread on established expectations of privacy while presenting real threats of persistent mass surveillance. in sorting through these promises and challenges, the fourth amendment will have a critical role to play. like all matter how popular or seemingly necessary in "providing for great emergencies and extraordinary cases." in particular, the fourth amendment will require that epidemiological surveillance programs demonstrate sufficient potential to serve compelling public health goals. there are good reasons to be skeptical. unless and until more mundane aspects of contact tracing are operating efficiently-including availability of testing and practical support for appropriate self-isolation by contacts-there is little reason to think that there is enough promise to justify the dramatic expansions in government power and significant costs to personal privacy. even if there is good reason to believe in the public health promise of these programs, the fourth amendment requires more than blind faith in the judgment of government officials. the fourth amendment is genetically skeptical of granting broad, unfettered discretion for state agents to conduct searches and seizures. to meet fourth amendment demands, epidemiological surveillance programs, whether directed at digital contact tracing, location monitoring, or data aggregation and analysis, must be the products of rigorous deliberative processes, weighing the genuine benefits and costs. robust prospective remedial measures should be put in place to secure privacy and liberty, including limitations on data gathering, aggregation, storage, access, analysis, and use. in addition, programs should be subject to constant review and sunset provisions. only by adopting these kinds of procedural and substantive safeguards can we hope to achieve legitimate public health goals as we face covid- while also protecting our sacred constitutional trust. maseres, supra note , at - . see united states v. jones, u.s. , ( ) (sotomayor, j., concurring). crowds gathered at national mall to watch blue angels coronavirus news: social distancing is not happening on the nyc subway closes wharf fish markets after patrons fail to follow social distancing guidelines at (touting improved cell tower triangulation methods giving location precision to within meters) gps-enabled smartphones are typically accurate to within a . m ( ft.) radius under open sky"). nonetheless, the majority of existing digital contact tracing apps appear to rely on gps data. see woodhams key: cord- -mjtlhh e authors: pellert, max; lasser, jana; metzler, hannah; garcia, david title: dashboard of sentiment in austrian social media during covid- date: - - journal: nan doi: nan sha: doc_id: cord_uid: mjtlhh e to track online emotional expressions of the austrian population close to real-time during the covid- pandemic, we build a self-updating monitor of emotion dynamics using digital traces from three different data sources. this enables decision makers and the interested public to assess issues such as the attitude towards counter-measures taken during the pandemic and the possible emergence of a (mental) health crisis early on. we use web scraping and api access to retrieve data from the news platform derstandard.at, twitter and a chat platform for students. we document the technical details of our workflow in order to provide materials for other researchers interested in building a similar tool for different contexts. automated text analysis allows us to highlight changes of language use during covid- in comparison to a neutral baseline. we use special word clouds to visualize that overall difference. longitudinally, our time series show spikes in anxiety that can be linked to several events and media reporting. additionally, we find a marked decrease in anger. the changes last for remarkably long periods of time (up to weeks). we discuss these and more patterns and connect them to the emergence of collective emotions. the interactive dashboard showcasing our data is available online under http://www.mpellert.at/covid _monitor_austria/. our work has attracted media attention and is part of an web archive of resources on covid- collected by the austrian national library. in , the outbreak of covid- in europe lead to a variety of countermeasures aiming to limit the spread of the disease. these include temporary lock downs, the closing of kindergartens, schools, shops and restaurants, the requirement to wear masks in public, and restrictions on personal contact. health infrastructure was re-allocated with the goal of providing additional resources to tackle the emerging health crisis triggered by covid- . such large-scale disruptions of private and public life can have tremendous influence on the emotional experiences of a population. governments have to build on the compliance of their citizens with these measures. forcing the population to comply by instituting harsh penalties is not sustainable in the longer run, especially in developed countries with established democratic institutions like in most of europe. on the scale of whole nations, very strict policing also faces technical limits and diverts resources from other duties. in addition, recent research shows that, when compared to enforcement, the recommendation of measures can be a better motivator for compliance [ ] . non-intrusive monitoring of emotional expressions of a population enables to identify problems early on, with the hope to provide the means to resolve them. due to the rapid development of the response to covid- , it is desirable to produce up-to-date observations of public sentiment towards the measures, but it is hard to quantify sentiment at large scales and high temporal resolution. policy decisions are usually accompanied by representative surveys of public sentiment that, however, suffer from a number of shortcomings. first, surveys depend on explicit self-reports which do not necessarily align with actual behaviour [ ] . in addition, conducting surveys among larger numbers of people is time consuming and expensive. lastly, a survey is always just a snapshot of public sentiment at a single point in time. often, by the time a questionnaire is constructed and the survey has been conducted, circumstances have changed and the results of the survey are only partially valid. online communities are a complementary data source to surveys when studying current and constantly evolving events. their digital traces reveal collective emotional dynamics almost in real-time. we gather these data in the form of text from platforms such as twitter and news forums, where large groups of users discuss timely issues. we observe a lot of activity online, with clear increases during the nation-wide lock down of public life. for example, our data shows austrian twitter saw a % increase in posts from march compared to before ( - - until - - ). livetickers at news platforms are a popular format that provides small pieces of very up-to-date news constantly over the course of a day. this triggers fast posting activity in the adjunct forum. by collecting these data in regular intervals, we face very little delay in data gathering and analysis and provide a complement to survey-based methods. our setup has the advantage of bearing low cost while featuring a very large sample size. the disadvantages include more noise in the signal due to our use of automated text analysis methods, such as sentiment analysis. additionally, if only information from one platform is considered, this might result in sampling a less representative part of the population than in surveys where participant demographics are controlled. however, systematic approaches to account for errors at different stages of research have been adapted to digital traces data [ ] . we showcase the monitoring of social media sentiment during the covid- pandemic for the case of austria. austria is located in central europe, serving as a small model for both western europe (especially germany [ ] ) and eastern europe (e.g. hungary [ ] ). therefore, the developments around covid- in austria have been closely watched by the rest of europe. as the virus started spreading in europe on a larger scale in february , stringent measures were implemented comparatively early in austria [ ] . using data from austria allows us to build a quite extensive, longitudinal account of first hand discussions on covid- . additionally, austria's political system and its public health system have all the capacities of a developed nation to tackle a health crisis [ ] . therefore, we expect the population to express the personal, emotional reaction to the event, without being overwhelmed by lack of resources and resulting basic issues of survival. interactive online dashboards are an accessible way to summarize complex information to the public. during covid- , popular dashboards have conveyed information about the evolution of the number of covid- cases in different regions of austria [ ] and globally [ ] . other dashboards track valuable information such as world-wide covid- registry studies [ ] . developers of dashboards include official governmental entities like the national ministry of health as well as academic institutions and individual citizens. to our knowledge, the overwhelming majority of these dashboards display raw data together with descriptive statistics of "hard" facts and numbers on covid- . to fill a gap, we build a dashboard with processed data from three different sources to track the sentiment in austrian social media during covid- . it is easily accessible online and updated on a daily basis to give feedback to authorities and the interested general public. we retrieve data from three different sites: a news platform, twitter and a chat platform for students. all data for this article was gathered in compliance with the terms and conditions of the platforms involved. twitter data was accessed through crimson hexagon (brandwatch), an official twitter partner. the platform for students and derstandard.at gave us their permission to retrieve the data automatically from their systems. a daily recurring task is set up on a server to retrieve and process the data, and to publish the updated page online (for a description of the workflow see figure ). the news platform derstandard.at was an internet pioneer as it was the first german language newspaper to go online in . from february , it started entertaining an active community, first in a chatroom [ ] . in , the chatroom was converted to a forum that is still active today and allows for posting beneath articles. users have to register to post and they can up-and down-vote single posts. in a platform change made voting more transparent by showing which user voted both positive or negative. according to a recent poll [ ] , derstandard.at is considered both the most trustful and most useful source of information on covid- in austria. visitors come from austria, but also from other parts of the german-speaking area. in , derstandard.at was visited by , , unique users per month that stay on average : minutes on the site and request a total of , , subpages [ ] . to cover the developments around covid- , daily livetickers (except sundays) were set up on derstandard.at. figure s in the supplementary information shows an example of the web interface of such a liveticker. as no dedicated api exists for data retrieval from derstandard.at, we use web-scraping to retrieve the data (under permission from the site). first, we request a sitemap and identify the relevant urls of livetickers. second, we query each small news item of each of the livetickers. we receive data in json format and flatten and transform the json object to extract the id of each small news piece. third, we query the posts attached to that id in batches. this is necessary because derstandard.at does not display all the posts at once beneath a small news item. instead, the page loads a new batch of posts as soon as the user reaches the bottom of the screen. this strategy is chosen to not overcrowd the interface, as the maximum number of posts beneath one small news item can be very high (up to posts in our data set). by following our iterative workflow to request posts, we are able to circumvent issues of pagination. finally, after we have received all posts, we transform the json objects to tabulator-separated value files for further analysis. this approach is summarised in the upper part of figure . to retrieve daily values for our indicators from twitter, we rely on the forsight platform by crimson hexagon, an aggregation service of data from various platforms, including twitter. twitter has an idiosyncratic user base in austria, mainly composed of opinion makers, like journalists and politicians. in the case of studying responses to a pandemic, studying these populations gives us an insight into public sentiment due to their influence in public opinion. yet, one should keep in mind that twitter users are younger, more liberal, and have higher formal education than the general population [ ] . as a third and last source, we include a discussion platform for young adults in austria . the discussions on the platform are organized in channels based on locality, with an average of ± (mean ± standard deviation) posts per day from - - to - - . the typical number of posts per day on the platform dropped from ± (january to april) to ± (april to may). this drop occurred due to the removal of the possibility to post anonymously on april th in order to prevent hate speech. based on data from this platform, we study the reaction of the special community of young adults in different austrian locations, with the majority of posts originating in vienna ( %), graz ( %) and other locations ( %). to assess expressions of emotions and social processes, we match text in posts on all three platforms to word classes in the german version of the linguistic inquiry and word count (liwc) dictionary [ ] , including anxiety, anger, sadness, positive emotions and social terms. liwc is a standard methodology in psychology for text analysis that includes validated lexica in german. it has been shown that liwc, despite its simplicity, has an accuracy to classify emotions in tweets that is comparable to other state of the art tools in sentiment analysis benchmarks [ ] . previous research has shown that liwc, when applied to large-scale aggregates of tweets, has similar correlations with well-being measures as other, more advanced text analysis techniques [ , ] . since within the scope of this study only text aggregates will be analysed, liwc is an appropriate method and can be applied to all sorts of text data that is collected for the monitor. for the prosocial lexicon, we translated a list of prosocial terms used in previous research [ ] , including for example words related to helping, empathy, cooperating, sharing, volunteering, and donating. we adapt the dictionaries to the task at hand by excluding most obvious terms that can bias the analysis, as done in recent research validating twitter word frequency data [ ] . specifically, we cleaned the lists for ( ) words which are likely more frequently used during the covid- pandemic e.g. by news media and do not necessarily express an emotion (sadness: tot*; anger: toete*, tt*, tte*; positive: heilte, geheilt, heilt, heilte*, heilung; prosocial: heilverfahren, behandlung, behandlungen, dienstpflicht, ffentlicher dienst, and digitale dienste all matching dienst*), ( ) potential mismatches unrelated to the respective emotion (sadness: harmonie/harmlos matching harm*; positive: uerst; prosocial: dienstag matching dienst*) ( ) specific austria-related terms like city names (sadness: klagenfurt matching klagen*) or events (sadness: misstrauensantrag matching miss*), and ( ) twitter-related terms for the analysis of tweets only (prosocial: teilen, teilt mit). for text from derstandard.at, we average the frequency of terms per post to take into account the varying lengths of posts. as twitter has a strict character limit of characters per post, crimson hexagon provides the number of tweets containing at least one of the terms, based on which we calculate the proportion of such posts. posts have a median length of characters in derstandard.at, characters in twitter, and characters in the chat platform for young adults. to exclude periodic weekday effects, we correct for the daily baseline of our indicators by computing relative differences to mean daily baseline values. for derstandard.at data, the baseline is computed from all posts to derstandard.at articles in the year . we use the main website articles for this instead of livetickers because during , livetickers were mainly used to cover sport events (for an example see https://www.derstandard.at/jetzt/livebericht/ /bundesliga-livelask-sturm) or high-profile court cases (https://www.derstandard.at/jetzt/livebericht/ /buwog-prozessvermoegensverwalter-stinksauer-auf-meischberger). thereby, we choose a slightly different medium for our baselines to avoid having a topic bias in the baselines. nonetheless, it comes from the same platform with the same layout and functionalities and an overlapping user base: users ( % of total unique users in the livetickers) in our data set that are active at livetickers also post at normal articles. the speed of posting may differ slightly, because the article is typically posted in a final format, whereas small news pieces are added constantly in livetickers. for the other data sources, we correct by computing the baseline for the indicators from the start of period available to us (twitter back to - - , chat platform for young adults back to - - ) to january . finally, we combine the processed data and render an interactive website. for this, we use "plotly" [ ] , "rflexdashboard" [ ] and "wordcloud " [ ] in r [ ] , and the "git" protocol to upload the resulting html page to github pages. using versioning control allows us to easily revert the page to a previous state in case of an error. we track the sentiment of the austrian population during covid- and make our findings available as an interactive online dashboard that is updated daily. we display the time series almost in real-time with a small delay to catch all available data (see figure using derstandard.at as a data source). it has features such as the option to display the number of observations by hovering over the data point or to isolate lines and to compare only a subset of indicators. the dashboard can be accessed online at http://www.mpellert.at/covid monitor austria/. table shows several descriptive statistics of the data sets used. for derstandard.at, we retrieved livetickers with small news items. on average, users publish ± posts under each of those items in the time period of interest ( - - to - - ). posts have a median length of characters (see figure s for a histogram of the length of posts). posts provide immediate reactions by the users of derstandard.at: the median is at . seconds for the first post to appear below a small news item. we use word clouds (figure ) to visualize the emotional content of posts. while livetickers on covid- cover the time period from - - until - - , the baseline includes normal articles on derstandard.at from . to highlight changes in language use during covid- , our word clouds compare word frequency in the livetickers with the baseline: the size of words in the clouds is proportional to | log( prob livetickers prob baseline )|, where prob baseline and prob livetickers refer to the frequency of the dictionary term compared to the frequency of all matches of terms in that category, in the baseline and the livetickers, respectively. color of words corresponds to the sign of this quantity: red means positive, i.e. the frequency of the word increased in the livetickers, whereas blue signifies that the usage of the word decreased. by combining these information, our word clouds give an impression on how the composition of terms in the dictionary categories changed during covid- . our dashboard analyses a part of public discourse. we assume that the lockdown of public life increased tendencies of the population to move debates online. users that take part in these discussions often form very active communities that sometimes structure their whole day around their posting activities. this is reflected in our data in the word clouds of figure from the increased usage of greetings (category "social"), marking the start or the end of a day such as "moin"/"good morning" or "gn"/"good night". we identified the following events in austria corresponding to anxiety spikes in expressed emotions in social media. unrelated to covid- , there was reporting on a terrorist attack in hanau, germany on - - . the first reported covid- case in austria was on - - and the first death on - - . the first press conference, announcing bans of large public events and university closures as first measures, happened on - - . it was followed by strict social distancing measures announced on - - , starting on the day after. the overall patterns in the monitor of sentiment in figure show that austrian user's expressions of anxiety increased, whereas anger decreased in our observation period. we go into detail on this in section . the sentiment dynamics on social media platforms can be influenced by content that spreads fear and other negative emotions. timely online emotion monitoring could help to quickly identify such campaigns by malicious actors. even legitimately elected governments can follow the controversial strategy of steering emotions to alert the population to the danger of a threat. for example, democratically elected actors can deliberately elicit emotions such as fear or anxiety to increase compliance from the top down. such a strategy has been followed in austria [ ] and other countries like germany [ ] . reports about the deliberate stirring of fear by the austrian government are reflected in a spike of anxiety on - - in figure . the spikes of anxiety at the beginning of march in the early stages of the covid- outbreak may have been reinforced by anxiety eliciting strategies. in an effort to provide an archive of austrian web resources for future reference, the austrian national library (nb) monitors the dashboard and stores changes. there are a number of such initiatives also in other nations [ ] with the earliest and most famous example being archive.org. through selective harvesting of resources connected to covid- , the dashboard is part of the nb collection "coronavirus " (https://webarchiv.onb.ac.at/). our results show patterns in the change of language use during covid- . in the anger category, words related to violence and crime are less frequent in livetickers since covid- compared to , indicating that reports and discussions about violent events, or possibly even these events themselves, become less frequent as the public discourse focuses on events related to the pandemic. for anxiety, the most remarkable change is a reduction in words related to terror and abuse, accompanied by a smaller increase of terms linked to panic, risk and uncertainty. in the sadness category, the verb "verabschiede"/"saying goodbye" appears almost times more often in the livetickers. for prosocial words, terms referring to helping, community and encouragement increased. from the social terms, the word "empfehlungen"/"recommendations" occurs slightly more frequently, while topics of migration, integration and patriarchy are less often discussed. finally, positive terms that increase the most are the expression of admiration "aww*" and "hugs", indicating that people send each other virtual hugs instead of physical ones. dynamics of collective emotions may be different in crisis times. while they typically vary fast [ ] and return to the baseline within a matter of days even after catastrophic events like natural disasters or terrorist attacks [ , ] , changes during the covid- pandemic in austria have lasted several weeks for most analysed categories (up to weeks in some cases). in contrast to one-off events, threat from a disease like covid- is more diffuse, and the emotion-eliciting events are distributed in time. in addition, measures that strongly affect people's daily lives over a long period of time, as well as high level of uncertainty, likely contribute to the unprecedented changes of collective emotional expression in online social media. the dashboard illustrates early and strong increases in anxiety across all three analysed platforms starting at the time of the first confirmed cases in austria (end of february ). a first initial spike of anxiety-terms occurs on all three platforms around the time the first positive cases were confirmed and news about the serious situation in italy were broadcast in austria. about two weeks later, levels rise again together with the number of confirmed cases, reaching particularly high levels in the week before the lockdown on march. afterwards, the gradually drop again. in total, levels of anxiety-expression did not return to the baseline for more than six weeks from - - until - - on twitter. on derstandard.at, levels also remained above the baseline for more than four weeks in a row. timelines for twitter and derstandard.at also show a clear and enduring decrease of angerrelated words starting in the week before the lock-down, as discussions of potentially controversial topics other than covid- become scarcer. this decrease lasts for four weeks on derstandard.at ( februar - april), but is particularly stable on twitter, where anger-terms remain less frequent than in for . months in a row ( - - to - - ). in contrast, prosocial and social terms show opposing trends on these two platforms: they increase slightly but do so for more than months on twitter, where people share not only news, but also talk about their personal lives. in contrast, they decrease for more than months in a row on derstandard.at, where people mostly discuss specific political events or topics.the increase of sadness-related expressions is smaller than changes in anxiety and anger, but also lasted for about a month on twitter, and two weeks on derstandard.at. interestingly, positive expressions were used slightly more frequently on all three platforms for long periods since the outbreak. this trend is visible from the beginning of march on the student platform and derstandard.at, and further increases since restrictions on people's lives have reduced. in total, positive expressions are more frequent than baseline during the last . months (as of th of june) on derstandard.at. an analysis of collective emotions in reddit comments from users in eight us cities found results similar to ours, including spikes in anxiety and the decrease in anger [ ] , which suggests that some of our findings might generalize to other platforms and countries. the dashboard gives opinion makers and the interested public a way to observe collective sentiment vis-a-vis the crisis response in the context of a pandemic. it has gained attention from austrian media [ ] , and from the covid future operations clearing board [ ] , an interdisciplinary platform for exchange and collaboration between researchers put in place by the federal chancellery of the republic of austria. especially during the first weeks of the crisis, multiple newspapers reported on the changes of emotional expressions in online platforms [ , , , , ] . timely knowledge about the collective emotional state and expressed social attitudes of the population is valuable for adapting emergency and risk-communication as well as for improving the preparedness of (mental) health services. supplementary material is included. the dashboard can be accessed at http://www.mpellert.at/covid monitor austria/. the source code is available at https://github.com/maxpel/covid monitor austria. the data sets accumulated daily by updating the dashboard will be released in the future. table : descriptive statistics showing relevant aspects of the data sources. numbers refer to the time period from march to june ( days). the total number of twitter users in austria in january is taken from the report of datareportal [ ] . fractions refer to the number of posts containing at least one term from the relevant dictionary category in liwc divided by the total number of posts. the differential impact of physical distancing strategies on social contacts relevant for the spread of covid- . medrxiv psychology as the science of self-reports and finger movements: whatever happened to actual behavior? a total error framework for digital traces of humans austria's kurz says germany copied his country's lockdown easing plan. reuters coronavirus orbn: gradual restart of life planned in nd phase of measures, preparation for surprises. hungary today kurier.at. sterreich bei intensivbetten weit ber oecd schnitt austrian ministry for health. amtliches dashboard covid csse. covid- dashboard by the center for systems science and engineering a real-time dashboard of clinical trials for covid- . the lancet digital health der standard chatroom: die bar, die nicht mehr ist. der standard library catalog: www.derstandard sales team. derstandard.at media data how twitter users compare to the general public computergesttzte quantitative textanalyse -quivalenz und robustheit der deutschen version des linguistic inquiry and word count sentibench-a benchmark comparison of state-of-the-practice sentiment analysis methods tracking" gross community happiness" from tweets estimating geographic subjective well-being from twitter: a comparison of dictionary and datadriven language methods moral actor, selfish agent flexdashboard: r markdown format for flexible dashboards wordcloud : create word cloud by htmlwidget r: a language and environment for statistical computing. r foundation for statistical computing regierungsprotokoll: angst vor infektion offenbar erwnscht wie wir covid- unter kontrolle bekommen a survey on web archiving initiatives the individual dynamics of affective expression on social media a novel surveillance approach for disaster mental health collective emotions and social resilience in the digital traces after a terrorist attack the unfolding of the covid outbreak: the shifts in thinking and feeling. understanding people and groups republic of austria. covid- future operations clearing board -bundeskanzleramt sterreich online-emotionen in foren whrend der coronakrise coronavirus: twitter spiegelt ngste und sorgen der menschen wider -derstandard gefhle und videokonferenzen -wiener komplexittsforscher finden bei online-emotionen nach einem deutlichen anstieg zu beginn der krise nun weniger ngstlichkeit. mensch -wiener zeitung online, . library catalog: www.wienerzeitung online-emotionen: mehr trauer als wut. science.orf.at we thank christian burger from derstandard.at for providing data access, and julia litofcenko and lena mller-naendrup for their support in translating the prosocial dictionary to german. access to crimson hexagon was provided via the project v!brant emotional health grant suicide prevention media campaign oregon to thomas niederkrotenthaler. the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. mp and dg designed research. mp retrieved derstandard.at data, processed and analyzed all data, and implemented the dashboard. jl retrieved data for the platform for young adults. hm retrieved data for twitter, and wrote methods and result reports for the dashboard. mp, jl and hm wrote the draft of the manuscript. all authors provided input for writing and approved the final manuscript. key: cord- - g erth authors: ienca, marcello; vayena, effy title: on the responsible use of digital data to tackle the covid- pandemic date: - - journal: nat med doi: . /s - - - sha: doc_id: cord_uid: g erth large-scale collection of data could help curb the covid- pandemic, but it should not neglect privacy and public trust. best practices should be identified to maintain responsible data-collection and data-processing standards at a global scale. o n january , the world health organization (who) directorgeneral declared the coronavirus disease (covid- ) outbreak a publichealth emergency of international concern (pheic). six weeks later, the outbreak was categorized as a pandemic. covid- has already caused times more cases (as of march ) than the previous coronavirus-induced pheic-the - severe acute respiratory syndrome (sars) outbreak-and the covid- numbers are expected to grow. compared with the - outbreak, however, the covid- emergency is occurring in a much more digitized and connected world. the amount of data produced from the dawn of humankind through is generated today within a few minutes. furthermore, advanced computational models, such as those based on machine learning, have shown great potential in tracing the source or predicting the future spread of infectious diseases , . it is therefore imperative to leverage big data and intelligent analytics and put them to good use for public health. relying on digital data sources, such as data from mobile phones and other digital devices, is of particular value in outbreaks caused by newly discovered pathogens, for which official data and reliable forecasts are still scarce. a recent study has shown the possibility of forecasting the spread of the covid- outbreak by combining data from the official aviation guide with data on human mobility from the wechat app and other digital services owned by chinese tech giant tencent . mobile-phone data already showed potential in predicting the spatial spread of cholera during the haiti cholera epidemic , while leveraging big-data analytics showed effectiveness during the - western african ebola crisis . however, during those recent epidemics, the large-scale collection of mobile data from millions of users-especially call-data records and social-media reports-also raised privacy and data-protection concerns. in , privacy concerns urged the gsm association (an industry organization that represents the interests of mobile-network operators worldwide) to issue guidelines on the protection of privacy in the use of mobile-phone data for responding to the ebola outbreak . in the data-intensive world of , ubiquitous data points and digital surveillance tools can easily exacerbate those concerns. china, the country most affected by covid- , is reportedly using ubiquitous sensor data and healthcheck apps to curb the disease spread . according to a new york times report , there is little transparency in how these data are cross-checked and reused for surveillance purposes. for example, the report said that alipay health code, an alibaba-backed government-run app that supports decisions about who should be quarantined for covid- , also seems to share information with the police . in italy, the european country recording the largest number of covid- cases, the local data-protection authority was urged, on march , to issue a statement to clarify the conditions of lawful data use for mitigation and containment purposes. in its statement, the authority warned against the privacy-infringing collection and processing of data by non-institutional actors (e.g., private employers). two weeks later, the european data protection board issued a statement on the importance of protecting personal data when used in the fight against covid- and flagged specific articles of the general data protection regulation that provide the legal grounds for processing personal data in the context of epidemics . for example, article allows the processing of personal data "for reasons of public interest in the area of public health, such as protecting against serious cross-border threats to health, " provided such processing is proportionate to the aim pursued, respects the essence of the right to data protection and safeguards the rights and freedoms of the data subject. as big data will be critical for managing the covid- pandemic in today's digital world, the conditions for responsible data collection and processing at a global scale must be clear. we argue that the use of digitally available data and algorithms for prediction and surveillancee.g., identifying people who have traveled to areas where the disease has spread or tracing and isolating the contacts of infected people-is of paramount importance in the fight against the covid- pandemic. it is equally important, however, to use these data and algorithms in a responsible manner, in compliance with data-protection regulations and with due respect for privacy and confidentiality. failing to do so will undermine public trust, which will make people less likely to follow public-health advice or recommendations and more likely to have poorer health outcomes . careful data-management practices should govern both data collection and data processing. in the collection of data from affected people, the principle of proportionality should apply, which means that the data collection must (i) be proportional to the seriousness of the public-health threat, (ii) be limited to what is necessary to achieve a specific public-health objective, and (iii) be scientifically justified. gaining access to data from personal devices for contact tracing purposes, for example, can be justified if it occurs within specific bounds, has a clear purpose-e.g., warning and isolating people who may have been exposed to the virusand no less-invasive alternative-e.g., using anonymized mobile positioning data-is suitable for that purpose. furthermore, 'do it yourself ' health surveillance, as it was labeled by the italian data-protection authority, should be avoided. comment at the data-processing level, data quality and security controls are needed. data-integrity weaknesses, which are common when data from personal digital devices are used, can introduce small errors in one or multiple factors, which in turn can have an outsized effect on largescale predictive models. furthermore, data breaches, insufficient or ineffective de-identification and biases in datasets can become major causes of distrust in public-health services. data privacy challenges not only are of a technical nature but also depend on political and judicial decisions. requesting or warranting access to personal devices can, for purposes such as contact tracing, be more effective than simply leveraging anonymized mobile positioning data. however, compelling providers to allow access to or even assist in decrypting cryptographically protected data (similar to what occurred during the us federal bureau of investigation-apple encryption dispute) may be counterproductive, especially if the agreements between (inter) national authorities and service providers lack transparency or proportion. similar trade-offs apply to health apps that require users to register with their names or national identification numbers. national authorities should be mindful that precisely because personal data may contain valuable information about the social interactions and recent movements of infected people, they should be handled responsibly. overriding consent and privacy rights in the name of disease surveillance may fuel distrust and ultimately turn out to be disadvantageous. there have been reports that china's digital epidemic control might have exacerbated stigmatization and public mistrust. this risk of mistrust is even greater in countries in which citizens place a much lower level of trust in their government, such as italy, france and the usa . therefore, whenever access to these data sources is required and is deemed proportional, the public should be adequately informed. secrecy about data access and use should be avoided. transparent public communication about data processing for the common good should be pursued. data-processing agreements, for example, should disclose which data are transmitted to third parties and for which purpose. reports from taiwan show a promising way to leverage big-data analytics to respond to the covid- crisis without fuelling public mistrust. taiwanese authorities integrated their national health insurance database with travel-history data from customs databases to aid in case identification. other technologies, such as qr code scanning and online reporting, were also used for containment purposes. these measures were combined with public communication strategies involving frequent health checks and encouragement for those under quarantine . as more countries are gearing up to use digital technologies in the fight against the ongoing covid- pandemic, data and algorithms are among the best arrows in our quiver-if they are used properly. ❐ gsma oecd the authors declare no competing interests. key: cord- - vn yt m authors: lei, howard; o’connell, ryan; ehwerhemuepha, louis; taraman, sharief; feaster, william; chang, anthony title: agile clinical research: a data science approach to scrumban in clinical medicine date: - - journal: intell based med doi: . /j.ibmed. . sha: doc_id: cord_uid: vn yt m the covid- pandemic has required greater minute-to-minute urgency of patient treatment in intensive care units (icus), rendering the use of randomized controlled trials (rcts) too slow to be effective for treatment discovery. there is a need for agility in clinical research, and the use of data science to develop predictive models for patient treatment is a potential solution. however, rapidly developing predictive models in healthcare is challenging given the complexity of healthcare problems and the lack of regular interaction between data scientists and physicians. data scientists can spend significant time working in isolation to build predictive models that may not be useful in clinical environments. we propose the use of an agile data science framework based on the scrumban framework used in software development. scrumban is an iterative framework, where in each iteration larger problems are broken down into simple do-able tasks for data scientists and physicians. the two sides collaborate closely in formulating clinical questions and developing and deploying predictive models into clinical settings. physicians can provide feedback or new hypotheses given the performance of the model, and refinement of the model or clinical questions can take place in the next iteration. the rapid development of predictive models can now be achieved with increasing numbers of publicly available healthcare datasets and easily accessible cloud-based data science tools. what is truly needed are data scientist and physician partnerships ensuring close collaboration between the two sides in using these tools to develop clinically useful predictive models to meet the demands of the covid- healthcare landscape. we propose the use of an agile data science framework based on the scrumban framework used in software development. scrumban is an iterative framework, where in each iteration larger problems are broken down into simple do-able tasks for data scientists and physicians. the two sides collaborate closely in formulating clinical questions and developing and deploying predictive models into clinical settings. physicians can provide feedback or new hypotheses given the performance of the model, and refinement of the model or clinical questions can take place in the next iteration. the rapid development of predictive models can now be achieved with increasing numbers of publicly available healthcare datasets and easily accessible cloud-based data science tools. what is truly needed are data scientist and physician partnerships ensuring close collaboration between the two sides in using these tools to develop clinically useful predictive models to meet the demands of the covid- healthcare landscape. the covid- pandemic has greatly altered the recent healthcare landscape and has brought about greater minute-to-minute urgency of patient treatment especially in intensive care units (icus). this greater urgency for treatment implies a greater need for agility in clinical research, rendering traditional approaches such as randomized controlled trials (rcts) [ ] too slow to be effective. one approach for meeting the agility needs is the use of data science for the development of predictive models to assist in patient treatment. predictive models can be rapidly and non-invasively developed leveraging existing data and computational tools, and various efforts have been undertaken [ ] [ ] [ ] . if successful, predictive models can rapidly process volumes of patient information to assist physicians in making clinical decisions. however, the development and deployment of predictive models that are useful in clinical environments within short timeframes is challenging. traditionally, the development and deployment of models employs a sequential process that resembles the waterfall methodology used in software development [ ] . data would the model be deployed into a real-world setting for the domain experts to evaluate and provide feedback. one main disadvantage of this approach is that it prescribes for little collaboration between the day-to-day operations of data scientists and domain experts such as physicians, resulting in data scientists potentially working in isolation for long periods of time. figure illustrates this process. a breakdown of the tasks data scientists typically perform in isolation include data collection, data pre-processing and augmentation, model selection, model hyper-parameter turning, model training, and model testing. data pre-processing is used for the data to be in a format that's suitable for use by the predictive model. data augmentation is used to artificially increase the size of the data. for example, if the input data consist of images, augmentation can include translation, scaling, rotation, and adjusting the brightness of images to present more example images for the predictive model to learn. note that one popular technique for compensating for data size is the synthetic minority oversampling technique (smote) [ ] , which addresses class imbalance in datasets by artificially increasing the amount of data in the minority class. class imbalance is commonly encountered when working with electronic medical record (emr) data in healthcare. the class representing the patients with a target condition is typically smaller in size (i.e. with fewer samples) than the class representing patients without the target condition, and this can adversely affect the accuracy of predictive models developed on such data. hyper-parameter tuning involves adjusting the parameters used in the model training process [ ] , where the model is taught how to make predictions given the training data. one example of a hyper-parameter is the number of times -or iterations -the training data is presented to the model to learn. each iteration is known as an epoch. after each epoch the model increases its learning from the training data, and after many epochs the learning is completed. a second example of a hyper-parameter is the percentage of training data that's used by the model in each epoch. the more epochs and the more data presented in each epoch, the better the model learns from the training data. a final example of a hyper-parameter is the learning rate of predictive models. the learning rate inversely correlates with the amount of time the model takes to reach its "learned state". models trained using higher learning rates can reach its final state faster and complete its training sooner; however, they may not have learned as well compared to models trained using lower learning rates. depending on the amount of training data, the complexity of the data, the number of parameters in the predictive model, and the computing resources, model training can potentially take days to complete. the model would be evaluated against a separate test data to verify if performance meets requirements. if not, some or all previous steps must be repeated until performance becomes acceptable. after the performance is deemed acceptable, the model would be deployed into a real-world environment. in the end, the process from the conception of the problem to model deployment can take months, and the opportunity for domain experts to evaluate comes only after the deployment of the model. one risk is that after deployment, the model would no longer be relevant if the goals have shifted; another risk is that the model may not meet the performance requirements in a realworld setting. in either situation, time or resources allocated to model development would have been wasted. this can be particularly damaging for data science efforts addressing the covid- pandemic, where rapid development of approaches for detection and diagnoses of symptoms is critical. in healthcare, the ability to rapidly define goals (i.e. clinically relevant questions) and deploy predictive models that have real-world impact is faced with even more challenges. one challenge is that healthcare data such as electronic medical records (emr) of patients is inherently complex [ ] , consisting of a mix of different data types and structures, missing data, and mislabeled data. the development of predictive models often requires well-structured and welllabeled data; hence, there is a greater need for data exploration, pre-processing and/or filtering when processing emr data. furthermore, it may be discovered upon exploration of available training data that the initial clinical questions and goals may not be achievable by predictive models developed using the data. those questions and goals would need to be refined before model development can proceed. furthermore, for predictive models to be usable in a clinical setting, physicians must have confidence that its performance is reliable. models that perform well under common metrics used by data scientists, such as area under the curve (auc), does not guarantee that important clinical decisions can be made based on model [ ] . that is because the auc is a metric that measures model performance across a broad range of sensitivities and specificities of the model. when making important clinical decisions related to patients in the icu, such as proning versus ventilation, which drugs to use, or whether to administer anti-coagulants, knowing that the model has an excellent auc of . out of . is not as helpful as knowing that a decision based on the model has a % chance of being correct (i.e. the model's specificity). some clinical decisions also need to be made within minutes, implying that the model must meet real-time performance j o u r n a l p r e -p r o o f standards in order to become the "partner" that can assist physicians in on-the-spot decision making. the fact that predictive models could lack in performance after being deployed into a clinical setting implies an even greater need for a framework that allows physicians to collaborate with data scientists to continuously monitor model development and performance. furthermore, the minute-to-minute urgency of treatment needed for the covid- pandemic implies that the lengthy process prescribed by the traditional waterfall approach -with little communication between data scientists and physicians -is inadequate. the agile framework has been traditionally used in software development and has recently been introduced in data science [ ] . the framework is an expedient approach that encourages greater velocity towards accomplishing goals. it includes the scrum and kanban frameworks, and a hybrid framework called scrumban [ ] . the scrum framework prescribes consecutive "sprint cycles", which each cycle spanning a few weeks. within each cycle, team members set and refine goals, produce implementations, and perform a retrospective with stakeholders. new goals and refinements are established for the next sprint cycle. one of the team members also acts as the scrum master, who facilitates daily team meetings (called standups), and ensures that the team is working towards goals and requirements [ ] . the kanban framework involves breaking down larger tasks into simple, do-able tasks. each task proceeds through a sequence of well-defined steps from start to finish. tasks are displayed as cards on a kanban board, and their positions on the board indicate how much progress has been made [ ] . certain tasks may be "blocked", meaning that something needs to resolve before progress on the task can continue. figure shows an example of a kanban board. one advantage to using a kanban board is that the set of all necessary tasks, along with progress for each task, is transparent to members of the development team and anyone else who is interested. overall, the kanban framework helps bring clarity in tackling larger problems. domain experts can visualize how the team is tackling the problems, along with what has been accomplished, what is in progress, what still needs to be done, and what needs to resolve before progress can be made. the proposed agile framework is shown in figure . unlike the waterfall approach, the tasks in the agile approach are to be done collaboratively between data scientists and physicians, and we note that the use of cloud-based storage and computing helps by providing a common platform for accessing the data and model(s). complex problems can be broken down into tasks that can be visualized by both data scientists and physicians, enabling physicians to better understand the work that data scientists must do within each sprint cycle. the framework encourages continuous deployment of predictive models in clinical settings (i.e. such as the icu), during which time data scientists can round with physicians and receive feedback on the model's performance. the physician's insight or gestalt can be leveraged to determine whether the results of the model are believable [ ] . it may be that the predictive model performs well only in certain settings, such as with only certain patient populations across certain periods of time; if so, the clinical questions can be refined or new hypotheses developed at the beginning of the next sprint cycle. the point at which the sprint cycles should end, either because the model has finally become clinically useful or the team needs to completely pivot to a different direction, is determined by the physicians. while the traditional waterfall approach could take many months for clinically useful models to be developed, the agile approach could take just a fraction of the time depending on the level of collaboration between data scientists and physicians. for agile data science to work in the healthcare domain, certain infrastructure must be in place to ensure that sprint cycles can be completed within shorter timeframes. these include the ability to: . rapidly acquire large datasets. . parse and query data in real time. . use established platforms and libraries rather than develop tools de-novo. these platforms and libraries should reside in a cloud framework that allows collaborative efforts to take place. the availability of publicly accessible health information databases for research is increasing despite a multitude of regulatory and financial roadblocks. one such database is the medical information mart for intensive care iii (mimic-iii) which contains de-identified data generated by over fifty thousand patients who received care in the icu at beth israel deaconess medical center [ ] . the hope is that as researchers adopt the use of mimic, new insights, knowledge, and tools from around the world can be generated [ ] . another publicly available database is the eicu collaborative research database, a multi-center collaborative database containing intensive care unit (icu) data from many hospitals across the united states [ ] . both the mimic-iii and the eicu databases can be immediately obtained upon registration and completion of training modules. the popularity of these two databases illustrates the potential for large amounts of data to be gathered from hospitals and icus around the world and made immediately accessible to researchers. [ ] . the cerner real-world data is another covid- research database that contains de-identified data and is freely offered to health systems [ ] . finally, databases for medical imaging studies also exist, such as the chest x-ray dataset released by the nih, which contains over , chest x-ray images [ ] . once datasets are obtained, storage and compute power are easily purchased and accessible from an ever-increasing number of vendors. the compute power needed for analyzing large datasets can often be met using cloud computing resources with amazon web services (aws), google cloud platform (gcp), and microsoft azure being the providers of popular cloud services [ ] [ ] [ ] . the need for cloud computing tools rests mainly on the availability of specialized elastic compute instances. the elasticity implies that computing resources can be assessed in real time and scaled up or down as needed to balance computing power and cost. another advantage of a cloud framework is that it allows multiple data scientists and physicians to conveniently collaborate and access the work. this shift to elastic cloud resources has seen one of the major electronic medical records (emr) providers, cerner corporation [ ] , develop tools for agile data science that use cloud computing resources as the underlying computing engine. these tools for agile data science often use jupyter notebook as the underlying frontend programming interface. jupyter is an open source computational environment that supports programming frameworks and languages such as apache spark [ ] , python and r required for processing the data and developing predictive models [ ] . open source machine learning libraries like keras [ ] which enables the rapid development of advanced predictive models such as convolutional neural networks (cnns) [ ] , can also be integrated. finally, the jupyter notebook framework supports collaboration amongst multiple individuals, where data scientists and physicians can query data, add and modify code and/or visualize results in real time [ ] . the availability of the development tools and accessibility of data allow data scientists to rapidly acquire data, query parts of data relevant for addressing clinical questions, and develop predictive models. the outcomes of the model can lead to refinement of the clinical questions, data, or the model itself. the combination of the data scientist, physicians, and agile data science tools will help revolutionize the entire data science process and accelerate discoveries in healthcare and other application domains. agile data science is quickly becoming a necessity in healthcare, and especially critical given the covid- pandemic. the agile framework prescribes a rapid, continuous-improvement process enabling physicians to understand the work of data scientists and regularly evaluate predictive model performance in clinical settings. physicians can provide feedback or form new hypotheses for data scientists to implement in the next cycle of the process. this is a departure from the traditional waterfall approach, with data scientists tackling a sequence of tasks in isolation, without regularly deploying the models in real-world settings and engaging domain experts such as physicians. given the rapidly shifting healthcare landscape, the goals and requirements for the predictive models may change by the time the model is deployed; this renders the slower, traditional model development approaches unsuitable. as the agile framework encourages rapid development and deployment of predictive models, it requires data scientists to have easy access to data and the infrastructure needed for model j o u r n a l p r e -p r o o f development, deployment, and communication of outcomes. fortunately, there are now publicly available datasets such as mimic-iii, and cloud-based infrastructure such as amazon web services (aws) to achieve this. aws contains a suite of popular tools such as jupyter notebook, python, and r, allowing data scientists to rapidly upload data, and develop and deploy models with short turn-around time. given the increasing amounts of healthcare data, the plethora of clinical questions to address, as well as the minute-to-minute urgency of treating icu patients given the covid- pandemic, the rapid development of predictive models to address these challenges is more important than ever. we hope that the agile framework can be embraced by increasing numbers of physician and data scientist partnerships, in the process of developing clinically useful models to address these challenges. a method for assessing the quality of a randomized control trial artificial intelligence (ai) applications for covid- pandemic artificial intelligence-enabled rapid diagnosis of patients with covid- smote: synthetic minority over{sampling technique how to read articles that use machine learning (users' guide to the medical literature) data processing and text mining technologies on electronic medical records: a a physician's perspective on machine learning in healthcare. invited talk presented at machine learning for health care (mlhc) agile data science . . o'reilly media, inc mvm -minimal viable model mimic-iii, a freely accessible critical care database making big data useful for health care: a summary of the inaugural mit critical data conference the eicu collaborative research database, a freely available multi-center database for covid- clinical data sets for research faq: covid- de-identified data cohort access offer chestx-ray : hospitalscale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases apache spark -unified analytics engine for big data toward collaborative open data science in metabolomics using jupyter notebooks and cloud computing reading checks with multilayer graph transformer networks no external funding is provided for this work.j o u r n a l p r e -p r o o f j o u r n a l p r e -p r o o f highlights • agile data science in healthcare is becoming a necessity, given the covid- pandemic and the minute-to-minute urgency of patient treatment.• the proposed agile data science framework is based on scrumban, used in software development.• publicly available healthcare datasets and cloud-based infrastructure enable the agile framework to be widely adopted.• collaboration between physicians and data scientists needed in order to implement the agile framework. ☒ the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐the authors declare the following financial interests/personal relationships which may be considered as potential competing interests:j o u r n a l p r e -p r o o f key: cord- -bwghyhqx authors: jiang, zheng; hu, menghan; fan, lei; pan, yaling; tang, wei; zhai, guangtao; lu, yong title: combining visible light and infrared imaging for efficient detection of respiratory infections such as covid- on portable device date: - - journal: nan doi: nan sha: doc_id: cord_uid: bwghyhqx coronavirus disease (covid- ) has become a serious global epidemic in the past few months and caused huge loss to human society worldwide. for such a large-scale epidemic, early detection and isolation of potential virus carriers is essential to curb the spread of the epidemic. recent studies have shown that one important feature of covid- is the abnormal respiratory status caused by viral infections. during the epidemic, many people tend to wear masks to reduce the risk of getting sick. therefore, in this paper, we propose a portable non-contact method to screen the health condition of people wearing masks through analysis of the respiratory characteristics. the device mainly consists of a flir one thermal camera and an android phone. this may help identify those potential patients of covid- under practical scenarios such as pre-inspection in schools and hospitals. in this work, we perform the health screening through the combination of the rgb and thermal videos obtained from the dual-mode camera and deep learning architecture.we first accomplish a respiratory data capture technique for people wearing masks by using face recognition. then, a bidirectional gru neural network with attention mechanism is applied to the respiratory data to obtain the health screening result. the results of validation experiments show that our model can identify the health status on respiratory with the accuracy of . % on the real-world dataset. the abnormal respiratory data and part of normal respiratory data are collected from ruijin hospital affiliated to the shanghai jiao tong university medical school. other normal respiratory data are obtained from healthy people around our researchers. this work demonstrates that the proposed portable and intelligent health screening device can be used as a pre-scan method for respiratory infections, which may help fight the current covid- epidemic. during the outbreak of covid- epidemic, early control is essential. among all the control measures, efficient and safe identification of potential patients is the most important part. existing researches show that human physiological state can be perceived through breathing [ ] , which means respiratory signals are vital signs that can reflect human health condition to a certain extent [ ] . many clinical literature suggests that abnormal respiratory symptoms may be important factors for diagnosis of some specific diseases [ ] . recent studies have found that covid- patients will have obvious respiratory symptoms such as shortness of breath fever, tiredness, and dry cough [ , ] . among those symptoms, atypical or irregular breathing is considered as one of the early signs. for many people, early mild respiratory symptoms are difficult to be recognized. therefore, through the measurement of respiration condition, potential covid- patients can be screened to some extent. this may play an auxiliary diagnostic role, thus helping to find potential patients as early as possible. traditional respiration measurement requires attachments of sensors to the patient's body [ ] . the monitor of respiration is measured through the movement of the chest or abdomen. contact measurement equipment is bulky, expensive, and time-consuming. the most important thing is that the contact during measurement may increase the risk of spreading infectious diseases such as covid- . therefore, the non-contact measurement is more suitable for the current situation. in recent years, many non-contact respiration measurement methods have been developed based on imaging sensors, doppler radar [ ] , depth camera [ ] and thermal cam-era [ ] . considering factors such as safety, stability and price, the measurement technology of thermal imaging is the most suitable for extensive promotion. so far, thermal imaging has been used as a monitoring technology in a wide range of medical fields such as estimations of heart rate [ ] and breathing rate [ ] [ ] [ ] . another important thing is that many existing respiration measurement devices are large and immovable. given the worldwide epdemic, the partable and intelligent screening equipment is required to meet the needs of largescale screening and other application scenarios in a real-time manner. for thermal imaging based respiration measurement, nostril regions and mouth regions are the only focused regions since only these two parts have periodic heat exchange between the body and the outside environment. however, until now, researchers seldom considered measuring thermal respiration data for people wearing masks. during the epidemic of infectious diseases, masks may effectively suppress the spread of the virus according to recent studies [ , ] . therefore, developing the respiration measurement method for people wearing masks becomes quite practical. in this study, we develop a portable and intelligent health screening device that uses thermal imaging to extract respiration data from masked people which is then used to do the health screening classification via deep learning architecture. in classification tasks, deep learning has achieved the state-of-the-art performance in most research areas. compared with traditional classifiers, classifiers based on deep learning can automatically identify the corresponding features and their correlations rather than extracting features manually. for breathing tasks, algorithms based deep learning can also better extract the corresponding features such as breathing rate and breath-to-exhale ratio, and make more accurate predictions [ ] [ ] [ ] [ ] . recently, many researchers made use of deep learning to analyze the respiratory process. cho et al. used a convolutional neural network (cnn) to analyze human breathing parameters to determine the degree of nervousness through thermal imaging [ ] . romero et al. applied a language model to detect acoustic events in sleepdisordered breathing through related sounds [ ] . wang et al. utilized deep learning and depth camera to classify abnormal respiratory patterns in real time and achieved excellent results [ ] . the disadvantage of this research may be that the equipment is not portable. in this paper, we propose a remote, potable and intelligent health screening system based on respiratory data for pre-screening and auxiliary diagnosis of respiratory diseases like covid- . in order to be more practical in a situation where people often choose to wear masks, the breathing data capture method for people wearing masks is introduced. after extracting breathing data from the video obtained from the thermal camera, a deep learning neural network is performed to work on the classification between healthy and abnormal respiration conditions. to verify the robustness of our algorithm and the effectiveness of the proposed equipment, we an-alyze the influence of mask type, measurement distance and measurement angle on breathing data. the main contributions of this paper are threefold. first, we combine the face recognition technology with dual-mode imaging to accomplish a respiratory data extraction method for people wearing masks, which is quite essential for current situation. based on our dual-camera algorithm, the respiration data is successfully obtained from masked facial thermal videos. subsequently, we propose a classification method to judge abnormal respiratory state with deep learning framework. finally, based on the two contributions mentioned above, we have implemented a non-contact and efficient health screening system for respiratory infections using the actual measured data from hospital, which may contribute to finding the possible cases of covid- and keeping the control of the secondary spread of sarscov- . a brief introduction to the proposed respiration condition screening method is shown below. we first use the portable and intelligent screening device to get the thermal and the corresponding rgb videos. during the data collection, we also perform a simple real-time screening result. after getting the thermal videos, the first step is to extract respiration data from faces in thermal videos. during the extraction process, we use the face detection method to capture people's masked areas. then a region of interest (roi) selection algorithm is proposed to get the region from the mask that stands for the characteristic of breath most. finally, we use a bidirectional gru neural network with attention mechanism (bigru-at) model to work on the classification task with the input respiration data. our data collection is achieved by the system shown in fig. . the whole screening system includes a flir one thermal camera, an android smartphone and the corresponding application we have written, which is used for data acquisition and simple instant analysis. our screening equipment, whose main advantage is portable, can be easily applied to measure abnormal breathing in many occasions of instant detection. as shown in fig. , the flir one thermal camera consists of two canmeras, an rgb camera and a thermal camera. we collect the face videos from both cameras and use face recogition method to get the nostril area and forehead area. the temperatures of the two regions are calculated in time series and is shown in the screening result page in fig. (b) . the red line stands for the body temprature and the blue line stands for breathing data. from the breathing data, we can predict the respiratory pattern of the testcase. then, a simple real-time screening result is given directly in the application fig. : overview of portable and intelligent health screening system for respiratory infections: a) device appearance; b) analysis result of the application. (notice that the system can simultaneously collect body temperature signals. in the current work, this body temperature signal is not considered in the model and is only used as a reference for the users. ) according to the extracted features shown in fig. . we use the the raw face videos collected from both rgb camera and thermal camera as the data for further study to ensure accuracy and higher performance. when continuous breathing activities performs, there is a fact that periodic temperature fluctuations occur around the nostril due to the inspiration and expiration cycles. therefore, respiration data can be obtained by analyzing the temperature data around the nostril based on thermal image sequence. however, when people wear masks, many facial features are blocked because of this. merely recognizing the face through thermal image will lose a lot of geometric and textural facial details, resulting in recognition errors of the face and mask parts. in order to solve this problem, we adopt the method based on two parallel located rgb and infrared cameras for face and mask region recognition. the masked region of face is first captured in the rgb camera, then such region is mapped to the thermal image with a related mapping function. the algorithm for masked face detection is based on pyramidbox model created by tang et al. [ ] . the main idea is to apply tricks like gaussian pyramidbox in deep learning to get the context correlations as further characteristics. the face image is first used to extract features of different scales using gaussian pyramid algorithm. for those high-level contextual features, a feature pyramid network is proposed to further excavate high-level contextual features. then, the output together with those low-level features are combined in lowlevel feature pyramid layers. finally, the result is obtained after another two layers of deep neural network. for faces that a lot of features are lost due to the cover of mask, such a context-sensitive structure can obtain more feature correlations and thus improve the accuracy of face detection. in our experiment, we use the open source model from paddlehub to detect the face area on rgb videos. the next step is to extract the masked area and map the area from rgb video to thermal video. since the position of the mask on the human face is fixed, after obtaining the position coordinates of the human face, we obtain the mask area of the face by scaling down in equal proportions. for a detected face with width w, and height h, the loaction of left-up corner is defined as ( , ), the loaction of right-bottom corner is then (w, h). the corresponding coordinate of the two corners of mask region is decalred as (w/ , h/ ) and ( w/ , h/ ). considering that the background to the boundary of the mask will produce a large contrast with the movement, which is easy to cause errors, we choose the center area of the mask through this division. then the selected area is mapped from the rgb image to thermal image to obtain the masked region in thermal videos. after getting the masked region in thermal videos, we need to get the region of interest (roi) that represents breathing features. recent studies often characterize breathing data through temperature changes around the nostril [ , ] . however when people wear masks, there exists another problem that the nostrils are also blocked by the masks, and when people wearing different masks, the roi may be different. therefore, we perform a roi tracking method based on maximizing the variance of thermal image sequence to extract a certain area on the masked region of the thermal video which stands for the breath signals most. due to the lack of texture features in masked regions compared to human faces, we judge the roi by the temperature change of thermal image sequence. the main idea is to traverse the masked region in the thermal images and find a small block with the largest temperature change as the selected roi. the position of a certain block is fixed in the masked region among all the frames since the nostril area is fixed on the face region. we do not need to consider the movement of the block since our face recognition algorithm can detect the mask position in each frame's thermal image. for a certain block with height m, and width n, we define the average pixel intensity at frame t as: for thermal images,s(t) represents the temperature value at frame t. for every block we obtained, we calculate their s(t) on time line. then, for each block n, the total variance of the list of average pixel intensity with t frames σ s (n) is calculated as shown in eq. , where µ stands for the mean value ofs(t). since respiration is a periodic data spread out from the nostril area, we can consider that the block with the largest variance is the position where the heat changes most in both frequency and value within the mask, which stands for the breath data most in the masked region. we adjust the corresponding block size according to the size of the masked region. for a masked region with n blocks, the final roi is selected by: for each thermal video, we traverse all possible blocks in the mask regions of each frame and find the rois for each frame by the method above. the respiration data is then defined ass(t)( < t < t ), which is the pixel intensities of rois in all the frames. we apply a bigru-at neural network to do the classification task on judging whether the respiration condition is healthy or not as shown in fig. . the input of the network is the respiration data obtained by our extraction method. since the respiratory data is time series, it can be regarded as a time series classification problem. therefore, we choose the gate recurrent unit (gru) network with bidirection and attention layer to work on the sequence prediction task. among all the deep learning structures, recurrent neural network (rnn) is a type of neural network which is specially used to process time series data samples [ ] . for a time step t, the rnn model can be represented by: where x (t) , h (t) and o (t) stands for the current input state, hidden state and output at time step t respectively. v, w, u are parameters obtained by training procedure. b is the bias and σ and φ are activation functions. the final prediction is y (t) . long-short term memory network is developed on the basis of rnn [ ] . compared to rnn, which can only memorize and analyze short-term information, it can process relatively long-term information, and is suitable for problems with short-term delays or long time intervals. based on lstm, many related structures are proposed in recent years [ ] . gru is a simplified lstm which merges three doors of lstm (forget, input and output) into two doors (update and reset) [ ] . for tasks with a few data, gru may achieve a better result than lstm since it includes less parameters. in our task, since the input of the neural network is only the respiration data in time sequence, the gru network may perform a better result than lstm network. the structure of gru can be expressed by the following equations: where r t is the reset gate that controls the amount of information being passed to the new state from the previous states. z t stands for the update gate which determines the amount of information being forgotten and added. w r , w z and w h are trained parameters that vary in the training procedure.h t is the candidate hidden layer which can be regarded as a summary of the above information h t− and the input information x t at time step t. h t is the output layer at time step t which will be sent to the next unit. the bidirectional rnn has been widely used in natural language processing [ ] . the advantage of such network structure is that it can strengthen the correlation between context of the sequence. as the respiratory data is a periodic sequence, we use bidirectional gru to obtain more information from the periodic sequence. the difference between bidirectional gru and normal gru is that backfoward sequence of data is spliced to the original forward sequence of data. in this way, the hidden layer of the original h(t) is changed to: where − → h t is the original hidden layer and ← − h t is the backfoward sequence of − → h t . during the analysis of respiratory data, the entire waveform in time sequence should be taken into consideration. for some specic breathing pattern such as sphyxia, several paticular features such as sudden acceleration may occur only at a certain point in the entire process. however, if we only use the bigru network, these features may be weakened as the time sequence data is input step by step which may cause a larger error in prediction. therefore, we add an attention layer to the network, which can ensure that certain key point features in the breathing process can be maximized. attention mechanism is a choice to focus only on those important points among the total data [ ] . it is often combined with neural networks like rnn. before the rnn model summarizes the hidden states for the output, an attention layer can make an estimation of all outputs and find the most important ones. this mechanism has been widely used in many research areas. the structure of attention layer is: where h t represents the bigru layer output at time step t, which is bidirectional. w u and b w are also parameters that vary in the training process. a t performs a softmax fucntion on u t to get the weight of each step t. finally, the output of the attention layer s is a combination of all the steps from bigru with different weights. by applying another softmax function to the output s, we get the final prediction of the classification task. the structure of the whole network is shown in fig. . our goal is to distinguish whether there is an epidemic infectious disease such as covid- according to the abnormal breathing in the respiratory system. our dataset is not collected from patients with covid- . these data were obtained from the inpatients of the respiratory disease department and cardiology department in ruijin hospital. most of the patients we collected data from only caught with basic or chronic respiratory disease. they did not have fever which is the typical respiratory symptoms of infectious diseases. therefore, the body temperature is not taken into consideration in our current screening system. in ruijin hospital wards, we use a flir one thermal camera connected to an android phone to work on the data collection. we collected data from people. for each person, we collect two -second infrared and rgb camera data with a sampling frequency of hz. through data cutting and oversampling, we finally obtained , healthy breathing data and , abnormal breathing data, a total of , data. each piece of data consists of frames of infrared and rgb videos in seconds. in the bigru-at network, the hidden cells in bigru layer and attention layers are and respectively. the breathing data is normalized before input into the neural network and we use cross-entropy as the loss function. during the training process, we separate the dataset into two parts. the training set includes , healthy breathing data and , abnormal breathing data. and the test set contains healthy breathing data and abnormal breathing data. once this paper is accepted, we will release the dataset used in the current work for non-commercial users. among the whole dataset, we choose typical respiratory data examples as shown in fig. . fig. (a) and fig. (b) stand for the abnormal respiratory patterns extracted from patients. fig. (c) and fig. (d) represent the normal respiratory pattern called eupnea from healthy participants. by comparison, we can find that the respiratory of normal people is in strong periodic and evenly distributed while abnormal respiratory data tend to be more irregular. generally speaking, most abnormal breathing data from respiratory infections have faster frequency and irregular amplitude. the experimental results are shown in table. . we consider four evaluation metrics viz. accuracy, precision, recall and f . to measure the performance of our model, we compare the result of our model with three other models which are gru-at, bilstm-at and lstm respectively. the result shows that our method performs better than any other networks in all evaluation metrics despite the precision value of gru-at. by comparison, the experimental result demonstrates that attention mechanism is well-performed in keeping important node features in the time series of breathing data since the networks with attention layer all perform a better result than lstm. another point is that gru based networks achieve better result than lstm based networks. this may beacuse our data set is relatively small which can't fill the demand of the lstm based networks. gru based networks require less data than lstm and perform better result in our respiration condition classification task. to figure out the detailed information about the classification of respiratory state, we plotted the confusion matrixs of the four models as demonstrated in fig. . as can be seen from the results, the performance improvement of bigru-at compared to lstm is mainly in the accuracy rate of the negative class. this is because many scatter-like abnormalities in the time series of abnormal breathing are better recognized by the attention mechanism. besides, the misclassification rate of the four networks are relatively high to some extent which may because many positive samples do not have typical respiratory infections characteristics since part of the patients caught other lung-related diseases. in the analysis section, we give comparasions from different aspects to prove the robustness of our algorithm and device. in order to measure the robustness of our breathing data acquisition algorithm and the effectiveness of the proposed portable device, we analyze the breathing data of the same person wearing different masks. we design mask wearing scenarios that cover most situations: wearing one surgical mask (blue line); wearing one kn (n ) mask (red line) and wearing two surgical masks (green line). the results are shown in fig. . it can be seen from the experimental results that no matter what kind of mask is worn, or even two masks, the respiratory data can be well recognized. this proves the stability of our algorithm and device. however, since different masks have different thermal insulation capabilities, the average breathing temperature may vary as the mask changes. to minimize this error, respiratory data are normalized before input into the neural network. in order to verify the robustness of our algorithm and device in different scenarios, we design experiments to collect respiratory data at different distances. considering the limitations of handheld devices, we test the collection of facial respiration data from a distance of to meters. the result is demonstrated in fig. . the signal tends to be periodic from the position of centimeters, and it does not start to lose regularity until about . meters. at a distance of about centimeters, the complete face begins to appear in the camera video. when the distance comes to . meters, our face detection algorithm begins to fail gradually due to the distance and pixel limitation. this experiment verifies that our algorithm and device can guarantee relatively accurate measurement results in the distance range of . meters to . meters. considering that breath detection will be applied in different scenarios, we design this experiment to show the actual situation under different shooting angles. we define the camera directly towards the face to be degree, and design an experiment in which the shooting angle gradually changed from degrees to degree. we consider the transformation of two angles: horizontal and vertical, which respectively represent left and right turning and nodding. the results in the two cases are quite different as shown in fig. . our algorithm and device maintain good results in horizontal rotation, but it is difficult to obtain precise respiratory data in vertical rotation. this means participants can trun left or turn right during the measurement but can't nod or head up since this may impact the measurement result. in this paper, we propose a abnormal breathing detection method based on a portable dual-mode camera which can reocrd both rgb and thermal videos. in our detection method, we first accomplished an accurate and robust respiratory data detecion algorithm which can precisely extract breathing data from people wearing masks. then, a bigru-at network is applied to work on the screening of respiratory infections. in validation experiments, the obtained bigru-at network achieves a realtively good result with the accuracy of . % on the real-world dataset. it is foreseeable that in patients with covid- who have more clinical respiratory symptoms, this classification method may yield better results. during the current outbreak of covid- , our research can work as prescan method for abnomral breathing in many scenarios such as community, campus and hospital which may contribute to distuigishing the possible cases , and then keep the control of the virus spread. in future research, on the basis of ensuring portability, we plan to use a more stable algorithm to minimize the effect of different masks on measurement of breathing condition. besides, temperature may be taken into consideration to achieve a higher detection accuracy on respiratory infections. respiratory rate: the neglected vital sign non-contact respiratory rate measurement validation for hospitalized patients dysfunctional breathing: a review of the literature and proposal for classification pathological findings of covid- associated with acute respiratory distress syndrome world health organization declares global emergency: a review of the novel coronavirus (covid- ) respiration rate monitoring methods: a review novel methods for noncontact heart rate measurement: a feasibility study abnormal respiratory patterns classifier may contribute to large-scale screening of people infected with covid- in an accurate and unobtrusive manner synergetic use of thermal and visible imaging techniques for contactless and unobtrusive breathing measurement combination of near-infrared and thermal imaging techniques for the remote and simultaneous measurements of breathing and heart rates under sleep situation remote monitoring of breathing dynamics using infrared thermography a novel method for extracting respiration rate and relative tidal volume from infrared thermography rgb-thermal imaging system collaborated with marker tracking for remote breathing rate measurement rational use of face masks in the covid- pandemic mass masking in the covid- epidemic: people need guidance performance characterization of deep learning models for breathing-based authentication on resource-constrained devices deep learning versus professional healthcare equipment: a fine-grained breathing rate monitoring model respiration-based emotion recognition with deep learning a deep learning framework using passive wifi sensing for respiration monitoring deepbreath: deep learning of breathing patterns for automatic stress recognition using low-cost thermal imaging in unconstrained settings deep learning features for robust detection of acoustic events in sleep-disordered breathing pyramidbox: a context-assisted single shot face detector robust tracking of respiratory rate in high-dynamic range scenes using mobile thermal imaging finding structure in time long short-term memory lstm: a search space odyssey learning phrase representations using rnn encoder-decoder for statistical machine translation neural machine translation by jointly learning to align and translate attention is all you need key: cord- -h au h authors: adiga, aniruddha; chen, jiangzhuo; marathe, madhav; mortveit, henning; venkatramanan, srinivasan; vullikanti, anil title: data-driven modeling for different stages of pandemic response date: - - journal: nan doi: nan sha: doc_id: cord_uid: h au h some of the key questions of interest during the covid- pandemic (and all outbreaks) include: where did the disease start, how is it spreading, who is at risk, and how to control the spread. there are a large number of complex factors driving the spread of pandemics, and, as a result, multiple modeling techniques play an increasingly important role in shaping public policy and decision making. as different countries and regions go through phases of the pandemic, the questions and data availability also changes. especially of interest is aligning model development and data collection to support response efforts at each stage of the pandemic. the covid- pandemic has been unprecedented in terms of real-time collection and dissemination of a number of diverse datasets, ranging from disease outcomes, to mobility, behaviors, and socio-economic factors. the data sets have been critical from the perspective of disease modeling and analytics to support policymakers in real-time. in this overview article, we survey the data landscape around covid- , with a focus on how such datasets have aided modeling and response through different stages so far in the pandemic. we also discuss some of the current challenges and the needs that will arise as we plan our way out of the pandemic. as the sars-cov- pandemic has demonstrated, the spread of a highly infectious disease is a complex dynamical process. a large number of factors are at play as infectious diseases spread, including variable individual susceptibility to the pathogen (e.g., by age and health conditions), variable individual behaviors (e.g., compliance with social distancing and the use of masks), differing response strategies implemented by governments (e.g., school and workplace closure policies and criteria for testing), and potential availability of pharmaceutical interventions. governments have been forced to respond to the rapidly changing dynamics of the pandemic, and are becoming increasingly reliant on different modeling and analytical techniques to understand, forecast, plan and respond; this includes statistical methods and decision support methods using multi-agent models, such as: (i) forecasting epidemic outcomes (e.g., case counts, mortality and hospital demands), using a diverse set of data-driven methods e.g., arima type time series forecasting, bayesian techniques and deep learning, e.g., [ ] [ ] [ ] [ ] [ ] , (ii) disease surveillance, e.g., [ , ] , and (iii) counter-factual analysis of epidemics using multi-agent models, e.g., [ ] [ ] [ ] [ ] [ ] [ ] ; indeed, the results of [ , ] were very influential in the early decisions for lockdowns in a number of countries. the specific questions of interest change with the stage of the pandemic. in the pre-pandemic stage, the focus was on understanding how the outbreak started, epidemic parameters, and the risk of importation to different regions. once outbreaks started-the acceleration stage, the focus is on determining the growth rates, the differences in spatio-temporal characteristics, and testing bias. in the mitigation stage, the questions are focused on non-prophylactic interventions, such as school and work place closures and other social-distancing strategies, determining the demand for healthcare resources, and testing and tracing. in the suppression stage, the focus shifts to using prophylactic interventions, combined with better tracing. these phases are not linear, and overlap with each other. for instance, the acceleration and mitigation stages of the pandemic might overlap spatially, temporally as well as within certain social groups. different kinds of models are appropriate at different stages, and for addressing different kinds of questions. for instance, statistical and machine learning models are very useful in forecasting and short term projections. however, they are not very effective for longer-term projections, understanding the effects of different kinds of interventions, and counter-factual analysis. mechanistic models are very useful for such questions. simple compartmental type models, and their extensions, namely, structured metapopulation models are useful for several population level questions. however, once the outbreak has spread, and complex individual and community level behaviors are at play, multi-agent models are most effective, since they allow for a more systematic representation of complex social interactions, individual and collective behavioral adaptation and public policies. as with any mathematical modeling effort, data plays a big role in the utility of such models. till recently, data on infectious diseases was very hard to obtain due to various issues, such as privacy and sensitivity of the data (since it is information about individual health), and logistics of collecting such data. the data landscape during the sars-cov- pandemic has been very different: a large number of datasets are becoming available, ranging from disease outcomes (e.g., time series of the number of confirmed cases, deaths, and hospitalizations), some characteristics of their locations and demographics, healthcare infrastructure capacity (e.g., number of icu beds, number of healthcare personnel, and ventilators), and various kinds of behaviors (e.g., level of social distancing, usage of ppes); see [ ] [ ] [ ] for comprehensive surveys on available datasets. however, using these datasets for developing good models, and addressing important public health questions remains challenging. the goal of this article is to use the widely accepted stages of a pandemic as a guiding framework to highlight a few important problems that require attention in each of these stages. we will aim to provide a succinct model-agnostic formulation while identifying the key datasets needed, how they can be used, and the challenges arising in that process. we will also use sars-cov- as a case study unfolding in real-time, and highlight some interesting peer-reviewed and preprint literature that pertains to each of these problems. an important point to note is the necessity of randomly sampled data, e.g. data needed to assess the number of active cases and various demographics of individuals that were affected. census provides an excellent rationale. it is the only way one can develop rigorous estimates of various epidemiologically relevant quantities. there have been numerous surveys on the different types of datasets available for sars-cov- , e.g., [ ] [ ] [ ] [ ] , as well as different kinds of modeling approaches. however, they do not describe how these models become relevant through the phases of pandemic response. an earlier similar attempt to summarize such responsedriven modeling efforts can be found in [ ] , based on the -h n experience, this paper builds on their work and discusses these phases in the present context and the sars-cov- pandemic. although the paper touches upon different aspects of model-based decision making, we refer the readers to a companion article in the same special issue [ ] for a focused review of models used for projection and forecasting. multiple organizations including cdc and who have their frameworks for preparing and planning response to a pandemic. for instance, the pandemic intervals framework from cdc describes the stages in the context of an influenza pandemic; these are illustrated in figure . these six stages span investigation, recognition and initiation in the early phase, followed by most of the disease spread occurring during the acceleration and deceleration stages. they also provide indicators for identifying when the pandemic has progressed from one stage to the next [ ] . as envisioned, risk evaluation (i.e., using tools like influenza risk assessment tool (irat) and pandemic severity assessment framework (psaf)) and early case identification characterize the first three stages, while non-pharmaceutical interventions (npis) and available figure : cdc pandemic intervals framework and who phases for influenza pandemic therapeutics become central to the acceleration stage. the deceleration is facilitated by mass vaccination programs, exhaustion of susceptible population, or unsuitability of environmental conditions (such as weather). a similar framework is laid out in who's pandemic continuum and phases of pandemic alert . while such frameworks aid in streamlining the response efforts of these organizations, they also enable effective messaging. to the best of our knowledge, there has not been a similar characterization of mathematical modeling efforts that go hand in hand with supporting the response. for summarizing the key models, we consider four of the stages of pandemic response mentioned in section : pre-pandemic, acceleration, mitigation and suppression. here we provide the key problems in each stage, the datasets needed, the main tools and techniques used, and pertinent challenges. we structure our discussion based on our experience with modeling the spread of covid- in the us, done in collaboration with local and federal agencies. • acceleration (section ): this stage is relevant once the epidemic takes root within a country. there is usually a big lag in surveillance and response efforts, and the key questions are to model spread patterns at different spatio-temporal scales, and to derive short-term forecasts and projections. a broad class of datasets is used for developing models, including mobility, populations, land-use, and activities. these are combined with various kinds of time series data and covariates such as weather for forecasting. • mitigation (section ): in this stage, different interventions, which are mostly non-pharmaceutical in the case of a novel pathogen, are implemented by government agencies, once the outbreak has taken hold within the population. this stage involves understanding the impact of interventions on case counts and health infrastructure demands, taking individual behaviors into account. the additional datasets needed in this stage include those on behavioral changes and hospital capacities. • suppression (section ): this stage involves designing methods to control the outbreak by contact tracing & isolation and vaccination. data on contact tracing, associated biases, vaccine production schedules, and compliance & hesitancy are needed in this stage. figure gives an overview of this framework and summarizes the data needs in these stages. these stages also align well with the focus of the various modeling working groups organized by cdc which include epidemic parameter estimation, international spread risk, sub-national spread forecasting, impact of interventions, healthcare systems, and university modeling. in reality, one should note that these stages may overlap, and may vary based on geographical factors and response efforts. moreover, specific problems can be approached prospectively in earlier stages, or retrospectively during later stages. this framework is thus meant to be more conceptual than interpreted along a linear timeline. results from such stages are very useful for policymakers to guide real-time response. consider a novel pathogen emerging in human populations that is detected through early cases involving unusual symptoms or unknown etiology. such outbreaks are characterized by some kind of spillover event, mostly through zoonotic means, like in the case of covid- or past influenza pandemics (e.g., swine flu and avian flu). a similar scenario can occur when an incidence of a well-documented disease with no known vaccine or therapeutics emerges in some part of the world, causing severe outcomes or fatalities (e.g., ebola and zika.) regardless of the development status of the country where the pathogen emerged, such outbreaks now contains the risk of causing a worldwide pandemic due to the global connectivity induced by human travel. two questions become relevant at this stage: what are the epidemiological attributes of this disease, and what are the risks of importation to a different country? while the first question involves biological and clinical investigations, the latter is more related with societal and environmental factors. one of the crucial tasks during early disease investigation is to ascertain the transmission and severity of the disease. these are important dimensions along which the pandemic potential is characterized because together they determine the overall disease burden, as demonstrated within the pandemic severity assessment framework [ ] . in addition to risk assessment for right-sizing response, they are integral to developing meaningful disease models. formulation let Θ = {θ t , θ s } represent the transmission and severity parameters of interest. they can be further subdivided into sojourn time parameters θ δ · and transition probability parameters θ p · . here Θ corresponds to a continuous time markov chain (ctmc) on the disease states. the problem formulation can be represented as follows: given Π(Θ), the prior distribution on the disease parameters and a dataset d, estimate the posterior distribution p(Θ|d) over all possible values of Θ. in a model-specific form, this can be expressed as p(Θ|d, m) where m is a statistical, compartmental or agent-based disease model. in order to estimate the disease parameters sufficiently, line lists for individual confirmed cases is ideal. such datasets contain, for each record, the date of confirmation, possible date of onset, severity (hospitalization/icu) status, and date of recovery/discharge/death. furthermore, age-and demographic/comorbidity information allow development of models that are age-and risk group stratified. one such crowdsourced line list was compiled during the early stages of covid- [ ] and later released by cdc for us cases [ ] . data from detailed clinical investigations from other countries such as china, south korea, and singapore was also used to parameterize these models [ ] . in the absence of such datasets, past parameter estimates of similar diseases (e.g., sars, mers) were used for early analyses. modeling approaches for a model agnostic approach, the delays and probabilities are obtained by various techniques, including bayesian and ordinary least squares fitting to various delay distributions. for a particular disease model, these are estimated through model calibration techniques such as mcmc and particle filtering approaches. a summary of community estimates of various disease parameters is provided at https://github.com/midas-network/covid- . further such estimates allow the design of pandemic planning scenarios varying in levels of impact, as seen in the cdc scenarios page . see [ ] [ ] [ ] for methods and results related to estimating covid- disease parameters from real data. current models use a large set of disease parameters for modeling covid- dynamics; they can be broadly classified as transmission parameters and hospital resource parameters. for instance in our work, we currently use parameters (with explanations) shown in table . challenges often these parameters are model specific, and hence one needs to be careful when reusing parameter estimates from literature. they are related but not identifiable with respect to population level measures such as basic reproductive number r (or effective reproductive number r eff ) and doubling time which allow tracking the rate of epidemic growth. also the estimation is hindered by inherent biases in case ascertainment rate, reporting delays and other gaps in the surveillance system. aligning different data streams (e.g., outpatient surveillance, hospitalization rates, mortality records) is in itself challenging. when a disease outbreak occurs in some part of the world, it is imperative for most countries to estimate their risk of importation through spatial proximity or international travel. such measures are incredibly valuable in setting a timeline for preparation efforts, and initiating health checks at the borders. over centuries, pandemics have spread faster and faster across the globe, making it all the more important to characterize this risk as early as possible. formulation let c be the set of countries, and g = {c, e} an international network, where edges (often weighted and directed) in e represent some notion of connectivity. the importation risk problem can be formulated as below: given c o ∈ c the country of origin with an initial case at time , and c i the country of interest, using g, estimate the expected time taken t i for the first cases to arrive in country c i . in its probabilistic form, the same can be expressed as estimating the probability p i (t) of seeing the first case in country c i by time t. data needs assuming we have initial case reports from the origin country, the first data needed is a network that connects the countries of the world to represent human travel. the most common source of such information is the airline network datasets, from sources such as iata, oag, and openflights; [ ] provides a systematic review of how airline passenger data has been used for infectious disease modeling. these datasets could either capture static measures such as number of seats available or flight schedules, or a dynamic count of passengers per month along each itinerary. since the latter has intrinsic delays in collection and reporting, for an ongoing pandemic they may not be representative. during such times, data on ongoing travel restrictions [ ] become important to incorporate. multi-modal traffic will also be important to incorporate for countries that share land borders or have heavy maritime traffic. for diseases such as zika, where establishment risk is more relevant, data on vector abundance or prevailing weather conditions are appropriate. modeling approaches simple structural measures on networks (such as degree, pagerank) could provide static indicators of vulnerability of countries. by transforming the weighted, directed edges into probabilities, one can use simple contagion models (e.g., independent cascades) to simulate disease spread and empirically estimate expected time of arrival. global metapopulation models (gleam) that combine seir type dynamics with an airline network have also been used in the past for estimating importation risk. brockmann and helbing [ ] used a similar framework to quantify effective distance on the network which seemed to be well correlated with time of arrival for multiple pandemics in the past; this has been extended to covid- [ , ] . in [ ] , the authors employ air travel volume obtained through iata from ten major cities across china to rank various countries along with the idvi to convey their vulnerability. [ ] consider the task of forecasting international and domestic spread of covid- and employ official airline group (oag) data for determining air traffic to various countries, and [ ] fit a generalized linear model for observed number of cases in various countries as a function of air traffic volume obtained from oag data to determine countries with potential risk of under-detection. also, [ ] provide africa-specific case-study of vulnerability and preparedness using data from civil aviation administration of china. challenges note that arrival of an infected traveler will precede a local transmission event in a country. hence the former is more appropriate to quantify in early stages. also, the formulation is agnostic to whether it is the first infected arrival or first detected case. however, in real world, the former is difficult to observe, while the latter is influenced by security measures at ports of entry (land, sea, air) and the ease of identification for the pathogen. for instance, in the case of covid- , the long incubation period and the high likelihood of asymptomaticity could have resulted in many infected travelers being missed by health checks at poes. we also noticed potential administrative delays in reporting by multiple countries fearing travel restrictions. as the epidemic takes root within a country, it may enter the acceleration phase. depending on the testing infrastructure and agility of surveillance system, response efforts might lag or lead the rapid growth in case rate. under such a scenario, two crucial questions emerge that pertain to how the disease may spread spatially/socially and how the case rate may grow over time. within the country, there is need to model the spatial spread of the disease at different scales: state, county, and community levels. similar to the importation risk, such models may provide an estimate of when cases may emerge in different parts of the country. when coupled with vulnerability indicators (socioeconomic, demographic, co-morbidities) they provide a framework for assessing the heterogeneous impact the disease may have across the country. detailed agent-based models for urban centers may help identify hotspots and potential case clusters that may emerge (e.g., correctional facilities, nursing homes, food processing plants, etc. in the case of covid- ). formulation given a population representation p at appropriate scale and a disease model m per entity (individual or sub-region), model the disease spread under different assumptions of underlying connectivity c and disease parameters Θ. the result will be a spatio-temporal spread model that results in z s,t , the time series of disease states over time for region s. data needs some of the common datasets needed by most modeling approaches include: ( ) social and spatial representation, which includes census, and population data, which are available from census departments (see, e.g., [ ] ), and landscan [ ] , ( ) connectivity between regions (commuter, airline, road/rail/river), e.g., [ , ] , ( ) data on locations, including points of interest, e.g., openstreetmap [ ] , and ( ) activity data, e.g., the american time use survey [ ] . these datasets help capture where people reside and how they move around, and come in contact with each other. while some of these are static, more dynamic measures, such as from gps traces, become relevant as individuals change their behavior during a pandemic. modeling approaches different kinds of structured metapopulation models [ , [ ] [ ] [ ] [ ] , and agent based models [ ] [ ] [ ] [ ] [ ] have been used in the past to model the sub-national spread; we refer to [ , , ] for surveys on different modeling approaches. these models incorporate typical mixing patterns, which result from detailed activities and co-location (in the case of agent based models), and different modes of travel and commuting (in the case of metapopulation models). challenges while metapopulation models can be built relatively rapidly, agent based models are much harder-the datasets need to be assembled at a large scale, with detailed construction pipelines, see, e.g., [ ] [ ] [ ] [ ] [ ] . since detailed individual activities drive the dynamics in agent based models, schools and workplaces have to be modeled, in order to make predictions meaningful. such models will get reused at different stages of the outbreak, so they need to be generic enough to incorporate dynamically evolving disease information. finally, a common challenge across modeling paradigms is the ability to calibrate to the dynamically evolving spatio-temporal data from the outbreak-this is especially challenging in the presence of reporting biases and data insufficiency issues. given the early growth of cases within the country (or sub-region), there is need for quantifying the rate of increase in comparable terms across the duration of the outbreak (accounting for the exponential nature of such processes). these estimates also serve as references, when evaluating the impact of various interventions. as an extension, such methods and more sophisticated time series methods can be used to produce short-term forecasts for disease evolution. formulation given the disease time series data within the country z s,t until data horizon t , provide scale-independent growth rate measures g s (t ), and forecastsẐ s,u for u ∈ [t, t + ∆t ], where ∆t is the forecast horizon. data needs models at this stage require datasets such as ( ) time series data on different kinds of disease outcomes, including case counts, mortality, hospitalizations, along with attributes, such as age, gender and location, e.g., [ ] [ ] [ ] [ ] [ ] , ( ) any associated data for reporting bias (total tests, test positivity rate) [ ] , which need to be incorporated into the models, as these biases can have a significant impact on the dynamics, and ( ) exogenous regressors (mobility, weather), which have been shown to have a significant impact on other diseases, such as influenza, e.g., [ ] . modeling approaches even before building statistical or mechanistic time series forecasting methods, one can derive insights through analytical measures of the time series data. for instance, the effective reproductive number, estimated from the time series [ ] can serve as a scale-independent metric to compare the outbreaks across space and time. additionally multiple statistical methods ranging from autoregressive models to deep learning techniques can be applied to the time series data, with additional exogenous variables as input. while such methods perform reasonably for short-term targets, mechanistic approaches as described earlier can provide better long-term projections. various ensembling techniques have also been developed in the recent past to combine such multi-model forecasts to provide a single robust forecast with better uncertainty quantification. one such effort that combines more than methods for covid- can be found at the covid forecasting hub . we also point to the companion paper for more details on projection and forecasting models. challenges data on epidemic outcomes usually has a lot of uncertainties and errors, including missing data, collection bias, and backfill. for forecasting tasks, these time series data need to be near real-time, else one needs to do both nowcasting, as well as forecasting. other exogenous regressors can provide valuable lead time, due to inherent delays in disease dynamics from exposure to case identification. such frameworks need to be generalized to accommodate qualitative inputs on future policies (shutdowns, mask mandates, etc.), as well as behaviors, as we discuss in the next section. once the outbreak has taken hold within the population, local, state and national governments attempt to mitigate and control its spread by considering different kinds of interventions. unfortunately, as the covid- pandemic has shown, there is a significant delay in the time taken by governments to respond. as a result, this has caused a large number of cases, a fraction of which lead to hospitalizations. two key questions in this stage are: ( ) how to evaluate different kinds of interventions, and choose the most effective ones, and ( ) how to estimate the healthcare infrastructure demand, and how to mitigate it. the effectiveness of an intervention (e.g., social distancing) depends on how individuals respond to them, and the level of compliance. the health resource demand depends on the specific interventions which are implemented. as a result, both these questions are connected, and require models which incorporate appropriate behavioral responses. in the initial stages, only non-prophylactic interventions are available, such as: social distancing, school and workplace closures, and use of ppes, since no vaccinations and anti-virals are available. as mentioned above, such analyses are almost entirely model based, and the specific model depends on the nature of the intervention and the population being studied. formulation given a model, denoted abstractly as m, the general goals are ( ) to evaluate the impact of an intervention (e.g., school and workplace closure, and other social distancing strategies) on different epidemic outcomes (e.g., average outbreak size, peak size, and time to peak), and ( ) find the most effective intervention from a suite of interventions, with given resource constraints. the specific formulation depends crucially on the model and type of intervention. even for a single intervention, evaluating its impact is quite challenging, since there are a number of sources of uncertainty, and a number of parameters associated with the intervention (e.g., when to start school closure, how long, and how to restart). therefore, finding uncertainty bounds is a key part of the problem. data needs while all the data needs from the previous stages for developing a model are still there, representation of different kinds of behaviors is a crucial component of the models in this stage; this includes: use of ppes, compliance to social distancing measures, and level of mobility. statistics on such behaviors are available at a fairly detailed level (e.g., counties and daily) from multiple sources, such as ( ) the covid- impact analysis platform from the university of maryland [ ] , which gives metrics related to social distancing activities, including level of staying home, outside county trips, outside state trips, ( ) changes in mobility associated with different kinds of activities from google [ ] , and other sources, ( ) survey data on different kinds of behaviors, such as usage of masks [ ] . modeling approaches as mentioned above, such analyses are almost entirely model based, including structured metapopulation models [ , [ ] [ ] [ ] [ ] , and agent based models [ ] [ ] [ ] [ ] [ ] . different kinds of behaviors relevant to such interventions, including compliance with using ppes and compliance to social distancing guidelines, need to be incorporated into these models. since there is a great deal of heterogeneity in such behaviors, it is conceptually easiest to incorporate them into agent based models, since individual agents are represented. however, calibration, simulation and analysis of such models pose significant computational challenges. on the other hand, the simulation of metapopulation models is much easier, but such behaviors cannot be directly represented-instead, modelers have to estimate the effect of different behaviors on the disease model parameters, which can pose modeling challenges. challenges there are a number of challenges in using data on behaviors, which depends on the specific datasets. much of the data available for covid- is estimated through indirect sources, e.g., through cell phone and online activities, and crowd-sourced platforms. this can provide large spatio-temporal datasets, but have unknown biases and uncertainties. on the other hand, survey data is often more reliable, and provides several covariates, but is typically very sparse. handling such uncertainties, rigorous sensitivity analysis, and incorporating the uncertainties into the analysis of the simulation outputs are important steps for modelers. the covid- pandemic has led to a significant increase in hospitalizations. hospitals are typically optimized to run near capacity, so there have been fears that the hospital capacities would not be adequate, especially in several countries in asia, but also in some regions in the us. nosocomial transmission could further increase this burden. formulation the overall problem is to estimate the demand for hospital resources within a populationthis includes the number of hospitalizations, and more refined types of resources, such as icus, ccus, medical personnel and equipment, such as ventilators. an important issue is whether the capacity of hospitals within the region would be overrun by the demand, when this is expected to happen, and how to design strategies to meet the demand-this could be through augmenting the capacities at existing hospitals, or building new facilities. timing is of essence, and projections of when the demands exceed capacity are important for governments to plan. the demands for hospitalization and other health resources can be estimated from the epidemic models mentioned earlier, by incorporating suitable health states, e.g., [ , ] ; in addition to the inputs needed for setting up the models for case counts, datasets are needed for hospitalization rates and durations of hospital stay, icu care, and ventilation. the other important inputs for this component are hospital capacity, and the referral regions (which represent where patients travel for hospitalization). different public and commercial datasets provide such information, e.g., [ , ] . modeling approaches demand for health resources is typically incorporated into both metapopulation and agent based models, by having a fraction of the infectious individuals transition into a hospitalization state. an important issue to consider is what happens if there is a shortage of hospital capacity. studying this requires modeling the hospital infrastructure, i.e., different kinds of hospitals within the region, and which hospital a patient goes to. there is typically limited data on this, and data on hospital referral regions, or voronoi tesselation can be used. understanding the regimes in which hospital demand exceeds capacity is an important question to study. nosocomial transmission is typically much harder to study, since it requires more detailed modeling of processes within hospitals. challenges there is a lot of uncertainty and variability in all the datasets involved in this process, making its modeling difficult. for instance, forecasts of the number of cases and hospitalizations have huge uncertainty bounds for medium or long term horizon, which is the kind of input necessary for understanding hospital demands, and whether there would be any deficits. the suppression stage involves methods to control the outbreak, including reducing the incidence rate and potentially leading to the eradication of the disease in the end. eradication in case of covid- appears unlikely as of now, what is more likely is that this will become part of seasonal human coronaviruses that will mutate continuously much like the influenza virus. contact tracing problem refers to the ability to trace the neighbors of an infected individual. ideally, if one is successful, each neighbor of an infected neighbor would be identified and isolated from the larger population to reduce the growth of a pandemic. in some cases, each such neighbor could be tested to see if the individual has contracted the disease. contact tracing is the workhorse in epidemiology and has been immensely successful in controlling slow moving diseases. when combined with vaccination and other pharmaceutical interventions, it provides the best way to control and suppress an epidemic. formulation the basic contact tracing problem is stated as follows: given a social contact network g(v, e) and subset of nodes s ⊂ v that are infected and a subset s ⊂ s of nodes identified as infected, find all neighbors of s. here a neighbor means an individual who is likely to have a substantial contact with the infected person. one then tests them (if tests are available), and following that, isolates these neighbors, or vaccinates them or administers anti-viral. the measures of effectiveness for the problem include: (i) maximizing the size of s , (ii) maximizing the size of set n (s ) ⊆ n (s), i.e. the potential number of neighbors of set s , (iii) doing this within a short period of time so that these neighbors either do not become infectious, or they minimize the number of days that they are infectious, while they are still interacting in the community in a normal manner, (iv) the eventual goal is to try and reduce the incidence rate in the community-thus if all the neighbors of s cannot be identified, one aims to identify those individuals who when isolated/treated lead to a large impact; (v) and finally verifying that these individuals indeed came in contact with the infected individuals and thus can be asked to isolate or be treated. data needs data needed for the contact tracing problem includes: (i) a line list of individuals who are currently known to be infected (this is needed in case of human based contact tracing). in the real world, when carrying out human contact tracers based deployment, one interviews all the individuals who are known to be infectious and reaches out to their contacts. modeling approaches human contact tracing is routinely done in epidemiology. most states in the us have hired such contact tracers. they obtain the daily incidence report from the state health departments and then proceed to contact the individuals who are confirmed to be infected. earlier, human contact tracers used to go from house to house and identify the potential neighbors through a well defined interview process. although very effective it is very time consuming and labor intensive. phones were used extensively in the last - years as they allow the contact tracers to reach individuals. they are helpful but have the downside that it might be hard to reach all individuals. during covid- outbreak, for the first time, societies and governments have considered and deployed digital contact tracing tools [ ] [ ] [ ] [ ] [ ] . these can be quite effective but also have certain weaknesses, including, privacy, accuracy, and limited market penetration of the digital apps. challenges these include: (i) inability to identify everyone who is infectious (the set s) -this is virtually impossible for covid- like disease unless the incidence rate has come down drastically and for the reason that many individuals are infected but asymptomatic; (ii) identifying all contacts of s (or s ) -this is hard since individuals cannot recall everyone they met, certain folks that they were in close proximity might have been in stores or social events and thus not known to individuals in the set s. furthermore, even if a person is able to identify the contacts, it is often hard to reach all the individuals due to resource constraints (each human tracer can only contact a small number of individuals. the overall goal of the vaccine allocation problem is to allocate vaccine efficiently and in a timely manner to reduce the overall burden of the pandemic. formulation the basic version of the problem can be cast in a very simple manner (for networked models): given a graph g(v, e) and a budget b on the number of vaccines available, find a set s of size b to vaccinate so as to optimize certain measure of effectiveness. the measure of effectiveness can be (i) minimizing the total number of individuals infected (or maximizing the total number of uninfected individuals); (ii) minimizing the total number of deaths (or maximizing the total number of deaths averted); (iii) optimizing the above quantities but keeping in mind certain equity and fairness criteria (across socio-demographic groups, e.g. age, race, income); (iv) taking into account vaccine hesitancy of individuals; (v) taking into account the fact that all vaccines are not available at the start of the pandemic, and when they become available, one gets limited number of doses each month; (vi) deciding how to share the stockpile between countries, state, and other organizations; (vii) taking into account efficacy of the vaccine. data needs as in other problems, vaccine allocation problems need as input a good representation of the system; network based, meta-population based and compartmental mass action models can be used. one other key input is the vaccine budget, i.e., the production schedule and timeline, which serves as the constraint for the allocation problem. additional data on prevailing vaccine sentiment and past compliance to seasonal/neonatal vaccinations are useful to estimate coverage. modeling approaches the problem has been studied actively in the literature; network science community has focused on optimal allocation schemes, while public health community has focused on using meta-population models and assessing certain fixed allocation schemes based on socio-economic and demographic considerations. game theoretic approaches that try and understand strategic behavior of individuals and organization has also been studied. challenges the problem is computationally challenging and thus most of the time simulation based optimization techniques are used. challenge to the optimization approach comes from the fact that the optimal allocation scheme might be hard to compute or hard to implement. other challenges include fairness criteria (e.g. the optimal set might be a specific group) and also multiple objectives that one needs to balance. while the above sections provide an overview of salient modeling questions that arise during the key stages of a pandemic, mathematical and computational model development is equally if not more important as we approach the post-pandemic (or more appropriately inter-pandemic) phase. often referred to as peace time efforts, this phase allows modelers to retrospectively assess individual and collective models on how they performed during the pandemic. in order to encourage continued development and identifying data gaps, synthetic forecasting challenge exercises [ ] may be conducted where multiple modeling groups are invited to forecast synthetic scenarios with varying levels of data availability. another set of models that are quite relevant for policymakers during the winding down stages, are those that help assess overall health burden and economic costs of the pandemic. epideep: exploiting embeddings for epidemic forecasting an arima model to forecast the spread and the final size of covid- epidemic in italy (first version on ssrn march) real-time epidemic forecasting: challenges and opportunities accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the u.s realtime forecasting of infectious disease dynamics with a stochastic semi-mechanistic model healthmap the use of social media in public health surveillance. western pacific surveillance and response journal : wpsar the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak basic prediction methodology for covid- : estimation and sensitivity considerations. medrxiv covid- outbreak on the diamond princess cruise ship: estimating the epidemic potential and effectiveness of public health countermeasures impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand. imperial college technical report modelling disease outbreaks in realistic urban social networks computational epidemiology forecasting covid- impact on hospital bed-days, icudays, ventilator-days and deaths by us state in the next months open data resources for fighting covid- data-driven methods to monitor, model, forecast and control covid- pandemic: leveraging data science, epidemiology and control theory covid- datasets: a survey and future challenges. medrxiv mathematical modeling of epidemic diseases the use of mathematical models to inform influenza pandemic preparedness and response mathematical models for covid- pandemic: a comparative analysis updated preparedness and response framework for influenza pandemics novel framework for assessing epidemiologic effects of influenza epidemics and pandemics covid- pandemic planning scenarios epidemiological data from the covid- outbreak, real-time case information covid- case surveillance public use data -data -centers for disease control and prevention covid- patients' clinical characteristics, discharge rate, and fatality rate of meta-analysis estimating the generation interval for coronavirus disease (covid- ) based on symptom onset data the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application estimating clinical severity of covid- from the transmission dynamics in wuhan, china the use and reporting of airline passenger data for infectious disease modelling: a systematic review flight cancellations related to -ncov (covid- ) the hidden geometry of complex, network-driven contagion phenomena potential for global spread of a novel coronavirus from china forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study. the lancet using predicted imports of -ncov cases to determine locations that may not be identifying all imported cases. medrxiv preparedness and vulnerability of african countries against introductions of -ncov. medrxiv creating synthetic baseline populations openstreetmap american time use survey multiscale mobility networks and the spatial spreading of infectious diseases optimizing spatial allocation of seasonal influenza vaccine under temporal constraints assessing the international spreading risk associated with the west african ebola outbreak spread of zika virus in the americas structure of social contact networks and their impact on epidemics generation and analysis of large synthetic social contact networks modelling disease outbreaks in realistic urban social networks containing pandemic influenza at the source report : impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand the structure and function of complex networks a public data lake for analysis of covid- data midas network. midas novel coronavirus repository covid- ) data in the united states covid- impact analysis platform covid- surveillance dashboard the covid tracking project absolute humidity and the seasonal onset of influenza in the continental united states epiestim: a package to estimate time varying reproduction numbers from epidemic curves. r package version google covid- community mobility reports mask-wearing survey data impact of social distancing measures on coronavirus disease healthcare demand, central texas, usa current hospital capacity estimates -snapshot total hospital bed occupancy quantifying the effects of contact tracing, testing, and containment covid- epidemic in switzerland: on the importance of testing, contact tracing and isolation quantifying sars-cov- transmission suggests epidemic control with digital contact tracing isolation and contact tracing can tip the scale to containment of covid- in populations with social distancing. available at ssrn privacy sensitive protocols and mechanisms for mobile contact tracing the rapidd ebola forecasting challenge: synthesis and lessons learnt acknowledgments. the authors would like to thank members of the biocomplexity covid- response team and network systems science and advanced computing (nssac) division for their thoughtful comments and suggestions related to epidemic modeling and response support. we thank members of the biocomplexity institute and initiative, university of virginia for useful discussion and suggestions. this key: cord- - yay kq authors: sun, chenxi; hong, shenda; song, moxian; li, hongyan title: a review of deep learning methods for irregularly sampled medical time series data date: - - journal: nan doi: nan sha: doc_id: cord_uid: yay kq irregularly sampled time series (ists) data has irregular temporal intervals between observations and different sampling rates between sequences. ists commonly appears in healthcare, economics, and geoscience. especially in the medical environment, the widely used electronic health records (ehrs) have abundant typical irregularly sampled medical time series (ismts) data. developing deep learning methods on ehrs data is critical for personalized treatment, precise diagnosis and medical management. however, it is challenging to directly use deep learning models for ismts data. on the one hand, ismts data has the intra-series and inter-series relations. both the local and global structures should be considered. on the other hand, methods should consider the trade-off between task accuracy and model complexity and remain generality and interpretability. so far, many existing works have tried to solve the above problems and have achieved good results. in this paper, we review these deep learning methods from the perspectives of technology and task. under the technology-driven perspective, we summarize them into two categories - missing data-based methods and raw data-based methods. under the task-driven perspective, we also summarize them into two categories - data imputation-oriented and downstream task-oriented. for each of them, we point out their advantages and disadvantages. moreover, we implement some representative methods and compare them on four medical datasets with two tasks. finally, we discuss the challenges and opportunities in this area. time series data have been widely used in practical applications, such as health [ ] , geoscience [ ] , sales [ ] , and traffic [ ] . the popularity of time series prediction, classification, and representation has attracted increasing attention, and many efforts have been taken to address the problem in the past few years [ , , , ] . the majority of the models assume that the time series data is even and complete. however, in the real world, the time series observations usually have non-uniform time intervals between successive measurements. three reasons can cause this characteristic: ) the missing data exists in time series due to broken sensors, failed data transmissions or damaged storage. ) the sampling machine itself does not have a constant sampling rate. ) different time series usually comes from different sources that have various sampling rates. we call such data as irregularly sampled time series (ists) data. ists data naturally occurs in many real-world domains, such as weather/climate [ ] , traffic [ ] , and economics [ ] . in the medical environment, irregularly sampled medical time series (ismts) is abundant. the widely used electronic health records (ehrs) data have a large number of ismts data. ehrs are the real-time, patient-centered digital version of patients' paper charts. ehrs can provide more opportunities to develop advanced deep learning methods to improve healthcare services and save more lives by assisting clinicians with diagnosis, prognosis, and treatment [ ] . many works based on ehrs data have achieved good results, such as mortality risk prediction [ , ] , disease prediction [ , , ] , concept representation [ , ] and patient typing [ , , ] . due to the special characteristics of ismts, the most important step is establishing the suitable models for it. however, it is especially challenging in medical settings. various tasks need different adaptation methods. data imputation and prediction are two main tasks. the data imputation task is a processing task when modeling data, while the prediction task is a downstream task for the final goal. the two types of tasks may be intertwined. standard techniques, such as mean imputation [ ] , singular value decomposition (svd) [ ] and k-nearest neighbour (knn) [ ] , can impute data. but they still lead to the big gap between the calculated data distribution and have no ability for the downstream task, like mortality prediction. linear regression (lr) [ ] , random forest (rf) [ ] and support vector machines (svm) [ ] can predict, but fails for ists data. state-of-the-art deep learning architectures have been developed to perform not only supervised tasks but also unsupervised ones that relate to both imputation and prediction tasks. recurrent neural networks (rnns) [ , , ] , auto-encoder (ae) [ , ] and generative adversarial networks (gans) [ , ] have achieved good performance in medical data imputation and medical prediction thanks to their abilities of learning and generalization obtained by complex nonlinearity. they can carry out prediction task or imputation task separately, or they can carry out two tasks at the same time through the splicing of neural network structure. different understandings about the characteristics of ismts data appear in existing deep learning methods. we summarized them as missing data-based perspective and raw data-based perspective. the first perspective [ , , , , ] treat irregular series as having missing data. they solve the problem through more accurate data calculation. the second perspective [ , , , , ] is on the structure of raw data itself. they model ismts directly through the utilization of irregular time information. neither views can defeat the other. either way, it is necessary to grasp the data relations comprehensively for more effectively modeling. we conclude two relations of ismts -intra-series relations (data relations within a time series) and inter-series relations (data relations between different time series). all the existing works model one or both of them. they relate to the local structures and global structures of data and we will introduced in section . besides, different ehr datasets may lead to different performance of the same method. for example, the real-world mimic-iii [ ] and cinc [ ] datasets record multiple different diseases. the records between diseases have distinct data characteristics and the prediction results of each general methods [ , , , ] varied between each disease datasets. thus, many existing methods model a specific disease record, like sepsis [ ] , atrial fibrillation [ , ] and kidney disease [ ] and have improved the predicting accuracy. the rest of the paper is organized as follows. section gives the basic definition and abbreviations. section describes the features of ismts based on two viewpoints -intra-series and inter-series. section and section introduce the related works in technology-driven perspective and task-driven perspective. in each perspective, we summarize the methods into specific categories and analyze the merits and demerits. section compares the experiments of some methods on four medical datasets with two tasks. in section and , we raise the challenges and opportunities for modeling ismts data and then make conclusion. the summary of abbreviations is in table . a typical ehr dataset is consist of a number of patient information which includes demographic information and in-hospital information. in-hospital information is a hierarchical patient-admission-code form shown in figure . each patient has certain admission records as he/she could be in hospital several times. the codes have diagnoses, lab values and vital sign measurements. each record r i is consist of many codes, including static diagnoses codes set d i and dynamic vital signs codes set x i . each code has the time stamp t. ehrs have many ismts because of two aspects: ) multiple admissions of one patient and ) multiple time series records in one admission. multiple admission records of each patient have different time stamps. because of health status dynamics and some unpredictable reasons, a patient will visit hospitals under varying intervals [ ] . for example, in figure , march , , july , and february , are patient admission times. the time interval between the st admission and nd admission is couple of months while the time interval between admissions , is years. each time series, like blood pressure in one admission, also has different time intervals. shown as admission in figure , the sampling time is not fixed. different physiological variables are examined at different times due to the changes in symptoms. every possible test is not regularly measured during an admission. when a certain symptom worsens, corresponding variables are examined more frequently; when the symptom disappears, the corresponding variables are no longer examined. without the loss of generality, we only discuss univariate time series. multivariate time series can be modeled in the same way. definition illustrates three important matters of ismts -the value x, the time t and the time interval δ. in some missing value-based works (we will introduce in section ), they use masking vector m ∈ { , } to represent the missing value. characteristics of irregularly sampled medical time series the medical measurements are frequently correlated both within streams and across streams. for example, the value of blood pressure of a patient at a given time could be correlated with the blood pressure at other times, and it could also have a relation with the heart rate at that given time. thus, we will introduce ismts's irregularity in two aspects: ) intra-series and ) inter-series. intra-series irregularity is the irregular time intervals between the nearing observations within a stream. for example, shown in figure , the blood pressure time series have different time intervals, such as hour, hours, and even hours. the time intervals add a time sparsity factor when the intervals between observations are large [ ] . existing two ways can handle the irregular time intervals problem: ) determining a fixed interval, treating the time points without data as missing data. ) directly modeling time series, seeing the irregular time intervals as information. the first way requires a function to impute missing data [ , ] . for example, some rnns [ , , , , , ] can impute the sequence data effectively by considering the order dependency. the second way usually uses the irregular time intervals as inputs. for example, some rnns [ , ] apply time decay to affect the order dependency, which can weaken the relation between neighbors with long time intervals. inter-series irregularity is mainly reflected in the multi-sampling rates among different time series. for example, shown in figure , vital signs such as heart rate (ecg data) have a high sampling rate (in seconds), while lab results such as ph (ph data) are measured infrequently (in days) [ , ] . existing two ways can handle the multi-sampling rates problem: ) considering data as a multivariate time series. ) processing multiple univariable time series separately. the first way aligns the variables of different series in the same dimension and then solves the missing data problem [ ] . the second way models different time series simultaneously and then designs fusion methods [ ] . numerous related works are capable of modeling ismts data, we category them from two perspectives: ) technologydriven and ) task-driven. we will describe each category in detail. based on technology-driven, we divide the existing works into two categories: ) missing data-based perspective and ) an raw data-based perspective. the specific categories are shown in figure . the missing data-based perspective regards every time series has uniform time intervals. the time points without data are considered to be the missing data points. as shown in figure a , when converting irregular time intervals to regular time intervals, missing data shows up. the missing rate r missing can measure the degree of the missing at a given sampling rate r sampling . r missing = # of time points with missing data # of time points ( ) the ismts in the real-world ehrs have a severe problem with missing data. for example, luo et al. [ ] gathered statistics of cinc dataset [ , ] . as time goes by, the results show that the maximum missing rate at each timestamp is always higher than %. most variables' missing rate is above %, and the mean of the missing rate is . %, as shown in figure a . the other three real-word ehrs data set mimic-iii dataset [ ] ,cinc dataset [ , ] , and covid- dataset [ ] are also affected by the missing data, shown in figure b , c, and d. in this viewpoint, existing methods impute the missing data, or model the missing data information directly. the raw data-based perspective uses irregular data directly. the methods do not fill in missing data to make the irregular sampling regular. on the contrary, they think that irregular time itself is the valuable information. as shown in figure b , the time are still irregular and the time intervals are recorded. irregular time intervals and multi-sampling rates are intra-series characteristic and inter-series characteristic we have introduced in section respectively. they are very common phenomenons in ehr database. for example, cinc dataset is relatively clean but still has more than % samples with irregular time intervals. only . % samples have the same sampling rate in mimic-iii dataset. in this viewpoint, methods usually integrate the features of varied time intervals to the inputs of model, or design models which can process samples with different sampling rates. the methods of missing data-based perspective convert ismts into equally spaced data. they [ , , ] discretize the time axis into non-overlapping intervals with hand-designed intervals. then the missing data shows up. the missing values damage temporal dependencies of sequences [ ] and make applying many existing models directly infeasible, such as linear regression [ ] and recurrent neural networks (rnn) [ ] . as shown in figure , because of missing values, the second valley of the blue signal is not observed and cannot be inferred by simply relying on existing basic models [ , ] . but the valley values of blood pressure are significant for icu patients to indicate sepsis [ ] , a leading cause of patient mortality in icu [ ] . thus, missing values have an enormous impact on data quality, resulting in unstable predictions and other unpredictable effects [ ] . many prior efforts have been dedicated to the models that can handle missing values in time series. and they can be divided into two categories: ) two-step approaches and ) end-to-end approaches. two-step approaches ignore or impute missing values and then process downstream tasks based on the preprocessed data. a simple solution is to omit the missing data and perform analysis only on the observed data. but it can result in a large amount of useful data not being available [ ] . the core of these methods is how to impute the missing data. some basic methods are dedicated to filling the values, such as smoothing, interpolation [ ] , and spline [ ] . but they cannot capture the correlation between variables and complex patterns. other methods estimate the missing values by spectral analysis [ ] , kernel methods [ ] , and expectation-maximization (em) algorithm [ ] . however, simple reasoning design and necessary model assumptions make data imputation not accurate. recently, with the vigorous development of deep learning, these methods have higher accuracy than traditional methods. rnns and gans mainly realize the deep learning-based data imputation methods. a substantial literature uses rnns to impute the missing data in ismts. rnns take sequence data as input, recursion occurs in the direction of sequence evolution, and all units are chained together. their special structure endows them with processing sequence data by learning order dynamics. in a rnn, the current state h t is affected by the previous state h t− and the current input x t and is described as rnn can integrate basic methods, such as em [ ] and linear model (lr) [ ] . the methods first estimate the missing values and again uses the re-constructed data streams as inputs to a standard rnn. however, em imputes the missing values by using only the synchronous relationships across data streams (inter-series relations) but not the temporal relationships within streams (intra-series relations). lr interpolates the missing values by using only the temporal relationships within each stream (intra-series relations) but ignoring the relationships across streams (inter-series relations). meanwhile, most of the rnn-based imputation methods, like simple recurrent network (srn) and lstm, which have been proved to be effective to impute medical data by kim et al. [ ] , are also learn an incomplete relation with considering intra-series relations only. chu et al. [ ] have noticed the difference between these two relations in ismts data and designed multi-directional recurrent neural network (m-rnn) for both imputation and interpolation. m-rnn operates forward and backward in the intra-series directions according to an interpolation block and operates across inter-series directions by an imputation block. they implanted imputation by a bi-rnn structure recorded as function Φ and implanted interpolation by fully connected layers with function Ψ. the final objective function is mean squared error between the real data and calculated data. where x, m and δ represent data value, masking and time interval we have defined in , we will not repeat it below. bi-rnn is bidirectional-rnn [ ] . it is an advanced rnn structure with forward and backward rnn chains. it have two hidden states for one time point in the above two orders. two hidden states concatenate or sum into the final value in this time point. unlike the basic bi-rnn, the timing of inputs into the hidden layers of m-rnn is lagged in the forward direction and advanced in the backward direction. however, in m-rnn, the relations between missing variables are dropped, the estimated values are treated as constants which cannot be sufficiently updated. to solve the problem, cao et al. [ ] proposed bidirectional recurrent imputation for time series (brits) to predict missing values with bidirectional recurrent dynamics. in this model, the missing values are regarded as the variables in the model graph and get delayed gradients in both forward and backward directions with consistency constraints, which makes the estimation of missing values more accurate. it can update the predicted missing data with a combined three objective function l -the errors of historical-based estimationx, the feature-based estimationẐ and the combined estimationĈ, which not only considered the relations between missing data and known data, but also modeled the relations between missing data ignored by m-rnn. but brits did not take both inter-series and intra-series relations into account, m-rnn solved it. gans are a type of deep learning model which train generative deep models through an adversarial process [ ] . from the perspective of game theory, gan training can be seen as a minimax two-player game [ ] between generator g and discriminator d with the objective function. however, typical gans require fully observed data during training. in response to this, yoon et al. [ ] proposed generative adversarial imputation nets (gain) model. different from the standard gan, its generator receives both noise z and mask m as input data, the masking mechanism makes missing data as input possible. gain's discriminator outputs both real and fake components. meanwhile, a hint mechanism h makes discriminator get some additional information in the form of a hint vector. gain changes the objective min g max d (v (d, g)) of basic gan to to improve gain, camino et al. [ ] used multiple-inputs and multiple-outputs to the generator and the discriminator. the method did the variable splitting by using dense layers connected parallelly for each variable. zhang et al. [ ] designed stackelberg gan based on gain to impute the medical missing data for computational efficiency. stackelberg gan can generate more diverse imputed values by using multiple generators instead of a single generator and applying the ensemble of all pairs of standard gan losses. the main goal of the above two-step methods is to estimate the missing values in the converted time series of ismts (convert irregularly sampled features to missing data features). however, in medical background, the ultimate goal is to carry out medical tasks such as mortality prediction [ , ] and patient subtyping [ , , ] . two separated steps may lead to the suboptimal analyses and predictions [ ] as the missing patterns are not effectively explored for final tasks. thus, some researches proposed finding ways to solve the downstream tasks directly, rather than filling missing values. end-to-end approaches process the downstream tasks directly based on modeling the time series with missing data. the core objective is to predict, classify, or clustering. data imputation is an additional task or not even a task in this type of methods. lipton et al. [ ] demonstrated a simple strategy -using the basic rnn model to cope with missing data in sequential inputs and the output of rnn being the final characteristics for prediction. then, to improve this basic idea, they addressed the task of multilabel classification of diagnoses by given clinical time series and found that rnns can make remarkable use of binary indicators for missing data, improving auc, and f significantly. thus, they approached missing data by heuristic imputation directly model missingness as a first-class feature in the new work [ ] . similarly, che at al. [ ] also use rnn idea to predict medical issues directly. for solving the missing data problem, they designed a kind of marking vector as the indicator for missing data. in this approach, the value x, the time interval δ and the masking m impute missing data x * together. it first replaces missing data with the mean values, and then used the feedback loop to update the imputed values, which are the input of a standard rnn for prediction. meanwhile, they proposed gru-decay (gru-d) to model ehrs data for medical predictions with trainable decays. the decay rate γ weighs the correlation between missing data x t and other data (previous data x t and mean datax t ). meanwhile, in this research, the authors plotted the pearson correlation coefficient between variable missing rates of mimic-iii dataset. they have observed that the missing rate is correlated with the labels, demonstrating the usefulness of missingness patterns in solving a prediction task. however, the above models [ , , , , ] are limited to using local information (empirical mean or the nearest observation) of ismts. for example, gru-d assumed that a missing variable could be represented as the combination of its corresponding last observed value and the mean value. the global structure and statistics are not directly considered. the local statistics are unreliable when the continuous data misses (shown in figure ), or the missing rate rises up. tang et al. [ ] have realized this problem and designed lgnet, exploring the global and local dependencies simultaneously. they used gru-d model local structure, grasping intra-series relations, and used a memory module to model the global structures, learning inter-series relations. the memory module g have l rows, it capture the global temporal dynamics for missing values with the variable correlations a. meanwhile, an adversarial training process can enhance the modeling of global temporal distribution. the alternative of processing the sequences with missing data by pre-discretizing ismts is constructing models which can directly receive ismts as input. the intuition of raw data-based perspective is from the characteristics of raw data itself -the intra -series relation and the inter-series relation. the intra -series relation of ismts is reflected in the irregular time intervals between two neighbor observations within one series; the inter-series relation is reflected in the different sampling rate of different time series. thus, two subcategories are ) irregular time intervals-based approaches and ) multi-sampling rates-based approaches. in ehrs setting, the time lapse between successive elements in patient records can vary from days to months, which is the characteristic of irregular time intervals in ismts. a better way to handle it is to model the unequally spaced data using time information directly. basic rnns only process uniformly distributed longitudinal data by assuming that the sequences have an equal distribution of time differences. thus, design of traditional rnns may lead to suboptimal performance. they applied a memory discount in coordination with elapsed time to capture the irregular temporal dynamics to adjust the hidden status c t− of basic lstm to a new hidden state c * t− . however, when ismts is univariate, t-lstm is not a completely irregular time intervals-based method. for the multivariate ismts, it has to align multiple time series and filling missing data first. where they have to solve the missing data problem again. but the research did not mention the specific filling strategy and used simple interpolation like mean values when data preprocessing. for the multivariate ismts and the alignment problem, tan et al. [ ] gave an end-to-end dual-attention time-aware gated recurrent unit (data-gru) to predict patients' mortality risk. data-gru uses a time-aware gru structure t-gru as same as t-lstm. besides, the authors give the strategy of multivariate data alignment problem. when aligning different time series to multi dimensions, previous missing data approaches, such as gru-d [ ] and lgnet [ ] , assigned equal weights to observed data and imputed data, ignoring the relatively larger unreliability of imputation compared with actuality. data-gru tackles this difficulty by a novel dual-attention structure -unreliability-aware attention α u with reliability score c and symptom-aware attention α s . the dual-attention structure jointly considers the data-quality and the medical-knowledge. further, the attention-like structure makes data-gru explainable according to the interpretable embedding, which is an urgently needed issue in medical tasks. instead of using rnns to learn the order dynamics in ismts, bahadori et al. [ ] have proposed methods for analyzing multivariate clinical time series that are invariant to temporal clustering. the events in ehrs may appear in a single admission together or may disperse over multiple admissions. for example, the authors postulated that whether a series of blood tests are completed at once or in rapid succession should not alter predictions. thus, they designed a data augmentation technique, temporally coarsening, to exploits temporal-clustering invariance to regularize deep neural networks optimized for clinical prediction tasks. moreover, they proposed a multi-resolution ensemble (mre) model with the coarsening transformed inputs to improve predictive accuracy. only modeling the irregular time intervals of intra-series relation would ignore the multi-sampling rate relation of inter-series relation. further, modeling inter-series relation is also a reflection of considering the global structure of ismts. the above rnn-based methods of irregular time intervals-based category only consider the local order dynamics information. although lgnet [ ] has integrated the global structures, it incorporates all of the information from all time points into an interpolation model, which is redundant and low adaptive. some models can also learn the global structures of time series, like a basic model kalman filters [ ] and a deep learning deep markov models [ ] . however, this kind of models mainly process the every time series with a stable sampling rate. che et al. [ ] focused on the problem of modeling multi-rate multivariate time series and proposed a multi-rate hierarchical deep markov model (mr-hdmm) for healthcare forecasting and interpolation tasks. mr-hdmm learns generation model and inference network by auxiliary connections and learnable switches. the latent hierarchical structure reflected in the states/switches s factorizing by joint probability p with layer z. p(x , z , s |z ) = p(x |z )p(z , s |z ) these structures can capture the temporal dependencies and data generation process. similarly, binkowski et al. [ ] presented an autoregressive framework for regression tasks by modeling ismts data. the core idea of implementation is roughly similar with mr-hdmm. however, these methods considered the different sampling rates between series but ignored the irregular time intervals in each series. they process the data with a stable sampling rate (uniform time intervals) for each time series. for the stable sampling rate, they have to use forward or linear interpolation, where the global structures are omitted again for getting the uniform intervals. the gaussian process can build global interpolation layers for process multi-sampling rate data. li et al. [ ] and futoma et al. [ ] used this technique. but if a time series is multivariate data, covariance functions are challenging due to the complicated and expensive computation. satya et al. [ ] designed a fully modular interpolation-prediction network (ipn). ipn has an interpolation network to accommodate the complexity of ismts data and provide the multi-channel output by modeling three informationbroad trends χ, transients τ and local observation frequencies λ. the three information is calculated by a low-pass interpolation θ, a high-pass interpolation γ and an intensity function λ. ipn also has a prediction network which operates the regularly partitioned inputs from the former interpolation module. in addition to taking care of data relationships from multiple perspectives, ipn can make up for the lack of modularity in [ ] and address the difficulty of the complexity of the gaussian process interpolation layers in [ , ] . modeling ists data aims to achieve two main tasks: ) missing data imputation and ) downstream tasks. the specific categories are shown in figure . missing data imputation is of practical significance, as works on machine learning have become actively, getting large amounts of complete data has become an important issue. however, it is almost impossible in the real world to get complete data for many reasons like lost records. in many cases, the time series with missing values becomes useless and then thrown away. this results in a large amount of data loss. the incomplete data has adverse effects when learning a model [ ] . existing basic methods, such as interpolation [ ] kernel methods [ ] and em algorithm [ , ] , have been proposed a long time ago. with the popularity of deep learning in recent years, most new methods are implemented by artificial neural networks (anns). one of the most popular models is rnn [ ] . rnns can capture long-term temporal dependencies and use them to estimate the missing values. existing works [ , , , , , ] have designed several special rnn structures to adapt the missingness and achieve good results. another popular model is gans [ ] , which generate plausible fake data through adversarial training. gan has been successfully applied to face completion and sentence generation [ , , , ] . based on their data generation abilities, some research [ , , , ] have applied gan on time series data generation with considering sequence information into the process. downstream tasks generally include prediction, classification, and clustering. for ismts data, medical prediction (such as mortality prediction, disease classification and image classification) [ , , , , ] , concept representation [ , ] and patient typing [ , , ] are three main tasks. the downstream task-oriented methods calculate missing values and perform downstream tasks simultaneously, which is expected to avoid suboptimal analyses and predictions caused by the not effectively explored missing patterns due to the separation of imputations and final tasks [ ] . most methods [ , , , , , , , ] use deep learning technology to achieve higher accuracy on tasks. in this section, we apply the above methods on four datasets and two tasks. we will analyze the method through the experimental results. four datasets were used to evaluate the performance of baselines. cinc dataset [ ] consist of records from , icu stays and have multivariate clinical time series. all patients were adults who were admitted for a wide variety of reasons to cardiac, medical, surgical, and trauma icus. each record is a multivariate time series of roughly hours and contains variables such as albumin, heart rate, glucose etc. cinc dataset [ ] is publicly available and comes from two hospitals; it contains , patient admission records and , records of diagnosed sepsis cases. it is a set of multivariate time series that contains related features, kinds of vital signs, kinds of laboratory values and kinds of demographics. the time interval is hour. the sequence length is between and , and , records have lengths less than . covid- dataset [ ] is collected between january and february from tongji hospital of tongji medical college, huazhong university of science and technology, wuhan, china. the dataset contains patients with blood sample records as training set, patients with records as test set and characteristics. the experiments have two tasks - ) mortality prediction and ) data imputation. the mortality prediction tasks use the time series of hours before onset time from the above four datasets. the imputation tasks use features (using the method in [ ] ) which are eliminated % of observed measurements from data. the eliminated data is the new ground-truth. for rnn-based method, we fix the dimension of hidden state is . for gan-based methods, the series inputs also use rnn structure. for final prediction, all methods use one -dimensions fc layer and one -dimensions fc layer. all methods apply adam optimizer [ ] with α = . , β = . and β = . . we use the learning rate decay α current = α initial · γ global step decay steps with decay rate γ = . and the decay step is . the -fold cross validation is used for both two tasks. [ ] . ± . . ± . . ± . . ± . lstm [ ] . ± . . ± . . ± . . ± . gru-d [ ] . ± . . ± . . ± . . ± . m-rnn [ ] . ± . . ± . . ± . . ± . brits [ ] . ± . . ± . . ± . . ± . t-lstm [ ] . ± . . ± . . ± . . ± . data-gru [ ] . ± . . ± . . ± . . ± . lgnet [ ] . ± . . ± . . ± . . ± . ipn [ ] . ± . . ± . . ± . . ± . the prediction results were evaluated by assessing the area under curve of receiver operating characteristic (auc-roc). roc is a curve of true positive rate (tpr) and false positive rate (fpr). tn, tp, fp and fn stand for true positive, true negative, false positive and false negative rates. t p r = t p t p + f n f p r = f p t n + f p ( ) we evaluate the imputation performance in terms of mean squared error (mse). for ith item,x i is the real value and x i is the predicting value. the number of missing values is n . table shows the performances of baselines for the mortality prediction task. for the two categories of technologydriven methods, each has its own merits, but irregularity-based methods work relatively well. missing data-based methods have / top results and / top results, while irregularity-based methods have / top results and / top results. for the methods of whether the two series relation are considered, the methods that take both inter-series relation and intra-series relation (both global and local structures) into account perform better. ipn, lgnet, and data-gru have relatively good results. for different datasets, the methods show different effects. for example, as covid- is a small dataset, unlike the other three datasets, the relatively simple methods perform better on this dataset, like t-lstm, which doesn't perform very well on the other three datasets. table the data imputation is better in the sepsis and covid- dataset. perhaps the time series in these two datasets is from the patients who suffered from the same disease. that's probably why they also have relatively better results in the prediction task. table shows a basic rnn model's performance for mortality prediction tasks based on baselines' imputation data. different from the results in table , the rnn-based methods perform better. where the rnn-based methods have / top results, but gan-based methods have / . the reason may be that the rnn-based approaches have integrated the downstream tasks when imputing. so, the data generated by them is more suitable for the final prediction task. according to the analysis of technologies and experiment results, in this section, we will discuss ismts modeling task from three perspectives - ) imputation task with prediction task, ) intra-series relation with inter-series relation / local structure with global structure and ) missing data with raw data. the conclusions of the approaches in this survey are in table . based on the above five perspectives, we summarize the challenges as follows. how to balance the imputation with the prediction? different kinds of methods suit different tasks. gans prefer imputation while rnns prefer prediction. however, in the medical setting, aiming at different datasets, the conclusion does not seem correct. for example, missing data is generated better by rnns than gans in the covid- dataset. and the two-step methods based on gans for mortality prediction are no worse than using rnns directly. therefore, it seems difficult to achieve a general and effective modeling method in medical settings. the method should be specified according to the specific task and the characteristics of the datasets. how to handle the intra-series relation with inter-series relation of ismts? in other words, how to trade off the local structure with global structure. in ismts format, a patient has several time series of vital signs connected to the diagnoses of diseases or the probability of death. seeing these time series as a whole multivariate data sample, intraseries relations are reflected in longitudinal dependences and horizontal dependencies. the longitudinal dependencies contain the sequence order and context, time intervals, and decay dynamics. the horizontal dependence is the relations between different dimensions. and the inter-series relations are reflected in the patterns of time series of different samples. however, when seeing these time series as separated multi-samples of a patient, the relations will change. intra-series relations change to the dependencies of values observed on different time steps in a univariate ismts. the features of different time intervals should be taken care of. inter-series relations change to the pattern relations between different patients' different samples and between different time series of the same vital sign. for the structural level, modeling intra-series relations is basically at the local level, while modeling inter-series relations is global. it is not clear what kind of consideration and which structure will make the results better. modeling local and global structures seems to perform better in morality prediction, but it is a more complex method, and it's not universal for different datasets. how to choose the modeling perspective, missing data-based or irregularity-based? both two kinds of methods have advantages and disadvantages. most existing works are missing data-based and there are methods of estimating missing data for a long time [ ] . in settings of missing data-based perspective, the discretization interval length is a hyper-parameter needs to be determined. if the interval size is large, missing data is less, but several values will show in low-applicability for multivariate data; incomplete data relation. multi sampling rates-based [ , , , ] no artificial dependency; no data imputation. implementation complexity; data generation patterns assumptions. the same interval; if the interval size is small, the missing data becomes more. no values in an interval will hamper the performance, while too many values in an interval need an ad-hoc choosing method. meanwhile, missing data-based methods have to interpolate new values, which may artificially introduce some naturally occurring dependencies. over-imputation may result in an explosion in size and the pursuit of multivariate data alignment may lead to the loss of raw data dependency. thus, of particular interest are irregularity-based methods that can learn directly by using multivariate sparse and irregularly sampled time series as input without the need for other imputation. however, although the raw data-based methods have metrics of no artificial dependencies introduced, they suffer from not achieving the desired results, complex designs, and large parameters. irregular time intervals-based methods are not complex as they can be achieved by just injecting time decay information. but in terms of specific tasks, such as morality prediction, the methods seem not as good as we think (concluded from experiments section). meanwhile, for multivariable time series, these methods have to align values on different dimensions, which leads to missing data problems again. multi-sampling rates-based methods will not cause missing data. however, processing multiple univariate time series at the same time requires more parameters and is not friendly to batch learning. meanwhile, modeling the entire univariate series may require data generation model assumptions. considering the complex patient states, the amount of interventions and the real-time requirement, the data-driven approaches by learning from ehrs are the desiderata to help clinicians. although some difficulties have not been solved yet, the deep learning method does show a better ability to model medical ismts data than the basic methods. basic methods can't model ismts completely as interpolation-based methods [ , ] just exploit the correlation within each series, imputation-based methods [ , ] just exploit the correlation among different series, matrix completion-based methods [ , ] assume that the data is static and ignore the temporal component of the data. deep learning methods use parameter training to learn data structures, and many basic methods can be integrated into the designs of neural networks. the deep learning methods introduced in this survey basically solve the problem of common methods and have achieved state-of-the-art in medical prediction tasks, including mortality prediction, disease prediction, and admission stay prediction. therefore, the deep learning model based on ismts data has a broad prospect in medical tasks. the deep learning methods, both rnn-based and gan-based methods mentioned in this survey, are troubled by poor interpretability [ , ] , and clinical settings prefer interpretable models. although this defect is difficult to solve due to models' characteristics, some researchers have made some breakthroughs and progress. for example, the attention-like structures which are used in [ , ] can give an explanation for medical predictions. this survey introduced a kind of data -irregularly sampled medical time series (ismts). combined with medical settings, we described characteristics of ismts. then, we have investigated the relevant methods for modeling ismts data and classified them by technology-driven perspective and task-driven perspective. for each category, we divided the subcategories in detail and represented each specific model's implementation method. meanwhile, according to imputation and prediction experiments, we analyzed the advantages and disadvantages of some methods and made conclusions. finally, we summarized the challenges and opportunities of modeling ismts data task. recurrent neural networks for multivariate time series with missing values convolutional lstm network: a machine learning approach for precipitation nowcasting restful: resolution-aware forecasting of behavioral time series data tensorized lstm with adaptive shared memory for learning trends in multivariate time series clustering and classification for time series data in visual analytics: a survey time graph: revisiting time series modeling with dynamic shapelets adversarial unsupervised representation learning for activity time-series revisiting spatial-temporal similarity: a deep learning framework for traffic prediction deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis predicting in-hospital mortality of icu patients: the physionet/computing in cardiology challenge holmes: health online model ensemble serving for deep learning models in intensive care units dipole: diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks learning to diagnose with lstm recurrent neural networks retain: an interpretable predictive model for healthcare using reverse time attention mechanism multi-layer representation learning for medical concepts mime: multilevel medical embedding of electronic health records for predictive healthcare patient subtyping via time-aware lstm networks deep computational phenotyping a survey of methodologies for the treatment of missing values within datasets: limitations and benefits singular value decomposition and least squares solutions an efficient nearest neighbor classifier algorithm based on pre-classify. computer ence simple linear regression in medical research predicting disease risks from highly imbalanced data using random forest a modified svm classifier based on rs in medical disease prediction alzheimer's disease neuroimaging initiative. rnn-based longitudinal analysis for diagnosis of alzheimer's disease estimating brain connectivity with varying-length time lags using a recurrent neural network on clinical event prediction in patient treatment trajectory using longitudinal electronic health records bidirectional recurrent auto-encoder for photoplethysmogram denoising a deep learning method based on hybrid auto-encoder model research and application progress of generative adversarial networks an accurate saliency prediction method based on generative adversarial networks joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values directly modeling missing data in sequences with rnns: improved classification of clinical time series brits: bidirectional recurrent imputation for time series recurrent neural networks with missing information imputation for medical examination data prediction data-gru: dual-attention time-aware gated recurrent unit for irregular multivariate time series temporal-clustering invariance in irregular healthcare time series. corr, abs interpolation-prediction networks for irregularly sampled time series hierarchical deep generative models for multi-rate multivariate time series mimic-iii, a freely accessible critical care database. sci. data early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge an intelligent warning model for early prediction of cardiac arrest in sepsis patients k-marginbased residual-convolution-recurrent neural network for atrial fibrillation detection opportunities and challenges of deep learning methods for electrocardiogram data: a systematic review risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis learning from irregularly-sampled time series: a missing data perspective. corr, abs time series analysis : forecasting and control forecasting in multivariate irregularly sampled time series with missing values. corr, abs estimating missing data in temporal data streams using multi-directional recurrent neural networks long short-term memory empirical evaluation of gated recurrent neural networks on sequence modeling temporal belief memory: imputing missing data during rnn training survey of clinical data mining applications on big data in health informatics analysis of incomplete and inconsistent clinical survey data modeling irregularly sampled clinical time series multivariate time series imputation with generative adversarial networks physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals early prediction of sepsis from clinical data -the physionet computing in cardiology challenge an interpretable mortality prediction model for covid- patients ua-crnn: uncertainty-aware convolutional recurrent neural network for mortality risk prediction a hybrid residual network and long short-term memory method for peptic ulcer bleeding mortality prediction raim: recurrent attentive and intensive model of multimodal patient monitoring data linear regression with censored data a learning algorithm for continually running fully recurrent neural networks arterial blood pressure during early sepsis and outcome hospital deaths in patients with sepsis from independent cohorts data cleaning: overview and emerging challenges the effects of the irregular sample and missing data in time series analysis wavelet methods for time series analysis. (book reviews) comparison of correlation analysis techniques for irregularly sampled time series multiple imputation using chained equations. issues and guidance for practice pattern classification with missing data: a review. neural computing and applications a solution for missing data in recurrent neural networks with an application to blood glucose prediction speech recognition with missing data using recurrent neural nets framewise phoneme classification with bidirectional lstm and other neural network architectures a survey of missing data imputation using generative adversarial networks stable and improved generative adversarial nets (gans): a constructive survey gain: missing data imputation using generative adversarial nets improving missing data imputation with deep generative models. corr, abs medical missing data imputation by stackelberg gan strategies for handling missing data in electronic health record derived data kalman filtering and neural networks hidden markov and other models for discretevalued time serie autoregressive convolutional neural networks for asynchronous time series a scalable end-to-end gaussian process adapter for irregularly sampled time series classification learning to detect sepsis with a multitask gaussian process rnn classifier doctor ai: predicting clinical events via recurrent neural networks generative face completion generative adversarial nets ambientgan: generative models from lossy measurements approximation and convergence properties of generative adversarial learning seqgan: sequence generative adversarial nets with policy gradient learning from incomplete data with generative adversarial networks adam: a method for stochastic optimization a study of handling missing data methods for big data multiple imputation for nonresponse in surveys spectral regularization algorithms for learning large incomplete matrices temporal regularized matrix factorization for high-dimensional time series prediction interpretable machine learning: a guide for making black box models explainable. online interpretability of machine learning-based prediction models in healthcare iterative robust semi-supervised missing data imputation medical time-series data generation using generative adversarial networks unsupervised online anomaly detection on irregularly sampled or missing valued time-series data using lstm networks. corr, abs kernels for time series with irregularly-spaced multivariate observations. corr, abs timeautoml: autonomous representation learning for multivariate irregularly sampled time series a distributed descriptor characterizing structural irregularity of eeg time series for epileptic seizure detection a bio-statistical mining approach for classifying multivariate clinical time series data observed at irregular intervals automatic classification of irregularly sampled time series with unequal lengths: a case study on estimated glomerular filtration rate mcpl-based ft-lstm: medical representation learning-based clinical prediction model for time series events a comparison between discrete and continuous time bayesian networks in learning from clinical time series data with irregularity multi-resolution networks for flexible irregular time series modeling (multi-fit) key: cord- -ecuex m authors: fong, simon james; li, gloria; dey, nilanjan; crespo, ruben gonzalez; herrera-viedma, enrique title: composite monte carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction date: - - journal: nan doi: nan sha: doc_id: cord_uid: ecuex m in the advent of the novel coronavirus epidemic since december , governments and authorities have been struggling to make critical decisions under high uncertainty at their best efforts. composite monte-carlo (cmc) simulation is a forecasting method which extrapolates available data which are broken down from multiple correlated/casual micro-data sources into many possible future outcomes by drawing random samples from some probability distributions. for instance, the overall trend and propagation of the infested cases in china are influenced by the temporal-spatial data of the nearby cities around the wuhan city (where the virus is originated from), in terms of the population density, travel mobility, medical resources such as hospital beds and the timeliness of quarantine control in each city etc. hence a cmc is reliable only up to the closeness of the underlying statistical distribution of a cmc, that is supposed to represent the behaviour of the future events, and the correctness of the composite data relationships. in this paper, a case study of using cmc that is enhanced by deep learning network and fuzzy rule induction for gaining better stochastic insights about the epidemic development is experimented. instead of applying simplistic and uniform assumptions for a mc which is a common practice, a deep learning-based cmc is used in conjunction of fuzzy rule induction techniques. as a result, decision makers are benefited from a better fitted mc outputs complemented by min-max rules that foretell about the extreme ranges of future possibilities with respect to the epidemic. on top of devastating health effects, epidemic impacted hugely on world economy. in the ebola outbreak between - where more than , and cases were suspected and , deaths in west africa [ ], $ . billion was lost [ ] . on the other hand, sars took over lives from china including hong kong and lives worldwide between and [ ] . its losses on global economy are up to a huge $ billion, % and . % dips of gdps in chinese and asian domestic markets respectively [ ] . although the current coronavirus (codename: ncp or covid- ) epidemic is not over yet, its economy impact is anticipated by economists from ihs markit to be far worse than that of sars outbreak in [ ] . the impact is so profound that will lead to factories shut down, enterprises bankruptcy especially those in tourism, retail and f&b industries, and suspensions or withdrawals in longterm investment, if the outbreak cannot be contained in time. since the first case in december , the suspected cases and deaths around the world skyrocketed to over confirmed cases and deaths, mostly in china, by the time of writing this article. an early intervention measure in public health to thwart the outbreak of covid- is absolutely imperative. according to a latest mathematical model that was reported in research article by the lancet [ ] , the growth of the epidemic spreading rate will ease down if the transmission rate of the new contagious disease can be lowered by . . knowing that the early ending the virus epidemic or even the reduction in the transmission rate between human to human, all governments especially china where wuhan is the epicenter are taking up all the necessary preventive measures and all the national efforts to halt the spread. how much input is really necessary? many decision makers take references from sars which is by far the most similar virus to covid- . however, it is difficult as the characteristics of the virus are not fully known, it details and about how it spreads are gradually unfolded from day to day. given the limited information on hand about the new virus, and the ever evolving of the epidemic situations both geographically and temporally, it boils down to grand data analytics challenge this analysis question: how much resources shall be enough to slow down the transmission? this is a composite problem that requires cooperation from multi-prong measures such as medical provision, suspension of schools, factories and office, minimizing human gathering, limiting travel, strict city surveillance and enforced quarantines and isolations in large scales. there is no easy single equation that could tell the amount of resources in terms of monetary values, manpower and other intangible usage of infrastructure; at the same time there exist too many uncertain variables from both societal factors and the new development of the virus itself. for example, the effective incubation period of the new virus was found to be longer than a week, only some time later after the outbreak. time is an essence in stopping the epidemic so to reduce its damages as soon as possible. however, uncertainties are the largest obstacle to obtain an accurate model for forecasting the future behaviours of the epidemic should intervention apply. in general, there is a choice of using deterministic or stochastic modelling for data scientists; the former technique based solely on past events which are already known for sure, e.g. if we know the height and weight of a person, we know his body mass index. should any updates on the two dependent variables, the bmi will be changed to a new value which remains the same for sure no matter how many times the calculation is repeated. the latter is called probabilistic or stochastic model -instead of generating a single and absolute result, a stochastic model outputs a collection of possible outcomes which may happen under some probabilities and conditions. deterministic model is useful when the conditions of the experiment are assumed rigid. it is useful to obtain direct forecasting result from a relatively simple and stable situation in which its variables are unlikely to deviate in the future. otherwise, for a non-deterministic model, which is sometimes referred as probabilistic or stochastic, the conditions of a future situation under which the experiment will be observed, are simulated to some probabilistic behaviour of the future observable outcome. for an example of epidemic, we want to determine how many lives could be saved from people who are infected by a new virus as a composite result of multi-prong efforts that are put into the medical resources, logistics, infrastructure, spread prevention, and others; at the same time, other contributing factors also matter, such as the percentage of high-risk patients who are residing in that particular city, the population and its mobility, as well as the severity and efficacy of the virus itself and its vaccine respectively. real-time tools like cdc data reporting and national big data centers are available with which any latest case that occurs can be recorded. however, behind all these records are sequences of factors associated with high uncertainty. for example, the disease transmission rate depends on uncertain variables ranging from macro-scale of weather and economy of the city in a particular season, to the individual's personal hygiene and the social interaction of commuters as a whole. they are dynamic in nature that change quickly from time to time, person to person, culture to culture and place to place. the phenomena can hardly converge to a deterministic model. rather, a probabilistic model can capture more accurately the behaviours of the phenomena. so for epidemic forecast, a deterministic model such as trending is often used to the physical considerations to predict an almost accurate outcome, whereas in a non-deterministic model we use those considerations to predict more of a probable outcome that is probability distribution oriented. in order to capture and model such dynamic epidemic recovery behaviours, stochastic methods ingest a collection of input variables that have complex dependencies on multiple risk factors. the epidemic recovery can be viewed in abstract as a bipolar force between the number of populations who has contracted the disease and the number of patients who are cured from the disease. each group of the newly infested and eventually cured (or unfortunately deceased) individuals are depending on complex societal and physiological factors as well as preventive measures and contagious control. each of these factors have their underlying and dependent factors carrying uncertain levels of risks. a popular probabilistic approach for modeling the complex conditions is known as monte carlo (mc) simulation which provides a means of estimating the outcome of complex functions by simulating multiple random paths of the underlying risk factors. rather than deterministic analytic computation, mc uses random number generation to generate random samples of input trials to explore the behaviour of a complex epidemic situation for decision support. mc is particularly suitable for modeling epidemic especially new and unknown disease like covid- because the data about the epidemic collected on hand in the early stage are bound to change. in mc, data distributions are entered as input, since precise values are either unknown or uncertain. output of mc is also in a form of distribution specifying a range of possible values (or outcome) each of which has its own probability at which it may occur. compared to deterministic approach where precise numbers are loaded as input and precise number is computed to be output, mc simulates a broad spectrum of possible outcomes for subsequent expert evaluation in a decision-making process. recently as epidemic is drawing global concern and costing hugely on public health and world economy, the use of mc in epidemic modeling forecast has become popular. it offers decision makers an extra dimension of probability information so called risk factors for analyzing the possibilities and their associated risk as a whole. decades ago, there has been a growing research interest in using mc for quantitatively modelling epidemic behaviours. since , bailey et al was among the pioneers in formulating mathematical theory of epidemics. subsequently in millennium, andersson and britton [ ] adopted mc simulation techniques to study the behaviour of stochastic epidemic models, observing their statistical characteristics. in , house et al, attempted to estimate how big the final size of an epidemic is likely to be, by using mc to simulate the course of a stochastic epidemic. as a result, the probability mass function of the final number of infections is calculated by drawing random samples over small homogeneous and large heterogeneous populations. yashima and sasaki in extended the mc epidemic model from over a population to a particular commute network model, for studying the epidemic spread of an infectious disease within a metropolitan area -tokoyo train station. mc is used to simulate the spread of infectious disease by considering the commuters flow dynamics, the population sizes and other factors, the proceeding size of the epidemic and the timing of the epidemic peak. it is claimed that the mc model is able to serve as a pre-warning system forecasting the incoming spread of infection prior to its actual arrival. narrowing from the mc model which can capture the temporal-spatial dynamics of the epidemic spread, a more specific mc model is constructed by fitzgerald et al [ ] in for simulating queuing behaviour of an emergency department. the model incorporates queuing theory and buffer occupancy which mimic the demand and nursing resource in the emergency department respectively. it was found that adding a separate fast track helps relieving the burden on handling of patient and cutting down the overall median wait times during an emergency virus outbreak and the operation hours are at peak. mielczarek and zabawa [ ] adopted a similar mc model to investigate how erratic the population is, hence the changes in the number of infested patients affect the fluctuations in emergency medical services, assuming there are epidemiological changes such as call-for-services, urgent admission to hospital and icu usages. based on some empirical data obtained from ems center at lower silesia region in poland, the ems events and changes in demographic information are simulated as random variables. due to the randomness of the changes (in population sizes as people migrate out, and infested cases increase) in both demand and supply of an ems, the less-structured model cannot be easily examined by deterministic analytic means. however, mc model allows decision makers to predict by studying the probabilities of possible outcomes on how the changes impact the effectiveness of the polish ems system. there are similar works which tap on the stochastic nature of mc model for finding the most effective escape route during emergency evacuation [ ] and modelling emergency responses [ ] . overall, the above-mentioned related works have several features in common: their studies are centered on using a probabilistic approach to model complex real-life phenomena, where a deterministic model may fall short in precisely finding the right parameters to cater for every detail. the mc model is parsimonious that means the model can achieve a satisfactory level of explanation or insights by requiring as few predictor variables as possible. the model which uses minimum predictor variables and offers good explanation is selected by some goodness of fit as bic model selection criterion. the input or predictor variables are often dynamic in nature whose values change over some spatial-temporal distribution. finally, the situation in question, which is simulated by mc model, is not only complex but a prior in nature. just like the new covid- pandemic, nobody can tell when or whether it will end in the near future, as it depends on too many dynamic variables. while the challenges of establishing an effective mc model is acknowledged for modelling a completely new epidemic behaviour, the model reported in [ ] inspires us to design the mc model by decomposing it into several sub-problems. therefore, we proposed a new mc model called composite mc or cmc in short which accepts predictor variables from multi-prong data sources that have either correlations or some kind of dependencies from one another. the challenge here is to ensure that the input variables though they may come from random distribution, their underlying inference patterns must contribute to the final outcome in question. considering multi-prong data sources widen the spectrum of possibly related input data, thereby enhancing the performance of monte carlo simulation. however, naive mc by default does not have any function to decide on the importance of input variables. it is known that what matters for the underlying inference engine of mc is the historical data distribution which tells none or little information about the input variables prior to the running of mc simulation. to this end, we propose a pre-processor, in the form of optimized neural network namely bfgs-polynomial neural network is used at the front of mc simulator. bfgs-pnn serves as both a filter for selecting important variables and forecaster which generates future time-series as parts of the input variables to the mc model. traditionally all the input variables to the mc are distributions that are drawn from the past data which are usually random, uniform or some customized distribution of sophisticated shape. in our proposed model here, a hybrid input that is composed of both deterministic type and non-deterministic type of variables. deterministic variables come from the forecasted time-series which are the outputs of the bfgs-pnn. non-deterministic variables are the usual random samples that are drawn from data distributions. in the case of covid- , the future forecasts of the time-series are the predictions of the number of confirmed infection cases and the number of cured cases. observing from the historical records, nevertheless, these two variables display very erratic trends, one of them contains extreme outliers. they are difficult to be closely modelled by any probability density function; intuitively imposing any standard data distribution will not be helpful to delivering accurate outcomes from the mc model. therefore in our proposal, it is needed to use a polynomial style of self-evolving neural network that was found to be one of the most suited machine learning algorithm in our prior work [ ] , to render a most likely future curve that is unique to that particular data variable. the composition of the multiple data sources is of those relevant to the development (rise-and-decline) of the covid- epidemic. specifically, a case of how much daily monetary budget that is required to struggle against the infection spread is to be modelled by mc. the data sources of these factors are publicly available from the chinese government websites. more details follow in section below. the rationale behind using a composite model is that what appears to be an important figure, e.g. the number of suspected cases are directly and indirectly related to a number of sub-problems which of each carries different levels of uncertainty: how a person gets infested, depends on ) the intensity of travel (within a community, suburb, inter-city, or oversea) ) preventive measures ) trail tracking of the suspected and quarantining them ) medical resources (isolation beds) available, and ) eventual cured or dead. some of these data sources are in opposing influences to one another. for example, the tracking and quarantine measures gets tighten up, the number of infested drops, and vice-versa. in theory, more relevant data are available, the better the performance and more accurate outcomes of the mc can provide. mc plays an important role here as the simulation is founded on probabilistic basis, the situation and its factors are nothing but uncertainty. given the available data is scare as the epidemic is new, any deterministic model is prone to high-errors under such high uncertainty about the future. the contribution of this work has been twofold. firstly, a composite mc model, called cmcm is proposed which advocates using non-deterministic data distributions along with future predictions from a deterministic model. the deterministic model in use should be one that is selected from a collection of machine learning models that is capable to minimize the prediction error with its model parameters appropriately optimized. the advantage of using both fits into the mc model is the flexibility that embraces some input variables which are solely comprised of historical data, e.g. trends of people infested. and those that underlying elements which contribute to the high uncertainty, e.g. the chances of people gather, are best represented in probabilistic distribution as non-deterministic variables to the mc model. by this approach, a better-quality mc model can be established, the outcomes from the mc model become more trustworthy. secondly, the sensitivity chart obtained from the mc simulation is used as corrective feedback to rules that are generated from a fuzzy rule induction (fri) model. it is known that fri outputs decision rules with probabilities/certainty for each individual rule. a rule consists of a series of testing nodes without any priority weights. by referencing the feedbacks from the sensitivity chart, decision makers can relate the priority of the variables which are the tests along each sequence of decision rules. combining the twofold advantages, even under the conditions of high uncertainty, decision makers are benefited with a better-quality mc model which embraces considerations of composite input variables, and fuzzy decision rules with tests ranked in priority. this tool offers a comprehensive decision support at its best effort under high uncertainty. the remaining paper is structured as follow. section describes the proposed methodology called grooms+cmcm, followed by introduction of two key soft computing algorithms -bfgs-pnn and fri which is adopted for forecasting some particular future trends as inputs to the mc model and generating fuzzy decision rules respectively. section presents some preliminary results from the proposed model followed by discussion. section concludes this paper. mc has been applied for estimating epidemic characteristics by researchers over the year, because the nature of epidemic itself and its influences are full of uncertainty. an application that is relatively less looked at but important is the direct cost of fighting the virus. the direct cost is critical to keep the virus at bay when it is still early before becoming a pandemic. but often it is hard to estimate during the early days because of too many unknown factors. jiang et al [ ] have modelled the shape of a typical epidemic concluding that the curve is almost exponential; it took only less than a week from the first case growing to its peak. if appropriate and urgent preventive measure was applied early to have it stopped in time, the virus would probably not escalate into an epidemic then pandemic. ironically, during the first few days (some call it the golden critical hours), most of the time within this critical window was spent on observation, study, even debating for funding allocation and strategies to apply. if a effective simulation tool such as the one that is proposed here, decision makers can be better informed the costs involved and the corresponding uncertainty and risks. therefore, the methodology would have to be designed in mind that the available data is limited, each functional component of the methodology in the form of soft computing model should be made as accurate as possible. being able to work with limited data, flexible in simulating input variables (hybrid deterministic and its counterpart), and informative outcomes coupled with fuzzy rules and risks, would be useful for experts making sound decision at the critical time. our novel methodology is based on group of optimized and multisource selection, (grooms) methodology [ ] which is made for choosing a machine learning method which has the highest level of accuracy. grooms as a standalone optimizing process is in aid of assuring the deterministic model that is to be used as input variable for the subsequent mc simulation to have the most accurate data source input. by default, mc model at its naive form accepts only input variable from a limited range of standard data distributions (uniform, normal, bernoulli, pareto, etc.); best fitting curve technique is applied should the historical data shape falls out of the common data distribution types. however, this limitation is expanded in our composite mc model, so-called cmcm in such a way that all relevant data sources are embraced, both direct and indirect types. an enhanced version of neural network is used to firstly capture the non-linearity (often with high irregularity and lack of apparent trends and seasonality) of the historical data. out of the full spectrum of data sources, direct and indirect, the selected data sources through feature selection by correlation, that are filtered by the neural network, whose data distributions would be taken as input variables to the mc model. the combined methodology, grooms+cmcm is shown in figure . according to the methodology, a machine learning algorithm candidate called bgfs-pnn which is basically pnn as selected as the winning algorithm in [ ] enhanced with further parameter optimization function. the given timeseries data fluctuated more than the same that were collected earlier. as a data pre-processor, bfgs-pnn has two functions. firstly, for the non-deterministic data, using a classifierbasedfilter function in a wrapper approach, salient features could be found in feature selection. the selected salient features are those very relevant to the forecast target in the mc. in this case, it is composite mc model or cmcm as the simulation engine intakes multiple data sources from types of deterministic and non-deterministic. the second function is to forecast a future time-series as a type of deterministic input variable for the cmcm. the formulation of bfgs-pnn is shown as follow. the naïve version of pnn is identical to the one reported in [ ] . bfgs-pnn uses bfgs (broyden-fletcher-goldfarb-shanno) algorithm to optimize the parameters and network structure size in an iterative manner using hill-climbing technique. bfgs theory is basically for solving non-linear optimization problem iteratively by finding a stationary equilibrium through quasi-newton method [ ] and secant method [ ] . let pnn [ ] take the form of kolmogorov-gabor polynomial as a functional series in eqn. ( ). the polynomial is capable to take form of any function which is generalized as y=f( ̅ ). the induction process is a matter of finding all the values for the coefficient vector ̅ . as the process iterates, the variables from ̅ arrive in sequence fitting into the polynomial via regression and minimizing the error. the complexity grows incrementally by trying to add a neuron at a time while the forecasting error is being monitored [ ] . when the number of neurons reach a pre-set level, the hidden layer increases. this continues until there is no further performance gain observed, the growth of the polynomial stops and taken as the final equation for the pnn. it is noted that the increment of the network growth is linear. however, for the case of bfgs-pnn, the expansion of the polynomial is non-linear and heuristic. the optimal state is achieved by unconstrainted hill-climbing method guided by quasi-network and secant methods. let an error function be e(p) where p is a vector of real numbers be a vector of network structure information and parameters, i.e. neurons and layers in an ordered set. at the start, when t= , pt= is initialized by randomly chosen states. let search direction be si at iteration i where time t=i. let hi be hessian which is a square matrix of nd -order partial derivatives of function e. where i is the current iteration, h improves becomes a better estimate as the process iterates. Ñe(pi) is the gradient of the error function which needs to be minimized at t=i by following the quasi-newton search pattern using a gradient function similar to eqn. . it seeks to search for the next state of parameters values pi+ by optimizing e(pi+s*si) where the scalar s must be greater than . eqn. then would have to obey the quasi-newton condition upon solving the approximation of the hessian hi, in a way of eqn. . by the secant equation, we let the hessian matrix take the form as in eqn. . the updating function for hessian matrix is defined as eqn. . following the secant method. the equation tries to impose the condition and symmetry such that ! = !"# ! . let !"# * ! = ! , = ! = ! ! , we can obtain two sub-equations in eqn. . as coefficients to eqn. . substituting eqn. . into eqn. , we obtain the updating function for hessian matrix hi+ . by applying sherman-morrison formula [ ] to eqn. ., we get eqn. . which is the inverse of hessian h matrix. expanding eqn. to eqn. we obtain an equation that can be calculated quickly without needing any buffer space for fast optimization which aims at minimizing e(*). !"# by our grooms+cmcm methodology, raw data from multiple sources are filtered, condensed, converted into insights of future behaviours in several forms. traditionally in mc simulation, probability density functions as simulated outcomes are generated, so is sensitivity chart which ranks how each factor in relevance to the predicted outcome. fuzzy rule induction (fri) plays a role in the methodology by inferring a rule-based model which supplies a series of conditional tests that lead to some consequences based on the same data that were loaded into the mc engine. fri serves the threefold purpose of easy to use, neural and scalable. firstly, the decision rules are interpretable by users. they can complement the probability density functions which show a macro view of the situation. fri helps give another perspective in causality assisting decision makers to investigate the logics of cause-and-effect. furthermore, different from other decision rule models, fri allows some fuzzy relaxation in bracketing the upper and lower bounds, thereby a decision can be made based on the min-max values pertaining to each conditional test (attribute in the data). the fri rules are formatted as branching-logic, which is also known as predicate-logic that preserves the crudest form of knowledge representation. predicate logic has an if-test[min-max]-then-verdict basic structure and a propensity of how often it exists in the dataset. the number of different groups of fri rules depend on how many different labels in the predicted class. the second advantage is that the fri rules are objective and free from human bias as they are derived homogenously from the data. therefore, they are suitable ingredient for scientifically devising policy and strategy for epidemic control. thirdly fri rules can scale up or down not only in quantity, but also in cardinality. a rule can consist of as many tests as the attributes of the data are available. in other words, as a composite mc system, new source of data could be chipped in as per when it becomes necessary or recently available; the attributes of the new data can add on to the conditional tests of the fri rules. one drawback about fri is the lack of indicators for each specific conditional test (or attribute). by the current formulation of fri, the likelihood of the occurrence of the rule is assigned to the rule as a whole. little is known about how each conditional test on the attribute is relatively contributing to the outcome that is specified in the rule. in the light of this shortcoming, our proposed methodology suggests that the scores from the sensitivity charts with respect to the relations between the attributes and the outcome, could be used at the rule by simple majority voting. rules are generated as a by-product of classification in data mining. the process is through fuzzification of the data ranges and the confidence factors in their effects in classification are taken as indicators. let a rule be a series of components constraining the attributes !"#..% (with outcome ! = y) in the classification model building, so that they can remain valid even though the values are fuzzified. so a rule can be expressed in the predicate rule format such that each ! ∈ , where ⊆y is an membership whose labels are where Θ *,↓ and Θ *,↑ are the lower and upper bounds of the argument which will map to a fuzzy membership of value . similarly, the supports of the lower and upper bounds are denoted by Θ *,↑ and Θ ',↑ . the fuzzy rules are built on the decision rules which are generated from standard decision tree algorithm, such as direct rule-based classifier equipped with incremental reduced error pruning via greedy search [ ] . given the rule sets generated, the task here is to find the most suitable fuzzy extension for each rule. the task can be seen as replacing the current memberships of the rules by their corresponding fuzzy memberships, which is not too computationally difficult as long as the rule structures and the elements the same. in order to fuzzify a membership, the following formula is applied over the antecedent (Ωi∈ i) of rule set while considering the relevant data / ! ; at the same time the instances from the other antecedent (Ωj by this approach, the instances / ! are divided into two subsets: one subset contains all the positive instances / ! and the other subset contains all the negative instances /-! . after that, a purity measure is used to further separate the two groups into two extremes, positive and negative subgroups, by eqn. . when it comes to actual operation, a certainty factor which serves as an indicator about how much the new data instances indeed belong to a subgroup, is needed to quantify the division. followed by segregating the data into fuzzy rules # ( ) … < ( ) by machine learning the relations from their attributes and instance values to some label class , further indicator is needed to quantify the strength of each rule. assume we have a new test instance x, eqn computes the support of the rules of x as follow: !"#…? where ℂ( ! ( ) ) is the certainty factor pertaining to the rule ! ( ) . the certainty factor ℂ is expressed in eqn. : where / ( ) denotes the subsets of training instances that are labelled as . the result that is predicted by the default classifier to be one of the class labels, is the one that has the greatest value computed from the support function eqn. . at times, some instances x could not be classified into rule or subgroup, that happens when sj(x)= for all classes λj, x could be randomly assigned or temporarily placed into a special group. otherwise, the fuzzy rules are formed, certainty and support indicators are assigned to each one of them. the indicators mean how strong the rules are with respect to the predictive power to the class label possessed by the rules. for the purpose of validating the proposed grooms+cmcm methodology, empirical data proceeding from the chinese center for disease control and prevention ‡ (cdcp), an official chinese government agency in beijing, china. since the beginning of the covid- outbreak, cdcp has been releasing the data to the public and updating them daily via mainstream media tencent and its subsidiary § . the data come from mainly two sources: one source is known to be deterministic in nature which is harvested from cdcp in the form of time-series starting from jan to feb . a snapshot of the published data is shown in figure which are deterministic in nature as historical facts. the data collected for this experiment are only parts of the total statistics available on the website. the data required for this experiment are the numbers of people in china who have been contracted with the covid- disease in the following ways: suspected of infection by displaying some symptoms, confirmed of infection by medical tests, cumulative confirmed cases, current number of confirmed cases, current number of suspected cases, current number of critically ill, cumulative number of cured cases, cumulative number of deceased cases, recovery (cured) rate % and fatality rate %. this group of time-series are subject to grooms for finding the most accurate machine learning technique for obtaining the forecasts as future trends under development. in this case, bfgs-pnn was found to be the winning candidate model, hence applied here for generating future trends for each of the above-mentioned records. the forecasts based on these selected data by bfgs-pnn are shown in fig . the forecasts are in turn used as deterministic input variables to the cmc model. they have relatively lowest errors in rmse in comparison to other time-series forecasting algorithms as tested in [ ] . the rationale is to use the most accurate possible forecasted inputs for achieving the most reliable simulated outcomes from mc simulation at the best effort. ‡ http://www.chinacdc.cn/en/ § https://news.qq.com/zt /page/feiyan.htm the goal of this monte carlo simulation experiment, which is a part of grooms+cmcm is to hypothetically estimate the direct cost that is needed as an urgent part of national budget planning to control the covid- epidemic. direct cost means the cost of medical resources, that includes but not limit to medicine, personnel, facilities, and other medical supplies, directly involved in providing treatments to those patients due to the covid- outbreak. of course, the grand total cost is much wider and greater than the samples experimented here. the experiment however aims at demonstrating the possibilities and flexibility of embracing both types of deterministic and non-deterministic data inputs by the composite mc methodology. the other group of data that would feed into the cmc are non-deterministic or probabilistic because they would have to bear a high level of uncertainty. they are subject to situation that changes dynamically and there is little control over the outcome. in the case of covid- epidemic control, finding the cure to the virus is a good example. there is best effort put into treatment, but no certainty at all about a cure, let alone knowing when exactly a cure could be developed, tested and proven to be effective for use against the novel virus. there are other probabilistic factors which are used in this experiment as well. selected main attributes are tabulated in table . we assume a simple equation for estimating the direct cost in fighting covid- using only data of quarantining and isolated medical treatment as follow. note that the variables shown are abbreviated from the term names. e.g. d-i-r = days_till_recovery. the assumptions and hypothesis are derived from past experiences about direct costs involved in quarantine and isolation during the epidemic of sars in as published in [ ] , with reasonable adjustment. for the non-deterministic variables ppi/day, d-f-r and d-t-d, the following assumptions are derived from [ ] . these variables are probabilistic in nature as shown in table . e.g. nobody can actually tell how long an infested patient could be recovered and go home, nor how long the isolation needs to be when the patient is in critical condition. all these are bounded by some probabilities that can be expressed in statistical properties, such as min-max, mean, standard deviation and so on. so some probabilism functions are needed to describe them, and random samples from these probabilities distributions are drawn to run the simulation. it is assumed that the growth of daily medical cost ppi/day follows a normal distribution with daily increase rate. the daily increase rate is estimated from [ ] to be rising as days go by, because the chinese government has been increasingly putting in resources to stop the epidemic using national efforts. the increase is due to the daily increase number of medical staff who are flown to wuhan from other cities, and the rise of the volume of consumable medical items as well as their inflating costs. the daily cost is anticipated to become increasingly higher as long as the battle against covid- continues at full force. there are other supporting material costs and infrastructure relocation costs such as imposing curfews and economy damages. however, these other costs are not considered for the purpose of demonstrating with a simple and easy-to-understand cmc model. normal distribution and uniform distribution are assumed accountable for the increase and probability distributions that describe the lengths of hospital stay. when more information become available, one can consider refining them to weibull distribution and rayleigh distribution which can better describe the progress of the epidemic in terms of severity and dual statistical degree of freedom. this is a very simplified approach in guessing the daily cost of the so-called medical expenses, based on only two interventions -quarantine and isolation. nevertheless, this cmc model though simplified, is serving as an example of how monte carlo style of modelling can help generate probabilistic outcomes, and to demonstrate the flexibility and scalability of the modelling system. theoretically, the cmc system can expand from considering two direct inputs (quarantining and isolation) to , or even other direct and indirect inputs to estimate the future behaviour of the epidemic. in practical application, data that are to be considered shall be widely collected, pre-processed, examined for quality check and relevance check (via grooms), and then carefully loading into the cmc system for getting the outcomes. stochastic simulation is well-suited for studying the risk factors of infectious disease outbreaks which always changes in their figures across time and geographical dispersion, thereby posing high level of uncertainty in decision making. each model forecast by mc simulation is an abstraction of a situation under observation -in our experiment, it is the impact of the dynamics of epidemic development on the direct medical costs against covid- . the model forecast depicts the future tendencies in real-life situation rather than statements of future occurrence. the output of mc simulation sheds light in understanding the possibilities of outcomes anticipated. being a composite mc model, the ultimate performance of the simulated outcomes would be sensitive to the choice of the machine learning technique that generated the deterministic forecast as input variable to the cmc model. in light of this, a part of our experiment besides showcasing the mc outcomes by the best available technique, is to compare the levels of accuracy (or error) resulted from the wining candidate of grooms and a standard (default) approach. the forecasting algorithms in comparison are bfgs+pnn and linear regression respectively. the performance criterion is rmse, which is consistent and unitless ! = m +,-). & / * . ! n and rmse=mo ( ! )n, as defined in [ ] . at the same time, the total costs that are produced manually by explicitly use of spreadsheet using human knowledge are compared vis-à-vis with those of the forecast models by cmcm. the comparative performances are tabulated in table ii . the forecasting period is days. the cmcm model is implemented on oracle crystal ball release . . . . ( -bit), running on a i cpu @ ghz, gb ram and ms windows platform. , trials were set to run the simulation for each model. it is apparent that as seen that from table ii , the rmse of the monte carlo forecasting method using linear regression is more than double that of the method using bfgs+pnn (approximately k vs k). that is mainly due to the overly estimate of all the deterministic input variables by linear regression. referring to the first diagram in figure , the variable called new_daily_increase_confirmed is non-stationary and it contains an outlier which rose up unusually high near the end. furthermore, the other correlated variable called new_daily_increase_suspected, which is a precedent of the trends of the confirmed cases, also is non-stationary and having an upward trend near the end, though it dips eventually. by using linear regression, the outlier and the upward trends would encourage the predicted curve to continue with the upward trend, linearly, with perhaps a sharp gradient. consequently, most of the forecast outcomes in the system have been over-forecasted. as such, using linear regression causes unstable stochastic simulation, leading to the more extreme final results, compared to the other methods. this is evident that the total_daily_cost has been largely over-forecasted and under-forecasted by manual and mc approaches, in table ii. on the other hand, when the bfgs+pnn is used, which is able to better recognize non-linear mapping between attributes and prediction class in forecasting, offers more realistic trends which in turn are loaded into the cmcm. as a result, the range between the final total_daily_cost results are narrower compared it to its peer linear regression ([lr: mil - mil] vs [bfgs+pnn: mil - mil]). the estimated direct medical cost for fighting covid- for a fortnight is estimated to be about . million usd given the available data using grooms+cmcm. according to the results in the form of probability distributions in figure , different options are available for the user to choose from when it comes to estimating the fortnightly budget in fighting this covid- given the existing situation. each option comes with different price tag, and at different levels of risks. in general, the higher risk that the user thinks it can be tolerated, the lower the budget it will be, and vice-versa. from the simulated possible outcomes in figure , if budget is of constrained, the user can consider bearing the risk (uncertainty of %) that a mean of $ mil with [min:$ mil, max:$ mil] is forecasted to be sufficient to fulfil the direct medical cost need. likewise, if a high certainty is to be assured, for example at % the chance that the required budget would be met, it needs about a mean of $ mil with [min:$ mil, max:$ mil]. for a high certainty of %, it is forecasted that the budget range will fall within a mean of $ mil and ranging from [min:$ mil, max:$ mil]. as a de-facto practice, some users will take % certainty as pareto principle ( - ) decision [ ] and accept the mean budget at $ mil. $ mil should be a realistic and compromising figure when compared to manual forecast without stochastic simulation, where $ mil and $ mil budgets would have been forecasted by manual approach by linear regression and neural network respectively. sensitivity chart, by its name, displays the extents of how the output of a simulated mc model is affected by changes in some of the input variables. it is useful in risk analysis of a so-called black box model such as the cmcm used in this experiment by opening up the information about how sensitive the simulated outcome is to variations in the input variables. since the mc output is an opaque function of multiple inputs of composite variables that were blended and simulated in a random fashion over many times, the exact relationship between the input variables and the simulated outcome will not be known except through sensitivity chart. an output of sensitivity chart from our experiment is generated and shown in figure . as it can be observed from figure , the top three input variables which are most influential to the predicted output, which is the total medical cost in our experiment are: the average number of days before recovery in day and day , and average cost per day for isolating a patient in day . the first two key variables are about how soon a patient can recover from covid- , near the final days. and the third most important variable is the average daily cost for isolating patients at the beginning of the forecasting period. this insight can be interpreted that an early recovery near the final days and reasonably lower medical cost at the beginning would impact the final budget to the greatest extend. consequently, decision makers could consider that based on the results from the sensitivity analysis, putting in large or maximum efforts in treating isolating patients at the beginning, observe for a period of days or so; if the medical efforts that were invested in the early days take effect, the last several days of recovery will become promising, hence leading to perhaps saving substantially a large portion of medical bill. sensitivity chart can be extended to what-if scenario analysis for epidemic key variable selection and modeling [ ] . for example, one can modify the quantity of each of the variables, and the effects on the output will be updated instantly. however, this is beyond the scope of this paper though it is worth exploring for it is helpful to fine-tune how the variables should be managed in the effort of maximizing or minimizing the budget and impacts. since the effect of a group of independent variables on the predicted output is known and ranked from the chart, it could be used as an alternative to feature ranking or feature engineering in data mining. the sensitivity chart is a byproduct generated by the mc after a long trial of repeating runs using different random samples from the input distributions. figure depicts how the sensitivity chart is related to the processes in the proposed methodology. effectively the top ranked variables could be used to infer the most influential or relevant attributes from the dataset that loads into an fri model (described in section . ) for supervised learning. one suggested approach which is fast and easy is to create a correlogram, from there one can do pairwise matching between the most sensitive variables from the non-deterministic data sources to the corresponding attributes from the deterministic dataset. ranking scores could be mapped over too, by boyer-moore majority vote algorithm [ ] . some selected fuzzy rules generated by the methodology and filtered by the sensitivity chart correlation mapping are show below. the display threshold is . which is arbitrary chosen to display only the top six rules where half of them predicts a reflection point for the struggle of controlling the covid- can be attained, the other half indicate otherwise. an fri model in a nutshell is a classification which predicts an output belonging either one of two classes. in our experiment, we setup a classification model using fri to predict whether an inflection point of the epidemic could be reached. there is no standard definition of inflection point, though it is generally agreed that is a turning point at which the momentum of accumulation changes from one direction to another or vice-versa. that could be interpreted as either an intersection of two curves of which their trajectories begins to switch. in the context of epidemic control, an inflection point is the moment since when the rate of spreading starts to subside, thereafter the trend of the epidemic is leading to elimination or eradication. based on a sliding window of days length, a formula for computing the inflection point based on the three main attributes of the covid- data is given in listed as follow: win: score = w × (d down-trend between the past days of new_daily_increase_confirmed (n.d.i.c)) + w × (d down-trend between the past days of current_confirmed) + w × (d up-trend between the past days of cured_rate) lose: score = w × (d up-trend between the past days of new_daily_increase_confirmed (n.d.i.c)) + w × (d up-trend between the past days of current_confirmed) + w × (d up-trend between the past days of death_rate) where w = . , w = . , and w = . which can be arbitrarily set by the user. the weights reflect the importance which one considers on how the up or downward trends of confirmed cases and cured vs death rates contribute to reaching the inflection point. a dual curve chart that depicts the inflection point is shown in figure . interestingly, when near the end of the timeline ( / - / ) that is from point th onwards, the two curves seem to intervene, as it has been hoping that the winning curve is rise over the losing curve. an inflection point might have reached, but the momentum of the winning is there yet. further observation on the epidemic development is needed to confirm about the certainty of winning. nevertheless, the top six rules that are built from classification of inflection point, and processed by feature selection via sensitivity analysis, are shown below. cf stands for confidence factor which indicates how strong the rule is. on the winning side, the rules reveal that when the variables about new confirmed cases fall below certain numbers, a win is scored, contributing towards a reflection point. (yester days-ndic = '( . .. ∞)') → win= (cf = . ) the strongest rules of the two forces are rules and . rule shows that to win an epidemic control the down trend over consecutive three days must fall below ; on the other hand, the epidemic control may lead to failure, if the cured rate stays less than . % and the new daily increase in confirmed cases remain high between and (round up the decimal points). originated from wuhan, china, the epidemic of novel coronavirus was spreading over many chinese cities, then over other countries worldwide since december . the chinese authorities took strict measures to contain the outbreak resolutely by restricting travels suspending business and schools etc. this gives rise to an emergency situation where critical decisions were demanded for, while the virus was novel and very little information was known about the epidemic at the early stage. with incomplete information, limited data on hand, and ever changing on the epidemic development, it is extremely hard for anybody to make a decision using only deterministic approach which foretells precisely the future behaviour of the epidemic. in this paper a composite monte-carlo model (cmcm) is proposed to be used in conjunction with grooms methodology [ ] which finds the best performing deterministic forecasting algorithm. coupling grooms+cmcm together offers the flexibility of embracing both deterministic and non-deterministic input data into the monte carlo simulation where random samples are drawn from the distributions of the data from the non-deterministic data sources for reliable outputs. during the early period of disease outbreaks, data are scarce and full of uncertainty. the advantage of cmc is that a range of possible outcomes are generated associated with probabilities. subsequently sensitivity analysis, what-if analysis and other scenario planning can be done for decision support. as a part of the grooms+cmcm methodology, fuzzy rule induction is also proposed, which provides another dimension of insights in the form of decision rules for decision support. a case study of the recent novel coronavirus epidemic (which are also known as wuhan coronavirus, covid- or -ncov) is used as an example in demonstrating the efficacy of grooms+cmcm. through the experimentation over the empirical covid- data collected from the chinese government agency, it was found that the outcomes generated from monte carlo simulation are superior to the traditional methods. a collection of soft computing techniques, such as bfgs+pnn, fuzzy rule induction, and other supporting algorithms to grooms+cmcm could be able to produce qualitative results for better decision support, when used together with monte carlo simulation, than any of deterministic forecasters alone. outbreaks chronology: ebola virus disease the impacts on health, society, and economy of sars and h n outbreaks in china: a case comparison study coronavirus: the hit to the global economy will be worse than sars nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study the mathematical theory of epidemics stochastic epidemic models and their statistical analysis how big is an outbreak likely to be? methods for epidemic final-size calculation epidemic process over the commute network in a metropolitan area a queue-based monte carlo analysis to support decision making for implementation of an emergency department fast track monte carlo simulation model to study the inequalities in access to ems services real-time stochastic evacuation models for decision support in actual emergencies on algorithmic decision procedures in emergency response systems in smart and connected communities finding an accurate early forecasting model from small dataset: a case of -ncov novel coronavirus outbreak bayesian prediction of an epidemic curve analytical study of the least squares quasi-newton method for interaction problems numerical analysis for applied science heuristic self-organization in problems of engineering cybernetics estimating the coefficients of polynomials in parametric gmdh algorithms by the improved instrumental variables method adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix (abstract nonlinear digital filters: analysis and applications medical applications of artificial intelligence a cost-based comparison of quarantine strategies for new emerging diseases how china built two coronavirus hospitals in just over a week fitting mechanistic epidemic models to data: a comparison of simple markov chain monte carlo approaches mathematical and computer modelling of the pareto principle soubeyrand samuel and thébaud gaël using sensitivity analysis to identify key factors for the propagation of a plant epidemic r automated reasoning: essays in honor of woody bledsoe he is a co-founder of the data analytics and collaborative computing research group in the faculty of science and technology. prior to his academic career, simon took up various managerial and technical posts, such as systems engineer, it consultant and e-commerce director in australia and asia. dr. fong has published over international conference and peer-reviewed journal papers, mostly in the areas of data mining, data stream mining, big data analytics, meta-heuristics optimization algorithms, and their applications. he serves on the editorial boards of the journal of network and computer applications of elsevier, ieee it professional magazine, and various special issues of scie-indexed journals. simon is also an active researcher with leading positions such as vice-chair of ieee computational intelligence society (cis) task force on her latest winning work includes the first unmanned supermarket in macau enabled by the latest sensing technologies, face recognition and e-payment systems. she is also the founder of several online offline dot.com companies in trading and retailing both online and offline. ms li is also an active researcher, manager and chiefknowledge-officer in dacc laboratory at the faculty of science and technology rubén gonzález crespo has a phd in computer science engineering. currently he is vice chancellor of academic affairs and faculty from unir and global director of engineering schools from proeduca group. he is advisory board member for the ministry of education at colombia and evaluator from the national agency for quality evaluation and accreditation of spain (aneca) his current research interests include group decision making, consensus models, linguistic modeling, and aggregation of information, information retrieval, bibliometric, digital libraries, web quality evaluation, recommender systems, and social media. in these topics he has published more than papers in isi journals and coordinated more than research projects. dr. herrera-viedma is vice-president of publications of the ieee smc society and an associate editor of international journals such as the key: cord- -n qwsvtr authors: arbia, giuseppe title: a note on early epidemiological analysis of coronavirus disease outbreak using crowdsourced data date: - - journal: nan doi: nan sha: doc_id: cord_uid: n qwsvtr crowdsourcing data can prove of paramount importance in monitoring and controlling the spread of infectious diseases. the recent paper by sun, chen and viboud ( ) is important because it contributes to the understanding of the epidemiology and of the spreading of covid- in a period when most of the epidemic characteristics are still unknown. however, the use of crowdsourcing data raises a number of problems from the statistical point of view which run the risk of invalidating the results and of biasing estimation and hypothesis testing. while the work by sun, chen and viboud ( ) has to be commended, given the importance of the topic for worldwide health security, in this paper we deem important to remark the presence of the possible sources of statistical biases and to point out possible solutions to them the paper by sun, chen and viboud ( ) (henceforth scv) is an important example of the use of crowdsourced data in monitoring the spread of covid- . indeed, crowdsourcing data can prove of paramount importance in monitoring and controlling the spread of infectious diseases as it is also remarked e. g. by leung and leung ( ) among the many others. the paper relies on an innovative source (potentially obtainable in real time) derived from social media and news report collected in china from th to st of january during the first outbreak of the corona virus epidemics. the data collected referred to cases. in the paper the crowdsourced data, coming from different sources, are used to estimate several epidemiological parameters of tremendous importance in the process of surveillance and control of the diffusion of the disease such as: the relative risk by age group, the mean age and skewness of infected people, the time of delays between symptoms and seeking care at hospital, the mean incubation period. scv also use the crowdsourced data to test theoretical hypotheses using the wilcoxon test and the kruskal-wallis test these test lead them to conclude that the delay between symptoms onset and seeking care at hospital or clinic decreased significantly after january th and that the delay was significantly longer in hubei with respect to tianjin and yunnan and between international travelers and local population. there are two main statistical problems connected with the use of crowdsourced data in general and with those presenting a spatial configuration in particular (such as those employed by scv), namely: . the lack of a precise sample design . the presence of spatial/network correlation among the individuals in the sample we will briefly discuss the two problems in more details in the following two sections. catholic university of the sacred heart, milan (italy) as is it well known the wilcoxon test (or the mann-whitney u test) is a nonparametric test used to test the hypothesis of equality between two independent samples. the kruskal-wallis test is also non-parametric which extends the wilcoxon test in comparing two or more independent samples of equal or different sample sizes and can be also seen as the non-parametric version of the one-way analysis of variance (anova). see wilcoxon ( ) and kruskal and wallis ( ) a general characteristics of crowdsourced data, , likewise other unconventionally collected big data , is the lack of any precise sample design (arbia, ). this situation is described in statistics as a "convenience sampling", in which case it is known that no probabilistic inference is possible (hansen et al, ) . as fisher ( ) says "if we aim at a satisfactory generalization of the sample results, the sample experiment needs to be rigorously programmed". indeed, while in a formal sample design the choice of sample observations is guided by a precise mechanism which allows the calculation of the probabilities of inclusion of each unit (and, hence, probabilistic inference), on the contrary with a convenience collection no probability of inclusion can be calculated thus giving rise to over-under-representativeness of the sample units. the advantages of using a convenience sampling are the obvious ease of data collection and cost-effectiveness. however, the disadvantages are that the results cannot be generalized to a larger population because the under-(or over-) representation of units produce a bias. furthermore, convenience sampling-based estimates are characterized by larger standard error and as a consequence by insufficient power in hypothesis testing. in this situation the estimation of parameters (like mean, median, proportions) based on the principle of analogy and the calculation of p-values to take decisions in hypothesis testing is not theoretically motivated. scv, indeed, acknowledge the fact that the collection criterion used could have generated possible biases in their sample. they list problems like the fact that a substantial proportion are travelers (who are predominantly adults), that data are captured by the health system and so are biased toward more severe cases, that geographical coverage is heterogeneous with an under-representation of provinces with a weaker health infrastructure. however, they do not take any action to reduce such biases. the problem raised by the convenience collection of data emerges dramatically in the big data era when we increasingly avail data which, almost invariably, do not satisfy the necessary conditions for probabilistic inference. in recent years researchers are becoming aware of this problem trying to suggest solutions to reduce the distorting effects inherent to non-probabilistic designs (fricker and schonlau, ) . one possible strategy consists in transforming crowdsourced datasets in such a way that they resemble a formal sample design. this procedure has been termed post-sampling (arbia et al., ) and represents a particular form of poststratification (holt and smith, ; little, ) . to implement a post-sampling analysis, we need to calculate, in each geographical location (e.g. the chinese provinces), a post-sampling ratio (ps), defined as the ratio between the number of observations required by a reference formal sample design (e.g. random stratified with probability of inclusion proportional to size) and those collected through crowdsourcing. more reliable estimations of population parameters can then be obtained by considering a weighted version of the dataset using the post-sampling ratio as weights. thus, crowdsourced observations will have to be over-weighted if ! > and, on the contrary, down-weighted when ! < . this strategy has been adopted in arbia et al. ( ) in order to estimate the food price index in nigeria using data crowdsourced through smartphones and in arbia and nardelli ( ) to estimate spatial regression models. after post-sampling, estimates display less bias and lower standard errors and the reduction in the power of the tests is moderated. the dataset used by scv refer to individuals ( from china and the rest from abroad). both in the estimation of the epidemiological parameters reported in section and in hypothesis testing the authors treat their crowdsourced data as if they were independent. indeed, both the wilcoxon and the kruskal-wallis test are based on the assumptions that all the observations are independent of each other. however, even if data were collected obeying a formal sample design, a further potential source of bias is the fact that the observational units could display a certain degree of spatial/network correlation (cliff and ord, ; arbia, ) . observed units that are close in space or in network interaction, may display similar values in the observed variables (e. g. age, incubation period, delays between symptoms and seeking care at hospital) due to the interaction between individuals or/and to presence of some unobserved latent variable with a geographical component. the effects of spatial correlation in the geography of epidemics is well documented in the book by cliff et al. ( ) . when observed data are not independent and display a positive spatial/network correlation, the standard errors are underestimated leading to inefficient estimation of the various parameters (mean, median, proportion etc.) . but the consequence on hypothesis testing can be even worse. due to the underestimation of the standard errors, indeed, the statistical test become artificially inflated leading to lower p-values and, as a consequence, to the rejection of the null hypotheses more frequently than we should. as a result, the significance of the statistical test can become very poor with very artificially high probability of type i error. this could explain the very low levels of p-values (of the order of - ) reported in the scv paper despite the relatively small dataset used. standard statistical textbooks like shabenberger and gotway ( ) and cressie and wikle ( ) document how to calculate the level of spatial/network correlation between geographically located individuals and how this parameter can be used in the process of estimation and hypothesis testing in order to obtain more reliable inferential conclusions. the work by sun, chen and viboud ( ) has to be commended, given the absolute relevance of the topic for health security and the timeliness with which results are presented in a period of great uncertainty related to the worldwide diffusion of the new corona virus covid- . their results are of invaluable help in the process of surveillance, monitoring and control of the disease. in this comment we draw the attention of the authors and of all the researchers and health operators on the fact that the crowdsourced dataset they use can lead to biases in the estimation of the epidemiological parameters and in hypothesis testing procedures. we hope that this note may help future studies in the area, so as to obtain even more reliable estimation and more grounded tests of theoretical hypotheses thus progressing rapidly and rigorously in the knowledge about covid- and any possible future new epidemics. a primer for spatial econometrics ) statistics, new empiricism and society in the era of big data, forthcoming, springerbrief on spatial lag models estimated using crowdsourcing, web-scraping or other unconventionally collected data, under revision for spatial economic analysis post-sampling crowdsourced data to allow reliable statistical inference: the case of food price indices in nigeria, paper presented at the lx conference of the italian statistical society advantages and disadvantages of internet research surveys: evidence from the literature post stratification, series a use of ranks in one-criterion variance analysis post-stratification: a modeler's perspective crowdsourcing data to mitigate epidemics, the lancet digital health early epidemiological analysis of coronavirus disease outbreak using crowdsourced data: a population level observational study statistical methods for spatial data analysis individual comparisons by ranking methods key: cord- -f vjm j authors: paiva, henrique mohallem; afonso, rubens junqueira magalhaes; caldeira, fabiana mara scarpelli de lima alvarenga; velasquez, ester de andrade title: a computational tool for trend analysis and forecast of the covid- pandemic date: - - journal: nan doi: nan sha: doc_id: cord_uid: f vjm j purpose: this paper proposes a methodology and a computational tool to study the covid- pandemic throughout the world and to perform a trend analysis to assess its local dynamics. methods: mathematical functions are employed to describe the number of cases and demises in each region and to predict their final numbers, as well as the dates of maximum daily occurrences and the local stabilization date. the model parameters are calibrated using a computational methodology for numerical optimization. trend analyses are run, allowing to assess the effects of public policies. easy to interpret metrics over the quality of the fitted curves are provided. country-wise data from the european centre for disease prevention and control (ecdc) concerning the daily number of cases and demises around the world are used, as well as detailed data from johns hopkins university and from the brasil.io project describing individually the occurrences in united states counties and in brazilian states and cities, respectively. u. s. and brazil were chosen for a more detailed analysis because they are the current foci of the pandemic. results: illustrative results for different countries, u. s. counties and brazilian states and cities are presented and discussed. conclusion: the main contributions of this work lie in (i) a straightforward model of the curves to represent the data, which allows automation of the process without requiring interventions from experts; (ii) an innovative approach for trend analysis, whose results provide important information to support authorities in their decision-making process; and (iii) the developed computational tool, which is freely available and allows the user to quickly update the covid- analyses and forecasts for any country, united states county or brazilian state or city present in the periodic reports from the authorities. mathematical models have been widely used to study the transmission dynamics of infectious diseases, enabling the understanding of the disease spread and the optimization of disease control [ ] . forecasting models are used to predict future behavior as a function of past data. this is a widely used method in the implementation of epidemic mathematical models, since it is necessary to know the past behavior of a disease to understand how it will evolve in the future. accurate forecasts of disease activity could allow for better preparation, such as public health surveillance, development and use of medical countermeasures, and hospital resource management [ ] . a similar approach is the concept of trend analysis, which allows predicting future behavior with accuracy, especially in the short run. a trend is a change over time exhibited by a random variable [ ] ; trend analyses provide direction to a trend from past behavior, allowing predicting future data. for better effectiveness, the predictions should be updated periodically, as soon as new data are available. the technique of trend analysis is widely used in several areas of science, such as finances [ ] [ ] and meteorology [ ] [ ] . in the context of health systems, trend analysis was used by zhao et al. [ ] , to analyze malignant mesotheliomas in china, aiming to provide data for its prevention and control; by soares et al. [ ] , to predict the testicular cancer mortality in brazil; by zahmatkesh et al. [ ] , to forecast the occurrences of breast cancer in iran; by mousavizadeh et al. [ ] , to forecast multiple sclerosis in a region of iran; and by yuan et al. [ ] , to analyze and predict the cases of type diabetes in east asia. modeling and prediction of the dynamics of the covid- pandemic is a subject of great interest. therefore, a myriad of papers on this theme have been published over the last months, exploiting different modeling approaches, such as compartment models [ ] , time series analysis [ ] , artificial intelligence [ ] [ ] , and regression-based models [ ] [ ] . for this purpose, some research groups extended previous epidemiological models to describe the covid- pandemics: lin et al. [ ] created a conceptual model for the covid- outbreak in wuhan, china, using components from the influenza pandemic in london, while paiva et al. [ ] proposed a dynamic model to describe the covid- pandemic, based on a model previously developed for the mers epidemic. this list is far from being exhaustive. for a detailed survey on different modeling approaches in this context, the reader is referred to review papers such as [ ] and [ ] . it is important to note that the behavior of the pandemic may vary greatly in the different regions of the world, due to characteristics such as different social habits (higher or lower physical interaction between citizens), capacity of the local health system, different governmental actions, and so on. therefore, the parameters of a mathematical model need to be tailored to the region where the disease behavior is being studied. furthermore, even in the same region, the conditions may vary very quickly, in a matter of weeks or even days (for instance, following the decree or release of a lockdown, or the saturation of the available intensive care unit vacancies in the hospitals); thus, the model parameters would need to be updated very often, usually by an expert. however, these analyses might take time and require dedicated work from highly qualified personnel, thus decreasing their availability. it is natural to expect that such analyses are run periodically at the country level, but the same may not be a reality locally at every municipality. therefore, in this scenario, it is useful to have a computational tool to perform a quick and automatic analysis and forecast of the disease conditions in any region, following the periodic updates published by the authorities. this is the purpose of the present paper. in the present paper, the fundamental curve that is used to describe the historical data is an asymmetric sigmoid, i.e., letting the independent variable be t, then the dependent variable is given as a function f: ℝ ↦ ℝ [ ] : with the parameters a ∈ ℝ ∪ { }, ν ∈ ℝ , δ ∈ ℝ , t ∈ ℝ ∪ { }. in the present work, the independent variable t is the time in days, whereas the dependent variable is either the cumulative number of individuals that were positively tested for sars-cov- or the cumulative number of individuals deceased with the disease as the cause. notice that i.e., the modeling of the cumulative number of cases/demises by ( ) implies convergence to a final value a. however, the convergence is asymptotic, therefore it is interesting to know when a certain threshold of the final number of infected/deceased has been reached. for that purpose, let a time instant τ ' be such that a particular value f τ ' is reached: where the parameter α ∈ ) , +. then, by replacing ( ) for f τ ' in ( ), one may solve to find: therefore, from ( ) one can determine the (finite) instant when a certain proportion of the final number of cases/demises is reached, which is a useful figure to evaluate whether the contamination can be considered over or not. in this paper, the settling date of the contamination is adopted as τ . , corresponding to the day where the number of occurrences reaches % of its final value. the settling ratio of % is a standard value used in the analysis of dynamic systems [ ] . the rate at which the number of infections/demises grows can be calculated by differentiation of ( ) with respect to the independent variable t, which yields df t dt = a δ e + νe as a matter of fact, the value of ( ) in a particular day t is an important indicator for healthcare infrastructure decision-making concerning the number of infected individuals, as a higher value indicates that the upcoming period might stress the healthcare infrastructure, whereas a comparatively lower value points that the number of new cases might be accommodated with the existing infrastructure. by analyzing the number of individuals that are cured each day and discharged from the facilities and comparing it with the rate of newly infected individuals, if the first is greater than the latter, than the capacity of the facilities is enough to treat the ill and they will not be endangered by lack of proper treatment. differentiating ( ) a sign change in ( ) = > < for t > t , this point corresponds to the maximum rate, i.e., the daily number of either infected or deceased individuals. replacing t = t in ( ) yields notice from ( ) that ν = entails f t = . a, i.e., the sigmoid curve crosses half of the final value at t = t . this is deemed a symmetric sigmoid. for the sake of understanding, consider two other illustrative possible values of ν: a) for ν = , from ( ), f t = a/√ ≈ . a, that is, the inflection happens at a later stage, when roughly % of the final values has been reached; b) for ν = . , from ( ), f t = a/ ≈ . a, in other words, the inflection happens at an earlier stage, when approximately only % of the final values has been reached. it is clear from these examples and from ( ) that the value of ν controls the degree of asymmetry in the sigmoid curve, with ν = representing a symmetric curve about the t = t vertical straight-line. this is illustrated in figure (a), where ( ) is shown for three values of ν whereas figure (b) shows ( ), i.e., the rate. it is interesting to remark that the value of ν impacts the symmetry of the derivative, with ν = representing a gaussian curve, with acceleration and deceleration phases occurring at the same rate. when ν < , the deceleration phase of the sigmoid is slower than the acceleration phase; when ν > , the opposite occurs. for their capability of representing processes with asymmetric acceleration and deceleration phases, asymmetric sigmoid curves are interesting to represent the data of a pandemic. many factors can contribute to the asymmetry between acceleration and deceleration phases besides the very nature of the disease spread, such as the introduction of policies by health authorities in order to slow down the spread, e.g., reduced social contact. therefore, this extra degree of freedom brought by the asymmetric sigmoid curve is useful to better represent the data. moreover, the added complexity with regard to a symmetric curve is due only to the necessity of estimating a single additional parameter, namely ν. in our context, there are three main sigmoid parameters of interest, which are described in table . the next section presents the algorithm used to estimate the parameters a, ν, δ, and t based on measured data from either the number of newly infected individuals per day or the number of deceased per day. the parameters a, ν, δ, and t are estimated based on the solution of a constrained optimization problem, in which the integral time square error (itse) [ ] is minimized, where the error is the difference between the value of f t output by ( ) and the corresponding data y t obtained from the authorities at the same day. we consider a time window for t ∈ { , , … , t vw= } for which the data y t are available at each day. there is a small abuse of notation by restricting the real-valued variable t to assume only integer values coinciding with the number of the day, let the vector of parameters to be estimated be defined as where the symbol • y indicates the transpose of a vector •. the optimal value of the vector Θ is given as where the argument Θ was explicitly included in f t, Θ to emphasize that the parameters may be varied during the optimization process. note that, for optimization purposes, strict inequalities cannot be implemented, therefore for the constraints ν > and δ > , an arbitrary small positive real number μ > is chosen and the constraints are approximated as ν ≥ μ and δ ≥ μ. after the optimization problem is solved to yield Θ * , the optimal values of a * , ν * , δ * , t * are fixed values used to build the curve. the function f t, Θ is nonlinear in the parameters Θ, and the cost function exacerbates that further, rendering the optimization problem nonlinear. moreover, the inequality constraints introduce additional difficulty, rendering the analytical solution of the optimization problem impractical. therefore, numerical methods must be used. one class of methods that are suitable for nonlinear constrained optimization is the so-called sequential quadratic programming (sqp) [ ] [ ] [ ] . sqp iteratively approximates the general nonlinear cost function in ( ) by a quadratic one, and the constraints by linear ones, which entails a quadratic programming (qp) problem. qps can be solved to global optimality in finite time, therefore each iteration of the sqp method takes finite time. the solution of the underlying qp approximation is then used to build a next iterate, for which another qp is solved, therefore the name sequential quadratic programming. sqp presents good convergence properties, converging quadratically to the optimal solution when the active set does not change [ ] . the implementation of sqp that is used in the present work is that of the function fmincon [ ], from the optimization toolbox tm of matlab ® . a second wave of spread has not been discarded. on the contrary, researchers argue that lifting the social distance measures might indeed lead to a retake in the infections [ ] [ ] [ ] [ ] . in order to describe the occurrence of multiple epidemiological waves, we propose to employ a sum of sigmoids. for this purpose, let n j be the adopted number of sigmoids. equation ( ) is then generalized to where similarly, the vector of parameters Θ, originally given by ( ), is generalized to a column vector with n j parameters defined as: where with these extended definitions, equation ( ) can still be used to estimate the value of Θ * by considering the inequalities applied to each a k , ν k , δ k and t ,k , i = , , … , n j . it is important to establish the number of sigmoids n j . for this purpose, an evaluation of the number of switches between acceleration and deceleration phases is performed. the rationale behind this assessment is: each sigmoid results in a single acceleration and a single deceleration phases, with a clear switching point between them, as discussed in section . . therefore, the number of sigmoids can be estimated by counting the amount of such switches between acceleration and deceleration phases. however, this counting requires careful consideration, as one is dealing with real noisy data. more so, recall that for identifying acceleration/deceleration the second derivative of the cumulative number of either infected or deceased individuals has to be considered. as it is well known, differentiation is prone to increase the effect of noise in the measurements [ ] . therefore, to mitigate the effect of noise in increasing artificially the amount of switches, a common approach is to consider a deadzone [ ] in the difference between the acceleration and deceleration. let s be the set of switching instants between acceleration and deceleration phases, then, for each t = , , … , t vw= − , the following logic is used to implement an identification of switches with a deadzone: where the parameter ϵ can be adjusted to provide a compromise between noise and detection sensitivity. in the present work, the value was set to ϵ = ⋅ • persons/day . thus, the number of switches is given by the cardinality of s n j = |s| ( ) recall, from table , that there are three parameters of interest. the final number of occurrences may be obtained as: on the other hand, when a sum of sigmoids is used, there are no analytical expressions to determine the other two parameters of interest, i.e., the date of maximum number of daily occurrences t and the settling dateτ . . in this case, a numerical search algorithm has to be used to find each of these parameters. the optimization problems to determine these parameters can be posed as follows: these two optimization problems are solved using the nelder-mead algorithm [ ] . it should be noted that each problem has only one independent variable (time). therefore, the search algorithm converges very quickly to the desired solution. the rationale to select the use of one or multiple sigmoids will be explained in a following section, which discusses the complexity of the model. two criteria are used to evaluate the degree of fidelity of the fitted curves to the data. the first is the socalled root mean square error (rmse), defined as: from ( ) the name of rmse becomes clear, as it involves the square root of the mean of the squared error. notice that, in ( ) , the values of the curve with the optimal parameters f t, Θ * are used to calculate the error between the data and the value returned by the fitted curve. moreover, the term t vw= + reflects the number of terms in the summation, as the index t starts at and ends at t vw= . the rmse is used in statistical analysis to measure compactly the degree of fidelity between the fitted curve and the data. the lower the value of the rmse, the better the fitted curve matches the data [ ] . in this paper, a normalized version of the rmse is used, obtained as: where a is the final number of occurrences, as defined in ( ) and ( ) for one and multiple sigmoids, respectively. this normalization is adopted to allow a fair comparison of the rmse of different curves. a second criterion to determine the quality of the representation of the data by the fitted curve generally applied in statistics is the squared correlation coefficient, which varies between and , with the latter meaning that there exists a perfect linear functional relationship between the data and the fitted curve points, whereas the first means the opposite. first, let us define the covariance of the data as where μ ‰ and μ ? are the mean values of y t and f t, Θ * , respectively, i.e. ( ) in which the symbol • can be replaced by either of y t and f t, Θ * , yielding μ ‰ and μ ? , respectively. similarly, the variances of y t and f t, Θ * are the squared correlation coefficient r : can then be determined from ( )-( ) as: additional criteria are defined to evaluate whether the data are enough to allow the convergence of the estimated values of the parameters Θ * . this is carried out by fitting the sigmoid curves to the data for each possible value of n ∈ { , , … , t vw= − , t vw= − , … , t vw= }. thus, instead of using all available data as in ( ) , windows of varying length are used; the minimum length of a window is adopted as to ensure a minimum amount of data to calibrate the curve. therefore, the sigmoid parameters Θ are estimated within different windows as where i = , , … , n j , depending on the number of sigmoids. the main parameters in table are then determined from Θ * n as follows: • for a single sigmoid, a * n and t * n are directly extracted from Θ * n in view of ( ), whereas τ α * n is calculated by ( ) employing t * n , δ * n and ν * n extracted from Θ * n considering ( ). • for multiple sigmoids, ( )-( ) are used to determine a * n , t * n and τ . * n . then, the relative variation of the estimated values of these parameters is calculated for each time window and multiplied over the time window, composing indices to evaluate if the data are enough to asseverate the suitability of the sigmoid that was fitted. these indices are defined as where the symbol • represents one of the parameters of interest, namely, a * n , t * n , or τ α * n , for a time window up to k days of data. it is clear that, if the data are enough and a suitable set of parameters is found, then each of the terms in the product in ( ) approaches one. therefore, the closer the value γ ‹ • is to one, the better the fit. moreover, the "min" in ( ) ensures that each term in the product is less than or equal to one, from which it follows that γ ‹ • ≤ . analyzing γ ‹ • for different values of k enables the conclusion of whether the convergence has occurred or not within variable window sizes. we adopt windows of size , and days, in order to verify the stability of the predictions over the last one, two and three weeks. from the previous discussion, it is possible to choose among different curve types (symmetric or asymmetric) and numbers (single or multiple sigmoids). this plays an important role both in the accuracy of the fit and in the complexity of the models (as per the different amounts of parameters to be estimated with each choice). it should be noted that the choice of a more complex model without a significant increase in the accuracy may lead to the problem of model overfitting, that is, an exaggeration while fitting of the training data that may compromise the generalization of the model predictions [ ] . in order to avoid this problem, criteria should be established to enable a compromise between accuracy and complexity. these criteria are described in this section. particularly for cases of regions where the contagion is in its early stage, there are not enough data to observe a deceleration phase. therefore, in this case the data are insufficient to support estimation of the asymmetric curves. in these situations, the symmetric curves can be used in the fitting and an automated decision of whether to present results with a symmetric or an asymmetric curve has to be done. the criterion for this decision considers a compromise between complexity and quality of the fitting results. the complexity is deemed higher for the asymmetric curve, as it requires estimation of one additional parameter, namely ν. as for the quality, it is evaluated through the following ratio: where r • : represents the squared correlation coefficient obtained from fitting an asymmetric curve and r j : the one yielded by a symmetric curve. recall that the value of the squared correlation coefficient ranges from zero to one and that this latter value implies a perfect relationship between the data and the curve. furthermore, the symetric sigmoid may be considered a particular case of the asymetric sigmoid, obtained by imposing ν = . since the asymetric sigmoid contains one additional parameter, and therefore one more degree of freedom for optimization, a better fit is expected, leading to a value of r • : higher than r j : . therefore, the value of η is expected to be lower than one. nevertheless, a value of η very close to one indicates a low increase in the squared correlation coefficient, which may not be enough to justify the increase in the model complexity. to prevent a division by zero, if r j : = (up to : precision), then the symmetric curve is selected, as it has perfect fit and the asymmetric curve cannot yield better results. this case has been considered for robustness of the computational code; nevertheless, since this a perfect situation, it is not expected to occur in practice. in the remaining cases, to balance between complexity and a more accurate fit, the criterion for selection follows the rule: with this choice, a more substantial gain in the accuracy of the fit has to be obtained to justify a more complex curve. given the number n j of sigmoids, two fits are performed, i.e., using one and using criterion to decide upon the use of one or multiple sigmoids is similar to the one defined in the previous section for symetric and asymetric sigmoids we define a new ratio η : as where r ' : and r " : represent the squared correlation coefficient obtained from fitting to the calibration data, respectively, provided that (which is a perfect situation, not expected to occur in practice the criterion to choose between one and multiple sigmoids is then: the methodology proposed in the η represent the squared correlation coefficient obtained from fitting one and multiple sigmoids to the calibration data, respectively, provided that r ' : is different from one. as in the previous case, if is a perfect situation, not expected to occur in practice), then only one sigmoid is adopted. the criterion to choose between one and multiple sigmoids is then: the methodology proposed in the current paper is summarized by the flowchart presented in n only one sigmoid is adopted. are selected selected ( ) current paper is summarized by the flowchart presented in figure . flowchart summarizing the methodology proposed in the current paper. the computer program described in this paper was developed using matlab ® a, with the optimization toolbox tm and the matlab compiler tm . the program uses as data source reports published in spreadsheet format in the websites of the european centre for disease prevention and control (ecdc) [ ] , of johns hopkins university [ ] and of the brasil.io project [ ] . the ecdc reports contain country-wise data of the countries in the world, while the reports of johns hopkins university and of the brasil.io project presents data of united states counties and of brazilian states and cities, respectively. the data inform the number of newly infected and deceased people on each date. these numbers are informed separately for each region, allowing to perform an independent analysis for each of them. the computer program may be downloaded from the following link, where the data files updated until -aug- are also available. the folder contains a "readme" file, which explain the main features and the preliminary steps to use the program. we emphasize that the program may be installed and run directly from the operating system, independently of the user possessing a licensed matlab ® installation. should the user have matlab ® and the required packages installed, he/she may run directly a different file from the package without any installation. a graphical user interface (gui), illustrated in figure and figure , will appear. these figures present the main screen of the gui with data from the european centre for disease prevention and control and from johns hopkins university, respectively. a zoom was applied to these figures to allow for a better reading of their contents; that is the reason why the names of some u.s. counties appear truncated and why the predictions in the bottom of the figures appear incomplete. when running the gui for the first time, the user is advised to initially select the option "file: download new data file" of the main menu, as described in further detail below. in the next section. • " figures: export figure" figure, in order to facilitate its edition and copy to external software. download; however, we opt to also provide an option to the work performed by the people responsible for them. this is a standard option to close the interface. presented in figure , where the brasil.io project. the user may choose to download the data file directly or to access one of these sites, using the default web browser. it to access the websites as and to access the corresponding websites. analysis, as described in the this analysis are presented this option is used to export the graphs of the main screen to a new d information about the interface. it also contains an on the left of the main screen ( figure and figure ), there is a list of all available regions, which will be henceforth called the region list. in this list, when analyzing data from the ecdc or from johns hopkins university, the user can select the name of the desired country or of the desired u.s. county (in english); when analyzing data from the brasil.io project, the user can select the name of the brazilian states and cities (in portuguese). brazilian states are identified by their two-letter acronym. the names of the cities are presented without accents; for instance, the cities of "são paulo", "santa bárbara d'oeste" and "santa fé" are identified as "sao paulo", "santa barbara d'oeste" and "santa fe", respectively. above the region list, there is an edit field where the user can type the name of a region to look for on the list. the user can select the region name in full or in part, and may also employ regular expressions. furthermore, a vertical bar "|" can be used representing the "or" operator, to perform a search for more than one region; for instance: fran|germ|italy will restrict the countries in the list to france, germany and italy. an empty string is used in the search bar to restore the complete list of regions. note the search for "illinois" in below the region list, there are two options: "show predictions" and "show dates". "show predictions" is used to enable or disable the mathematical modeling (if disabled, only historical data will be shown). if "show dates" is disabled, then sequential numbers are shown in the graphs' axes, instead of dates. in the lower left corner of the gui, there is an edit field where the user can specify the number of days for testing. for instance, if the user specifies a value of days, then the model is calibrated with all data available until one week before the data acquisition, and the remaining days are used to test the model, allowing a comparison between the predictions of the model and the observed data. the main screen presented in figure and figure below the graphs, the following predictions are presented, for either cases or demises: final number, date of maximum daily occurrences and settling date (as defined in table ). furthermore, the equation of the best sigmoid (or set of sigmoids in case more than one wave is identified) matching the accumulated data is presented, as well as the indices rmse and r : , defined in the previous section. the predictions are presented in editable fields, such that the user can copy their texts and paste them in external software. an example of such information is presented in table . in order to illustrate the use of the tool to perform predictions, figure to figure as previously mentioned, the trend analysis is run when the user selects the corresponding option in the main menu. examples of figures resulting from such analysis are presented in figure to figure , which correspond to the brazilian city of são paulo (sp), the us county of cook, illinois, and the brazilian state of são paulo, respectively. each of these figures contain three subfigures, showing the predicted value of the three parameters of interest described in table . the abscissa of the graphs indicates the date of the estimation, meaning that all data available until that date were used to estimate the value of the parameter under study. it can be seen that, as expected, the values of the estimated parameters vary with the amount of data used to estimate them. on the title of each subfigure, the values of γ • , γand γ : are presented, indicating how stable each prediction is, considering the last one, two and three weeks, respectively. a value of γcloser to one indicates a more stable prediction. similarly, figure -(a) indicates that the dynamics of the pandemic in the u.s. county of cook, illinois, followed a similar pattern. there were oscillations until the end of april, followed by an increasing trend until may th, a slightly decreasing tendency and finally an stabilized predicted value of approximately demises since june th. on the other hand, figure indicates that the pandemic is not yet stabilized in the brazilian state of sp (são paulo). an increasing tendency can be seen over the last weeks. for instance, figure -(a) shows that the predicted number of demises was approximately on july th and changed to on august th, indicating an increase of approximately % in days. the stability of such predictions over the last one, two and three weeks may be verified by the values of γ • , γand γ : shown in the title of each figure. it can be seen that these values are close to one, indicating stabilized or near-to-stabilization predictions. the values of γ ‹ are intended to represent the convergence of the estimations. as an additional feature, they may also be used as a measurement of the quality of the prediction -higher values of γ indicate that the pandemic has been following the same predicted behavior over the last weeks. for instance, when analysing the values of γ : presented in figure -(a) and figure -(a), it can be seen that such values are γ : = . and γ : = . for the city and for the state of são paulo, respectively. one may infer from these numbers that the predictions obtained in the last three weeks are more stable in the city of são paulo than in the state with the same name. this is the same conclusion that was achieved by analysing the curves, as described in the previous paragraphs. it should be emphasized that the values in figure -(a), figure -(a) and figure -(a) do not refer to the number of demises on the day of the analysis, but rather to the predicted final number of demises, estimated on the basis of all data available until that date. this is an innovative approach for trend analysis in this context and, to the best of the authors' knowledge, has not been proposed before. additionally, the same analyses presented here for the number of demises can be run for the number of the stabilized results for the city of são paulo and for the county of cook, illinois, allow to conclude that that the actions of the local governments to control the pandemic are taking effect. it is of public interest to determine how the disease will spread in each city after the restriction measures are alleviated. for this purpose, the trend analysis should be run again. should a new increasing tendency be observed, the authorities would be advised to reinstate some containment measures. it is important to point out that these results, although helpful, should be validated by medical experts and not be considered alone when deciding public policies. the trend analysis may be run for a country, a state, a county or a city. it provides more useful information when it is run for smaller administrative regions such as a county or a city, because it allows supporting decision by local authorities based on specific data of the region under consideration. a limitation of the proposed approach is that it is not adequate to analyze the pandemic in very small cities or counties, because the number of infections and demises is usually very low, not allowing a good fitting by the mathematical model proposed here. however, for medium-and large-sized cities or counties, informative results are expected, as the ones presented here for the u.s. counties of cook, illinois and los angeles, california and for the brazilian cities of brasilia (df) and são paulo (sp). the results presented here are illustrative and correspond to the scenario on the date when the data were acquired, that is, on -aug- . these analyses should always employ updated data to increase their reliability. therefore, the authors recommend these studies to be repeated periodically, at least on a weekly basis. the developed computer program allows to easily perform this task. this paper proposed a methodology and a computational tool to forecast the covid- pandemic throughout the world, providing useful resources for health-care authorities. a user-friendly graphical user interface (gui) in matlab ® was developed and can be downloaded online for free use. an innovative approach for trend analysis was presented. resources in the computational tool allow to quickly run analyses for the desired regions. additional options allow to access the official website of the european centre of disease prevention and control, of johns hopkins university and of the brasil.io project, in order to download new data as soon as they are published online. to this date, these institutions have been updating their reports on a daily basis. the analyses run by the program are intended only as an aid and the results should be interpreted with care. they do not replace a careful analysis by experts. nevertheless, such results may be a very useful tool to assist the authorities in their decision-making process. the proposed program is in continuous development and future added features will be published and described in the project webpage. the authors would appreciate any feedback and suggestions to improve the computational tool. the program, in its current version, is able to process detailed information about u.s. counties and about brazilian states and cities. these two countries were chosen because they have continental dimensions and are currently the focus of the covid- pandemic. nevertheless, the same resource could be extended to other countries. for this purpose, the main requirement would be to write a code to read other country data files and convert them to the format recognized by the program, which is quite simple. future works can employ the same methodology and adapt the computer tool to describe the dynamics of other epidemics around the world. in the recent past, no pandemic was as severe as the covid- , but there were occurrences of other diseases such as influenza a and mers-cov. should a similar epidemic occur again, the computer program described here would be a resourceful tool. modelling the impact of testing, contact tracing and household quarantine on second waves of covid- financial performance of malaysian local authorities: a trend analysis correspondence among the correlation, rmse, and heidke forecast verification measures; refinement of the heidke score. weather forecasting the role of laboratory diagnostics in emerging viral infections: the example of the middle east respiratory syndrome epidemic convalescent plasma as a potential therapy for covid- the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak influenza forecasting in human populations: a scoping review an interactive web-based dashboard to track covid- in real time modern control systems influenza forecasting with google flu trends download today's data on the geographic distribution of covid- cases worldwide a novel human coronavirus: middle east respiratory syndrome human coronavirus sequential quadratic programming methods clinical characteristics of coronavirus disease in china could chloroquine /hydroxychloroquine be harmful in coronavirus disease (covid- ) treatment? forecasting of covid per regions using arima models and polynomial functions the effectiveness of quarantine of wuhan city against the corona virus disease (covid- ): a well-mixed seir model analysis clinical features of patients infected with novel coronavirus in wuhan, china backtracking search integrated with sequential quadratic programming for nonlinear active noise control systems optimization of multi-product economic production quantity model with partial backordering and physical constraints the characteristics of middle eastern respiratory syndrome coronavirus transmission dynamics in south korea. osong public health and research perspectives convergence properties of the nelder--mead simplex method in low dimensions first-wave covid- transmissibility and severity in china outside hubei after control measures, and second-wave scenario planning: a modelling impact assessment. the lancet a conceptual model for the coronavirus disease (covid- ) outbreak in wuhan, china with individual reaction and governmental action spread and impact of covid- in china: a systematic review and synthesis of predictions from transmission-dynamic models incubation period and other epidemiological characteristics of novel coronavirus infections with right truncation: a statistical analysis of publicly available case data trend analysis of annual and seasonal rainfall time series in the mediterranean area the end of social confinement and covid- re-emergence risk coronavirus pandemic -therapy and vaccines a review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid- time-trend analysis and developing a forecasting model for the prevalence of multiple sclerosis in kohgiluyeh and boyer-ahmad province, southwest of iran real-time forecasting of an epidemic using a discrete time stochastic model: a case study of pandemic influenza (h n - ) numerical optimization climatology and trend analysis of extreme precipitation in subregions of northeast brazil clinical manifestation, diagnosis, prevention and control of sars-cov- (covid- ) during the outbreak period a data-driven model to describe and forecast the dynamics of covid- transmission covid- immunity passports and vaccination certificates: scientific, equitable, and legal challenges prediction of new active cases of coronavirus disease (covid- ) pandemic using multiple linear regression model short-term forecasting covid- cumulative confirmed cases: perspectives for brazil a flexible growth function for empirical use cdc estimate of global h n pandemic deaths: , . center for infectious disease research and policy time series analysis and forecast of the covid- pandemic in india using genetic programming molecular characterization and comparative analysis of pandemic h n / strains with cocirculating seasonal h n / strains from eastern india testicular cancer mortality in brazil: trends and predictions until overfitting and optimism in prediction models. clinical prediction models stock market trend prediction using high-order information of time series world health organization (who) world health organization (who) world health organization (who) nowcasting and forecasting the potential domestic and international spread of the -ncov outbreak originating in wuhan, china: a modelling study beware of the second wave of covid- modified seir and ai prediction of the epidemics trend of covid- in china under public health interventions type diabetes epidemic in east asia: a -year systematic trend analysis breast cancer trend in iran from to and prediction till using a trend analysis method epidemiology and trend analysis on malignant mesothelioma in china the authors acknowledge the european centre for disease prevention and control (ecdc), johns hopkins university and the brasil.io project for making the covid- data publicly available and for allowing its use for research purposes. key: cord- -qbmm st authors: nguyen, thanh thi title: artificial intelligence in the battle against coronavirus (covid- ): a survey and future research directions date: - - journal: nan doi: nan sha: doc_id: cord_uid: qbmm st artificial intelligence (ai) has been applied widely in our daily lives in a variety of ways with numerous successful stories. ai has also contributed to dealing with the coronavirus disease (covid- ) pandemic, which has been happening around the globe. this paper presents a survey of ai methods being used in various applications in the fight against the covid- outbreak and outlines the crucial roles of ai research in this unprecedented battle. we touch on a number of areas where ai plays as an essential component, from medical image processing, data analytics, text mining and natural language processing, the internet of things, to computational biology and medicine. a summary of covid- related data sources that are available for research purposes is also presented. research directions on exploring the potentials of ai and enhancing its capabilities and power in the battle are thoroughly discussed. we highlight groups of problems related to the covid- pandemic and point out promising ai methods and tools that can be used to solve those problems. it is envisaged that this study will provide ai researchers and the wider community an overview of the current status of ai applications and motivate researchers in harnessing ai potentials in the fight against covid- . t he novel coronavirus disease (covid- ) has created tremendous chaos around the world, affecting people's lives and causing a large number of deaths. since the first cases were detected, the disease has spread to almost every country, causing deaths of over , people among nearly , , confirmed cases based on statistics of the world health organization in the middle of july [ ] . governments of many countries have proposed intervention policies to mitigate the impacts of the covid- pandemic. science and technology have contributed significantly to the implementations of these policies during this unprecedented and chaotic time. for example, robots are used in hospitals to deliver food and medicine to coronavirus patients or drones are used to disinfect streets and public spaces. many medical researchers are rushing to investigate drugs and medicines to treat infected patients whilst others are attempting to develop vaccines to prevent the virus. computer science researchers on the other hand have managed to early detect infectious patients using techniques that can process and understand medical imaging data such as x-ray images and computed tomography (ct) scans. these computational techniques are part of artificial intelligence (ai), which has been applied successfully in various fields. this paper focuses on the roles of ai technologies in the battle against the covid- pandemic. we provide a comprehensive survey of ai t. t. nguyen is with the school of information technology, deakin university, victoria, , australia. e-mail: thanh.nguyen@deakin.edu.au. applications that support humans to reduce and suppress the substantial impacts of the outbreak. recent advances in ai have contributed significantly to improving humans' lives and thus there is a strong belief that proper ai research plans will fully exploit the power of ai in helping humans to defeat this challenging battle. we discuss about these possible plans and highlight ai research areas that could bring great benefits and contributions to overcome the battle. in addition, we present a summary of covid- related data sources to facilitate future studies using ai methods to deal with the pandemic. an overview of common ai methods is presented in fig. where recent ai development is highlighted. machine learning, especially deep learning, has made great advances and substantial progress in long-standing fields such as computer vision, natural language processing (nlp), speech recognition, and video games. a significant advantage of deep learning over traditional machine learning techniques is its ability to deal with and make sense of different types of data, especially big and unstructured data, e.g. text, image, video and audio data. a number of industries, e.g. electronics, automotive, security, retail, agriculture, healthcare and medical research, have achieved better outcomes and benefits by using deep learning and ai methods. it is thus expected that ai technologies can contribute to the fight against the covid- pandemic, such as those surveyed in the next section. we separate surveyed papers into different groups that include: deep learning algorithms for medical image processing, data science methods for pandemic modelling, ai and the internet of things (iot), ai for text mining and nlp, and ai in computational biology and medicine. while radiologists and clinical doctors can learn to detect covid- cases based on chest ct examinations, their tasks are manual and time consuming, especially when required to examine a lot of cases. bai et al. [ ] convenes three chinese and four united states radiologists to differentiate covid- from other viral pneumonia based on chest ct images obtained from a cohort of cases, in which cases are from the united states with non-covid- pneumonia whilst cases are from china positive with covid- . results obtained show that radiologists can achieve high specificity (which refers to the proportion of actual positives that are correctly identified as such) in distinguishing covid- from other causes of viral pneumonia using chest ct imaging data. however, their performance in terms of sensitivity (which fig. . an overview of common ai methods where machine learning constitutes a great proportion. the development of deep learning, a subset of machine learning, has contributed significantly to improving the power and capabilities of recent ai applications. a number of deep learning-based convolutional neural network (cnn) architectures, e.g. lenet [ ] , alexnet [ ] , googlenet [ ] , visual geometry group (vgg) net [ ] and resnet [ ] , have been proposed and applied successfully in different areas, especially in the computer vision domain. other techniques such as autoencoders and recurrent neural networks are crucial components of many prominent natural language processing tools. the deep learning methods in particular and ai in general may thus be employed to create useful applications to deal with various aspects of the covid- pandemic. refers to the proportion of actual negatives that are correctly identified as such) is just moderate for the same task. ai methods, especially deep learning, have been used to process and analyse medical imaging data to support radiologists and doctors to improve diagnosis performance. likewise, the current covid- pandemic has witnessed a number of studies focusing on automatic detection of covid- using deep learning systems. a three-dimensional deep learning method, namely covid- detection neural network (covnet), is introduced in [ ] to detect covid- based on volumetric chest ct images. three kinds of ct images, including covid- , community acquired pneumonia (cap) and other non-pneumonia cases, are mixed to test the robustness of the proposed model, which is illustrated in fig. . these images are collected from hospitals in china and the detection method is evaluated by the area under the receiver operating characteristic curve (auc). covnet is a convolutional resnet- model [ ] that takes a series of ct slices as inputs and predicts the class labels of the ct images via its outputs. the auc value obtained is at . , which shows a great ability of the proposed model for detecting covid- cases. another deep learning method based on the concatenation between the location-attention mechanism and the threedimensional cnn resnet- network [ ] is proposed in [ ] to detect coronavirus cases using pulmonary ct images. distinct manifestations of ct images of covid- found in previous [ ] for covid- detection using ct images. max pooling operation is used to combine features extracted by resnet- cnns whose inputs are ct slices. the combined features are fed into a fully connected layer to compute probabilities for three classes, i.e. non-pneumonia, community acquired pneumonia (cap) and covid- . predicted class is the one that has highest probability among the three classes. studies [ ] , [ ] and their differences with those of other types of viral pneumonia such as influenza-a are exploited through the proposed deep learning system. a dataset comprising ct images of covid- cases, influenza-a viral pneumonia patients and healthy cases is used to validate the performance of the proposed method. the method's overall accuracy of approximately % is obtained on this dataset, which exhibits its ability to help clinical doctors to early screen covid- patients using chest ct images. in line with the studies described above, we have found a number of papers also applying deep learning for covid- diagnosis using radiology images. they are summarized in table i for comparisons. a modified stacked autoencoder deep learning model is used in [ ] to forecast in real-time the covid- confirmed cases across china. this modified autoencoder network includes four layers, i.e. input, first latent layer, second latent layer and output layer, with the number of nodes correspondingly is , , and . a series of data points ( days) are used as inputs of the network. the latent variables obtained from the second latent layer of the autoencoder model are processed by the singular value decomposition method before being fed into clustering algorithms in order to group the cases into provinces or cities to investigate the transmission dynamics of the pandemic. the resultant errors of the model are low, which give confidence that it can be applied to forecast accurately the transmission dynamics of the virus as a helpful tool for public health planning and policy-making. on the other hand, a prototype of an ai-based system, namely α-satellite, is proposed in [ ] to assess the infectious risk of a given geographical area at community levels. the system collects various types of large-scale and real-time data from heterogeneous sources, such as number of cases and deaths, demographic data, traffic density and social media data, e.g., reddit posts. the social media data available for a given area may be limited so that they are enriched by the conditional generative adversarial networks (gans) [ ] to learn the public awareness of covid- . a heterogeneous graph autoencoder model is then devised to aggregate information from neighbourhood areas of the given area in order to estimate its risk indexes. this risk information enables residents to select appropriate actions to prevent them from the virus infection with minimum disruptions in their daily lives. it is also useful for authorities to implement appropriate mitigation strategies to combat the fast evolving pandemic. chang et al. [ ] modify a discrete-time and stochastic agent-based model, namely acemod (australian censusbased epidemic model), previously used for influenza pandemic simulation [ ] , [ ] , for modelling the covid- pandemic across australia over time. each agent exemplifies an individual characterized by a number of attributes such as age, occupation, gender, susceptibility and immunity to diseases and contact rates. the acemod is calibrated to model specifics of the covid- pandemic based on key disease transmission parameters. several intervention strategies including social distancing, school closures, travel bans, and case isolation are then evaluated using this tuned model. results obtained from the experiments show that a combination of several strategies is needed to mitigate and suppress the covid- pandemic. the best approach suggested by the model is to combine international arrival restrictions, case isolation and social distancing in at least weeks with the compliance level of % or above. a framework for covid- detection using data obtained from smartphones' onboard sensors such as cameras, microphones, temperature and inertial sensors is proposed in [ ] . machine learning methods are employed for learning and acquiring knowledge about the disease symptoms based on the collected data. this offers a low-cost and quick approach to coronavirus detection compared to medical kits or ct scan methods. this is arguably plausible because data obtained from the smartphones' sensors have been utilized effectively in different individual applications and the proposed approach integrates these applications together in a unique framework. for instance, data obtained from the temperature-fingerprint sensor can be used for fever level prediction [ ] . images and videos taken by smartphones' camera or data collected by the onboard inertial sensors can be used for human fatigue detection [ ] , [ ] . likewise, story et al. [ ] use smartphone's videos for nausea prediction whilst lawanont et al. [ ] use camera images and inertial sensors' measurements for neck posture monitoring and human headache level prediction. alternatively, audio data obtained from smartphone's microphone are used for cough type detection in [ ] , [ ] . an approach to collecting individuals' basic travel history and their common manifestations using a phone-based online survey is proposed in [ ] . these data are valuable for machine learning algorithms to learn and predict the infection risk of each individual, thus help to early identify high-risk cases for quarantine purpose. this contributes to reducing the spread of the virus to the susceptible populations. in another work, allam and jones [ ] suggest the use of ai and data sharing standardization protocols for better global understanding and management of urban health during the covid- pandemic. for example, added benefits can be obtained if ai is integrated with thermal cameras, which might have been installed in many smart cities, for early detection of the outbreak. ai methods can also demonstrate their great effectiveness in supporting managers to make better decisions for virus containment when loads of urban health data are collected by data sharing across and between smart cities using the proposed protocols. a hybrid ai model for covid- infection rate forecasting is proposed in [ ] , which combines the epidemic susceptible infected (si) model, nlp and deep learning tools. the si model and its extension, i.e. susceptible infected recovered (sir), are traditional epidemic models for modelling and predicting the development of infectious diseases where s represents the number of susceptible people, i denotes the number of infected people and r specifies the recovered cases. using differential equations to characterize the relationship between i, s and r, these models have been used to predict successfully sars and ebola infected cases, as reported in [ ] and [ ] respectively. nlp is employed to extract semantic features from related news such as epidemic control measures of governments or residents' disease prevention awareness. these features are then served as inputs to the long short-term memory (lstm) deep learning model [ ] to revise the infection rate predictions of the si model (detailed in fig. ). epidemic data of wuhan, beijing, shanghai and the whole china are used for experiments, which demonstrate the great accuracy of the proposed hybrid model. it can be applied to predict the covid- transmission law and development trend, and thus useful for establishing prevention and control measures for future pandemics. that study also shows the importance of public awareness of governmental epidemic prevention policies and the significant role of transparency and openness of epidemic reports and news in containing the development of infectious diseases. fig. . an ai-based approach to covid- prediction that combines traditional epidemic si model, nlp and machine learning tools as introduced in [ ] . a pre-trained nlp model is used to extract nlp features from text data, i.e. the pandemic news, reports, prevention and control measures. these features are integrated with infection rate features obtained from the si model via multilayer perceptron (mlp) networks before being fed into lstm model for covid- case modelling and prediction. in another work, lopez et al. [ ] recommend the use of network analysis techniques as well as nlp and text mining to analyse a multilanguage twitter dataset to understand changing policies and common responses to the covid- outbreak across time and countries. since the beginning of the pandemic, governments of many countries have tried to implement policies to mitigate the spread of the virus. responses of people to the pandemic and to the governmental policies may be collected from social media platforms such as twitter. much of information and misinformation is posted through these platforms. when stricter policies such as social distancing and country lockdowns are applied, people's lives are changed considerably and part of that can be observed and captured via people's reflections on social media platforms as well. analysis results of these data can be helpful for governmental decision makers to mitigate the impacts of the current pandemic and prepare better policies for possible future pandemics. likewise, three machine learning methods including support vector machine (svm), naive bayes and random forest are used in [ ] to classify , covid- related posts collected from sina weibo, which is the chinese equivalent of twitter, into seven types of situational information. identifying situational information is important for authorities because it helps them to predict its propagation scale, sense the mood of the public and understand the situation during the crisis. this contributes to creating proper response strategies throughout the covid- pandemic. being able to predict structures of a protein will help understand its functions. google deepmind is using the latest version of their protein structure prediction system, namely alphafold [ ] , to predict structures of several proteins associated with covid- based on their corresponding amino acid sequences. they have released the predicted structures in [ ] , but these structures still need to be experimentally verified. nevertheless, it is expected that these predictions will help understand how the coronavirus functions and potentially lead to future development of therapeutics against covid- . an ai-based generative chemistry approach to design novel molecules that can inhibit covid- is proposed in [ ] . several generative machine learning models, e.g. generative autoencoders, gans, genetic algorithms and language models, are used to exploit molecular representations to generate structures, which are then optimized using reinforcement learning methods. this is an ongoing work as the authors are synthesising and testing the obtained molecules. however, it is a promising approach because these ai methods can exploit the large drug-like chemical space and automatically extract useful information from high-dimensional data. it is thus able to construct molecules without manually designing features and learning the relationships between molecular structures and their pharmacological properties. the proposed approach is cost-effective and time-efficient and has a great potential to generate novel drug compounds in the covid- fight. on the other hand, randhawa et al. [ ] aim to predict the taxonomy of covid- based on an alignment-free machine learning method [ ] using genomic signatures and a decision tree approach. the alignment-free method is a computationally inexpensive approach that can give rapid taxonomic classification of novel pathogens by processing only raw dna sequence data. by analysing over unique viral sequences, the authors are able to confirm the taxonomy of covid- as belonging to the subgenus sarbecovirus of the genus betacoronavirus, as previously found in [ ] . the proposed method also provides quantitative evidence that supports a hypothesis about a bat origin for covid- as indicated in [ ] , [ ] . recently, nguyen et al. [ ] propose the use of ai-based clustering methods and more than genome sequences to search for the origin of the covid- virus. numerous clustering experiments are performed on datasets that combine sequences of the covid- virus and those of reference viruses of various types. results obtained show that covid- virus genomes consistently form a cluster with those of bat and pangolin coronaviruses. that provides quantitative evidences to support the hypotheses that bats and pangolins may have served as the hosts for the covid- virus. their findings also suggest that bats are the more probable origin of the virus than pangolins. ai methods thus have demonstrated their capabilities and power for mining big biological datasets in an efficient and intelligent manner, which contributes to the progress of finding vaccines, therapeutics or medicines for covid- . this section summarises available data sources relevant to covid- , ranging from numerical data of infection cases, radiology images [ ] , twitter, text, natural language to biological sequence data (table ii) , and highlights potential ai methods for modelling different types of data. the data are helpful for research purposes to exploit the capabilities and power of ai technologies in the battle against covid- . different data types have different characteristics and thus require different ai methods to handle. for example, numerical time series data of infection cases can be dealt with by traditional machine learning methods such as naive bayes, logistic regression, k-nearest neighbors (knn), svm, mlp, fuzzy logic system [ ] , nonparametric gaussian process [ ] , decision tree, random forest, and ensemble learning algorithms [ ] . deep learning recurrent neural networks such as lstm [ ] can be used for regression prediction problems if a large amount of training data are available. the deeper the models, the more data are needed to enable the models to learn effectively from data. based on their ability to characterize temporal dynamic behaviours, recurrent networks are well suited for modelling infection case time series data. radiology images such as chest x-ray and ct scans are high-dimensional data that require processing capabilities of deep learning methods in which cnn-based models are common and most suitable (e.g. lenet [ ] , alexnet [ ] , googlenet [ ] , vgg net [ ] and resnet [ ] ). cnns were inspired by biological processes of visual cortex of human and animal brains where each cortical neuron is activated within its receptive field when stimulated. a receptive field of a neuron covers a specific subarea of the visual field and thus the entire visual field can be captured by a partial overlap of receptive fields. a cnn consists of multiple layers where each neuron of a subsequent (higher) layer connects to a subset of neurons in the previous (lower) layer. this allows the receptive field of a neuron of a higher layer covers a larger portion of images compared to that of a lower layer. the higher layer is able to learn more abstract features of images than the lower layer by taking into account the spatial relationships between different receptive fields. this use of receptive fields enables cnns to recognize visual patterns and capture features from images without prior knowledge or making hand-crafted features as in traditional machine learning approaches. this principle is applied to different cnn architectures although they may differ in the number of layers, number of neurons in each layer, the use of activation and loss functions as well as regularization and learning algorithms [ ] . transfer learning methods can be used to customize cnn models, which have been pretrained on large medical image datasets, for the covid- diagnosis problem. this would avoid training a cnn from scratch and thus reduce training time and the need for covid- radiology images, which may not be sufficiently available in the early stage of the pandemic. alternatively, unstructured natural language data need text mining tools, e.g. natural language toolkit (nltk) [ ] , and advanced nlp and natural language generation (nlg) data type links johns hopkins university [ ] web-based mapping global cases https://systems.jhu.edu/research/public-health/ncov/ c. r. wells's github [ ] daily incidence data and airport connectivity from china [ ] , text-to-text transfer transformer (t ) [ ] , binary-partitioning transformer (bpt) [ ] and openais generative pretrained transformer (gpt- ) [ ] . the core components of these tools are deep learning and transfer learning methods. for example, elmo and ulm-fit are built using lstm-based language models while transformer utilizes an encoder-decoder structure. likewise, bert and ernie use multi-layer transformer as a basic encoder while xlnet is a generalized autoregressive pretraining method inherited from transformer-xl. transformer also serves as a basic model for t , bpt and gpt- . these are excellent tools for many nlp and nlg tasks to handle text and natural language data related to covid- . analysing biological sequence data such as viral genomic and proteomic sequences requires either traditional machine learning or advanced deep learning or a combination of both depending on problems being addressed and data pipelines used. as an example, traditional clustering methods, e.g. hierarchical clustering and density-based spatial clustering of applications with noise (dbscan) [ ] , can be employed to find the virus origin using genomic sequences [ ] . alternatively, a fuzzy logic system can be used to predict protein secondary structures based on quantitative properties of amino acids, which are used to encode the twenty common amino acids [ ] . a combination between principal component analysis and lasso (least absolute shrinkage and selection operator) can be used as a supervised approach for analysing single-nucleotide polymorphism genetic variation data [ ] . advances in deep learning may be utilized for protein structure prediction using protein amino acid sequences as in [ ] , [ ] . an overview on the use of various types of machine learning and deep learning methods for analysing genetic and genomic data can be referred to [ ] , [ ] . typical applications may include, for example, recognizing the locations of transcription start sites, identifying splice sites, promoters, enhancers, or positioned nucleosomes in a genome sequence, analysing gene expression data for finding disease biomarkers, assigning functional annotations to genes, predicting the expression of a gene [ ] , identifying splicing junction at the dna level, predicting the sequence specificities of dna-and rna-binding proteins, modelling structural features of rna-binding protein targets, predicting dna-protein binding, or annotating the pathogenicity of genetic variants [ ] . these applications can be utilized for analysing genomic and genetic data of severe acute respiratory syndrome coronavirus (sars-cov- ), the highly pathogenic virus that has caused the global covid- pandemic. the covid- pandemic has considerably affected lives of people around the globe and the number of deaths related to the disease keeps increasing worldwide. while ai technologies have penetrated into our daily lives with many successes, they have also contributed to helping humans in the tough fight against covid- . this paper has presented a survey of ai applications so far in the literature relevant to the covid- crisis's responses and control strategies. these applications range from medical diagnosis based on chest radiology images, virus transmission modelling and forecasting based on number of cases time series and iot data, text mining and nlp to capture the public awareness of virus prevention measures, to biological data analysis for drug discovery. although various studies have been published, we observe that there are still relatively limited applications and contributions of ai in this battle. this is partly due to the limited availability of data about covid- whilst ai methods normally require large amounts of data for computational models to learn and acquire knowledge. however, we expect that the number of ai studies related to covid- will increase significantly in the months to come when more covid- data such as medical images and biological sequences are available. current available datasets as summarized in table ii are stored in various formats and standards that would hinder the development of covid- related ai research. a future work on creating, hosting and benchmarking covid- related datasets is essential because it will help to accelerate discoveries useful for tackling the disease. repositories for this goal should be created following standardized protocols and allow researchers and scientists across the world to contribute to and utilize them freely for research purposes. among the published works, the use of deep learning techniques for covid- diagnosis based on radiology imaging data appears to be dominant. as summarized in table , numerous studies have used various deep learning methods, applying on different datasets and utilizing a number of evaluation criteria. this creates an immediate concern about difficulties when utilizing these approaches to the realworld clinical practice. accordingly, there is a demand for a future work on developing a benchmark framework to validate and compare the existing methods. this framework should facilitate the same computing hardware infrastructure, (universal) datasets covering same patient cohorts, same data pre-processing procedures and evaluation criteria across ai methods being evaluated. furthermore, as li et al. [ ] pointed out, although their model obtained great accuracy in distinguishing covid- with other types of viral pneumonia using radiology images, it still lacks of transparency and interpretability. for example, they do not know which imaging features have unique effects on the output computation. the benefit that black box deep learning methods can provide to clinical doctors is therefore questionable. a future study on explainable ai to explain the deep learning models' performance as well as features of images that contribute to the distinction between covid- and other types of pneumonia is necessary. this would help radiologists and clinical doctors to gain insights about the virus and examine future coronavirus ct and x-ray images more effectively. in the field of computational biology and medicine, ai has been used to partially understand covid- or discover novel drug compounds against the virus [ ] , [ ] . these are just initial results and thus there is a great demand for ai research in this field, e.g., to investigate genetics and chemistry of the virus and suggest ways to quickly produce vaccines and treatment drugs. with a strong computational power that is able to deal with large amounts of data, ai can help scientists to gain knowledge about the coronavirus quickly. for example, by exploring and analyzing protein structures of virus, medical researchers would be able to find components necessary for a vaccine or drug more effectively. this process would be very time consuming and expensive with conventional methods [ ] . recent astonishing success of deep learning in identifying powerful new kinds of antibiotic from a pool of more than million molecules as published in [ ] gives a strong hope to this line of research in the battle against covid- . compared to the spanish flu pandemic [ ] , we are now fortunately living in the age of exponential technology. when everybody, organization and government try their best in the battle against the pandemic, the power of ai should be fully exploited and employed to support humans to combat this battle. ai can be utilized for the preparedness and response activities against the unprecedented national and global crisis. for example, ai can be used to create more effective robots and autonomous machines for disinfection, working in hospitals, delivering food and medicine to patients. aibased nlp tools can be used to create systems that help understand the public responses to intervention strategies, e.g. lockdown and physical distancing, to detect problems such as mental health and social anxiety, and to aid governments in making better public policies. nlp technologies can also be employed to develop chatbot systems that are able to remotely communicate and provide consultations to people and patients about the coronavirus. ai can be used to eradicate fake news on social media platforms to ensure clear, responsible and reliable information about the pandemic such as scientific evidences relevant to the virus, governmental social distancing policies or other pandemic prevention and control measures. in table iii , we point out groups of problems related to covid- along with types of data needed and potential ai methods that can be used to solve those problems. we do not aim to cover all possible ai applications but emphasize on realistic applications that can be achieved along with their technical challenges. those challenges need to be addressed effectively for ai methods to bring satisfactory results. it is great to observe the increasing number of ai applications against the covid- pandemic. ai methods however are not silver bullets but they have limitations and challenges such as inadequate training and validation data or when data are abundantly available, they are normally in poor quality. huge efforts are needed for an ai system to be effective and useful. they may include appropriate data processing pipelines, model selection, efficient algorithm development, remodelling and retraining, continuous performance monitoring and validation to facilitate continuous deployment and so on. there are ai ethics principles and guidelines [ ] , [ ] that each phase of the ai system lifecycle, i.e. design, -challenging to collect physiological characteristics and therapeutic outcomes of patients. -low-quality data would make biased and inaccurate predictions. -uncertainty of ai models outcomes. -privacy and confidentiality issues. [ ]- [ ] machine learning techniques, e.g. naive bayes, logistic regression, knn, svm, mlp, fuzzy logic system, elasticnet regression [ ] , decision tree, random forest, nonparametric gaussian process [ ] , deep learning techniques such as lstm [ ] and other recurrent networks, and optimization methods. predict number of infected cases, infection rate and spreading trend. time series case data, population density, demographic data. -insufficient time series data, leading to unreliable results. -complex models may not be more reliable than simple models [ ] . [ ] , [ ] , [ ] covid- early diagnosis using medical images. radiology images, e.g. chest x-ray and ct scans. -imbalanced datasets due to insufficient covid- medical image data. -long training time and unable to explain the results. -generalisation problem and vulnerable to false negatives. [ ]- [ ] and works in table i . deep learning cnn-based models (e.g. alexnet [ ] , googlenet [ ] , vgg network [ ] , resnet [ ] , densenet [ ] , resnext [ ] , and zfnet [ ] ), aibased computer vision camera systems, and facial recognition systems. scan crowds for people with high temperature, and monitor people for social distancing and mask-wearing or during lockdown. infrared camera images, thermal scans. -cannot measure inner-body temperature and a proportion of patients are asymptomatic, leading to imprecise results. -privacy invasion issues. [ ]- [ ] analyse viral genomes, create evolutionary (phylogenetic) tree, find virus origin, track physiological and genetic changes, predict protein secondary and tertiary structures. viral genome and protein sequence data -computational expenses are huge for aligning a large dataset of genomic or proteomic sequences. -deep learning models take long training time, especially for large datasets, and are normally unexplainable. [ ], [ ] , deepminds alphafold [ ] , [ ] -sequence alignment, e.g. dynamic programming, heuristic and probabilistic methods. -clustering algorithms, e.g. hierarchical clustering, k-means, dbscan [ ] and supervised deep learning. discover vaccine and drug biochemical compounds and candidates, and optimize clinical trials. viral genome and protein sequences, transcriptome data, drug-target interactions, protein-protein interactions, crystal structure of protein, cocrystalized ligands, homology model of proteins, and clinical data. -dealing with big genomic and proteomic data. -results need to be verified with experimental studies. -it can take long time for a promising candidate to become a viable vaccine or treatment method. [ ], [ ] - [ ] heuristic algorithm, graph theory, combinatorics, and machine learning such as adversarial autoencoders [ ] , multitask cnn [ ] , gan [ ] , [ ] , deep reinforcement learning [ ] , [ ] , [ ] . making drones and robots for disinfection, cleaning, obtaining patients vital signs, distance treatment, and deliver medication. simulation environments and demonstration data for training autonomous agents. -safety must be guaranteed at the highest level. -trust in autonomous systems. -huge efforts from training agents to implementing them to real machines. [ ]- [ ] deep learning, computer vision, optimization and control, transfer learning, deep reinforcement learning [ ] , learning from demonstrations. track and predict economic recovery via, e.g. detection of solar panel installations, counting cars in parking lots. satellite images, gps data (e.g. daily anonymized data from mobile phone users to count the number of commuters in cities). -difficult to obtain satellite data in some regions. -noise in satellite images. -anonymized mobile phone data security. [ ], [ ] deep learning, e.g. autoencoder models for feature extraction and dimensionality reduction, and cnn-based models for object detection. types of data challenges related ai methods real-time spread tracking, surveillance, early warning and alerts for particular geographical locations, like the global zika virus spread model bluedot [ ] . anonymized location data from cellphones, flight itinerary data, ecological data, animal and plant disease networks, temperature profiles, foreign-language news reports, public announcements, and population distribution data, e.g. landscan datasets [ ] . -insufficient data in some regions of the world, leading to skewed results. -inaccurate predictions may lead to mass hysteria in public health. -privacy issues to ensure cellphone data remain anonymous. bluedot [ ] , metabiota epidemic tracker [ ] , healthmap [ ] deep learning (e.g. autoencoders and recurrent networks), transfer learning, and nlg and nlp tools (e.g. nltk [ ] , elmo [ ] , ulmfit [ ] , transformer [ ] , googles bert [ ] , transformer-xl [ ] , xlnet [ ] , ernie [ ] , t [ ] , bpt [ ] and openais gpt- [ ] ) for various natural language related tasks such as terminology and information extraction, automatic summarization, relationship extraction, text classification, text and semantic annotation, sentiment analysis, named entity recognition, topic segmentation and modelling, machine translation, speech recognition and synthesis, automated question and answering. understand communities' responses to intervention strategies, e.g. physical distancing or lockdown, to aid public policy makers and detect problems such as mental health. news outlets, forums, healthcare reports, travel data, and social media posts in multiple languages across the world. -social media data and news reports may be low-quality, multidimensional, and highly unstructured. -issues related to language translation. -data cannot be collected from populations with limited internet access. [ ]- [ ] mining text to obtain knowledge about covid- transmission modes, incubation, risk factors, nonpharmaceutical interventions, medical care, virus genetics, origin, and evolution. text data on covid- virus such as scholarly articles in cord- dataset [ ] . -dealing with inaccurate and ambiguous information in the text data. -large volume of data from heterogeneous sources. -excessive amount of data make difficult to extract important pieces of information. [ ]- [ ] mining text to discover candidates for vaccines, antiviral drugs, therapeutics, and drug repurposing through searching for elements similar to covid- virus. text data about treatment effectiveness, therapeutics and vaccines on scholarly articles, e.g. cord- dataset [ ] and libraries of drug compounds. -need to involve medical experts knowledge. -typographical errors in text data need to be rectified carefully. [ ], [ ] - [ ] making chatbots to consult patients and communities, and combat misinformation (fake news) about covid- . medical expert guidelines and information. -unable to deal with unsaved query. -require a large amount of data and information from medical experts. -users are uncomfortable with chatbots being machines. -irregularities in language expression such as accents and mistakes. [ ]- [ ] development, implementation and ongoing maintenance, may need to adhere to, especially when most ai applications against covid- involve or affect human beings. the more ai applications are proposed, the more these applications need to ensure fairness, safety, explainability, accountability, privacy protection and data security, be aligned with human values, and have positive impacts on societal and environmental wellbeing. who coronavirus disease (covid- ) dashboard gradientbased learning applied to document recognition imagenet classification with deep convolutional neural networks going deeper with convolutions very deep convolutional networks for large-scale image recognition deep residual learning for image recognition performance of radiologists in differentiating covid- from viral pneumonia on chest ct. radiology artificial intelligence distinguishes covid- from community acquired pneumonia on chest ct. radiology deep learning system to screen coronavirus disease pneumonia chest ct findings in novel coronavirus ( -ncov) infections from wuhan, china: key points for the radiologist. radiology ct imaging features of novel coronavirus ( -ncov). radiology estimating uncertainty and interpretability in deep learning for coronavirus (covid- ) detection a deep learning algorithm using ct images to screen for corona virus disease (covid- ). medrxiv predicting covid- malignant progression with ai techniques. medrxiv development and evaluation of an ai system for covid- . medrxiv ai-assisted ct imaging analysis for covid- screening: building and deploying a medical ai system in four weeks. medrxiv unet++: a nested u-net architecture for medical image segmentation automatic detection of coronavirus disease (covid- ) using x-ray images and deep convolutional neural networks covid-net: a tailored deep convolutional neural network design for detection of covid- cases from chest radiography images rapid ai development cycle for the coronavirus (covid- ) pandemic: initial results for automated detection and patient monitoring using deep learning ct image analysis can ai help in screening viral and covid- pneumonia diagnosing covid- pneumonia from x-ray and ct images using deep learning and transfer learning algorithms densely connected convolutional networks aggregated residual transformations for deep neural networks squeezenet: alexnet-level accuracy with x fewer parameters and < . mb model size artificial intelligence forecasting of covid- in china α-satellite: an ai-driven system and benchmark datasets for hierarchical community-level risk assessment to help combat covid- conditional generative adversarial nets modelling transmission and control of the covid- pandemic in australia urbanization affects peak timing, prevalence, and bimodality of influenza pandemics in australia: results of a census-calibrated model investigating spatiotemporal dynamics and synchrony of influenza epidemics in australia: an agent-based modelling approach. simulation modelling practice and theory a novel ai-enabled framework to diagnose coronavirus covid- using smartphone embedded sensors: design study use of a smartphone thermometer to monitor thermal conductivity changes in diabetic foot ulcers: a pilot study smartphone-based human fatigue detection in an industrial environment using gait analysis fatigue detection during sit-to-stand test based on surface electromyography and acceleration: a case study smartphone-enabled videoobserved versus directly observed treatment for tuberculosis: a multicentre, analyst-blinded, randomised, controlled superiority trial neck posture monitoring system based on image detection and smartphone sensors using the prolonged usage classification concept a comprehensive approach for cough type detection nocturnal cough and snore detection in noisy environments using smartphone-microphones identification of covid- can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine on the coronavirus (covid- ) outbreak and the smart city network: universal data sharing standards coupled with artificial intelligence (ai) to benefit urban health monitoring and management predicting covid- using hybrid ai model a double epidemic model for the sars propagation a simple mathematical model for ebola in africa long short-term memory understanding the perception of covid- policies by mining a characterizing the propagation of situational information in social media during covid- epidemic: a case study on weibo improved protein structure prediction using potentials from deep learning computational predictions of protein structures associated with covid- , deepmind website potential covid- c-like protease inhibitors designed using generative deep learning approaches machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: covid- case study mldsp-gui: an alignment-free standalone tool with an interactive graphical user interface for dna sequence comparison and analysis genomic characterisation and epidemiology of novel coronavirus: implications for virus origins and receptor binding a pneumonia outbreak associated with a new coronavirus of probable bat origin origin of novel coronavirus (covid- ): a computational biology study using artificial intelligence. biorxiv covid- : a survey on public medical imaging data resources epidemiological dynamics modeling by fusion of soft computing techniques gaussian processes for machine learning neural network ensemble operators for time series forecasting a survey of the recent architectures of deep convolutional neural networks natural language processing with python deep contextualized word representations universal language model fine-tuning for text classification attention is all you need bert: pre-training of deep bidirectional transformers for language understanding transformer-xl: attentive language models beyond a fixed-length context xlnet: generalized autoregressive pretraining for language understanding ernie: enhanced language representation with informative entities exploring the limits of transfer learning with a unified text-to-text transformer bptransformer: modelling long-range context via binary partitioning language models are unsupervised multitask learners a density-based algorithm for discovering clusters in large spatial databases with noise multi-output interval type- fuzzy logic system for protein secondary structure prediction a hybrid supervised approach to human population identification using genomics data genomic mutations and changes in protein secondary structure and solvent accessibility of sars-cov- (covid- virus), biorxiv machine learning applications in genetics and genomics applications of deep learning and reinforcement learning to biological data an interactive web-based dashboard to track covid- in real time impact of international travel and border control measures on the global spread of the novel coronavirus outbreak covid- image data collection covid-ct-dataset: a ct scan dataset about covid- pocovid-net: automatic detection of covid- from a new lung ultrasound imaging dataset (pocus) a twitter dataset of + million tweets related to covid- (version . ) [data set cord- ). ( ) version - - ai can help scientists find a covid- vaccine a deep learning approach to antibiotic discovery reassessing the global mortality burden of the influenza pandemic ethics guidelines for trustworthy ai a novel high specificity covid- screening method based on simple blood exams and artificial intelligence. medrxiv predicting mortality risk in patients with covid- using artificial intelligence to help medical decision-making. medrxiv a novel triage tool of artificial intelligence assisted diagnosis aid system for suspected covid- pneumonia in fever clinics the role of artificial intelligence in management of critical covid- patients towards an artificial intelligence framework for datadriven prediction of coronavirus clinical severity artificial intelligence in the intensive care unit prediction of criticality in patients with severe covid- infection using three clinical features: a machine learningbased prognostic model with clinical data in wuhan. medrxiv regularization and variable selection via the elastic net why is it difficult to accurately predict the covid- epidemic? predicting covid- in china using hybrid ai model modified seir and ai prediction of the epidemics trend of covid- in china under public health interventions artificial intelligenceenabled rapid diagnosis of patients with covid- clinically applicable ai system for accurate diagnosis, quantitative measurements, and prognosis of covid- pneumonia using computed tomography radiological findings from patients with covid- pneumonia in wuhan, china: a descriptive study covid- pneumonia: what has ct taught us diagnosis of coronavirus disease (covid- ) with structured latent multi-view representation learning automated detection of covid- cases using deep neural networks with x-ray images medical image analysis using wavelet transform and deep belief networks coronet: a deep neural network for detection and diagnosis of covid- from chest x-ray images coronavirus (covid- ) outbreak: what the department of radiology should know application of deep learning technique to manage covid- in routine clinical practice using ct images: results of convolutional neural networks deep learning covid- features on cxr using limited training data sets covid- identification in chest x-ray images on flat and hierarchical classification scenarios inf-net: automatic covid- lung infection segmentation from ct images explainable deep learning for pulmonary disease and coronavirus covid- detection from x-rays covidgan: data augmentation using auxiliary classifier gan for improved covid- detection a hybrid covid- detection model using an improved marine predators algorithm and a ranking-based diversity reduction strategy review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid- sample-efficient deep learning for covid- diagnosis based on ct scans deep learning-based model for detecting novel coronavirus pneumonia on high-resolution computed tomography: a prospective study visualizing and understanding convolutional networks coronavirus france: cameras to monitor masks and social distancing as coronavirus surveillance escalates, personal privacy plummets. the new york times how russia is using facial recognition to police its coronavirus lockdown potentially highly potent drugs for -ncov. biorxiv mathdl: mathematical deep learning for d r grand challenge aiaided design of novel targeted covalent inhibitors against sars-cov- . biorxiv de novo design of new chemical entities (nces) for sars-cov- using artificial intelligence preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the sars-cov- ( -ncov, covid- ) coronavirus network bioinformatics analysis provides insight into drug repurposing for covid- covid- coronavirus vaccine design using reverse vaccinology and machine learning how artificial intelligence is changing drug discovery a data-driven drug repositioning framework discovered a potential therapeutic agent targeting covid- combating covid- -the role of robotics in managing public health and infectious diseases the uses of drones in case of massive epidemics contagious diseases relief humanitarian aid: wuhan-covid- crisis from high-touch to hightech: covid- drives robotics adoption robotics, smart wearable technologies, and autonomous intelligent systems for healthcare during the covid pandemic: an analysis of the state of the art and future vision deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications solarnet: a deep learning framework to map solar power plants in china from satellite imagery satellites and ai monitor chinese economys reaction to coronavirus anticipating the international spread of zika virus from brazil landscan global population database pneumonia of unknown aetiology in wuhan, china: potential for international spread via commercial air travel metabiota epidemic tracker contagious disease surveillance top concerns of tweeters during the covid- pandemic: infoveillance study the impact of covid- epidemic declaration on psychological consequences: a study on active weibo users the outbreak of covid- coronavirus and its impact on global mental health deepmine-natural language processing based automatic literature mining and research summarization for early stage comprehension in pandemic situations specifically for covid- . biorxiv covidnlp: a web application for distilling systemic implications of covid- pandemic with natural language processing understand research hotspots surrounding covid- and other coronavirus infections using topic modeling. medrxiv word embedding mining for sars-cov- and covid- drug repurposing text and networkmining for covid- intervention studies. chemrxiv information mining for covid- research from a large volume of scientific literature identification of pulmonary comorbid diseases network based repurposing effective drugs for covid- bioactivity profile similarities to expand the repertoire of covid- drugs an artificial intelligence-based first-line defence against covid- : digitally screening citizens for risks via a chatbot who health alerts facebook messenger chatbot who viber interactive chatbot who's health alert on whatsapp unicefs europe and central asia regional office and the who regional office for europe ( ) ibm watson assistant for citizens covid- risk assessment chatbot rona (covid- bot) covid chat bot dr. nguyen has been recognized as a leading researcher in australia in the field of artificial intelligence by the australian newspaper in a report published in . he is currently a senior lecturer in the school of information technology, deakin university, victoria, australia. key: cord- - jzkdu authors: bickman, leonard title: improving mental health services: a -year journey from randomized experiments to artificial intelligence and precision mental health date: - - journal: adm policy ment health doi: . /s - - - sha: doc_id: cord_uid: jzkdu this conceptual paper describes the current state of mental health services, identifies critical problems, and suggests how to solve them. i focus on the potential contributions of artificial intelligence and precision mental health to improving mental health services. toward that end, i draw upon my own research, which has changed over the last half century, to highlight the need to transform the way we conduct mental health services research. i identify exemplars from the emerging literature on artificial intelligence and precision approaches to treatment in which there is an attempt to personalize or fit the treatment to the client in order to produce more effective interventions. in , i was writing my first graduate paper at columbia university on curing schizophrenia using sarnoff mednick's learning theory. i was not very modest even as a first-year graduate student! but i was puzzled as to how to develop and evaluate a cure. then, as now, the predominant research design was the randomized experiment or randomized clinical trial (rct). it was clear that simply describing, let alone manipulating, the relevant characteristics of this one disorder and promising treatments would require hundreds of variables. developing an effective treatment would take what seemed to me an incalculable number of randomized trials. how could we complete all the randomized experiments needed? how many different outcomes should we measure? how could we learn to improve treatment? how should we consider individual differences in these group comparisons? i am sure i was not insightful enough to think of all these questions back then, but i know i felt frustrated and stymied by our methodological approach to answering these questions. but i had to finish the paper, so i relegated these and similar questions to the list of universal imponderables such as why i exist. in fact, i became a committed experimentalist, and i dealt with the limitations of experiments by recognizing their restrictions and abiding by the principle "for determining causality, in many but not all circumstances, the randomized design is the worst form of design except all the others that have been tried " (bickman and reich , pp. - ) . for the much of my career, i was a committed proponent of the rct as the best approach to understanding causal relationships (bickman ) . however, as some of my writing indicates, it was a commitment with reservations. i did not see a plausible alternative or complement to rcts until recently, when i began to read about artificial intelligence (ai) and precision medicine in . the potential solution to my quandary did not crystallize until , when i collaborated with aaron lyons and miranda wolpert on a paper on what we called "precision mental health" (bickman et al. ) . with the development of ai and its application in precision medicine, i now believe that ai is another approach that we may be able to use to understand, predict, and influence human behavior. while not necessarily a substitute for rcts in efforts to improve mental health services, i believe that ai provides an exciting alternative to rcts or an adjunct to them. while i use precision medicine and precision mental health interchangeably, i will differentiate them later in this paper. toward that end, i focus much of this paper on the role of ai and precision medicine as a critical movement in the field with great potential to inform the next generation of research. before proposing such solutions, i first describe the challenges currently faced by mental health services, using examples drawn almost entirely from studies of children and youth, the area in which i have conducted most of my research. i describe five principal causes of this failure, which i attribute primarily, but not solely, to methodological limitations of rcts. lastly, i make the case for why i think ai and the parallel movement of precision medicine embody approaches that are needed to augment, but probably not replace, our current research and development efforts in the field of mental health services. i then discuss how ai and precision mental health can help inform the path forward, with a focus on similar problems manifested in mental health services for adults. these problems, i believe, make it clear that we need to consider alternatives to our predominant research approach to improving services. importantly, most of the research on ai and precision medicine i cite deals with adults, as there is little research in this area on children and youth. i am assuming that we can generalize from one literature to the other, but i anticipate that there many exceptions to this assumption. according to some estimates, more than half ( . %) of adults with a mental illness receive no treatment (mental health in america ) . less than half of adolescents with psychiatric disorders receive any kind of treatment (costello et al. ) . over % of youth with major depression do not receive any mental health treatment (mental health in america ). several other relevant facts when it comes to youth illustrate the problem of their access to services. hodgkinson et al. ( ) have documented that less than % of children in poverty receive needed services. these authors also showed that there is less access to services for minorities and rural families. when it comes to the educational system, mental health in america ( ) estimated that less than % of students have an individual education plan (iep), which students need to access school-supported services, even though studies have shown that a much larger percentage of students need those services. access is even more severely limited in in low-and middle-income countries (esponda et al. ). very few clients receive effective evidence-based quality mental health services that have been shown to be effective in laboratory-based research (garland et al. ; gyani et al. ). moreover, research shows that even when they do receive care that is labeled evidence-based, it is not implemented with sufficient fidelity to be considered evidence-based (park et al. ) . no matter how effective evidence-based treatments are in the laboratory, it is very clear that they lose much of their effectiveness when implemented in the real world (weisz et al. (weisz et al. , . research reviews demonstrate that services that are typically provided outside the laboratory lack substantial evidence of effectiveness. there are two factors that account for this lack of effectiveness. as noted above, evidencebased services are usually not implemented with sufficient fidelity to replicate the effectiveness found in the laboratory. more fundamentally, it is argued here that even evidencebased services may not be sufficiently effective as currently conceptualized. a review of published studies on school-based health centers found that while these services increased access, the review could not determine whether services were effective because the research was of such poor quality (bains and diallo ) . a meta-analysis of studies of mental health interventions implemented by school personnel found small to medium effect sizes, but only % of the services were provided by school counselors or mental health workers (sanchez et al. ) . a cochrane review concluded, "we do not know whether psychological therapy, antidepressant medication or a combination of the two is most effective to treat depressive disorders in children and adolescents" (cox et al. , p. ) . another meta-analysis of studies on school-based interventions delivered by teachers showed a small effect for internalizing behaviors but no effect on externalizing ones (franklin et al. a) . similarly, a meta-analysis of meta-analyses of universal prevention programs targeting school-age youth showed a great deal of variability with effect sizes from to . standard deviations depending on type of program and targeted outcome (tanner-smith et al. ) . a review of rcts found no compelling evidence to support any one psychosocial treatment over another for people with serious mental illnesses (hunt et al. ) . a systematic review and meta-analysis of conduct disorder interventions concluded that they have a small positive effect, but there was no evidence of any differential effectiveness by type of treatment (bakker et al. ) . fonagy and allison ( ) conclude, "the demand for a reboot of psychological therapies is unequivocal simply because of the disappointing lack of progress in the outcomes achieved by the best evidence-based interventions" (p. ). probably the most discouraging evidence was identified by weisz et al. ( ) on the basis of a review of rcts over a -year period. they found that the mean effect size for treatment did not improve significantly for anxiety and adhd and decreased significantly for depression and conduct problems. the authors conclude: in sum, there were strikingly few exceptions to the general pattern that treatment effects were either unchanged or declining across the decades for each of the target problems. one possible implication is that the research strategy used over the past decades, the treatment approaches investigated, or both, may not be ideal for generating incremental benefit over time. (p. ) there is a need-indeed, an urgent need-to change course, because our traditional approaches to services appear not to be working. however, we might be expecting too much from therapy. in an innovative approach to examining the effectiveness of psychotherapy for youth, jones et al. ( ) subjected rcts to a mathematical simulation model that estimated that even if therapy was perfectly implemented, the effect size would be a modest . . they concluded that improving the quality of existing psychotherapy will not result in much better outcomes. they also noted that ai may help us understand why some therapies are more effective than others. they suggested that the impact of therapy is limited because a plethora of other factors influence mental health, especially given that therapy typically lasts only one hour a week out of + waking hours. they also indicated that other factors that have not been included in typical therapies, such as individualizing or personalizing treatment, may increase the effectiveness of treatment. i am not alone in signaling concern about the state of mental health services. for example, other respected scholars in children's services research have also raised concerns about the quality and effectiveness of children's services. weisz and his colleagues (marchette and weisz ; ng and weisz ) described several factors that contribute to the problems identified above. these included a mismatch between empirically supported treatments and mental health care in the real world, the lack of personalized interventions, and the absence of transdiagnostic treatment approaches. it is important to acknowledge the pioneering work of sales and her colleagues, who identified the need and tested approaches to individualizing assessment and monitoring clients (alves et al. (alves et al. , elliott et al. ; alves , ; sales et al. sales et al. , . we need not only to appreciate the relevance of this work but also to integrate it with new artificial intelligence approaches described later in this paper. i am not concluding from such evidence that all mental health services are ineffective. this brief summary of the state of our services can be perceived in terms of a glass half full or half empty. in other words, there is good evidence that some services are effective under particular, but yet unspecified, conditions. however, i do not believe that the level of effectiveness is sufficient. moreover, we are not getting better at improving service effectiveness by following our traditional approach to program development, implementation, research, and program evaluation. while it is unlikely that the social and behavioral sciences will experience a major breakthrough in discovering how to "cure" mental illness, similar to those often found in the physical or biological sciences, i am arguing in this paper that we must increase our research efforts using alternative approaches to produce more effective services. a large part of this paper, therefore, is devoted to exploring what has been also called a precision approach to treatment in which there is an attempt to personalize treatment or fit treatment to the client in order to produce more effective interventions. in some of my earliest work in mental health, i identified the field's focus on system-level factors rather than on treatment effectiveness as one cause of the problems with mental health services. the most popular and well-funded approach to mental health services in the s and s, which continues even today, is called a system or continuum of care (bickman (bickman , bickman et al. b; bryant and bickman ) . this approach correctly recognized the problems with the practice of providing services that were limited to outpatient and hospitalization only, which was very common at that time. moreover, these traditional services did not recognize the importance of the role played by youth and families in the delivery of mental services. to remedy these important problems, advocates for children's mental health conceptualized that a system of care was needed, in which a key ingredient was a managed continuum of care with different levels or intensiveness of services to better meet the needs of children and youth (stroul and friedman ) . this continuum of care is a key component of a system of care. however, i believe that in actuality, these different levels of care simply represent different locations of treatment and restrictiveness (e.g., inpatient vs. outpatient care) and did not necessarily reflect a gradation of intensity of treatment. a system of care is not a specific type of program, but an approach or philosophy that combines a wide range of services and supports for these services with a set of guiding principles and core values. services and supports are supposed to be provided within these core values, which include the importance of services that are community-based, family-focused, youth-oriented, in the least restrictive environment, and culturally and linguistically proficient. system-level interventions focus on access and coordination of services and organizations and not on the effectiveness of the treatments that are provided. it appeared that the advocates of systems of care assumed that services were effective and that what was needed was to organize them better at the systems level. although proponents of systems of care indicated that they highly valued individualized treatment, especially in what were called wraparound services, there was no distinct and systematic way that individualization was operationalized or evaluated. moreover, there was not sufficient evidence that supported the assumption that wraparound services produced better clinical outcomes (bickman et al. ; stambaugh et al. ) . a key component of the system is providing different levels of care that include hospitalization, group homes, and outpatient services, but there is little evidence that clinicians can reliably assign children to what they consider the appropriate level of care (bickman et al. a ). my earliest effort in mental health services research was based on a chance encounter that led to the largest study ever conducted in the field of child and youth mental health services. i was asked by a friend to see if i could help a person whom i did not know to plan an evaluation of a new way to deliver services. this led to a project that cost about $ million to implement and evaluate. we evaluated a new system of care that was being implemented at fort bragg, a major u.s. army post in north carolina. we used a quasi-experimental design because the army would not allow us to conduct a rct; however, we were able to control for many variables by using two similar army posts as controls (bickman ; bickman et al. ) . the availability of sufficient resources allowed me to measure aspects of the program that were not commonly measured at that time, such as cost and family empowerment. with additional funding that i received from a competitive grant from the national institute of mental health (nimh) and additional follow-up funding from the army, we were able to do a cost-effectiveness analysis (foster and bickman ) , measure family outcomes (heflinger and bickman ) , and develop a new battery of mental health symptoms and functioning (bickman and athay ). in addition, we competed successfully for an additional nimh grant to evaluate another system of care in a civilian population using a rct (bickman et al. a, b) and a study of a wraparound services that was methodologically limited because of sponsor restrictions (bickman et al. ) . i concluded from this massive and concentrated effort that systems of care (including the continuum of care) were able to influence system-level variables, such as access, cost, and coordination, but that there was not sufficient evidence to support the conclusion that it produced better mental health outcomes for children or families or that it reduced costs per client . this conclusion was not accepted by the advocates for systems of care or the mental health provider community more generally. moreover, i became persona non grata among the proponents of systems of care. while the methodologists who were asked to critique on the fort bragg study saw it as an important but not flawless study (e.g., sechrest and walsh ; weisz et al. ) that should lead to new research (hoagwood ) , most advocates thought it to be a well-done evaluation but of very limited generalizability (behar ). it is important to note that the system of care approach, almost years later, remains the major child and youth program funded by the substance abuse and mental health services administration's (samhsa) center for mental health services (cmhs) to the tune of about a billion dollars in funding since the system of care program's inception in . there have been many evaluations funded as part of the samhsa program that show some positive results (e.g., holden et al. ), but, in my opinion, they are methodologically weak and, in some cases, not clearly independent. systems of care are still considered by samhsa's center for mental health services to be the premier child and adolescent program worthy of widespread diffusion and funding (substance abuse and mental health services administration ), regardless of what i believe is the weak scientific support. this large investment of capital should be considered a significant opportunity cost that has siphoned off funds and attention from more basic concerns such as effectiveness of services. sadly, based on my unsuccessful efforts to encourage change as a member of the cmhs national advisory council ( - ), i am not optimistic that there will be any modification of support for this program or shift of funding to more critical issues that are identified in this paper. in the following section, i consider some of the problems that have contributed to the current status of mental health services. my assessment of current services led me to categorize the previously described deficiencies into the five following related problem groups. the problems with the validity of diagnoses have existed for as long as we have had systems of diagnoses. while a diagnosis provides some basis for tying treatment to individual case characteristics, its major contribution is providing a payment system for reimbursement for services. research has shown that external factors such as insurance influence the diagnosis given, and the diagnosis located in electronic health records is influenced by commercial interests (perkins et al. ; taitsman et al. ) . other studies have demonstrated that the diagnosis of depression alone is not sufficient for treatment selection; additional information is required (iniesta et al. ). moreover, others have shown that diagnostic categories overlap and are not mutually exclusive (bickman et al. c) . in practice, medication is prescribed according to symptoms and not diagnosis (waszczuk et al. ) . in their thematic analysis of selected chapters of the diagnostic and statistical manual of mental disorders (dsm- ), allsopp et al. ( ) examined the heterogeneous nature of categories within the dsm- . they showed how this heterogeneity is expressed across diagnostic criteria, and explained its consequences for clinicians, clients, and the diagnostic model. the authors concluded that "a pragmatic approach to psychiatric assessment, allowing for recognition of individual experience, may therefore be a more effective way of understanding distress than maintaining commitment to a disingenuous categorical system" (p. ). moreover, in an interview, allsop stated: although diagnostic labels create the illusion of an explanation, they are scientifically meaningless and can create stigma and prejudice. i hope these findings will encourage mental health professionals to think beyond diagnoses and consider other explanations of mental distress, such as trauma and other adverse life experiences. (neuroscience news , para. ) finally, a putative solution to this muddle is nimh's research domain criteria initiative (rdoc) diagnostic guide. rdoc is not designed to be a replacement of current systems but serves as a research tool for guiding research on mental disorders systems. however, it has been criticized on several grounds. for example, heckers ( ) states, "it is not clear how the new domains of the rdoc matrix map on to the current dimensions of psychopathology" (p. ). moreover, there is limited evidence that rdoc has actually improved the development of treatments for children (e.g., clarkson et al. ) . as i will discuss later in the paper, rush and ibrahim ( ) , in their critical review of psychiatric diagnosis, predicted that ai, especially artificial neural networks, will change the nature of diagnosis to support precision medicine. if measures are going to be used in real world practice, then in addition to the classic and modern psychometric validity criteria, it must be possible to use measures sufficiently often to provide a fine-grained picture of change. if measures are used frequently, then they must be short so as not to take up clinical time (riemer et al. ) . moreover, since there is a low correlation among different respondents (de los reyes and ohannessian ), we need measures and data from different respondents including parents, clinicians, clients, and others (e.g., teachers). however, we are still lacking a systematic methodology for managing these different perspectives. since we are still unsure which constructs are important to measure, we need measures of several different constructs in order to pinpoint which ones we should administer on a regular basis. in addition to outcome measures, we need valid and reliable indicators of mediators and processes to test theories of treatment as well as to indicate short-term outcomes. we need measures that are sensitive to change to be valid measures of improvement. we need new types of measures that are more contextual, that occur outside of therapy sessions, and that are not just standardized questionnaires. we lack good measures of fidelity of implementation that capture in an efficient manner what clinicians actually do in therapy sessions. this information is required to provide critical feedback to clinicians. we also lack biomarkers of mental illness that can be used to develop and evaluate treatments that are often found in physical illnesses. this is a long and incomplete list of needs and meeting them will be difficult to accomplish without a concerted effort. there are some resources at the national institutes of health that are focused on measure development, such as patient-reported outcomes measurement system information (promis) (https ://www.healt hmeas ures.net/explo re-measu remen t-syste ms/promi s), but this program does not focus on mental health. thus, we depend upon the slow and uncoordinated piecemeal efforts of individual researchers to somehow fit measure development into their career paths. i know this intimately because when i started to be engaged with children's mental health services research, i found that the measures in use were too long, too expensive, and far from agile. this dissatisfaction led me down a long path to the development of a battery of measures called the peabody treatment progress battery riemer et al. ). this battery of brief measures was developed as part of ongoing research grants and not with any specific external support. for over a half century, i have been a committed experimentalist. i still am a big fan of experiments for some purposes (bickman ). the first independent study i conducted was my honors thesis at city college of new york in . my professor was a parapsychologist and personality psychologist, so the subject of my thesis was extrasensory perception (esp). my honors advisor had developed a theory of esp that predicted that those who were positive about esp, whom she called sheep, would be better at esp than the people who rejected esp, whom she called goats (schmeidler ) . although i did not realize it at the time, my experimentalist or action orientation was not satisfied with correlational findings that were the core of the personality approach. i designed an experiment in which i randomly assigned college students to hear a scripted talk from me supporting or debunking esp. i found very powerful results. the experimental manipulation changed people's perspective on the efficacy of esp, but i found no effect on actual esp scores. it was not until i finished my master's degree in experimental psychopathology at columbia university that i realized that i wanted to be an experimental social psychologist, and i became a graduate student at the city university of new york. however, i did not accept the predominant approach of social psychologists, which was laboratory experimentation. i was convinced that research needed to take place in the real world. although my dissertation was a laboratory study of helping behavior in an emergency , it was the last lab study i did that was not also paired with a field experiment (e.g. bickman and rosenbaum ) . one of my first published research studies as a graduate student was a widely cited field experiment (rct) that examined compliance to men in different uniforms in everyday settings (bickman a, b) . the first book i coedited, as a graduate student, was titled beyond the laboratory: field research in social psychology and was composed primarily of field experiments (bickman and henchy ) . almost all my early work as a social psychologist consisted of field experiments . i strongly supported the primacy of randomized designs in several textbooks i coauthored or coedited (alasuutari et al. ; bickman and rog ; bickman and rog ; hedrick et al. ) . while the fort bragg study i described above was a quasi-experiment (bickman ) , i was not happy that the funding agency, the u.s. army, did not permit me to use a rct for evaluating an important policy issue. as i was truly committed to using a rct to evaluate systems of care, i followed up this study with a conceptual replication in a civilian community using a rct (bickman et al. b ) that was funded by a nimh grant. while i have valued the rct and continue to do so, i have come to the conclusion that our experimental methods were developed for simpler problems. mental health research is more like weather forecasting with thousands of variables rather than like traditional experimentation, which is based on a century-old model for evaluating agricultural experiments with only a few variables (hall ) . we need alternatives to the traditional way of doing research, service development, and service delivery that recognize the complexity of disorders, heterogeneity of clients, and varied contexts of mental health services. the oversimplification of rcts has produced a blunt tool that has not served us well for swiftly improving our services. this is not to say that there has been no change in the last years. for example, the institute of education sciences, a more recent player the field of children's behavioral and mental health outcomes research, has released an informative monograph on the use of adaptive randomized trials that does demonstrate flexibility in describing how rcts can be implemented in innovative ways (nahum-shani and almirall ). the concerns about rcts are also apparent in other fields. for example, a special issue of social science and medicine focused on the limitations of rcts (deaton and cartwright ) . the contributors to this incisive issue indicated that a rct does not in practice equalize treatment and control groups. rcts do not deliver precise estimates of average treatment effects (ates) because a rct is typically just one trial, and precision depends on numerous trials. there is also an external validity problem; that is, it is difficult to generalize from rcts, especially those done in university laboratory settings. context is critical and theory confirmation/disconfirmation is important, for without generalizability, the findings are difficult to apply in the real world (bickman et al. ) . scaling up from a rigorous rct to a community-based treatment is now recognized as a significant problem in the relatively new fields of translational research and implementation sciences. in addition to scaling up, there is a major issue in scaling down to the individual client level. stratification and theory help, but they are still at the group level. the classic inferential approach also has problems with replication, clinical meaningfulness, accurate application to individuals, and p-value testing (dwyer et al. ) . the primary clinical problem with rcts is the emphasis on average treatment effects (ates) versus individual prediction. rcts emphasize postdiction, and ates lead to necessary oversimplification and a focus on group differences and not individuals. subramanian et al. ( ) gave two examples of the fallacy of averages: the first was a study to describe the "ideal woman," where they measured nine body dimensions and then averaged each one. a contest to identity the "average woman" got responses, but not a single woman matched the averages on all nine variables. in a second example, the u.s. air force in measured pilots on body dimensions to determine appropriate specifications for a cockpit. not a single pilot matched the averages on even as few as dimensions, even when their measurements fell within % of the mean value. as these examples show, the problem with using averages has been known for a long time, but we have tended to ignore this problem. we are disappointed when clinicians do not use our research findings when in fact our findings may not be very useful for clinicians because clinicians deal with individual clients and not some hypothetical average client. we can obtain significant differences in averages between groups, but the persons who actually benefit from therapy will vary widely to the extent to which they respond to the recommended treatments. thus, the usefulness of our results depends in part on the heterogeneity of the clients and the variability of the findings. the privileging of rcts also came with additional baggage. instead of trying to use generalizable samples of participants, the methodology favored the reduction of heterogeneity as a way to increase the probability of finding statistically significant results. this often resulted in the exclusion from studies of whole groups of people, such as women, children, people of color, and persons with more than one diagnosis. while discussions often included an acknowledgment of this limitation, little was done about these artificial limitations until inclusion of certain groups was required by federal funding agencies (national institutes of health, central resource for grants and funding information ) . the limitations of rcts are not a secret, but we tend to ignore these limitations (kent et al. ) . one attempt to solve the difficulty of translating average effect sizes by rcts to individualize predictions is called reference class forecasting. here, the investigator attempts to make predictions for individuals based on "similar" persons treated with alternative therapies. however, it is rarely the case that everyone in a clinical trial is influenced by the treatment in the same way. an attempt to reduce this heterogeneity of treatment effects (hte) by using conventional subgroup analysis with one variable at a time is rejected by kent et al. ( ) . they argue that this approach does not work. first, there are many variables on which participants can differ, and there is no way to produce the number of groups that represent these differences. for example, matching on just binary variables would produce over a million groups. moreover, one would have to start with an enormous sample to maintain adequate statistical power. the authors describe several technical reasons for not recommending this approach to dealing with the hte problem. they also suggested two other statistical approaches, risk modeling and treatment effect modeling, that may be useful, but more research on both is needed to support their use. kent et al. ( ) briefly discussed using observational or non-rct data, but they pointed out the typical problems of missing data and other data quality issues as well as the difficulty in making causal attributions. moreover, they reiterated their support for the rct as the "gold standard." although published in , their article mentioned machine learning only as a question for future research-a question that i address later in this paper. i will also present other statistical approaches to solving the limitations of rcts. there is another problem in depending upon rcts as the gold standard. nadin ( ) pointed out that failed reproducibility occurs almost exclusively in life sciences, in contrast to the physical sciences. i would add that the behavioral sciences have not been immune from criticisms about replicability. the open science collaboration ( ) systematically sampled results from three top-tier journals in psychology, and only % of the replication efforts yielded significant findings. this issue is far from resolved, and it is much more complex than simple replication (laraway et al. ) . nadin ( ) considered the issue of the replicability as evidence of an underlying false assumption about treating humans as if they were mechanistic physical objects and not reactive human beings. he noted that physics is nomothetic, while biology is idiographic, meaning that the former is the study of the formulation of universal laws and the latter deals with the study of individual cases or events. without accurate feedback, there is little learning (kluger and denisi ) . clinicians are in a low feedback occupation, and unlike carpenters or surgeons, they are unlikely to get direct accurate feedback on the effects of their activities. when carpenters cut something too short, they can quickly see that it no longer fits and have to start with a new piece, so they typically follow the maxim of measure twice, cut once. because clinicians in the real world of treatment do not get direct accurate feedback on client outcomes, especially after clients leave treatment, then they are unlikely to learn how to become more effective clinicians from practice alone. clinical practice is thus similar to an archer's trying to improve while practicing blindfolded (bickman ) . moreover, the services research field does not learn from treatment as usual in the real world, where most treatment occurs, because very few services collect outcome data, let alone try to tie these data to clinician actions (bickman b) . there are two critical requirements needed for learning. the first is the collection of fine-grained data that are contemporaneous with treatment. the second is the feedback of these data to the clinician or others so that they can learn from these data. learning can be accomplished with routine use of measures such as patient outcome measures (poms) and feedback through progress monitoring, measurementbased care (mbc), and measurement feedback systems (mfs). these measurement feedback concepts have repeatedly demonstrated their ability to improve outcomes in therapy across treatment type and patient populations (brattland et al. ; bickman et al. ; dyer et al. ; gibbons ; gondek et al. ; lambert et al. ) . despite this evidence base, most clinicians do not use these measurement feedback systems. for example, in one of the largest surveys of canadian psychologists, only % were using a progress monitoring measure (ionita et al. ) . a canadian psychological association task force (tasca et al. ) reinforced the need for psychologists to systematically monitor and evaluate their services using continuous monitoring and feedback. they stated that the association should encourage regulatory bodies to prioritize training in their continuing education and quality assurance requirements. moreover, lewis et al., in their review of measurement-based care ( ), presented a -point research agenda that captures much the ideas in the present paper: ( ) harmonize terminology and specify mbc's core components; ( ) develop criterion standard methods for monitoring fidelity and reporting quality of implementation; ( ) develop algorithms for mbc to guide psychotherapy; ( ) test putative mechanisms of change, particularly for psychotherapy; ( ) develop brief and psychometrically strong measures for use in combination; ( ) assess the critical timing of administration needed to optimize patient outcomes; ( ) streamline measurement feedback systems to include only key ingredients and enhance electronic health record interoperability; ( ) identify discrete strategies to support implementation; ( ) make evidence-based policy decisions; and ( ) align reimbursement structures. (p. ) it is not surprising that the measurement feedback approach has not yet produced dramatic effects, given how little we know about what data to collect, how often it should be collected, what feedback should be, and when and how it should be provided (bickman et al. ) . regardless, every time a client is treated, it is an opportunity to learn how to be more effective. by not collecting and analyzing information from usual care settings, we are missing a major opportunity to learn from ordinary services. the most successful model i know of using this real-world services approach is the treatment of childhood cancers in hospitals where most children enter a treatment rct (o'leary et al. ) . these authors note that in the past years, the survival rates for childhood cancer have climbed from % to almost %. they attribute this remarkable improvement to clinical research through pediatric cooperative groups. this level of cooperation is not easy to develop, and it is not frequently found in mental health services. most previous research shows differential outcomes among different types of therapies that are minor at most (wampold and imel ) . for example, weisz et al. ( ) report that in their meta-analysis, the effect of treatment type as a moderator was not statistically significant but there was a significant, but not clearly understood, treatment type by informant interaction effect. in addition, the evidence that therapists have a major influence on the outcomes of psychotherapy is still being hotly debated. the fact that the efficacy of therapists is far from a settled issue is troubling (anderson et al. ; goodyear et al. ; hill et al. ; king and bickman ) . also, current drug treatment choices in psychiatry are successful in only about % of the patients (bzdok and meyer-lindenberg ) and are as low as - % for antidepressants (dwyer et al. ) . while antidepressants are more effective than placebos, they have small effect sizes (perlis ) , and the choice of specific medicine is a matter of trial and error in many cases. it is relatively easy to distinguish one type of drug from another but not so for services, where even dosage in psychosocial treatments is hard to define. according to dwyer et al. ( ) , "currently, there are no objective, personalized methods to choose among multiple options when tailoring optimal psychotherapeutic and pharmacological treatment" (p. ). a recent summary concluded that after years and studies, it is unknown which patients benefit from interpersonal psychotherapy (ipt) versus another treatment (bernecker et al. ) . however, to provide a more definitive answer to the question about which treatments are more effective, we need head-to-head direct comparisons between different treatments and network meta-analytic approaches such as those used by dagnea et al. ( ) . the field of mental health is not alone in finding that many popular medications do not work with most of the people who take them. nexium, a common drug for treating heartburn, works with only person out of , while crestor, used to treat high cholesterol, works with only out of (schork ) . this poor alignment between what the patient needs, and the treatment provided is the primary basis for calling for a more precise medicine approach. this lack of precision leads to the application of treatments to people who cannot benefit from it, thus leading to overall poor effectiveness. in summary, a deep and growing body of work has led me to conclude that we need additional viable approaches to a rct when it comes to conducting services-related research. an absence of rigorous evaluation of treatments that are usually provided in the community contributes to a gap in our understanding why treatments are ineffective (bickman b) . poor use of measurement in routine care (bickman ) and the absence of measurement feedback systems and clinician training and supervision (garland et al. ) are rampant. there also a dire need for the application of more advanced analytics and data mining techniques in the mental health services area (bickman et al. ). these and other such challenges have in turn informed my current thinking about alternative or ancillary approaches for addressing the multitude of problems plaguing the field of mental health services. the five problems i have described above constitute significant obstacles to achieving accessibility, efficiency, and effectiveness in mental health services. nevertheless, there is a path forward that i believe can help us reach these goals. artificial intelligence promises to transform the way healthcare is delivered. the core of my recommendations in this paper rests on the revolutionary possibilities of artificial intelligence for improving mental healthcare through precision medicine that allows us to take into account the individual variability that exists with respect to genetic and other biological, environmental, and lifestyle characteristics. several others have similarly signaled a need for considering the use of personalized approaches to service delivery. for example, weisz and his colleagues (marchette and weisz ; ng and weisz ) called for more idiographic research and for studies tailoring strategies in usual care. kazdin ( ) focused on expanding mental health services through novel models of intervention delivery; called for task shifting among providers; advocated designing and implementing treatments that are more feasible, using disruptive technologies, for example, smartphones, social media such as twitter and facebook, and socially assistive robots; and emphasized social network interventions to connect with similar people. ai is currently used in areas ranging from prediction of weather patterns to manufacturing, logistic planning to determine efficient delivery routes, banking, and stock trading. ai is used in smartphones, cars, planes, and the digital assistants siri and alexa. in healthcare, decision support, testing and diagnosis, and self-care also use ai. ai can sort through large data sets and uncover relationships that humans cannot perceive. through learning that occurs with repeated, rapid use, ai surpasses the abilities of humans only in some areas. however, i would caution potential users that there are significant limitations associated with ai that are discussed later in this paper. rudin and carlson ( ) present a non-technical and well written review of how to utilize ai and of some of the problems that are typically encountered. ai is not one type of program or algorithm. machine learning (ml), a major type of ai, is the construction of algorithms that can learn from and make predictions based on data. it can be ( ) supervised, in which the outcome is known and labeled by humans and the algorithm learns to get that outcome; ( ) unsupervised, when the program learns from data to predict specific outcomes likely to come from the patterns identified; and ( ) reinforcement learning, in which ml is trial and error. in most cases, there is an extensive training data set that the algorithm "learns" from, followed by an independent validation sample that tests the validity of the algorithm. other variations of ai include random forest, decision trees, and the support vector machine (svm), a multivariate supervised learning technique that classifies individuals into groups (dwyer et al. ; shrivastava et al. ). the latter is most widely used in psychology and psychiatry. artificial neural networks (anns) or "neural networks" (nns) are learning algorithms that are conceptuality related to biological neural networks. this approach can have many hidden layers. deep learning is a special type of machine learning. it helps to build learning algorithms that can function conceptually in a way similar to the functioning of the human brain. large amounts of data are required to use deep learning. ibm's watson won jeopardy with deepqa algorithms designed for question answering. as exemplified by the term neural networks, algorithm developers appear to name their different approaches with reference to some biological process. genetic algorithms are based on the biological process of gene propagation and the methods of natural selection, and they try to mimic the process of natural evolution at the genotype level. it has been a widely used approach since the s. natural language processing (nlp) involves speech recognition, natural language understanding, and natural language generation. nlp may be especially useful in analyzing recordings of a therapy session or a therapist's notes. affective computing or sentiment analysis involves the emotion recognition, modeling, and expression of emotion by robots or chatbots. sentiment analysis can recognize and respond to human emotions. virtual reality and augmented reality are human-computer interfaces that allow a user to become immersed within and interact with computer-generated simulated environments. hinton ( ) , a major contributor to research on ai and health, described ai as the use of algorithms and software to approximate human cognition in the analysis of complex data without being explicitly programmed. the primary aim of health-related ai applications is to analyze relationships between prevention or treatment techniques and patient outcomes. ai programs have been developed and applied to practices such as diagnosis processes, treatment protocol development, drug development, personalized medicine, and patient monitoring and care. deep learning is best at modeling very complicated relationships between input and outputs and all their interactions, and it sometimes requires a very large number of cases-in the thousands or tens of thousands-to learn. however, there appears to be no consensus about how to determine, a priori, the number of cases needed, because the number is highly dependent on the nature of the problem and the characteristics of the data. ai is already widely used in medicine. for example, in ophthalmology, photos of the eyes of persons with diabetes were screened with % specificity and % sensitivity in detecting diabetes (gargeya and leng ) . one of the more prolific uses of ai is in the diagnosis of skin cancer. in a study that scanned , clinical images, the ai approach had accuracy similar to that of board-certified dermatologists (esteva et al. ) . cardiovascular risk prediction with ml is significantly improved over established methods of risk prediction (krittanawong et al. ; weng et al. ). however, a study by desai et al. ( ) found only limited improvements in predicting heart failure over traditional logistic regression. in cancer diagnostics, ai identified malignant tumors with % accuracy compared to % accuracy for human pathologists (liu et al. ). the ibm's watson ai platform took only min to analyze a genome of a patient with brain cancer and suggest a treatment plan, while human experts took h (wrzeszczynski et al. ) . ai has also been used to develop personalized immunotherapy for cancer treatment (kiyotani et al. ). rajpurkar et al. ( ) compared chest x-rays for signs of pneumonia using a state-of-the-art -layer convolutional neural network (cnn) program with a "swarm" of radiologists (groups connected by swarm algorithms) and found the latter to be significantly more accurate. in a direct comparison between radiologists on , interpretations and a stand-alone deep learning ai program designed to detect breast cancer in mammography, the ai program was as accurate as the radiologists (rodriguez-ruiz et al. ). as topol ( b) noted, ai is not always the winner in comparison with human experts. moreover, many of these applications have not been used in the real world, so we do not know how well ai will scale up in practice. topol describes other concerns with ai, many of which are discussed later in this paper. finally, many of the applications are visual, such as pictures of skin or scans, for which ai is particularly well suited. large banks of pictures often form the training and testing data for this approach. in mental health, visual data are not currently as relevant. however, there is starting to be some research on facial expressions in diagnosing mental illness. for example, abdullah and choudhury ( ) cite several studies that showed that patients with schizophrenia tend to show reduced facial expressivity or that facial features can be used to indicate mental health status. more generally, there is research showing how facial expressions can be used to indicate stress (mayo and heilig ) . visual data are ripe for exploration using ai. although an exhaustive review of the ai literature and its applications is well beyond the focus of this paper, rudin and carlson ( ) present a well-written, non-technical review of how to utilize ai and of some of the problems that are typically encountered. topol ( a) , in his book titled deep medicine: how artificial intelligence can make healthcare human again, includes a chapter on how to use of ai in mental health. topol ( b) also provides an excellent review of ai and its application to health and mental health in a briefer format. buskirk et al. ( ) and y. liu et al. ( ) provide well-written and relatively brief introductions to ml's basic concepts and methods and how they are evaluated. a more detailed introduction to deep learning and neural networks is provided by minar and naher ( ) . in most cases, i will use the generic term ai to refer to all types of ai unless the specific type of ai (e.g., ml for machine learning, dl for deep learning, and dnn for deep neural networks) is specified. precision medicine has been defined as the customization of healthcare, with medical decisions, treatments, practices, or products being tailored to the individual patient (love-koh et al. ) . typically, diagnostic testing is used for selecting the appropriate and best therapies based a person's genetic makeup or other analysis. in an idealized scenario, a person may be monitored with hundreds of inputs from various sources that use ai to make predictions. the hope is that precision medicine will replace annual doctor visits and their granular risk factors with individualized profiles and continuous longitudinal health monitoring (gambhir et al. ) . the aim of precision medicine, as stated by president barack obama when announcing his precision medicine initiative, is to find the long-sought goal of "delivering the right treatments, at the right time, every time to the right person" (kaiser ) . both ai and precision medicine can be considered revolutionary in the delivery of healthcare, since they enable us to move from one-size-fits-all diagnoses and treatment to individualized diagnoses and treatments that are based on vast amounts of data collected in healthcare settings. the use of ai and precision medicine to guide clinicians will change diagnoses and treatments in significant ways that will go beyond our dependence on the traditional rct. precision medicine should also be seen as evolutionary since even hippocrates advocated personalizing medicine (kohler ) . the importance of a precision medicine approach was recognized in the field of prevention science with a special issue of prevention science devoted to that topic (august and gewirtz ) . the articles in this special issue recognize the importance of identifying moderators of treatment that predict heterogeneous responses to treatment. describing moderators is a key feature of precision medicine. once these variables are discovered, it becomes possible to develop decision support systems that assist the provider (or even do the treatment assignment) in selecting the most appropriate treatment for each individual. this general approach has been tried using a sequential multiple assignment randomized trial (smart) in which participants are randomized two to three times at key decision points (august et al. ) . what i find notable about this special issue is the absence of any focus on ai. the articles were based on a conference in october , and apparently the relevance of ai had not yet influenced these very creative and thoughtful researchers at that point. precision medicine does not have an easy path to follow. x. liu et al. ( b) describe several challenges, including the following three. large parts of the human genome are not well enough known to support analyses; for example, almost % of our genetic code is unknown. it is also clear that a successful precision medicine approach depends on having access to large amounts of data at multiple levels, from the genetic to the behavioral. moreover, these data would have be placed into libraries that allow access for researchers. the u.s. federal government has a goal of establishing such a library with data on one million people through nih's all of us research program (https ://allof us.nih.gov/). recruitment of volunteers who would be willing to provide data and the "harmonization" of data from many different sources are major issues. x. liu et al. ( b) also point to ethical issues that confront precision medicine, such as informed consent, privacy, and predictions that someone may develop a disease. these issues are discussed later in this paper. chanfreau-coffinier et al. ( ) provided a helpful illustration of how precision medicine could be implemented. they convened a conference of veterans affairs stakeholders to develop a detailed logic model that can be used by an organization planning to introduce precision medicine. this model includes components typically found in logic models, such as inputs (clinical and information technology), big data (analytics, data sources), resources (workforce, funding) activities (research), outcomes (healthcare utilization), and impacts (access). the paper also includes challenges to implementing precision medicine (e.g., a poorly trained workforce) that apply to mental health. ai has the potential to unscramble traditional and new diagnostic categories based on analysis of biological/genetic and psychological data, and in addition, more data will likely be generated now that the potential for analysis has become so much greater. ai also has the potential to pinpoint those individuals who have the highest probability of benefiting from specific treatments and to provide early indicators of success or failure of treatment. research is currently being undertaken to provide feedback to clinicians at key decision points as an early warning of relapse. fernandes et al. ( ) describe what the authors call the domains related to precision psychiatry (see fig. ). these domains include many approaches and techniques, such as panomics, neuroimaging, cognition, and clinical characteristics, that form several domains including big data and molecular biosignature; the latter includes biomarkers. the authors include data from electronic health records, but i would also include data collected from treatment or therapy sessions as well as data collected outside of these sessions. these domains can be analyzed using biological and computational tools to produce a biosignature, a higher order domain that includes data from all the lower level techniques and approaches. this set of biomarkers in the biosignature should result in improved diagnosis, classification, and prognosis, as well as individualized interventions. the authors note that this bottom-up approach, from specific approaches to domains to the ultimate biosignature, can also be revised to a top-down approach, with the biosignature studied to better understand domains and its specific components. the bottom of the figure shows a paradigm shift where precision psychiatry contributes to different treatments being applied to persons with different diagnoses and endophenotypes, producing different prognoses. endophenotypes is a term used in genetic epidemiology to separate different behavioral another perspective on precision psychiatry is presented by bzdock and meyer-lindberg ( ). both models contain similar concepts. both start with a group of persons containing multiple traditional diagnoses. bzdock and meyer-lindberg recognize that these psychiatric diagnoses are often artificial dichotomies. machine learning is applied to diverse data from many sources and extracts hidden relationships. this produces different subgroups of endophenotypes. machine learning is also used to produce predictive models of the effects of different treatments instead of the more typical trial and error. further refinement of the predictive ml models results in better treatment selection and better prediction of the disease trajectory. an excellent overview of deep neural networks (dnns) in psychiatry and its applications is provided by durstewitz et al. ( ) . in addition to explaining how dnns work, they provide some suggestions on how dnns can be used in clinical practice with smartphones and large data sets. a major feature of deep neural networks is their ability to learn and adapt with experience. while dnns typically outperform ml, the authors state that they do not fully understand why this is the case. in mental health, dnns have been mostly used in diagnosis and predictions but not in designing personalized treatments. dnn's ability to integrate many different data sets (e.g., various neuroimaging data, movement patterns, social media, and genomics) should provide important insights on how to personalize treatments. regardless of the model used, eyre et al. ( ) remind us that consumers should not be left out of the development of precision psychiatry. in my conceptualization of precision medicine, precision mental health encompasses precision psychiatry and any other precision approach such as social work that focuses on mental health (bickman et al. ). there has not been much written about using a precision approach with psychosocial mental health services. possibly it is psychiatry's close relationship to general medicine and its roots in biology that make psychiatry more amenable to the precision science approach. in addition, the use of the precision construct is being applied in other fields, as exemplified by the special issue of the journal of school psychology devoted to precision education (cook et al. ) and precision public health (kee and taylor-robinson ) . however, in this paper i am primarily addressing the use of psychosocial treatment of mental health problems, which differs in important ways from psychiatric treatment. for example, precision psychosocial mental health treatment does not have a strong biological/medical perspective and does not focus almost exclusively on medication; instead, it emphasizes psychosocial interventions. psychosocial mental health services are also provided in hospital settings, but their primary use is in community-based services. these differences lead to different data sources for ai analyses. it is highly unlikely that electronic mental healthcare records found outside of hospital settings contain biological and genomic data (serretti ) . but hospital records are not likely to contain the detailed treatment process data that could possibly be found in community settings. the genomic and biological data offer new perspectives but may not be informative until we have a better understanding about the genomic basis of mental illness. in addition, the internet of things and smart healthcare connect wearable and home-based sensors that can be used to monitor movement, heart rate, ecg, emg, oxygen level, sleep, and blood glucose, through wi-fi, bluetooth, and related technologies. (sundaravadivel et al. ) . with wider use of very fast g internet service, there will be a major increase in the growth of the internet of things. i want to emphasize that applying precision medicine concepts to mental health services, especially psychotherapy, is a very difficult undertaking. the data requirements for psychosocial mental health treatment are more similar to meteorology or weather forecasting than to agriculture, which is considered the origin of the rct design. people's affect, cognition, and behavior are constantly changing just like the variables that affect weather. but unlike meteorology, which is mainly descriptive and not yet engaged in interventions, mental health services are interventions. thus, in addition to client data, we must identify the variables that are critical to the success of the intervention. we are beginning to grasp how difficult this task is as we develop greater understanding that the mere labeling of different forms of treatment by location (e.g., hospital or outpatient) or by generic type (e.g., cognitive behavior therapy) is not sufficiently informative. moreover, the emergence of implementation sciences has forced us to face the fact that a treatment manual describes only some aspects of the treatments as intended but does not describe the treatment that is actually delivered. nlp is a step in the right direction in trying to capture some aspects of treatment as actually delivered. data quality is the foundation upon which ai systems are built. while medical records are of higher technical quality than community-based data because they must adhere to national standards, i believe that the nascent interest in measurement-based care and measurement feedback systems in community settings bodes well for improved data systems in the future. moreover, although electronic hospitalbased data may be high quality from a technical viewpoint (validity, reliability) and be very large, they probably do not contain the data that are valuable for developing and evaluating mental health services. the development of electronic computer-based data collection and feedback systems will become more common as the growth in ai demands large amounts of good-quality treatment and finer grained longitudinal outcome data. there is a potential reciprocal relationship between the ai needs for large, high-quality data sets and the development of new measurement approaches and the electronic systems needed to collect such data (bickman a; bickman et al. a bickman et al. , . to accomplish this with sufficiently unbiased and valid data will be a challenge. ai can bypass many definitional problems by not using established diagnostic systems. ml can use a range of variables to describe the individual ml classifier systems (tandon and tandon ) . moreover, additional sources of data that help in classification are now feasible. for example, automated analysis of social media including tweets and facebook can detect depression, with accuracy measured by area under the curve (auc) ranging from . to . compared to clinical interviews with aucs of . (guntuku et al. ). as noted earlier, dnns have been shown to be superior to other machine learning approaches in general and specifically in identifying psychiatric stressors for suicide from social media (du et al. ). predictions of , adolescent suicides with ml showed high accuracy (auc > . ) and outperformed traditional logistic regression analyses ( . - . aucs) (tandon and tandon ) . saxe has published a pioneering proof of concept that has demonstrated that ml methods can be used to predict child posttraumatic stress (saxe et al. ) . ml was more accurate than humans in predicting social and occupational disability with persons in high-risk states of psychosis or with recent-onset depression (koutsouleris et al. a) . machine learning has also been used in predicting psychosis using everyday language (rezaii et al. ) . another application of ai to diagnosis is provided by kasthurirathne et al. ( ) . they demonstrated the ability to automate screening for , adult patients in need of advanced care for depression using structured and unstructured data sets covering acute and chronic conditions, patient demographics, behaviors, and past service use history. the use of many existing data elements is a key feature and thus does not depend on single screening instruments. the authors used this information to accurately predict the need for advanced care for depression using random forest classification ml. milne et al. ( ) recognized that in implementing online peer counseling, professionals need to participate and/or provide safety monitoring in using ai. however, cost and scalability issues appeared to be insurmountable barriers. what is needed is an automated triage system that would direct human moderators to cases that require the most urgent attention. the triage system milne et al. developed sent human moderators color-coded messages about their need to intervene. the algorithm supporting this triage system was based on supervised ml. the accuracy of the system was evaluated by comparing a test set of manually prioritized messages with the ones developed through the algorithm. they used several methods to judge accuracy, but their main one was an f-measure, or the harmonic mean of recall (i.e., sensitivity) and precision (i.e., positive predictive value). regression analysis indicated that the triage system made a significant and unique contribution to reducing the time taken to respond to some messages, after accounting for moderator and community activity. i can see the potential for this and similar ai approaches to deal with the typical service setting where some degree of supervision is required but even intermittent supervision is not feasible or possible. another use of ml as a classification tool is provided by pigoni et al. ( ) . in their review of treatment resistant depression, they found that ml could be used successfully to classify responders from non-responders. this suggested that stratification of patients might help in selecting the appropriate treatment, thus avoiding giving patients treatments that are unlikely to work with them. a more general systematic review and meta-analysis of the use of ml to predict depression are provided by lee et al. ( ) . the authors found qualitative and quantitative studies that qualified for inclusion in their review. while most of the studies were retrospective, they did find predictions with an average overall accuracy of . . kaur and sharma ( ) reviewed the literature on diagnosis of ten different psychological disorders and examined the different data mining and software approaches (ai) used in different publications. depending on the disorder and the software used, the accuracy ranged from to %. accuracy was defined differently depending on the study. only % of the articles exploring diagnosis of any health problem were found to be for psychological problems. this suggests that we need more studies on diagnosis and ai. a very informative synthesis and review are provided by low et al. ( ) . they screened studies and reviewed the that met the inclusion criterion: studies from the last years using speech to identify the presence or severity of disorders through ml methods. they concluded that ml could be predictive, but confidence in any conclusions was dampened by the general lack of cross-validation procedures. the article contains very useful information on how best to collect and analyze speech samples. another innovative approach using ml focused on wearable motion detector sensors, in which these devices were worn for s during a -s mood induction task (seeing a fake snake). these data were able to distinguish children with an internalizing disorder from controls with % accuracy (mcginnis et al. ) . this approach has potential for screening children for this disorder. a problem that seemingly has been ignored by most studies that deal with classification or diagnosis is the gold standard by which accuracy is judged. in most cases, the gold standard is human judgment, which is especially fallible when it comes to mental health diagnosis. we can clearly measure whether the ai approach is faster and less expensive than human judgment, but is the ultimate in ai accuracy matching human judgment with all its flaws? i believe that the endpoint that must also be measured is client clinical mental health improvement. a system that provides faster and less expensive diagnosis but does not lead to more precise treatment and better clinical outcomes will save us time and money, which are important, but they will not be the breakthrough for which we are looking. a solution to the problems described above will involve the integration of causal discovery methods with ai approaches. ai methods are capable of improving our capacity to predict outcomes. to enhance predictability, we will need to identify the factors in the predictive models that are causal. thus, there is the need to identify techniques that provide us with causal knowledge, which currently is based primarily on rcts. but, for real-world and ethical reasons, human etiological experiments can rarely be conducted. fortunately, there are newer ai methods that can be used to infer causes, which include well validated tests of conditional independencies based on the causal markov condition (pearl ; aliferis et al. ; saxe ) . these methods have been successfully used outside of psychiatry (sachs et al. ; ramsey et al. ; statnikov et al. ) and have, in the last five years, been applied in research on mental health, largely by the team of glenn saxe at new york university and constantin aliferis and sisi ma at university of minnesota. this group has reported causal models of ptsd in hospitalized injured children (saxe et al. (saxe et al. , , children seen in outpatient trauma centers (saxe et al. ) , maltreated children (morales et al. ) , adults seen in emergency rooms (galatzer-levy et al. ) , and police officers who were exposed to trauma (saxe et al. in press ). saxe ( ) recently described the promise of these methods for psychiatric diagnosis and personalized precision medicine. new measures need to be developed that cover multiple domains of mental health, are reported by different respondents (e.g., child, parent, clinician), and are very brief. cohen ( ) provides an excellent overview of what he calls ambulatory biobehavioral technologies in a special section of psychological assessment. he notes that the development of mobile devices can have a major impact on psychological assessment. he cautions, however, that while some of these approaches have been used for decades, they still have not progressed beyond the proof of concept phase for clinical and commercial applications. ecological momentary assessment (ema) is a relatively new approach to measurement development. ema is the collection of real-time data collected in naturalistic environments. this approach uses a wide range of smart watches, bands, garments, and patches with embedded sensors (gharani et al. ; pistorius ) . for example, using smartphones, researchers have identified gait features for estimating blood alcohol content level (gharani et al. ). other researchers have been able to map changes in emotional state ranging from sad to happy by using a movement sensor on smart watches (quiroz et al. ) . others have described real-time fluctuations in suicidal ideation and its risk factors, using an average of . assessments per day (kleiman et al. ) . social anxiety has been assessed from global positioning data obtained from smart watches by noting that socially anxious students were found to avoid public places and to spend more time at home than in leisure activities outside the home (boukhechba et al. ) . a review of studies using ema concluded that the compliance rate was moderate but not optimal and could be affected by study design (wen et al. ). this review is also a good source of descriptions of different approaches to using ema. another good summary that focused on ema in the treatment of psychotic disorders can be found in bell et al. ( ) . for ema use in depression and anxiety, schueller et al. ( ) is a good source. ema has been used to measure cardiorespiratory function, movement patterns, sweat analysis, tissue oxygenation, sleep, and emotional state (peake et al. ) . harari et al. ( ) present a catalog of behavior in more than aspects of daily living that can be used in studying physical movement, social interactions, and daily activities. these include walking, speaking, text messaging, and so on. these all can be collected from smartphones and serve as an alternative to traditional survey approaches. however, it is still not clear what higher-level constructs are measured using these approaches. a comprehensive and in-depth review of studies that have used speech to assess psychiatric disorders is provided by low et al. ( ) . they conclude that speech processing technology could assist in mental health assessments but believe that there are many obstacles to this use, including the need for longitudinal studies. another interesting application for children is the use of inexpensive screening for internalizing disorders. mcginnis et al. ( ) monitored the child's motion for s using a commercially available and inexpensive wearable sensor. using a supervised ml approach, they obtained an % accuracy ( % sensitivity, % specificity) compared to similar clinical threshold on parent-reported child symptoms that differentiate children with an internalizing diagnosis from controls without such a diagnosis. in a systematic review of ema use in major depression, colombo et al. ( ) evaluated studies that met their criteria for inclusion. these studies measured a wide variety of variables including self-reported symptoms, sleep patterns, social contacts, cortisol, heart rate, and affect. they point out many of the advantages of using emas such as realtime assessments, capturing the dynamic nature of change, improving generalizability, and providing information about context. they believe that the use of emas has resulted in novel insights about the nature of depression. they do note that there are few evaluations of these measures, and there is not much use in actual clinical practice. mohr et al. ( ) note that most of the research on ema has been carried out primarily by computer scientists and engineers using a very different research model than social and behavioral scientists. while computer scientists are mostly interested in exploratory proof of concepts approach (does it work at all?) using very small samples, social/behavioral scientists are more typically theory driven and investigate under what conditions the intervention will work. mental health care, apart from medication, is almost exclusively verbal. several approaches have been tried to capture the content of treatment sessions. my colleagues and i have tried by asking clinicians to use a brief checklist of topics discussed after each therapy session . although this technique produced some interesting findings such as the identification of topics that the clinician did not discuss but that were believed to be important by the youth or parent, it is clearly filtered by what the clinician recalls and is willing to check off as having been discussed. while recordings provide a richer source of information, coding recordings manually is too expensive and slow for the real world of service delivery. the content of therapy sessions, including notes kept by clinicians, is pretty much ignored by researchers because of the difficulty and cost of manually coding those sources. however, advances in natural language processing (nlp) are now being explored as a way of capturing aspects of the content of therapy sessions. for example, tanana et al. ( ) have shown how two types of nlp techniques can be used to study and code the use of motivational interviewing in taped sessions. carcone et al. ( ) also showed that they could accurately code motivational interviewing (mi) clinical encounter transcripts with sufficient accuracy. other researchers have used ai to analyze speech to distinguish between what they called high-and low-quality counselors (pérez-rosas et al. ). some colleagues and i have submitted a proposal to nimh to refine nlp tools that can be used to supervise clinicians implementing an evidence-based treatment using ai. as far as we know, using nlp to measure fidelity and provide feedback to clinicians has not been studied in a systematic way. while ai appears to be an attractive approach to new ways of analyzing data, it should be noted that, as always, the quality of the analysis is highly dependent on the quality of the data. jacobucci and grimm ( ) caution us that "in psychology specifically, the impact of machine learning has not been commensurate with what one would expect given the complexity of algorithms and their ability to capture nonlinear and interactive effects" (p. ). one observation made by these authors is that the apparent lack of progress in using ai may be caused by "throwing the same set of poorly measured variables that have been analyzed previously into machine learning algorithms" (p. ). they note that this is more than the generic garbage in, garbage out problem, but it is specifically related to measurement error, which can be measured relatively accurately. as described earlier, our privileging of rcts has contributed to a lack of focus on a precision approach to mental health services. this has resulted in the problem of ignoring the clinical need for predicting for an individual in contrast to establishing group difference, the approach favored by the experimentalist/ hypothesis testing tradition. ai offers an approach to the discovery of important relationships in mental health in addition to rcts that are based on singlesubject prediction accuracy and not null hypothesis testing (bzdok and karrer ) . saxe et al. ( ) have demonstrated the use of the complex-systems-causal network method to detect causal relationships among variables and bivariate relations in a psychiatric study using algorithms. a comprehensive review and meta-analysis of machine learning algorithms that predict outcomes of depression showed excellent accuracy ( . ) using multiple forms of data (lee et al. ) . it is interesting to note that none of the scholars commenting on the rct special issue in social science and medicine (deaton and cartwright ) specifically mentioned the use of ai as a potential solution to some of the problems of using average treatment effects (ates). kessler et al. ( a) noted that clinical trials do not tell us which treatments are more effective for which patients. they suggested that what they label as precision treatment rules (ptrs) be developed that are predictors of the relative treatment effectiveness of different treatments. the authors presented a comprehensive discussion on how to use ml to develop ptrs. they concluded that the sample sizes needed are much larger than usually those found in rcts; observational data, especially from electronic medical records (emrs) can be used to deal with the sample size issue; and statistical methods can be used to balance both observed and unobserved covariates using instrumental variables and discontinuity designs. they do note the difficulty in obtaining full baseline data from emrs and suggest several solutions for this problem, including supplemental data collection and links to other archival sources. they recommend the use of an ensemble ml approach that combines several algorithms. they are clear that their suggestions are exploratory and require verification, but they are more certain that if ml improves patient outcomes, it will be a substantial improvement. wu et al. ( ) collaborated with kessler on a proof of concept of a similar model called individualized treatment rules (itr). in a model simulation, they used a large sample (n = , ) with an ensemble ml method to identify the advantages of using ml algorithms to estimate the outcomes if a precision medicine approach was taken in prescribing medication for persons with first-onset schizophrenia. they found that the treatment success was estimated to be . % under itr compared to . % with the medication that was actually used. wu et al. see this as a first step that needs to be confirmed by pragmatic rcts. kessler et al. ( b) conducted a relatively small randomized study (n = ) in which soldiers seeking treatment were judged to be at risk for suicide. they were randomly assigned to two types of treatment but not on the basis of any a priori ptr. the data from that study were then analyzed using ml to produce ptrs. these data were then modeled in a simulation to see if the ptr would have produced better outcomes. the authors did find that the simulated ptr produced better effects. lenze et al. ( ) address the problems of rcts from a somewhat different perspective than i have presented here and suggest a potential solution that they call precision clinical trials (pcts). the authors propose that the problem with most existing rcts is that they measure only the fixed baseline characteristics that are not usually sensitive to detecting treatment responders. moreover, treatment is typically not dynamically adapting to the client during treatment, and measures are not administered with sufficient frequency. instead, the pcts would: ( ) first attempt to determine whether short-term responses to the intervention could determine who was a likely candidate for that specific treatment; ( ) initiate the treatment in an adaptive fashion that could vary over time, using stepped care or just-in-time adaptations that are responsive to the client's changing status, and frequently collect data possibly using multiple assignment randomized trial methods; and ( ) use frequent precision measurement, possibly using ecological momentary assessments described earlier. coincidently, they illustrate the application of pcts using repetitive transcranial magnetic stimulation (rtms), a form of brain stimulation therapy used to treat depression and anxiety that has been in use since . rtms will be described later in connection with what i call a third path for services and ai. it is disappointing that i could not find any examples of published research that used a rct to test whether an ai approach to an actual, not simulated, delivery of a mental health treatment produces better clinical outcomes than a competitive treatment or even treatment as usual. this is clearly an area requiring further rigorous empirical investigation. imel et al. ( ) provide an excellent overview on how ai and other technologies can be used for monitoring and feedback in psychotherapy in both training and supervision. imel et al. ( ) used ml to code and provide data to clinicians on metrics used to measure the quality of motivational interviewing (mi). a prior study (tanana et al. ) established that ml was able to code mi quality metrics with accuracy similar to human coders. they conducted a pilot study using standardized patients and -min speech segments that was designed to test the feasibility of providing feedback to clinicians on the quality of their mi intervention. the feedback was not in real-time but was provided after the session. they were able to establish that clinicians thought highly of the feedback they received. the authors anticipate that further developments in this technology will lead to its widespread use in supervision and in real-time feedback. it would seem that the next step is evaluating the enhanced ai feedback procedure in a real-world effectiveness study. another example of the use of nlp application is the use of a bot that was trained to assess and provide feedback on specific interviewing and counseling skills such as asking open-ended questions and providing feedback (tanana et al. ) . after training the bot on transcripts, non-therapists (using amazon mechanical turk recruits) were randomly assigned to either immediate feedback on a practice session with the bot or just encouragement on the use of those skills. the group provided the feedback were significantly more likely to use reflection even when feedback was removed. the authors consider this to be a proof of concept demonstration because of the many limitations (e.g., use of non-therapists). a plan for using nlp to monitor and provide feedback to clinicians on the implementation of an evidenced program is provided by berkel et al. ( ) . they provide excellent justification for using nlp to accomplish this goal, but unfortunately it is only a design at this point. rosenfeld et al. ( ) see ai making major contributions to improving the quality of treatment through efficient continuous monitoring of patients. until now, monitoring was limited to in-session contacts or manual contacts, an approach that is not practical or efficient. the almost universal availability of smartphones and other internet active devices (internet of things) makes collecting data from clients practical and efficient. these various data sources provide feedback to providers so that they can predict and prevent relapse and compliance with treatment, especially medication. the authors note that there is not a large body of research in this area, but early studies are positive. one concrete application of ai to providing feedback is described by ryan and his colleagues . their article only describes how such could be done; unfortunately, it is not an actual study but a suggestion on how to apply ai for feedback to physicians to improve their communications with patients. they note that routine assessment and feedback are not done manually because of the cost and time requirements. however, ai can automate these tasks by evaluating recordings. they suggest using already existing ai approaches that are in use by call centers to categorize and evaluate communication along the following dimensions: speaker ratio that indicates listening, overlapping talk that are interruptions, pauses longer than two seconds, speed, pitch, and tone. the content could also be evaluated along the dimensions of the use of plain language, clinical jargon, and shared decision making. ai could also explore other dimensions such as the meaning of words and phrases using nlp, turn taking, tone, and style. many technical difficulties would have to be overcome to assess many of these variables, but the field is making progress. an actual application of ml to feedback, but not in mental health, is provided by pardo et al. ( ) in a course for first-year engineering students. instructors developed in advance a set of feedback messages for levels of interaction with learning resources. for example, different feedback messages were provided depending on whether the student barely looked at video, watched a major portion, watched the whole video, or watched it several times. an ml algorithm selected the appropriate message to send the student through either email or the virtual learning environment. compared to earlier cohorts who did not receive the feedback, those who did were more satisfied with the course and had better performance on the midterm. i can see how such a protocol could be used in mental health services. an indication of the work that needs to be done in becoming more specific about feedback is a study conducted by hooke et al. ( ) . they provide feedback to patients with and without a trajectory showing expected progress and found that patients preferred the feedback with the expected change over time. they found that these patients preferred to have normative feedback with which they could compare their own ideographic progress. two systematic reviews that focused on implementing routine outcome measurement (rom) concluded that while rom has been shown to produce positive results, how to best implement rom remains to be determined by future research (gual-montolio et al. ; mackrill and sorensen ) . the authors of both reviews note several interesting points but focus on these two: how to integrate measurement into clinical practice and how organizations support staff in this effort. they highlight the importance of developing a culture of feedback in organizations. neither review includes any studies using ai. while they call for more research to move this field forward, i do not think there will be much change until either measurement feedback systems are required by funders or service delivery organizations are paid for providing such systems. probably the most advanced work in this area that includes ml is being done by lutz and his colleagues (lutz et al. ) . they have developed a measurement feedback system that includes the use of ml to make predictions and to provide clinicians with clinical decision support tools. they are able to predict dropouts and assign support tools to clinicians that are specific to the problems their clients are exhibiting, based on the data they have collected. lutz and his colleagues are currently evaluating the system to influence clinical outcomes in a prospective study. this comprehensive feedback system provided clinical support tools with recommendations based on identification of similar patients to the treatment group but not to the control group. they already have some very promising results using three different treatment strategies (w. lutz, personal communication, september , ) . almost all the research in this area has been on prediction and not in actually testing whether precision treatments are in fact better than standard treatments in improving mental health outcomes. even these predictive studies are on extant databases rather than data collected specially for use in ai algorithms. with a few exceptions to be discussed later, this is the state of the art. to establish the practical usefulness of ai, we need to move beyond prediction to show actual mental health improvements that have clinical and not just statistical significance. there are some scholars who are carefully considering how to improve methodology to achieve better predictions (e.g., garb and wood ) . in addition, zilcha-mano ( ) has a very thoughtful paper that describes traditional statistical and machine learning approaches to trying to answer the core question of what treatments work best for which patients, as well as the more general question about why psychotherapy works at all. nlp has been used to analyze unstructured or textual material for identifying suicidal ideation in a psychiatric research database. precision of % for identification of suicide ideation and % for suicide attempts has been found using nlp (fernandes et al. ) . a meta-analysis of studies of prediction of suicide using traditional methodologies found only slightly better than chance predictions and no improvement in accuracy in years (franklin et al. b ). recent ml decision support aids using large-scale biological and other data have been useful in predicting responses to different drugs for depression (dwyer et al. ). triantafyllidis and tsanas ( ) conducted a literature review of pragmatic evaluations of nonpharmacological applications of ml in real-life health interventions from january through november , following prisma guidelines. they found only eight articles that met their criteria from citations screened. three dealt with depression and the remainder with other health conditions. six of the eight produced significantly positive results, but only three were rcts. there has been little rigorous research to support ai in real-world contexts. accuracy of prediction is one of the putative advantages of ai. but the advantage of predicting outcomes is not as relevant if a client prematurely leaves treatment. thus, predicting premature termination is one of the key goals of an ai approach. in a pilot study to test whether ai could be beneficial in predicting premature termination, bohus et al. ( ) were not able to adequately predict dropouts using different ml approaches with responses to the borderline symptom list (bsl- ). however, they obtained some success when they combined the questionnaire data with personal diary questionnaires from patients, although they note that the sample is too small to draw any strong conclusions. this pilot study illustrates the importance of what data goes into the data set as well as our lack of knowledge of the data requirements we need to have confidence in as we select the appropriate data. duwe and kim ( ) compared statistical methods including ml approaches on their accuracy in predicting recidivism among , offenders. they found the newer ml algorithms generally performing modestly better. kessler et al. ( ) used data from u.s. army and department of defense administrative data systems to predict suicides of soldiers who were hospitalized for a psychiatric disorder (n = , ). within one year of hospitalization, ( . %) of the soldiers committed suicide. they used a statistical prediction rule based on ml that resulted in a high validity auc value of . . kessler and his colleagues have continued this important work, which was discussed earlier. another approach to prediction was taken by pearson et al. ( ) in predicting depression symptoms after an -week internet depression reduction program using participants. they used an elastic net and random forest ml ensemble (combination) and compared it to a simple linear autoregressive model. they found that the ensemble method predicted an additional % of the variance over the non-ml approach. the authors offer several good technical suggestions about how to avoid some common errors in using ml. moreover, the ml approach allowed them to identify specific module dosages that were related to outcomes that would be more difficult to determine using standard statistical approaches (e.g., detecting nonlinear relationships without having to specify them in advance). however, not all attempts to use ai are successful. pelham et al. ( ) compared logistic regression and five different ml approaches to typical sum-score approaches to identify boys in the fifth grade who would be repeatedly arrested. ml performed no better than simple logistic regression when appropriate cross-validation procedures were applied. the authors emphasize the importance of cross-validation in testing ml approaches. in contrast, a predictive study of people with first-episode psychosis used ai to successfully predict poor remission and recovery one year later based only on baseline data (leighton et al. ) . the model was cross validated on two independent samples. a comprehensive synthesis of the literature of studies that used ml or big data to address a mental health problem illustrated the wide variety of uses that currently exist; however, most dealt with detection and diagnosis (shatte et al. ) . a critical view of the way psychiatry is practiced for the treatment of depression and how ai can improve that practice is provided by tan et al. ( ) . they note that most depression is treated with an "educated-guess-and-check approach in which clinicians prescribe one of the numerous approved therapies for depression in a stepwise manner" (p. ). they posit that ai and especially deep learning have the ability to model the heterogeneity of outcomes and complexity of psychiatric disorders through the use large data sets. at this point, the authors have not provided any completed studies that have used ai, but two of the authors are shareholders in a medical technology company that is developing applications using deep learning in psychiatry. we are beginning to see commercial startups take an interest in mental health services even though the general health market is considerably bigger. entrepreneurially motivated research may be important for the future of ai growth in mental health services, with traditional federal research grants to support this important developmental work, including such mechanisms as the small business innovation research (sbir) program and the r and r nih funding mechanisms. one of the few studies that go beyond just prediction and actually attempt to develop a personalized treatment was conducted by fisher et al. ( ) . in a proof of concept study, the authors used fisher's modular model of cognitive-behavioral therapy (cbt) and algorithms to develop and implement person-by-person treatments for anxiety and mood disorders for adults. the participants were asked to complete surveys four times a day for about days. the average improvement was better than found in comparison benchmark studies. the authors state that this is the first study to use pre-therapy multivariate time series data to generate prospective treatment plans. rosenfeld et al. ( ) describe several treatment delivery approaches that utilize ai. woebot, for example, is a commercial product to provide cbt-based treatment using ai. the clients interact with woebot through instant messaging that is later reviewed by a psychologist. it has been shown to have short-term effectiveness in reducing phq- scores of college students who reported depression and anxiety symptoms. the authors are optimistic that approaches like the ones described will lead to more widely available and efficacious treatment modalities. applications of ml to addiction studies was the focus of a systematic review by mak et al. ( ) . they did an extensive search of the literature until december and could find only articles. none of the studies involved evaluating a treatment. i want to distinguish between the use of computer-assisted therapy, especially that provided through mobile apps, and the use of ai. in a review of these digital approaches to providing cbt for depression and anxiety, wright et al. ( ) point out while many of these apps have been shown to be better than no treatment, they usually do not use ai to personalize them. thus, they are less relevant to this paper and are not discussed in depth. ecological momentary interventions (emis) are treatments provided to patients between sessions during their everyday lives (i.e., in real time) and in natural settings ). these interventions extend some aspects of psychotherapy to patients' daily lives to encourage activities and skill building in diverse conditions. in the only systematic review available of emis, colombo et al. ( ) found only eight studies that used emis to treat major depression, with only four different interventions. the common factor of these four interventions is that they provide treatment in real-time and are not dependent on planned sessions with a clinician. the authors report that participants were generally satisfied with the interventions, but there was variability in compliance and dropout rates among the programs. with only two studies that tested for effectiveness with rcts, there is clearly a need for more rigorous evaluations. momentary reminders are typically used for behaviors such as medication adherence and management of symptoms. the more complex emis use algorithms to optimize and personalize systems. they also can use algorithms that changes the likelihood of the presentation of a particular intervention over time, based on past proximal outcomes. schueller et al. ( ) note that emis are becoming more popular as a result of technological advances. these authors suggest the use of micro-randomized trials (mrts) to evaluate them. an mrt uses a sequential factorial design that randomly assigns an intervention component to each person at multiple randomly chosen times. each person is thus randomized many times. this complex design represents the dynamic nature of these interventions and how their outcomes correspond to different contextual features. ai is often used to develop algorithms to optimize and personalize the mrt over time. one interesting algorithm, called a "bandit algorithm," changes the intervention presented based on a past proximal outcome. as an example, schueller et al. describe a hypothetical study to reduce anxiety through two different techniques-deep breathing and progressive muscle relaxation. the bandit algorithm may start the presentation of each technique with equal frequency but then shift more to the one that appears to be most successful for that individual. thus, each treatment (a combination of deep breathing and progressive muscle relaxation) would be different for each person. unlike rcts, this method does not use group-level outcomes of average effect sizes but uses individual-level data. in the future, we might have personal digital mental health "therapists" or assistants that can deliver individualized combinations of treatments based on algorithms developed with ai that are data driven. of course, this approach is best suited for these momentary interventions and would be difficult if not impossible to successfully apply to traditional treatment. i consider explicating the relationship between ai and causality to be a key factor in understanding whether ai is to be seen as replacing or as supplementing rcts. toward that end, i first consider whether observational data can replace rcts using ai. second, should a replacement not seem currently feasible, i explore ways to design studies that combine ai and rcts to evaluate whether the ai approach produces better outcomes than non-ai enhanced interventions. the journal prevention science devoted a special section of an issue to new approaches for making causal inferences from observational data (wiedermann et al. ). an example is the paper by shimizu ( ) that demonstrates the use of non-gaussian analysis tools to infer causation from observational data under certain assumptions. malinsky and danks ( ) provide an extended discussion of the use of causal discovery algorithms to learn causal structure from observational data. in a similar fashion, blöbaum et al. ( ) present a case for inferring causal direction between two variables by comparing the least-squares errors of prediction in both possible directions. using data that meet some assumptions, they provide an algorithm that requires only a regression in both causal directions and a comparison of the least-square errors. lechner's ( ) paper focuses on identifying the heterogeneity of treatment effects at the finest possible level or identifying what he calls groups of winners and losers who receive some treatment. hassani et al. ( ) hope to build a connection between researchers who use big data analysis and data mining techniques and those who are interested in causality analysis. they provide a guide that describes data mining applications in causality analysis. these include entity extractions, cluster analysis, association rule, and classification techniques. the authors also provide references to studies that use these techniques, key software, substantive areas in which they have been used, and the purpose of the applications. this is another bit of evidence that the issue of causality is being taken seriously and that some progress is being made. however, because of the newness of these publications, there is a lag in publications that are critical of these approaches; for example, d'amour ( ) provides a technical discussion about why some approaches will not work but also suggests that others may be potentially effective. clearly, caution is still warranted in drawing causal conclusion from observational data. chen ( ) provides a very interesting discussion of ai and causality but not from the perspective of the rct issue that i raise here but as a much broader but still relevant point of view. he advances the key question about whether ai technology should be adopted in the medical field. chen argues that there are two major deficits in ai, namely the causality deficit and the care deficit. the causality deficit refers to the inferior ability of ai to make accurate casual inferences, such as diagnosis, compared to humans. the care deficit is the comparative lack of ability of ai to care for a patient. both deficits are interesting, but the one most germane to this paper is the causality deficit. chen notes that ai represents statistical and not causal reasoning machines. he argues that ai is deficient compared to humans in causal reasoning, and, moreover, he doubts that there is a feasible way to deal with this lack of comparability in reasoning. he believes that ai is a model-blind approach in contrast to a human's more model-based approach to causal reasoning. thus, causation for chen is not an issue of experimental methodology (he never mentions rcts in his paper), but a characteristic associated with humans and not computers. chen does recognize that ai researchers are attempting to deal with the causality issue, for example, by briefly describing pearl's ( ) directed acyclic graphs and nonparametric structural equation models. but chen is skeptical that either the causality or care deficits will be overcome. he concludes that ai is best thought of as assisting humans in medical care and not replacing them. the relationship between ai and humans is a major concern of this paper. caliebe et al. ( ) see big data, and i would assume ai, as contributing to hypotheses generation that could then be tested in rcts. the critical issues they see are related to the quality and quantity of big data. they quote an institute of medicine (iom) report that refers to the use of big data and ai in medicine as "learning healthcare systems" and states that these systems will "transform the way evidence on clinical effectiveness is generated and used to improve health and health care" (institute of medicine , p. ). moreover, in , the iom suggested that alternative research methodologies will be needed. they do not acknowledge the conundrum that i have raised here; moreover, they do not see any need to consider changing any of our methodology or analyses. i have found many individual papers that describe how to solve the causality problem with ai (e.g., kuang et al. ; pearl ) . although these papers are complex, their mere existence gives me hope that this problem is being seriously considered. in addition to the statistical and validity issues in trying to replace rcts with observational data, there is the feasibility question. although the data studied in much of the research reported in this paper are in the medical domain and deal primarily with medications, the characteristics of these data have some important lessons for mental health services. bartlett et al. ( ) identified trials published in the top seven highest impact medical journals. they then determined whether the intervention, medical condition, inclusion and exclusion criteria, and primary end points could be routinely obtained from insurance claims and/or electronic health data (ehr) data. these data are recognized by the fda as what they term real-world evidence. they found that only % of the u.s.-based clinical trials published in highimpact journals in could be feasibly replicated through analysis of administrative claims or ehr data. the results suggest that potential for real-world evidence to replace clinical trials is very limited. at best, we can hope that they can complement trials. given the paucity of data collected in mental health settings, the odds are that such data are even less available. suggestions for improving the utility of real-world data for use in research are provided in an earlier article by some of these authors (dhruva et al. ). pearl ( ) posits causal information in terms of the types of questions that, in his three-level model, each level answers. his first level is association; the second, intervention; and the third, counterfactual. association is simply the statistical relationship or correlation. there is no causal information at this first level. the higher order levels can answer questions about the lower levels but not the other way around. counterfactuals are the control groups in rcts. they represent what would have happened if there had been no intervention. to pearl, this unidirectional hierarchy explains why ml, based on associations, cannot provide causal statements like rcts, which are based on counterfactuals. however, as noted earlier, pearl does present an approach using what he calls structural causal models to "extract" causal relationships from associations. pearl describes seven "talks" and accompanying tools that are accomplished in the framework provided by the structural causal models that are necessary to move from the lower levels to the counterfactual level to allow causal inferences. i would anticipate that there will be direct comparisons between this approach to causality and the randomized experiments like those done in program evaluation (bickman and reich ; boruch et al. ) . theory development or testing is usually not thought of as a strength of ai; instead, its lack of transparency, that is, the lack of explanatory power that would enable us to identify models/mechanisms that underlie outcomes, is seen as a major weakness. coutanche and hallion ( ) present a case for using feature ablation to test theories. this technique involves the removal or ablation of features from algorithms that have been thought to be theoretically meaningful and then seeing if there is a significant reduction in the predictive accuracy of the model. they have also studied whether the use of a different data set affects the predictive accuracy of a previously tested model in theoretically useful ways. they present a very useful hypothetical application of their approach to test theories using ai. it is clear that ai can be very useful in making predictions, but can it replace rcts? can ai perform the major function of rcts, that of determining causality? the dependence on rcts was one of the major limitations i saw as hindering the progress of mental health services research. while rcts have their flaws, they are still considered by most as the best method for determining causal relationships. is ai limited to being a precursor in identifying those variables that are good candidates for rcts because they have high predictive values? the core conceptual problem is that while it is possible to compare two different but theoretically equivalent groups, one receiving the experimental treatment and the other the control condition, it is not possible to compare the same individuals on both receiving and not receiving the experimental treatment. rcts produce average effect sizes, but the ultimate purpose of precision mental health is to predict individualized effects. how do we reconcile these two very different aims? one approach is to use ai to identify the most predictive variables and then test them in a randomized experiment. let us take a group of patients with the same disorder or problem. there may be several alternative treatments, but the most basic concept is to compare two conditions. in one condition, call it the traditional treatment condition in the rct, everyone in that condition gets the same treatment. it is not individualized. in the second condition, call it the ai condition, everyone gets a treatment that is based on prior ai research. the latter may differ among individuals in dosage, timing, type of treatment, and so on. the simplest is medication that differs in dosage. however, a more nuanced design is a yoked design used primarily in operant and classical conditioning research. there have been limitations associated with this design, but these problems apply to conditioning research and not the application considered here (church ) . to separate the effects of the individualization from the differences in treatment, i suggest using a yoked design. in this design, individuals who would be eligible to be treated with either the standard treatment or the ai-selected treatment would be yoked, that is, paired. which participant of the pair received which condition would be randomized. first, the eligible participants would be randomly divided into two groups. the individuals in the ai group would get a treatment that was precisely designed for each person in that group, while those in the yoked control group would not; instead, those in the control group would receive the treatment that had been designed for his or her partner in the ai group. in this way, each participant would receive the same treatment, but only the ai group participants would be receiving individualized treatment. if the ai approach is superior, we would expect those in the ai group to have a superior average treatment effect compared to the control group, who received a treatment matched not to their individual characteristics but to those in the ai group. we could also use an additional control group where the treatment is selected by a clinician. while this design would not easily identify which characteristics were responsible for its success, it would demonstrate whether individualized ai-based treatment was the causal factor. that is, we could learn that on the average, a precision approach is more effective than a traditional approach, but we would not be able to identify from this rct which particular combination of characteristics made it more effective. of note is that the statistical power of this design would depend on the differences among the participants at baseline. for example, if the individuals were identical on measured covariates, then they would get the same personalized treatment, which practically would produce no useful information. instead of yoking participants based on randomly assigning them as in the above example, we could yoke them on dissimilarity and then randomly assign each individual in the pair to ai-based treatment or a control condition that could be the same ai treatment or a clinician-assigned treatment. however, interesting this would be from a methodical point of view, i think this would also bring up ethical issues that are discussed next. of course, as with any rct, there are ethical issues to consider. in many rcts, the control group may receive standard treatment, which should not present any unusual ethical issues. however, in a yoked design, the control group participants will receive a treatment that was not selected for them on the basis of their characteristics. moreover, the yoked design would make the formulation of the informed consent document problematic because it would have to indicate that participants in the control group would receive a treatment designed for someone else. one principle that should be kept in mind is equipoise: there should be consensus among clinicians and researchers that the treatments, a priori, are equivalent. in a yoked design, we must be assured that none of individualized treatments would harm the yoked control group members, and moreover, that there is no uniform agreement that the individualized treatment would be better for the recipient. that is, the research is designed to answer a question about relative effectiveness for which we do not know the answer. almost all of the research previously cited in this paper has dealt with psychosocial interventions, along with some research on interventions with medications. clearly these are the two main approaches taken in providing services for mental health problems. however, in the last decade, a new approach to understanding mental illness has emerged from the field of psychoneuroimmunology. this relatively new field integrates research on psychology, neuroscience, and immunology to understand how these processes influence each other and, in turn, human health and behavior (slavich ). i want to explore this relatively new approach to understanding mental health because i believe that it is a potentially rich field in which to apply ai. slavich and irwin ( ) have combined diverse areas to show how stressors affect neural, physiologic, molecular, and genomic and epigenetic processes that mediate depression. they labeled this integrative theory the social signal transduction theory of depression. in a recent extension of this work, slavich ( ) proposed social safety theory, which describes how social-environmental stressors that degrade experiences of social safety-such as social isolation and rejection-affect neural, immunologic, and genomic processes that increase inflammation and damage health. a key aspect of this perspective is the role of inflammatory cytokines as key mediators of the inflammatory response (slavich ) . cytokines are the biological endpoint of immune system activity and are typically measured in biobehavioral studies of stress and health. cytokines promote the production of c-reactive protein, which is an inflammatory mediator like cytokines, but which also is a biomarker of inflammation that is assessed with a blood test. cytokines also interact with the central nervous system and produce what have been labeled "sickness behaviors," which include increased pain and threat sensitivity, anhedonia, fatigue, and social-behavioral withdrawal. while the relationship between inflammation and depression is well-established in adults, a systematic review and meta-analysis of studies with children and adolescents concluded that because of the small number of studies, more evidence was needed before drawing a similar conclusion for youth (d'acunto et al. ) . in contrast, a major longitudinal study of more than adults followed over years found that participants who had stable high c-reactive protein levels were more likely to report clinically significant late-life depression symptoms (sonsin-diaz et al. ) . chronic inflammation has been shown to be present in many psychiatric disorders including depression, schizophrenia, and ptsd, as well as in many other somatic and physical disease conditions (furman et al. ) . chronic inflammatory diseases have been shown to be a major cause of death. a typical inflammatory response occurs when a threat is present and then goes away when there is no longer a threat. however, when the threat is chronic and unresolved, systemic chronic inflammation can occur and is distinct from acute inflammation. chronic inflammation can cause significant damage to tissues and organs and break down the immune system tolerance. what is especially interesting from a behavioral health perspective is that inflammatory activity can apparently be initiated by any psychological stressor, real or imagined. thus, social and psychological stressors such as negative interpersonal relationships with friends and family, as well as physical stressors, can produce inflammation, which leads to increased risk of mental and physical health problems. this inflammatory response initially can have positive effects in that it can help increase survival in the short term, but it can also lead to a dysfunctional hypervigilance and anxiety that increases the risk of serious mental illness if chronic. the "cytokine storm" experienced by many covid- patients is an example of the damage an uncontrolled immune response can cause (konig et al. ). although we do not know a great deal about how this process operates, it is clear that there is a strong linkage between inflammatory responses and mental disorders such as depression. the role of the immune system in disease, especially brain inflammation related to brain microglial cells (i.e., neuroinflammation), is also receiving attention in the popular press (nakazawa ). psychoneuroimmunology research has explicated the linkage between the brain and the immune system, showing how stress affects the immune system, and how these interactions relate to mental illness. the relationships between these constructs suggest interventions that can be used to improve mental health. but much research remains to be done to identify specific processes and effective interventions. research will require multidisciplinary teams to produce personalized interventions guided by each patient's specific level of neuroinflammation and genetic profiles. this process will need to be monitored by continuous feedback that i believe will be made more feasible with the application of ai. at present, there are some existing interventions that appear to be aligned with this approach that are being explored. these include the following. three anti-inflammatory medications have been found to reduce depressive symptoms in well-designed rcts. these agents include celecoxib, usually used for treating excessive inflammation and pain, and etanercept and infliximab, which are used to treat rheumatoid arthritis, psoriasis, and other inflammatory conditions (slavich ) . however, there has not been a great deal of research in this area, so caution is warranted. a recent well-designed rct with depressed youth tested aspirin, rosuvastatin (a statin), and a placebo and found no significant differences in depression symptoms (berk et al. ). a meta-analysis explored the possible link between different types of psychosocial interventions, such as behavior therapy and cbt, and immune system function (shields et al. ) . the authors examined eight common psychosocial interventions, seven immune outcomes, and nine moderating factors in evaluating rcts. they found that psychosocial interventions were associated with a . % improvement in good immune system function and a . % decrease in detrimental immune function, on average. moreover, the effects lasted for at least months and were consistent across age, sex, and intervention duration. the authors concluded that psychosocial interventions are a feasible approach for influencing the immune system. repetitive transcranial magnetic stimulation (rtms) has been found to be an effective treatment for several mental illnesses, especially treatment-resistant depression (mutz et al. ; somani and kar ; voigt et al. ) . while the literature is not clear on how rtms produces its effect (noda et al. ; peng et al. ) , i was curious about its relationship to neuroinflammation. i could find little in the research literature that addressed the relationship between inflammation and rtms; therefore, i conducted an informal survey of rtms researchers who have published rtms research in peer-reviewed journals and asked them the following: i suspect that rtms is related to inflammation but the only published research that i could find on that relationship was two studies dealing with rats. are you aware of any other research on this relationship? in addition, do you know of anyone using ai to investigate rtms? i received replies from all but of the researchers. about half said they were aware of some research that linked rtms to inflammation and supplied citations. in contrast, only % were aware of any research on rtms and ai. the latter noted some research that used ai on eegs to predict rtms outcomes. a most informative response was from the author of a review article that dealt with several different nontraditional treatments including rtms on the hypothalamic-pituitary-adrenal (hpa) axis and immune function in the form of cytokine production in depression (perrin and parianti ) . the authors found relevant human studies ( studies using rtms) but were unable to conduct the metaanalysis because of significant methodological variability among studies. but they concluded that non-convulsive neurostimulation has the potential to impact abnormal endocrine and immune signaling in depression. moreover, given that there is more information available than on other neurostimulation techniques, the research suggests that rtms appears to reduce cytokines. finally, there is some support from animal models (rats) that rtms can have an anti-inflammatory effect on the brain and reduce depression and anxiety (tiana et al. ). moreover, four published studies showed that the efficacy of rtms for schizophrenics could be predicted koutsouleris et al. ( b) . three other studies were able to use ml and eeg to predict outcomes of rtms treatment for depression (bailey et al. ; hasanzadeh et al. ) . the existing literature indicates that metabolic activity and regional cerebral blood flow at the baseline can predict the response to rtms in depression (kar ) . as these baseline parameters are linked to inflammation, it is worth studying responses to rtms that predict inflammation. as noted by one of the respondents, "in summary, it is a relatively new field and there are no major multi-site machine learning studies in rtms response prediction" (n. koutsouleris, personal communication, march , ) . one of the significant limitations of measurement in mental health is the absence of robust biomarkers of inflammation. furman et al. ( ) caution us that "despite evidence linking sci [systemic chronic inflammation] with disease risk and mortality, there are presently no standard biomarkers for indicating the presence of health-damaging chronic inflammation" (p. ). however, some biomarkers that are currently being explored for inflammation may be of some help. for example, furman et al. ( ) are hopeful that a new approach using large numbers of inflammatory markers to identify predictors will produce useful information. a narrative review of inflammatory biomarkers for mood disorders was also cautious in drawing any conclusions from extant research because of "substantial complexities" (chang and chen ) . it is also worth noting the emerging area of research on gut-brain communication and the relationship between microbiome bacteria and quality of life and mental health (valles-colomer et al. ) . however, there is need for more research on the use of biomarkers. the area of inflammation and mental health offers an additional pathway to uncovering the causes of mental illness but also, most importantly for this paper, potential services interventions beyond traditional medications and psychosocial interventions. given the complexity, large number of variables from diverse data sets, and the emerging nature of this area, it appears that ai could be of great benefit in tying some potential biomarkers to effective interventions designed to produce better clinical outcomes. however, some caution is needed concerning the seemingly "hard data" provided by biomarkers. for example, elliot et al. ( ) found in a meta-analysis of experiments that one widely used biomarker, task-fmir, had poor overall reliability and poor test-retest reliability in two other large studies. they concluded that these measures were not suitable for brain biomarker research or research on individual differences. as noted in several places in this paper, ai is not without its problems and limitations. the next section of the paper discuses several of these problems. ai may force the treatment developer to make explicit choices that are ethically ambiguous. for example, automobile manufacturers designing fully autonomous driving capabilities now have to be explicit about whose lives to value more in avoiding a collision-the driver and his or her passengers or a pedestrian. should the car be programmed to avoid hitting a pedestrian, regardless of the circumstances, even if it results in the death of the driver? mental health services do not typically have such clear-cut conflicts, but the need to weigh the potential side effects of a drug against potential benefits suggests that ethical issues will confront uses of ai in mental health. some research has shown that inherent bias in original data sets has produced biased (racist) decisions (obermeyer et al. ; veale and binns ) . an unresolved question is who has the responsibility for determining the accuracy and quality of original data set (packin and lev-aretz ) . data scientists operating with data provided by others may not have sufficient understanding of the complexity of the data to be sensitive to its limitations. moreover, they may not consider it their responsibility to evaluate the accuracy of the data and attend to its limitations. librenza-garcia ( ) provides a comprehensive review of ethical issues in the use of large data sets with ai. the ethical issues in predicting major mental illness are discussed by lawrie et al. ( ) . they note that predictive algorithms are not sufficiently accurate at present, but they are progressing. the authors raise questions about whether people want to know their risk level for major psychiatric disorders, about individual and societal attitudes to such knowledge and the possible adverse effects of sharing such data, and about the possible impact of such information on early diagnosis and treatment. they urge conducting research in this area. related to the ethics issue but with more direct consequences to the health provider is the issue of legal responsibility in using an ai application. it is not clear what the legal liability is for interventions based on ai that go wrong. who is responsible for such outcomes-the person applying the ai, the developer of the algorithm, or both? price ( ) points out that providers typically do not have to be concerned about the legal liability of a negative outcome if they used standard care. thus, if there are negative outcomes of some treatment but that treatment was the standard of care, there is usually no legal liability. however, currently ai is probably not seen as the standard of care in most situations. while this will hopefully change as evidence of the effectiveness of ai applications develops, currently the healthcare provider is at greater risk of legal liability in using an ai application that is different from the standard of care. i have previously discussed the insufficient evidence for the effectiveness of many of the interventions used in mental health services. this lack of strong evidence has implications for the use of ai in mental health services. in an insightful article on using ai for individual-level treatment predictions, paulus and thompson ( ) make several key observations and suggestions that are very relevant to the current paper. the authors summarize several meta-analyses of the weak evidence of effectiveness of mental health interventions and come to conclusions similar to those i have already stated. they also identify similar factors i have focused on in accounting for the modest effect sizes found in mental health rcts. they point out that diagnostic categories are not useful if they are not aggregating homogenous populations. they suggest that what i call the diagnostic muddle may result from the nature of mental disorders themselves, for which there are many causes at many different levels, from the genetic to the environmental. thus, there is no simple explanatory model. paulus and thompson note that prediction studies rarely account for more than a very small percentage of the variance. they recommend conducting large, multisite pragmatic rcts that are clearly pre-defined with specific ml models and variables. predictive models generated by this research then need to be validated with independent samples. this is a demanding agenda, but i think it is necessary if we are going to advance mental health services with the help of ai. treatments are often considered black boxes that provide no understanding of how and why the treatment works (kelley et al. ; bickman b) . the problem of lack of transparency is compounded in the use of deep neural networks (samek et al. ) . at present we are not able to understand relationships between inputs and outcomes, because this ai technique does not adequately describe process. deep neural networks may contain many hidden layers and millions of parameters (de choudhury and kikkoman ). however, this problem is now being widely discussed, and new technologies are being developed to make ai more transparent (rauber et al. ; kuang et al. ). i do not believe it is possible to develop good theories of treatment effectiveness without this transparency. this is an important limitation of efforts to improve mental health services. but how important is this limitation? early in my program evaluation career, i wrote about the importance of program theory (bickman (bickman , . i argued that if individual studies were going to be conceptually useful, beyond local decisions such as program termination, then they must contribute to the broader goal of explaining why certain programs were effective and others not. this is in contrast to the worth and merit of a local program. a theory based evaluation of the program must add to our understanding of the theory underlying the program. while i still believe that generalizing to a broad theory of why certain interventions work is critical, at present it may be sufficient simply to increase the accuracy of our predictions, regardless of whether we understand why. as stephens-davidowitz ( ) argues, "in the prediction business, you just need to know that something works, not why" (p. ). however, turing award winner judea pearl argued in his paper theoretical impediments to machine learning with seven sparks from the causal revolution ( ) that human-level ai cannot emerge from model-blind learning machines that ignore causal relationships. one of the positive outcomes of the concern over transparency is the development of a subfield of ai that has been called explainable artificial intelligence (xai). adai and berrada ( ) present a very readable description of this movement and show that it has been a growing area since . they are optimistic that research in this area will go a long way toward solving the black box problem. large data sets are required for some ai techniques, especially deep neural networks. while such data sets may be common in consumer behavior, social media, and hospitalbased electronic health records, they are not common in community-based mental health services. the development and ownership of these data sets may be more important (and profitable) than ownership of specific ai applications. there is currently much turmoil over data ownership (mittelstadt ) . ownership issues are especially important in the mental health field given the sensitivity of the data. in addition to the size and quality of the data set, longitudinal data are necessary for prediction. collecting longitudinal data poses a particular problem for community-based services given the large treatment drop-out rate. in addition to the characteristics of the data, there is the need for competent data managers of large complex data sets. the data requirements for mental health applications are more demanding than those for health in general. first, mental health studies usually do not involve the large samples that are found in general health. for example, the wellknown physicians' health study of aspirin to prevent myocardial infarction (mi) utilized more than , doctors in a rct (steering committee of the physicians' health study research group ). they found a reduction in mi that was highly statistically significant: p < . . the trial was stopped because it was thought that this was conclusive evidence that aspirin should be adopted for general prevention. however, the effect size was extremely small: a risk difference of . % with r = . (sullivan and feinn ) . a study this size is not likely to occur in mental health. moreover, such small effects would not be considered important even if they could be detected. it is unlikely that very large clinical trials such as the aspirin study would ever be conducted in mental health. thus, it is probable that data will have to be obtained from service data. but mental health services usually do not collect sufficiently fine-grained data from clients. while i was an early and strong proponent of what i called a measurement feedback system for services (bickman a) , recent research shows that the collection of such data is rare in the real world. until services start collecting these data as part of their routine services, it is unlikely that ai will have much growth with the limited availability of relevant data. there is, of course, a chicken and egg problem. a major reason why services do not collect data is the limited usefulness of data in improving clinical care. while ai may offer the best possibility of increasing the usefulness of regularly collected data, such data will not be available until policy makers, funders, and providers deem it useful and are willing to devote financial resources to such data collection analysis. at present, there are no financial incentives for mental health providers to collect such data even if they improved services. moustafa et al. ( ) made the interesting observation that psychology is behind other fields in using big data. ai and big data are not considered core topics in psychology. the authors suggest several reasons for this, including that psychology is mostly theory-and hypothesis-driven rather than data-driven, and that studies use small sample sizes and a small number of variables that are typically categorical and thus are not as amenable to ai. moreover, most statistical packages used by psychologists are not well-equipped to analyze large data sets. however, the authors note that the method of clustering and thus differentiating among participants is used by psychologists and is in many ways similar to ai, especially deep neural networks, in trying to identify similar participants. using ml methods such as random forest algorithms, the investigator can identify variables that best explain differences among groups or clusters. instead of the typically few variables used by psychologists, ai can examine hundreds of variables. as a note of caution, rutledge et al. ( ) warn that "there is no silver bullet that can replace collecting enough data to generate stable and generalizable predictions" (p. ). while there are techniques that are often used in low sample size situations (e.g., the elastic net and tree-based ensembles), researchers need replications with independent samples if they are to have sufficient confidence in their findings. moreover, since big data are indeed big, they are easily misunderstood as automatically providing better results through smaller sampling errors. it is often not appreciated that the gain in precision drawn from larger samples may well be nullified by the introduction of additional population variance and biases. finding competent big data managers, data scientists, and programmers is a human resource problem. in my experience, ai scientists who are able and want to collaborate with mental health services researchers are rare. industry pays a lot more for these individuals than universities can afford. moreover, even within the health field, mental health is a very small component of the cost of services, so it is often ignored in this area. difficulty and resistance are encountered in the implementation of new technologies. clinicians are reluctant to adopt new approaches and to engage clients in new approaches and data collection procedures. community mental health services have been slow to successfully adopt new technologies (crutzen et al. ; lattie et al. ; yeager and benight ) . in their mixed methods study of community clinicians, crutzen et al. ( ) found there were concerns about privacy, the wide range of therapeutic techniques used, disruptions in trust and alliance, managing crises, and organizational issues such as billing and regulations contained in the privacy rule established by the health insurance portability and accountability act of (hipaa) that inhibited the use of new technologies. moreover, our current reimbursement policies do not support greater payment for better outcomes. thus, there is little or no financial incentive for hard-pressed community services to improve their services at their own expense. in fact, i would argue that there is a disincentive to improve outcomes since it results in increased costs (at least initially), organizational disruption and potentially a loss of clients if it takes less time and effort to successfully treat them. an interesting meta-issue has emerged from the widespread and ever-increasing investment in ai in healthcare. in a perceptive "viewpoint" published in jama, emanuel i would be happy to serve as a "matchmaker" for any ai programmers, data scientists (etc.), or behavioral scientists who are interested in collaborating on mental health projects. just contact me describing your background and interests and i will try to put together likeminded researchers. and wachter ( ), argue that the major challenge facing healthcare is not that of obtaining data and new analytics but the achievement of behavior change among both clinicians and patients. they point out the major failures of google and microsoft in not recognizing the problems in translating evidence into practice in connection with their large, web-based repositories for storage of health records, google health and microsoft healthvault, both of which have been discontinued. they indicate that the long delays in translation are due not primarily to data issues or lack of accurate predictions, but to the absence of behavioral changes needed for adoption of these practices. for example, the collection of longitudinal data has been problematic. another problem they note is that about half the people in the united states are nonadherent with medications. there is a huge gap between knowing what a problem is and actually solving it that "data gurus" seem to ignore. while this translation problem is evident in the sometimes narrow focus of ai promoters, it also represents an opportunity for the behavioral scientists engaged in ai research to marshal their skills and the knowledge gained from years of dealing with similar behavioral issues. the emergence of translational and implementation sciences, the latter more often led by behavioral scientists, can be of great service to the problems of applying ai to healthcare. the field of translational sciences has been developed and well-funded by the nih in recognition of the difficulty in using (i.e., translating) laboratory studies into practice. in , the budget for the clinical and translational science awards (ctsa) program was over a half billion dollars from to . however, as director of evaluation for vanderbilt's medical center's ctsa program for many years, i became very familiar with the difficulties in applying medical research in the real world. mental health is determined by multiple factors. it is unlikely that we will find a single vector such as a virus or a bacterium that causes mental illness. thus, data demands can include multiple systems with biological, psychological, sociological, economic, and environmental factors. within many of these domains, we do not have objective measures such as the lab tests found in medicine. subjective selfreports are prone to many biases, and many of the symptoms are not observable by observers. the lack of a strong theory of mental disorders also makes it difficult to intelligently focus on only a few variables. even with such apparently simple measures that include observations or recordings from multiple informants, we do not have a consensus on how to integrate them (bickman et al. a; martel et al. ). however, i would expect that research generated with ai will contribute not only to improved treatment but also to enhanced theories by including heterogeneous clients and many data sources. confidentiality and trust are key issues in mental health treatment. how will the introduction of ai affect the relationship between client and clinician? as noted earlier, there are problems, especially with deep learning, in interpreting the meaning of algorithmic solutions and predictions. our ability to explain the algorithms to clients is problematic. while many research projects outside of mental health show that combining ai with human judgment produces the best outcomes, this research is still in its infancy. a great deal has been written about ai in the context of medicine, but we need a reality check about the importance of ai in clinical practice. ben-israel et al. ( ) addressed the use of ai in a systematic review of the medical literature from to . the authors focused on human studies that addressed a problem in clinical medicine using one or more forms of ai. of the studies, only % were prospective. none of the studies included a power analysis, and half did not report attrition data. most were proof of concept studies. the authors concluded that their study showed that the use of ai in daily practice of clinical medicine is practically nonexistent. the authors acknowledge that use was defined by publication and that many applications of ai may be occurring without publication. regardless, this study suggests that there are many barriers that must be overcome before ai is more widely used. the self-help industry can provide perspective on digital apps, including some that use ai. it has been estimated that this sector was worth $ . billion in and is expected to be worth $ . billion in (la rosa ). part of that big dollar market is in digital mental health apps, although their precise monetary value is unknown. more to the point is that we know little about the effectiveness of digital apps in the marketplace (chandrashekar ) . moreover, many have warned that these unregulated and untested apps could be dangerous (wykes ) . in the united states, the publication of books is protected by the constitution, so there are no rules governing what can be published in the self-help sector. the market determines what gets accepted and used, regardless of effectiveness or negative side effects. but publication is limited by the cost of publishing and distribution. this is not the case for digital programs, where marginal costs of adding an additional user are negligible. unlike other mental health interventions, there are no licensing or ethical standards governing their use. there are no data being uniformly collected on their use and their effects. although there are u.s. government rules that can be applied to these apps (armontrout et al. ) , the law has many exceptions. the authors note that they could not find a single lawsuit related to software that diagnoses or treats a psychiatric condition. an interactive tool is provided by the federal trade commission to help judge which federal laws might apply in developing an app (https ://www.ftc.gov/tips-advic e/busin ess-cente r/guida nce/mobil e-healt h-apps-inter activ e-tool). it is clear that digital mental health apps will continue to grow. it is critical that services research and funding agencies do not overlook this development that might have potentially positive or negative effects. these are but a few of the many areas or ai needing additional research and potential limitations to be addressed. an excellent discussion of these and other relates issues regarding the potential hype common in the ai field is provided in the national academy of medicine's monograph on the use of ai in healthcare (matheny et al. ) . a thought-provoking paper by hagendorff and wezel ( ) classifies what ai can and cannot do. some of the authors' concerns, such as measurement, completeness and quality of the data, and problems with transparency of algorithms, have already been discussed, so i will describe those that i feel are most relevant to mental health services. the authors describe two methodological challenges, the first being that the data used in ai systems are not representative of reality because of the way they are collected and processed. this can lead to biases and problems with generalizability. second is the concern that supervised learning represents the past. thus, prediction can be based only on the past and not on expectations of change; thus, in some respects, change is inhibited. hagendorff and wezel ( ) also note several societal challenges. one such challenge they cite is that many software engineers who develop these algorithms do not have sufficient knowledge of the sociological, psychological, ethical, and political consequences of their software. they suggest this leads to misinterpretations and misunderstandings about how the software will operate in society. the authors also note the scarcity of competent programmers. i noted earlier that this is especially the case in academia and particularly in the behavioral sciences. the authors highlight that ai systems often produce hidden costs. this includes hardware to run the ai systems and, i would add, the disruptive nature of the intrusion of ai into a workflow. among the technological challenges discussed by hagendorff and wezel, i believe the authors' focus on the big differences between human thinking and intelligent machines is especially relevant to mental health. machines are in no way as complex as human brains; even ai's powerful neural networks, with more than a billion interconnections, represent only a tiny portion of the complexity of brain tissue. in order to obtain better convergence between machines and humans, hagendorff and wezel suggest that programmers follow the three suggestions made by lake et al. ( ) . first, programmers should move away from pattern recognition models, where most development started, to automated recognition of causal relationships. the second suggestion is to teach machines basic physical and psychological theories so that they have the appropriate background knowledge. the third suggestion is to teach machines to learn how to learn so that they can better deal with new situations. the comparison between ai and human thought is the only aspect of their paper where hagendorff and wezel mention causality issues. they note the challenge related to the inflexibility of many algorithms, especially the supervised ones, where simply changing one aspect would result in processing errors because that aspect was not in the training data. machines can be vastly superior to humans in some games where there are very specific inputs for achieving specific goals, but they cannot flexibly adapt to changes like humans. the authors suggest that promising technical solutions are being worked on to deal with this weakness in transferability. all these challenges will affect how well ai will work in mental health services. most problems will probably be solved, but the authors believe that some of these challenges will never be met, such as dealing with the differences between human and computer cognition, which means that ai will never fully grasp the context of mental health services. the machine's construction of a person may lead to a fragmented or distorted self-concept that conflicts with the person's own sense of identity, which seems critical to any analysis of the person's mental health or lack thereof. i do not have a sense of how serious this and the other challenges will be for us in the future, but it is clear that there is a lot more we need to learn. yet another set of concerns, specifically about the variation in ai called deep learning (dl), was enumerated by marcus ( ) , an expert in dl. in a controversial paper in which he identified limitations of dl, he noted that dl "may well be approaching a wall" (p. ) where progress will slow or cease. for example, he noted that dl is primarily a statistical approach for classifying data, using neural networks with multiple layers. dl "maps" the relationships between inputs and outputs. while children may need only a few trials to correctly identify a picture of a dog, dl may need thousands or even millions of labeled examples before making correct identifications without the labels. very large data sets are needed for dl. this is not the case for all ml techniques. i will not attempt to summarize the nine other limitations he sees with dl since many of them are noted elsewhere in this paper. he concludes that dl itself is not the problem; rather, the problem is that we do not fully understand the limitations of dl and what it does well. marcus warns against excessive hype and unrealistic expectations. i am taking this advice personally, and i am not expecting my tesla to be fully autonomous in as predicted by elon musk (woodyard ) . wolff ( ) provided an overview of how some of the problems of deep learning can be ameliorated. he responds to marcus using many of the subheadings in marcus's paper. he calls his framework the sp theory of intelligence, and its application is called the sp computer model (sp stands for simplicity and power). the theory was developed by wolff to integrate observations and concepts across several fields including ai, computing, mathematics, and human perception and cognition, using information compression to unify them. despite these and other concerns previously described, i do think that the advantages of ai for moving mental health services forward outweigh its disadvantages. however, this summary of advantages does not attempt to balance in length or number the disadvantages described above. i do not think it is necessary to repeat the already described numerous applications and potential applications of ai that can be used to improve health services. rather than repeating the numerous applications and potential applications of ai that can be used to improve health services, i highlight only a few key advantages. one of the main advantages is the way ai deals with data. it can handle large amounts of data from diverse sources. this includes structured (quantitative) and unstructured (text, pictures, sound) data in the same analyses. thus, it can integrate heterogeneous data from dissimilar sources. as noted earlier, the inclusion of non-traditional data such as those obtained from remote sensing (e.g., movement, facial expression, body temperature) will be responsible for a paradigm shift in what we consider relevant data. ai, if widely adopted, has the potential to have a major impact on employment. while most of the popular press coverage has been on the potential negative effects of eliminating many jobs, there also are potential positive effects. ai can reduce the costs of many tasks, thus increasing productivity. on the human side, it can streamline routine work and eliminate many boring aspects of work. it thus can free up workers to engage in the more complex and interesting aspects of many jobs. previous innovations have caused job dislocations. the classic loss of jobs in making buggy whips after the advent of automobiles is just one example. the inventions of the industrial age, such as steam engines, displaced many workers but also created many more new jobs. we know that many unskilled or semi-skilled jobs will be affected by ai in a major way. the elimination of cashiers with automated checkouts is now being implemented by amazon. in these stores, you scan your phone, and then ai and cameras take over. you just put products in your bag or cart and leave when you are finished. self-driving cars and trucks will greatly disrupt the transportation industry. we have weathered these disruptions in the past, but even the experts are unsure about how ai will influence jobs. probably the area in which there is the most positive potential in healthcare is when humans and machines collaborate in partnership. here, ai augments human tasks but keeps humans in the center. thus, physicians will no longer be separated by a laptop when speaking to a patient because ai will be able to record, take notes, and interpret the medical visit. we have documented the shortage of mental health workers and the immense gap between mental health needs and our ability to fill them. yes, we can train more clinicians, but our society seems unwilling to offer sufficient salaries to attract and keep such individuals. we have been experimenting with computers as therapists for more than years, but now we finally have the technological resources to develop and implement such approaches. we have started to use chatbots to extend services, but in the near future, ai may allow us to replace the human therapist under some conditions (hopp et al. ). in , the computer scientist and science fiction author vernor vinge developed the concept of a singularity in which artificial intelligence would lead to a world in which robots attain self-consciousness and are capable of what are now human cognitive activities (vinge ) . advocates and critics disagree on whether a singularity will be achieved and whether it would be a desirable development (braga and logan ) . braga and logan, editors of a special issue of information on the singularity and ai, conclude that although ai research is still in the early stage, the combination of human intelligence and ai will produce the best outcomes, but ai will never replace humans and we cannot fully depend on ai for the right answers. while these authors are well-informed, their crystal ball may not be clearer than anyone else's. the relevance of the singularity for healthcare lies in asking whether there will there be a time when ai-based computers are more effective and efficient than clinicians and will replace them. it is a question worth considering. i have presented a comprehensive, wide-ranging paper dealing with ai and mental health services. i have described major deficiencies of our current services, namely the lack of sufficient access, inadequate implementation, and low efficiency/effectiveness. i summarized how precision medicine and ai have contributed to improving healthcare in general and how these approaches are being applied in precision psychiatry and mental health. the paper then describes research that shows how ai has been or can be used to help solve the five problems i noted earlier. i then described the disadvantages and advantages of ai. in reviewing all this information, i believe there is one factor that i have not discussed sufficiently that clearly differentiates the way mental health services have been delivered and the way i expect they will be delivered in the future. i want to focus this last section of the paper on what i believe is the most important and significant change that can occur. this change is reflected in a simple question: is a human clinician necessary to deliver effective and efficient mental health services? i believe the answer to this question does not depend on the occurrence of the singularity but lies in the growth of ai research and its application to mental health services. i think there is widespread agreement that there are significant problems with diagnoses and the quality of our measures. moreover, most will probably agree that if ai can improve diagnoses and measures, then we should use utilize ai and let the results speak for themselves. the dependence on rcts will probably not be resolved by ai research, but ai can clearly help inform what should be tested in rcts. however, our current services overwhelmingly depend on human clinicians to deliver treatment. the problem with learning and feedback is that it requires clinicians to learn how to improve treatment over time with feedback. we are still uncertain about how well clinicians can learn from experience, training, and education (bacon ) . we also lack evidence of the best way to provide feedback to enhance that learning (bickman a; dyason et al. ). the problem of treatment precision is also currently tied to having the clinician deliver the treatment. while we can expect ai to deliver more precise information about treatment planning, we still depend on the clinician to interpret and deliver it with fidelity with some evidence-based model. a precision approach requires the clinician to systematically deliver treatment that is most appropriate to a specific client. we do not have good evidence that most clinicians can do that. i believe no other issue generates a bigger emotional response than the idea of the changing the role of the clinician. no other issue has the economic impact on services as the position of the clinician. i believe this issue is the most critical to the future of mental health services and will be most affected by ai. i note that in in writing an introduction to an extensive special issue of this journal called "therapist effects in mental health service outcome" (king ) , the authors of the introduction to that issue not did not note the potential role of ai in affecting clinicians (king and bickman ) . change is happening rapidly. mental health services are not alone in facing the issue of the role of humans, although human clinicians are probably more central to the provision of mental health services than other health services. a similar issue of the role of humans in the provision of services is being played out in surgery. surgery has been using robots for over years (bhandari et al. ), but the uptake has been slow for a variety of reasons. the next iteration of robot use is a move from using robots guided by surgeons to using robots assisted by ai and guided by surgeons. the use of ai may be seen as an intermediate step to fully autonomous ai-based robots not guided by surgeons. however, it is very clear that this progression is speculative and will take a long time to happen, if ever, given the consequences of errors. closer to our everyday experience is the similar path that the development of autonomous driving involves as we move toward the point at which a human driver is no longer needed. will mental health services follow a similar path? since we do not currently have a sufficient amount of research on using ai in treatment alone to inform us, we must look elsewhere for guidance. two bodies of literature are relevant. one deals with the use of computers and other technologies that do not include the use of ai at present, the second with self-help in which the participation of the clinician is minimal or totally absent. first, let us consider the existing literature that contrasts technology-based treatments with traditional face-to-face psychotherapy. then i will present some reviews of self-help research, followed by a description of the small amount of research using ai in treatment. a review of studies of internet-delivered cbt (icbt) to youth, using waitlist controls, supports the conclusion that cbt could be successfully adapted for internet-based treatment (vigerlan et al. ) . in a meta-analytic review of meta-analyses, containing studies of adult use of internet delivered via icbt, the authors concluded that icbt is as effective as face-to-face therapy (andersson et al. ) . hermes et al. ( ) include websites, software, mobile aps, and sensors as instances of what they call behavioral intervention technologies (bit). in their informative article, dealing primarily with implementation, they note that these technologies (they do not mention ai) can relate to a clinician in three ways: ( ) when intervention is delivered by the clinician and supported by bit, ( ) when bit provides the intervention with support from the clinician, or ( ) when intervention is fully automated with no role for the clinician. this schema clearly applies to the ai interventions and the role of clinicians as well. their conceptual model is helpful in understanding the parameters of implementation. they present a comprehensive plan for research to fill in the major gaps in the literature that addresses the question of comparative effectiveness of bit and traditional treatment. carlbring et al. ( ) conducted a systematic review and meta-analysis of eligible studies of ibct versus face-to-face cbt and reported that they produced equivalent outcomes, supporting the conclusions drawn by previous studies. it is also important to consider the issue of therapeutic alliance (ta) and its relationship to internet-based treatment. ta, to a large extent, is designed to capture the human aspect of the relationship between the clinician and the client. there are thousands of correlational studies that have established that ta is a predictor of treatment outcomes (flückiger et al. ) ; however, there are few studies of interventions that show a causal connection between ta and outcomes (e.g., hartley et al. ) . moreover, the very nature of ta as trait-like or state-like, which is central to causal assumptions, is being questioned and is subject to new research approaches (zilcha-mano ) as well as to questions about how it should be measured regardless of my doubts about the importance of ta, the fluckiger et al. ( ) meta-analysis found similar effect sizes (r = . ) for the alliance-outcome relationship in online interventions and in traditional face-to-face therapies. however, most of these studies were guided by a therapist, so the human factor was not totally absent. penedo et al. ( ) , in their study of a guided internet-based treatment, showed that it was important to align with the client's expectations and goals because these were related to outcomes, but no such relationship existed with the traditional third component of ta, bond with the supporting therapist, implying that ta might play a different role in internetbased treatments. i was trained as a social psychologist and was a graduate student of stanley milgram (of the famous obedience experiments), so i was curious about the research on the relationship between technological virtual agents and humans beyond the context of mental health treatment. several studies cited by schneeberger et al. ( ) showed that robots could get people to do tiring, shameful, or deviant tasks. the authors found that participants obeyed these virtual agents similarly to the way they responded to humans in a video-chat format. the participants did the same number of shameful tasks regardless of who or what was ordering them. moreover, doing the tasks produced the same level of shame and stress in the participant. they concluded that virtual agents and humans appear to have the same influence as human experimenters on participants. of course, there are many limitations associated with generalizing from this laboratory study, which was conducted with female college students in germany, but it does suggest that a great deal of research needs to be done on how humans relate to robots and virtual agents. miner et al. ( ) suggest that use of conversational ai in psychotherapy can be an asset for improving access to care, but there is limited research on efficacy and safety. can we learn about the role of the therapist from therapies that do not involve any therapist or technology? there is substantial research on self-help approaches from written material or what some call bibliotherapy. in general, research has supported the effectiveness of bibliotherapy before the advent of digital approaches. in , cuijpers et al. published a review of the literature that compared face-to face psychotherapy for depression and anxiety with guided selfhelp (i.e., with some therapist involvement) and concluded that they appeared comparable, but because there were so few studies in this comparison, this conclusion should be interpreted with caution. has the situation changed in the last decade? in a comprehensive review and meta-analysis almost years later, bennett et al. ( ) conducted a review and meta-analysis of studies. they concluded that self-help (both guided and unguided) had significant moderate to large effects on reducing symptoms of anxiety, depression, and disruptive behavior. however, there was also very high heterogeneity among the outcomes of these studies. compared to face-to-face therapy, self-help was better than no treatment but slightly worse than face-to-face treatments, guided therapy was better than unguided, and computerized treatment was better than bibliographic treatment. it is important to note that none of the studies were fully powered noninferiority trials, which would be a superior design. the authors concluded that their study showed potential near equivalence for self-help compared to faceto-face interventions, and their conclusions were consistent with several other reviews of self-help for mental health disorders in adults. the paper makes no mention of ai. cuijpers et al. ( ) conducted a network meta-analysis of trials of cbt addressing the question of whether format of delivery (individual, group, telephone-administered, guided self-help, or unguided self-help) influenced acceptability and effectiveness for these adult patients with acute depression. no statistically significant differences in effectiveness were found among these formats except that unguided self-help therapy was not more effective than care as usual but was more effective than a waitlist control group. the authors concluded that treatments using these different formats should be considered alternatives to therapist-delivered individual cbt. as in the previous publication, there was no mention of the use of ai, but cuijpers believes that few if any of the studies reviewed in his publication used ai (p. cuijpers, personal communication, march , ) . there is an emerging area of the use of ai in treatment that is informative. tuerk et al. ( ) , in a special section of current psychiatry reports focusing on psychiatry in a digital age, describe several approaches to using technology in evidence-based treatments. most relevant is their discussion of the use of ai in what has been called "conversational artificial intelligence" where there is a real-time interchange between a computer and a person. they note research that shows that this approach is low risk, high in consumer satisfaction, and high in self-disclosure. they suggest that there is a great deal of clinical potential in using ai in this manner. in a review of the literature from to on conversational agents used in the treatment of mental health problems, gaffney et al. ( ) found only qualifying studies out of an initial , with four being what they called full-scale rcts. they concluded that the use of conversational agents was limited but growing. all studies showed reduced psychological distress, with the five controlled studies showing a significant reduction compared to control groups. however, the three studies that used active controls did not show significant differences between the waitlist controls and use of a conversational agent, although all showed improvement. the authors concluded that the use of conversational agents in therapy looks promising, but not surprisingly, more research is needed. a similar conclusion on conversational agents was reached in another independent review (vaidyam et al. ) . i have little doubt that more research will be forthcoming in this emerging area. in summary, previous research using digital but not aipowered icbt, self-help (bibliotherapy), and ai-powered conversational agents suggests that effective treatment can be delivered without a human clinician under certain circumstances. i want to emphasize that these studies are suggestive but far from definitive. rather, they suggest that the role of the clinician is worth more exploration, but they do not establish the conclusion that we do not need clinicians to deliver services. we need to know a great deal more about how ai-supported therapy operates in different contexts. a survey of psychiatrists from countries asked about how technology will affect their future practice (doraiswamy et al. ) . only . % felt their jobs would become obsolete, and only a small minority ( %) felt that ai was likely to replace a human clinician in providing care. as much of the literature on the effects of ai on jobs suggests, those surveyed believed that ai would help in more routine tasks such as record keeping ( %) and synthesizing information, with about % believing their practices would be substantially changed. about % thought ai would have no influence or only minimal effect on their future work over the next years. another % thought their practices would be moderately changed by ai over the next years. more than three quarters ( %) thought it unlikely that technology would ever be able to provide care as well as or better than the average psychiatrist. only % of u.s.-based psychiatrists predicted that the potential benefits of future technologies or ai would outweigh the possible risks. some of the specific tasks that psychiatrists typically perform, including mental status examination, evaluation of dangerous behavior, and the development of a personalized treatment plan, were also felt to be tasks that a future technology would be unlikely to perform as well. i do not think many psychiatrists in this study are prepared for the major changes in their practices that are highly likely to occur in the next quarter century. in a thoughtful essay on the future of digital psychiatry, hariman et al. ( ) draw a number of conclusions. they predict major changes in practice, with treatment by an individual psychiatrist alone becoming rare. patients will receive treatment through their phones, participate in videoconferencing, and converse with chatbots. clinicians will receive daily updates on the patients through remote sensing devices and self-report. ai will be involved in both diagnosis and treatment and will integrate diverse sources of information. concerns over privacy and data security will increase. this is not the picture that the previously described survey of psychiatrists anticipated. brown et al. ( ) present the pros and cons of ai in an interesting debate format. on the pro side, the authors argue that while there are current limitations, the improvements in natural language processing (nlp) will lead to better clinical interviews. they point to research that shows people are more likely be honest with computers as a plus in obtaining more valid information from clients. they expect the ai "clinician" will be seen as competent and caring. they do note the danger that non-transparent ai will produce unintended negative side effects. those arguing against the use of ai clinicians acknowledge the technical superiority of ai to accomplish more routine tasks such as information gathering and tracking, but they point out the limitations even in the development of ai therapists. the lack of data needed to develop and test algorithms is critical. i have noted this in the discussion of the diagnostic muddle as a problem that ai can help solve, but these anti-ai authors argue that because psychiatrists disagree on diagnoses, there is no gold standard against which to measure the validity of ai models. this seems to be a rather unusual perspective from which to challenge change. they insightfully note that ai is different from human intelligence and does not perform well when presented with data that are different from training data. but the anti-ai authors acknowledge that more and better data may lead to improvement. brown et al. ( ) argue that common sense is something that ai cannot draw on; however, this seems to be a weak argument when common sense has been demonstrated to be inaccurate under many situations. they conclude with the statement that psychiatry "will always be about connecting with another human to help that individual" (p. ). this may be more wishful thinking than an accurate prediction about the future. those arguing the pro position state that the "the advance of ai psychiatry is inexorable" (p. ). on the other hand, the opponents of ai correctly point out that there is not yet sufficient evidence to draw a conclusion about the effectiveness of ai versus human clinicians. while there is disagreement about the potential advantages and disadvantages of ai, both sides agree that we need more and better research in this area. simon and yarborough ( ) present the case that ai should not be a major concern for mental health. they argue that ideally, our field would abandon the term artificial intelligence in regard to actual diagnosis and treatment of mental health conditions. using that term raises false hopes that machines will explain the mysteries of mental health and mental illness. it also raises false fears that all-knowing machines will displace human-centered mental health care. big data and advanced statistical methods have and will continue to yield useful tools for mental health care. but calling those tools artificially intelligent is neither necessary nor helpful. (p. ) the authors further take the position that despite the buildup around artificial intelligence, we need not fear the imminent arrival of "the singularity," that science fiction scenario of artificially intelligent computers linking together and ruling over all humanity. . . a scenario of autonomous machines selecting and delivering mental health treatments without human supervision or intervention remains in the realm of science fiction. (p. ) a more balanced approach to the role to the issue of replacement of clinicians by ai is presented by ahuja ( ) . after his review of the literature on medical specialists who may be replaced or more likely augmented by ai, his pithy take on this question is "or, it might come to pass that physicians who use ai might replace physicians who are unable to do so" (ahuja , p. ) . clearly, ai research will have to provide strong evidence of its effectiveness before ai will be accepted by some in the psychiatric community. there are several pressing questions about how mental health services should be delivered and about the future of mental health services. doubts about how much clinicians contribute to outcomes, our seeming inability to differentiate the effectiveness among clinicians except at the extremes, the lack of stability of employment of most community based clinicians, the poor track record on implementation of evidence-based programs, the cost of human services, the very limited availability of services especially where resources are inadequate-all lead to strong doubts about continuing the status quo of using clinicians as the primary way in which mental health services are delivered. in contrast, alternative approaches have many advantages. if scaled, ai therapists could be available to patients / and would not be bound to office hours. these ai therapists could represent any demographic or therapy style (e.g., directive) that the client preferred or that had been found to be more effective with a particular client. they can be specialists in any area for which there is sufficient research. in other words, not only can a personalized treatment plan be developed, but a personalized clinician (avatar) can be constructed for the best match with the client. of course, all these are putative advantages. as noted earlier, the application of ai is not without its risks and challenges, especially in putting together the interdisciplinary teams needed to accomplish this research. while i am optimistic about the potential contribution of ai to mental health services, it is just that-a potential. extensive research will be needed to learn whether these approaches produce positive outcomes when compared to traditional face-to face treatment, while also dealing with the ethical issues raised by ai applications. moreover, the quality of research needs significant improvement if we are going to have confidence in the findings. however, as exemplified by the rapid and uncontrolled growth of therapy apps, the world may not wait for rigorous supporting research before adopting a larger role for ai in mental health services. while my brief summaries of findings of ai in the medical literature are supportive of the application of ai, i do not want to give the impression that these positive findings are accepted uncritically. a deeper reading of many of these studies exposes methodological flaws that temper enthusiasm. for example, in reviewing comparisons between healthcare professionals and deep learning algorithms in classifying diseases of all types using medical imaging, x. liu et al. ( a) conclude that the ai models are equivalent to the accuracy of healthcare professionals. this review is the first to compare the diagnostic accuracy of deep learning models to health-care professionals; however, only a small number of the studies were direct comparisons. the authors also caution us by indicating what they labeled as the poor quality of many of the studies. the problems included low external validity (not done in a clinical practice setting), insufficient clarity in the reporting of results, lack of external validation, and lack of uniformity of metrics of diagnostic performance and deep learning terminology. however, the authors were encouraged by improvement in quality in the most recent studies analyzed. in commenting on the study, cook ( ) noted other limitations and concluded that it is premature to draw conclusions about the comparative accuracy of ai versus human physicians. if we are not more cautious, she warns that we will experience "inflated expectations on the gartner hype cycle" (p. e ). the latter refers to the examination of innovations and trends in ai. she cautions us to "stick to the facts, rather than risking a drop into the trough of disillusionment and a third major ai winter" (p. e ). many issues are raised in cook's paper, and the need to avoid the hype often found in the ai field is reiterated in the national academy of medicine's monograph on the use of ai in healthcare (matheny et al. ) . mental health services are changing. there are more than , mental health apps on the internet that are being used without much evidence of their effectiveness (marshall et al. ; bergin and davis ; gould et al. ) . the explosion of mental health apps is the leading edge of future autonomous interventions. however, there is pressure to bring some order to this chaos. probably the next innovation that will involve ai is its use in stepped therapy in which clients are typically triaged to low-intensity, low-cost care, monitored systematically, and stepped up to more intensive care if progress is not satisfactory . in this schema, the low-cost care could be ai-based apps with little risk to the client. if more confidence is gained in the safety and effectiveness of this type of protocol, the use of ai-based treatment would be expected to increase. the covid- pandemic will produce a major impact on mental health services. first, it is expected that the stresses caused by the pandemic will increase the demand for services (qiu et al. ; rajkumar ) . already poorly resourced mental health systems will not be able to meet this demand (Ćosić et al. ; ho et al. ; holmes et al. ) , especially in low resourced countries. however, the biggest change will be in the service delivery infrastructure. because of social distancing requirements, in-person delivery of therapy is being severely curtailed. while the major change at this time appears to be a shift to telemedicine (shore et al. ; van daele et al. ) , which is being adopted across almost all healthcare, there will need to be changes instituted in how clinicians are trained and supervised (zhou et al. ) . i have little doubt that ai will be adopted in order to increase efficiency and address the change in the service environment caused by the pandemic. in addition to changes initiated by the pandemic, there appear to be some changes in funding as a result of the protests concerning george floyd's killing. there is reconsideration of shifting some funding from police services to mental health and conflict reduction services to be delivered by personnel outside law enforcement (stockman and eligon ) . it will be difficult to meet this potential demand using the current infrastructure. the literature on ai and medicine is replete with warnings about the difficulties we face in integrating ai into our healthcare system. as a program evaluator, i appreciate the position paper describing the urgent need for well-designed and competently conducted evaluations of ai interventions as well as the guidelines provided by magrabi et al. ( ) . more suggestions for improving the quality of research on supervised machine learning can be found in the paper by cearns et al. ( ) . celi et al. ( ) describe the future in a very brief essay that is worth quoting: clinical practice should evolve as a hybrid enterprise with clinicians who know what to expect from, and how to work with, what is fundamentally a very sophisticated clinical support tool. working together, humans and machines can address many of the decisional fragilities intrinsic to current practice. the human-driven scientific method can be powerfully augmented by computational methods sifting through the necessarily large amounts of longitudinal patientand provider-generated data. (p. e ) however, research on ai, data science, and other technologies is in its infancy if not the embryonic stage of development. i am fully immersed in the struggle to implement several types of technologies in practice. changing the routine behavior of clinicians and clients is a major barrier to using new technologies, regardless of the effectiveness of these approaches. emanuel and wachter ( ) argue that the most important problem facing healthcare is not the absence of data or analytic approaches but turning predictions and findings into successful accomplishments through behavior change. alongside the investment in technology and analytics, we need to support the research and applications of psychologists, behavioral economists, and those working in the relatively new field of translational and implementation research. the emphasis on practical and implementable digital approaches requires a methodology that departs from the traditional efficacy approach, which does not focus on context and thus is difficult to translate to the real world. mohr et al. ( ) suggest a solution-based approach that focuses on three stages that they label create, trial and sustain. creation focuses on the initial stages of development, although not exclusively, and takes advantage of the unique characteristics of digital approaches that focus on engagement rather than trying to mimic traditional psychotherapy. trial must be dynamic because digital technologies rapidly change; rapid evaluations are required, such as continuous quality improvement strategies (bickman and noser ) . sustainability requires more from investigators and evaluators than publication of results; they must also produce sustainable implementation that no longer depends on a research project for support. we are currently in an ai summer in which there are important scientific breakthroughs and large investments in the application of ai (hagendorff and wezel ) . but ai has had several winters when enthusiasm for ai has waned and unreasonable expectations have cooled. we were confronted with the reality that ai could not accomplish everything that people thought it could and that investors and journalists had hyped. ai, at least in the near term, will not be the superintelligence that will destroy humanity or the ultimate solution that will solve all problems. enthusiasm for ai seems to run in cycles like the seasons. ai summers suffer from unrealistic expectations, but the winters bring an experience of disproportionate backlash and exaggerated disappointment. there was a severe winter in the late s, and another in the s and s (floridi ) . today, some are talking about another predictable winter (nield ; walch ; schuchmann ). floridi ( ) suggests that we can learn important principles from these cycles. first is whether ai is going to replace previous activities as the car did with the buggy, diversify activities as the car did with the bicycle, or complement and expand them as the plane did with the car. floridi asks how acceptable an ai that survives another winter will be. he suggests that we need to avoid oversimplification and think deeply about with we are doing with ai. in the june issue of the technology quarterly of the economist ( ), it is suggested that because ai's current summer is "warmer and brighter" than past ones because of widespread deployment of ai, "another fullblown winter is unlikely. but an autumnal breeze is picking up" (p. ). i have traced a path my career has taken from an almost exclusive focus on randomized experiments to consideration of the applications of ai. i have identified the main problems related to mental health services research's almost sole dependence on rct methodology. i have linked the problems with this methodology with the lack of satisfactory progress in developing sufficiently effective mental health services. the recent availability of ai and the value now being placed on precision medicine have produced the early stages of a revolution in healthcare that will determine how treatment will be developed and delivered. i anticipate that in the very near future, a first-year graduate student will be contemplating the same questions that i raised years ago, because they are still relevant, but this time he or she will realize that there are answers that were not available to me. acknowledgements this paper is part of a special issue of this journal titled "festschrift for leonard bickman: the future of children's mental health services." the issue includes a collection of original children's mental health services research articles, this article, three invited commentaries on this article, and a compilation of letters in which colleagues reflect on my career and on their experiences with me. the word festschrift is german and means a festival or celebration of the work of an author. there are many people to thank for their assistance in both the festschrift and this paper. first, i want to acknowledge my two colleagues and friends, nick ialongo and michael lindsey, who spontaneously originated the idea of a festschrift during a phone conversation with them. the folks at the johns hopkins bloomberg school of public health were great in supporting the daylong event held on may , . the many friends, family, former students and colleagues who traveled from around the country to attend and present made the event memorable. i am grateful to the committee that helped put this special issue together, which included marc atkins, catherine bradshaw, susan douglas, nick ialongo, kim hoagwood, and sonja schoenwald. this paper represents more than a yearlong effort for which many contributed including the scholars who provided email exchanges and ideas throughout the conceptualization and writing process. i thank the two editors of this special issue, sonja and catherine, who spent much of their valuable time on this project during a very difficult period. the manuscript was greatly improved through the efforts of my copy editor, diana axelsen. most of all i thank corinne bickman, who has been my partner in life for almost years and has managed this journal since its inception. without her support and love none of this would have been possible. funding no external funding was used in the preparation or writing of this article. conflict of interest from the editors: leonard bickman is editorin-chief of this journal and thus could have a conflict of interest in how this manuscript was managed. however, the guest editors of this special issue, entitled "festschrift for leonard bickman: the future of children's mental health services," managed the review process. three independent reviews of the manuscript were obtained and all recommended publication with some minor revisions, with which the editors concurred. while the reviewers were masked to the author, because of the nature of the manuscript is was not possible to mask the author for the reviewers. from the author: the author reported receipt of compensation related to the peabody treatment progress battery from vanderbilt university and a financial relationship with care software. no other disclosures were reported. this article does not contain any research conducted by the authors involving human participants or animals. this article did not involve any participants who required informed consent. sensing technologies for monitoring serious mental illnesses peeking inside the black-box: a survey on explainable artificial intelligence (xai) the impact of artificial intelligence in medicine on the future role of the physician the sage handbook of social research methods local causal and markov blanket induction for causal discovery and feature selection for classification part ii: analysis and extensions heterogeneity in psychiatric diagnostic classification. psychiatry research enhancing the patient involvement in outcomes: a study protocol of personalised outcome measurement in the treatment of substance misuse personalising the evaluation of substance misuse treatment: a new approach to outcome measurement a prospective study of therapist facilitative interpersonal skills as a predictor of treatment outcome internet interventions for adults with anxiety and mood disorders: a narrative umbrella review of recent meta-analyses. the canadian journal of psychiatry/la revue canadienne de psychiatrie current regulation of mobile mental health applications moving toward a precision-based, personalized framework for prevention science: introduction to the special issue being "smart" about adolescent conduct problems prevention: executing a smart pilot study in a juvenile diversion agency constructionist extension of the contextual model: ritual, charisma, and client fit responders to rtms for depression show increased front to-midline theta and theta connectivity compared to non-responders mental health services in schoolbased health centers: systematic review practitioner review: psychological treatments for children and adolescents with conduct disorder problems: a systematic review and meta-analysis feasibility of using real-world data to replicate clinical trial evidence the fort bragg evaluation: a snapshot in time ecological momentary assessment and intervention in the treatment of psychotic disorders: a systematic review practitioner review: unguided and guided self-help interventions for common mental health disorders in children and adolescents: a systematic review and meta-analysis the impact of machine learning on patient care: a systematic review. artificial intelligence in medicine technology matters: mental health apps-separating the wheat from the chaff youth depression alleviation with antiinflammatory agents (yoda-a): a randomised clinical trial of rosuvastatin and aspirin redesigning implementation measurement for monitoring and quality improvement in community delivery settings for whom does interpersonal psychotherapy work? a systematic review artificial intelligence and robotic surgery: current perspective and future directions social influence and diffusion of responsibility in an emergency the social power of a uniform social roles and uniforms: clothes make the person improving established statewide programs: a component theory of evaluation barriers to the use of program theory the fort bragg demonstration project: a managed continuum of care a continuum of care: more is not always better resolving issues raised by the ft. bragg findings: new directions for mental health services research practice makes perfect and other myths about mental health services my life as an applied social psychologist a measurement feedback system (mfs) is necessary to improve mental health outcomes administration and policy in mental health and mental health services research why can't mental health services be more like modern baseball? administration and policy in mental health and mental health services research youth mental health measurement (special issue) administration and policy in mental health and mental health services research implementing a measurement feedback system: a tale of two sites. administration and policy in mental health and mental health services research evaluating managed mental health care: the fort bragg experiment beyond the laboratory: field research in social psychology clinician reliability and accuracy in judging appropriate level of care the technology of measurement feedback systems effects of routine feedback to clinicians on mental health outcomes of youths: results of a randomized trial the fort bragg continuum of care for children and adolescents: mental health outcomes over five years achieving precision mental health through effective assessment, monitoring, and feedback processes. administration and policy in mental health and mental health services research meeting the challenges in the delivery of child and adolescent mental health services in the next millennium: the continuous quality improvement approach what counts as credible evidence in applied research and evaluation practice the sage handbook of applied social research methods the evaluation handbook: an evaluator's companion crime reporting as a function of bystander encouragement, surveillance, and credibility evaluation of a congressionally mandated wraparound demonstration comparative outcomes of emotionally disturbed children and adolescents in a system of services and usual care the relationship between change in therapeutic alliance ratings and improvement in youth symptom severity: whose ratings matter the most? administration and policy in mental health and mental health services research problems in using diagnosis in child and adolescent mental health services research analysis of cause-effect inference by comparing regression errors improving machine learning prediction performance for premature termination of psychotherapy handbook of social policy evaluation predicting social anxiety from global positioning system traces of college students: feasibility study ai and the singularity: a fallacy or a great opportunity? information the effects of routine outcome monitoring (rom) on therapy outcomes in the course of an implementation process: a randomized clinical trial will artificial intelligence eventually replace psychiatrists? methodology for evaluating mental health case management an introduction to machine learning methods for survey researchers single-subject prediction: a statistical paradigm for precision psychiatry machine learning for precision psychiatry: opportunities and challenges does big data require a methodological change in medical research? developing machine learning models for behavioral coding internet-based vs. face-to-face cognitive behavior therapy for psychiatric and somatic disorders: an updated systematic review and meta-analysis recommendations and future directions for supervised machine learning in psychiatry an awakening in medicine: the partnership of humanity and intelligent machines. the lancet, digital health do mental health mobile apps work: evidence and recommendations for designing high-efficacy mental health mobile apps a logic model for precision medicine implementation informed by stakeholder views and implementation science inflammatory biomarkers for mood disorders: a brief narrative review a tale of two deficits: causality and care in medical ai systematic effect of random error in the yoked control design meta-analysis of the rdoc social processing domain across units of analysis in children and adolescents advancing ambulatory biobehavioral technologies beyond "proof of concept": introduction to the special section current state and future directions of technology-based, ecological momentary assessment and intervention for major depressive disorder: a systematic review the effectiveness of clinician feedback in the treatment of depression in the community mental health system advancing the science and practice of precision education to enhance student outcomes human versus machine in medicine: can scientific literature answer the question? the lancet: digital health impact of human disasters and covid- pandemic on mental health: potential of digital psychiatry services for adolescents with psychiatric disorders: -month data from the national comorbidity survey-adolescent machine learning for clinical psychology and clinical neuroscience the cambridge handbook of research methods in clinical psychology. cambridge handbooks in psychology psychological therapies versus antidepressant medication, alone and in combination for depression in children and adolescents can interest and enjoyment help to increase use of internet-delivered interventions? is guided self-help as effective as face-to-face psychotherapy for depression and anxiety disorders? a systematic review and meta-analysis of comparative outcome studies effectiveness and acceptability of cognitive behavior therapy delivery formats in adults with depression: a network meta-analysis testing moderation in network meta-analysis with individual participant data inflammatory cytokines in children and adolescents with depressive disorders: a systematic review and meta-analysis on multi-cause causal inference with unobserved confounding: counterexamples, impossibility, and alternatives integrating artificial and human intelligence in complex, sensitive problem domains: experiences from mental health introduction to the special issue: discrepancies in adolescent-parent perceptions of the family and adolescent adjustment reflections on randomized control trials comparison of machine learning methods with traditional models for use of administrative claims with electronic medical records to predict heart failure outcomes real-world evidence: promise and peril for medical product evaluation artificial intelligence and the future of psychiatry: insights from a global physician survey extracting psychiatric stressors for suicide from social media using deep learning deep neural networks in psychiatry out with the old and in with the new? an empirical comparison of supervised learning algorithms to predict recidivism machine learning approaches for clinical psychology and psychiatry does feedback improve psychotherapy outcomes compared to treatment-as-usual for adults and youth? psychotherapy research effects of providing domain specific progress monitoring and feedback to therapists and patients on outcome what is the test-retest reliability of common task-functional mri measures? new empirical evidence and a meta-analysis psychometrics of the personal questionnaire: a client-generated outcome measure artificial intelligence in health care: will the value match the hype? barriers and facilitators of mental health programmes in primary care in low-income and middle-income countries dermatologist-level classification of skin cancer with deep neural networks consumer participation in personalized psychiatry the new field of "precision psychiatry open trial of a personalized modular treatment for mood and anxiety ai and its new winter: from myths to realities the alliance in adult psychotherapy: a meta-analytic synthesis commentary: a refresh for evidencebased psychological therapies-reflections on marchette and weisz refining the costs analyses of the fort bragg evaluation: the impact of cost offset and cost shifting the effectiveness of psychosocial interventions delivered by teachers in schools: a systematic review and metaanalysis risk factors for suicidal thoughts and behaviors: a meta-analysis of years of research chronic inflammation in the etiology of disease across the life span conversational agents in the treatment of mental health problems: mixed-method systematic review utilization of machine learning for prediction of posttraumatic stress: a re-examination of cortisol in the prediction and pathways to non-remitting ptsd toward achieving precision health methodological advances in statistical prediction automated identification of diabetic retinopathy using deep learning change what? identifying quality improvement targets by investigating usual mental health care. administration and policy in mental health and mental health services research an artificial neural network for movement pattern analysis to estimate blood alcohol content level the association of therapeutic alliance with long-term outcome in a guided internet intervention for depression: secondary analysis from a randomized control trial feedback from outcome measures and treatment effectiveness, treatment efficiency, and collaborative practice: a systematic review. administration and policy in mental health and psychotherapy expertise should mean superior outcomes and demonstrable improvement over time veterans affairs and the department of defense mental health apps: a systematic literature review how are information and communication technologies supporting routine outcome monitoring and measurement-based care in psychotherapy? a systematic review detecting depression and mental illness on social media: an integrative review the gap between science and practice: how therapists make their clinical decisions challenges for ai: or what ai (currently) can't do r. a. fisher and his advocacy of randomization smartphone sensing methods for studying behavior in everyday life. current opinion in behavioral sciences the future of digital psychiatry effective nurse-patient relationships in mental health care: a systematic review of interventions to improve the therapeutic alliance prediction of rtms treatment response in major depressive disorder using machine learning techniques and nonlinear features of eeg signal big data and causality the value of psychiatric diagnoses applied research design: a practical guide family empowerment: a theoretically driven intervention and evaluation measuring the implementation of behavioral intervention technologies: recharacterization of established outcomes therapist expertise in psychotherapy revisited deep learning: a technology with the potential to transform health care mental health strategies to combat the psychological impact of covid- beyond paranoia and panic interpreting nullity: the fort bragg experiment: a comparative success or failure? improving mental health access for low-income children and families in the primary care setting overview of the national evaluation of the comprehensive community mental health services for children and their families program multidisciplinary research priorities for the covid- pandemic: a call for action for mental health science methods of delivering progress feedback to optimise patient outcomes: the value of expected treatment trajectories big data and the precision medicine revolution psychosocial interventions for people with both severe mental illness and substance misuse technology-enhanced human interaction in psychotherapy iom roundtable on evidence-based medicine: the learning healthcare system: workshop summary challenges of using progress monitoring measures: insights from practicing clinicians machine learning and psychological research: the unexplored effect of measurement an upper limit to youth psychotherapy benefit? a meta-analytic copula approach to psychotherapy outcomes obama gives east room rollout to precision medicine initiative predictors of response to repetitive transcranial magnetic stimulation in depression: a review of recent updates identification of patients in need of advanced care for depression using data extracted from a statewide health information exchange: a machine learning approach diagnosis of human psychological disorders using supervised learning and nature-inspired computing techniques: a meta-analysis annual research review: expanding mental health services through novel models of intervention delivery exploring the black box: measuring youth treatment process and progress in usual care. administration and policy in mental health and the session report form (srf): are clinicians addressing issues of concern to youth and caregivers? administration and policy in mental health and mental health services research personalized evidence based medicine: predictive approaches to heterogeneous treatment effects machine learning methods for developing precision treatment rules with observational data a preliminary precision treatment rule for remission of suicide ideation. suicide and life-threatening behavior predicting suicides after psychiatric hospitalization in u.s. army soldiers: the army study to assess risk and resilience in servicemembers (army starrs) therapist effects in mental health service outcome. administration and policy in mental health and mental health services research is there a future for therapists? administration and policy in mental health and mental health services research the metamorphosis immunopharmacogenomics towards personalized cancer immunotherapy targeting neoantigens examination of real-time fluctuations in suicidal ideation and its risk factors: results from two ecological momentary assessment studies the effects of feedback interventions on performance: a historical review, a meta-analysis, and a preliminary feedback intervention theory academy of science of south africa (assaf)). available at preventing cytokine storm syndrome in covid- using α- adrenergic receptor antagonists prediction models of functional outcomes for individuals in the clinical high-risk state for psychosis or with recent-onset depression: a multimodal, multisite machine learning analysis predicting response to repetitive transcranial magnetic stimulation in patients with schizophrenia using structural magnetic resonance imaging: a multisite machine learning analysis how artificial intelligence could redefine clinical trials in cardiovascular medicine: lessons learned from oncology building machines that learn and think like people collecting and delivering progress feedback: a meta-analysis of routine outcome monitoring an overview of scientific reproducibility: consideration of relevant issues for behavior science/analysis the $ billion self-improvement market adjusts to a new generation opportunities for and tensions surrounding the use of technology-enabled mental health services in community mental health care. administration and policy in mental health and mental health services research predicting major mental illness: ethical and practical considerations modified causal forests for estimating heterogeneous causal effects applications of machine learning algorithms to predict therapeutic outcomes in depression: a meta-analysis and systematic review development and validation of multivariable prediction models of remission, recovery, and quality of life outcomes in people with first episode psychosis: a machine learning approach a framework for advancing precision medicine in clinical trials for mental disorders implementing measurement-based care in behavioral health: a review ethics in the era of big data a comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis difficulties and challenges in the development of precision medicine how to read articles that use machine learning: users' guides to the medical literature detecting cancer metastases on gigapixel pathology images the future of precision medicine: potential impacts for health technology assessment automated assessment of psychiatric disorders using speech: a systematic review towards integrating personalized feedback research into clinical practice: development of the trier treatment navigator (ttn) implementing routine outcome measurement in psychosocial interventions: a systematic review artificial intelligence in clinical decision support: challenges for evaluating ai and practical implications causal discovery algorithms: a practical guide applications of machine learning in addiction studies: a systematic review. psychiatry research practitioner review: empirical evolution of youth psychotherapy toward transdiagnostic approaches deep learning: a critical appraisal clinical or gimmickal: the use and effectiveness of mobile mental health apps for treating anxiety and depression research review: multi-informant integration in child and adolescent psychopathology diagnosis artificial intelligence in health care: the hope, the hype, the promise, the peril in the face of stress: interpreting individual differences in stress-induced facial expressions rapid detection of internalizing diagnosis in young children enabled by wearable sensors and machine learning the state of mental health in america improving moderator responsiveness in online peer support through automated triage recent advances in deep learning: an overview key considerations for incorporating conversational ai in psychotherapy the ethics of biomedical 'big data' analytics a randomized noninferiority trial evaluating remotely-delivered stepped care for depression using internet cognitive behavioral therapy (cbt) and telephone cbt personal sensing: understanding mental health using ubiquitous sensors and machine learning . the complex etiology of ptsd in children with maltreatment applying big data methods to understanding human behavior and health comparative efficacy and acceptability of non-surgical brain stimulation for the acute treatment of major depressive episodes in adults: systematic review and network meta-analysis rethinking the experiment: necessary (r)evolution the angel and the assassin: the tiny brain cell that changed the course of medicine. ballantine. national institutes of health, central resource for grants and funding information is deep learning already hitting its limitations? and is another ai winter coming? towards data science neurobiological mechanisms of repetitive transcranial magnetic stimulation of the dorsolateral prefrontal cortex in depression: a systematic review progress in childhood cancer: years of research collaboration, a report from the children's oncology group dissecting racial bias in an algorithm used to manage the health of populations estimating the reproducibility of psychological science research handbook on the law of artificial intelligence using learning analytics to scale the provision of personalised feedback integrity of evidence-based practice: are providers modifying practice content or practice sequencing? administration and policy in mental health and mental health services research computational approaches and machine learning for individual-level treatment predictions a critical review of consumer wearables, mobile applications, and equipment for providing biofeedback, monitoring stress, and sleep in physically active populations causality: models, reasoning, and inference causal inference in statistics: an overview the seven tools of causal inference, with reflections on machine learning a machine learning ensemble to predict treatment outcomes following an internet intervention for depression can machine learning improve screening for targeted delinquency prevention programs? prevention science mechanism of repetitive transcranial magnetic stimulation for depression. shanghai archives of psychiatry what makes a good counselor? learning to distinguish between high-quality and low-quality counseling conversations experiencing mental health diagnosis: a systematic review of service user, clinician, and carer perspectives across clinical settings abandoning personalization to get to precision in the pharmacotherapy of depression endocrine and immune effects of non-convulsive neurostimulation in depression: a systematic review. brain, behavior, and immunity can machine learning help us in dealing with treatment resistant depression? a review potential liability for physicians using artificial intelligence a nationwide survey of psychological distress among chinese people in the covid- epidemic: implications and policy recommendations emotion recognition using smart watch sensor data: mixed-design study covid- and mental health: a review of the existing literature chexnet: radiologist-level pneumonia detection on chest x-rays with deep learning a million variables and more: the fast greedy equivalence search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images transparency in algorithmic decision making a machine learning approach to predicting psychosis using semantic density and latent content analysis. npj schizophrenia the peabody treatment progress battery: history and methods for developing a comprehensive measurement battery for youth mental health using program theory to link social psychology and program evaluation stand-alone artificial intelligence for breast cancer detection in mammography: comparison with radiologists big data analytics and ai in mental healthcare the secrets of machine learning: ten things you wish you had known earlier to be more effective at data analysis speculations on the future of psychiatric diagnosis machine learning and big data in psychiatry: toward clinical applications using artificial intelligence to assess clinicians' communication skills causal protein-signaling networks derived from multiparameter single-cell data psychotherapists openness to routine naturalistic idiographic research individualized patient-progress systems: why we need to move towards a personalized evaluation of psychological treatments patient-centered assessment in psychotherapy: a review of individualized tools the individualised patient-progress system: a decade of international collaborative networking explainable artificial intelligence: understanding, visualizing and interpreting deep learning models the effectiveness of school-based mental health services for elementary-aged children: a metaanalysis editorial: in the causal labyrinth: finding the path from early trauma to neurodevelopment personalized and precision medicine informatics. a workflow-based view computational causal discovery for posttraumatic stress in police officers machine learning methods to predict child posttraumatic stress: a proof of concept study a complex systems approach to causal discovery in psychiatry personal values and esp scores would you follow my instructions if i was not human? examining obedience towards virtual agents time for one-person trials probability of an approaching ai winter. towards data science ecological momentary interventions for depression and anxiety the present and future of precision medicine in psychiatry: focus on clinical psychopharmacology of antidepressants machine learning in mental health: a scoping review of methods and applications psychosocial interventions and immune system function: a systematic review and meta-analysis of randomized clinical trials non-gaussian methods for causal structure learning telepsychiatry and the coronavirus disease pandemic-current and future outcomes of the rapid virtualization of psychiatric care a svm-based classification approach for obsessive compulsive disorder by oxidative stress biomarkers good news: artificial intelligence in psychiatry is actually neither psychoneuroimmunology of stress and mental health social safety theory: a biologically based evolutionary perspective on life stress, health, and behavior from stress to inflammation and major depressive disorder: a social signal transduction theory of depression efficacy of repetitive transcranial magnetic stimulation in treatment-resistant depression: the evidence thus far. general psychiatry chronic systemic inflammation is associated with symptoms of latelife depression: the aric study outcomes from wraparound and multisystemic therapy in a center for mental health services system-of-care demonstration site gems [gene expression model selector]: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data final report on the aspirin component of the ongoing physicians' health study everybody lies: big data, new data and what the internet can tell us about who we really are cities ask if it's time to defund police and 'reimagine' public safety a system of care for children and youth with severe emotional disturbances study finds psychiatric diagnosis to be 'scientifically meaningless the "average" treatment effect: a construct ripe for retirement. a commentary on deaton and cartwright using effect size-or why the p value is not enough everything you wanted to know about smart health care: evaluating the different technologies and components of the internet of things for better health commercial influences on electronic health records and adverse effects on clinical decision making primed for psychiatry: the role of artificial intelligence and machine learning in the optimization of depression treatment a comparison of natural language processing methods for automated coding of motivational interviewing development and evaluation of clientbot: patientlike conversational agent to train basic counseling skills will machine learning enable us to finally cut the gordian knot of schizophrenia? schizophrenia empirically based mean effect size distributions for universal prevention programs targeting school-aged youth: a review of meta-analyses outcome and progress monitoring in psychotherapy: report of a canadian psychological association task force think s-ai-is-moreprofo und-than-fire#:~:text=%e % % cai% i s% o ne% o f% t he repetitive transcranial magnetic stimulation elicits antidepressant-and anxiolytic-like effect via nuclear factor-e -related factor -mediated anti-inflammation mechanism in rats deep medicine: how artificial intelligence can make healthcare human again high-performance medicine: the convergence of human and artificial intelligence applications of machine learning in real-life digital health interventions: review of the literature adapting evidence-based treatments for digital technologies: a critical review of functions, tools, and the use of branded solutions an understanding of ai's limitations is starting to sink in chatbots and conversational agents in mental health: a review of the psychiatric landscape the neuroactive potential of the human gut microbiota in quality of life and depression recommendations for policy and practice of telepsychotherapy and e-mental health in europe and beyond fairer machine learning in the real world: mitigating discrimination without collecting sensitive data internet-delivered cognitive behavior therapy for children and adolescents: a systematic review and meta-analysis how to survive in the post-human era. in interdisciplinary science and engineering in the era of cyberspace a systematic literature review of the clinical efficacy of repetitive transcranial magnetic stimulation (rtms) in non-treatment resistant patients with major depressive disorder are we heading for another ai winter soon? forbes the great psychotherapy debate: the evidence for what makes psychotherapy work what do clinicians treat: diagnoses or symptoms? the incremental validity of a symptom-based, dimensional characterization of emotional disorders in predicting medication prescription patterns evidence-based youth psychotherapies versus usual clinical care: a meta-analysis of direct comparisons more of what? issues raised by the fort bragg study performance of evidence-based youth psychotherapies compared with usual clinical care: a multilevel meta-analysis what five decades of research tells us about the effects of youth psychological therapy: a multilevel meta-analysis and implications for science and practice are psychotherapies for young people growing stronger? tracking trends over time for youth anxiety, depression, attention-deficit/hyperactivity disorder, and conduct problems compliance with mobile ecological momentary assessment protocols in children and adolescents: a systematic review and meta-analysis can machine-learning improve cardiovascular risk prediction using routine clinical data? advances in statistical methods for causal inference in prevention science: introduction to the special section elon musk vows fully self-driving teslas this year and 'robotaxis' ready next year. usa today solutions to problems with deep learning computer-assisted cognitive-behavior therapy and mobile apps for depression and anxiety comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma development and validation of a machine learning individualized treatment rule in first-episode schizophrenia racing towards a digital paradise or a digital hell if we build it, will they come? issues of engagement with digital health interventions for trauma recovery. mhealth, , the role of telehealth in reducing the mental health burden from covid- . telemedicine and e-health is the alliance really therapeutic? revisiting this question in light of recent methodological advances major developments in methods addressing for whom psychotherapy may work and why key: cord- - fu blu authors: lazarus, ross; yih, katherine; platt, richard title: distributed data processing for public health surveillance date: - - journal: bmc public health doi: . / - - - sha: doc_id: cord_uid: fu blu background: many systems for routine public health surveillance rely on centralized collection of potentially identifiable, individual, identifiable personal health information (phi) records. although individual, identifiable patient records are essential for conditions for which there is mandated reporting, such as tuberculosis or sexually transmitted diseases, they are not routinely required for effective syndromic surveillance. public concern about the routine collection of large quantities of phi to support non-traditional public health functions may make alternative surveillance methods that do not rely on centralized identifiable phi databases increasingly desirable. methods: the national bioterrorism syndromic surveillance demonstration program (ndp) is an example of one alternative model. all phi in this system is initially processed within the secured infrastructure of the health care provider that collects and holds the data, using uniform software distributed and supported by the ndp. only highly aggregated count data is transferred to the datacenter for statistical processing and display. results: detailed, patient level information is readily available to the health care provider to elucidate signals observed in the aggregated data, or for ad hoc queries. we briefly describe the benefits and disadvantages associated with this distributed processing model for routine automated syndromic surveillance. conclusion: for well-defined surveillance requirements, the model can be successfully deployed with very low risk of inadvertent disclosure of phi – a feature that may make participation in surveillance systems more feasible for organizations and more appealing to the individuals whose phi they hold. it is possible to design and implement distributed systems to support non-routine public health needs if required. timely identification and subsequent reaction to a public health emergency requires routine collection of appropriate and accurate data about the occurrence and location of cases of illness. there is substantial interest in using rou-tinely collected electronic health records to support both the detection of unusual clusters of public health events and the response to public health threats detected by other means. such data are also useful to reduce an initial alert level, if it is clear that no unusual illness clusters exist in a community. ideally, such systems operate automatically and include sensitive and specific statistical surveillance software and alerting systems. these are often referred to as syndromic surveillance systems [ , ] , because they typically rely on the non-specific signs and symptoms that may provide the earliest evidence of a serious public health threat, such as anthrax or sars. many syndromic surveillance systems gather potentially identifiable, individual patient-level encounter records. these records are typically collected without name or address, but they do contain enough identifiers to allow re-identification in some circumstances. the potential for re-identification is greatest when records are collected from ambulatory settings or health systems that supply a unique identifier that allows the very useful identification of repeated visits over time. the risk of disclosing sensitive information that can be linked to the individual also increases when the health care facility provides more than occasional care. in the united states, the health insurance portability and accountability act [ ] (hipaa) specifically exempts transfer, use and retention of identifiable electronic personal health information (phi) to support public health activities. this exemption also applies to syndromic surveillance activities, although hipaa was developed before large volumes of such data concerning individuals who are not suspected of having a reportable condition were being used for public health purposes in the absence of any known public health emergency. despite the exemption, data providers may be unwilling to offer identifiable data for surveillance purposes in the face of increasing awareness of the potential costs of inadvertent disclosure or inappropriate use of phi. additionally, their patients may object to their providing it. these concerns are common to many developed countries and under these circumstances, designs that minimise the risk of inadvertent disclosure may be needed in order to gain the cooperation of data custodians, for surveillance systems to be feasible. the focus of this paper is on one such design, in which initial data aggregation is performed to decrease the risk of any phi being inadvertently disclosed, before the aggregate data is centralised for subsequent statistical analysis. although the system we describe is currently operating in the united states and many of the implementation details are specific to that context, some of the conceptual issues we describe and some of the lessons we have learned may be directly relevant to public health practice in other countries. while it is possible to centrally collate and process deidentified records, there is a potential problem with statistical inference if multiple records from the same individual are not distinguished. this problem arises because many statistical analysis techniques applicable to surveillance, such as generalised linear mixed models [ ] (glmm), depend on the assumption that observations are statistically independent. inference based on this assumption using ambulatory care encounter data will likely be biased if the model cannot distinguish observations from multiple encounters during a single course of illness from a single individual patient. although the extent of this bias has not been quantified, the problem is clearly illustrated by real data. in more than half of the individuals with multiple lower respiratory syndrome encounters over a four year period from one large ambulatory care practice, a second encounter with the same syndrome was noted less than days after the first encounter [ ] . our approach to this problem of statistical independence is to aggregate multiple encounters from a single individual into "episodes" of illness, and is described in more detail below. reliably automating this aggregation requires that every patient's records be uniquely identifiable. to support the national demonstration bioterrorism surveillance program (ndp), we developed a system in which no phi leaves the immediate control of the data provider, and only aggregate data is transferred to the datacenter [ , ] . each data provider performs initial aggregation of the phi within their own existing, secured data processing environment, producing data that is aggregated beyond the point where any individual patient is identifiable. since data processing is distributed to the site of data collection rather than being performed at one central location, we describe this as a distributed processing surveillance system. although this particular aspect of our work has briefly been mentioned in previous publications [ , , [ ] [ ] [ ] , we present it in greater detail here, because we believe that it represents a potentially valuable alternative surveillance system design option that deserves more explanation and wider debate than it has received to date. the basic principle of distributed processing is simple. rather than collecting all needed identifiable, individual phi records centrally for statistical processing, all phi is pre-processed remotely, and remains secured, under the direct control of the data provider. only aggregate data are transferred to the central datacenter for additional statistical processing, signal detection, display and distribution. at an appropriate level of aggregation, the risk of inadvertent phi disclosure becomes very small, and may prove acceptable to data custodians and to individual patients. although this risk is never completely absent, it is certainly decreased in aggregate data, making this approach far more acceptable to data providers in our experience, than the more traditional approach of centralized collection of directly identifiable phi. before describing our distributed system, we briefly review the more familiar model of centralized aggregation and processing of phi for surveillance. in the more traditional type of system, individual patient records, often containing potentially identifiable information, such as date of birth and exact or approximate home address, are transferred, usually in electronic form, preferably through some secured method, to a central secured repository, where statistical tools can be used to develop and refine surveillance procedures. one of the main benefits of this data-processing model is that the software and statistical methods can be changed relatively easily to accommodate changes in requirements, because they only need to be changed at the one central location where analysis is taking place. as long as appropriate details have been captured for each individual encounter of interest, the raw data can be re-coded or manipulated in different ways. only one suite of analysis code is needed, and because it is maintained at a single, central location, costs for upgrading and maintenance are small. inadvertent disclosure of phi is always a potential risk with centralized systems. even where minimally identifiable data are stored in each record, the probability of being able to unambiguously identify an individual increases as multiple, potentially linkable records for that individual accrue over time. rather than gathering identifiable phi information into a central repository for analysis, a distributed system moves some of the initial data processing, such as counting aggregated episodes of care (see below), to the site where the data is being collected. this aggregation minimizes the number of individuals who have access to phi and diminishes the risk of inadvertent phi disclosure from the sur-distributed processing model and data flow figure distributed processing model and data flow. veillance system, while still allowing effective use of the information of interest. the focus of this report is on the model used to collect surveillance data while providing maximum protection for phi, so the statistical methods we use in the ndp, which have been described elsewhere [ ] are not discussed further here. data flows for the ndp are illustrated in figure . data pre-processing, detection of repeated visits by the same patient for the same syndrome, and data aggregation is performed using a custom software package, written, maintained, and distributed by the ndp datacenter. data providers maintain complete control of the security of their own phi and also maintain control over the operation of the data processing software, which runs on one of their secured workstations. since the pre-processing takes place within a secured environment under the control of the data provider, there is no need for the individual patient identifiers to be divulged to the datacenter. in the case of the ndp [ ] , the only data that is centrally collated consists of counts of the number of new episodes of specific syndromes over a defined time period (currently set at each hour period ending at midnight), by geographic area (currently, -digit zip code area). more detailed definitions of "syndromes" and "new episodes" are provided below. table illustrates the data transferred from each data provider each day to the datacenter for statistical processing, reporting and alerting. note that the although this data does not contain any obvious identifiers such as date of birth or gender, there is always a risk that a specific individual might be identifiable using additional data, and that this risk is greatest in zip codes with very small populations. all source code required to build the data processing software is provided to the data provider at installation and whenever the software is updated, so that the local information services staff can check that there are no "backdoors" or other ways the distributed software could compromise the security of their systems. all information transferred to the datacenter is stored in text files (in xml format) and can be readily accessed by local staff to ensure that no phi is being transmitted. participating data providers have near real-time icd codes for every encounter, usually assigned by clinicians at the time of the encounter. since much acute infectious disease manifests as broad suites of nonspecific symptoms, we monitor syndromes -respiratory, lower gastro-intestinal (gi), upper gi, neurological, botulism-like, fever, hemorrhagic, skin lesions, lymphatic, rash, shock-death, influenza-like illness and sars-like illness. all syndromes except influenza-like illness and sars-like illness were defined by a working group led by cdc and department of defense [ ] . individual icd codes are used to aggregate encounters into one of these syndromes. the definitions (icd code lists) of of these syndromes are available [ ] . the definitions comprising the other two syndromes were developed in consultation with both cdc and the massachusetts department of public health. our surveillance algorithms [ ] require statistically independent observations and are based on new episodes of syndromes. our goal was to distinguish health care encounters that were related to ongoing care for any given episode of acute illness from the initial encounter that indicated the start of a new episode of a syndrome of interest. the derivation of the specific method for identifying first encounters for an episode of illness has been described in more detail elsewhere [ ] . we define a new episode to begin at the first encounter after at least a day encounter-free interval for that specific patient and that specific syndrome. if there has been any encounter for that specific syndrome for the same individual patient within the previous days, the current encounter is regarded as part of the usual ongoing care for the original encounter that signalled the start of an episode of illness of that syndrome. the start of a new episode for a different syndrome can occur during ongoing encounters for any given specific syndrome -ongoing encounters during an episode are counted as new episodes only if they are outside (i.e. at least days since the last encounter)of an existing episode of the matching syndrome. as will be described later, all ongoing encounters within any syndrome are recorded, and are visible through reports under the control of the data provider, but they do not contribute to the counts that are sent to the datacentre for analysis. all of this processing requires consistent and unique patient identifiers for all encounters. we use the local patient master index record number for this purpose in the software that we provide, but these identifiers are not required once the processing is complete, and they remain under the complete control of the providers. the distributed software requires the data providers to extract information about encounters of interest (daily, in our case) and convert it into the uniform format used by our distributed software. this kind of uniform representation is required for any multi-source surveillance system and is not peculiar to the distributed model we have adopted. in practice, we found that data providers could easily produce text files containing data as comma separated values in the format which we specified, and which the distributed software has been written to process. however, this requires dedicated programming effort that was supported with resources from the ndp grant. our project receives support from the cdc, so we are required to comply with relevant cdc standards. although the data being transferred to the datacenter is arguably not identifiable phi because of the high level of aggregation, we use the public health information network messaging system [ ] (phinms), a freely available, secure, data transfer software suite developed by the cdc, to transfer aggregate data. a phinms server operates at the datacenter and each data provider operates a phinms client, using a security certificate supplied by the datacenter for encryption and authentication. phinms allows fully automated operation at both the datacenter and at each data provider. phinms communicates over an encrypted channel and usually requires no special modification to the data provider firewall, since it is only ever initiated by an outgoing request (the data provider always initiates the transfer of new data) and uses the same firewall port and protocol (ssl on port ) as commercially encrypted services such as internet banking. phinms is reasonably robust to temporary connectivity problems, as it will try to resend all messages in the queue until they are delivered. data transmission is one of the least problematic aspects of maintaining this system. we provide automatic installation software and it runs more or less instantaneously and transparently, without intervention in our experience. no training is needed as the process is fully automated. all data is transferred to the datacenter in the form of extensible markup language (xml) since this is a flexible machine-readable representation and is easy to integrate with phinms. we used the python [ ] language for the development of the distributed software package. this choice was partially motivated by the fact that python is an open-source language and thus freely distributable, partly by our very positive experience with python as a general purpose application development language, and partially because in our experience, python can be installed, and applications reliably run without any change to source code, on all common operating systems (including linux, unix, macintosh and windows), making it easy for the datacenter to provide support for systems other than windows pc's. it is also a language with extensive support for standards such as xml, and securely encrypted internet connections. in addition, our existing web infrastructure has been built with the open-source zope [ ] web application framework, which is written mostly in python. a major design goal for our distributed software was that it should offer potentially useful functions for the data provider. this was motivated by our desire to encourage data providers to look at their own data in different ways that might not only help them manage the data more efficiently, but might also help them to more easily identify errors. in our experience, the task of maintaining a system like the one we have developed is far more attractive and interesting to the staff responsible at each participating institution if they gain some tangible, useful and immediate benefits. in addition, easy access to data flowing through our software is useful for ensuring transparency and to facilitate security auditing by each data provider. the distributed software optionally creates reports that show one line of detailed information about each of the patient encounters that was counted for the aggregate data for each day's processing. these reports are termed "line lists" and were designed to support detailed reporting of encounter level data, so that a data provider can quickly make this information available in response to a public health need. two versions are available, one with and one without the most specific identifying details, such as patient name and address. these standard line lists are used most often to support requests by public health agencies for additional information about the individual cases that contribute to clusters identified in the aggregate data. these lists are never transmitted to the datacenter but may be used to support public health officials investigating a potential event. when unexpectedly high counts of particular syndromes are detected in geographically defined areas, the datacenter automatically generates electronic alerts, which are automatically routed to appropriate public health authorities. for example, in massachusetts, electronic messages are automatically sent to the massachusetts alert network within minutes of detection, where they are automatically and immediately forwarded to the appropriate public health personnel for follow up. available alert delivery methods in the massachusetts system range from email through to an automated telephone text-to-speech delivery system. responders can configure the alert delivery method for each type of alert they have subscribed to. this alerting system is independent of our distributed system, but in practice, the ready availability of reports in electronic format containing both fully and partially identifiable clinical data for all cases comprising any particular period or syndrome makes the task of the clinical responder much simpler whenever a query is received from a public health official. electronic reports, containing clinical information and optionally, full identifiers for all encounters can be generated as required, at the provider's site, from where they can immediately be made available to public health agencies. in the ndp's current operational mode (see figure ), a public health official calls a designated clinical responder to obtain this information. table ). the "narrow" version, which contains fewer identifiers, provides each patient's five-year age group instead of date of birth and does not include the physician id or medical record number (table ) . at the provider's discretion, the clinical responder can provide the "narrow" list corresponding to the cases of interest to the public health department. if on this basis public health officials decide that further investigation is warranted, they can call the clinical provider and request a review of medical records, identifying the cases of interest by date and an index number (unique within date) in the narrow line list. the clinician finds the medical record number by looking up the date and index number in the wide line list and then accesses the record itself through the usual hmo-specific means. resources to support clinical responders were provided through our ndp grant to participating data providers. it would be straightforward to send detailed lists of encounters that are part of clusters directly to the relevant health department whenever the datacenter detects an event and sends an automated alert to a health department. we have not implemented this feature because all the participating health plans prefer to have an on-site clinical responder participate in the initial case evaluation with the public health agency. it would also be simple to allow designated public health personnel to initiate requests for specific line lists, even when no alert has occurred. public health officials may, on occasion, wish to inspect the line lists to search for specific diagnoses that do not occur frequently enough to trigger an alert for their syndrome, but may be meaningful in the context of information that arises from other sources. although not currently implemented in the ndp, it would be feasible to allow a remote user to perform adhoc queries on the encounter data maintained by the health plan. examples of these queries include focused assessment of disease conditions affecting subsets of the population or specific diagnoses. this type of direct query capability is currently used at some of the same participating health plans to support the cdc's vaccine safety datalink project [ ], a surveillance system that supports post-marketing surveillance of vaccine safety [ ] . this distributed data model supports active surveillance and alerting of public health agencies in five states with participating data providers. the system has proven to be workable, and it supports the syndromic surveillance needs of the participating health departments. there are fixed costs such as programming to produce the standard input files, installation and training, associated with adding each new data provider, so we have focussed our efforts on large group practices providing ambulatory care with substantial daily volumes of encounters, completely paperless electronic medical record systems, and substantial technical resources, since these enable us to capture large volumes of transactions with each installation. relatively large numbers of encounters are needed to ensure that estimates from statistical modelling are robust. applying a distributed architecture to surveillance from multiple smaller practices may enable appropriately large numbers of encounters to be gathered, but may prove infeasible because of costs and lack of appropriate internal technical support and because of heterogeneity in the way icd codes are recorded and assigned by each data provider. once the programming for standard input files is completed, installation and training take approximately one day total, usually spread out over the first two weeks. nearly all problems are related to providers getting the standard file format contents exactly right, and to transferring these to the the distributed architecture currently in use by the ndp allows clinical facilities to provide the aggregated information needed to support rapid and efficient syndromic surveillance, while maintaining control over the identifiable phi and clinical data that supports this surveillance. the system provides support for the clinical providers to respond quickly to public health requests for detailed information when this is needed. in our experience, such requests involve only a tiny fraction of the data that would be transferred in a centralized surveillance model, providing adequate support for public health with minimal risk of inadvertent disclosure of identifiable phi. we believe this design, in which patients' clinical data remains with their own provider under most circumstances, while public health needs are still effectively met, conforms to the public's expectations, and so will be easier to justify if these surveillance systems come under public scrutiny. many of the details of our approach are specific to the united states context, but the general principle of using distributed processing to minimise the risk of inadvertent phi disclosure is of potential utility in other developed countries, although the specifics of our implementation may be less useful. the benefit of decreased risk of inadvertent phi disclosure from our approach entails three principal disadvantages compared with routine, centralized collection of identifiable data. first, a clinical responder with access to the locally stored phi data must be available to provide case level information when a cluster is detected. it would be technically straightforward to provide detailed information for relevant cases automatically when signals are detected. we deliberately did not implement this feature in the current system, since the participating health plans expressed a strong preference for direct involvement in this process. the second disadvantage is the need to pre-specify the syndromes, age groups, and other data aggregation parameters in advance, since changing these requires the distribution of a new release of the aggregation software. in practice, we have addressed this by means of configura-distributed software screen, showing results (synthetic data) after daily processing of encounter records figure distributed software screen, showing results (synthetic data) after daily processing of encounter records. tion data for syndrome categories read from a text file as the application loads, so the application code itself does not need alteration. this limitation could be largely overcome by creating a remote query capability to support ad hoc queries on identifiable data that remains in the control of the provider. the third disadvantage is the technical challenge of maintaining distributed software that must reliably process data that the programmers are not permitted to examine. while the software can be exhaustively tested on synthesized data, we have occasionally encountered subtle problems arising from previously unnoticed errors in the input data. our experience suggests that when writing this kind of distributed application, extensive effort must be devoted to detecting and clearly reporting errors in the input data before any processing takes place. an archive of python source code for the distributed software will be made available by the corresponding author upon request. unfortunately no resources are available to provide technical or other support outside the ndp. in summary, we have implemented a near real-time syndromic surveillance system that includes automated detection and reporting to public health agencies of clusters of illness that meet pre-specified criteria for unusualness [ ] . this system uses a distributed architecture that allows the participating health care provider to maintain full control over potentially identifiable phi and health encounter data. the distributed software loads simple text files that can be created from the data stored in virtually any proprietary emr system. it sends summary data suitable for signal detection algorithms via a freely available messaging system, to a datacenter that can manipulate the aggregated information and combine it with data from other providers serving the same geographic region, and which automatically generates and sends alerts when unusual clusters of syndromes are identified. the distributed software also facilitates efficient access to fully identified patient information when needed for following up a potential event. using automated medical records for rapid identification of illness syndromes (syndromic surveillance): the example of lower respiratory infection national bioterrorism syndromic surveillance demonstration program us department of health & human services: health insurance portability and accountability act a generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism syndromic surveillance using minimum transfer of identifiable data: the example of the national bioterrorism syndromic surveillance demonstration program use of automated ambulatory-care encounter records for detection of acute illness clusters supported by u /ccu from the centers for disease control and prevention/massachusetts department of public health public cooperative agreement for health preparedness and response for bioterrorism and rfa-cd- - , center of excellence in public health informatics, from the centers for disease control and prevention. figure originally appeared in an article in the mmwr supplement [ ] the author(s) declare that they have no competing interests. rl wrote the first draft of the manuscript after extensive discussions with ky and rp. ky and rp both made substantial intellectual contributions during the evolution of the submitted manuscript. ky prepared figure and all of the tables.publish with bio med central and every scientist can read your work free of charge the pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/ - / / /pre pub key: cord- - qld authors: agrawal, prashant; singh, anubhutie; raghavan, malavika; sharma, subodh; banerjee, subhashis title: an operational architecture for privacy-by-design in public service applications date: - - journal: nan doi: nan sha: doc_id: cord_uid: qld governments around the world are trying to build large data registries for effective delivery of a variety of public services. however, these efforts are often undermined due to serious concerns over privacy risks associated with collection and processing of personally identifiable information. while a rich set of special-purpose privacy-preserving techniques exist in computer science, they are unable to provide end-to-end protection in alignment with legal principles in the absence of an overarching operational architecture to ensure purpose limitation and protection against insider attacks. this either leads to weak privacy protection in large designs, or adoption of overly defensive strategies to protect privacy by compromising on utility. in this paper, we present an operational architecture for privacy-by-design based on independent regulatory oversight stipulated by most data protection regimes, regulated access control, purpose limitation and data minimisation. we briefly discuss the feasibility of implementing our architecture based on existing techniques. we also present some sample case studies of privacy-preserving design sketches of challenging public service applications. a welfare state may have legitimate interests in building large data registries with personally identifiable information (pii) for efficiency of service delivery. a state may also legitimately need to put its residents under purpose-specific surveillance. in fact, several commentators have alluded to the possibility of pervasive underthe-skin surveillance in a post-covid world [ ] . however, mandatory recordings of pii require enacting reasonable and fair laws to ensure that the processing of pii is proportionate to the stated objective, and safeguard the basic operative principles of privacy and fairness. citizens' basic rights need to be protected even when there is a legitimate state interest in digitisation with pii [ ] . the need to ensure that the information collected is not used adversely against citizens to harm them takes us into one of the hard problems of modern public policy: creating rules and technologies around information privacy to help strike this critical balance for online collection of pii at national scale. in this paper we address the problem of operationalising the broad privacy-by-design principles outlined in [ , ] , in the context of large public service databases. we present an architecture for implementing the data protection principles after the utility and proportionality of an application have been established through an appropriate regulatory analysis [ , , ] . the general principles of fair and reasonable processing, purpose, collection and storage limitation, notice and consent, data quality etc. have evolved since the s, both through sector specific standards in the us such as the social security number protection act [ ] and health insurance portability and accountability act (hipaa) [ ] , or through omnibus laws in general data protection standards such as the gdpr in the european union [ ] and the draft data protection bill of india [ ] . however, they have largely failed to prevent both direct harms that can occur as a result of data breaches or through unauthorised access of personal datasuch as identity thefts, unethical profiling and unlawful surveillance, or secondary harms that could arise due to the use of the data to adversely affect a person -such as through discrimination or exclusion, predatory targeting for unsuitable products, loss of employment, inaccurate credit rating etc. dictums such as personal data shall be processed in a fair and reasonable manner are non-specific, and they do not adequately define the contours of the required regulatory actions. as episodes like cambridge analytica [ ] demonstrate, harm is often not immediately obvious, and causal links of harm are not always easy to determine. this is compounded by the fact that data collection and use are becoming ubiquitous making it hard to trace misuse; the effects of misuse of personal data may not immediately manifest, and when they do they may not be easily quantifiable in monetary terms despite causing grave distress. hence, ex-post accountability and punitive measures are largely ineffective, and it is imperative to operationalise ex-ante preventive principles. as a consequence of the weak protection standards, most attempts at building large public services like national identity systems [ , ] , health registries [ , , ] , national population and voter registries [ , , ] , public credit registries [ , ] , income [ ] and tax registries [ ] etc. have often been questioned on privacy and fairness grounds and have been difficult to operationalise. the concerns have invariably been related to the need for protective safeguards when large national data integration projects are contemplated by governments and acknowledgment of the unprecedented surveillance power that this could create. in some situations they have even had to be abandoned altogether as they were unable to deal with these risks [ , , ] . in india too, the recent momentum and concerns around informational privacy guarantees have occurred in the context of the creation of new government databases and digital infrastructures for welfare delivery [ , ] . recording transactions with pii projects an individual into a data space, and any subsequent loss of privacy can happen only through the data pathway. hence data protection is central to privacy protection insofar as databases are concerned. the critical challenge in design of a data protection framework is that the main uses of digitisation -long term record keeping and data analysis -are seemingly contradictory to the privacy protection requirements. the legal principles around "fair information practice" attempt to reconcile these tensions, but there are four broad areas that require careful attention for effective data protection. first, a data protection framework is incomplete without an investigation of the nuances of digital identity, and guidelines for the various use cases of authentication, authorisation and accounting. it is also incomplete without an analysis of the extent to which personal information needs to be revealed for each use case, for example during know-your-citizen or -customer (kyc) processes. in addition, effective protection requires an understanding of the possible pathways of information leaks; of the limits of anonymisation with provable guarantees against re-identification attacks [ ] ; and of the various possibilities with virtual identities [ , ] . second, there have to be clear-cut guidelines for defining the requirements and standards of access control, and protection against both external and insider attacks in large data establishments, technically as well as legally. in particular, insider attacks are the biggest threat to privacy in public databases [ ] . these include possible unauthorised and surreptitious examination of data, transaction records, logs and audit trails by personnel with access, leading to profiling and surveillance of targeted groups and individuals, perhaps at the behest of interested and influential parties in the state machinery itself [ ] . thus, there must be guidelines on how the data may be accessed, under what authorisation and for what purpose. in addition, post data access purpose limitation -ensuring that there is no illegal use after the data crosses the access boundaries -is also crucial for privacy protection. third, a data protection framework is incomplete without guidelines for safe use of ai and data analytics. most theories for improving state efficiency in delivery of welfare and health services using personal data will have to consider improved data processing methods for targeting, epidemiology, econometrics, tax compliance, corruption control, analytics, topic discovery, etc. this, in turn, will require digitisation, surveillance and processing of large-scale personal transactional data. this requires detailed analyses of how purpose limitation of such surveillance -targeted towards improving efficiency of the state's service delivery -may be achieved without enabling undesirable mass surveillance that may threaten civil liberty and democracy. there must also be effective guidelines to prevent discriminatory and biased data processing [ ] . finally, it is well recognised in data protection frameworks [ , , ] that regulatory oversight is a necessary requirement for ensuring the above. while there is a rich set of tools and techniques in computer science arising out of decades of innovative privacy research, there is no overarching general framework for a privacy preserving architecture which, in particular, allows regulatory supervision and helps deal with the above issues in effective designs. in this paper we propose such an operational architecture for implementing the data protection principles. our immediate objective here is design space exploration and not specific implementations to evaluate performance and scalability. we illustrate the effectiveness of our proposal through design sketches of some challenging large public service applications. in particular, we illustrate through some real world case studies how some state-of-the-art designs either fail in their data protection goals, or tend to be overly defensive at the cost of utility in the absence of such an architecture. the rest of the paper is organized as follows. section briefly reviews the basic legal principles for data protection. section reviews concepts, tools and techniques from computer science for privacy protection. section presents our operational architecture. section discusses the feasibility and section discusses some illustrative case studies of large government applications. in what follows we briefly discuss the context of digitisation and privacy in india and the basic legal principles around privacy. we situate this analysis within the context of india's evolving regulatory and technical systems. however, many of these principles are relevant for any country seeking to align legal and technical guarantees of privacy for citizens. building public digital infrastructures has received an impetus in india in recent times [ ] and privacy has been an obvious concern. india has a long-standing legal discourse on privacy as a right rooted in the country's constitution. however, informational privacy and data protection issues have gained renewed visibility due to the recent national debate around the country's aadhaar system [ ] . aadhaar is a unique, biometric-based identity system launched in , with the ambitious aim of enrolling all indian residents, and recording their personal information, biometric fingerprints and iris scans against a unique identity number. aadhaar was designed as a solution for preventing leakages in government welfare delivery and targeting public services through this identity system. in addition, the "india stack" was envisioned as a set of apis that could be used -by public and private sector entities contract -to query the aadhaar database to provide a variety of services [ ] . however, as the project was unrolled across the country, its constitutionality was challenged in the courts on many grounds including the main substantive charge that it was violative of the citizens' right to privacy. over petitions challenging the system were eventually raised to the supreme court of india for its final determination. in the course of the matter, a more foundational question arose, i.e., whether the indian constitution contemplated a fundamental right of privacy? the question was referred to a separate -judge bench of the indian supreme court to conclusively determine the answer to this question. the answer to this question is important both for law and computer science, since the response creates deep implications for the design of technical systems in india. the supreme court's unanimous response to this question in justice k.s.puttaswamy (retd.) vs union of india (puttaswamy i) [ ] was to hold that privacy is a fundamental right in india guaranteed by part iii (fundamental rights) of the indian constitution. informational privacy was noted to be an important aspect of privacy for each individual, that required protection and security. in doing so, the court recognised the interest of an individual in controlling or limiting the access to their personal information, especially as ubiquitous data generation and collection, combined with data processing techniques, can derive information about individuals that we may not intend to disclose. in addition to cementing privacy as a constitutional right for indians, the supreme court in puttaswamy i [ ] also played an important role in clarifying certain definitional aspects of the concept. first, when defining privacy, the lead judgement noted that every person's reasonable expectation of privacy has both subjective and objective elements (see page of puttaswamy i), i.e., . the subjective element which is to the expectation and desire of an individual to be left alone, and . the objective element, which refers to objective criteria and rules (flowing from constitutional values) that create the widely agreed content of "the protected zone", where a person ought to be left alone in our society. second, informational privacy was also recognised (see page of puttaswamy i, from a seminal work which set out a typology of privacy) to be: ". . . an interest in preventing information about the self from being disseminated and controlling the extent of access to information." it would be the role of a future indian data protection law to create some objective standards for informational privacy to give all actors in society an understanding of the "ground rules" for accessing an individuals' personal information. these principles are already fairly well-developed through several decades of international experience. india is one of the few remaining countries in the world that is yet to adopt a comprehensive data protection framework. this section provides a brief overview of some of these established concepts. one of the early and most influential global frameworks on privacy protection are the oecd guidelines on the protection of privacy and transborder flows of personal data [ ] . these were formulated as a response to the advancements in technology that enabled faster processing of large amounts of data as well as their transmission across different countries. these guidelines were updated in , reflecting the multilateral consensus of the changes in the use and processing of personal data in that year period. therefore, it is a good starting point for the fundamental principles of privacy and data protection. the key principles of the oecd privacy framework are: collection limitation: personal data should be collected in a fair and lawful manner and there should be limits to its collection. use limitation: collected personal data be used or disclosed for any purposes other than those stated. if personal data must be used for purposes other that those stated, it should with the consent of the data subject or with the authority of the law. purpose specification: the purpose for collection of personal data should be stated no later than the point of collection. all subsequent uses of such data must be limited to the stated purposes. data quality: collected personal data should be relevant for the stated purposes and its accuracy for such a purpose must be maintained. security safeguards: reasonable safeguards must be adopted by the data controller to protect it from risks such as unauthorised access, destruction, use, modification or disclosure of the data. accountability: any entity processing personal data must be responsible and held accountable for giving effect to the principles of data protection and privacy. openness: any entity processing personal data must be transparent about the developments and practices with respect to the personal data collected. individual participation: individuals should have the rights to confirm from the data controller whether they have any personal data relating to them and be able to obtain the same within a reasonable time, at a reasonable charge and in a reasonable manner. if these requests are denied, individuals must be given the reasons for such denial and have the right to challenge such denials. individuals must also retain the right to be able to challenge personal data relating to them and able to erase, rectify, complete or amended. these principles, and many international instruments and national laws that draw from them, set some of the basic ground rules around the need for clear and legitimate purposes to be identified prior to accessing personal information. they also stress on the need for accountable data practices including strict access controls. many of these principles are reflected to varying degrees in india's personal data protection bill in [ ] which was introduced in the lower house of the indian parliament in december . the bill is currently under consideration by a joint select committee of parliamentarians following which it will enter parliament for final passage. the oecd privacy framework [ ] in article (g) also recognised the need for the promotion of technical measures to protect privacy in practice. there is also a growing recognition that if technical systems are not built with an appreciation of data protection and privacy principles, they can create deficits of trust and other dysfunctions. these are particularly problematic in government-led infrastructures. the failure of privacy self-management and the need for accountability-based data protection the need for data processing entities to adhere to objective and enforceable standards of data protection is heightened because of vulnerability of the individuals whose data they process. although research shows that individuals value their privacy and seek to control how information about them is shared, cognitive limitations operate at the level of the individuals' decision-making about their personal data [ ] . this "privacy paradox" signals the behavioural biases and information asymmetries that operate on people making decisions about sharing their personal information. especially in contexts where awareness that personal data is even being collected in digital interactions is low, such as with first-time users of digital services in india, it is often unfair and meaningless to delegate the self-management of privacy to users entirely through the ineffective mechanism of "consent". the inadequacy of consent alone as a privacy protection instrument has been well established, especially given that failing to consent to data collection could result in a denial of the service being sought by the user [ ] . in the context of these findings, it is crucial that digital ecosystems be designed in a manner that protects the privacy of individuals, does not erode their trust in the data collecting institution and does not make them vulnerable to different natures of harm. therefore, mere dependence on compliance with legal frameworks by data controllers is not sufficient. technical guarantees that the collected data will only be used for the stated purposes and in furtherance of data protection principles must become a reality, if these legal guarantees are to be meaningful. the need for early alignment of legal and technical design principles of data systems, such as access controls, purpose limitation and clear liability frameworks under appropriate regulatory jurisdictions are essential to create secure and trustworthy public data infrastructures [ , , ] . before we present our architectural framework, we briefly review some privacy preserving tools from computer science. cryptographic encryption [ ] , for protecting data either in storage or transit, have often been advocated for privacy protection. the following types are of particular importance: symmetric encryption symmetric encryption allows two parties to encrypt and decrypt messages using a shared secret key. diffie-hellman key exchange protocol [ ] is commonly used by the parties to jointly establish a shared key over an insecure channel. asymmetric encryption asymmetric or public key encryption [ ] allows two parties to communicate without the need to exchange any keys beforehand. each party holds a pair of public and private keys such that messages encrypted using the receiver's public key cannot be decrypted without the knowledge of the corresponding private key. id-based encryption id-based encryption [ ] allows the sender to encrypt the message against a textual id instead of a public key. a trusted third party provisions decryption keys corresponding to the ids of potential receivers after authenticating them through an out-of-band mechanism. id-based encryption considerably simplifies the public key infrastructure: a sender can encrypt messages using the semantic identifier of the intended recipient without explicitly knowing the public keys of the particular receivers. encryption with strong keys is a powerful method for privacy protection provided there are no unauthorised accesses to the keys. insider attacks, however, pose serious risks if the keys also reside with the same authority. even when the keys are stored securely, they have to be brought into the memory for decryption during run-time, and can be leaked by a compromised privileged software, for example an operating system or a hypervisor. digital signature a digital signature [ ] σ pk (m) on a message m allows a verifier to verify using the public key pk that m was indeed signed with the corresponding the private key. any alteration of m invalidates the signature. signatures also provide non-repudiation. blind signatures blind signatures [ ] are useful to obtain a signature on a message without exposing the contents of the message to the signer. a signature σ pk (b(m)) by a signer holding public key pk allows the signer to sign a blinded message b(m) that does not reveal anything about m. the author of the message can now use the σ pk (b(m)) to create an unblinded digital signature σ pk (m). chfs are functions that are a) 'one-way', i.e., given hash value h, it is difficult to find an x such that h = hash(x), and b) 'collision-resistant', i.e., finding any x and x such that hash(x ) = hash(x ) is difficult. chfs form the basis of many privacy preserving cryptographic primitives. there are several techniques from computer science that are particularly useful for data minimisation -at different levels of collection, authentication, kyc, storage and dissemination. some of these are: zkps [ ] are proofs that allow a party to prove to another that a statement is true, without leaking any information other than the statement itself. of particular relevance are zkps of knowledge [ ] , which convince a verifier that the prover knows a secret without revealing it. zkps also enable selective disclosure [ ] , i.e., individuals can prove only purpose-specific attributes about their identity without revealing additional details; for example, that one is of legal drinking age without revealing the age itself. "anonymity refers to the state of being not identifiable within a set of individuals, the anonymity set" [ ] . in the context of individuals making transactions with an organisation, the following notions of anonymity can be defined: unlinkable anonymity transactions provide unlinkable anonymity (or simply unlinkability) if a) they do not reveal the true identities of the individuals to organisations, and b) organisations cannot identify how different transactions map to the individuals. linkable anonymity transactions provide linkable anonymity if an organisation can identify whether or not two of its transactions involve the same individual, but individuals' true identities remain hidden. linkable anonymity is useful because it allows individuals to maintain their privacy while allowing the organisation to aggregate multiple transactions from the same individual. linkable anonymity is typically achieved by making individuals use pseudonyms. anonymous credentials authenticating individuals online may require them to provide credentials from a credential-granting organisation a to a credential-verifying organisation b. privacy protection using anonymous credentials [ , , ] can ensure that transactions with a are unlinkable to transactions with b. anonymous credentials allow an individual to obtain a credential from an organisation a against their pseudonym with a and transform it to an identical credential against their pseudonym with organisation b. an identity authority provisions a master identity to each individual from which all pseudonyms belonging to an individual, also known as virtual identities, are cryptographically derived. anonymous credentials are typically implemented by obtaining blind signatures (see section . . ) from the issuer and using zkps of knowledge (see section . . ) of these signatures to authenticate with the verifier. the credential mechanism guarantees: • unlinkable anonymity across organisations. this property ensures that a cannot track the uses of the issued credentials and b cannot obtain the individual's information shared only with a even when a and b collude. • unforgeability. a credential against an individual's pseudonym cannot be generated without obtaining an identical credential against another pseudonym belonging to the same individual. • linkable anonymity within an organisation. depending on the use case requirements, individuals may or may not use more than one pseudonym per organisation. in the latter case the transactions within an organisation also become unlinkable. if an organisation a requires to link multiple transactions from the same individual, it can indicate this requirement to the identity authority that checks if pseudonyms used by individuals with a are unique. if a does not require linking, the identity authority merely checks if the pseudonyms are correctly derived from the individual's master identity. if the checks pass, an anonymous credential certifying this fact is issued by the identity authority. all checks by the identity authority preserve individuals' anonymity. accountable anonymous credentials anonymity comes with a price in terms of accountability: individuals can misuse their credentials if they can never be identified and held responsible for their actions. trusted third parties can revoke the anonymity of misbehaving users to initiate punitive measures against them [ , , ] . one-time credentials and k-times anonymous authentication schemes [ , , ] also prevent overspending of limited-use credentials by revoking individuals' anonymity if they overspend. blacklisting misbehaving users for future access without revoking their anonymity is also feasible [ ] . linkability by a trusted authority linking across organisations may also be required for legitimate purposes, for example for legitimate data mining. also see examples in section . such linkability also seems to be an inevitable requirement to deter sharing of anonymous credentials among individuals [ ] . linkability by a trusted authority can be trivially achieved by individuals attaching a randomised encryption of a unique identifier against the trusted authority's public key for transactions requiring cross-linking. of course, appropriate mechanisms must exist to ensure that the trusted authority does not violate the legitimate purpose of linking. note that the anonymity of credentials is preserved only under the assumption that individuals interact with organisations through anonymous channels (e.g., in [ ] ). in particular, neither the communication network nor the data that individuals share with organisations should be usable to link their transactions (see section . . and . . ). anonymous networks, originally conceptualised as mix networks by chaum [ ] , are routing protocols that make messages hard-to-trace. mix networks consist of a series of proxy servers where each of them receives messages from multiple senders, shuffles them, and sends to the next proxy server. an onion-like encryption scheme allows each proxy server to only see an encrypted copy of the message (and the next hop in plaintext), thus providing untraceability to the sender even if only one proxy server honestly shuffles its incoming messages. anonymisation is the process of transforming a database such that individuals' data cannot be traced back to them. however, research in de-anonymisation has shown that anonymisation does not work in practice, as small number of data points about individuals coming from various sources, none uniquely identifying, can completely identify them when combined together [ ] . this is backed by theoretical results [ , ] which show that for high-dimensional data, anonymisation is not possible unless the amount of noise introduced is so large that it renders the database useless. there are several reports in literature of de-anonymisation attacks on anonymised social-network data [ , ] , location data [ ] , writing style [ ] , web browsing data [ ] , etc. in this setting, analysts interact with a remote server only through a restricted set of queries and the server responds with possibly noisy answers to them. dinur and nissim [ ] show that given a database with n rows, an adversary having no prior knowledge could make o(n polylog(n)) random subset-sum queries to reconstruct almost the entire database, unless the server perturbs its answers too much (by at least o( √ n)). this means that preventing inference attacks is impossible if the adversary is allowed to make arbitrary (small) number of queries. determining whether a given set of queries preserves privacy against such attacks is in general intractable (np-hard) [ ] . inferential privacy [ , ] is the notion that no information about an individual should be learnable with access to a database that could not be learnt without any such access. in a series of important results [ , , ] , it was established that such an absolute privacy goal is impossible to achieve if the adversary has access to arbitrary auxiliary information. more importantly, it was observed that individuals' inferential privacy is violated even when they do not participate in the database, because information about them could be leaked by correlated information of other participating individuals. in the wake of the above results, the notion of differential privacy was developed [ ] to allow analysts extract meaningful distributional information from statistical databases while minimising the additional privacy risk that each individual incurs by participating in the database. note that differential privacy is a considerably weaker notion than inferential privacy as reconstruction attacks described in section . . or other correlation attacks can infer a lot of non-identifying information from differentially private databases too. mechanisms for differential privacy add noise to the answers depending on the sensitivity of the query. in this sense, there is an inherent utility versus privacy tradeoff. differentially private mechanisms possess composability properties. thus, privacy degrades gracefully when multiple queries are made to differentially private databases. however, this alone may not protect against an attacker making an arbitrary number of queries. for example, the reconstruction attacks mentioned in section . . prevent many differentially private algorithms from answering a linear (in the number of rows) number of queries [ ] . for specific types of queries though, e.g., predicate queries, sophisticated noise-addition techniques [ ] can be used to maintain differential privacy while allowing for an exponential number of queries [ , ] . differentially private mechanisms also degrade gracefully with respect to group privacy as the group size increases. these guarantees may not be enough for policymakers who must protect the profile of specific communities constituting a sizable proportion of the population. the ability of an adversary to manipulate and influence a community even without explicitly identifying its members is deeply problematic, as demonstrated by episodes like cambridge analytica [ ] . therefore, the goal of modern private data analysis should not be limited to protecting only individual privacy, but also extend to protecting sensitive aggregate information. due to the inherently noisy nature of differentially private mechanisms, they are not suitable for any nonstatistical uses, e.g., financial transactions, electronic health records, and password management. privacy mechanisms for such use-cases must prevent misuse of data for malicious purposes such as illegal surveillance or manipulation, without hampering the legitimate workflows. the difficulties with differential privacy, and the impossibility of protection against inferential privacy violations, suggest that privacy protection demands that there should be no illegal access or processing in the first place. these check whether a given code-base uses personal data in accordance with a given privacy policy [ , , ] . privacy policies are expressed in known formal languages [ , ] . a compiler verifies, using standard information flow analysis [ ] and model-checking techniques [ ] , if a candidate program satisfies the intended privacy policy. in order to enforce various information flow constraints these techniques rely on manual and often tedious tagging of variables, functions and users with security classes and verify if information does not flow from items with high security classes to items with low security classes. these techniques define purpose hierarchies and specify purpose-based access-control mechanisms [ , , ] . however, they typically identify purpose with the role of the data requester and therefore offer weak protection from individuals claiming wrong purposes for their queries. jafari et al. [ ] formalise purpose as a relationship between actions in an action graph. hayati et al. [ ] express purpose as a security class (termed by them as a "principal") and verify that data collected for a given purpose does not flow to functions tagged with a different purpose. tschantz et al. [ ] state that purpose violation happens if an action is redundant in a plan that maximises the expected satisfaction of the allowed purpose. however, enforcement of these models still relies on fine-grained tagging of code blocks, making them tedious, and either a compiler-based verification or post-facto auditing, making them susceptible to insider attacks that bypass the checks. secure remote execution refers to the set of techniques wherein a client can outsource a computation to a remote party such that the remote party does not learn anything about the client's inputs or intermediate results. homomorphic encryption (he) schemes compute in the ciphertext space of encrypted data by relying on the additive or multiplicative homomorphism of the underlying encryption scheme [ , , ] . designing an encryption scheme that is both -which is required for universality -is challenging. gentry [ ] gave the first theoretical fully homomorphic encryption (fhe) scheme. even though state-of-the-art fhe schemes and implementations have considerably improved upon gentry's original scheme, the performance of these schemes is still far from any practical deployment [ ] . functional encryption (fe) [ ] schemes have similar objectives, with the crucial difference that fe schemes let the remote party learn the output of the computation, whereas fhe schemes compute encrypted output, which is decrypted by the client. secure multiparty computation (smc) -originally pioneered by yao through his garbled circuits technique [ ] -allows multiple parties to compute a function of their private inputs such that no party learns about others' private inputs, other than what the function's output reveals. smc requires clients to express the function to be computed as an encrypted circuit and send it to the server alongwith encrypted inputs; the server needs to evaluate the circuit by performing repeated decryptions of the encrypted gates. as a result, smc poses many challenges in its widespread adoption -ranging from the inefficiencies introduced by the circuit model itself to the decryption overhead for each gate evaluation, even as optimisations over the last two decades have considerably improved the performance and usability of smc [ ] . however, he, fe and smc based schemes involve significant application re-engineering and may offer reduced functionality in practice. in recent times, secure remote execution is increasingly being realised not through advances in cryptography but through advances in hardware-based security. this approach commoditises privacy-preserving computation, albeit at the expense of a weakened trust model, i.e., the increased trust on the hardware manufacturer. intel software guard extensions (sgx) [ ] implements access control in the cpu to provide confidentiality and integrity to the executing program. at the heart of the sgx architecture lies the notion of an isolated execution environment, called an enclave. an enclave resides in the memory space of an untrusted application process but access to the enclave memory and leakage from it are protected by the hardware. the following are the main properties of sgx: confidentiality information about an enclave execution can not leak outside the enclave memory except through explicit exit points. integrity information can not leak into the enclave to tamper with its execution except through explicit entry points. remote attestation for an enclave's execution to be trusted by a remote party, it needs to be convinced that a) the contents of the enclave memory at initialisation are as per its expectations, and b) that confidentiality and integrity guarantees will be enforced by the hardware throughout its execution. for this the hardware computes a measurement, essentially a hash of the contents of the enclave memory and possibly additional user data, signs it and sends it over to the remote party [ ] . the remote party verifies the signature and matches the enclave measurement with the measurement of a golden enclave it considers secure. if these checks pass, the remote party trusts the enclave and sends sensitive inputs to it. secure provisioning of keys and data sgx enclaves have secure access to hardware random number generators. therefore, they can generate a diffie-hellman public/private key pair and keep the private key secured within enclave memory. additionally, the generated public key can be included as part of additional user data in the hardware measurement sent to a remote verifier during remote attestation. these properties allow the remote verifier to establish a secure tls communication channel with the enclave over which any decryption keys or sensitive data can be sent. the receiving enclave can also seal the secrets once obtained for long-term use such that it can access them even across reboots, but other programs or enclaves cannot. sgx has been preceded by the trusted platform module (tpm) [ ] . tpm defines a hardware-based root of trust, which measures and attests the entire software stack, including the bios, the os and the applications, resulting in a huge trusted computing base (tcb) as compared to sgx whose tcb includes only the enclave code. arm trustzone [ ] partitions the system into a secure and an insecure world and controls interactions between the two. in this way, trustzone provides a single enclave, whereas sgx supports multiple enclaves. trustzone has penetrated the mobile world through arm-based android devices, whereas sgx is available for laptops, desktops and servers. sgx is known to be susceptible to serious side-channel attacks [ , , , ] . sanctum [ ] has been proposed as a simpler alternative that provides provable protection against memory access-pattern based software side-channel attacks. for a detailed review on hardware-based security, we refer the reader to [ ] . stateful secure remote execution requires a secure database and mechanisms that protect clients' privacy when they perform queries on them. the aim of these schemes is to let clients host their data encrypted in an untrusted server and still be able to execute queries on it with minimal privacy loss and maximal query expressiveness. one approach for enabling this is searchable encryption schemes, i.e., encryption schemes that allow searching over ciphertexts [ , ] . another approach is to add searchable indexes along with encrypted data, or to use special property-preserving encryptions to help with searching [ [ , ] is a useful primitive that provides read/write access to encrypted memory while hiding all access patterns, but these schemes require polylogarithmic number of rounds (in the size of the database) per read/write request. enclavedb [ ] has been recently proposed as a solution based on intel sgx. it hosts the entire database within secure enclave memory, with a secure checkpoint-based logging and recovery mechanism for durability, thus providing complete confidentiality and integrity from the untrusted server without any loss in query expressiveness. private information retrieval (pir) is concerned with hiding which database rows a given user query touches -thus protecting user intent -rather than encrypting the database itself. kushilevitz and ostrovsky [ ] demonstrated a pir scheme with communication complexity o(n ), for any > , using the hardness of the quadratic residuosity problem. since then, the field has grown considerably and modern pir schemes boast of o( ) communication complexity [ ] . symmetric pir (also known as oblivious transfer), i.e., the set of schemes where additionally users cannot learn anything beyond the row they requested, is also an active area of research. as is evident from the discussion in the previous section, none of the techniques by themselves are adequate for privacy protection. in particular, none are effective against determined insider attacks without regulatory oversight. hence we need an overarching architectural framework based on regulatory control over data minimisation, authorisation, access control and purpose limitation. in addition, since the privacy and fairness impacts of modern ai techniques [ ] are impossible to determine automatically, the regulatory scrutiny of data processing programs must have a best effort manual component. once approved, the architecture must prevent any alteration or purpose extension without regulatory checks. in what follows we present an operational architecture for privacy-by-design. we assume that all databases and the associated computing environments are under physical control of the data controllers, and the online regulator has no direct physical access to it. we also assume that the data controllers and the regulators do not collude. we illustrate our conceptual design through an example of a privacy-preserving electronic health record (ehr) system. ehrs can improve quality of healthcare significantly by providing improved access to patient records to doctors, epidemiologists and policymakers. however, the privacy concerns with them are many, ranging from the social and psychological harms caused by unwanted exposure of individuals' sensitive medical information, to direct and indirect economic harms caused by the linkage of their medical data with data figure : an illustration of the architecture of trusted executables using an example involving an ehr database, a patient, an mri imaging station, a doctor and a data analysis station. tes marked "approved by r" are preaudited and pre-approved by the regulator r. er(·) represents a regulator-controlled encryption function and acr represents online access control by regulator r. dti(x) represent various data types parametrised by the patient x (as explained in the right-hand side table). in particular, v irtualhospitalid(x) represents the hospital-specific virtual identity of the patient. the regulator checks the consents, approvals and other static rules regarding data transfer at each stage of online access control. presented to their employers, insurance companies or social security agencies. building effective ehrs while minimising privacy risks is a long standing design challenge. we propose trusted executables (te) as the fundamental building blocks for privacy-by-design. we introduce them in the abstract, and discuss some possibilities for their actual realisation in section . tes are dataprocessing programs, with explicit input and output channels, that are designed by the data controllers but are examined, audited, and approved by appropriate regulators. tes execute in controlled environments on predetermined data types with prior privacy risk assessment, under online regulatory access control. the environment ensures that only approved tes can operate on data items. in particular, all data accesses from the databases, and all data/digest outputs for human consumption, can only happen through the tes. we prescribe the following main properties of the tes: . runtime environment: tes are approved by regulators. they execute in the physical infrastructure of the data controllers but cannot be modified by them. . authentication: a regulator can authenticate related tes during runtime, and verify that indeed the approved versions are running. integrity: there is no way for a malicious human or machine agent, or even for the data controller, to tamper with the execution of a te other than by sending data through the te's explicit input channels. there is no way for any entity to learn anything about the execution of a te other than by reading data written at the te's explicit output channels. all data accesses and output can only happen through tes. besides, all tes should be publicly available for scrutiny. the above properties allow a regulator to ensure that a te is untampered and will conform to the limited purpose identified at the approval stage. as depicted in figure , a data agent -for example, a hospital -interacts with databases or users only through pre-approved tes, and direct accesses are prevented. all data stores and communication messages are encrypted using a regulator-controlled encryption scheme to prevent any information leakage in transit or storage. the data can be decrypted only inside the tes under regulated access control. the regulator provisions decryption keys securely to the te to enable decryption after access is granted. the regulator allows or denies of the patient and the doctor, respectively. σx (·) represents digital signature by patient x . autht e (·) represents authentication information of the te and authdt (·) represents authentication information of the supplied data's type. individuals are authenticated by verifying their virtual identities. access, online, based on the authentication of the te and the incoming data type, consent and approval checks as required, and the credential authentication of any human consumers of output data (e.g., the doctor(s) and data analysts). all sink tes -i.e., those that output data directly for consumption by a human agent -are pre-audited to employ appropriate data minimisation before sending data to their output channels. note that extending the te architecture to the doctors' terminals and the imaging stations ensures that the data never crosses the regulated boundary and thus enables purpose limitation. in the above example an independent identity authority issues credentials and engages in a three-way communication to authenticate individuals who present their virtual identities to the regulator. an individual can use a master health id to generate hospital-specific or doctor-specific unlinkable anonymous credentials. only a health authority may be allowed to link identities across hospitals and doctors in a purpose-limited way under regulated access control. we depict the regulatory architecture in figure . the first obligation of the regulator is to audit and approve the tes designed by the data controllers. during this process, the regulator must assess the legality of the data access and processing requirements of each te, along with the privacy risk assessment of its input and output data types. in case a te is an ai based data analytics program, it is also an obligation of the regulator to assess its fairness and the potential risks of discrimination [ ] . before approving a te, the regulator also needs to verify that the te invokes a callback to the regulator's online interface before accessing a data item and supplies appropriate authentication information, and that it employs appropriate encryption and data minimisation mechanisms at its output channels. finally, the regulator needs to put in place a mechanism to be able to authenticate the te in the data controller's runtime environment. the second obligation of the regulator is to play an online role in authorising data accesses by the tes. the authorisation architecture has both a static and a dynamic component. the static authorisation rules typically capture the relatively stable regulatory requirements, and the dynamic component typically captures the fast-changing online context, mainly due to consents and approvals. specifically, each static authorisation rule takes the form of a set of pre-conditions necessary to grant access to a te the data of a given type; and, in case of sink tes, to output it to a requester. the design of these rules is governed by regulatory requirements and the privacy risk assessment of tes and data types. the rules are typically parametric in nature, allowing specification of constraints that provide access to a requester only if the requester can demonstrate some specific relationship with the data individual (e.g., to express that only a doctor consulted by a patient can access her data). the pre-conditions of the authorisation rules may be based on consent of data individuals, approvals by authorities or even other dynamic constraints (e.g., time-bound permissions). the consent architecture must be responsible for verifying signatures on standardised consent apis from consent givers and recording them as logical consent predicates. the regulator, when designing its authorisation rules, may use a simple consentfor example, that a patient has wilfully consulted a doctor -to invoke a set of rules to protect the individual's privacy under a legal framework, rather than requiring individuals to self-manage their privacy. similar to the consent architecture, the approval architecture for data access must record standardised approvals from authorities as logical approval predicates. an approval from an authority may also be provided to an individual instead of directly to the regulator, as a blind signature against a virtual identity of the individual known to the approver, which should be transformed by the individual to a signature against the virtual identity known to the data controller and the regulator. this, for example, may enable a patient to present a self generated virtual identity to a doctor or a hospital instead of her universal health id. the regulator also requires an authentication architecture. first, it needs to authenticate individuals, i.e., consent givers, approvers and data requesters, by engaging in a three-way communication with an identity authority which may be external to both the data controller and the regulator. second, it needs to authenticate tes in order to be able identify the access requests as originating from one of the pre-approved tes. third, it needs to authenticate data types, i.e., identify the underlying type of the te's encrypted input data. the consent/approval predicates and the authentication information flow to the dynamic authorisation module, which can instantiate the static authorisation rules with the obtained contextual information to determine, in an online fashion, if access should be allowed to the requesting te. if yes, then it must also provision decryption keys to the te securely such that only the te can decrypt. the keys can be securely provisioned to the te because of the authentication, integrity and confidentiality properties, and by the fact that approved tes must never output the obtained decryption keys. an example control-flow diagram depicting the regulatory access control in a scenario where a doctor is trying to access the data of a patient who consulted them is shown in figure . several existing techniques can be useful for the proposed architecture, though some techniques may need strengthening. trusted executables can be implemented most directly on top of trusted hardware primitives such as intel sgx or arm trustzone where authentication of tes is carried out by remote attestation. secure provisioning of keys and data to tes can be done in case of intel sgx as per section . . . however, since sgx includes only the cpu in its tcb, it presents challenges in porting ai applications that run on gpus for efficiency. graviton [ ] has been recently proposed as an efficient hardware architecture for trusted execution environments on gpus. in our architecture, tes fetch or update information from encrypted databases. this may be implemented using special indexing data structures, or may involve search over encrypted data, where the tes act as clients and the database storage acts as the server. accordingly, techniques from section . can be used. since the tes never output data to agents unless deemed legitimate by the regulator, the inferential attacks identified with these schemes in section . have minimal impact. for added security, enclavedb [ ] , which keeps the entire database in secure enclave memory, can be used. enclavedb has been evaluated on standard database benchmarks tpc-c [ ] and tatp [ ] with promising results. for authentication of data types messages may be encrypted using an id-based encryption scheme, where the concrete runtime type of the message acts as the textual id and the regulator acts as the trusted third party (see section . . ). the receiver te can send the expected plaintext type to the regulator as part of its access request. the regulator should provision the decryption key for the id representing the requested type only if the receiver te is authorised to receive it as per the dynamic authorisation check. note that authentication of the received data type is implicit here, as a te sending a different data type in its access request can still not decrypt the incoming data. data minimisation for consents and approvals based on virtual identities is well-established from chaum's original works [ , ] . individuals should use their purpose-specific virtual identities with organisations, as opposed to a unique master identity. to prevent cross-linking of identities, anonymous credentials may be used. in some cases, individuals' different virtual identities may need to be linked by a central authority to facilitate data analytics or inter-organisation transactions. this should be done under strict regulatory access control and purpose limitation. modern type systems can conveniently express the complex parametric constraints in the rules in the authorisation architecture. efficient type-checkers and logic engines exist that could perform the dynamic authorisation checks. approval of tes needs to be largely manual as the regulator needs to evaluate the legitimacy and privacy risks associated with the proposed data collection and processing activity. however, techniques from program analysis may help with specific algorithmic tasks, such as checking if the submitted programs adhere to the structural requirement of encrypting data items with the right type at their outgoing channels. we require the regulatory boundary to be extended even to agent machines, which must also run tes so that data they obtain is not repurposed for something else. however, when a te at an authorised agent's machine outputs data, it could be intercepted by malicious programs on the agent's machine leading to purpose violation. solutions from the drm literature may be used to prevent this. in particular, approaches that directly encrypt data for display devices may be useful [ ] . we note that this still does not protect the receiving agent from using more sophisticated mechanisms to copy data (e.g., by recording the display using an external device). however, violations of this kind are largely manual in nature and ill-suited for large-scale automated attacks. finally, we need internal processes at the regulatory authority itself to ensure that its actual operational code protects the various decryption keys and provides access to tes as per the approved policies. to this end, the regulator code may itself be put under a te and authenticated by the regulatory authority using remote attestation. once authenticated, a master secret key may be provisioned to it using which the rest of the cryptosystem may bootstrap. in this section, we present two additional case studies to showcase the applicability of our architecture in diverse real-world scenarios. direct benefit transfer (dbt) [ ] is a government of india scheme to transfer subsidies to citizens' bank accounts under various welfare schemes. its primary objective is to bring transparency and reduce leakages in public fund disbursal. the scheme design is based on india's online national digital identity system aadhaar [ ] . all dbt recipients have their aadhaar ids linked to their bank accounts to receive benefits. figure shows a simplified schematic of the scheme that exists today [ ] . a ministry official initiates payment by generating a payment file detailing the aadhaar ids of the dbt recipients, the welfare schemes under which payments are being made and the amounts to be transferred. the payment file is then signed and sent to a centralised platform called the public financial management system (pfms). pfms hosts the details of various dbt schemes and is thus able to initiate an inter-bank fund transfer from the bank account of the sponsoring scheme to the bank account of the beneficiary, via the centralised payments facilitator npci (national payments corporation of india). npci maintains a mapping of citizen's aadhaar ids to the banks hosting their dbt accounts. this mapping allows npci to route the payment request for a given aadhaar id to the right beneficiary bank. the beneficiary bank internally maintains a mapping of its customers' aadhaar ids to their bank account details, and is thus able to transfer money to the right account. as dbt payments are primarily directed towards people who need benefits, precisely because they are structurally disadvantaged, their dbt status must be protected from future employers, law enforcement, financial providers etc., to mitigate discrimination and other socio-economic harms coming their way. further, since dbt relies on the recipients' national aadhaar ids, which are typically linked with various other databases, any leakage of this information makes them directly vulnerable. indeed, there are reports that bank and address details of millions of dbt recipients were leaked online [ ] ; in some cases this information was misused to even redirect dbt payments to unauthorised bank accounts [ ] . we illustrate our approach for a privacy-preserving dbt in figure . in our proposal, dbt recipients use a virtual identity for dbt that is completely unlinkable to the virtual identity they use for their bank account. they may generate these virtual identities -using suitably designed simple and intuitive interfaces -by an anonymous credential scheme where the credentials are issued by a centralised identity authority. additionally, they provide the mapping of the two virtual identities, along with the bank name, to the ncpi mapper. this information is provided encrypted under the control of the financial regulator r such that only the npci mapper te can access it under r 's online access control. this mechanism allows the npci mapper to convert payment requests against dbt-specific identities to the target bank-specific identities, while maintaining the mapping private from all agents. regulator-controlled encryption of data in transit and storage and the properties of tes allow for an overall privacy-preserving dbt pipeline. note that data flow is controlled by different regulators along the dbt pipeline, providing a distributed approach to privacy protection. pfms is controlled by a dbt regulator; npci mapper is controlled by a financial regulator, and the sponsor and beneficiary banks are controlled by their respective internal regulators. there have been a plethora of attempts recently from all over the world towards electronic app-based contact tracing for covid- using a combination of gps and bluetooth [ , , , , , , , , ] . (a) collecting spatiotemporal information. a and b come in contact via ble, as denoted by the dotted arrows. c does not come in contact with a or b via ble but is spatially close within a time window, as per gps data. vidx represents the virtual identity of agent x; locx i represents x's i-th recorded location; timex i represents its i-th recorded time. tx i represents i-th token generated by x; rx i represents i-th receipt obtained by x; σx () represents signing by x. dashed arrows represent one-time registration steps (illustrated only for c). (b) tracing the contacts of infected individuals. a gets infected, as certified by the doctor's signature σ doc on a's virtual identity vida and medical report reporta. ds and dt respectively represent chosen spatial and temporal distance functions and and δ the corresponding thresholds, as per the disease characteristics. ∆ represents the infection window, the time during which a might have remained infectious. timenow represents the time when the query was executed. even keeping aside the issue of their effectiveness, some serious privacy concerns have been raised about such apps. in most of these apps the smartphones exchange anonymous tokens when they are in proximity, and each phone keeps a record of the sent and received tokens. when an individual is infected -signalled either through a self declaration or a testing process -the tokens are uploaded to a central service. there are broadly two approaches to contact tracing: . those involving a trusted central authority that can decrypt the tokens and, in turn, alert individuals and other authorities about potential infection risks [ , , , ] . some of these apps take special care to not upload any information about individuals who are not infected. . those that assume that the central authority is untrusted and use privacy preserving computations on user phones to alert individuals about their potential risks of infection [ , , , , ] . the central service just facilitates access to anonymised sent tokens of infected individuals and cannot itself determine the infection status of anybody. the following are the main privacy attacks on contact tracing apps: ) individuals learning about other individuals as high-risk spreaders, ) insiders at the central service learning about individuals classified as high risk, ) exposure of social graphs of individuals, and ) malicious claims by individuals forcing quarantine on others. see [ ] for a vulnerability analysis of some popular approaches. the centralised approaches clearly suffer from many of the above privacy risks. while alerting local authorities about infection risks is clearly more effective from a public health perspective, to enable them to identify hotspots and make crucial policy decisions, it is mainly the privacy concerns that sometimes motivate the second approach. also, it is well known that location data of individuals can be used to orchestrate de-anonymisation attacks [ ] , and hence many of the above approaches adopt the policy of not using geolocation data for contact tracing despite their obvious usefulness at least in identifying hotspots. in addition, bluetooth based proximity sensing -which are isolated communication events over narrow temporal windows between two smartphonesis ineffective for risk assessment of indirect transmission through contaminated surfaces, where the virus can survive for long hours or even days. such risk assessment will require computation of intersection of space-time volumes of trajectories which will be difficult in a decentralised approach. it appears that the privacy considerations have forced many of these approaches to adopt overly defensive decentralised designs at the cost of effectiveness. in contrast, we propose an architecture where governments can collect fine-grained location and proximity data of citizens, but under regulated access control and purpose limitation. such a design can support both shortrange peer-to-peer communication technologies such as ble and gps based location tracking. additionally, centralised computing can support space-time intersections. in figure , we show the design of a state-mandated contact-tracing app that, in addition to protecting against the privacy attacks outlined earlier, can also protect against attacks by individuals who may maliciously try to pose as low-risk on the app, for example to get around restrictions (attack ). as before, we require all storage and transit data to be encrypted under a regulator-controlled encryption scheme, and that they be accessible only to pre-approved tes. we also require the app to be running as a te on the users' phones (e.g., within a trusted zone on the phone). we assume that everyone registers with the app using a phone number and a virtual identity unlinkable to their other identities. periodically, say after every few minutes, each device records its current gps location and time. the tuple made up of the registered virtual identity and the recorded location and time is signed by the device and encrypted controlled by the regulator, thus creating an ephemeral "token" to be shared with other nearby devices over ble. when a token is received from another device, a tuple containing the virtual identity of self and the incoming token is created, signed and stored in a regulator-controlled encrypted form, thus creating a "receipt". periodically, once every few hours, all locally stored tokens and receipts are uploaded to a centralised server te, which stores them under regulated access control as a mapping between registered virtual identities and all their spatiotemporal coordinates. for all the receipts, the centralised server te stores the same location and time for the receiving virtual identity as in the token it received, thus modelling the close proximity of ble contacts. when a person tests positive, they present their virtual identity to a medical personnel who uploads a signed report certifying the person's infection status to the centralised server te. this event allows the centralised server te to fetch all the virtual identities whose recorded spatiotemporal coordinates intersects within a certain threshold, as determined by the disease parameters, with the infected person's coordinates. as the recorded (location, time) tuples of any two individuals who come in contact via ble necessarily collide in our approach, the virtual identities of all ble contacts can be identified with high precision. moreover, virtual identities of individuals who did not come under contact via ble but were spatially nearby in a time window as per gps data are also identified. a notifier te securely obtains the registered phone numbers corresponding to these virtual identities from the centralised server te and sends suitably minimised notifications to them, and also perhaps to the local administration according to local regulations. the collected location data can also be used independently by epidemiologists and policy makers in aggregate form to help them understand the infection pathways and identify areas which need more resources. note that attack is protected by the encryption of all sent tokens; attacks and are protected by the properties of tes and regulatory access control; attack is protected by devices signing their correct spatiotemporal coordinates against their virtual identity before sending tokens or receipts. attack is mitigated by requiring the app to run within a trusted zone on users' devices, to prevent individuals from not sending tokens and receipts periodically or sending junk data. we have presented the design sketch of an operational architecture for privacy-by-design [ ] based on regulatory oversight, regulated access control, purpose limitation and data minimisation. we have established the need for such an architecture by highlighting limitations in existing approaches and some public service application designs. we have demonstrated its usefulness with some case studies. while we have explored the feasibility of our architecture based on existing techniques in computer science, some of them will definitely require further strengthening. there also needs to be detailed performance and usability evaluations, especially in the context of large-scale database and ai applications. techniques to help a regulator assess the privacy risks of tes also need to be investigated. these are interesting open problems that need to be solved to create practical systems for the future with built-in end-to-end privacy protection. yuval noah harari: the world after coronavirus writ petition (civil) no of , supreme court judgment dated august privacy by design: the foundational principles oecd guidelines on the protection of privacy and transborder flows of personal data the digital person: technology and privacy in the information age the european parliament and the council of european union the personal data protection bill social security number protection act of the health insurance portability and accountability act of (hipaa) revealed: million facebook profiles harvested for cambridge analytica in major data breach the identity project: an assessment of the uk identity cards bill and its implications privacy and security of aadhaar: a computer science perspective nhs care.data scheme closed after years of controversy australians say no thanks to electronic health records india plan to merge id with health records raises privacy worries voter privacy is gone get over it are citizens compromising their privacy when registering to vote concerns over linking aadhaar to voter id and social media accounts equifax data breach the rbi's proposed public credit registry and its implications for the credit reporting system in india launch of incomes register dogged by data security concerns the use of big data analytics by the irs: efficient solution or the end of privacy as we know it? national identity register destroyed as government consigns id card scheme to history unique identification authority of india dissent on aadhaar: big data meets big brother robust de-anonymization of large sparse datasets security without identification: transaction systems to make big brother obsolete showing credentials without identification nsa whistleblower: the ultimate insider attack fairness and machine learning. fairmlbook.org digital transformation in the indian government india stack-digital infrastructure as public good symmetric and asymmetric encryption new directions in cryptography identity-based encryption from the weil pairing blind signatures for untraceable payments proofs that yield nothing but their validity or all languages in np have zero-knowledge proof systems zero knowledge proofs of knowledge in two rounds efficient selective disclosure on smart cards using idemix anonymity, unlinkability, undetectability, unobservability, pseudonymity, and identity management -a consolidated proposal for terminology a secure and privacy-protecting protocol for transmitting personal information between organizations an efficient system for non-transferable anonymous credentials with optional anonymity revocation dynamic accumulators and application to efficient revocation of anonymous credentials group signatures k-times anonymous authentication (extended abstract) dynamic k-times anonymous authentication k-times anonymous authentication with a constant proving cost blacklistable anonymous credentials: blocking misbehaving users without ttps credential sharing: a pitfall of anonymous credentials untraceable electronic mail, return addresses, and digital pseudonyms on k-anonymity and the curse of dimensionality provable de-anonymization of large datasets with sparse dimensions de-anonymizing social networks link prediction by de-anonymization: how we won the kaggle social network challenge unique in the crowd: the privacy bounds of human mobility on the feasibility of internet-scale author identification de-anonymizing web browsing data with social networks revealing information while preserving privacy auditing boolean attributes towards a methodology for statistical disclosure control inferential privacy guarantees for differentially private mechanisms calibrating noise to sensitivity in private data analysis differential privacy a learning theory approach to non-interactive database privacy mechanism design via differential privacy a simple and practical algorithm for differentially private data release bootstrapping privacy compliance in big data systems data capsule: a new paradigm for automatic compliance with data privacy regulations language-based enforcement of privacy policies the enterprise privacy authorization language (epal) -how to enforce privacy throughout an enterprise p p: making privacy policies more useful a lattice model of secure information flow principles of model checking (representation and mind series) hippocratic databases purbac: purpose-aware role-based access control purpose based access control for privacy protection in relational database systems towards defining semantic foundations for purpose-based privacy policies formalizing and enforcing purpose restrictions in privacy policies a method for obtaining digital signatures and public-key cryptosystems a public key cryptosystem and a signature scheme based on discrete logarithms public-key cryptosystems based on composite degree residuosity classes fully homomorphic encryption using ideal lattices a survey on homomorphic encryption schemes: theory and implementation functional encryption: definitions and challenges protocols for secure computations secure multiparty computation and trusted hardware: examining adoption challenges and opportunities innovative instructions and software model for isolated execution innovative technology for cpu based attestation and sealing tpm main specification version . : part design principles building a secure system using trustzone technology controlled-channel attacks: deterministic side channels for untrusted operating systems malware guard extension: using sgx to conceal cache attacks leaky cauldron on the dark land: understanding memory side-channel hazards in sgx foreshadow: extracting the keys to the intel sgx kingdom with transient out-of-order execution sanctum: minimal hardware extensions for strong software isolation practical techniques for searches on encrypted data public key encryption with keyword search revisited secure indexes searchable symmetric encryption: improved definitions and efficient constructions executing sql over encrypted data in the database-service-provider model cryptdb: protecting confidentiality with encrypted query processing access pattern disclosure on searchable encryption: ramification, attack and mitigation leakage-abuse attacks against searchable encryption all your queries are belong to us: the power of file-injection attacks on searchable encryption inference attacks on property-preserving encrypted databases efficient computation on oblivious rams software protection and simulation on oblivious rams enclavedb a secure database using sgx replication is not needed: single database, computationally-private information retrieval optimal rate private information retrieval from homomorphic encryption graviton: trusted execution environments on gpus telecom application transaction processing benchmark hdcp interface independent adaptation specification direct benefit transfer, government of india standard operating procedure (sop) modules for direct benefit transfer (dbt) information security practices of aadhaar (or lack thereof): a documentation of public availability of aadhaar numbers with sensitive personal financial information aadhaar mess: how airtel pulled off its rs crore magic trick china's high-tech battle against covid- coronavirus: south koreas success in controlling disease is due to its acceptance of surveillance tracetogether app government of india apps gone rogue: maintaining personal privacy in an epidemic anonymous collocation discovery: harnessing privacy to tame the coronavirus epione: lightweight contact tracing with strong privacy privacy-preserving contact tracing the pact protocol specification key: cord- -j iawzp authors: fitzpatrick, meagan c.; bauch, chris t.; townsend, jeffrey p.; galvani, alison p. title: modelling microbial infection to address global health challenges date: - - journal: nat microbiol doi: . /s - - - sha: doc_id: cord_uid: j iawzp the continued growth of the world’s population and increased interconnectivity heighten the risk that infectious diseases pose for human health worldwide. epidemiological modelling is a tool that can be used to mitigate this risk by predicting disease spread or quantifying the impact of different intervention strategies on disease transmission dynamics. we illustrate how four decades of methodological advances and improved data quality have facilitated the contribution of modelling to address global health challenges, exemplified by models for the hiv crisis, emerging pathogens and pandemic preparedness. throughout, we discuss the importance of designing a model that is appropriate to the research question and the available data. we highlight pitfalls that can arise in model development, validation and interpretation. close collaboration between empiricists and modellers continues to improve the accuracy of predictions and the optimization of models for public health decision-making. m icrobial pathogens are responsible for more than million years of life lost annually across the globe, a higher burden than either cancer or cardiovascular disease . diseases that have long plagued humanity, such as malaria and tuberculosis, continue to impose a staggering toll. recent decades have also witnessed the emergence of new virulent pathogens, including human immunodeficiency virus (hiv), ebola virus, severe acute respiratory syndrome (sars) coronavirus, west nile virus and zika virus. the persistent global threat posed by microbial pathogens arises from the nonlinear mechanisms of disease transmission. that is, as the prevalence of a disease is reduced, the density of immune individuals drops, the density of susceptible individuals rises and disease is more likely to rebound. the resultant temporal trajectories are difficult to predict without considering this nonlinear interplay. for instance, many microbial diseases exhibit periodic spikes in the number of cases that are unexplainable by pathogen natural history or environmental phenomena. by explicitly defining the nonlinear processes underlying infectious disease spread, transmission models illuminate these otherwise opaque systems. forty years ago, nature published a series of papers that launched the modern era of infectious disease modelling , . since that time, these methodologies have multiplied . transmission models now employ a variety of approaches, ranging from agent-based simulations that represent each individual to compartmental frameworks that group individuals by epidemiological status, such as infectiousness and immunity , . accompanying the methodological innovations, however, are challenges regarding selection of appropriate model structures from among the wealth of possibilities . at this anniversary of the publication of these landmark papers , , we reflect on contributions that transmission modelling has made to infectious disease science and control. through a series of case studies, we illustrate the overarching principles and challenges related to model design. with expanding computational capacity and new types of data, myriad opportunities have opened for transmission modelling to bolster evidence-based policy (box ) , . in all pursuits, modelling is most informative when conducted collaboratively with microbiologists, immunologists and epidemiologists. we offer this perspective as an entry point for non-modelling scientists to understand the power and flexibility of modelling, and as a foundation for the transdisciplinary conversations that bolster the field. even within the same disease system, the ideal model design depends on the specifics of the questions asked. here, we highlight a series of models focused on one of the defining infectious agents of our era: hiv. the virus has challenged science, medicine and public health at every scale, from its deft immune evasion to its death toll of more than million over the last four decades . we describe how clinical needs, research questions and data availability have shaped the design of hiv models across these scales. unless otherwise indicated, the term 'hiv' is inclusive of both hiv- and hiv- . within-host models. at a within-host scale (table ) , models can be used to simulate cellular interactions, immunological responses and treatment pharmacokinetics . in such simulations, viral dynamics are often modelled using a compartmental structure, with the growth of one population, such as circulating virions, dependent on the size of another population, such as infected cells. for example, a seminal within-host model fit to viral load data by perelson et al. revealed high turnover rates of hiv- , counter to what was then the prevailing assumption that hiv- remained dormant during the asymptomatic 'latency' phase. the corollary to these high rates of viral turnover was that drug resistance would likely evolve rapidly under monotherapy. further analyses of this model indicated that a combination of at least three drugs was necessary to maintain drug sensitivity . once combination therapy did become available, extension of the perelson et al. model demonstrated that the two-phase decline in viral load observed following treatment initiation was attributable to a reservoir of long-lived infected cells . with this insight also came the realization that prolonged treatment would be necessary to suppress viral load. the incorporation of meagan c. fitzpatrick , , chris t. bauch , jeffrey p. townsend , , and alison p. galvani , , * the continued growth of the world's population and increased interconnectivity heighten the risk that infectious diseases pose for human health worldwide. epidemiological modelling is a tool that can be used to mitigate this risk by predicting disease spread or quantifying the impact of different intervention strategies on disease transmission dynamics. we illustrate how four decades of methodological advances and improved data quality have facilitated the contribution of modelling to address global health challenges, exemplified by models for the hiv crisis, emerging pathogens and pandemic preparedness. throughout, we discuss the importance of designing a model that is appropriate to the research question and the available data. we highlight pitfalls that can arise in model development, validation and interpretation. close collaboration between empiricists and modellers continues to improve the accuracy of predictions and the optimization of models for public health decision-making. stochasticity into this within-host framework allowed model fitting to 'viral blips'-transient peaks in viral load, even under antiretroviral treatment . analysis of this data-driven stochastic model demonstrated that homeostatic proliferation maintained the infected cell reservoir and produced these viral blips, a finding that was later confirmed experimentally , . the implication for clinical care was that intensified antiretroviral treatment would be unable to eliminate the latent reservoir of infected cells as had been hypothesized, sparing patients from potentially fruitless trials with such regimens. individual-based models. whereas the unit of interest for withinhost modelling is an infected cell, the analogous unit for individualbased models is an infected person (table ) , , . individual-based models are often used to explore the interplay between disease transmission and individual-level risk factors, such as comorbidities, sexual behaviours and age. such models are capable of incorporating data with individual-level granularity, including those regarding contact patterns, patient treatment cascades and clinical outcomes. individual-based models are uniquely suited for representing overlap in individual-level risk factors and translating the implications of this overlap for public health policy. for example, an individual-based model was recently used to demonstrate that the majority of hiv transmission among people who inject drugs in new york city is attributable to undiagnosed infections . these modelling results underscore the urgency for the city to invest in more comprehensive screening and improved diagnostic practices. population models. most commonly, models are created at the population scale, capturing the spread of a pathogen through a large group (table ) . at this scale, compartmental models shift in focus from the pathogen to the host. unlike individual-based models, compartmental models will aggregate individuals with a similar epidemiological status. for instance, the archetypical 's-i-r' model separates the entire population of interest into one of three categories: s, susceptible to infection; i, infected and infectious; or r, recovered and protected . in practice, most models will have additional compartments or stratification beyond this simple structure. age stratification is essential when either the disease risk or the intervention is age-specific. as an example, an age-stratified multipathogen model demonstrated that schistosomiasis prevention targeted to zimbabwean schoolchildren could cost-effectively reduce hiv acquisition later in life . this framework was extended to additional countries with a range of age-specific disease prevalence and co-infection rates to assess the potential value of treating schistosomiasis in adults. although adult treatment is not usually considered efficient, the model showed that it could be cost-effective in settings with high hiv prevalence . these models strengthened the investment case for treatment of schistosomiasis, an otherwise neglected tropical disease. network models are also deployed to represent dynamics on the population scale (table ) . these models impose a structure on contacts between hosts, unlike compartmental models which assume that contacts are random among hosts within a compartment. in a network model, nodes represent individuals and the connections between nodes represent contacts through which infection may spread . sources for network parameterization may include surveys, partner notification services or phylogenetic tracing , . as with individual-based models, network models tend to require significant amounts of data to fully parameterize, but various computational and statistical methods have been developed to analyse the impact of uncertain parameter values on model predictions . network models are applied to discern the influence of contact structure on disease transmission and on the effectiveness of targeted intervention strategies. for instance, network models predicted that hiv would spread more quickly through sexual partnerships that are concurrent versus serially monogamous, even if the total numbers of sexual acts and partners remain constant . the study prompted a more rigorous engagement of epidemiologists with sociological data to tailor interventions for specific settings . other network models have focused on the more rapid transmission within clusters of high-risk individuals and slower transmission to lower-risk clusters, a dynamic which explains discrepancies between observed incidence patterns and the expected pattern based on an assumption of homogeneous risks . these studies both illustrate the importance of accounting for network-driven dynamics when individuals are highly aggregated with regards to their risk factors, and when appropriate data for parameterization are available. metapopulation models. metapopulation models represent disease transmission at dual scales, considering not just the interactions of individuals, but also the relationships between groups of individuals, which are typically defined geographically (table ) . transmission intensity is often higher within groups than across groups, especially when the groups are spatially segregated . one metapopulation model of hiv in mainland china considered there are the three principal objectives of modelling, all of which can inform public health policy. predicting disease spread. models can be used to estimate the infectiousness of a pathogen within a given population. a fundamental concept is that of r , the basic reproduction number, which quantifies the number of infections that would result from a single index case in a susceptible population. r governs the temporal trajectory of an outbreak and the scale of interventions required for its containment. models may be used to infer r as well as forecast changes in r that could drive transitions in epidemic dynamics, such as the shift from sporadic outbreaks to sustained chains of transmission. example: assessing real-time zika risk in texas . selecting among alternative control strategies. simultaneous field trials of multiple infectious disease control options are often infeasible. models can simulate a wide range of control strategies and thus optimize public health policies according to translational objectives and real-world constraints. modelling can also extrapolate from the individual clinical outcomes of interventions or novel therapeutics to the population-level impacts. extrapolating to the population level is essential to evaluate the indirect benefits of interventions, including a reduction in transmission, or unanticipated repercussions, such as evolution of resistance. example: comparing antibiotic 'cycling' versus 'mixing' to minimize the evolution of antimicrobial resistance . hypothesis testing. it is often logistically or ethically infeasible to empirically test scientific hypotheses in the field or experimentally. modelling can identify parsimonious explanations of observed phenomena, including complex outcomes that can arise from the nonlinear processes common in microbiological systems. even simple models can be useful to help us understand dynamics that are common to many microbiological systems through identification of basic mechanisms that apply across a range of infections. by examining a new infectious agent through the lens of previously characterized systems, models provide insight into the ways that a particular microbial infection might follow or break from typical patterns. example: investigating whether individual heterogeneity within social networks significantly impacts disease spread . transmission within and between provinces, driven by the mobility of migrant labourers . the study suggested that hiv prevention resources could be most effectively targeted to provinces with the greatest initial incidence, as rising incidence in other provinces is driven more by migration from the high-burden provinces than by local transmission. given that the chinese provinces with employment opportunities for migrants are also those with the heaviest burden of hiv, migrant workers who acquire hiv often do so in the province where they work. however, government policy requires migrants to return to their home province for treatment. the movement of these workers perpetuates the disease cycle, as new migrants move to fill the vacated jobs and themselves become exposed to elevated hiv risk. these results therefore call for reconsideration of provincial treatment restrictions. multinational models. global policies, such as the treatment goals set by the joint united nations programme on hiv/aids (unaids), have been modelled on a global scale (table ) by considering the effectiveness of the policies for each nation. for example, a compartmental model was used to evaluate the potential impact of a partially efficacious hiv vaccine on the epidemiological trajectories in countries that together constitute over % of the global burden . the model was tailored to each country by fitting to country-specific incidence trends as well as diagnosis, treatment and viral suppression data. this model revealed that, even with efficacy as low as %, a hiv vaccine would avert millions of new infections worldwide, irrespective of whether ambitious treatment goals are met. these results identify the synergies between vaccination and treatment-as-prevention, and provide evidence to support continued investment in vaccine development , . from the cellular level to the population level, hiv modelling has led to improvements in drug formulations, clinical care and resource allocation. as scientific advances continue to bring pharmaceutical innovations, modelling will remain a useful tool for illuminating transmission dynamics and optimizing public health policy. hiv was not controlled before it became a pandemic, but our response to future outbreaks has the potential to be more timely . when diseases emerge in new settings, such as ebola in west africa and sars in china, modelling can be rapidly deployed to inform and support response efforts (fig. ) . unfortunately, the urgency of public health decisions during such outbreaks tends to be accompanied by a sparsity of data with which to parameterize, calibrate and validate models. as detailed below, uncertainty analysis-a method of analysing how uncertainty in input parameters translates to uncertainty in model outcome variables-becomes all the more vital in these situations. media attention regarding model predictions is often heightened during outbreaks, ironically at a time when modelling results are apt to be less robust than for well-characterized endemic diseases. we discuss the importance of careful communication regarding model recommendations and associated uncertainty to inform the public without fuelling excessive alarm. despite these challenges, and especially if these challenges can be navigated, the timely assessment of a wide range of intervention scenarios made possible by modelling would be particularly valuable during infectious disease emergencies. ebola virus outbreaks. the ebola virus outbreak struck a populous region near the border of guinea and sierra leone, sparking a crisis in a resource-constrained area that had no prior experience with the virus. as the caseload mounted and disseminated geographically, it became apparent that the west african outbreak would be unprecedented in its devastation. models were developed to estimate the potential size of the epidemic in the absence of intervention, demonstrating the urgent need for expanded action by the international community [ ] [ ] [ ] , and to calculate the scale of the required investment . initial control efforts included a militarily enforced quarantine of a liberian neighbourhood in which ebola was spreading. modelling analysis in collaboration with the liberian ministry of health demonstrated that the quarantine was ineffective and possibly even counterproductive . connecting the microbiological and population scales, another modelling study integrated within-host viral load data over the course of ebola infection and between-host transmission parameterized by contact-tracing data. the resulting dynamics highlighted the imperative to hospitalize most cases in isolation facilities within four days of symptom onset . these modelling predictions were borne out of empirical observations. early in the outbreak, when the incidence was precipitously growing, the average time to hospitalization in liberia was above six days . as contact tracing improved, the concomitant acceleration in hospitalization was found to be instrumental in turning the tide on the outbreak . in another approach, phylogenetic analysis and transmission modelling were combined to estimate underreporting rates and social clustering of transmission . this study informed public health authorities regarding the optimal scope and targeting of their efforts, which were central to stemming the epidemic. although data can be scarce for emerging pathogens, modellers can exploit similarities with better-characterized disease systems to investigate the potential efficiency of different interventions (box ). as vaccine candidates became available against ebola, ring vaccination was proposed based on the success of the strategy in eliminating smallpox , another microorganism whose transmission required close contact between individuals and for which peak infectiousness occurs after the appearance of symptoms. compartmental models had suggested parameter combinations for which ring vaccination would be superior to mass vaccination , and methodological advances subsequently allowed for explicit incorporation of contact network data . modelling based on social and healthcare contact networks specific to west africa supported implementation of ring vaccination , and the approach was adopted for the clinical trial of the vaccine . in , two independent outbreaks of ebola erupted in the democratic republic of the congo. during the initial outbreak in Équateur province, modellers combined case reports with time series from previous outbreaks to generate projections of final epidemic size that could inform preparedness planning and allocation of resources . ring vaccination was again deployed, this time within two weeks of detecting the outbreak. a spatial model quantified the impact of vaccine on both the ultimate burden and geographic spread of ebola, highlighting how even one week of additional delay would have substantially reduced the ability of vaccination to contain this outbreak . the second outbreak was reported in august in the north kivu province. armed conflict in this region has interfered with the ability of healthcare workers to conduct the necessary contact tracing, vaccination and treatment. as conditions make routine data collection difficult and even dangerous, modelling has the potential to provide crucial insights into the otherwise unobservable characteristics of this outbreak. in contrast to the unexpected emergence of ebola in a new setting, the influenza virus has repeatedly demonstrated its ability to cause pandemics. a pandemic is an event in which a pathogen creates epidemics across the entire globe. the pandemic killed an estimated million people worldwide , exceeding the combined military and civilian casualties of world war . while the % case-fatality rate of the strain was approximately times higher than is typical for influenza , pathogenic strains with case-fatality rates exceeding % periodically emerge . modelling has illustrated how repeated zoonotic introductions impose selection for elevated human-to-human transmissibility, which thereby exacerbates the threat of a devastating influenza pandemic . such threats underscore the importance of surveillance systems and preparedness plans, which can be informed by modelling (box ). transmission models are able to optimize surveillance systems, accelerate outbreak detection and improve forecasting [ ] [ ] [ ] [ ] . for example, a spatial model integrating a variety of surveillance data streams and embedded in a user-friendly platform is currently implemented by the texas department of state health services to generate real-time influenza forecasts (http://flu.tacc.utexas.edu/). modelling has also motivated the development of dynamic preparedness plans, which adapt in response to the unfolding events of a pandemic, as models identified that adaptive efforts would be more likely to contain an influenza pandemic than static policies chosen a priori . other pandemic influenza analyses used agestructured compartmental models to study the trade-off between targeting influenza vaccination to groups that transmit many infections but experience relatively low health burdens (for example, schoolchildren) versus groups that transmit fewer infections but experience greater health burdens (for example, the elderly) . such examples illustrate the insights that modelling has provided to the decision makers charged with maintaining readiness against simultaneously rare but catastrophic situations. modelling has also examined the impact of human behaviour, including vaccination decisions and social interactions, on the course of an epidemic. public health interventions are not always sufficient to ensure disease control, as behavioural factors can thwart progress [ ] [ ] [ ] [ ] . for example, reports in of potential neurological side effects from the whole-cell pertussis vaccine led to a steep decline in vaccine uptake throughout the uk, followed by a slow recovery (fig. a) . vaccine uptake ebbed and flowed over the next two decades, with higher rates of vaccination in the wake of large pertussis outbreaks (fig. b) , . compartmental models analysing the interplay between vaccine uptake and disease dynamics confirmed the hypothesis that increases in vaccination were a response to the pertussis infection risk , and showed that incorporating this interplay can improve epidemiological forecasts. network models extending these coupled disease-behaviour analyses types of projection that can be generated include outbreak trajectories, disease burdens and economic impact. d, probabilistic uncertainty analyses convey not only model projections of policy outcomes, but also quantification of confidence in the projections. e, as policies are adopted and the microbiological system is influenced accordingly, the model can be iteratively updated to reflect the shifting status quo, thereby progressively optimizing policies within an evolving system. have illustrated how the perceived risk of vaccination can have greater influence on vaccine uptake than disease incidence . more recently, vaccine refusal has led to the resurgence of measles in the usa , . researchers are turning to social media to gather information about attitudes toward vaccines and infectious diseases, and to glean clues about vaccinating behaviour , , . for instance, signals that vaccine refusal is compromising elimination can be detected months or years in advance of disease resurgence by applying mathematical analysis of tipping points to social media data that have been classified on the basis of sentiment using machine learning algorithms . these and other data science techniques might help public health authorities identify the specific communities that are at increased risk of future outbreaks. on shorter timescales, the near-instantaneous availability of social media data facilitates its integration into models developed for outbreak response , . other behavioural factors that have been incorporated into transmission models include attendance at social gatherings, sexual behaviour and commuting patterns-elements which are also often affected by perceived infection risk , , . antimicrobial resistance. a substantial portion of the increase in human lifespan over the last century is attributable to antibiotics , but the emergence of pathogen strains that are resistant to antimicrobials threatens to reverse these gains. the extensive use and misuse of antibiotics has led to the evolution of multidrug-resistant, extensively drug-resistant and even pan-drug-resistant pathogens across the globe. precariously, this evolution outpaces the development of new antibiotics. mathematical modelling is being used to identify strategies to forestall the emergence and re-emergence of antimicrobial resistance , . models are particularly valuable for comparing alternative strategies, such as administration of different antibiotics within the same hospital ward, temporal cycling of antibiotics and combination therapy [ ] [ ] [ ] [ ] . high-performance computing now permits the rapid exploration of multidimensional parameter space. models can thereby narrow an array of possible interventions down to a subset likely to have the highest impact or optimize between trade-offs, such as effectiveness and cost (box ). by contrast, expense, feasibility and ethical considerations may impose more limitations on in vivo investigations (box ). not only can models identify the optimal strategy for a given parameter set, but they can generate the probability that this intervention remains optimal across variation in the parameters. for example, an optimization routine combined with simulation of hospital-based interventions identified combination therapy as most likely to reduce antibiotic resistance . as a complementary approach, modelling can incorporate economic considerations into these evaluations. a stochastic compartmental model showed that infection control specialists dedicated to promoting hand hygiene in hospitals are cost-effective for limiting the spread of antibiotic resistance . although most models of antibiotic resistance have focused on transmission in healthcare settings, the importance of antibiotic resistance in natural, agricultural and urban settings has been increasingly recognized [ ] [ ] [ ] [ ] [ ] [ ] [ ] . for example, a metapopulation model of antimicrobial-resistant clostridium difficile simulated its transmission within and between hospitals, long-term care facilities and the community. this model demonstrated that mitigating risk in the community has the potential to substantially avert hospital-onset cases by decreasing the number of patients with colonization at admission and thereby the transmission within hospitals . this study illustrates how models can consider the entire ecosystem of infection to elucidate dynamics that might not be captured through focus on a single setting. during the initial phase of an outbreak, the predictive power of models is often constrained by data scarcity. this challenge is exacerbated for outbreaks of novel emerging diseases given that our understanding of the disease will rely on the unfolding epidemic (fig. ) . not only can the absence of data constrain model design, but sparse data requires extensive sensitivity analyses to evaluate the robustness of conclusions. univariate sensitivity analyses, in which individual parameters are varied incrementally above and below a point estimate, can identify which parameters most influence model output (box ). such comparisons reveal both salient gaps in knowledge and targets for preventing and mitigating the outbreak (box ) . as an outbreak progresses, each day has the potential to provide more information about the new disease, including its duration of latency, the symptomatic period, infectiousness, transmission modalities, underreporting and the case-fatality rate. however, collecting detailed data to inform each of these parameters can strain resources when they are thinly spread during an emergency response. sensitivity analysis can support clinicians and epidemiologists in prioritizing data collection efforts . parameterization challenges are compounded for complicated disease systems, such as vector-borne diseases. for example, models of zika virus infection span both species and scales, as the disease trajectory is influenced by factors ranging from mosquito seasonality and mosquito abundance down to viral and immunological dynamics within human and mosquito hosts , . adding to this complexity, the ecological parameters vary seasonally and geographically-heterogeneities that may be amplified by socioeconomic factors modulating human exposure to infected mosquitoes . in the absence of the high-resolution data that would be ideal to tailor a mosquitodriven disease system to a given setting, uncertainty analysis can unify parameterization from disparate data sources. in contrast to univariate sensitivity analyses, uncertainty analysis simultaneously samples from empirical-or expert-informed distributions for many or all input parameters. collaboration between modellers and disease experts is thus instrumental to ensuring the biological plausibility of these parameter distributions , . the uncertainty analysis produces both a central point estimate and a range for each outcome, a combination which can inform stakeholders about the best-case and worst-case scenarios as well as the likelihood that an intervention will be successful [ ] [ ] [ ] . in constructing models and communicating results, there are common pitfalls which can compromise the rigor and impact of the research. a pervasive pitfall is the incorporation of excessive model complexity, particularly through inclusion of more parameters than can be reliably parameterized from data. intuition might suggest that a complex representation of a microbiological system would more closely represent reality. however, the predictive power of a model can be degraded if incorporating additional parameters only marginally improves the fit to data. this tendency results in complicated transmission models that overfit data in much the same way that complicated statistical regressions can overfit data, replicating not only the relevant trends but also the noise in a particular data set. these overfit models thus become less useful for prediction and generalization , . to guide appropriate model complexity and parameterization, modellers have used the mathematical theory of information to develop criteria which quantify the balance between realism and simplicity. such criteria penalize additional parameters but reward substantial improvements in fit, thereby identifying the simplest model that can adequately fit the data , , . these methods can be applied to select among models or alternatively to calculate weighted average predictions across models. in a similar vein, modelling consortiums serve to address uncertainty surrounding model design [ ] [ ] [ ] . in a consortium, several modelling groups develop their models independently, each applying their particular expertise and perspective. for example, consortia of malaria modellers were convened to predict the effectiveness of interventions, including a vaccine candidate and mass drug administration . congruence of output among models engenders confidence that model results are robust. another pitfall concerns the quality of data used to inform the model. incompleteness of data has been an issue since , when daniel bernoulli published a compartmental model of smallpox and acknowledged that more extended analyses would have been possible if the data had been age-stratified . even today, using data to develop models without knowledge of how the data were collected or the limitations of the data can be risky. data collected for an alternative purpose can contain gaps or biases that are acceptable for the original research question, yet lead to incorrect conclusions when incorporated for another purpose in a specific model. in ideal circumstances, modellers would be involved in the design of the original study, ensuring both seamless integration of the results into the model and awareness on the part of the modeller with regard to data limitations. failing that, it is very helpful for modellers to collaborate with scientists familiar with the details of empirical studies on which their results might depend. this lack of familiarity with the biases or incompleteness of data sources may be particularly dangerous in the era of digital data. 'big data hubris' can blind researchers to the limitations of the dataset, such as being a large but unrepresentative sample of the general population, or the alteration of search engine algorithms partway through the data collection process . some of these limitations can be addressed by using digital data as a complement to traditional data sources. in this way, the weakness of one data source (for example, low sample size of traditional surveys or bias in large digital data) can be compensated by the strengths of another data source (for example, balanced representation in small survey versus large scale of digital data). a final pitfall that often arises in the midst of an ongoing outbreak concerns the interpretation of epidemic projections. initial models may assume an absence of intervention as a way to assess the potential consequences of inaction. such projections may contribute to the mobilization of government resources towards control, as was the case during the west african ebola outbreak , , . in this respect, the projections are intended to make themselves obsolete . in retrospect and without knowledge of the initial purpose of the model, it may appear that the initial predictions were excessively pessimistic . additionally, people living in outbreak zones often change their behaviour to reduce infection risks, thereby mitigating disease spread through, for example, reducing social interactions or increasing vaccine uptake (fig. ) , , . thus, risk assessment constitutes a 'moving target' . for example, input parameters estimated from contact tracing early in an outbreak could require adjustments to reflect these behaviour changes and accurately predict subsequent dynamics . the need for proficient communication skills is heightened during an outbreak. this concern is particularly relevant when presenting sensitivity and uncertainty analyses. although predictions at the extreme of sensitivity analyses also tend to be less probable than mid-range projections, there can be a temptation to focus on the most sensational model scenarios. ensuing public pressure on the basis of misunderstood findings can cause unwarranted alarm and trigger counterproductive political decisions. in both publications and media interactions, underscoring the improbability of extreme scenarios explored during sensitivity analysis, as well as how improved interventions turn a predictive model into a counterfactual one, may pre-empt this pitfall . the role for modelling in supporting epidemiologists, public health officials and microbiologists has progressively expanded since the foundational publications forty years ago, in concert with the growing abundance and granularity of data as well as the refinement of quantitative approaches. models have now been developed for virtually every human infectious disease, as well as in many that affect animals and plants, and have been applied across the globe. interdisciplinary collaboration among empiricists, policymakers and modellers facilitates the development of scientifically grounded models for specific settings and generates results that will be actionable in the real world. reciprocally, modelling results may guide the design of experiments and field studies by revealing key gaps in our understanding of microbiological systems. furthermore, modelling is a feasible and cost-effective approach for identifying impactful policies prior to implementation decisions. through all these avenues, epidemiological modelling galvanizes evidence-based action to alleviate disease burden and improve global health. vaccine uptake appears to be entrained by surges in infection incidence. mathematical models can capture the interplay between natural and human dynamics exemplified in this dataset and a wide variety of other study systems. global, regional, and national age-sex specific mortality for causes of death, - : a systematic analysis for the global burden of disease study population biology of infectious diseases: part i population biology of infectious diseases: part ii modeling infectious disease dynamics in the complex landscape of global health formalizing the role of agent-based modeling in causal inference and epidemiology big data. the parable of google flu: traps in big data analysis predicting the effectiveness of prevention: a role for epidemiological modeling bridging the gap between evidence and policy for infectious diseases: how models can aid public health decision-making preventing acquisition of hiv is the only path to an aids-free generation multiscale modelling in immunology: a review hiv- dynamics in vivo: virion clearance rate, infected cell life-span, and viral generation time dynamics of hiv- and cd + lymphocytes in vivo decay characteristics of hiv- -infected compartments during combination therapy modeling latently infected cell activation: viral and latent reservoir persistence, and viral blips in hiv-infected patients on potent therapy hiv reservoir size and persistence are driven by t cell survival and homeostatic proliferation modeling the within-host dynamics of hiv infection assessment of epidemic projections using recent hiv survey data in south africa: a validation analysis of ten mathematical models of hiv epidemiology in the antiretroviral therapy era the risk of hiv transmission at each step of the hiv care continuum among people who inject drugs: a modeling study infectious diseases of humans: dynamics and control cost-effectiveness of a community-based intervention for reducing the transmission of schistosoma haematobium and hiv in africa evaluating the potential impact of mass praziquantel administration for hiv prevention in schistosoma haematobium high-risk communities when individual behaviour matters: homogeneous and network models in epidemiology connecting the dots: network data and models in hiv epidemiology detailed transmission network analysis of a large opiate-driven outbreak of hiv infection in the united states bayesian inference of spreading processes on networks concurrent partnerships and the spread of hiv concurrency is more complex than it seems network-related mechanisms may help explain long-term hiv- seroprevalence levels that remain high but do not approach population-group saturation metapopulation dynamics of rabies and the efficacy of vaccination predicting the hiv/aids epidemic and measuring the effect of mobility in mainland effectiveness of unaids targets and hiv vaccination across countries an hiv vaccine is essential for ending the hiv/aids pandemic opinion: mathematical models: a key tool for outbreak response ebola virus disease in west africa-the first months of the epidemic and forward projections estimating the future number of cases in the ebola epidemic -liberia and sierra leone impact of bed capacity on spatiotemporal shifts in ebola transmission dynamics and control of ebola virus transmission in montserrado, liberia: a mathematical modelling analysis strategies for containing ebola in west africa effect of ebola progression on transmission and control in liberia interrupting ebola transmission in liberia through community-based initiatives epidemiological and viral genomic sequence analysis of the ebola outbreak reveals clustered transmission selective epidemiologic control in smallpox eradication emergency response to a smallpox attack: the case for mass vaccination the impact of contact tracing in clustered populations harnessing case isolation and ring vaccination to control ebola efficacy and effectiveness of an rvsv-vectored vaccine in preventing ebola virus disease: final results from the guinea ring vaccination, open-label, cluster-randomised trial (ebola Ça suffit!) projections of ebola outbreak size and duration with and without vaccine use in Équateur ebola vaccination in the democratic republic of the congo emergence and pandemic potential of swine-origin h n influenza virus estimated influenza illnesses, medical visits, hospitalizations, and deaths averted by vaccination in the united states global epidemiology of avian influenza a h n virus infection in humans, - : a systematic review of individual case data the role of evolution in the emergence of infectious diseases information technology and global surveillance of cases of h n influenza optimizing provider recruitment for influenza surveillance networks influenza a (h n ) and the importance of digital epidemiology disease surveillance on complex social networks optimizing infectious disease interventions during an emerging epidemic optimizing influenza vaccine distribution nine challenges in incorporating the dynamics of behaviour in infectious diseases models evolutionary game theory and social learning can determine how vaccine scares unfold department of health & human services. measles cases and outbreaks the pertussis vaccine controversy in great britain the impact of rare but severe vaccine adverse events on behaviour-disease dynamics: a network model measles outbreak: opposition to vaccine extends well beyond ultra-orthodox jews critical dynamics in population vaccinating behavior digital epidemiology multiscale mobility networks and the spatial spreading of infectious diseases capturing sexual contact patterns in modelling the spread of sexually transmitted infections: evidence using natsal- the perpetual challenge of antimicrobial resistance factors affecting the reversal of antimicrobial-drug resistance retrospective evidence for a biological cost of vancomycin resistance determinants in the absence of glycopeptide selective pressures multistrain models predict sequential multidrug treatment strategies to result in less antimicrobial resistance than combination treatment universal or targeted approach to prevent the transmission of extended-spectrum beta-lactamase-producing enterobacteriaceae in intensive care units: a cost-effectiveness analysis modeling antibiotic treatment in hospitals: a systematic approach shows benefits of combination therapy over cycling, mixing, and mono-drug therapies why sensitive bacteria are resistant to hospital infection control call of the wild: antibiotic resistance genes in natural environments antibiotic resistant bacteria are widespread in songbirds across rural and urban environments spatial and temporal distribution of antibiotic resistomes in a peri-urban area is associated significantly with anthropogenic activities impact of point sources on antibiotic resistance genes in the natural environment: a systematic review of the evidence investigating antibiotics, antibiotic resistance genes, and microbial contaminants in groundwater in relation to the proximity of urban areas systematic review: impact of point sources on antibioticresistant bacteria in the natural environment environmental factors influencing the development and spread of antibiotic resistance quantifying transmission of clostridium difficile within and outside healthcare settings optimal cost-effectiveness decisions: the role of the cost-effectiveness acceptability curve (ceac), the cost-effectiveness acceptability frontier (ceaf), and the expected value of perfection information (evpi) probabilistic uncertainty analysis of epidemiological modeling to guide public health intervention policy zika viral dynamics and shedding in rhesus and cynomolgus macaques evaluating vaccination strategies for zika virus in the americas epidemic dengue and dengue hemorrhagic fever at the texas-mexico border: results of a household-based seroepidemiologic survey assessing real-time zika risk in the united states one health approach to cost-effective rabies control in india cost-effectiveness of canine vaccination to prevent human rabies in rural tanzania cost-effectiveness of next-generation vaccines: the case of pertussis optimizing the impact of low-efficacy influenza vaccines reassessing google flu trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales a new look at the statistical model identification model selection in ecology and evolution direct and indirect effects of rotavirus vaccination: comparing predictions from transmission dynamic models data-driven models to predict the elimination of sleeping sickness in former equateur province of drc learning from multi-model comparisons: collaboration leads to insights, but limitations remain public health impact and cost-effectiveness of the rts, s/as malaria vaccine: a systematic comparison of predictions from four mathematical models role of mass drug administration in elimination of plasmodium falciparum malaria: a consensus modelling study bernoulli's epidemiological model revisited ebola: models do more than forecast models overestimate ebola cases coupled contagion dynamics of fear and disease: mathematical and computational explorations ecological theory suggests that antimicrobial cycling will not reduce antimicrobial resistance in hospitals the authors gratefully acknowledge funding from the notsew orm sands foundation (grants to m.c.f., j.p.t. and a.p.g.), the national institutes of health (grant nos. k ai and u gm to m.c.f. and a.p.g., respectively) and the natural sciences and engineering research council of canada (grant no. rgpin- - to c.t.b.). the authors also thank c. wells and a. pandey, both members of the yale center for infectious disease modeling and analysis, for their helpful discussions regarding the hiv and ebola modelling literature. m.c.f. and a.p.g. drafted the initial manuscript. m.c.f., c.t.b., j.p.t. and a.p.g. all critically revised the content. the authors declare no competing interests. correspondence should be addressed to a.p.g.reprints and permissions information is available at www.nature.com/reprints.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- -f z authors: nikitenkova, s.; kovriguine, d. a. title: it's the very time to learn a pandemic lesson: why have predictive techniques been ineffective when describing long-term events? date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: f z we have detected a regular component of the monitoring error of officially registered total cases of the spread of the current pandemic. this regular error component explains the reason for the failure of a priori mathematical modelling of probable epidemic events in different countries of the world. processing statistical data of countries that have reached an epidemic peak has shown that this regular monitoring obeys a simple analytical regularity which allows us to answer the question: is this or that country that has already passed the threshold of the epidemic close to its peak or is still far from it? people desire to know about future epidemic events in terms of where, how much and when. this information creates assurance in society, some adequate understanding and effective reaction on current events. studying the history of the pandemic in detail will undoubtedly contribute to fruitful strategic thinking. the practice of recent months has shown that well-tested and well-known mathematical methods for processing statistical data directly used to describe future events of the spread of a pandemic have been ineffective in solving such a flagrantly important and urgent problem. meanwhile, it is no secret that the aggregated dynamics of a pandemic is very simple, it is described quite fully by the solution of a well-known logistic equation, not so important whether it is discrete or differential. neither the more developed versions of this mathemati- other promising predictive technologies based on graph theory, percolation theory, and stochastic processes turned out to be too clumsy or incapable to produce a clear and concrete result on an urgent issue. in other words, the available methods for processing statistical data were found to be helpless in solving an important specific problem. the paradox is the apparent inefficiency of using the known methods of processing statistical data to obtain an adequate quantitative result, even though, it would seem, everything about the dynamics of the pandemic is known qualitatively in advance. let's try to understand the main reasons for such an unfortunate failure and try to formulate the principles for overcoming the current difficult situation. first of all, what does an epidemic look like? in the dynamics of development and spread, the epidemic is similar to a forest fire. with fire, everything is understood a little easier due to the obviousness of this phenomenon. no one is trying to delve into the physics of combustion and the nature of fire, for people the expectations of ignition of adjacent sections of the forest, threats to human settlements are obvious, the assessment of the forces . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . . and means necessary to extinguish a fire is almost foreseeable. the epidemic does not have such striking external manifestations, so it is insidious and dangerous especially for ignorant people, in which, as a rule, there is no shortage. besides, decisive actions taken, say, to extinguish a fire, are completely inapplicable if you try to try them on to combat the epidemic. in the absence of vaccination and other effective medical countermeasures to the spread and development of the epidemic, quarantine seems to be the most effective means, like fifty years ago and earlier. self-isolation and social distancing are effective only when each member of the society realizes that any action, even if it is legitimate, borders on risk. to understand the reason for the failures of mathematical modelling, we first turn our attention to the statistical data used to make the forecast, since this forecast directly depends on them. it is not a revelation that, for completely natural reasons, statistics are not necessarily flawless. note that, in contrast to the experimental data that are dealt with, say, in physical experiments, data on the epidemic situation cannot be redundant, but only insufficient. indeed, it is difficult to imagine a situation where redundant data regularly appears in the monitoring summary of the number of newly detected cases, since the case is too delicate to allow such sloppiness. most likely, we should expect that the data will be underestimated due to a lack of information. therefore, it is natural to assume that the data contains a regular error component with a deficiency. indeed, comprehensive pandemic data can only be obtained a posteriori, but life requires reliable a priori information. to achieve this goal, it is necessary to identify, evaluate and study the mentioned regular component of the error, using the statistics of those countries that have already reached a peak -the stationary level of the epidemic dynamics. let us take a logistic pattern for the predicted result since it requires a minimum of information for modelling, i.e., the initial number of cases, the final peak number, and the number of days from the outbreak to its peak. one can use appropriate data to build these logistic curves, for example from the site is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . parameters determined from the tables. the transition time from the state i n to the state i n is a day, but it does not play a significant role at large numbers. after using simple selection criteria, the obtained data appears as tables and graphs [ ] . suppose that the temporary evolution of the epidemic in countries possessing highquality statistics is adequately described by the solutions of the logistic equation let the infection rate be a specific value for each territory or country. it is easy to guess that the parameter a should tend to zero due to elementary statistical properties. thus, we can interpret the relative indicator a /n * as the quality of statistical data. always a positive parameter a determines the maximum spread of the epidemic. let it characterize the measure of social hygiene inherent in a particular country. the critical parameter determining a stable steady-state n * is this a . note that abstract theories connect this parameter with the concept of intraspecific competition, but in our case, an attempt to interpret this indicator will lead to nothing. it remains to be assumed that this negative parameter is an indicator of medical and other useful actions to counteract, spread, and develop the epidemic. now let's look at a specific example of the epidemic history in one of the countries that have reached the peak of the epidemic. let it be the most affected european country today, spain, shown in fig. . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . the above graphs show that the discrepancy between real data and the forecast is most significant, starting from the early stages of the epidemic until the inflexion point of the logistic curve, which is appropriate to call a threshold. in the figures, this threshold is indicated by a dotted line. now let's try to evaluate the absolute and relative forecasting error graphically, as shown in fig. . . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . this illustration shows that absolute error is colossal throughout the history of the epidemic. the relative error becomes acceptable when the absolute error has passed its maximum value. fortunately, the actual data, starting from the beginning of the epidemic and almost to the point of the specified maximum, is well described by the exponential dependence shown in the left part of fig. . this fact is not a revelation: the behaviour of real data over time shown in graphs is not a specific feature of the history of the epidemic in spain alone. the authors of some works on the problem of the pandemic have noticed this pattern, but link this regular exponential component of the error with useful medical intervention in the course of the early stages of the epidemic [ ] . however, it is easier to assume that this unavoidable regular error is due only to the specifics of data monitoring. moreover, the analytical approximation of the input data set continues further, starting from the point of maximum absolute error up to the inflexion point [ ] , [ ] , [ ] . figure shows the power fit on the indicated time interval. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint these graphs show that real input data can be replaced by approximating functions without any particular error. note that beyond the inflexion point, and up to the peak of the epidemic, the predicted and real data, as a rule, practically coincide. the explanation is simple: when crossing the inflexion point, the growth rate of registered infections decreases, while the maximum effort is involved in solving the monitoring problem. if we repeat the calculation procedure using any suitable technique for processing statistical data using the indicated exponential and power-fit approximating functions, the result will be close to what is achieved using real input data. we can repeat the above arguments and calculations using other data on the history of the epidemic in other countries, for example, germany, italy, turkey, france, and many others. the pattern in the input data in question will be the same. this means that epidemic event monitoring services in many countries operate similarly, which causes the inevitable regular error component to occur. of course, there are exceptions: there are data for some countries for which the analytical analysis described in this paper cannot be used. such countries, among the most affected by the pandemic, primarily include china, whose monitoring services provided data that could not be processed in the above aspect. also, in this aspect, countries such as ecuador, japan, slovakia, guinea-bissau, kosovo are not amenable to statistical processing. in some other countries, as a rule, with a few registered cases, the causal principle is formally violated. this means that monitoring data overtakes in time the data calculated by the logistic forecast curve. since in reality this principle is never violated, it remains to be assumed that the monitoring data provided by such countries is incomplete. if desired and zealous, this error is not difficult to correct, assuming that the epidemic could have begun somewhat earlier than the officially declared date. there are some countries, also, as a rule, with some recorded cases, where the forecast data surprisingly practically coincide with the monitoring data. these countries include ireland, iceland, new zealand. we assume that the regular component of the monitoring error has a place to be. could this information have a positive effect on the reliability of predicting future epidemic events in other countries that have not yet reached the peak of the epidemic? if we take into account the evolution of the indicated regular error component in a dynamic way, that is, ad-. cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . just the input data appropriately, this problem will not completely disappear. the parameters of the exponential and power functions that approximate the input data are unknown a priori. the history of countries that have reached the peak of the epidemic is of fundamental value. this value is manifested in the appearance of a pattern in the so-called virtual time delay of monitoring data. the indicated delay time is determined as follows. mentally, we draw a horizontal line on the graph, which is shown on the left in fig. . we determine the intersection points of this line with the forecast logistics curve and graphs of functions that approximate the input data. the time difference at the indicated intersection points determines the virtual delay time of the monitoring data. in other words, if you move the monitoring points from right to left by the desired time interval, then all the points will be on the logistic curve, which will lead to a correct prediction, even if you do not use all the new data obtained, but only a small part of them. in fig. . replacing the real input data with approximating functions creates an error of not more than per cent; . a country has fetched its epidemic peak to the current date for at least a week ago. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . unfortunately, the above analysis cannot be applied to us monitoring data, although this country has confidently crossed the epidemic threshold. also, it is not yet possible to analyse the situation in brazil, since the epidemic in this country is at the initial stage of the exponential growth of the epidemic, despite the unprecedentedly large number of officially recorded cases in this country. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . saudi arabia and russia, which have crossed the epidemic threshold but have not yet reached its peak. we have used the same illustration is as in fig. on the left as a background for visual comparison. we have found a regular component of the monitoring error of officially registered total cases of the spread of the current pandemic. this regular error component explains the reason for the failure of a priori mathematical modelling of probable epidemic events in different countries of the world. processing statistical data of countries that have reached an epidemic peak has shown that this regular monitoring obeys a simple analytical regularity. this pattern allows us to answer the question: is this or that country that has already crossed the threshold of the epidemic close to a peak or is still far from it. not far off is that happy day when the world will cope with the pandemic as a whole. monitoring data on the current pandemic, collected in its entirety, will allow us to establish valuable a posteriori patterns to significantly improve the quality of a priori dynamic modelling of epidemic events when they appear in the future. † the united kingdom has reached the epidemic peak while writing this text. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint the countries of the "golden billion", despite the sometimes catastrophic situation allowed in this part of the world, judging by the dynamics of current events, are the first to pass the peak of the epidemic. the fate of the rest, most of the world remains uncertain. special attention of an enlightened society should be focused today on the most painful epidemic situation in the usa, brazil and russia. the statistics of these countries indicate the absence of a precedent in the history of the epidemic compared to other countries that are already close to recovery. a country called the world is at the beginning of the pandemic despite the truly gigantic scale of the disaster already achieved. a novel coronavirus from patients with pneumonia in china novel coronavirus and old lessons -preparing the health system for the pandemic dynamics of the covid- --comparison between the theoretical predictions and real data the covid- pandemic: growth patterns, power-law scaling, and saturation search for the trend of covid- infection following farr's law, idea model and power law strong correlations between power-law growth of covid- in four continents and the inefficiency of soft quarantine strategies key: cord- - tsy lt authors: shao, xue-feng; liu, wei; li, yi; chaudhry, hassan rauf; yue, xiao-guang title: multistage implementation framework for smart supply chain management under industry . date: - - journal: technol forecast soc change doi: . /j.techfore. . sha: doc_id: cord_uid: tsy lt the true potential of the industry . , which is a byproduct of the fourth industrial revolution, cannot be actually realized. this is, of course true, until the smart factories in the supply chains get connected to each other, with their systems and the machines linked to a common networking system. the last few years have experienced an increase in the adoption and acceptance of the industry . ′s components. however, the next stage of smart factories, which will be the smart supply chains, is still in its period of infancy. moreover, there is a simultaneous need to maintain a focus on the supply chain level implementation of the concept that industry . puts forth. this is important in order to gain the end to end benefits, while also avoiding the organization to organization compatibility issues that may follow later on. when considering this concept, limited research exists on the issues related to the implementation of industry . , at the supply chain level. hence, keeping in mind this lack of literature and research available, on a phenomenon that will define the future of business and industry, this study uses an exploratory approach to capture the implementation of industry . concepts across multiple tiers of the supply chain. based on this research, the study proposes a multistage implementation framework that highlights the organizational enablers such as culture, cross-functional approach, and the continuous improvement activities. furthermore, it also highlights the staged implementation of the advanced tools, starting from the focal organization with the subsequent integration with the partner organizations. one word that transcends most of the consumers, as well as the manufacturing topics being researched these days, is "digitization". with the advent of the covid- virus, that has taken over the world as one of the most devastating pandemics, all major industries, from education to manufacturing, are exploring novel ways to digitize their operations. technology is now being seen as a robust strategic weapon (chavarrìa-barrientos et al., ) that is expected to ensure operational performance and continuity, through process integration (srinivasan and swink, ) , by creating smart factories (rashid and tjahjono, ) . this situation has given a much needed thrust to the adoption and implementation of smart technologies in various aspects of trade, business and organizational management. what started as a concept in germany under the industry . revolution, by using smart ict technologies, is now being acknowledged by all the segments of the society. moreover, it is looking for ways to transform into this new environment in the most seamless and successful manner. industry . , or smart manufacturing, are the terms that are being used for digital transformation, using technologies such as the internet of things (iot), artificial intelligence (ai), cloud computing (cc), machine learning (ml), and data analytics (da), etc. these concepts have been built upon the interconnectedness of the machines and systems that are using the above-mentioned technologies, to self-correct and self-adopt according to environmental needs of time (fatorachian and kazemi ) . another term that is being used is of that of resilient systems that are capable of self-correction. smart manufacturing signifies the working atmosphere, in which employees, machinery, enterprise systems, and devices are linked with other cyberphysical systems, as well as the internetwork (Öberg and graham, ) . the amount of data being generated by the industrial production systems has seen, and is also expected to experience, immense growth. moreover, the increase in the computational power is now leading organizations to make more informed decisions. the use of these technologies can be leveraged in order to integrate the data that is being generated, to make smarter and more calculated decisions. this fourth industrial revolution has brought about a new environment for industrial management and smart process management (moeuf et al., ) . the concept of smart manufacturing is evolving from simple digitization and automation of individual machines, to connecting machines using iot technologies and utilizing the data from the connected systems to make decisions on the go. lean manufacturing, that focuses on improving the service that is provided to the customers, and reducing process waste (womack and jones ) is considered as one of the most widely adopted process management systems (tortorella et al., ) . however, implementing organizations have been able to reap benefits only when the internal improvement efforts were linked with the external stake holders, i.e. suppliers and the customers (frohlick and westbrook, ) . similar to the lean implementation across supply chains, digitalization of processes is obligatory for the supply chains, especially if they seek to reap the benefits in true terms (pereira and romero, ) . the extant literature available on the dynamics of industry . focuses more on the application of various technologies such as the iot, ai, ml, and data analytics from a manufacturing standpoint. however, very limited research, particularly about the supply chain interaction of advanced technologies, exists in the literature (müller and voigt, ) . within the supply chain literature that has been written regarding the concept of industry . , or smart manufacturing, the majority of the studies have been focused on the theoretical or conceptual models of implementation. very few studies, however, capture the empirical perspective that pertains to this phenomenon (buyukuzkan and gocer ). since the industry . concept along the supply chain is still at in its stages of infancy, within the supply chain literature, this exploratory study captures the stage-wise implementation of this concept across a multi-tiered supply chain. the remainder of this paper is structured as follows. section represents the literature and the background of this research. additionally, section has been further subdivided into smart manufacturing, components of smart manufacturing, and its application across supply chains. moving on, section discusses the research methodology. whereas, section captures the case data of industry . application, along with a three-tiered supply chain. moving on, section presents the results of the exploratory study, and also proposes a framework of the industry . implementation across supply chains, while the last section looks at the contributions of this study, along with its limitations. when it comes to analyzing the literature that has been written, it has been observed that different terms have been used interchangeably. these include, for instance, data driven or smart manufacturing, industry . technology, advanced manufacturing, factories of the future, and the fourth industrial revolution -all representing similar concepts (buchi, cugno, castagnoli ). that is to say that all these terms are talking about the future of manufacturing, by utilizing the idea of connected and networked technologies, that will generate value for the organizations, as well as the society (roblek et al., ) . the broader concept here refers to the machines that are equipped with data capturing devices. these devices are designed so that they can communicate with other machines and systems, in order to fulfill certain predetermined objectives (tang et al., ) . the research carried out during the recent years has seen much emphasis on the advent of smart manufacturing. the integration and interoperability (chen et al., ; lu ), helps bridge the gap, and hence, create a connection between the physical and cyber world. this integration then creates the connections between the external entities. in this context, enterprise integration can take place at multiple levels, i.e., physical, application and business level. physical integration refers to the connection of physical devices and machines, whereas, application integration refers to the connectivity or integration of the software or database systems. business integration involves the coordination among the business processes, which helps every aspect of the business to "gel in" together, so as to ensure the smooth transition of the work procedures (chen et al., ) . also, interoperability creates the connections in the systems, in order to exchange knowledge and skills. it is defined as the ability of two separate systems to understand each other, and operate in other environments (chen et al., ) . that is to say that, in terms of the machines, it is considered to be the ability to access another machine's resources. whereas, in a networked environment, it is the interaction between the various systems of the enterprise (chen et al., ) . the initial research streams focused more on the component level application of this concept, which merely focuses on the individual technologies, such as the cloud services, big data, etc., (which will be discussed in the next section which refers to the enabling technologies). whereas, some of the more recent papers have explored the integrative aspects of these technologies, such as the design, planning, manufacturing, human resource management, etc. (osterrieder et al., ) . the purpose of smart manufacturing is to utilize the data from the product lifecycle, into the intelligence systems, which improve the positive aspects of all the manufacturing processes (o'donovan et al., ) . frank et al. ( ) dissected and classified the concept of industry . into two main components. these included the front-end and the base technologies. the front-end technology dimension includes smart working, smart products, smart supply chain, and smart manufacturing initiatives. whereas, the base technology elements include cloud services, the internet of things, big data, and analytics. following in the same context, wang et al. ( ) hypothesized that there are layers of activities taking place in the manufacturing environments. the physical layer consists of the shop floor, machines, and all the tangible activities taking place there. whereas, the data layer consists of transferring the data captured from the physical layer, i.e., by using the sensors and other technologies, on to the cloud environment. the intelligence layer consists of the software and tools (analytical, prescriptive, etc.) that are needed to carry out the analysis, while the final layer comprises of the control layer forming the human supervision. according to tao et al. ( ) , the data driven smart manufacturing comprises of four modules. these include the real-time monitoring, problem processing data driver, and the manufacturing module. while the big data enables the companies to become more competitive, through intelligent tracking systems, and material assessment, energy efficiency management, and predictive maintenance. moving further, smart factories have the ability make timely decisions, with less human connection, which is managed via artificial intelligence (wuest et al., ) . smart manufacturing can perform tasks such as the all-round monitoring, optimization of the manufacturing activities, and simulation through big data (tao et al., ) . moreover, smart manufacturing utilizes the data that is extracted from the business processes, and in order to refine them further, it tends to improve the process efficiencies and the product performance. the first stage of this process is the collection of the data from the manufacturing environment. this includes the data on the inputs, i.e., the raw material characteristics, the data on the manufacturing variables, the data on the machine and human variables, and finally, the data on the output characteristics. the next step for smart manufacturing is to analyze the data that is stored at the cloud based data centers. this forms the core action point for the other subsequent activities, such as the monitoring and problem solving initiatives. in this regard, the monitoring stage acts as a quality controller, with any changes in process parameters resulting in the process readjustment. while the last stage, i.e., the problem processing, uses the data to predict emerging problems and the x.-f. shao, et al. technological forecasting & social change ( ) possible solutions that can be suggested in this regard (tao et al., ) . buchi et al. ( ) have indicated towards the increased production flexibility, improved performance, decrease in errors, higher efficiencies, and reduced set-up times due to the smart manufacturing initiatives. moreover, buchi and castagnoli ( ) have pointed towards an increased efficiency factor, and a greater production capacity, as a result of smart adoption. additionally, tao et al. ( ) have further highlighted the characteristics of the data that is driven by smart manufacturing, which includes product development, self-organization (production planning), smart execution (raw material movement, processing), self-regulation and the self-learning within a system. going deeper into the same context, it is observed that the product development, or the design uses rich consumer data (behavioral data, userproduct interaction data), in order to identify the key product features and requirements. the manufacturing planning can be further enhanced, by using the data for the optimum resource allocation, and network optimization. similarly, the machine data can also be used to predict any probable equipment faults, along with the diagnoses which will eventually lead to the proactive maintenance (tao et al., ) . van lopik et al. ( ) found that capabilities pertaining to augmented reality tend to enable the end-user, so that they can minimize the disruption to the workflow of the particular shop floor. additionally, oliff and liu ( ) showed how data mining techniques could lead to an improvement of the production operations, in terms of the quality in small manufacturing organizations. also, the internet of things helps to reduce the cost, improve quality, efficiency, and the predictive maintenance services (aheleroff et al., ) . many researchers have explained the phenomena of smart manufacturing, or industry . technologies, in terms of an augmented and virtual reality (wu et al., ; rüßmann et al., ; kolberg and zühlke, ) , additive manufacturing (huang et al., ; chan et al., ) , internet of things (wu et al., ) , big data analytics (de mauro et al., ; addo-tenkorang and helo, ; lenz et al., ) , and cyber-physical systems (monostori, ; lee et al., ; zhong and nof, ) . wu et al. ( ) have cited azuma, ( ) , in order to explain virtual reality as something which offers real time interactions, and three dimensional depiction of the objects in question. additive manufacturing is a production technique, where the products are manufactured in a layer by layer manner, by using digital data, and special polymeric materials (wu et al., ) . moreover, the internet of things (iot) is a technology that lets autonomous objects and devices to be sensed or controlled remotely (wu et al., ; ketzenberg and meters, ) . also, big data refers to the enhanced decision making capabilities, due to the collection and analysis of large data sets (astill et al., ) . cloud computing is defined as the ability to access the data storage, and analyzing (computing) the relevant resources via the internet, while the resources are maintained by a third party (zhong and nof, ) . in the same context, adkil et al. ( ) highlighted the key technologies in industry . , such as the (a) adaptive robotics, (b) data analytics and artificial intelligence, (c) simulation (d), embedded systems (e), communication and networking (f), cybersecurity (g), cloud (h), additive manufacturing (i), virtualization technologies (j), sensors and actuators (k), rfid and rtls technologies, and finally, the (l) mobile technologies. chiarello et al. ( ) explained the fourth industrial revolution in the form of enabling technologies. these included concepts such as d printing, augmented reality, additive manufacturing and virtual reality, digital transaction, big data, computing, programming languages, internet of things, protocols and architecture, communication networks and infrastructures, production and identification, including constituent technologies. moreover, they also described them as well for the common knowledge about the industry . technologies. furthermore, osterrieder et al., have highlighted that the key eight perspectives, which primarily include cyber physical systems, it infrastructure, human machine interaction, cloud manufacturing and services, decision making, and data handling, are considered to be vital elements in smart manufacturing. according to pacchini et al. ( ) , eight enablers, such as artificial intelligence, additive manufacturing, the internet of things, cyber physical system, cloud computing, big data, augmented reality, and collaborative robots, have been empirically tested in the auto manufacturing sector in brazil, so as to accelerate the adoption of industry . . moreover, kusiak ( ) also described the characteristics of smart manufacturing, which included the prediction technology, the agent technology, the data storage technology, the cloud computing technology, the automation and process technology, and the digitization technology in depth. in the same stride, lee et al. ( ) defined cyber physical systems (cps) as a collection of technologies that further form a system that connects and manages the physical assets, with the computational system. moreover, lee et al. ( ) further explained that the cyber physical system consists of two major processes; (i) the acquisition of real time data, which comes with advanced connectivity through the information feedback from the cyber space and the physical world, and the (ii) smart data management, data analytics, and computing abilities that build and nurture the cyber space. moreover, they also proposed a cps structure with five levels, namely, the smart connection, data to information conversion, cyber, cognition and configuration. when we go further into the detail, the smart connection level provides plug and play, sensor work, and a tether-free communication mechanism. moving on, the data to information conversion level provides the algorithms that are required for the data, in order to carry out the information conversion. with this in line, it helps the machines achieve a degree of self-awareness so as to account for the degradation, performance prediction and the multi-dimensional data correlation. furthermore, the cyber level provides a central hub for the information that has been received, along with the analytical capability for the status of the other relevant components. other than that, the cognition level provides the integrated simulation and synthesis, collaborative diagnostics and decision making, as well as the remote visualization capabilities for humans. finally, the configuration level provides selfconfiguration capabilities for the resilience, self-adjustment and selfoptimization of the model. going further in the same context, astill et al. ( ) have successfully shared the snapshots of precision poultry farming. this is a procedure that uses sensors to capture the data from various farm operations, while using the big data tools. this procedure is primarily used when making data driven decisions, and having a well-connected and networked farm equipment via the iot technology, which will lead to the much needed automation and optimization in the farm operations. in the recent years, there have been studies that have kept their focus on capturing the issues and the barriers that are related to the adoption of smart manufacturing technologies. in this regard, raj et al. ( ) conducted an extensive review of the literature. this review was specifically targeted towards the adoption barriers that industry . technologies frequently come face to face with. hence, through this analysis, raj et al. ( ) experienced, and listed down high investment, lack of clarity on the economic advantage, issues in supply chain integration, risk of security breaches, technology risk, job threats, lack of standards, lack of skills, and the resistance to changes, as some of the common barriers that have been experienced (kiel et al., ) . moreover, raj et al. ( ) also argued that certain national policies regarding the technological infrastructure must be designed, with the support of the government regulations in both the developing and developed countries. furthermore, iyer ( ) argued that the governments should develop and implement a customized framework for industry . , especially when it comes to the advent of employment opportunities and growth. at another instance, backhaus and nadarajah ( ) explained the concept of . technologies, by ranking them according to their impact, while also suggesting that x.-f. shao, et al. technological forecasting & social change ( ) implementing the . technologies must initially be implimented on small scale pilot projects, which capture the organizational needs, which, in turn, will improve the competitiveness. other than that, ebrahimi et al. ( ) established the concept of the five pillars which are denoted by cost deployment, workplace organization, logistics and customer service, professional maintenance, and quality control. these pillars have the potential capability to reduce the costs and the rate of loss, as well as the three principles which include the real-time capability, decentralization, and virtualization, which will eventually help to overcome the barriers to revolutionize and skew the national economy towards industry . . according to kadir et al. ( ) , there is now a need for empirical research, specifically based on the human factors and ergonomics approach, in order to completely understand, and break down the concept of the . opportunities and challenges on a tactical, operational, and strategic level. smart manufacturing explains the interconnected devices, within the cyber physical system, in order to reach a self-evolving environment that is equipped to manage the variations and suggest the optimum alternative and direct routes. however, given the critical role of multiple entities that contribute on an individual level, in order to shape the journey from raw materials to the end consumers, it is essential to realize that even if one entity does not effectively adopt the smart manufacturing concepts, the efforts of the rest of the members would not lead to a global optima point. in order for the supply chain to operate like a holistic entity, the smaller individual entities must work like interconnected platforms, just like the interconnected physical assets that exist within a smart environment. a change in one variable for a singular entity must initiate a counter and a collective response for the other interconnected entities (nasiri et al., ) . time and again, many scholars have highlighted the significance of the digital supply chain (addo-tenkorang and helo, ; scuotto et al., ; crittenden et al., ; riemer and schellhammer, ) . in this regard, the previous literature draws attention to the digital supply chain in the industrial sector (büyüközkan and göcer, ) . smart technology is the extent to which the physical devices or processes are connected with the various digital platforms. the investment in smart technologies can exponentially improve the internal, and the external performance of the company, while incorporating such technologies within the current supply chain (nasiri et al., ) . fig. below can help the readers to understand the major differences between the traditional supply chains vs. the digital supply chains. grieco et al. ( ) conducted a case study in the fashion industry in italy, and found that the decision support system facilitates the users to make better decisions about organizational activities, all over the supply chain. according to ghadimi et al. ( ) , a recent research on the fourth industrial revolution is taking into consideration several supply chain processes, such as the supplier selection, by applying the multi-agent technology. in the same context, weking et al. ( ) developed a business model with three patterns such as integration, servitization, and expertization, for leveraging industry . , and showed that the integration modernizes an existing business model with new procedures, and also integrates the parts of the supply chain simltanously. moving on, ralston and blackhurst ( ) found that the smart systems may provide improved supply chain resilience, especially when owing to the new skills enhancement and capability development. the application of the . enabling technologies tends to enhance the entire supply chain performance, especially in terms of the procurement, manufacturing, inventory management, and trading, while also promoting the concept of information sharing and automation, digitization, and transparency across the supply network (factorachian and kazemi, ). ghadge et al. ( ) emphasized the need to integrate the digital businesses, with digital supply chains, such as the incorporation and adoption of the (i) digital culture, (ii) new digital business models, (iii) optimized data management, (iv) connected processes and devices, (v)integrated performance management, (vi) synchronized planning and inventory management, (vii) supply chain transparency, (viii) integrated value chains, (ix) connected customers and channels, and the (x) collaboration and data sharing for fruitful adoption towards the concept of industry . . preindl et al. ( ) demonstrated that the digital transformation, and industry . , have the capability to accomplish a fully digital supply chain through higher transparency in terms of the centralization of the processes. however, this might not be achieved, if the firms do not have the appropriate information-sharing standards. while it is noteworthy that the decision making procedure is associated with the information exchange across the supply chain, for better efficiency and effectiveness of the processes. as a continuation of the same context, ding's ( ) investigation revealed that the innovations and technologies related with the fourth industrial revolution, allow for the autonomous decision-making actions for the entire supply chain. moreover, machado et al. ( ) identified that the new technologies allow the industry . to leave a positive impact on the sustainable supply chains, and all the sustainability-related dimensions (e.g., sustainable circular production system and so on), in an integrated mamner. in this x.-f. shao, et al. technological forecasting & social change ( ) regard, escalating the information exchange with the synchronization into the operations among supply chain partners, allows for the agility, efficiency, and total cost reduction throughout the entire supply network (ghobakhloo and fathi, ) . the model suggested by ghadge et al. ( ) found that the cloud technology and rfid enhanced the operational efficiencies through a reduction in the inventory levels and the costs. this was, however, made possible by an increased visibility through the data sharing that was taking place among the supply chain members. in their study, de sousa jabbour et al. ( ) argued that choosing a method to resolve, identify the appropriate . technologies, embracing sustainable operations management decisions, creating collaboration in supply chain, and establishing performance enablers for small attainable targets, are still challenging issues that need to be tackled . müller and voigt ( ) argued that industry . is primarily concentrated on the basis of production, but the integration of the supply chain management in the context of industry . , is still scarce in the contemporary research. additionally, manavalan and jayakrishna ( ) analyzed that the research on the supply chain for the fourth industrial revolution is still lurking in its initial stages. the traditional supply chains must shift rapidly in order to effectively and efficiently adopt the industry . technologies' principles, in order to remain in the ever -changing and evolving markets, while the organizations are constantly finding ways to adopt to these new technologies (ghadge et al., ) . also, in their study, buyukuzkan and gocer ( ) thus, for the reasons that are cited above, we have essentially focused on how to integrate the cyber physical system with a digital supply chain, so as to assimilate the processes for better product quality, as well as system reliability. given the lack of systematic studies on the implementation of the industry . concept across the supply chain, it was decided that for the purpose of this research, there was to be an exploratory case study model for the packaging supply chain in pakistan. according to eisenhardt, ( ) , qualitative data tends to provide an understanding of the underlying dynamics of a particular phenomenon. in this regard, the case study research also helps in obtaining rich data to explore the management issues in the field of research (yin, ; eisenhardt and graebner, ) . it also helps to capture the emergent theories, by recognizing the design of the associations among the relevant constructs (eisenhardt and graebner, ) . moreover, the inter-organizational liaisons are well studied in ways that are able to produce qualitative data, and permit the interpretive and explorative analysis as well (maanen, ) . other than this, the explorative studies seem to concentrate on the new subject matter that sheds the spotlight on the research conducted (brown and brown, ) . the purpose of this study is to use the explorative and qualitative research patterns to reveal the potential new approaches (zikmund et al., ) regarding the integration of the supply chain and smart technologies, which provides further insight into the impact of such initiatives on smart organizations. in order to explore the emerging phenomena of the implementation of industry . across the supply chain, a single case design has been adopted to unearth the dynamics that emerged during the course of the implementation (siggelkow, ) . in this regard, it also helped to describe the evolution of the firm or phenomena (siggelkow, ) . the single case study method is particularly useful, when the objective is to model the process that is adopted (leonard- barton and deschamps, ) . the organization for this study was selected using the theoretical sampling, as it provided the opportunity to capture the evolution of the industry . implementation across a supply chain that included the focal firm, along with its supplier and a downstream customer (eisenhardt, , siggelkow, . in order to conduct the interviews for this exploratory data, a certain protocol was developed as a guide for the process (eisenhardt, ) . semi structured interviews were then conducted at multiple levels, with the team members directly involved in the implementation process. the interviews were conducted in person at the plants of the collaborating organizations. also, the interviews were transcribed and shared with the team for their further feedback. the interviews were then triangulated with the data on the projects carried out during the various implementation phases. based on the analysis of the case data, an implementation framework was then proposed, which was shared with the implementing managers, plus the industry . researchers, in order to validate the findings. the incorporation of the supply chain integration (digitization) for the purpose of problem solving, captures an example from a multinational corporation's pakistan based factory (called as the focal firm from here onwards). the focal firm operates in the packaging business, serving clients that are mainly in the fmcg sector. the focal firm provides packaging material for liquid products, which are then filled and packaged at the focal firm's customer locations. the downstream customer in this case is a local fmcg company, with multiple divisions, and this particular case deals with a tea whitening product. the upstream part is a board factory, which provides the raw material for the focal firm. both the supplier's and the customer's factories are located within a one hour driving distance from the focal firm. the problem that cropped up was that there were dents observed in some of the randomly selected packages (the final product), after the filling and packaging stage. the objective was to find out the cause of the dents, in order to resolve the problem in real time at the customer's filling machines, focal firm process or at the process stage of the board supplier. under the pre implementation scenario, whenever any issue or customer complaint is generated from the customer, it usually comes to the focal firm. the concerned team would then identify the problem at their end, by talking to the production team, or they would also talk to their supplier directly. it was observed that this process took at least two days to resolve any issue, which resulted in a loss of two business days which were dedicated to the production of the product. the objective of the supply chain digitization project was to integrate the processes, in order to have the relevant data available in real time, in order to resolve the issues more efficiently, and also to take corrective measures in a timely manner as well. in addition to this, the focal firm wanted to reduce the issue resolution time, and hence, improve their customer service score as well. one reason cited by the team for not having taken up this initiative was that, the dents were not frequent, and the end consumers were also not overly demanding as well. however, they could foresee a change in this attitude, with the increasing competition in the industry. so they took it upon themselves to be proactive and solve the issue before it resulted in a loss of customers. when they started off, the team had no clue as to why the dents occurred in the final product. they also did not have any idea regarding the origin of these dents, or the cause for these dents. moreover, there were multiple variables that could have resulted in the dents being formed at the end of the final process. for instance, these could have been possibly generated from the processes, at their suppliers' end, at the focal firm's end, or even at their customer's filling machine. moreover, there could also be potential raw material issues, or a material issue in the focal firm's inputs, which may have created further issues at the focal firm's processing stage and so on. a team was then formed around this dilemma, to investigate the probable causes of the dent formation at the final production stage. a cross functional team was formed consisting of the members from different the various sub divisions, such as the production, procurement, maintenance, it and sales teams. this was a quick decision for the focal firm because of its experience of working in cross-functional teams, and the kaizen experience during its wcm journey. when the project was initiated, the data was only being captured at the focal firm's processes, and the idea was to collect data at three different legs (supplier, process in focal firm and process and customer end), to see if there were any possible product discrepancies which could lead to any potential problems at the last stage, so that those may be rectified in real-time. furthermore, the team started discussing about the different variables that could be impacting the quality of the final product. thus, they started off with exploring the issue from the supplier's end. the supplier was a paper and board mill, and had a long history of a healthy working relationship with the focal firm in question. the first step of this investigation was to include the members of the supplier team in the problem-solving team. the paper rolls were then delivered with the quality assurance certificate, which is usually called the certificate of analysis. this contained the generic level parameters of the incoming roll. the team worked on the problem for approximately a little more than a year, in order to identify, capture and analyze the data from the supplier's process. the supplier being in the vicinity, helped the team's working processes, as they were able to plan frequent gatherings at the supplier's premises, so as to take a first-hand look at the identified areas. when the team initially approached the supplier, they were met with an expected initial resistance from their side. the supplier's responded to this initiative by claiming that they had been sharing the quality certificate with every roll, and could also provide the data for this as evidence. however, the team shared their own lack of understanding of the causes, and explained that it was more of a proactive problem-solving initiative by them, and they were in the process of learning and wanted to understand the material usage and the process in order to explore the variables which could have been involved. therefore, as there was a long working history with the supplier, they were able to bring the supplier on board. moreover, the supplier's own internal culture of continuous improvement in the quality controls ensured an active participation and cooperation from their side. the team then initiated their investigation with the observation of the process at the supplier's end. the process consisted of three steps, i.e., the board making, the coating, and then finally the cutting of the rolls. the mother roll measured around , m in length, and . m in width, which was first cut into four rolls, with an approx. length of each, and then these rolls were further cut into smaller rolls, with the widths of . m, each. the team kicked off the process by looking at all the available data points which might have been connected together. their initial research revealed that the variables such as moisture, thickness and grammage of the rolls were being captured by a sensor, and the values were displayed on a monitor. however, the supplier team was neither aware of the data points, nor the storage of the data which could be used to retrieve it for carrying out any further analysis. all they were using was a combined report for the mother roll, which gave the average values which were never found to be out of tolerance. moreover, the supplier's technical staff had also never considered exploring the process further, since the output met the required customers' specs, and never resulted in any significant customer complaints. the team tried to go in depth with their exploration of the fact that if the data was being displayed, it must be stored somewhere as well. however surprisingly, even the it team could not figure out the location of the data storage. they then decided to contact the sensor manufacturer, which happened to be a well know brand, but that particular company did not offer any services in pakistan. other than that, they also contacted the machine manufacturer, but could not find much information from their side either. the team then contacted another international company, which offered the sensors and the allied product of this category. but, the catch was that this particular company offered a solution for five hundred thousand euros. however, at this stage of the project, with no information of the variables that could be causing defects later, this price was not deemed appropriate and this solution was dropped from the consideration set. with no input from the sensor manufacturer in sight, the team started to look for improvised solutions. they discussed options that were as simple as stationing a human in front of the screen to capture the data manually. on the other extreme, they also thought of capturing the images, by using a camera, and then using artificial intelligence techniques to capture the data that was being recorded but was lost. before going any further, the team was also expanded, and an ai researcher was added from a leading university in the city. the amended team with enhanced it capabilities was now able to extract a text file from the sensor's data that was being captured. the next step was to map the points on the data file. in this regard, the team found that the sensor moved along the width of the machine, while the board moved beneath. as a result of this movement, the sensor captured data in diagonal directions on the paper board. the next step after this was for the team to come up with algorithms that could be used to map the points on the paper board. this entire effort of engaging the supplier, team formation, data exploration, addition of academic researchers, data capturing and algorithm development to map data, took almost a year to materialize. equipped with the data from the supplier, the team then moved on to study the product at the focal firm. the next step in this regard was to link the data coming from the supplier, with the machine that was running at the focal firm. one step that could have impacted the properties of the paper board, was the production of crease marks on the board, which facilitated the package formation at the filling machines that were set up at the customer's end. these creases were formed by passing the board between a male plate and a female plate. thus, the creases formed contained certain properties, in terms of height of the crease, width and the gradient, that were collectively called as the crease profile. the probable impact of the crease profile had never been explored earlier by anyone working in these companies. in addition to this, there was no mechanism available to measure the crease profile. keeping these limitations in check, the team started exploring multiple options to come up with a plausible explanation of the damaged paper rolls. one idea was to develop a crease profiler. therefore, the team, with assistance from their academic network, developed a crease profiler which was equipped with a laser sensor. the sensor moved along on the packaging material, specifically where the creases were present, and measured the crease profile. the sensor also read a crease, and based on that, summarized the other three creases, while moving on to the other patterns that were present (collectively, there were three creases at the right angle that came together to form a corner, on which the bending was being studied). this activity unearthed the fact that significant variations were present in the crease profiles. the measured crease patterns for every roll were then mapped on the roll data. the problem happened to be the dents in the final product, after filling and packing the product. while, the objective was to find out the cause of the dents, in order to fix the problem in real time at the customer's filling machines, focal firm process or at the process stage at of the board supplier. having mapped the data from the supplier and the focal firm, the next stage was to capture the data from the customer's filling machines, and to link it with the data coming from the first to entities in the value chain. at the customer's end, no mechanism existed to monitor the dent formation in the packages, even as they occurred. also, the machine operated with an output speed of , packs/per hour, and it was only after a batch had been manufactured, that the dents were identified through visual monitoring. x.-f. shao, et al. technological forecasting & social change ( ) once again in a fix, the team sat down to deliberate the available options of capturing the defects first, and then link them with the available data. they decided to install high speed cameras at the filling machine, where the packet was exiting the machine from. in order to capture the dents, they decided to make use of artificial intelligence. as a first step however, the different visual profiles of the dents on the packages were captured, and fed into the system. they developed a machine learning module, which would give a positive signal for the dents that were deemed similar to ones fed to the data base. moreover, they also installed a camera at the customer machine. interestingly, this camera caught certain dents, which were not possible to be caught by the human eye. once a dent was identified, the next step was to work backwards to plot the dents on the data available on the paper board properties, and the crease profile. the idea was to look for any correlation of the crease profile, most specifically, the crease height (they found that the crease height was the most significant), and see whether there were any variations in the crease height around that point. data analysis revealed that every roll had a unique id, and before putting the roll on to the filling machine, the roll id was scanned. in the case of a dent, that particular point was correlated with the existing data of the board mill, as well as the focal firm data. during the first stage of the implementation, the camera was just capturing the top of the packet. the next stage planned was to install a camera under the filling machine to capture the dents at the bottom. this required more collaboration with the customers, as putting a camera beneath the required modification of machine at the customer end was a risky initiative to take. as a subsequent task, the team also planned on connecting the system with a mobile application, which would be activated with the dent identification, while linking together, the data collected from three areas; the sensors at the supplier and the focal company, plus the camera installed at the customer's end. recent advances in the technology have led to the development of smart factories, or at least have initiated a journey towards that direction. the smart factories/industry . consist of a networked and interconnected system, where the information flow is optimized between the physical infrastructure and the cyber space. with the help of the advanced data management and analytics tools, the entire system is expected to perform optimally. however, until and unless the smart factory concept is not translated across the supply chain i.e. smart supply chains, the benefits of industry . concept would not materialize in its true essence. numerous big businesses, as well as the small businesses, are shifting towards smart manufacturing. these include businesses such as the automotive, electrical, pharmaceutical, and the defense organizations (müller, ) . this study has explored the start to end journey, using the industry . concepts, to link multiple tiers across the supply chain. the interplay between the multiple entities was initiated by a need or a problem that appeared at the downstream end of the value chain. the lead was taken by the firm, which was in the middle of the three entities, and played the role of the dominant partner. the dominance was not only because of being a major firm in the category, but also because of the knowledge and implementation of the advanced management tools that were used to find the solutions (wcm implementation). the case data depicts the presence of multiple stages in the journey of the industry . adaptation, across the value chain. these stages have been broken down into the visualization phase, first level linkage phase, connected supply chain phase, and finally the smart supply chain phase. the visualization level may also be termed as the trigger phase, which would underpin the entire idea of the connected supply chains. here, the term 'connected' supply chain, however, does not refer to the term usually used in the supply chain collaboration literature. in the supply chain collaboration literature, mutual trust leads to the joint activities, such as information sharing, planning and product development, which leads to a higher level collaboration, in order to enhance the efficiencies across the entire value chain (chopra and meindl ) . the term 'connected' being used here is considered to be more so in literal terms, but it builds on the earlier concept, that is, that the 'connection' in the traditional terms is a pre requisite to the 'connection' in literal terms. this means that it is based along the same lines of pos data sharing, but at a much advanced and transparent level. the objective in the pos is the visibility for joint planning (simchi-levi et al., ) , but here, in terms of the industry . concept, this connection forms the basis of automatic decision making by a smart system. furthermore, the supply chain efficiency is enhanced by the automation of physical planning, information sharing processes, control, tasks (pereira and romero, ) and the end to end visibility across the value chain (miragliotta et al., ) . in this case, the three-tier supply chain was orchestrated by the middle entity, termed as the focal firm. this journey rallied around a problem that was identified at the downstream end. once the problem was identified, the focal firm took the initiative, and assumed the position of the 'champion' of the new journey. this ideally means that the visualization has to be championed by the firm, which has the resources, vision, as well as the capability to lead the initiative (tangpong et al., ) . this was legitimized by the major and dominant players in the category, the advanced management practices that were adopted, as well as the key product owner. the first step started within the walls of the 'champion' firm. a project team was initiated to address, and chalk out a relevant plan. the focal firm had already achieved an advanced level of wcm implementation, and was not a novice to the cross functional kaizen activities. in fact, it championed the cross functional management approach under the pillared structure of the wcm model (furlan and vinelli, ) . in addition to this, the focal firm, in its journey towards higher level implementation of wcm principle, had been internally working on the digitization concept, and had an internal roadmap that was placed in parallel. the human resource was also well equipped with the basic digitization concepts, and the advanced level trainings on the role of the digitization were already in place as well. phase : level linkage when considering the level linkages, it was observed known that once the multi-tier connection roadmap had been visualized, the focal firm connected with it the immediate supplier, in order to work collaboratively to explore and study the problem in depth. however, the supplier initially showed resistance in owning the problem (handfield et al., ) despite the fact that the focal firm had a history of working together (zhang and cao ) with the supplier, and had remained not only as a major customer, but had also shared management practices, improvement tools and techniques over the years. the relationship element was essential to bring the two firms together on a joint platform (zhang and cao ) , where the problem was known; however, the roadmap was not clear. a team was then set-up at the supplier's end as well to coordinate and work with the focal firm's team. there were frequent meetings at the supplier's plant to explore the various aspects of the problem (handfield et al., ) . the supplier also had an existing culture of carrying out problem solving and continuous improvement activities through these teams, so the two teams were soon merged into a single team that was headed by the focal firm personnel. the physical proximity and cultural similarity between the two organizations also helped in the coordination and the sharing of ideas for an extended period of time (cannon et al., ) . the team carried on to explore the answers to their questions, looking at different solutions while engaging different stakeholders during the course of the time. moreover, at some point in time, the team was also expanded by adding academic researchers in the area of data sciences and it (gulati et al., ) . it was after almost a year of effort that the team was able to align the data being captured at the supplier's end, with the data already existing at the focal firm level. luckily, the data existed, but the challenge was only to extract the data, otherwise the solution would have required going to the next step, and installing sensors to capture the required data as well, which would have been a very tedious process. in addition to this, the cloud environment was used to link the data with the existing data. the phase required the initial elements of collaboration in which the past history, continuous improvement culture at both organizations, physical proximity, and the earlier projects, all helped the two organizations to move forward towards a solution (handfield et al., ) . this was augmented in the later stage, with the addition to the focal team, as per the project requirements, the use of smart manufacturing technologies, such as data capturing, and the iot to connect the required components of the two organizations, and also meet the objectives of this phase. equipped with the data from the supplier's end, the team moved on to gather the data on the required specs at their facility. the first step was to capture data, in order to build on to the data that was captured at the supplier's end. the joint team had to set-up the equipment, using the sensors to capture the data on the crease profiles. this activity unearthed the process issues at the focal firm's level as well. the next step was to link that data with the data that was being captured at the supplier's end (barratt and barratt ). with the help from the data science experts, the team was then able to map the data, on to the data that was coming from the supplier (gulati et al., ) . the next step within this phase was to connect with the third partner in the value chain. the issue at this stage was, that they needed to code the physical appearance of the product, i.e. the dents appearing in the final product. however, there wasn't any parameter or measuring instrument that could have been used for gaging this attribute. in addition to this dilemma, the speed of the out, i.e. , packs per hour, made it humanly impossible to count the defectives in real time. the human inspection could only have been instituted post the production process, with the attached costs and delays going side by side. therefore, the team had to explore alternative options which could automatically detect the occurrence of dents, and then integrate that knowledge further. with input from the data experts, the team installed a camera that could make use of the artificial intelligence, in order to detect the dents (da costa, figueroa, and fracarolli ; pang et al., ) . the camera was fed the data, so as to decipher which dent would be counted, and which would be ignored. once the team was able to capture the data on the dents, the next step was to link that data with the paper board data that was coming from the earlier echelons of the supply chain. during this stage, the multi-tier collaboration, common objective (handfield et al., ) , and the use of advanced tools, such as the advanced sensing tools, ai, and data mapping, enabled the supply chain partners to link the relevant data to a central networked system (novais et al., ) . the final stage was to link the data together, and based on the defect, the identification made the system self-adaptive by taking corrective measures at the relevant stage. however, this would ideally require continued investment into the technology aspect of this issue, and the relationship part as well. moreover, this would also require investment in the iot devices, for the real time information exchange, throughout the supply chain members (haddud et al., ) , rfid sensors for real time automobile tracing (barreto et al., ) and the centralized manufacturing execution systems through cloud computing (almada-lobo, ) . it is essential to understand that this must not be seen as the end of the journey to adapt to the smart technology that industry . aims to introduce in the supply chain processes. rather, this was the start of a journey that would hopefully highlight a considerable amount of interconnected areas of improvement, but would also require repeated commitment in terms of the time and money for the continued experimentation to take place. also, this might require further broadening of the team's profile, as they would have to continuously move to explore the use of advanced technologies even further (ghadimi et al., ) . fig. captures these stages of collaboration, while listing the technological aspects of the stages, along with the organizational enablers that are needed to move to advanced stages of collaboration. with the advent of the industry . concept, the increasing adoption of innovative technologies such as the iot, cloud computing, big data analytics, use of smart sensors, robotic, etc. is being observed across x.-f. shao, et al. technological forecasting & social change ( ) various industries. the organizations have started to reap the benefits of deploying advanced technologies in their manufacturing systems, in order to improve the process efficiencies. however, it is imperative to adopt this concept from the supply chain perspective for two reasons. the first reason is that, as organizations embark to adopt the concept of smart factories in isolation, this might bring issues of compatibility later on, when the concept would be rolled out in broader supply chains. secondly, it is a universal fact that is most commpnly featured in the supply chain literature that, in order to improve the processes and gain efficiencies, an end to end approach must be adopted. however, up till now, only a limited number of researches have empirically explored the concept of industry . , with regards to the supply chains. moreover, studies that have also looked at the supply chain view, predominantly fall in the conceptual category. very little research exists with regards to the actual implementation of the industry . concept, across the supply chains. the exploratory study of the journey of the industry . application across the supply chains has been used to propose a phased framework for supply chain wide implementation of industry . . the framework consists of four distinct stages of interaction among the supply chain actors. these stages, while identifying the adoption of advanced technologies, also highlight the organizational enablers that are essential to the industry . rollout, across the supply chain. moreover, the previous studies have presented conceptual models, while this framework is based on the actual implementation study. the first stage deals relatively more with the implementation of the ground work that is based on the visualization of the project, team building, and the cross functional elements, etc. it is noteworthy that, organizations with a history of cross functional continuous improvement initiatives are better suited to proceed to the next stage, with lesser efforts. the next two stages, where the advanced tools such as the iot, ai, ml are adopted, the existing collaboration between the supply chain partners, acts as an enabler. finally, the implementation of industry . concepts, across the supply chains, must not be seen as an end of a project. rather, it must be approached with the spirit of initiation of a new journey of exploration and implementation of the ever advancing technologies. the proposed framework will provide insights to the companies that are looking to take advantage of the industry . concept, across the supply chains. the framework also highlights the importance of the relationship elements across the supply chain members, in successful the adoption of the supply chain wide, smart initiatives. this study also highlights the significance of expanding the scope of the team, by including data experts who can tap into the knowledge that is residing outside the organizations. this framework also provides a base model to the researchers, so as to test and expand the framework, by capturing different industries and geographic regions. as with other exploratory studies, this study has certain limitations. the study has relied on one implementation incident of the industry . concept, across the supply chain. researchers relying on other methods of research often question the generalizability of the qualitative or case based research. however, the objective of the exploratory studies, along with this study, is to capture a certain phenomenon, in a certain setting, and to generate a set of hypothesis which can later be tested for generalizability (eisenhardt ) . this study explored the implementation of only a few tools of the industry . (sensor technology, iot, ai), whereas it did not provide the opportunity to study the implementation of other technologies, such as the additive manufacturing, augmented reality, autonomous robots, etc. therefore, the future studies could explore broader applications of these tools across different levels of the supply chains. industry . is an emerging phenomena that requires cross organization coordination and linkages, which provide the opportunities to explore the relationship dynamics within the broader phenomena. in addition this, the technology expertise of partnering firms, along with their history of collaboration, also requires to be further explored. david xuefeng shao is a lecturer at university of sydney. his-research interests focus on international risk management and management capability. he was awarded the first prize in the anzam pitch competition during his phd candidature and also won the best paper award in strategic management stream in anzam conference. his-research has appeared in international journals including enterprise information systems, journal of global information management, and international journal of environmental research and public health. dr. shao can be reached at xuefeng.shao@sydney.edu.au. wei liu is a phd candidate in the discipline of international business, at the university of sydney business school. his-research interests include corporate strategy in the chinese context, international business, technology and innovation. his-research has appeared in emerging markets review, renewable and sustainable energy reviews, enterprise information systems and others. he also served as guest editor in international journal of x.-f. shao, et al. technological forecasting & social change ( ) big data applications in operations/supply-chain management: a literature review iot-enabled smart appliances under industry . : a case study maturity and readiness model for industry . strategy the industry . revolution and the future of manufacturing execution systems (mes) smart poultry management: smart sensors, big data, and internet of things a survey of augmented reality investigating the relationship between industry . and productivity: a conceptual framework for malaysian manufacturing firms exploring internal and external supply chain linkages: evidence from the field doing your dissertation in business and management: the reality of researching and writing economies of scale and network economies inindustry . . theoretical analysis and future directions of research smart factory performance and industry . digital supply chain: literature review and a proposed framework for future research building long-term orientation in buyer supplier relationships: the moderating role of culture data-driven cost estimation for additive manufacturing in cybermanufacturing architecture for enterprise integration and interoperability: past, present and future a methodology to create a sensing, smart and sustainable manufacturing enterprise extracting and mapping industry . technologies using wikipedia supply chain management: strategy, planning and operation the digitalization triumvirate: how incumbents survive computer vision based detection of external defects on tomatoes using deep learning what is big data? a consensual definition and a review of key research topics industry . and the circular economy: a proposed research agenda and original roadmap for sustainable operations pharma industry . : literature review and research opportunities in sustainable pharmaceutical supply chains the evolution of world class manufacturing toward industry . : a case study in the automotive industry building theories from case study research theory building from cases: opportunities and challenges impact of industry . on supply chain performance industry . technologies: implementation patterns in manufacturing companies arcs of integration: an international study of supply chian styrategies unpacking the coexistence between improvement and innovation in world class manufacturing: a dynamic capability approach the impact of industry . implementation on supply chains intelligent sustainable supplier selection using multi-agent technology: theory and application for industry . supply chains corporate survival in industry . era: the enabling role of lean-digitized manufacturing an industry . case study in fashion manufacturing meta-organization design: rethinking design in interorganizational and community context examining potential benefits and challenges associated with the internet of things integration in supply chains avoid the pitfalls in supplier development additive manufacturing and its societal impact: a literature review moving from industry . to industry . : a case study from india on leapfrogging in smart manufacturing current research and future perspectives on human factors and ergonomics in industry . . comp. industrial engineering adopting operations to new new information technology: a failed internet of things application the influence of industrial internet of things on business models of established manufacturing companies -a business level perspective lean automation enabled by industry . technologies fundamentals of smart manufacturing: a multi-thread perspective a cyber-physical systems architecture for industry . -based manufacturing systems holistic approach to machine tool data analytics managerial influence in the implementation of new technology industry . : a survey on technologies, applications and research issues qualitative studies of organisations sustainable manufacturing in industry . : an emerging research agenda a review of internet of things (iot) embedded sustainable supply chain for industry . requirements data driven management in industry . : a method to measure data productivity the industrial management of smes in the era of industry . cyber-physical production systems: roots, expectations and r&d challenges business model innovation in small-and medium-sized enterprises the impact of industry . on supply chains in engineerto-order industries-an exploratory case study managing the digital supply chain: the role of smart technologies a systematic literature review of cloud computing use in supply chain integration an industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities how smart cities will change supply chain management: a technical viewpoint towards industry . utilizing data-mining techniques: a case study on quality improvement the smart factory as a key construct of industry . : a systematic literature review the degree of readiness for the implementation of industry . . computers in industry a review of the meanings and the implications of the industry . concept transformation strategies for the supply chain: the impact of industry . and digital transformation barriers to the adoption of industry . technologies in the manufacturing sector: an inter-country comparative perspective industry . and resilience in the supply chain: a driver of capability enhancement or capability loss? achieving manufacturing excellence through the integration of enterprise systems and simulation collaboration in the digital age: diverse, relevant and challenging. collaboration in the digital age industry . : the future of productivity and growth in manufacturing industries a multiple buyer-supplier relationship in the context of smes' digital supply chain management change in the presence of fit: the rise, the fall, and the renaissance of liz claiborne evolution toward fit persuasion with case studies designing and managing the supply chain: concepts, strategies, and case studies leveraging supply chain integration through planning comprehensiveness: an organizational information processing theory perspective using autonomous intelligence to build a smart shop floor towards a typology of buyer-supplier relationships: a study of computer industry data-driven smart manufacturing lean supply chain management: empirical research on practices, context and performance developing augmented reality capabilities for industry . small enterprises: lessons learnt from a content authoring case study implementing smart factory of industrie . : an outlook leveraging industry . -a business model pattern framework lean thinking: banish waste and create wealth in your corporation a fog computing-based framework for process monitoring and prognosis in cyber-manufacturing current status, opportunities and challenges of augmented reality in education machine learning in manufacturing: advantages, challenges, and applications case study research: design and methods exploring antecedents of supply chain collaboration: effects of culture and interorganizational system appropriation the dynamic lines of collaboration model: collaborative disruption response in cyber-physical systems business research methods (book only) key: cord- -c mv eve authors: christensen, paul a; olsen, randall j; perez, katherine k; cernoch, patricia l; long, s wesley title: real-time communication with health care providers through an online respiratory pathogen laboratory report date: - - journal: open forum infect dis doi: . /ofid/ofy sha: doc_id: cord_uid: c mv eve we implemented a real-time report to distribute respiratory pathogen data for our -hospital system to anyone with an internet connection and a web browser. real-time access to accurate regional laboratory observation data during an epidemic influenza season can guide diagnostic and therapeutic strategies. we implemented a real-time report to distribute respiratory pathogen data for our -hospital system to anyone with an internet connection and a web browser. real-time access to accurate regional laboratory observation data during an epidemic influenza season can guide diagnostic and therapeutic strategies. keywords. analytics; epidemic; epidemiology; influenza; laboratory. the us centers for disease control and prevention (cdc) provides data regarding influenza activity, aggregated from state data sources that generally lag or more weeks behind the date of release [ ] . however, real-time data summarizing regional hospital system observations are more relevant for local clinical decision-making. clinicians frequently request updates from the microbiology laboratory on influenza test positivity, in addition to other common respiratory pathogens, during the respiratory virus season to help inform their daily practice. in addition, clinical laboratories should routinely monitor local influenza data to determine if epidemics are occurring, if continued testing is necessary, or if patients can be treated based on positive symptoms alone [ , ] . to address these local needs in a major us metropolitan area, our clinical microbiology laboratory implemented an online dashboard to distribute respiratory pathogen data for our -hospital system to clinicians, epidemiologists, infection control practitioners, system leadership, and the public. the report provides easy access from any workstation or mobile device with an internet connection. development of this report began in the fall , before the respiratory virus season, during which influenza reached an epidemic status across the united states that resulted in supply shortages, testing difficulties, and a widespread public health crisis [ , ] . respiratory pathogen panel test result data were extracted from our laboratory information system (scc soft computer). the extracts included de-identified laboratory result data, including specimen collection date, facility, and result for all influenza and respiratory pathogen tests. the data were further analyzed and aggregated to produce interactive charts published to a public-facing web server. the accuracy of the report was validated by comparison with data generated natively by our laboratory information system, as well as a manual review of all test results from day. we gathered visitor statistics from the server log files. internet protocol addresses were mapped to internet service provider, country, city, and organization using ipinfo.io [ ] . four distinct data summary analyses were performed. first, for our most commonly detected pathogens (influenza a, influenza b, respiratory syncytial virus, and rhinovirus/enterovirus), we calculated the number of positive tests for each day and week. second, we calculated the number of positive tests at each facility. third, we calculated the daily and weekly positivity rates of our respiratory pathogen molecular test. fourth, we calculated the frequency with which each pathogen was identified by our molecular test. these counts reflected anonymized and aggregated data devoid of protected health information. to present users with interactive and dynamic data, we elected to use hypertext markup language (html) [ ] and javascript [ ] as our visualization modality. we used chart.js [ ] as the framework for producing our interactive charts. the data analyses generated javascript arrays that were stored in chart.js data structures to produce charts ( figure for both chart and chart , we created radio buttons that allowed the user to toggle between a weekly or daily summary. for chart , we built a radio button to switch between basic view and detailed view. in basic view, the influenza a molecular and antigen results are grouped together; in detailed view, the subtypes of influenza a detected by our molecular platform are graphed separately. three distinct time intervals were supported, including the most recent weeks, the past year, and -present. the charts were packaged into an html report and uploaded to a public-facing web server. we unveiled the report and requested informal feedback at the system infection control meeting, the system antimicrobial stewardship meeting, and the hospital infection prevention and control committee meeting. we included a link to the report in an e-mail distributed to all employees from the executive vice president of the hospital system. we updated the graphs daily. over the subsequent weeks, the report was accessed times, and over the next weeks, the report was accessed times. approximately % of the originating ip addresses were from within our hospital system, and % were from locations outside the united states. views on mobile devices accounted for % of the traffic, and % of views were referred from the department of pathology and genomic medicine website. views peaked at per hour right after the link was distributed by our executive vice president. during the first week, % of all hour intervals saw at least page view, with an average of views per hour. daily view counts decreased as the influenza season ended and stabilized at views per week on average. at the height of the influenza epidemic at our hospital system, % ( / on december , ) of all influenza antigen tests and % ( / on december , ) of respiratory pathogen molecular tests were positive for influenza a or influenza b. forty-six percent ( / on december , ) of these molecular tests were positive for any respiratory pathogen. vendor supply stocks were limited nationwide [ ] , and in january , our supply of universal transport media diminished, requiring the creation of : aliquots to preserve material for sample collection. based on these data and following cdc/world health organization (who) guidelines for epidemics [ , ] , our primary care group stopped testing for influenza and treated symptomatic patients as if they were influenza positive. our interactive website provided near real-time data, which allowed this decision to be made a week earlier than otherwise would have been possible using federal and state data. furthermore, our inpatient pharmacy was able to anticipate oseltamivir utilization and stock accordingly while remaining prepared for potential drug shortages. brief report • ofid • we developed a near real-time report that presents statistics regarding respiratory pathogen testing from our microbiology laboratory. the population tested includes all inpatient and outpatient individuals across our -hospital system ( operating beds) and all patients in the associated free-standing emergency and primary care clinics. the report is available to any device with an internet connection and is updated daily to provide critical data to clinicians, epidemiologists, infection prevention and control committees, hospital leadership, and the public. we developed the site with mobile devices in mind, which allows the graphs and fonts to be readable on any platform. the user can switch between daily and weekly data aggregations using radio buttons. the time interval of interest can be modified using preconfigured buttons. data can be filtered by clicking the data labels in the chart legends. these features are possible because of our decision to develop a web-based report, as opposed to a pdf, spreadsheet, or word processing document. anecdotal feedback collected at the time of rollout was universally positive. interest in the report quickly peaked after the initial announcement but continued to be viewed daily. as the influenza season ended, infectious diseases clinicians asked that we add a rhinovirus/enterovirus trend to the website so they could track the summer respiratory virus season as well. at least clinician changed ordering practice in early september after identifying a spike in rhinovirus/enterovirus positivity frequency. in november , the chief physician executive and specialty physician group ceo sent an e-mail to physician leaders across the system regarding a significant uptick in respiratory syncytial virus isolates and rising influenza a pathogens detected by laboratory testing, which was identified through our website report. based on the data provided by our laboratory in this report and cdc/who guidelines for epidemics, our primary care group stopped laboratory testing for influenza and treated symptomatic patients as if they were influenza positive. access to accurate real-time data during an epidemic influenza season can guide diagnostic and therapeutic strategies. during the epidemic, there was a nationwide shortage of testing reagents [ ] . earlier identification of high positivity rates by monitoring real-time data allows for an institution to implement "treat don't test" guidelines earlier, preserving test reagents for critical situations where they are most needed. in summary, our microbiology laboratory implemented a near real-time internet report to distribute respiratory pathogen data for our -hospital system to clinicians, hospital epidemiologists, infection control committees, system leadership, and the public. facile access to accurate real-time data during an epidemic influenza season can guide diagnostic and therapeutic strategies. the report is available at https://flu.houstonmethodist.org/. fluview: a weekly influenza surveillance report prepared by the influenza division guide for considering influenza testing when influenza viruses are circulating in the community world health organization. who recommendations on the use of rapid testing for influenza diagnosis centers for disease control and prevention. summary of the - influenza season labs take stock of surprising flu season ip address api and data solutions -geolocation, company, carrier info, type and more world wide web consortium. w c html ecmascript language specification financial support. this work was supported by the department of pathology and genomic medicine at houston methodist hospital. this research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.potential conflicts of interest. all authors: no reported conflicts of interest. all authors have submitted the icmje form for disclosure of potential conflicts of interest. conflicts that the editors consider relevant to the content of the manuscript have been disclosed. key: cord- - f ruku authors: wagner, joseph e.; warriner, william j.; bradfield, sylvia a.; farrar, patricia l.; morehouse, lawrence g. title: a computer based system for collection, storage, retrieval and reporting accession information in a veterinary medical diagnostic laboratory date: - - journal: computers in biology and medicine doi: . / - ( ) - sha: doc_id: cord_uid: f ruku abstract substantial data collected from large numbers of accessions, the need for comprehensive reporting of negative as well as positive laboratory findings, and the necessity for obtaining rapid diagnostic correlations prompted the development of a computer based system of accession data management for collection, storage, rapid retrieval, reporting, concording, and administrative compiling in a state-university veterinary medical diagnostic laboratory. increasing numbers of accessions and a desire to store and rapidly retrieve information on each accession have prompted veterinary hospitals, clinics and diagnostic laboratories to develop computer-based data management systems [ . one such system has been developed at the university of missouri-columbia research animal diagnostic and investigative laboratory (radil) section of the veterinary medical diagnostic laboratory at the university of missouri college of veterinary medicine. annually, a large amount of accession data is generated at the radil. in , for example, there were accessions with a total of , animals examined at the radil. the development of a computer based system of data management has made the storage, compilation, and rapid retrieval of large amounts of diagnostic information possible. this system is capable of handling large volumes of diagnostic data such as results of histopathology, serology, toxicology, virology, parasitology, necropsy, and microbiology examinations as well as demographic-zoographic/patient data. this system uses the full screen capabilities of ibm series computer terminals to display a blank panel (essentially a blank form for recording laboratory results) which is filled in by the laboratory technicians at the major data generating stations throughout the laboratory. preliminary, final, and supplemental reports or select diagnostic data for each accession are printed by the computer directly from this information, and copies of these reports are sent to the person(s) responsible for submitting the accession. individual accession records are kept in a vsam database (ibm program product) and archived to magnetic tape every quarter. information from these records is abstracted by the computer for an annual concordance catalogue index and other administrative reports. during the development of our data storage and retrieval system, several design objectives were conceived and followed. these included: . the system should be easy to use, and require little training; . the system should allow for easy production of summaries and year-end reports, preferably using an existing file processing package such as mark iv (informatics, inc.); . the system should allow for easy update of accession records; . the system should interface with the existing concordance index program [ ]; . there should be little or no keypunching or other data entry required to produce an annual or other compiled concordance indexes. data for compiled reports should be acquired passively from existing files originally created to store laboratory data for reporting purposes ; . preliminary, supplemental and final reports should be produced in english (as opposed to numerical codes), and in letter format suitable for mailing to commercial research animal producers, investigators and/or owners, and referring veterinarians. data entry the data entry system described herein requires a large capacity computer. this system runs under the time sharing option of the university of missouri's amdahl v/ running ibm mvs/sp release and jes /nje release . operating systems. data entry and editing is by means of seven ibm model full screen computer terminals and one full screen controller-terminal. these are located in major data-generating areas of the radil, i.e. the accessioning area, the necropsy, microbiology, and serology laboratories. the full screen capabilities of these terminals are used for accessioning and data entry. the computer displays a blank form (panel) on which the crt terminal operator enters appropriate data. to avoid confusion, the panel is displayed in low intensity characters while operator entries are displayed in high intensity characters. during subsequent screen display of data and for updating accessions, the crt displays the panel with such information filled in as is currently present in the accession record. the operator may then change or delete this information by typing "over" it, or may add more information to it, or both. each panel has its own designated "free text" area for continuation of selected fields as well as general comments. a set of panels are in use, with provisions for adding more panels when needed. typically, each panel, except the demographic-zoographic and summary-concordance panels, comprises a report of data from one section of the laboratory. each panel type is shown in this paper with data entered in italics from a fictional accession (no. ). the actual report generated and printed from entered data is also presented for each panel. because all laboratory findings are treated as confidential information, data entries on these panels do not represent actual accession material, but are merely for demonstration purposes. demographic-zoographic panel ( fig. ) when an accession is presented to the radil section of the veterinary medical ,ll,l lll~lllllll ~ ~~~~~~~~~~~~'~~~~~~"'~~"'~'~'~~~~~~~~~ diagnostic laboratory, demographic and zoographic information is immediately entered by a data controller or data entry operator from information on a form submitted with the accession. this panel ( fig. ) includes such information as the investigator's and/or owner's name and address, the referring veterinarian's name and address, type of specimen submitted (whole animal, slides, fixed tissue for histopathology, swabs for culture, etc.), number of specimens submitted, species, strain, age, sex and name of animal (if applicable), and accession history. owner or investigator, and referrer names may be entered as a -digit code number which the computer "looks up" in a directory, replacing the number with the appropriate name and address on all subsequent crt displays or printed reports. this insures consistency and accuracy in names and addresses. the demographic-zoographic panel ( fig. ) and summary-concordance panel ( fig. ) always occur exactly once per accession, and together with certain control information, form the "base segment" of the accession. in addition to the base segment, for each accession, there are several types of "subordinate segments" which may occur independently of each other any number of times, or not at all. each subordinate segment is represented by a panel type, as described herein. necropsy panel (fig. ) the necropsy results panel (fig. ) includes space for recording results ofprenecropsy and necropsy examinations for one or more animals. figure includes data entries for animals ; adult females and juvenile males (arrow, line ). reports of negative findings and normal necropsy observations, as well as reports of the kinds of techniques used (such as the kind of blood collection method used, arrow, fig. , line ) can be entered by a code number, thus reducing data entry time. through use of a directory, the code number appears as a statement in english when printed on the report. if the data entry operator needs more space than is available on the panel to enter or report a finding, a free text reference can be entered in the appropriate field, eg. the "i a " at the arrow on line in fig. . this allows the operator to continue the report of a finding in a free text area (eg. "i a eyes."in fig. (a) on line ). the computer will, on the final printed report, replace the free text reference with the text from the free text area (arrow on fig. (b) ). organs and tissues can be designated as normal (n) and/or "flagged' as having been sent to either the histopathology (h) or microbiology (m) laboratory subsections, or both (x) (arrows in fig. on lines - ) . in fig. , line , a code number was entered to indicate the method of blood collection used ; " " (arrow) when taken from the directory and printed on the report reads, "jugular incision". an entry of " " would read, "orbital bleeding", and " " would read, "axillary incision". code numbers are also used for several other entries, such as general appearance (line lo), skeletal palpation (line lo), andexternal lesions (line ). through use of a directory these code numbers are replaced by "canned" statements on the final report ( fig. (b) ). if these animals were without significant gross lesions, a code number would have been entered in the field "ngl in any system" (arrow in fig. on line ). depending on which code number was used, the final report would contain a statement telling which organs and tissues had been examined and found without lesions. there are five different panel formats for entering microbiology results. the first is designed to show detailed microbiological culture results from individual organs or sites (fig. ) . shown in fig. and (a) are typical culture results for a group of animals in which nasopharynx and middle ear were cultured on mycoplasma agar and on blood agar and incubated in a % carbon dioxide environment. the second format for entering microbiology results shows the results of antimicrobial sensitivity testing (fig. ) . in this accession, an antibiotic sensitivity test was performed on a pseudomonas aeruginosa bacterial isolant from the middle ear of animal a, and its relative sensitivity to antibiotics was determined (figs and (a) ). the third format reports microbiology results in a tabular form (fig. ) . the bacterial isolants commonly cultured from laboratory rodents are listed on the panel and there is room for adding two additional isolants on the same table. in the event that culture methods employed would not detect certain microorganisms, an "x" placed in the space immediately preceding the genus and species of the bacteria causes elimination of that line (organism) from the final report (fig. , lines and ). thus citrobacter freundii and pasteurella pneumotropica do not appear on the final report ( fig. (a) ). negative results are tilled in automatically by the computer, and are overtyped if positive results are found. , ,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,(,,,,,,,,,,,,,,,) pneunotropi ca the fourth microbiology panel type (fig. ) reports group results in a tabular format, but the genus and species of the bacteria must be filled in by the data entry operator (figs , (a) and (b)). a specialized form of microbiological examination panel is provided for reporting positive or negative results of tests on large numbers of fecal specimens cultured for salmonella spp. and pseudomonas spp. (fig. ) . this panel is designed to facilitate tracking of culture results by supplier and colony (building and room). the results shown in figs , (a) and (b) indicate that the first five samples were negative for both pseudomonas and salmonella, the sixth was positive for pseudomonas and negative for salmonella, samples - were ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,',,,,,,,,,,,,,,,,,, fig. (b) . microbiology panel type , final report. negative for both, sample was positive for both, and samples pseudomonas but negative for salmonella. parasitology panel (fig. ) - were positive for the parasitology panel (fig. ) is for recording results of parasitological examinations performed. both "exam method" and "specimen examined" may be entered as rt al. i,,,,,,,,,,,,,,,,i,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ,,,ii ~ i(iiii.iiiiiii:i~iiiiii ( """""' "' """""' ""~~~' ' ,,,,,, ,,,, ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, either a code number or in english (fig. ) . on the printed report the code numbers are translated into "canned" responses ( fig. s(a) ). for example, a "specimen examined" entry of " " (arrow in fig. on line ) refers to the perianal area ( fig. (a) ), and an "exam method" entry of " " (arrow in fig. on line ) indicates microscopic examination by cellophane tape impressions ( fig. (a) ). ;- . wagner lllll, , , lllllll, llll, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , llll~l~ fig. . histopathology panel, type , completed ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,_,,,,,, lllllll,l,l,,,l~ serology and elzsa panels (figs and ) the serology panel (fig. ) includes complete serology results only for those tests commonly performed, i.e. hemagglutination inhibition (hi) and complement fixation (cf) tests. in this case, a serological examination of aminals a-j was performed, using complement fixation tests and hemagglutination inhibition tests (figs and (a) ). the elisa (enzyme linked immunosorbent assay) panel is used for reporting results of mouse hepatitis virus (mhv), rat coronavirus (rcv), sendai virus, or mycoplasma testing for serum antibodies (fig. ). an elisa to test for mycoplasma pulmonis antibody was performed on the sera of animals a-j (figs and (a) ). (figs and ) there are two formats for entry of histopathology results. the first of these results in a tabular report (fig. ) is entered in complete sentences or narrative form on this panel, and printed out exactly as entered. this format is useful for lengthy descriptions of lesions. summary-concordance panel (fig. ) the summary of findings panel (figs , (a) and (b)) includes a brief summary of laboratory results from other panels, along with a summary diagnosis and diagnostic information for concordance indexing. in the upper right hand corner (fig. ) is the "archive?" field. as long as this field is left blank all panels with this accession number remain on line in the master file. this field is left blank until the accession is completely processed and all reports have been mailed. when the accession has been fully reported and all reports are mailed, this field is marked with an "x". quarterly, accessions marked with an "x" in the "archive?" field are removed from the master file and placed in a magnetic tape file. about technical and professional personnel enter data into the radil system described herein. after data has been captured through entry on the various panels, individual accession reports are generated interactively using a la decwriter ii computer terminal located in the laboratory but connected to the host computer via low-noise phone lines and baud modems. one to three or four part carbonless paper is used in this terminal to provide printed copies of reports ; one for the referring veterinarian, one for the owner or investigator, and one for the radil files. reports can be designated either preliminary, final, or supplemental, and may optionally include only designated panel types. alternately, accessions can be printed centrally on campus on an ibm printing subsystem for twice daily delivery to our laboratory and subsequent mailing or distribution. regardless of where accessions are printed, all mailing or distribution of reports is done by a data controller who controls the flow of accession material and subsequently generated data. at any time this key individual knows of the status of any accession that has entered the laboratory. many telephone inquiries can be answered by the on-line data controller. accessions are easily "tracked" through the laboratory. a special "culprits" program flags any accession that has been in the laboratory over two weeks. the advantages of the system described herein are many. clean typed reports are issued. it is convenient and easy to provide complete reporting of diagnostic tests performed, i.e. positive as well as negative findings are reported. multiple copies are generated without the need for preprinted forms. errors are corrected electronically by data entry operators. computer stored and accessed directories of complete mailing addresses of referring veterinarians and selected clients are entered by three digit code, thus speeding up data entry and improving accuracy and completeness of addresses. use of window envelopes eliminates the need for addressing envelopes. there is no typing of reports per se, rather, data is entered from the laboratory by laboratory technicians or a data entry operator. accession history and demographic-zoographic data is acquired for concordance indexing without the need for reentry. all data is held in a form accessible through mark iv and specialized programs can be prepared for summarized reporting using an optional "display" mode. additional panels can be created if needed, thus, allowing for future expansion. availability of comprehensive user's manuals and interdigitated standard operating procedures of laboratory procedures greatly facilitate training of new employees. there are also several disadvantages to this system. it requires a high capacity computer running under ibm's mvs operating system, and ibm's series terminals. these are expensive, necessitating sharing with other users. this can produce prolonged response times during periods of maximal usage. development of such a system as this requires the services of a computer programmer as well as considerable time. data processing and storage costs are quite high. nearly all laboratory technicians must be trained to enter data into the system. consistency in terminology is necessary. users' manuals must be prepared and updated periodically. the programs are extensively interdigitated, making certain types of program changes difficult. high case loads and the necessity for obtaining rapid diagnostic correlations prompted the development of an electronic computer based system of accession data management, storage, and retrieval in a large state-university veterinary medical diagnostic laboratory. this system is capable of handling large volumes of diagnostic data such as results of histopathology, parasitology, necropsy, and microbiology as well as demographic data. this system uses the full screen capabilities of ibm model computer terminals to display a blank panel (essentially a blank laboratory results form) which is filled in by laboratory technicians or data entry operators at the major data generating stations throughout the laboratory. final reports are printed directly from this computer stored information. individual accession records are kept in a vsam data-base and archived to magnetic tape every quarter. information from these records is abstracted as needed by the computer for an annual concordance index and other administrative reports. the management ofclinical and laboratory data in veterinary medicine a computerized system for retrieval of case information in a veterinary diagnostic laboratory organ or tissue examined microscopically in routine disease surveillance accessions (fig. , line ) . each cell in the table may be left blank (indicating that an organ or tissue was not examined for that animal), filled with a dash or minus sign (indicating no significant microscopic lesions were found in that organ or tissue for that animal), or a description of the lesions found can be entered using a two digit number from a directory (fig. ) . to speed data entry and insure consistency, a directory of descriptions of common microscopic lesions was programmed which the computer interprets to narrative statements in english on the final report. the operator thus need only enter the code for a given lesion, for example, when one enters " ", the computer will translate it and " multiple syncytia in absorbing epithelium covering villi pathognomonic of mouse hepatitis virus infection." will appear on the printed report (arrow on fig. l(b) ). descriptions of lesions not on the list can be entered by means of a free text reference (arrow in fig. on line ) in the appropriate cell and by entering the appropriate text in the free text area for that panel, labeled with the same free text reference ( fig. (a) i""""""" in the college of veterinary medicine. his research activities have centered around infectious diseases of animals and poultry (where he has senior or co-authored more than papers in refereed journals), and in toxic effects of fungal poisons. where he has authored several papers and is co-editor ofa threevolume series on mycotoxins and mycotoxicoses. he has had interest throughout his career in computerization of data from animal disease diagnostic laboratories. key: cord- - axdn g authors: allam, zaheer title: the emergence of voluntary citizen networks to circumvent urban health data sharing restrictions during pandemics date: - - journal: surveying the covid- pandemic and its implications doi: . /b - - - - . -x sha: doc_id: cord_uid: axdn g covid- has impacted the global landscape well beyond initial estimates, impacting on both societal and economic fronts. immediate responses by corporations and governments were geared toward building knowledge so that accurate and efficient programs could be devised toward curbing the impacts of the pandemic on society. however, one aspect to this was noted as to the limited availability of data sharing across platforms, systems, and jurisdictions, leading to limited datasets, hence, rendering inaccurate predictions that can be used to contain and limit the virus outbreak. in view of required immediate actions, volunteered geographic information (vgi) and citizen science concept have emerged, where people voluntarily share location and health status data to circumvent data sharing restrictions imposed upon corporations and governments. this is leading to more accurate predictions and supporting an emergence of alternative tools. this chapter explores this dimension and outlines how people, previously aggressively resisting data sharing, do so willingly in times of emergencies. after the past three industrial revolutions, the fourth one, where use of information and communication technologies (ict) technologies is emphasized, has accelerated our global transformation. today, it is not unusual for different institutions and agencies to use a wide range of computing technologies in their decision-making and operations. technologies such as artificial intelligence, big data, blockchain technologies, machine learning, cloud computing, and others such as natural language processing have become very popular. one sector that has greatly benefitted from such is health, where through the use of the said technologies, it is now possible to collect and process massive datasets, thus leading to advanced surveillance and monitoring, diagnosis, treatment, and drug manufacturing among other benefits. these have been even more instrumental in the present days when the whole world is experiencing one of its lowest moment, courtesy of covid- pandemic. use of these technologies allowed for quicker identification of the virus strand and for faster sharing of information on the virus to various global agencies for quicker actions to be taken. on the same, the use of these technologies, especially ai, machine learning, and natural language processing, through some startups, bluedotdbased in canada, and metabiotadbased in california, the united states, was able to an earlier prediction of the next location where the virus would spread after wuhan (metaboita, ; heaven, ; . furthermore, these technologies have become particularly handy in helping various stakeholders involved in the fight against the coronavirus in making decisions on issues such as quarantine, self-isolation, preventive measures, screening, and many others (pringle, ) . with all these technologies, the common denominator is the availability of data, and without these, or when such are unreliable, low scale, compromised, or delayed, the analysis, insight derived and decision made, would undoubtably be less impactful. but, while this is the case, there is evidence that the world is amass with data from different sources, especially with the infiltration of smart phones, and availability of numerous social media platforms. there are some open databases that allow free access to data. with all these, in the case of covid- , startups engaged in providing more insights are observed to access data from those sources, including airline ticketing and from governments of different countries, and with these, they are able to run simulation and predictive algorithms to come up with conclusions guiding policy orientations. on this, while the outcomes from such computations are impressive and with far reaching impacts on the health frontier, the numerous challenges with data collection, storage, management, access, and distributions still linger. in the present case of covid- , for instance, it has become clear that some crucial data such as the actual number of those affected, the trends of spread and the number of those in selfisolation and quarantine are not available to the extent desired. thereby, it would be a daunting task to quantify or predict the number of those who are to be infected by the virus and in which geographical location. in such circumstances, it has become relatively hard for authorities to pursue watertight preventive measures that would see the spread reduced. these obstacles on data need to be rectified urgently if substantial steps on the fight against the virus are to be won. on this front, this chapter proposes a paradigm shift where data collection and sharing will not follow the traditional models that are mired with the shortcomings but proposes an alternative one. in this case, the model being fronted is the volunteered geographic information (vgi), which emphasizes on having open data banks, with the public being allowed to freely share data. on this, langley et al. ( ) note that with vgi, it increases the propensity for collaboration in matters relating to data, including with nonscientists like the general public who usually share massive amount of data via social media platforms. with the model gaining popularity, it would be easy to capitalize on the increasing popularity and numbers of smart devices and hence end up with an enriched database that can enhance the information and insights concerning the pandemic. while vgi has the potential to positively influence the global health database, it would be significant to ensure that the restriction in data sharing is overcome such that the data can be shared in real time. this has not been the case as reported by hamade ( ) , and when that happens, there has been the challenge of lack of granularity making it hard for one jurisdiction to comfortably access and use data from a different jurisdiction. this is observed as among one of the hindrances that need to be addressed while recalibrating the standards and protocols that guide data sharing (allam, a , allam, a , allam, b , allam, c , allam, d . but even before this, there is a discourse on how that standardization can be done. it is worth noting that in the case of emergencies, as in the present case of covid- , one cannot wait as the situation is dire with far reaching impacts. therefore, there is need for goodwill from all stakeholders and governments to voluntarily allow unsolicited data sharing, by overriding existing restrictions. this way, as is expressed by romm et al. ( ) , it will be possible to urgently track and trace suspected cases of the virus and, thereafter, initiate the most appropriate medical approach; whether quarantine, isolate, or hospitalize individuals depending on their health status. in view of this background, this chapter is dedicated to the exploration of whether it is feasible to adopt the vgi model during times of disaster like now, especially in this current era where the advanced computational technologies are ubiquitous. the idea of data sharing, especially in the present scenario when covid- , has caused havoc across the globe that cannot be overemphasized. but such efforts need to be complimented by reactionary approaches aimed at helping to overcome some deep-rooted challenges that have curtailed free sharing, use, and distribution of data. in view of this, there is a substantial amount of literature dedicated to discussing primary challenges that are associated to data. among those challenges identified include those of privacy, security, ethics, transparency, volume, and storage to name a few. the concern over those challenges is even further accentuated by the increasing technological advancement, where with some modern technologies, it has become increasingly possible for different agencies and individuals to unlawfully access private data. tiell ( ) of accenture affirms this and further expresses that as the amount of data increases, and the means to generate and collect such also in inexpensive ways, the privacy concerns become live. in this case, it becomes relatively hard to convince potential sources to freely share their data, as they fear that such would be compromised, especially for-profit gains. the fear also arises as it is evident that in most cases, such are stored by third parties, especially large ict corporations that have the technological capacity to collect and store large amounts of data (allam, a , allam, b , allam, d allam and newman, ) . the fears are compounded by the fact that these parties can, in some instances, share the data in their possession for their own selfish interest, or as tools for target marketing. such would go against the letter and spirit of vgi, which is based on the principle of transparency and ethical use of collected data (blatt, ) . this fear often results in artistic manifestations, as shown in fig. . . while those challenges persist, in the recent past, there have been spirited efforts to reverse the situation, especially through the formulation of a series of legal frameworks, standards, and protocols that would guide data management, access, and sharing. in this regard, it is possible to point some practical examples. for instance, in the european region, there is a legal framework dubbed nordic data sharing framework (salokannel, ) aimed at ensuring that certain procedure, rules, and laws are followed while sharing data. another example is the cyber security information sharing act (cisa) ( ) that was passed in by the us senate. other such frameworks include the general data protection regulation (gdpr) ( ), trusted data sharing framework, and the data sharing code of practice that borrows from section number of the data protection act passed in by the european parliament ( ) . with these, plus a myriad other based on different parts the world, those are observed to provide some reprieve when it comes to sharing of data, especially which has the potential to expose private information. with adequate safeguards and platforms, it has become possible to overcome data silos, which most existing databases have maintained, with a bid to overcome existing restriction and filter any unauthorized communication. with such frameworks and protocols, it is a daunting task to allow for free data sharing and leave alone open access of the databases. besides overcoming the silos, reworking on the existing standards and protocols allows existing databases to be enriched, and this is affirmed by the quality of insight derived after analyzing the data. this does not imply that they are insufficient in their current form, but eliminating silos is the surest way of enhancing them. this is true as most of the information being sought, in most cases, especially those related to health need to be diversified, regarding issues such as geographical location and demography. for instance, in the current case of covid- , compelling insights can only be achieved if analysis is done with data that capture details from different geographical locationsd countries, cities, and establishments. but, still, the existing frameworks are not enough; as already, due to their divergence in relation to different issues, it has become hard to have an agreement on the conclusion being drawn in different countries. for instance, china, which was the first to experience the outbreak of the coronavirus, has been accused, especially by the united states, of giving falsified data and information, especially about the emergence of the virus, the spread, and its impact (crowley et al., ) . according to moorthy et al. ( ) , the fight against covid- needs to be addressed from a point of sincerity and transparency, especially in regard to data sharing; hence, they emphasize that all concerned agencies and governments need to ensure the authenticity and urgency of sharing the data. by doing this, the world would prevent a repeat of the severe acute respiratory syndrome (sars) outbreak, where china was accused of delaying information on the outbreak, thus making it hard for identification and subsequent diagnosis and control of the virus (cao et al., ) . the world health organization recommends the approach taken during the ebola virus outbreak where there were concerted efforts from different agencies and government and, thus, saw the virus contained before it could spread further beyond the three affected western african countries (wojda et al., ) . with that, the current case of covid- requires an open database where all participating institutions, governments, and other stakeholders can easily access the data. however, as of now, this protocol has not yet been established or communicated, but the severity of the virus demands that protocols be adopted and communicated in a clear fashion such that a solution on the virus can be formulated before the global health, society, and economy reach a tipping point. through the numerous technological advancement that the world has been experiencing since the turn of the fourth industrial revolution, the amount of data being generated has continued to increase. there before, due to scarcity and complexity of technological tools, and the exorbitant costs of collecting, storing, analyzing, and distributing data, researchers and other data users had to face numerous challenges. similarly, during those periods, the need for data was not as expounded as it is today; hence, people, organization, institutions, and even government could comfortably survive without data. but today, things have changed, and it has become paramount to have access to data to position oneself as a key player in this competitive world. luckily, the infiltration and accessibility of mobile devices such as phones, cameras, and wearables have come in handy in ensuring that massive amount of data is available. similarly, the availability of these devices has been complimented by the availability of diverse social media platforms, and fast internet has made the generation and sharing of data even more ubiquitous. but, as has been shared comprehensively in the sections earlier, these advancements have attracted substantial challenges, especially those related to privacy, security, transparency, and ethics that have been seen to curtail the freedom, the scale, and volume of data sharing. indeed, bezuidenhout and chakauya ( ) highlight that most people are now apprehensive to share their data for fear that such would be exploited to disfranchise and disadvantage them. even in high ranking government levels, there is apprehension that some of the technologies that have been developed are maliciously crafted to enable agencies and governments to extract private data from individuals and institutions without their consent. some of the issues that have brought about this apprehension range from the profit value that has been attached to data, with large ict organisations seen to be increasing their activities, including heavy investments in r&d to come up with cutting-edge technologies and devices to help them collect and store data. with such technologies, those are observed to seek increasing revenues, as most governments and institutions are constantly contracting startups with capacity to collect and store massive data on different issues. such data have also become very popular for target marketing and those with capacity to collect gain financially by sharing insights with marketing institutions. but, further from that, the main concerns with access to data are increasing incidences of terrorism activities such as recruitment and planning. there are also fears of extortion and sabotage of personal privacy among other issues. such have been seen to sometimes overshadow the original intents such as prevention of crime, increased efficiency in service delivery, medical purposes, improvement of liveability and resiliency of cities, and reduction in some bureaucracies and bottleneck in service delivery to mention a few. while in such scenarios, the existing data sharing frameworks and protocols have been seen to be adequate by proposing data anonymization and encryption, technology advancement is seen to enable reidentification, thus warranting the need for even more stringent measures. when all these challenges are compounded, as noted by waithira et al. ( ) , it becomes relatively hard to convince people to freely share their data, as already the levels of mistrust on agencies handling data are high and thus are seen to create psychological barriers to data sharing. this reality is not conducive for a world that is prone to numerous challenges like the present case of covid- , which urgently requires any available data to be analyzed to help in manufacturing vaccines, drugs, and relevant preventive tools and measures. though such psychological barriers may exist, it is now possible to talk through and win the public confidence on data sharing. this would be achieved by employing emerging technologies such as blockchain technologies and quantum cryptography, which have shown to be reliable in safeguarding the anonymity of individuals (allam, b; allam and jones, ; naz et al., ; shahab and allam) . as shared by hölbl et al. ( ) , blockchain technology is particularly popular currently due to its qualities such as scalability, immutability, decentralized, secure, reliable, and transparent. such qualities are sought by those willing to share their data through trusted platforms. therefore, while these technologies are relatively in their infancy stages and are being tested in different areas, they give a glimpse of hope in overcoming the apprehension caused by the increased leakage of data and the challenge of being traced back. availability of such technologies, especially in moments like now when the challenge of covid- is stressing and disrupting the core of the global systems, may inspire some paradigm shift in people and encourage them to voluntarily share data. this is pegged on the fact that even in times of distress, people are not readily prepared to share their data if they are not assured of its privacy and security. this is well recognized in the general data protection regulation (gdpr) (ketola et al., ) , which emphasizes that data especially that with potential to directly or indirectly lead to identification of a natural person need to be handled with extra care. doing this prevents scenarios where data are traced back to their source. that is, using technologies such as machine learning and reversing the process of data sharing such that one can identify individuals involved in the designing, generating, and sharing of the data. the imperial college london ( ) calls that process the "reverse engineering" of data, and it shows how much progress the technology has brought in the global arena and, thus, increases the apprehension that people have on sharing data, as such new technologies have the capacity to override the anonymizing and encrypting ones that have been relied upon to prevent identification of persons. therefore, as is the view of crutzen et al. ( ) , institutions and agencies and any stakeholder involved in data need to restrict themselves to only the data they require to conduct their research work. this strategy of data minimization ensures that as much as possible, only little data that can be utilized to identify an individual are available. such strategy of data minimization needs to be enhanced with advanced data sharing protection mechanisms, thus inspiring data handlers to share data they have ownership of. while there are diverse viewpoints on data sharing and the need for stringent measures to safeguard the same, especially to overcome the psychological barriers that is created by the public not trusting the whole system of data management, some cases are different. for instance, the present case of covid- dits spread, infectiousness, and havoc it is causing on the health sector, the social fabric and the economy have triggered an urgency on the citizen to voluntarily share data, with a hope that it could warrant the survival of people. on this, it has been observed that a majority of people in different parts of the world are willingly presenting themselves to health facilities, and in some countries, they are willingly collaborating with government agencies (unhcr, ) to voluntarily share information on the coronavirus. same trends of voluntary data sharing have been observed with popular celebrities drawn from different backgrounds who are sharing their coronavirus health status with the world. promising enough, these are also willingly complying with the medical requirement for quarantine and selfisolation, so that they can not only safeguard themselves but also cut the spread link, thus, protecting potent contact they would infect. on the government level, similar cases of citizen science are being demonstrated. on this, it is unusual for high-ranking government officials to disclose their health status to the public, but with the coronavirus, traditional knowledge is being shattered. for instance, in different jurisdictions, top officials such as the german's chancellor angela merkel publicly shared their coronavirus status and voluntary submitted to the self-quarantine rule (delfs and donahue, ). in another case, the prime minister of canada was unhesitant to disclose that his wife had tested positive, leading both of them in self-quarantine (gillies, ) . same case with the british prime minister, boris johnson, who made a video to share the unfortunate news that he had contracted the virus and was henceforth working from his isolation room (bbc news, ) and later hospital. in view of the earlier background and examples, it then verifies that the vgi model of data sharing can become a strong approach that can be used to enrich the available datasets, more so to overcome global, modern challenges. in the case of covid- pandemic, availability of an open access dataset enriched with data from the length and breadth of the world would enhance steps being made in finding a short-term and long-term solution to the health crisis. from such a dataset, anyone now has the potential to track how the virus is spreading and areas, or regions that require concerted efforts such as supply of test kits, personal protective equipment, food and medical supply, and other essentials. similarly, information from such datasets can inform governments on steps to take regarding travel ban, lockdowns, and other necessary measures that can prevent the spread of the virus. likewise, similar approaches could be used in the future to address other global challenges such as terrorism and future pandemics. the advantage on this is that with modern technologies, and availability of smart devices, it is possible for data to be sourced from different quarters in real time and in an inexpensive fashion. this is perhaps well represented in the rich assortment of wearable tech available today, represented in fig. . . but, on this, there also need to be systems and mechanisms in place to ensure that the data shared are filtered such that only factual, time-sensitive, and comprehensive ones are admitted into those datasets. this way, the quality of insight drawn from analyzing such data would be assured. and, while this is pursued, issues such as privacy and security that hold backs voluntary data sharing need to be fully addressed to assure the public of the trust and transparency. the discussion in the previous section, in an expounded way, demonstrates that most of the urban challenges can be, in a larger margin, solved by analyzing the massive data being generated from different quarters. in particular, with the increasing penetration of smart mobile devices, sensors, and other smart devices, which has the potential to link to an existing network, the amount of data being generated and shared is massive, and such can be exploited to develop long-lasting solutions. one such scenario where data have been handy is the present case of covid- pandemic, where through data, its spread, its infectiousness, and its impacts have been made known to the public. such were shared by bluedot and metabiota, some of the modern startups that use data, and through advanced technologies, such as natural language processing and machine learning, they were able to predict some of the geographical location that the virus would spread next from wuhan, days before first cases were reported in those regions. in addition, it is through availability of data from different locations and sources that health officials, governments, and researchers and other stakeholders are able to establish the manifestation and impacts of the coronavirus in respect to demographic factors such as age, gender, preexisting health condition, neighborhoods, and others. but, from the global happening, where despite spirited efforts and measures like quarantine, self-isolations, lockdowns, and others that have been instituted have not been successful in preventing the virus from spreading. this means that more data require to be collected and analyzed, especially in laboratories and in other medical fields to comprehensively come up with a solution that can help in containing the outbreak. regarding the earlier call for more data, there are now some steps that have already been made, especially in coming up with smartphone applications that have the potential to track and report on the infected people. with these apps, and relying on the concept of citizen science, there is optimism that people can voluntarily continue to share more data, which would ultimately be very useful in advancing research on how to contain the virus. in this front, some of the apps that are already in use include the covid- tracker that is dedicated to tracking all aspects of covid- with the perimeters of the united kingdom (wakefield, ) . in china, where the virus broke first, giant companies such as alibaba group and tencent are said to have taken the challenge of fighting the virus spread by incorporating coronavirus trackers in their existing apps. for instance, alibaba incorporated a color qr code in alipay app dedicated to identifying those suspected to have contracted the virus. those whose color turns red after the qr code is scanned then can use the company's messaging app (dingtalk) to seek medical services (li, ) . the same approach of using qr code is also available in wechat app (owned by tencent) (voa student union, ), and such are said to be playing a significant role in ensuring the virus is contained in the country. in the united states, the government is discussing with social media giants such as facebook, twitter, and google to help track whether people are maintaining social distance (porterfield, ) . the emergence of these, and many other, apps dedicated to track the virus across the globe is aimed at enhancing data collection mechanism, especially focusing on the voluntary sharing. the success of such apps will help in complimenting the existing data collection mechanism, with the main goal focusing on having enriched databases that can be used to give insights on the global virus status. while those efforts geared toward increasing voluntary data sharing are necessary, the need to strengthen data security and privacy should simultaneously be enhanced, thus assuring those sharing their data, including private ones that such would not land in hands of malicious people, will not in any way be used beyond their intended purpose. such efforts would not only help in breaking the psychological barrier but would also increase the confidence and trust that the public has with data-oriented institutions and agencies, and ultimate effect would be increased volumes of data shared. this, coupled with modern technologies, as is discussed above would in turn lead to improved service delivery, lead to improved liveability status, and increase urban digital solutions. the same, especially sourced from health devices and wearable technologies, would help in formulating solutions for medical problems such as pandemics, help in improving diagnosis, and help in surveillance and monitoring to name a few. availability of big data would also help in optimal use of resources, reduction in pollution, and speeding up efficiency in service delivery. while there is evidence that there are some emerging, positive changes in the data landscape scene, more study is still necessary to explore how such changes are impacting different sectors, especially on social and economic frontiers. the need for this is based on the fact that, in the recent past, data are argued to be the "new oil" of this modern time, and wide scale acceptance of the concept of vgi would only lead to an increased amount; thus, it would be expected that social and economic sectors would change for the good. but, while that is the same, it is also known that some organizations have been increasing their investments in the ict so that they can profit from data management (yigitcanlar and bulu, ) ; hence, the expected positive changes in the aforementioned sectors may not be actualized, or in the scale commensurate with amount of data shared. with further study, it would be possible to establish such and recommend some of the proactive actions that can be taken to ensure the public truly benefit from their actions to accept the call to voluntary share their data. united kingdom: legislation. gov.uk contextualising the smart city for sustainability and inclusivity on smart contracts and organisational performance: a review of smart contracts through the blockchain achieving neuroplasticity in artificial neural networks through smart cities the emergence of anti-privacy and control at the nexus between the concepts of safe city and smart city privacy, safety, and resilience in future cities. in: allam z (ed) biotechnology and future cities: towards sustainability, resilience and living urban organisms biotechnology to render future cities as living and intelligent organisms data as the new driving gears of urbanization digital urban networks and social media. in: allam z (ed) cities and the digital revolution: aligning technology and humanity privatization and privacy in the digital city artificial intelligence (ai) provided early detection of the coronavirus (covid- ) in china and will influence future urban health policy internationally the potential of blockchain within air rights development as a prevention measure against urban sprawl on the coronavirus (covid- ) outbreak and the smart city network: universal data sharing standards coupled with artificial intelligence (ai) to benefit urban health monitoring and management redefining the smart city: culture coronavirus: prime minister boris johnson tests positive hidden concerns of sharing research data by low/middle-income country scientists data privacy and ethical uses of volunteered geographic information what we have learnt from the sars epdemics in mainland china? coronavirus drives the u.s. and china deeper into global power struggle why and how we should care about the general data protection regulation merkel quarantine puts another crack in europe virus defense canadian prime minister's wife sophie grégoire trudeau tests positive for coronavirus covid- : how to fight disease outbreaks with data ai could help with the next pandemic-but not with this one a systematic review of the use of blockchain in healthcare anonymizing personal data 'not enough to protect privacy data management guidelines using meta-quality to assess the utility of volunteered geographic information for alibaba unveils technologies to empower partners in fight against coronavirus confronting the risk you can't see data sharing for novel coronavirus (covid- ) a secure data sharing platform using blockchain and interplanetary file system surge of smartphone apps promise coronavirus tracking, but raise privacy concerns computer science versus covid- tech industry discussing ways to uses smart phone location data to combat coronavirus nordic data sharing framework: a legal perspective reducing transaction costs of tradable permit schemes using blockchain smart contracts security and data ethics: best practices for data security coronavirus (covid- ) update phone apps in china track coronavirus data management and sharing policy: the first step towards promoting data sharing coronavirus: tracking app aims for one million downloads the ebola outbreak of e : from coordinated multilateral action to effective disease containment, vaccine development, and beyond urban knowledge and innovation spaces key: cord- -opuwyaiv authors: amram, denise title: building up the “accountable ulysses” model. the impact of gdpr and national implementations, ethics, and health-data research: comparative remarks date: - - journal: computer law & security review doi: . /j.clsr. . sha: doc_id: cord_uid: opuwyaiv abstract the paper illustrates obligations emerging under articles and of the eu reg. / (general data protection regulation, hereinafter “gdpr”) within the health-related data processing for research purposes. furthermore, through a comparative analysis of the national implementations of the gdpr on the topic, the paper highlights few practical issues that the researcher might deal with while accomplishing the gdpr obligations and the other ethical requirements. the result of the analyses allows to build up a model to achieve an acceptable standard of accountability in health-related data research. the legal remarks are framed within the myth of ulysses. ulysses, according to the homer's epic poem the odyssey, was king of ithaca, son of laertes and anticleia, and father of telemachus. ulysses was described as a man of outstanding intelligence, wisdom, and endurance. ulysses is cunning as he is able to overcome insurmountable obstacles to shape reality. dante, the latter admired his skills and competence, courage and smartness. in the nineteenth century, ulysses became the intellectual hero, who is far away from the current society, looking for a safe harbor but always in trouble. in twentieth century, ulysses is the modern hero bringing the anxieties and sufferings to find the true sense of things. nowadays, the ulysses . has been interpreted as a human person in his complexity of skills and feelings. the fils rouge in the several interpretations of the myth is how ulysses faces new challenges during his journey, combining his technical and organizational skills in order to deal with (and sometimes overcome) the vulnerabilities and limits of human beings. this model is particularly relevant today that scientific efforts are addressed to facing covid- pandemia through new technological solutions aimed at both supporting the early-detection and treatment of the disease as well as at managing its social and economics consequences in light of a needed balance between fundalmental rights. without pretending to contributing to literature, the idea of ulysses as a researcher, who develops technical skills and ethical competence to properly achieve his scientific goals, appears suitable in order to build up a model of ethical-legal compliance within research and development activities in light of the gdpr principle of accountability. in the following paragraphs, we will discuss the ethicallegal obligations that the researchers have to deal with while processing personal data, and in particular, health data, during their journey (i.e. the life cycle of the research), highlighting through a comparative analysis the critical profiles emerging from the current legal framework. after the gdpr entered into application, a strong debate arose in light of the impact that new regulation had on research. the balance between the protection of fundamental rights and the free circulation of data makes the researchers responsible of a series of obligations for the whole duration of the research lifecycle. this could be seen as an obstacle, at least in terms of time and resource allocation. the need to adapt current practices to the new paradigm of privacy by design and privacy by default approach with respect to the whole research architecture includes the necessity to deal with the adoption of proper technical and organizational measures. however, they cannot be standardized for every project, as they should be appropriate to the specific activity and they can be replaced during the developing of the research, considering the possible introduction of new risks or new technologies to mitigate it. according to the principle of accountability, in fact, the researcher has the burden to prove the implementation of the mentioned measures not only because the project proposal has to satisfy a given check list of conditions, but as the research methodology itself has to be ethical-legal compliant by design. therefore, the first skill of our ulysses . . is the openmindness to consider the ethical-legal compliance as a necessary step despite of the given area of research and regardless (as well as beyond) the existence of a given check-list section in the project proposal template. in the daily routine, these profiles might constitute a new combination of procedures, contacts, administrative activities which take time and should be supported by the research institutions. the great challenge that the gdpr launched is to create the opportunity of sharing ideas, models, and options in a continuous interdisciplinary dialogue aimed at addressing research and innovation towards the eu values and fundamental rights. to protect dignity and fundamental rights is the compass to improve society and enhance its values: science serves human beings, not viceversa , despite the technological progress goes faster than legal positivism and it has already put (personal as well as non-personal) data analysis as the first step of the research. this is true in the case of research dealing with health data, since the information connected to such a processing make the data subjects particularly vulnerable. the gdpr has standardized the legislation at the european level, but at the same time it allowed member states to nationally specify conditions and requirement in the context of data processing for scientific and statistical research purposes. this might create an obstacle while the activities (and therefore the processing of personal data related to them) are frequently conducted by partnerships belonging to several nationalities. from this viewpoint, it is certainly useful to compare different national interpretations in order to compare the implemented/proposed national models to find out possible best practices which may address the discussion towards a specific code of conduct. for this reason, considering the new ethical-legal issues emerging from the scientific-technological progress that involves a daily use of health-related data, our comparative analysis will firstly discuss the legal bases for health data processing for research purposes in order to identify the critical profiles as well possible practical solutions that might help ulysses . . to develop the "accountability" virtue. health-related data are those information which are relevant for health conditions, i.e. reproductive outcomes, causes of death, and also the quality of life. for this reason, they are included in the list of personal data considered as sensitive under article gdpr. this article regulates the legal bases to process data belonging to the so-called "particular categories". first of all, to allocate the data subject "responsibility" of the whole data processing, national implementations prefer to asking for the consent to data processing, even if other legal bases might be applied. this tension reflects a sort of confusion between the informed consent for volunteers who participate to a research experiment and the information about personal data processing as well. the involvement of human being in a research, both for clinical and not clinical trials, in fact, requires to ask for their consent, therefore the overlap of the procedure is quite common. however, instead of asking for consent for data processing under article sub a) or , para , sub a) gdpr, data might be processed in the public interest of the controller under article sub e) and art. , para , sub i) or j) gdpr) or pursuing a legitimate interest (article sub f) and art. , para , sub j) gdpr), in light of article gdpr, which regulates the data processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes. according to the gdpr system, in fact, the data subject's consent could be seen as a residual legal basis, considering that transparency and awareness are in any case achieved by the information under article ff. gdpr, which are due regardless the legal basis of the data processing. moreover, to collect consent does not mean only to sign a form, but it means to be responsible that the obtained consent is informed, relating to a specific purpose, unambiguous, freely given, as stated by article gdpr and explained in the recitals ff. the right to revoke it should be always guaranteed. this might constitute an issue, if we consider that the favor for data processing for research purposes, and the presumption of conformity of re-using for research purposes under article sub b) and e) gdpr are strictly connected to the article gdpr paradigm, which recalls the necessity to implement appropriate safeguards that the researcher has to adopt to minimize any risks. some critical profiles emerge from article , para , gdpr which allows member states to decide whether or not maintaining the legal bases provided by the eu regulation or introducing further conditions, including limitations, with regard to the processing of particularly sensitive data, like the genetic data, the biometric ones, or those concerning health. this reduces the room of harmonization of the legal basis to process health-related data, also for research purposes. in fact, article gdpr offers the opportunity to member states to provide derogation to data subjects' rights while processing personal data under the proper safeguards (such as the pseudonymization and anonymization). at this regard, the spanish ley orgánica de protección de datos personales states that limits to the exercise of data subjects' rights are valid only if addressed to the researchers that process anonymized or pseudonymized data. in light of this provision, some systems have introduced new rules or updated the previous legal framework, identifying specific conditions and requirements applicable also in case of health-related data processing for research purposes. the topic has been highlighted in the last weeks while na- tional legislators have been introducted specific provisions to deal with the covid- emergency. for example, the irish data protection act , which replaced the previous data protection acts in light of the gdpr, states under the article that the suitable and specific measures for processing sensitive data also for research purposes should be regulated in a further specific act: the health research regulation (hereinafter "hrr"). according to the hrr, the data controller who is processing or further processing personal data for the purposes of health research shall ensure that the collection of an explicit consent from the data subject, before starting the health research. the consent could be obtained "either in relation to a particular area or more generally in that area or a related area of health research, or part thereof ". the favor for the application of article , para , sub j) is anyway recovered when there is a public interest in the research as declared by a specific committee appointed by the health ministry. in that case, the data protection impact assessment under article gdpr should be performed together with the positive approval from the ethics committee. likewise, the italian ethics rules adopted by the italian data protection authority to process personal data for scientific research and statistical purposes, namely regole deontologiche per trattamenti a fini statistici o di ricerca scientifica pubblicate ai sensi dell'art. , comma , del d.lgs. agosto , n. , refer to the consent as a legal basis to process sensitive data under article gdpr. for medical, biomedical, epidemiological research, article of the above-mentioned ethics rules states that data subjects/patients should be able to distinguish through proper information data flows for healthcare purposes and data flows for research purposes, but consent seems to be maintained as the principal legal basis to process data. this might be misled if it is not compared with article of the privacy code legislative decree n. / , as amended by the legislative decree / , a higher ranking rule. it states, in fact, that consent it is not necessary if the legal basis is article , para , sub j) and a data protection impact assessment has been performed and published. the provision is quite cryptic as it is not evident which are during the publication of this paper, the scientific community focused on the ethical and legal issues emerging from the covid- emergency management. in particular, the necessity to balance individual and collective health protection and personal data protection stimulated an interesting and inditersciplinary debate, that we cannot avoid. the cases where article , para , sub j) is not applicable and it does not explain how the data protection impact assessment should be published to fulfill the requirements. furthermore, according to the mentioned article, the consent is not required whether there is a positive approval from an ethical committee and a prior consultation before the data protection authority has been performed. also this exception may create some practical issues if we consider that data protection is an ethical profile that an ethical committee should face in its opinion. the relationship between data protection compliance and ethical compliance is, again, recalled within the article of the ethics rules. its para , indeed, states that the informed consent under the oviedo convention and helsinki declaration shall include information about incidental findings, while in the previous paragraphs the topic was the legal basis for data processing. this combination of provisions on privacy information and informed consent misleads the gdpr paradigm which promotes data circulation, under the principles of data protection by design and by default, instead of requiring the data subject's consent. another group of legal systems, instead, did not introduce the consent as a further condition under article , para , gdpr, but established additional measures to allow health data (as well as genetic or biometric ones) processing. this is the case of the belgian loi relative à la protection des personnes physiques à l'égard des traitements de données à caractère personnel , which for example provides specific confidential obligations for those who processes, at any title, sensitive data. research purposes conditions are included in article ff. of the mentioned act. further organizational measures, like the necessity to appoint a data protection officer (dpo) even if not required for the data controller according to the article gdpr, are stated in case the data protection impact assessment considers a high risk. the data processing record under article gdpr must include (i) the reasons that justifies the public interest in pursuing the research or in further processing data and how the possible lack of information might be justified by the anonymization (the pseudonymization or the reasons to avoid it) or the risk to compromise the research, and (ii) the agreement between who firstly collected data and the further processing actors. the belgian approach addresses the debate on the role of the legal basis compared to a series of other ethical-legal requirements while processing sensitive data for research purposes. indeed, pursuing the data protection by design and by default in a project means to build up a complex system of checks and balance, through organizational and technical measures discussed between the several expertise involved in the research. their implementation ensures that the research output will be aligned with the eu values and fundamental rights. loi relative à la protection des personnes physiques à l'égard des traitements de données à caractère personnel , . . , https://www.ipnews.be/wp-content/uploads/ / / -loi-belge-adaptant-réglementation-belge-au-rgpd. pdf. according to the above-discussed system, the data controller (i.e. the university/research institute in person of the legal representative) shall involve the principal investigator in the data management activities, authorizing to data processing under article gdpr, in order to proactively guarantee the adoption of those technical and organizational measures aimed at safeguarding the rights and freedoms of data subjects in her project. this first organizational measure that a research institute has to apply is to appoint a role in the privacy orgchart to the principal investigator of each research. at the same time, this helps to sensitize, trains, makes each ulysses . responsible of a data protection by-design research, and let the data controller achieve the compliance with the principles of correctness, transparency, and minimization under article gdpr. however, this might not be sufficient since the principal investigator might not have an ethical-legal background able to identify those proper safeguards that would make her research compliant with the current legal framework. at this regard, a principal investigator non-gdpr expert may consider a double level of ethical-legal experts' involvement. the first level concerns the data protection officer appointed by the data controller under articles ff., who -in light of the principle of proximity -shall be able to deal with research issues and be a key-person between the principal investigator, the data controller, the ethics committee, the it services, and the data protection authority. this supposes a strong collaboration with the administrative staff that supports the research. the second level refers to the increasing role played by an ethical-legal unit as a partner of the developed project. its task is to help the principal investigator to design and implement the research in compliance with the whole ethical-legal framework during the entire duration of the project. considering the ethical-legal challenges emerging from the current legal-ethical framework, to address an impact assessment on the basis of the risk for the shared values and fundamental rights could enhance the given research not only in terms of innovation, but also for the consequences on the society, economy etc. the involvement both of a dpo and an ethical-legal unit could make the difference in terms of achieving the goal of an ethical-legal compliant research by design and by default in a given system. in fact, it strengthens the interdisciplinary dialogue and helps the cross-fertilization between different domains. according to the current national implementations of the gdpr, in fact, many systems, like the above-illustrated belgian one have introduced check-lists and protocols to properly address the data protection compliance activities. for example, in the spanish system, firstly, an impact assessment must be carried out; secondly, the research must be subjected to quality standards according to the international directives and clinical best practices; thirdly, it is necessary to implement tools aimed at avoiding the re-identification of the data subjects; finally, the spanish law requires the ap-pointment of a legal representative in the european union under the article eu reg. / and gdpr in case of extra-eu partnerships. the provision sub g), indeed, states that in case the ethical committee cannot express an approval, the principal investigator may ask the data protection officer's one. however, it specifies that ethical committees should add specific competence on data protection by one year. from this perspective, as far as health data are concerned, data protection compliance walks together with the ethical one. therefore, the ethical committees should empower their competence on data protection and the data protection officer shall be able to address the researcher to an ethical-legal unit or legal-ethical advisor. the proactive risk-based approach which has been implemented by the gdpr for data processing could be potentially applicable to all the ethical issues emerging in the research. to this end, the already mentioned irish hrr takes the opportunity to deal with the ethical profiles related to research (e.g. informed consent, voluntary research, cost-benefits assessment with respect to the clinical trial, conflict of interests etc.) answering the need to develop a coherent paradigm to properly process health data for research purposes. the hrr firstly provides a definition for "health research", listing scientific areas for the purpose of human health, including: "(i) research with the goal of understanding normal and abnormal functioning, at molecular, cellular, organ system and whole body levels; (ii) research that is specifically concerned with innovative strategies, devices, products or services for the diagnosis, treatment or prevention of human disease or injury; (iii) research with the goal of improving the diagnosis and treatment (including the rehabilitation and palliation) of human disease and injury and of improving the health and quality of life of individuals; (iv) research with the goal of improving the efficiency and effectiveness of health professionals and the health care system; (v) research with the goal of improving the health of the population as a whole or any part of the population through a better under-standing of the ways in which social, cultural, environmental, occupational and economic factors determine health status ". then the hrr recalls the ethical issues to be assessed by the ethical committees for approval: data protection is one of them, but it finds a specific development in article and ff. article states that personal data can be processed for research purposes as long as it is necessary to achieve the research purposes and it does not cause any damage or distress to the data subject and if the organizational measures stated sub b) are in place. the check-list that follows is a sort of data protection plan, which includes the ethical approval by an ethical committee, the identification of the privacy governance structure (including joint controllers, data processors, and recipients), the training programme for those who are involved in the research, the data protection impact assessment for higher processing risks. the italian ethics rules issued by the data protection authority, instead, identify the field of application under a subjective requirement: "to all the data processing carried out for statistical and scientific purposes -in accordance with the methodological standards of the relevant disciplinary sector -which are held by universities, other research institutions and research institutes and scientific societies, as well as researchers operating within the scope of said universities, institutions, research institutes and members of these scientific societies ". to this end, marketing purposes hidden under "research or statistical" purposes are excluded since private companies are included only if in their company bylaws research activities are mentioned. pseudonymization, anonymization, and data re-using for research purposes according to the article gdpr data processing for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes, shall be subject to appropriate safeguards. the same article refers to pseudonymization as the first measure aimed at achieving the minimization purposes. for further processing, it states that a further level "which does not permit or no longer permits the identification of data subjects" should be gained. technically, it does not exist a unique criterion of anonymization. data can be considered anonymized having regard to any methods reasonably likely to be used by the data controller (or any third party) to reverse the process and allow the re-linking of the data subject. the assessment is based on the risk of re-identification through a rational effort. therefore, the pseudonymization standard could be always obtained through technical separation of information, considering several levels (e.g. scrambling, encryption, masking, tokenization, data blurring, etc.) while the anonymization could be achieved by the combination of technical and organizational measures as well, considering the features of the data controller. some national implementations of the article gdpr focused on this profile. the belgian law establishes, in case of health data processed for research purposes, that pseudonymization could not be performed by the data controller, but by an independent body, who is subject to specific confidentiality obligations (i.e. professional secrecy). at this regard, the spanish law requires a "technical separation" between who performs the two activities and an explicit obligation for the ones who pseudonymized to avoid re-identification. these provisions arise two issues. firstly, when the data controller is a research hospital, healthcare purposes and scientific ones might be performed at the same time by the same team. in this context, despite of the application of pseudonymization techniques, clinicians might be always able to recognize and refer to their patients, even if committed to professional secrecy. secondly, to always identify a partner or team for pseudonymization could constitute an expense for the research: perhaps it could be sufficient to appoint such a task to it services of the data controller and establishing granular accessibility to the token (e.g. only the principal investigator, but not the research team). the article bis of the italian privacy code refers to the reusing of data by third parties. first condition is that data subjects must be informed. otherwise a prior authorization from the data protection authority is needed. this approach is not applicable whereas personal data are collected for healthcare purposes and used for research ones by the same research clinics, considering the functional link between the two purposes. the provision seems to refer to patients' personal data before being pseudonymized or anonymized for research purposes, as stated under article gdpr. another profile of the gdpr compliance consists of the system security: data flows are usually in a digital format, therefore proper measures shall be implemented to guarantee the availability, integrity and confidentiality of data. as far as the security of data is concerned, the irish act, for example, refers to the criterion of the "effectiveness" of the adopted measures. so far, the interdisciplinary dialogue touches the fields of the data management, including ip rights. if we consider that open access is becoming the new standard to manage research data, the role of the data protection officer/expert/advisor becomes essential to establish which data can be shared or not. therefore, skills required to ulysses . and his crew become everyday more specialized ones. in the context of the covid- pandemia, for example, during the so called lockdown, governments opted for establishing interdisciplinary task forces aimed at identifying effective, ethical-legal, suistanable solutions to plan a safe re-starting of the activities. this strategy appears in line with the ulysses . model. as shown in the previous paragraphs, the "accountable ulysses" is a standard which might be achieved only estab-lishing an interdisciplinary dialogue between the researcher in different fields. starting from this principle, some common features can be identified to reach an acceptable level of compliance. i) whereas health data are processed, to involve ethicallegal experts, who could play either an institutional role (as the data protection officer) or being a partner of the research, since the beginning of the project proposal is an added value to design an ethical-legal compliant ecosystem. ulysses cannot avoid from including an interdisciplinary support in her crew. ii) research purposes are functional to empower human dignity and values. considering the strong impact on individuals as well as on groups, as the research could identify new vulnerabilities, or classify individuals (as more exposed to a given risk), ulysses processing health data shall adopt suitable and effective technical and organizational measures to ensuring the ethical-legal compliance, in order to avoid possible misuse or dual use. iii) the it infrastructure and data management strategy should be designed in order to guarantee the availability, confidentiality, integrity of data. iv) if ulysses is also a physician, the combination of data processed for healthcare and scientific research should be clearly distinguished and therefore following possible different data protection plans: risks, technical protocols, access, time retention, level of pseudonymization might be different. v) data flows should be regulated between partners as well as within the given research teams. vi) data flows should be recorded and described in the information given to data subjects. vii) data protection is one of the several ethical issues that might arise from a research. the coordination between different legal constrains, protocols, and requirements should be analyzed in terms of risk assessment and monitored during the whole life-cycle of the research. the gdpr introduced a new proactive approach to dataintensive research. its handling supposes the cross fertilization between different domains, where the legal one plays the role to establish boundaries between lawful and unlawful, contributing to identifying possible tensions under the ethical framework. in order to sensitize ulysses to this new approach, which necessarily includes the allocation of time and resources, a coherent ethical-legal support to the core-research should be promoted by the research institutes. in this perspective, ulysses does not represent only the principal investigator of a given research, but the university/research institute per se , which as the data controller, should firstly train the research staff and the administrative staff to ethical-legal compliance, inform on duties and responsibilities, organizing and introducing specific support. in other terms, be accountable both within the technical and the organizational activities. the anonymisation of research data-a pyric victory for privacy that should not be pushed too hard by the eu data protection framework? passenger name records, data mining & data protection: the need for strong safeguards, report for the council of europe consultative committee on data protection therefore, ulysses . is the one who embraces a new way of working, as it has been stressed during these last weeks within the analysis of the solution to combat open to assess together with the technical specifications the ethical-legal consequences not only in order to mitigate the eu commission, joint european roadmap towards lifting covid- containment measures, https://ec.europa.eu/info/sites/ info/files/communication _ -_ a _ european _ roadmap _ to _ lifting _ coronavirus _ containment _ measures _ .pdf. risk to compromise fundamental rights, but to empower human dignity as main core of her research. i hereby declare that there is no conflict of interests in publishing this paper. key: cord- -dao kx authors: rife, brittany d; mavian, carla; chen, xinguang; ciccozzi, massimo; salemi, marco; min, jae; prosperi, mattia cf title: phylodynamic applications in (st) century global infectious disease research date: - - journal: glob health res policy doi: . /s - - -y sha: doc_id: cord_uid: dao kx background: phylodynamics, the study of the interaction between epidemiological and pathogen evolutionary processes within and among populations, was originally defined in the context of rapidly evolving viruses and used to characterize transmission dynamics. the concept of phylodynamics has evolved since the early (st) century, extending its reach to slower-evolving pathogens, including bacteria and fungi, and to the identification of influential factors in disease spread and pathogen population dynamics. results: the phylodynamic approach has now become a fundamental building block for the development of comparative phylogenetic tools capable of incorporating epidemiological surveillance data with molecular sequences into a single statistical framework. these innovative tools have greatly enhanced scientific investigations of the temporal and geographical origins, evolutionary history, and ecological risk factors associated with the growth and spread of viruses such as human immunodeficiency virus (hiv), zika, and dengue and bacteria such as methicillin-resistant staphylococcus aureus. conclusions: capitalizing on an extensive review of the literature, we discuss the evolution of the field of infectious disease epidemiology and recent accomplishments, highlighting the advancements in phylodynamics, as well as the challenges and limitations currently facing researchers studying emerging pathogen epidemics across the globe. electronic supplementary material: the online version of this article (doi: . /s - - -y) contains supplementary material, which is available to authorized users. globalization has dramatically changed the way in which pathogens spread among human populations and enter new ecosystems [ , ] . through migration, travel, trade, and various other channels, humans have and will continue to intentionally or unintentionally introduce new organisms into virgin ecosystems with potentially catastrophic consequences [ ] . humans are not the only culprits, however; global climate pattern changes can alter local ecosystems, creating favorable conditions for the rapid spread of previously overlooked or even undiscovered organisms among humans, giving rise to unexpected epidemics [ , ] . recent years have been marked by global epidemics of ebola, dengue, and zika, derived from pathogens previously restricted to local outbreaks [ ] . according to the world health organization, more than one and a half billion people are currently awaiting treatment for neglected tropical diseases with similar potential for global spread, for which we have limited knowledge of etiology and treatment options [ ] . this lack of knowledge further limits our ability to investigate the putative role of these pathogens in future epidemics or even pandemics. epidemiological strategies have been and still are the first line of defense against an outbreak or epidemic. despite conventionality, traditional epidemiological methods for the analysis of global infectious diseases are subject to errors from various sources (fig. ) and are thus often inadequate to investigate the epidemiology of an infectious disease. putative outbreak investigations typically ensue following case notification of one of the diseases recognized by local and global public health organizations. trained investigators subsequently collect data on cases and diagnoses to establish a disease cluster. during active surveillance, more cases may be detected through outreach to healthcare facilities and nearby health departments. relevant case contacts, such as family, friends, and partners, are also sought to provide details on demographics, clinical diagnoses, and other potential risk factors associated with the spread of the disease [ ] . however, the lack of infrastructure, trained personnel, and resources in low-and middleincome countries are prohibitive against field epidemiology investigations, as contact tracing and surveillance both require systematic, unbiased, and detailed investigations. the reconstruction and interpretation of transmission networks are often very sensitive to response, selection, and recall biases and are strictly limited by surveillance data collected in many regions with diverse socioeconomic and cultural backgrounds [ ] [ ] [ ] . in addition, even with a highly effective surveillance system, environmental, zoonotic, and vector-borne transmission dynamics confound analysis by shadowing alternative (i.e., not human-to-human) routes of disease acquisition. furthermore, routine analyses of pathogen subtype and drug resistance are conducted only in a subset of developed nations, wherein variation in screening assays and protocols and therapy regimens increases the discordance in surveillance [ , ] . despite the limitations to traditional infectious disease epidemiology, major advances in study designs and methods for epidemiological data analysis have been made over the past decade for a multifaceted investigation of the complexity of disease at both the individual and population levels [ , ] . however, many challenges for infectious disease research remain salient in contemporary molecular epidemiology, such as the incorporation of intra-and inter-host pathogen population characteristics as influential factors of transmission. combating current and future emerging pathogens with potential for global spread requires innovative conceptual frameworks, new analytical tools, and advanced training in broad areas of research related to infectious diseases [ ] [ ] [ ] . an expanded multi-disciplinary approach posits advancement in infectious disease epidemiology research and control in an era of economic and health globalization [ , , , ] . fortunately, recent developments in phylogenetic methods have made possible the ability to detect evolutionary patterns of a pathogen over a natural timescale (months-years) and allow for researchers to assess the pathogen's ecological history imprinted within the underlying phylogeny. when reconstructed within the coalescent framework, and assuming a clock-like rate of evolution, the evolutionary history of a pathogen can provide valuable information as to the origin and timing of major population changes [ ] . phylogenetic methods also provide key information as to the evolution of both genotypic and phenotypic characteristics, such as subtype and drug resistance (fig. ) . even though phylogenetic methods are also limited in certain areas, such as restriction of analysis to only the infected population, a significant subset of these limitations can be overcome by complementary use of data from surveillance (both disease and syndromic) and monitoring [ ] (fig. ) . by integrating phylogenetic methods with traditional epidemiological methods, researchers are able to infer relationships between surveillance data and patterns in pathogen population dynamics, such as genetic diversity, selective pressure, and spatiotemporal distribution. systematic investigation of these relationships, or phylodynamics [ ] , offers a unique perspective on infectious disease epidemiology, enabling researchers to better understand the impact of evolution on, for example, spatiotemporal dispersion among host populations and transmission among network contacts, and vice versa [ , ] . the study of the interconnectedness of these pathogen characteristics was previously limited by the cost and timescale of the generation of molecular data. recent decades have been characterized by technology with the ability to rapidly generate serial molecular data from identifiable sources for which we can obtain detailed relevant information through epidemiological surveillance, allowing for the merging of phylodynamics and epidemiology, or evolutionary epidemiology [ , ] . hence, progress in the field of molecular evolution has provided the opportunity for real-time assessment of the patterns associated with local, national, and global outbreaks [ ] , cross-species transmission events and characteristics [ ] , and the effectiveness of treatment strategies on current [ ] and recurring epidemics [ ] . these assessments are essential for monitoring outbreaks and predicting/preventing pandemic inception, a good example being the recent study of middle east respiratory syndrome coronavirus global transmission [ ] (additional file (video s )). but has the, field of evolutionary epidemiology quite reached its full potential? in this article, we systematically discuss how the application of phylodynamic methods has and will continue to impact epidemiological research and global public health to understand and control infectious diseases locally and across the globe. in a strict sense, the concept of phylodynamics is anything but new. the phylogenetic tree reconstructed by haeckel in using phenotypic traits [ ] was used to explain the distribution of the earliest humansthe "twelve races of man"-across the globe and the location of the "centre of creation." this incorporation of both spatial information and phylogenetic relationships in the inference of population distributions and diversity among geographical locations is a branch of phylodynamics, often referred to as phylogeography. since then, the progression of genetic sequencing technology as well as geographical information systems (gis) has enabled evolutionary biologists to gain a higher resolution view of infectious disease dynamics. the st century, in particular, has witnessed unparalleled advances in methods and techniques for molecular sequence data generation and analyses. however, the relationship of progress and perfection is far from linear, along with its relationship to navigational ease. for example, phylodynamic inference has transitioned into a highly statistics-focused process with the corresponding challenges, including informative samples that can significantly affect the accuracy of results [ ] [ ] [ ] . several research groups [ , ] have reviewed and/or demonstrated the impact of neglecting critical quality control steps on obtaining reliable inferences using the recently developed phylodynamic frameworks, particularly with high throughput, or next-generation, sequencing (ngs) data. some important steps include ensuring uniform spatial and temporal sampling [ ] , sufficient time duration between consecutive sample collections for observing measurable evolution [ ] , coverage of deep sequencing, and consideration of genomic recombination [ ] . the reliance on phylodynamic methods for estimating a pathogen's population-level characteristics (e.g., effective population size) and their relationships with epidemiological data suffers from a high costincreasing the number of inference models, and thus parameters associated with these models, requires an even greater increase in the information content, or phylogenetic resolution, of the sequence alignment and associated phenotypic data. low coverage [ ] and the presence of organism-or sequencing-mediated recombination [ ] , can skew estimates of the evolutionary rate and even impact the underlying tree topology, particularly when dealing with priors in the bayesian statistical framework commonly used for phylodynamic inference. programs such as splitstree [ ] can take as input a nucleotide alignment and output a network in which the dual origins of recombinant sequences are displayed in a phylogeneticlike context. however, network-reconstructing programs have difficulty distinguishing actual recombination events from phylogenetic uncertainty, and branch lengths do not usually reflect true evolutionary distances [ ] . despite much work ongoing in this area, there are currently no broadly applicable methods that are able to reconstruct phylogenetic network graphs that explicitly depict recombination and allow for phylodynamic inference. although the bayesian framework has shown to be fairly robust with the inclusion of recombinant sequences in large population studies [ ] , the inclusion threshold has not been thoroughly investigated and is likely dependent on a number of factors, such as sample size and sequence length. recombinant sequences are thus usually removed prior to analysis; however, the ability to incorporate recombinant sequences is imperative given our knowledge of the role of recombination in virus adaptation [ ] , for example. more details on methods that can potentially account for recombination, applicable to a variety of pathogens, are discussed by martin, lemey, & posada [ ] . while the traditional realm of phylogenetics has focused on rapidly evolving viruses, the development of whole-genome sequencing (wgs) has made possible the expansion of phylodynamic methods to the analysis of slower-evolving microorganisms, such as bacteria, fungi, and other cell-based pathogens. wgs has widened the range of measurably evolving pathogens, allowing for the identification of sparse, genetically variable sites, referred to as single nucleotide polymorphisms (snps), among populations sampled at different time points. the use of wgs in phylogenetics is highly beneficial not only in resolving relationships for slower-evolving organisms but also in reconstructing a more accurate evolutionary history (phylogeny) of an organism, rather than the genealogy (single gene), which can differ significantly from the phylogeny due to the presence of selective pressure or even genetic composition [ ] . however, as with phylodynamic analysis of rapidly evolving viruses, wgs analysis of cell-based pathogens comes with its own challenges, as discussed in detail elsewhere [ ] . implementation of phylodynamic and/or phylogeographic analysis has transitioned over the last two decades from maximum likelihood to the bayesian framework. this framework provides a more statistical approach for testing specific evolutionary hypotheses by considering the uncertainty in evolutionary and epidemiological parameter estimation. given surveillance data (e.g., the duration of infection) and the specification of an epidemiological mathematical model, bayesian phylogenetic reconstruction can also be used to estimate epidemiological parameters that might otherwise be difficult to quantify [ ] . for example, during the early stage of an epidemic, wherein the pathogen population is growing exponentially, the rate of exponential growth can be estimated from the phylogeny using a coalescent model that describes the waiting time for individual coalescent events of evolutionary lineages. this rate estimate can be combined with knowledge of the duration of infection for a particular pathogen to estimate the basic reproduction number, r (e.g., [ ] ), as well as the prevalence of infection and number of infected hosts. transmission dynamics can similarly be inferred following the early exponential growth of the pathogen, during which the pathogen has become endemic. estimation of these parameters is described more thoroughly in volz et al. [ ] . with the expansion of phylodynamic methods to global epidemics, theoretical studies have found that inferences of infection dynamics within the coalescent framework are limited by the assumption of a freely mixing population [ ] . this assumption is often violated with the inclusion of several isolated geographical areas with single or few pathogen introductions. without considering this factor, population structure within a phylogeny can severely bias inferences of the evolutionary history and associated epidemiological parameters [ , ] . to overcome this limitation, software packages such as beast (bayesian evolutionary analysis sampling trees) [ ] [ ] [ ] have recently developed algorithms that allow for the integration of coalescent, mathematical, and spatial diffusion models [ ] [ ] [ ] [ ] [ ] [ ] . more importantly, beast readily implements a comparative phylogenetic approach, which incorporates parameterization of phenotypic trait evolution to identify predictors of population dynamics and spatial spread, all of which are estimated/assessed simultaneously during reconstruction of the evolutionary history [ , ] . statistical evaluation of the risk factors for pathogen population growth and spread can be performed concurrently with the assessment of phylogenetic resolution within the data [ ] , discussed above as a challenge to complex phylodynamic analyses. for example, in the absence of strong phylogenetic resolution, bayesian statistics are more sensitive to long-branch attraction bias [ , ] , wherein rapidly evolving lineages appear to be closely related, regardless of their true evolutionary relationships. this phenomenon, therefore, influences inferences of spatiotemporal spread of the studied pathogen, as well as estimation of the relationship of pathogen population behavior with potential risk factors, such as climate change, host and/or vector distribution, accessibility and so on. the influence of low-resolution molecular data on the reliability of phylodynamic inferences highlights the importance of the implementation of the method described by vrancken et al. [ ] , or even a priori estimation of the phylogenetic and temporal resolution (sufficient time between sampling) [ , ] . unlike other phylogenetic frameworks, bayesian inference enables utilization of prior knowledge in the form of prior distributions (in combination with information provided by the data); however, abuse of prior knowledge is possible and can lead to incorrect conclusions. even within the bayesian school of thought, scientists do not always agree with regard to the specification of prior distributions under certain conditions. the incorporation of prior information is, however, intuitively appealing, as it allows one to rationalize the probability of an estimate based on previous knowledge of the typical behavior of the parameter among populations of the organism under study. but what can we do if we have no knowledge regarding a particular organism or population? this has become a more pertinent issue recently with the increasing rate of discovery, facilitated by ngs, of organisms for which we have limited prior knowledge, such as novel viruses and bacteria, [ ] . one of the advantages of the bayesian phylodynamic approach is the ability to test multiple hypotheses regarding the evolution or epidemiological models used to describe infectious disease behavior, but because of the intricate relationship of these models, reliable inferences require testing of all combinations of the individual proposed models. although often neglected due to computational complexity, improved estimates of marginal likelihoods used for statistical model comparison have been demonstrated with less computational effort [ ] . additionally, if we know that we know nothing about the parameter in question, then, in fact, we know something. referred to as the "objective bayesian" approach, this ideal allows researchers to alter a normally "subjective" prior to create one that is minimally informative. this term is used because the impact of this type of prior on parameter estimation can be controlled to a minimum, allowing the data to dominate the analytical process and conclusions drawn [ ] . although similarly appealing, this approach can be particularly problematic with small datasets [ ] or biased datasets, such as the exclusion of potential intermediate sampling locations [ ] . the expanding volume of sequence data and increasing efforts to combine epidemiological and laboratory data in open access locations can help to improve evolutionary estimates. additionally, the growing availability of data and collaboration can accelerate our understanding of the emergence and spread of infectious diseases through coordinated efforts by multidisciplinary researchers across various institutions and public health organizations. more detail on the benefits of open access databases and data sharing in the context of phylogenetic epidemiology is reviewed in [ ] and [ ] . combining pathogen genetic data with host population information (e.g., population density and air traffic) in a statistical framework is critical for the reliable assessment of factors potentially associated with pathogen population dynamics and geographic spread. the comparative phylogenetic approach described above [ ] was used recently to identify potential determinants of the dengue virus (denv) introduction to and spread within brazil. results from nunes et al. [ ] suggested that for three denv serotypes, the establishment of new lineages in brazil had been occurring within to -year intervals since their primary introduction in , most likely from the caribbean. additionally, they observed that aerial transportation of humans and/or vector mosquitoes, rather than distances between geographical locations or mosquito (particularly aedes aegypti) infestation rates, were likely responsible. the study by nunes et al. marked one of the first uses of the comparative phylogenetic approach for vector-borne tropical diseases and implies the need for a similar approach in future studies aimed at investigating transmission patterns of a broad range of emerging vector-borne viruses. for example, this approach will allow researchers to determine if specific universal factors, such as vector species, are predictive of global transmission route or if health policy and prevention strategies tailored specifically to the pathogen, irrespective of the vector, are required for effective control. with the development of molecular clock models for serially sampled data [ ] , phylogenetic analyses have helped to uncover the timing of transmission events and epidemiological origins. moreover, when paired with comparative phylogeographic models, researchers have been able to identify risk factors most likely associated with these particular events. since the inception of the zika virus (zikv) pandemic around may of in brazil [ ] , phylogeneticists and epidemiologists have sought to reveal mechanisms by which zikv has spread and the factors fueling the wide geographical leaps. a full-genome phylogeographic analysis of zikv isolates collected during - revealed very intricate spatiotemporal transmission patterns across africa prior to the introduction into asia [ ] . from its origin in uganda, two independent transmission events appeared to play a role in the spread of zikv from east africa to the west circa : the first involved the introduction of zikv to côte d'ivoire with subsequent spread to senegal, and the second involved the spread of the virus from nigeria to west africa. results from spatiotemporal analysis demonstrated that uganda was the hub of the african epidemic as well as the common ancestor of the malaysian lineages sampled during the outbreak [ ] . following the emergence and rapid spread of zikv in brazil and other south american countries [ ] , faria's group sought to further characterize the spatiotemporal dynamics of zikv following introduction into this region [ ] . in addition to sequencing data, air traffic data for visitors to brazil from other countries associated with major social events during - were included to test different hypotheses of airline-mediated introduction of zikv in brazil. the results linked the origin of the brazilian epidemic to a single introduction of zikv estimated to occur between may and december , consistent with the confederations cup event, but predating the first reported cases in french polynesia. although these findings are of great value and importance to public health organizations, the authors drew an additional, and similarly valuable conclusion-large-scale patterns in human (and mosquito) mobility extending beyond air traffic data will provide more useful and testable hypotheses about disease emergence and spread than ad hoc hypotheses focused on specific events. this conclusion further supports the proposal for greater availability of epidemiological data among the scientific community. understanding both the rapid spread of the virus throughout south and central america and the caribbean as well as the initial emergence of the virus from the ugandan zika forest in the early s is important for application to the control of future outbreaks, but increasing data may not be the only answer. moreover, several different risk factors are likely responsible for these two migration events. therefore, a more comprehensive approach that allows for the analysis of multiple potential factors and their distinct contribution to independent migration events without the loss of information (i.e., use of data that span the entire evolutionary history) is imperative for fully understanding a global epidemic from beginning to present. a combined approach to understanding the emergence and expansion of an epidemiologically diverse viral population: hiv crf _ag in the congo river basin although viral spread is often attributed to human mobility [ ] , factors such as population growth and accessibility can also play an important role, as with the emergence of human immunodeficiency virus type (hiv- ) group m subtypes a and d in east africa [ ] and circulating recombinant form (crf) _ag in regions of the congo river basin (crb) [ ] . the democratic republic of congo (drc) has been reported to be the source of hiv- group m diversity [ ] [ ] [ ] ; however, the epidemiological heterogeneity of crf _ag within surrounding regions comprising the crb had remained a mystery since its discovery in [ ] , with prevalence ranging from virtual non-existence [ ] [ ] [ ] [ ] [ ] [ ] to accounting for as high as % of infections [ ] , depending on the geographical location. the region with the highest proportion of crf _ag infections, cameroon [ , ] , has been characterized by a rapidly growing infected population ( . % in to % in [ ] ), of which the majority ( %) is caused by this clade. using both molecular sequence data and unaids surveillance data [ ] , the spatiotemporal origin of crf _ag was estimated to occur in the drc in the early s ( ) ( ) ( ) ( ) , with the rapid viral population growth in cameroon following a chance exportation event out of drc. although similar phylodynamic techniques as described above for other viral species were used to infer the spatial origins of crf _ag, the timing of the origin of this viral clade was inferred using both coalescent analysis of molecular sequence data and prevalence information [ , ] . coalescent models allow for estimation of the effective population size (ne), of fundamental importance to infectious disease epidemiology, as it describes the level of genetic diversity within a population over the course of its evolutionary history. during the exponential growth period of an epidemic, the change in ne has been shown to linearly correlate with prevalence of infection [ , ] and can, therefore, be used to estimate the latter, as mentioned above, but also, when combined, faria et al. [ ] were able to show that fitting of ne and prior prevalence data can narrow the uncertainty of the temporal origin estimates by over % as compared to coalescent estimates alone. furthermore, surveillance data was recently used during simultaneous phylodynamic coalescent estimation to identify factors associated with ne dynamics throughout the entire evolutionary history of the cameroonian sequences [ ] , revealing that changes in ne were more reflective of incidence dynamics rather than prevalence, consistent with previous mathematical modeling [ , ] . although associations between ne and potentially related factors are frequently assessed, statistical analysis of these has until recently been primarily limited to post hoc examination (e.g., [ , ] ), which ignores uncertainty in demographic reconstruction, as discussed above. simultaneous implementation of evolutionary reconstruction and estimation of the relationship of covariate data with ne will be available in the newest version of beast v [ ] . although this tool has obvious implications for global assessment of factors contributing to the growth and dynamics of an epidemic, similar applications of this method to other data sets has suggested that reduced molecular data relative to covariate data may result in an impact of inclusion of the data on ne estimates. this finding posits a potential concern for convenience sequence sampling, as factors that are not responsible but are represented by large amounts of data may influence ne estimates, resulting in unreliable population dynamic inferences. as mentioned above, care is needed to ensure sufficient sampling and an appropriate sampling strategy for reliable reconstruction of the evolutionary and epidemiological history of the infectious organism of interest. traditional phylodynamic analysis applied to nosocomial outbreaks has been successfully used in the past to identify the likely source; however, the inclusion of extensive patient data, such as treatment regimens, admission and discharge dates, and length of stay, can improve not only phylogenetic estimates but also the translation of the interpretation to public health policy. epidemiological and genomic data on methicillin-resistant staphylococcus aureus (mrsa) infections were recently utilized by azarian and colleagues to reconstruct mrsa transmission and to estimate possible community and hospital acquisitions [ ] . findings from this study revealed that as high as % of the mrsa colonization within the hospital's neonatal intensive care unit (nicu) was acquired within the nicu itself. these findings indicated that current, standard prevention efforts were insufficient in preventing an outbreak, calling for the improvement of current care or alternative implementation strategies. the earlier uses of phylodynamic methods focused primarily on the molecular evolution of rapidly evolving viruses, greatly advancing the fields of virus vaccine and treatment strategies [ ] . on the other hand, epidemiological approaches have focused on influential factors related to social, economic, and behavioral patterns. integrating the phylodynamics and epidemiology approaches into a single analytical framework, referred to as evolutionary epidemiology [ , ] , represents one of the most powerful multi-disciplinary platforms. examples discussed herein of the adoption of an integrative and multifactorial mindset reveal the potential for accelerating our understanding of the emergence and spread of global infectious diseases, presently expanded to include bacterial and other cell-based pathogens. however, although a highly evolved analytical platform and an improved understanding of the translation of molecular evolutionary patterns to infection and transmission dynamics have aided in facilitating this transition, several challenges still remain. the st century has witnessed a major shift in breadth of scientific knowledge at the level of the individual researcher, requiring more focused training (e.g., molecular mechanisms) and greater collaborative efforts; meanwhile, a consensus of commonality and crossdisciplinary understanding is necessary for globalization of not only the economy, but also public health. this kind of understanding can be better achieved through interdisciplinary instruction on the theoretical and application skills related to both phylogenetics and epidemiology during early education. if successfully achieved, this combined training, in addition to access to modern ngs technology, such as handheld sequencers, would increase the mobility of labs and researchers, expanding the concept of lab-based research. mobilized labs would, in turn, reduce our current reliance on few major public health organizations and the impact of limited resources on sampling and surveillance in developing countries. increasing mobility is nevertheless inconsequential without the cooperative sharing of genomic and epidemiological information. although data are typically readily available to the public following peer-reviewed publication, the median review time of manuscripts submitted to, for example, nature is days [ ] , this in addition to the time required for thorough analysis of the original data. this timeline seems quite long in retrospect of the "spanish flu," which spread to one-third of the global population in a relatively brief -month period [ ] . data sharing prior to publication, even if only among a proportion of consenting institutions, may accelerate the process of dissemination of research findings to public health decision makers and practitioners, and its practice is not entirely unheard of. an excellent example of this type of collaboration is the "nextstrain" project (http://www.next strain.org/). nextstrain is a publicly available repository currently comprised of evolutionary datasets for ebola, zika, and avian and seasonal influenza viruses contributed by research groups from all over the world for the purpose of real-time tracking of viral epidemics. similar projects have also recently developed in other research fields. modeled after the stand up to cancer initiative, the synodos collaborative funded by the children's tumor foundation in partnership with sage bionetworks brings together a consortium of multidisciplinary researchers, who have agreed to the sharing of data and relevant information, as well as results [ ] . the ultimate goal of this cooperation is to accelerate the drug discovery process, which is highly applicable to global infectious disease research. without a similar collaborative approach to synodos, the preparedness of the global reaction to rising epidemics is at risk. recent years have been marked by local outbreaks across vast geographical regions within a timespan of months to years. hence, both the rapid dissemination of data and results and the rapid response of government and public health organizations are required for the effective prevention of a global epidemic, or pandemic. additionally, with the type of results, particularly risk factors, that are generated using this multifaceted approach (e.g., both human population and pathogen molecular characteristics), the question then arises of how organizations will actually utilize this information for treatment and prevention strategies. moreover, as the techniques and methods advance, are the infrastructures in place for global cooperation and immediate response following the presentation of a potentially more complex story? although gaps remain in current evolutionary modeling capabilities when used with epidemiological surveillance data, it is only a matter of time before the challenges described herein and elsewhere are met with more realistic models that capture the complexity of infectious disease transmission. furthermore, theoretical research in the field of infectious disease phylodynamics is still growing. consequently, there is a need for a review of the more recently developed methods and techniques and their performance, as well as their application in areas within and outside the realm of infectious disease. for example, in the era of global health, translational genomics, and personalized medicine, the accumulating availability of genetic and clinical data provides the unique opportunity to apply this approach to studies of, e.g., tumor metastasis and chronic infections, which comprise complex transmission dynamics among tissues and/or cell types, not unlike the geographical spread of infectious diseases. globalization and health understanding the development and perception of global health for more effective student education globalization of infectious diseases: the impact of migration the ecology of climate change and infectious diseases global climate change and emerging infectious diseases deciphering emerging zika and dengue viral epidemics: implications for global maternal-child health burden traditional and syndromic surveillance of infectious diseases and pathogens a structural approach to selection bias recall bias in epidemiologic studies response bias, social desirability and dissimulation resistance considerations in sequencing of antiretroviral therapy in low-middle income countries with currently available options surveillance for antimicrobial drug resistance in under-resourced countries new tools to understand transmission dynamics and prevent hiv infections the global one health paradigm: challenges and opportunities for tackling infectious diseases at the human, animal, and environment interface in low-resource settings charting a future for epidemiologic training the role of epidemiology in the era of molecular epidemiology and genomics: summary of the ajesponsored society of epidemiologic research symposium transforming epidemiology for st century medicine and public health the eco-in eco-epidemiology eco-epidemiology: thinking outside the black box viral phylodynamics overview of syndromic surveillance what is syndromic surveillance? mmwr unifying the epidemiological and evolutionary dynamics of pathogens phylogenetic and epidemic modeling of rapidly evolving infectious diseases evolutionary epidemiology: preparing for an age of genomic plenty zika virus in the americas: early epidemiological and genetic findings genomic analysis of the emergence, evolution, and spread of human respiratory rna viruses ebola virus epidemiology, transmission, and evolution during seven months in sierra leone eighteenth century yersinia pestis genomes reveal the long-term persistence of an historical plague focus the global spread of middle east respiratory syndrome: an analysis fusing traditional epidemiological tracing and molecular phylodynamics new york: d. appleton and company the effects of sampling strategy on the quality of reconstruction of viral population dynamics using bayesian skyline family coalescent methods: a simulation study exploring the temporal structure of heterochronous sequences using tempest (formerly path-o-gen) towards a new paradigm linking virus molecular evolution and pathogenesis: experimental design and phylodynamic inference the effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and bayesian inference analysing recombination in nucleotide sequences application of phylogenetic networks in evolutionary studies a comparison of phylogenetic network methods using computer simulation the early spread and epidemic ignition of hiv- in human populations why do rna viruses recombine? improving bayesian population dynamics inference: a coalescent-based model for multiple loci measurably evolving pathogens in the genomic era pandemic potential of a strain of influenza a (h n ): early findings the confounding effect of population structure on bayesian skyline plot inferences of demographic history bayesian phylogenetics with beauti and the beast . beast: bayesian evolutionary analysis by sampling trees beast : a software platform for bayesian evolutionary analysis bayesian phylogeography finds its roots counting labeled transitions in continuous-time markov models of evolution fast, accurate and simulation-free stochastic mapping phylodynamic inference for structured epidemiological models new routes to phylogeography: a bayesian structured coalescent approximation efficient bayesian inference under the structured coalescent simultaneously estimating evolutionary history and repeated traits phylogenetic signal: applications to viral and host phenotypic evolution generalized linear models for identifying predictors of the evolutionary diffusion of viruses long-branch attraction bias and inconsistency in bayesian phylogenetics on the distributions of bootstrap support and posterior distributions for a star tree iq-tree: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies how to measure and test phylogenetic signal virus discovery in the st century bayesian evolutionary model testing in the phylogenomics era: matching model complexity with computational efficiency the case for objective bayesian analysis analyzing small data sets using bayesian estimation: the case of posttraumatic stress symptoms following mechanical ventilation in burn survivors make data sharing routine to prepare for public health emergencies a code of conduct for data on epidemics unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza h n air travel is associated with intracontinental spread of dengue virus serotypes - in brazil bayesian molecular clock dating of species divergences in the genomics era zika: the origin and spread of a mosquito-borne virus molecular evolution of zika virus during its emergence in the th century population migration and the spread of types and human immunodeficiency viruses spatial phylodynamics of hiv- epidemic emergence in east africa phylodynamics of the hiv- crf _ ag clade in cameroon aids: prehistory of hiv- human immunodeficiency virus: phylogeny and the origin of hiv- unprecedented degree of human immunodeficiency virus type (hiv- ) group m genetic diversity in the democratic republic of congo suggests that the hiv- pandemic originated in central africa coding region of a new hiv type subtype a strain (hiv- ibng) from nigeria genetic diversity of hiv type in likasi, southeast of the democratic republic of congo genetic subtypes of hiv type in republic of congo hiv- subtypes and recombinants in the republic of congo increasing hiv type polymorphic diversity but no resistance to antiretroviral drugs in untreated patients from central african republic: a study increase of hiv- subtype a in central african republic highly divergent subtypes and new recombinant forms prevail in the hiv/aids epidemic in angola: new insights into the origins of the aids pandemic analysis of partial pol and env sequences indicates a high prevalence of hiv type recombinant strains circulating in gabon the prevalence of diverse hiv- strains was stable in cameroonian blood donors from two independent epidemics of hiv in maryland world health organization. who | epidemiological fact sheets on hiv and aids epidemic dynamics revealed in dengue evolution viral phylodynamics and the search for an 'effective number of infections' understanding past population dynamics: bayesian coalescent-based modeling with covariates a high-resolution genetic signature of demographic and spatial expansion in epizootic rabies virus genomic epidemiology of methicillin-resistant staphylococcus aureus in a neonatal intensive care unit does it take too long to publish research? influenza: the mother of all pandemics children's tumor foundation announces historic new initiative in neurofibromatosis research the "virogenesis" project receives funding from the european union's horizon research and innovation program under grant agreement no. . not applicable.availability of data and materials not applicable. authors' contributions bdr and cm contributed to the writing of the manuscript and creating of figures and additional material. bdr, cm, xc, mc, ms, jm, and mp discussed the contents of the manuscript and contributed to editing and revision. all authors read and approved the final manuscript. no financial or non-financial competing interest existed for any one author during the writing of this manuscript. not applicable. reported studies do not involve human participants, human data or human tissue. submit your next manuscript to biomed central and we will help you at every step: key: cord- - jxt diq authors: batarseh, feras a.; ghassib, iya; chong, deri (sondor); su, po-hsuan title: preventive healthcare policies in the us: solutions for disease management using big data analytics date: - - journal: j big data doi: . /s - - - sha: doc_id: cord_uid: jxt diq data-driven healthcare policy discussions are gaining traction after the covid- outbreak and ahead of the us presidential elections. the us has a hybrid healthcare structure; it is a system that does not provide universal coverage, albeit few years ago enacted a mandate (affordable care act-aca) that provides coverage for the majority of americans. the us has the highest health expenditure per capita of all western and developed countries; however, most americans don’t tap into the benefits of preventive healthcare. it is estimated that only % of americans undergo routine preventive screenings. on a national level, very few states ( out of the ) have above-average preventive healthcare metrics. in literature, many studies focus on the cure of diseases (research areas such as drug discovery and disease prediction); whilst a minority have examined data-driven preventive measures—a matter that americans and policy makers ought to place at the forefront of national issues. in this work, we present solutions for preventive practices and policies through machine learning (ml) methods. ml is morally neutral, it depends on the data that train the models; in this work, we make the case that big data is an imperative paradigm for healthcare. we examine disparities in clinical data for us patients by developing correlation and imputation methods for data completeness. non-conventional patterns are identified. the data lifecycle followed is methodical and deliberate; + clinical, demographical, and laboratory variables are collected from the centers for disease control and prevention (cdc). multiple statistical models are deployed (pearson correlations, cramer’s v, mice, and anova). other unsupervised ml models are also examined (k-modes and k-prototypes for clustering). through the results presented in the paper, pointers to preventive chronic disease tests are presented, and the models are tested and evaluated. highest expenditures (illustrated in fig. ). preventive measures are economically critical. they are very clear pointers to the quality of service at hospitals and clinics; as well as the general health of citizens across any country. our work's main objective (hypothesis) is two-tier: through one of the largest and most representative national health datasets for population-based surveillance, data imputations and machine learning models (such as clustering) offer preventive care pointers by grouping patients into heterogeneous clusters, and providing data-driven predictions and policies for healthcare in the us. in our work, preventive care is measured and affected through three main parameters: -immunizations, -access to healthcare providers, and -chronic disease prevention as indicated in united america's health ranking [ ] . we explain in detail these three important parameters in subsequent sections. immunization policies directly influence the health of a state. different states in the us adopt different levels of immunization enforcement. immunizations prevent the occurrence or spread of vaccine preventable diseases (vpds). for example, in the th century; the annual morbidity of smallpox was , . due to vaccines, in the year that number dropped to . measles dropped from an annual morbidity rate of , to ; and rubella from , to [ ] . all vpds had a reduction of more than % due to immunizations [ , ] . therefore, the case for immunizations has been considerably studied. states with low vaccine rates (such as missouri, indiana, alaska, and mississippi) hold campaigns to change public belief about immunizations. due to the lack of public trust in vaccines and pharmaceutical companies, immunization rates have been declining. figure shows how non-medical exemptions (nmes) are on the rise in many states across the country fig. us healthcare expenditures vs. other oecd countries [ ] (between years and ). the study declared the following conclusions [ ] : "a social movement of public health vaccine opposition has been growing in the us in recent years; subsequently, measles outbreaks have also increased. since , the number of philosophical-belief vaccine nmes has risen in of the states that currently allow this policy; namely: arkansas (ar), arizona (az), idaho (id), maine (me), minnesota (mn), north dakota (nd), ohio (oh), oklahoma (ok), oregon (or), pennsylvania (pa), texas (tx), and utah (ut)". the center for disease control and prevention (cdc) reported on those states, and presented multiple cases to help increase public trust in immunizations: "we hope this report is a reminder to healthcare professionals to make a strong vaccine recommendation to their patients at every visit and make sure parents understand how important it is for their children to get all their recommended vaccinations on time" [ , ] . immunization rates vary among states, and lag behind the department of health and human services' healthy people targets. it is yet to be determined how states will react to the recent outbreak of covid- and its vaccination when that is made available. the second factor that affects preventable healthcare is access to healthcare providers. it has high correlation with poverty rates, education levels, and job rates; (i.e. it can be considered an economic attribute). socioeconomic factors are heavily implicated with low or poorer management of disease and preclude patients from complying with seeking early screening and preventive care [ ] . from a healthcare policy perspective, access is due to factors such as the lack of physicians in many areas of the country. in the us, in , the ratio of physician-to-patient was . -to- , which is below the oecd fig. nmes trends across the us [ ] recommended average of . -to- [ ] . additionally, oecd market research predictions indicate that by the year , , more primary care physicians are needed to meet demand; especially due to the aca's provisions to increase that number [ ] . while the us average for doctors active in patient-care per , people = . (in ), there is a wide variance across states: massachusetts ranks the highest with . active doctors per , people (the aca is a reincarnation of the massachusetts healthcare system). mississippi has only . doctors per , people [ , ] . in this manuscript, we present methods that focus on the third (and most critical) variable in preventable healthcare: chronic disease prevention and prediction (cdpp) (fig. ) . most chronic diseases are linked to risk factors or indicators such as obesity, smoking and inflammatory conditions. as a result, sub-clinical changes are potentially occurring with subjects, and that can be detected prior to clinical manifestations. in addition to indispensable clinical tests and screening, a programmable bio-nano-chip (pbnc) system for example could be used in enabling the detection of proteins and molecules known as biomarkers that facilitate 'early' diagnosis for possible disease prevention. this technology is generally referred to as a lab on a chip (l-o-c). one example is the blood glucose monitoring device. thus, evaluation of changes occurring prior to disease appearance (preventive) may highly lead to better management of the disease and is likely considered more cost-effective. when physicians are presented with such projections, they can augment their recommendations to patients. patterns can be drawn between patients' clinical and laboratory data, passed to predictive and intelligent models for proactive actions. these attributes are collected in this study using cdc data, and pointers to enforcing cdpp using ml are presented. chronic diseases encompass diabetes mellitus (type ii), hypertension, kidney diseases, obesity, inflammatory diseases, and cancer, among others. nearly half ( %) of americans have at least one chronic condition [ ] . direct medical costs for chronic conditions are > $ billion annually. further, by , cd cases will increase by %, costing fig. the three pillars of preventive healthcare [ ] $ . trillion in treatment. choosing the right combination of diagnostic tests requires the understanding of the disease, and managing it [ ] . the list of tests provided through the aca is designed to help eradicate chronic diseases [ ] . out of the four possible healthcare models for a country (the beveridge, bismarck, out of pocket, and national insurance) [ ] , a model such as the aca provides guarantees for the pursuit of preventive healthcare. however, in recent years, the aca has been deployed with high costs, the program has low adoption rates, and it still suffers from public's rejection. within the aca, all marketplace plans must cover the following list of preventive services without charging the citizen a co-payment; those are presented in appendix (verbatim from healthcare.gov) [ ] . data on the listed biomarkers are collected by cdc. one of the major sources of such cdc data is their national health and nutrition examination survey (nhanes) [ ] (used in this work). nhanes is a program that is concerned with populationbased surveillance of health and disease via surveying; that refers to patients' reported outcomes, collecting laboratory and clinical data, as well as demographic and socioeconomic statuses of the us population. datasets are released in -year cycles. the program is part of the national center for health statistics; one of the thirteen us government statistical agencies. policy improvements through cdc big data and the ml methods applied make a strong case to enforcing cdpp best practices in the clinic and on a national scale. our study provides completeness to the collection of biomarkers through learning from clinical historical trends. our ml models can improve preventive practices at clinics, save doctors and patients time, reduce cost of unnecessary clinical tests, and provide improved diagnostics to incoming patients by essentially categorizing them in health and disease groups using k-clustering. in the world of silicon chips, moore's law is still alive and well [ ] . the law suggested the doubling of speed and capability of computer chips every months; which still holds true. interestingly though, the same logical construct has been applied to other areas. eroom's law (reverse of moore) suggested that the cost of developing new drugs has doubled every nine years since [ ] . that alludes to more drugs being used for cure, but less towards prevention. this section presents the one of the largest us public health dataset that can aid in preventive practices (i.e. nhanes), and presents a history of ml for healthcare. in , the bush administration presented the food and drug administration amendments act [ ] . the act reaffirmed the requirement for federal investigators to release partial information about clinical trials. before the act, government agencies had no incentive to collect clinical data, or release comprehensive datasets to the public for research and policy validations. later on, when the obama administration enacted the open data initiative [ ] ; all government agencies became obliged to share their public data through online repositories. the cdc (through nhanes) has been performing health surveys since . they cover hundreds of health attributes for a patient at a single point in time [ ] . clinical trials and surveying are critical sources of information for such medical innovations. time series data (longitudinal) are important for ml models. predictions, forecasts, and other means of pattern recognition are strictly enforced by a time dimension in the data. nhanes data are cross-sectional (i.e. collected at a single point in time); thus a follow up on patients is required for ml trends and forecasts [ ] . completeness of health data is required to allow for comprehensive examinations and to provide early recommendations to preventive parameters. for example, an increase in blood pressure through time can point to cases of hypertension in some patients. in this work, we examine the nature of nhanes data, allocate imbalance and incompleteness, and solve that using correlations, clustering, and imputation methods. since the early inception of artificial intelligence (ai), healthcare has been at the forefront of applications. one of the first forms of ai is expert systems. knowledge-based systems (kbs or expert systems) are intelligent systems that encapsulate the knowledge of a skillful person. kbs are a special kind of an intelligent system that makes extensive use of knowledge. kbs were first introduced in the s during the process of collecting medical knowledge from healthcare practitioners. kbs are different from conventional software systems or data analytical systems because they use heuristic rather than algorithmic approaches for decision making [ ] . the original idea of a general problem solver (gps), that later turned into the idea of building a medical expert system, used generic search techniques aided by heuristic knowledge to solve medical problems [ ] . the gps idea was instrumental in the development of mycin; a system that diagnosed blood and hemoglobin disorders. mycin is a landmark medical kbs developed at stanford university (known to be the first expert system). proforma was later developed as a generic model for building clinical and medical expert systems (due to the rise in demand for such systems) [ ] . data re-kindled the promise of ai and fueled the algorithms that struggled to provide learning and intelligence due to the lack of abundant datasets [ ] . that led to the transformation of the famous adage "data is the new oil" to "data is the new blood". in , at stanford university (the developer of mycin), healthcare and ml are hand-in-hand at the center for clinical excellence (cerc). in their recent publication [ ] , they point to multiple endeavors at their centers to deploy ai/ml in healthcare. additionally, they point to the rising trend of ai in the healthcare domain, and how it is expected to grow (percentage-wise and -fold) by year (fig. ). many medical research labs and clinics are starting to deploy ml models within their practices, albeit still considered scarce [ , ] . before presenting our models, it is very important to point to a very critical and dangerous phenomenon in healthcare research and deployment of data-driven methods: the replication crisis; something we avoid in our study by using open data, and providing our models, code, and datasets to anyone interested in replicating our study. data for scientific research need to be reliable. "reliable" in this context means the following: -unbiased to one geographical location, -unbiased to one or few demographics, -unbiased to certain arguments for a policy by candidates on the political left or right -data modeling needs to avoid issues such as overfitting/underfitting or imbalance. although the majority of clinical studies are dependent on data collected at one clinic, hospital, or university, there has been major pushback on medical studies and the lack of ability to regenerate the results with a different set of patients, pre-conceived clinical setup, a variety of geographical locations, inherited biases in the analysis, or hyper-parameter's tuning that is specific to the results of one study-all those practices lead to an inability in replicating a study [ ] [ ] [ ] . nhanes is an open and democratized dataset that represents national health and disease. it is also claimed to be verified by multiple practitioners throughout the years. like other big data repositories, albeit being incomplete, nhanes was used for our models. we allocated biases or inconsistencies, solved these issues, and then built our models. the remaining of the manuscript is organized as follows: the next section presents the cdc big data management processes using sql, bias removal, and data wrangling. "results: preventive healthcare through data analytics" section presents the ml models (clustering, correlations, and imputations), and their results (result # -result # ). lastly, "discussion: conclusions, policies, and future plans" section demonstrates conclusions and implications on preventive policies. data are collected through the cdc's nhanes web-pages. files get extracted into a sas viewer, and then exported into a sql database management studio. all the tables had one primary key: seqn, which represents a distinct sequence id for every patient. the database has tables, a total of + data columns, and tens of thousands of data points (detailed later in this section). no other similar relational database that includes the majority of the cdc survey data exists in literature. most years within the nhanes surveys have missing data. righteously however, the cdc published a report declaring the reasons, counts, and summaries of missing data from their surveys. in the report, the following reasons are mentioned (verbatim): fig. growth of ai in healthcare [ ] "the answers are classified as unusable; the respondent does not have the information to answer a particular item or refuses to answer a specific question or undergo a particular test; laboratory equipment fails; test results are faulty; specimens are lost in shipping; or some items of information fail to be recorded on the examination record". reasons, as such, were only documented after . with data collected previous to that year, it is unclear what pitfalls exist (the electronic collection survey began in ). nhane's history of data collection is broken into the following five phases: in this study, we investigate data from to whatever is available up to -an effort to cover the last years of surveying. it is worthy to mention that every batch of nhanes has a different set of patients (and consequently a different set of patient ids). as one can assume, the sample citizen set which nhanes presents is not biased or imbalanced to any demographical group (we challenge that in upcoming sections). nhanes claims that they sample the data in accordance with percentages by the us census bureau; they consider race, age, income, and gender as basis to stratify the population. in their guides, nhanes claims the following: "race and hispanic-origin information used for sampling is based on census population estimates and obtained from the household screener to determine eligibility for inclusion in the survey" [ ] . one can argue that stratification is not of relevance to a patients' medical status, and in order to cover a wider part of the variety in the clinical spectrum, stratifying needs to be performed based on medical variables. most importantly however, the data are not found to be longitudinal; it is therefore not possible to completely follow a patient's progress over time. patients are presented in id forms, but multiple variables on their demographics, habits, medical history, clinical tests, and multiple other important variables are recorded. we collected data from all six groups (demographics, dietary, examination, laboratory, questionnaire, and limited access). out of these six groups, we traced preventive care categories. those are: questionnaires the mentioned diagnostic categories include + variables, all of which are considered in the ml models (clustering and imputations). prior to developing the models, a significant effort of data wrangling is deployed, that process is presented next. after data collection and transformation, all tables have been merged into one database view within the dbo schema; analysis has been performed on the unified table. for example, a sql code for merging cholesterol levels data with smoking, cardiovascular health, and diabetes is as follows (inner joining tables on seqn): when the data are merged, several observations are possible. these observations span the entirety of the nhanes data. observations (presented as result # of this manuscript) on the merged, cleaned, and wrangled data are: ( ) data categories did not have the same counts, which could lead to data imbalance and lack of comprehensive experimentation in terms of preventive parameters. when merged on patient ids, the data size reduces drastically, due to the minimal overlap between different categories (for clustering, we ended up clustering more than resulted ~ patients, instead, we used the imputed data as well). ( ) female vs. male distributions are fairly unbiased. however, when it comes to race, some races, such as hispanics, are not well represented in the survey (fig. ) . studies show that hispanics, along with african americans have higher level of diabetes than other races [ ] . therefore, for preventive parameters of diabetes, data for more races ought to be collected or created. the nhanes data collection process was rather costly. therefore, as an attempt to substitute for the need to acquire expensive utilities needed to collect clinical parameters, data imputation methods were applied to compensate for missing clinical data. these methods are applied in "multivariate imputation by chained equation" to table summarizes all the data categories' counts across the nhanes survey database (presented as result # ). after cleaning, wrangling and merging the data, the method presented in the next section could be deployed to solve any nhanes data imbalance issue. the results introduced in "results: preventive healthcare through data analytics" section are presented through multiple dimensions such as dental diseases, and smoking (smq). the next section presents the experimental methods used for imputations of preventive biomarkers (in accordance to what was proposed by the aca), and the ml models deployed through the database. this section introduces the models and the results for data imputations, as well as the clustering model for patients included in the cdc surveys. the next two sub-sections include the following: -imputations using correlations (pearson and others) for data creation of variables that can point to preventable diseases -a multivariate clustering the data wrangling process is referred to as % of the effort in a data science lifecycle. the most challenging aspect in data wrangling is data incompleteness. incompleteness leads to data imbalance, and therefore, leads to biased outputs from models. incomplete data often makes the case for a garbage-in garbage-out (gigo) situation; therefore, imputation is deployed. most commonplace methods for imputation are presented below. none of the four mentioned methods are useful in the case of cdpp-the reasoning is presented in parenthesis: . providing means, modes, and medians of a column to replace missing data (averages of patients do not mean anything-patients respond to diseases in unique fashions). it is noteworthy to state that since nhanes data are cross-sectional, stratification is not possible. such imputations are applied when monitoring patients longitudinally. . providing means, modes, and medians of data points surrounding missing datausually on each side, or more (same reason as in # . closer patients in order are not closer health-wise). in small datasets. the number of columns that are used in this study is ~ +, which makes it impossible to find one or few columns to be used for correlations; instead all columns ought to be used). . deleting rows with missing data (deleting rows/patients will lead to bias problems, and so we avoid this option). instead, we introduce a different process for data imputation of cdpp variables. the method used includes six main steps: . stream data from the sql database into the r environment, split data into testing and learning data, use learning data for all steps except step # . . measure correlations between every column and all other columns to find the highest correlated columns for every cdpp variable. . apply pearson correlations for numerical values, analysis of variance (anova) for columns that have numerical and categorical values, and cramer v correlation coefficients for data that are categorical: i. cramer v coefficient is used when both x and y are nominal. results are decimal values between and . formula for cramer v is: Φ = sqrt (x /n (k- )). φ denotes cramer v; x is the pearson chi square statistic; n is the sample size involved in the test; k is the lesser number of categories of either variable. ii. anova is used when x is numeric, and when y is nominal (or vice versa). the result is a decimal value between and . [ ] . the mice model then uses columns that have a correlation > . . . run through all columns of missing cdpp data and impute data points based on the highest correlated columns and mice. . validate created cdpp data versus actual data (using the testing datasets from step # ). a sample code from the mice algorithm used for this study is as follows (used in steps # and # ): the -step process is applied to providing missing longitudinal smoking data as an example. smoking is known to be one of the most common risk factors to many chronic diseases, therefore, completeness in this data is critical to providing health recommendations. the following successful results are produced: ( ) imputations error rate is: . . that is deemed to be very low (result # ). ( ) the process created similar statistical distributions of predicted data and actual data. ( ) top correlations for disease and smoking data are identified (fig. ) , with low error rates. such variables aid medical teams in identifying healthcare parameters for preventive healthcare amongst a subgroup of the population taking the nhanes survey. this breaks down smq to specific practices and can aid in making a quick change. for example, cigarette filter type seems to have a high effect on the patient's smoking numbers (correlation = %). cigarettes length and tar content are other two high effect variables-(result # ). ( ) the listed measures (survey questions) are the preventive pointers to smokers-in order to avoid certain smoking-relevant diseases. the -step imputations' process of cd data can provide more detailed pointers to preventive sub-variables (such as the cigarette filter example). another example presented is for missing clinical periodontal measures used to estimate prevalence of periodontitis. periodontitis is a host-inflammatory oral disease characterized by lengthy exposure to pathogens. for periodontitis, imputations created similar distributions of groups (actual data vs. predictions) amongst periodontitis predictions. see fig. for a visual comparison. periodontitis of mild, moderate or severe form affects over million americans today; i.e. above % of the us population aged - [ ] . this disease is heavily diagnosed and monitored via clinical parameters; thus, in case where there are limited resources or if dubbed costly or inaccessible, imputations can potentially serve to primarily identify patients requiring further screening, preventive or interventional measures. periodontitis can be prevented and managed; therefore, a data-driven approach can present a state-of-the-art ml method to apply on a population; i.e. a large scale of individuals. here, this was implemented on nhanes missing periodontal data for onward. when comparing actual vs. predicted results of imputations for periodontitis, and comparing the highest correlated variables between both sets, many variables ended up having similar correlations such as: ohdpdsts dentition status, ohxcjid dentition status, and oxhxpcm dentition status. figure shows the top variables that have the highest correlation. the dental variables' filters used in the study are illustrated in fig. ( showing the criteria for data collection of periodontal data in nhanes). only the complete and partial dental tests are included, any data that are not done or missing are not considered in the imputations or correlations test. refer to corr order # for instance; it has filters and . corr# in actual data is the highest correlation variable ( . ) and corr# is the second highest ( . ), and so on. similar to that, in predicted values, corr# is equal to . , and corr# is equal to . (result # ). [ ] the goals desired from the mentioned periodontitis example are to present initial pointers to the disease, and to point to factors that would allow for preventive dental parameters (providing focused dental cleaning and cures). the -step imputations model presented in this section provides completeness and balance to the datasets, however, to fit patients into a group of similar patients, a clustering model is required, that is presented next. the k-modes clustering algorithm is an extension of the infamous k-means clustering model. instead of distances, the k-modes model is based on dissimilarities (that is, quantification of the total mismatches between two data points: the smaller this number, the more similar the two objects). k-prototype (k is the number of clusters) [ ] is used for clustering numerical and categorical values. it is a simple combination of k-means and k-modes. k-prototype has the following steps: . select k initial prototypes from the dataset x. . choose the number of clusters (fig. illustrates the elbow diagram for recommendations of k). . allocate each data point in x to a cluster whose prototype is the nearest. . retest the similarity of objects against the current prototypes. if the algorithm finds that an object is allocated such that it is nearest to another cluster prototype, it updates the prototypes of clusters. . repeat step # , until no object changes its cluster (after fully testing x). a variety of clusters' descriptions could be pulled from the model developed. the model includes variables; we only used columns that had more than % values, and less than % nulls. columns with null values could lead to model skew. depending on the application deployed, the clusters could be defined depending on the purpose. for example, they can be defined based on smoking habits and blood pressure, fig. diagram for choosing the number of clusters (k = and k = peaks) but they will be categorized differently if they are defined by chronic diseases-the seven cds collected in our results are (a total of cases): the best k-model has clusters (based on the elbow method and clinical heterogeneity). for instance, it is worth mentioning that cluster is the 'healthy cluster'. cluster has patients who don't have high levels of cholesterol (mmol/l)- or higher. clusters and are the ones with the highest probability of oral health disease (i.e. periodontitis); a + sample of patients had periodontitis. as one can notice, clusters , , and are the least healthy, while , , and are the healthiest ones. figure presents the distribution of clustering results by periodontitis projections. more importantly, table presents the counts of cd patients within every cluster (result # ). most chronic pro-inflammatory conditions have common risk factors, such as smoking and co-existence with other systemic diseases. in case of periodontitis, diabetes mellitus ii is a chief risk factor to exacerbation of the disease. both diseases were claimed to even have a bi-directional or a two-way relationship. periodontitis is the th most common condition in diabetic patients. susceptibility to periodontitis increased by around threefold in diabetic patients [ ] . evidence suggests that managing or preventing one could aid in alleviating the other condition. thus, treating or preventing periodontitis may improve the status of other chronic conditions. by belonging to a health group, a primary care physician can get instant expectations on patient's health, and which preventive tests a patient might need. the multiple factors of a cd, for most physicians, can sometimes be rather time-consuming to collect; and so our ml model aids in providing comprehensive consideration of hundreds of variables. we aim to experiment with this model at small scale facilities first, such as at a university healthcare facility or a clinic. further clustering data results and code are available for researchers upon request from the authors. validating unsupervised models is an intricate task. for the clustering model, we applied three main means of evaluation: ( ) relative clustering validation: evaluates the clustering model by varying different parameter values for the same algorithm. in this study, we evaluated a different number of clusters (k), however, as part of future work, we would like to test including/excluding other healthcare variables and observe the changes to the model outputs. ( ) external clustering validation: compares the results of a cluster analysis to an externally known result, such as externally provided nhanes clusters (which are not provided by nhanes). since we know the best cluster number in advance (k = or k = ), this approach is mainly used for selecting the right clustering algorithm. . internal clustering validation: uses the internal workings of the clustering process without reference to external knowledge. this type aims to minimize the distance between data points in the same cluster and maximize ones in different clusters. different diseases require different measures, and therefore, a breakdown of cd patients within every cluster is presented in table (result # ) . as noted in the table, if a patient ends up in cluster # for instance, it is safe to assume that they have health parameters that are very similar to a group of other patients with coronary disease, or asthma. additionally, if a patient belongs to clusters # or # , then their health parameters are similar to a group of healthy patients. the next section presents conclusions and implications on healthcare policy. due to patient-privacy rules and legal concerns, accessible data in healthcare are scarce. survey datasets from cdc/nhanes constitute an example datasets that could be used to provide pointers to the current general health of americans. the surveys have been continuously used to draw causal inferences from health attributes, patient's information, and diseases. nhanes datasets are complex, highly context-dependent, inherently heterogeneous, and multi-dimensional. just like any half-cleaned dataset, data wrangling methods need to be applied before developing models that aid in decision making [ ] . besides manually evaluating patient records and health indicators by physicians, ml models presented in this manuscript allow for a proactive evaluation of patients' status. ml models aim to augment physicians' tasks, encourage preventive actions towards diseases, and provide pointers to healthcare policies that work (or others that do not)-a challenging task that requires systematic data democratization [ ] . the methods presented in this paper provide successful imputations, correlations, and grouping of patients; as suggested in the hypothesis. we present seven analytical results. ten years of nhanes data are collected and multiple insights are presented, for example: demographical data have ~ , patients (after joining with healthcare variables)-however, demographical categories are not balanced, for instance: the majority of surveyed patient's household income is $ , and more. in any case, nhanes data have enough health indicators to further impute preventive measures. as we ran multiple imputation tests, dependable data are generated through a -step process with a very low error rate: . . imputations are deployed based on high correlations ( % and higher) of variables relevant to preventive measures (based on the aca benchmark). some presented columns had very high correlations (up to: . %). afterwards, data are clustered. eleven ( ) clusters served the highest quality and distribution of patients amongst heterogeneous groups. smoking and periodontitis preventive measures are clearly stated. other cds (such as diabetes and asthma) were very prevalent amongst clusters , and . when a patient falls into a cluster, the physician can have an initial pointer to the health of the patient (in terms of what preventive tests to undertake) [ ] . the overarching goal is to identify health and disease parameters for the us population, elevate the health of the american public, effectively reduce office time for patients' data collection, and provide completed, wrangled, and unbiased survey datasets for further clinical analysis. the aca mandate encourages citizens to benefit from a list of preventive parameters (appendix ). as our work indicates, three factors (immunization, access, and chronic disease) affect the efforts of national preventive care. changes to policies in american states through intelligent recommendations could be driven by imputation and clustering models. five example policy making cases are presented next. healthcare fraud is on the rise, recommendations to patients by clinics or pharmaceutical companies are frequently not supported by explainable scientific proof; rather, merely through experience or bureaucratic governmental approvals. intelligent recommendations and predictions could provide detailed categorizations and validations to patients (i.e. precision medicine). almost every year in the us, there is a case of statewide healthcare fraud. many times, the story behind such activities is the overuse of tests and medical examinations, 'fake' lab results, or 'fake' (placebo) medicine. recommendations by fraudulent physicians/clinics/labs are usually not supported by proof, rather, only by opinion. combining medical knowledge and expertise with an interactive data system can provide intelligent recommendations and further validation to patients. accordingly, policies for the validation of clinical recommendations ought to be enforced. another case where ml-driven methods could be very beneficial to the health of a state is when they are applied across what is referred to as medical deserts. in the us, there are wide areas in the heartland where there is very low access to healthcare (few doctors or clinics)-an automated recommendation system can help in some cases, and cover some of these areas. multiple clinical and subclinical tests are implemented to acquire comprehensive knowledge to managing diseases and preventing them or their progression. patients' stratification and classification could aid in intelligently choosing appropriate tests; thus, minimizing the need for unnecessary access, and lowering health costs for patients who live in such areas. methods presented in this manuscript provide pointers to preventive policies that benefit the general good of americans. for example, pharmaceutical industries benefit from the lack of preventive practices, simply because that will allow them to sell more medication to patients that don't undergo preventive care. additionally, food and agricultural industries are able to push products to consumers that could cause chronic diseases (such as hypertension and diabetes). diet for example, has major effects on health variables; however, medical institutions and policy makers are involved with cure but rarely with prevention. our study evaluates the biomarkers concerned with diet and points to attributes that could lead to cds. some clinical tests are expensive for patients and are time consuming for hospitals. ml methods can help in providing cheaper solutions. the models allow doctors to recommend preventive measures to patients based on what group they belong to instead of performing all measures to avoid being sued (less tests might be recommended if the patient belongs to cluster for instance). cluster had only patients with a cd; cluster had , and cluster had . on the other hand, cluster (the least healthy cluster) had cd patients, and cluster had patients. therefore, in a clinical setup, a patient that belongs to cluster will require less testing than a patient belonging to clusters or . this will provide a mechanism to avoid over-recommending, or underrecommending clinical tests. over-recommending tests is a case of defensive medicine, and as established prior, this and other practices have been on the forefront of reasons to the increasing expenditures within the us healthcare system. our ml methods can also aid in immunization recommendations and e-access to healthcare. for example, cdc can provide ml models for patients to self-evaluate, and quickly get predicted metrics to their immunization needs (with a certain statistical confidence-error rate in this study is . , and r-squared for all tested cases is higher than %). in the future, we aim to apply further experimentation on our system, such as: . run models per us state: as different states have different rules and regulations, we aim to re-train models per state, and maybe also per county and city. . we aim to collect more cdc data variables to provide more correlations and further tests for imputations, and compare with other nhanes predictive models for specific diseases such as periodontitis [ ] . . k = and k = models are to be tested on the same dataset to further evaluate their usability for medical purposes. . we aim to develop a user interface for clinicians to allow for interaction with ml models for decision making support. the recommendations through our models are not a replacement for visiting a primary care physician or models like the andersen model [ ] ; however, they are tools that could keep the citizens informed about their own health and would allow them to take preventive parameters (on time) if needed. when such big data analytics are applied to a state or to the national scale, they can provide recommendations to policies based on the aggregated health of citizens in a state. that will eventually promote the health of citizens across the country, something that is currently at the forefront of american policy making. best care at lower cost: the path to continuously learning health care in america america's health rankings vaccination mandates: the public health imperative and individual rights the causal effect of health insurance on utilization and outcomes in adults: a systematic review of us studies prevention (cdc). . health insurance and access to care . the u.s. health care system: an international perspective the state of the antivaccine movement in the united states: a focused examination of nonmedical exemptions in states and counties socioeconomic position indicators and periodontitis: examining the evidence projecting us primary care physician workforce needs: - association of american medical colleges, state physician workforce data book how to improve patient outcomes for chronic diseases and comorbidities. a report by the health catalyst computational technology for effective health care physicians for a national health program cdc data surveys: national health and nutrition examination survey (nhanes) cramming more components onto integrated circuits diagnosing the decline in pharmaceutical r&d efficiency food and drug administration (fda) open government initiative. open data update on nhanes dietary data: focus on collection, release, analytical considerations, and uses to inform public policy a doctoral dissertation published at the florida state university library services the engineering of knowledge-based systems, theory and practice assessing the quality of service using big data analytics-with application to healthcare machine learning in healthcare informatics stanford medicine health trends report. med.stanf ord healthcare analytics for quality and performance improvement measuring the prevalence of questionable research practices with incentives for truth telling facts are more important than novelty: replication in the education sciences problems and challenges in patient information retrieval: a descriptive study american diabetes association (ada) periodontitis in us adults: national health and nutrition examination survey - periodontitis and diabetes: a twoway relationship making the case for artificial intelligence at government: guidelines to transforming federal software systems data democracy: at the nexus of artificial intelligence, software development, and knowledge engineering evidence-informed health policy: using ebp to transform policy in nursing and healthcare development and validation of a predictive model for periodontitis using nhanes - data revisiting the behavioral model and access to medical care: does it matter author's contributions fb wrote the manuscript, developed data methods, developed the ml proof of concept, lead the data analytics lifecycle, tested the results, and validated the workflow. ig identified nhanes datasets, evaluated data imputations, analyzed the clustering results, designed and developed medical and dental insights, and edited the manuscript. dc collected the data, organized it in a database, performed data wrangling exercises, created the data taxonomy, developed clustering models for testing in r, and produced scoring results. ps developed r scripts for clustering and imputations, performed data wrangling exercises, optimized the mice models; created imputations' results, and deployed the ml workflow. all authors read and approved the final manuscript. diabetes hypertension periodontitis not applicable (this work received no specific grant from any funding agency, commercial or not-for-profit sectors). the datasets generated and/or analyzed during the current study are available in the cdc/nhanes repository, https ://www.cdc.gov/nchs/nhane s/index .htm all the datasets, the r code, and the sql database are available to interested readers (via a github repository) upon request from the authors. the authors declare that they have no competing interests. key: cord- - phjvxat authors: galván‐casas, c.; catalá, a.; carretero hernández, g.; garcia‐doval, i. title: sars‐cov‐ infection: the same virus can cause different cutaneous manifestations: reply from authors date: - - journal: br j dermatol doi: . /bjd. sha: doc_id: cord_uid: phjvxat dr drago et al. are right to point out that our paper did not provide data on enanthems( , ). as the data collection form did not include the description of mucous membranes, they might have not been explored in many patients. we have reported and included in the supplementary material a few cases that were noticed by their doctors and were the first descriptions of enanthem in covid‐ . given the low number of cases and their non‐systematic acquisition, we avoided any analysis of these data. dr drago et al. are right to point out that our paper did not provide data on enanthems , . as the data collection form did not include the description of mucous membranes, they might have not been explored in many patients. we have reported and included in the supplementary material a few cases that were noticed by their doctors and were the first descriptions of enanthem in covid- . given the low number of cases and their non-systematic acquisition, we avoided any analysis of these data. we disagree with the suggestion of collecting data for a study and later asking for consent. all the included patients gave informed consent before incorporating their data in the study. we feel that the this article is protected by copyright. all rights reserved limitations of this strategy are already highlighted in the discussion of the paper and do not seriously bias the results of the study. regarding serologic testing, we agree that it could have been useful, but it was not available at the time that the study was done, and diagnosis was made using polymerase chain reaction. we also agree on the discussed hypotheses about coinfection or reactivation and the need for further studies to test them. sars-cov- infection: the same virus can cause different cutaneous manifestations classification of the cutaneous manifestations of covid- : a rapid prospective nationwide consensus study in spain with cases key: cord- -xm qvhn authors: tarkoma, sasu; alghnam, suliman; howell, michael d. title: fighting pandemics with digital epidemiology date: - - journal: eclinicalmedicine doi: . /j.eclinm. . sha: doc_id: cord_uid: xm qvhn nan digital epidemiologists conduct traditional epidemiological studies and health-related research using new data sources and digital methods from data collection to analysis [ , ] . according to salath e, digital epidemiology is epidemiology building on digital data and tools, but a narrower definition defines it as epidemiology building on data generated and obtained with a primary goal other than conducting epidemiological studies [ ] . digital epidemiology provides insights into health and disease determinants in human populations by building on diverse digital data sources. infectious diseases already account for > % of digital epidemiology studies [ ] , but in the current crisis the rapid understanding of disease spread, risk factors, and intervention impact at the population scale has never been more important to mitigate health and economic consequences. digital epidemiology can contribute to global health security through syndromic surveillance, public health surveillance, and early pandemic detection. there is, however, a need for open, accessible data and improved computing capability to bridge gaps in producing knowledge and making knowledge-based decisions. in addition to the availability and accessibility of data and digital tools in pandemic responses, it is important to improve health equity by promoting access to connectivity and digital health literacy skills and to consider the interaction between digital health technologies and society, culture, and the economy. digital systems have a potentially critical role in the early pandemic detection. the program for monitoring emerging diseases (promed), the global public health intelligence network (gphin), and healthmap have pioneered digital approaches to epidemic intelligence. epidemic intelligence from governmental and public health agency sources is complemented with other data sources such as from mobile phones, call centers, sensors, social media, and search engines. the who alert and response operations unit reported that > % of initial disease outbreak reports originate from informal sources [ ] . over a decade ago, google flu trends pioneered the use of large-scale technological information for public health, first using search queries to track influenza-like illnesses [ ] . however, meaningful limitations in predictive capability remain, maintaining model calibration over time is challenging, and some of the underlying data may not be publicly available [ ] . more recently, other aggregated data for applications such as mobility pattern analysis have found use in public health [ ] . twitter has also become a frequently employed, publicly available data source [ ] . digital epidemiology and digital tools have had a profound role in understanding and mitigating the covid- pandemic through analysis of diverse digital data sources such as smartphone, health register, and environmental monitoring data. aggregate and anonymized smartphone data have been extensively used to study the pandemic and support decision-making; however, the pandemic's global scale requires coordinated regional, national, and global efforts in sharing, combining, and privacy-protecting data [ ] . the who has published interim guidance for public health programs and governments regarding the ethical and appropriate use of digital proximity tracking for covid- contact tracing. the covid- exposure notifications api from apple and google shows how opt-in privacy-preserving digital contact tracing can be deployed ubiquitously, while the covid symptom study shows how a smartphone application can be used to study covid- symptoms and predict disease hotspots À days in advance [ ] . mit's open source private kit:safe paths is an example of an anonymous contact tracing platform that supports hotspot detection [ ] . while the digital epidemiology toolkit is evolving rapidly, significant challenges must be addressed pertaining to data privacy, availability, and analysis. a lot of useful data are private and require data protection. state-of-the-art privacy solutions include data aggregation, data anonymization, differential privacy (dp), federated learning, and synthetic data generation techniques. advances that might help in this regard include dp-based methods and decentralized data processing [ ] . privacy-preserving and decentralized machine learning (ml) have also become active research areas that can contribute to a paradigm shift in digital epidemiology. for example, federated learning is an ml technique that enables distributed model creation across multiple decentralized devices that store data locally [ ] . synthetic data generation is also a promising paradigm that can offer a high level of privacy; however, there is an inherent tradeoff between data privacy and utility [ ] . privacy-protected data disclosure remains an active research topic. a vital requirement for any data-driven system is the validation of both the input data and the generated models. purely data-driven approaches have limited predictive capability and their results may be difficult to interpret in different contexts [ , ] . however, the combination of data-driven models and domain-specific knowledge would appear to be a promising research avenue, as is ensuring that digital epidemiology and ml promote algorithmic fairness and avoid worsening health disparities. new legislative measures could also support efficient, ethical, and privacy-preserving combinations of data sets and sources. for example, the eu member states have developed a technology and data use toolbox to combat and exit from the covid- crisis [ ] . the new health data rules in the us within the st century cures act are another example of how third parties can work more easily with health data. within the eu and nordic countries, finland has pioneered a one-stop shop for secondary use of health and social data with a new act and a new permit authority (findata) for sharing health-related datasets. combining aggregate and privacy-protected diverse data sources such as mobility, health, environmental, and city data is expected to help understand and mitigate the consequences of pandemics. the digital epidemiology toolkit is likely to be supported by advances in ml, privacy-enhancing technologies, data/ model validation and explainability, and national and transnational policy measures. increasing data availability and access combined with advances in open source data processing and analysis pave the way for scalable digital epidemiology supporting world health security. this article is published as part of g riyadh global digital health summit ( À august ) activities. saudi arabia hosted this virtual summit to leverage the role of digital health in the fight against current and future pandemics. digital epidemiology: what is it, and where is it going digital epidemiology: use of digital data collected for non-epidemiological purposes in epidemiological studies epidemic intelligence -systematic event detection is google trends a reliable tool for digital epidemiology? insights from different clinical settings mobile phone data for informing public health actions across the covid- pandemic life cycle rapid implementation of mobile technology for real-time epidemiology of covid- apps gone rogue: maintaining personal privacy in an epidemic communication-efficient learning of deep networks from decentralized data digital epidemiology and global health security; an interdisciplinary conversation recommendation on a common union toolbox for the use of technology and data to combat and exit from the covid- crisis, in particular concerning mobile applications and the use of anonymised mobility data all authors wrote the manuscript and reviewed and approved the final version of the paper. none. key: cord- -u vihq u authors: allam, zaheer title: the rise of machine intelligence in the covid- pandemic and its impact on health policy date: - - journal: surveying the covid- pandemic and its implications doi: . /b - - - - . - sha: doc_id: cord_uid: u vihq u the use of advanced technologies, especially predictive computing in the health sector, is on the rise in this era, and they have successfully transformed the sector with quality insights, better decision-making, and quality policies. even though notable benefits have been achieved through the uptake of the technologies, adoption is still slow, as most of them are still new, hence facing some hurdles in their applications especially in national and international policy levels. but the recent case of covid- outbreak has given an opportunity to showcase that these technologies, especially artificial intelligence (ai), have the capacity to produce accurate, real-time, and reliable predictions on issues as serious as pandemic outbreak. a case in point is how companies such as bluedot and metabiota managed to correctly predict the spread route of the virus days before such events happened and officially announced by the world health organization. in this chapter, an increase in the use of ai-based technologies to detect infectious diseases is underlined and how such uses have led to early detections of infectious diseases. nevertheless, there is evidence that there is need to enhance data sharing activities, especially by rethinking how to improve the efficiency of data protocols. the chapter further proposes the need for enhanced use of technologies and data sharing to ensure that future outbreaks are detected even earlier, thus accelerating early preventive measures. the fourth industrial revolution has brought numerous transformations in the world, catalyzed by the advent and rapid adoption of digital tools. one notable disruption that this revolution has brought is the advent of novel computing methodologies and technologies, which are seen to be transforming all spheres of the global economies. these are observed to be the basis for the unquestionable improvements and quality risk assessments that each sector within the global fabric is experiencing. on this note, one sector that has truly been transformed is the medical sphere, which now prides itself of enriched databases following novel application of different technologies such as artificial intelligence (ai), machine learning, natural language processing, big data, and internet of things (iot) and others. with the increased data, and the technology, those working in the sector have access to advanced predictive modeling and computing tools that have transformed areas such as personal medicine and epidemiology, medical operations, diagnosis, and drug manufacturing . in addition, as noted by watson ( ) , the technologies are helping the medical fraternity to draw variable predictions by comparing historical and present medical data. for instance, it is now possible to use such predictions to assess the healthcare labor force and subsequently use the results to initiate recruitment processes in areas that are most deserving in a meritocratic way. even if ai tools are observed to sometimes extend biases, those are being addressed, a hiring authority can manage to effect the concept of inclusivity and streamline some lingering workforce challenges among other things. while these computing technologies and tools have already given a glimpse of how they can transform the health sector, it is worth appreciating that most of them are still a "work-in-progress" in the medical field and hence need to continue evolving and streamlined. therefore, in that regard, it is understandable that some teething problems and challenges are inevitable, but such need to be addressed with time. additionally, it has been found that some medical professionals and stakeholders in this field are yet to fully embrace the computing tools as part of the advancement in the medical realm; hence, they continue to rely on human-based interpretation, including in life-threatening situations. and, from records, such decisions sometimes tend to be time-consuming and are at times, far from being correct. therefore, to ensure that such scenarios are minimized, there is a need for frameworks to guide the usage of those technologies so that they can be widely accepted and ultimately lead to saving more lives. in particular, the issues of data collection, storage, management, and sharing require to be urgently addressed, as it is seen as the primary source of the apprehension that some in the health community have against them. on this, for a start, the challenge of standardization of protocols needs to be sorted as this hosts issues such as limiting the scope of data, is associated with incompatibility of devices and networks, and exposes the field to extra costs to name a few. addressing such challenges would be key, especially in a period of emergencies like now, when the entire world is hurting from the impacts of covid- pandemic. for instance, despite the challenges raised earlier, some startup companies were able to use the available data from social media, airline ticketing, and medical institutions to identify that the world is experiencing a new virus outbreak days before those in medical fraternity had made similar findings (gaille, ) . with these technologies, it also took less time to identify the outbreak, unlike in other previous outbreaks like in the case of the severe acute respiratory syndrome (sars) outbreak in that took relatively months to identify (qiu et al., ) . in the case of covid- , initially known as -ncov, it only took days (who, b) . these breakthroughs in the medical field, therefore, need to be encouraged, and one way of doing this is the streamlining all available obstacles. in support of this, this chapter surveys how ai processes, aided by availability of data managed to allow for early detection of the coronavirus outbreak, and through the findings, showcase that enhanced data sharing protocols hold the key to improved future urban health policies. the first official confirmation of novel coronavirus (now known as the covid- ) was made public by the world health organization (who) on january , , after its officials based in china received reports from the chinese health official of a new type of infectious virus (who, a). but from records, before this actual confirmation, there were reports that some people had started to show signs like those of the virus as early as december , , in wuhan. of these, six had presented themselves to the hospital in where they were treated and later discharged. but, the cases of similar symptoms continued, and this raised an alarm among health officials who embarked on a fact-finding mission to establish whether they were dealing with a commonly known virus outbreak or a new type altogether. it was after this that, on december , , chinese officials liaised with the who officials to establish that this was truly a new strait of coronavirus. this prompted concerted efforts that lead to the official confirmation of the virus on january , , which later became widespread worldwide (fig. . ) and classified as a pandemic. in view of this historical perceptive, it is true that there was a time lag between when the first symptoms were reported and when the confirmation was done (who, b); thus, since it is now clearly known how the virus is transmitted (from person-to-person), there must have been a sizable number of people contracted the virus. this prompted chinese health officials to place the entire region of wuhan city under a total lockdown as from january , , to prevent further spread (li et al., ) . but, unfortunately, noting that this is a busy city, people from other regions, and countries, as was later established by bluedot, who may have contracted the virus had already traveled back to their countries, and this opened the door for further spread across regions and finally into the breadth and length of the world. in respect to the actual origin of the virus, for now, only theories have been advanced. but, reports to date (retamal, ) support that the first victims of this virus contracted it from the huanan seafood wholesale market in wuhan city. and as noted earlier, being a new virus, the tests were initially being conducted only in china, specifically in wuhan, with the health officials suspecting it to be sars virus, but this was ruled out on january , . this raised alarms, and when it was officially identified and provisionally named " -ncov" on january , , and the data subsequently shared to public, an australian virus identification laboratory based at the peter doherty institute for infection and immunity immediately embarked on its research and by january , , it was able to clone the virus (nature, ). but this is not the only institution that took the virus outbreak seriously. according to niiler ( ) , bluedot, whose profile is shared in the following, was able to employ the services of aidriven algorithms, to analyze data gathered from sources such as new reports, air ticketing, and animal disease outbreaks to predict that the world is facing a new type of virus outbreak. besides the prediction of the new virus outbreak, this startup and another called metabiota (both profiles shared in the following) were able to predict (independently) correctly some of the areas that would experience the virus spread next. among regions and countries predicted by each of these startups that turned to be true include japan, taiwan, south korea, singapore, thailand, and hong kong (heilweil, ) . such predictions came days earlier before any of the said country reported their first case. the information from these different quarters became instrumental in combating the virus. it was through the spread prediction mapped by bluedot and metabiota that the rest of the world and concerned institutions and agencies came to learn that the world is confronting a highly infectious virus that was spreading at alarming rates. on the same, after successfully cloning the virus, the virus identification laboratory shared the data in an open database where authorized researchers and labs can access and conduct further research on cures and vaccines (nature, ). all these efforts prove that with technologies, it is now possible to confront pandemics of global magnitude. but such drive needs to be backed by concerted efforts aimed at eliminating data sharing obstacles associated with different advanced computing technologies and tools. the current case of covid- is not the most devastating nor the only virus that the world has had to struggle with. indeed, looking at the historical fact, there have had some more contagious, devastating, and widespread pandemics experienced before. in the early th century, it is documented that a deadly plague dubbed the black death (bubonic plague) struck the world and killed approximately million people in a span of years (duncan and scott, ) . fast forward, in , another deadly pandemic was the spanish flu (h n influenza virus) struck. this is a type of influenza that is believed to have originated in Étaples, france, and it went on to infect over million people, killing around million of these globally (martini et al., ) . between and , another type of influenza (a subtype h n ) also known as asian flu broke in china, and by the time it was contained, it had claimed the lives of . million people. ten years later ( ), the world suffered another outbreak; this time influenza a (h n ) first reported in hong kong, killing over million people. later on, in , the swine flu (h n influenza virus) killed , people in the united states alone (cdc, ). before this, in , there was an outbreak of sars in china that killed people (song et al., ) . there was also ebola (zaire ebola virus) that was first reported in democratic republic of congo that claimed approximately , people, followed by the zika virus in that infected approximately , people, killing people. in , the world is now confronting the coronavirus, which has spread to over countries, and the end to it is not predictable at the time of writing of this chapter. in all the examples cited earlier, the common denominator is that the success of containing any of these viruses depends on detection and identification. that said, it is worth noting that these pandemics were caused by different types of viruses. these include the influenza virus, henipavirus (nipah virus), filoviruses like those responsible for ebola, and flavivirus that is responsible for zika (aris-brosou et al., ; madhav et al., ) to name a few. this process usually equates to extensive laboratory testing, as illustrated in fig. . . regarding their detection, it is dependent on the type of technology use; hence, from the emergence of the digital revolution, things are seen to be changing in respect to amount of time taken for detection. however, here too, other factors such as the availability of data, quality of the same, and sharing methods are critical. for instance, despite having some levels of modern technology, it took approximately months to identify the sars virus. such delays, however, are credited to the action or inaction of the chinese health officials to withhold information concerning the virus outbreak. in cases where there were concerted efforts between different players, like in the case of ebola outbreak in west africa, it is reported that the virus was identified in a record time, and this prevented its spread beyond the three countries (liberia, sierra leone, and guinea) that it was first reported (wojda et al., ) . in the current case of covid- , as noted earlier, it only took only days to detect and identify the virus and to also predict how it would spread from the original epicenter (wuhan). this was possible due to availability of technologies such as ai (bini, ) , machine learning, and natural language processing; the aforementioned startups were able to use to gather and analyze the data. in particular, the advancement in ai-based infectious disease-surveillance algorithms is understood to reduce the amount taken to detect a virus outbreak. it is evident that since the emergence of the ai-based surveillance, there is a notable level of improvement and efficiency. this is particularly important noting that technological advancements in other sectors such as transport have made movement of people relatively cheaper, quicker, and comfortable; thus, importation of virus and diseases from regions of high concentration to those with little or no virus or disease has become relatively high. this is the reality with the covid- , which was first imported from china and then later from some european countries such as italy to the rest of the world. in this regard, it is true that there are ongoing works and discussion aimed at revising existing policies to ensure the loopholes that have existed, thus allowing that spread of diseases and other outbreaks to nonendemic regions have been sealed. but, with the current happening, it is true that much effort is still needed. the amount of emerging computing literature on infectious diseases demonstrates that substantial research, supported by development of ai-based algorithms, has been increasing exponentially supporting an incline in use of ai technologies involved in diseases and virus surveillance. the increased use of ai-based tools to monitor and survey outbreaks in different regions, through a forward step toward early prevention, needs to be complimented by the availability of substantial data. therefore, as has been stamped in this chapter, it is paramount to have a framework that clearly outlines how specific data need to be shared with the public. in particular, this would help to overcome challenges of insufficient data that are instigated by the act of withholding information by some entities or countries surfing on private interest. on this, a positive step toward its actualization was made in by the who after the zika virus outbreak where through unfettered sharing of data, different agencies and stakeholders were able to utilize advanced technologies to prevent the spread of the virus. and, as noted earlier, such efforts were fruitful in that, unlike other previous virus outbreaks, this had the least number of casualties ( ). henceforth, the use of technologies is seen to be gaining traction with use of analytical tools such as ai algorithms becoming popular as it allows for data scouring from diverse targets (lau et al., ) and it is also compatible with other technologies such as machine learning and natural language processing. such technologies, as noted earlier, are what allowed bluedot and metabiota to obtain the correct predictions they made about the outbreak and spread of covid- to different regions. the use of these modern tools is also hailed for they have the potential to lead to quick diagnosis, help in development of vaccines and cures of outbreaks, and also would prompt development of raft of preventive strategies in areas that would be predicted to be of high risk of experience an outbreak (martins, ) . this is what the two aforementioned companies, whose description is given in the next section, achieved in the current case of covid- . this section highlights some briefs on how bluedot and metabiota were able to utilize modern computing technologies to accurately, and in record time predict coronavirus outbreak, and the target countries that were at risk of experiencing the outbreak. bluedot is a web-based startup that was pioneered in by dr. kamran khan after the sars outbreak. initially, it was known by the trade name of bio-diaspora, but in , it seeded round with a sri lankan private venture (horizons ventures) prompting its renaming to bluedot. the startup came into limelight in when the h n influenza pandemic broke, where it was able to correctly predict that global pathway of the virus by relying on worldwide air travel data. it cemented its authority in the use of modern computing technologies in , where it developed risk assessment models that allowed it to predict the spread of ebola virus outbreak that struck three west africa countries (allen, ). in the current predicament of covid- pandemic, bluedot was among the first to predict ( days before official announcement) that the world was experiencing a new outbreak and also correctly identified countries that were at high risk of being next target of the outbreak (bowles, ). the answer to the success of this startup in making correct predictions lies on their reliance on modern, advanced, computing technologies and availability of data from different spheres. in respect to technologies, the company is observed to heavily rely on ai-based tools, machine learning technology, and natural language processing technologies. using different models and algorithms, the company managed to scour valuable data from different sources such as diverse, global news outlets, global airline ticketing data (heaven, ) , population density data, global infectious disease alert, climate report, and insect vectors and animal diseases reservoirs. in its website ( b), it is clearly noted that it relies on over , official and media sources drawn from over languages each day. it also queries reputable databases such as world factbook and national statistics reports from different regions. with the available technologies, the company is able to employ filters on information from the different sources to narrow the results to the issue at hand (blue-dot, a). on the same, the technologies also allow the use of modern clustering tools that allow it to quickly, and in real time, identify areas or regions with the potential to become hot spots, cold spots, and/or spatial outliers. it also relies on the power of machine learning to train its system using the assorted dataset, and in turn, the systems are able to generate real-time and regular alerts on issues of interest to the company's clients. it is through this that it was able to flag out coronavirus as an outbreak that had potential to spread to other regions quickly. the history of metabiota takes us back to when it was initiated. during those early days, its main engagements were in research focusing on how human and animal health were linked, especially in the african context. in , when the ebola virus broke in west africa, the company was already active, and through its work attracted the attention of the us government, which at the time was actively involved in combating this outbreak (rossi, ) . having experience on the african context, metabiota was requested to assist, and it did a remarkable job, but after the ebola situation was contained, the us government withdrew the funding to the company. the reduction of the funds took a toll on the company, hence prompting a paradigm shift, which entailed the company expanding its operation scope to enable it to serve more clients. in this regard, its target market was insurance companies, who would benefit from information concerning disease outbreaks. henceforth, the company embarked on enriching its disease database, which today is among the most comprehensive ones (rossi, ) . to achieve this, the company embarked on investing and utilizing advanced computation and predictive technologies, and such included ai, machine learning, big data, and natural language processing (nlp) algorithms. through this, the san franciscoebased company serves a wide range of clients including government agencies, insurance companies, contractors, diverse noneprofitmaking organizations, ngos, and others that, in one way or the other, depends on information of infectious diseases outbreaks to enhance their decision-making. with these technologies, it has become among the leading startups in rendering predictions about infectious diseases outbreaks, spread, interventions, and event severity (heaven, ) . it uses nlp algorithms to scour data from diverse sources (both official and unofficial sources). from its website (metaboita, ) , it sources range from biological, political, socioeconomic, environmental, and social media among others. the data gathered from these are analyzed and categorized using reputable analytical and visualization technologies into clusters such as frequencies, severity, and time (duration of outbreaks), and these are shared with its clients depending on information being sought. in the recent case of covid- , metabiota was in the forefront to analyze the outbreak, and during the analysis of the data, some even sourced from social media, the company was able to predict which neighboring countries were at high risk of being the next target of the virus spread, more so because the panic in wuhan had stated to trigger some fear, forcing people to flee. by relying on ai, machine learning, and nlp, the company analyzed human predictive behaviors and scare levels, thus managing to correctly make the predictions a week earlier before any of the said countries (japan, thailand, hong kong, and others) had reported any case of the virus (tong, ) . when it comes to pandemics, one sure way of protecting the masses and averting related negative impacts on the social fabric, the economy and human lives to name a few are providing early detection. today where there are enough digital tools and technologies with capacity to allow for real-time data collection, fast and comprehensive computation, and prediction, early detection ought to be emphasized. but even with these technologies, any lapses, especially in data sharing, are bound to delay the detection and identification of the outbreak and that can prove to be fatal. for instance, in , when sars (sars-cov) broke in guangdong local market in china, health officials and chinese authorities withheld information on the outbreak, and this prompted the identification of the virus to drag for around months. while the fatalities from this virus outbreak were only reported across countries where the virus had spread, such could have been avoided. by the time the virus was contained, it had already spread to countries. in a totally different case, as reported earlier, due to collaborative measures taken in when ebola virus broke in west africa, the virus was identified in a reasonable time, and this prevented further spread of the virus beyond liberia, guinea, and sierra leone. though this virus is very infectious and tends to have high casualties, it leads to the unfortunate death of , people. if the outbreak here was to take the same route that sars, it could have been disastrous and would have spread to numerous countries. in the present case of coronavirus (c vid- ) pandemic that originated in wuhan city, china, the response was totally different from the sars incident. this time round, the chinese authorities were quicker and forthright in their reporting, and also in sharing subsequent information and data. nevertheless, some quarters continue to accuse the chinese authorities for the global spread of the virus. but while mistrust exists, the steps taken by chinese authorities have been lauded. additionally, as noted earlier, when the who officials were notified of this outbreak, they were also quick to identify the virus and to take decisive measures in ensuring that the spread was contained. it only took days for the identification, but as noted earlier, it had taken approximately days (from december , e december ) to detect that the world was confronting a new type of coronavirus. the breakthrough in the early detection being witnessed in these recent years can be credited to several factors. first, the reasons learnt from previous occurrences may have prompted some changes on how governments and stakeholders perceive the issue of pandemic outbreak. secondly, and more importantly, the emergence and subsequent acceptability of a wide range of computational technologies has made it possible for faster data collection, data sharing, and advanced computation and analysis. the availability of data from different sources, including smart devices and healthwearable devices, social media, and existing health database, has also been handy and influential in determining the detection period and tracking of outbreaks. for this reason, gaille ( ) notes that besides technologies, availability of data in large quantities is now seen as the world's new "gold rush" of this century. the availability of these not only influences health outcomes but is also seen to determine geopolitical standing, with those in position to collect, store, and control most of the data seen to be positioned as a global power house and, hence, the push and pull on the g internet between power economies (allam, a; allam and dhunny, ; allam and jones, ; allam and newman, ; kharpal, ) . in addition, even in lower levels of governance, the control of data is seen to be raising heat with large ict corporations competing to control the market share such that they can have exclusive control of data, thus increasing their profit standing (allam and jones, ) . but, beyond selfish interests, it is possible for corporations and governments and organisations with capacity to manage large quantities of data to work together for the sake of the economic landscape, the welfare of the social fabric, and the improvement in the health sector (allam, b; allam, c; allam, d) . such calls are valid in a time like now when the entire global economy, the health sector, and societies are in limbo due to the impacts of covid- . to achieve such noble goals, however, as noted in the previous sections, there are several things that need to be addressed to streamline the data usage landscape. among these include addressing some notable challenges with computing technologies used in analyzing big data. first, there needs to be a framework that guides how data have for long been highly guarded, collected, shared, and accessed in such a way that it does deem to be compromising the security and privacy of individuals. by doing this, it would be possible to increase personal data even further as the ethical and moral issues associated with data sharing would be lifted (allam, a; allam, b) . in particular, this would be important when it comes to accessing and using data comprising personal genome, personal demographic information, and other personal identifying details (vayena and blasimme, ) . in cases where such must be accessed, the use of strategies like k-anonymity (i.e ensuring that datasets have no combination of user identifying attribute) (sweeney, ) to anonymise data, or the use of technologies such as blockchain or quantum cryptography must be made to ensure total anonymization of data. the framework should also address issues to do with standardization of protocols and networks, which have for long been seen to reduce communication of systems, especially in urban areas. on this, as shared by allam and jones ( ) , standardization would then support seamless flow at an urban, regional, and international scale. besides streamlining on the use of technologies, and data-related issues, as has comprehensively been shared earlier, the war against covid- , as has been done already in different countries, needs to be supported by strengthening instituted quarantines, self-isolation, and lockdowns. these actions are in their part enough in enriching health databases, as it is from these that data on people contracting, recovering, and succumbing from the virus are being collected, in addition to those already sourced in medical facilities. this work has candidly explored the role of various technologies, especially ai, machine learning, and nlp and big data in early detection of covid- , especially by exploring how such were instrumental in assisting blue-dot and metabiota companies make their groundbreaking predictions for rendering early detection of the coronavirus. the exploration has demonstrated that the future of the health sector, among others, is promising, if such predictive achievements are to continue. to make this even better, it is the position in this paper that data sharing practices need to be encouraged by adopting best practices such as standardization of protocols, enhancing anonymization, and employing modern technologies such as blockchain and quantum cryptography, which have proven to be novel in such fields. there is also needed to emphasize cooperation between different agencies, institutions, and corporations to ensure that corporate monetary interest on data does not overshadow work aimed toward improving global health, economic equity, and social welfare. urban resilience and economic equity in an era of global climate crisis the emergence of anti-privacy and control at the nexus between the concepts of safe city and smart city cities and the digital revolution: aligning technology and humanity data as the new driving gears of urbanization privatization and privacy in the digital city theology, sustainability and big data on big data, artificial intelligence and smart cities the potential of blockchain within air rights development as a prevention measure against urban sprawl on the coronavirus (covid- ) outbreak and the smart city network: universal data sharing standards coupled with artificial intelligence (ai) to benefit urban health monitoring and management redefining the smart city: culture redefining the use of big data in urban health for increased liveability in smart cities how a toronto company used big data to predict the spread of zika viral outbreaks involve destabilized evolutionary networks: evidence from ebola what do these terms mean and how will they impact health care? better public health surveillance for infectious diseases bluedot protects people around the world from infectious diseases with human and artificial intelligence how canadian ai start-up bluedot spotted coronavirus before anyone else had a clue influenza (flu) what caused the black death? ai could help with the next pandemic-but not with this one how ai is battling the coronavirus outbreak china 'has the edge' in the war for g and the us and europe could fall behind artificial intelligence in health: new opportunities, challenges, and practical implications china locks down city at center of coronavirus outbreak risks, impacts, and mitigation the spanish influenza pandemic: a lesson from history years after how healthcare is using big data and ai to cure disease confronting the risk you can't see coronavirus latest: australian lab first to grow virus outside china an ai epidemiologist sent the first warnings of the wuhan virus the impacts on health, society, and economy of sars and h n outbreaks in china: a case comparison study where it all began: wuhan's virus ground-zero 'wet market' hides in plain sight the metaboita story from sars to mers, thrusting coronaviruses into the spotlight ) k-anonynity: a model for protecting privacy big data predicted the coronavirus outbrea and where it would spread health research with big data: time for systemic oversight predictive analytics in health care: emerging value and risks emergencies preparedness novel coranvirus ( -ncov) situation report. world health organisation the ebola outbreak of - : from coordinated multilateral action to effective disease containment, vaccine development, and beyond key: cord- -fr uod authors: nan title: saem abstracts, plenary session date: - - journal: acad emerg med doi: . /j. - . . .x sha: doc_id: cord_uid: fr uod nan objectives: we sought to determine if the ocp policy resulted in a meaningful and sustained improvement in ed throughput and output metrics. methods: a prospective pre-post experimental study was conducted using administrative data from community and tertiary centers across the province. the study phases consisted of the months from february to september compared against the same months in . operational data for all centres were collected through the edis tracking systems used in the province. the ocp included main triggers: ed bed occupancy > %, at least % of ed stretchers blocked by patients awaiting inpatient bed or disposition decision, and no stretcher available for high acuity patients. when all criteria were met, selected boarded patients were moved to an inpatient unit (non-traditional care space if no bed available). the primary outcome was ed length of stay (los) for admitted patients. the ed load of boarded patients from - am was reported the editors of academic emergency medicine (aem) are honored to present these abstracts accepted for presentation at the annual meeting of the society for academic emergency medicine (saem), may to in chicago, illinois. these abstracts represent countless hours of labor, exciting intellectual discovery, and unending dedication by our specialty's academicians. we are grateful for their consistent enthusiasm, and are privileged to publish these brief summaries of their research. this year, saem received abstracts for consideration, and accepted . each abstract was independently reviewed by up to six dedicated topic experts blinded to the identity of the authors. final determinations for scientific presentation were made by the saem program scientific subcommittee co-chaired by ali s. raja, md, mba, mph and steven b. bird, md, and the saem program committee, chaired by michael l. hochberg, md. their decisions were based on the final review scores and the time and space available at the annual meeting for oral and poster presentations. there were also innovation in emergency medicine education (ieme) abstracts submitted, of which were accepted. the ieme subcommittee was co-chaired by joanna leuck, md and laurie thibodeau, md. we present these abstracts as they were received, with minimal proofreading and copy editing. any questions related to the content of the abstracts should be directed to the authors. presentation numbers precede the abstract titles; these match the listings for the various oral and poster sessions at the annual meeting in chicago, as well as the abstract numbers (not page numbers) shown in the key word and author indexes at the end of this supplement. all authors attested to institutional review board or animal care and use committee approval at the time of abstract submission, when relevant. abstracts marked as ''late-breakers'' are prospective research projects that were still in the process of data collection at the time of the december abstract deadline, but were deemed by the scientific subcommittee to be of exceptional interest. these projects will be completed by the time of the annual meeting; data shown here may be preliminary or interim. on behalf of the editors of aem, the membership of saem, and the leadership of our specialty, we sincerely thank our research colleagues for these contributions, and their continuing efforts to expand our knowledge base and allow us to better treat our patients. david background: two to ten percent of patients evaluated in the emergency departments (ed) present with altered mental status (ams). the prevalence of non-convulsive seizure (ncs) and other electroencephalographic (eeg) abnormalities in this population is not known. this information is needed to make recommendations regarding the routine use of emergent eeg in ams patients. objectives: to identify the prevalence of ncs and other eeg abnormalities in ed patients with ams. methods: an ongoing prospective study at two academic urban ed. inclusion: patients ‡ years old with ams. exclusion: an easily correctable cause of ams (e.g. hypoglycemia, opioid overdose). a -minute eeg with the standard electrodes was performed on each subject as soon as possible after presentation (usually within hour). outcome: the rate of eeg abnormalities based on blinded review of all eegs by two boardcertified epileptologists. descriptive statistics are used to report eeg findings. frequencies are reported as percentages with % confidence intervals (ci), and inter-rater variability is reported with kappa. results: the interim analysis was performed on consecutive patients (target sample size: ) enrolled from may to october (median age: , range - , % male). eegs for patients were reported uninterpretable by at least one rater ( by both raters). of the remaining , only ( %, %ci - %) were normal according to either rater (n = by both). the most common abnormality was background slowing (n = , %, %ci - %) by either rater (n = by both), indicating underlying encephalopathy. ncs was diagnosed in patients ( %, %ci, - %) by at least one rater (n = by both), including ( %, %ci - %) patients in non-convulsive status epilepticus (ncse). patients ( %, %ci - %) had interictal epileptiform discharges read by at least one rater (n = by both) indicating cortical irritability and an increased risk of spontaneous seizure. inter-rater reliability for eeg interpretations was modest (kappa: . , %ci . - . ). objectives: to define diagnostic sbi and non-bacterial (non-sbi) biosignatures using rna microarrays in febrile infants presenting to emergency departments (eds). methods: we prospectively collected blood for rna microarray analysis in addition to routine screening tests including white blood cell (wbc) counts, urinalyses, cultures of blood, urine, and cerebrospinal fluid, and viral studies in febrile infants days of age in eds . we defined sbi as bacteremia, urinary tract infection (uti), or bacterial meningitis. we used class comparisons (mann-whitney p < . , benjamini for mtc and . fold change filter), modular gene analysis, and k-nn algorithms to define and validate sbi and non-sbi biosignatures in a subset of samples. results: % ( / ) of febrile infants were evaluated for sbi. . % ( / ) had sbi ( ( . %) bac-teremia, ( . %) utis, and ( . %) bacterial meningitis). infants with sbis had higher mean temperatures, and higher wbc, neutrophil, and band counts. we analyzed rna biosignatures on febrile infants: sbis ( meningitis, bacteremia, uti), non-sbis ( influenza, enterovirus, undefined viral infections), and healthy controls. class comparisons identified , differentially expressed genes between sbis and non-sbis. modular analysis revealed overexpression of interferon related genes in non-sbis and inflammation related genes in sbis. genes were differently expressed (p < . ) in each of the three non-sbi groups vs sbi group. unsupervised cluster analysis of these genes correctly clustered % ( / ) of non-sbis and sbis. k-nn algorithm identified discriminatory genes in training set ( non-sbis vs sbis) which classified an independent test ( non-sbis vs sbis) with % accuracy. four misclassified sbis had over-expression of interferon-related genes, suggesting viral-bacterial co-infections, which was confirmed in one patient. background: improving maternal, newborn, and child health (mnch) is a leading priority worldwide. however, limited frontline health care capacity is a major barrier to improving mnch in developing countries. objectives: we sought to develop, implement, and evaluate an evidence-based maternal, newborn, and child survival (mncs) package for frontline health workers (fhws). we hypothesized that fhws could be trained and equipped to manage and refer the leading mnch emergencies. methods: setting -south sudan, which suffers from some of the world's worst mnch indices. assessment/intervention -a multi-modal needs assessment was conducted to develop a best-evidence package comprised of targeted trainings, pictorial checklists, and reusable equipment and commodities ( figure ). program implementation utilized a trainingof-trainers model. evalution - ) pre/post knowledge assessments, ) pre/post objective structured clinical examinations (osces), ) focus group discussions, and ) closed-response questionnaires. results: between nov to oct , local trainers and fhws were trained in of the states in south sudan. knowledge assessments among trainers (n = ) improved significantly from . % (sd . ) to . % (sd . ) (p < . ). mean scores a maternal osce and a newborn osce pre-training, immediately post-training, and upon - month follow-up are shown in the table. closed-response questionnaires with fhws revealed high levels of satisfaction, use, and confidence with mncs materials. participants reported an average of . referrals (range - ) to a higher level of care in the - months since training. furthermore, . % of fhws were more likely to refer patients as a result of the training program. during seven focus group discussions with trained fhws, respondents (n = ) reported high satisfaction with mncs trainings, commodities, and checklists, with few barriers to implementation or use. conclusion: these findings suggest mncs has led to improvements in south sudanese fhws' knowledge, skills, and referral practices with respect to appropriate management of mnch emergencies. no study has compared various lactate measurements to determine the optimal parameter to target. objectives: to compare the association of blood lactate kinetics with survival in patients with septic shock undergoing early quantitative resuscitation. methods: preplanned analysis of a multicenter edbased rct of early sepsis resuscitation targeting three physiological variables: cvp, map, and either central venous oxygen saturation or lactate clearance. inclusion criteria: suspected infection, two or more sirs criteria, and either sbp < mmhg after a fluid bolus or lactate > mmol/l. all patients had an initial lactate measured with repeat at two hours. normalization of lactate was defined a lactate decline to < . mmol/l in a patient with an intial lactate ‡ . . absolute lactate clearance (initial -delayed value), and relative ((absolute clearance)/(initial value)* ) were calculated if the initial lactate was ‡ . . the outcome was in-hospital survival. receiver operating characteristic curves were constructed and areas under the curve (auc) were calculated. difference in proportions of survival between the two groups at different lactate cutoffs were analyzed using % ci and fisher exact tests. results: of included patients, the median initial lactate was . mmol/l (iqr . , . ), and the median absolute and relative lactate clearance were mmol/l (iqr . , . ) and % (iqr , ). an initial lactate > . mmol/l was seen in / ( %), and / ( %) patients normalized their lactate. overall sutures on trunk and extremity lacerations that present in the ed. the use of absorbable sutures in the ed setting confers several advantages: patients do not need to return for suture removal which results in a reduction in ed crowding, ed wait times, missed work or school days, and stressful procedures (suture removal) for children. objectives: the primary objective of this study is to compare the cosmetic outcome of trunk and extremity lacerations repaired using absorbable versus nonabsorbable sutures in children and adults. a secondary objective is to compare complication rates between the two groups. methods: eligible patients with lacerations were randomly allocated to have their wounds repaired with vicryl rapide (absorbable) or prolene (nonabsorbable) sutures. at a day follow-up visit the wounds were evaluated for infection and dehiscence. after months, patients were asked to return to have a photograph of the wound taken. two blinded plastic surgeons using a previously validated mm visual analogue scale (vas) rated the cosmetic outcome of each wound. a vas score of mm or greater was considered to be a clinically significant difference. results: of the patients enrolled, have currently completed the study including in the vicryl rapide group and in the prolene group. there were no significant differences in the age, race, sex, length of wound, number of sutures, or layers of repair in the two groups. the observer's mean vas for the vicryl rapide group was . mm ) and that for the prolene group was . mm ( %ci . - . ), resulting in a mean difference of . mm ( %ci- . to . , p = . ). there were no significant differences in the rates of infection, dehiscence, or keloid formation between the two groups. conclusion: the use of vicryl rapide instead of nonabsorbable sutures for the repair of lacerations on the trunk and extremities should be considered by emergency physicians as it is an alternative that provides a similar cosmetic outcome. objectives: to determine the relationship between infection and time from injury to closure, and the characteristics of lacerations closed before and after hours of injury. methods: over an month period, a prospective multi-center cohort study was conducted at a teaching hospital, trauma center and community hospital. emergency physicians completed a structured data form when treating patients with lacerations. patients were followed to determine whether they had suffered a wound infection requiring treatment and to determine a cosmetic outcome rating. we compared infection rates and clinical characteristics of lacerations with chisquare and t-tests as appropriate. results: there were patients with lacerations; had documented times from injury to closure. the mean times from injury to repair for infected and noninfected wounds were . vs. . hrs (p = . ) with % of lacerations treated within hours and % ( ) treated hours after injury. there were no differences in the infection rates for lacerations closed before ( . %, %ci . - . ) or after ( . %, %ci . - . ) hours and before ( . %, % ci . %- . %) or after ( . %, % ci . %- . %) hours. the patients treated hours after injury tended to be older ( vs. yrs p = . ) and fewer were treated with primary closure ( % vs. % p < . ). comparing wounds or more hours after injury with more recent wounds, there was no effect of location on decision to close. wounds closed after hours did not differ from wounds closed before hours with respect to use of prophylactic antibiotics, type of repair, length of laceration, or cosmetic outcome. conclusion: closing older lacerations, even those greater than hours after injury, does not appear to be associated with any increased risk of infection or adverse outcomes. excellent irrigation and decontamination over the last years may have led to this change in outcome. background: deep burns may result in significant scarring leading to aesthetic disfigurement and functional disability. tgf-b is a growth factor that plays a significant role in wound healing and scar formation. objectives: the current study was designed to test the hypothesis that a novel tgf-b antagonist would reduce scar contracture compared with its vehicle in a porcine partial thickness burn model. methods: ninety-six mid-dermal contact burns were created on the backs and flanks of four anesthetized young swine using a gm aluminum bar preheated to °celsius for seconds. the burns were randomized to treatment with topical tgf-b antagonist at one of three concentrations ( , , and ll) in replicates of in each pig. dressing changes and reapplication of the topical therapy were performed every days for weeks then twice weekly for an additional weeks. burns were photographed and full thickness biopsies were obtained at , , , , and days to determine reepithelialization and scar formation grossly and microscopically. a sample of burns in each group had % power to detect a % difference in percentage scar contracture. results: a total of burns were created in each of the three study groups. burns treated with the high dose tgf-b antagonist healed with less scar contracture than those treated with the low dose and control ( ± %, ± %, and ± %; anova p = . ). additionally, burns treated with the higher, but not the lower dose of tgf-b antagonist healed with significantly fewer full thickness scars than controls ( . % vs. % vs. . % respectively; p < . ). there were no infections and no differences in the percentage wound reepithelialization among all study groups at any of the time points. conclusion: treatment of mid-dermal porcine contact burns with the higher dose tgf-b antagonist reduced scar contracture and rate of deep scars compared with the low dose and controls. background: diabetic ketoacidosis (dka) is a common and lethal complication of diabetes. the american diabetes association recommends treating adult patients with a bolus dose of regular insulin followed by a continuous insulin infusion. the ada also suggests a glucose correction rate of - mg/dl/hr to minimize complications. objectives: compare the effect of bolus dose insulin therapy with insulin infusion to insulin infusion alone on serum glucose, bicarbonate, and ph in the initial treatment of dka. methods: consecutive dka patients were screened in the ed between march ' and june ' . inclusion criteria were: age > years, glucose > mg/dl, serum bicarbonate or ketonemia or ketonuria. exclusion criteria were: congestive heart failure, current hemodialysis, pregnancy, or inability to consent. no patient was enrolled more than once. patients were randomized to receive either regular insulin . units/kg or the same volume of normal saline. patients, medical and research staff were blinded. baseline glucose, electrolytes, and venous blood gases were collected on arrival. bolus insulin or placebo was then administered and all enrolled patients received regular insulin at rate of . unit/kg/hr, as well as fluid and potassium repletion per the research protocol. glucose, electrolytes, and venous blood gases were drawn hourly for hours. data between two groups were compared using unpaired t-test. results: patients were enrolled, with being excluded. patients received bolus insulin; received placebo. no significant differences were noted in initial glucose, ph, bicarbonate, age, or weight between the two groups. after the first hour, glucose levels in the insulin group decreased by mg/dl compared to mg/dl in the placebo group (p = . , % ci . to . ). changes in mean glucose levels, ph, bicarbonate level, and ag were not statistically different between the two groups for the remainder of the hour study period. there was no difference in the incidence of hypoglycemia in the two groups. conclusion: administering a bolus dose of regular insulin decreased mean glucose levels more than placebo, although only for the first hour. there was no difference in the change in ph, serum bicarbonate or anion gap at any interval. this suggests that bolus dose insulin may not add significant benefit in the emergency management of dka. ihca; . return of spontaneous circulation (rsoc). traumatic cardiac arrests were excluded. we recorded baseline demographics, arrest event characteristics, follow-up vitals and laboratory data, and in-hospital mortality. apache ii scores were calculated at the time of rosc, and at hrs, hrs, and hrs. we used simple descriptive statistics to describe the study population. univariate logistic regression was used to predict mortality with apache ii as a continuous predictor variable. discrimination of apache ii scores was assessed using the area under the curve (auc) of the receiver operator characteristic (roc) curve. results: a total of patients were analyzed. the median age was years (iqr: - ) and % were female. apache ii score was a significant predictor of mortality for both ohca and ihca at baseline and at all follow-up time points (all p < . ). discrimination of the score increased over time and achieved very good discrimination after hrs (table, figure) . conclusion: the ability of apache ii score to predict mortality improves over time in the hours following cardiac arrest. these data suggest that after hours, apache ii scoring is a useful severity of illness score in all post-cardiac arrest patients. background: admission hyperglycemia has been described as a mortality risk factor for septic non-diabetics, but the known association of hyperglycemia with hyperlactatemia (a validated mortality risk factor in sepsis) has not previously been accounted for. objectives: to determine whether the association of hyperglycemia with mortality remains significant when adjusted for concurrent hyperlactatemia. methods: this was a post-hoc, nested analysis of a single-center cohort study. providers identified study subjects during their ed encounters; all data were collected from the electronic medical record. patients: nondiabetic adult ed patients with a provider-suspected infection, two or more systemic inflammatory response syndrome criteria, and concurrent lactate and glucose testing in the ed. setting: the ed of an urban teaching hospital; to . analysis: to evaluate the association of hyperglycemia (glucose > mg/dl) with hyperlactatemia (lactate ‡ . mmol/l), a logistic regression model was created; outcome-hyperlactatemia; primary variable of interest-hyperglycemia. a second model was created to determine if concurrent hyperlactatemia affects hyperglycemia's association with mortality; outcome- -day mortality; primary risk variablehyperglycemia with an interaction term for concurrent hyperlactatemia. both models were adjusted for demographics, comorbidities, presenting infectious syndrome, and objective evidence of renal, respiratory, hematologic, or cardiovascular dysfunction. results: ed patients were included; mean age ± years. ( %) subjects were hyperglycemic, ( %) hyperlactatemic, and ( %) died within days of the initial ed visit. after adjustment, hyperglycemia was significantly associated with simultaneous hyperlactatemia (or . , %ci . , . ). hyperglycemia with concurrent hyperlactatemia was associated with increased mortality risk (or . , %ci . , . ) , but hyperglycemia in the absence of simultaneous hyperlactatemia was not (or . , %ci . , . ) . conclusion: in this cohort of septic adult non-diabetic patients, mortality risk did not increase with hyperglycemia unless associated with simultaneous hyperlactatemia. the previously reported association of hyperglycemia with mortality in this population may be due to the association of hyperglycemia with hyperlactatemia. the background: near infrared spectroscopy (sto ) represents a measure of perfusion that provides the treating physician with an assessment of a patient's shock state and response to therapy. it has been shown to correlate with lactate and acid/base status. it is not known if using information from this monitor to guide resuscitation will result in improved patient outcomes. objectives: to compare the resuscitation of patients in shock when the sto monitor is or is not being used to guide resuscitation. methods: this was a prospective study of patients undergoing resuscitation in the ed for shock from any cause. during alternating day periods, physicians were blinded to the data from the monitor followed by days in which physicians were able to see the information from the sto monitor and were instructed to resuscitate patients to a target sto value of . adult patients (age> ) with a shock index (si) of > . (si = heart rate/systolic blood pressure) or a blood pressure < mmhg systolic who underwent resuscitation were enrolled. patients had a sto monitor placed on the thenar eminence of their least-injured hand. data from the sto monitor were recorded continuously and noted every minute along with blood pressure, heart rate, and oxygen saturation. all treatments were recorded. patients' charts were reviewed to determine the diagnosis, icu-free days in the days after enrollment, inpatient los, and -day mortality. data were compared using wilcoxon rank sum and chi-square tests. results: patients were enrolled, during blinded periods and during unblinded periods. the median presenting shock index was . (range . to . ) for the blinded group and . ( . - . ) for the unblinded group (p = . ). the median time in department was minutes (range - ) for the blinded and minutes (range - ) for the unblinded groups (p = . ). the median hospital los was day (range - ) for the blinded group, and days (range - ) in the unblinded group (p = . ). the mean icu-free days was ± for the blinded group and ± for the unblinded group (p = . ). among patients where the physician indicated using the sto monitor data to guide patient care, the icu-free days were . ± for the blinded group and . ± for the blinded group (p = . ). background: inducing therapeutic hypothermia (th) using °c iv fluids in resuscitated cardiac arrest patients has been shown to be feasible and effective. limited research exists assessing the efficiency of this cooling method. objectives: the objective was to determine an efficient infusion method for keeping fluid close to °c upon exiting an iv. it was hypothesized that colder temperatures would be associated with both higher flow rate and insulation of the fluid bag. methods: efficiency was studied by assessing change in fluid temperature ( c) during the infusion, under three laboratory conditions. each condition was performed four times using liter bags of normal saline. fluid was infused into a ml beaker through gtts tubing. flow rate was controlled using a tubing clamp and in-line transducer with a flowmeter, while temperature was continuously monitored in a side port at the terminal end of the iv tubing using a digital thermometer. the three conditions included infusing chilled fluid at a rate of ml/min, which is equivalent to ml/kg/hr for an kg patient, ml/min, and ml/min using a chilled and insulated pressure bag. descriptive statistics and analysis of variance was performed to assess changes in fluid temperature. results: the average fluid temperatures at time were . ( % ci . - . ) ( ml/min), . ( % ci . - . ) ( ml/min), and . ( % ci . - . ) ( ml/min + insulation). there was no significant difference in starting temperature between groups (p = . ). the average fluid temperatures after ml had been infused were . ( % ci . - . ) ( ml/min), . ( % ci . - . ) ( ml/min), and . ( % ci . - . ) ( ml/min + insulation). the higher flow rate groups had significantly lower temperature than the lower flow rate after ml of fluid had been infused (p < . ). the average fluid temperatures after ml had been infused were . ( % ci . - . ) ( ml/min), . ( % ci . - . ) ( ml/min), and . ( % ci . - . ) ( ml/min + insulation). there was a significant difference in temperature between all three groups after ml of fluid had been infused (p < . ). conclusion: in a laboratory setting, the most efficient method of infusing cold fluid appears to be a method that both keeps the bag of fluid insulated and is infused at a faster rate. fluid bolus. patients were categorized by presence of vasoplegic or tissue dysoxic shock. demographics and sequential organ failure assessment (sofa) scores were evaluated between the groups. the primary outcome was in-hospital mortality. data were analyzed using t-tests, chi-squared test, and proportion differences with % confidence intervals as appropriate. results: a total of patients were included: patients with vasoplegic shock and with tissue dysoxic shock. there were no significant differences in age ( vs. years), caucasian race ( % vs. %), or male sex ( % vs. %) between the dysoxic shock and vasoplegic shock groups, respectively. the group with vasoplegic shock had a lower initial sofa score than did the group with tissue dysoxic shock ( . vs. . points, p = . ). the primary outcome of in-hospital mortality occurred in / ( %) of patients with vasoplegic shock compared to / ( %) in the group with tissue dysoxic shock (proportion difference %, % ci - %, p < . ). conclusion: in this analysis of patients with septic shock, we found a significant difference in in-hospital mortality between patients with vasoplegic versus tissue dysoxic septic shock. these findings suggest a need to consider these differences when designing future studies of septic shock therapies. background: the pre-shock population, ed sepsis patients with tissue hypoperfusion (lactate of . - . mm), commonly deteriorates after admission and requires transfer to critical care. objectives: to determine the physiologic parameters and disease severity indices in the ed pre-shock sepsis population that predict clinical deterioration. we hypothesized that neither initial physiologic parameters nor organ function scores will be predictive. methods: design: retrospective analysis of a prospectively maintained registry of sepsis patients with lactate measurements. setting: an urban, academic medical center. participants: the pre-shock population, defined as adult ed sepsis patients with either elevated lactate ( . - . mm) or transient hypotension (any sbp < mmhg) receiving iv antibiotics and admitted to a medical floor. consecutive patients meeting pre-shock criteria were enrolled over a -year period. patients with overt shock in the ed, pregnancy, or acute trauma were excluded. outcome: primary patientcentered outcome of increased organ failure (sequential organ failure assessment [sofa] score increase > point, mechanical ventilation, or vasopressor utilization) within hours of admission or in-hospital mortality. results: we identified pre-shock patients from screened. the primary outcome was met in % of the cohort and % were transferred to the icu from a medical floor. patients meeting the outcome of increased organ failure had a greater shock index ( . vs . , p = . ) and heart rate ( vs , p < . ) with no difference in initial lactate, age, map, or exposure to hypotension (sbp < mmhg). there was no difference in the predisposition, infection, response, and organ dysfunction (piro) score between groups ( . vs . , p = . ). outcome patients had similar initial levels of organ dysfunction but had higher sofa scores at , , and hours, a higher icu transfer rate ( vs %, p < . ), and increased icu and hospital lengths of stay. conclusion: the pre-shock sepsis population has a high incidence of clinical deterioration, progressive organ failure, and icu transfer. physiologic data in the ed were unable to differentiate the pre-shock sepsis patients who developed increased organ failure. this study supports the need for an objective organ failure assessment in the emergency department to supplement clinical decision-making. background: lipopolysaccharide (lps) has long been recognized to initiate the host inflammatory response to infection with gram negative bacteria (gnb). large clinical trials of potentially very expensive therapies continue to have the objective of reducing circulating lps. previous studies have found varying prevalence of lps in blood of patients with severe sepsis. compared with sepsis trials conducted years ago, the frequency of gnb in culture specimens from emergency department (ed) patients enrolled in clinical trials of severe sepsis has decreased. objectives: test the hypothesis that prior to antibiotic administration, circulating lps can be detected in the plasma of fewer than % of ed patients with severe sepsis. methods: secondary analysis of a prospective edbased rct of early quantitative resuscitation for severe sepsis. blood specimens were drawn at the time severe sepsis was recognized, defined as two or more systemic inflammatory criteria and a serum lactate > mm or spb< mmhg after fluid challenge. blood was drawn in edta prior to antibiotic administration or within the first several hours, immediately centrifuged, and plasma frozen at ) °c. plasma lps was quantified using the limulus amebocyte lysate assay (lal) by a technician blinded to all clinical data. results: patients were enrolled with plasma samples available for testing. median age was ± years, % female, with overall mortality of %. forty of patients ( %) had any culture specimen positive for gnb including ( %) with blood cultures positive. only five specimens had detectable lps, including two with a gnb-positive culture specimen, and three were lps-positive without gnb in any culture. prevalence of detectable lps was . % (ci: . %- . %). the frequency of detectable lps in antibiotic-naive plasma is too low to serve as a useful diagnostic test or therapeutic target in ed patients with severe sepsis. the data raise the question of whether post-antibiotic plasma may have a higher frequency of detectable lps. background: egdt is known to reduce mortality in septic patients. there is no evidence to date that delineates the role of using a risk stratification tool, such as the mortality in emergency department sepsis (meds) score, to determine which subgroups of patients may have a greater benefit with egdt. objectives: our objective was to determine if our egdt protocol differentially affects mortality based on the severity of illness using meds score. methods: this study is a retrospective chart review of patients, conducted at an urban tertiary care center, after implementing an egdt protocol on july , (figure) . this study compares in-hospital mortality, length of stay (los) in icu, and los in ed between the control group ( patients from / / - / / ) and the postimplementation group ( patients from / / - / / ), using meds score as a risk stratification tool. inclusion criteria: patients who presented to our ed with a suspected infection, and two or more sirs criteria, a map< mmhg, a sbp< mmol/l. exclusion criteria: age< , death on arrival to ed, dnr or dni, emergent surgical intervention, or those with an acute myocardial infarction or chf exacerbation. a two-sample t-test was used to show that the mean age and number of comorbidities was similar between the control and study groups (p = . and . respectively). mortality was compared and adjusted for meds score using logistic regression. the odds ratios and predicted probabilities of death are generated using the fitted logistic regression model. ed and icu los were compared using mood's median test. results: when controlling for illness severity using meds score, the relative risk (rr) of death with egdt is about half that of the control group (rr = . , % ci [ . - . ], p= . ). also, by applying meds score to risk stratify patients into various groups of illness severity, we found no specific groups where egdt is more efficacious at reducing the predicted probability of death (table ) . without controlling for meds score, there is a trend in reduction of absolute mortality by . % when egdt is used (control = . %, study = . %, p = . ). egdt leads to a . % reduction in the median los in icu (control = hours, study = hours, p = . ), without increasing los in ed (control = hours, study = hours, p = . ). conclusion: egdt is beneficial in patients with severe sepsis or septic shock, regardless of their meds score. background: in patients experiencing acute coronary syndrome (acs), prompt diagnosis is critical in achieving the best health outcome. while ecg analysis is usually sufficient to diagnose acs in cases of st elevation, acs without st elevation is reliably diagnosed through serial testing of cardiac troponin i (ctni). pointof-care testing (poct) for ctni by venipuncture has been proven a more rapid means to diagnosis than central laboratory testing. implementing fingerstick testing for ctni in place of standard venipuncture methods would allow for faster and easier procurement of patients' ctni levels, as well as increase the likelihood of starting a rapid test for ctni in the prehospital setting, which could allow for even earlier diagnosis of acs. objectives: to determine if fingerstick blood samples yield accurate and reliable troponin measurements compared to conventional venous blood draws using the i-stat poc device. methods: this experimental study was performed in the ed of a quaternary care suburban medical center between june-august . fingerstick blood samples were obtained from adult ed patients for whom standard (venipuncture) poc troponin testing was ordered. the time between fingerstick and standard draws was kept as narrow as possible. ctni assays were performed at the bedside using the i-stat (abbott point of care). results: samples from patients were analyzed by both fingerstick and standard ed poct methods (see table) . four resulted in cartridge error. compared to ''gold standard'' ed poct, fingerstick testing has a positive predictive value of %, negative predictive value of %, sensitivity of %, and specificity of %. no significant difference in ctni level was found between the two methods, with a nonparametric intraclass correlation coefficient of . ( % ci . - . , p-value < . ). conclusion: whole blood fingerstick ctni testing using the i-stat device is suitable for rapid evaluation of ctni level in prehospital and ed settings. however, results must be interpreted with caution if they are within a narrow territory of the cutoff for normal vs. elevated levels. additional testing on a larger sample would be beneficial. the practicality and clinical benefit of using fingerstick ctni testing in the ems setting must still be assessed. background: adjudication of diagnosis of acute myocardial infarction (ami) in clinical studies typically occurs at each site of subject enrollment (local) or by experts at an independent site (central). from from - , the troponin (ctn) element of the diagnosis was predicated on the local laboratories, using a mix of the th percentile reference ctn and roc-determined cutpoints. in , the universal definition of ami (ud-ami) defined it by the th percentile reference alone. objectives: to compare the diagnosis rates of ami as determined by local adjudication vs. central adjudication using udami criteria. methods: retrospective analysis of data from the myeloperoxidase in the diagnosis of acute coronary syndromes (acs) study (midas), an -center prospective study with enrollment from / / to / / of patients with suspected acs presenting to the ed < hours after symptom onset and in whom serial ctn and objective cardiac perfusion testing was planned. adjudication of acs was done by single local principal investigators using clinical data and local ctn cutpoints from different ctn assays, and applying the definition. central adjudication was done after completion of the midas primary analysis using the same data and local ctn assay, but by experts at three different institutions, using the udami and the manufacturer's th percentile ctn cutpoint, and not blinded to local adjudications. discrepant dignoses were resolved by consensus. local vs. central ctn cutpoints differed for six assays, with central cutpoints lower in all. statistics were by chi-square and kappa. results: excluding cases deemed indeterminate by central adjudication, cases were successfully adjudicated. local adjudication resulted in ami ( . % of total) and non-ami; central adjudication resulted in ( . %) ami and non-ami. overall, local diagnoses ( %) were either changed from non-ami to ami or ami to non-ami (p < . ). interrater reliability across both methods was found to be kappa = . (p < . ). for acs diagnosis, local adjudication identified acs cases ( %) and non-acs, while central adjudication identified acs ( %) and non-acs. overall, local diagnoses ( %) were either changed from non-acs to acs or acs to non-acs (p < . ). interrater reliability found kappa = . (p < . ). conclusion: central and local adjudication resulted in significantly different rates of ami and acs diagnosis. however, overall agreement of the two methods across these two diagnoses was acceptable. occur four times more often in cocaine users. biomarkers myeloperoxidase (mpo) and c-reactive protein (crp) have potential in the diagnosis of acs. objectives: to evaluate the utility of mpo and crp in the diagnosis of acs in patients presenting to the ed with cocaine-associated chest pain and compare the predictive value to nonusers. we hypothesized that these markers may be more sensitive for acs in nonusers given the underlying pathophysiology of enhanced plaque inflammation. methods: a secondary analysis of a cohort study of enrolled ed patients who received evaluation for acs at an urban, tertiary care hospital. structured data collection at presentation included demographics, chest pain history, lab, and ecg data. subjects included those with self-reported or lab-confirmed cocaine use and chest pain. they were matched to controls based on age, sex, and race. our main outcome was diagnosis of acs at index visit. we determined median mpo and crp values, calculated maximal auc for roc curves, and found cut-points to maximize sensitivity and specificity. data are presented with % ci. results: overall, patients in the cocaine positivegroup and patients in the nonusers group had mpo and crp levels measured. patients had a median age of (iqr, ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , % black or african american, and % male (p > . between groups). fifteen patients were diagnosed with acs: patients in the cocaine group and in the nonusers group. comparing cocaine users to nonusers, there was no difference in mpo (median [iqr, ] v ng/ml; p = . ) or crp ( [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] v [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] mg/l; p = . ). the auc for mpo was . ( % ci . - . ) v . ( % ci . - . ). the optimal cut-point to maximize sensitivity and specificity was ng/ml which gave a sensitivity of . and specificity of . . using this cutpoint, % v % of acs in cocaine users vs the nonusers would be identified. the auc for crp was . ( % ci . - . ) in cocaine users vs . ( % ci . - . ) in nonusers. the optimal cut point was . mg/l with a sensitivity of . and specificity of . . using this cutpoint, % v % of acs in cocaine users and nonusers would have been identified. conclusion: the diagnostic accuracy of mpo and crp is not different in cocaine users than nonusers and does not appear to have sufficient discriminatory ability in either cohort. results: hrs of moderate pe caused a significant decrease in rv heart function in rats treated with the solvent for bay - : peak systolic pressure (psp) decreased from ± . mmhg, control to ± . , pe, +dp/dt decreased from ± mmhg/sec to ± , -dp/dt decreased from ) ± mmhg/sec to ) ± . treatment of rats with bay - significantly improved all three indices of rv heart function (psp ± . , +dp/dt ± , -dp/dt ) ± ). hrs of severe pe also caused significant rv dysfunction (psp ± , -dp/dt ) ± ) and treatment with bay - produced protection of rv heart function (psp ± , -dp/dt ) ± ) similar to the hr moderate pe model. conclusion: experimental pe produced significant rv dysfunction, which was ameliorated by treatment of the animals with the soluble guanylate cyclase stimulator, bay - . hospital of the university of pennsylvania, philadelphia, pa; cooper university hospital, camden, nj background: patients who present to the ed with symptoms of potential acute coronary syndrome (acs) can be safely discharged home after a negative coronary computerized tomographic angiography (cta). however, the duration of time for which a negative coronary cta can be used to inform decision making when patients have recurrent symptoms is unknown. objectives: we examined patients who received more than one coronary cta for evaluation of acs to determine whether they had disease progression, as defined by crossing the threshold from noncritical (< % maximal stenosis) to potentially critical disease. methods: we performed a structured comprehensive record search of all coronary ctas performed from to at a tertiary care health system. low-tointermediate risk ed patients who received two or more coronary ctas, at least one from an ed evaluation for potential acs, were identified. patients who were revascularized between scans were excluded. we collected demographic data, clinical course, time between scans, and number of ed visits between scans. record review was structured and done by trained abstractors. our main outcome was progression of coronary stenosis between scans, specifically crossing the threshold from noncritical to potentially critical disease. results: overall, patients met study criteria (median age , interquartile range [iqr] ( . - ); % female; % black). the median time between studies was . months (iqr, . patients did not have stenosis in any vessel on either coronary cta, two studies showed increasing stenosis of < %, and the rest showed ''improvement,'' most due to better imaging quality. no patient initially below the % threshold subsequently exceeded it ( %; % ci, - . %). patients also had varying numbers of ed visits (median number of visits , range - ), and numbers of ed visits for potentially cardiac complaints (median , range - ); were re-admitted for potentially cardiac complaints (for example, chest pain or shortness of breath), and received further provocative cardiac testing, all of which had negative results. conclusion: we did not find clinically significant disease progression within a year time frame in patients who had a negative coronary cta, despite a high number of repeat visits. this suggests that prior negative coronary cta may be able to be used to inform decision making within this time period. . - . ) compared to non tro ct patients. there was no significant difference in image quality between tro ct images and those of dedicated ct scans in any studies performing this comparison. similarly, there was no significant difference between tro ct and other diagnostic modalities in regards to length of stay or admission rate. when compared to conventional coronary angiography as the gold standard for evaluation of cad, tro ct had the following pooled diagnostic accuracy estimates: sensitivity . conclusion: tro chest ct is comparable to dedicated pe, coronary, or ad ct in regard to image quality, length of stay, and admission rate and is highly accurate for detecting cad. the utility of tro ct depends on the relative pre-test probabilities of the conditions being assessed and its role is yet to be clearly defined. tro ct, however, involves increased radiation exposure and contrast volume and for this reason clinicians should be selective in its use. background: coronary computed tomographic angiography (ccta) has high sensitivity, specificity, accuracy, and prognostic value for coronary artery disease (cad) and acs. however, how a ccta informs subsequent use of prescription medication is unclear. objectives: to determine if detection of critical or noncritical cad on ccta is associated with initiation of aspirin and statins for patients who presented to the ed with chest pain. we hypothesized that aspirin and statins would be more likely to be prescribed to patients with noncritical disease relative to those without any cad. methods: prospective cohort study of patients who received ccta as part of evaluation of chest pain in the ed or observation unit. patients were contacted and medical records were reviewed to obtain clinical follow-up for up to the year after ccta. the main outcome was new prescription of aspirin or statin. cad severity on ccta was graded as absent, mild ( % to %), moderate ( % to %), or severe ( ‡ %) stenosis. logistic regression was used to assess the association of stenosis severity to new medication prescription; covariates were determined a priori. results: patients who had ccta performed consented to participate in this study or met waiver of consent for record review only (median age, , % female, % black). median follow-up time was days, iqr - days. at baseline, % of the total cohort was already prescribed aspirin and % on statin medication. two hundred seventy nine ( %) patients were found to have stenosis in at least one vessel. in patients with absent, mild, moderate, and severe cad on ccta, aspirin was initiated in %, %, %, and %; statins were initiated in %, %, %, and % of patients. after adjustment for age, race, sex, hypertension, diabetes, cholesterol, tobacco use, and admission to the hospital after ccta, higher grades of cad severity were independently associated with greater post-ccta use of aspirin (or . per grade, % ci . - . , p < . ) and statins (or . , % ci . - . , p < . ). conclusion: greater cad severity on ccta is associated with increased medication prescription for cad. patients with noncritical disease are more likely than patients without any disease to receive aspirin and statins. future studies should examine whether these changes lead to decreased hospitalizations and improved cardiovascular health. background: hess et al. developed a clinical decision rule for patients with acute chest pain consisting of the absence of five predictors: ischemic ecg changes not known to be old, elevated initial or -hour troponin level, known coronary disease, ''typical'' pain, and age over . patients less than required only a single troponin evaluation. objectives: to test the hypothesis that patients less than years old without these criteria are at < % risk for major adverse cardiovascular events (mace) including death, ami, pci, and cabg. methods: we performed a secondary analysis of several combined prospective cohort studies that enrolled ed patients who received an evaluation for acs in an urban ed from to . cocaine users and stemi patients were excluded. structured data collection at presentation included demographics, pain description, history, lab, and ecg data for all studies. hospital course was followed daily. thirty-day follow up was done by telephone. our main outcome was -day mace using objective criteria. the secondary outcome was potential change in ed disposition due to application of the rule. descriptive statistics and % cis were used. results: of visits for potential acs, patients had a mean age of . ± . yrs; % were black and % female. there were patients ( . %) with -day cv events ( dead, ami, pci). sequential removal of patients in order to meet the final rule for patients less than excluded patients based upon: ischemic ecg changes not old (n = , % mace rate), elevated initial troponin level (n = , % mace), known coronary disease (n = , % mace), ''typical'' pain (n = , % mace), and age over (n = , . % mace) leaving patients less than with . % mace [ % ci, . - . %]. of this cohort, % were discharged home from the ed by the treating physician without application of this rule. adding a second negative troponin in patients - years old identified a group of patients with a . % rate of mace [ . - . ] and a % discharge rate. the hess rule appears to identify a cohort of patients at approximately % risk of -day mace, and may enhance discharge of young patients. however, even without application of this rule, the % of young patients at low risk are already being discharged home based upon clinical judgment. background: a clinical decision support system (cdss) incorporates evidence-based medicine into clinical practice, but this technology is underutilized in the ed. a cdss can be integrated directly into an electronic medical record (emr) to improve physician efficiency and ease of use. the christopher study investigators validated a clinical decision rule for patients with suspected pulmonary embolism (pe). the rule stratifies patients using wells' criteria to undergo either d-dimer testing or a ct angiogram (ct). the effect of this decision rule, integrated as a cdss into the emr, on ordering cts has not been studied. objectives: to assess the effect of a mandatory cdss on the ordering of d-dimers and cts for patients with suspected pe. methods: we assessed the number of cts ordered for patients with suspected pe before and after integrating a mandatory cdss in an urban community ed. physicians were educated regarding cdss use prior to implementation. the cdss advised physicians as to whether a negative d-dimer alone excluded pe or if a ct was required based on wells' criteria. the emr required physicians to complete the cdss prior to ordering the ct. however, physicians maintained the ability to order a ct regardless of the cdss recommendation. patients ‡ years of age presenting to the ed with a chief complaint of chest pain, dyspnea, syncope, or palpitations were included in the data analysis. we compared the proportion of d-dimers and cts ordered during the -month periods immediately before and after implementing the cdss. all physicians who worked in the ed during both time periods were included in the analysis. patients with an allergy to intravenous contrast agents, renal insufficiency, or pregnancy were excluded. results were analyzed using a chi-square test. results: a total of , patients were included in the data analysis ( pre-and post-implementation). cts were ordered for patients ( . %) in the pre-implementation group and patients ( . %) in the post-implementation group; p = . . a d-dimer was ordered for patients ( . %) in the pre-implementation group and patients ( . %) in the post-implementation group; p = . . in this single-center study, emr integration of a mandatory cdss for evaluation of pe did not significantly alter ordering patterns of cts and d-dimers. identification of patients with low-risk pulmonary emboli suitable for discharge from the emergency department mike zimmer, keith e. kocher university of michigan, ann arbor, mi background: recent data, including a large, multicenter randomized controlled trial, suggest that a low-risk cohort of patients diagnosed with pulmonary embolism (pe) exists who can be safely discharged from the ed for outpatient treatment. objectives: to determine if there is a similar cohort at our institution who have a low rate of complications from pe suitable for outpatient treatment. methods: this was a retrospective chart review at a single academic tertiary referral center with an annual ed volume of , patients. all adult ed patients who were diagnosed with pe during a -month period from / / through / / were identified. the pulmonary embolism severity index (pesi) score, a previously validated clinical decision rule to risk stratify patients with pe, was calculated. patients with high pesi (> ) were excluded. additional exclusion criteria included patients who were at high risk of complications from initiation of therapeutic anticoagulation and those patients with other clear indications for admission to the hospital. the remaining cohort of patients with low risk pe (pesi £ ) was included in the final analysis. outcomes were measured at and days after pe diagnosis and included death, major bleeding, and objectively confirmed recurrent venous thromboembolism (vte). results: during the study period, total patients were diagnosed with pe. there were ( %) patients categorized as ''low risk'' (pesi £ ), with removed because of various pre-defined exclusion criteria. of the remaining ( %) patients suitable for outpatient treatment, patients ( . %; % ci, . % - . %) had one or more negative outcomes by days. this included ( . %; % ci, % - . %) major bleeding events, ( . %; % ci, % - . %) recurrent vte, and ( . %; % ci, % - . %) deaths. none of the deaths were attributable to pe or anticoagulation. one patient suffered both a recurrent vte and died within days. both patients who died within days were transitioned to hospice care because of worsening metastatic burden. at days, there was bleeding event ( . %; % ci, % - . %), no recurrent vte, and no deaths. the average hospital length of stay for these patients was . days (sd ± . ). conclusion: over % of our patients diagnosed with pe in the ed may have been suitable for outpatient treatment, with % suffering a negative outcome within days and . % suffering a negative outcome within days. in addition, the average hospital length of stay for these patients was . days, which may represent a potential cost savings if these patients had been managed as outpatients. our experience supports previous studies that suggest the safety of outpatient treatment of patients diagnosed with pe in the ed. given the potential savings related to a decreased need for hospitalization, these results have health policy implications and support the feasibility of creating protocols to facilitate this clinical practice change. background: chest x-rays (cxrs) are commonly obtained on ed chest pain patients presenting with suspected acute coronary syndrome (acs). a recently derived clinical decision rule (cdr) determined that patients who have no history of congestive heart failure, have never smoked, and have a normal lung examination do not require a cxr in the ed. objectives: to validate the diagnostic accuracy of the hess cxr cdr for ed chest pain patients with suspected acs. methods: this was a prospective observational study of a convenience sample of chest pain patients over years old with suspected acs who presented to a single urban academic ed. the primary outcome was the ability of the cdr to identify patients with abnormalities on cxr requiring acute ed intervention. data were collected by research associates using the chart and physician interviews. abnormalities on cxr and specific interventions were predetermined, with a positive cxr defined as one with abnormality requiring ed intervention, and a negative cxr defined as either normal or abnormal but not requiring ed intervention. the final radiologist report was used as a reference standard for cxr interpretation. a second radiologist, blinded to the initial radiologist's report, reviewed the cxrs of patients meeting the cdr criteria to calculate inter-observer agreement. patients were followed up by chart review and telephone interview days after presentation. results: between january and august , patients were enrolled, of whom ( %) were excluded and ( . %) did not receive cxrs in the ed. of the remaining patients, ( %) met the cdr. the cdr identified all patients with a positive cxr (sensitivity = %, %ci - %). the cdr identified of the patients with a negative cxr (specificity = %, %ci - %). the positive likelihood ratio was . ( %ci . - . ). inter-observer agreement between radiologists was substantial (kappa = . , %ci . - . ). telephone contact was made with % of patients and all patient charts were reviewed at days. none had any adverse events related to a background: increasing the threshold to define a positive d-dimer in low-risk patients could reduce unnecessary computed tomographic pulmonary angiography (ctpa) for suspected pe. this strategy might increase rates of missed pe and missed pneumonia, the most common non-thromboembolic finding on ctpa that might not otherwise be diagnosed. objectives: measure the effect of doubling the standard d-dimer threshold for ' 'pe unlikely'' revised geneva (rgs) or wells' scores on the exclusion rate, frequency, and size of missed pe and missed pneumonia. methods: prospective enrollment at four academic us hospitals. inclusion criteria required patients to have at least one symptom or sign and one risk factor for pe, and have -channel ctpa completed. pretest probability data were collected in real time and the d-dimer was measured in a central laboratory. criterion standard for pe or pneumonia consisted of cpta interpretation by two independent radiologists combined with necessary treatment plan. subsegmental pe was defined as total vascular obstruction < %. patients were followed for outcome at days. proportions were compared with % cis. results: of patients enrolled, ( %) were pe+ and ( %) had pneumonia. with rgs£ and standard threshold (< ng/ml), d-dimer was negative in / ( %, % ci: - %), and / were pe+ (posterior probability . %, % ci: - . %). with rgs£ and a threshold < ng/ml, d-dimer was negative in / ( %, - %) and / ( . %, . - . %) were pe+, but / missed pes were subsegmental, and none had concomitant dvt. the posterior probability for pneumonia among patients with rgs&# ; and d-dimer< was / ( . %, - %) which compares favorably to the posterior probability of / ( . %, - %) observed with rgs& # ; and d-dimer< ng/ml. of the ( %) patients who also had plain film cxr, radiologists found an infiltrate in only . use of wells£ produced similar results as the rgs&# ; for exclusion rate and posterior probability of both pe and pneumonia. conclusion: doubling the threshold for a positive d-dimer with a pe unlikely pretest probability can significantly reduce ctpa scanning with a slightly increased risk of missed isolated subsegmental pe, and no increase in rate of missed pneumonia. background: the limitations of developing world medical infrastructure require that patients are transferred from health clinics only when the patient care needs exceed the level of care at the clinic and the receiving hospital can provide definitive therapy. to determine what type of definitive care service was sought when patients were transferred from a general outpatient clinic operating monday through friday from : am to : pm in rural haiti to urban hospitals in port-au-prince. methods: design -prospective observational review of all patients for whom transfer to a hospital was requested or for whom a clinic ambulance was requested to an off-site location to assist with patient care. setting -weekday, daytime only clinic in titanyen, haiti. participants/subjects -consecutive series of all patients for whom transfer to another health care facility or for whom an ambulance was requested during the time period of / / - / / and / / - / / . results: between / / - / / and / / - / / patients were identified who needed to be transferred to a higher level of care. sixteen patients ( . %) presented with medical complaints, ( . %) were trauma patients, ( . %) were surgical, and ( . %) were in the obstetric category. within these categories, patients were pediatric and non-trauma patients required blood transfusion. conclusion: while trauma services are often focused on in rural developing world medicine, the need for obstetric care and blood transfusion constituted six ( . %) cases in our sample. these patients raise important public health, planning, and policy questions relating to access to prenatal care and the need to better understand transfusion medicine utilization among rural haitian patients with non-trauma related transfusion needs. the data set is limited by sample size and single location of collection. another limitation of understanding the needs is that many patients may not present to the clinic for their health care needs in certain situations if they have knowledge that the resources to provide definitive care are unavailable. background: the practice of emergency medicine in japan has been unique in that emergency physicians are mostly engaged in critical care and trauma with a multi-specialty model. for the last decade with progress in medicine, an aging population with complicated problems, and institution of postgraduate general clinical training, the us model emergency medicine with single-specialty model has been emerging throughout japan. however, the current status is unknown. objectives: the objective of this study was to investigate the current status of implementation of the us model emergency medicine at emergency medicine training institutions accredited by the japanese association for acute medicine (jaam). methods: the er committee of the jaam, the most prestigious professional organization in japanese emergency medicine, conducted the survey by sending questionnaires to accredited emergency medicine training institutions. results: valid responses obtained from facilities were analyzed. us model em was provided in facilities ( % of facilities), either in full time ( hours a day, seven days a week; facilities) or in part time (less than hours a day; facilities). among these us model facilities, % have a number of beds between - . the annual number of ed visits was less than , in %, and % have ambulance transfers between , - , per year. the number of emergency physicians was less than in % of the facilities. postgraduate general clinical training was offered at us model ed in facilities, and ninety hospitals adopted us model em after , when a -year period of postgraduate general clinical training became mandatory for all medical graduates. sixty-four facilities provided a residency program to be a us model emergency physician, and another institutions were planning to establish it. conclusion: us model em has emerged and become commonplace in japan. the background including advance in medicine, aging population, and mandatory postgraduate general clinical training system are considered to be contributing factors. erkan gunay, ersin aksay, ozge duman atilla, nilay zorbalar, savas sezik tepecik research and training hospital, izmir, turkey background: workplace safety and occupational health problems are increasing issues especially in developing countries as a result of the industrial automatisation and technologic improvements. occupational injuries are preventable but they can occasionally cause morbidity and mortality resulting in work day loss and financial problems. hand injuries are one-third of all traumatic injuries and are the most injured parts after occupational accidents. objectives: we aim to evaluate patients with occupational upper extremity injuries for demographic characteristics, injury types, and work day loss. methods: trauma patients over years old admitted to our emergency department with an occupational upper extremity injury were prospectively evaluated from . . to . . . patients with one or more of digit, hand, forearm, elbow, humerus, and shoulder injuries were included. exclusion criteria were multitrauma, patient refusal to participate, and insufficient data. patients were followed up from the hospital information system and by phone for work day loss and final diagnosis. results: during the study period there were patients with an occupational upper extremity injury. total of ( . %) patients were included. patients were . % male, . % between the age to , and mean age was calculated . ± . years. . % of the patients were from the metal and machinery sector, and primary education was the highest education level for the . % of the patients. most injured parts were fingers with the highest rate for index finger and thumb. crush injury was the most common injury type. . % (n = ) of the patients were discharged after treatment in the emergency department. tendon injuries, open fractures, and high degree burns were the reasons for admission to clinics. mean work day loss was . ± . days and this increases for the patients with laboratory or radiologic studies, consultant evaluation, or admission. the - age group had a significantly lower work day loss average. conclusion: evaluating occupational injury characteristics and risks is essential for identifying preventive measures and actions. with the guidance of this study preventive actions focusing on high-risk sectors and patients may be the key factor for avoiding occupational injuries and creating safer workplace environments in order to reduce financial and public health problems. background: as emergency medicine (em) gains increased recognition and interest in the international arena, a growing number of training programs for emergency health care workers have been implemented in the developing world through international partnerships. objectives: to evaluate the quality and appropriateness of an internationally implemented emergency physician training program in india. methods: physicians participating in an internationally implemented em training program in india were recruited to participate in a program evaluation. a mixed methods design was used including an online anonymous survey and semi-structured focus groups. the survey assessed the research, clinical, and didactic training provided by the program. demographics and information on past and future career paths were also collected. the focus group discussions centered around program successes and challenges. results: fifty of eligible trainees ( %) participated in the survey. of the respondents, the vast majority were indian; % were female, and all were between the ages of and years (mean age years). all but two trainees ( %) intend to practice em as a career. one-third listed a high-income country first for preferred practice location and half listed india first. respondents directly endorsed the program structure and content, and they demonstrated gains in self-rated knowledge and clinical confidence over their years of training. active challenges identified include: ( ) insufficient quantity and inconsistent quality of indian faculty, ( ) administrative barriers to academic priorities, and ( ) persistent threat of brain drain if local opportunities are inadequate. conclusion: implementing an international emergency physician training program with limited existing local capacity is a challenging endeavor. overall, this evaluation supports the appropriateness and quality of this partnership model for em training. one critical challenge is achieving a robust local faculty. early negotiations are recommended to set educational priorities, which includes assuring access to em journals. attrition of graduated trainees to high-income countries due to better compensation or limited in-country opportunities continues to be a threat to long-term local capacity building. background: with an increasing frequency and intensity of manmade and natural disasters, and a corresponding surge in interest in international emergency medicine (iem) and global health (gh), the number of iem and gh fellowships is constantly growing. there are currently iem and gh fellowships, each with a different curriculum. several articles have proposed the establishment of core curriculum elements for fellowship training. to the best of our knowledge, no study has examined whether iem and gh fellows are actually fulfilling these criteria. objectives: this study sought to examine whether current iem and gh fellowships are consistently meeting these core curricula. methods: an electronic survey was administered to current iem and gh fellowship directors, current fellows, and recent graduates of a total of programs. survey respondents stated their amount of exposure to previously published core curriculum components: em system development, humanitarian assistance, disaster response, and public health. a pooled analysis comparing overall responses of fellows to those of program directors was performed using two-sampled t-test. results: response rates were % (n = ) for program directors and % (n = ) for current and recent fellows. programs varied significantly in terms of their emphasis on and exposure to six proposed core curriculum areas: em system development, em education development, humanitarian aid, public health, ems, and disaster management. only % of programs reported having exposure to all four core areas. as many as % of fellows reported knowing their curriculum only somewhat or not at all prior to starting the program. conclusion: many fellows enter iem and gh fellowships without a clear sense of what they will get from their training. as each fellowship program has different areas of curriculum emphasis, we propose not to enforce any single core curriculum. rather, we suggest the development of a mechanism to allow each fellowship program to present its curriculum in a more transparent manner. this will allow prospective applicants to have a better understanding of the various programs' curricula and areas of emphasis. background: advance warning of probable intensive care unit (icu) admissions could allow the bed placement process to start earlier, decreasing ed length of stay and relieving overcrowding conditions. however, physicians and nurses poorly predict a patient's ultimate disposition from the emergency department at triage. a computerized algorithm can use commonly collected data at triage to accurately identify those who likely will need icu admission. objectives: to evaluate an automated computer algorithm at triage to predict icu admission and -day in-hospital mortality. methods: retrospective cohort study at a , visit/ year level i trauma center/tertiary academic teaching hospital. all patients presenting to the ed between / / and / / were included in the study. the primary outcome measure was icu admission from the emergency department. the secondary outcome measure was -day all-cause in-hospital mortality. patients discharged or transferred before days were considered to be alive at days. triage data includes age, sex, acuity (emergency severity index), blood pressure, heart rate, pain scale, respiratory rate, oxygen saturation, temperature, and a nurse's free text assessment. a latent dirichlet allocation algorithm was used to cluster words in triage nurses' free text assessments into topics. the triage assessment for each patient is then represented as a probability distribution over these topics. logistic regression was then used to determine the prediction function. results: a total of , patients were included in the study. . % were admitted to the icu and . % died within days. these patients were then randomly allocated to train (n = , ; %) and test (n = , ; %) data sets. the area under the receiver operating characteristic curve (auc) when predicting icu background: at the saem annual meeting, we presented the derivation of two hospital admission prediction models adding coded chief complaint (ccc) data from a published algorithm (thompson et al. acad emerg med ; : - ) to demographic, ed operational, and acuity (emergency severity index (esi)) data. objectives: we hypothesized that these models would be validated when applied to a separate retrospective cohort, justifying prospective evaluation. methods: we conducted a retrospective, observational validation cohort study of all adult ed visits to a single tertiary care center (census: , /yr) ( / / - / / ). we downloaded from the center's clinical tracking system demographic (age, sex, race), ed operational (time and day of arrival), esi, and chief complaint data on each visit. we applied the derived ccc hospital admission prediction models (all identified ccc categories and ccc categories with significant odds of admission from multivariable logistic regression in the derivation cohort) to the validation cohort to predict odds of admission and compared to prediction models that consisted of demographic, ed operational, and esi data, adding each category to subsequent models in a stepwise manner. model performance is reported by areaunder-the-curve (auc) data and %ci. signs, pain level, triage level, -hour return, number of past visits in the previous year, injury, and one of chief complaint codes (representing % of all visits in the database). outputs for training included ordering of a complete blood count, basic chemistry (electrolytes, blood urea nitrogen, creatinine), cardiac enzymes, liver function panel, urinalysis, electrocardiogram, x-ray, computed tomography, or ultrasound. once trained, it was used on the nhamcs-ed database, and predictions were generated. predictions were compared with documented physician orders. outcomes included the percent of total patients who were correctly pre-ordered, sensitivity (the percent of patients who had an order that were correctly predicted), and the percent over-ordered. waiting time for correctly pre-ordered patients was highlighted, to represent a potential reduction in length of stay achieved by preordering. los for patients overordered was highlighted to see if over-ordering may cause an increase in los for those patients. unit cost of the test was also highlighted, as taken from the medicare fee schedule. physician times. however, during peak ed census times, many patients with completed tests and treatment initiated by triage await discharge by the next assigned physician. objectives: determine if a physician-led discharge disposition (dd) team can reduce the ed length of stay (los) for patients of similar acuity who are ultimately discharged compared to standard physician team assignment. methods: this prospective observational study was performed from / to / at an urban tertiary referral academic hospital with an annual ed volume of , visits. only emergency severity index level patients were evaluated. the dd team was scheduled weekdays from : until : . several ed beds were allocated to this team. the team was comprised of one attending physician and either one nurse and a tech or two nurses. comparisons were made between los for discharged patients originally triaged to the main ed side who were seen by the dd team versus the main side teams. time from triage physician to team physician, team physician to discharge decision time, and patient age were compared by unpaired t-test. differences were studied for number of patients receiving x-rays, ct scan, labs, and medications. results: dd team mean los in hours for discharged patients was shorter at . ( % ci: . - . , n = ) compared to . ( % ci: . - . , n = ) on the main side, p < . . the mean time from triage physician to dd team physician was . hours ( % ci: . - . , n = ) versus to . hours ( % ci: . - . , n = ) to main side physician, p < . . the dd team physician mean time to discharge decision was . hour ( % ci: . - . , n = ) compared to . hours ( % ci: . - . , n = ) for main side physician, p < . . the dd team patients' mean age was . years ( % ci: . - . , n = ) compared to main side patients' mean age of . years ( % ci: . - . , n = .) the dd team patients (n = ) received fewer x-rays ( % vs. %), ct scans ( % vs. %), labs ( % vs. %), and medications ( % vs. %) than main side patients (n = ), p < . for all compared. conclusion: the dd team complements the advanced triage process to further reduce los for patients who do not require extended ed treatment or observation. the dd team was able to work more efficiently because its patients tended to be younger and had fewer lab and imaging tests ordered by the triage physician compared to patients who were later seen on the ed main side. ed objectives: to evaluate the association between ed boarding time and the risk of developing hapu. methods: we conducted a retrospective cohort study using administrative data from an academic medical center with an adult ed with , annual patient visits. all patients admitted into the hospital through the ed / / - / / were included. development of hapu was determined using the standardized, national protocol for cms reporting of hapu. ed boarding time was defined as the time between an order for inpatient admission and transport of the patient out of the ed to an in-patient unit. we used a multivariate logistic regression model with development of a hapu as the outcome variable, ed boarding time as the exposure variable, and the following variables as covariates: age, sex, initial braden score, and admission to an intensive care unit (icu) from the ed. the braden score is a scale used to determine a patient's risk for developing a hapu based on known risk factors. a braden score is calculated for each hospitalized patient at the time of admission. we included braden score as a covariate in our model to determine if ed boarding time was a predictor of hapu independent of braden score. results: of , patients admitted to the hospital through the ed during the study period, developed a hapu during their hospitalization. clinical characteristics are presented in the table. per hour of ed boarding time, the adjusted or of developing a hapu was . ( % ci . - . , p = . ). a median of patients per day were admitted through the ed, accumulating hours of ed boarding time per day, with each hour of boarding time increasing the risk of developing a hapu by %. conclusion: in this single-center, retrospective study, longer ed boarding time was associated with increased risk of developing a hapu. queried ed and inpatient nurses and compared their opinions toward inpatient boarding. it also assessed their preferred boarding location if they were patients. objectives: this study queried ed and inpatient nurses and compared their opinions toward inpatient boarding. methods: a survey was administered to a convenience sample of ed and ward nurses. it was performed in a -bed academic medical center ( , admissions/yr) with a -bed ed ( , visits/yr). nurses were identified as ed or ward and whether they had previously worked in the ed. the nurses were asked if there were any circumstances where admitted patients should be boarded in the ed or inpatient hallways. they were also asked their preferred location if they were admitted as a patient. six clinical scenarios were then presented and their opinions on boarding queried. results: ninety nurses completed the survey; ( %) were current ed nurses (ced), ( %) had previously worked in the ed (ped). for the entire group ( %) believed admitted patients should board in the ed. overall, ( %) were opposed to inpatient boarding, with % of ced versus % of current ward (cw) nurses (p < . ) and % of ped versus % of nurses never having worked in the ed (ned) opposed (p < . ). if admitted as patients themselves, overall ( %) preferred inpatient boarding, with % of ced versus % of cw nurses (p < . ) and % of ped versus % ned nurses (p = . ) preferring inpatient boarding. for the six clinical scenarios, significant differences in opinion regarding inpatient boarding existed in all but two cases: a patient with stable copd but requiring oxygen and an intubated, unstable sepsis patient. conclusion: ward nurses and those who have never worked in the ed are more opposed to inpatient boarding than ed nurses and nurses who have worked previously in the ed. nurses admitted as patients seemed to prefer not being boarded where they work. ed and ward nurses seemed to agree that unstable or potentially unstable patients should remain in the ed. weeks. staff satisfaction was evaluated through pre/ post-shift and study surveys; administrative data (physician initial assessment (pia), length of stay (los), patients leaving without being seen (lwbs) and against medical advice [lama] ) were collected from an electronic, real-time ed information system. data are presented as proportions and medians with interquartile ranges (iqr); bivariable analyses were performed. results: ed physicians and nurses expected the intervention to reduce the los of discharged patients only. pia decreased during the intervention period ( vs minutes; p < . ). no statistically/clinically significant differences were observed in the los; however, there was a significant reduction in the lwbs ( . % to . % p = . ) and lama ( . % to . % p = . ) rates. while there was a reduction of approximately patients seen per physician in the affected ed area, the total number of patients seen on that unit increased by approximately patients/day. overall, compared to days when there was no extra shift, % of emergency physicians stated their workload decreased and % felt their stress level at work decreased. conclusion: while this study didn't demonstrate a reduction in the overall los, it did reduce pia times and the proportion of lwbs/lama patients. while physicians saw fewer patients during the intervention study period, the overall patient volume increased and satisfaction among ed physicians was rated higher. provider-and hospital-level variation in admission rates and -hour return admission rates jameel abualenain , william frohna , robert shesser , ru ding , mark smith , jesse m. pines the george washington university, washington, dc; washington hospital center, washington, dc background: decisions for inpatient versus outpatient management of ed patients are the most important and costliest decision made by emergency physicians, but there is little published on the variation in the decision to admit among providers or whether there is a relationship between a provider's admission rate and the proportion of their patients who return within hours of the initial visit and are subsequently admitted ( h-ra). objectives: we explored the variation in provider-level admission rates and h-ra rates, and the relationship between the two. methods: a retrospective study using data from three eds with the same information system over varying time periods: washington hospital center (whc) ( - ), franklin square hospital center (fshc) , and union memorial hospital (umh) . patients were excluded if left without being seen, left against medical advice, fast-track, psychiatric patients, and aged < years. physicians with < ed encounters or an admission rate < % were excluded. logistic regression was used to assess the relationship between physician-level h-ra and admission rates, adjusting for patient age, sex, race, and hospital. results: , ed encounters were treated by physicians. mean patient age was years sd , % male, and % black. admission rates differed between hospitals (whc = %, umh = %, and fshc = %), as did the h-ra (whc = . %, umh = . %, and fshc = . %). across all hospitals, there was great variation in individual physician admission rates ( . %- . %). the h-ra rates were quite low, but demonstrated a similar magnitude of individual variation ( . %- . %). physicians with the highest admission rate quintile had lower odds of h-ra (or . % ci . - . ) compared to the lowest admission rate quintile, after adjusting for other factors. no intermediate admission rate quintiles ( nd, rd, or th) were significantly different from the lowest admission rate quintile with regard to h-ra. conclusion: there is more than three-fold variation in individual physician admission rates indicating great variation among physicians in hospital admission rates and h-ra. the highest admitters have the lowest h-ra; however, evaluating the causes and consequences of such significant variation needs further exploration, particularly in the context of health reform efforts aimed at reducing costs. background: ed scribes have become an effective means to assist emergency physicians (eps) with clinical documentation and improve physician productivity. scribes have been most often utilized in busy community eds and their utility and functional integration into an academic medical center with resident physicians is unknown. objectives: to evaluate resident perceptions of attending physician teaching and interaction after introduction of scribes at an em residency training program, measured through an online survey. residents in this study were not working with the scribes directly, but were interacting indirectly through attending physician use of scribes during ed shifts. methods: an online ten question survey was administered to residents of a midwest academic emergency medicine residency program (pgy -pgy program, annual residents), months after the introduction of scribes into the ed. scribes were introduced as emr documentation support (epic , epic systems inc.) for attending eps while evaluating primary patients and supervising resident physicians. questions investigated em resident demographics and perceptions of scribes (attending physician interaction and teaching, effect on resident learning, willingness to use scribes in the future), using likert scale responses ( minimal, maximum) and a graduated percentage scale used to quantify relative values, where applicable. data were analyzed using kruskal-wallis and mann-whitney u tests. results: twenty-one of em residents ( %) completed the survey ( % male; % pgy , % pgy , % pgy ). four residents had prior experience with scribes. scribes were felt to have no effect on attending eps direct resident interaction time (mean score . , sd . ), time spent bedside teaching ( . , sd . ), or quality of teaching ( . , sd . ), as well as no effect on residents' overall learning process ( . , sd . ). however, residents felt positive about utilizing scribes at their future occupation site ( . , sd . ). no response differences were noted for prior experience, training level, or sex. conclusion: when scribes are introduced at an em residency training site, residents of all training levels perceive it as a neutral interaction, when measured in terms of perceived time with attending eps and quality of the teaching when scribes are present. the effect of introduction of an electronic medical record on resident productivity in an academic emergency department shawn london, christopher sala university of connecticut school of medicine, farmington, ct background: there are little available data which describe the effect of implementation of an electronic medical record (emr) on provider productivity in the emergency department, and no studies which, to our knowledge, address this issue pertaining to housestaff in particular. objectives: we seek to quantify the changes in provider productivity pre-and post-emr implementation to support our hypothesis that resident clinical productivity based on patients seen per hour will be negatively affected by emr implementation. methods: the academic emergency department at hartford hospital, the principle clinical site in the university of connecticut emergency medicine residency, sees over , patients on an annual basis. this environment is unique in that pre-emr, patient tracking and orders were performed electronically using the sunrise system (eclipsys corp) for over years prior to conversion to the allscripts ed emr in october, for all aspects of ed care. the investigators completed a random sample of days/evening/night/weekend shift productivity to obtain monthly aggregate productivity data (patients seen per hour) by year of training. results: there was an initial . % decrease of in productivity for pgy- residents on average from . patients per hour on average in the three blocks preceding activation of the emr to . patients seen per hour compared in the subsequent three prior blocks. pgy performance returned to baseline in the subsequent three months to . patients per hour. there was no change noted in patients seen per hour of pgy- and pgy- residents. conclusion: while many physicians tend to assume that emrs pose a significant barrier to productivity in the ed, in our academic emergency department, there was no lasting change on resident productivity based on the patients seen per hour metric. the minor decrease which did occur in pgy- residents was transient and was not apparent months after the emr was implemented. our experience suggests that decrease in the rate of patients seen per hour in the resident population should not be considered justification to delay or avoid implementation of an emr in the emergency department. emory university, atlanta, ga; children's healthcare of atlanta, atlanta, ga background: variation in physician practice is widely prevalent and highlights an opportunity for quality improvement and cost containment. monitoring resources used in the management of common pediatric emergency department (ed) conditions has been suggested as an ed quality metric. objectives: to determine if providing ed physicians with severity-adjusted data on resource use and outcomes, relative to their peers, can influence practice patterns. methods: data on resource use by physicians were extracted from electronic medical records at a tertiary pediatric ed for four common conditions in mid-acuity (emergency severity index level ): fever, head injury, respiratory illness, and gastroenteritis. condition-relevant resource use was tracked for lab tests (blood count, chemistry, crp), imaging (chest x-ray, abdominal x-ray, head ct scan, abdominal ct scan), intravenous fluids, parenteral antibiotics, and intravenous ondansetron. outcome measures included admission to hospital and ed length of stay (los); -hr return to ed (rr) was used as a balancing measure. scorecards were constructed using box plots to show physicians their practice patterns relative to peers (the figure shows an example of the scorecard for gatroenteritis for one physician, showing resources use rates for iv fluids and labs). blinded scorecards were distributed quarterly for five quarters using rolling-year averages. a pre/post-intervention analysis was performed with sep , as the intervention date. fisher's exact and wilcoxon rank sum tests were used for analysis. results: we analyzed , patient visits across two hospitals ( , pre-and , post-intervention), comprising . % of the total ed volume during the study period. patients were seen by physicians (mean patients/physician). the table shows overall physician practice in the pre-and post-intervention periods. significant reduction in resource use was seen for abdominal/pelvic ct scans, head ct scan, chest x-rays, iv ondansetron, and admission to hospital. ed los decreased from min to min (p = . ). there was no significant change in -hr return rate during the study period ( . % pre-, . % post-intervention). conclusion: feedback on comprehensive practice patterns including resource use and quality metrics can influence physician practice on commonly used resources in the ed. billboards, via iphone application, twitter, and text messaging. there is a paucity of data describing the accuracy of publically posted ed wait times. objectives: to examine the accuracy of publicly posted wait times of four emergency departments within one hospital system. methods: a prospective analysis of four ed-posted wait times in comparison to the wait times for actual patients. the main hospital system calculated and posted ed wait times every twenty minutes for all four system eds. a consecutive sample of all patients who arrived / over a -week period during july and august was included. an electronic tracking system identified patient arrival date and the actual incurred wait time. data consisted of the arrival time, actual wait time, hospital census, budgeted hospital census, and the posted ed wait time. for each ed the difference was calculated between the publicly posted ed wait time at the time of patient's arrival and the patient's actual ed wait time. the average wait times and average wait time error between the ed sites were compared using a two-tailed student's t-test. the correlation coefficient between the differences in predicted/ actual wait times was also calculated for each ed. results: there were wait times within the four eds included in the analysis. the average wait time (in minutes) at each facility was: . (± . ) for the main ed, . (± . ) for freestanding ed (fed) # , . (± . ) for fed # , and . (± . ) for the small community ed. the average wait time error (in minutes) for each facility was (± . ) for the main ed, (± . ) for fed # , (± . ) for fed # , and (± . ) for the community hospital ed. the results from each ed were statistically significant for both average wait time and average wait time error (p < . ). there was a positive correlation between the average wait time and average wait time error, with r-values of . , . , . , and . for the main ed, fed # , fed # , and the small community hospital ed, respectively. each correlation was statistically significant; however, no correlation was found between the number of beds available (budgeted-actual census) and average wait times. conclusion: publically posted ed wait times are accurate for facilities with less than ed visits per month. they are not accurate for eds with greater than visits per month. reduction of pre-analytic laboratory errors in the emergency department using an incentive-based system benjamin katz, daniel pauze, karen moldveen albany medical center, albany, ny background: over the last decade, there has been an increased effort to reduce medical errors of all kinds. laboratory errors have a significant effect on patient care, yet they are usually avoidable. several studies suggest that up to % of laboratory errors occur during the pre-or post-analytic phase. in other words, errors occur during specimen collection and transport or reporting of results, rather than during laboratory analysis itself. objectives: in an effort to reduce pre-analytic laboratory errors, the ed instituted an incentive-based program for the clerical staff to recognize and prevent specimen labeling errors from reaching the patient. this study sought to demonstrate the benefit of this incentive-based program. methods: this study examined a prospective cohort of ed patients over a three year period in a tertiary care academic ed with annual census of , . as part of a continuing quality improvement process, laboratory specimen labeling errors are screened by clerical staff by reconciling laboratory specimen label with laboratory requisition labels. the number of ''near-misses'' or mismatched specimens captured by each clerk was then blinded to all patient identifiers and was collated by monthly intervals. due to poor performance in , an incentive program was introduced in early by which the clerk who captured the most mismatched specimens would be awarded a $ gift card on a quarterly basis. the total number of missed laboratory errors was then recorded on a monthly basis. investigational data were analyzed using bivariate statistics. background: most studies on operational research have been focused in academic medical centers, which typically have larger volumes of patients and are located in urban metropolitan areas. as cms core measures in begin to compare emergency departments (eds) on treatment time intervals, especially length of stay (los), it is important to explore if any differences exist inherent to patient volume. objectives: the objective of this study is to look at differences in operational metrics based on annual patient census. the hypothesis is that treatment time intervals and operational metrics differ amongst these different categories. methods: the ed benchmarking alliance has collected yearly operational metrics since . as of , there are eds providing data across the united states. eds are stratified by annual volume for comparison in the following categories: < k, - k, - k, and over k. in this study, metrics for eds with < k visits per year were compared to those of different volumes, averaged from - . mean values were compared to < k visits as a reference point for statistical difference using t-tests to compare means with a p-value < . considered significant. results: as seen in the table, a greater percentage of high acuity of patients was seen in higher volume eds than in < k eds. the percentage of patients transferred to another hospital was higher in < k eds. a higher percentage arrived by ems and a higher percentage were admitted in higher volume eds when compared to < k visits. in addition, the median los for both discharged and admitted patients and percentage who left before treatment was complete (lbtc) were higher in the higher volume eds. conclusion: lower volume eds have lower acuity when compared to higher volume eds. lower volume eds have shorter median los and left before treatment complete percentages. as cms core measures require hospitals to report these metrics, it will be important to compare them based on volume and not in aggregate. does the addition of a hands-free communication device improve ed interruption times? amy ernst, steven j. weiss, jeffrey a. reitsema university of new mexico, albuquerque, nm background: ed interruptions occur frequently. recently a hands-free communication device (vocera) was added to a cell phone and a pager in our ed. objectives: the purpose of the present study was to determine whether this addition improved interruption times. our hypothesis was that the device would significantly decrease length of time of interruptions. methods: this study was a prospective cohort study of attending ed physician calls and interruptions in a level i trauma center with em residency. interruptions included phone calls, ekg interpretations, pages to resuscitation, and other miscellaneous interruptions (including nursing issues, laboratory, ems, and radiology). we studied a convenience sampling intended to include mostly evening shifts, the busiest ed times. length of time the interruption lasted was recorded. data were collected for a comparison group pre-vocera. three investigators collected data including seven different addendings' interruptions. data were collected on a form, then entered into an excel file. data collectors' agreement was determined during two additional four hour shifts to calculate a kappa statistic. spss was used for data entry and statistical analysis. descriptive statistics were used for univariate data. chi-square and mann whitney u nonparametric test were used for comparisons. results: of the total interruptions, % were phone calls, % were ekgs to be read, % were pages to resuscitation, and % miscellaneous. there were no significant differences in types of interruptions pre-vs. post-vocera. pre-vocera we collected hours of data with interruptions with a mean . per hour. post-vocera, hours of data were collected with interruptions with a mean . per hour. there was a significant difference in length of time of interruptions with an average of minutes pre-vocera vs. minutes post-vocera (p = . , diff . , % ci . - . ). vocera calls were significantly shorter than non-vocera calls ( vs minutes, p < . ). comparing data collectors for type of interruption during the same -hour shift resulted in a kappa (agreement) of . . conclusion: the addition of a hands-free communication device may improve interruptions by shortening call length. '' talk background: analyses of patient flow through the ed typically focus on metrics such as wait time, total length of stay (los), or boarding time. however, little is known about how much interaction a patient has with clinicians after being placed in a room, or what proportion of the in-room visit is also spent ''waiting,'' rather than directly interacting with care providers. objectives: the objective was to assess the proportion of time, relative to the time in a patient care area, that a patient spends actively interacting with providers during an ed visit. methods: a secondary analysis of audiotaped encounters of patients with one of four diagnoses (ankle sprain, back pain, head injury, laceration) was performed. the setting was an urban, academic ed. ed visits of adult patients were recorded from the time of room placement to discharge. audiotapes were edited to remove all downtime and non-patient-provider conversations. los and door-to-doctor times were abstracted from the medical record. the proportion of time the patient spent in direct conversation with providers (''talk-time'') was calculated as the ratio of the edited audio recording time to the time spent in a patient care area (talk-time = [edited audio time/(los -door-to-doctor)]). multiple linear regression controlling for time spent in patient care area, age, and sex was performed. results: the sample was % male with a mean age of years. median los: minutes (iqr: - ), median door-to-doctor: minutes (iqr: - ), median time spent in patient care area: minutes (iqr: - ). median time spent in direct conversation with providers was minutes (iqr: - ), corresponding to a talk-time percentage of . % (iqr: . - . %). there were no significant differences based on diagnosis. regression analysis showed that those spending a longer time in a patient care area had a lower percentage of talk time (b = ) . , p = . ). conclusion: although limited by sample size, these results indicate that approximately % of a patients' time in a care area is spent not interacting with providers. while some of the time spent waiting is out of the providers' control (e.g. awaiting imaging studies), this significant ''downtime'' represents an opportunity for both process improvement efforts to decrease downtime as well as the development of innovative patient education efforts to make the best use of the remaining downtime. degradation of emergency department operational data quality during electronic health record implementation michael j. ward, craig froehle, christopher j. lindsell university of cincinnati, cincinnati, oh background: process improvement initiatives targeted at operational efficiency frequently use electronic timestamps to estimate task and process durations. errors in timestamps hamper the use of electronic data to improve a system and may result in inappropriate conclusions about performance. despite the fact that the number of electronic health record (ehr) implementations is expected to increase in the u.s., the magnitude of this ehr-induced error is not well established. objectives: to estimate the change in the magnitude of error in ed electronic timestamps before and after a hospital-wide ehr implementation. methods: time-and-motion observations were conducted in a suburban ed, annual census , , after receiving irb approval. observation was conducted weeks pre-and weeks post-ehr implementation. patients were identified on entering the ed and tracked until exiting. times were recorded to the nearest second using a calibrated stopwatch, and are reported in minutes. electronic data were extracted from the patient-tracking system in use pre-implementation, and from the ehr post-implementation. for comparison of means, independent t-tests were used. chi-square and fisher's t-tests were used for proportions, as appropriate. results: there were observations; before and after implementation. the differences between observed times and timestamps were computed and found to be normally distributed. post-implementation, mean physician seen times along with arrival to bed, bed to physician, and physician to disposition intervals occurred before observation. physician seen timestamps were frequently incorrect and did not improve postimplementation. significant discrepancies (ten minutes or greater) from observed values were identified in timestamps involving disposition decision and exit from the ed. calculating service time intervals resulted in every service interval (except arrival to bed) having at least % of the times with significant discrepancies. it is notable that missing values were more frequent post-ehr implementation. conclusion: ehr implementation results in reduced variability of timestamps but reduced accuracy and an increase in missing timestamps. using electronic timestamps for operational efficiency assessment should recognize the magnitude of error, and the compounding of error, when computing service times. background: procedural sedation and analgesia is used in the ed in order to efficiently and humanely perform necessary painful procedures. the opposing physiological effects of ketamine and propofol suggest the potential for synergy, and this has led to interest in their combined use, commonly termed ''ketofol'', to facilitate ed procedural sedation. objectives: to determine if a : mixture of ketamine and propofol (ketofol) for ed procedural sedation results in a % or more absolute reduction in adverse respiratory events compared to propofol alone. methods: participants were randomized to receive either ketofol or propofol in a double-blind fashion according to a weight-based dosing protocol. inclusion criteria were age years or greater, and asa class - status. the primary outcome was the number and proportion of patients experiencing an adverse respiratory event according to pre-defined criteria (the ''quebec criteria''). secondary outcomes were sedation consistency, sedation efficacy, induction time, sedation time, procedure time, and adverse events. results: a total of patients were enrolled, per group. forty-three ( %) patients experienced an adverse respiratory event in the ketofol group compared to ( %) in the propofol group (difference %; % ci ) % to %; p = . ). thirty-eight ( %) patients receiving ketofol and ( %) receiving propofol developed hypoxia, of whom three ( %) ketofol patients and ( %) propofol patient received bag-valve-mask ventilation. sixty-five ( %) patients receiving ketofol and ( %) receiving propofol required repeat medication dosing or lightened to a ramsay sedation score of or less during their procedure (difference %; % ci % to %; p = . ). procedural agitation occurred in patients ( . %) receiving ketofol compared to ( %) receiving propofol (difference . %, % ci % to %). recovery agitation requiring treatment occurred in six patients ( %, % ci . % to . %) receiving ketofol. other secondary outcomes were similar between the groups. patients and staff were highly satisfied with both agents. conclusion: ketofol for ed procedural sedation does not result in a reduced incidence of adverse respiratory events compared to propofol alone. induction time, efficacy, and sedation time were similar; however, sedation depth appeared to be more consistent with ketofol. with propofol and its safety is well established. however, in cms enacted guidelines defining propofol as deep sedation and requiring administration by a physician. common edps practice had been one physician performing both the sedation and procedure. edps has proven safe under this one-physician practice. however, the guidelines mandated separate physicians perform each. objectives: the study hypothesis was that one-physician propofol sedation complication rates are similar to two-physician. methods: before and after, observational study of patients > years of age consenting to edps with propofol. edps completed with one physician were compared to those completed with two (separate physicians performing the sedation and the procedure). all data were prospectively collected. the study was completed at an urban level i trauma center. standard monitoring and procedures for edps were followed with physicians blinded to the objectives of this research. the frequency and incremental dosing of medication was left to the discretion of the treating physicians. the study protocol required an ed nurse trained in data collection to be present to record vital signs and assess for any prospectively defined complications. we used chi-square tests to compare the binary outcomes and asa scores across the time periods, and two-sample t-tests to test for differences in age between the two time periods. results: during the -year study period we enrolled patients: one-physician edps sedations and (- to ) also received bag-valve-mask ( ) [ . to ) ( ) [ . to ] (- to ) two-physician. all patients meeting inclusion criteria were included in the study. total adverse event rates were . % and . %, respectively (p = . ). the most common complications were hypotension and oxygen desaturation, and they respectively showed one-physcian rates of . % and . % and two-physician rates of . % and . % (p = . and . .) the unsuccessful procedure rates were . % vs . % (p = . ). conclusion: this study demonstrated no significant difference in complication rates for propofol edps completed by one physician as compared to two. background: overdose patients are often monitored using pulse oximetry, which may not detect changes in patients on high-flow oxygen. objectives: to determine whether changes in end-tidal carbon dioxide (etco ) detected by capnographic monitoring are associated with clinical interventions due to respiratory depression (crd) in patients undergoing evaluation for a decreased level of consciousness after a presumed drug overdose. methods: this was a prospective, observational study of adult patients undergoing evaluation for a drug overdose in an urban county ed. all patients received supplemental oxygen. patients were continuously monitored by trained research associates. the level of consciousness was recorded using the observer's assessment of alertness/sedation scale (oaa/s). vital signs, pulse oximetry, and oaa/s were monitored and recorded every minutes and at the time of occurrence of any crd. respiratory rate and etco were measured at five second intervals using a capno-stream monitor. crd included an increase in supplemental oxygen, the use of bag-valve-mask ventilations, repositioning to improve ventilation, and physical or verbal stimulus to induce respiration, and were performed at the discretion of the treating physicians and nurses. changes from baseline in etco values and waveforms among patients who did or did have a clinical intervention were compared using wilcoxon rank sum tests. results: patients were enrolled in the study (age , range to , % male, median oaas , range to ). suspected overdoses were due to opioids in , benzodiazepines in , an antipsychotic in , and others in . the median time of evaluation was minutes (range to ). crd occurred in % of patients, including an increase in o in %, repositioning in %, and stimulation to induce respiration in %. % had an o saturation of < % (median , range to ) and % had a loss of etco waveform at some time, all of whom had a crd. the median change in etco from baseline was mmhg, range to . among patients with crd it was mmhg, range to , and among patients with no crd it was mmhg, range to (p = . ). conclusion: the change in etco from baseline was larger in patients who required clinical interventions than in those who did not. in patients on high-flow oxygen, capnographic monitoring may be sensitive to the need for airway support. how reliable are health care providers in reporting changes in etco waveform anas sawas , scott youngquist , troy madsen , matthew ahern , camille broadwater-hollifield , andrew syndergaard , jared phelps , bryson garbett , virgil davis university of utah, salt lake city, ut; midwestern university, glendale, az background: etco changes have been used in procedural sedation analgesia (psa) research to evaluate subclinical respiratory depression associated with sedation regiments. objectives: to evaluate the accuracy of bedside clinician reporting of changes in etco . methods: this was a prospective, randomized, singleblind study conducted in ed setting from june until the present time. this study took place at an academic adult ed of a -bed ( in the ed) and a level i trauma center. subjects were randomized to receive either ketamine-propofol or propofol according to a standardized protocol. loss of etco waveforms for ‡ sec were recorded. following sedation, questionnaires were completed by the sedating physicians. digitally recorded etco waveforms were also reviewed by an independent physician and a trained research assistant (ra). to ensure the reliability of trained research assistants, we compared their analyses with the analyses of an independent physician for the first recordings. the target enrollment was patients in each group (n = total). statistics were calculated using sas statistical software. results: patients were enrolled; ( . %) are males and ( . %) are females. mean age was . ± . years. most participants did not have major risk factors for apnea or for further complications ( . % were asa class or ). etco waveforms were reviewed by ( . %) sedating physicians and ( . %) nurses at the bedside. there were ( . %) etco waveforms recordings, ( . %) were reviewed by an independent physician and ( %) were reviewed by an ra. a kappa test for agreement between independent physicians and ras was conducted on recordings and there were no discordant pairs (kappa = ). compared to sedating physicians, the independent physician was more likely to report etco wave losses (or . , % ci . - . ). compared to sedating physicians, ras were more likely to report etco wave losses (or . , % ci . - . ). conclusion: compared to sedating physicians at the bedside, independent physicians and ras were more likely to note etco waveform losses. an independent review of recorded etco waveform changes will be more reliable for future sedation research. background: comprehensive studies evaluating current practices of ed airway management in japan are lacking. many emergency physicians in japan still experience resistance regarding rapid sequence intubation (rsi). objectives: we sought to describe the success and complication rate of rsi with non-rsi. methods: design and setting: we conducted a multicenter prospective observational study using the jean registry of eds at academic and community hospitals in japan during between and . data fields include ed characteristics, patient and operator demographics, method of airway management, number of attempts, and adverse events. we defined non-rsi as intubation with sedation only, neuromuscular blockade only, and without medication. participants: all patients undergoing emergency intubation in ed were eligible for inclusion. cardiac arrest encounters were excluded from the analysis. primary analysis: we described rsi with non-rsi in terms of success rate on first attempt, within three attempts, and complication rate. we present descriptive data as proportions with % confidence intervals (cis). we report odds ratios (or) with % ci via chi-square testing. results: the database recorded intubations (capture rate %) and met the inclusion criteria. rsi was the initial method chosen in ( %) and non-rsi in ( %). use of rsi varied among institutes from % to %. success cases of rsi on first and within three attempts are intubations ( %, %ci %- %) and intubations ( %, %ci %- %), respectively. the success cases of non-rsi on first and within three attempts are intubations ( %, %ci %- %) and intubations ( %, %ci %- %). success rates of rsi on first and within three attempts are higher than non-rsi (or . , %ci . - . and or . , % ci . - . , respectively). we recorded complications in rsi ( %) and in non-rsi ( %). there is no significant difference of complication rate between rsi and non-rsi (or . , % ci . - . ). conclusion: in this multi-center prospective study in japan, we demonstrated a high degree of variation in use of rsi for ed intubation. additionally we found that success rate of rsi on first and within three attempts were both higher than non-rsi. this study has the limitation of reporting bias and confounding by indication. (originally submitted as a ''late-breaker.'') methods: this was a prospective, randomized, singleblind study conducted in the ed setting from june until the present time. this study took place at an academic adult ed of a -bed ( in the ed) and a level i trauma center. subjects were randomized to receive either ketamine-propofol or propofol according to a standardized protocol. etco waveforms were digitally recorded. etco changes were evaluated by the sedating physicians at the bedside. recorded waveforms were reviewed by an independent physician and a trained research assistant (ra). to ensure the reliability of trained ras, we computed a kappa test for agreement between the analysis of independent physicians and ras for the first recordings. a post-hoc analysis of the association between any loss, the number of losses, and total duration of loss of etco waveform and crp was performed. on review we recorded the absence or presence of loss of etco and the total duration in seconds of all lost etco episodes ‡ seconds. ors were calculated using sas statistical software. results: patients were enrolled; ( . %) are males and are ( . %) females. . % participants were asa class or . waveforms were reviewed by ( . %) sedating physicians. there were ( . %) waveforms recordings, ( . %) were reviewed by an independent physician and ( %) were reviewed by ras, where there were no discordant pairs (kappa = ). there were ( . %) crp events. any loss of etco was associated with a non-significant or of . ( % ci . - . ) for crp. however, the duration of etco loss was significantly associated with crp with an or of . ( % ci . - . ) for each second interval of lost etco . the number of losses was significantly associated with the outcome (or . , % ci . - . ). conclusion: defining subclinical respiratory depression as present or absent may be less useful than quantitative measurements. this suggests that risk is cumulative over periods of loss of etco , and the duration of loss may be a better marker of sedation depth and risk of complications than classification of any loss. background: ed visits present an opportunity to deliver brief interventions (bis) to reduce violence and alcohol misuse among urban adolescents at risk for future injury. previous analyses demonstrated that a brief intervention resulted in reductions in violence and alcohol consequences up to months. objectives: this paper describes findings examining the efficacy of bis on peer violence and alcohol misuse at months. methods: patients ( - yrs) at an ed reporting past year alcohol use and aggression were enrolled in the rct, which included computerized assessment, and randomization to control group or bi delivered by a computer (cbi) or therapist assisted by a computer (tbi). baseline and months included violence (peer aggression, peer victimization, violence related consequences) and alcohol (alcohol misuse, binge drinking, alcohol-related consequences). results: adolescents were screened ( % participation). of those, screened positive for violence and alcohol use and were randomized; % completed -month follow-up. as compared to the control group, the tbi group showed significant reductions in peer aggression (p < . ) and peer victimization (p < . ) at months. bi and control groups did not differ on alcohol-related variables at months. conclusion: evaluation of the saferteens intervention one year following an ed visit provides support for the efficacy of computer-assisted therapist brief intervention for reducing peer violence. violence against ed health care workers: a -month experience terry kowalenko , donna gates , gordon gillespie , paul succop university of michigan, ann arbor, mi; university of cincinnati, cincinnati, oh background: health care (hc) support occupations have an injury rate nearly times that of the general sector due to assaults, with doctors and nurses nearly times greater. studies have shown that the ed is at greatest risk of such events compared to other hc settings. objectives: to describe the incidence of violence in ed hc workers over months. specific aims were to ) identify demographic, occupational, and perpetrator factors related to violent events; ) identify the predictors of acute stress response in victims; and ) identify predictors of loss of productivity after the event. methods: longitudinal, repeated methods design was used to collect monthly survey data from ed hc workers (w) at six hospitals in two states. surveys assessed the number and type of violent events, and feelings of safety and confidence. victims also completed specific violent event surveys. descriptive statistics and a repeated measure linear regression model were used. results: ed hcws completed monthly surveys, and violent events were reported. the average per person violent event rate per months was . . events were physical threats ( . per person in months). events were assaults ( . per person in months). violent event surveys were completed, describing physical threats and assaults with % resulting in injuries. % of the physical threats and % of the assaults were perpetrated by men. comparing occupational groups revealed significant differences between nurses and physicians for all reported events (p = . ), with the greatest difference in physical threats (p = . ). nurses felt less safe than physicians (p = . ). physicians felt more confident than nurses in dealing with the violent patient (p = . ). nurses were more likely to experience acute stress than physicians (p < . ). acute stress significantly reduced productivity in general (p < . ), with a significant negative effect on ''ability to handle/ manage workload'' (p < . ) and ''ability to handle/ manage cognitive demands'' (p < . ). conclusion: ed hcws are frequent victims of violence perpetrated by visitors and patients. this violence results in injuries, acute stress, and loss of productivity. acute stress has negative consequences on the workers' ability to perform their duties. this has serious potential consequences to the victim as well as the care they provide to their patients. a randomized controlled feasibility trial of vacant lot greening to reduce crime and increase perceptions of safety eugenia c. garvin, charles c. branas perelman school of medicine at the university of pennsylvania, philadelphia, pa background: vacant lots, often filled with trash and overgrown vegetation, have been associated with intentional injuries. a recent quasi-experimental study found a significant decrease in gun crimes around vacant lots that had been greened compared with control lots. objectives: to determine the feasibility of a randomized vacant lot greening intervention, and its effect on police-reported crime and perceptions of safety. methods: for this randomized controlled feasibility trial of vacant lot greening, we partnered with the pennsylvania horticulture society (phs) to perform the greening intervention (cleaning the lots, planting grass and trees, and building a wooden fence around the perimeter). we analyzed police crime data and interviewed people living around the study vacant lots (greened and control) about perceptions of safety before and after greening. results: a total of sq ft of randomly selected vacant lot space was successfully greened. we used a master database of , vacant lots to randomly select vacant lot clusters. we viewed each cluster with the phs to determine which were appropriate to send to the city of philadelphia for greening approval. the vacant lot cluster highest on the random list to be approved by the city of philadelphia was designated the intervention site, and the next highest was designated the control site. overall, participants completed baseline interviews, and completed follow-up interviews after months. % of participants were male, % were black or african american, and % had a household income less than $ , . unadjusted difference-in-differences estimates showed a decrease in gun assaults around greened vacant lots compared to control. regression-adjusted estimates showed that people living around greened vacant lots reported feeling safer after greening compared to those who lived around control vacant lots (p < . ). conclusion: conducting a randomized controlled trial of vacant lot greening is feasible. greening may reduce certain gun crimes and make people feel safer. however, larger prospective trials are needed to further investigate this link. screening for violence identifies young adults at risk for return ed visits for injury abigail hankin-wei, brittany meagley, debra houry emory university, atlanta, ga background: homicide is the second leading cause of death among youth ages - . prior studies, in nonhealth care settings, have shown associations between violent injury and risk factors including exposure to community violence, peer behavior, and delinquency. objectives: to assess whether self-reported exposure to violence risk factors can be used to predict future ed visits for injuries. methods: we conducted a prospective cohort study in the ed of a southeastern us level i trauma center. patients aged - presenting for any chief complaint were included unless they were critically ill, incarcerated, or could not read english. recruitment took place over six months, by a trained research assistant (ra). the ra was present in the ed for - days per week, with shifts scheduled such that they included weekends and weekdays, over the hours from am- pm. patients were offered a $ gift card for participation. at the time of initial contact in the ed, patients completed a written questionnaire which included validated measures of the following risk factors: a) aggression, b) perceived likelihood of violence, c) recent violent behavior, d) peer behavior, e) community exposure to violence, and f) positive future outlook. at months following the initial ed visit, the participants' medical records were reviewed to identify any subsequent ed visits for injury-related complaints. data were analyzed with chi-square and logistic regression analyses. results: patients were approached, of whom patients consented. participants' average age was . years, with % female, and % african american. return visits for injuries were significantly associated with hostile/aggressive feelings (rr . , ci . , ) , self-reported perceived likelihood of violence (rr . , ci . , . ) , recent violent behavior (rr . , ci . , . ) , and peer group violence (rr . , ci . , . ) . these findings remained significant when controlling for participant sex. conclusion: a brief survey of risk factors for violence is predictive of return visit to the ed for injury. these findings identify a potentially important tool for primary prevention of violent injuries among young adults visiting the ed for both injury and non-injury complaints. background: sepsis is a commonly encountered disease in ed, with high mortality. while several clinical prediction rules (cpr) including meds, sirs, and curb- exist to facilitate clinicians in early recognition of risk of mortality for sepsis, most are of suboptimal performance. objectives: to derive a novel cpr for mortality of sepsis utilizing clinically available and objective predictors in ed. methods: we retrospectively reviewed all adult septic patients who visited the ed at a tertiary hospital during the year with two sets of blood cultures ordered by physicians. basic demographics, ed vital signs, symptoms and signs, underlying illnesses, laboratory findings, microbiological results, and discharge status were collected. multivariate logistic regressions were used to obtain a novel cpr using predictors with < . p-value tested in univariate analyses. the existing cprs were compared with this novel cpr using auc. results: of included patients, . % died in hospital, % had diabetes, % were older than years of age, % had malignancy, and % had positive blood bacterial culture tests. predisposing factors including history of malignancy, liver disease, immunosuppressed status, chronic kidney disease, congestive heart failure, and older than years of age were found to be associated with mortality (all p < . ). patients who developed mortality tended to have lower body temperature, narrower pulse pressure, higher percentage of red cell distribution width (rdw) and bandemia, higher blood urea nitrogen (bun), ammonia, and c-reactive protein level, and longer prothrombin time and activated partial thromboplastin time (aptt) (all p < . ). the most parsimonious cpr incorporating history of malignancy (or . , % ci . - . ), prolonged aptt ( . , . - . ), presence of bandemia ( . , . - . results: there was poor agreement between the physician's unstructured assessment used in clinical practice and the guidelines put forth by the aha/acc/acep task force. ed physicians were more likely to assess a patient as low risk ( %), while aha guidelines were more likely to classify patients as intermediate ( %) or high ( %) risk. however, when comparing the patient's final acs diagnosis and the relation to the risk assessment value, ed physicians proved better predictors of high-risk patients who in fact had acs, while the aha/acc/acep guidelines proved better at correctly identifying low-risk patients who did not have acs. conclusion: in the ed, physicians are far more efficient at correctly placing patients with underlying acs into a high-risk category, while established criteria may be overly conservative when applied to an acute care population. further research is indicated to look at ed physicians' risk stratification and ensuing patient care to assess for appropriate decision making and ultimate outcomes. compartative conclusion: the amuse score was more specific, but the wells score was more sensitive for acute lower limb dvt in this cohort. there is no significant advantage in using the amuse over the wells score in ed patient with suspected dvt. background: the direct cost of medical care is not accurately reflected in charges or reimbursement. the cost of boarding admitted patients in the ed has been studied in terms of opportunity costs, which are indirect. the actual direct effect on hospital expenses has not been well defined. objectives: we calculate the difference to the hospital in the cost of caring for an admitted patient in the ed and in a non-critical care in-patient unit. methods: time-directed activity-based costing (tdabc) has recently been proposed as a method of determining the actual cost of providing medical services. tdabc was used to calculate the cost per patient bed-hour both in the ed and for an in-patient unit. the costs include nursing, nursing assistants, clerks, attending and resident physicians, supervisory salaries, and equipment maintenance. boarding hours were determined from placement of admission order to transfer to in-patient unit. a convenience sample of consecutive non-critical care admissions was assessed to find the degree of ed physician involvement with boarded patients. results: the overhead cost per patient bed-hour in the ed was $ . . the equivalent cost per bed-hour inpatient was $ . , a differential of $ . . there were , boarding hours for medical-surgical patients in , a differential of $ , , . for the year. for the short-stay unit (no residents), the cost per patient hour was $ . and the boarding hours were , . this resulted in a differential cost of $ , . , a total direct cost to the hospital of $ , , . . review of consecutive admissions showed no orders placed by the ed physician after decision-toadmit. conclusion: concentration of resources in the ed means considerably higher cost per unit of care as compared to an in-patient unit. keeping admitted patients boarding in the ed results in expensive underutilization. this is exclusive of significant opportunity costs of lost revenue from walk-out and diverted patients. this study includes the cost of teaching attendings and residents (ed and in-patient) . in a non-teaching setting, the differential would be less and the cost of boarding would be shared by a fee-for-service ed physician group as well as the hospital. improving identification of frequent emergency department users using a regional health information background: frequent ed users consume a disproportionate amount of health care resources. interventions are being designed to identify such patients and direct them to more appropriate treatment settings. because some frequent users visit more than one ed, a health information exchange (hie) may improve the ability to identify frequent ed users across sites of care. objectives: to demonstrate the extent to which a hie can identify the marginal increase in frequent ed users beyond that which can be detected with data from a single hospital. methods: data from / / to / / from the new york clinical information exchange (nyclix), a hie in new york city that includes ten hospitals, were analyzed to calculate the number of frequent ed users ( ‡ visits in days) at each site and across the hie. results: there were , ( % of total patients) frequent ed users, with , ( %) of frequent users having all their visits at a single ed, while , ( %) frequent users were identified only after counting visits to multiple eds (table ) . site-specific increases varied from % to % (sd . ). frequent ed users accounted for % of patients, but for % of visits, averaging . visits per year, versus . visits per year for all other patients. . % of frequent users visited two or more eds during the study period, compared to . % of all other patients. conclusion: frequent ed users commonly visited multiple nyclix eds during the study period. the use of a hie helped identify many additional frequent users, though the benefits were lower for hospitals not located in the relative vicinity of another nyclix hospital. measures that take a community, rather than a single institution, into account may be more reflective of the care that the patient experiences. indocyanine background: due to their complex nature and high associated morbidity, burn injuries must be handled quickly and efficiently. partial thickness burns are currently treated based upon visual judgment of burn depth by the clinician. however, such judgment is only % accurate and not expeditious. laser doppler imaging (ldi) is far more accurate -nearly % after days. however, it is too cumbersome for routine clinical use. laser assisted indocyanine green angiography (laicga) has been indicated as an alternative for diagnosing the depth of burn injuries, and possesses greater utility for clinical translation. as the preferred outcome of burn healing is aesthetic, it is of interest to determine if wound contracture can be predicted early in the course of a burn by laic-ga. objectives: determine the utility of early burn analysis using laicga in the prediction of -day wound contracture. methods: a prospective animal experiment was performed using six anesthetized pigs, each with standardized wounds. differences in burn depth were created by using a . · . cm aluminum bar at three exposure times and temperatures: degrees c for seconds, degrees c for seconds, and degrees c for seconds. we have shown in prior validation experiments that these burn temperatures and times create distinct burn depths. laicga scanning, using lifecell spy elite, took place at hour, hours, hours, hours, and week post burn. imaging was read by a blinded investigator, and perfusion trends were compared with day post-burn contraction outcomes measured using imagej software. biopsies were taken on day to measure scar tissue depth. results: deep burns were characterized by a blue center indicating poor perfusion while more superficial burns were characterized by a yellow-red center indicating perfusion that was close to that of the normal uninjured adjacent skin (see figure) . a linear relationship between contraction outcome and burn perfusion could be discerned as early as hour post burn, peaking in strength at - hours post-burn. burn intensity could be effectively identified at hours post-burn, although there was no relationship with scar tissue depth. conclusion: pilot data indicate that laicga using lifecell spy has the ability to determine the depth of injury and predict the degree of contraction of deep dermal burns within - days of injury with greater accuracy than clinical scoring. the objectives: we hypothesize that real-time monitoring of an integrated electronic medical records system and the subsequent firing of a ''sepsis alert'' icon on the electronic ed tracking board results in improved mortality for patients who present to the ed with severe sepsis or septic shock. methods: we retrospectively reviewed our hospital's sepsis registry and included all patients diagnosed with severe sepsis or septic shock presenting to an academic community ed with an annual census of , visits and who were admitted to a medical icu or stepdown icu bed between june and october . in may an algorithm was added to our integrated medical records system that identifies patients with two sirs criteria and evidence of endorgan damage or shock on lab data. when these criteria are met, a ''sepsis alert'' icon (prompt) appears next to that patient's name on the ed tracking board. the system also pages an in-house, specially trained icu nurse who can respond on a prn basis and assist in the patient's management. months of intervention data are compared with months of baseline data. statistical analysis was via z-test for proportions. results: for ed patients with severe sepsis, the preand post-alert mortality was of ( %) and of ( %), respectively (p = . ; n = ). in the septic shock group, the pre-and post-alert mortality was of ( %) and of ( %), respectively (p = . ). with ed and inpatient sepsis alerts combined, the severe sepsis subgroup mortality was reduced from % to % (p = . ; n = ). conclusion: real-time ed ehr screening for severe sepsis and septic shock patients did not improve mortality. a positive trend in the severe sepsis subgroup was noted, and the combined inpatient plus ed data suggests statistical significance may be reached as more patients enter the registry. limitations: retrospective study, potential increased data capture post intervention, and no ''gold standard'' to test the sepsis alert sensitivity and specificity. ) . descriptive statistics were calculated. principal component analysis was used to determine questions with continuous response formats that could be aggregated. aggregated outcomes were regressed onto predictor demographic variables using multiple linear regression. results: / physicians completed the survey. physicians had a mean of . ± . years experience in the ed. . % were female. eight physicians ( %) reported never having used the tool, while . % of users estimated having used it more than five times. % of users cited the ''p'' alert on the etb as the most common notification method. most felt the ''p'' alert did not help them identify patients with pneumonia earlier (mean = . ± . ), but found it moderately useful in reminding them to use the tool ( . ± . ). physicians found the tool helpful in making decisions regarding triage, diagnostic studies, and antibiotic selection for outpatients and inpatients ( . ± . , . ± . , . ± . , and . ± . , respectively). they did not feel it negatively affected their ability to perform other tasks ( . ± . ). using multiple linear regression, neither age, sex, years experience, nor tool use frequency significantly predicted responses to questions about triage and antibiotic selection, technical difficulties, or diagnostic ordering. conclusion: ed physicians perceived the tool to be helpful in managing patients with pneumonia without negatively affecting workflow. perceptions appear consistent across demographic variables and experience. objectives: we seek to examine whether use of the salt device can provide reliable tracheal intubation during ongoing cpr. the dynamic model tested the device with human powered cpr (manual) and with an automated chest compression device (physio control lucas ). the hypothesis is that the predictable movement of an automated chest compression device will make tracheal intubation easier than the random movement from manual cpr. methods: the project was an experimental controlled trial and took place in the ed at a tertiary referral center in peoria, illinois. this project was an expansion arm of a similarly structured study using traditional laryngoscopy. emergency medicine residents, attending physicians, paramedics, and other acls-trained staff were eligible for participation. in randomized order, each participant attempted intubation on a mannequin using the salt device with no cpr ongoing, during cpr with a manual compression, and during cpr with an automatic chest compression. participants were timed in their attempt and success was determined after each attempt. results: there were participants in the trial. the success rates in the control group and the automated cpr group were both % ( / ) and the success rate in the manual cpr group was % ( / objectives: our primary hypothesis was that in fasting, asymptomatic subjects, larger fluid boluses would lead to proportional aortic velocity changes. our secondary endpoints were to determine inter-and intra-subject variation in aortic velocity measurements. methods: the authors performed a prospective randomized double-blinded trial using healthy volunteers. we measured the velocity time integral (vti) and maximal velocity (vmax) with an estimated - °pulsed wave doppler interrogation of the left ventricular outflow in the apical- cardiac window. three physicians reviewed optimal sampling gate position, doppler angle and verified the presence of an aortic closure spike. angle correction technology was not used. subjects with no history of cardiac disease or hypertension fasted for hours and were then randomly assigned to receive a normal saline bolus of ml/kg, ml/kg or ml/kg over minutes. aortic velocity profiles were measured before and after each fluid bolus. results: forty-two subjects were enrolled. mean age was ± (range to ) and mean body mass index . ± . (range . to ). mean volume (in ml) for groups receiving ml/kg, ml/kg, and ml/kg were , , and , respectively. mean baseline vmax (in cm/s) of the subjects was . ± . (range to ). mean baseline vti (in cm) was . ± . (range . to . ). pre-and post-fluid mean differences for vmax were ) . (± . ) and for vti . (± . ). aortic velocity changes in groups receiving ml/kg, ml/kg, and ml/kg were not statistically significant (see table) . heart rate changes were not significant. background: clinicians recognize that septic shock is a highly prevalent, high mortality disease state. evidence supports early ed resuscitation, yet care delivery is often inconsistent and incomplete. the objective of this study was to discover latent critical barriers to successful ed resuscitation of septic shock. objectives: clinicians recognize that septic shock is a highly prevalent, high mortality disease state. evidence supports early ed resuscitation, yet care delivery is often inconsistent and incomplete. the objective of this study was to discover latent critical barriers to successful ed resuscitation of septic shock. methods: we conducted five -minute risk-informed in-situ simulations. ed physicians and nurses working in the real clinical environment cared for a standardized patient, introduced into their existing patient workload, with signs and symptoms of septic shock. immediately after case completion clinicians participated in a minute debriefing session. transcripts of these sessions were analyzed using grounded theory, a method of qualitative analysis, to identify critical barrier themes. results: fifteen clinicians participated in the debriefing sessions: four attending physicians, five residents, five nurses, and one nurse practitioner. the most prevalent critical barrier themes were: anchoring bias and difficulty with cognitive framework adaptation as the patient progressed to septic shock (n = ), difficult interactions between the ed and ancillary departments (n = ), difficulties with physician-nurse commu-nication and teamwork (n = ), and delays in placing the central venous catheter due to perceptions surrounding equipment availability and the desire to attend to other competing interests in the ed prior to initiation of the procedure (n = and ). each theme was represented in at least four of the five debriefing sessions. participants reported the in-situ simulations to be a realistic representation of ed sepsis care. conclusion: in-situ simulation and subsequent debriefing provides a method of identifying latent critical areas for improvement in a care process. improvement strategies for ed-based septic shock resuscitation will need to address the difficulties in shock recognition and cognitive framework adaptation, physician and nurse teamwork, and prioritization of team effort. the background: the association between blood glucose level and mortality in critically ill patients is highly debated. several studies have investigated the association between history of diabetes, blood sugar level, and mortality of septic patients; however, no consistent conclusion could be drawn so far. objectives: to investigate the association between diabetes and initial glucose level and in-hospital mortality in patients with suspected sepsis from the ed. methods: we conducted a retrospective cohort study that consisted of all adult septic patients who visited the ed at a tertiary hospital during the year with two sets of blood cultures ordered by physicians. basic demographics, ed vital signs, symptoms and signs, underlying illnesses, laboratory findings, microbiological results, and discharge status were collected. logistic regressions were used to evaluate the association between risk factors, initial blood sugar level, and history of diabetes and mortality, as well as the effect modification between initial blood sugar level and history of diabetes. results: a total of patients with available blood sugar levels were included, of whom % had diabetes, % were older than years of age, and % were male. the mortality was % ( % ci . - . %). patients with a history of diabetes tended to be older, female, and more likely to have chronic kidney disease, lower sepsis severity (meds score), and positive blood culture test results (all p < . ). patients with a history of diabetes tended to have lower in-hospital mortality after ed visits with sepsis, controlling for initial blood sugar level (aor . , % ci . - . , p = . ). initial normal blood sugar seemed to be beneficial compared to lower blood sugar level for in-hospital mortality, controlled history of diabetes, sex, severity of sepsis, and age (aor . , % ci . - . , p = . ). the effect modification of diabetes on blood sugar level and mortality, however, was found to be not statistically significant (p = . ). conclusion: normal initial blood sugar level in ed and history of diabetes might be protective for mortality of septic patients who visited the ed. further investigation is warranted to determine the mechanism for these effects. methods: this irb-approved retrospective chart review included all patients treated with therapeutic hypothermia after cardiac arrest during at an urban, academic teaching hospital. every patient undergoing therapeutic hypothermia is treated by neurocritical care specialists. patients were identified by review of neurocritical care consultation logs. clinical data were dually abstracted by trained clinical study assistants using a standardized data dictionary and case report form. medications reviewed during hypothermia were midazolam, lorazepam, propofol, fentanyl, cisatracurium, and vecuronium. results: there were patients in the cohort. median age was (range - years), % were white, % were male, and % had a history of coronary artery disease. seizures were documented by continuous eeg in / ( %), and / ( %) died during hospitalization. most, / ( %), received fentanyl, / ( %) received benzodiazepine pharmacotherapy, and / ( %) received propofol. paralytics were administered to / ( %) patients, / ( %) with cisatracurium and / ( %) with vecuronium. of note, one patient required pentobarbital for seizure management. conclusion: sedation and neuromuscular blockade are common during management of patients undergoing therapeutic hypothermia after cardiac arrest. patients in this cohort often received analgesia with fentanyl, and sedation with a benzodiazepine or propofol. given the frequent use of sedatives and paralytics in survivors of cardiac arrest undergoing hypothermia, future studies should investigate the potential effect of these drugs on prognostication and survival after cardiac arrest. background: the use of therapeutic hypothermia (th) is a burgeoning treatment modality for post-cardiac arrest patients. objectives: we performed a retrospective chart review of patients who underwent post cardiac arrest th at eight different institutions across the united states. our objective was to assess how th is currently being implemented in emergency departments and assess the feasibility of conducting more extensive th research using multi-institution retrospective data. methods: a total of charts with dates from - were sent for review by participating institutions of the peri-resuscitation consortium. of those reviewed, eight charts were excluded for missing data. two independent reviewers performed the review and the results were subsequently compared and discrepancies resolved by a third reviewer. we assessed patient demographics, initial presenting rhythm, time until th initiation, duration of th, cooling methods and temperature reached, survival to hospital discharge, and neurological status on discharge. results: the majority of cases of th had initial cardiac rhythms of asystole or pulseless electrical activity ( . %), followed by ventricular tachycardia or fibrillation ( . %), and in . % the inciting cardiac rhythm was unknown. time to initiation of th ranged from - minutes with a mean time of min (sd . ). length of th ranged from - minutes with a mean time of minutes (sd ). average minimum temperature achieved was . °c, with a range from . - . °c (sd . °c). of the charts reviewed, ( . %) of the patients survived to hospital discharge and ( . %) were discharged relatively neurologically intact. conclusion: research surrounding cardiac arrest has always been difficult given the time and location span from pre-hospital care to emergency department to intensive care unit. also, as witnessed cardiac arrest events are relatively rare with poor survival outcomes, very large sample sizes are needed to make any meaningful conclusions about th. our varied and inconsistent results show that a multi-center retrospective review is also unlikely to provide useful information. a prospective multi-center trial with a uniform th protocol is needed if we are ever to make any evidence-based conclusions on the utility of th for post-cardiac arrest patients. serum results: mean la was . , sd = . . mean age was . years old, sd = . . a statistically significant positive correlation was found between la and pulse, respiratory rate (rr), wbc, platelets, and los, while a significant negative correlation was seen with temperature and hco -. when two subjects were dropped as possible outliers with la > , it resulted in non-significant temperature correlation, but a significant negative correlation with age and bun was revealed. patients in the higher la group were more likely to be admitted (p = . ) and have longer los. of the discharged patients, there was no difference in mean la level between those who returned (n = , mean la of . , sd = . ) and those who did not (n = , mean la of . , sd = . ), p = . . furthermore, mean la levels for those with sepsis (n = , mean la of . , sd = . ) did not differ from those without sepsis (n = , mean la of . , sd = . ), p = . . conclusion: higher la in pediatric patients presenting to the ed with suspected infection correlated with increased pulse, rr, wbc, platelets, and decreased bun, hco -, and age. la may be predictive of hospitalization, but not of -day return rates or pediatric sepsis screening in the ed. background: mandibular fractures are one of the most frequently seen injuries in the trauma setting. in terms of facial trauma, madibular fractures account for - % of all facial bone fractures. prior studies have demonstrated that the use of a tongue blade to screen these patients to determine whether a mandibular fracture is present may be as sensitive as x-ray. one study showed the sensitivity and specificity of the test to be . % and . %, respectively. in the last ten years, high-resolution computed tomography (hct) has replaced panoramic tomography (pt) as the gold standard for imaging of patients with suspected mandibular fractures. this study determines if the tongue blade test (tbt) remains as sensitive a screening tool when compared to the new gold standard of ct. objectives: the purpose of the study was to determine the sensitivity and specificity of the tbt as compared to the new gold standard of radiologic imaging, hct. the question being asked: is the tbt still useful as a screening tool for patients with suspected mandibular fractures when compared to the new gold standard of hct? methods: design: prospective cohort study. setting: an urban tertiary care level i trauma center. subjects: this study took place from / / to / / in which any person suffering from facial trauma presented. intervention: a tbt was performed by the resident physician and confirmed by the supervising attending physician. ct facial bones were then obtained for the ultimate diagnosis. inter-rater reliability (kappa) was calculated, along with sensitivity, specificity, accuracy, ppv, npv, likelihood ratio (lr) (+), and likelihood ratio (lr) (-) based on a · contingency tables generated. results: over the study period patients were enrolled. inter-rater reliability was kappa = . (se + . ). the table demonstrates the outcomes of both the tbt and ct facial bones for mandibular fracture. the following parameters were then calculated based on the contingency table: sensitivity . (ci . - . ), specificity . (ci . - . ), ppv . (ci . - . ), npv . (ci . - . ), accuracy . , lr(+) . ), lr (-) . (ci . - . ). conclusion: the tbt is still a useful screening tool to rule out mandibular fractures in patients with facial trauma as compared to the current gold standard of hct. background: appendicitis is the most common surgical emergency occurring in children. the diagnosis of pediatric appendicitis is often difficult and computerized tomography (ct) scanning is utilized frequently. ct, although accurate, is expensive, time-consuming, and exposes children to ionizing radiation. radiologists utilize ultrasound for the diagnosis of appendicitis, but it may be less accurate than ct, and may not incorporate emergency physician (ep) clinical impression regarding degree of risk. objectives: the current study compared ep clinical diagnosis of pediatric appendicitis pre-and post-bedside ultrasonography (bus). methods: children - years of age were enrolled if their clinical attending physician planned to obtain a consultative ultrasound, ct scan, or surgical consult specific for appendicitis. most children in the study received narcotic analgesia to facilitate bus. subjects were initially graded for likelihood of appendicitis based on research physician-obtained history and physical using a visual analogue scale (vas). immediately subsequent to initial grading, research physicians performed a bus and recorded a second vas impression of appendicitis likelihood. two outcome measures were combined as the gold standard for statistical analysis. the post-operative pathology report served as the gold standard for subjects who underwent appendectomy, while post -week telephone follow-up was used for subjects who did not undergo surgery. various specific ultrasound measures used for the diagnosis of appendicitis were assessed as well. results: / subjects had pathology-proven appendicitis. one subject was pathology-negative post-appendectomy. of the subjects who did not undergo surgery, none had developed appendicitis at the post -week telephone follow-up. pre-bus sensitivity was % ( - %) while post-bus sensitivity was % ( - %). both pre-and post-bus specificity was % ( - %). pre-bus lr+ was ( - ), while post-bus lr+ was ( - ). pre-and post-bus lr-were . and . , respectively. bus changed the diagnosis for % of subjects ( - %). background: there are very little data on the normal distance between the glenoid rim and the posterior aspect of the humeral head in normal and dislocated shoulders. while shoulder x-rays are commonly used to detect shoulder dislocations, they may be inadequate, exacerbate pain in the acquisition of some views, and lead to delay in treatment, compared to bedside ultrasound evaluation. objectives: our objective was to compare the glenoid rim to humeral head distance in normal shoulders and in anteriorly dislocated shoulders. this is the first study proposing to set normal and abnormal limits. methods: subjects were enrolled in this prospective observation study if they had a chief complaint of shoulder pain or injury, and received a shoulder ultrasound as well as a shoulder x-ray. the sonographers were undergraduate students given ten hours of training to perform the shoulder ultrasound. they were blinded to the x-ray interpretation, which was used as the gold standard. we used a posterior-lateral approach, capturing an image with the glenoid rim, the humeral head, as well as the infraspinatus muscle. two parallel lines were applied to the most posterior aspect of the humeral head and the most posterior aspect of the glenoid rim. a line perpendicular to these lines was applied, and the distance measured. in anterior dislocations, a negative measurement was used to denote the fact that the glenoid rim is now posterior to the most posterior aspect of the humeral head. descriptive analysis was applied to estimate the mean and th to th interquartile range of normal and anteriorly dislocated shoulders. results: eighty subjects were enrolled in this study. there were six shoulder dislocations, however only four were anterior dislocations. the average distance between the posterior glenoid rim and the posterior humeral head in normal shoulders was . mm, with a th to th inter-quartile range of . mm to . mm. the distance in our four cases of anterior dislocation was ) mm, with a th to th interquartile range of ) mm to ) mm. conclusion: the distance between the posterior humeral head to posterior glenoid rim may be mm to mm in patients presenting to the ed with shoulder pain but no dislocation. in contrast, this distance in anterior dislocations was greater than ) mm. shoulder ultrasound may be a useful adjunct to x-ray for diagnosing anterior shoulder dislocations. conclusion: in this retrospective study, the presence of rv strain on focus significantly increases the likelihood of an adverse short term event from pulmonary embolism and its combination with hypotension performs similarly to other prognostic rules. background: burns are expensive and debilitating injuries, compromising both the structural integrity and vascular supply to skin. they exhibit a substantial potential to deteriorate if left untreated. jackson defined three ''zones'' to a burn. while the innermost coagulation zone and the outermost zone of hyperemia display generally predictable healing outcomes, the zone of stasis has been shown to be salvageable via clinical intervention. it has therefore been the focus of most acute therapies for burn injuries. while laser doppler imaging (ldi) -the current gold standard for burn analysis -has been % effective at predicting the need for second degree burn excision, its clinical translation is problematic, and there is little information regarding its ability to analyze the salvage of the stasis zone in acute injury. laser assisted indocyanine green dye angiography (laicga) also shows potential to predict such outcomes with greater clinical utility. objectives: to test the ability of ldi and laicga to predict interspace (zone of stasis) survival in a horizontal burn comb model. methods: a prospective animal experiment was performed using four pigs. each pig had a set of six dorsal burns created using a brass ''comb'' -creating four rectangular · mm full thickness burns separated by · mm interspaces. laicga and ldi scanning took place at hour, hours, hours, and week post burn using novadaq spy and moor ldi respectively. imaging was read by a blinded investigator, and perfusion trends were compared with interspace viability and contraction. burn outcomes were read clinically, evaluated via histopathology, and interspace contraction was measured using image j software. results: laicga data showed significant predictive potential for interspace survival. it was . % predictive at hours post burn, % predictive hours post burn, and % predictive days post burn using a standardized perfusion threshold. ldi imaging failed to predict outcome or contraction trends with any degree of reliability. the pattern of perfusion also appears to be correlated with the presence of significant interspace contraction at days, with an % adherence to a power trendline. ventions, isolation, testing, treatment, and ''other'' category intervention were identified. one intervention involving school closures was associated with a % decrease in pediatric ed visits for respiratory illness. conclusion: most interventions were not tested in isolation, so the effect of individual interventions was difficult to differentiate. interventions associated with statistically significant decreases in ed crowding were school closures, as well as interventions in all categories studied. further study and standardization of intervention input, process, and outcome measures may assist in identifying the most effective methods of mitigating ed crowding and improving surge capacity during an influenza or other respiratory disease outbreak. communication background: the link between extended shift lengths, sleepiness, and occupational injury or illness has been shown, in other health care populations, to be an important and preventable public health concern but heretofore has not been fully described in emergency medical services (ems objectives: to assess the effect of an ed-based computer screening and referral intervention for ipv victims and to determine what characteristics resulted in a positive change in their safety. we hypothesized that women who were experiencing severe ipv and/or were in contemplation or action stages would be more likely to endorse safety behaviors. methods: we conducted the intervention for female ipv victims at three urban eds using a computer kiosk to deliver targeted education about ipv and violence prevention as well as referrals to local resources. all adult english-speaking non-critically ill women triaged to the ed waiting room were eligible to participate. the validated universal violence prevention screening protocol was used for ipv screening. any who disclosed ipv further responded to validated questionnaires for alcohol and drug abuse, depression, and ipv severity. the women were assigned a baseline stage of change (precontemplation, contemplation, action, or maintenance) based on the urica scale for readiness to change behavior surrounding ipv. participants were contacted at week and months to assess a variety of pre-determined actions such as moving out, to prevent ipv during that period. statistical analysis (chi-square testing) was performed to compare participant characteristics to the stage of change and whether or not they took protective action. results: a total of , people were screened and disclosed ipv and participated in the full survey. . % of the ipv victims were in the precontemplative stage of change, and . % were in the contemplation stage. women returned at week of follow-up ( . %), and ( . %) women returned at months of followup. . % of those who returned at week, and % of those who returned at months took protective action against further ipv. there was no association between the various demographic characteristics and whether or not a woman took protective action. conclusion: ed-based kiosk screening and health information delivery is both a feasible and effective method of health information dissemination for women experiencing ipv. stage of change was not associated with actual ipv protective measures. objectives: we present a pilot, head-to-head comparison of x and x effectiveness in stopping a motivated person. the objective is to determine comparative injury prevention effectiveness of the newer cew. methods: four humans had metal cew probe pairs placed. each volunteer had two probe pairs placed (one pair each on the right and left of the abdomen/inguinal region). superior probes were at the costal margin, inches lateral of midline. inferior probes were vertically inferior at predetermined distances of , , , and inches apart. each volunteer was given the goal of slashing a target feet away with a rubber knife during cew exposure. as a means of motivation, they believed the exposure would continue until they reached the goal (in reality, the exposure was terminated once no further progress was made). each volunteer received one exposure from a x and a x cew. the exposure order was randomized with a -minute rest between them. exposures were recorded on a hi-speed, hi-resolution video. videos were reviewed and scored by six physician, kinesiology, and law officer experts using standardized criteria for effectiveness including degree of upper and lower extremity, and total body incapacitation, and degree of goal achievement. reviews were descriptively compared independently for probe spread distances and between devices. results: there were exposures ( pairs) for evaluation and no discernible, descriptive reviewer differences in effectiveness between the x and the x cews when compared. background: the trend towards higher gasoline prices over the past decade in the u.s. has been associated with higher rates of bicycle use for utilitarian trips. this shift towards non-motorized transportation should be encouraged from a physical activity promotion and sustainability perspective. however, gas price induced changes in travel behavior may be associated with higher rates of bicycle-related injury. increased consideration of injury prevention will be a critical component of developing healthy communities that help safely support more active lifestyles. objectives: the purpose of this analysis was to a) describe bicycle-related injuries treated in u.s. emergency departments between and and b) investigate the association between gas prices and both the incidence and severity of adult bicycle injuries. we hypothesized that as gas prices increase, adults are more likely to shift away from driving for utilitarian travel toward more economical non-motorized modes of transportation, resulting in increased risk exposure for bicycle injuries. methods: bicycle injury data for adults ( - years) were obtained from the national electronic injury surveillance system (neiss) database for emergency department visits between - . the relationship between national seasonally adjusted monthly rates of bicycle injuries, obtained by a seasonal decomposition of time series, and average national gasoline prices, reported by the energy information administration, was examined using a linear regression analysis. results: monthly rates of bicycle injuries requiring emergency care among adults increase significantly as gas prices rise (p < . , see figure) . an additional , adult injuries ( % ci - , ) can be predicted to occur each month in the u.s. (> , injuries annually) for each $ rise in average gasoline price. injury severity also increases during periods of high gas prices, with a higher percentage of injuries requiring admission. conclusion: increases in adult bicycle use in response to higher gas prices are accompanied by higher rates of significant bicycle-related injuries. supporting the use of non-motorized transportation will be imperative to address public health concerns such as obesity and climate change; however, resources must also be dedicated to improve bicycle-related injury care and prevention. background: this is a secondary analysis of data collected for a randomized trial of oral steroids in emergency department (ed) musculoskeletal back pain patients. we hypothesized that higher pain scores in the ed would be associated with more days out of work. objectives: to determine the degree to which days out of work for ed back pain patients are correlated with ed pain scores. methods: design: prospective cohort. setting: suburban ed with , annual visits. participants: patients aged - years with moderately severe musculoskeletal back pain from a bending or twisting injury £ days before presentation. exclusion criteria included nonmusculoskeletal etiology, direct trauma, motor deficits, and employer-initiated visits. observations: we captured initial and discharge ed visual analog pain scores (vas) on a - scale. patients were contacted approximately days after discharge and queried about the days out of work. we plotted days out of work versus initial vas, discharge vas, and change in vas and calculated correlation coefficients. using the bonferroni correction because of multiple comparisons, alpha was set at . . results: we analyzed patients for whom complete data were available. the mean age was ± years and % were female. the average initial and discharge ed pain scales were . ± . and . ± . , respectively. on follow-up, % of patients were back to work and % did not lose any days of work. for the plots of the days out of work versus the initial and discharge vas and the change in the vas, the correlation coefficients (r ) were . (p = . ), . (p = . ), and . (p = . ), respectively. conclusion: for ed patients with musculoskeletal back pain, we found no statistically significant correlation between days out of work and ed pain scores. background: conducted electrical weapons (cews) are common law enforcement tools used to subdue and repel violent subjects and, therefore, prevent further injury or violence from occurring in certain situations. the taser x is a new generation of cew that has the capability of firing two cartridges in a ''semi-automatic'' mode, and has a different electrical waveform and different output characteristics than older generation technology. there have been no data presented on the human physiologic effects of this new generation cew. objectives: the objective of this study was to evaluate the human physiologic effects of this new cew. methods: this was a prospective, observational study of human subjects. an instructor shot subjects in the abdomen and upper thigh with one cartridge, and subjects received a -second exposure from the device. measured variables included: vital signs, continuous spirometry, pre-and post-exposure ecg, intra-exposure echocardiography, venous ph, lactate, potassium, ck, and troponin. results: ten subjects completed the study (median age . , median bmi . , % male). there were no important changes in vital signs or in potassium. the median increase in lactate during the exposure was . , range . to . . the median change in ph was ) . , range ) . to . . no subject had a clinically relevant ecg change, evidence of cardiac capture, or positive troponin up to hours after exposure. the median change in creatine kinase (ck) at hours was , range ) to . there was no evidence of impairment of breathing by spirometry. baseline median minute ventilation was . , which increased to . during the exposure (p = . ), and remained elevated at . post-exposure (p = . ). conclusion: we detected a small increase in lactate and decrease in ph during the exposure, and an increase in ck hours after the exposure. the physiologic effects of the x device appear similar to previous reports for ecd devices. use background: public bicycle sharing (bikeshare) programs are becoming increasingly common in the us and around the world. these programs make bicycles easily accessible for hourly rental to the public. there are currently active bikeshare programs in cities in the us, and more than programs are being developed in cities including new york and chicago. despite the importance of helmet use, bikeshare programs do not provide the opportunity to purchase or rent helmets. while the programs encourage helmet use, no helmets are provided at the rental kiosks. objectives: we sought to describe the prevalence of helmet use among adult users of bikeshare programs and users of personal bicycles in two cities with recently introduced bicycle sharing programs (boston, ma and washington, dc). methods: we performed a prospective observational study of bicyclists in boston, ma and washington, dc. trained observers collected data during various times of the day and days of the week. observers recorded the sex of the bicycle operator, type of bicycle, and helmet use. all bicycles that passed a single stationary location in any direction for a period of between and minutes were recorded. data are presented as frequencies of helmet use by sex, type of bicycle (bikeshare or personal), time of the week (weekday or weekend), and city. logistic regression was used to estimate the odds ratio for helmet use controlling for type of bicycle, sex, day of week, and city. results: there were observation periods in two cities at locations. , bicyclists were observed. there were ( . %) bicylists riding bikeshare bicycles. overall helmet use was . %, although helmet use varied significantly with sex, day of use, and type of bicycle (see figure) . bikeshare users were helmeted at a lower rate compared to users of personal bicycles ( . % vs . %). logistic regression, controlling for type of bicycle, sex, day of week, and city demonstrate that bikeshare users had higher odds of riding unhelmeted (or . , % ci . - . ). women had lower odds of riding unhelmeted (or . , . - . ), while weekend riders were more likely to ride unhelmeted (or . , . - . ). conclusion: use of bicycle helmets by users of public bikeshare programs is low. as these programs become more popular and prevalent, efforts to increase helmet use among users should increase. background: abusive head trauma (aht) represents one of the most severe forms of traumatic brain injury (tbi) among abused infants with % mortality. young adult males account for % of the perpetrators. most aht prevention programs are hospital-based and reach a predominantly female audience. there are no published reports of school-based aht prevention programs to date. objectives: . to determine whether a high schoolbased aht educational program will improve students' knowledge of aht and parenting skills. . to evaluate the feasibility and acceptability of a school-based aht prevention program. methods: this program was based on an inexpensive commercially available program developed by the national center on shaken baby syndrome. the program was modified to include a -minute interactive presentation that teaches teenagers about aht, parenting skills, and caring for inconsolable crying infants. the program was administered in three high schools in flint, michigan during spring . student's knowledge was evaluated with a -item written test administered pre-intervention, post-intervention, and two months after program completion. program feasibility and acceptability were evaluated through interviews and surveys with flint area school social workers, parent educators, teachers, and administrators. results: in all, high school students ( % male) participated. of these, ( . %) completed the pretest and post-test with ( %) completing the twomonth follow-up test. the mean pre-intervention, postintervention, and two-month follow-up scores were %, %, and % respectively. from pre-test to posttest, mean score improved %, p < . . this improvement was even more profound in young males, whose mean post-test score improved by %, p < . . of the participating social workers, parent educators, teachers, and administrators, % ranked the program as feasible and acceptable. conclusion: students participating in our program showed an improvement in knowledge of aht and parenting skills which was retained after two months. teachers, social workers, parent educators, and school administrators supported the program. this local pilot program has the potential to be implemented on a larger scale in michigan with the ultimate goal of reducing aht amongst infants. will background: fear of litigation has been shown to affect physician practice patterns, and subsequently influence patient care. the likelihood of medical malpractice litigation has previously been linked with patient and provider characteristics. one common concern is that a patient may exaggerate symptoms in order to obtain monetary payouts; however, this has never been studied. objectives: we hypothesize that patients are willing to exaggerate injuries for cash settlements and that there are predictive patient characteristics including age, sex, income, education level, and previous litigation. methods: this prospective cross-sectional study spanned june to december , in a philadelphian urban tertiary care center. any patient medically stable enough to fill out a survey during study investigator availability was included. two closed-ended paper surveys were administered over the research period. standard descriptive statistics were utilized to report incidence of: patients who desired to file a lawsuit, patients previously having filed lawsuits, and patients willing to exaggerate the truth in a lawsuit for a cash settlement. chi-square analysis was performed to determine the relationship between patient characteristics and willingness to exaggerate injuries for a cash settlement. results: of surveys, were excluded due to incomplete data, leaving for analysis. the mean age was with a standard deviation of , and % were male. the incidence of patients who had the desire to sue at the time of treatment was %. the incidence of patients who had filed a lawsuit in the past was %. of those patients, % had filed multiple lawsuits. fifteen percent [ % ci - %] of all patients were willing to exaggerate injuries for cash settlement. sex and income were found to be statistically significant predictors of willingness to exaggerate symptoms: % of females vs. % of males were willing to exaggerate (p = . ), and % of people with income less than $ , /yr vs. % of those with income over $ , / yr were willing to exaggerate (p = . ). conclusion: patients at a philadelphian urban tertiary center admit to willingness to exaggerate symptoms for a cash settlement. willingness to exaggerate symptoms is associated with female sex and lower income. background: current data suggest that as many as % of patients presenting to the ed with syncope leave the hospital without a defined etiology. prior studies have suggested a prevalence of psychiatric disease as high as % in patients with syncope of unknown etiology. objectives: to determine whether psychiatric disease and substance abuse are associated with an increased incidence of syncope of unknown etiology. methods: prospective, observational, cohort study of consecutive ed patients ‡ presenting with syncope was conducted between / and / . patients were queried in the ed and charts reviewed about a history of psychiatric disease, use of psychiatric medication, substance abuse, and duration. data were analyzed using sas with chi-square and fisher's exact tests. results: we enrolled patients who presented to the ed after syncope, of whom did not have an identifiable etiology for their syncopal event. . % of those without an identifiable etiology were male. ( %) patients had a history of or current psychiatric disease ( % male), and patients ( %) had a history of or current substance abuse ( % male). among males with psychiatric disease, % had an unknown etiology of their syncopal event, compared to % of males without psychiatric disease (p = . ). similarly, among all males with a history of substance abuse, % had an unknown etiology, as compared to % of males without a history of substance abuse (p = . ). a similar trend was not identified in elderly females with psychiatric disease (p = . ) or substance abuse (p = . ). however, syncope of unknown etiology was more common among both men and women under age with a history of substance abuse ( %) compared to those without a history of substance abuse ( %; p = . ). conclusion: our results suggest that psychiatric disease and substance abuse are associated with increased incidence of syncope of unknown etiology. patients evaluated in the ed or even hospitalized with syncope of unknown etiology may benefit from psychiatric screening and possibly detoxification referral. this is particularly true in men. (originally submitted as a ''late-breaker.'') scope background: after discharge from an emergency department (ed), pain management often challenges parents, who significantly under-treat their children's pain. rapid patient turnover and anxiety make education about home pain treatment difficult in the ed. video education standardizes information and circumvents insufficient time and literacy. objectives: to evaluate the effectiveness of a -minute instructional video for parents that targets common misconceptions about home pain management. methods: we conducted a randomized, double-blinded clinical trial of parents of children ages - years who presented with a painful condition, were evaluated, and discharged home in june and july . parents were randomized to a pain management video or an injury prevention control video. primary outcome was the proportion of parents who gave pain medication at home. these data were recorded in a home pain diary and analyzed using a chi-square test. parents' knowledge about pain treatment was tested before, immediately following, and days after intervention. mcnemar's test statistic determined odds that knowledge correlated with the intervention group. results: parents were enrolled: watched the pain education video, and the control video. . % completed follow up, providing information about home pain education use. significantly more parents provided at least one dose of pain medication to their children after watching the educational video: % vs. % (difference %, % ci . %, . %). the odds the parent had correct knowledge about pain treatment significantly improved immediately following the educational video for knowledge about pain scores (p = . ), the effect of pain on function (p < . ), and pain medication misconceptions (p < . ). these significant differences in knowledge remained days after the video intervention. the educational video about home pain treatment viewed by parents significantly increased the proportion of children receiving pain medication at home and significantly improved knowledge about at-home pain management. videos are an efficient tool to provide medical advice to parents that improves outcomes for children. methods: this was a prospective, observational study of consecutive admitted cpu patients in a large-volume academic urban ed. cardiology attendings round on all patients and stress test utilization is driven by their recommendation. eligibility criteria include: age> , aha low/intermediate risk, nondynamic ecgs, and normal initial troponin i. patients > and with a history of cad or co-existing active medical problem were excluded. based on prior studies and our estimated cpu census and demographic distribution, we estimated a sample size of , patients in order to detect a difference in stress utilization of % ( -tailed, a = . , b = . ). we calculated a timi risk prediction score and a diamond & forrester (d&f) cad likelihood score on each patient. t-tests were used for univariate comparisons of demographics, cardiac comorbidities, and risk scores. logistic regression was used to estimate odds ratios (ors) for receiving testing based on race, controlling for insurance and either timi or d&f score. results: over months, , patients were enrolled. mean age was ± , and % ( % ci - ) were female. sixty percent ( % ci - ) were caucasian, % ( % ci - ) african american, and % ( % ci - ) hispanic. mean timi and d&f scores were . ( % ci . - . ) and % ( % ci - ). the overall stress testing rate was % ( % ci - ). after controlling for insurance status and timi or d&f scores, african american patients had significantly decreased odds of stress testing (or timi . ( % ci . - . ), or d&f . ( % ci . - . )). hispanics had significantly decreased odds of stress testing in the model controlling for d&f (or d&f . ( % ci . - . )). conclusion: this study confirms that disparities in the workup of african american patients in the cpu are similar to those found in the general ed and the outpatient setting. further investigation into the specific provider or patient level factors contributing to this bias is necessary. the outcomes for hf and copd were sae . %, . %; death . %, . %. we found univariate associations with sae for these walk test components: too ill to walk (both hf, copd p < . ); highest heart rate ‡ (hf p = . , copd p = . ); lowest sao < % (hf p = . , copd p = . ); borg score ‡ (hf p = . , copd p = . ); walk test duration £ minute (hf p = . . copd p = . ). after adjustment for multiple clinical covariates with logistic regression analyses, we found ''walk test heart rate ‡ '' had an odds ratio of . for hf patients and ''too ill to start the walk test'' had an odds ratio of . for copd patients. conclusion: we found the -minute walk test to be easy to administer in the ed and that maximum heart rate and inability to start the test were highly associated with adverse events in patients with exacerbations of hf and copd, respectively. we suggest that the -minute walk test be routinely incorporated into the assessment of hf and copd patients in order to estimate risk of poor outcomes. the objectives: the objective of this study was to investigate differences in consent rates between patients of different demographic groups who were invited to participate in minimal-risk clinical trials conducted in an academic emergency department. methods: this descriptive study analyzed prospectively collected data of all adult patients who were identified as qualified participants in ongoing minimal risk clinical trials. these trials were selected for this review because they presented minimal factors known to be associated background: increasing rates of patient exposure to computerized tomography (ct) raise questions about appropriateness of utilization, as well as patient awareness of radiation exposure. despite rapid increases in ct utilization and published risks, there is no national standard to employ informed consent prior to radiation exposure from diagnostic ct. use of written informed consent for ct (icct) in our ed has increased patient understanding of the risks, benefits, and alternatives to ct imaging. our team has developed an adjunct video educational module (vem) to further educate ed patients about the ct procedure. objectives: to assess patient knowledge and preferences regarding diagnostic radiation before and after viewing vem. methods: the vem was based on icct currently utilized at our tertiary care ed (census , patients/ year). icct is written at an th grade reading level. this fall, vem/icct materials were presented to a convenience sample of patients in the ed waiting room am- pm, monday-sunday. patients who were < years of age, critically ill, or with language barrier were excluded. to quantify the educational value of the vem, a six-question pretest was administered to assess baseline understanding of ct imaging. the patients then watched the vem via ipad (macintosh) and reviewed the consent form. an eight-question post-test was then completed by each subject. no phi were collected. pre-and post-test results were analyzed using mcnemar's test for individual questions and a paired t-test for the summed score (sas version . ). results: patients consented and completed the survey. the average pre-test score for subjects was poor, % correct. review of vem/icct materials increased patient understanding of medical radiation as evidenced by improved post-test score to %. mean improvement between tests was % (p < . ). % of subjects responded that they found the materials helpful, and that they would like to receive icct. conclusion: the addition of a video educational module improved patient knowledge regarding ct imaging and medical radiation as quantified by pre-and posttesting. patients in our study sample reported that they prefer to receive icct. by educating patients about the risks associated with ct imaging, we increase informed, shared decision making -an essential component of patient-centered care. does objectives: we sought to determine the relationship between patients' pain scores and their rate of consent to ed research. we hypothesized that patients with higher pain scores would be less likely to consent to ed research. methods: retrospective observational cohort study of potential research subjects in an urban academic hospital ed with an average annual census of approximately , visits. subjects were adults older than years with chief complaint of chest pain within the last hours, making them eligible for one of two cardiac biomarker research studies. the studies required only blood draws and did not offer compensation. two reviewers extracted data from research screening logs. patients were grouped according to pain score at triage, pain score at the time of approach, and improvement in pain score (triage score -approach score). the main outcome was consent to research. simple proportions for consent rates by pain score tertiles were calculated. two multivariate logistic regression analyses were performed with consent as outcome and age, race, sex, and triage or approach pain score as predictors. results: overall, potential subjects were approached for consent. patients were % caucasian, % female, and with an average age of years. six patients did not have pain scores recorded at all and did not have scores documented within hours of approach and were excluded from relevant analyses. overall, . % of patients consented. consent rates by tertiles at triage, at time of approach, and by pain score improvement are shown in tables and . after adjusting for age, race, and sex, neither triage (p = . ) nor approach (p = . ) pain scores predicted consent. conclusion: research enrollment is feasible even in ed patients reporting high levels of pain. patients with modest improvements in pain levels may be more likely to consent. future research should investigate which factors influence patients' decisions to participate in ed research. conclusion: in this multicenter study of children hospitalized with bronchiolitis neither specific viruses nor their viral load predicted the need for cpap or intubation, but young age, low birth weight, presence of apnea, severe retractions, and oxygen saturation < % did. we also identified that children requiring cpap or intubation were more likely to have mothers who smoked during pregnancy and a rapid respiratory worsening. mechanistic research in these high-risk children may yield important insights for the management of severe bronchiolitis. brigham & women's hospital, boston, ma background: siblings and children who share a home with a physically abused child are thought to be at high risk for abuse. however, rates of injury in these children are unknown. disagreements between medical and child protective services professionals are common and screening is highly variable. objectives: our objective was to measure the rates of occult abusive injuries detected in contacts of abused children using a common screening protocol. methods: this was a multi-center, observational cohort study of child abuse teams who shared a common screening protocol. data were collected between jan , and april , for all children < years undergoing evaluation for physical abuse and their contacts. for contacts of abused children, the protocol recommended physical examination for all children < years, skeletal survey and physical exam for children < months, and physical exam, skeletal survey, and neuroimaging for children < months old. results: among , children evaluated for abuse, met criteria as ''physically abused'' and these had contacts. for each screening modality, screening was completed as recommended by the protocol in approximately % of cases. of contacts who met criteria for skeletal survey, new injuries were identified in ( . %). none of these fractures had associated findings on physical examination. physical examination identified new injuries in . % of eligible contacts. neuroimaging failed to identify new injuries among eligible contacts less than months old. twins were at significantly increased risk of fracture relative to other nontwin contacts (or . ). conclusion: these results support routine skeletal survey for contacts of physically abused children < months old, regardless of physical examination findings. even for children where no injuries are identified, these results demonstrate that abuse is common among children who share a home with an abused child, and support including contacts in interventions (foster care, safety planning, social support) designed to protect physically abused children. methods: this was a retrospective study evaluating all children presenting to eight paediatric, universityaffiliated eds during one year in - . in each setting, information regarding triage and disposition were prospectively registered by clerks in the ed database. anonymized data were retrieved from the ed computerized database of each participating centre. in the absence of a gold standard for triage, hospitalisation, admission to intensive care unit (icu), length of stay in the ed, and proportion of patients who left without being seen by a physician (lwbs) were used as surrogate markers of severity. the primary outcome measure was the association between triage level (from to ) and hospitalisation. the association between triage level and dichotomous outcomes was evaluated by a chi-square test, while a student's t-test was used to evaluate the association between triage level and length of stay. it was estimated that the evaluation of all children visiting these eds for a one year period would provide a minimum of , patients in each triage level and at least events for outcomes having a proportion of % or more. results: a total of , children visited the eight eds during the study period. pooled data demonstrated hospitalisation proportions of %, %, %, %, and . % for patients triaged at level , , , , and respectively (p < . ). there was also a strong association between triage levels and admission to icu (p < . ), the proportion of children who lwbs (p < . ), and length of stay (p < . ). background: parents frequently leave the emergency department (ed) with incomplete understanding of the diagnosis and plan, but the relationship between comprehension and post-care outcomes has not been well described. objectives: to explore the relationship between comprehension and post-discharge medication safety. methods: we completed a planned secondary analysis of a prospective observational study of the ed discharge process for children aged - months. after discharge, parents completed a structured interview to assess comprehension of the child's condition, the medical team's advice, and the risk of medication error. limited understanding was defined as a score of - from (excellent) to (poor). risk of medication error was defined as a plan to use over-the-counter cough/cold medication and/or an incorrect dose of acetaminophen (measured by direct observation at discharge or reported dose at follow-up call). parents identified as at risk received further instructions from their provider. the primary outcome was persistent risk of medication error assessed at phone interview - days post-discharge. a major barrier to administering analgesics to children is the perceived discomfort of intravenous access. the delivery of intranasal analgesia may be a novel solution to this problem. objectives: we investigated whether the addition of the mucosal atomizer device (mad) as an alternative for fentanyl delivery would improve overall fentanyl administration rates in pediatric patients transported by a large urban ems system. we performed a historical control trial comparing the rate of pediatric fentanyl administration months before and months after the introduction of the mad. study subjects were pediatric trauma patients (age < years) transported by a large urban ems agency. the control group was composed of patients treated in the months before introduction of the mad. the experimental group included patients treated in the months after the addition of the mad. two physicians reviewed each chart and determined whether the patient met predetermined criteria for the administration of pain medication. a third reviewer resolved any discrepancies. fentanyl administration rates were measured and compared between the two groups. we used two-sample t-tests and chi-square tests to analyze our data. results: patients were included in the study: patients in the pre-mad group and in the post-mad group. there were no significant differences in the demographic and clinical characteristics of the two groups. ( . %) patients in the control arm received fentanyl. ( . %) of patients in the experimental arm received fentanyl with % of the patients receiving fentanyl via the intranasal route. the addition of the mad was not associated with a statistically significant increase in analgesic administration. age and mechanism of injury were statistically more predictive of analgesia administration. conclusion: while the addition of the mucosal atomizer device as an alternative delivery method for fentanyl shows a trend towards increased analgesic administration in a prehospital pediatric population, age and mechanism of injury are more predictive in who receives analgesia. further research is necessary to investigate the effect of the mad on pediatric analgesic delivery. methods: this was a prospective study evaluating php-se before (pre) and after (post) a ppp introduction and months later ( -mo). php groups received either ppp review and education or ppp review alone. the ppp included a pain assessment tool. the se tool, developed and piloted by pediatric ems experts, uses a ranked ordinal scale ranging from 'certain i cannot do it' ( ) to 'completely certain i can do it' ( ) for items: pain assessment ( items), medication administration ( ) and dosing ( ) , and reassessment ( ). all items and an averaged composite were evaluated for three age groups (adult, child, toddler). paired sample t-tests compared post-and -mo scores to pre-ppp scores. results: of phps who completed initial surveys, phps completed -mo surveys. ( %) received education and ppp review and ( %) review only. ppp education did not affect php-se (adult p = . , child p = . , toddler p = . ). the largest se increase was in pain assessment. this increase persisted for child and toddler groups at months. the immediate increase in composite se scores for all age groups persisted for the toddler group at months. conclusion: increases in composite and pain assessment php-se occur for all age groups immediately after ppp introduction. the increase in pain assessment se persisted at months for pediatric age groups. composite se increase persisted for the toddler age group alone. background: pediatric medications administered in the prehospital setting are given infrequently and dosage may be prone to error. calculation of dose based on known weight or with use of length-based tapes occurs even less frequently and may present a challenge in terms of proper dosing. objectives: to characterize dosing errors based on weight-based calculations in pediatric patients in two similar emergency medical service (ems) systems. methods: we studied the five most commonly administered medications given to pediatric patients weighing kg or less. drugs studied were morphine, midazolam, epinephrine : , , epinephrine : , and diphenhydramine. cases from the electronic record were studied for a total of months, from january to july . each drug was administered via intravenous, intramuscular, or intranasal routes. drugs that were permitted to be titrated were excluded. an error was defined as greater than % above or below the recommended mg/kg dosage. results: out of , total patients, , were pediatric patients. had documented weights of < kg and patients were given these medications. we excluded patients for weight above the %ile or below the %ile, or if the weight documentation was missing. of the patients and doses, errors were noted in ( %; % ci %, %). midazolam was the most common drug in errors ( of doses or %; % ci %, %), followed by diphenhydramine ( / or %; % ci %, %), epinephrine ( / or %; % ci %, %), and morphine sulfate ( / or %; % ci, %, %). underdosing was noted in of ( %; % ci %, %) of errors, while excessive dosing was noted in of ( %; % ci %, %). conclusion: weight-based dosing errors in pediatric patients are common. while the clinical consequences of drug dosing errors in these patients are unknown, a considerable amount of inaccuracy occurs. strategies beyond provision of reference materials are needed to prevent pediatric medication errors and reduce the potential for adverse outcomes. drivers background: homelessness affects up to . million people a year. the homeless present more frequently to eds, their ed visits are four times more likely to occur within days of a prior ed evaluation, and they are admitted up to five times more frequently than others. we evaluated the effect of a street outreach rapid response team (sorrt) on the health care utilization of a homeless population. a nonmedical outreach staff responds to the ed and intensely case manages the patient: arranges primary care follow-up, social services, temporary housing opportunities, and drug/ alcohol rehabilitation services. objectives: we hypothesized that this program would decrease the ed visits and hospital admissions of this cohort of patients. methods: before and after study at an urban teaching hospital from june, -december, in indianapolis, indiana. upon identification of homeless status, sorrt was immediately notified. eligibility for sorrt enrollment is determined by housing and urban development homeless criteria and the outreach staff attempted to enter all such identified patients into the program. the patients' health care utilization was evaluated in the months prior to program entry as compared to the months after enrollment by prospectively collecting data and a retrospective medical record query for any unreported visits. since the data were highly skewed, we used the nonparametric signed rank test to test for paired differences between periods. results: patients met criteria but two refused participation. the -patient cohort had total ed visits ( pre and post) with a mean of . (sd . ) and median of . (range - ) ed visits in months pre-sorrt as compared to a mean of . (sd . ) and median of . ( - ) in months post-sorrt (p = . ). there were total inpatient admissions pre-intervention and post-intervention, with a mean of . (sd . ) and median of . ( . ) per patient in the pre-intervention period as compared to . (sd . ) and . ( - ) in the post-intervention period (p = . ). in the pre-sorrt period . % had at least one inpatient admission as compared to . % post-sorrt (p = . ). there were no differences in icu days or overall length of stay between the two periods. conclusion: an aggressive case management program beginning immediately with homeless status recognition in the ed has not demonstrated success in decreasing utilization in our population. methods: this was a secondary analysis of a prospective randomized trial that included consenting patients discharged with outpatient antibiotics from an urban county ed with an annual census of , . patients unable to receive text messages or voice-mails were excluded. health literacy was assessed using a validated health literacy assessment, the newest vital sign (nvs). patients were randomized to a discharge instruction modality: ) standard care, typed and verbal medication and case-specific instructions; ) standard care plus text-messaged instructions sent to the patient's cell phone; or ) standard care plus voice-mailed instructions sent to the patient's cell. patients were called at days to determine preference for instruction delivery modality. preference for discharge instruction modality was analyzed using z-tests for proportions. results: patients were included ( % female, median age , range months to years); were excluded. % had an nvs score of - , % - , and % - . among the . % of participants reached at days, % preferred a modality other than written. there was a difference in the proportion of patients who preferred discharge instructions in written plus another modality (see table) . with the exception of written plus another modality, patient preference was similar across all nvs score groups. conclusion: in this sample of urban ed patients, more than one in four patients prefer non-traditional (text message, voice-mail) modalities of discharge instruction delivery to standard care (written) modality alone. additional research is needed to evaluate the effect of instructional modality on accessibility and patient compliance. figure) . conclusion: cumulative saps ii scoring fails to predict mortality in ohca. the risk scores assigned to age, gcs, and hco independently predict mortality and combined are good mortality predictors. these findings suggest that an alternative severity of illness score should be used in post-cardiac arrest patients. future studies should determine optimal risk scores of saps ii variables in a larger cohort of ohca. objectives: to determine the extent to which cpp recovers to pre-pause levels with seconds of cpr after a -second interruption in chest compressions for ecg rhythm analysis. methods: this was a secondary analysis of prospectively collected data from an iacuc-approved protocol. fortytwo yorkshire swine (weighing - kg) were instrumented under anesthesia. vf was electrically induced. after minutes of untreated vf, cpr was initiated and a standard dose of epinephrine (sde) ( . mg/kg) was given. after . minutes of cpr to circulate the vasopressor, compressions were interrupted for seconds to analyze the ecg rhythm. this was immediately followed by seconds of cpr to restore cpp before the first rs was delivered. if the rs failed, cpr resumed and additional vasopressors (sde, and vasopressin . mg/kg) were given and the sequence repeated. the cpp was defined as aortic diastolic pressure minus right atrial diastolic pressure. the cpp values were extracted at three time points: immediately after the . minutes of cpr, following the -second pause, and immediately before defibrillation for the first two rs attempts in each animal. eighty-three sets of measurements were logged from animals. descriptive statistics were used to analyze the data. in most cities, the proportion of patients who achieve prehospital return of spontaneous circulation (rosc) is less than %. the association between time of day and ohca outcomes in the prehospital setting is unknown. objectives: we sought to determine whether rates of prehospital rosc varied by time of day. we hypothesized that night ohcas would exhibit lower rates of rosc. methods: we performed a retrospective review of cardiac arrest data from a large, urban ems system. included were all ohcas occurring in individuals > years of age from / / to / / . excluded were traumatic arrests and cases where resuscitation measures were not performed. day was defined as : am- : pm, while night was : pm- : am. we examined the association between time of day and paramedic-perceived prehospital rosc in unadjusted and adjusted analyses. variables included age, sex, race, presenting rhythm, aed application by a bystander or first responder, defibrillation, and bystander cpr performance. analyses were performed using chisquare tests and logistic regression. objectives: determine whether a smei helps to improve physician compliance with ihi bundle and reduce patient mortality in ed patients with s&s. methods: we conducted a pre-smei retrospective review of four months of ed patients with s&s to determine baseline pre-smei physician compliance and patient mortality. we designed and completed a smei attended by of ed attending physicians and of ed resuscitation residents. finally, we conducted a twenty-month post-smei prospective study of ongoing physician compliance and patient mortality in ed patients with s&s. results: in the four month pre-smei retrospective review, we identified patients with s&s, with a % physician overall compliance and mortality rate of %. the average ed physician smei multiple-choice pre-test score was %, and showed a significant improvement in the post-test score of % (p = . ). additionally, % of ed physicians were able to describe three new clinical pearls learned and % agreed that the smei would improve compliance. in the twenty months of the post-smei prospective study, we identified patients with s&s, with a % physician overall compliance, and mortality rate of %. relative physician compliance improved % (p = . ) and relative patient mortality was reduced by % (p < . ) when comparing pre-and post-smei data. conclusion: our data suggest that a smei improves overall physician compliance with the six hour goals of the ihi bundle and reduces patient mortality in ed patients with s&s. conclusion: using a population-level, longitudinal, and multi-state analysis, the rate of return visits within days is higher than previously reported, with nearly in returning back to the ed. we also provide the first estimation of health care costs for ed revisits. background: the ability of patients to accurately determine their level of urgency is important in planning strategies that divert away from eds. in fact, an understanding of patient self-triage abilities is needed to inform health policies targeting how and where patients access acute care services within the health care system. objectives: to determine the accuracy of a patient's self-assessment of urgency compared against triage nurses. methods: setting: ed patients are assigned a score by trained nurses according to the canadian emergency department triage and acuity scale (ctas). we present a cross-sectional survey of a random patient sample from urban/regional eds conducted during the winters of and . this previously validated questionnaire, based on the british healthcare commission survey, was distributed according to a modified dillman protocol. exclusion criteria consisted of: age - years, left prior to being seen/treated, died during ed visit, no contact information, presented with a privacy-sensitive case. alberta health services provided linked non-survey administrative data. results: , surveys distributed with a response rate of %. patients rated health problems as life-threatening ( %), possibly life-threatening ( %), urgent ( %), somewhat urgent ( %), or not urgent ( %). triage nurses assigned the same patients ctas scores of i (< %), ii ( %), iii ( %), iv ( %) or v ( %). patients self-rated their condition as or points less urgent than the assigned ctas score (< % of the time), points less urgent ( %), point less urgent ( %), exactly as urgent ( %), point more urgent ( %), points more urgent ( %), or or points more urgent ( %, respectively). among ctas i or ii patients, % described their problem as life-threatening/possibly life-threatening, % as urgent (risk of permanent damage), % as urgent (needed to be seen that day), and % as not urgent (wanted to be but did not need to be seen that day). conclusion: the majority of ed patients are generally able to accurately assess the acuity of their problem. encouraging patients with low-urgency conditions to self-triage to lower-acuity sources of care may relieve stress on eds. however, physicians and patients must be aware that a small minority of patients are unable to self-triage safely. when the tourniquet was released, blood spurted from the injured artery as hydrostatic pressure decayed. pressure and flow were recorded in three animals (see table) . the concept was proof-tested in a single fresh frozen human cadaver with perfusion through the femoral artery and hemorrhage from the popliteal artery. the results were qualitatively and quantitatively similar to the swine carcass model. conclusion: a perfused swine carcass can simulate exsanguinating hemorrhage for training purposes and serves as a prototype for a fresh-frozen human cadaver model. additional research and development are required before the model can be widely applied. background: in the pediatric emergency department (ped), clinicians must work together to provide safe and effective care. crisis resource management (crm) principles have been used to improve team performance in high-risk clinical settings, while simulation allows practice and feedback of these behaviors. objectives: to develop a multidisciplinary educational program in a ped using simulation-enhanced teamwork training to standardize communication and behaviors and identify latent safety threats. methods: over months a workgroup of physicians and nurses with experience in team training and simulation developed an educational program for clinical staff of a tertiary ped. goals included: create a didactic curriculum to teach the principles of crm, incorporate principles of crm into simulation-enhanced team training in-situ and center-based exercises, and utilize assessment instruments to evaluate for teamwork, completion of critical actions, and presence of latent safety threats during in-situ sim resuscitations. results: during phase i, clinicians, divided into teams, participated in -minute pre-training assessments of pals-based in-situ simulations. in phase ii, staff participated in a -hour curriculum reviewing key crm concepts, including team training exercises utilizing simulation and expert debriefing. in phase iii, staff participated in post-training minute teamwork and clinical skills assessments in the ped. in all phases, critical action checklists (cac) were tabulated by simulation educators. in-situ simulations were recorded for later review using the assessment tools. after each simulation, educators facilitated discussion of perceptions of teamwork and identification of systems issues and latent hazards. overall, in-situ simulations were conducted capturing % of the physicians and % of the nurses. cac data were collected by an observer and compared to video recordings. over significant systems issues, latent hazards, and knowledge deficits were identified. all components of the program were rated highly by % of the staff. conclusion: a workgroup of pem, simulation, and team training experts developed a multidisciplinary team training program that used in-situ and centerbased simulation and a refined crm curriculum. unique features of this program include its multidisciplinary focus, the development of a variety of assessment tools, and use of in-situ simulation for evaluation of systems issues and latent hazards. this program was tested in a ped and findings will be used to refine care and develop a sustainment program while addressing issues identified. objectives: our hypothesis is that participants trained on high-fidelity mannequins will perform better than participants trained on low-fidelity mannequins on both the acls written exam and in performance of critical actions during megacode testing. the study was performed in the context of an acls initial provider course for new pgy residents at the penn medicine clinical simulation center and involved three training arms: ) low fidelity (low-fi): torso-rhythm generator; ) mid-fidelity (mid-fi): laerdal simmanÒ turned off; and ) high-fidelity (high-fi): laerdal simmanÒ turned on. training in each arm of the study followed standard aha protocol. educational outcomes were evaluated by written scores on the acls written examination and expert rater reviews of acls megacode videos performed by trainees during the course. a sample of subjects were randomized to one of the three training arms: low-fi (n = ), mid-fi (n = ), or high-fi (n = ). results: statistical significance across the groups was determined using analysis-of-variance (anova). the three groups had similar written pre-test scores [low-fi . ( . ), mid-fi . ( . ), and high-fi . ( . )] and written post-test scores [low-fi . ( . ), mid-fi . ( . ), and high-fi . ( . )]. similarly, test improvement was not significantly different. after completion of the course, high-fi subjects were more likely to report they felt comfortable in their simulator environment (p = . ). low-fi subjects were less likely to perceive a benefit in acls training from high-fi technology (p < . ). acls instructors were not rated significantly different by the subjects using the debriefing assessment for simulation in healthcareª (dash) student version except for element , where the high-fi group subjects reported lower scores ( . vs . and . in the other groups, p = . ). objectives: we sought to determine if stress associated with the performance of a complex procedural task can be affected by level of medical training. heart rate variability (hrv) is used as a measure of autonomic balance, and therefore an indicator of the level of stress. methods: twenty-one medical students and emergency medicine residents were enrolled. participants performed airway procedures on an airway management trainer. hrv data were collected using a continuous heart rate variability monitoring system. participant hrv was monitored at baseline, during the unassisted first attempt at endotracheal intubation, during supervised practice, and then during a simulated respiratory failure clinical scenario. standard deviation of beat to beat variability (sdnn), very low frequency (vlf), total power (tp), and low frequency (lf) was analyzed to determine the effect of practice and level of training on the level of stress. a cohen's d test was used to determine differences between study groups. results: sdnn data showed that second-year residents were less stressed during all stages than were fourthyear medical students (avg d = . ). vlf data showed third-year residents exhibited less sympathetic activity than did first-year residents (avg d = ) . ). the opportunity to practice resulted in less stress for all participants. tp data showed that residents had a greater degree of control over their autonomic nervous system (ans) than did medical students (avg d = . ). lf data showed that subjects were more engaged in the task at hand as the level of training increased indicating autonomic balance (avg d = . ). conclusion: our hrv data show that stress associated with the performance of a complex procedural task is reduced by increased training. hrv may provide a quantitative measure of physiologic stress during the learning process and thus serve as a marker of when a subject is adequately trained to perform a particular task. objectives: we seek to examine whether intubation during cpr can be done as efficiently as intubation without ongoing cpr. the hypothesis is that the predictable movement of an automated chest compression device will make intubation easier than the random movement from manual cpr. methods: the project was an experimental controlled trial and took place in the emergency department at a tertiary referral center in peoria, illinois. emergency medicine residents, attendings, paramedics, and other acls trained staff were eligible for participation. in randomized order, each participant attempted intubation on a mannequin with no cpr ongoing, during cpr with a human compressor, and during cpr with an automatic chest compression device (physio control lucas ). participants could use whichever style laryngoscope they felt most comfortable with and they were timed during the three attempts. success was determined after each attempt. results: there were participants in the trial. the success rate in the control group and the automated cpr group were both % ( / ) and the success rate in the manual cpr group was % ( / ). the differences in success rates were not statistically significant (p = . and p = . ). the automated cpr group had the fastest average time ( . sec; p = . ). the mean times for intubation with manual cpr and no cpr were not statistically different ( . sec, . sec; p = . ). conclusion: the success rate of tracheal intubation with ongoing chest compression was the same as the success rate of intubation without cpr. although intubation with automatic chest compression was faster than during other scenarios, all methods were close to the second timeframe recommended by acls. based on these findings, it may not always be necessary to hold cpr to place a definitive airway; however, further studies will be needed. background: after acute myocardial infarction, vascular remodeling in the peri-infarct area is essential to provide adequate perfusion, prevent additional myocyte loss, and aid in the repair process. we have previously shown that endogenous fibroblast growth factor (fgf ) is essential to the recovery of contractile function and limitation of infarct size after cardiac ischemia-reperfusion (ir) injury. the role of fgf in vascular remodeling in this setting is currently unknown. objectives: determine the role of endogenous fgf in vascular remodeling in a clinically relevant, closed-chest model of acute myocardial infarction. methods: mice with a targeted ablation of the fgf gene (fgf knockout) and wild type controls were subjected to a closed-chest model of regional cardiac ir injury. in this model, mice were subjected to minutes of occlusion of the left anterior descending artery followed by reperfusion for either or days. immunofluorescence was performed on multiple histological sections from these hearts to visualize capillaries (endothelium, anti-cd antibody), larger vessels (venules and arterioles, antismooth muscle actin antibody), and nuclei (dapi). digital images were captured, and multiple images from each heart were measured for vessel density and vessel size. results: sham-treated fgf knockout and wild type mice show no differences in capillary or vessel density suggesting no defect in vessel formation in the absence of endogenous fgf . when subjected to closed-chest regional cardiac ir injury, fgf knockout hearts had normal capillary and vessel number and size in the peri-infarct area after day of reperfusion compared to wild type controls. however, after days, fgf knockout hearts showed significantly decreased capillary and vessel number and increased vessel size compared to wild type controls (p < . ). conclusion: these data show the necessity of endogenous fgf in vascular remodeling in the peri-infarct zone in a clinically relevant animal model of acute myocardial infarction. these findings may suggest a potential role for modulation of fgf signaling as a therapeutic intervention to optimize vascular remodeling in the repair process after myocardial infarction. the diagnosis of aortic dissections by ed physicians is rare scott m. alter, barnet eskin, john r. allegra morristown medical center, morristown, nj background: aortic dissection is a rare event. the most common symptom of dissection is chest pain, but chest pain is a frequent emergency department (ed) chief complaint and other diseases that cause chest pain, such as acute coronary syndrome and pulmonary embolism, occur much more frequently. furthermore, % of dissections are without chest pain and % are painless. for all these reasons, diagnosing dissection can be difficult for the ed physician. we wished to quantify the magnitude of this problem in a large ed database. objectives: our goal was to determine the number of patients diagnosed by ed physicians with aortic dissections compared to total ed patients and to the total number of patients with a chest pain diagnosis. methods: design: retrospective cohort. setting: suburban, urban, and rural new york and new jersey eds with annual visits between , and , . participants: consecutive patients seen by ed physicians from january , through december , . observations: we identified aortic dissections using icd- codes and chest pain diagnoses by examining all icd- codes used over the period of the study and selecting those with a non-traumatic chest pain diagnosis. we then calculated the number of total ed patients and chest pain patients for every aortic dissection diagnosed by emergency physicians. we determined % confidence intervals (cis). results: from a database of . million ed visits, we identified ( . %) aortic dissections, or one for every , ( % ci , to , ) visits. the mean age of aortic dissection patients was ± years and % were female. of the total visits there were , ( %) with a chest pain diagnosis. thus there is one aortic dissection diagnosis for every ( % ci to , ) chest pain diagnoses. conclusion: the diagnosis of aortic dissections by ed physicians is rare. an ed physician seeing , to , patients a year would diagnose an aortic dissection approximately once every to years. an aortic dissection would be diagnosed once for approximately every , ed chest pain patients. patients were excluded if they suffered a cardiac arrest, were transferred from another hospital, or if the ccl was activated for an inpatient or from ems in the field. fp ccl activation was defined as ) a patient for whom activation was cancelled in the ed and ruled out for mi or ) a patient who went to catheterization but no culprit vessel was identified and mi was excluded. ecgs for fp patients were classified using standard criteria. demographic data, cardiac biomarkers, and all relevant time intervals were collected according to an on-going quality assurance protocol. results: a total of ccl activations were reviewed, with % male, average age , and % black. there were ( %) true stemis and ( %) fp activations. there were no significant differences between the fp patients who did and did not have catheterization. for those fp patients who had a catheterization ( %), ''door to page'' and ''door to lab'' times were significantly longer than the stemi patients (see table) , but there was substantial overlap. there was no difference in sex or age, but fp patients were more likely to be black (p = . ). a total of fp patients had ecgs available for review; findings included anterior elevation with convex ( %) or concave ( %) elevation, st elevation from prior anterior ( %) or inferior ( %) mi, pericarditis ( %), presumed new lbbb ( %), early repolarization ( %), and other ( %). conclusion: false ccl activation occurred in a minority of patients, most of whom had ecg findings warranting emergent catheterization. the rate of false ccl activation appears acceptable. background: atrial fibrillation (af) is the most common cardiac arrhythmia treated in the ed, leading to high rates of hospitalization and resource utilization. dedicated atrial fibrillation clinics offer the possibility of reducing the admission burden for af patients presenting to the ed. while the referral base for these af clinics is growing, it is unclear to what extent these clinics contribute to reducing the number of ed visits and hospitalizations related to af. objectives: to compare the number of ed visits and hospitalizations among discharged ed patients with a primary diagnosis of af who followed up with an af clinic and those who did not. methods: a retrospective cohort study and medical records review including three major tertiary centres in calgary, canada. a sample of patients was taken representing patients referred to the af clinic from the calgary zone eds and compared to matched control ed patients who were referred to other providers for follow-up. the controls were matched for age and sex. inclusion criteria included patients over years of age, discharged during the index visit, and seen by the af clinic between january , and october , . exclusion criteria included non-residents and patients hospitalized during the index visit. the number of cardiovascular-related ed visits and hospitalizations was measured. all data are categorical, and were compared using chi-square tests. results: patients in the control and af clinic cohorts were similar for all baseline characteristics except for a higher proportion of first episode patients in the intervention arm. in the six months following the index ed visit, study group patients ( . %) visited an ed on occasions, and ( %) were hospitalized on occasions. of the control group, patients ( . %) visited an ed on occasions, and ( %) were hospitalized on occasions. using a chi-square test we found no significant difference in ed visits (p = . ) or hospitalizations (p = . ) between the control and af clinic cohorts. conclusion: based on our results, referral from the ed to an af clinic is not associated with a significant reduction in subsequent cardiovascular related ed visits and hospitalizations. due to the possibility of residual confounding, randomized trials should be performed to evaluate the efficacy of af clinics. reported an income of less than $ , . there were no significant associations between sex, race, marital status, education level, income, insurance status, and subsequent -and- day readmission rates. hla score was not found to be significantly related to readmission rates. the mean hla score was . (sd = . ), equivalent to less than th grade literacy, meaning these patients may not be able to read prescription labels. for each unit increase in hfkt score, the odds of being readmitted within days decreased by . (p < . ) and for - days decreased by . (p < . ). for each unit increase in scbs score, the odds of being readmitted within days decreased by . (p = . ). conclusion: health care literacy in our patient population is not associated with readmission, likely related to the low literacy rate of our study population. better hf knowledge and self-care behaviors are associated with lower readmission rates. greater emphasis should be placed on patient education and self-care behaviors regarding hf as a mechanism to decrease readmission rates. comparison of door to balloon times in patients presenting directly or transferred to a regional heart center with stemi jennifer ehlers, adam v. wurstle, luis gruberg, adam j. singer stony brook university, stony brook, ny background: based on the evidence, a door-to-balloon-time (dtbt) of less than minutes is recommended by the aha/acc for patients with stemi. in many regions, patients with stemi are transferred to a regional heart center for percutaneous coronary intervention (pci). objectives: we compared dtbt for patients presenting directly to a regional heart center with those for patients transferred from other regional hospitals. we hypothesized that dtbt would be significantly longer for transferred patients. methods: study design-retrospective medical record review. setting-academic ed at a regional heart center with an annual census of , that includes a catchment area of hospitals up to miles away. patients-patients with acute stemi identified on ed -lead ecg. measures-demographic and clinical data including time from triage to ecg, from ecg to activation of regional catheterization lab, and from initial triage to pci (dtbt , and door to intravascular balloon deployment (d b). methods: the study was performed in an inner-city academic ed between / / and / / . every patient for whom ed activation of our stemi system occurred was included. all times data from a pre-existing quality assurance database were collected prospectively. patient language was determined retrospectively by chart review. results: there were patients between / / and / / . patients ( %) were deemed too sick or unable to provide history and were excluded, leaving patients for analysis. ( %) spoke english and ( %) did not. in the non-english group, chinese was the most common language, in ( %) background: syncope is a common, potentially highrisk ed presentation. hospitalization for syncope, although common, is rarely of benefit. no populationbased study has examined disparities in regional admission practices for syncope care in the ed. moreover, there are no population-based studies reporting prognostic factors for -and -day readmission of syncope. objectives: ) to identify factors associated with admission as well as prognostic factors for -and -day readmission to these hospitals; ) to evaluate variability in syncope admission practices across different sizes and types of hospitals. methods: design -multi-center retrospective cohort study using ed administrative data from albertan eds. participants/subjects -patients > years of age with syncope (icd : r ) as a primary or secondary diagnosis from to june . readmission was defined as return visits to the ed or admission < days or - days after the index visit (including against medical advice and left without being seen during the index visit). outcomes -factors associated with hospital admission at index presentation, and readmission following ed discharge, adjusted using multivariable logistic regression. results: overall, syncope visits occurred over years. increased age, increased length of stay (los), performance of cxr, transport by ground ambulance, and treatment at a low-volume hospital (non-teaching or non-large urban) were independently associated with index hospitalization. these same factors, as well as hospital admission itself, were associated with -day readmission. additionally, increased age, increased los, performance of a head ct, treatment at a low-volume hospital, hospital admission, and female sex were independently associated with - day readmission. arrival by ground ambulance was associated with a decreased likelihood of both -and - day readmission. conclusion: our data identify variations in practice as well as factors associated with hospitalization and readmission for syncope. the disparity in admission and readmission rates between centers may highlight a gap in quality of care or reflect inappropriate use of resources. further research to compare patient out-comes and quality of patient care among urban and non-urban centers is needed. background: change in dyspnea severity (ds) is a frequently used outcome measure in trials of acute heart failure (ahf). however, there is limited information concerning its validity. objectives: to assess the predictive validity of change in dyspnea severity. methods: this was a secondary analysis of a prospective observational study of a convenience sample of ahf patients presenting with dyspnea to the ed of an academic tertiary referral center with a mixed urban/ suburban catchment area. patients were enrolled weekdays, june through december . patients assessed their ds using a -cm visual analog scale at three times: the start of ed treatment (baseline) as well as at and hours after starting ed treatment. the difference between baseline and hour was the -hour ds change. the difference between baseline and hours was the -hour ds change. two clinical outcome measures were obtained: ) the number of days hospitalized or dead within days of the index visit ( -day outcome), and ) the number of days hospitalized or dead within days of the index visit ( -day outcome). results: data on patients were analyzed. the median -day outcome variable was days with an interquartile range (iqr) of to . the median -day outcome variable was days (iqr to . ). the median -hour ds change was . cm (iqr . to . ). the median -hour ds change was . cm (iqr . to . ). the -day and -day mortality rates were % and % respectively. the spearman rank correlations and % confidence intervals are presented in the table below. conclusion: while the point estimates for the correlations were below . , the % ci for two of the correlations extended above . . these pilot data support change in ds as a valid outcome measure for ahf when measured over hours. a larger prospective study is needed to obtain a more accurate point estimate of the correlations. background: the majority of volume-quality research has focused on surgical outcomes in the inpatient setting; very few studies have examined the effect of emergency department (ed) case volume on patient outcomes. objectives: to determine whether ed case volume of acute heart failure (ahf) is associated with short-term patient outcomes. methods: we analyzed the nationwide emergency department sample (neds) and nationwide inpatient sample (nis), the largest, all-payer, ed and inpatient databases in the us. ed visits for ahf were identified with a principal diagnosis of icd- -cm code .xx. eds were categorized into quartiles by ed case volume of ahf. the outcome measures were early inpatient mortality (within the first days of admission), overall inpatient mortality, and hospital length of stay (los). results: there were an estimated , visits for ahf from approximately , eds in ; % were hospitalized. of these, the overall inpatient mortality rate was . %, and the median hospital los was days. early inpatient mortality was lower in the highest-volume eds, compared with the lowest-volume eds ( . % vs. . %; p < . ). similar patterns were observed for overall inpatient mortality ( . % vs. . %; p < . ). in a multivariable analysis adjusting for patient and hospital characteristics, early inpatient mortality remained lower in patients admitted through the highest-volume eds (adjusted odds ratios [or], . ; % confidence interval [ci], . - . ), as compared with the lowest-volume eds. there was a trend towards lower overall inpatient mortality in the highest-volume eds; however, this was not statistically significant (adjusted or, . ; %ci, . - . ). by contrast, using the nis data including various sources of admissions, a higher case volume of inpatient ahf patients predicted lower overall inpatient mortality (adjusted or, . ; %ci, . - . ). the hospital los in patients admitted through the highest-volume eds was slightly longer (adjusted difference, . day; %ci, . - . ), compared with the lowest-volume eds. conclusion: ed patients who are hospitalized for ahf have an approximately % reduced early inpatient mortality if they were admitted from an ed that handles a large volume of ahf cases. the ''practice-makesperfect'' concept may hold in emergency management of ahf. emergency department disposition and charges for heart failure: regional variability alan b. storrow, cathy a. jenkins, sean p. collins, karen p. miller, candace mcnaughton, naftilan allen, benjamin s. heavrin vanderbilt university, nashville, tn background: high inpatient admission rates for ed patients with acute heart failure are felt partially responsible for the large economic burden of this most costly cardiovascular problem. objectives: we examined regional variability in ed disposition decisions and regional variability in total dollars spent on ed services for admitted patients with primary heart failure. methods: the nationwide emergency department sample (neds) was used to perform a retrospective, cohort analysis of patients with heart failure (icd- code of .x) listed as the primary ed diagnosis. demographics and disposition percentages (with se) were calculated for the overall sample and by region: northeast, south, midwest, and west. to account for the sample design and to obtain national and regional estimates, a weighted analysis was conducted. results: there were , weighted ed visits with heart failure listed as the primary diagnosis. overall, over eighty percent were admitted (see table) . fifty-two percent of these patients were female; mean age was . years (se . ). hospitalization rates were higher in the northeast ( . %) and south ( . %) than in the midwest ( . %) and west ( . %). total monies spent on ed services were highest in the south ($ , , ) followed by the northeast ($ , , ), west ($ , , ) and midwest ($ , , ) . conclusion: this large retrospective ed cohort suggests a very high national admission rate with significant regional variation in both disposition decisions as well as total monies spent on ed services for patients with a primary diagnosis of heart failure. examining these estimates and variations further may provide strategies to reduce the economic burden of heart failure. background: workplace violence in health care settings is a frequent occurrence. gunfire in hospitals is of particular concern. however, information regarding such workplace violence is limited. accordingly, we characterized u.s. hospital-based shootings from - . objectives: to determine extent of hospital-based shootings in the u.s. and involvement of emergency departments. methods: using lexisnexis, google, netscape, pub-med, and sciencedirect, we searched reports for acute care hospital shooting events from january through december , and those with at least one injured victim were analyzed. results: we identified hospital-related shootings ( inside the hospital, on hospital grounds), in states, with victims, of whom were perpetrators. in comparison to external shootings, shootings within the hospital have not increased over time (see figure) . perpetrators were from all age groups, including the elderly. most of the events involved a determined shooter: grudge ( %), suicide ( %), ''euthanizing'' an ill relative ( %), and prisoner escape ( %). ambient societal violence ( %) and mentally unstable patients ( %) were comparatively infrequent. the most common injured was the perpetrator ( %). hospital employees comprised only % of victims; physician ( %) and nurse ( %) victims were relatively infrequent. the emergency department was the most common site ( %), followed by patient rooms ( %) and the parking lot ( %). in % of shootings within hospitals, the weapon was a security officer's gun grabbed by the perpetrator. ''grudge'' motive was the only factor determinative of hospital staff victims (or = . , % ci . - . ). conclusion: although hospital-based shootings are relatively rare, emergency departments are the most likely site. the unpredictable nature of this type of event represents a significant challenge to hospital security and deterrence practices, as most perpetrators proved determined, and many hospital shootings occur outside the building. impact of emergency physician board certification on patient perceptions of ed care quality albert g. sledge iv , carl a. germann , tania d. strout , john southall maine medical center, portland, me; mercy hospital, portland, me background: the hospital value-based purchasing program mandated by the affordable care act is the latest example of how patients' perceptions of care will affect the future practice environment of all physicians. the type of training of medical providers in the emergency department (ed) is one possible factor affecting patient perceptions of care. a unique situation in a maine community ed led to the rapid transition from non-emergency medicine (em) residency trained physicians to all em residency trained and american board of emergency medicine (abem) certified providers. objectives: the purpose of this study was to evaluate the effect of the implementation of an all em-trained, abem-certified physician staff on patient perceptions of the quality of care they received in the ed. methods: we retrospectively evaluated press ganey data from surveys returned by patients receiving treatment in a single, rural ed. survey items addressed patient's perceptions of physician courtesy, time spent listening, concern for patient comfort, and informativeness. additional items evaluated overall perceptions of care and the likelihood that the respondent would recommend the ed to another. data were compared for the three years prior to and following implementation of the all trained, certified staff. we used the independent samples t-test to compare mean responses during the two time periods. bonferroni's correction was applied to adjust for multiple comparisons. results: during the study period, , patients provided surveys for analysis: , during the pre-certification phase and , during the post-certification phase. across all six survey items, mean responses increased following transition to the board-certified staff. these improvements were noted to be statistically significant in each case: courtesy p < . , time listening p < . , concern for comfort p < . , informativeness p < . , overall perception of care p < . , and likelihood to recommend p < . . conclusion: data from this community ed suggest that transition from a non-residency trained, abem certified staff to a fully trained and certified model has important implications for patient's perceptions of the care they receive. we observed significant improvement in rating scores provided by patients across all physicianoriented and general ed measures. background: transfer of care from the ed to the inpatient floor is a critical transition when miscommunication places patients at risk. the optimal form and content of handoff between providers has not been defined. in july , ed-to-floor signout for all admissions to the medicine and cardiology floors was changed at our urban, academic, tertiary care hospital. previously, signout was via an unstructured telephone conversation between ed resident and admitting housestaff. the new signout utilizes a web-based ed patient tracking system and includes: ) a templated description of ed course is completed by the ed resident; ) when a bed is assigned, an automated page is sent to the admitting housestaff; ) ed clinical information, including imaging, labs, medications, and nursing interventions (figure) is reviewed by admitting housestaff; ) if housestaff has specific questions about ed care, a telephone conversation between the ed resident and housestaff occurs; ) if there are no specific questions, it is indicated electronically and the patient is transferred to the floor. objectives: to describe the effects on patient safety (floor-to-icu transfer in hours) and ed throughput (ed length of stay (los) and time from bed assignment to ed departure) resulting from a change to an electronic, discussion-optional handoff system. conclusion: transition to a system in which signout of admitted patients is accomplished by accepting housestaff review of ed clinical information supplemented by verbal discussion when needed resulted in no significant change in rate of floor-to-icu transfer or ed los and reduced time from bed assignment to ed departure. background: emergency physicians may be biased against patients presenting with nonspecific complaints or those requiring more extensive work-ups. this may result in patients being seen less quickly than those with more straightforward presentations, despite equal triage scores or potential for more dangerous conditions. objectives: the goal of our study was to ascertain which patients, if any, were seen more quickly in the ed based on chief complaint. methods: a retrospective report was generated from the emr for all moderate acuity (esi ) adult patients who visited the ed from january through december at a large urban teaching hospital. the most common complaints were: abdominal pain, alcohol intoxication, back pain, chest pain, cough, dyspnea, dizziness, fall, fever, flank pain, headache, infection, pain (nonspecific), psychiatric evaluation, ''sent by md,'' vaginal bleeding, vomiting, and weakness. non-parametric independent sample tests assessed median time to be seen (ttbs) by a physician for each complaint. differences in the ttbs between genders and based on age were also calculated. chi-square testing compared percentages of patients in the ed per hour to assess for differences in the distribution of arrival times. results: we obtained data from , patients. patients with a chief complaint of weakness and dizziness waited the longest with a median time of minutes and patients with flank pain waited the shortest with minutes (p < . ) ( figure ). overall, males waited minutes and females waited minutes (p < . ). stratifying by gender and age, younger females between the ages of - waited significantly longer times when presenting with a chief complaint of abdominal pain (p < . ), chest pain (p < . ), or flank pain (p < . ) as compared to males in the same age group ( figure ). there was no difference in the distribution of arrival times for these complaints. conclusion: while the absolute time differences are not large, there is a significant bias toward seeing young male patients more quickly than women or older males despite the lower likelihood of dangerous conditions. triage systems should perhaps take age and gender better into account. patients might benefit from efforts to educate em physicians on the delays and potential quality issues associated with this bias in an attempt to move toward more egalitarian patient selection. background: detailed analysis of emergency department (ed) event data identified the time from completion of emergency physician evaluation (doc done) to the time patients leave the ed as a significant contributor to ed length of stay (los) and boarding at our institution. process flow mapping identified the time from doc done to the time inpatient beds were ordered (bo) as an interval amendable to specific process improvements. objectives: the purpose of this study was to evaluate the effect of ed holding orders for stable adult . ( . - . ) . ( . - . ) . ( . - . ) . ( . - . ) . ( . - . ) . ( . - . ) inpatient medicine (aim) patients on: a) the time to bo and b) ed los. methods: a prospective, observational design was used to evaluate the study questions. data regarding the time to bo and los outcomes were collected before and after implementation of the ed holding orders program. the intervention targeted stable aim patients being admitted to hospitalist, internal medicine, and family medicine services. ed holding orders were placed following the admission discussion with the accepting service and special attention was paid to proper bed type, completion of the emergent work-up and the expected immediate course of the patient's hospital stay. holding orders were of limited duration and expired hours after arrival to the inpatient unit. results: during the -month study period, patients were eligible for the ed holding orders intervention; ( . %) were cared for using the standard adult medicine order set and ( . %) received the intervention. the median time from doc done to bo was significantly shorter for patients in the ed holding orders group, min (iqr , ) vs. min (iqr , ) for the standard adult medicine group, p < . . similarly, the median ed los was significantly shorter for those in the ed holding orders group, min (iqr , ) vs. min (iqr , ) for the standard adult medicine group, p < . . no lapses in patient care were reported in the intervention group. conclusion: in this cohort of ed patients being admitted to an aim service, placing ed holding orders rather than waiting for a traditional inpatient team evaluation and set of admission orders significantly reduced the time from the completion of the ed workup to placement of a bo. as a result, ed los was also significantly shortened. while overall utilization of the intervention was low, it improved with each month. emergency department interruptions in the age of electronic health records matthew albrecht, john shabosky, jonathan de la cruz southern illinois university school of medicine, springfield, il background: interruptions of clinical care in the emergency department (ed) have been correlated with increased medical errors and decreased patient satisfaction. studies have also shown that most interruptions happen during physician documentation. with the advent of the electronic health record and computerized documentation, ed physicians now spend much of their clinical time in front of computers and are more susceptible to interruptions. voice recognition dictation adjuncts to computerized charting boast increased provider efficiency; however, little is known about how data input of computerized documentation affects physician interruptions. objectives: we present here observational interruptions data comparing two separate ed sites, one that uses computerized charting by conventional techniques and one assisted by voice recognition dictation technology. methods: a prospective observational quality initiative was conducted at two teaching hospital eds located less than mile from each other. one site primarily uses conventional computerized charting while the other uses voice recognition dictation computerized charting. four trained observers followed ed physicians for minutes during shifts. the tasks each ed physician performed were noted and logged in second intervals. tasks listed were selected from a predetermined standardized list presented at observer training. tasks were also noted as either completed or placed in queue after a change in task occurred. a total of minutes were logged. interruptions were noted when a change in task occurred with the previous task being placed in queue. data were then compared between sites. results: ed physicians averaged . interruptions/ hour with conventional computerized charting compared to . interruptions/hour with assisted voice recognition dictation (p = . ). conclusion: computerized charting assisted with voice recognition dictation significantly decreased total per hour interruptions when compared to conventional techniques. charting with voice recognition dictation has the potential to decrease interruptions in the ed allowing for more efficient workflow and improved patient care. background: using robot assistants in health care is an emerging strategy to improve efficiency and quality of care while optimizing the use of human work hours. robot prototypes capable of performing vital signs and assisting with ed triage are under development. however, ed users' attitudes toward robot assistants are not well studied. understanding of these attitudes is essential to design user-friendly robots and to prepare eds for the implementation of robot assistants. objectives: to evaluate the attitudes of ed patients and their accompanying family and friends toward the potential use of robot assistants in the ed. methods: we surveyed a convenience sample of adult ed patients and their accompanying adult family members and friends at a single, university-affiliated ed, / / - / / . the survey consisted of eight items from the negative attitudes towards robots scale (normura et al.) modified to address robot use in the ed. response options included a -point likert scale. a summary score was calculated by summing the responses for all items, with a potential range of (completely negative attitude) to (completely positive attitude). research assistants gave the written surveys to subjects during their ed visit. internal consistency was assessed using cronbach's alpha. bivariate analyses were performed to evaluate the association between the summary score and the following variables: participant type (patient or visitor), sex, race, time of day, and day of week. results: of potential subjects approached, ( %) completed the survey. participants were % patients, % family members or friends, % women, % white, and had a median age of . years (iqr - ). cronbach's alpha was . . the mean summary score was . (sd = . ), indicating subjects were between ''occasionally'' and ''sometimes'' comfortable with the idea of ed robot assistants (see table) . men were more positive toward robot use than women (summary score: . vs . ; p = . ). no differences in the summary score were detected based on participant type, race, time of day, or day of week. conclusion: ed users reported significant apprehension about the potential use of robot assistants in the ed. future research is needed to explore how robot designs and strategies to implement ed robots can help alleviate this apprehension. background: emergency department cardioversion (edc) of recent-onset atrial fibrillation or flutter (af) patients is an increasingly common management approach to this arrhythmia. patients who qualify for edc generally have few co-morbidities and are often discharged directly from the ed. this results in a shift towards a sicker population of patients admitted to the hospital with this diagnosis. objectives: to determine whether hospital charges and length of stay (los) profiles are affected by emergency department discharge of af patients. methods: patients receiving treatment at an urban teaching community hospital with a primary diagnosis of atrial fibrillation or flutter were identified through the hospital's billing data base. information collected on each patient included date of service, patient status, length of stay, and total charges. patient status was categorized as inpatient (admitted to the hospital), observation (transferred from the ed to an inpatient bed but placed in an observation status), or ed (discharged directly from the ed). the hospital billing system automatically defaults to a length of stay of for observation patients. ed patients were assigned a length of stay of . total hospital charges and mean los were determined for two different models: a standard model (sm) in which patients discharged from the ed were excluded from hospital statistics, and an inclusive model (im) in which discharged ed patients were included in the hospital statistics. statistical analysis was through anova. results: a total of patients were evaluated for af over an -month period. of these, ( %) were admitted, ( %) were placed in observation status, and ( %) were discharged from the ed. hospital charges and los in days are summarized in the table. all differences were statistically significant at (p < . ). conclusion: emergency department management can lead to a population of af patients discharged directly from the ed. exclusion of these patients from hospital statistics skews performance profiles effectively punishing institutions for progressive care. background: recent health care reform has placed an emphasis on the electronic health record (ehr). with the advent of the ehr it is common to see ed providers spending more time in front of computers documenting and away from patients. finding strategies to decrease provider interaction with computers and increase time with patients may lead to improved patient outcomes and satisfaction. computerized charting adjuncts, such as voice recognition software, have been marketed as ways to improve provider efficiency and patient contact. objectives: we present here observational data comparing two separate ed sites, one where computerized charting is done by conventional techniques and one that is assisted with voice recognition dictation, and their effects on physican charting and patient contact. methods: a prospective observational quality initiative was conducted at two teaching hospitals located less than mile from each other. one site primarily uses conventional computerized charting while the other uses voice recognition dictation. four trained quality assistants observed ed physicians for minutes during shifts. the tasks each physician performed were noted and logged in second intervals. tasks listed were identified from a predetermined standardized list presented at observer training. a total of minutes were logged. time allocated to charting and that allocated to direct patient care were then compared between sites. results: ed physicians spent . % of their time charting using conventional techniques vs . % using voice recognition dictation (p = . ). time allocated to direct patient care was found to be . % with conventional charting vs . % using dictation (p = ). in total, ed physicians using conventional charting techniques spent / minutes charting. ed physicians using voice recognition dictation spent / minutes dictating and an additional . / minutes reviewing or correcting their dictations. the use of voice recognition assisted dictation rather than conventional techniques did not significantly change the amount of time physicians spent charting or with direct patient care. although voice recognition dictation decreased initial input time of documenting data, a considerable amount of time was required to review and correct these dictations. objectives: for our primary objective, we studied whether emergency department triage temperatures detected fever adequately when compared to a rectal temperature. as secondary objectives, we examined the temperature differences when a rectal temperature was taken within an hour of non-invasive temperature, temperature site (oral, axillary, temporal), and also examined the patients that were initially afebrile but were found to be febrile by rectal temperature. methods: we performed an electronic chart review at our inner city, academic emergency department with an annual census of , patients. we identified all patients over the age of who received a non-invasive triage temperature and a subsequent rectal temperature while in the ed from january through february . specific data elements included many aspects of the patient's medical record (e.g. subject demographics, temperature, and source). we analyzed our data with standard descriptive statistics, t-tests for continuous variables, and pearson chi-square tests for proportions. results: a total of , patients met our inclusion criteria. the mean difference in temperatures between the initial temperature and the rectal temperature was . °f, with . % having higher rectal temperatures ‡ °f, and . % having higher rectal temperatures ‡ °f. the mean temperature difference among the , patients who an initial noninvasive temperature and a rectal temperature within one hour was . °f. the mean difference among patients that received oral, axillary, and temporal temperatures was . °f, . °f, and . °f respectively. approximately one in five patients ( . %) were initially afebrile and found to be febrile by rectal temperature, with an average temperature difference of . °f. these patients had a higher rate of admission, and were more likely to be admitted to the intensive care unit. conclusion: there are significant differences between rectal temperatures and non-invasive triage temperatures in this emergency department cohort. in almost one in five patients, fever was missed by triage temperature. background: pediatric emergency department (ped) overcrowding has become a national crisis, and has resulted in delays in treatment, and patients leaving without being seen. increased wait times have also been associated with decreased patient satisfaction. optimizing ped throughput is one means by which to handle the increased demands for services. various strategies have been proposed to increase efficiency and reduce length of stay (los). objectives: to measure the effect of direct bedding, bedside registration, and patient pooling on ped wait times, length of stay, and patient satisfaction. methods: data were extracted from a computerized ed tracking system in an urban tertiary care ped. comparisons were made between metrics for ( , patients) and the months following process change ( , patients). during , patients were triaged by one or two nurses, registered, and then sent either to a -bed ped or a physically separate -bed fast-track unit, where they were seen by a physician. following process change, patients were brought directly to a bed in the -bed ped, triaged and registered, then seen by a physician. the fast-track unit was only utilized to accommodate patient surges. results: anticipating improved efficiencies, attending physician coverage was decreased by %. after instituting process changes, improvements were noted immediately. although daily patient volume increased by %, median time to be seen by a physician decreased by %. additionally, median los for discharged patients decreased by %, and median time until the decisionto-admit decreased by %. press-ganey satisfaction scores during this time increased by greater than mean score points, which was reported to be a statistically significant increase. conclusion: direct bedding, bedside registration, and patient pooling were simple to implement process changes. these changes resulted in more efficient ped throughput, as evidenced by decreased times to be seen by a physician, los for discharged patients, and time until decision-to-admit. additionally, patient satisfaction scores improved, despite decreased attending physician coverage and a % decrease in room utilization. ) . during period , the ou was managed by the internal medicine department and staffed by primary care physicians and physician assistants. during periods and , the ou was managed and staffed by em physicians. data collected included ou patient volume, length of stay (los) for discharged and admitted patients, admission rates, and -day readmission rates for discharged patients. cost data collected included direct, indirect, and total cost per patient encounter. data were compared using chi-square and anova analysis followed by multiple pairwise comparisons using the bonferroni method of p-value adjustment. results: see table. the ou patient volume and percent of ed volume was greater in period compared to periods and . length of stay, admission rates, -day readmission rates, and costs were greater in period compared to periods and . conclusion: em physicians provide more cost-effective care for patients in this large ou compared to non-em physicians, resulting in shorter los for admitted and discharged patients, greater rates of patients discharged, and less -day readmission rates for discharged patients. this is not affected by an increase in ou volume and shows a trend towards improvement. background: emergency department (ed) crowding continues to be a problem, and new intake models may represent part of the solution. however, little data exist on the sustainability and long-term effects of physician triage and screening on standard ed performance metrics, as most studies are short-term. objectives: we examined the hypothesis that a physician screening program (start) sustainably improves standard ed performance metrics including patient length of stay (los) and patients who left without completing assessment (lwca). we also investigated the number of patients treated and dispositioned by start without using a monitored bed and the median patient door-to-room time. methods: design and setting: this study is a retrospective before-and-after analysis of start in a level i tertiary care urban academic medical center with approximately , annual patient visits. all adult patients from december until november are included, though only a subset was seen in start. start began at our institution in december . observations: our outcome measures were length of stay for ed patients, lwca rates, patients treated and dispositioned by start without using a monitored bed, and door-to-room time. statistics: simple descriptive statistics were used. p-values for los were calculated with wilcoxon test and p-value for lwca was calculated with chi-square. results: table shows median length of stay for ed patients was reduced by minutes/patient (p-value < . ) when comparing the most recent year to the year before start. patients who lwca were reduced from . % to . % (p-value < . ) during the same time period. we also found that in the first half-year of start, % of patients screened in the ed were treated and dispositioned without using a monitored bed and by the end of year , this number had grown to %. median door-to-room time decreased from . minutes to . minutes over the same period of time. conclusion: a start system can provide sustained improvements in ed performance metrics, including a significant reduction in ed los, lwca rate, and doorto-room time. additionally, start can decrease the need for monitored ed beds and thus increase ed capacity. . labs were obtained in %, ct in %, us in %, and consultation in %. % of the cohort was admitted to the hospital. the most commonly utilized source of translation was a layman ( %). a professional translator was used in % and translation service (language line, marty) in %. the examiner was fluent in the patient's language in %. both the patient and examiner were able to maintain basic communication in %. there were patients in the professional/ fluent translation group and patients in the lay translation group. there was no difference in ed los between groups vs min; p = . . there was no difference in the frequency of lab tests, computerized tomography, ultrasound, consultations, or hospital admission. frequencies did not differ by sex or age. conclusion: translation method was not associated with a difference in overall ed los, ancillary test use, or specialist consultation in spanish-speaking patients presenting to the ed for abdominal pain. emergency department patients on warfarin -how often is the visit due to the medication? jim killeen, edward castillo, theodore chan, gary vilke ucsd medical center, san diego, ca background: warfarin has important therapeutic value for many patients, but has been associated with signi-ficant bleeding complications, hypersensitivity reactions, and drug-drug interactions, which can result in patients seeking care in the emergency department (ed). objectives: to determine how often ed patients on warfarin present for care as a result of the medication itself. methods: a multi-center prospective survey study in two academic eds over months. patients who presented to the ed taking warfarin were identified, and ed providers were prospectively queried at the time of disposition regarding whether the visit was the result of a complication or side effect associated with warfarin. data were also collected on patient demographics, chief complaint, triage acuity, vital signs, disposition, ed evaluation time, and length of stay (los). patients identified with a warfarin-related cause for their ed visit were compared with those who were not. statistical analysis was performed using descriptive statistics. results: during the study period, , patients were cared for by ed staff, of whom were identified as taking warfarin as part of their medication regimen. of these, providers identified . % ( patients) who presented with a warfarin-related complication as their primary reason for the ed visit. . % ( ) each hours of daily boarding is associated with a drop of . raw score points in both pg metrics. these seemingly small drops in raw scores translate into major changes in rankings on press ganey national percentile scales (a difference of as much as percentile points). our institution commonly has hundreds of hours of daily boarding. it is possible that patient-level measurements of boarding impact would show stronger correlation with individual satisfaction scores, as opposed to the daily aggregate measures we describe here. our research suggests that reducing the burden of boarding on eds will improve patient satisfaction. background: prolonged emergency department (ed) boarding is a key contributor to ed crowding. the effect of output interventions (moving boarders out of the ed into an intermediate area prior to admission or adding additional capacity to an observation unit) has not been well studied. objectives: we studied the effect of a combined observation-transition (ot) unit, consisting of observation beds and an interim holding area for boarding ed patients, on the length of stay (los) for admitted patients, as well as secondary outcomes such as los for discharged patients, and left without being seen rates. methods: we conducted a retrospective review ( months pre-, months post-design) of an ot unit at an urban teaching ed with , annual visits (study ed). we compared outcomes to a nearby communitybased ed with , annual visits in the same health system (control ed) where no capacity interventions were performed. the ot had beds, full monitoring capacity, and was staffed hours per day. the number of beds allocated to transition and observation patients fluctuated throughout the course of the intervention, based on patient demands. all analyses were conducted at the level of the ed-day. wilcoxon rank-sum and analysis of covariance tests were used for comparisons; continuous variables were summarized with medians. results: in unadjusted analyses, median daily los of admitted patients at the study ed was minutes lower in the months after the ot opened, . to . hours (p < . ). control site daily los for admitted patients increased minutes from . to . hours (p < . ). results were similar after adjusting for other covariates (day of week, ed volume, and triage level). los of discharged patients at study ed decreased by minutes, from . hours to . hours (p < . ), while the control ed saw no significant changes in discharged patient los ( . hours to . hours, p = . ). left without being seen rates did not decrease at either site. conclusion: opening an ot unit was associated with a -minute reduction in average daily ed los for admitted patients and discharged patients in the study ed. given the large expense of opening an ot, future studies should compare capacity-dependent (e.g., ot) vs. capacity-independent (e.g, organizational) interventions to reduce ed crowding. fran balamuth, katie hayes, cynthia mollen, monika goyal children's hospital of philadelphia, philadelphia, pa background: lower abdominal pain and genitourinary problems are common chief complaints in adolescent females presenting to emergency departments. pelvic inflammatory disease (pid) is a potentially severe complication of lower genital tract infections, which involves inflammation of the female upper genital tract secondary to ascending stis. pid has been associated with severe sequelae including infertility, ectopic pregnancy, and chronic pelvic pain. we describe the prevalence and microbial patterns of pid in a cohort of adolescent females presenting to an urban emergency department with abdominal or genitourinary complaints. objectives: to describe the prevalence and microbial patterns of pid in a cohort of adolescent patients presenting to an ed with lower abdominal or genitourinary complaints. methods: this is a secondary analysis of a prospective study of females ages - years presenting to a pediatric ed with lower abdominal or genitourinary complaints. diagnosis of pid was per cdc guidelines. patients underwent chlamydia trachomatis (ct) and neisseria gonorrhea (gc) testing via urine aptima combo assay and trichomonas vaginalis (tv) testing using the vaginal osom trichomonas rapid test. descriptive statistics were performed using stata . . results: the prevalence of pid in this cohort of patients was . % ( % ci . %, . %), . % ( % ci . %, . %) of whom had positive sexually transmitted infection (sti) testing: % ( % ci . %, . %) with ct, . % ( % ci . , . %) with gc, and . % ( % ci . %, . %) with tv. . % ( % ci . , . %) of patients diagnosed with pid received antibiotics consistent with cdc recommendations. patients with lower abdominal pain as their chief complaint were more likely to have pid than patients with genitourinary complaints (or . , % ci . , . ). conclusion: a substantial number of adolescent females presenting to the emergency department with lower abdominal pain were diagnosed with pid, with microbial patterns similar to those previously reported in largely adult, outpatient samples. furthermore, appropriate treatment for pid was observed in the majority of patients diagnosed with pid. impact background: in resource-poor settings, maternal health care facilities are often underutilized, contributing to high maternal mortality. the effect of ultrasound in these settings on patients, health care providers, and communities is poorly understood. objectives: the purpose of this study was to assess the effect of the introduction of maternal ultrasound in a population not previously exposed to this intervention. methods: an ngo-led program trained nurses at four remote clinics outside koutiala, mali, who performed , maternal ultrasound scans over three years. our researchers conducted an independent assessment of this program, which involved log book review, sonographer skill assessment, referral follow-up, semi-structured interviews of clinic staff and patients, and focus groups of community members in surrounding villages. analyses included the effect of ultrasound on clinic function, job satisfaction, community utilization of prenatal care and maternity services, alterations in clinical decision making, sonographer skill, and referral frequency. we used qrs nvivo to organize qualitative findings, code data, and identify emergent themes, and graphpad software (la jolla, ca) and microsoft excel to tabulate quantitative findings results: -findings that triggered changes in clinical practice were noted in . % of ultrasounds, with a . % referral rate to comprehensive maternity care facilities. -skill retention and job satisfaction for ultrasound providers was high. -the number of patients coming for antenatal care increased, after introduction of ultrasound, in an area where the birth rate has been decreasing. -over time, women traveled from farther distances to access ultrasound and participate in antenatal care. -very high acceptance among staff, patients and community members. -ultrasound was perceived as most useful for finding fetal position, sex, due date, and well-being. -improved confidence in diagnosis and treatment plan for all cohorts. -improved compliance with referral recommendations. -no evidence of gender selection motivation for ultrasound use. conclusion: use of maternal ultrasound in rural and resource-limited settings draws women to an initial antenatal care visit, increases referral, and improves job satisfaction among health care workers. methods: a retrospective database analysis was conducted using the electronic medical record from a single, large academic hospital. ed patients who received a billing diagnosis of ''nausea and vomiting of pregnancy'' or ''hyperemesis gravidarum'' between / / and / / were selected. a manual chart review was conducted with demographic and treatment variables collected. statistical significance was determined using multiple regression analysis for a primary outcome of return visit to the emergency department for nausea and vomiting of pregnancy. results: patients were identified. the mean age was . years (sd± . ), mean gravidity . (sd± . ), and mean gestational age . weeks (sd± . ). the average length of ed evaluation was min (sd± ). of the patients, ( . %) had a return ed visit for nausea and vomiting of pregnancy, ( %) were admitted to the hospital, and ( %) were admitted to the ed observation protocol. multiple regression analysis showed that the presence of medical co-morbidity (p = . ), patient gravditity (p = . ), gestational age (p = . ), and admission to the hospital (p = . ) had small but significant effects on the primary outcome (return visits to the emergency department). no other variables were found to be predictive of return visits to the ed including admission to the ed observation unit or factors classically thought to be associated with severe forms of nausea and vomiting in pregnancy including ketonuria, electrolyte abnormalities, or vital sign abnormalities. conclusion: nausea and vomiting in pregnancy has a high rate of return ed visits that can be predicted by young patient age, low patient gravidity, early gestational age, and the presence of other comorbidities. these patients may benefit from obstetric consultation and/or optimization of symptom management after discharge in order to prevent recurrent utilization of the ed. prevalence conclusion: there is a high prevalence of ht in adult sa victims. although our study design and data do not allow us to make any inferences regarding causation, this first report of ht ed prevalence suggests the opportunity to clarify this relationship and the potential opportunity to intervene. background: sexually transmitted infections (sti) are a significant public health problem. because of the risks associated with stis including pid, ectopic pregnancy, and infertility the cdc recommends aggressive treatment with antibiotics in any patient with a suspected sti. objectives: to determine the rates of positive gonorrhea and chlamydia (g/c) screening and rates of empiric antibiotic use among patients of an urban academic ed with > , visits in boston, ma. methods: a retrospective study of all patients who had g/c cultures in the ed over months. chi-square was used in data analysis. sensitivity and specificity were also calculated. results: a positive rate of / ( . %) was seen for gonorrhea and / ( . %) for chlamydia. females had positive rates of / ( . %) and / ( . %) respectively. males had higher rates of / ( . %) (p =< . ) and / ( . %) (p = . ). patients with g/c sent received an alternative diagnosis, the most common being uti ( ), ovarian pathology ( ), vaginal bleeding ( ), and vaginal candidiasis ( ); were excluded. this left without definitive diagnosis. of these, . % ( / ) of females were treated empirically with antibiotics for g/c, and a greater percentage of males ( %, / ) were treated empirically (p < . ). of those empirically treated, / ( . %) had negative cultures. meanwhile / ( . %) who ultimately had positive cultures were not treated with antibiotics during their ed stay. sensitivity of the provider to predict presence of disease based on decision to give empiric antibiotics was . (ci . - . ). specificity was . (ci . - . ). conclusion: most patients screened in our ed for g/c did not have positive cultures and . % of those treated empirically were found not to have g/c. while early treatment is important to prevent complications, there are risks associated with antibiotic use such as allergic reaction, c difficile infection, and development of antibiotic resistance. our results suggest that at our institution we may be over-treating for g/c. furthermore, despite high rates of treatment, % of patients who ultimately had positive cultures did not receive antibiotics during their ed stay. further research into predictive factors or development of a clinical decision rule may be useful to help determine which patients are best treated empirically with antibiotics for presumed g/c. background: air travel may be associated with unmeasured neurophysiological changes in an injured brain that may affect post-concussion recovery. no study has compared the effect of commercial airtravel on concussion injuries despite rather obvious decreased oxygen tension and increased dehydration effect on acute mtbi. objectives: to determine if air travel within - hours of concussion is associated with increased recovery time in professional football and hockey players. methods: prospective cohort study of all active-roster national football league and national hockey league players during the - seasons. internet website review of league sties for injury identification of concussive injury and when player returned to play solely for mtbi. team schedules and flight times were also confirmed to include only players who flew immediately following game (within - hr). multiple injuries were excluded as were players who had injury around all-star break for nhl and scheduled off week in nfl. results: during the - nfl and nhl seasons, ( . %) and ( . %) players experienced a concussion (percent of total players), in the respective leagues. of these, nfl players ( %) and nhl players ( %) flew within hours of the incident injury. the mean distance flown was shorter for nfl ( miles, sd vs. nhl , sd ) miles and all were in a pressurized cabin. the mean number of games missed for nfl and nhl players who traveled by air immediately after concussion was increased by % and % (respectively) than for those who did not travel by air nfl: . (sd . ) vs. . games (sd . ) and nhl: . games (sd . ) vs. . (sd . ); p < . . conclusion: this is an initial report of an increased rate of recovery in terms of more games missed, for professional athletes flying commercial airlines post-mtbi compared to those that do not subject their recently injured brains to pressurized airflight. the obvious changes of decreased oxygen tension with altitude equivalent of , feet, decreased humidity with increased dehydration, and duress of travel accompanying pressurized airline cabins all likely increase the concussion penumbra in acute mtbi. early air travel post concussion should be further evaluated and likely postponed - hr. until initial symptoms subside. background: previous studies have shown better in-hospital stroke time targets for those who arrive by ambulance compared to other modes of transport. however, regional studies report that less than half of stroke patients arrive by ambulance. objectives: our objectives were to describe the proportion of stroke patients who arrive by ambulance nationwide, and to examine regional differences and factors associated with the mode of transport to the emergency department (ed). methods: this is a cross-sectional study of all patients with a primary discharge diagnosis of stroke based on previously validated icd- codes abstracted from the national hospital ambulatory medical care survey for - . we excluded subjects < years of age and those with missing data. the study related survey variables included patient demographics, community characteristics, mode of transport to the hospital, and hospital characteristics. results: patients met inclusion criteria, representing , , patient records nationally. of these, . % arrived by ambulance. after adjustment for potential confounders, patients residing in the west and south had lower odds of arriving by ambulance for stroke when compared to northeast (southern region, or . , % ci . - . , western region, or . , % ci . - . , midwest region, or . , % ci . - . ). compared to the medicare population, privately insured and self insured had lower odds of arriving by ambulance (or for private insurance . , % ci . - . and or for self payers . , % ci . - . ). age, sex, race, urban or rural location of ed, or safety net status were not independently associated with ambulance use. conclusion: patients with stroke arrive by ambulance more frequently in the northeast than in other regions of the us. identifying reasons for this regional difference may be useful in improving ambulance utilization and overall stroke care nationwide. objectives: we sought to determine whether there was a difference in type of stroke presentation based upon race. we further sought to determine whether there is an increase in hemorrhagic strokes among asian patients with limited english proficiency. methods: we performed a retrospective chart review of all stroke patients age and older for year of patients that were diagnosed with cerebral vascular accident (cva) or intracranial hemorrhage (ich). we collected data on patient demographics, and past medical history. we then stratified patients according to race (white, black, latino, asian, and other). we classified strokes as ischemic, intracranial hemorrhage (ich), subarachnoid hemorrhage (sah), subdural hemorrhage (sdh), and other (e.g., bleeding into metatstatic lesions). we used only the index visit. we present the data percentages, medians and interquartile ranges (iqr). we tested the association of the outcome of intracranial hemorrhage against demographic and clinical variables using chi-square and kruskal-wallis tests. we performed a logistic regression model to determine factors related to presentation with an intracranial hemorrhage (ich background: the practice of obtaining laboratory studies and routine ct scan of the brain on every child with a seizure has been called into question in the patient who is alert, interactive, and back to functional baseline. there is still no standard practice for the management of non-febrile seizure patients in the pediatric emergency department (ped). objectives: we sought to determine the proportion of patients in whom clinically significant laboratory studies and ct scans of the brain were obtained in children who presented to the ped with a first or recurrent non-febrile seizure. we hypothesize that the majority of these children do not have clinically significant laboratory or imaging studies. if clinically significant values were found, the history given would warrant further laboratory and imaging assessment despite seizure alone. methods: we performed a retrospective chart review of patients with first-time or recurrent non-febrile seizures at an urban, academic ped between july to june . exclusion criteria included children who presented to the ped with a fever and age less than months. we looked at specific values that included a complete blood count, basic metabolic panel, and liver function tests, and if the child was on antiepileptics along with a level for a known seizure disorder, and ct scan. abnormal laboratory and ct scan findings were classified as clinically significant or not. results: the median age of our study population is years with male to female ratio of . . % of patients had a generalized tonic-clonic seizure. laboratory studies and ct scans were obtained in % and % of patients, respectively. five patients had clinically significant abnormal labs; however, one had esrd, one developed urosepsis, one had eclampsia, and two others had hyponatremia, which was secondary to diluted formula and trileptal toxicity. three children had an abnormal head ct: two had a vp shunt and one had a chromosomal abnormality with developmental delay. conclusion: the majority of the children analyzed did not have clinically significant laboratory or imaging studies in the setting of a first or recurrent non-febrile seizure. of those with clinically significant results, the patient's history suggested a possible etiology for their seizure presentation and further workup was indicated. background: in patients with a negative ct scan for suspected subarachnoid hemorrhage (sah), ct angiography (cta) has emerged as a controversial alternative diagnostic strategy in place of lumbar puncture (lp). objectives: to determine the diagnostic accuracy for sah and aneurysm of lp alone, cta alone, and lp followed by cta if the lp is positive. methods: we developed a decision and bayesian analysis to evaluate ) lp, ) cta, and ) lp followed by cta if the lp is positive. data were obtained from the literature. the model considers probability of sah ( %), aneurysm ( % if sah), sensitivity and specificity of ct ( . % and % overall), of lp (based on rbc and xanthochromia), and of cta, traumatic tap and its influence on sah detection. analyses considered all patients and those presenting at less than hours or greater than hours from symptom onset by varying the sensitivity and specificity of ct and cta. results: using the reported ranges of ct scan sensitivity and the specificity, the revised likelihood of sah following a negative ct ranged from . - . %, and the likelihood of aneurysm ranged from . - . %. following any of the diagnostic strategies, the likelihood of missing sah ranged from - . %. either lp strategy diagnosed . % of sahs versus - % with cta alone because cta only detected sah in the presence of an aneurysm. false positive sah with lp ranged from . - . % due to traumatic taps and with cta ranged from . - . % due to aneurysms without sah. the positive predictive value for sah ranged from . - % with lp and from . - % with cta. for patients presenting within hours of symptom onset, the revised likelihood of sah following a negative ct became . %, and the likelihood of aneurysm ranged from . - . %. following any of the diagnostic strategies, the likelihood of missing sah ranged from . - . %. either lp strategy diagnosed . % of sah versus - % with cta alone. false positive sah with lp was . % and with cta ranged from . - . %. the positive predictive value for sah was . % with lp and from . - % with cta. cta following a positive lp diagnosed . - % of aneurysms. conclusion: lp strategies are more sensitive for detecting sah but less specific than cta because of traumatic taps, leading to lower predictive value positives for sah with lp than with cta. either diagnostic strategy results in a low likelihood of missing sah, particularly within hours of symptom onset. background: recent studies support perfusion imaging as a prognostic tool in ischemic stroke, but little data exist regarding its utility in transient ischemic attack (tia). ct perfusion (ctp), which is more available and less costly to perform than mri, has not been well studied. objectives: to characterize ctp findings in tia patients, and identify imaging predictors of outcome. methods: this retrospective cohort study evaluated tia patients at a single ed over months, who had ctp at initial evaluation. a neurologist blinded to ctp findings collected demographic and clinical data. ctp images were analyzed by a neuroradiologist blinded to clinical information. ctp maps were described as qualitatively normal, increased, or decreased in mean transit time (mtt), cerebral blood volume (cbv), and cerebral blood flow (cbf). quantitative analysis involved measurements of average mtt (seconds), cbv (cc/ g) and cbf (cc/[ g x min]) in standardized regions of interest within each vascular distribution. these were compared with values in the other hemisphere for relative measures of mtt difference, cbv ratio, and cbffratio. mtt difference of ‡ seconds, rcbv as £ . , and rcbf as £ . were defined as abnormal based on prior studies. clinical outcomes including stroke, tia, or hospitalization during follow-up were determined up to one year following the index event. dichotomous variables were compared using fisher's exact test. logistic regression was used to evaluate the association of ctp abnormalities with outcome in tia patients. results: of patients with validated tia, had ctp done. mean age was ± years, % were women, and % were caucasian. mean abcd score was . ± . , and % had an abcd ‡ . prolonged mtt was the most common abnormality ( , %), and ( . %) had decreased cbv in the same distribution. on quantitative analysis, ( %) had a significant abnormality. four patients ( . %) had prolonged mtt and decreased cbv in the same territory, while ( %) had mismatched abnormalities. when tested in a multivariate model, no significant associations between mismatch abnormalities on ctp and new stroke, tia, or hospitalizations were observed. conclusion: ctp abnormalities are common in tia patients. although no association between these abnormalities and clinical outcomes was observed in this small study, this needs to be studied further. objectives: we hypothesized that pre-thrombolytic anti-hypertensive treatment (aht) may prolong door to treatment time (dtt). methods: secondary data analysis of consecutive tpatreated patients at randomly selected michigan community hospitals in the instinct trial. dtt among stroke patients who received pre-thrombolytic aht were compared to those who did not receive pre-thrombolytic aht. we then calculated a propensity score for the probability of receiving pre-thrombolytic aht using a logistic regression model with covariates including demographics, stroke risk factors, antiplatelet or beta blocker as home medication, stroke severity (nihss), onset to door time, admission glucose, pretreatment systolic and diastolic blood pressure, ems usage, and location at time of stroke. a paired t-test was then performed to compare the dtt between the propensity-matched groups. a separate generalized estimating equations (gee) approach was also used to estimate the differences between patients receiving pre-thrombolytic aht and those who did not while accounting for within-hospital clustering. results: a total of patients were included in instinct; however, onset, arrival, or treatment times were not able to be determined in , leaving patients for this analysis. the unmatched cohort consisted of stroke patients who received pre-thrombolytic aht and stroke patients who did not receive aht from - (table) . in the unmatched cohort, patients who received pre-thrombolytic aht had a longer dtt (mean increase minutes; % confidence interval (ci) - minutes) than patients who did not receive pre-thrombolytic aht. after propensity matching (table) , patients who received pre-thrombolytic aht had a longer dtt (mean increase . minutes, % ci . - . ) than patients who did not receive pre-thrombolytic aht. this effect persisted and its magnitude was not altered by accounting for clustering within hospitals. conclusion: pre-thrombolytic aht is associated with modest delays in dtt. this represents a feasible target for physician educational interventions and quality improvement initiatives. further research evaluating optimum hypertension management pre-thrombolytic treatment is warranted. post-pds, % had only pre-pds, and % had both. the most common pds included failure to treat post-treatment hypertension ( , %), antiplatelet agent within hours of treatment ( , %), pre-treatment blood pressure over / ( , %), anticoagulant agent within hours of treatment ( , %), and treatment outside the time window ( , %). symptomatic intracranial hemorrhage (sich) was observed in . % of patients with pds and . % of patients without any pd. in-hospital case fatality was % with and % without a pd. in the fully adjusted model, older age was significantly associated with pre-pds (table) . when post-pds were evaluated with adjustment for pre-pds, age was not associated with pds; however, pre-pds were associated with post-pds. conclusion: older age was associated with increased odds of pre-pds in michigan community hospitals. pre-pds were associated with post-pds. sich and in-hospital case fatality were not associated with pds; however, the low number of such events limited our ability to detect a difference. ct background: mri has become the gold standard for the detection of cerebral ischemia and is a component of multiple imaging enhanced clinical risk prediction rules for the short-term risk of stroke in patients with transient ischemic attack (tia). however, it is not always available in the emergency department (ed) and is often contraindicated. leukoaraiosis (la) is a radiographic term for white matter ischemic changes, and has recently been shown to be independently predictive of disabling stroke. although it is easily detected by both ct and mri, their comparative ability is unknown. objectives: we sought to determine whether leukoaraiosis, when combined with evidence of acute or old infarction as detected by ct, achieved similar sensitivity to mri in patients presenting to the ed with tia. methods: we conducted a retrospective review of consecutive patients diagnosed with tia between june and july that underwent both ct and mri as part of routine care within calendar day of presentation to a single, academic ed. ct and mr images were reviewed by a single emergency physician who was blinded to the mr images at the time of ct interpretation. la was graded using the van sweiten scale (vss), a validated grading scale applicable to both ct and mri. anterior and posterior regions were graded independently from to . results: patients were diagnosed with tia during the study period. of these, had both ct and mri background: helping others is often a rewarding experience but can also come with a ''cost of caring'' also known as compassion fatigue (cf). cf can be defined as the emotional and physical toll suffered by those helping others in distress. it is affected by three major components: compassion satisfaction (cs), burnout (bo), and traumatic experiences (te). previous literature has recognized an increase in bo related to work hours and stress among resident physicians. objectives: to assess the state of cf among residents with regard to differences in specialty training, hours worked, number of overnights, and demands of child care. we aim to measure associations with the three components of cf (cs, bo, and te). methods: we used the previously validated survey, proqol . the survey was sent to the residents after approval from the irb and the program directors. results: a total of responses were received ( % of the surveyed). five were excluded due to incomplete questionnaires. we found that residents who worked more hours per week had significantly higher bo levels (median vs , p = . ) and higher te ( vs , p = . ) than those working less hours. there was no difference in cs ( vs , p = . ). eighteen percent of the residents worked a majority of the night shifts. these residents had higher levels of bo background: emergency department (ed) billing includes both facility and professional fees. an algorithm derived from the medical provider's chart generates the latter fee. many private hospitals encourage appropriate documentation by financially incentivizing providers. academic hospitals sometimes lag in this initiative, possibly resulting in less than optimal charting. past attempts to teach proper documentation using our electronic medical record (emr) were difficult in our urban, academic ed of providers (approximately attending physicians, residents, and physician assistants). objectives: we created a tutorial to teach documentation of ed charts, modified the emr to encourage appropriate documentation, and provided feedback from the coding department. this was combined with an incentive structure shared equally amongst all attendings based on increased collections. we hypothesized this instructional intervention would lead to more appropriate billing, improve chart content, decrease medical liability, and increase educational value of charting process. methods: documentation recommendations, divided into two-month phases of - proposals, were administered to all ed providers by e-mails, lectures, and reminders during sign-out rounds. charts were reviewed by coders who provided individual feedback if specific phase recommendations were not followed. our endpoints included change in total rvu, rvus/ patient, e/m level distribution, and subjective quality of chart improvement. we did not examine effects on procedure codes or facility fees. results: our base average rvu/patient in our ed from / / - / / was . with monthly variability of approximately %. implementation of phase one increased average rvu/patient within two weeks to . ( . % increase from baseline, p < . ). the second aggregate phase implemented weeks later increased average rvu/patient to . ( . % increase from baseline, p < . ). conclusion: using our teaching methods, chart reviews focused on - recommendations at a time, and emr adjustments, we were able to better reflect the complexity of care that we deliver every day in our medical charts. future phases will focus on appropriate documentation for procedures, critical care, fast track, and pediatric patients, as well as examining correlations between increase in rvus with charge capture. identifying mentoring ''best practices'' for medical school faculty julie l. welch, teresita bellido, cherri d. hobgood background: mentoring has been identified as an essential component for career success and satisfaction in academic medicine. many institutions and departments struggle with providing both basic and transformative mentoring for their faculty. objectives: we sought to identify and understand the essential practices of successful mentoring programs. methods: multidisciplinary institutional stakeholders in the school of medicine including tenured professors, deans, and faculty acknowledged as successful mentors were identified and participated in focused interviews between mar-nov . the major area of inquiry involved their experiences with mentoring relationships, practices, and structure within the school, department, or division. focused interview data were transcribed and grounded theory analysis was performed. additional data collected by a institutional mentoring taskforce were examined. key elements and themes were identified and organized for final review. results: results identified the mentoring practices for three categories: ) general themes for all faculty, ) specific practices for faculty groups: basic science researchers, clinician researchers, clinician educators, and ) national examples. additional mentoring strategies that failed were identified. the general themes were quite universal among faculty groups. these included: clarify the best type of mentoring for the mentee, allow the mentee to choose the mentor, establish a panel of mentors with complementary skills, schedule regular meetings, establish a clear mentoring plan with expectations and goals, offer training and resources for both the mentor and mentee at institutional and departmental levels, ensure ongoing mentoring evaluation, create a mechanism to identify and reward mentoring. national practice examples offered critical recommendations to address multi-generational attitudes and faculty diversity in terms of gender, race, and culture. conclusion: mentoring strategies can be identified to serve a diverse faculty in academic medicine. interventions to improve mentoring practices should be targeted at the level of the institution, department, and individual faculty members. it is imperative to adopt results such as these to design effective mentoring programs to enhance the success of emergency medicine faculty seeking robust academic careers. background: women comprise half of the talent pool from which the specialty of emergency medicine draws future leaders, researchers, and educators and yet only % of full professors in us emergency medicine are female. both research and interventions are aimed at reducing the gender gap, however, it will take decades for the benefits to be realized which creates a methodological challenge in assessing system's change. current techniques to measure disparities are insensitive to systems change as they are limited to percentages and trends over time. objectives: to determine if the use of relative rate index (rri) better predicts which stage in the system women are not advancing in the academic pipeline than traditional metrics. methods: rri is a method of analysis that assesses the percent of sub-populations in each stage relative to their representation in the stage directly prior. thus, there is a better notion of the advancement given the availability to advance. rri also standardizes data for ease of interpretation. this study was conducted on the total population of academic professors in all departments at yale school of medicine during the academic year of - . data were obtained from the yale university provost's office. results: n = . there were a total of full, associate, and assistant professors. males comprised %, %, and % respectively. rri for the department of emergency medicine (dem) is . , . , and . , for full, associate, and assistant professors, respectively while the percentages were %, %, and % respectively. conclusion: relying solely on percentages masks improvements to the system. women are most represented at the associate professor level in dem, highlighting the importance of systems change evidence. specifically, twice as many women are promoted to associate professor rank given the number who exists as assistant professors. within years, the dem should have an equal system as the numbers of associate professors have dramatically increased and will be eligible to promote to full professor. additionally, dem has a better record of retaining and promoting women than other yale departments of medicine at both associate and full professor ranks. objectives: we examine the payer mixes of community non-rehabilitation eds in metropolitan areas by region to identify the proportion of academic and nonacademic eds that could be considered safety net eds. we hypothesize that the proportion of safety net academic eds is greater than that for non-academic eds and is increasing over time. methods: this is an ecological study examining us ed visits from through . data were obtained from the nationwide emergency department sample (neds). we grouped each ed visit according to the unique hospital-based ed identifier, thus creating a payer mix for each ed. we define a ''safety net ed'' as any ed where the payer mix satisfied any one of the following three conditions: ) > % of all ed visits are medicaid patients; ) > % of all ed visits are self-pay patients; or ) > % of all ed visits are either medicaid or self-pay patients. neds tags each ed with a hospital-based variable to delineate metropolitan/non-metropolitan locations and academic affiliation. we chose to examine a subpopulation of eds tagged as either academic metropolitan or non-academic metropolitan, because the teaching status of non-metropolitan hospitals was not provided. we then measured the proportion of eds that met safety net criteria by academic status and region. results: we examined , , , , and , weighted metro eds in years - , respectively. table presents safety net proportions. the proportions of academic safety net eds increased across the study period. widespread regional variability in safety net proportions existed across all years. the proportions of safety net eds were highest in the south and lowest in the northeast and midwest. table describes these findings for . conclusion: these data suggest that the proportion of safety-net academic eds may be greater than that of non-academic eds, is increasing over time, and is objectives: to examine the effect of ma health reform implementation on ed and hospital utilization before and after health reform, using an approach that relies on differential changes in insurance rates across different areas of the state in order to make causal inferences as to the effect of health reform on ed visits and hospitalizations. our hypothesis was that health care reform (i.e. reducing rates of uninsurance) would result in increased rates of ed use and hospitalizations. methods: we used a novel difference-in-differences approach, with geographic variation (at the zip code level) in the percentage uninsured as our method of identifying changes resulting from health reform, to determine the specific effect of massachusetts' health care reform on ed utilization and hospitalizations. using administrative data available from the massachusetts division of health care finance and policy acute hospital case mix databases, we compared a one-year period before health reform with an identical period after reform. we fit linear regression models at the area-quarter level to estimate the effect of health reform and the changing uninsurance rate (defined as self-pay only) on ed visits and hospitalizations. results: there were , , ed visits and , hospitalizations pre-reform and , , ed visits and , hospitalizations post-reform. the rate of uninsurance decreased from . % to . % in the ed group and from . % to . % in the hospitalization group. a reduction in the rate of the uninsured was associated with a small but statistically significant increase in ed utilization (p = . ) and no change in hospitalizations (p = . ). conclusion: we find that increasing levels of insurance coverage in massachusetts were associated with small but statistically significant increases in ed visits, but no differences in rates of hospitalizations. these results should aid in planning for anticipated changes that might result from the implementation of health reform nationally. with high levels of co-morbidity when untreated in adolescents. despite broad cdc screening recommendations, many youth do not receive testing when indicated. the pediatric emergency department (ped) is a venue with a high volume of patients potentially in need of sti testing, but assessing risk in the ped is difficult given constraints on time and privacy. we hypothesized that patients visiting a ped would find an audio-enhanced computer-assisted self-interview (acasi) program to establish sti risk easy to use, and would report a preference for the acasi over other methods of disclosing this information. objectives: to assess acceptability, ease of use, and comfort level of an acasi designed to assess adolescents' risk for stis in the ped. methods: we developed a branch-logic questionnaire and acasi system to determine whether patients aged - visiting the ped need sti testing, regardless of chief complaint. we obtained consent from participants and guardians. patients completed the acasi in private on a laptop. they read a one-page computer introduction describing study details and completed the acasi. patients rated use of the acasi upon completion using five-point likert scales. results: eligible patients visited the ped during the study period. we approached ( %) and enrolled and analyzed data for / ( %). the median time to read the introduction and complete the acasi was . minutes (interquartile range . - . minutes). . % of patients rated the acasi ''very easy'' or ''easy'' to use, . % rated the wording as ''very easy'' or ''easy'' to understand, % rated the acasi ''very short'' or ''short'', . % rated the audio as ''very helpful'' or ''helpful,'' . % were ''very comfortable'' or ''comfortable'' with the system confidentiality, and . % said they would prefer a computer interface over in-person interviews or written surveys for collection of this type of information. conclusion: patients rated the computer interface of the acasi as easy and comfortable to use. a median of . minutes was needed to obtain meaningful clinical information. the acasi is a promising approach to enhance the collection of sensitive information in the ped. the participants were randomized to one of three conditions, bi delivered by a computer (cbi), bi delivered by a therapist assisted by a computer (tbi), or control, and completed , , and month follow-up. in addition to content on alcohol misuse and peer violence, adolescents reporting dating violence received a tailored module on dating violence. the main outcome for this analysis was frequency of moderate and severe dating victimization and aggression at the baseline assessment and , , and months post ed visit. results: among eligible adolescents, % (n = ) reported dating violence and were included in these analyses. compared to controls, after controlling for baseline dating victimization, participants in the cbi showed reductions in moderate dating victimization at months (or . ; ci . - . ; p < . , effect size . ) and months (or . ; ci . - . ; p < . , effect size . ); models examining interaction effects were significant for the cbi on moderate dating victimization at and months. significant interaction effects were found for the tbi on moderate dating victimization at and months and severe dating victimization at months. the computer-based intervention shows promise for delivering content that decreases moderate dating victimization over months. the therapist bi is promising for decreasing moderate dating victimization over months and severe dating victimization over months. ed-based bis delivered on a computer addressing multiple risk behaviors could have important public health effects. figure . the -only ordinance was associated with a significant reduction of ar visits. this ordinance was also associated with reduction in underage ar visits, ui student visits, and public intoxication bookings. these data suggest that other cities should consider similar ordinances to prevent unwanted consequences of alcohol. background: prehospital providers perform tracheal intubation in the prehospital environment, and failed attempts are of concern due to the danger of hypoxia and hypotension. some question the appropriateness of intubation in this setting due to the morbidity risk associated with intubation in the field. thus it is important to gain an understanding of the factors that predict the success of prehospital intubation attempts to inform this discussion. objectives: to determine the factors that affect success rates on first attempt of paramedic intubations in a rapid sequence intubation (rsi) capable critical care transport service. methods: we conducted a multivariate logistic analysis on a prospectively collected database of airway management from an air and land critical care transport service that provides scene responses and interfacility transport in the province of ontario. background: motor vehicle collisions (mvcs) are one of the most common types of trauma for which people seek ed care. the vast majority of these patients are discharged home after evaluation. acute psychological distress after trauma causes great suffering and is a known predictor of posttraumatic stress disorder (ptsd) development. however, the incidence and predictors of psychological distress among patients discharged to home from the ed after mvcs have not been reported. objectives: to examine the incidence and predictors of acute psychological distress among individuals seen in the ed after mvcs and discharged to home. methods: we analyzed data from a prospective observational study of adults - years of age presenting to one of eight ed study sites after mvc between / and / . english-speaking patients who were alert and oriented, stable, and without injuries requiring hospital admission were enrolled. patient interview included assessment of patient sociodemographic and psychological characteristics and mvc characteristics. level of psychological distress in the ed was assessed using the -item peritraumatic distress inventory (pdi). pdi scores > are associated with increased risk of ptsd and were used to define substantial psychological distress. descriptive statistics and logistic regression were performed using stata ic . (statacorp lp, college station, texas). results: mvc patients were screened, were eligible, and were enrolled. / ( %) participants had substantial psychological distress. after adjusting for crash severity (severity of vehicle damage, vehicle speed), substantial patient distress was predicted by sociodemographic factors, pre-mvc depressive symptoms, and arriving to the ed on a backboard (table) . conclusion: substantial psychological distress is common among individuals discharged from the ed after mvcs and is predicted by patient characteristics separate from mvc severity. a better under standing of the frequency and predictors of substantial psychological distress is an important first step in identifying these patients and developing effective interventions to reduce severe distress in the aftermath of trauma. such interventions have the potential to reduce both immediate patient suffering and the development of persistent psychological sequelae. figure) the predictive characteristics of pets, pesi, and spesi for -day mortality in emperor, including auc, negative predictive value, sensitivity, and specificity were calculated. results: the of patients ( . %; % ci . %- . %) classified as pets low had -day mortality of . % ( % ci . - . %), versus . % ( % ci . %- . %) in the pets high group, statistically similar to pesi and spesi. pets is significantly more specific for mortality than the spesi ( . % v . %; p < . ), classifying far more patients as low-risk while maintaining a sensitivity of % ( % ci . %- . %), not significantly different from spesi or pesi (p > . ). conclusion: with four variables, pets in this derivation cohort is as sensitive for -day mortality as the more complicated pesi and spesi, with significantly greater specificity than the spesi for mortality, placing % more patients in the low-risk group. external validation is necessary. nicole seleno, jody vogel, michael liao, emily hopkins, richard byyny, ernest moore, craig gravitz, jason haukoos denver health medical center, denver, co background: the sequential organ failure assessment (sofa) score, base excess, and lactate have been shown to be associated with mortality in critically ill trauma patients. the denver emergency department (ed) trauma organ failure (tof) score was recently derived and internally validated to predict multiple organ failure in trauma patients. the relationship between the denver tof score and mortality has not been assessed or compared to other conventional measures of mortality in trauma. objectives: to compare the prognostic accuracies of the denver ed tof score, ed sofa score, and ed base excess and lactate for mortality in a large heterogeneous trauma population. methods: a secondary analysis of data from the denver health trauma registry, a prospectively collected database. consecutive adult trauma patients from through were included in the study. data collected included demographics, injury characteristics, prehospital care characteristics, response to injury characteristics, ed diagnostic evaluation and interventions, and in-hospital mortality. the values of the four clinically relevant measures (denver ed tof score, ed sofa score, ed base excess, and ed lactate) were determined within four hours of patient arrival, and prognostic accuracies for in-hospital mortality for the four measures were evaluated with receiver operating characteristic (roc) curves. multiple imputation was used for missing values. results: of the , patients, the median age was (iqr - ) years, median injury severity score was (iqr - ), and % had blunt mechanisms. thirty-eight percent ( , patients) were admitted to the icu with a median icu length of stay of . (iqr - ) days, and % ( patients) died. in the non-survivors, the median values for the four measures were ed sofa . (iqr . - . ); denver ed tof . (iqr . - . ); ed base excess . (iqr . - . ) meq/l; and ed lactate . (iqr . - . ) mmol/l. the areas under the roc curves for these measures are demonstrated in the figure. conclusion: the denver ed tof score more accurately predicts in-hospital mortality in trauma patients as compared to the ed sofa score, ed base excess, or ed lactate. the denver ed tof score may help identify patients early who are at risk for mortality, allowing for targeted resuscitation and secondary triage to improve outcomes in these critically ill patients. the background: both animal and human studies suggest that early initiation of therapeutic hypothermia (th) and rapid cooling improve outcomes after cardiac arrest. objectives: the objective was to determine if administration of cold iv fluids in a prehospital setting decreased time-to-target-temperature (tt) with secondary analysis of effects on mortality and neurological outcome. methods: patients resuscitated after out-of-hospital cardiac arrest (oohca) who received an in-hospital post cardiac arrest bundle including th were prospectively enrolled into a quality assurance database from november to november . on april , a protocol for intra-arrest prehospital cooling with °c normal saline on patients experiencing oohca was initiated. we retrospectively compared tt for those receiving prehospital cold fluids and those not receiving cold fluids. tt was defined as °c measured via foley thermistor. secondary outcomes included mortality, good neurological outcome defined as cerebral performance category (cpc) score of or at discharge, and effects of pre-rosc cooling. results: there were patients who were included in this analysis with patients receiving prehospital cold iv fluids and who did not. initially, % of patients were in vf/vt and % asystole/pea. patients receiving prehospital cooling did not have a significant improvement in tt ( minutes vs minutes, p = . ). survival to discharge and good neurologic outcome were not associated with prehospital cooling ( % vs %, p = . ) and cpc of or in % vs %, (p = . ). initiating cold fluids prior to rosc showed both a nonsignificant decrease in survival ( % vs %, p = . ) and increase in poor neurologic outcomes ( % vs %, p = . ). % of patients received £ l of cooled ivf prior to hospital arrival. patients receiving prehospital cold ivf had a longer time from arrest to hospital arrival ( vs min, p =< . ) in addition to a prolonged rosc to hospital time ( vs min, p = . ). conclusion: at our urban hospital, patients achieving rosc following oohca did not demonstrate faster tt or outcome improvement with prehospital cooling compared to cooling initiated immediately upon ed arrival. further research is needed to assess the utility of prehospital cooling. assessment background: an estimated % of emergency department (ed) patients years of age and older have delirium, which is associated with short-and long-term risk of morbidity and mortality. early recognition could result in improved outcomes, but the reliability of delirium recognition in the continuum of emergency care is unknown. objectives: we tested whether delirium can be reliably detected during emergency care of elderly patients by measuring the agreement between prehospital providers, ed physicians, and trained research assistants using the confusion assessment method for the icu (cam-icu) to identify the presence of delirium. our hypothesis was that both ed physicians and prehospital providers would have poor ability to detect elements of delirium in an unstructured setting. methods: prehospital providers and ed physicians completed identical questionnaires regarding their clinical encounter with a convenience sample of elderly (age > years) patients who presented via ambulance to two urban, teaching eds over a three-month period. respondents noted the presence or absence of ( ) an acute change in mental status, ( ) inattention, ( ) disorganized thinking, and ( ) altered level of consciousness (using the richmond agitation sedation scale). these four components comprise the operational definition of delirium. a research assistant trained in the cam-icu rated each component for the same patients using a standard procedure. we calculated inter-rater reliability (kappa) between prehospital providers, ed physicians, and research assistants for each component. objectives: this study aimed to assess the association between age and ems use while controlling for potential confounders. we hypothesized that this association use would persist after controlling for confounders. methods: a cross-sectional survey study was conducted at an academic medical center's ed. an interview-based survey was administered and included questions regarding demographic and clinical characteristics, mode of ed arrival, health care use, and the perceived illness severity. age was modeled as an ordinal variable (< , - , and ‡ years). bivariate analyses were used to identify potential confounders and effect measure modifiers and a multivariable logistic regression model was constructed. odds ratios were calculated as measures of effect. results: a total of subjects were enrolled and had usable data for all covariates, ( %) of whom arrived via ems. the median age of the sample was years and % were female. there was a statistically significant linear trend in the proportion of subjects who arrived via ems by age (p < . ). compared to adults aged less than years, the unadjusted odds ratio associating age and ems use was . ( % ci: background: we previously derived a clinical decision rule (cdr) for chest radiography (cxr) in patients with chest pain and possible acute coronary syndrome (acs) consisting of the absence of three predictors: history of congestive heart failure, history of smoking, and abnormalities on lung auscultation. objectives: to prospectively validate and refine a cdr for cxr in an independent patient population. methods: we prospectively enrolled patients over years of age with a primary complaint of chest pain and possible acs from september to january at a tertiary care ed with , annual patient visits. physicians completed standardized data collection forms before ordering chest radiographs and were thus blinded to cxr findings at the time of data collection. two investigators, blinded to the predictor variables, independently classified cxrs as ''normal,'' ''abnormal not requiring intervention,'' and ''abnormal requiring intervention'' (e.g, heart failure, infiltrates) based on review of the radiology report and the medical record. analyses included descriptive statistics, inter-rater reliability assessment (kappa), and recursive partitioning. results: of visits for possible acs, mean age (sd) was . ( . ) and % were female. twenty-four percent had a history of acute myocardial infarction, % congestive heart failure, and % atrial fibrillation. seventy-one ( . %, % ci . - . ) patients had a radiographic abnormality requiring intervention. ing the likelihood of coronary artery disease (cad) could reduce the need for stress testing or coronary imaging. acyl-coa:cholesterol acyltransferase- (acat ) activity has been shown in monkey and murine models to correlate with atherosclerosis. objectives: to determine if a novel cardiac biomarker consisting of plasma cholesteryl ester levels (ce) typically derived from the activity of acat is predictive of cad in a clinical model. methods: a single center prospective observational cohort design enrolled a convenience sample of subjects from a tertiary care center with symptoms of acute coronary syndrome undergoing coronary ct angiography or invasive angiography. plasma samples were analyzed for ce composition with mass spectrometry. the primary endpoint was any cad determined at angiography. multivariable logistic regression analyses were used to estimate the relationship between the sum of the plasma concentrations from cholesteryl palmitoleate ( : ) and cholesteryl oleate ( : ) (defined as acat -ce) and the presence of cad. the added value of acat -ce to the model was analyzed comparing the c-statistics and integrated discrimination improvement (idi). results: the study cohort was comprised of participants enrolled over months with a mean age (± . ) years, % with cad at angiography. the median plasma concentration of acat -ce was lm ( , ) in patients with cad and lm ( , ) in patients without cad (p = . ) (figure) . when considered with age, sex, and the number of conventional cad risk factors, acat -ce were associated with a . % increased odds of having cad per lm increase in concentration. the addition of acat -ce significantly improved the c-statistic ( . vs . , p = . ) and idi ( . , p < . ) compared to the reduced model. in the subgroup of low-risk observation unit patients, the ce model had superior discrimination compared to the diamond forrester classification (idi . , p < . ). conclusion: plasma levels of acat -ce, considered in a clinical model, have strong potential to predict a patient's likelihood of having cad. in turn, this could reduce the need for cardiac imaging after the exclusion of mi. further study of acat -ce as biomarkers in patients with suspected acs is needed. background: outpatient studies have demonstrated a correlation between carotid intima-media thickness (cimt) on ultrasound and coronary artery disease (cad). there are no known published studies that investigate the role of cimt in the ed using cardiac ct or percutaneous cardiac intervention (pci) as a gold standard. objectives: we hypothesized that cimt can predict cardiovascular events and serve as a noninvasive tool in the ed. methods: this was a prospective study of adult patients who presented to the ed and required evaluation for chest pain. the study location was an urban ed with a census of , annual visits and -hour cardiac catheterization. patients who did not have ct or pci or had carotid surgery were excluded from the study. ultrasound cimt measurements of right and left common carotid arteries were taken with a mhz linear transducer (zonare, mountain view, ca). anterior, medial, and posterior views of the near and far wall were obtained ( cimt scores total). images were analyzed by carotid analyzer (mailing imaging application llc, coralville, iowa). patients were classified into two groups based on the results from ct or pci. a subject was classified as having significant cad if there was over % occlusion or multi-vessel disease. results: ninety of patients were included in the study; . % were males. mean age was . ± years. there were ( . %) subjects with significant cad and ( . %) with non-significant cad. the mean of all cimt measurements was significantly higher in the cad group than in the non-cad group ( . ± . vs. . ± . ; p < . ). a logistic regression analysis was carried out with significant cad as the event of interest and the following explanatory variables in the model: objectives: to determine the diagnostic yield of routine testing in-hospital or following ed discharge among patients presenting to an ed following syncope. methods: a prospective, observational, cohort study of consecutive ed patients ‡ years old presenting with syncope was conducted. the four most commonly utilized tests (echocardiography, telemetry, ambulatory electrocardiography monitoring, and cardiac markers) were studied. interobserver agreement as to whether tests results determined the etiology of the syncope was measured using kappa (k) values. results: of patients with syncope, ( %) had echocardiography with ( %) demonstrating a likely etiology of the syncopal event such as critical valvular disease or significantly depressed left ventricular function (k = . ). on hospitalization, ( %) patients were placed on telemetry, ( %) of these had worrisome dysrhythmias (k = . ). ( %) patients had troponin levels drawn of whom ( %) had positive results (k = ); ( %) patients were discharged with monitoring with significant findings in only ( . %) patients (k = . ). overall, ( %, % ci - %) studies were diagnostic. conclusion: although routine testing is prevalent in ed patients with syncope, the diagnostic yield is relatively low. nevertheless, some testing, particularly echocardiography, may yield critical findings in some cases. current efforts to reduce the cost of medical care by eliminating non-diagnostic medical testing and increasing emphasis on practicing evidence-based medicine argue for more discriminate testing when evaluating syncope. (originally submitted as a ''late-breaker.'') unusual fatigue was reported by . % (severe . %) and insomnia by . % (severe . %). these findings have led to risk management recommendations to consider these symptoms as predictive of acute coronary syndromes (acs) among women visiting the ed. objectives: to document the prevalence of these symptoms among all women visiting an ed. to analyze the potential effect of using these symptoms in the ed diagnostic process for acs. methods: a survey on fatigue and insomnia symptoms was administered to a convenience sample of all adult women visiting an urban academic ed (all arrival modes, acuity levels, all complaints). a sensitivity analysis was performed using published data and expert opinion for inputs. results: we approached women, with enrollments. see table. the top box shows prevalences of prodromal symptoms among all adult female ed patients. the bottom box shows outputs from sensitivity analysis on the diagnostic effect of initiating an acs workup for all female ed patients reporting prodromal symptoms. conclusion: prodromal symptoms of acs are highly prevalent among all adult women visiting the ed in this study. this likely limits their utility in ed settings. while screening or admitting women with prodromal symptoms in the ed would probably increase sensitivity, that increase would be accompanied by a dramatic reduction in specificity. such a reduction in specificity would translate to admitting, observing, or working up somewhere between % and % of all women visiting the ed, which is prohibitive in terms of personal costs, risks of hospitalization, and financial costs. while these symptoms may or may not have utility in other settings such as primary care, their prevalence, and the implied lack of specificity for acs suggest they will not be clinically useful in the ed. length methods: we examined a cohort of low-risk chest pain patients evaluated in an ed-based ou using prospective and retrospective ou registry data elements. cox proportional hazard modeling was performed to assess the effect of testing modality (stress testing vs. ccta) on the los in the cdu. as ccta is not available on weekends, only subjects presenting on weekdays were included. cox models were stratified on time of patient presentation to the ed, based on four hour blocks beginning at midnight. the primary independent variable was first test modality, either stress imaging (exercise echo, dobutamine echo, stress mri) or ccta. age, sex, and race were included as covariates. the proportional hazards assumption was tested using scaled schoenfield residuals, and the models were graphically examined for outliers and overly influential covariate patterns. test selection was a time varying covariate in the am strata, and therefore the interaction with ln (los) was included as a correction term. after correction for multiple comparisons, an alpha of . was held to be significant. results: over the study period, subjects (of , in the registry) presented on non-weekend days. the median los was . hours (iqr . - . hours), % were white, and % were female. the table shows the number of subjects in each time strata, the number tested, and the number undergoing stress testing vs. ccta. after adjusting all models for age, race, and sex, the hazard ratio (hr) for los is as shown. only those patients presenting between am and noon noted a significant improvement in los with ccta use (p < . ). objectives: determine the validity of a managementfocused em osce as a measure of clinical skills by determining the correlation between osce scores and faculty assessment of student performance in the ed. methods: medical students in a fourth year em clerkship were enrolled in the study. on the final day of the clerkship students participated in a five-station em osce. student performance on the osce was evaluated using a task-based evaluation system with - critical management tasks per case. task performance was evaluated using a three-point system: performed correctly/timely ( ), performed incorrectly/late ( ), or not performed ( ). descriptive anchors were used for performance criteria. communication skills were also graded on a three-point scale. student performance in the ed was based on traditional faculty assessment using our core-competency evaluation instrument. a pearson correlation coefficient was calculated for the relationship between osce score and ed performance score. case item analysis included determination of difficulty and discrimination. the acgme also requires that trainees are evaluated on these ccs during their residency. trainee evaluation in the ccs are frequently on a subjective rating scale. one of the recognized problems with a subjective scale is the rating stringency of the rater, commonly known as the hawk-dove effect. this has been seen in standardized clinical exam scoring. recent data have shown that score variance can be related to evaluator performance with a negative correlation. higher-scoring physicians were more likely to be a stringent or hawk type rater on the same evaluation. it is unclear if this pattern also occurs in the subjective ratings that are commonly used in assessments of the ccs. objectives: comparison of attending physician scores on the acgme ccs with attending ratings of residents for a negative correlation or hawk-dove effect. methods: residents are routinely evaluated on the ccs with a - numerical rating scale as part of their training. the evaluation database was retrospectively reviewed. residents anonymously scored attending physicians on the ccs with a cross-sectional survey that utilized the same rating scale, anchors, and prompts as the resident evaluations. average scores for and by each attending were calculated and a pearson correlation calculated by core competency and overall. results: in this irb-approved study, a total of attending physicians were scored on the ccs with evaluations by residents. attendings evaluated residents with a total of , evaluations completed over a -year period. attending mode score was ranging from to ; resident scores had a mode of with a range of to . there was no correlation between the rated performance of the attendings overall or in each ccs and the scores they gave (p = . - . ). conclusion: hawk-dove effects can be seen in some scoring systems and has the potential to affect trainee evaluation on the acgme core competencies. however, a negative correlation to support a hawk-dove scoring pattern was not found in em resident evaluations by attending physicians. this study is limited by being a single center study and utilizing grouped data to preserve resident anonymity. background: all acgme-accredited residency programs are required to provide competency-based education and evaluation. graduating residents must demonstrate competency in six key areas. multiple studies have outlined strategies for evaluating competency, but data regarding residents' self-assessments of these competencies as they progress through training and beyond is scarce. objectives: using data from longitudinal surveys by the american board of emergency medicine, the primary objective of this study was to evaluate if resident self-assessments of performance in required competencies improve over the course of graduate medical training and in the years following. additionally, resident self-assessment of competency in academic medicine was also analyzed. methods: this is a secondary data analysis of data gathered from two rounds of the abem longitudinal study of emergency medicine residents ( - and - ) and three rounds of the abem longitudinal study of emergency physicians ( , , ). in both surveys, physicians were asked to rate a list of items in response to the question, ''what is your current level of competence in each of the following aspects of work in em?'' the rated items were grouped according to the acgme required competencies of patient care, medical knowledge, practice-based learning and improvement, interpersonal and communication skills, and system-based practice. an additional category for academic medicine was also added. results: rankings improved in all categories during residency training. rankings in three of the six categories improved from the weak end of the scale to the strong end of the scale. there is a consistent decline in rankings one year after graduation from residency. the greatest drop is in medical knowledge. mean self-ranking in academic medicine competency is uniformly the lowest ranked category for each year. conclusion: while self-assessment is of uncertain value as an objective assessment, these increasing rankings suggest that emergency medicine residency programs are successful at improving residents' confidence in the required areas. residents do not feel as confident about academic medicine as they do about the acgme required competencies. the uniform decline in rankings the first year after residency is an area worthy of further inquiry. screening medical student rotators from outside institutions improves overall rotation performance shaneen doctor, troy madsen, susan stroud, megan l. fix university of utah, salt lake city, ut background: emergency medicine is a rapidly growing field. many student rotations are limited in their ability to accommodate all students and must limit the number of students they allow per rotation. we hypothesize that pre-screening visiting student rotators will improve overall student performance. objectives: to assess the effect of applicant screening on overall rotation grade and mean end of shift card scores. methods: we initiated a medical student screening process for all visiting students applying to our -week elective em rotation starting in . this consisted of reviewing board scores and requiring a letter of intent. students from our home institution were not screened. all end-of-shift evaluation cards and final rotation grades (honors, high pass, pass, fail) from to were analyzed. we identified two cohorts: home students (control) and visiting students. we compared pre-intervention ( ) ( ) ( ) ( ) ( ) and postintervention ( - ) scores and grades. end of shift performance scores are recorded using a fivepoint scale that assesses indicators such as fund of knowledge, judgment, and follow-through to disposition. mean ranks were compared and p-values were calculated using the armitage test of trend and confirmed using t-tests. results: we identified visiting students ( pre, post) and home students ( pre, post). ( . %) visiting students achieved honors pre-intervention while ( . %) achieved honors post-intervention (p = . ). no significant difference was seen in home student grades: ( . %) received honors pre- and ( . %) received honors post- conclusion: we found that implementation of a screening process for visiting medical students improved overall rotation scores and grades as compared to home students who did not receive screening. screening rotating students may improve the overall quality of applicants and thereby the residency program. background: there are many descriptions in the literature of computer-assisted instruction in medical education, but few studies that compare them to traditional teaching methods. objectives: we sought to compare the suturing skills and confidence of students receiving video preparation before a suturing workshop versus a traditional instructional lecture. methods: first and second year medical students were randomized into two groups. the control group was given a lecture followed by minutes of suturing time. the video group was provided with an online suturing video at home, no lecture, and given minutes of suturing time during the workshop. both groups were asked to rate their confidence before and after the workshop, and their belief in the workshop's effectiveness. each student was also videotaped suturing a pig's foot after the workshop and graded on a previously validated -point suturing checklist. videos were scored. results: there was no significant difference between the test scores of the lecture group (m = . , sd = . , n = ) and the video group (m = . , sd = . , n = ) using the two-sample independent ttest for equal variances (t( ) = ) . , p = . ). there was a statistically significant difference in the proportion of students scoring correctly for only one point: ''curvature of needle followed'': / in the lecture group and / in the video group (chi = . , df = , p = . ). students in the video group were found to be . times more likely to have a neutral or favorable feeling of suturing confidence before the workshop (p = . , ci . - . ) using a proportional odds model. no association was detected between group assignment and level of suturing confidence after the workshop (p = . ). there was also no association detected between group assignment and opinion of the suturing workshop (p = . ) using a logistic regression odds model. among those students who indicated a lack of confidence before training, there was no detected association (p = . ) between group assignment and having an improved confidence using a logistic regression odds model. conclusion: students in the video group and students in the control group achieved similar levels of suturing skill and confidence, and equal belief in the workshop's effectiveness. this study suggests that video instruction could be a reasonable substitute for lectures in procedural education. background: accurate interpretation of the ecg in the emergency department is not only clinically important but also critical to assess medical knowledge competency. with limitations to expansion of formal didactics, educational technology offers an innovative approach to improve the quality of medical education. objectives: the aim of this study was to assess an online multimedia-based ecg training module evaluating st elevation myocardial infarction (stemi) identification among medical students. methods: a convenience sample of fifty-two medical students on their em rotations at an academic medical center with an em residency program was evaluated in a before-after fashion during a -month period. one cardiologist and two ed attending physicians independently validated a standardized exam of ten ecgs: four were normal ecgs, three were classic stemis, and three were subtle stemis. the gold standard for diagnosis was confirmed acute coronary thrombus during cardiac catheterization. after evaluating the ecgs, students completed a pre-intervention test wherein they were asked to identify patients who required emergent cardiac catheterization based on the presence or absence of st segment elevation on ecg. students then completed an online interactive multimedia module containing minutes of stemi training based on american heart association/american college of cardiology guidelines on stemi. medical students were asked to complete a post-test of the ecgs after watching online multimedia. objectives: our objective was to quantify the number of pre-verbal pediatric head cts performed at our community hospital that could have been avoided by utilizing the pecarn criteria. methods: we conducted a standardized chart review of all children under the age of who presented to our community hospital and received a head ct between jan st, and dec st, . following recommended guidelines for conducting a chart review, we: ) utilized four blinded chart reviewers, ) provided specific training, ) created a standardized data extraction tool, and ) held periodic meetings to evaluate coding discrepancies. our primary outcome measure was the number of patients who were pecarn negative and received a head ct at our institution. our secondary outcome was to reevaluate the sensitivity and specificity of the pecarn criteria to detect citbi in our cohort. data were analyzed using descriptive statistics and % confidence intervals were calculated around proportions using the modified wald method. results: a total of patients under the age of received a head ct at our institution during the study period. patients were excluded from the final analysis because their head cts were not for trauma. the prevalence of a citbi in our cohort was . % ( % ci . %- . %) ( (dti) measures disruption of axonal integrity on the basis of anisotropic diffusion properties. findings on dti may relate to the injury, as well as the severity of postconcussion syndrome (pcs) following mtbi. objectives: to examine acute anisotropic diffusion properties based on dti in youth with mtbi relative to orthopedic controls and to examine associations between white matter (wm) integrity and pcs symptoms. methods: interim analysis of a prospective casecontrol cohort involving youth ages - years with mtbi and orthopedic controls requiring extremity radiographs. data collected in ed included demographics, clinical information, and pcs symptoms measured by the postconcussion symptom scale. within hours of injury, symptoms were re-assessed and a -direction, diffusion weighted, spin-echo imaging scan was performed on a t philips scanner. dti images were analyzed using tract-based spatial statistics. fractional anisotropy (fa), mean diffusivity (md), axial diffusivity (ad), and radial diffusivity were measured. results: there were no group demographic differences between mtbi cases and controls. presenting symptoms within the mtbi group included gcs = %, loss of consciousness %, amnesia %, post-traumatic seizure %, headache %, vomiting %, dizziness %, and confusion %. pcs symptoms were greater in mtbi cases than in the controls at ed visit ( . ± . vs. . ± . , p < . ) and at the time of scan ( . ± . vs. . ± . , p < . ). the mtbi group displayed decreased fa in cerebellum and increased md and ad in the cerebral wm relative to controls (uncorrected p < . ). increased fa in cerebral wm was also observed in mtbi patients but the group difference was not significant. pcs symptoms at the time of the scan were positively correlated with fa and inversely correlated with rd in extensive cerebral wm areas (p < . , uncorrected). in addition, pcs symptoms in mtbi patients were also found to be inversely correlated with md, ad, and rd in cerebellum (p < . ). conclusion: dti detected axonal damage in youth with mtbi which correlated with pcs symptoms. dti performed acutely after injury may augment detection of injury and help prediction of those with worse outcomes. background: sports-related concussion among professional, collegiate, and more recently high school athletes has received much attention from the media and medical community. to our knowledge, there is a paucity of research in regard to sports-related concussion in younger athletes. objectives: the aim of this study was to evaluate parental knowledge of concussion in young children who participate in recreational tackle football. methods: parents/legal guardians of children aged - years enrolled in recreational tackle football were asked to complete an anonymous questionnaire based on the cdc's heads up: concussion in youth sports quiz. parents were asked about their level of agreement in regard to statements that represent definition, symptoms, and treatment of concussion. results: a total of out of parents voluntarily completed the questionnaire ( % response rate). parent and child demographics are listed in table . ninety four percent of parents believed their child had never suffered a concussion. however, when asked to agree or disagree with statements addressing various aspects of concussion, only % (n = ) could correctly identify all seven statements. most did not identify that a concussion is considered a mild traumatic brain injury and can be achieved from something other than a direct blow to the head. race, sex, and zip code had no significant association with correctly answering statements. education ( . ; p < . ) and number of years the child played ( . ; p < . ) had a small effect. fifty-three percent of parents reported someone had discussed the definition of concussion with them and % the symptoms of concussion. see table for source of information to parents. no parent was able to classify all symptoms listed as correctly related or not related to concussion. however, identification of correct concussion definitions correlated with identification of correct symptoms ( . ; p < . ). conclusion: while most parents had received some education regarding concussion from a health care provider, important misconceptions remain among parents of young athletes regarding the definition, symptoms, and treatment of concussion. this study highlights the need for health care providers to increase educational efforts among parents of young athletes in regard to concussion. figure ). / ( %) of patients with baseline liver dysfunction were (oh)d deficient and / ( %) of deaths were patients who had insufficient levels of (oh)d. there was an inverse association between (oh)d level and tnf-a (p = . ; figure ) and il- (p = . ). background: fever is common in the emergency department (ed), and % of those diagnosed with severe sepsis present with fever. despite data suggesting that fever plays an important role in immunity, human data conflict on the effect of antipyretics on clinical outcomes in critically ill adults. objectives: to determine the effect of ed antipyretic administration on -day in-hospital mortality in patients with severe sepsis. methods: single-center, retrospective observational cohort study of febrile severe sepsis patients presenting to an urban academic , -visit ed between june and june . all ed patients meeting the following criteria were included: age ‡ , temperature ‡ . °c, suspected infection, and either systolic blood pressure £ mmhg after a ml/kg fluid bolus or lactate of ‡ . patients were excluded for a history of cirrhosis or acetaminophen allergy. antipyretics were defined as acetaminophen, ibuprofen, or ketorolac. results: one hundred-thirty five ( . %) patients were treated with an antipyretic medication ( . % acetaminophen). intubated patients were less likely to receive antipyretic therapy ( . % vs. . %, p < . ), but the groups were otherwise well matched. patients requiring ed intubation (n = ) had much higher in-hospital mortality ( . % vs. . %, p < . ). patients given an antipyretic in the ed had lower mortality ( . % vs. . %, p < . ). when multivariable logistic regression was used to account for apache-ii, intubation status, and fever magnitude, antipyretic therapy was not associated with mortality (adjusted or . , . - . , p = . ). conclusion: although patients treated with antipyretic therapy had lower -day in-hospital mortality, antipyretic therapy was not independently associated with mortality in multivariable regression analysis. these findings are hypothesis-generating for future clinical trials, as the role of fever control has been largely unexplored in severe sepsis (grant ul rr , nih-ncrr). , and caval index ) . ± . (ci ) . , ) . ) and all were statistically significant. the groups receiving ml/kg and ml/kg had statistically significant changes in caval index; however the ml/kg group had no significant change in mean ivc diameter. one-way anova differences between the means of all groups were not statistically different. conclusion: overall, there were statistically significant differences in mean ivc-us measurements before and after fluid loading, but not between groups. fasting asymptomatic subjects had a wide inter-subject variation in both baseline ivc-us measurements and fluid-related changes. the wide differences within our ml/kg group may limit conclusions regarding proportionality. there were significant differences in performance on ed measures by ownership (p < . ) and region (p = . ). scores on ed process measures were highest at for-profit hospitals ( % above average) and hospitals in the south ( % above average), and lowest at public hospitals ( % below average) and hospitals in the northeast ( % below average). conclusion: there was considerable variation in performance on the ed measures included in the vbp program by hospital ownership and region. ed directors may come under increasing pressure to improve scores in order to reduce potential financial losses under the program. our data provide early information on the types of hospitals with the greatest opportunity for improvement. methods: design/setting -an independent agency mandated by the government collected and analyzed ed patient experience data using a comprehensive, validated multidimensional instrument and a random periodic sampling methodology of all ed patients. a prospective pre-post experimental study design was employed in the eight community and tertiary care hospitals most affected by crowding. two . month study periods were evaluated (pre: / - / / ; post: / / - / / ). outcomes -the primary outcome was patient perception of wait times and crowding reported as a composite mean score ( - ) from six survey items with higher scores representing better ratings. the overall rating of care by ed patients (composite score) and other dimensions of care were collected as secondary outcomes. all outcomes were compared using chi-square and two-tailed student's t-tests. results: a total of surveys were completed in both the pre-ocp and post-ocp study periods representing a response rate of %. we compared in-patient mortality from ami for patients who lived in a community with either . miles or miles of a closure but did not need to travel farther to the nearest ed with those who did not. we used patient-level data from the california office of statewide health and planning development (oshpd) database patient discharge data, and locations of patient residence and hospitals were geo-coded to determine any changes in distance to the nearest ed. we applied a generalized linear mixed effects model framework to estimate a patient's likelihood to die in the hospital of ami as a function of being affected by a neighborhood closure event. results background: fragmentation of care has been recognized as a problem in the us health care system. however, little is known about ed utilization after hospitalization, a potential marker of poor outpatient care coordination after discharge, particularly for common inpatient-based procedures. objectives: to determine the frequency and variability in ed visits after common inpatient procedures, how often they result in readmission, and related payments. methods: using national medicare data for - , we examined ed visits within days of hospital discharge after six common inpatient procedures: percutaneous coronary intervention, coronary artery bypass grafting (cabg), elective abdominal aortic aneurysm repair, back surgery, hip fracture repair, and colectomy. we categorized hospitals into risk-adjusted quintiles based on the frequency of ed visits after the index hospitalization. we report visits by primary diagnosis icd- codes and rates of readmission. we also assessed payments related to these ed visits. results: overall, the highest quintile of hospitals had -day ed visit rates that ranged from a low of . % with an associated . % readmission rate (back surgery) to a high of . % with an associated . % readmission rate (cabg). the most variability was more than -fold and found among patients undergoing colectomy in which the worst-performing hospitals saw . % of their patients experienced an ed visit within days while the best-performing hospitals saw . %. average total payments for the -day window from initial discharge across all surgical cohorts varied from $ , for patients discharged without subsequent ed visit; $ , for those experiencing an ed visit(s); $ , for those readmitted through the ed; and $ , for those readmitted from another source. if all patients who did not require readmission also did not incur an ed visit within the -day window, this would represent a potential cost savings of $ million. conclusion: among elderly medicare recipients there was significant variability between hospitals for -day ed visits after six common inpatient procedures. the ed visit may be a marker of poor care coordination in the immediate discharge period. this presents an opportunity to improve post-procedure outpatient care coordination which may save costs related to preventable ed visits and subsequent readmissions. objectives: we sought to assess the effect of pharmacist medication review on ed patient care, in particular time from physician order to medication administration for the patient (order-to-med time). methods: we conducted a multi-center, before-after study in two eds (urban academic teaching hospital and suburban community hospital, combined census of , ) after implementation of the electronic prospective pharmacy review system (prs). the system allowed a pharmacist to review all ed medication orders electronically at the time of physician order and either approve or alter the order. we studied a -month time period before implementation of the system (pre-prs, / / - / / ) and after implementation (post-prs, / / - / / ). we collected data on all ed medication orders including dose, route, class, pharmacist review action, time of physician order, and time of medication administration. differences in order-to-medication between the pre-and post-prs study periods were compared using a results: ed metrics that were significantly associated with lbtcs varied across ed patient-volume categories (table) . for eds seeing less than k patients annually, the percentage of ems arrivals admitted to the hospital and ed square footage were both weakly associated with lbtcs (p = . ). for eds seeing at least k- k patients, median ed length of stay (los), percent of patients admitted to hospital through the ed, percent of ems arrivals admitted to hospital, and percent of pediatric patients were all positively associated, while percent of patients admitted to the hospital was negatively associated with lbtcs. for eds seeing k- k, median los and percent of x-rays performed were positively associated, while percent of ekgs performed was negatively associated with lbtcs. for eds seeing k- k, percent of patients admitted to the hospital through the ed was negatively associated and percent of ekgs performed was positively associated with lbtcs. for eds with volume greater than k, none of the selected variables were associated with lbtc. conclusion: ed factors that help explain high lbtc rates differ depending on the size of an ed. interventions attempting to improve lbtc rates by modifying ed structure or process will need to consider baseline ed volume as a potential moderating influence. objectives: our study sought to compare bacterial growth of samples taken from surfaces after use of a common approved quat compound and a virtually non-toxic, commercially available solution containing elemental silver ( . %), hydrogen peroxide ( %), and peroxyacetic acid ( %) (shp) in a working ed. we hypothesized that, based on controlled laboratory data available, shp compound would be more effective on surfaces in an active urban ed. methods: we cleaned and then sampled three types of surfaces in the ed (suture cart, wooden railing, and the floor) during midday hours one minute after application of tap water, quat, and shp and then again at hours without additional cleaning. conventional environmental surface surveillance rodac media plates were used for growth assessment. images of bacterial growth were quantified at and hours. standard cleaning procedures by hospital staff were maintained per usual. results: shp was superior to control and quat one minute after application on all three surfaces. quat and water had x and x more bacterial growth than the surface cleaned with shp, respectively. hours later, the shp area produced fewer colonies sampled from the wooden railing: x more bacteria for quat, and x for water when compared to shp. h cultures from the cart and floor had confluent growth and could not be quantified. conclusion: shp outperforms quat in sterilizing surfaces after one minute application. shp may be a superior agent as a non-toxic, non-corrosive, and effective agent for surfaces in the demanding ed setting. further studies should examine sporidical and virucidal properties in a similar environment. objectives: evaluate the effect on patient satisfaction of increasing waiting room times and physician evaluation times. methods: emergency department flow metrics were collected on a daily basis as well as average daily patient satisfaction scores. the data were from july through february , in a , census urban hospital. the data were divided into equal intervals. the arrival to room time was divided by minute intervals up to minutes with the last group being greater than minutes. the physician evaluation times were divided into minute intervals, up to , the last group greater than with days in the group. data were analyzed using means and standard deviations, and well as anova for comparison between groups. results: the overall satisfaction score for the outpatient emergency visit was higher when the patient was in a room within minutes of arrival ( . , std deviation . ), analysis of variation between the groups had a p = . , for the means of each interval (see table ). the total satisfaction with the visit as well as satisfaction with the provider dropped when the evaluation extended over minutes, but was not statistically significant on anova analysis (see table for means). conclusion: once a patient's time in the waiting room extends beyond minutes, you have lost a significant opportunity for patient satisfaction; once they have been in the waiting room for over minutes, you are also much more likely to receive a poor score. physician evaluation time scores are much more consistent but as longer evaluation times occurred beyond total of minutes we started to see a trend downward in the satisfaction score. results: in all three eds, pain medication rates (both in ed and rx) varied significantly by clinical factors including location of pain, discharge diagnosis, pain level, and acuity. we observed little to no variation in pain medication rates by patient factors such as age, sex, race, insurance, or prior ed visits. the table displays key pain management practices by site and provider. after adjusting for patient and clinical characteristics, significant differences in pain medication rates remained by provider and site (see figure) . conclusion: within this health system, the approach to pain management by both providers and sites is not standardized. investigation of the potential effect of this variability on patient outcomes is warranted. results: all measures showed significant differences, p < . . average pts/h decreased post-cpoe and did not recover post transitional period, . ± . vs . ± . , p < . . rvu/h also decreased post-cpoe and did not recover post transitional period, . ± . vs . ± . and . ± . , p < . . charges/h also decreased after cpoe implementation and did not recover after system optimization. there was a sustained significant decrease in charges/h of . % ± . % post cpoe and . % ± . % post optimization, p < . . sub-group analysis for each provider group was also evaluated and showed variability for different providers. conclusion: there was a significant decrease in all productivity metrics four months after the implementation of cpoe. the system did undergo optimization initiated by providers with customization for ease and speed of use. however, productivity measurements did not recover after these changes were implemented. these data show that with the implementation of a cpoe system there is a decrease in productivity that continues even after a transition period and system customization. background: procedural competency is a key component of emergency medicine residency training. residents are required to log procedures to document quantity of procedures and identify potential weaknesses in their training. as emergency medicine evolves, it is likely that the type and number of procedures change over time. also, exposure to certain rare procedures in residency is not guaranteed. objectives: we seek to delineate trends in type and volume of core em procedures over a decade of emergency medicine residents graduating from an accredited four-year training program. methods: deidentified procedure logs from - were analyzed to assess trends in type and quantity of procedures. procedure logs were self-reported by individual residents on a continuous basis during training onto a computer program. average numbers of procedures per resident in each graduating class were noted. statistical analysis was performed using spss and includes a simple linear regression to evaluate for significant changes in number of procedures over time and an independent samples two-tailed t-test of procedures performed before and after the required resident duty hours change. results: a total of procedure logs were analyzed and the frequency of different procedures was evaluated. a significant increase was seen in one procedure, the venous cutdown. significant decreases were seen in procedures including key procedures such as central venous catheters, tube thoracostomy, and procedural sedation. the frequency of five high-stakes/ resuscitative procedures, including thoracotomy and cricothyroidotomy, remained steady but very low (< per resident over years). of the remaining procedures, showed a trend toward decreased frequency, while only increased. conclusion: over the past years, em residents in our program have recorded significantly fewer opportunities to perform most procedures. certain procedures in our emergency medicine training program have remained stable but uncommon over the course of nearly a decade. to ensure competency in uncommon procedures, innovative ways to expose residents to these potentially life saving skills must be considered. these may include practice on high-fidelity simulators, increased exposure to procedures on patients during residency (possibly on off-service rotations), or practice in cadaver and animal labs. objectives: to study the effectiveness of a unique educational intervention using didactic and hands-on training in usgpiv. we hypothesized that senior medical students would improve performance and confidence with usgpiv after the simulation training. methods: fourth year medical students were enrolled in an experimental, prospective, before and after study conducted at a university medical school simulation center. baseline skills in participant's usgpiv on simulation vascular phantoms were graded by ultrasound expert faculty using standardized checklists. the primary outcome was time to cannulation, and secondary outcomes were ability to successfully cannulate, number of needle attempts, and needle-tip visualization. subjects then observed a -minute presentation on correct performance of usgpiv followed by a -minute hands-on practical session using the vascular simulators with a : to : ultrasound instructor to student ratio. an expert blinded to the participant's initial performance graded post-educational intervention usgpiv ability. pre-and post-intervention surveys were obtained to evaluate usgpiv confidence, previous experience with ultrasound, peripheral iv access, usg-piv, and satisfaction with the educational format. objectives: this study examines the grade distribution of resident evaluations when the identity of the evaluator was anonymous as compared to when the identity of the evaluator was known to the resident. we hypothesize that there will be no change in the grades assigned to residents. methods: we retrospectively reviewed all faculty evaluations of residents and grades assigned from july , through november , . prior to july , the identity of the faculty evaluators was anonymous, while after this date, the identity of the faculty evaluators was made known to the residents. throughout this time period, residents were graded on a five-point scale. each resident evaluation included grades in the six acgme core competencies as well as in select other abilities. specific abilities evaluated varied over the dates analyzed. evaluations of residents were assigned to two groups, based on whether the evaluator was anonymous or made known to the resident. grades were compared between the two groups. results: a total of , grades were assigned in the anonymous group, with an average grade of . ( ci . , . ). a total of , grades were assigned in the known group with an average grade of . ( ci . , . ). specific attention was paid to assignment of unsatisfactory grades ( or on the five-point scale). the anonymous group assigned grades in this category, comprising . % of all grades assigned. the known group assigned grades in this category, comprising . % of all grades assigned. unsatisfactory grades were assigned by the anonymous group . % ( ci . , . ) more often. additionally, . % ( ci . , . ) fewer exceptional grades ( or on the five-point scale) were assigned by the anonymous group. conclusion: the average grade assigned was closer to average ( on a five-point scale) when the identity of the evaluator was made known to the residents. additionally, fewer unsatisfactory and exceptional grades were assigned in this group. this decrease of both unsatisfactory and exceptional grades may make it more difficult for program directors to effectively identify struggling and strong residents respectively. testing to improve knowledge retention from traditional didactic presentations: a pilot study david saloum, amish aghera, brian gillett maimonides medical center, brooklyn, ny background: the acgme requires an average of at least hours of planned educational experiences each week for em residents, which traditionally consists of formal lecture based instruction. however, retention by adult learners is limited when presented material in a lecture format. more effective methods such as small group sessions, simulation, and other active learning modalities are time-and resource-intensive and therefore not practical as a primary method of instruction. thus, the traditional lecture format remains heavily relied upon. efficient strategies to improve the effectiveness of lectures are needed. testing utilized as a learning tool to force immediate recall of lecture material is an example of such a strategy. objectives: to evaluate the effect of immediate postlecture short answer quizzes on em residents' retention of lecture content. methods: in this prospective randomized controlled study, em residents from a community based -year training program were randomized into two groups. block randomization provided a similar distribution of postgraduate year training levels and performance on both us-mle and in-training examinations between the two groups. each group received two identical -minute lectures on ecg interpretation and aortic disease. one group of residents completed a five-question short answer quiz immediately following each lecture (n = ), while the other group received the lectures without subsequent quizzes (n = ). the quizzes were not scored or reviewed with the residents. two weeks later, retention was assessed by testing both groups with a -question multiple choice test (mct) derived in equal part from each lecture. mean and median test results were then compared between groups. statistical significance was determined using a paired t-test of median test scores from each group. results: residents who received immediate post-lecture quizzes demonstrated significantly higher mct scores (mean = %, median %, n = ) compared to those receiving lectures alone (mean = %, median = %, n = ); p = . . conclusion: short answer testing immediately after a traditional didactic lecture improves knowledge retention at a -week interval. limitations of the study are that it is a single center study and long term retention was not assessed. background: the task of educating the next generation of physicians is steadily becoming more difficult with the inherent obstacles that exist for faculty educators and the work hour restrictions that students must adhere to. the obstacles make developing curricula that not only cover important topics but also do so in a fashion that helps support and reinforce the clinical experiences very difficult. several areas of medical education are using more asynchronous techniques and self-directed online educational modules to overcome these obstacles. objectives: the aim of this study was to demonstrate that educational information pertaining to core pediatric emergency medicine topics could be as effectively disseminated to medical students via self-directed online educational modules as it could through traditional didactic lectures. methods: this was a prospective study conducted from august , through december , . students participating in the emergency medicine rotation at carolinas medical center were enrolled and received education in a total of eight core concepts. the students were divided into two groups which changed on a monthly basis. group was taught four concepts via self-directed online modules and four traditional didactic lectures. group was taught the same core concepts, but in opposite fashion to group . each student was given a pre-test, post-test, and survey at the conclusion of the rotation. results: a total of students participated in the study. students, regardless of which group assigned, performed similarly on the pre-test, with no statistical difference among scores. when looking at the summative total scores between online and traditional didactic lectures, there was a trend towards significance for more improvement among those taught online. the student's assessment of the online modules showed that the majority either felt neutral or preferred the online method. the majority thought the depth and length of the modules were perfect. most students thought having access to the online modules was valuable and all but one stated that they would use them again. conclusion: this study demonstrates that self-directed, online educational modules are able to convey important concepts in emergency medicine similar to traditional didactics. it is an effective learning technique that offers several advantages to both the educator and student. background: critical access hospitals (cah) provide crucial emergency care to rural populations that would otherwise be without ready access to health care. data show that many cah do not meet standard adult quality metrics. adults treated at cah often have inferior outcomes to comparable patients cared for at other community-based emergency departments (eds). similar data do not exist for pediatric patients. objectives: as part of a pilot project to improve pediatric emergency care at cah, we sought to determine whether these institutions stock the equipment and medications necessary to treat any ill or injured child who presents to the ed. methods: five north carolina cah volunteered to participate in an intensive educational program targeting pediatric emergency care. at the initial site visit to each hospital, an investigator, in conjunction with the ed nurse manager, completed a -item checklist of commonly required ed equipment and medications based on the acep ''guidelines for care of children in the emergency department''. the list was categorized into monitoring and respiratory equipment, vascular access supplies, fracture and trauma management devices, and specialized kits. if available, adult and pediatric sizes were listed. only hospitals stocking appropriate pediatric sizes of an item were counted as having that item. the pharmaceutical supply list included antibiotics, antidotes, antiemetics, antiepileptics, intubation and respiratory medications, iv fluids, and miscellaneous drugs not otherwise categorized. results: overall, the hospitals reported having % of the items listed (range - %). the two greatest deficiencies were fracture devices (range - %), with no hospital stocking infant-sized cervical collars, and antidotes, with no hospital stocking pralidoxime, / hospitals stocking fomepizole, and / hospitals stocking pyridoxine and methylene blue. only one of the five institutions had access to prostaglandin e. the hospitals stated cost and rarity of use as the reason for not stocking these medications. conclusion: the ability of cah to care for pediatric patients does not appear to be hampered by a lack of equipment. ready access to infrequently used, but potentially lifesaving, medications is a concern. tertiary care centers preparing to accept these patients should be aware of these potential limitations as transport decisions are made. background: while incision and drainage (i&d) alone has been the mainstay of management of uncomplicated abscesses for decades, some advocate for adjunct antibiotic use, arguing that available trials are underpowered and that antibiotics reduce treatment failures and recurrence. objectives: to investigate the role of antibiotics in addition to i&d in reducing treatment failure as compared to management with i&d alone. methods: we performed a search using medline, embase, web of knowledge, and google scholar databases (with a medical librarian) to include trials and observational studies analyzing the effect of antibiotics in human subjects with skin and soft-tissue abscesses. two investigators independently reviewed all the records. we performed three overlapping meta-analy-ses: . only randomized trials comparing antibiotics to placebo on improvement of the abscess during standard follow-up. . trials and observational studies comparing appropriate antibiotics to placebo, no antibiotics, or inappropriate antibiotics (as gauged by wound culture) on improvement during standard follow-up. . only trials, but broadened outcome to include recurrence or new lesions during a longer follow-up period as treatment failure. we report pooled risk ratios (rr) using a fixed-effects model for our point estimates with shore-adjusted % confidence intervals (ci). results: we screened , records, of which studies fit inclusion criteria, of which were meta-analyzed ( trials, observational studies) because they reported results that could be pooled. of the studies, enrolled subjects from the ed, from a soft-tissue infection clinic, and from a general hospital without definition of enrollment site. five studies enrolled primarily adults, pediatrics, and without specification of ages. after pooling results for all randomized trials only, the rr = . ( % ci: . - . ). exposure being ''appropriate'' antibiotics (using trials and observational studies) resulted in a pooled rr = . ( % ci: . - . ). when we broadened our treatment failure criteria to include recurrence or new lesions at longer lengths of follow-up (trials only), we noted a rr = . ( % ci: . - . ). conclusion: based on available literature pooled for this analysis, there is no evidence to suggest any benefit from antibiotics in addition to i&d in the treatment of skin and soft tissue abscesses. (originally submitted as a ''late-breaker.'') primary objectives: to compare wound healing and recurrence rates after primary vs. secondary closure of drained abscesses. we hypothesized the percentage of drained ed abscesses that would be completely healed at days would be higher after primary closure. methods: this randomized clinical trial was undertaken in two academic emergency departments. immunocompetent adult patients with simple, localized cutaneous abscesses were randomly assigned to i & d followed by primary or secondary closure. randomization was balanced by center, with an allocation sequence based on a block size of four, generated by a computer random number generator. the primary outcome was percentage of healed wounds seven days after drainage. a sample of patients had % power to detect an absolute difference of % in healing rates assuming a baseline rate of %. all analyses were by intention to treat. results: twenty-seven patients were allocated to primary and to secondary closure, of whom and , respectively, were followed to study completion. healing rates at seven days were similar between the primary and secondary closure groups ( we compared consecutive patients each scanned on the or slice ccta in - . measures and outcomes-data were prospectively collected using standardized data collection forms required prior to performing ccta. the main outcomes were cumulative radiation doses and volumes of intravenous contrast. data analysis-groups compared with t-, mann whitney u, and chi-square tests. results: the mean age of patients imaged with the and scanners were (sd ) vs. ( ) (p = . ). male:female ratios were also similar ( : vs. : respectively, p = . ). both mean (p < . ) and median (p = . ) effective radiation dose were significantly lower with the ( . and msv) vs. the -slice scanner ( . and msv) respectively. prospective gating was successful in % of the scans and only in % of the scans (p < . ). mean iv contrast volumes were also lower for the vs. the -slice scanner ( ± vs. ± ml; p < . ). the % non-diagnostic scans was similarly low in both scanners ( % each). there were no differences in use of beta-blockers or nitrates. conclusion: when compared with the -slice scanner, the -slice scanner reduces the effective radiation doses and iv contrast volumes in ed patients with cp undergoing ccta. need for beta-blockers and nitrates was similar and both scanners achieved excellent diagnostic image quality. background: a few studies have demonstrated that bedside ultrasound measurement of inferior vena cava to aorta (ivc-to-ao) ratio is associated with the level of dehydration in pediatric patients and a proposed cutoff of . has been suggested, below which a patient is considered dehydrated. objectives: we sought to externally validate the ability of ivc-to-ao ratio to discriminate dehydration and the proposed cutoff of . in an urban pediatric emergency department (ed). methods: this was a prospective observational study at an urban pediatric ed. we included patients aged to months with clinical suspicion of dehydration by the ed physician and an equal number of control patients with no clinical suspicion of dehydration. we excluded children who were hemodynamically unstable, had chronic malnutrition or failure to thrive, open abdominal wounds, or were unable to provide patient or parental consent. a validated clinical dehydration score (cds) (range to ) was used to measure initial dehydration status. an experienced sonographer blinded to the cds and not involved in the patient's care measured the ivc-to-ao ratio on the patient prior to any hydration. cds was collapsed into a binary outcome of no dehydration or any level of dehydration ( or higher). the ability of ivc-to-ao ratio to discriminate dehydration was assessed using area under the receiver operating characteristic curve (auc) and the sensitivity and specificity of ivc-to-ao ratio was calculated for three cutoffs ( . , . , . ). calculation of auc was repeated after adjusting for age and sex. results: patients were enrolled, ( %) of whom had a cds of or higher. median age was (interquartile range - ) months, and ( %) were female. the ivcto-ao ratio showed an unadjusted auc of . ( % ci . - . ) and adjusted auc of . ( % ci . - . ). for a cutoff of . sensitivity was % ( % ci %- %) and specificity % ( % ci %- %); for a cutoff of . sensitivity was % ( % ci %- %) and specificity % ( % ci %- %); for a cutoff of . sensitivity was % ( % ci %- %) and specificity % ( % ci %- %). conclusion: the ability of the ivc-to-ao ratio to discriminate dehydration in young pediatric ed patients was modest and the cutoff of . was neither sensitive nor specific. background: while early cardiac computed tomographic angiography (ccta) could be more effective to manage emergency department (ed) patients with acute chest pain and intermediate (> %) risk of acute coronary syndrome (acs) than current management strategies, it also could result in increased testing, cost, and radiation exposure. objectives: the purpose of the study was to determine whether incorporation of ccta early in the ed evaluation process leads to more efficient management and earlier discharge than usual care in patients with acute chest pain at intermediate risk for acs. methods: randomized comparative effectiveness trial enrolling patients between - years of age without known cad, presenting to the ed with chest pain but without ischemic ecg changes or elevated initial troponin and require further risk stratification for decision making, at nine us sites. patients are being randomized to either ccta as the first diagnostic test or to usual care, which could include no testing or functional testing such as exercise ecg, stress spect, and stress echo following serial biomarkers. test results were provided to physicians but management in neither arm was driven by a study protocol. data on time, diagnostic testing, and cost of index hospitalization, and the following days are being collected. the primary endpoint is length of hospital stay (los). the trial is powered to allow for detection of a difference in los of . hours between competing strategies with % power assuming that % of projected los values are true. secondary endpoints are cumulative radiation exposure, and cost of competing strategies. tertiary endpoints are institutional, caregiver, and patient characteristics associated with primary and secondary outcomes. rate of missed acs within days is the safety endpoint. results: as of november st, , of patients have been enrolled (mean age: ± , . % female, acs rate . %). the anticipated completion of the last patient visit is / / and the database will be locked in early march . we will present the results of the primary, secondary, and some tertiary endpoints for the entire cohort. conclusion: romicat ii will provide rigorous data on whether incorporation of ccta early in the ed evaluation process leads to more efficient management and triage than usual care in patients with acute chest pain at intermediate risk for acs. (originally submitted as a ''late-breaker.'') meta background: many studies have documented higher rates of advanced radiography utilization across u.s. emergency departments (eds) in recent years, with an associated decrease in diagnostic yield (positive tests / total tests). provider-to-provider variability in diagnostic yield has not been well studied, nor have the factors that may explain these differences in clinical practice. objectives: we assessed the physician-level predictors of diagnostic yield using advanced radiography to diagnose pulmonary embolus (pe) in the ed, including demographics and d-dimer ordering rates. methods: we conducted a retrospective chart review of all ed patients who had a ct chest or v/q scan ordered to rule out pe from / to / in four hospitals in the medstar health system. attending physicians were included in the study if they had ordered or more scans over the study period. the result of each ct and vq scan was recorded as positive, negative, or indeterminate, and the identity of the ordering physician was also recorded. data on provider sex, residency type (em or other), and year of residency completion were collected. each provider's positive diagnostic yield was calculated, and logistic regression analysis was done to assess correlation between positive scans and provider characteristics. results: during the study period, , scans ( , cts and , v/qs) were ordered by providers. the physicians were an average of . years from residency, % were female, and % were em-trained. diagnostic yield varied significantly among physicians (p < . ), and ranged from % to %. the median diagnostic yield was . % (iqr . %- . %). the use of d-dimer by provider also varied significantly from % to % (p < . ). the odds of a positive test were significantly lower among providers less than years out of residency graduation (or . , ci . - . ) after controlling for provider sex, type of residency training, d-dimer use, and total number of scans ordered. conclusion: we found significant provider variability in diagnostic yield for pe and use of d-dimer in this study population, with % of providers having diagnostic yield less than or equal to . %. providers who were more recently graduated from residency appear to have a lower diagnostic yield, suggesting a more conservative approach in this group. background: the literature reports that anticoagulation increases the risk of mortality in patients presenting to emergency departments (ed) with head trauma (ht). it has been suggested that such patients should be treated in a protocolized fashion, including ct within minutes, and anticipatory preparation of ffp before ct results are available. there are significant logistical and financial implications associated with implementation of such a protocol. objectives: our primary objective was to determine the effect of anticoagulant therapy on the risk of intracranial hemorrhage (ich) in elderly patients presenting to our urban community hospital following bunt head injury. methods: this was a retrospective chart review study of ht patients > years of age presenting to our ed over a -month period. charts reviewed were identified using our electronic medical record via chief complaints and icd- codes and cross referencing with written ct logs. research assistants underwent review of at least % of their contributing data to validate reliability. we collected information regarding use of warfarin, clopidogrel, and aspirin and ct findings of ich. using univariate logistic regression, we calculated odds ratios (or) for ich with % ci. results: we identified elderly ht patients. the mean age of our population was , ( . %) admitted to using anticoagulant therapy, and % were on antiplatelet drugs. ( . %) of the cohort had icb, patients required neurosurgical intervention, and had transfusion of blood products. of the non-anticoagulated patients, ( . %) were found to have ich, half of those ( ) , and mir- ) were measured using real-time quantitative pcr from serum drawn at enrollment. il- , il- , and tnf-a were measured using a bio-plex suspension system. baseline characteristics, il- , il- , tnf-a and micrornas were compared using one way anova or fisher exact test, as appropriate. correlations between mirnas and sofa scores, il- , il- , and tnf-a were determined using spearman's rank. a logistic regression model was constructed using in-hospital mortality as the dependent variable and mirnas as the independent variables of interest. bonferroni adjustments were made for multiple comparisons. results: of patients, were controls, had sepsis, and had septic shock. we found no difference in serum mir- a or mir- between cohorts, and found no association between these micrornas and either inflammatory markers or sofa score. mir- demonstrated a significant correlation with sofa score (q = . , p = . ), il- (q = . , p = . ), but not il- or tnf-a (p = . , p = . ). logistic regression demonstrated mir- to be associated with mortality, even after adjusting for sofa score (p = . ). conclusion: mir- a or mir- failed to demonstrate any diagnostic or prognostic ability in this cohort. mir- was associated with inflammation, increasing severity of illness, and mortality, and may represent a novel prognostic marker for diagnosis and prognosis of sepsis. objectives: to examine the association between emergency physician recognition of sirs and sepsis and subsequent treatment of septic patients. methods: a retrospective cohort study of all-age patient medical records with positive blood cultures drawn in the emergency department from / - / at a level i trauma center. patient parameters were reviewed including vital signs, mental status, imaging, and laboratory data. criteria for sirs, sepsis, severe sepsis, and septic shock were applied according to established guidelines for pediatrics and adults. these data were compared to physician differential diagnosis documentation. the mann-whitney test was used to compare time to antibiotic administration and total volume of fluid resuscitation between two groups of patients: those with recognized sepsis and those with unrecognized sepsis. results: sirs criteria were present in / reviewed cases. sepsis criteria were identified in / cases and considered in the differential diagnosis in / septic patients. severe sepsis was present in / cases and septic shock was present in / cases. the sepsis -hour resuscitation bundle was completed in the emergency department in cases of severe sepsis or septic shock. patients who met sepsis criteria and were recognized by the ed physician had a median time to antibiotics of minutes (iqr: - ) and a median ivf of ml (iqr: - ). the patients who met sepsis criteria but went unrecognized in the documentation had a median time to antibiotics of minutes (iqr: - ) and median volume of fluid resuscitation of ml (iqr: . median time to antibiotics and median volume of fluid resuscitation differed significantly between recognized and unrecognized septic patients (p = . and p = . , respectively). conclusion: emergency physicians correctly identify and treat infection in most cases, but frequently do not document sirs and sepsis. lack of documentation of sepsis in the differential diagnosis is associated with increased time to antibiotic delivery and a smaller total volume of fluid administration, which may explain poor sepsis bundle compliance in the emergency department. background: severe sepsis is a common clinical syndrome with substantial human and financial impact. in the first consensus definition of sepsis was published. subsequent epidemiologic estimates were collected using administrative data, but ongoing discrepancies in the definition of severe sepsis led to large differences in estimates. objectives: we seek to describe the variations in incidence and mortality of severe sepsis in the us using four methods of database abstraction. methods: using a nationally representative sample, four previously published methods (angus, martin, dombrovskiy, wang) were used to gather cases of severe sepsis over a -year period ( ) ( ) ( ) ( ) ( ) ( ) . in addition, the use of new icd- sepsis codes was compared to previous methods. our main outcome measure was annual national incidence and in-hospital mortality of severe sepsis. results: the average annual incidence varied by as much as . fold depending on method used and ranged from , ( / , population) to , , ( , / , ) using the methods of dombrovskiy and wang, respectively. average annual increase in the incidence of severe sepsis was similar ( . - . %) across all methods. total mortality mirrored the increase in incidence over the -year period ( background: radiation exposure from medical imaging has been the subject of many major journal articles, as well as the topic of mainstream media. some estimate that one-third of all ct scans are not medically justified. it is important for practitioners ordering these scans to be knowledgeable of currently discussed risks. objectives: to compare the knowledge, opinions, and practice patterns of three groups of providers in regards to cts in the ed. methods: an anonymous electronic survey was sent to all residents, physician assistants, and attending physicians in emergency medicine (em), surgery, and internal medicine (im) at a single academic tertiary care referral level i trauma center with an annual ed volume of over , visits. the survey was pilot tested and validated. all data were analyzed using the pearson's chi-square test. results: there was a response rate of % ( / ). data from surgery respondents were excluded due to a low response rate. in comparison to im, em respondents correctly equated one abdominal ct to between and chest x-rays, reported receiving formal training regarding the risks of radiation from cts, believe that excessive medical imaging is associated with an increased lifetime risk of cancer, and routinely discuss the risks of ct imaging with stable patients more often (see table ). particular patient factors influence whether radiation risks are discussed with patients by % in each specialty (see table ). before ordering an abdominal ct in a stable patient, im providers routinely review the patient's medical imaging history less often than em providers surveyed. overall, % of respondents felt that ordering an abdominal ct in a stable ed patient is a clinical decision that should be discussed with the patient, but should not require consent. conclusion: compared with im, em practitioners report greater awareness of the risks of radiation from cts and discuss risks with patients more often. they also review patients' imaging history more often and take this, as well as patients' age, into account when ordering cts. these results indicate a need for improved education for both em and im providers in regards to the risks of radiation from ct imaging. background: in nebraska, % of emergency departments have annual visits less than , , and the predominance are in rural settings. general practitioners working in rural emergency departments have reported low confidence in several emergency medicine skills. current staffing patterns include using midlevels as the primary provider with non-emergency medicine trained physicians as back-up. lightly-embalmed cadaver labs are used for resident's procedural training. objectives: to describe the effect of a lightlyembalmed cadaver workshop on physician assistants' (pa) reported level of confidence in selected emergency medicine procedures. methods: an emergency medicine procedure lab was offered at the nebraska association of physician assistants annual conference. each lab consisted of a -hour hands-on session teaching endotracheal intubation techniques, tube thoracostomy, intraosseous access, and arthrocentesis of the knee, shoulder, ankle, and wrist to pas. irb-approved surveys were distributed pre-lab and a post-lab survey was distributed after lab completion. baseline demographic experience was collected. pre-and post-lab procedural confidence was rated on a six-point likert scale ( - ) with representing no confidence. the wilcoxon signed-rank test was use to calculate p values. results: pas participated in the course. all completed a pre-and post-lab assessment. no pa had done any one procedure more than times in their career. pre-lab modes of confidence level were £ for each procedure. post-lab modes were > for each procedure except arthrocentesis of the ankle and wrist. however, post lab assessments of procedural confidence significantly improved for all procedures with p values < . . conclusion: midlevel providers' level of confidence improved for emergent procedures after completion of a procedure lab using lightly-embalmed cadavers. a mobile cadaver lab would be beneficial to train rural providers with minimal experience. background: use of automated external defibrillators (aed) improves survival in out-of-hospital cardiopulmonary arrest (ohca). since , the american heart association has recommended that individuals one year of age or older who sustain ohca have an aed applied. little is known about how often this occurs and what factors are associated with aed use in the pediatric population. objectives: our objective was to describe aed use in the pediatric population and to assess predictors of aed use when compared to adult patients. methods: we conducted a secondary analysis of prospectively collected data from u.s. cities that participate in the cardiac arrest registry to enhance survival (cares). patients were included if they had a documented resuscitation attempt from october , through december , and were ‡ year old. patients were considered pediatric if they were less than years old. aed use included application by laypersons and first responders. hierarchical multivariable logistic regression analysis was used to estimate the associations between age and aed use. results: there were , ohcas included in this analysis, of which ( . %) occurred in pediatric patients. overall aed use in the final sample was , , with , ( . %) total survivors. aeds were applied less often in pediatric patients ( . %, % ci: . %- . % vs . %, % ci: . %- . %). within the pediatric population, only . % of patients with a shockable rhythm had an aed used. in all pediatric patients, regardless of presenting rhythm, aed use demonstrated a statistically significant increase in return of spontaneous circulation (aed used . %, % ci: . - . vs aed not used . %, % ci: . - . , p < . ), although there was no significant increase in survival to hospital discharge (aed used . %; aed not used . %; p = . ). in the adjusted model, pediatric age was independently associated with failure to use an aed (or . , % ci: . - . ) as was female sex (or . , % ci: . - . ). patients who had a public arrest (or . , % ci: . - . ) or one that was witnessed by a bystander (or . . %: ci . - . ) were also predictive of aed use. conclusion: pediatric patients who experience ohca are less likely to have an aed used. continued education of first responders and the lay public to increase aed use in this population is necessary. does implementation of a therapeutic hypothermia protocol improve survival and neurologic outcomes in all comatose survivors of sudden cardiac arrest? ken will, michael nelson, abishek vedavalli, renaud gueret, john bailitz cook county (stroger), chicago, il background: the american heart association (aha) currently recommends therapeutic hypothermia (th) for out of hospital comatose survivors of sudden cardiac arrest (cssca) with an initial rhythm of ventricular fibrillation (vf). based on currently limited data, the aha further recommends that physicians consider th for cssca, from both the out and inpatient settings, with an initial non-vf rhythm. objectives: investigate whether a th protocol improves both survival and neurologic outcomes for cssca, for out and inpatients, with any initial rhythm, in comparison to outcomes previously reported in literature prior to th. methods: we conducted a prospective observational study of cssca between august and may whose care included th. the study enrolled eligible consecutive cssca survivors, from both out and inpatient settings with any initial arrest rhythm. primary endpoints included survival to hospital discharge and neurologic outcomes, stratified by sca location, and by initial arrest rhythm. results: overall, of eligible patients, ( %, % ci - %) survived to discharge, ( %, % ci - %) with at least a good neurologic outcome. twelve were out and were inpatients. among the outpatients, ( %, % ci - %) survived to discharge, ( %, % ci - %) with at least a good neurologic outcome. among the inpatients, ( %, % ci - ) survived to discharge, ( %, % ci - %) with at least a good neurologic outcome. by initial rhythm, patients had an initial rhythm of vf/t and non-vf/t. among the patients with an initial rhythm of vf/t, ( %, ci - %) survived to discharge, all with at least a good outcome, including out and inpatients. among the patients with an initial rhythm of non-vf/t, ( %, ci - %) survived to discharge, ( %, ci - %) with at least a good neurologic outcome, including out and inpatients. conclusion: our preliminary data initially suggest that local implementation of a th protocol improves survival and neurologic outcomes for cssca, for out and inpatients, with any initial rhythm, in comparison to outcomes previously reported in literature prior to th. subsequent research will include comparison to local historical controls, additional data from other regional th centers, as well as comparison of different cooling methods. protocolized background: therapeutic hypothermia (th) has been shown to improve the neurologic recovery of cardiac arrest patients who experience return of spontaneous circulation (rosc). it remains unclear as to how earlier cooling and treatment optimization influence outcomes. objectives: to evaluate the effects of a protocolized use of early sedation and paralysis on cooling optimization and clinical outcomes in survivors of cardiac arrest. methods: a -year ( - ), pre-post intervention study of patients with rosc after cardiac arrest treated with th was performed. those patients treated with a standardized order set which lacked a uniform sedation and paralytic order were included in the pre-intervention group, and those with a standardized order set which included a uniform sedation and paralytic order were included in the post-intervention group. patient demographics, initial and discharge glasgow coma scale (gcs) scores, resuscitation details, cooling time variables, severity of illness as measured by the apache ii score, discharge disposition, functional status, and days to death were collected and analyzed using student's t-tests, man-whitney u tests, and the log-rank test. results: patients treated with th after rosc were included, with patients in the pre-intervention group and in the post-intervention group. the average time to goal temperature ( °c) was minutes (pre-intervention) and minutes (post-intervention) (p = . ). a -hour time target was achieved in . % of the patients (post-intervention) compared to . % in the pre-group (p = . ). twenty-eight day mortality was similar between groups ( . % and . %) though hospital length of stay ( days pre-and days post-intervention) and discharge gcs ( preand -post-intervention) differed between cohorts. more post-intervention patients were discharged to home ( . %) compared to . % in the pre-intervention group. conclusion: protocolized use of sedation and paralysis improved time to goal temperature achievement. these improved th time targets were associated with improved neuroprotection, gcs recovery, and disposition outcome. standardized sedation and paralysis appears to be a useful adjunct in induced th. background: ct is increasingly used to assess children with signs and symptoms of acute appendicitis (aa) though concerns regarding long-term risk of exposure to ionizing radiation have generated interest in methods to identify children at low risk. objectives: we sought to derive a clinical decision rule (cdr) of a minimum set of commonly used signs and symptoms from prior studies to predict which children with acute abdominal pain have a low likelihood of aa and compared it to physician clinical impression (pci). methods: we prospectively analyzed subjects aged to years in u.s. emergency departments with abdominal pain plus signs and symptoms suspicious for aa within the prior hours. subjects were assessed by study staff unaware of their diagnosis for clinical attributes drawn from published appendicitis scoring systems and physicians responsible for physical examination estimated the probability of aa based on pci prior to their medical disposition. based on medical record entry rate, frequently used cdr attributes were evaluated using recursive partitioning and logistic regression to select the best minimum set capable of discriminating subjects with and without aa. subjects were followed to determine whether imaging was used and use was tabulated by both pci and the cdr to assess their ability to identify patients who did or did not benefit based on diagnosis. results: this cohort had a . % prevalence ( / subjects) of aa. we derived a cdr based on the absence of two out of three of the following attributes: abdominal tenderness, pain migration, and rigidity/ guarding had a sensitivity of . % ( % ci: . - . ), specificity of . % ( % ci: . - . ), npv of . % ( % ci: . - . ), and negative likelihood ratio of . ( % ci: . - . ). the pci set at aa < % pre-test probability had a sensitivity of . % ( % ci: . - . ), specificity of . % ( % ci: . - . ), npv of . % ( % ci: . - . ), and negative likelihood ratio of . ( % ci: . - . ). the methods each classified % of the patients as low risk for aa. our cdr identified . % ( / ) of low risk subjects who received ct but being aa (-), could have been spared ct, while the pci identified . % ( / ). conclusion: compared to physician clinical impression, our clinical decision rule can identify more children at low risk for appendicitis who could be managed more conservatively with careful observation and avoidance of ct. negative background: abdominal pain is the most common complaint in the ed and appendicitis is the most common indication for emergency surgery. a clinical decision rule (cdr) identifying abdominal pain patients at a low risk for appendicitis could lead to a significant reduction in ct scans and could have a significant public health impact. the alvarado score is one of the most widely applied cdrs for suspected appendicitis, and a low modified alvarado score (less than ) is sometimes used to rule out acute appendicitis. the modified alvarado score has not been prospectively validated in ed patients with suspected appendicitis. objectives: we sought to prospectively evaluate the negative predictive value of a low modified alvarado score (mas) in ed patients with suspected appendicitis. we hypothesized that a low mas (less than ) would have a sufficiently high npv (> %) to rule out acute appendicitis. methods: we enrolled patients greater than or equal to years old who were suspected of having appendicitis (listed as one of the top three diagnosis by the treating physician before ancillary testing) as part of a prospective cohort study in two urban academic eds from august to april . elements of the mas and the final diagnosis were recorded on a standard data form for each subject. the sensitivity, specificity, negative predictive value (npv), and positive predictive value (ppv) were calculated with % ci for a low mas and final diagnosis of appendicitis. background: evaluating children for appendicitis is difficult and strategies have been sought to improve the precision of the diagnosis. computed tomography is now widely used but remains controversial due to the large dose of ionizing radiation and risk of subsequent radiation-induced malignancy. objectives: we sought to identify a biomarker panel for use in ruling out pediatric acute appendicitis as a means of reducing exposure to ionizing radiation. methods: we prospectively enrolled subjects aged to years presenting in u.s. emergency departments with abdominal pain and other signs and symptoms suspicious for acute appendicitis within the prior hours. subjects were assessed by study staff unaware of their diagnosis for clinical attributes drawn from appendicitis scoring systems and blood samples were analyzed for cbc differential and candidate proteins. based on discharge diagnosis or post-surgical pathology, the cohort exhibited a . % prevalence ( / subjects) of appendicitis. clinical attributes and biomarker values were evaluated using principal component, recursive partitioning, and logistic regression to select the combination that best discriminated between those subjects with and without disease. mathematical combination of three inflammation-related markers in a panel comprised of myeloid-related protein / complex (mrp), c-reactive protein (crp), and white blood cell count (wbc) provided optimal discrimination. results: this panel exhibited a sensitivity of % ( % ci, - %), a specificity of % ( % ci, - %), and a negative predictive value of % ( % ci, - %) in this cohort. the observed performance was then verified by testing the panel against a pediatric subset drawn from an independent cohort of all ages enrolled in an earlier study. in this cohort, the panel exhibited a sensitivity of % ( % ci, - %), a specificity of % ( % ci, - %), and a negative predictive value of % ( % ci, - %). conclusion: appyscore is highly predictive of the absence of acute appendicitis in these two cohorts. if these results are confirmed by a prospective evaluation currently underway, the appyscore panel may be useful to classify pediatric patients presenting to the emergency department with signs and symptoms suggestive of, or consistent with, acute appendicitis and thereby sparing many patients ionizing radiation. background: there are no current studies on the tracking of emergency department (ed) patient dispersal when a major ed closes. this study demonstrates a novel way to track where patients sought emergency care following the closure of saint vincent's catholic medical center (svcmc) in manhattan by using de-identified data from a health information exchange, the new york clinical information exchange (nyclix). nyclix matches patients who have visited multiple sites using their demographic information. on april , , svcmc officially stopped providing emergency and outpatient services. we report the patterns in which patients from svcmc visited other sites within nyclix. objectives: we hypothesize that patients often seek emergency care based on geography when a hospital closes. methods: a retrospective pre-and post-closure analysis was performed of svcmc patients visiting other hospital sites. the pre-closure study dates were january , -march , . the post closure study dates were may , -july , . a svcmc patient was defined as a patient with any svcmc encounter prior to its closure. using de-identified aggregate count data, we calculated the average number of visits per week by svcmc patients at each site (hospital a-h). we ran a paired t-test to compare the pre-and post-closure averages by site. the following specifications were used to write the database queries: of patients who had one or more prior visits to svcmc for each day within the study return the following: a. eid: a unique and meaningless proprietary id generated within the nyclix master patient index (mpi). b. age: thru the age of . persons over were listed as '' + '' c. ethnicity/race d. type of visit: emergency e. location of visit: specific nyclix site. results: nearby hospitals within miles saw the highest number of increased ed visits after svcmc closed. this increase was seen until about miles. hospitals > miles away did not see any significant changes in ed visits. see table. conclusion: when a hospital and its ed close down, patients seem to seek emergency care at the nearest hospital based on geography. other factors may include the patient's primary doctor, availabilities of outpatient specialty clinics, insurance contracts, or preference of ambulance transports. this study is limited by the inclusion of data from only the eight hospitals participating in nyclix at the time of the svcmc closure. upstream methods: data were collected on all ed ems arrivals from the metro calgary (population . million) area to its three urban adult hospitals. the study phases consisted of the months from february to october (pre-ocp) compared against the same months in (post-ocp). data from the ems operational database and the regional emergency department information system (redis) database were linked. the primary analysis examined the change in ems offload delay defined as the time from ems triage arrival until patient transfer to an ed bed. a secondary analysis evaluated variability in ems offload delay between receiving eds. conclusion: implementation of a regional overcapacity protocol to reduce ed crowding was associated with an important reduction in ems offload delay, suggesting that policies that target hospital processes have bearing on ems operations. variability in offload delay improvements is likely due to site-specific issues, and the gains in efficiency correlate inversely with acuity. methods: a pre-post intervention study was conducted in the ed of an adult university teaching hospital in montreal (annual visits = ). the raz unit (intervention), created to offload the acu of the main ed, started operating in january, . using a split flow management strategy, patients were directed to the raz unit based on patient acuity level (ctas code and certain code ), likelihood to be discharged within hours, and not requiring an ed bed for continued care. data were collected weekdays from : to : for months (september -december ) (pre-raz) and for . months (february -march ) (post-raz). in the acu of the main ed, research assistants observed and recorded cubicle access time, and nurse and physician assessment times. databases were used to extract socio-demographics, ambulance arrival, triage code, chief complaint, triage and registration time, length of stay, and ed occupancy. background: telephone follow-up after discharge from the ed is useful for treatment and quality assurance purposes. ed follow-up studies frequently do not achieve high (i.e. ‡ %) completion rates. objectives: to determine the influence of different factors on the telephone follow-up rate of ed patients. we hypothesized that with a rigorous follow-up system we could achieve a high follow-up rate in a socioeconomically diverse study population. methods: research assistants (ras) prospectively enrolled adult ed patients discharged with a medication prescription between november , and september , from one of three eds affiliated with one health care system: (a) academic level i trauma center, (b) community teaching affiliate, and (c) community hospital. patients unable to provide informed consent, non-english speaking, or previously enrolled were excluded. ras interviewed subjects prior to ed discharge and conducted a telephone follow-up interview week later. follow-up procedures were standardized (e.g. number of calls per day, times to place calls, obtaining alternative numbers) and each subject's follow-up status was monitored and updated daily through a shared, web-based data system. subjects who completed follow-up were mailed a $ gift card. we examined the influence of patient (age, sex, race, insurance, income, marital status, usual major activity, education, literacy level, health status), clinical (acuity, discharge diagnosis, ed length of stay, site), and procedural factors (number and type of phone numbers received from subjects, offering two gift cards for difficult to reach subjects) on the odds of successful followup using multivariate logistic regression. results: of the , enrolled, % were white, % were covered by medicaid or uninsured, and % reported an annual household income of <$ , . % completed telephone follow-up with % completing on the first attempt. the table displays the factors associated with successful follow-up. in addition to patient demographics and lower acuity, obtaining a cell phone or multiple phone numbers as well as offering two gift cards to a small number of subjects increased the odds of successful follow-up. conclusion: with a rigorous follow-up system and a small monetary incentive, a high telephone follow-up rate is achievable one week after an ed visit. methods: an interrupted time-series design was used to evaluate the study question. data regarding adherence with the following pneumonia core measures were collected pre-and post-implementation of the enhanced decision-support tool: blood cultures prior to antibiotic, antibiotic within hours of arrival, appropriate antibiotic selection, and mean time to antibiotic administration. prescribing clinicians were educated on the use of the decision-support tool at departmental meetings and via direct feedback on their cases. results: during the -month study period, complete data were collected for patients diagnosed with cap: in the pre-implementation phase and post-implementation. the mean time to antibiotic administration decreased by approximately one minute from the pre-to post-implementation phase, a change that was not statistically significant (p = . ). the proportion of patients receiving blood cultures prior to antibiotics improved significantly (p < . ) as did the proportion of patients receiving antibiotics within hours of ed arrival (p = . ). a significant improvement in appropriate antibiotic selection was noted with % of patients experiencing appropriate selection in the post-phase, p = . . use of the available support tool increased throughout the study period, v = . , df = , p < . . all improvements were maintained months following the study intervention. conclusion: in this academic ed, introduction of an enhanced electronic clinical decision support tool significantly improved adherence to cms pneumonia core measures. the proportion of patients receiving blood cultures prior to antibiotics, antibiotics within hours, and appropriate antibiotics all improved significantly after the introduction of an enhanced electronic clinical decision support tool. background: emergency medicine (em) residency graduates need to pass both the written qualifying exam and oral certification exam as the final benchmark to achieve board certification. the purpose of this project is to obtain information about the exam preparation habits of recent em graduates to allow current residents to make informed decisions about their individual preparation for the abem written qualifying and oral certification exams. objectives: the study sought to determine the amount of residency and individual preparation, to determine the extent of the use of various board review products, and to elicit evaluations of the various board review products used for the abem qualifying and certification exams. methods: design: an online survey instrument was used to ask respondents questions about residency preparation and individual preparation habits, as well as the types of board review products used in preparing for the em boards. participants: as greater than % of all em graduates are emra members, an online survey was sent to all emra members who have graduated for the past three years. observations: descriptive statistics of types of preparation, types of resources, time, and quantitative and qualitative ratings for the various board preparation products were obtained from respondents. results: a total of respondents spent an average of . weeks and hours per week preparing for the written qualifying exam and spent an average of weeks and . hours per week preparing for the oral certification exam. in preparing for the written qualification exam, % used a preparation textbook with % using more than one textbook and % using a board preparation course. in preparing for the oral qualifying exam, % used a preparation textbook while % used a preparation course. sixty-seven percent of respondents reported that their residency programs had a formalized written qualifying exam preparation curriculum of which % was centered on the annual in-training exam. eight-five percent of residency programs had a formalized oral certification exam preparation. respondents reported spending on average $ preparing for the qualifying exam and $ for the certification exam. conclusion: em residents spend significant amounts of time and money and make use of a wide range of residency and commercially available resources in preparing for the abem qualifying and certification exams. background: communication and professionalism skills are essential for em residents but are not wellmeasured by selection processes. the multiple mini-interview (mmi) uses multiple, short structured contacts to measure these skills. it predicts medical school success better than the interview and application. its acceptability and utility in em residency selection is unknown. objectives: we theorized that the mmi would provide novel information and be acceptable to participants. methods: interns from three programs in the first month of training completed an eight-station mmi developed to focus on em topics. pre-and post-surveys assessed reactions using five-point scales. mmi scores were compared to application data. results: em grades correlated with mmi performance (f( . ) = : , p < . ) with honors students having higher mmi summary scores. higher third year clerkship grades trended to higher mmi performance means, although not significantly. mmi performance did not correlate with a match desirability rating and did not predict other individual components of the application including usmle step or usmle step . participants preferred a traditional interview (mean difference = . , p < . ). a mixed format was preferred over a pure mmi (mean difference = . , p < . ). preference for a mixed format was similar to a traditional interview. mmi performance did not significantly correlate with preference for the mmi; however, there was a trend for higher performance to associate with higher preference (r = . , t( ) = . , n.s.) performance was not associated with preference for a mix of interview methods (r = . , t( ) = . , n.s.). conclusion: while the mmi alone was viewed less favorably than a traditional interview, participants were receptive to a mixed methods interview. the mmi appears to measure skills important in successful completion of an em clerkship and thus likely em residency. future work will determine whether mmi performance correlates with clinical performance during residency. background: the annual american board of emergency medicine (abem) in-training exam is a tool to assess resident progress and knowledge. when the new york-presbyterian (nyp) em residency program started in , the exam was not emphasized and resident performance was lower than expected. a course was implemented to improve residency-wide scores despite previous em literature failing to exhibit improvements with residency-sponsored in-training exam interventions. objectives: to evaluate the effect of a comprehensive, multi-faceted course on residency-wide in-training exam performance. methods: the nyp em residency program, associated with cornell and columbia medical schools, has a year format with - residents per year. an intensive -week in-training exam preparation program was instituted outside of the required weekly residency conferences. the program included lectures, pre-tests, high-yield study sheets, and remediation programs. lectures were interactive, utilizing an audience response system, and consisted of core lectures ( - . hours) and three review sessions. residents with previous in-training exam difficulty were counseled on designing their own study programs. the effect on intraining exam scores was measured by comparing each resident's score to the national mean for their postgraduate year (pgy). scores before and after course implementation were evaluated by repeat measures regression modeling. overall residency performance was evaluated by comparing residency average to the national average each year and by tracking abem national written examination pass rates. results: resident performance improved following course implementation. following the course's introduction, the odds of a resident beating the national mean increased by . ( % ci . - . ) and the percentage of residents exceeding the national mean for their pgy year increased by % ( % ci %- %). following course introduction, the overall residency mean score has outperformed the national exam mean annually and the first-time abem written exam board pass rate has been %. conclusion: a multi-faceted in-training exam program centered around a -week course markedly improved overall residency performance on the in-training exam. limitations: this was a before and after evaluation as randomizing residents to receive the course was not logistically or ethically feasible. . years of practice. among the nonresidency trained, non-boarded em physicians, the percentage of individuals with board actions against them was significantly higher ( . % vs. . %, % ci for difference of . % = . to . %), but the incidence of actions was not significant ( . vs. . events/ years of practice, % ci for difference of . / = ) / to + / ), but the power to detect a difference was %. conclusion: in this study population, em-trained physicians had significantly fewer total state medical board disciplinary actions against them than non-em trained physicians, but when adjusted for years of practice (incidence), the difference was not significantly different at the % confidence level. the study was limited by low power to detect a difference in incidence. objectives: we chose pain documentation as a long term project for quality improvement in our ems system. our objectives were to enhance the quality of pain assessment, to reduce patient suffering and pain through improved pain management, to improve pain assessment documentation, to improve capture of initial and repeat pain scales, and to improve the rate of pain medication. this study addressed the aim of improving pain assessment documentation. methods: this was a quasi-experiment looking at paramedic documentation of the pqrst mnemonic and pain scales. our intervention consisted of mandatory training on the importance and necessity of pain assessment and treatment. in addition to classroom training, we used rapid cycle individual feedback and public posting of pain documentation rates (with unique ids) for individual feedback. the categories of chief complaint studied were abdominal pain, blunt injury, burn, chest pain, headache, non-traumatic body pain, and penetrating injury. we compared the pain documentation rates in the months prior to intervention, the months of intervention, and months post intervention. using repeated-measures anova, we compared rates of paramedic documentation over time. results: our ems system transported patients during the study period, of whom were for painful conditions in the defined chief complaint categories. there were paramedics studied, of whom had complete data. documentation increased from of painful cases ( . %) in qtr to of painful cases ( . %) in qtr . the trend toward increased rates of pain documentation over the three quarters was strongly significant (p < . ). paramedics were significantly more likely to document pain scales and pqrst assessments over the course of the study with the highest rates of documentation compliance in the final -month period. conclusion: a focused intervention of education and individual feedback through classroom training, one on one training, and public posting improves paramedic documentation rates of perceived patient pain. background: emergency medical services (ems) systems are vital in the identification, assessment, and treatment of trauma, stroke, myocardial infarction, and sepsis and improving early recognition, resuscitation, and transport to adequate medical facilities. ems personnel provide similar first-line care for patients with syncope, performing critical actions such as initial assessment and treatment as well as gathering key details of the event. objectives: to characterize emergency department patients with syncope receiving initial care by ems and their role as initial providers. methods: we prospectively enrolled patients over years of age who presented with syncope or near syncope to a tertiary care ed with , annual patient visits from june to june . we compared patient age, sex, comorbidities, and -day cardiopulmonary adverse outcomes (defined as myocardial infarction, pulmonary embolism, significant cardiac arrhythmia, and major cardiovascular procedure) between ems and non-ems patients. descriptive statistics, two-sided ttests, and chi-square testing were used as appropriate. results: of the patients enrolled, ( . %) arrived by ambulance. the most common complaint in patients transported by ems was fainting ( . %) or dizziness ( . %); syncope was reported in ( . %). compared to non-ems patients, those who arrived by ambulance were older (mean age (sd) . ( . ), vs. . ( . ) years, p = . ). there were no differences in the proportion of patients with hypertension ( . % vs . %, p = . ), coronary artery disease ( . % vs . %, p = . ), diabetes mellitus ( . % vs . %, p = . ), or congestive heart failure ( . % vs . %, p = . ). sixtynine ( . %) patients experienced a cardiopulmonary event within days. twenty-eight ( . %) patients who arrived by ambulance and ( . %) non-ems patients had a subsequent cardiopulmonary adverse event (rr . , %ci . - . ) within days. the table tabulates interventions provided by ems prior to ed arrival. conclusion: ems providers care for more than one third of ed syncope patients and often perform key interventions. ems systems offer opportunities for advancing diagnosis, treatment, and risk stratification in syncope patients. background: abdominal pain is the most common reason for visiting an emergency department (ed), and abdominopelvic computed tomography (apct) use has increased dramatically over the past decade. despite this, there has been no significant change in rates of admission or diagnosis of surgical conditions. objectives: to assess whether an electronic accountability tool affects apct ordering in ed patients with abdominal or flank pain. we hypothesized that implementation of an accountability tool would decrease apct ordering in these patients. methods: before and after study design using an electronic medical record at an urban academic ed from jul-nov , with the electronic accountability tool implemented in oct for any apct order. inclusion criteria: age >= years, non-pregnant, and chief complaint or triage pain location of abdominal or flank pain. starting oct th , , resident attempts to order apct triggered an electronic accountability tool which only allowed the order to proceed if approved by the ed attending physician. the attending was prompted to enter the primary and secondary diagnoses indicating apct, agreement with need for ct and, if no agreement, who was requesting this ct (admitting or consulting physician), and their pretest probability ( - ) of the primary diagnosis. patients were placed into two groups: those who presented prior to (pre) and after (post) the deployment of the accountability tool. background: there has been a paradigm shift in the diagnostic work-up for suspected appendicitis. edbased staged protocols call for the use of ultrasound prior to ct scanning because of its lack of radiation, and the morbidity related to contrast. a barrier to implementation is the lack of / availability of ultrasound. objectives: to evaluate the impact of the implementation of ed performed appendix ultrasounds (apus) on ct utilization in the staged workup for appendicitis in the emergency department. methods: we performed a quasi-experimental, before/ after study. we compared data from the first months of , before the availability of ed performed apus, with the same interval in after introduction of ed apus. we excluded patients who had appendectomies for reasons other than appendicitis or had been diagnosed prior to arrival. no patient identifiers were included in the analysis and the study was approved by the hospital irb. we report the following descriptive statistics (percentages, sensitivities, and absolute utilization changes conclusion: implementation of an ed apus in the staging work up of appendicitis was associated with a significant reduction in overall ct utilization in the ed. objectives: this study aims to evaluate ed patients' knowledge of radiation exposure from ct and mri scans as well as the long-term risk of developing cancer. we hypothesize that ed patients will have a poor understanding of the risks, and will not know the difference between ct and mri. methods: design -this was a cross-sectional survey study of adult, english-speaking patients at two eds from / / - / / . setting -one location was a tertiary care center with an annual ed census of , patient visits and the other was a community hospital with annual ed census of , patient visits. obser-vations -the survey consisted of six questions evaluating patients' understanding of radiation exposure from ct and mri as well as long-term consequences of radiation exposure. patients were then asked their age, sex, race, highest level of education, annual household income, and whether they considered themselves health care professionals. results: there were participants in this study, (of , total) from the academic center and (of , total) from the community hospital during the study period. overall, only % ( % ci - %) of participants understood the radiation risks associated with ct scanning. % ( % ci - %) of patients believed that an abdominal ct had the same or less radiation as a chest x-ray. % ( % ci - %) believed that there was an increased risk of developing cancer from repeated abdominal cts. only % ( % ci - %) of patients knew that mri scans had less radiation than ct. % ( % ci - %) either didn't know or believed that repeated mris were associated with an increased risk of developing cancer. higher educational level, household income, and identification as a health care professional all were associated with correct responses, but even within these groups, a majority gave incorrect responses. conclusion: in general, ed patients do not understand the radiation risks associated with advanced imaging modalities. we need to educate these patients so that they can make informed decisions about their own health care. background: homelessness has been associated with many poor health outcomes and frequent ed utilization. it has been shown that frequent use of the ed in any given year is not a strong predictor of subsequent use. identifying a group of patients who are chronic high users of the ed could help guide intervention. objectives: the purpose of this study is to identify if homelessness is associated with chronic ed utilization. methods: a retrospective chart review was accomplished looking at the records of the most frequently seen patients in the ed for each year from - at a large, urban academic hospital with an annual volume of , . patients' visit dates, chief complaints, dispositions, and housing status were reviewed. homelessness was defined by self-report at registration. patients were categorized according to their ed utilization with those seen > times in at least three of the five years of the study identified as chronic high utilizers; and those who visited the ed > times in at least three of the five years of the study were identified as chronic ultra-high utilizers. descriptive statistics with confidence intervals were calculated, and comparisons were made using non-parametric tests. results: during the -year study period, , unique patients were seen, of whom . % patients were homeless. patients were identified as frequent users. there were patients who presented on the top utilizer lists from multiple years. ( %, %ci - ) patients were identified as homeless. patients were seen > times in at least three of the years and ( %, - ) were homeless. patients were seen > times in at least three of the years and ( %, - ) were homeless. our facility has a % admission rate; however, non homeless chronic ultra-high utilizers had admission rates of % and homeless chronic ultra-high utilizers were admitted %. conclusion: chronic ultra-high utilizers of our ed are disproportionately homeless and present with lower severity of illness. these patients may prove to be a cost-effective group to house or otherwise involve with aggressive case management. the debate over homeless housing programs and case management solutions can be sharpened by better defining the groups who would most benefit and who represent the greatest potential saving for the health system. background: the prevalence of obese patients presenting to our emergency department (ed) is %: obese patients present in disproportionate number compared to the general population (us rate = %). in spite of this, there is a disconnect in patients' perceptions of weight and health: many patients underestimate their weight and report a key barrier to weight loss is patient-provider communications; such discussions have proven to be highly effective in smoking, drug, and alcohol cessation, an important initial step toward promoting wellness. information about patient provider communication is essential for designing and implementing emergency department (ed) based interventions to help increase patient awareness about weightrelated medical issues and provide counseling for weight reduction. objectives: we assessed patients' perceptions about obesity as disease and patient communication with their providers through two questions: do you believe your present weight is damaging to your health? has a doctor or other health professional every told you that you are overweight? methods: a descriptive cross-sectional study was performed in an academic tertiary care ed. a randomized sample of patients (every fifth) presenting to the ed (n = ) was enrolled. pregnant patients, patients who were medically unstable, cognitively impaired, or who were unable or unwilling to provide informed consent were excluded. percentages of ''yes'' and ''no'' are reported for each question based on patient bmi, ethnicity, sex, and the number of comorbid conditions. regression analysis was used to determine differences in responses between subgroups. results: among overweight/obese, white/black patients, . % do not feel their weight is damaging to their health and . % reported they have not been told by a doctor they are overweight. of individuals who have been told by a doctor they were overweight, . % still believe their present weight is not damaging to their health. of individuals who have not been told by a doctor they were overweight, . % believe their present weight is damaging to their health. differences in race and age were not found. p values < . for all results. conclusion: our data point toward a disconnect regarding patients' perceptions of health and weight. timely education about the burden of obesity may lead to a decrease in its overall prevalence. (originally submitted as a ''late-breaker.'') objectives: to examine the attitudes and expectations of patients admitted for inpatient care following an emergency department visit. methods: a descriptive study was done by surveying a voluntary sample of adult patients (n = ) admitted to the hospital from the emergency department in one urban teaching hospital in the midwest. a short, ninequestion survey was developed to assess patient attitudes and expectations towards hiv testing, consent, and requirements. analyses consisted of descriptive statistics, correlations, and chi-square analyses. results: the majority of patients report that hiv testing should be a routine part of health care screening ( . %) and that the hospital should routinely test admitted patients for hiv ( . %). despite these overall positive attitudes towards hiv testing, the data also suggest that patients have strong attitudes towards consent requirements with % acknowledging that hiv testing requires special consent and % reporting that separate consent should be required. the data also showed a statistically significant difference in the proportion of patients who believed that hiv testing is a part of routine health care screening by race (v = . , df = , p = . ). conclusion: patients attitudes and expectations towards routine hiv testing are consistent with the cdc recommendations. emergency departments are an ideal setting to initiate hiv testing and the findings suggest that patients expect hospital policies outline procedures for obtaining consent and screening all patients who are admitted to the hospital from the ed. results: the analysis revealed a ''hot spot'', a cluster of counties ( . %) with high ca rates adjacent to counties with high ca rates, located across the southeastern us (p < . ). within these counties, the average ca rate was % higher than the national average. a ''cool spot'', a cluster of counties ( . %) with low rates, was located across the midwest (p < . ). in this cool spot the average ca rate was % lower than the national average. figures and show us adjusted rates and spatial autocorrelation of ca deaths, respectively. conclusion: we identify geographic disparities in ca mortality and describe the cardiac arrest belt in the southeastern us. a limitation of this analysis was the use of icd- codes to identify cardiac arrest deaths; however, no other national data exist. an improved understanding of the drivers of this variability is essential to targeted prevention and treatment strategies, especially given the recent emphasis on development of cardiac resuscitation centers and cardiac arrest systems of care. an understanding of the relation between population density, cardiac arrest count, and cardiac arrest rate will be essential to the design of an optimized cardiac arrest system. we defined ed utilization during the past months as non-users ( visits), infrequent users ( - visits), frequent users ( - visits), and super-frequent users ( ‡ visits). we compared demographic data, socioeconomic status, health conditions, and access to care between these ed utilization groups. results: overall, super-frequent use was reported by . % of u.s. adults, frequent use by %, and infrequent ed use by %. higher ed utilization was associated with increased self-reported fair to poor health ( % for super-frequent, % for frequent, % for infrequent, % for non-ed users). frequent ed users were also more likely to be impoverished, with % of superfrequent, % of frequent, % of infrequent, and % of non-ed users reporting a poverty-income ratio < . adults with higher ed utilization were more likely to report the ed as the place they usually go when sick ( % for super-frequent, % for frequent, % for infrequent, . % for non-ed users). they also reported greater outpatient resource utilization, with % of super-frequent, % of frequent, % of infrequent, and % of non-ed users reporting ‡ outpatient visits/year. frequent ed users were also more likely than non-ed users to be covered by medicaid ( % for super-frequent, % for frequent, % for infrequent, % for non-ed users). conclusion: frequent ed users were a vulnerable population with lower socioeconomic status, poor overall health, and high outpatient resource utilization. interventions designed to divert frequent users from the ed should also focus on chronic disease management and access to outpatient services, rather than focusing solely on limiting ed utilization. objectives: we explored factors associated with specialty provider willingness to provide urgent appointments to children insured by medicaid/chip. methods: as part of a mixed method study of child access to specialty care by insurance status, we conducted semi-structured qualitative interviews with a purposive sample of specialists and primary care physicians (pcps) in cook county, il. interviews were conducted from april to september , until theme saturation was reached. resultant transcripts and notes were entered into atlas.ti and analyzed using an iterative coding process to identify patterns of responses in the data, ensure reliability, examine discrepancies, and achieve consensus through content analysis. results: themes that emerged indicate that pcps face considerable barriers getting publicly insured patients into specialty care and use the ed to facilitate this process. ''if i send them to the emergency room, i'm bypassing a number of problems. i'm fully aware that i'm crowding the emergency room.'' specialty physicians reported that decisions to refuse or limit the number of patients with medicaid/chip are due to economic strain or direct pressure from their institutions ''in the last budget revision, we were [told], 'you are losing money, so you need to improve your patient mix'''. in specialty practices with limited medicaid/chip appointment slots, factors associated with appointment success included: high acuity or complexity, personal request from or an informal economic relationship with the pcp, geography, and patient hardship. ''if it's a really desperate situation and they can't find anybody else, i will make an exception''. specialists also acknowledged that ''patients who can't get an appointment go to the er and then i am obligated to see them if they're in the system.'' conclusion: these exploratory findings suggest that a critical linkage exists between hospital eds and affiliated specialty clinics. as health systems restructure, there is an opportunity for eds to play a more explicit role in improving care coordination and access to specialty care. albert amini, erynne a. faucett, john m. watt, richard amini, john c. sakles, asad e. patanwala university of arizona, tucson, az background: trauma patients commonly receive etomidate and rocuronium for rapid sequence intubation (rsi) in the ed. due to the long duration of action of rocuronium and short duration of action of etomidate, these patients require prompt initiation of sedatives after rsi. this prevents the potential of patient awareness under pharmacological paralysis, which could be a terrifying experience. objectives: the purpose of this study was to evaluate the effect of the presence of a pharmacist during traumatic resuscitations in the ed on the initiation of sedatives and analgesics after rsi. we hypothesized that pharmacists would decrease the time to provision of sedation and analgesia. methods: this was an observational, retrospective cohort study conducted in a tertiary, academic ed that is a level i trauma center. consecutive adult trauma patients who received rocuronium in the ed for rsi were included during two time periods: / / to / / (pre-phase -no pharmacy services in the ed) and / / to / / (post-phase -pharmacy services in the ed). since the pharmacist could not respond to all traumas in the post-phase, this was further categorized based on whether the pharmacist was present or absent at the trauma resuscitation. data collected included patient demographics, baseline injury data, and medications used. the median time from rsi to initiation of sedatives and analgesics was compared between the pre-phase group (group ), post-phase pharmacist absent group (group ), and post-phase pharmacist present group (group ) using the kruskal-wallis test. results: a total of patients were included in the study (group = , group = , and group = ). median age was , . , and . years in groups , , and , respectively (p = . ). there were no other differences between groups with regard to demographics, mechanism of injury, presence of traumatic brain injury, glasgow coma scale score, vital signs, ed length of stay, or mortality. median time between rsi and post-intubation sedative use was , , and minutes in groups , and , respectively (p < . ). median time between rsi and post-intubation analgesia use was , , and minutes in groups , , and , respectively (p < . ). the presence of a pharmacist during trauma resuscitations decreases time to provision of sedation and analgesia after rsi. background: outpatient antibiotics are frequently prescribed from the ed, and limited health literacy may affect compliance with recommended treatments. objectives: among patients stratified by health literacy level, multimodality discharge instructions will improve compliance with outpatient antibiotic therapy and follow-up recommendations. methods: this was a prospective randomized trial that included consenting patients discharged with outpatient antibiotics from an urban county ed with an annual census of , . patients unable to receive text messages or voicemails were excluded. health literacy was assessed using a validated health literacy assessment, the newest vital sign (nvs). patients were randomized to a discharge instruction modality: ) usual care, typed and verbal medication and case-specific instructions; ) usual care plus text messaged instructions sent to the patient's cell phone; or ) usual care plus voicemailed instructions sent to the patient's cell phone. antibiotic pick-up was verified with the patient's pharmacy at hours. patients were called at days to determine antibiotic compliance. z-tests were used to compare -hour antibiotic pickup and patient-reported compliance across instructional modality and nvs score groups. results: patients were included ( % female, median age , range months to years); were excluded. % had an nvs score of - , % - , and % - . the proportion of prescriptions filled at hours varied significantly across nvs score groups; self-reported medication compliance at days revealed no difference across different instructional modalities nor nvs scores (table ) . conclusion: in this sample of urban ed patients, hour prescription pickup varied significantly by validated health literacy score, but not by instruction delivery modality. in this sample, patients with lower health literacy are at risk of not filling their outpatient antibiotics in a timely fashion. has been developed, validated, and utilized to study the processes of care involved in successful care transitions from inpatient to outpatient settings, but has not been utilized in the ed. objectives: we hypothesized that the ctm- could be successfully implemented in the ed without differential item difficulty by age, sex, education, or race; and would be associated with measures of quality of care and likelihood of following physician recommendations. methods: a descriptive study design based on exit surveys was used to measure ctm- scores and likelihood of following treatment recommendations. surveys were administered to a daily cross-sectional sample of all patients leaving the ed between a- a by research assistants in an urban academic ed setting for weeks in november . we report means and standard deviations, and analysis of variance to identify differences in ctm- scores for those who planned and did not plan to follow ed recommendations. results: surveys were completed; patients were ± years old, % black, % female, % with at least some college education, and % were admitted. average ctm- score was . ± . (range - ). scores were not associated with sex (p = . ), race (p = . ), or education level (p = . ). lower ctm scores were associated with increasing age (p = . ), patient perceptions that the ed team was less likely to use words that they understood, listen carefully to them, inspire their confidence and trust, or encourage them to ask questions (all p < . ). those who reported they were ''very likely'' to follow ed treatment had an average score of ± , while those who were ''unlikely'' or ''very unlikely'' to follow ed treatment plans had an average score ± (p = . ). conclusion: the ctm- performs well in the ed and exhibited only differential item difficulty by age; there was no significant difference by race, sex, or education level. furthermore, it is highly associated with likelihood of following physician recommendations. future studies will focus on ctm- scores ability to discriminate between patients who did or did not experience a subsequent ed visit or rehospitalization. age and race were found to be significant predictors of the race pathway. regression of the data by race revealed blacks (or . : ci . - . ; p < . ), hispanics (or . : ci . - . ; p = . ), and asians (or . : ci . - . ; p = . ), were more likely to enter the race cohort than were whites; however, much of this discrepancy is accounted for by age. the mean age of minority patients was years, while white patients were older at years (p = . ). conclusion: in a diverse demographic population we found that racial minorities were presenting at younger ages for chest pain and were more likely to receive cardiac testing at bedside than their white counterparts; and hence, were selected to a lower level of care (nonmonitored unit background: expanding insurance coverage is designed to improve access to primary care and reduce use of emergency services. whether expanding coverage achieves this is of paramount importance as the united states prepares for the affordable care act. objectives: we examined ed and outpatient department use after the state children's health insurance program (schip) coverage expansion, focusing on adolescents (a major target group for schip) versus young adults (not targeted). we hypothesized that coverage would increase use of outpatient services and emergency department services would decrease. methods: using the national ambulatory medical care survey and the national hospital ambulatory medical care survey, we analyzed years - as baseline and then compared use patterns in - after schip launch. primary outcomes were populationadjusted annual visits to ed versus non-emergency outpatient settings. interrupted time-series were performed on use rates to ed and outpatient departments between adolescents ( - years old) and young adults ( - years old) in the pre-schip and schip periods. outpatient-to-ed ratios were calculated and compared across time periods. results: the mean number of outpatient adolescent visits increased by visits per persons ( % ci, - ), while there was no statistically significant increase in young adult outpatient visits across time periods. there was no statistically significant change in the mean number of adolescent ed visits across time periods, while young adult ed use increased by visits per persons ( % ci, - ). the adolescent outpatient-to-ed ratio increased by . ( % ci, . - . ), while the young adults ratio decreased by . across time periods ( % ci, ) . to ) . ). conclusion: since schip, adolescent non-ed outpatient visits increased while ed visits remained unchanged. in comparison to young adults, expanding insurance coverage to adolescents improved access to health care services and suggests a shift to non-ed settings. as an observational study we are unable to control for secular trends during this time period. also as an ecological study we are unable to examine individual variation. expanding insurance through the affordable care act of will likely increase use of outpatient services but may not decrease emergency department volumes. background: cancer patients are receiving a greater proportion of their care on an outpatient basis. the effect of this change in oncology care patterns on ed utilization is poorly understood. objectives: to examine the characteristics of ed utilization by adult cancer patients. methods: between july and march , all new adult cancer patients referred to a tertiary care cancer centre were recruited into a study examining psychological distress. these patients were followed prospectively until september . the collected data were linked to administrative data from three tertiary care eds. variables evaluated in this study included basic we have previously shown that reducing non-value-added activities through the application of the lean process improvement methodology improves patient satisfaction, physician productivity and emergency department length of stay. objectives: in this investigation, we tested the hypothesis that non-value-added activities reduce physician job satisfaction. methods: to test this hypothesis, we conducted timemotion studies on attending emergency physicians working in an academic setting and categorized their activities into value-added (time in room with patient, time discussing cases and educating medical learners, time in room with patient and learner), necessary non-valueadded activities (charting, sign out, looking up labs), and unnecessary non-value-added activities (looking for things, looking for people, on the phone). the physicians were then surveyed using a -point likert scale to determine their relative satisfaction with each of the individual tasks ( worst part of day, best part of day). results: physicians spent % of their shift performing value-added work, % of their shift performing necessary non-value-added activities, and % of their shift performing unnecessary non-value-added activities (waste). weighted physician satisfaction (satisfaction x [percent time spent performing the activity / percent time engaged in activity category]) was highest when the physician was performing value-added work ( . ) compared to performing either necessary non-valueadded work ( . ) or waste ( . ). conclusion: the attending physicians we studied spent the majority of their time performing non-value-added activities, which were associated with lower satisfaction. application of process improvement techniques such as lean, which focus on reducing non-value-added work, may improve emergency physician job satisfaction. background: rocuronium and succinylcholine are the most commonly used paralytics for rapid sequence intubation (rsi) in the ed. after rsi, patients need sustained sedation while they are mechanically ventilated. however, the longer duration of action of rocuronium may influence subsequent sedation dosing, while the patient is therapeutically paralyzed. objectives: we hypothesized that patients who receive rocuronium would be more likely to receive lower doses of post-rsi sedation compared to patients who receive succinylcholine. methods: this was an observational, retrospective cohort study conducted in a tertiary, academic ed. consecutive adult patients, who received rsi using etomidate for induction of sedation between / / to / / , were included. patients were then categorized based on whether they received rocuronium or succinylcholine for paralysis. the dosing of post-rsi sedative infusions was compared at , , , and minutes after initiation between the two groups using the wilcoxon rank-sum test. results: a total of patients were included in the final analysis (rocuronium = , succinylcholine = ). mean age was and years in the rocuronium and succinylcholine groups, respectively (p = . ). there were no other baseline differences between groups with regard to demographics, reason for intubation, stroke, traumatic brain injury, glasgow coma scale score, pain scores, or vital signs. in the overall cohort, . % (n = ) of patients were given a sedative infusion or bolus in the ed. most patients were initiated on propofol (n = ) or midazolam (n = ) infusions. median propofol infusion rates at , , , and minutes were , , . , and mcg/kg/min in the rocuronium group and , , , and mcg/kg/ min in succinylcholine group, respectively. the difference was statistically significant at (p < . ) and (p = . ) minutes. median midazolam infusion rates at , , , and minutes were , , , and mg/hour in the rocuronium group and , , , and . mg/hour in succinylcholine group, respectively. the difference was statistically significant at (p = . ) and (p = . ) minutes. conclusion: patients who receive rocuronium are more likely to receive lower doses of sedative infusions post-rsi due to sustained therapeutic paralysis. this may put them at risk for being awake under paralysis. what is the impact of the implementation of an there was a difference in presenting pain (p < . ), stress (p < . ), and anxiety (p < . ) among patients that received an opioid in the ed. there was a difference in presenting pain (p < . ) for patients discharged with an opioid prescription, but not for stress (p = . ) or anxiety (p = . ). conclusion: patient-reported pain, stress, and anxiety are higher among patients who received an opiate in the ed than in those who did not, but only pain is higher among patients who received a discharge prescription for an opioid. methods: this was a prospective, randomized crossover study on the use of gvl and dl by incoming pediatric interns prior to advanced life support training. at the start of the study, the interns received a didactic session and expert modeling of the use of both devices for intubation. two scenarios were used: ( ) normal intubation with a standard airway and ( ) difficult intubation with tongue edema and pharyngeal swelling. interns then intubated laerdal simbaby in each scenario with both gvl and dl for a total of four randomized intubation scenarios. primary outcomes included time to successful intubation and the rate of successful intubation. the interns also rated their satisfaction with the devices using a visual analog scale ( - ) and chose their preferred device for their next intubation. results: interns were included in this study. in the normal airway scenario, there were no differences in the mean time for intubation with gvl or dl ( . ± . vs . ± . seconds, p = ns) or the number of interns who performed successful intubation ( vs , p = ns). in the difficult airway scenario, the interns took longer to intubate with gvl than dl ( . ± . vs . ± . seconds, p = . ), but there were no differences in the number of successful intubations ( vs , p = ns). interns rated their satisfaction higher for gvl than dl ( . ± . vs . ± . , p = . ) and gvl was chosen as the preferred device for their next intubation by a majority of the interns ( / , %). conclusion: for novice clinicians, gvl does not improve the time to intubation or intubation success objectives: to determine the time to intubation, the number of attempts, and the occurrence of hypoxia, in patients intubated with a c-mac device versus those intubated using a standard laryngoscope. methods: randomized controlled trial using exception from informed consent that included patients undergoing endotracheal intubation with a standard laryngoscope at an urban level i trauma center. eligible patients were randomized to undergo intubation using the c-mac or standard laryngoscopy. standard laryngoscopy was performed using a c-mac device laryngoscope with the video output obstructed to ensure equivalent laryngoscope blades in the two groups. data were collected by a trained research assistant at the patient's bedside and video review by the investigators. the number of attempts made, the initial and lowest oxygen saturation (spo ), and the total time until the intubation was successful was recorded. hypoxia was defined as an oxygen saturation < %. data were compared with wilcoxon rank sum and chi-square tests. results: thirty-eight patients were enrolled, ( % male, median age , range to , median spo %, range to ) in the standard laryngoscopy group and ( % male, median age , range to , median spo . %, range to ) in the c-mac group. the median number of attempts for standard laryngoscopy was , range to , and for c-mac was , range to (p = . ). the median time to intubation for the standard laryngoscopy group was seconds (range to ) and for the c-mac group was seconds (range to )(p = . ). hypoxia was detected in / ( %) in the standard laryngoscopy group and / ( %) in the c-mac group (p = . ). the median decrease in oxygen saturation during the attempt was . % (range % to %) for the standard laryngoscopy group and . % (range % to %) for the c-mac group. conclusion: we did not detect a difference in number of attempts, the occurrence of hypoxia, or the diagnosis of aspiration pneumonia between standard laryngoscopy and the c-mac. the time to successful intubation was shorter for patients intubated with the c-mac. the c-mac device appears to be superior to standard laryngoscopy for emergent endotracheal intubation. (originally submitted as a ''late-breaker.'') the background: aspiration pneumonia is a complication of endotracheal intubation that may be related to the difficulty of the airway procedure. objectives: to determine the association of the device used, the time to intubation, the number of attempts to intubate, and the occurrence of hypoxia with the subsequent development of aspiration pneumonia. methods: this was a prospective observational study of patients undergoing endotracheal intubation by emergency physicians at an urban level i trauma center conducted from / / until / / . the device used on the initial attempt to intubate was at the discretion of the treating physician. data were collected by a trained research assistant at the patient's bedside. the device used, the number of attempts made to intubate, the lowest oxygen saturation during the attempt, and the total time until intubation was successfully accomplished were recorded. patient's medical records were reviewed for the subsequent diagnosis of aspiration pneumonia. hypoxia was defined as an oxygen saturation < %. data were analyzed using multinomial logistic regression and odds ratios (or). results: patients were enrolled; ( %) subsequently developed aspiration pneumonia. were intubated with a standard laryngoscope (sl), using the c-mac, with an intubating laryngeal mask, and with nasotracheal intubation (ni) (or . , % ci = . - . ). comparison of individual devices versus sl did not show an association by device type. the median number of attempts for patients with aspiration pneumonia was , range to , and for those without was , range to (or . , %ci = . - . ). the median time to intubation for patients who developed aspiration pneumonia was seconds (range to ) and for those who did not was seconds (range to )(or . , %ci = . - . ). hypoxia during intubation was detected in / ( %) in the aspiration pneumonia group and / ( %) in the no aspiration pneumonia group (or . , % ci = . - . ). conclusion: there was not an association between the device used, the number of attempts, the time to intubation, or the occurrence of hypoxia during the intubation, and the subsequent occurrence of aspiration pneumonia. background: japanese census data estimate that million, or nearly % of the overall population, will be over age by the year . similar trends are apparent throughout the developed world. although increased patient age affects airway management, comprehensive information in emergency airway management for the elderly is lacking. objectives: we sought to characterize emergency department (ed) airway management for the elderly in japan including success rate, and major adverse events using a large multi-center registry. methods: design and setting: we conducted a multicenter prospective observational study using the japanese emergency airway network (jean) registry of eds at academic and community hospitals in japan between and inclusive. data fields included ed characteristics, patient and operator demographics, methods of airway management, number of attempts, success rate, and adverse events. participants: patient inclusion criteria were all adult patients who underwent emergent tracheal intubation in the ed. primary analysis: patients were divided to into two groups defined as follows: to years old and over years old. we describe primary success rates and major adverse events using simple descriptive statistics. categorical data are reported as proportions and % confidence intervals (cis). results: the database recorded patients (capture rate %) and met the inclusion criteria. of patients, patients were to years old ( %) and were over years old ( %). the older group had a significantly higher success rate at first attempt intubation ( / ; . %, % ci . - . %) compared with the younger group ( / ; . %, % ci . - . %). the older group had similar major adverse event rates ( / ; . %, % ci . - . %) compared with the younger group ( / ; . %, % ci . - . %). (see table ) background: the degree to which a patient's report of pain is associated with changes in blood pressure, heart rate, and respiratory rate is not known. objectives: to determine to what degree a standardized painful stimulus effects a change in systolic blood pressure (sbp), diastolic blood pressure (dbp), heart rate (hr), or respiratory rate (rr), and compare changes in vital signs between patients based on pain severity. methods: prospective observational study of healthy human volunteers. subjects had their sbp, dbp, hr, and rr measured prior to pain exposure, immediately after, and minutes after. pain exposure consisted of subjects placing their hand in a bath of degree water for seconds. the bath was divided into two sections; the larger half was the reservoir of cooled water monitored to be degrees, the other half filled from constant overflow over the divider. water drained from this section into the cooling unit and was then pumped up into the base of the reservoir through a diffusion grid. subjects completed a mm visual analog scale (vas) representing their perceived pain during the exposure and graded their pain as minimal, moderate or severe. data were compared using % confidence intervals. results: subjects were enrolled, mean pain vas mm, range to , reported mild pain, moderate pain, and severe pain. the percent change from baseline in vital signs during the exposure and minutes after are presented in the table. conclusion: there was a wide variety in reported pain among subjects exposed to a standard painful stimulus. there was a larger change in heart rate during the exposure among subjects who described a standardized painful exposure as moderate than in those who described it as severe. the small observed changes in blood pressure and respiratory rate seen during the exposure did not differ by pain report or persist after minutes. background: vital signs are often used to validate intensity of pain. however, few studies have looked at the capacity of vital signs to estimate pain intensity, particularly in patients with a diagnosis that a majority of physicians would agree produce significant pain in the ed. objectives: to determine the association between pain intensity and vital signs in consecutive ed patients and in a sub-group of patients with diagnosis known to cause significant pain. methods: we performed a post-hoc analysis of prospectively acquired data in a cohort study done in an urban teaching hospital with computerized triage and nurses records. we included all consecutive ed adult patients ( ‡ years old), who had any level of pain intensity measured during triage, from march to november . the primary outcome was the mean heart rate, systolic and diastolic blood pressure for every pain intensity level from to on a verbal numerical scale. our secondary outcomes where the same but limited to patients with the following diagnosis: fracture, dislocation, and renal colic. we performed descriptive statistics, one-way and two-way anovas when appropriate. results: during our study period, , patients ‡ years old where triaged with a pain intensity of at least / and had a diagnosis known to cause significant pain. . % of patients were female, with a mean pain intensity of . / , mean age of . years (± . ), and . % were ‡ years old. there was a statistically significant difference (p < . ) in mean heart rate, systolic and diastolic blood pressure for each level of pain intensity, ex: difference between / and / for mean heart rate was . beats per minutes, for systolic pressure was . mmhg and for diastolic . mmhg. results are similar for painful diagnosis: difference for mean heart rate was . beats per minutes, for systolic pressure was . mmhg and diastolic . mmhg. however, these differences are not clinically significant. conclusion: although our study is a post hoc analysis, pain intensity, heart rate, systolic and diastolic pressures during triage are usually reliable data and a prospective study would likely produce the same result. these vital signs cannot be used to estimate or validate pain intensity in the emergency department. % had a positive urine drug screen. logistic multivariate regressions analyses revealed the following factors to be significantly associated with the risk of having an abnormal head ct: association with seizure (p = . ); length of time of loss of consciousness, ranging from none to - min to > min (p = . ); alteration of consciousness (p = . ); post-traumatic amnesia (p = . ); alcohol intake prior to injury (p = , ); and initial ed gcs (p = . ). conclusion: in an emergency department cohort of patients with traumatic brain injury, symptoms including loss of or alteration in consciousness, seizure, post traumatic amnesia, and alcohol intake appear to be significantly associated with abnormal findings on head ct. these clinical findings on presentation may be useful in helping triage head injury patients in a busy emergency department, and can further define the need for urgent or emergent imaging in patients without clearly apparent injuries. background: the etiology of neurogenic shock is classically attributed to diminished peripheral vascular resistance (pvr) secondary to loss of sympathetic outflow to the peripheral vasculature. however, the sympathetic nervous system also controls other key elements of the cardiovascular system such as the heart and capacitance vessels and disruptions in their function could complicate the hemodynamic presentation. objectives: we sought to systematically examine the hemodynamic profiles of a series of trauma patients with neurogenic shock. methods: consecutive trauma patients with documented spinal cord injury complicated by clinical shock were enrolled. hemodynamic data including systolic and diastolic blood pressure, heart rate (hr), impedance-derived cardiac output, pre-ejection period (pep), left ventricular ejection time (lvet), and calculated systemic pvr were collected in the ed. data were normalized for body surface area and a validated integrated computer model of human physiology (guyton model) was used to analyze and categorize the hemodynamic profiles based on etiology of the hypotension using a systems analysis. correlation between markers of sympathetic outflow (hr, pep, lvet) and shock etiology category was examined. results: of patients with traumatic neurogenic shock, the etiology of shock was decrease in pvr in ( %; % ci to %), loss of vascular capacitance in ( %; to %), and mixed peripheral resistance and capacitance responsible in ( %; to %). the markers of sympathetic outflow had no correlation to any of the elements in the patients' hemodynamic profiles. conclusion: neurogenic shock is often considered to have a specific well-characterized pathophysiology. results from this study suggest that neurogenic shock can have multiple mechanistic etiologies and represents a spectrum of hemodynamic profiles. this understanding is important for the treatment decisions made in the management of these patients. -year ( - ) , pre-post intervention study of trauma patients requiring massive blood transfusion was performed. we divided the population into two cohorts: a pre-protocol group (pre) which included trauma patients receiving mbt not aided by a protocol, and a post-protocol group (post) who underwent mbt via the mbtp. patient demographics, hour blood component totals, timing of blood component delivery, trauma injury severity score (iss), initial glasgow coma scale (gcs) score, trauma mechanism, and patient mortality data were collected and analyzed using fisher's exact tests, student's t-tests, and mann-whitney u tests. results: fifty-two patients were included for study. median times to delivery of first products were reduced for prbcs ( minutes), ffp ( minutes), and platelets ( minutes) between the pre and post cohorts. median time to delivery of any subsequent blood product was significantly reduced ( minutes) in the post cohort (p = . ). the median number of blood products delivered was increased by . units for prbcs, units for ffp, . units for platelets, and unit for cryoprecipitate after implementation of mbtp. the percentage of patients receiving higher blood product ratios (> : ) was reduced between the pre and post cohorts for prbc to ffp ( % reduction) and prbc to platelet ratio groups ( % reduction). despite improved transfusion timing and ratios, we found no significant difference in mortality (p = . ) between pre and post cohorts when we adjusted for injury severity. conclusion: protocolized delivery of massive blood transfusion might reduce time to product availability and delivery, though it is unclear how this affects patient mortality in all us trauma centers. background: burns are common injuries that can result in significant scarring leading to poor function and disfigurement. unlike mechanical injuries, burns often progress both in depth and size over the first few days after injury, possibly due to inflammation and oxidative stress. a major gap in the field of burns is the lack of an effective therapy that reduces burn injury progression. objectives: since mesenchymal stem cells (msc) have been shown to improve healing in several injury models, we hypothesized that species-specific msc would reduce injury progression in a rat comb burn model. methods: using a gm brass comb preheated to degrees celsius, we created four rectangular burns, separated by three unburned interspaces on both sides of the backs of male sprague-dawley rats ( g). the interspaces represented the ischemic zones surround-ing the central necrotic core. left untreated, most of these interspaces become necrotic. in an attempt to reduce burn injury progression, rats were randomized to tail vein injections of ml rat-specific msc cells/ml (n = ) or normal saline (n = ) minutes after injury. tracking of the stem cells was attempted by injecting several rats with quantum dot-labeled msc. results: by four days post-injury, all of the interspaces in the control rats ( / , %) became necrotic while in the experimental group, / ( %) of the interspaces became necrotic (fisher's exact test; p < . ). at days, the percentage of the unburned interspaces that became necrotic in the msc treated group was significantly less than in the control group ( % vs. %, p < . ). we were unable to identify any quantum dot labeled msc in the injured skin. no adverse reactions or wound infections were noted in rats injected with msc. conclusion: intravenous injection of rat msc reduced burn injury progression in a rat comb burn model. although basic demographics of bicyclists in accidents have been described, there is a paucity of data describing the street surface involved in accidents, and whether designated bicycle roadways offer protection. this lack of information limits informed attempts to change infrastructure in a way that will decrease morbidity and/or mortality of cyclists. objectives: to identify road surface types involved in pedal cyclist injuries and determine the relationship between injury severity and the use of designated bicycle roadways (dbr) versus non-designated roadways (ndr). we hypothesized that more severe injuries would happen at intersections regardless of dbr versus ndr. methods: this retrospective cohort study reviewed the trauma database from a level i trauma center in tucson, az. we identified all bicyclists in the database injured in accidents involving a motor vehicle from january , , through december , . the patients were then linked to a local government database that documents location (latitude/longitude) and direction of travel of the cyclist. seventy-eight total incidents were identified and categorized as occurring on a dbr versus ndr and occurring at an intersection versus not at an intersection. results: only one patient who arrived at the trauma center died. fifty-one of the accidents ( %) occurred on dbrs; % of accidents occurring on dbrs took place in intersections. conversely, % of accidents on ndrs occurred outside of intersections. the odds of an injury occurring at an intersection versus not at an intersection were . times higher ( % ci: . - . ) for dbrs compared to ndrs. the odds of a trauma being severe (admitted) versus not severe (discharged home) were . times higher ( % ci: . - . ) when a collision occurred not at an intersection versus at an intersection. conclusion: contrary to our hypothesis, in this study group severe injuries were more likely outside of an intersection. however, intersections on dbrs were identified as problematic as cyclists on a dbr were more likely to be injured in an intersection. future city planning could target improved cyclist safety in intersections. background: minor thoracic injury (mti) is frequent and a significant proportion will still have moderate to severe pain at days. there is a lack of risk factors to orient specific treatment at ed discharge. objectives: to determine risk factors of having pain ( ‡ / , on a numerical intensity pain score from to ) at days in a population of minor thoracic injury patients discharged from the ed. methods: a prospective multi-center cohort study was conducted in four canadian eds, from november to january . all consecutive patients, years and older, with mti (with or without rib fracture), a normal chest x-ray, and discharged from the ed were eligible. a standardized clinical and radiological evaluation was done at and weeks. standardized phone interviews were done at and days. pain evaluation occurred at five time points (ed visit, and weeks, and days). using a pain trajectory model (sas), we planned to identify groups with different pain evolution at days. the final model was based on the importance of difference in pain evolution, confidence intervals, and number of patients in each group. to judge the adequacy of the final model, we examined whether the posteriori probabilities (i.e., a participant's probability of belonging to a certain trajectory group) averaged at least % for each trajectory group. then using logistic multinomial regression and the low risk group of having pain as the control group, we identified significant predictors of patients in the moderate and high risk groups having pain at days. results: in our cohort of , patients, , had an evaluation at days. we identified three groups at low ( %), moderate ( . %), and high risk ( . %) of having pain ‡ / at days. using risk factor identified by univariate analysis, we created a model to identify patients at risk containing the following predictors: age ‡ years old, women, current smoker, two or more rib fractures, complaint of dyspnea, and saturation < % at initial visit. posteriori probabilities for low, moderate, and high risk were %, %, and %. conclusion: to our knowledge, this is the first study to identify potential risk factor for having pain at days after minor thoracic injury. these risk factors should be validated in a prospective study to guide specific treatment plan. the use of ultrasound to evaluate traumatic optic neuropathy benjamin burt, lisa montgomery, cynthia garza meissner, sanja plavsic-kupesic, nadah zafar ttuhsc -paul l foster school of medicine, el paso, tx background: whenever head trauma occurs, there is the possibility for a patient to have an optic nerve injury. the current method to evaluate optical nerve swelling is to look for proptosis. however, by the time proptosis presents, significant damage has already occurred. therefore, there is a need to establish a method to evaluate nerve injury prior to the development of proptosis. objectives: fundamental to understanding the pathophysiology of optic nerve injury and repair is an understanding of the optic nerve's temporal response to trauma including blood flow changes and vascular reactivity. the aim of our study was to assess the dependability and reproducibility of ultrasound techniques to sequence optic nerve healing and monitor the vascular response of the ophthalmic artery following an optic nerve crush. methods: the rat's orbit was imaged prior to and following a direct injury to the optic nerve, at hours and at days. d, d, and color doppler techniques were used to detect blood flow and the course of the ophthalmic artery and vein, to evaluate the course and diameter of the optic nerve, and to assess the extent of optic nerve trauma and swelling. the parameters used to evaluate healing over time were pulsatility and resistance indices of the ophthalmic artery. results: we have established baseline ultrasound measurements of the optic nerve diameter, normal resistance and pulsatility indices of the ophthalmic artery, and morphological assessment of the optic nerve in a rat model. longitudinal assessment of d and d ultrasound parameters were used to evaluate vascular response of the ophthalmic artery to optic nerve crush injury. we have developed a rat model system to study traumatic optic nerve injury. the main advantages of ultrasound are low cost, non-invasiveness, lack of ionizing radiation, and the potential to perform longitudinal studies. our preliminary data indicate that d and d color doppler ultrasound may be used for the evaluation of ophthalmic artery and total orbital perfusion following trauma. once baseline ultrasound and doppler measurements are defined there is the opportunity to translate the rat model to evaluate patients with head trauma who are at risk for optic nerve swelling and to assess the usefulness of treatment interventions. background: alcoholism is a chronic disease that affects an estimated . million american adults. a common presentation to the emergency department (ed) is a trauma patient with altered sensorium who is presumed to be alcohol intoxicated by the physicians based on their olfactory sense. often ed physicians may leave patients suspected of alcohol intoxication aside until the effects wear off, potentially missing major trauma as the source of confusion or disorientation. this practice often results in delays in diagnosing acute potentially life-threatening injuries in the patients with presumed alcohol intoxication. objectives: this study will determine the accuracy of physicians' olfactory sense for diagnosing alcohol intoxication. methods: patients suspected of major trauma in the ed underwent an evaluation by the examining physician for the odor of alcohol as well as other signs of intoxication. each patient had determination of blood alcohol level. alcohol intoxication was defined as a serum ethanol level ‡ mg/dl. data were reported as means with % confidence intervals ( % ci) or proportions with inter-quartile ranges (iqr %- %). results: one hundred and fifty one patients ( % males) were enrolled in the study, median age years (iqr - ). the median score for glasgow coma scale was . the level of training of examining physician was a median of pgy (iqr pgy -attending). prevalence of alcohol intoxication was % ( % ci: % to %). operating characteristics: physician assessment of alcohol intoxication, sensitivity % ( % ci: % to %), specificity % ( % ci: % to %), positive likelihood ratio . ( % ci: . to . ), negative likelihood ratio . ( % ci: . to . ), and accuracy % ( % ci: % to %). patients who were falsely suspected of being intoxicated were . % ( % ci: % to %). conclusion: although the physicians had a high degree of accuracy in identifying patients with alcohol intoxication based on their olfactory sense, they still falsely overestimated intoxication in a significant number of non-intoxicated trauma patients. the background: optimal methods for education and assessment in emergency and critical care ultrasound training for residents are not known. methods of assessment often rely on surrogate endpoints which do not assess the ability of the learner to perform the imaging and integrate the imaging into diagnostic and therapeutic decisions. we designed an educational strategy that combines asynchronous learning to teach imaging skills and interpretation with a standardized assessment tool using a novel ultrasound simulator to assess the learner's ability to acquire and interpret images in the setting of a standardized patient scenario. objectives: to assess the ability of emergency medicine and surgical residents to integrate and apply information and skills acquired in an asynchronous learning environment in order to identify pathology and prioritize relevant diagnoses using an advanced cardiac ultrasound simulator. methods: em r residents and r surgical residents completed an online focused training program in cardiac ultrasonography (iccu elearning, https:// www.caeiccu.com/lms). this consisted of approximately hours of intensive training in cardiac ultrasound. residents were then given cases with a patient scenario that lacked significant details that would suggest a specific diagnosis. the resident was then given a list of possible diagnoses and asked to rank the top five diagnoses in order of most likely to least likely. each resident (blinded to the pathology displayed by the simulator) then imaged using an ultrasound simulator. after imaging, the residents were given the same list of potential diagnoses, and asked to rank them again from - . results: overall, residents ranked the correct diagnosis in the top five significantly more times post-ultrasound than pre-ultrasound. additionally, the residents made the correct diagnosis significantly more times postultrasound than pre-ultrasound. similar patterns occur for congestive heart failure, pericardial effusion with tamponade, and pleural effusion. there was no significant difference pre-and post-ultrasound for pulmonary embolism and anterior infarction. conclusion: an asynchronous online learning program significantly improves the ability of emergency medicine and surgical residents to correctly prioritize the correct diagnosis after imaging with a standardized pathology imaging simulator. mark favot, jacob manteuffel, david amponsah henry ford hospital, detroit, mi background: em clerkships are often the only opportunity medical students have to spend a significant amount of time caring for patients in the ed. it is imperative that students gain exposure to as many of the various fields within em as possible during this time. if the exposure of medical students to ultrasound is left to the discretion of the supervising physicians, we feel that many students would complete an em clerkship with limited skills and knowledge in ultrasound. the majority of medical students receive no formal training in ultrasound during medical school and we believe that the em clerkship is an excellent opportunity to fill this educational gap. objectives: evaluate the usefulness and effectiveness of a focused ultrasound curriculum for medical students in an em clerkship at a large, urban, academic medical center. methods: prospective cohort study of fourth year medical students doing an em clerkship. as part of the clerkship requirements, the students have a portion of the curriculum dedicated to the fast exam and ultrasound-guided vascular access. at the end of the month they take a written test, and month later they are given a survey via e-mail regarding their ultrasound experience. em residents also completed the test to serve as a comparison group. all data analysis was done using sas . . scores were integers ranging between and . descriptive statistics are given as count, mean, standard deviation, median, minimum, and maximum for each group. due to non-gaussian nature of the data and small group sizes, a wilcoxon two-sample test was used to compare the distributions of scores between the groups. results: in the table, the distribution of scores was compared between the residents (controls) and the students (subjects). the mean and median scores of the student group were higher than those of the resident group. the difference in scores between the two groups was statistically significant (p = . ). conclusion: our data reveal that after completing an em clerkship with time devoted to learning ultrasound for the fast exam and vascular access, fourth year medical students are able to perform better than em residents on a written test. what remains to be determined is if their skills in image acquisition and in performance of ultrasound-guided vascular access procedures also exceed those of em residents. results: there were respondents (total response rate . %). compared to non-em students, students pursuing em ( students, . %) were more drawn to their specialty for work hour control (p < . ) and shorter residency length (p < . ). em students were less likely than non-em students to be drawn to their chosen specialty for future academic opportunities (p < . ). em students formed their mentorships by referral significantly more than non-em students (p < . ), though there was no statistical difference in quality of existing mentorships amongst students. of the students not currently and never formerly interested in em, the most common response ( . %) for why they did not choose em was the lack of a strong mentor in the field. conclusion: the results confirmed previous findings of lifestyle factors drawing students to em. future academic opportunities were less likely to draw students to em than students pursuing other specialties. lack of mentorship in the field was the most common reason given for why students did not consider em. given the lack of direct em exposure until late in the curriculum of most medical schools, mentorship may be particularly important for em and future study should focus on this area. background: misdiagnosis is a major public health problem. dizziness leads to million visits annually in the us, including . million to the emergency department (ed). despite extensive ed workups, diagnostic accuracy remains poor, with at least % of strokes missed in those presenting with dizziness. ed physicians need and want support, particularly in the best method for diagnosis. strong evidence now indicates the bedside oculomotor exam is the best method of differentiating central from peripheral causes of dizziness. objectives: after a vertigo day that includes instruction in head impulse testing, emergency medicine residents will feel comfortable discharging a patient with signs of vestibular neuritis and a positive head impulse test without ordering a ct scan. methods: post graduate year - emergency medicine residents participated in a four hour vertigo day. we developed a mixed cognitive and systems intervention with three components: an online game that began and ended the day, a didactic taught by dr. newman-toker, and a series of small group exercises. the small group sessions included the following: a question and answer session with the lecturer; vertigo special tests (cerebellar assessment, dix hall-pike, epley maneuver); a head impulse hands-on tutorial using a mannequin; and a video lecture on other tests useful in vertigo evaluation (nystagmus, test of skew, vestibulocular reflex, ataxia). results: thirty emergency medicine residents were studied. before and after the intervention the residents were given a survey in which one question asked ''in a patient with acute vestibular syndrome and a history and exam compatible with vestibular neuritis, i would be willing to discharge the patient without neuroimaging based on an abnormal head impulse test result that i elicited''. resident answers were based on a sevenpoint likert scale from strongly agree to strongly disagree. twenty-five residents completed both surveys. of the seven residents who changed their responses pre to post,a significant proportion ( %) changed their answer from disagree/neutral to agree after a hour vertigo day (mcnemar's test, p value = . ). conclusion: in this single-center study, teaching headimpulse testing as part of a vertigo day increases resident comfort with discharging a patient with vestibular neuritis without a ct scan. background: previous studies have been inconsistent in determining the effect of increased ed census on resident workload and productivity. we examined resident workload and productivity after the closure of a large urban ed near our facility, which resulted in a rapid % increase in our census. objectives: we hypothesized that the closure of a nearby hospital closure with a resulting influx of ed patients to our facility would not change resident productivity. methods: this computer-assisted retrospective study compared new patient workups per hour and patient load before and after the closure of a large nearby hospital. specifically, new patient workups per hour and the pm patient census per resident were examined for a one-year period in the calendar year prior to the closing and also for one year after the closing. we did not include the four month period surrounding the closure in order to determine the long-term overall effect. background: emergency medicine residents use simulation for training due to multiple factors including the acuity of certain situations they are faced with, and the rarity of others. current training on highfidelity mannequin simulators is often critiqued by residents over the physical exam findings present, specifically the auscultatory findings. this detracts from the realism of the training, and may also lead a resident down a different diagnostic or therapeutic pathway. wireless remote programmed stethoscopes represent a new tool for simulation education which allows any sound to be wirelessly transmitted to a stethoscope receiver. objectives: our goal was to determine if a wireless remote programmed stethoscope was a useful adjunct in simulation-based cases using a high-fidelity mannequin. our hypothesis was that this would represent a useful adjunct in simulation education of emergency medicine residents. methods: starting june , pgy - emergency medicine residents were assessed in two simulation-based cases using pre-determined scoring anchors. an experimental randomized crossover design was used in which each resident performed a simulation case with and without a remote programmed stethoscope on a highfidelity mannequin. scoring anchors and surveys were used to collect data with differences of means calculated. results: fourteen residents participated in the study. residents noted most realistic physical exam findings associated with the case with the adjunct in / ( %) and that their preference was for the use of the adjunct in / ( %). based off of a five-point likert scale, with being the most realistic, the adjunct-associated case averaged . as compared to . without (difference of means . , p = . ). average scores of residents with the adjunct were . / with the use of the adjunct and . / without (difference of means . , p = . ). average total times were : with the adjunct as compared to : without. conclusion: a wireless remote programmed stethoscope is a useful adjunct in simulation training of emergency medicine residents. residents noted physical exam findings to be more realistic, preferred its use, and had approached significant improvement of scores when using the adjunct. background: prior studies predict an ongoing shortage of emergency physicians to staff the nation's eds, especially in rural areas. to address this, em organizations have discussed broadening access to acgme or aoa accredited em residency programs to physicians who previously trained in another specialty and focusing on physicians already practicing in rural areas. objectives: to investigate whether em program directors (pds) from allopathic and osteopathic residency programs would be willing to accept applicants previously trained in other specialties and whether this willingness is modified by applicants' current practice in rural areas. methods: a five-question web-based survey was sent to u.s. em pds asking questions about their policies on accepting residents with past training and from rural practices. questions included whether a pd would accept a resident with prior training in other specialties, how many years from this training would the applicant be still a competitive candidate and if a physician was practicing in a rural region would the likelihood of acceptance to the program be improved. different characteristics of the residency programs were recorded including length of program, years in existence, size, type, and location of program. we compared responses by program characteristics using chi-square test. results: of the ( %) pds responding to date, a large majority ( %) reported they do accept applicants with previous residency training, although directors of osteopathic programs were less likely to accept these applicants ( % vs % for allopathic; p < . ). overall, % of pds reported no limit on the length of time from prior training to when they are accepted at an em program. % reported it is very or possibly realistic they would accept a candidate who had completed training and was board certified in another specialty. a majority of all respondents ( %) felt a physician practicing in a rural setting might be viewed as a more favorable candidate, even if the resident would only be in the program for years after receiving training credit. directors of newer programs (< years of existence) were more likely to view these candidates favorably than older programs ( % vs %; p = . ). conclusion: there appear to be many em residency programs that would at least review the application and consider accepting a candidate who trained in another specialty. a qualitative assessment of emergency medicine self-reported strengths todd guth university of colorado, aurora, co background: self-reflection has been touted as a useful way to assess the acgme core competencies. objectives: the purpose of this study is to gain insight into resident physician professional development through analysis of self-perceived strengths. a secondary purpose is to discover potential topics for selfreflective narrative essays relating to the acgme core competencies. methods: design: a small qualitative study was performed to explore the self-reported strengths of emergency medicine (em) residents in a single four-year residency. participants: all residents regardless of year of training were also asked to report their selfperceived strengths. observations: residents were asked: ''what do you feel are your greatest strengths as a resident? provide a quick description.'' the author and another reviewer identified themes from within each year of residency with abraham maslow's conscious competence conceptual framework in mind. occurrences of each theme were counted by the reviewers and organized according to frequency. once the top ten themes for each year of residency were identified and exemplar quotes identified, the two reviewers identified trends. inter-rater agreements were calculated. results: representing unconscious incompetency, the first trend was the reported presence of ''enthusiasm and a positive attitude'' from residents early in their training that decreases further along in training. additionally, a ''willingness and motivation to improve and learn'' was reported as a strength throughout all the years of training but most frequently reported in the first two years of residency. entering into conscious incompetence, the second trend identified was ''recognition of limitations and openness to constructive feedback'' that was mentioned frequently in the second and third years of residency. demonstrating conscious competence, the third trend identified was the increase in identification of the strengths of ''educational leadership, teamwork skills and communication, and departmental patient flow and efficiency'' in the later years of residency. conclusion: self-reported strengths has helped to identify both themes within each year of residency and trends among the years of residency that can serve as areas to explore in self-reflective narratives relating to the acgme core competencies. training. pofu can also be used to assess the acgme core competency of practice-based learning. the exact form or frequency of pofu assessment among various em residencies, however, is not currently known. objectives: we aimed to survey em residencies across the country to determine how they fulfill the pofu requirement and whether certain program structure variables were associated with different pofu systems. we hypothesized that implementation of pofu systems among em residencies would be highly variable. methods: in this irb-approved study, all program directors of acgme allopathic em residencies were invited to complete a -question survey on their current approaches to pofu. respondents were asked to describe their current pofu system's characteristics and rate its ease of use, effectiveness, and efficiency. data were collected using surveymonkey(tm) and reported using descriptive statistics. results: of residencies surveyed, ( %) submitted complete data. . % were completed by program directors and over three-fourths ( . %) of em residencies require monthly completion of pofus. the mean total pofus required per year was ( % ci - ), with a median of and a range of - . almost / ( %) of residencies use an electronic pofu system. most ( %) -year em residencies use an electronic pofu system, compared with half ( %) of -year residencies (difference %, p = . , % ci . %- . %). seven commercially available electronic programs are used by % of the residencies, while % use a customized product. most respondents ( %) rated their pofu system as easy to use, but less than half ( %) felt it was an effective learning tool or an efficient one ( %). onethird ( %) would use a different pofu system if available, and almost half ( %) would be interested in using a multi-residency pofu system. conclusion: em residency programs use many different strategies to fulfill the rrc requirement for pofu. the number of required pofus and the method of documentation vary considerably. about two-thirds of respondents use an electronic pofu system. less than half feel that pofu logs are an effective or efficient learning tool. background: certification of procedural competency is requisite to graduate medical education. however, little is known regarding which platforms are best suited for competency assessment. simulators offer several advantages as an assessment modality, but evidence is lacking regarding their use in this domain. furthermore, perception of an assessment environment has important influence on the quality of learning outcomes, and procedural skill assessment is ideally conducted on a platform accepted by the learner. objectives: to ascertain if a simulator performs as well as an unembalmed cadaver with regard to residents' perception of their ability to demonstrate procedural competency during ultrasound (us) guided internal jugular vein (ij) catheterization. methods: in this cross-sectional study at an urban community hospital during july of , residents in their second or third year of training from a -year em residency program performed us guided catheterizations of the ij on both an unembalmed cadaver and a simulator manufactured by blue phantom. after the procedure, residents completed an anonymous survey ascertaining how adequately each platform permitted their demonstration of proficiency on predefined procedural steps. answers were provided on a likert scale of to , with being poor and being excellent. p values < . were considered educationally significant. results: the median overall rating of the simulator (s) to serve as an assessment platform was similar to that of the cadaver (c) with scores of . and . respectively, p = . . median ratings for permitting the demonstration of specific procedural steps were as follows: conclusion: senior em residents positively rate the blue phantom simulator as an assessment platform and similarly to that of a cadaver with regard to permitting their demonstration of procedural competency for us guided ij catheterization, but did prefer the cadaver to a greater degree when identifying and guiding the needle into the ij. methods: in fall , wcmc and wcmc-q students taking the course completed a question pre-and post-test. wcmc-q students also completed a postcourse single-station objective structured clinical examination (osce) that evaluated their ability to identify and perform eight actions critical for a first responder in an emergency situation (table ) . results: on both campuses, mean post-test scores were significantly higher than mean pre-test scores (p £ . ). on the pre-test, mean wcmc student scores were significantly higher than for wcmc-q students (p = . ); however, no difference was found in mean post-test scores (p = . ). there was no association between the scores on the osce (mean = . , sd = . ) and the post-test (p = . ) even after adjusting for a possible evaluators' effect (table ) . clinical skills course was effective in enhancing student knowledge in both qatar and new york as evidenced by the significant improvement in scores from the pre-to post-tests. the course was able to bring wcmc-q student scores and presumably knowledge up to the same level as wcmc students. students performed well on the osce, suggesting that the course was able to teach them the critical actions required of a first responder. the lack of association between the post-test and osce scores suggests that student knowledge does not independently predict ability to learn and demonstrate critical actions required of a first responder. future studies will evaluate whether the course affects the students' clinical practice. assess breathing assess circulation call ems call ems and assess abcs prior to other interventions immobilize localize and control bleeding splint fractured extremity and skills specific to wilderness medicine by incorporating simulated medical scenarios into a day-long adventure race. this event has gained acceptance nationally in wilderness medical circles as an excellent way to appreciate the challenges of wilderness medicine, however its effectiveness as a teaching tool has not yet been verified. objectives: the objective of this study was to determine if improvement in simulated clinical and didactic performance can be demonstrated by teams participating in a typical medwar event. methods: we developed a complex clinical scenario and written exam to test the basic tenets that are reinforced through the medwar curriculum. teams were administered the test and scored on a standardized scenario immediately before and after the midwest medwar race. teams were not given feedback on their pre-race performance. scenario performance was based on the number of critical actions correctly performed in the appropriate time frame. data from the scenario and written exams were analyzed using a standard paired difference t-test. results: a total of teams participated in both the pre-and post-event scenarios. the teams' pre-race scenario performance was . % (sd = . , n = ) of critical actions met compared to a post-race performance of . % (sd = . , n = ). the mean improvement was . % (sd = . , n = , % ci . , . ) with a significant paired two-tailed t-test (p £ . ). a total of individual subjects took the written pre-and posttests. the written scores averaged pre-race . % (sd = . , n = ) and post-race . % (sd = . , n = ). the mean improvement was . % (sd = . , n = , ci ) . , . ), with a significant paired twotailed t-test (p £ . ). conclusion: medwar participants demonstrated a significant improvement in both written exam scores and the management of a simulated complex wilderness medical scenario. this strongly suggests that medwar is an effective teaching platform for both wilderness medicine knowledge and skills. palliative methods: ed residents and faculty of an urban, tertiary care, level i trauma center were asked to complete an anonymous survey ( / - / ). participants ranked statements on a five-point likert scale ( = strongly disagree- = strongly agree). statements covered four main domains of barriers related to: ) education/training, ) communication, ) ed environment; ) personal beliefs. respondents were also asked if they would call pc consult for ed clinical scenarios (based on established triggers). results: / ( %) eligible participants completed the survey ( residents, faculty), average age was years, % ( / ) male, and % ( / ) caucasian. respondents identified two major barriers to ed-pc provision: lack of hour availability of pc team (mean score . ) and lack of access to complete medical records ( . ). listed domain barriers included: communication-related issues (mean . ) like access to family or primary providers, ed environment ( . ) for example chaotic setting with time-constraints, education/training ( . ) related to pain/pc, and personal beliefs regarding end-of-life ( . ). all respondents agreed that they would call pc consult for a 'hospice patient in respiratory distress', and a majority ( %) would consult pc for 'massive intracranial hemorrhage, traumatic arrest, and metastatic cancer'. however, traditional in-patient triggers like frequent re-admits for organ failure issues (dementia, congestive heart failure, and obstructive pulmonary disease exacerbations) were infrequently ( %) chosen for pc consult. conclusion: to enhance pc provision in the ed setting, two main ed physician perceived barriers will likely need to be addressed: lack of access to medical records and lack of - availability of pc team. ed physicians may not use the same criteria to initiate pc consults as compared to the traditionally established inpatient pc consult trigger models. percent of charts with an mse by ait prior to resident evaluation (a measure of reduced diagnostic uncertainty and decision-making), ( ) ed volume. results: there were no educationally significant differences in productivity or acuity between the pre-ait and post-ait groups. mse was recorded in the chart prior to resident evaluation in . % of cases. ed volume rose by . % between periods. conclusion: ait did not affect productivity or acuity of patients seen by em s. while some volume was directed away from residents by ait (patients treated-andreleased by ait only), overall volume increased and made up the difference. this is similar to previously reported rankings that program directors gave to the same criteria. although medical students agreed with program directors on the importance of most aspects of the nrmp application areas of discordance included higher medical student ranking for extracurricular activities and a lower relative ranking for aoa status than program directors. this can have implications for medical student mentoring and advising in the future. background: emergency care of older adults requires specialized knowledge of their unique physiology, atypical presentations, and care transitions. older adults often require distinctive assessment, treatment and disposition. emergency medicine (em) residents should develop expertise and efficiency in geriatric care. older adults represent over % of most emergency department (ed) volumes. yet many em residencies lack curricula or assessment tools for competent geriatric care. the geriatric emergency medicine competencies (gemc) are high-impact geriatric topics developed to help residencies meet this demand. objectives: to examine the effect of a brief gemc educational intervention on em resident knowledge. methods: a validated -question didactic test was administered at six em residencies before and after a gemc focused lecture delivered summer and fall of . scores were analyzed as individual questions and in defined topic domains using a paired student's t-test. results: a total of exams were included. the testing of didactic knowledge before and after the gemc educational intervention had high internal reliability ( . %). the intervention significantly improved scores in all domains (table ) . graded increase in geriatric knowledge occurred by pgy year with the greatest improvement seen at the pgy level (table ) . conclusion: even a brief gemc intervention had a significant effect on em resident knowledge of critical geriatric topics. a formal gemc curriculum should be considered in training em residents for the demands of an ageing population. the overall procedure experience of this incoming class was limited. most r s had never received formal education in time management, conflict of interest management, or safe patient trade-off. the majority lacked confidence in their acute and chronic pain management skills. these entry level residents lacked foundational skill levels in many knowledge areas and procedures important to the practice of em. ideally medical school curricular offerings should address these gaps; in the interim, residency curricula should incorporate some or all of these components essential to physician practice and patient safety. background: the american heart association and international liaison committee on resuscitation recommend patients with return of spontaneous circulation following cardiac arrest undergo post-resuscitation therapeutic hypothermia. in post-cardiac arrest patients presenting with a rhythm of vf/vt, therapeutic hypothermia has been shown to reduce neurologic sequelae and decrease overall mortality. objectives: to explore clinical practice regarding the use of therapeutic hypothermia and compare survival outcomes in post-cardiac arrest patients. a secondary outcome was to assess whether the initial presenting cardiac arrest rhythm (ventricular fibrillation/ventricular tachycardia (vf/vt) versus pulseless electrical activity (pea) or asystole) was associated with differences in outcomes. methods: a retrospective medical record review was conducted for all adult ( ‡ years) post-cardiac arrest patients admitted to the icu of an academic tertiary care centre (annual ed census , ) from - . data were extracted using a standardized data collection tool by trained research personnel. results: patients were enrolled. mean (sd) age was ( ) and . % were male. of ( . %) patients treated with hypothermia, ( . %) presented with an initial rhythm of vf/vt and ( . %) presented with pea or asystole. nine ( . %) patients with vf/vt were treated with therapeutic hypothermia and discharged from hospital compared to ( . %) patients with pea or asystole (d . %; % ci: . %, . %). of patients not treated with hypothermia, ( . %) presented with vf/vt, ( . %) presented with pea or asystole, and ( . %) initial rhythms were unknown. fifteen ( . %) patients with vf/vt, not treated with hypothermia, were discharged from hospital compared to ( . %) patients with pea or asystole (d . %; % ci: . %, . %). regardless of initial presenting rhythm or initiation of therapeutic hypothermia, ( . %) discharged patients had good neurological function as assessed by the cerebral performance category (cpc score - ). conclusion: although recommended, post-cardiac arrest therapeutic hypothermia was not routinely used. patients with vf/vt and treated with hypothermia had better outcomes than those with pea or asystole. further research is needed to assess whether cooling patients with presenting rhtyhms of pea or asystole is warranted. racial background: chronic obstructive pulmonary disease (copd) is a major public health problem in many countries.the course of the disease is characterised by episodes, known as acute exacerbations (ae), when symptoms of cough, sputum production, and breathlessness become much worse. the standard prehospital management of patients suffering from an aecopd includes oxygen therapy, nebulised bronchodilators, and corticosteroids. high flow oxygen is used routinely in prehospital areas for breathless patients with copd. there is little high quality evidence on the benefits or potential dangers in this setting but audits have shown increased mortality, acidosis, and hypercarbia in patients with aecopd treated with high flow oxygen. objectives: to compare standard high flow oxygen treatment with titrated oxygen treatment for patients with an aecopd in the prehospital setting. methods: cluster randomized controlled parallel group trial comparing high flow oxygen treatment with titrated oxygen treatment in the prehospital setting. in an intention to treat analysis (n = ), the risk of death was significantly lower in the titrated oxygen arm compared with the high flow oxygen arm for all patients and for the subgroup of patients with confirmed copd (n = ). overall mortality was % ( deaths) in the high flow oxygen arm compared with % ( deaths) in the titrated oxygen arm; mortality in the subgroup with confirmed copd was % ( deaths) in the high flow arm compared with % ( deaths) in the titrated oxygen arm. titrated oxygen treatment reduced mortality compared with high flow oxygen by % for all patients (p = . ) and by % for the patients with confirmed chronic obstructive pulmonary disease (p = . ). patients with copd who received titrated oxygen according to the protocol were significantly less likely to have respiratory acidosis or hypercapnia than were patients who received high flow oxygen. conclusion: titrated oxygen treatment significantly reduced mortality, hypercapnia, and respiratory acidosis compared with high flow oxygen in aecopd. these results provide strong evidence to recommend the routine use of titrated oxygen treatment in patients with breathlessness and a history or clinical likelihood of copd in the prehospital setting. (originally submitted as a ''late-breaker.'') trial registration australian new zealand clinical trials register actrn . background: toxic particulates and gases found in ambulance exhaust are associated with acute and chronic health risks. the presence of such materials in areas proximate to ed ambulance parking bays, where emergency services' vehicles are often left running, is potentially of significant concern to ed patients and staff. objectives: investigators aimed to determine whether the presence of ambulances correlated with ambient particulate matter concentrations and toxic gas levels at the study site ed. methods: the ambulance exhaust toxicity in healthcare-related exposure and risk [aether] program conducted a prospective observational study at an academic urban ed / level i trauma center. environmental ambient gas was sampled over a continuous five-week period from september to october . two sampling locations in the public triage area (public patient dropoff area without ambulances) and three sampling locations in the ambulance triage area were randomized for -hour monitoring windows with a temporal resolution of minutes to obtain days of non-contiguous data for each location. concentrations of particulate matter less than . microns in aerodynamic size (pm . ), oxygen, hydrogen sulfide (h s), and carbon monoxide (co) as well as lower explosive limit for methane (lel) were monitored with professionally calibrated devices. ambulance traffic was recorded through offline review of / security video footage of the site's ambulance bays. results: , measurements at the public triage nurse desk space revealed pm . concentrations with a mean of . ± . lg/m (median . lg/m ; maximum , . lg/m ). , ambulance triage nurse desk space pm . concentrations recorded a mean of . ± . lg/m (p < . , unpaired t test; median . lg/m ; maximum . lg/m ). oxygen levels remained steady throughout the study period; co, h s, and lel were not detected. ambulance activity levels had the highest correlations with pm . concentrations at the ambulance triage foyer (r = . ) and desk area (r = . ) where patients wait and ed staff work - hr shifts. conclusion: ed spaces proximate to ambulance parking bays had higher levels of pm . than areas without ambulance traffic. concentrations of ambient particulate matter in acute care environments may pose a significant health threat to patients and staff. an ems ''pit crew'' model improves ekg and stemi recognition times in simulated prehospital chest pain patients sara y. baker , salvatore silvestri , christopher d. vu , george a. ralls , christopher l. hunter , zack weagraff , linda papa orlando regional medical center, orlando, fl; florida state university college of medicine, orlando, fl background: prehospital teams must minimize time to ekg acquisition and stemi recognition to reduce overall time from first medical contact to reperfusion. auto-racing ''pit crews'' model rapid task completion by pre-assigning roles to team members. objectives: we compared time-to-completion of key tasks during chest pain evaluation in ems teams with and without pre-assigned roles. we hypothesized that ems teams using the ''pit crew'' model would improve time to recognition and treatment of stemi patients. methods: a randomized, controlled trial of paramedic students was conducted over months at orlando medical institute, a state-approved paramedic training center. we compared a standard ems chest pain management algorithm (control) with a pre-assigned tasks (''pit crew'') algorithm (intervention) in the evaluation of simulated chest pain patients. students were randomized into groups of three; intervention and control groups did not interact after randomization. all students reviewed basic prehospital chest pain management and either the standard or pre-assigned tasks algorithm. groups encountered three simulated patients. laerdal simmanÒ software was used track completion of tasks: taking vital signs, iv access, ekg acquisition and interpretation, asa administration, hospital stemi notification, and total time on scene. results: we conducted simulated-patient encounters ( control / intervention encounters). mean time-to-completion of each task was compared in the control and intervention groups respectively. time to obtain vital signs was : vs. : min (p = . ); time to asa administration was : vs : min (p < . ); time to ekg acquisition was : vs : min (p < . ); time to ekg interpretation was : vs : min (p < . ); time to iv access was : vs : min (p = . ); time to stemi notification was : vs : min (p < . ); and time to scene completion was : vs : min (p < . ). conclusion: paramedic student teams with pre-assigned roles (the ''pit crew'' model) were faster to obtain vital signs, administer asa, acquire and interpret the ekg, stemi notification, and overall time on scene during simulated patient encounters. further study with experienced ems teams in actual patient encounters is necessary to confirm the relevance of these findings. background: use of automated external defibrillators (aed) has remained low in the u.s. understanding the effect of neighborhoods on the probability of having an aed used in the setting of a public arrest may provide important insights for future placement of aeds. objectives: to determine associations between the racial and income composition of neighborhoods (as defined by u.s. census tracts), individual arrest characteristics, and whether bystanders or first responders initiate aed use. methods: cohort study using surveillance data prospectively submitted by emergency medical services systems and hospitals from u.s. sites to the cardiac arrest registry to enhance survival between october , and december , . neighborhoods were defined as high-income vs. low-income based on the median household income being above or below $ , and as white or black if > % of the census tract was of one race. neighborhoods without a predominant racial composition were defined as integrated. arrests that occurred within a public location (excluding medical facilities and airports) were eligible for inclusion. hierarchical multi-level modeling, using stata v . , was used to determine the association between individual and census tract characteristics on whether an aed was used. results: of , eligible cases, an aed was used in arrests ( . %) by a first responder (n = , , . %) or bystander (n = , . %). patients whose arrest was witnessed (odds ratio [or] . ; % confidence interval [ci] . - . ) were more likely to have an aed used (table) . when compared to high-income white neighborhoods, arrest victims in low-income black neighborhoods were least likely to have an aed used (or . ; % ci . - . ). arrest victims in lowincome white (or . ; % ci . - . ) and lowincome integrated (or . ; % ci . - . ) were also less likely to have an aed used. conclusion: arrest victims in black and low-income neighborhoods are least likely to have an aed used by a layperson or first responder. future research is needed to better understand the reasons for low rates of aed use for cardiac arrests in these neighborhoods. the impact of an educational intervention on the pre-shock pause interval among patients experiencing an out-of-hospital cardiac arrest jonathan studnek , eric hawkins , steven vandeventer carolinas medical center, charlotte, nc; mecklenburg ems agency, charlotte, nc background: pre-shock pause duration has been associated with survival to hospital discharge (std) among patients experiencing out-of-hospital cardiac arrest (oohca) resuscitation. recent research has demonstrated that for every -second increase in this interval there is an % decrease in std. objectives: determine if a decrease in the pre-shock pause interval for patients experiencing oohca could be realized after implementation of an educational intervention. methods: this was a retrospective analysis of data obtained from a single als urban ems system from / / to / / and / / to / / . in august , an educational intervention was designed and delivered to approximately paramedics emphasizing the importance of reducing the time off chest during cpr. specifically, the time period just prior to defibrillation was emphasized by having rescuers count every th compression and pre-charge the defibrillator on the th compression. in order to determine if this change resulted in process improvement, months of data were assessed before and months after the educational intervention. pre-shock pause was the outcome variable and was defined as the time period after compressions ceased until a shock was delivered. this interval was measured by a cpr feedback device connected to the defibrillator. inclusion criteria were adult patients who required at least one defibrillation and had the cpr feedback device connected during the defibrillation attempt. analysis was descriptive utilizing means and % ci as well as wilcoxon rank sum test to assess difference between the two time periods. results: in the pre-intervention period there were patients who received defibrillations compared to patients receiving defibrillations in the post-intervention phase. the mean duration of the pre-shock pause pre-intervention was seconds ( % ci - ) while the post-intervention duration was seconds ( % ci - ). the difference in pre-shock pause duration was statistically significant with p < . . conclusion: these data indicate that after a simple educational intervention emphasizing decreasing time off chest prior to defibrillation the pre-shock pause duration decreased. future research must describe the sustainability of this intervention as well as the effects this process measure may have on outcomes such as survival to hospital discharge. background: the broselow tape (bt) has been used as a tool for estimating medication dosing in the emergency setting. the obesity trend has demonstrated a tendency towards insufficient pediatric weight estimations from the bt, and thus potential under-dosing of resuscitation medications. objectives: this study compared drug dosing based on the bt with dosing from a novel electronic tool (et) that accounts for provider estimation of body habitus. methods: data were obtained from a prospective convenience sample of children ages to years arriving to a pediatric emergency department. a clinician performed an assessment of body habitus (average/underweight, overweight, or obese), blinded to the patient's actual weight and parental weight estimate. parental estimate of weight and measured length and weight were collected. epinephrine dosing was calculated from the measured weight, the bt measurement, as well as from a smart-phone tool based on the measured length and clinician's estimate of body habitus, and a modified tool (mt) incorporating the parent estimate of habitus. the wilcoxson rank-sum test was used to compare median percent differences in dosing. results: one hundred children (mean age years) were analyzed; % were overweight or obese. clinicians correctly identified children as overweight/obese % of time (ci . - . ). adding parent estimate of weight improved this to a sensitivity of % (ci . - . ). the median difference between the weight-based epinephrine dose and bt dose was %. for the et the median difference from the weight-based dose was % (p = . compared to the bt), and for the mt was . % (p < . compared to the bt). when a clinically significant difference was defined as ± % of the actual dose, bt was within that range % of the time, et was within range % of the time (p = . ), and mt was within range % of the time ( background: in most out-of-hospital cardiac arrest (ohca) events, a call to - - is the first action by bystanders. accurate diagnosis of cardiac arrest by the call taker depends on the caller's verbal description. if cardiac arrest is not suspected, then no telephone cpr instructions will be given. objectives: we measured the effect of a change in the ems call taker question sequence on the accuracy of diagnosis of cardiac arrest by - - call takers. methods: we retrospectively reviewed the cardiac arrest registry to enhance survival (cares) dataset for january , through june , from a city, population , , with a longstanding telephone cpr program (apco). we included ohca cases of any age who were in arrest prior to the arrival of ems and for whom resuscitation was attempted. in early , - - call takers were taught to follow a revised telephone script that emphasized focused questions, assertive control of the caller, and provision of hands-only cpr instructions. the medical director personally explained the reasons for the changes, emphasizing the importance of assertive control of the caller and the comparative safety of chest compressions in patients not in cardiac arrest. beginning in , call recordings were reviewed regularly with feedback to the call taker by the - - center leadership. the main outcome measure was sensitivity of the - - call taker in diagnosing cardiac arrest. bystander cpr was reported by ems crews attending the event. we compared with and using the v test and odds ratios (or). results: there were ohca cases in , cases in , and in the first half of ( / , population). the mean age was ± years, and % of the events were witnessed. before the revision, % of ohca cases were identified by - - dispatchers; and after the revised questioning sequence, % were identified (or . , % ci . - . ). the false positive rate changed little (from /month to /month). the mean time to question callers was unchanged ( vs seconds). bystander cpr was performed in . % of events in , . % in , and . % of events in (p < . ). conclusion: emphasis on scripted assessment improved sensitivity without loss of specificity in identifying ohca. with repeated feedback, it translated to an increase in victims receiving bystander cpr. in an out-of hospital cardiac arrest population confirmed by autopsy salvatore silvestri, christopher hunter, george ralls, linda papa orlando regional medical center, orlando, fl background: quantitative end-tidal carbon dioxide (etco ) measurements (capnography) have consistently been shown to be more sensitive than qualitative (colorimetric) ones, and the reliability of capnography for assessing airway placement in low perfusion states has sometimes been questioned in the literature. objectives: this study examined the rate of capnographic waveform presence of an intubated out-of-hospital cardiac arrest cohort and its correlation to endotracheal tube location confirmed by autopsy. our hypothesis is that capnography is % accurate in determining endotracheal tube location, even in low perfusion states. methods: this cross-sectional study reviewed a detailed prehospital cardiac arrest database that regularly records information using the utstein style. in addition, the ems department quality manager routinely logs the presence of an alveolar (four-phase) capnographic waveform in this database. the study population included all cardiac arrest patients from january , through december , managed by a single ems agency in orange county, florida. patients were included if they had endotracheal intubation performed, had capnographic measurement obtained, failed to regain return of spontaneous circulation (rosc), and had an autopsy performed. the main outcome was the correlation of the presence of an alveolar waveform and the location of the ett at autopsy. results: during the study period, cardiac arrests were recorded. of these, had an advanced airway placed (ett or laryngeal tube airway), and no rosc. of the advanced airway cases, were managed with an ett. autopsies were performed on of these patients and resulted in our study cohort. the location of the ett at autopsy was recorded on all of these cases. capnographic waveforms were recorded in the field in all of these study patients, and % of the tubes were located within the trachea at autopsy. the sensitivity of capnography in determining proper endotracheal tube location was % in this study. conclusion: in our study, the presence of a capnographic waveform was % reliable in confirming proper placement of endotracheal tubes placed in outof-hospital patients with poor perfusion states. results: over variables were presented to the ems medical directors responding ( % survey population captured). among the myriad of responses, ( %) initiate cardiopulmonary resuscitation (cpr) at compressions to ventilations consistent with il-cor/aha guidelines. seven ( %) initiate continuous chest compressions from the start of cpr with no pause and interposed ventilations. nine ( %) begin chest compressions only during the first - minutes, with either passive oxygenation by oxygen mask (six; %) or no oxygen (three; %). airway management following non-invasive oxygenation and ventilation by primary endotracheal intubation occurs in systems ( %), while six ( %) use supraglottic devices. fourteen ( %) allow paramedics to decide between endotracheal and supraglottic device placement. thirty systems ( %) utilize continuous waveform capnography. the initial approach to non-ems witnessed ventricular fibrillation is chest compression prior to first defibrillation in systems ( %). eighteen systems ( %) escalate defibrillation energy settings, with four systems ( %) utilizing dual sequential defibrillation. twenty ( %) initiate therapeutic hypothermia in the field. conclusion: wide variability in ca care standards exists in america's largest urban ems systems in mid- , with many current practices promoting more continuity in chest compressions than specified in the ilcor/aha guidelines. endotracheal intubation, a past mainstay of ca airway management, is deemphasized in many systems. immediate defibrillation of non-ems witnessed ventricular fibrillation is uncommon. objectives: determine the out-of-hospital cardiac arrest survival in this area of puerto rico using the utstein method. methods: prospective observational cohort study of adult patients presenting with an out-of-hospital cardiac arrest to the upr hospital ed. study endpoints will be survival and neurologically intact survival at hospital discharge, months, and months. results: a total of consecutive cardiac arrest events were analyzed for a period of years. one-hundred fifteen events met criteria for primary cardiac etiology ( . %). the average age for this group was . years. there were female ( . %) and male ( . %) participants. the average time to start cpr was . minutes. transportation to the ed was . % by ems and . % by private vehicle. a total of events were witnessed ( . %). the survival rate to hospital admission was . %. the overall cardiac arrest survival was . % and overall neurologically intact survival was . %. neurologically intact survival at and months was . %. the rate of bystander cpr in our population was . % with a survival rate of . %. conclusion: survival from out-of-hospital cardiac arrest in the area served by the upr hospital is low but comparable to other cities in the us as reported by the cdc cardiac arrest registry to enhance survival (cares). this low survival rate might be due to low bystander cpr rate and prolonged time to start cpr. background: hyperventilation has been directly correlated with increased mortality for out-of-hospital cpr. ems providers may hyperventilate patients at levels above national bls guidelines. real-time feedback devices, such as ventilation timers, have been shown to improve cpr ventilation rates towards bls standards. it remains unclear if the combination of a ventilation timer and pre-simulation instruction would influence overall ventilation rates and potentially reduce undesired hyperventilation. objectives: this study measured ventilation rates of standard cpr (and pre-instruction on effects of hyperventilation) compared to cpr with the use of a commercial ventilation timer (and pre-instruction on effects of hyperventilation). we propose that use of a ventilation timer, measuring and displaying to ems providers real-time ventilations delivered, will have no difference in ventilation rates when comparing these groups. methods: this prospective study placed ems providers into four groups: two controls measuring ventilation rates before ( a) and after instruction ( b) on the deleterious effects of hyperventilation, and a concurrent intervention pair with before ( a) and after instruction ( b), with the second pair measuring ventilation rates with a ventilation timer that provides immediate feedback on respirations given. ventilation rates were measured for a -second period after one minute of simulated cpr using mannequins. the control set without instruction ( a, n = ) averaged . breaths ( % ci = . - . ) and with instruction ( b, n = ) averaged . breaths ( % ci = . - . ). the intervention set without instruction ( a, n = ) averaged . breaths ( % ci = . - . ) and with instruction ( b, n = ) averaged . breaths ( % ci = . - . ). there was a significant improvement (p = . ) in ventilation rates with use of a ventilation timer (control group versus intervention group regardless of pre-instruction). there was no statistically significant difference between groups with respect to instruction alone (p = . ). conclusion: the use of a ventilation timer significantly reduced overall ventilation rates, providing care closer to bls guidelines. the addition of pre-simulation instruction added no significant benefit to reducing hyperventilation. background: in , the american heart association (aha) recommended a compression rate of (roc) / min and a depth of compressions (doc) at least inches for effective cpr. as an educational tool for lay rescuers, the aha as adopted the catch phrase ''push hard, push fast''. objectives: in this irb-exempt study, we sought to determine if persons without formal cpr training could perform non-ventilated cpr as well as those who have been trained in the past or those currently certified. methods: a convenience sample of patrons of the new york state fair was asked to perform minutes of hands-only cpr on a prestan pp-am- m adult cpr manikin. these devices provide visual indicators of acceptable rate and depth of compressions. each subject was video recorded on a dell latitude laptop computer with a logitech quick cam using logitech quick cam . . for windows software. results: a total of volunteers ( male, female) aged - years participated: were never certified (nc) in cpr, were previously certified (pc), and were currently certified (cc). there was no difference in age across the groups. the cc group had a higher proportion of females (chi-square = . , p < . ). cc volunteers sustained roc and doc for an average of . seconds as compared to an average of . seconds (pc) and . seconds (nc) respectively. (f = . , p < . ). the cc maintained roc of closer to / min (mean . /min) when compared to the pc (mean . /min) and nc (mean . /min) groups (f = . , p < . ). a higher proportion of volunteers of the cc group were able to perform adequate doc (chi-square = . , p < . ), and hand placement (chisquare = . , p < . ) when compared to the other two groups. conclusion: compared to the target roc and doc, none of the groups did well and only subjects met target roc/doc. increased out-of-hospital cardiac arrest survivability due to lay rescuer intervention is only assured if cpr is effectively administered. the effect and benefit of maintaining formal cpr training and certification is clear. background: more than , out-of-hospital cardiac arrests (ohcas) occur annually in the united states (us). automated external defibrillators (aeds) are life-saving devices in public locations that can significantly improve survival. an estimated million aeds have been sold in the us; however, little is known about whether locations of aeds match oh-cas. these data could help determine optimal placement of future aeds and targeted cpr/aed training to improve survival. objectives: we hypothesized that the majority (> %) of aeds are not located in close proximity ( feet) to the occurrence of cardiac arrests in a major metropolitan city. methods: this was a retrospective review of prospectively collected cardiac arrest data from philadelphia ems from january , until december , . included were ohcas of presumed cardiac etiology in individuals years of age or older. excluded were oh-cas of presumed traumatic etiology, cases where resuscitation was terminated at the scene, and those dead on arrival. aed locations in philadelphia were obtained from myheartmap, a database of installed and wallmounted aeds in pennsylvania. we used gis mapping software to visualize where ohcas occurred relative to where aeds were located and to determine the radius of ohcas to aeds. arrests within a , , and foot radius of aeds were identified using the attribute location selection option in arcgis. the lengths of radii were estimated based on the average time it would take for a person to walk to and from an aed ( feet minutes; feet minutes; feet minutes). results: we mapped , ohcas and , aeds in philadelphia county. ohcas occurred in males ( %; / ) and the mean age was . years. ventricular fibrillation occurred in % ( / ). aeds were primarily located in schools/universities ( %), office buildings ( %), and residential buildings ( %). aeds were not identified within feet in % ( , ) of ohcas, within feet of % ( , ) of ohcas, and within feet in % ( , ) of ohcas. the figure (large black circles) illustrates aed/ohca within feet on the left and feet on the right. conclusion: aeds were rarely close to the locations of ohcas, which may be a contributor to low cardiac arrest survival rates. innovative models to match aed availability with ohcas should be explored. (originally submitted as a ''late-breaker.'') potential background: early and frequent epinephrine administration is advocated by acls; however, epinephrine research has been conducted primarily with standard cpr (std). active compression-decompression cpr with an impedance threshold device (acd-cpr + itd) has become the standard of care for out of hospital cardiac arrest in our area. the hemodynamic effects of iv epinephrine under this technique are not known. objectives: to determine the hemodynamic effects of iv epinephrine in a swine model undergoing acd-cpr+itd. methods: six female swine ( ± kg) were anesthetized, intubated, and mechanically ventilated. intracranial, thoracic aorta, and right atrial pressures were recorded via indwelling catheters. carotid blood flow (cbf) was recorded via doppler. etc , sp , and ekg were monitored. ventricular fibrillation was induced and went untreated for minutes. three minutes each of standard cpr (std), std-cpr+itd, and acd-cpr+itd was preformed. at minute of the resuscitation, lg/kg of iv epinephrine was administered and acd-cpr+itd was continued for minute. statistical analysis was performed with a paired t-test. results: aortic pressure and calculated cerebral and carotid perfusion pressures increased from std < std+itd < acd-cpr+itd (p £ . ). epinepherine administered during acd-cpr+itd signficantly increased mean aortic ( ± vs ± , p = . ), cerebral ( ± vs ± , p = . ), and coronary perfusion pressures ( ± vs ± , p = . ); however, mean cbf and etco decreased (respectively ± vs ± . , p = . ; ± vs ± , p = . ). conclusion: the administration of epinepherine during acd-cpr+itd signficantly increased markers of macrocirculation, while significantly decreasing etco , a proxy for organ perfusion. while the calculated cerebral perfusion pressures increased, the directly measured cbf decreased. this calls into question the ability of calculated perfusion pressures to accurately reflect blood flow and oxygen delivery to end organs. hypoxia background: during cardiac arrest most patients are placed on % oxygen with assisted ventilations. after return of spontaneous circulation (rosc), % oxygen is typically continued for an extended time. animal data suggest that immediate post-arrest titration of oxygen by pulse oximetry produces better neurocognitive/ histologic outcomes. recent human data suggest that arterial hyperoxia is associated with worse outcomes. objectives: to assess the relationship between hypoxia, normoxia, and hyperoxia post-arrest and outcomes in post-cardiac arrest patients treated with therapeutic hypothermia. methods: we conducted a retrospective chart review of post-arrest patients admitted to an academic medical center between january, and december, who had arterial blood gases (abg) drawn after rosc. demographic variables were analyzed using anova and chi-square tests as appropriate. unadjusted logistic regression analyses were performed to assess the relationship between hypoxia (pao < mmhg), normoxia ( - mmhg), hyperoxia (> mmhg), and mortality. results: on first abg ( patients), ( . %) were hypoxic, ( . %) normoxic, and ( . %) hyperoxic. the average age of the cohort was . years (no difference for hypoxic, normoxic, and hyperoxic patients). overall mortality was . % ( / ). there were no significant differences between initial heart rate, systolic blood pressure, sex, race, or pre-arrest functional status. in-hospital mortality was significantly higher when the first abg demonstrated hypoxia ( . %; / ) than for normoxia ( . %; / ) or hyperoxia ( %; / ). in unadjusted logistic regression analysis of first pao values, hyperoxia was not associated with increased mortality (or . ; % ci . - . ) but hypoxia was associated with increased mortality (or . ; % ci . - . ). conclusion: hypoxia but not hyperoxia on first abg was associated with mortality in a cohort of post-arrest patients. background: there are over , deaths due to cardiac arrest per year in the us. the aha recommends monitoring the quality of cpr primarily through the use of end tidal co (etco ). the level of etco is significantly dependant on minute ventilation and altered by pressor and bicarbonate use. cerebral oximetry (cereox) uses near infrared spectroscopy to non-invasively measure oxygen saturation of the frontal lobes of the brain. cereox has been correlated with cerebral blood flow and jugular vein bulb saturations. objectives: the objective of this study is to compare the simultaneous measurement of etco and cereox to investigate which monitoring method provides the best measure of cpr quality as defined by return of spontaneous circulation (rosc). methods: a prospective cohort of a convenient sample of patients using out-of-hospital and ed cardiac arrest from two large eds. patients were monitored simultaneously by etco and cereox during cpr. patient demographics and arrest data were collected using the utstein criteria. all patients were monitored throughout the resuscitation efforts. rosc was defined as a palpable pulse and a measurable blood pressure for a minimum of thirty minutes. results: twenty two patients were enrolled with complete data sets; % of the subjects had rosc. average down time of rosc subjects was minutes (sd ± . ) and minutes (sd ± . ) for subjects without rosc. the inability to obtain a value of either for etco or cereox was % and % specific with an % and % npv respectively for predicting lack of rosc. obtaining a value of either for etco or cereox was % and % sensitive, respectively in identifying rosc. subjects with rosc had sustained values above for . mins on cereox and . mins on etco prior to rosc. the increase in values over a three minute period prior to rosc was . on cereox and . on etco . conclusion: the inability to obtain a value of on either the etco or cereox strongly predicted lack of rosc. cereox provides a larger magnitude and closer temporal increase prior to rosc than etco . attaining a value of on cereox was more predictive of rosc than etco . an discrepancies due to communicating information to multiple listeners in a short amount of time. this creates a communication barrier not always apparent to practitioners. we examine the perceptions of ems and ed personnel on the transfer of care and its correlation to missing patient data. objectives: evaluate provider perception of information transfer by ems and ed personnel and compare this to an external observer's objective assessment. methods: this is a retrospective quality improvement program at an academic level i trauma center. transfers of medical and trauma patients from ems to ed personnel were attended by trained external observers, research associates (ra). ra recorded the data communicated: name, age, past medical history (pmh), allergies, medications, events, active problems, vital signs (vs), level of consciousness (loc), iv access, and treatments given. then, ems and ed staff rated their perception of transfer on a - rating scale. results: ra evaluated patient transfers ( medical and trauma). transfer time did not differ, . minutes for medical ( % ci: . - . ), . minutes for trauma patients ( % ci: . - . )(p = . ). missing data between the two groups also did not differ, except loc and treatment were missed more in medical transfers, while pmh was missed more in the trauma transfers. comparing the transfers with all vs present ( %, / ) and all vs missing ( %, / ), with all vs missing, there was no difference in perception of transfer for ems ( . / vs present vs . / vs absent) or ed staff ( . / vs present, . / vs absent). when all vital signs were missing, ra rated . % of transfers as poor, whereas when all vs were present . % of transfers were considered good. conclusion: ems and ed staff felt transfers of care were professional, teams were attentive, and had similar amounts of interruptions for both medical and trauma cases. their perception of transfer of care was similar even when key information was missing, although external observers rated a significant amount of transfers poorly. thus, ems and ed staffs were not able to evaluate their own performance in a transfer of care and external observers were found to be better evaluators of transfers of care. swati singh, john brown, prasanthi ramanujam ucsf, san francisco, ca background: ems transports a large number of psychiatric emergencies to emergency departments (ed) across the us. research on paramedic education related to behavioral emergencies is sparse, but based on expert opinion we know that gaps in paramedic knowledge and training exist. in our system, paramedics triage patients to medical, detoxification, and purely psychiatric destinations, so a paramedic's understanding of these emergencies directly affects the flow of patients in our eds. objectives: our objectives were to understand the gaps in current training and develop a targeted curriculum for field providers with a long term goal of appropriately recognizing and triaging subjects to the ed. methods: data were collected using a survey that was distributed during a paramedic association meeting in october . subjects were excluded if they did not complete the survey. survey questions addressed demographics of paramedics, frequency of various psychiatric emergencies and their confidence in managing these emergencies. data were collated, analyzed, and presented as descriptive statistics. results: forty-nine surveys were distributed with a response rate of % (n = / ). of the respondents, % (n = ) were male and % (n = ) had at least five years experience. mood, thought, and cognitive disorders were the most frequently encountered presentations and % (n = ) of respondents came across psychiatric emergencies multiple times a week. many respondents did not feel confident managing agitated delirium (n = , %), acute psychosis (n = , %), and intimate partner or elder abuse (n = , %). a third to a half of the respondents felt they have little or no training in chemical sedation (n = , %), verbal de-escalation (n = , %), and triaging patients (n = , %). conclusion: we identified a need for a revised curriculum on management of psychiatric emergencies. future steps will focus on development of a curriculum and change in knowledge after implementation of this curriculum. background: prehospital endotracheal intubation has long been a cornerstone of resuscitative efforts for critically ill or injured patients. paramedic airway management training will need to be modified due to the acc/aha guidelines to ensure maintenance of competency in overall management of airway emergencies. how best to modify the training of paramedics requires an understanding of current experience. objectives: the purpose of this report is to characterize the airway management expertise of experienced and non-experienced paramedics in a single ems system. methods: we retrospectively reviewed all prehospital intubations from an urban/suburban ambulance service (professional ambulance, inc.) over a five-year period (january , to december , ). characteristics of airway management by paramedics with - years of experience (group ) were compared to those with greater than years of experience (group ). airway management was guided by massachusetts statewide treatment protocols governing direct laryngoscopy and all adjunctive approaches. attempts are characterized by laryngoscope blade passing the lips. difficult and failed airways were managed with extraglottic devices (egd) or needle cricothyroidotomy. we reviewed patient characteristics, intubation methods, rescue techniques, and adverse events. results: patients required airway management: ( %) were performed by group and ( %) were performed by group . group was both faster to intubate ( . vs . attempts, p = . ) and less likely to use a rescue device ( . % vs . %, p = . ). both are equally likely to go directly to a rescue device ( % vs %, p = . ). all patients were successfully oxygenated and ventilated with either an endotracheal tube or egd. no surgical airways were performed and no patients died as a result of a failed airway. conclusion: while intubation success rates of paramedics with less than and greater than five years of experience are similar, less experienced paramedics use fewer attempts and are less likely to use a rescue device. both recognize difficult airways and go directly to rescue devices equally. this highlights difficulties faced maintaining competence. education requirements must be evaluated and redesigned to allow paramedics to maintain competence and emphasize airway management according to the latest resuscitation guidelines. how well do ems - - protocols predict ed utilization for pediatric patients? stephanie j. fessler , harold k. simon , daniel a. hirsh , michael colman emory university, atlanta, ga; grady health systems, atlanta, ga background: the use of emergency medical services (ems) for low-acuity pediatric problems has been well documented. however, it is unclear how accurately general ems dispatch protocols predict the subsequent ed utilization for these patients. objectives: to determine the ed resource utilization rate of pediatric patients categorized as low acuity by - - dispatch protocols and then subsequently transferred to a children's hospital. methods: all transports for pediatric patients from the scene by a large urban general ems provider that were prioritized as low acuity by initial - - dispatch protocols were identified. protocols were based on the national academy of medical priority dispatch system, v . starting on jan , , consecutive cases of patients transported to three pediatric emergency departments (ped) of a large tertiary care pediatric health care system were reviewed. demographics, ped visit characteristics, resource utilization, and disposition were recorded. those patients who received meds other than po antipyretics, had labs other than a strep test, a radiology study, a procedure, or were not discharged home were categorized into the significant ed resource utilization group. results: % of the patients were african american and either had public insurance or self-pay ( %, % respectively). the median age was months ( d- yr). % were female. none of these low-acuity patients were upgraded by ems operators en route. upon arrival to the ped, % of transported patients were classified into the significant utilization group. six of the total patients were admitted, including a y/o requiring emergent intubation, an m/o old with a broken cvl, a y/o with sickle cell pain crisis, and a y/o with altered mental status. the remainder of the significant resource utilization group consisted of children needing procedures, anti-emetics, narcotic pain control, labs, and xrays. conclusion: in this general ems - - system, dispatch protocols for pediatric patients classified as low priority did poorly in predicting subsequent ed utilization with % requiring significant resources. further, ems operators did not recognize a critical child who needed emergent intervention. opportunity exists to refine general ems - - protocols for children in order to more accurately define an ems priority status that better correlates with ultimate needs and resource utilization. the objectives: determine if there is an association between a patient's impression of the overall quality of care and his or her satisfaction with provided pain management. it was hypothesized that satisfaction with pain management would be significantly associated with a patient's impression of the overall quality of care. methods: this was a retrospective review of patient satisfaction survey data initially collected by an urban als ems agency from / / to / / . participants were randomly selected from all patients transported proportional to their paramedic defined acuity; categorized as low, medium, or high with a goal of interviews per month. the proportions of patients sampled from each acuity level were % low, % medium, and % high. patients were excluded if there was no telephone number recorded in the prehospital patient record or they were pronounced dead on scene. all satisfaction questions used a five-point likert scale with ratings from excellent to poor that were dichotomized for analysis as excellent or other. the outcome variable of interest was the patient's perception of the overall quality of care. the main independent variable had patients rate the staff who treated them at the scene on their helping to control or reduce their pain. demographic variables were assessed for potential confounding. results: there were , patients with complete data for the outcome and main independent variable with . % male respondents and an average age of . (sd = . ). overall quality of care was rated excellent by . % of patients while . % rated their pain management as excellent. of patients who rated their pain management as excellent, . % rated overall quality of care as excellent while only . % of patients rated overall quality excellent if pain management was not excellent. when controlling for potential confounding variables, those patients who perceived their pain management to be excellent were . ( % ci . - . ) times more likely to rate their overall quality of care as excellent compared to those with non-excellent perceived pain management. conclusion: patients' perceptions of the overall quality of care were significantly associated with their perceptions of pain management. objectives: the purpose of this study is to determine whether ground-based paramedics could be taught and retain the skills necessary to successfully perform a cricothyrotomy. methods: this retrospective study was performed in a suburban county with a population of , and , ems calls per year. participants were groundbased paramedics in a local ems system who were taught wire-guided cricothyrotomy as part of a standardized paramedic educational update program. as part of the educational program, paramedics were taught wire-guided cricothyrotomy on a simulation model previously developed to train emergency medicine residents. after viewing an instructional video, the participants were allowed to practice using a step checklist. not all of these steps were automatic failures. each paramedic was individually supervised performing a cricothyrotomy on the simulator until successful; a minimum of five simulations was required. retention was assessed using the same -step checklist during annual skills testing, after a minimum of weeks to a maximum of months posttraining. results: a total of paramedics completed both the initial training and reassessment during the time period studied. during the initial training phase, % ( of ) of the paramedics were successful in performing all steps of the wire-guided cricothyrotomy. during the retention phase . % ( of ) retained the skills necessary to successfully perform the wire-guided cricothyrotomy. of the -step checklist, most steps were performed successfully by all the paramedics or missed by only of the paramedics. step # , which involved removing the needle prior to advancing the airway device over the guidewire, was missed by . % ( of ) of the participants. step # was not an automatic failure since most participants immediately self-corrected and completed the procedure successfully. conclusion: paramedics can be taught and can retain the skills necessary to successfully perform a wireguided cricothyrotomy on a simulator. future research is necessary to determine if paramedics can successfully transfer these skills to real patients. helicopter emergency medical services in background: netcare is one of the largest private providers of emergency air medical care in south africa. each hems (helicopter emergency medical service) crew is manned by a physician-paramedic team and is dispatched based on specific medical criteria, time to definitive care, and need for physician expertise. objectives: to describe the characteristics of net-care air medical evacuations in gauteng province and to analyze the role of physicians in patient care and effect on call times. methods: all patients transported by a netcare helicopter over a one year period from january -december were enrolled in the study. injury classifications, demographics, procedures, scene and flight times were collected retrospectively from run sheets. data were described by medians and interquartile intervals. results: a total of patients were transported on flights originating from the netcare gauteng helicopter base. ninety-two percent were traumarelated, with % resulting from motor vehicle accidents. physician expertise was listed % of the time as the indication for air medical response. a total of advanced procedures were performed by physicians on patients, including paralytic-assisted intubations, chest tube placement, and cardiac pacing. the median total call time was minutes with minutes spent on scene, compared with and minutes when advanced procedures were performed by hems (p < . ). conclusion: trauma accounts for an overwhelming majority of patients requiring emergency air medical transportation. advanced medical procedures were performed by physicians in nearly a quarter of the patients. there were significant differences in call times when advanced procedures were performed by hems. objectives: we sought to evaluate the level of awareness and adoption of the off-line protocol guidelines by utah ems agencies. methods: we surveyed all ems agencies in utah months after protocol guideline release. medical directors, ems captains, or training coordinators completed a short phone survey regarding their knowledge of the emsc protocol guidelines, and whether their agency had adopted them. in particular, participants were asked about the pain protocol guideline and their management of pediatric pain. results: of the agencies, participated in the survey ( %). of those participating, agencies ( %) were excluded from the analysis: ( %) who only treat adults and ( %) who do not participate in electronic data entry. of the remaining agencies ( %), ( %) were familiar with the utah emsc protocol guidelines; agencies ( %) have either partially or fully adopted the protocol guidelines. agencies ( %) were familiar with the pain treatment protocol guideline; ( %) had adopted it; ( %) planned to either partially or fully adopt the protocol. overall, agencies ( %) had offline protocols allowing the administration of narcotics to children. of those, ( %) had intranasal fentanyl as an available medication and delivery route. of the agencies with offline protocols for pain, ( %) reported familiarity with the emsc pain protocol guideline. conclusion: the creation and dissemination of statewide emsc protocol guidelines results in widespread awareness ( %) and to date % of agencies have adopted them. future investigation into factors associated with protocol adoption should be explored. background: intranasal (in) naloxone is safe and effective for the treatment of opioid overdose. while it has been extensively studied in the out-of-hospital environment in the hands of paramedics and lay people, we are unaware of any studies evaluating the safety and efficacy of in naloxone administration by bls providers. in recent years in naloxone has been added to the bls armamentarium; however, most services/states require an als unit be dispatched and attempt an intercept if in naloxone is administered by the bls providers. objectives: the purpose of this study is to evaluate the safety and effectiveness of bls-administered in naloxone in an urban environment. methods: retrospective cohort review as part of the ongoing qa process of all patients who had in naloxone administration by bls providers. the study was part of a special projects waiver by massachusetts oems from february through november in a busy urban tiered ems system in the metro-boston area. exclusion criteria: cardiac arrest. demographic information was collected, as well as vital signs, number of naloxone doses by bls, patient response to bls naloxone administration (clinical improvement in mental status and/or respiratory status), als intercept. descriptive statistics and confidence intervals are reported using microsoft excel and spss . . results: fifty-six cases of bls-administered in naloxone were identified, and were excluded as cardiac arrests. the included cases had a mean age of . years ± . (range - ), and % (ci - ) were male. of the included cases, % (ci - ) of patients responded to bls administration of naloxone. of the responders, % (ci - ) required two doses. there were protocol violations representing % (ci . - . ) of the total administrations, however in % of these protocol violations the patients had a positive response to the administration of in naloxone. seven of the protocol violations were patients who required a second mg dose of naloxone. eleven cases did not have an als intercept; only of these patients did not respond to bls administration of naloxone. there were no identified adverse events. conclusion: bls providers safely and successfuly administered in naloxone achieving a response rate consistent with studies of als providers' administration of in naloxone. given the success rate of bls providers, it may be feasible for bls to manage responders without the aid of an als intercept. background: an estimated % of patients arriving by ambulance to the ed are in moderate to severe pain. however, the management of pain in the prehospital setting has been shown to be inadequate, and untreated pain may have negative consequences for patients. objectives: to determine if focused education on pediatric pain management and implementation of a pain management protocol improved the prehospital assessment and treatment of pain in adult patients. specifically, this study aimed to determine if documentation of pain scores and administration of morphine by ems personnel improved. methods: this was a retrospective before and after study conducted by reviewing a county-wide prehospital patient care database. the study population included all adult patients transported by ems between february and february with a working assessment of trauma or burn. ems patient care records were searched for documentation of pain scores and morphine administration years before and years after an intensive pediatric focused pain management education program and implementation of a pain management protocol. frequencies and % cis were determined for all patients meeting the inclusion criteria in the before and after time period and chisquare was used to compare frequencies between time periods. a secondary analysis was conducted using only subjects documented as meeting the protocol's treatment guidelines. results: , ( %) of , adult patients transported by ems during the study period met the inclusion criteria: , in the before and , in the after period. subject demographics were similar between the two periods. documentation of pain score did not change between the time periods ( background: there is a presumption that ambulance response times affect patient outcome. we sought to determine if shorter response times really make a difference in hospital outcomes. objectives: to determine if ambulance response time makes a difference in the outcomes of patients transported for two major trauma (motor vehicle crash injuries, penetrating trauma) and two major medical (difficulty breathing and chest pain complaints) emergencies. methods: this study was conducted in a metropolitan ems system serving a population total of , including urban and rural areas. cases were included if the private ems service was the first medical provider on scene, the case was priority , and the patient was years and older. a -month time period was used for the data evaluation. four diagnoses were examined: motor vehicle crash injuries, penetrating trauma, difficulty breathing, and chest pain complaints. ambulance response times were assessed for each of the four different complaints. the patients' initial vital signs were assessed and the number of vital signs out of range was recorded. a sampling of all cases which went to the single major trauma center was selected for evaluation of hospital outcome. using this hospital sample, number of vital signs out of range were assessed as a surrogate marker indicating severity of hospital outcome. correlation coefficients were used to evaluate interactions between independent and outcome variables. results: of the cases we reviewed over the month period, we found that the ems service responded significantly faster to trauma complaints at . minutes (n = ) than medical complaints at . minutes (n = ) . in the hospital sample of cases, number of vital signs out of range were positively correlated with hospital days (r = . ), admits (r = . ), icu admits (r = . ), and deaths (r = . ), but not response times (r = (-) . ). in the entire sample, there was no correlation between vital signs out of range and response times for any diagnosis (see figure) . conclusion: conclusions: based on our hospital sample which showed that number of vital signs out of range was a surrogate marker of worse hospital outcomes, we find that hospital outcomes are not related to initial response times. adverse effects following prehospital use of ketamine by paramedics eric ardeel baylor college of medicine, houston, tx background: ketamine is widely used across specialties as a dissociative agent to achieve sedation and analgesia. emergency medical services (ems) use ketamine to facilitate intubation and pain control, as well as to sedate acutely agitated patients. published studies of ems ketamine practice and effects are scarce. objectives: describe the incidence of adverse effects occurring after ketamine administration by paramedics treating under a single prehospital protocol. methods: a retrospective analysis was conducted of consecutive patients receiving prehospital ketamine from paramedics in the suburban/rural ems system of montgomery county hospital district, texas between august , and october , . ketamine administration indications were: need for rapid control of violent/agitated patients requiring treatment and transport; sedation and analgesia after trauma; facilitation of intubation and mechanical ventilation. ketamine administration contraindications were: equivalent ends achieved by less invasive means; hypertensive crisis; angina; signs of significantly elevated intracranial pressure; anticipated inability to support or control airway. all patients were included, regardless of indication for ketamine administration. data were abstracted from electronic patient care records and available continuous physiologic monitoring data, and analyzed for the presence of adverse effects as defined a priori in ''clinical practice guidelines for emergency department ketamine dissociative sedation: update.'' results: no patients were identified as experiencing adverse effects as defined by the referenced literature. ketamine was utilized most often for patients with the following nemsis provider's primary impression: ( %) altered level of consciousness, ( %) behavioral/psychiatric, ( %) traumatic injury. overall, combativeness was associated with ( %) patients. the mean age was years (range - years) and ( %) were male. the mean ketamine dose was mg (range - mg) and twenty-four ( %) patients received multiple administrations. conclusion: in this patient population, our data indicate that prehospital ketamine use by ems paramedics, across all indications for administration, was safe. further study of ketamine's utility in ems is warranted. an background: rigorous evaluation of the effect of implementing nationally vetted evidence-based guidelines (ebgs) has been notoriously difficult in ems. specifically, human subjects issues and the health insurance portability and accountability act (hipaa) present major challenges to linking ems data with distal outcomes. objectives: to develop a model that addresses the human subjects and hipaa issues involved with evaluating the effect of implementing the traumatic brain injury (tbi) ebgs in a statewide ems system. methods: the excellence in prehospital injury care (epic) project is an nih-funded evaluation of the effect of implementing the ems tbi guidelines throughout arizona (ninds- r ns - a ). to accomplish this, a partnership was developed between the arizona department of health services (adhs), the university of arizona, and more than ems agencies that serve approximately % of the state's population. results: ebg implementation: implementation follows all routine regulatory processes for making changes in ems protocols. in arizona, the entire project must be carried out under the authority of the adhs director. evaluation: a before-after system design is used (randomization is not acceptable). hipaa: as an adhsapproved public health initiative, epic is exempt from hipaa, allowing sharing of protected health information between participating entities. for epic, the state attorney general provided official verification of hi-paa exemption, thus allowing direct linkage of ems and hospital data. irb: once epic was officially deemed a public health initiative, the university irb process was engaged. as an officially sanctioned public health project, epic was determined to not be human subjects research. this allows the project to implement and evaluate the effect of this initiative without requiring individual informed consent. conclusion: by utilizing an ems-public health-university partnership, the ethical and regulatory challenges related to evaluating implementation of new ebgs can be successfully overcome. the integration of the department of health, the attorney general, and the university irb can properly protect citizens while permitting efficient implementation and rigorous evaluation of the effect of ebgs. this novel approach may be useful as a model for evaluation of implementing ems ebgs in other states and large counties. ( . %- . % by age) were transported to non-trauma centers. the most common reasons cited by ems for hospital selection were: patient preference ( . %), closest facility ( . %), and specialty center ( . %). patient preference increased with age (p for trend . ) and paralleled under-triage ( figure ). iss ‡ patients transported to non-trauma hospitals by patient request had lower unadjusted mortality ( . %, %ci . - . ) than similar patients transported to trauma centers ( . %, %ci . - . ) or transported for other reasons ( . %, %ci . - . ) (figure ) . under-triage appears to be influenced by patient preference and age. self-selection for transport to non-trauma centers may result in under-triaged patients with inherently better prognosis than triagepositive patients. background: only % of all out-of-hospital cardiac arrest (ohca) patients receive bystander cpr (cardiopulmonary resuscitation). the neighborhood in which an ohca occurs has significant influence on the likelihood of receiving bystander cpr. objectives: to utilize geographic information systems to identify ''high-risk'' neighborhoods, defined as census tracts with high incidence of ohca and low cpr prevalence. methods: design: secondary analysis of the cardiac arrest registry to enhance survival (cares) dataset for denver county, colorado. population: all consecutive adults (> years old) with ohca due to cardiac etiology from january , through december , . data analysis: analyses were conducted in arc-gis. three spatial statistical methods were used: local morans i (lmi), getis-ord gi*(gi*), and spatial empirical bayes (seb) adjusted rates. census tracts with high incidence of ohca, as identified by all three spatial statistical methods, were then overlain with low bystander cpr census tracts, which were identified in at least two out of three statistical methods (lmi, gi*, or the lowest quartile of bystander cpr prevalence). overlapping census tracts identified with both high ohca incidence and low cpr prevalence were designated as ''highrisk''. results: a total of arrests in census tracts occurred during the study period, with arrests included in final sample. events were excluded if they were unable to be geocoded (n = ), outside denver county (n = ), or occurred in a jail (n = ), hospital/ physician's office (n = ), or nursing home (n = ). for high ohca incidence: lmi identified census tracts, gi* identified census tracts, and the seb method identified census tracts. twenty-five census tracts were identified by all three methods. for low bystander cpr prevalence: lmi identified census tracts, gi* identified census tracts, and census tracts were identified as being in the lowest quartile of cpr prevalence. twenty-four census tracts were identified by two of the three methods. two census tracts were identified as high-risk having both high ohca incidence and low cpr prevalence (figure) . high-risk census tract demographics as compared to denver county are shown in the table. conclusion: the two high-risk census tracts, comprised of minority and low-income populations, appear to be possible sites for targeted community-based cpr interventions. objectives: we sought to assess the accuracy and correlation of geographic information system (gis) derived transport time compared to actual ems transport time in ohca patients. methods: prospective, observational cohort analysis of ohca patients in vancouver, b.c., one of the sites of the resuscitation outcomes consortium (roc). a random sample from all of the ohca cases from / through / was selected for analysis from one site of the roc epistry. using gis, ems transport time was derived from reported latitude/longitude coordinates of the ohca event to the actual receiving hospital. this was calculated via the actual network distance using arcgis. this gis-derived time was then compared to the actual ems transport time (in minutes) using the wilcoxon signed rank test. scatter plot analysis of actual vs. gis times were created to evaluate the relationship between actual and calculated time. a linear regression model predicting actual ems transport time from the derived gis-time was also developed in order to examine the potential relationship between the two variables. differences in the relationship were also investigated based on time of the day to reflect varying traffic conditions. results: cases were randomly selected for analysis. the median actual transport time was significantly longer than the median gis derived transport time ( . minutes vs. . minutes). scatter plot analysis did not reveal any significant correlation between actual and gis-based time. additionally, there was poor approximation of gis-based time and actual ems time (r = . ) with no evidence of a significant linear relationship between the two. the poorest correlation of time was observed during the morning hours ( : - : ; r = . ) while the strongest correlation was during the overnight hours ( : - : ; r = . ). conclusion: gis derived time does not appear to correlate well with actual ems transport time of ohca patients. efforts should be made to accurately obtain actual ems transport times for ohca patients. objectives: we first sought to describe the incidence of ohca presenting to the ed. we then sought to determine the association between hospital characteristics and survival to hospital admission. methods: we identified patients with diagnoses of cardiac arrest or ventricular fibrillation (icd- . or . ) in the nationwide emergency department sample, a nationally representative estimate of all ed admissions in the us. eds reporting ‡ patient with ohca were included. our primary outcome was survival to hospital admission. we examined variability in hospital survival rate and also classified hospitals into high or low performers based on median survival rate. we used this dichotomous hospital level outcome to examine factors associated with survival to admission including hospital and patient demographics, ed volume, cardiac arrest volume, and cardiac catheterization availability. all unadjusted and adjusted analyses were performed using weighted statistics and logistic regressions. results: of the hospitals, ( . %) were included. in total, , cases of cardiac arrest were identified, representing an estimated , cases nationally. overall ed ohca survival to hospital admission was . % (iqr . %, . %) in adjusted analyses, increased survival to admission was seen in hospitals with teaching status (or . , % ci . - . , p < . ), annual ed visits ‡ , (or . , % ci . - . , p < . ), and pci capability (or . , % ci . - . , p = . ). in separate adjusted analyses including teaching status and pci capabilities, hospitals with > annual cardiac arrest cases (or . , % ci . - . , p < . ) were also shown to have improved survival (figure) . conclusion: ed volume, cardiac arrest volume, and pci capability were associated with improved survival to hospital admission in patients presenting to the ed after ohca. an improved understanding of the contribution of ed care to ohca survival may be useful in guiding the regionalization of cardiac arrest care. background: prior investigations have demonstrated regional differences in out-of-hospital cardiac arrest (ohca) outcomes, but none have evaluated survival variability by hospital within a single major us city. objectives: we hypothesized that -day survival from ohca would vary considerably among one city's receiving hospitals. methods: we performed a retrospective review of prospectively collected cardiac arrest data from a large, urban ems system. our population included all ohcas with a recorded social security number (which we used to determine -day survival through the social security death index) that were transported to a hospital between / / and / / . we excluded traumatic arrests, pediatric arrests, and hospitals receiving less than ohcas with social security numbers over the three-year study period. we examined the associa-tion between receiving hospital and -day survival. additional variables examined included: level i trauma center status, teaching hospital status, ohca volume, and whether post-arrest therapeutic hypothermia (th) protocols were in place in . statistics were performed using chi-square tests and logistic regression. results: our study population comprised arrest cases delivered to unique hospitals with an overall -day survival of . %. mean age was . (sd . ) years. males comprised . % of the cohort; . % of victims were black. thirty-day survival varied significantly among the hospitals, ranging from . % to . % (chi-square . , p = . ). ohcas delivered to level i trauma centers were significantly more likely to survive ( . % vs. . %, p = . ), as were those delivered to hospitals known to offer post-arrest th ( . % vs. . %, p = . ). hospital teaching status and ohca volume were not associated with survival. conclusion: there was significant variability in ohca survival by hospital. patients were significantly more likely to survive if transported to a level i trauma center or hospital with post-arrest th protocols, suggesting a potential role for regionalization of ohca care. limiting our population to ohcas with recorded social security numbers reduced our power and may have introduced selection bias. further work will include survival data on the complete set of ohcas transported to hospitals during the three-year study period. background: traumatic brain injury is a leading cause of death and disability. previous studies suggest that prehospital intubation in patients with tbi may be associated with mortality. limited data exist comparing prehospital (ph) nasotracheal (nt), prehospital orotracheal (ot), and ed ot intubation and mortality following tbi. objectives: to estimate the associations between ph nt, ph ot, and ed ot intubation and in-hospital mortality in patients with moderate to severe tbi, with hypotheses that ph nt and ph ot intubation would be associated with increased mortality when compared to ed ot or no intubation. methods: an analysis using the denver health trauma registry, a prospectively collected database. consecutive adult trauma patients from - with moderate to severe tbi defined as head abbreviated injury scale (ais) scores of - . structured chart abstraction by blinded physicians was used to collect demographics, injury and prehospital care characteristics, intubation status and timing, in-hospital mortality and survival time, and neurologic function at discharge. poor neurologic function was defined as cerebral performance category score of - . multivariable logistic regression and survival analyses were performed, using multiple imputation for missing data. results: of the , patients, the median age was (iqr - ) years. the median ph gcs was (iqr - ), median injury severity score was (iqr - ), and median head ais was (iqr - ). ph nt occurred in . %, ph ot in . %, and ed ot in . %, while mortality occurred in . %. the -, -, and -hour survival analyses are outlined in the table. survival curves for ph nt, ph ot, and ed ot are demonstrated in the figure (p < . ) . conclusion: prehospital intubation in patients with moderate to severe tbi is associated with increased mortality. contrary to our initial hypothesis, there was also a significant association between ed intubation and mortality. these associations persisted despite survival time, and while adjusting for injury severity. background: sbdp is a breakdown product of the cytoskeletal protein alpha-ii-spectrin found in neurons and has been detected in severe tbi. objectives: this study examined whether early serum levels of sbdp could distinguish: ) mild tbi from three control groups; ) those with and without traumatic intracranial lesions on ct (+ct vs -ct); and ) those having a neurosurgical intervention (+nsg vs -nsg) in mild and moderate tbi (mmtbi). methods: this prospective cohort study enrolled adult patients presenting to two level i trauma centers following mmtbi with blunt head trauma with loss of consciousness, amnesia, or disorientation and a gcs - . control groups included uninjured controls and trauma controls presenting to the ed with orthopedic injuries or an mvc without tbi. mild tbi was defined as gcs and moderate tbi as having a gcs < . blood samples were obtained in all patients within hours of injury and measured by elisa for sbdp (ng/ml). the main outcomes were: ) the ability of sbdp to distinguish mild tbi from three control groups; ) to distinguish +ct from -ct and; ) to distinguish +nsg from -nsg. data were expressed as means with %ci, and performance was tested by roc curves (auc and %ci). results: there were patients enrolled: tbi patients ( gcs , gcs - ), trauma controls ( mvc controls and orthopedic controls), and uninjured controls. the mean age of tbi patients was years (range - ) with % males. fourteen ( %) had a +ct and % had +nsg. mean serum sbdp levels were . ( %ci . - . ) in normal controls, . ( . - . ) in orthopedic controls, . ( . - . ) in mvc controls, . ( . - . ) in mild tbi with gcs , and . ( . - . ) in tbi with gcs - (p < . ). the auc for distinguishing mild tbi from both controls was . ( %ci . - . ). mean sbdp levels in patients with -ct versus +ct were . ( . - . ) and . ( . - . ) respectively (p < . ) with auc = . ( %ci . - . ). mean sbdp levels in patients with -nsg versus +nsg were . ( . - . ) and . ( . - . ) respectively (p < . ) with auc = . ( %ci . - . ). conclusion: serum sbdp levels were detectable in serum acutely after injury and were associated with measures of injury severity including ct lesions and neurosurgical intervention. further study is required to validate these findings before clinical application. utility of platelet background: pre-injury use of anti-platelet agents (e.g., clopidogrel and aspirin) is a risk factor for increased morbidity and mortality in patients with traumatic intracranial hemorrhage (tich). some investigators have recommended platelet transfusion to reverse the anti-platelet effects in tich. objectives: this evidence-based medicine review examines the evidence regarding the effect of platelet transfusion in emergency department (ed) patients with pre-injury anti-platelet use and tich on patientoriented outcomes. methods: the medline, embase, cochrane library, and other databases were searched. studies were selected for inclusion if they compared platelet transfusion to no platelet transfusion in the treatment of adult ed patients with pre-injury anti-platelet use and tich, and reported rates of mortality, neurocognitive function, or adverse effects as outcomes. we assessed the quality of the included studies using ''grading of recommendations assessment, development and evaluation'' (grade) criteria. categorical data are presented as percentages with % confidence interval (ci). relative risks (rr) are reported when clinically significant. results: five retrospective, registry-based studies were identified, which enrolled patients cumulatively. based on standard criteria, three studies were of ''low'' quality evidence and two studies had ''very low'' qualities. one study reported higher in-hospital mortality in patients with platelet transfusion (ohm et al), another showed a lower mortality rate in patients receiving platelet transfusion (wong et al). three studies did not show any statistical difference in comparing mortality rates between the groups (table) . no studies reported intermediate-or long-term neurocognitive outcomes or adverse events. conclusion: five retrospective registry studies with suboptimal methodologies provide inadequate evidence to support the routine use of platelet transfusion in adult ed patients with pre-injury anti-platelet use and tich. abnormal levels of end-tidal carbon dioxide (etco ) are associated with severity of injury in mild and moderate traumatic brain injury (mmtbi) linda papa , artur pawlowicz , carolina braga , suzanne peterson , salvatore silvestri orlando regional medical center, orlando, fl; university of central florida, orlando, fl background: capnography is a fast, non-invasive technique that is easily administered and accurately measures exhaled etco concentration. etco levels respond to changes in ventilation, perfusion, and metabolic state, all of which may be altered following tbi. objectives: this study examined the relationship between etco levels and severity of tbi as measured by clinical indicators including glasgow coma scale (gcs) score, computerized tomography (ct) findings, requirement of neurosurgical intervention, and levels of a serum biomarkers of glial damage. methods: this prospective cohort study enrolled adult patients presenting to a level i trauma center following a mmtbi defined by blunt head trauma followed by loss of consciousness, amnesia, or disorientation and a gcs - . etco measurements were recorded from the prehospital and emergency department records and compared to indicators of tbi severity. results: of the patients enrolled, ( %) had a normal etco level and ( %) had an abnormal etco level. the mean age of enrolled patients was (range - ) and ( %) were male. mechanisms of injury included motor vehicle collision in ( %), motor cycle collision in ( %), fall in ( %), bicycle/ pedestrian struck in ( %), and other in ( %). eight ( %) patients had a gcs - and ( %) had a gcs - . of the ( %) patients with intracranial lesions on ct, ( %) had an abnormal etco level (p = . ). of the ( %) patients who required a neurosurgical intervention, % had an abnormal etco level (p = . ). levels of a biomarker indicative of astrogliosis were significantly higher in those with abnormal etco compared to those with a normal etco (p = . ). conclusion: abnormal levels of etco were significantly associated with clinical measures of brain injury severity. further research with a larger sample of mmtbi patients will be required to better understand and validate these findings. background: acetaminophen (apap) poisoning is the most frequent cause of acute hepatic failure in the us. toxicity requires bioactivation of apap to toxic metabolites, primarily via cyp e . children are less susceptible to apap toxicity; one current theory is that children's conjugative pathway (sulfonation) is more active. liquid apap preparations contain propylene glycol (pg), a common excipient that inhibits apap bioactivation and reduces hepatocellular injury in vitro and in rodents. cyp e inhibition may decrease toxicity in children, who tend to ingest liquid apap preparations, and suggests a potential novel therapy. objectives: to compare phase i (toxic) and phase ii (conjugative) metabolism of liquid versus solid prepara-tions of apap. we hypothesize that ingestion of a liquid apap preparation results in decreased production of toxic metabolites relative to a solid preparation, likely due to the presence of pg in the liquid preparations. methods: design-pharmacokinetic cross-over study. setting-university hospital clinical research center. subjects-adults ages - taking no chronic medications. interventions-subjects were randomized to receive a mg/kg dose of a commercially available solid or liquid apap preparation. after a washout period of greater than week, subjects received the same dose of apap in the alternate preparation. apap, apap-glucuronide and apap-sulfate (phase metabolites), apap-cysteinate and apap-mercapturate (phase metabolites) were analyzed via lc/ms in plasma over hours. peak concentrations and measured auc were compared using paired-sample t-tests. plasma pg levels were measured. results: fifteen subjects completed the protocol. peak concentrations and aucs of the cyp e derived toxic metabolites were significantly lower following ingestion of the liquid preparation (table, figure) . the glucuronide and sulfate metabolites were not different. pg was present following ingestion of liquid but not solid preparations. conclusion: ingestion of liquid relative to solid preparations in therapeutic doses results in decreased plasma levels of toxic apap metabolites. this may be due to inhibition of cyp e by pg, and may explain the decreased susceptibility in children. a less hepatotoxic formulation of apap can potentially be developed if co-formulated with a cyp e inhibitor. background: pressure immobilization bandages have been shown to delay mortality for up to hours after coral snake envenomation, providing an inexpensive and effective treatment when antivenin is not readily available. however, long-term efficacy has not been established. objectives: determine if pressure immobilization bandages, consisting of an ace wrap and splint, can delay morbidity and mortality from coral snake envenomation, even in the absence of antivenin therapy. methods: institutional animal care and use committee approval was obtained. this was a randomized, observational pilot study using a porcine model. ten pigs ( . kg to . kg) were sedated and intubated for hours. pigs were injected subcutaneously in the left distal foreleg with mg of lyophilized m. fulvius venom resuspended in water, to a depth of mm. pigs were randomly assigned to either a control group (no compression bandage and splint) or a treatment group (compression bandage and splint) approximately minute after envenomation. pigs were monitored daily for days for signs of respiratory depression, decreased oxygen saturations, and paresis/paralysis. in case of respiratory depression, pigs were euthanized and time to death recorded. chi-square was used to compare rates of survival up to days and a kaplan-meier survival curve constructed. results: average survival time of control animals was ± minutes compared to , ± , minutes for treated animals. significantly more pigs in the treatment group survived to hours than in the control group (p = . ). two of the treatment pigs survived to the endpoint of days, but showed necrosis of the distal lower extremity. conclusion: long-term survival after coral snake envenomation is possible in the absence of antivenin with the use of pressure immobilization bandages. the applied pressure of the bandage is critical to allowing survival without secondary consequences (i.e. necrosis) of envenomation. future studies should be designed to accurately monitor the pressures applied. background: patients exposed to organophosphate (op) compounds demonstrate a central apnea. the kölliker-fuse nuclei (kf) are cholinergic nuclei in the brainstem involved in central respiratory control. objectives: we hypothesize that exposure of the kf is both necessary and sufficient for op-induced central apnea. methods: anesthetized and spontaneously breathing wistar rats (n = ) were exposed to a lethal dose of dichlorvos using three experimental models. experiment (n = ) involved systemic op poisoning using subcutaneous (sq) dichlorvos ( mg/kg or x ld ). experiment (n = ) involved isolated poisoning of the kf using stereotactic microinjections of dichlorvos ( micrograms in microliters) into the kf. experiment (n = ) involved systemic op poisoning with isolated protection of the kf using sq dichlorvos ( mg/kg) and stereotactic microinjections of organophosphatase a (opda), an enzyme that degrades dichlorvos. respiratory and cardiovascular parameters were recorded continuously. histological verification of injection site was performed using kmno injections. animals were followed post-poisoning for hour or death. betweengroup comparisons were performed using a repeated measured anova or student's t-test where appropriate. results: animals poisoned with sq dichlorvos demonstrated respiratory depression starting . min post exposure, progressing to apnea . min post exposure. there was no difference in respiratory depression between animals with sq dichlorvos and those with dichlorvos microinjected into the kf. despite differences in amount of dichlorvos ( mg/kg vs . mg/kg) and method of exposure (sq vs cns microinjection), min following dichlorvos both groups (sq vs microinjection respectively) demonstrated a similar percent decrease in respiratory rate ( . vs . , p = . ), minute ventilation ( background: patients sustaining rattlesnake envenomation often develop thrombocytopenia, the etiology of which is not clear. laboratory studies have demonstrated that venom from several species, including the mojave rattlesnake (crotalus scutulatus scutulatus), can inhibit platelet aggregation. in humans, administration of crotaline fab antivenom (av) has been shown to result in transient improvement of platelet levels; however, it is not known whether platelet aggregation also improves after av administration. objectives: to determine the effect of c. scutulatus venom on platelet aggregation in vitro in the presence and absence of crotaline fab antivenom. methods: blood was obtained from four healthy male adult volunteers not currently using aspirin, nsaids, or other platelet-inhibiting agents. c. scutulatus venom from a single snake with known type b (hemorrhagic) activity was obtained from the national natural toxins research center. measurement of platelet aggregation by an aggregometer was performed using five standard concentrations of epinephrine (a known platelet aggregator) on platelet-rich plasma over time, and a mean area under the curve (auc) was calculated. five different sample groups were measured: ) blood alone; ) blood + c. scutulatus venom ( . mg/ml); ) blood + crotaline fab av ( mg/ml); ) blood + venom + av ( mg/ ml); ) blood + venom + av ( mg/ml). standard errors of the mean (sem) were calculated for each group. results: antivenom administration by itself did not significantly affect platelet aggregation compared to baseline ( . ± . %, p = . ). administration of venom decreased platelet aggregation ( . ± . %, p < . ). concentrated av administration in the presence of venom normalized platelet aggregation ( . ± . %) and in the presence of diluted av significantly increased aggregation ( . ± . %); p < . for both groups when compared to the venom-only group. to control for the effects of the venom and av, each was run independently in platelet-rich plasma without epinephrine; neither was found to significantly alter platelet aggregation. conclusion: crotaline fab av improved platelet aggregation in an in vitro model of platelet dysfunction induced by venom from c. scutulatus. the mechanism of action remains unclear but may involve inhibition of venom binding to platelets or a direct action of the antivenom on platelets. background: routine use of both breathalyzers and hand sanitizers is common across emergency depart-ments. the most common hand sanitizer on the market, purell, contains % ethyl alcohol and a lesser amount of isopropyl alcohol. previous investigations have documented that risk is low to the health care worker who applies frequent hand sanitizers to themselves. however, it is unknown whether this alcohol mixture causes false readings on a breathalyzer machine being used to determine alcohol levels on others. objectives: to determine the effect on the measurement of breathalyzer readings in individuals who have not consumed alcohol after hand sanitizer is applied to the experimenter holding a breathalyzer machine. methods: after obtaining informed consent, a breathalyzer reading was obtained in participants who had not consumed any alcohol in the last hours. three different experiments were performed with different participants in each. in experiment , two pumps of hand sanitizer were applied to the experimenter. without allowing the sanitizer to dry, the experimenter then measured the breathalyzer reading of the participant. in experiment , one pump of sanitizer was applied to the experimenter. measurements of the participant were taken without allowing the sanitizer to dry. in experiment , one pump of sanitizer was placed on the experimenter and rubbed until dry according to the manufacturer's recommendations. readings were recorded and analyzed using paired t-tests. results: the initial breathalyzer reading for all participants was . after two pumps of hand sanitizer were applied without drying (experiment ), breathalyzers ranged from . to . , with a mean above the legalintoxication limit of . (t( ) = ) . , p < . ). after one pump of hand sanitizer was applied without drying (experiment ), breathalyzers ranged from . to . , with a mean of . (t( ) = ) . , p < . ). after one pump of hand sanitizer was applied according to manufacturer's directions (experiment ), breathalyzers ranged from . to . with a mean of . (t( ) = ) . , p < . ). conclusion: use of hand sanitizer according to the manufacturer's recommendations results in a small but significant increase in breathalyzer readings. however, the improper and overuse of common hand sanitizer elevates routine breathalyzer readings, and can mimic intoxication in individuals who have not consumed alcohol. stephanie carreiro, jared blum, francesca beaudoin, gregory jay, jason hack objectives: the primary aim of this study is to determine if pretreatment with ile affects the hemodynamic response to epinephrine in a rat model. hemodynamic response was measured by a change in heart rate (hr) and mean arterial pressure (map). we hypothesized that ile would limit the rise in map and hr that typically follow epinephrine administration. methods: twenty male sprague dawley rats (approximately - weeks of age) were sedated with isoflurane and pretreated with a ml/kg bolus of ile or normal saline, followed by a mcg/kg dose of epinephrine intravenously. intra-arterial blood pressure and hr were monitored continuously until both returned to baseline (biopaq). a multifactorial analysis of variance (manova) was performed to assess the difference in map and hr between the two groups. standardized t-tests were then used to compare the peak change in map, time to peak map, and time to return to baseline map in the two groups. results: overall, a significant difference was found between the two groups in map (p = . ) but not in hr (p = . ). there was a significant difference (p = . ) in time to peak map in the ile group ( sec, % ci - ) versus the saline group ( sec, % ci - ) and a significant difference (p = . ) in time to return to baseline map in ile group ( sec, % ci - ) versus the saline group ( sec, % ci - ). there was no significant difference (p = . ) in the peak change in map of the ile group ( . , mmhg, % ci - ) versus the saline group ( . mmhg, % ci - ). conclusion: our data show that in this rat model ile pretreatment leads to a significant difference in map response to epinephrine, but no difference in hr response. ile delayed the peak effect and prolonged the duration of effect on map but did not alter the peak increase in map. this suggests that the use of ile may delay the time to peak effect of epinephrine if the drugs are administered concomitantly to the same patient. further research is needed to explore the mechanism of this interaction. rasch analysis of the agitation severity scale when used with emergency department acute psychiatry patients tania d. strout, michael r. baumann maine medical center, portland, me background: agitation is a frequently observed and problematic phenomenon in mental health patients being treated in the emergency setting. the agitation severity scale (agss), a reliable and valid instrument, was developed using classical test theory to measure agitation in acute psychiatry patients. objectives: the aim of this study was to analyze the agss according to the rasch measurement model and use the results to determine whether improvements to the instrument could be made. methods: this prospective, observational study was irb-approved. adult ed patients with psychiatric chief complaints and dsm-iv-tr diagnoses were observed using the agss. the rasch rating scale model was employed to evaluate the items comprising the agss using winsteps statistical software. unidimensionality, item fit, response category performance, person and item separation reliability, and hierarchical ordering of items were all examined. a principle components analysis (pca) of the rasch residuals was also performed. results: variable maps revealed that all of the agss items were used to some degree and that the items were ordered in a way that makes clinical sense. several duplicative items, indicating the same degree of agitation, were identified. item ( . ) and person ( . ) separation statistics were adequate, indicating appropriate spread of items and subjects along the agitation continuum and providing support for the instrument's reliability. keymaps indicated that the agss items are functioning as intended. analysis of fit demonstrated no extreme misfitting items. pca of the rasch residuals revealed a small amount of residual variance, but provided support for the agss as being unidimensional, measuring the single construct of agitation. the results of this rasch analysis support the agss as a psychometrically robust instrument for use with acute psychiatry patients in the emergency setting. several duplicative items were identified that may be eliminated and re-evaluated in future research; this would result in a shorter, more clinically useful scale. in addition, a gap in items for patients with lower levels of agitation was identified. generation of additional items intended to measure low levels of agitation could improve clinician's ability to differentiate between these patients. background: attempted suicide is one of the strongest clinical predictors of subsequent suicide and occurs up to times more frequently than completed suicide. as a result, suicide prevention has become a central focus of mental health policy. in order to improve current treatment and intervention strategies for those presenting with suicide attempt and self-injury in the emergency department (ed), it is necessary to have a better understanding of the types of patients who present to the ed with these complaints. objectives: to describe the epidemiology of ed visits for attempted suicide and self-inflicted injury over a year period. methods: data were obtained from the national hospital ambulatory medical care survey (nhamcs). all visits for attempted suicide and self-inflicted injury (e -e ) during - were included. trend analyses were conducted using stata's nptrend (a nonparametric test for trends that is an extension of the wilcoxon rank-sum test) and regression analyses. a two-tailed p < . was considered statistically significant. results: over the -year period, there were an average of , annual ed visits for attempted suicide and self-inflicted injury ( . [ % confidence interval (ci) . - . ] visits per , us population). the overall mean patient age was years, with visits most common among ages - ( . ; %ci . - . ). the average annual number of ed visits for suicide attempt and self-inflicted injury more than doubled from , in - to , in - . during the same timeframe, ed visits for these injuries per , us population almost doubled for males ( . to . ), females ( . to . ), whites ( . to . ), and blacks ( . to . ). no temporal differences were found for method of injury or ed disposition; there was, however, a significant decrease in visits determined by the physician to be urgent/emergent from % in to % in . conclusion: ed visit volume for attempted suicide and self-inflicted injury has increased over the past two decades in all major demographic groups. awareness of these longitudinal trends may assist efforts to increase research on suicide prevention. in addition, this information may be used to inform current suicide and self-injury related ed interventions and treatment programs. benjamin l. bregman, janice c. blanchard, alyssa levin-scherz george washington university, washington, dc background: the emergency department (ed) has increasingly become a health care access point for individuals with mental health needs. recent studies have found that rates of major depression disorder (mdd) diagnosed in eds are far above the national average. we conducted a study assessing whether individuals with frequent ed visits had higher rates of mdd than those with fewer ed visits in order to help guide screening and treatment of depressed individuals encountered in the ed. objectives: this study evaluated potential risk factors associated with mdd. we hypothesized that patients who are frequent ed visitors will have higher rates of mdd. methods: this was a single center, prospective, crosssectional study. we used a convenience sample of noncritically ill, english speaking adult patients presenting with non-psychiatric complaints to an urban academic ed over months in . we oversampled patients presenting with ‡ visits over the previous days. subjects were surveyed about their demographic and other health and health care characteristics and were screened with the phq , a nine-item questionnaire that is a validated, reliable predictor of mdd. we conducted bivariate (chi-square) and multivariate analysis controlling for demographic characteristics using sta-ta v. . . our principal dependent variable of interest was a positive depression screen (phq score ‡ ). our principal independent variable of interest was ‡ visits over the previous days. results: our response rate was . % with a final sample size of . of our total sample, ( . %) had three or greater visits within the prior days. one hundred ( %) frequent visitors had a positive phq mdd screen as compared to ( . %) of subjects with fewer than three visits (p < . ). in our multivariate analysis, the odds for having three or more visits for subjects who had a positive depression screen was . ( . , . ). of subjects with three or more visits with a positive depression screen, only ( %) were actively being treated for mdd at the time of their visit. conclusion: our study found a high prevalence of untreated depression among frequent users of the ed. eds should consider routinely screening patients who are frequent consumers for mdd. in addition, further studies should evaluate the effect of early treatment and follow up for mdd on overall utilization of ed services. access to psychiatric care among patients with depression presenting to the emergency department janice c. blanchard, benjamin l. bregman, dana rosenfarb, qasem al jabr, eun kim george washington university, washington, dc background: literature suggests that there is a high rate of major depressive disorder (mdd) in emergency department (ed) users. however, access to outpatient mental health services is often limited due to lack of providers. as a result, many persons with mdd who are not in active treatment may be more likely to utilize the ed as compared to those who are currently undergoing outpatient treatment. objectives: our study evaluated utilization rates and demographic characteristics associated with patients with a prior diagnosis of mdd not in active treatment. we hypothesized that patients who present to the ed with untreated mdd will have more frequent ed visits. methods: this was a single center, prospective, crosssectional study. we used a convenience sample of noncritically ill, english speaking adult patients presenting with non-psychiatric complaints to an urban academic ed over months in . subjects were surveyed about their demographic and other health and health care characteristics and were screened with the phq , a nine-item questionnaire that is a validated, reliable predictor of mdd. we conducted bivariate (chi-square) and multivariate analysis controlling for demographic characteristics using stata v. . . our principal dependent variable of interest was a positive depression screen (phq ‡ ). our analysis focused on the subset of patients with a prior diagnosis of mdd with a positive screen for mdd during their ed visit. results: our response rate was . % with a final sample size of . ( . %) patients screened positive for mdd with a phq score ‡ . of the patients with a positive depression screen, . % reported a prior history of treatment for mdd (n = ). of these patients, only . % were currently actively receiving treatment. hispanics who screened positive for depression with a history of mdd were less likely to actively be undergoing treatment as compared to non-hispanics ( . % versus . %, p = . ). patients with incomes less than $ , were more likely to actively be receiving treatment as opposed to higher incomes ( . % versus . % p = . ). conclusion: patients presenting to our ed with untreated mdd are more likely to be hispanic and less likely to be low income. the emergency department may offer opportunities to provide antidepressant treatment for patients who screen positive for depression but who are not currently receiving treatment. evaluation of a two-question screening tool (phq- ) for detecting depression in emergency department patients jeffrey p. smith, benjamin bregman, janice blanchard, nasser hashim, mary pat mckay george washington university, washington, dc background: the literature suggests there is a high rate of undiagnosed depression in ed patients and that early intervention can reduce overall morbidity and health care costs. there are several well validated screening tools for depression including the nine-item patient health questionnaire (phq- ). a tool using a two-question subset, the phq- , has been shown to be an easily administered, reasonably sensitive screening tool for depression in primary care settings. objectives: to determine the sensitivity and specificity of the phq- in detecting major depressive disorders (mdd) among adult ed patients presenting to an urban teaching hospital. we hypothesize that the phq- is a rapid, effective screening tool for depression in a general ed population. methods: cross sectional survey of a convenience sample of adult, non-critically ill, english speaking patients with medical and not psychiatric complaints presenting to the ed between am and pm weekdays. patients were screened for mdd with the phq- . we used spss v . to analyze the specificity, sensitivity, positive predictive value (ppv), negative predictive value (npv), and kappa of phq- scores of and (out of possible total score of ) compared to a validated cut-off score of or higher of points on the phq- . the two questions on the phq- are: ''over the last two weeks, how often have you had little interest in doing things? how often have you felt down, depressed or hopeless?'' responses are scored from - based on ''never'',''several days'', ''more than half'', ''nearly every day''. results: subjects of approached agreed to participate ( . % response rate), and ( . %) completed the phq- . the phq- identified ( . %) subjects with mdd. table outlines the percent of subjects who were positive and the sensitivity, specificity, positive, and negative predictive values and kappa for each cut-off on the phq- . conclusion: the phq- is a sensitive and specific screening tool for mdd in the ed setting. moreover, the phq- is closely correlated with the phq- , especially if a score of or greater is used. given the simplicity and ease of using a two-item questionnaire and the high rates of undiagnosed depression in the ed, including this brief, self-administered screening tool to ed patients may allow for early awareness of possible mdd and appropriate evaluation and referral. patients. however, much of this self-harm behavior is not discovered clinically and very little is known about the prevalence and predictors of current ed screening practices. attention to this issue is increasing due to the joint commission's patient safety goal , which focuses on identification of suicide risk in patients. objectives: to describe the prevalence and predictors of screening for self-harm and of presence of current self-harm in eds. methods: data were obtained from the nimh-funded emergency department safety assessment and followup evaluation (ed-safe). eight u.s. eds reviewed charts in real time for - hours a week between / and / . all patients presenting during enrollment shifts were characterized as to whether a selfharm screening had been performed by ed clinicians. a subset of patients with a positive screening was asked about the presence of self-harm ideation, attempts, or both by trained research staff. we used multivariable logistic regression to identify predictors of screening and of current self-harm. data were clustered by site. in each model we examined day and time of presentation, age < years, sex, race, and ethnicity. results: of the , patients presenting during research shift, , ( %) were screened for self-harm. screening rates varied among sites and ranged from % to %, with one outlier at %. of those screened, , ( %) had current self-harm. among those with selfharm approached by study personnel (n = , ), ( %) had thoughts of self-harm (suicidal or non-suicidal), ( %) had thoughts of suicide, ( %) had self-harm behavior, and ( %) had suicide attempt(s) over the preceding week. predictors of being screened were: age < years, male sex, weekend presentation, and night shift presentation (table) . among those screened, predictors of current self-harm were: age < years, white race, and night shift presentation. conclusion: screening for self-harm is uncommon in ed settings, though practices vary dramatically by site. patients presenting at night and on weekends are more likely to be screened, as are those under age and males. current self-harm is more common among those presenting on night shift, those under age , and whites. results: there were out-of-hospital records reviewed, and hospital discharge data were available in non-cardiac arrest patients. of the patients, ( . %) patients survived to hospital discharge and ( . %) died during hospitalization. the mean age of those transported was years (sd ), ( %) were male, ( %) were trauma-related, and ( %) were admitted to the icu. average systolic blood pressure (sbp), pulse (p), respiratory rate (rr), oxygen saturation (o sat), and end-tidal carbon dioxide (etco ) were sbp = (sd ), p = (sd ), rr = (sd ), o sat = % (sd ), and etco = (sd conclusion: of all the initial vital signs recorded in the out-of-hospital setting, etco was the most predictive of mortality. these findings suggest that pre-hospital etco is a useful clinical tool for determining severity of illness and appropriate triage. background: the prehospital use of continuous positive airway pressure (cpap) ventilation is a relatively new management for acute cardiogenic pulmonary edema (acpe) and there is little high quality evidence on the benefits or potential dangers in this setting. objectives: the aim of this study was to determine whether patients in severe respiratory distress treated with cpap in the prehospital setting have a lower mortality than those treated with usual care. methods: randomized, controlled trial comparing usual care versus cpap (whisperflowÒ) in a prehospital setting, for adults experiencing severe respiratory distress, with falling respiratory efforts, due to a presumed acpe. patients were randomised to receive either usual care, including conventional medications (nitrates, furosemide, and oxygen) plus bag-valve-mask ventilation, versus conventional medications plus cpap. the primary outcome was prehospital or in-hospital mortality. secondary outcomes were need for tracheal intubation, length of hospital stay, change in vital signs, and arterial blood gas results. we calculated relative risk with % cis. results: fifty patients were enrolled with mean age ae (sd ae ), male ae %, mortality ae %. the risk of death was significantly reduced in the cpap arm with mortality ae % ( deaths) in the usual care arm compared to ae % ( death) in the cpap arm (rr, ae ; % ci ae to ae ; p = ae ). patients who received cpap were significantly less likely to have respiratory acidosis (mean difference in ph ae ; % ci ae to ae ; p = ae ; n = ) than patients receiving usual care. the length of hospital stay was significantly less in the patients who received cpap (mean difference ae days; % ci ) ae to ae , p = ae ). conclusion: we found that cpap significantly reduced mortality, respiratory acidosis, and length of hospital stay for patients in severe respiratory distress caused by acpe. this study shows the use of cpap for acpe improves patient outcomes in the prehospital setting. (originally submitted as a ''late-breaker.'') trial reg. anzctr actrn ; funding fisher and paykal suppliers of the whisperflowÒ cpap device. background: because emergency service utilization continues to climb, validated methods to safely identify and triage low-acuity patients to either alternate care destinations or a complaint-appropriate level of ems response is of keen interest to ems systems and potentially payers. though the literature generally supports the medical priority dispatch system (mpds) as a tool to predict low-acuity patients by various standards, correlation with initial patient physiologic data and patient age is novel. objectives: to determine whether the six mpds priority determinants for protocol (sick person) can be used to predict initial ems patient acuity assessment or severity of an aggregate physiologic score. our longterm goal is to determine whether mpds priority can be used to predict patient acuity and potentially send only a first responder to do an in-person assessment to confirm this acuity, while reserving als transport resources for higher acuity patients. methods: calls dispatched through the wichita-sedgwick county - - center between july , and october , using mpds protocol (sick person) were linked to the ems patient care record for all patients and older. the six mpds priority determinants were evaluated for correlation with initial ems acuity code, initial vital signs, rapid acute physiology score (raps), or patient age. the ems acuity code scores patients from low to severe acuity, based on initial ems assessment. results: there were calls dispatched using protocol for those years of age and older during the period, representing approximately % of all ems calls. there is a significant difference in the first encounter vital signs among different mpds priority levels. based on the logistic regression model, the mpds priority code alone had a sensitivity of % and specificity of % for identifying low-acuity patients with ems acuity score as the standard. the area under the curve (auc) for roc is . for mpds priority codes alone, while addition of age increases this value to . . if we use the raps score as the standard to the mpds priority code, auc is . . if we include both mpds and age in the model, the auc is . . conclusion: in our system, mpds priority codes on protocol (sick person) alone, or with age or raps score, are not useful either as predictors of patient acuity on ems arrival or to reconfigure system response or patient destination protocols. alternate ambulance destination program c. nee-kofi mould-millman , tim mcmahan , michael colman , leon h. haley , arthur h. yancey emory university, atlanta, ga; grady ems, atlanta, ga background: low-acuity patients calling - - are known to utilize a large proportion of ems and ed resources. the national association of ems physicians and acep jointly support ems alternate destination programs (adps) in which low-acuity patients are allocated alternative resources non-emergently. analysis of one year's adp data from our ems system revealed that only . % of eligible patients were transported to alternate destinations (ambulatory clinics). reasons for this low success rate need investigation. objectives: to survey emts and discover the most frequent reasons given by them for transportation of eligible patients to eds instead of to clinics. methods: this study was conducted within a large, urban, hospital-based ems system. upon conducting an adp for months, a paper-based survey was created and pre-tested. all medics with any adp-eligible patient contact were included. emts were asked about personal, patient, and system related factors contributing to ed transport during the last months of the adp. qualitative data were coded, collated, and descriptively reported. results: sixty-three respondents ( emt-intermediates and emt-paramedics) completed the survey, representing % of eligible emts. thirty-one emts ( %) responded that they did not attempt to recruit eligible patients into the adp in the last program months. of those emts, ( %) attributed their motive to multiple, prior, failed recruitment attempts. the emts who actively recruited adp patients were asked reasons given by patients for clinic transport refusals: ( %) cited that patients reported no prior experience of care at the participating clinics, and ( %) reported patients had a strong preference for care in an ed. regarding system-related factors contributing to non-clinic transport, of the emts ( %) reported that clinic-consenting patients were denied clinic visits, mostly because of non-availability of same-day clinic appointments. conclusion: respondents indicated that poor emt enrollment of eligible patients, lack of available clinic time slots, and patient preference for ed care were among the most frequent reasons contributing to the low success rate of the adp. this information can be used to enhance the success of this, and potentially other adp programs, through modifications to adp operations and improved patient education. the effect of a standardized offline pain treatment protocol in the prehospital setting on pediatric pain treatment brent kaziny , maija holsti , nanette dudley , peter taillac , hsin-yi weng , kathleen adelgais university of utah, school of medicine, salt lake city, ut; university of colorado, school of medicine, aurora, co background: pain is often under treated in children. barriers include need for iv access, fear of delayed transport, and possible complications. protocols to treat pain in the prehospital setting improve rates of pain treatment in adults. the utah ems for children (emsc) program developed offline pediatric protocol guidelines for ems providers, including one protocol that allows intranasal analgesia delivery to children in the prehospital setting. objectives: to compare the proportion of pediatric patients receiving analgesia for orthopedic injury by prehospital providers before and after implementation of an offline pediatric pain treatment protocol. methods: we conducted a retrospective study of patients entered into the utah prehospital on-line active reporting information system (polaris, a database of statewide ems cases) both before and after initiation of the pain protocol. patients were included if they were age - years, with a gcs of - , an isolated extremity injury, and were transported by an ems agency that had adopted the protocol. pain treatment was compared for years before and months after protocol implementation with a wash-out period of months for agency training. the difference in treatment proportions between the two groups was analyzed and % cis were calculated. results: during the two study periods, patients met inclusion criteria. patient demographics are outlined in the table. / ( . %) patients were treated for pain before compared to / ( . %) patients treated after the pain protocol was implemented; a difference of . % ( % ci: . %- . %). patients were more likely to receive pain medication if they had a pain score documented (or: . ; % ci: . - . ) and if they were treated after the implementation of a pain protocol (or: . ; % ci: . - . ). factors not associated with the treatment of pain include age, sex, and mechanism of injury. conclusion: the creation and adoption of statewide emsc pediatric offline protocol guideline for pain management is associated with a significant increase in use of analgesia for pediatric patients in the prehospital setting. background: evidence-based guidelines are needed to determine the appropriate use of air medical transport, as few criteria currently used predict the need for air transport to a trauma center. we previously developed a clinical decision rule (cdr) to predict mortality in injured, helicopter-transported patients. objectives: this study is a prospective validation of the cdr in a new population. methods: a prospective, observational cohort analysis of injured patients ( ‡ y.o.) transported by helicopter from the scene to one of two level i trauma centers. variables analyzed included patient demographics, diagnoses, and clinical outcomes (in-hospital mortality, emergent surgery w/in hrs, blood transfusion w/in hrs, icu admit greater than hrs, combined outcome of all). prehospital variables were prospectively obtained from air medical providers at the time of transport and included past medical history, mechanism of injury, and clinical factors. descriptive statistics compared those with and without the outcomes of interest. the previous cdr (age ‡ , gcs £ , sbp < , flail chest) was prospectively applied to the new population to determine its accuracy and discriminatory ability. results: patients were transported from october -august . the majority of patients were male ( %), white ( %), with an injury occurring in a rural location ( %). most injuries were blunt ( %) with a median iss of . overall mortality was %. the most common reasons for air transport were: mvc with high risk mechanism ( %), gcs £ ( %), loc > minutes ( %), and mvc > mph ( %). of these, only gcs £ was significantly associated with any of the clinical outcomes. when applying the cdr, the model had a sensitivity of % ( . %- %), a specificity of . % ( . %- . %), a npv of % ( . %- %), and a ppv of . % ( . %- . %) for mortality. the area under the curve for this model was . , suggesting excellent discriminatory ability. conclusion: the air transport decision rule in this study performed with high sensitivity and acceptable specificity in this validation cohort. further external validation in other systems and with ground transported patients are needed in order to improve decision making for the use of helicopter transport of injured patients. background: acute non-variceal upper gastrointestinal (gi) bleeding is a common indication for hospital admission. to appropriately risk-stratify such patients, endoscopy is recommended within hours. given the possibility to safely manage patients as outpatients after endoscopy, risk stratification as part of an emergency department (ed) observation unit (ou) protocol is proposed. objectives: our objective was to determine the ability of an ou upper gi bleeding protocol to identify a lowrisk population, and to expeditiously obtain endoscopy and disposition patients. we also identified rates of outcomes including changes in hemoglobin, abnormal endoscopy findings, admission, and revisits. background: acute uncomplicated pyelonephritis (pyelo) requires no imaging but a ct flank pain protocol (ctfpp) may be ordered to determine if patients with pyelo and flank pain also have an obstructing stone. the prevalence of kidney stone and the characteristics predictive of kidney stone in pyelo patients is unknown. objectives: to determine elements on presentation that predict ureteral stone, as well as prevalence of stone and interventions in patients undergoing ct for pyelo. methods: retrospective study of patients at an academic ed who received a ctfpp scan between / and / . ctfpps were identified and randomly selected for review. pyelo was defined as: positive urine dip for infection and > wbc/hpf on formal urinalysis in addition to flank pain/cva tenderness, chills, fever, nausea, or vomiting. patients were excluded for age < y.o., renal disease, pregnancy, urological anomaly, or recent trauma. clinical data ( elements) were gathered blinded to ct findings; ct results were abstracted separately and blinded to clinical elements. ct findings of hydronephrosis and hyrdroureter (hydro) were used as a proxy for hydro that could be determined by ultrasound prior to ct. patients were categorized into three groups: ureteral stone, no significant findings, and intervention or follow-up required. classification and regression tree analysis was used to determine which variables could identify ureteral stone in this population of pyelo patients. results: out of the patients, ( . %) met criteria for pyelo; subjects had a mean age of ± . and % (n = ) were female. ct revealed ( %, % ci = . - . ) symptomatic stones, and ( %, % ci = . - . ) exams with no significant findings. two patients needed intervention/ follow-up ( %, % ci = . - . ), one for perinephric hemorrhage and the other for pancreatitis. hydro was predictive for ureteral stone with an or = . ( % ci = . - , p < . ). eleven ( %) ureteral stone patients were admitted and ( %) of them had procedures. of these patients, % had ct signs of obstruction, ( %) had hydronephrosis, and ( %) had hydroureter. conclusion: hydronephrosis was predictive of ureteral stone and in-house procedures. prospective study is needed to determine whether ct scan is warranted in patients with pyelonephritis but without hydronephrosis or hydroureter. curative objectives: the specific aim of this analysis was to describe characteristics of patients presenting to the emergency department (ed) at their index diagnosis, and to determine whether emergency presentation precludes treatment with curative intent. methods: we performed a retrospective cohort analysis on a prospectively maintained institutional tumor registry to identify patients diagnosed with crc from - . emrs were reviewed to identify which patients presented to the ed with acute symptoms of crc as the initial sign of their illness. the primary outcome variable was treatment plan (curative vs. palliative). secondary outcome variables included demographics, tumor type and location. descriptive statistics were conducted for major variables. chi-squre and fisher's exact tests were used to detect the association between categorical variables. two-sample t-test was used to identify the association between continuous and categorical variables. results: between jan and dec , patients were identified at our institution with crc. ( %) were male and ( %) were female, with mean age . ; sd: . . thirty-three patients ( . %) initially presented to the ed, of whom ( . %) received palliation. of patients who initially presented elsewhere, ( . %) received palliation. acute ed presentation with crc symptoms did not preclude treatment with curative intent (p = . ). patients who presented emergently were more likely to be female ( % vs male %; p = . ) and older ( vs. ; p = . ). there was no statistically significant relationship between age, sex, tumor location, or type and treatment approach. conclusion: patients with crc may present to the ed with acute symptoms, which ultimately leads to the diagnosis. emergent presentation of crc does not preclude patients from receiving therapy with curative intent. cannabinoid (or . , , and white blood cell (wbc) count ‡ , /mm (or . , % ci . - . ). conclusion: age ‡ years is not associated with need for admission from an ed observation unit. older adults can successfully be cared for in these units. initial temperature, respiratory rate, and pulse were not predictive of admission, but extremely elevated blood pressure was predictive. other relevant predictor variables included comorbidities and elevated wbc count. advanced age should not be a disqualifying criterion for disposition to an ed observation unit. older adult fallers in the emergency department luna ragsdale, cathleen colon-emeric duke university, durham, nc background: approximately / of community-dwelling older adults experience a fall each year, and . million are treated in u.s. emergency departments (ed) annually. the ed offers a potential location for identification of high-risk individuals and initiation of fall-prevention services that may decrease both fall rates and resource utilization. objectives: the goal of this study was to: ) validate an approach to identifying older adults presenting with falls to the ed using administrative data; and ) characterize the older adult who falls and presents to the ed and determine the rate of repeat ed visits, both fall-related and all visits, after an index fall-related visit. methods: we identified all older adults presenting to either of the two hospitals serving durham county residents during a six month period. manual chart review was completed for all encounters with icd codes that may be fall-related. charts were reviewed months prior and months post index visit. descriptive statistics were used to describe the cohort. results: a total of older adults were evaluated in the ed during this time period; ( . %) had an icd code for a potentially fall-related injury. of these, record review identified ( %) with a fall from standing height or less. of the fallers, . % of the patients were discharged, % were admitted, and % were admitted under observation. of those who fell, . % had an ed visit within the previous year. approximately / ( . %) of these were fall related. over half ( . %) of the patients who fell returned to the ed within one year of their index visit. a large proportion ( . %) of the return visits was fall-related. follow-up with a primary care provider or specialist was recommended in % of the patients who were discharged. overall mortality rate for fallers over the year following the index visit was %. conclusion: greater than fifty percent of fallers will return to the ed after an index fall, with a large proportion of the visits related to a fall. a large number of these fallers are discharged home with less than fifty percent having recommended follow-up. the ed represents an important location to identify high-risk older adults to prevent subsequent injuries and resource utilization. objectives: we studied whether falls from a standing position resulted in an increased risk for intracranial or cervical injury verses falling from a seated or lying position. methods: this is a prospective observational study of patients over the age of who presented with a chief complaint of fall to a tertiary care teaching facility. patients were eligible for the study if they were over age , were considered to be at baseline mental status, and were not triaged to the trauma bay. at presentation, a questionnaire was filled out by the treating physician regarding mechanism and position of fall, with responses chosen from a closed list of possibilities. radiographic imaging was obtained at the discretion of the treating physician. charts of enrolled patients were subsequently reviewed to determine imaging results, repeat studies done, or recurrent visits. all patients were called in follow-up at days to assess for delayed complications related to the fall. data were entered into a standardized collection sheet by trained abstractors. data were analyzed with fisher's exact test and descriptive statistics. this study was reviewed and approved by the institutional review board. results: two-hundred sixty two patients were enrolled during the study period. one-hundred ninety eight of these had fallen from standing and fell from either sitting or lying positions. the mean age for patients was (sd . ) for those who fell from standing and (sd . ) for those who fell from sitting or lying. there were patients with injuries who fell from standing: three with subdural hematomas, one with a cerebral contusion, one with an osteophyte fracture at c , and one with an occipital condyle fracture with a chip fracture of c . there were patients with injuries who fell from a seated or lying position: one with a traumatic subarachnoid hemorrhage and one with a type ii dens fracture. the overall rate of traumatic intracranial or cervical injury in elders who fell was %. no patients required surgical intervention. there was no difference in rate of injury between elders who fell from standing versus those who fell from sitting or lying (p = ). (table) . conclusion: both instruments identify the majority of patients as high-risk which will not be helpful in allocating scarce resources. neither the isar nor the trst can distinguish geriatric ed patients at high or low risk for or -month adverse outcomes. these prognostic instruments are not more accurate in dementia or lower literacy subsets. future instruments will need to incorporate different domains related to short-term adverse outcomes. background: for older adults, both inpatient and outpatient care involves not only the patient and physician, but often a family member or informal caregiver. they can assist in medical decision making and in performing the patient's activities of daily living. to date, multiple outpatient studies have examined the positive roles family members play during the physician visit. however, there is very limited information on the involvement of the caregiver in the ed and their relationship with the health outcomes of the patient. objectives: to assess whether the presence of a caregiver influences the overall satisfaction, disposition, and outpatient follow-up of elderly patients. we performed a three-step inquiry of patients over years old who arrived to the upenn ed. patients and care partners were initially given a questionnaire to understand basic demographic data. at the end of the ed stay, patients were given a satisfaction survey and followed through days to assess time to disposition, whether the patient was admitted or discharged, outpatient follow-up, and ed revisit rates. chi-square and t-tests were used to examine the strength of differences in the elderly patients' sociodemographics, self-rated health, receiving aid with their instrumental activities of daily living, and number of health problems by accompaniment status. multivariate regression models were constructed to examine whether the presence or absence of caregivers affected satisfaction, disposition, and follow-up. results: overall satisfaction was higher among patients who had caregivers ( . points), among patients who felt they were respected by their physician ( . points), and had lower lengths of stay ( hours). patients with caregivers were also more likely to be discharged home (or . ) and to follow-up with their regular physician (or . ). there was no evidence to suggest caregivers affected the overall rates of revisits back to an ed. conclusion: for older adults, medical care involves not only the patient and physician, but often a family member or an informal care companion. these results demonstrate the positive influence of caregivers on the patients they accompany, and emergency physicians should define ways to engage these caregivers during their ed stay. this will also allow caregivers to participate when needed and can help to facilitate transitions across care settings. background: shared decision making has been shown to improve patient satisfaction and clinical outcomes for chronic disease management. given the presence of individual variations in the effectiveness and side effects of commonly used analgesics in older adults, shared decision making might also improve clinical outcomes in this setting. objectives: we sought to characterize shared decision making regarding the selection of an outpatient analgesic for older ed patients with acute musculoskeletal pain and to examine associations with outcomes. methods: we conducted a prospective observational study with consecutive enrollment of patients age or older discharged from the ed following evaluation for moderate or severe musculoskeletal pain. two essential components of shared decision making, ) information provided to the patient and ) patient participation in the decision, were assessed via patient interview at one week using four-level likert scales. results: of eligible patients, were reached by phone and completed the survey. only % ( / ) of patients reported receiving 'a lot' of information about the analgesic, and only % ( / ) reported participating 'a lot' in the selection of the analgesic. there were trends towards white patients (p = . ) and patients with higher educational attainment (p = . ) reporting more participation in the decision. after adjusting for sex, race, education, and initial pain severity, patients who reported receiving 'a lot' of information were more likely to report optimal satisfaction with the analgesic than those receiving less information ( % vs. %, p < . ). after the same adjustments, patients who reported participating 'a lot' in the decision were also more likely to report optimal satisfaction with the analgesic ( % vs. %, p < . ) and greater reductions in pain scores (mean reduction in pain . vs. . , p < . ) at one week than those who participated less. background: quality of life (qol) measurements have become increasingly important in outcomes-based research and cost-utility analyses. dementia is a prevalent, often unrecognized, geriatric syndrome that may limit the accuracy of patient self-report in a subset of patients. the relationship between caregiver and geriatric patient qol in the emergency department (ed) is not well understood. objectives: to qualify the relationship between caregiver and geriatric patient qol ratings in ed patients with and without cognitive dysfunction. methods: this was a prospective, consecutive patient, cross-sectional study over two months at one urban academic medical center. trained research assistants screened for cognitive dysfunction using the short blessed test and evaluated health impairment using the quality of life-alzheimer's disease (qol-ad) test. when available in the ed, caregivers were asked to independently complete the qol-ad. consenting subjects were non-critically ill, english-speaking, community-dwelling adults over years of age. responses were compared using wilcoxon signed ranks test to assess the relationships between patient and caregiver scores from the qol-ad stratified by normal or abnormal cognitive screening results. significance was defined by p < . . results: patient qol ratings were obtained from patient-caregiver pairs. patients were % female, % african-american, with a mean age of -years, and % had abnormal cognitive screening tests. compared with caregivers, cognitively normal patients had no significant qol assessment differences except for questions of energy level and overall mood. on the other hand, cognitively impaired patients differed significantly on questions of energy level and ability to perform household chores with a trend towards significant differences for living setting (p = . ) and financial situation (p = . ). in each category, the differences reflected a caregiver underestimation of quality compared with the patient's self-rating. conclusion: discrepancies between qol domains and total scores for patients with cognitive dysfunction and their caregivers highlights the importance of identifying cognitive dysfunction in ed-based outcomes research and cost-utility analyses. further research is needed to quantify the clinical importance of the patient-and caregiver-assessed quality of life. background: age is often a predictor for increased morbidity and mortality. however, it is unclear whether old age is a predictor of adverse outcome in syncope. objectives: to determine whether old age is an independent predictor of adverse outcome in patients presenting to the emergency department following a syncopal episode. methods: a prospective observational study was conducted from june to july enrolling consecutive adult ed patients (> years) presenting with syncope. syncope was defined as an episode of transient loss of consciousness. adverse outcome or critical intervention were defined as gastrointestinal bleeding or other hemorrhage, myocardial infarction/percutaneous coronary intervention, dysrhythmia, alteration in antidysrhythmics, pacemaker/defibrillator placement, sepsis, stroke, death, pulmonary embolus, or carotid stenosis. outcomes were identified by chart review and -day follow-up phone calls. results: of patients who met inclusion criteria, an adverse event occurred in % of patients. overall, % of patients with risk factors had adverse outcomes compared to . % of patients with no risk factors. in particular, / ( %; % ci - %) of patients < with risk factors had adverse outcomes, while / ( %; % ci - %) of the elderly with risk factors had adverse outcomes. in contrast, among young people / ( %; % ci . - . %) of patients without risk factors had adverse outcomes while / ( . %; % ci . - %) of patients ‡ without risk factors had adverse outcomes. conclusion: although the elderly are at greater risk for adverse outcomes in syncope, age ‡ or older alone does not appear to be a predictor of adverse outcome following a syncopal event. based on these data, it should be safe to discharge home from the ed patients with syncope, but without risk factors, regardless of age. (originally submitted as a ''late-breaker.'') antibiotics background: adherence to national guidelines for hiv and syphilis screening in eds is not routine. in our ed, hiv and syphilis screening rates among patients tested for gonorrhea and chlamydia (gc/ct) have been reported to be % and %, respectively. objectives: to determine the effect of a sexually transmitted infection (sti) laboratory order set on hiv and syphilis screening among ed patients tested for gc/ct. we hypothesized that a sti order set would increase screening rates by at least %. methods: a -month, quasi-experimental study in an urban ed comparing hiv and syphilis screening rates of gc/ct-tested patients before (control phase) and after the implementation of a sti laboratory order set (intervention phase). the order set linked blood-based rapid hiv and syphilis screening with gc/ct testing. consecutive patients completing gc/ct testing were included. the primary outcome was the absolute difference in hiv and syphilis screening rates among gc/ ct-tested patients between phases. we estimated that subjects per phase were needed to provide % power (p-value of £ . ) to detect an absolute difference in screening rates of %, assuming a baseline hiv screening rate of %. results: the ed census was , . characteristics of patients tested for gc/ct were similar between phases: the mean age was years (sd = ) and most were female ( %), black ( %), hispanic ( %), and unmarried ( % services have recommended the use of immunization programs against influenza disease within hospitals since the s. the emergency department (ed) being the ''safety net'' for most non-insured people is an ideal setting to intervene and provide primary prevention from influenza. objectives: the purpose of this study is to assess whether a pharmacist-based influenza immunization program is feasible in the ed, and successful in increasing the percentage of adult patients receiving the influenza vaccine. methods: implementation of pharmacist-based immunization program was developed in coordination with ed physicians and nursing staff in . the nursing staff, using an embedded electronic questionnaire within their triage activity, screened patients for eligibility for the influenza vaccine. the pharmacist using an electronic alert system within the electronic medical record identified patients who we deemed eligible and if agreed the pharmacist vaccinated the patient. patients who refused to be vaccinated were surveyed to ascertain their perception concerning immunization offered by a pharmacist in the ed. feasibility and safety data for vaccinating patient in the ed were recorded. results: patients were approached and enrolled into the study. of the , % agreed to receive the influenza vaccine from a pharmacist in the ed. the median screening time was minutes and median vaccination time was minutes for a total of minutes from screening time to vaccination time. % were willing to receive the influenza vaccine from a pharmacist, and % were willing to receive the vaccine in the ed. the main reason given for refusing to receive the influenza vaccine was ''patient does not feel at risk of getting the disease''; only . % stated they were vaccinated recently. conclusion: a pharmacist-based influenza immunization program is feasible in the ed, and has the potential to successfully increase the percentage of adult patients receiving the vaccine. . ± . , p < . ). ed visits by hiv-infected patients also had longer lengths of ed stay ( ± . minutes vs. . ± . minutes, p < . ) and were more likely to be admitted ( % vs. %, p < . ), than their non-hiv infected counterparts. conclusion: although ed visits by hiv-infected individuals in the u.s. are relatively infrequent, they occur at rates higher than the general population, and consume significantly more ed resources than the general population. the background: the influence of wound age on the risk of infection in simple lacerations repaired in the emergency department (ed) has not been well studied. it has traditionally been taught that there is a ''golden period'' beyond which lacerations are at higher risk of infection and therefore should not be closed primarily. the proposed cutoff for this golden period has been highly variable ( - hours in surgical textbooks). objectives: to answer the following research question: are wounds closed via primary repair after the golden period at increased risk for infection? methods: we searched medline, embase, and other databases as well as bibliographies of relevant articles. we included studies that enrolled ed patients with lacerations repaired by primary closure. exclusion: . intentional delayed primary repair or secondary closure, . wounds requiring intra-operative repair, skin graft, drains, or extensive debridement, and . grossly contaminated or infected at presentation. we compared the outcome of wound infection in two groups of early versus delayed presentations (based on the cut-offs selected by the original articles). we used ''grading of recommendations assessment, development and evaluation'' (grade) criteria to assess the quality of the included trials. frequencies are presented as percentages with % confidence intervals. relative risk (rr) of infection is reported when clinically significant. results: studies were identified. four trials enrolling patients in aggregate met our inclusion/exclusion criteria. two studies used a -hour cut-off and the other two used a -hour cut-off for defining delayed wounds. the overall quality of evidence was low. the infection rate in the wounds that presented with delay ranged from . % to %. one study with the smallest sample size (morgan et al), which only enrolled lacerations to the hand and forearm, showed higher rates of infection in patients with delayed wounds (table). the infection rates in delayed wound groups in the remaining three studies were not significantly different from the early wounds. conclusion: the evidence does not support the existence of a golden period, nor does it support the role of wound age on infection rate in simple lacerations. background: although clinical studies in children have shown that temperature elevation is an independent and significant predictor of bacteremia in children, the relationship in adults is largely unknown or equivocal. objectives: review the incidence of positive blood cultures on critically ill adult septic patients presenting to an emergency department (ed) and determine the association of initial temperature with bacteremia. methods: july to july retrospective chart review on all patients admitted from the ed to an urban community hospital with sepsis and subsequently expiring within days of admission. fever was defined as a temperature ‡ °c. sirs criteria were defined as: ) temperature ‡ °c or £ °c, ) heart rate ‡ beats/ minute, ) respiratory rate ‡ or mechanical ventilation, ) wbc ‡ , /mm or < , or bands ‡ %. objectives: we examined the utility of limited genetic sequencing of bacterial isolates using multilocus sequence typing (mlst) to discriminate between known pathogenic blood culture isolates of s. epidermidis and isolates recovered from skin. methods: ten blood culture isolates from patients meeting the centers for disease control and prevention (cdc) criteria for clinically significant s. epidermidis bacteremia and ten isolates from the skin of healthy volunteers were studied. mlst was performed by sequencing bp regions of seven genes (arc, aroe, gtr, muts, pyr, tpia, and yqil) . genetic variability at these sites was compared to an international database (www.sepidermidis.mlst.net) and each strain was then categorized into a genotype on the basis of known genetic variation. the ability of the gene sequences to correctly classify strains was quantified using the support vector machine function in the statistical package r. , bootstrap resamples were performed to generate confidence bounds around the accuracy estimates. results: between-strain variability was considerable, with yqil being most variable ( alleles) and tpia being least ( allele). the muts gene, responsible for dna repair in s. epidermidis, showed almost complete separation between pathogenic and commensal strains. when the seven genes were used in a joint model, they correctly predicted bacterial strain type with % accuracy (iqr , %). conclusion: multilocus sequence typing shows excellent early promise as a means of distinguishing contaminant versus truly pathogenic isolates of s. epidermidis from clinical samples. near-term future goals will involve developing more rapid means of sequencing and enrolling a larger cohort to verify assay performance. conference are presented by influenza scenario in table and background: antiviral medications are recommended for patients with influenza who are hospitalized or at high risk for complications. however, timely diagnosis of influenza in the ed remains challenging. influenza rapid antigen tests have short turn-around times, making them potentially useful in the ed setting, but their sensitivities may be too low to assist with treatment decisions. objectives: to evaluate the test characteristics of the binaxnow influenza a&b rapid antigen test (rat) in ed patients. methods: we prospectively enrolled a systematic sample of patients of all ages presenting to two eds with acute respiratory symptoms or fever during three consecutive influenza seasons ( ) ( ) ( ) ( ) . research personnel collected nasal and throat swabs, which were combined and tested for influenza with rt-pcr using cdc-provided primers and probes. ed clinicians independently decided whether to obtain a rat during clinical care. rats were performed in the clinical laboratory using the binaxnow influenza a&b test on nasal swabs collected by ed staff. the study cohort included subjects who underwent both a research pcr and clinical rat. rat test characteristics were evaluated using pcr as the criterion standard with stratified sub-analyses for age group and influenza subtype (pandemic h n (ph n ), non-pandemic influenza a, influenza b). results: subjects were enrolled; subjects were pcr positive for influenza ( ph n , non-pandemic influenza a, and influenza b). for all age groups, rat sensitivities were low and specificities were high ( hiv infection with cd < ; and among nursing home residents, inability to independently perform activities of daily living. sources for bacterial cultures included blood, sputum (adults only), bronchoalveolar lavage (bal), tracheal aspirate, and pleural fluid. only sputum specimens with a bartlett score ‡ + were considered adequate for culturing. results: among children enrolled, ( %) had s. aureus cultured from ‡ specimen, including with methicillin-resistant s. aureus (mrsa) and with methicillin-susceptible s. aureus (mssa). specimens positive for s. aureus included pleural fluid, blood, tracheal aspirates, and bal. two children with s. aureus had evidence of co-infection: influenza a, and streptococcus pneumoniae. among adults enrolled, ( %) grew s. aureus from ‡ specimen, including with mrsa and with mssa. specimens positive for s. aureus included blood, sputum, and bal. five adults with s. aureus had evidence of co-infections: coronavirus, respiratory syncytial virus, s. pneumoniae, and pseudomonas aeruginosa. presenting clinical characteristics and outcomes of subjects with staphylococcal cap are summarized in tables - . conclusion: these preliminary findings suggest s. aureus is an uncommon cause of cap. although the small number of staphylococcal cases limits conclusions that can be drawn, in our analysis staphylococcal cap appears to be associated with co-infections, pleural effusions, and severe disease. future work will focus on continued enrollment and developing clinical prediction models to aid in diagnosing staphylococcal cap in the ed. background: emergency care has been a neglected public health challenge in sub-saharan africa. the goal of global emergency care collaborative (gecc) is to develop a sustainable model for emergency care delivery in low-resource settings. gecc is developing a training program for emergency care practitioners (ecps). objectives: to analyze the first patient visits at karoli lwanga ''nyakibale'' hospital ed in rural uganda to determine the knowledge and skills needed in training ecps. methods: a descriptive cross-sectional analysis of the first consecutive patient visits in the ed's patient care log was reviewed by an unblinded abstractor. data on demographics, procedures, laboratory testing, bedside ultrasounds (us) performed, radiographs (xrs) ordered, and diagnoses were collated. all authors discussed uncertainties and formed a consensus. descriptive statistics were performed. results: of the first patient visits, procedures were performed in ( . %) patients, including ( . %) who had ivs placed, ( . %) who received wound care, and ( . %) who received sutures. complex procedures, such as procedural sedations, lumbar punctures, orthopedic reductions, nerve blocks, and tube thoracostomies, occurred in ( . %) patients. laboratory testing, xrs, and uss were performed in ,( . %), ( . %), and ( %) patients, respectively. infectious diseases were diagnosed in ( . %) patients; ( . %) with malaria and ( . %) with pneumonia. traumatic injuries were present in ( %) patients; ( . %) needing wound care and ( . %) with fractures. gastrointestinal and neurological diagnoses affected ( . %) and ( . %) patients, respectively. conclusion: ecps providing emergency care in sub-saharan africa will be required to treat a wide variety of patient complaints and effectively use laboratory testing, xrs, and uss. this demands training in a broad range of clinical, diagnostic, and procedural skills, specifically in infectious disease and trauma, the two most prevalent conditions seen in this rural sub-saharan africa ed. assessment of point-of-care ultrasound in tanzania background: current chinese ems is faced with many challenges due to a lack of systematic planning, national standards in training, and standardized protocols for prehospital patient evaluation and management. objectives: to estimate the frequency with which prehospital care providers perform critical actions for selected chief complaints in a county-level ems system in hunan province, china. methods: in collaboration with xiangya hospital (xyh), central south university in hunan, china, we collected data pertaining to prehospital evaluation of patients on ems dispatches from a '' - - '' call center over a -month period. this call center services an area of just under km with a total population of . million. each ems team consists of a driver, a nurse, and a physician. this was a cross-sectional study where a single trained observer accompanied ems teams on transports of patients with a chief complaint of chest pain, dyspnea, trauma, or altered mental status. in this convenience sample, data were collected daily between am and pm. critical actions were pre-determined by a panel of emergency medicine faculty from xyh and the university of maryland school of medicine. simple statistical analysis was performed to determine the frequency of critical actions performed by ems providers. results: during the study period, patients were transported, of whom met the inclusion criteria. ( . %) evaluations were observed directly for critical actions. the table shows the frequency of critical actions performed by chief complaint. none of the patients with chest pain received an ecg even though the equipment was available. rapid glucose was checked in only . % of patients presenting with altered mental status. a lung exam was performed in . % of patients with dyspnea, and the respiratory rate was measured in . %. among patients transported for trauma, blood pressure, and heart rate were only measured in % and . %, respectively. conclusion: in this observation study of prehospital patient assessments in a county-level ems system, critical actions were performed infrequently for the chief complaints of interest. performance frequencies for critical actions ranged from to . %, depending on the chief complaint. standardized prehospital patient care protocols should be established in china and further training is needed to optimize patient assessment. trends little is known about the comparative effectiveness of noninvasive ventilation (niv) versus invasive mechanical ventilation (imv) in chronic obstructive pulmonary disease (copd) patients with acute respiratory failure. objectives: to characterize the use of niv and imv in copd patients presenting to the emergency department (ed) with acute respiratory failure and to compare the effectiveness of niv vs. imv. methods: we analyzed the - nationwide emergency department sample (neds), the largest, all-payer, us ed and inpatient database. ed visits for copd with acute respiratory failure were identified with a combination of copd exacerbation and respiratory failure icd- -cm codes. patients were divided into three treatment groups: niv use, imv use, and combined use of niv and imv. the outcome measures were inpatient mortality, hospital length of stay (los), hospital charges, and complications. propensity score analysis was performed using patient and hospital characteristics and selected interaction terms. results: there were an estimated , visits annually for copd exacerbation and respiratory failure from approximately , eds. ninety-six percent were admitted to the hospital. of these, niv use increased slightly from % in to % in (p = . ), while imv use decreased from % in to % in (p < . ); the combined use remained stable ( %). inpatient mortality decreased from % in to % in (p < . ). niv use varied widely between hospitals, ranging from % to % with median of %. in a propensity score analysis, niv use (compared to imv) significantly reduced inpatient mortality (risk ratio . ; % confidence interval [ci] . - . ), shortened hospital los (difference ) days; %ci ) to ) ), and reduced hospital charges ; ) . niv use was associated with a lower rate of iatrogenic pneumothorax compared with imv use ( . % vs. . %, p < . ). an instrumental analysis confirmed the benefits of niv use, with a % reduction in inpatient mortality in the niv-preferring hospitals. conclusion: niv use is increasing in us hospitals for copd with acute respiratory failure; however, its adoption remains low and varies widely between hospitals. niv appears to be more effective and safer than imv in the real-world setting. background: dyspnea is a common ed complaint with a broad differential diagnosis and disease-specific treatment. bronchospasm alters capnographic waveforms, but the effect of other causes of dyspnea on waveform morphology is unclear. objectives: we evaluated the utility of capnographic waveforms in distinguishing dyspnea caused by reactive airway disease (rad) from non-rad in adult ed patients. methods: this was a prospective, observational, pilot study of a convenience sample of adult patients presenting to the ed with dyspnea. waveforms, demographics, past medical history, and visit data were collected. waveforms were independently interpreted by two blinded reviewers. when the interpreters disagreed, the waveform was re-reviewed by both reviewers and an agreement was reached. treating physician diagnosis was considered the criterion standard. descriptive statistics were used to characterize the study population. diagnostic test characteristics and inter-rater reliability are given. results: fifty subjects were enrolled. median age was years (range - ), % were female, % were caucasian. / ( %) had a history of asthma or chronic obstructive pulmonary disease. rad was diagnosed by the treating physician in / ( %) and / ( %) had received treatment for dyspnea prior to waveform acquisition. the interpreters agreed on waveform analysis in / ( %) cases (kappa = . ). test characteristics for presence of acute rad, including %ci, were: overall accuracy % ( . %- . %), sensitivity % ( . %- . %), specificity % ( . %- . %), positive predictive value % ( . %- . %), negative predictive value % ( . %- . %), positive likelihood ratio . ( . - . ) , negative likelihood ratio . ( . - . ). conclusion: inter-rater agreement is high for capnographic waveform interpretation, and shows promise for helping to distinguish between dyspnea caused by rad and dyspnea from other causes in the ed. treatments received prior to waveform acquisition may affect agreement between waveform interpretation and physician diagnosis, affecting the observed test characteristics. asthma background: asthma and chronic obstructive pulmonary disease (copd) patients who present to the emergency department (ed) usually lack adequate ambulatory disease control. while evidence-based care in the ed is now well defined, there is limited inform-ation regarding the pharmacologic or non-pharmacologic needs of these patients at discharge. objectives: this study evaluated patients' needs with regard to the ambulatory management of their respiratory conditions after ed treatment and discharge. methods: over months, adult patients with acute asthma or copd, presenting to a tertiary care alberta hospital ed and discharged after being treated for exacerbations, were enrolled. using results from standardized in-person questionnaires, charts were reviewed by respiratory researchers to identify care gaps. results: overall, asthmatic and copd patients were enrolled. more patients with asthma required education on spacer devices ( % vs %). few asthma ( %) and no copd patients had written action plans; asthma patients were more likely to need adherence counseling ( % vs %) for preventer medications. more patients with asthma required influenza vaccination ( % vs %; p = . ); pneumococcal immunization was low ( %) in copd patients. only % of asthmatics reported ever being referred to an asthma education program and % of the copd patients reported ever being referred to pulmonary rehabilitation. at ed presentation, % of the asthmatics required the addition of inhaled corticosteroids (ics) and % required the addition of ics/long acting beta-agonist (ics/laba) combination agents. on the other hand, % of copd patients required the addition of long-acting anticholinergics while most ( %) were receiving preventer medications. finally, % of copd and % of asthma patients who smoked required smoking cessation counseling. conclusion: overall, we identified various care gaps for patients presenting to the ed with asthma and copd. there is an urgent need for high-quality research on interventions to reduce these gaps. methods: this is an interim, sub-analysis of an interventional, double-blinded study performed in an academic urban-based adult ed. subjects with acute exacerbation of asthma with fev < % predicted within minutes following initiation of ''standard care'' (including a minimum of mg nebulized albuterol, . mg nebulized ipratropium, and mg corticosteroid) who consented to be in a trial were included. all treatment was administered by emergency physicians unaware of the study objectives. patients were randomly assigned to treatment with placebo or an intravenous beta agonist. all subjects had fev and ds obtained at baseline, , , and hours after treatment. fev was measured using a bedside nspire spirometer, and ds was calculated using a modified borg dyspnea score. results: thirty-eight patients were included for analysis. spearman's rho test (rho) was used to measure correlations between fev and ds at , , and hours post study entry and subsequent hospitalization. rho is negative for fev (higher fev correlates to lower rate of hospitalization) and positive for ds (higher ds correlates to higher rate of hospitalization). at each time point, ds were more highly correlated to hospitalization than were fev (see table) . conclusion: dyspnea score at , , and hours were significantly correlated with hospital admission, whereas fev was not. in this set of subjects with moderate to severe asthma exacerbations, a standardized subjective tool was superior to fev for predicting subsequent hospitalization. methods: this is an interim, subgroup analysis of a prospective, interventional, double-blind study performed in an academic urban ed. subjects who were consented for this trial presented with acute asthma exacerbations with fev £ % predicted within minutes following initiation of ''standard care'' (includes a minimum of . mg nebulized albuterol, . mg nebulized ipratropium, and mg of a corticosteroid). ed physicians who were unaware of the study objectives administered all treatments. subjects were randomized in a : ratio to either placebo or investigational intravenous beta agonist arms. blood was obtained at and . hours after the start of the hour long infusion. blood was centrifuged and serum stored at ) °c, and then shipped on dry ice for albuterol and lactate measurements at a central lab. the treatment lactate and d lactate were correlated with hr serum albuterol concentrations and hospital admission using partial pearson correlations to adjust for ds. results: subjects were enrolled to date, with complete data. the mean baseline serum lactate level was . mg/dl (sd ± . ). this increased to . mg/ dl (sd ± . ) at . hrs. the mean hr ds was . (sd ± . ). the correlations between treatment lactate, d lactate, hr serum albuterol concentrations (r, s and total) and admission to hospital are shown (see table) . both treatment and d lactate were highly conrrelated with total serum albuterol, r albuterol, and s albuterol. there was no correlation between treatment lactate or d lactate and hospital admission. conclusion: lactate and d lactate concentrations correlate with albuterol concentrations in patients presenting had asthma. fifty one percent were < years old and % were female. we found a decline of % ( % ci: %- %, p < . ; r = . , p < . ) in the overall yearly asthma visits to total ed visits from to . when we analyzed sex and age groups separately, we found no statistically significant changes for females or for males < years old (r £ . , p ‡ . ). for females and males > years old, yearly asthma visits to total ed visits from to decreased % ( % ci: %- %, p < . ; r = . , p < . ) and % ( % ci: %- %, p < . ; r = . , p < . ), respectively. conclusion: we found an overall decrease in yearly asthma visits to total ed visits from to . we speculate that this decrease is due to greater corticosteroid use despite the increasing prevalence of asthma. it is unclear why this decrease was seen in adults and not in children and why it was greater for adult females than males. objectives: our objectives were to describe the use of a unique data collection system that leveraged emr technology and to compare its data entry error rate to traditional paper data collection. methods: this is a retrospective review of data collection methods during the first months of a multicenter study of ed, anti-coagulated, head injury patients. on-shift ed physicians at five centers enrolled eligible patients and prospectively completed a data form. enrolling ed physicians had the option of completing a one-page paper data form or an electronic ''dotphrase'' (dp) data form. our hospital system uses an epicÒbased emr. a feature of this system is the ability to use dps to assist in medical information entry. a dp is a preset template that may be inserted into the emr when the physician types a period followed by a code phrase (in this case ''.ichstudy''). once the study dp was inserted at the bottom of the electronic ed note, it prompted enrolling physicians to answer study questions. investigators then extracted data directly from the emr. our primary outcomes of interest were the prevalence of dp data form use and rates of data entry errors. results: from / through / , patients were enrolled. dp data forms were used in ( . %; % ci . , . %) cases and paper data forms in ( . %; % ci . , . %). the prevalence of dp data form use at the respective study centers was %, %, %, %, and %. sixty-six ( . %; % ci . , . %) of physicians enrolling patients used dp data entry at least once. using multivariate analysis, we found no significant association between physician age, sex, or tenure and dp use. data entry errors were more likely on paper forms ( / , . %; % ci . , . %) than dp data forms ( / , . %; % ci . , . %), difference in error rates . % ( % ci . , . %, p < . ). conclusion: dp data collection is a feasible means of study data collection. dp data forms maintain all study data within the secure emr environment obviating the need to maintain and collect paper data forms. this innovation was embraced by many of our emergency physicians. we found lower data entry error rates with dp data forms compared to paper forms. background: inadequate randomization, allocation concealment, and blinding can inflate effect sizes in both human and animal studies. these methodological limitations might in part explain some of the discrepancy between promising results in animal models and non-significant results in human trials. whereas blinding is not always possible, in clinical or animal studies, true randomization with allocation concealment is always possible, and may be as important in minimizing bias. objectives: to determine the frequency with which published emergency medicine (em) animal research studies report randomization, specific randomization methods, allocation concealment, and blinding of interventions and measurements, and to estimate whether these have changed over time. methods: all em animal research publications from / through / in ann emerg med and acad emerg med were reviewed by two trained investigators for a statement regarding randomization, and specific descriptions of randomization methods, allocation concealment, blinding of intervention, and blinding of measurements, when possible. raw initial agreement was calculated and differences were settled by consensus. the first (period = - ) and second (period = - ) -year periods were compared with % confidence intervals. results: of em animal research studies, were appropriate for review because they involved intervention in at least two groups. blinding of interventions and measurements were not considered possible in % and %, respectively. significant differences between period and were absent, although there was a trend towards less blinding of interventions and more blinding of measurements. raw agreement was %. conclusion: although randomization is mentioned in the majority of studies, allocation concealment and blinding remain underutilized in em animal research. we did not compare outcomes between blinded and non-blinded, randomized and non-randomized studies, because of small sample size. this review fails to demonstrate significant improvement over time in these methodological limitations in em animal research publications. journals might consider requiring authors to explicitly describe their randomization, allocation, and blinding methods. background: cluster randomized trials (crts) are increasingly utilized to evaluate quality improvement interventions aimed at health care providers. in trials testing ed interventions, migration of eps between hospitals is an important concern, as contamination may affect both internal and external validity. objectives: we hypothesized geographically isolating emergency departments would prevent migratory contamination in a crt designed to increase ed delivery of tpa in stroke (the instinct trial). methods: instinct was a prospective, cluster-randomized, controlled trial. twenty-four michigan community hospitals were randomly selected in matched pairs for study. following selection of a single hospital, all hospitals within miles were excluded from the sample pool. individual emergency physicians staffing each site were identified at baseline ( ) and months later. contamination was defined at the cluster level, with substantial contamination defined a priori as > % of eps affected. non-adherence, total crossover (contamination + non-adherence), migration distance and characteristics were determined. results: emergency physicians were identified at all sites. overall, ( . %) changed study sites. one moved between control sites, leaving ( . %) total crossovers. of these, ( . %) moved from intervention to control (contamination) and ( . %) moved from control to intervention (non-adherence). contamination was observed in of sites, with % and % contamination of the total site ep workforce at follow-up, respectively. two of crossovers occurred between hospitals within the same health system. average migration distance was miles for all eps in the study and miles for eps moving from intervention to control sites. conclusion: the mobile nature of emergency physicians should be considered in the design of quality improvement crts. use of a -mile exclusion zone in hospital selection for this crt was associated with very low levels of substantial cluster contamination ( of ) and total crossover. assignment of hospitals from a single health system to a single study group and/or an exclusion zone of miles would have further reduced crossovers. increased reporting of contamination in cluster randomized controlled trials is encouraged to clarify thresholds and facilitate crt design. objectives: an extension of the lr, the average absolute likelihood ratio (aalr), was developed to assess the average change in the odds of disease that can be expected from a test, or series of tests, and an example of its use to diagnose wide qrs complex tachycardia (wct) is provided. methods: results from two retrospective multicenter case series were used to assess the utility of qrs duration and axis to assess for ventricular tachycardia (vt) in patients with undifferentiated regular sustained wct. serial patients with heart rate (hr) > beats per minute and qrs duration > milliseconds (msec) were included. the final tachydysrhythmia diagnosis was determined by a number of methods independent of the ecg. the aalr is defined as: aalr = /n total [r (n i *lr i ) (for lr > ) + r (n k /lr k ) (for lr < )], where lr i and lr k are the interval lrs, and n i and n k are the number of patients with test results within the corresponding intervals. roc curves were constructed, and interval lrs and aalrs were calculated for the qrs duration and axis tests individually, and when applied together. confidence intervals were bootstrapped with , replications using the r boot package. results: patients were included: with supraventricular tachycardia (svt) and with vt. optimal qrs intervals (msec) for distinguishing vt from svt were: qrs £ , < qrs < , and qrs ‡ . qrs axis results were dichotomized to upward right axis ( - degrees) or not () to degrees). results are listed in the table. conclusion: application of the qrs interval and axis tests together for patients with wide qrs complex tachycardia changes the odds of ventricular tachycardia, on average, by a factor of . ( % ci . to . ), and this is mildly improved over the qrs duration test alone. both a strength and weakness of the aalr is its dependence on the pretest probability of disease. the aalr may be helpful for clinicians and researchers to evaluate and compare diagnostic testing approaches, particularly when strategies with serial non-independent tests are considered. consultation for adults with metastatic solid tumors at an urban, academic ed located within a tertiary care referral center. field notes were grouped into barrier categories and then quantified when possible. patient demographics for those who did and did not enroll were extracted from the medical record and quantified. patients who did not meet inclusion criteria for the study (e.g., cognitive impairment) were excluded from the analysis. results: attempts were made to enroll eligible patients in the study, and were successfully enrolled ( % enrollment rate). barriers to enrollment were deduced from the field notes and placed into the following categories from most to least common: patient refusal ( ); diagnostic uncertainty regarding cancer stage ( ); severity of symptoms preclude participation ( ); patient unaware of illness or stage ( ); and family refusal ( ). conclusion: patients, families, and diagnostic uncertainty are barriers to enrolling ed patients with advanced illness in clinical trials. it is unclear whether these barriers are generalizable to other study sites and disease processes other than cancer. objectives: the purpose of this study was to evaluate the use of a high-fidelity mannequin bedside simulation scenario followed by a debriefing session as a tool to improve medical student knowledge of palliative care techniques. methods: third year medical students participating in a -week simulation curriculum during a surgery/ emergency medicine/anesthesia clerkship were eligible for the study. all students were administered a pretest to evaluate their baseline knowledge of palliative care and randomized to a control or intervention group. during week or , students in the intervention group participated in and observed two end-of-life scenarios. following the scenarios, a faculty debriefer trained in palliative care addressed critical actions in each scenario. during week , all students received a posttest to evaluate for improvement in knowledge. the pre-test and post-test consisted of questions addressing prognostication, symptom control, and the medicare hospice benefit. students were de-identified and pre-and post-tests were graded by a blinded scorer. results: from jan-dec , students were included in the study and were excluded due to incomplete data. the mean score on the pre-test for the intervention group was . , and for the control group was . (p = . the results indicate that educators identify the most important scenarios as protocol-based simulations. respondents also suggested that scenarios of very common emergency department presentations bear a great deal of importance. emergency medicine educators assign priority to simulations involving professionalism and communication. finally, many respondents noted that they use simulation to teach the presentation and management of rare or less frequent, but important disease processes. the identification of these scenarios would suggest that educators find simulation useful for filling in ''gaps'' in resident education. background: prescription drug misuse is a growing problem among adolescent and young adult populations. objectives: to determine factors associated with past year prescription drug misuse defined as using prescription sedatives, stimulants, or opioids to get high, taking them when they were prescribed to someone else or taking more than was prescribed among patients seeking care in an academic ed. methods: adolescents and young adults ( - ) presenting for ed care at a large, academic teaching hospital were approached to complete a computerized screening questionnaire regarding demographics, prescription drug misuse, illicit drug use, alcohol use, and violence in the past months. logistic regression was used to predict past year prescription drug misuse. results: over the study time period, there were participants ( % response rate) of whom ( . %) endorsed past year prescription drug misuse. specifically, rates of past year misuse for opioids was . %, sedatives was . %, and stimulants was . %. significant overlap exists among classes with over % misusing more than one class of medications. in the multivariate analysis significant predictors of past year prescription drug misuse included female gender (or conclusion: approximately one in seven adolescents or young adults seeking ed care have misused prescription drugs in the past year. while opioids are the most common drug misused, significant overlap exists among this population. given the correlation of prescription drug misuse with the use and misuse of other substances (i.e. alcohol, cough medicine, marijuana) more research is needed to further understand these relationships and inform interventions. additionally, future research should focus on understanding the differences in demographics and risk factors associated with misuse of each separate class of prescription drugs. prospective objectives: this study aims to examine the association of depression with high ed utilization in patients with non-specific abdominal pain. methods: this single-center, prospective, cross-sectional study was conducted in an urban academic ed located in washington, dc as part of a larger study to evaluate the interaction between depression and frequency of ed visits and chronic pain. as part of this study, we screened patients using the phq- , a nineitem questionnaire that is a validated, reliable predictor of major depressive disorder. we analyzed the subset of respondents with a non-specific abdominal pain diagnosis (icd- code of .xx). our principal outcome of interest was the rate of a positive depression screen in patients with non-specific abdominal pain. we analyzed the prevalence of a positive depression screen among this group and also conducted a chi-square analysis to compare high ed use among abdominal pain patients with a positive depression screen versus those without a positive depression screen. we defined high ed utilization as > visits in a -day period prior to the enrollment visit. background: numerous studies have found high rates of co-morbid mental illness and chronic pain in emergent care settings. one psychiatric diagnosis frequently associated with chronic pain is major depressive disorder (mdd). objectives: we conducted a study to characterize the relationship between mdd and chronic pain in the emergency department (ed) population. we hypothesized that patients who present to the ed with selfreported chronic pain will have higher rates of mdd. methods: this was a single-center, prospective, crosssectional study. we used a convenience sample of noncritically ill, english speaking adult patients presenting with non-psychiatric complaints to an urban academic ed over months in . we oversampled patients presenting with pain-related complaints (musculoskeletal pain or headache). subjects were surveyed about their demographic and other health and health care characteristics and were screened with the phq , a nine-item questionnaire that is a validated, reliable predictor of mdd. we conducted bivariate (chi-square) and multivariate analysis controlling for demographic characteristics (race, income, sex, age) using stata v. . . our principal dependent variable of interest was a positive depression screen (phq score ‡ ). our principal independent variable of interest was the presence of self-reported chronic pain (greater than months). results: of patients enrolled, did not meet all inclusion criteria. had two or more assessments for comparison. their average age was (range - ), % were male, and % were in police custody. % used methadone alone; % heroin alone; % oxycodone alone; and the rest used multiple opioids. the average dose of im methadone was . mg (range - mg); all but patients received mg. the mean cows score before receiving im methadone was . (range - ), compared to . (range - ) minutes after methadone (p < . ; mean difference = ) . ; % ci = ) . to ) . ). the mean wss before and after methadone was ) . (range ) to ) ) and ) . (range ) to ), respectively (p < . ; % ci = ) . to ) . ). the mean physician-assessed wss was significantly lower than the patient's own assessment by . (p < . ). adverse events included an asthmatic patient with bronchospasm whose oxygen saturation decreased from % to % after receiving methadone, a patient whose oxygen saturation decreased from % to %, and two patients whose amss decreased from ) to ) (indicating moderate sedation). background: as the us population ages, the coexistence of copd and acute coronary syndrome (acs) is expected to be more frequent. very few studies have examined the effect of copd on outcomes in acs patients, and, to our knowledge, there has been no report on biomarkers that possibly mediate between copd and long-term acs patient outcomes. objectives: to determine the effect of copd on longterm outcomes in patients presenting to the emergency department (ed) with acs and to identify prognostic inflammatory biomarkers. methods: we performed a prospective cohort study enrolling acs patients from a single large tertiary center. hospitalized patients aged years or older with acs were interviewed and their blood samples were obtained. seven inflammatory biomarkers were measured, including interleukin- (il- ), c-reactive protein (crp), tumor necrosis factor-alpha (tnf-alpha), vascular cell adhesion molecule (vcam), e-selectin, lipoprotein-a (lp-a), and monocyte chemoattractant protein- (mcp- ). the diagnoses of acs and copd were verified by medical record review. annual telephone follow-up was conducted to assess health status and major adverse cardiovascular events (mace) outcomes, a composite endpoint including myocardial infarction, revascularization procedure, stroke, and death. background: aortic dissection (ad) is an uncommon life-threatening condition requiring prompt diagnosis and management. thirty-eight percent of cases are missed upon initial evaluation. the cornerstone of accurate diagnosis hinges on maintaining a high index of clinical suspicion for the various patterns of presentation. quality documentation that reflects consideration for ad in the history, exam, and radiographic interpretation is essential for both securing the diagnosis and for protecting the clinician in missed cases. objectives: we sought to evaluate the quality of documentation in patients presenting to the emergency department with subsequently diagnosed acute ad. methods: irb-approved, structured, retrospective review of consecutive patients with newly diagnosed non-traumatic ad from to . inclusion criteria: new ad diagnosis via ed. exclusion criteria: ad diagnosed at another facility; chronic, traumatic, or iatrogenic ad. trained/monitored abstractors used a standardized data tool to review ed and hospital medical records. descriptive statistics were calculated as appropriate. inter-rater reliability was measured. our primary performance measure was the prevalence of a composite of all three key historical elements ( . any back pain, . neurologic symptoms including syncope, and . sudden onset of pain.) in the attending emergency physician's documentation. secondary outcomes included documentation of: ad risk factors, pain quality, back pain at multiple locations, presence/absence of pulse symmetry, mediastinal widening on chest radiograph, and migratory nature of the pain. results: / met our inclusion/exclusion criteria. the mean age was . years; % were male, ( . %) were stanford a. ( %) presented with a chief complaint of chest pain. primary outcome measure: / ( . %; %ci = . , . ) documented the presence/ absence of all three key historical elements. [back pain = / ; . % ( . , . ); neuro symptoms = / ; % ( . , . ); sudden onset = / ; . % ( . , . ).] limitations: small number of confirmed ad cases. conclusion: in our cohort, emergency physician documentation of key historical, physical exam, and radiographic clues of ad is suboptimal. although our ed miss rate is lower than that which has been reported by previous authors, there is an opportunity to improve documentation of these pivotal elements at our institution. objectives: this study assessed the opinions of iem and gh fellowship program directors, in addition to recent and current fellows regarding streamlining the application process and timeline in an attempt to implement change and improve this process for program directors and fellows alike. methods: a total of current iem and gh fellowship programs were found through an internet search. an electronic survey was administered to current iem and gh fellowship directors, current fellows, and recent graduates of these programs. results: response rates were % (n = ) for program directors and % (n = ) for current and recent fellows. the great majority of current and recent fellows ( %) and program directors ( %) support transitioning to a common application service. similarly, % of current and recent fellows and % of program directors support instituting a uniform deadline date for applications. however, only % of recent/current fellows and % of program directors would support a formalized match process like nrmp. conclusion: the majority of fellows and program directors support streamlining the application for all iem and gh fellowship programs. this could improve the application process for both fellows and program directors, and ensure the best fit for the candidates and for the fellowship programs. in order to establish effective emergency care in rural sub-saharan africa, the unique practice demographics and patient dispositions must be understood. objectives: the objectives of this study are to determine the demographics of the first patients seen at nyakibale hospital's ed and assess the feasibility of treating patients in a rural district hospital ed in sub-saharan africa. methods: a descriptive cross-sectional analysis of the first consecutive patient visits in the ed's patient care log was reviewed by an unblinded abstractor. data collected included age, sex, condition upon discharge, and disposition. all authors discussed uncertainties and formed a consensus. descriptive statistics were performed. results: of the first patient visits, ( . %) occurred when the outpatient clinic was open. there were ( %) male visits. the average age was . years (sd ± . ). pediatric visits accounted for ( . %) patients, and ( . %) visits were for children under five years old. only one patient expired in the ed, and ( . %) were in good condition after treatment, as subjectively defined by the ed physicians. one person was transferred to another hospital. after treatment, ( %) patients were discharged home. of those admitted to an inpatient ward, ( . %) patients were admitted to medical wards, ( . %) to pediatrics, and ( %) to surgical. only six ( . %) patients went directly to the operating theatre. conclusion: this consecutive sample of patient visits from a novel rural district hospital ed in sub-saharan africa included a broad demographic range. after treatment, most patients were judged to be in ''good condition'', and over one third of patients could be discharged after ed management. this sample suggests that it is possible to treat patients in an ed in rural sub-saharan africa, even in cases where surgical backup and transfers to higher level of care are limited or unavailable. background: communication failures in clinical handoffs have been identified as a major preventable cause of patient harm. in italy, advanced prehospital care is provided predominantly by physicians who work on ambulances in teams with either nurses or basic rescuers. the hand-offs from prehospital physicians to hospital emergency physicians (eps) is especially susceptible to error with serious consequences. there are no studies in italy evaluating the communication at this transition in patient care. studying this, however, requires a tool that measures the quality of this communication. objectives: the purpose of this study is to develop and validate a tool for the evaluation of communication during the clinical handoff from prehospital to emergency physicians in critically ill patients. methods: several previously validated tools for evaluating communication in hand-offs were identified through a literature search. these were reviewed by a focus group consisting of eps, nurses, and rescuers, who then adapted and translated the australian isbar (identification, situation, background, assessment, recommendation), the tool most relevant to local practice. the italian isbar tool consists of the following elements: patient and provider identification; patient's chief complaint; patient's past medical history, medications, and allergies; prehospital clinical assessment (primary survey, illness severity, vital signs, diagnosis); treatment initiated and anticipated treatment plan. we conducted and video-taped the hand-offs of care from the prehospital physicians to the eps in pediatric critical care simulations. four physician raters were trained in the italian isbar tool and used it to independently assess communication in each simulation. to assess agreement we calculated the proportion of agreement among raters for each isbar question, fleiss' kappas for each simulation, as well as mean agreement and mean kappas with standard deviations. results: there was % agreement among the four physicians on % of the items. the mean level of agreement was % (sd . ). the overall mean kappa was . (sd . ). conclusion: the standardized tool resulted in good agreement by physician raters. this validated tool may be helpful in studying and improving hand-offs in the prehospital to emergency department setting. objectives: we hypothesized that residents who were provided with vps prior to hfs would perform more thoroughly and efficiently than residents who had not been exposed to the online simulation. methods: we randomized a group of residents from an academic, pgy - emergency medicine program to complete an online vps case, either prior to (vps group, n = residents) or after (n = ) their hfs case. the vps group had access to the online case (which reviewed asthma management) days prior to the hfs session. all residents individually participated in their regularly scheduled hfs and were blinded to the content of the case -a patient in moderate asthma exacerbation. the authors developed a dichotomous checklist consisting of items recorded as done/not done along with time completed. a two sample proportion test was used to evaluate differences in the individual items completed between groups. a wilcoxon rank sum test was used to determine the differences in overall and subcategory performance between the two groups. median time to completion was analyzed using the log-rank test. results: the vps group had better overall checklist performance than the control group (p-value . ). in addition, the vps group was more thorough in obtaining an hpi (p-value . ). specific actions (related to asthma management) were performed better by the vps group: inquiring about last/prior ed visits ( . ), total number of hospitalizations in the prior year ( . ), prior intubations ( . ), and obtaining peak flow measurements ( . ). overall there was no difference in time to event completion between the two groups. conclusion: we found that when hfs is primed with educational modalities such as vps there was an improvement in performance by trainees. however, the improved completeness of the vps group may have served as a barrier to efficiency, inhibiting our ability to identify a statistical significant efficiency overall. vps may aid in priming the learners and maximize the efficiency of training using high-fidelity simulations. training using an animal model helped develop residents' skills and confidence in performing ptv. retention was found to be good at months post-training. this study underscores the need for hands-on training in rare but critical procedures in emergency medicine. methods: in this cross-sectional study at an urban community hospital, residents in their second or third year of training from a -year em residency program performed us-guided catheterizations of the ij on a simulator manufactured by blue phantom. two board-certified em physicians observed for the completion of pre-defined procedural steps using a checklist and rated the residents' overall performance of the procedure. overall performance ratings were provided on a likert scale of to , with being poor and being excellent. residents were given credit for performing a procedural step if at least one rater marked its completion. agreement between raters was calculated using intraclass correlation coefficients for domain and summary scores. the same protocol was then repeated on an unembalmed cadaver using two different board-certified em physician raters. criterion validity of the residents' proficiency on the simulator was evaluated by comparing their median overall performance rating on the simulator to that on the cadaver and by comparing the proportion of residents completing each procedural step between modalities with descriptive statistics. results: em residents' overall performance rating on the simulator was . ( % ci: . to . ) and on the cadaver was . ( % ci: . to . ). the results for each procedural step are summarized in the attached figure. inter-rater agreement was high for assessments on both the simulator and cadaver with overall kappa scores of . and . respectively. background: the environment in the emergency department (ed) is chaotic. physicians must learn how to multi-task effectively and manage interruptions. noise becomes an inherent byproduct of this environment. previous studies in the surgical and anesthesiology literature examined the effect of noise levels and cognitive interruptions on resident performance during simulated procedures; however, the effect of noise distraction on resident performance during an ed procedure has not yet been studied. objectives: our aim was to prospectively determine the effects of various levels of noise distraction on the time to successful intubation of a high-fidelity simulator. methods: a total of emergency medicine, emergency medicine/internal medicine, and emergency medicine/family medicine residents were studied in a background noise environments of less than decibels (noise level ), - decibels (noise level ), and of greater than decibels (noise level ). noise levels were standardized by a dosimeter (ex tech instruments, heavy duty ). each resident was randomized to the order in which he or she was exposed to the various noise levels and had a total of minutes to complete each of the intubation attempts, which were performed in succession. time, in seconds, to successful intubation was measured in each of these scenarios with the start time defined as the time the resident picked up the storz c-mac video laryngoscope blade and the finish time defined as the time the tube passed through the vocal cords as visualized by an observer on the storz c-mac video screen. analytic methods included analysis of variance, student's t-test, and pearson's chi-square. results: no significant differences were found between time to intubation and noise level nor did the order of noise level exposure affect the time to intubation (see table) . there were no significant differences in success rate between the three noise levels (p = . ). a significant difference in time to intubation was found between the residents' second and third intubation attempts with decreased time to intubation for the third attempt (p = . ). conclusion: noise level did not have an effect on time to intubation or intubation success rate. time to intubation decreased between the second and third intubations regardless of noise level. background: growing use of the emergency department (ed) is cited as a cause of rising health care costs and a target of health care reform. eds provide approximately one quarter of all acute care outpatient visits in the us. eds are a diagnostic center and a portal for rapid inpatient admission. the changing role of eds in hospital admissions has not been described. objectives: to compare if admission through the ed has increased compared to direct hospital admission. we hypothesized that the use of the ed as the admitting portal increased for all frequently admitted conditions. methods: we analyzed the nationwide inpatient sample (nis), the largest us all-payer inpatient care database, from - . nis contains data from approximately million hospital stays each year, and is weighted to produce national estimates. we used an interactive, webbased data tool (hcupnet) to query the nis. clinical classification software (ccs) was used to group discharge diagnoses into clinically meaningful categories. we calculated the number of annual admissions and proportion admitted from the ed for the most frequently admitted conditions. we excluded ccs codes that are rarely admitted through the ed (< %) as well as obstet- background: the optimal dose of opioids for patients in acute pain is not well defined, although . mg/kg of iv morphine is commonly recommended. patient-controlled analgesia (pca) provides an opportunity to assess the adequacy of this recommendation as use of the pca pump is a behavioral indication of insufficient analgesia. objectives: to assess the need for additional analgesia following a . mg/kg dose of iv morphine by measuring additional self-dosing via a pca pump. methods: a three-arm randomized controlled trial was performed in an urban ed with , annual adult visits. a convenience sample of ed patients ages to with abdominal pain of < days duration requiring iv opioids was enrolled between / and / . all patients received an initial dose of . mg/kg iv morphine. patients in the pca arms could request additional doses of mg or . mg iv morphine by pressing a button attached to the pump with a -minute lock-out period. for this analysis, data from both pca arms were combined. software on the pump recorded times when the patient pressed the button (activation) and when he/she received a dose of morphine (successful activation). results: patients were enrolled in the pca arms. median baseline nrs pain score was . mean amount of supplementary morphine self-administered over the hour study period subsequent to the loading dose was . mg and . mg for the and . mg pca groups respectively. patients activated the pump at least once ( %, % ci: to %). figure shows the frequency distribution of the number of times the pump was activated. of those who activated the pump, the median number of activations per person was (iqr: to ). there were activations of the pump. % of activations were successful (followed by administration of morphine), while % were unsuccessful as they occurred during the -minute lock-out periods. % of the activations occurred in the first minutes, % in the second minutes, % in the third minutes, and % in the last minutes after the initial loading dose. conclusion: almost all patients requested supplementary doses of pca morphine, half of whom activated the pump five times or more over a course of hours. this frequency of pca activations suggests that the commonly recommended dose of . mg/kg morphine may constitute initial oligoanalgesia in most patients. marie-pier desjardins, benoit bailey, fanny alie-cusson, serge gouin, jocelyn gravel chu sainte-justine, montreal, qc, canada background: administration of corticosteroid at triage has been suggested to decrease the time to corticosteroid administration in the ed. objectives: to compare the time between arrival and corticosteroid administration in patients treated with an asthma pathway (ap) or with standard management (sm) in a pediatric ed. methods: chart review of children aged to years diagnosed with asthma, bronchospasm, or reactive airways disease seen in the ed of a tertiary care pediatric hospital. for a one year period, % of all visits were randomly selected for review. from these, we reviewed patients who were eligible to be treated with the ap ( ‡ months with previous history of asthma and no other pulmonary condition) and who had received at least one inhaled bronchodilator treatment. charts were evaluated by a data abstractor blinded to the study hypothesis using a standardized datasheet. various variables were evaluated such as age, respiratory rate and saturation at triage, type of physician who saw patient first, treatment prior to visit, in ed, and at discharge, time between arrival and corticosteroid administration, and length of stay (los background: return visits comprise . % of pediatric emergency department (ped) visits, at a cost of >$ million/year nationally. these visits are typically triaged with higher acuity and admission rates and raise concern for lapses in quality of care and patient education during the first visit. objectives: the aim of this qualitative study was to describe parents' reasons for return visits to the ped. methods: we prospectively recruited a convenience sample of parents of patients under the age of years who returned to the ped within hours of their previous visit. we excluded patients who were instructed to return, had previously left without being seen, arrived without a parent, were wards of the state, or did not speak english. after obtaining consent, the principal investigator (ce) conducted confidential, in-person, tape-recorded interviews with parents during ped return visits. parents answered open-ended questions and closed-ended questions using a five-point likert scale. responses to open-ended questions were analyzed using thematic analysis techniques. the scaled responses were grouped into three categories of agree, disagree, or neutral. results: from the closed-ended responses, % of parents agreed that their children were getting sicker, and % agreed that their children were not getting better. % agreed that they were unsure how to treat the illness, however only % agreed they did not feel figure : frequency distribution of number of pca activations comfortable taking care of the illness. only % agreed that the medical condition and/or the instructions were not clearly explained in the first visit. some common themes from the open-ended questions included worsening or lack of improvement of symptoms. many parents reported having unanswered questions about the cause of the illness and hoped to find out the cause during the return visit. conclusion: most parents brought their children back to the ped because they believed the symptoms had worsened or were not improving. although a large proportion of parents believed that the medical condition was clearly explained at the first visit, many parents still had unanswered questions about the cause of their child's illness. while worsening symptoms seemed to drive most return visits, it is possible that some visits related to failure to improve might be prevented during the first ped visit through a more detailed discussion of disease prognosis and expected time to recover. pediatric background: experience indicates that it is difficult to effectively quell many parents' anxiety toward pediatric fevers, making this a common emergency department (ed) complaint. the question remains as to whether athome treatment has any effect on the course of emergency department treatment or length of stay in this population. objectives: to determine whether anti-pyretic treatment prior to arrival in the emergency department affects the evaluation or emergency department length of stay of febrile pediatric patients. methods: a convenience sample of children, ages - years, who presented to a tertiary care ed with chief complaint of fever were enrolled. parents were asked to participate in an eight-question survey. questions related to demographic information, pre-treatment of the fever, contact with primary care providers prior to ed arrival, and immunization status. upon admission or discharge, investigators recorded information regarding length of stay, laboratory tests and imaging ordered, and medications given. results: eighty-one patients were enrolled in the study. seventy-six percent of the patients were pre-treated with some form of anti-pyretic by the caregiver prior to ed arrival. there was no significant effect of pre-treatment on whether laboratory tests or medications were ordered in the ed or whether the patient was admitted or discharged. the length of ed stay was found to be significantly shorter among those who received anti-pyretics prior to arrival ( ± vs. ± minutes; p = . ). conclusion: among febrile children, those who receive anti-pyretics prior to their ed visit had statistically significant shorter length of stays. this also supports implementation of triage or nursing protocols to administer an anti-pyretic as soon as possible in the hope of decreasing ed throughput times. background: during the past two decades, the prevalence of overweight (bmi percentile > ) in children has more than doubled, reaching epidemic proportions both nationally and globally. the public health burden is enormous given the increased risk of adult obesity as well as the adverse consequences on cardiovascular, metabolic, and psychological health. despite the overwhelming prevalence, the effect of obesity on emergency care has received little attention. objectives: the goal of this study is to determine the relation of weight on reported emergency department visits in children from a nationally representative sample. methods: weight (as reported by parents) and height along with frequency of and reason for emergency department (ed) use in the last months were obtained from children aged - y (n = , ) in the cross-sectional, telephone-administered, national survey of children's health (nsch). bmi percentiles were calculated using sex-specific bmi for age growth charts from the cdc ( ). children were categorized as: underweight (bmi percentile£ ), normal weight (> to < ), at-risk for overweight ( to < ), and overweight ( ‡ ). prevalence of ed use was estimated and compared across bmi percentile categories using chisquare analysis and multivariable logistic regression. taylor-series expansion was used for variance estimation of the complex survey design. results: the prevalence of at least one ed use in the past months increased with increasing bmi percentiles (figure , p < . ). additionally, overweight children were more likely to have more than one visit. overweight children were also less likely to report an injury, poisoning, or accident as the reason for ed visit compared to other bmi categories ( , , , % in overweight, at-risk, normal, and underweight respectively, p < . ). conclusion: as rates of childhood obesity continue to grow in the u.s., we can expect greater demands on the ed. this will likely translate into an increased emphasis on the care of chronic conditions rather than injuries and accidents in the pediatric ed setting. results: mean pediatric satisfaction score was . (sd . ) compared with . ( . ) for adult patients (p < . ); monthly sample sizes ranged from - and from - for the two populations, respectively. both populations showed an increase in satisfaction after opening of the ped-ed. for both populations there was no significant trend in patient satisfaction from the beginning of the study period to the opening of the ped-ed, but after the opening the models of the populations differed. the pediatric satisfaction model was an interrupted two-slope model, with an immediate jump of . points in november and an increase of . points per month thereafter. in contrast, adult satisfaction scores did not show a jump but increased linearly (two slope model) after / at a rate of . per month. prior to the opening of the ped-ed, mean monthly pediatric and adult satisfaction scores were . ( . ) and . ( . ), respectively (difference . % ci . - . , p = . ). after the opening the mean scores were . ( . ) and . ( . ), respectively (difference . , % ci . - . , p < . ). conclusion: opening of a dedicated ped-ed was associated with a significant increase in patient satisfaction scores both for children and adults. patient satisfaction for children, as compared to adults, was higher before and after opening a ped-ed. the background: there are racial disparities in outcomes among injured children. in particular, black race appears to be an independent predictor of mortality. objectives: to evaluate disparities among ed visits for unintentional injuries among children ages - . methods: five years of data ( ) ( ) ( ) ( ) ( ) from the national hospital ambulatory cares survey were combined. inclusion criteria were defined as unintentional injury visits (e-code . to . or . to . ) and age - years. visit rates per population (defined by the us census) were calculated by race and age group. weighted multivariate logistic regression analysis was performed to describe associations between race and specific outcome variables and related covariates. primary statistical analyses were performed using sas version . . . results: , , of , , weighted ed visits met our inclusion criteria ( . %). per persons, black children had . times as many ed visits for unintentional injuries as whites (table) . there were no racial differences in the sex ratio ( . boy visits: girl), proportion of visits by age, ed disposition, immediacy with which they needed to be seen, whether or not they were evaluated by an attending physician, metropolitan vs. rural hospital, admission length of stay, mode of transportation for ed arrival, number of procedures, diagnostic services, or ed medications. background: sudden cardiac arrests in schools are infrequent, but emotionally charged events. little data exist that describes aed use in these events. objectives: the purpose of our study was to ) describe characteristics and outcomes of school cardiac arrests (ca), and ) assess the feasibility of conducting bystander interviews to describe the events surrounding school ca. methods: we performed a telephone survey of bystanders to ca occurring in k- schools in communities participating in the cardiac arrest registry to enhance survival (cares) database. the study period was from / - / and continued in one community through . utstein style data and outcomes were collected from the cares database. a structured telephone interview of a bystander or administrative personnel was conducted for each ca. a descriptive summary was used to assess for the presence of an aed, provision of bystander cpr (bcpr), and information regarding aed deployment, training, and use and perceived barriers to aed use. descriptive data are reported. results: during the study period there were , ca identified at cares communities, of which were identified as educational institutions. of these, ( . %) events were at k- schools with ( . %) being high schools. of the arrests, a minority were children ( ( . %) < age ), most ( , . %) were witnessed, a majority ( , . %) received bcpr, and ( . %) were initially in ventricular fibrillation (vf). most arrests / ( %) occurred during the school day ( a- p). overall, ( . %) survived to hospital discharge. interviews were completed for of ( . %) k- events. eighteen schools had an aed on site. most schools ( . %) with aeds reported that they had a training program and personnel identified for its use. an aed was applied in of patients, and of these were in vf and survived to hospital discharge. multiple reasons for aed non-use (n = ) were identified. conclusion: cardiac arrests in schools are rare events; most patients are adults and received bcpr. aed use was infrequent, even when available, but resulted in excellent ( / ) survival. further work is needed to understand aed non-use. post-event interviews are feasible and provide useful information regarding cardiac arrest care. physician background: gastroenteritis is a common childhood disease accounting for - million annual pediatric emergency visits. current literature supports the use of anti-emetics reporting improved oral re-hydration, cessation of vomiting, and reduced need for iv re-hydration. however, there remains concern that using these agents may mask alternative diagnoses. objectives: to assess outcomes associated with use of a discharge action plan using ed-dispensed ondansetron at home in the treatment of gastroenteritis. methods: a prospective, controlled, observational trial of patients presenting to an urban pediatric emergency department (census , ) over a -month period for acute gastroenteritis. fifty patients received ondansetron in the ed. twenty-nine patients were enrolled in the pediatric emergency department discharge action plan (ped-dap) where ondansetron for home use was dispensed by the treating clinician. twenty-one patients were controls. control patients did not receive home ondansetron. ped-dap patients were given instructions to administer the ondansetron for ongoing symptoms any time hours post ed discharge. all patients were followed by phone at - days to assess for the following: time of emesis resolution, alternative diagnoses, unscheduled visits, and adverse events. results: all patients were followed by phone. / ped-dap patients received home ondansetron. / patients had resolution of emesis in the ed. / had resolution of their emesis between time of discharge and hours. / of ped-dap patients reported emesis after hours from ed discharge. five patients reported an unscheduled visit. all five return visits returned to the ed ( / returned for emesis, / for diarrhea). / controls reported resolution of symptoms within the ed. / of controls had resolution between time of discharge and hours. / of the control patients had resolution with between and hours post discharge. / had an unscheduled appointment with the pmd at hours post-discharge for ongoing fever and nausea. in follow-up there were no alternative diagnoses identified. the effect of the ped-dap on resolution of emesis between discharge and hours appears to be statistically significant (p value < . ). conclusion: ondansetron given in schedule with a discharge action plan appears to provide a modest benefit in resolution of symptoms relative to a control population. objectives: to determine the repeatability coefficient of a mm vas in children aged to years in different circumstances: assessments done either at or minute interval, when asked to recall their score or to reproduce it. methods: a prospective cohort study was conducted using a convenience sample of patients aged to years presenting to a pediatric ed. patients were asked to indicate, on a mm paper vas, how much they liked a variety of food with four different sets of three questions: (set ) questions at minute interval with no specific instruction other than how to complete the vas and no access to previous scores, (set ) same format as set except for questions at minute interval, (set ) same as set except patients were asked to remember their answers, and (set ) same as set except patients were shown their previous answers. for each set, the repeatability coefficient of the vas was determined according to the bland-altman method for measuring agreement using repeated measures: . x Ö x s w where s w is the within-subject standard deviation by anova. the sample size required to estimate s w to % of the fraction value as recommended was patients if we obtained three measurements for each patient. results: a total of patients aged . ± . years were enrolled. the repeatability coefficient for the questions asked at minute intervals was mm, and mm when asked at minute interval. when asked to remember their previous answers or to reproduce them, the repeatability coefficient for the questions was mm and mm, respectively. conclusion: the condition of the assessments (variation in intervals or patients asked to remember or to reproduce their previous answers) influence the testretest reliability of the vas. depending on circumstances, the theoretical test-retest reliability in children aged to years varies from to mm on a mm paper vas. background: skull radiographs are a useful tool in the evaluation of pediatric head trauma patients. however, there is no consensus on the ideal number of views that should be obtained as part of a standard skull series in the evaluation of pediatric head trauma patients. objectives: to compare the sensitivity and specificity of a two-and four-film x-ray series in the diagnosis of skull fracture in children, when interpreted by pediatric emergency medicine physicians. methods: a prospective, crossover experimental study was performed in a tertiary care pediatric hospital. the skull radiographs of children were reviewed. these were composed of the most recent cases of skull fracture for which a four-film radiography series was available at the primary setting and controls, matched for age. two modules, containing a random sequence of two-and four-film series of each child, were constructed in order to have all children evaluated twice (once with two films and once with four films). board-certified or -eligible pediatric emergency physicians evaluated both modules two to four weeks apart. the interpretation of the four-film series by a radiologist, or when available, the findings on ct scan, served as the gold standard. accuracy of interpretation was evaluated for each patient. the sensitivity and specificity of the two-film versus the four-film skull xray series, in the identification of fracture, were compared. this was a non-inferiority cross-over study evaluating the null hypothesis that a series with two views would have a sensitivity (specificity) that is inferior by no more than . compared to a series with four views. a total of controls and cases were needed to establish non-inferiority of the two-film series versus the four-film series, with a power of % and a significance level of %. results: ten pediatric emergency physicians participated in the study. for each radiological series, the proportion of accurate interpretation varied between . to . . the four-film series was found to be more sensitive in the detection of skull fracture than a two-film series (difference: . , %ci . to . ). however, there was no difference in the specificity (difference: . , %ci ) . to . ). conclusion: for children sustaining a head trauma, a four-film skull radiography series is more sensitive than a two-film series, when interpreted by pediatric emergency physicians. the objectives: we developed a free online video-based instrument to identify knowledge and clinical reasoning deficits of medical students and residents for pediatric respiratory emergencies. we hypothesized that it would be a feasible and valid method of differentiating educational needs of different levels of learners. methods: this was an observational study of a free, web-based needs assessment instrument that was tested on third and fourth year medical students (ms - ) and pediatric and emergency medicine residents (r - ). the instrument uses youtube video triggers of children in respiratory distress. a series of cased-based questions then prompts learners to distinguish between upper and lower airway obstruction, classify disease severity, and manage uncomplicated croup and bronchiolitis. face validity of the instrument was established by piloting and revision among a group of experienced educators and small groups of targeted learners. final scores were compared across groups using t-tests to determine the ability of the instrument to differentiate between different levels of learners (concurrent validity). cronbach's alpha was calculated as a measure of internal consistency. results: response rates were % among medical students and % among residents. the instrument was able to differentiate between junior (ms , ms , and r ) and senior (r , r ) learners for both overall mean score ( % vs. %, p < . ) and mean video portion score ( vs. %, p = . ). table compares results of several management questions between junior and senior learners. cronbach's alpha for the test questions was . . conclusion: this free online video-based needs assessment instrument is feasible to implement and able to identify knowledge gaps in trainees' recognition and management of pediatric respiratory emergencies. it demonstrates a significant performance difference between the junior and senior learners, preliminary evidence of concurrent validity, and identifies target groups of trainees for educational interventions. future revisions will aim to improve internal consistency. results: the survey response rate was % ( / ). among responding programs, ( %) reside within a children's hospital (vs. general ed); ( %) are designated level i pediatric trauma centers. forty-three ( %) programs accept - pem fellows per year; ( %) provided at least some eus training to fellows, and ( %) offer a formal eus rotation. on average this training has existed for ± years and the mean duration of eus rotations is ± weeks. twenty-eight ( %) programs with eus rotations provide fellow training in both a general ed and a pediatric ed. there were no hospital or program level factors associated with having a structured training program for pem fellows. conclusion: as of , the majority of pem fellowship programs provide eus training to their fellows, with a structured rotation being offered by most of these programs. background: ed visits are an opportunity for clinicians to identify children with poor asthma control and intervene. children with asthma who use eds are more likely than other children to have poor control, not be using controller medications, and have less access to traditional sources of primary care. one significant barrier to ed-based interventions is recognizing which children have uncontrolled asthma. objectives: to determine whether the pacci, a item parent-administered questionnaire, can help ed clinicians better recognize patients with the most uncontrolled asthma and differentiate between intermittent and persistent asthma. methods: this was a randomized controlled trial performed at an urban pediatric ed. parents were asked to answer questions about their child's asthma including drug adherence and history of exacerbations, as well as answer demographic questions. using a convenience sample of children - years presenting with an asthma exacerbation, attending physicians in the study were asked to complete an assessment of asthma control. physicians were randomized to receive a completed pacci (intervention) or not (control group). using an intent-to-treat approach, clinicians' ability to accurately identify ) four categories of control used by the national heart, lung, and blood institute (nhlbi) asthma guidelines, ) intermittent vs. persistent level asthma, and ) controlled / mildly uncontrolled vs. moderate/severely uncontrolled asthma were compared for both groups using chi-square analysis. results: between january and august , patients were enrolled. there were no statistically significant differences between the intervention and control groups for child's sex, age, race and parents' education. conclusion: the pacci improves ed clinicians' ability to categorize children's asthma control according to nhlbi guidelines, and the ability to determine when a child's control has been worsening. ed clinicians may use the pacci to identify those children in greatest need for intervention, to guide prescription of controller medications, and communicate with primary care providers about those children failing to meet the goals of asthma therapy. figure) . fewer than half of physicians reported the parent of a -year-old being discharged from their ed following an mvc-related visit would receive either child passenger safety information or referrals (table) . conclusion: emergency physician report of child passenger safety resource availability is associated with trauma center designation. even when resources are available, referrals from the ed are infrequent. efforts to increase referrals to community child passenger safety resources must extend to the community ed settings where the majority of children receive injury care. background: pediatric subspecialists are often difficult to access following ed care especially for patients living far from providers. telemedicine (tm) can potentially eliminate barriers to access related to distance, and cost. objectives: to evaluate the overall resource savings and access that a tm program brings to patients and families. methods: this study took place at a large, tertiary care regional pediatric health care system. data were collected from / - / . metrics included travel distance saved (round trip between tm presenting sites and the location of the receiving sites), time savings, direct cost savings (based on $ . /mile) and potential work and school days saved. indirect costs were calculated as travel hrs saved/encounter (based on an average speed of miles/hr). demographics and services provided were included. results: tm consults were completed by separate pediatric subspecialty services. most patients were school aged ( % >/= yrs old objectives: to analyze test characteristics of the pathway and its effects on ed length of stay, imaging rates, and admission rate before versus after implementation. methods: children ages - presenting to one academic pediatric ed with suspicion for appendicitis from october -august were prospectively enrolled to a pathway using previously validated lowand high-risk scoring systems. the attending physician recorded his or her suspicion of appendicitis and then used one of two scoring systems incorporating history, physical exam, and cbc. low-risk patients were to be discharged or observed in the ed. high-risk patients were to be admitted to pediatric surgery. those meeting neither low-nor high-risk criteria were evaluated in the ed by pediatric surgery, with imaging at their discretion. chart review and telephone follow-up were conducted two weeks after the visit. charts of a random sample of patients with diagnoses of acute appendicitis or chief complaint of abdominal pain and undergoing a workup for appendicitis in the eight months before and after institution of the pathway were retrospectively reviewed by one or two trained abstractors. results: appendicitis was diagnosed in of patients prospectively enrolled to the pathway ( %). mean age was . years. of those with appendicitis, were not low-risk (sensitivity . %, specificity . %). the high-risk criteria had a sensitivity of . % and specificity of . %. a priori attending physician assessment of low risk had a sensitivity of % and specificity of . %. a priori assessment of high risk had a sensitivity of . % and specificity of . %. we reviewed visits prior to the pathway and after. mean ed length of stay was similar ( minutes before versus after). ct was used in . % of visits before and . % after (p = . ). use of ultrasound increased ( . % before versus . % after, p < . ). admission rates were not significantly different ( . % before versus . % after, p = . ). conclusion: the low-risk criteria had good sensitivity in ruling out appendicitis and can be used to guide physician judgment. institution of this pathway was not associated with significant changes in length of stay, utilization of ct, or admission rate in an academic pediatric ed. computer-delivered alcohol and driver safety behavior screening and intervention program initiated during an emergency department visit mary k. murphy , lucia l. smith , anton palma , david w. lounsbury , polly e. bijur , paul chambers yale university, new haven, ct; albert einstein college of medicine, bronx, ny background: alcohol use is involved in percent of all fatal motor vehicle crashes and recent estimates show that at least , people were injured due to distracted driving last year. patients who visit the emergency department (ed) are not routinely screened for driver safety behavior; however, large numbers of patients are treated in the ed every day creating an opportunity for screening and intervention on important public health behaviors. objectives: to evaluate patient acceptance and response to a computer-based traffic safety educational intervention during an ed visit and one month follow-up. methods: design. pre /post educational intervention. setting. large urban academic ed serving over , patients annually. participants. medically stable adult ed patients. intervention. patients completed a self-administered, computer-based program that queried patients on alcohol use and risky driving behaviors (texting, talking, and other forms of distracted driving). the computer provided patients with educational information on the dangers of these behaviors and collected data on patient satisfaction with the program. staff called patients one month post ed visit for a repeat query. results: patients participated; average age ( - ), % hispanic, % male. % of patients reported the program was easy to use and were comfortable receiving this education via computer during their ed visit. self-reported driver safety behaviors pre, post intervention (% change): driving while talking on the phone %, % () %, p = . ), aggressive driving %, % () %, p = . ), texting while driving %, % () %, p = . ), driving while drowsy %, % () %, p = . ), drinking in excess of nih safe drinking guidelines %,% () %, p = . ), drinking and driving %, % () %, p = . ). conclusion: we found a high prevalence of selfreported risky driving behaviors in our ed population. at month follow-up, patients reported a significant decrease in these behaviors. overall patients were very satisfied receiving educational information about these behaviors via computer during their ed visit. this study indicates that a low-intensity, computer-based educational intervention during an ed visit may be a useful approach to educate patients about safe driving behaviors and promote behavior change. prevalence of depression among emergency department visitors with chronic illness janice c. blanchard, benjamin l. bregman, jeffrey smith, mohammad salimian, qasem al jabr george washington university, washington, dc background: persons with chronic illnesses have been shown to have higher rates of depression than the general population. the effect of depression on frequent emergency department (ed) use among this population has not been studied. objectives: this study evaluated the prevalence of major depressive disorder (mdd) among persons presenting with depression to the george washington university ed. we hypothesized that patients with chronic illnesses would be more likely to have mdd than those without. methods: this was a single center, prospective, crosssectional study. we used a convenience sample of noncritically ill, english-speaking adult patients presenting with non-psychiatric complaints to an urban academic ed over months in . subjects were screened with the phq , a nine-item questionnaire that is a validated, reliable predictor of mdd. we also queried respondents about demographic characteristics as well as the presence of at least one chronic disease (heart disease, hypertension, asthma, diabetes, hiv, cancer, kidney disease, or cerebrovascular disease). we evaluated the association between mdd and chronic illnesses with both bivariate analysis and multivariate logistic regression controlling for demographic characteristics (age, race, sex, income, and insurance coverage). results: our response rate was . % with a final sample size of . of our total sample, ( . %) had at least one of the chronic illnesses defined above. of this group, ( . %) screened positive for mdd as compared to ( . %) of the group without chronic illnesses (p < . ). in multivariate analysis, persons with chronic illnesses had an odds ratio for a positive depression screen of . ( . , . ) as compared to persons without illness. among the subset of persons with chronic illnesses (n = ), . % had ‡ visits in the prior days as compared to . % of persons with chronic illnesses without mdd (p = . ). conclusion: our study found a high prevalence of untreated mdd among persons with chronic illnesses who present to the ed. depression is associated with more frequent emergency department use among this population. initial blood alcohol level aids ciwa in predicting admission for alcohol withdrawal craig hullett, douglas rappaport, mary teeple, daniel butler, arthur sanders university of arizona, tucson, az background: assessment of alcohol withdrawal symptoms is difficult in the emergency department. the clinical institute withdrawal assessment (ciwa) is commonly used, but other factors may also be important predictors of withdrawal symptom severity. objectives: the purpose of this study is to determine whether ciwa score at presentation to triage was predictive of later admission to the hospital. methods: a retrospective study of patients presenting to an acute alcohol and drug detoxification hospital was performed from july through january . patients were excluded if other drug withdrawal was present in addition to alcohol. initial assessment included age, sex, vital signs, and blood alcohol level (bal) in addition to hourly ciwa score. admission is indicated for a ciwa score of or higher. data were analyzed by selecting all patients not immediately admitted at initial presentation. logistic regression using wald's criteria for stepwise inclusion was used to determine the utility of the initially gathered ciwa, bal, longest sobriety, liver cirrhosis, and vital signs in predicting subsequent admission. results: there were patients who fit the inclusion criteria, with admitted for treatment at initial intake and another admitted during the following hours. logistic regression indicated that presenting bal was a strong predictor (p = . ) of admission for treatment after initial presentation, as was presenting ciwa (p = . ). thus, presenting bal provided a substantial addition above initial ciwa in predicting later admission. no other variables added significantly to the prediction of later admission. to determine the interaction between presenting bal and ciwa scores, we ran a repeated measures analysis of the first five ciwa scores (from presentation to hours later), using bal split into low (bal < . ) and high (bal > . ) groups (see figure) . their interaction was significant, f ( , ) = . , p < . , g = . . those presenting with higher initial bal had suppressed ciwa scores that rose precipitously as the alcohol cleared. those with low presenting bal showed a decline in ciwa over time conclusion: initial assessment using the common assessment tool ciwa is aided significantly by bal assessment. patients with higher presenting bal are at higher risk for progression to serious alcohol withdrawal symptom. objectives: to describe patient and visitor characteristics and perspectives on the role of visitors in the ed and determine the effect of visitors on ed and hospital outcome measures. methods: this cross-sectional study was done in an , -visit urban ed, and data were attempted to be collected from all patients over a consecutive -hour period from august to , . trained data collectors were assigned to the ed continuously for the study period. patients assigned to a rapid care section of the ed ( %) were excluded. a visitor was defined as a person other than a health care provider (hcp) or hospital staff present in a patient's room at any time. patient perspectives on visitors were assessed in the following domains: transportation, emotional support, physical care, communication, and advocating for the patient. ed and hospital outcome measures pertaining to ed length of stay (los) and charges, hospital admission rate, hospital los and charges were obtained from patient medical records and hospital billing. data analyses included frequencies, student's t-tests for continuous variables, and chi-square tests of association for categorical variables. all tests for significance were two-sided. objectives: to examine the effect of sunday alcohol availability on ethanol-related visits and alcohol withdrawal visits to the ed. methods: study design was a retrospective beforeafter study using electronically archived hospital data at an urban, safety net hospital. all adult non-prisoner ed visits from / / to / / were analyzed. an ethanol-related ed visit was defined by icd- codes related to alcohol ( .x, .x, . , . ). an alcohol withdrawal visit was defined by icd- codes of delirium tremens ( . ), alcohol psychosis with hallucination ( . ), and ethanol withdrawal ( . ). we generated a ratio of ethanol-related ed visits to total ed visits (ethanol/total) and ratio of alcohol withdrawal ed visits to total ed visits (withdrawal/total). a day was redefined as am to am. the ratios were averaged within the four seasons to account for seasonal variations. data from summer were dropped as it spanned the law change. we stratified data into sunday and non-sunday days prior to analysis to isolate the effects of the law change. we used multivariable linear regression to estimate the association of the ratio with the law change while adjusting for time and the seasons. each ratio was modeled separately. the interaction between time and the law change was assessed using p < . . results: during the study there were a total of , ed visits including , ( % of total) ethanol-related visits and , ( % of total) alcohol withdrawal visits. unadjusted ratios in seasonal blocks are plotted in the figure with associated % ci and best fit regression line for before and after law change, respectively. after adjusting for time and season in the multivariable linear regression, we found no significant association of either ethanol/total or withdrawal/total with the law change. this remained true for both sunday and non-sunday data. all interactions assessed were not significant. conclusion: the change in colorado law to allow the sale of full-strength alcoholic beverages on sundays did not significantly affect ethanol-related or alcohol withdrawal ed visits. background: olanzapine is a second-generation antipsychotic (sga) with actions at the serotonin/histamine receptors. post-marketing reports and a case report have documented dangerous lowering of blood pressure when this antipsychotic is paired with benzodiazepines, but a recent small study found no bigger decreases in blood pressure compared to another antipsychotic like haloperidol. decreases in oxygen saturations, however, were larger when olanzapine was combined with benzodiazepines in alcohol-intoxicated patients. it is unclear whether these vital sign changes are associated with the intramuscular (im) route only. objectives: the assessment of vital signs following administration of either oral (po) or im olanzapine, either with or without benzodiazepines (benzos) and with or without concurrent alcohol intoxication. methods: this is a structured retrospective chart review of all patients who received olanzapine in an academic medical center ed from - who had vital signs documented both before medication administration and within four hours afterwards. vital signs were calculated as pre-dose minus lowest post-dose vital sign within hours, and were analyzed in an anova with route (im/po), benzo use (+/)), and alcohol use (+/)) as factors. significance level was set to < . . results: there were patients who received olanzapine over the study period. a total of patients ( po, im) met inclusion criteria. systolic blood pressures decreased across all groups as patients reduced their agitation. neither the route of administration, concurrent use of benzos, nor the use of alcohol were associated with significant changes in systolic bp (p = ns for all comparisons; see figure ). decreases in oxygen saturations, however, were significantly larger for alcoholintoxicated patients who subsequently received im olanzapine + benzos compared to other groups (route: p < . ; alcohol: p < . ; route x alcohol: p < . ; route x benzos x alcohol: p < . ; see figure ). conclusion: alcohol and benzos are not associated with significant decreases in blood pressure after po olanzapine, but im olanzapine + benzos is associated with potentially significant oxygen desaturations in patients who are intoxicated. intoxicated patients may have differential effects with the use of im sgas such as olanzapine when combined with benzos, and should be studied separately in drug trials. patients with a psychiatric diagnosis rasha buhumaid, jessica riley, janice blanchard george washington university, washington, dc background: literature suggests that frequent emergency department (ed) use is common among persons with a mental health diagnosis. few studies have documented risk factors associated with increased utilization among this population. objectives: to understand demographic characteristics of frequent users of the emergency department and describe characteristics associated with their visits. it was hypothesized that frequent visitors would have a higher rate of medical comorbidities than infrequent visitors. methods: this was a retrospective study of patients presenting to an urban, academic emergency department in . a cohort of all patients with a mental health-related final icd- coded diagnosis (axis i or axis ii) was extracted from the electronic medical record. using a standard abstraction form, a medical chart review collected information about medical comorbidities, substance abuse, race, age, sex, and insurance coverage, as well as diagnosis, disposition, and time of each visit. results: our sample consisted of frequent users ( ‡ visits in a day period) and infrequent users (£ visits in a day period). frequent users were more likely to be male ( % vs. . % p = . ), black ( % vs. % p < . ), and had a higher average number of comorbid conditions ( . , %ci . , . ) as compared to infrequent users ( . , %ci . , . ). a higher percentage of visits in the infrequent user group occurred during the day ( % vs. . % p < . ) while a higher number of visits in the frequent users occurred after midnight ( . % vs. . % p = . ). visits in the frequent user group were less likely to be for a psychiatric complaint ( . % vs. . %) and less likely to result in a psychiatric admission ( . % versus . %) as compared to the infrequent user group (p < . ). conclusion: our data indicate that among patients with psychiatric diagnoses, those who make frequent ed visits have a higher rate of comorbid conditions than infrequent visitors. despite their increased use of the ed, frequent visitors have a significantly lower psychiatric admission rate. many of the visits by frequent users are for non-psychiatric complaints and may reflect poor access to outpatient medical and mental health services. emergency departments should consider interventions to help address social and medical issues among mental health patients who frequently use ed services. background: the world health organization estimates that one million people die annually by suicide. in the u.s., suicide is the fourth leading cause of death between the ages of and . many of these patients are seen in ed, while outpatient visits for depression are also high. no recent analysis has compared these groups in the recent years. objectives: to determine if there is a relationship between the incidence of suicidal and depressed patients presenting to emergency departments and the incidence of depressed patients presenting to outpatient clinics from - . the secondary objective is to analyze trends in suicidal patients in the ed. methods: we used nhamcs (national hospital ambulatory medical care survey) and namcs (national ambulatory medical care survey), national surveys completed by the centers for disease control, which provide a sampling of emergency department and outpatient visits respectively. for both groups, we used mental-health-related icd- -cm, e codes and reasons for visit. we compared suicidal and depressed patients who presented to the ed, to those who presented to outpatient clinics. our subgroup analyses included age, sex, race/ethnicity, method of payment, regional variation, and urban verses rural distribution. results: ed visits for depression ( . %) and suicide attempts ( . %) remained stable over the years, with no significant linear trend. however, office visits for depression significantly decreased from . % of visits in to . % of visits in . non-latino whites had a higher percentage of ed visits for depression ( . %) and suicide attempt ( . %) (p < . ), and a higher percentage of office visits for depression than all other groups. among patients age - years, ed visits for suicide attempt significantly increased from . % in to . % in . homeless patients had a higher percent of ed visits for depression ( . %) and suicide attempt ( background: for potentially high-risk ed patients with psychiatric complaints, efficient ed throughput is key to delivering high-quality care and minimizing time spent in an unsecured waiting room. objectives: we hypothesized that adding a physician in triage would improve ed throughput for psychiatric patients. we evaluated the relationship between the presence of an ed triage physician and waiting room (wr) time, time to first physician order, time to ed bed assignment, and time spent in an ed bed. methods: the study was conducted from / - / at an academic ed with annual visits and a dedicated on-site emergency psychiatric unit. we performed a pre/post retrospective observational cohort study using administrative data, including weekend visits from noon- pm, months pre and post addition of weekend triage physicians. after adjusting for patient age, sex, insurance status, emergency severity index score, mode of arrival, ed occupancy rate, wr count, boarding count, and average wr los, multiple linear regression evaluated the relationship between the presence of a triage physician and four ed throughput outcomes: time spent in the wr, time to first order, time spent in an ed bed, and the total ed los. results: visits met inclusion criteria, in the months before and in the months after physicians were assigned to triage on weekends. table reports demographic data; multivariate analysis results are found in table . the presence of a triage physician was associated with an ( % ci . - . ) minute increase in wr time and no associated change in time to first order, time spent in an ed bed, or in the overall ed los. conclusion: use of triage physicians has been reported to decrease the time patients spend in an ed bed and improve ed throughput. however, for patients with psychiatric complaints, our analysis revealed a slight increase in wr time without evident change in the time to first order, time spent in an ed bed, or total ed los. improvements in ed throughput for psychiatric patients will likely require system-level changes, such as reducing ed boarding and improving lab efficiency to speed the process of medical clearance and reduce time spent in the unsecured wr. these findings may not be generalizable to eds without a dedicated ed psychiatric unit with full-time social workers to assist with disposition. initial assessment included ciwa scoring, repeated hourly, as well as other variables (see table ). treatment and admission to the inpatient hospital was indicated for a ciwa score of or higher. statistical analysis was performed utilizing repeated measures general linear modeling for ciwa scores and anova for all other variables. results: there were patients who fit the inclusion criteria, with admitted for treatment at initial intake and another admitted during the following hours. the table below compares the three most prevalent ethnic populations seen at our hospital. native americans presented at a significantly younger age (p < . ) than the other two ethnicities. initial ciwa scores taken on admission were significantly lower in the native american group than the other two groups (p < . ) and at hour a difference existed but failed to reach significance. repeated measures analysis indicate that ciwa scores progressed in a u-shaped curvilinear fashion (see figure ) conclusion: initial assessment utilizing ciwa scores appears to be affected by ethnicity. care must be taken when assessing and making decisions on a single initial ciwa score. further research is needed in this area as our numbers are small and differences might be seen in subsequent scoring. in addition, our study consists of primarily male patients and does not include african-american patients. background: age is a risk factor for adverse outcomes in trauma, yet evidence supporting the use of specific age cut-points to identify seriously injured patients for field triage is limited. objectives: to evaluate under-triage by age, empirically examine the association between age and serious injury for field triage, and assess the potential effect of mandatory age criteria. methods: this was a retrospective cohort study of injured children and adults transported by ems agencies to hospitals in regions of the western u.s. from - . hospital records were probabilistically linked to ems records using trauma registries, emergency department data, and state discharge databases. serious injury was defined as an injury severity score (iss) ‡ (the primary outcome). we assessed under-triage (triage-negative patients with iss ‡ ) by age decile, different mandatory age criteria, and used multivariable logistic regression models to test the association (linear and non-linear) between age and iss ‡ , adjusted for important confounders. results: , injured patients were evaluated and transported by ems over the -year period. under-triage increased markedly for patients over years, reaching % for those over years ( figure ). mandatory age triage criteria decreased under-triage, while substantially increasing over-triage: one iss ‡ patient identified for every additional patients triaged to major trauma centers. among patients not identified by other criteria, age had a strong non-linear association with iss ‡ (p < . ); the probability of serious injury steadily increased after years, becoming more notable after years ( figure ). conclusion: under-triage in trauma increases in patients over years, which may be reduced with mandatory age criteria at the expense of system efficiency. among patients not identified by other criteria, serious injury steadily increased after years, though there was no age at which risk abruptly increased. background: although limited resuscitation with hemoglobin-based oxygen carriers (hbocs) improves survival in several polytrauma models, including those of traumatic brain injury (tbi) with uncontrolled hemorrhage (uh) via liver injury, their use remains controversial. objectives: we examine the effect of hboc resuscitation in a swine polytrauma model with uh by aortic tear +/) tbi. we hypothesize that limited resuscitation with hboc would offer no survival benefit and would have similar effects in a model of uh via aortic tear +/) tbi. methods: anesthetized swine subjected to uh inflicted via aortic tear +/) fluid percussion tbi underwent equivalent limited resuscitation with hboc, lr, or hboc+nitroglycerin (ntg) (vasoattenuated hboc) and were observed for hours. comparisons were between tbi and no-tbi groups with adjustment for resuscitation fluid type using two-way anova with interaction and tukey kramer adjustment for individual comparisons. results: there was no independent effect of tbi on survival time after adjustment for fluid type (anova, tbi term p = . ) and there was no interaction between tbi and resuscitation fluid type (anova interaction term p = . ). there was a significant independent effect of fluid type on survival time (anova p = . background: intracranial hemorrhage (ich) after a head trauma is a problem frequently encountered in the ed. an elevated inr is recognized as a risk of bleeding. however, in a patient with an inr in normal range, a level associated with a lower risk of ich is not known. objectives: the aim of this study was to identify an inr threshold that could predict a decreased risk of an ich after a head trauma in patients with a normal inr. it is hypothesized that there is a threshold at which the likelihood of bleeding decreases significantly. methods: we did a study using data from a registry of patients with mild to severe head trauma (n = ) evaluated in a level i trauma center in canada between march and february . all the patients with a documented scan interpreted by a radiologist and a normal inr, defined as a value less then . , were included. we determined the correlation between inr value binned by . and the proportion of patients with an ich. threshold was defined by consensus as an abrupt change of more than % in the percentage of patients with ich. univariate frequency distribution was tested with pearson's chisquare test. logistic regression analysis was then used to study the effects of inr on ich with the following confounding factors: age, sex, and intake of warfarin, clopidogrel, or aspirin. results are presented with % confidence intervals. results: patients met the inclusion criteria. the mean age was . years ± . and % were men. patients ( . %) had an ich on brain scan. we found a significantly lower risk of ich at a threshold of inr less than . (p < . , univariate or = . , %ci . - . ) and a strong correlation between the risk of bleeding for every increase of the inr (r = . ). in fact, after adjustment for confounding variables, every . inr increase was associated with an increased risk of having an ich (or . ; % ci . - . ). conclusion: we were able to demonstrate an inr threshold under which the probability of ich was significantly lower. we also found a strong association between the risk of bleeding and the increase in inr within a normal range, suggesting that clinicians should not be falsely reassured by a normal inr. our results are limited by the fact that this is a retrospective study and a small proportion of traumatic brain injured patients in our database had no scan or inr at their ed visit. a prospective cohort study would be needed to confirm our results. background: increasingly, patients with tbi are being seen and managed in the emergency neurology setting. knowing which early signs are associated with prognosis can be helpful in directing the acute management. objectives: to determine whether any factors early in the course of head trauma are associated with shortterm outcomes including inpatient admission, in-hospital mortality, and return to the hospital within days. methods: this irb-approved study is a retrospective review of patients head injury presenting to our tertiary care academic medical center during a -month period. the dataset was created using redcap, a data management solution hosted by our medical school's center for translational science institute. results: the median age of the cohort (n = ) was , iqr = - yrs, with % being male. % had a gcs of - (mild tbi), % - (moderate tbi), and % gcs < (severe tbi). % of patients were admitted to the hospital. the median length of hospital stay was days, with an iqr of - days. of those admitted, % had an icu stay as well. the median icu los was also days, with an iqr of - days. twenty nine ( %) patients died during their hospital stay. lower gcs was predictive of inpatient admission (p = . ) as well as icu days (p < . ). significant predictors of re-admission to the hospital within days included hypotension (p = . ) upon initial presentation. the prehospital and ed gcs scores were not statistically significant. significant predictors of in-hospital death in a model controlling for age included bradycardia (p = . ), hyperglycemia (p = . ), and lower gcs (p = . ). the incidence of bradycardia (hr < ) was . %. conclusion: early hypotension, hyperglycemia, and bradycardia along with lower initial gcs are associated with significantly higher likelihood of hospital admission, including icu admission, as well as intrahospital death and re-admission. background: over , people per day require treatment for ankle sprains, resulting in lost workdays and training for athletes. platelet rich plasma (prp) is an autologous concentration of platelets which, when injected into the site of injury, is thought to improve healing by promoting inflammation through growth factor and cytokine release. studies to date have shown mixed results, with few randomized or placebo-controlled trials. the lower extremity functional scale (lefs) is a previously validated objective measure of lower extremity function. objectives: is prp helpful in acute ankle sprains in the the emergency department? methods: prospective, randomized, double-blinded, placebo-controlled trial. patients with severe ankle sprains and negative x-rays were randomized to trial or placebo. severe was defined as marked swelling and ecchymosis and inability to bear weight. both groups had cc of blood drawn. trial group blood was centrifuged with a magellan autologous platelet separator (arteriocyte, cleveland) to yield - cc of prp. prp along with . cc of % lidocaine and . cc of . % bupivicaine was injected at the point of maximum tenderness by a blinded physician under ultrasound guidance. control group blood was discarded and participants were injected in a similar fashion substituting sterile . % saline for prp. both groups had visual analog scale (vas) pain scores and lefs on days , , , and . all participants had a posterior splint and were made non weight bearing for days after which they were reexamined, had their splint removed, and were asked to bear weight as tolerated. participants were instructed not to use nsaids during the trial. results: patients were screened and were enrolled. four withdrew before prp injection was complete. eighteen were randomized to prp and to placebo. see tables for results. vas and lefs are presented as means with sd in parentheses. demographics were not statistically different between groups. conclusion: in this small study, prp did not appear to offer benefit in either pain control or healing. both groups had improvement in their pain and functionality and did not differ significantly during the study period. limitations include small study size and large number of participant refusals. methods: a structured chart review of all icd- radius fracture coded charts spanning march , to july , was conducted. specific variable data were collected and categorized as follows: age, moi, body mass index, and fracture location. the charts were reviewed by two medical students, with % of the charts reviewed by both students to confirm inter-rater reliability. frequencies and inter-quartile ranges were determined. comparisons were made with fisher's exact test and multiple logistic regression. results: charts met inclusion criteria. charts were excluded due to one of the following reasons: no fracture or no x-ray ( ), isolated ulnar fracture ( ), or undocumented or penetrating moi ( ). of the analyzed patients (n = ), distal radius fractures were most common ( %), followed by proximal ( %) and midshaft ( %). chart reviewers were found to be reliable (j = ). age and moi were significantly associated with fracture location (see table) . ages - and bike accidents were more strongly associated with proximal radius fractures (odds ratio: [ - ] and [ - ], respectively). conclusion: patients presenting to our inner city ed with a radius fracture are more likely to have a distal fracture. adults - and bike accidents had a significantly higher incidence of proximal fractures than other ages or mois. background: trauma centers use guidelines to determine the need for a trauma surgeon in the ed on patient arrival. a decision rule from loma linda university that includes penetrating injury and tachycardia was developed to predict which pediatric trauma patients require emergent intervention, and thus are most likely to benefit from surgical presence in the ed. objectives: our goal was to validate the loma linda rule (llr) in a heterogeneous pediatric trauma population and to compare it to the american college of surgeons' major resuscitation criteria (mrc). we hypothesized that the llr would be more sensitive than the mrc for identifying the need for emergent operative or procedural intervention. methods: we performed a secondary analysis of prospectively collected trauma registry data from two urban level i pediatric trauma centers with a combined annual census of approximately , visits. consecutive patients < years old with blunt or penetrating trauma from through were included. patient demographics, injury severity scores (iss), times of ed arrival and surgical intervention, and all variables of both rules were obtained. the outcome (emergent operative intervention within hour of ed arrival or ed cricothyroidotomy or thoracotomy) was confirmed by trained, blinded abstractors. sensitivities, specificities, and % confidence intervals (cis) were calculated for both rules. results: , patients were included with a median age of . years and a median iss of . emergent intervention was required in patients ( . %). the llr had a sensitivity ranging from . %- . % ( % ci: . %- . %) and specificity ranging from . %- . % ( % ci: . %- . %) between both institutions. the mrc had a sensitivity ranging from . %- . % ( % ci: . %- . %) and specificity ranging from . %- . % ( % ci: . %- . %) between institutions. conclusion: emergent intervention is rare in pediatric trauma patients. the mrc was more sensitive for predicting the need for emergent intervention than the llr. neither set of criteria was sufficiently accurate to recommend their routine use for pediatric trauma patients. droperidol for sedation of acute behavioural disturbance leonie a. calver , colin page , michael downes , betty chan , geoffrey k. isbister calvary mater newcastle and university of newcastle, newcastle, australia; princess alexandra hospital, brisbane, australia; calvary mater newcastle, newcastle, australia; prince of wales hospital, sydney, australia background: acute behavioural disturbance (abd) is a common occurrence in the emergency department (ed) and is a risk to staff and patients. there remains little consensus on the most effective drug for sedation of violent and aggressive patients. prior to the food and drug administration's black box warning, droperidol was commonly used and was considered safe and effective. objectives: this study aimed to investigate the effectiveness of parenteral droperidol for sedation of abd. methods: as part of a prospective observational study, a standardised protocol using droperidol for the seda-acute and delayed behavioral deficits were demonstrated in this rat model of co toxicity, which parallels the neurocognitive deficit pattern observed in humans (see figure) . similar to prior studies, pathologic analysis of brain tissue demonstrated the highest percentage of necrotic cells in the cortex, pyramidal cells, and cerebellum. the collected data are summarized in the table. we have developed an animal model of severe co toxicity evidenced by behavioral deficits and neuronal necrosis. future efforts will compare neurologic outcomes in severely co poisoned rats treated with hypothermia and % inspired o versus hbo to normothermic controls treated with % inspired o . increasing in popularity, attracting more than , annual participants worldwide. prior studies have consistently documented renal function impairment, but only after race completion. the incidence of renal injury during these multi-day ultramarathons is currently unknown. this is the first prospective cohort study to evaluate the incidence of acute kidney injury (aki) in runners during a multi-day ultramarathon foot race. objectives: to assess the effect of inter-stage recovery versus cumulative damage on resulting renal function during a multi-day ultramarathon. methods: demographic and biochemical data gathered via phlebotomy and analyzed by istatÒ (abbott, nj) were collected at the start and finish of day ( miles), ( miles), and ( miles) during racing the planet'sÒ -mile, -day self-supported desert ultramarathons. pre-established rifle criteria using creatinine (cr) and glomerular filtration rate (gfr) defined aki as ''no injury'' (cr < . x normal, decrease of gfr < %), ''risk'' (cr . x normal, decrease of gfr by - %), and ''injury'' (cr x normal, decrease of gfr by - %). results: thirty racers ( % male) with a mean (+/) sd) age of + /- years were studied during the sahara (n = , . %), gobi (n = , %), and namibia (n = , . %) events. the average decrease in gfr from day start to day finish was + /- (p < . , % ci . - . ); day start to day finish was . + /- . (p < . , % ci . - . ); and day start to day finish was . ± . (p < . , % ci . - ). runners categorized as risk and injury for aki after stage was . % and %; after stage was % and %, and after stage was . % and . % conclusion: the majority of participants developed significant levels of renal impairment despite recovery intervals. given the changes in renal function, potentially harmful non-steroidal anti-inflammatory drugs should be minimized to prevent exacerbating acute kidney injury. background: more than % of the elderly abuse prescription drugs, and emergency medicine providers frequently struggle to identify features of opioid addiction in this population. the prescription drug use questionnaire (pduqp) is a validated, -item, patient-administered tool developed to help health care providers better identify problematic opioid use, or dependence, in patients who receive opioids for the treatment of chronic pain. objectives: to identify the prevalence of prescription drug misuse features in elderly ed patients. methods: this cross-sectional, observational study was conducted between / and / in the ed of an urban, university-affiliated community hospi-tal that serves a large geriatric population. all patients aged to inclusive were eligible, and were recruited on a convenience basis. exclusion criteria included known dementia, and critical illness. outcomes of interest included self-reported history of prior prescription opioid use, substance abuse history, aberrant medication-taking behaviors, and pduqp results. results: one hundred patients were approached for participation. two were excluded for inability to read english, three were receiving analgesia for metastatic cancer, had never taken a prescription opioid, and seven refused to participate beyond pre-screening. sixty patients completed the study (see table ). of those, . % reported four or more visits within months; chronic pain was reported by . %; debilitating pain by . %; prior pain management referral by . %; and storing opioids for future use by %. seventeen patients reported current prescription opioid use, and were administered the pduqp (see figure) . in this population, . % thought their pain was not adequately being treated; . % reported having to increase the amount of pain medication they were taking over the prior months; . % saved up future pain medication; . % had doctors refuse to give them pain medication for fear that the patient would abuse the prescription opioids; and . % reported having a previous drug or alcohol problem. conclusion: screening instruments, such as the pduqp, facilitate identification of geriatric patients with features of opioid misuse. a high proportion of patients in this study save opioids for further use. interventions for safe medication disposal may decrease access to opioids and subsequent morbidity. age extremes, male sex, and several chronic health conditions were associated with increased odds of heat stroke, hospital admission, and death in the ed by a factor of - . chronic hematologic disease (e.g. anemia) was associated with a - fold increase in adjusted odds of each of these outcomes. conclusion: hri imposes a substantial public health burden, and a wider range of chronic conditions confer susceptibility than previously thought. males, older adults, and patients with chronic conditions, particularly anemia, are likely to have more severe hri, be admitted, or die in the ed. background: carbon monoxide (co) poisoning is a remarkable cause of death worldwide. co, produced by the incomplete combustion of hydrocarbons, has many toxic effects on especially the heart and brain. co binds strongly to cytochrome oxidase, hemoglobin, and myoglobin causing hypoxia of organs and issues. co converts hemoglobin to carboxyhemoglobin and makes transport of oxygen through the body impossible and causes severe hypoxia. objectives: the aim of this study is to investigate the levels of s b and neuron specific enolase (nse) measured both during admittance and at the sixth hour of hyperbaric and normobaric oxygen therapy carried out on patients with a diagnosis of co poisoning. methods: the study is designed as a prospective observational laboratory study. forty patients were enrolled in the study: underwent normobaric oxygen therapy (nbot) and the other underwent hyperbaric oxygen therapy (hbot). levels of s b and nse were measured both during admittance and at the sixth hour of admittance of all patients. demographic data, clinical characteristics, and outcome measures were recorded. all data were statistically analyzed. results: in both treatment groups, mean levels of nse after therapy were significantly lower than admittance levels. although levels of nse measured before and hours after treatment in hbot group were high, the difference between groups was not statistically significant (p > . ). in both treatment groups, mean levels of s b after therapy were significantly lower than admittance levels; likewise nse. although levels of s b measured before and hours after treatment in hbot group were high, the difference between groups was not statistically significant (p > . ). additionally, while levels of s b measured after treatment in the hbot group were lower compared to the nbot group, the difference between groups was also not statistically significant (p > . ). conclusion: levels of s b and nse as evidence for brain injury elevation in case of co poisoining and decrease by therapy according to our study as well as previous studies. decrease in levels of s b is more significant. according to our results, s b and nse may be useful markers in case of co poisoning; however, we did not meet any data providing more value in determining hbot indications and determining levels of cohb in the management of patients with a diagnosis of co poisoining. neurons objectives: this study was conducted to determine if neurons in the dmh, and its neighbor the paraventricular hypothalamus (pvn), were likewise involved in mdma-mediated neuroendocrine responses, and if serotonin a receptors ( -ht a) play a role in this regional response. methods: in both experiments, male sprague dawley rats (n = - /group) were implanted with bilateral cannulas targeting specific regions of the brain, i.v. catheters for drug delivery, and i.a. catheters for blood withdrawal. experiments were conducted in raturn cages, which allow blood withdrawal and drug administration in free moving animals while recording their locomotion. in the first experiment, rats were microinjected into the dmh, the pvn, or a region between, with the gabaa agonist muscimol ( pmol/ nl/side) or pbs ( nl) and min later were injected with either mdma ( . mg/kg i.v.) or an equal volume of saline. blood was withdrawn prior to microinjections and minutes after mdma for ria measurement of plasma acth. locomotion was recorded throughout the experiment. in a separate experiment of identical design, either the -ht a antagonist way (way, nmol/ nl/side) or saline was microinjected followed by i.v. injection of mdma or saline. in both experiments, increases in acth and distance traveled were compared between groups using an anova analysis. results: when compared to controls, microinjections of muscimol into the dmh, pvn, or the area in between attenuated plasma increases in acth and locomotion evoked by mdma. when microinjected into the dmh or pvn, way had no effect on acth, but when injected into the region of the dmh it significantly increased locomotion. background: poor hand-offs between physicians when admitting patients have been shown to be a major source of medical errors. objectives: we propose that training in a standardized admissions protocol by emergency medicine (em) to internal medicine (im) residents would improve the quality of and quantity of communication of vital patient information. methods: em and im residents at a large academic center developed an evidence-based admission handover protocol termed the ' ps' (table ) . em and im residents received ' ps' protocol training. im residents recorded prospectively how well each of the seven ps were communicated during each admission pre-and post-intervention. im residents also assessed the overall quality of the handover using a likert scale. the primary outcome was the change in the number of 'ps' conveyed by the em resident to the accepting im resident. data were collected for six weeks before and then for six weeks starting two weeks after the educational intervention. results: there were observations recorded in the preintervention (control) group and observations in the post-intervention group. for each of the seven 'ps' the percentage of observation where all of the information was communicated is shown in table . the communication of 'ps' increased following the intervention. this rise was statistically significant for patient information and pending tests. in the control group the mean of total communicated ps was and in the intervention group, the mean increased to (p < . ). the quality of the handover communication had a mean rating of . in the control group and . in the intervention group (p < . ). conclusion: this educational intervention in a cohort of em and im residents improved the quality and quantity of vital information communicated during patient handovers. the intervention was statistically significant for patient information transfer and tests pending. the results are limited by study size. based on our preliminary data, an agreed-upon handover protocol with training improved the amount and quality of communication during patients' hospital admission on simple items that were likely had been taken for granted as routinely transmitted. we recruited a convenience sample of residents and students rotating in the pediatric emergency department. a two-sided form had the same seven clinical decisions on each side: whether to perform blood, urine, spinal fluid tests, imaging, iv fluids, antibiotics, or a consult. the rating choices were: definitely not, probably not, probably would, and definitely would. trainees rated each decision after seeing a patient, but before presenting to the preceptor, who, after evaluating the patient, rated the same seven decisions on the second side of the form. the preceptor also indicated the most relevant decision (mrd) for that patient. we examined the validity of the technique using hypothesis testing; we posited that residents would have a higher degree of concordance with the preceptor than would medical students. this was tested using dichotomized analyses (accuracy, kappa) and roc curves with the preceptor decision as the gold standard. results: thirty-one students completed forms (median forms; iqr , ) and residents completed ( ; iqr , ). preceptors included attending physicians and fellows ( ; iqr , ). students were concordant with preceptors in % (k = . ) of mrd while residents agreed in . % (p = . ), k = . . roc analysis revealed significant differences between students and residents in the auc for the mrd ( . vs . ; p = . ). conclusion: this measure of trainee-preceptor concordance requires further research but may eventually allow for assessment of trainee clinical decision-making. it also has the pedagogical advantage of promoting independent trainee decision-making. background: basic life support (bls) and advanced cardiac life support (acls) are integral parts of emergency cardiac care. this training is usually reserved in most institutions for residents and faculty. the argument can be made to introduce bls and acls training earlier in the medical student curriculum to enhance acquisition of these skills. objectives: the goal of the survey was to characterize the perceptions and needs of graduating medical students in regards to bls and acls training. methods: this was a survey-based study of graduating fourth year medical students at a u.s. medical school. the students were surveyed before voluntarily participating in a student-led acls course in march of their final year. the surveys were distributed before starting the training course. both bls and acls training, comfort levels, and perceptions were assessed in the survey. results: of the students in the graduating class, participated in the training class with ( %) completing the survey. % of students entered medical school without any prior training and % started clinics without training. . % of students reported witnessing an average of . in-hospital cardiac arrests during training (range of - ). overall, students rated their preparedness . (sd . ) for adult resuscitations on a - likert scale with being the unprepared. % and % of students believe that bls and acls should be included in the medical student curriculum respectively with a preference for teaching before starting clerkships. % of students avoided participating in resuscitations due to lack of training. of those, % said they would have participated had they been trained. conclusion: to our knowledge, this is one of the first studies to address the perceptions and needs for bls and acls training in u.s. medical schools. students feel that bls and acls training is needed in their curriculum and would possibly enhance perceived comfort levels and willingness to participate in resuscitations. background: professionalism is one of six core competency requirements of the acgme, yet defining and teaching its principles remains a challenge. the ''social contract'' between physician and community is clearly central to professionalism so determining the patient's understanding of the physician's role in the relationship is important. because specialization has created more narrowly focused and often quite different interactions in different medical environments, the patient concept of professionalism in different settings may vary as well. objectives: we hoped to determine if patients have different conceptions of professionalism when considering physicians in different clinical environments. methods: patients were surveyed in the waiting room of an emergency department, an outpatient internal medicine clinic, and a pre-operative/anesthesia clinic. the survey contained examples of attributes, derived from the american board of internal medicine's eight characteristics of professionalism. participants were asked to rate, on a -point scale, the importance that a physician possess each attribute. an anova analysis was used to compare the sites for each question. results: of who took the survey, were in the emergency department, were in the medicine clinic, and were in the pre-operative clinic. females comprised % of the study group and the average age was with a range from to . there was a significant difference on the attribute of ''providing a portion of work for those who cannot pay;'' this was rated higher in the emergency department (p = . ). there was near-significance (p = . ) on the attribute of ''being able to make difficult decisions under pressure,'' which was rated higher in the pre-op clinic. there was no difference for any of the other questions. the top four professional attributes at each clinical site were the same -''honesty,'' ''excellence in communication and listening,'' ''taking full responsibility for mistakes,'' and ''technical competence/ skill;'' the bottom two were ''being an active leader in the community'' and ''patient concerns should come before a doctor's family commitments.'' conclusion: very few differences between clinical sites were found when surveying patient perception of the important elements of medical professionalism. this may suggests a core set of values desired by patients for physicians across specialties. emergency medicine faculty knowledge of and confidence in giving feedback on the acgme core competencies todd guth, jeff druck, jason hoppe, britney anderson university of colorado, aurora, co background: the acgme mandates that residency programs assess residents based upon six core competencies. although the core competencies have been in place for a number of years, many faculty are not familiar with the intricacies of the competencies and have difficulty giving competency-specific feedback to residents. objectives: the purpose of the study is to determine the extent to which emergency medicine (em) faculty can identify the acgme core competencies correctly and to determine faculty confidence with giving general feedback and core competency focused feedback to em residents. methods: design and participants: at a single department of em, a survey of twenty-eight faculty members, their knowledge of the acgme core competencies, and their confidence in providing feedback to residents was conducted. confidence levels in giving feedback were scored on a likert scale from to . observations: descriptive statistics of faculty confidence in giving feedback, identification of professional areas of interest, and identification of the acgme core competencies were determined. mann-whitney u tests were used to make comparisons between groups of faculty given the small sample size of the respondents. results: there was a % response rate of the faculty members surveyed. eight faculty members identified themselves as primarily focused on education. although those faculty members identifying themselves as focused on education scored higher than non-education focused faculty for all type of feedback (general feedback, constructive feedback, negative feedback), there was only a statistical difference in confidence levels . versus . (p < . ) for acgme core competency specific feedback when compared to noneducation focused faculty. while education focused faculty correctly identified all six of acgme core competencies % of the time, not one of the non-education focused faculty identified all six of the core competencies correctly. non-education focused faculty only correctly identified three or more competencies % of the time. conclusion: if residency programs are to assess residents using the six acgme core competencies, additional faculty development specific to the core competencies will be needed to train all faculty on the core competencies and on how to give core competency specific feedback to em residents. there is no clear consensus as to the most effective tool to measure resident competency in emergency ultrasound. objectives: to determine the relationship between the number of scans and scores on image recognition, image acquisition, and cognitive skills as measured by an objective structured clinical exam (osce) and written exam. secondarily, to determine whether image acquisition, image recognition, and cognitive knowledge require separate evaluation methodologies. methods: this was a prospective observational study in an urban level i ed with a -year acgme-accredited residency program. all residents underwent an ultrasound introductory course and a one-month ultrasound rotation during their first and second years. each resident received a written exam and osce to assess psychomotor and cognitive skills. the osce had two components: ( ) recognition of images, and ( ) acquisition of images. a registered diagnostic medical sonographer (rdms)-certified physician observed each bedside examination. a pre-existing residency ultrasound database was used to collect data about number of scans. pearson correlation coefficients were calculated for number of scans, written exam score, image recognition, and image acquisition scores on the osce. results: twenty-nine residents were enrolled from march to february who performed an average of scans (range - ). there was no significant correlation between number of scans and written exam scores. an analysis of the number of scans and the ocse found a moderate correlation with image acquisition (r = . , p = . ) and image recognition (r = . , p = < . )). pearson correlation analysis between the image acquisition score and image recognition score found that there was no correlation (r = . , p = . ). there was a moderate correlation with image acquisition scores to written scores (r = . , p = . ) and image recognition scores to written scores (r = . , p = . ). conclusion: the number of scans does not correlate with written tests but has a moderate correlation with image acquisition and image recognition. this suggests that resident education should include cognitive instruction in addition to scan numbers. we conclude that multiple methods are necessary to examine resident ultrasound competency. background: although emergency physicians must often make rapid decisions that incorporate their interpretation of an ecg, there is no evidence-based description of ecg interpretation competencies for emergency medicine (em) trainees. the first step in defining these competencies is to develop a prioritized list of ecg findings relevant to em contexts. objectives: the purpose of this study was to categorize the importance of various ecg diagnoses and/or findings for the em trainee. methods: we developed an extensive list of potentially important ecg diagnoses identified through a detailed review of the cardiology and em literature. we then conducted a three-round delphi expert opinion-soliciting process where participants used a five-point likert scale to rate the importance of each diagnosis for em trainees. consensus was defined as a minimum of percent agreement on any particular diagnosis at the second round or later. in the absence of consensus, stability was defined as a shift of percent or less after successive rounds. results: twenty-two em experts participated in the delphi process, sixteen ( %) of whom completed the process. of those, fifteen were experts from eleven different em training programs across canada and one was a recognized expert in em electrocardiography. overall, diagnoses reached consensus, achieved stability, and one diagnosis achieved neither consensus nor stability. out of potentially important ecg diagnoses, ( %) were considered ''must know'' diagnoses, ( %) ''should know'' diagnoses, and ( %) ''nice to know'' diagnoses. conclusion: we have categorized ecg diagnoses within an em training context, knowledge of which may allow clinical em teachers to establish educational priorities. this categorization will also facilitate the development of an educational framework to establish em trainee competency in ecg interpretation. ''rolling refreshers background: cardiac arrest survival rates are low despite advances in cardiopulmonary resuscitation. high quality cpr has been shown to impart greater cardiac arrest survival; however, retention of basic cpr skills by health care providers has been shown to be poor. objectives: to evaluate practitioner acceptance of an in-service cpr skills refresher program, and to assess for operator response to real-time feedback during refreshers. methods: we prospectively evaluated a ''rolling refresher'' in-service program at an academic medical center. this program is a proctored cpr practice session using a mannequin and cpr-sensing defibrillator that provides real-time cpr quality feedback. subjects were basic life support-trained providers who were engaged in clinical care at the time of enrollment. subjects were asked to perform two minutes of chest compressions (ccs) using the feedback system. ccs could be terminated when the subject had completed approximately seconds of compressions with < corrective prompts. a survey was then completed by to obtain feedback regarding the perceived efficacy of this training model. cpr quality was then evaluated using custom analysis software to determine the percent of cc adequacy in -second intervals. results: enrollment included subjects from the emergency department and critical care units ( nurses, physicians, students and allied health professionals). all participants completed a survey and cpr performance data logs were obtained. positive impressions of the in-service program were registered by % ( / ) and % ( / ) reported a self-perceived improvement in skills confidence. eighty-three percent ( / ) of respondents felt comfortable performing this refresher during a clinical shift. thirtynine percent ( / ) of episodes exhibited adequate cc performance with approximately seconds of cc. of the remaining episodes, . ± . % of cc were adequate in the first seconds with . ± . % of cc adequate during the last second interval (p = . ). of these individuals, improved or had no change in their cpr skills, and individuals skills declined during cc performance (p = . ). conclusion: implementation of a bedside cpr skill refresher program is feasible and is well received by hospital staff. real time cpr feedback improved upon cpr skill performance during the in-service session. teaching emergency medicine skills: is a self-directed, independent, online curriculum the way of the future? tighe crombie, jason r. frank, stephen noseworthy, richard gerein, a. curtis lee university of ottawa, ottawa, on, canada background: procedural competence is critical to emergency medicine, but the ideal instructional method to acquire these skills is not clear. previous studies have demonstrated that online tutorials have the potential to be as effective as didactic sessions at teaching specific procedural skills. objectives: we studied whether a novel online curriculum teaching pediatric intraosseus (io) line insertion to novice learners is as effective as a traditional classroom curriculum in imparting procedural competence. methods: we conducted a randomized controlled educational trial of two methods of teaching io skills. preclinical medical students with no past io experience completed a written test and were randomized to either an online or classroom curriculum. the online group (og) were given password-protected access to a website and instructed to spend minutes with the material while the didactic group (dg) attended a lecture of similar duration. participants then attended a -minute unsupervised manikin practice session on a separate day without any further instruction. a videotaped objective structured clinical examination (osce) and post-course written test were completed immediately following this practice session. finally, participants were crossed over into the alternate curriculum and were asked to complete a satisfaction survey that compared the two curricula. results were compared with a paired t-test for written scores and an independent t-test for osce scores. results: sixteen students completed the study. pre-course test scores of the two groups were not significantly different prior to accessing their respective curricula (mean scores of % for og and % for dg, respectively; p > . ). post-course written scores were also not significantly different (both with means of %; p > . ); however, for the post-treatment osce scores, the og group scored significantly higher than the dg group (mean scores of . % and . %; t( ) = . , p < . .) conclusion: this novel online curriculum was superior to a traditional didactic approach to teaching pediatric io line insertion. novice learners assigned to a selfdirected online curriculum were able to perform an emergency procedural skill to a high level of performance. em educators should consider adopting online teaching of procedural skills. background: applicants to em residency programs obtain information largely from the internet. curricular information is available from a program's website (pw) or the saem residency directory (sd). we hypothesize that there is variation between these key sources. objectives: to identify discrepancies between each pw and sd. to describe components of pgy - em residency programs' curricula as advertised on the internet. methods: pgy - residencies were identified through the sd. data were abstracted from individual sd and pw pages identifying pre-determined elements of interest regarding rotations in icu, pediatrics, inpatient (medicine, pediatrics, general surgery), electives, orthopedics, toxicology, and anesthesia. agreement between the sd and pw was calculated using a cohen's unweighted kappa calculation. curricula posted on pws were considered the gold standard for the programs' current curricula. results: a total of pgy - programs were identified through the sd and confirmed on the pw. ninetyone of programs ( %) had complete curricular information on both sites. only these programs were included in the kappa analysis for sd and pw comparisons. of programs with complete listings, of programs ( %) had at least one discrepancy. the agreement of information between pw and sd revealed a kappa value of . ( % ci . - . ). analysis of pw revealed that pgy - programs have an average of . (range, - ), . (range, - ), . (range, - ), and . (range, - ) blocks of icu, pediatrics, elective, and inpatient, respectively. common but not rrc-mandated rotations in orthopedics, toxicology, and anesthesiology are present in , , and percent of programs, respectively. conclusion: publicly accessible curricular information through the sd and pw for pgy - em programs only has fair agreement (using commonly accepted kappa value guides). applicants may be confused by the variability of data and draw inaccurate conclusions about program curricula. from the gravid uterus and improves cardiac output; however, this theory has never been proven. objectives: we set out to determine the difference in inferior vena cava (ivc) filling when third trimester patients were placed in supine, llt, and right lateral tilt (rlt) positions using ivc ultrasound. methods: healthy pregnant women in their third trimester presenting to the labor and delivery suite were enrolled. patients were placed in three different positions (supine, rlt, and llt) and ivc maximum (max) and minimum (min) measurements were obtained using the intercostal window in short axis approximately two centimeters below the entry of the hepatic veins. ivc collapse index (ci) was calculated for each measurement using the formula (max-min)/max. in addition, blood pressure, heart rate, and fetal heart rate were monitored. patients stayed in each position for at least minutes prior to taking measurements. we compared ivc measurements using a one-way analysis of variance for repeated measures. results: twenty patients were enrolled. the average age was years (sd . ) with a mean estimated gestational age of . weeks (sd . ). there were no significant differences seen in ivc filling in each of the positions (see table ). in addition, there were no differences in hemodynamic parameters between positions.ten ( %) patients had the largest ivc measurement in the llt position, ( %) patients in the rlt position, and ( %) in the supine position. conclusion: there were no significant differences in ivc filling between patient positions. for some third trimester patients llt may not be the optimal position for ivc filling. background: although the acgme and rrc require competency assessment in ed bedside ultrasound (us), there are no standardized assessment tools for us training in em. objectives: using published us guidelines, we developed four observed structured competency evalua-tions (osce) for four common em us exams: fast, aortic, cardiac, and pelvic. inter-rater reliability was calculated for overall performance and for the individual components of each osce. methods: this prospective observational study derived four osces that evaluated overall study competency, image quality for each required view, technical factors (probe placement, orientation, angle, gain, and depth), and identification of key anatomic structures. em residents with varying levels of training completed an osce under direct observation of two em-trained us experts. each expert was blinded to the other's assessment. overall study competency and image quality of each required views were rated on a five-point scale ( poor, -fair, -adequate, -good, -excellent), with explicit definitions for each rating. each study had technical factors (correct/incorrect) and anatomic structures (identified/not identified) assessed as binary variables. data were analyzed using cohen's and weighted k, descriptive statistics, and % ci. results: a total of us exams were observed, including fast, cardiac, aorta, and pelvic. total assessments included ratings of overall study competency, ratings of required view image quality, ratings of technical factors, and ratings of anatomic structures. inter-rater assessment of overall study competency showed excellent agreement, raw agreement . ( . , . ), weighted k . ( . , . ). ratings of required view image quality showed excellent agreement: raw agreement . ( . , . ), weighted k . ( . , . ). inter-rater assessment of technical factors showed substantial agreement: raw agreement . ( . , . ), cohen's k . ( . , . ). ratings of identification of anatomic structures showed substantial agreement: raw agreement . ( . , . ), cohen's k . ( . , . ). conclusion: inter-rater reliability is substantial to excellent using the derived ultrasound osces to rate em resident competency in fast, aortic, cardiac, and pelvic ultrasound. validation of this tool is ongoing. a objectives: the objective of this study was to identify which transducer orientation, longitudinal or transverse, is the best method of imaging the axillary vein with ultrasound, as defined by successful placement in the vein with one needle stick, no redirections, and no complications. methods: emergency medicine resident and attending physicians at an academic medical center were asked to cannulate the axillary vein in a torso phantom model. the participants were randomized to start with either the longitudinal or transverse approach and completed both sequentially, after viewing a teaching presentation. participants completed pre-and post-attempt questionnaires. measurements of each attempt were taken regarding time to completion, success, skin punctures, needle redirections, and complications. we compared proportions using a normal binomial approximation and continuous data using the t-distribution, as appropriate. a sample size of was chosen based on the following assumptions: power, . ; significance, . ; effect size, % versus %. results: fifty-seven operators with a median experience of prior ultrasounds ( to iqr) participated. first-attempt success frequency was / ( . ) for the longitudinal method and / ( . ) for the transverse method (difference . , % ci . - . ); this difference was similar regardless of operator experience. the longitudinal method had fewer redirections (mean difference . , % ci . - . ) and skin punctures (mean difference . , % ci ) to . ). arterial puncture occurred in / longitudinal attempts and / transverse attempts, with no pleural punctures in either group. among successful attempts, the time spent was seconds less for longitudinal method ( % ci - ). though % of participants had more experience with the transverse method prior to the training session, % indicated after the session that they preferred the longitudinal method. methods: a prospective single-center study was conducted to assess the compressibility of the basilic vein with ultrasound. healthy study participants were recruited. the compressibility was assessed at baseline, and then further assessed with one proximal tourniquet, two tourniquets (one distal and one proximal), and a proximal blood pressure cuff inflated to mmhg. compressibility was defined as the vessel's resistance to collapse to external pressure and rated as completely compressible, moderately compressible, or mildly compressible after mild pressure was applied with the ultrasound probe. results: one-hundred patients were recruited into the study. ninety-eight subjects were found to have a completely compressible basilic vein at baseline. when one tourniquet and two tourniquets were applied and participants, respectively, continued to have completely compressible veins. a fisher's exact test comparing one versus two tourniquets revealed no difference between these two techniques (p = . ). only two participants continued to have completely compressible veins following application of the blood pressure cuff. the compressibility of this group was found to be statistically significant by fisher's exact test compared to both tourniquet groups (p < . ). furthermore, participants with the blood pressure cuff applied were found to have moderately compressible veins and participants were found to have mildly compressible veins. conclusion: tourniquets and blood pressure cuffs can both decrease the compressibility of peripheral veins. while there was no difference identified between using one and two tourniquets, utilization of a blood pressure cuff was significantly more effective to decrease compressibility. the findings of this study may be utilized in the emergency department when attempting to obtain peripheral venous access, specifically supporting the use of blood pressure cuffs to decrease compressibility. background: electroencephalography (eeg) is an underused test that can provide valuable information in the evaluation of emergency department (ed) patients with altered mental status (ams). in ams patients with nonconvulsive seizure (ncs), eeg is necessary to make the diagnosis and to initiate proper treatment. yet, most cases of ncs are diagnosed > h after ed presentation. obstacles to routine use of eeg in the ed include space limitations, absence of / availability of eeg technologists and interpreters, and the electrically hostile ed environment. a novel miniature portable wireless device (microeeg) is designed to overcome these obstacles. objectives: to examine the diagnostic utility of micro-eeg in identifying eeg abnormalities in ed patients with ams. methods: an ongoing prospective study conducted at two academic urban eds. inclusion: patients ‡ years old with ams. exclusion: an easily correctable cause of ams (e.g. hypoglycemia, opioid overdose). three -minute eegs were obtained in random order from each subject beginning within one hour of presentation: ) a standard eeg, ) a microeeg obtained simultaneously with conventional cup electrodes using a signal splitter, and ) a microeeg using an electrocap. outcome: operative characteristics of micro-eeg in identifying any eeg abnormality. all eegs were interpreted in a blinded fashion by two board-certified epileptologists. within each reader-patient pairing, the accuracy of eegs and were each assessed relative to eeg . sensitivity, specificity, and likelihood ratios (lr) are reported for microeeg by standard electrodes and electrocap (eegs and ). inter-rater variability for eeg interpretations is reported with kappa. results: the interim analysis was performed on consecutive patients (target sample size: ) enrolled from may to october (median age: , range: - , % male). overall, % ( % confidence interval [ci], - %) of interpretations were abnormal (based on eeg ). kappa values representing the agreement of neurologists in interpretation of eeg - were . ( . - . ), . ( . - . ), and . ( . - . ), respectively. conclusion: the diagnostic accuracy and concordance of microeeg are comparable to those of standard eeg but the unique ed-friendly characteristics of the device could help overcome the existing barriers for more frequent use of eeg in the ed. (originally submitted as a ''late-breaker.'') a background: patients who use an ed for acute migraine are characterized by higher migraine disability scores, lower socio-economic status, and are unlikely to have used a migraine-specific medication prior to ed presentation. objectives: to determine if a comprehensive migraine intervention, delivered just prior to ed discharge, could improve migraine impact scores one month after the ed visit. methods: this was a randomized controlled trial of a comprehensive migraine intervention versus typical care among patients who presented to an ed for management of acute migraine. at the time of discharge, for patients randomized to comprehensive care, we reinforced their diagnosis, shared a migraine education presentation from the national library of medicine, provided them with six tablets of sumatriptan mg and tablets of naproxen mg, and if they wished, provided them with an expedited free appointment to our institution's headache clinic. patients randomized to typical care received the care their attending emergency physician felt was appropriate. the primary outcome was a between-group comparison of the hit score, a validated headache assessment instrument, one month after ed discharge. secondary outcomes included an assessment of satisfaction with headache care and frequency of use of migraine-specific medication within that one month period. the outcome assessor was blinded to assignment. results: over a month period, migraine patients were enrolled. one month follow-up was successfully obtained in % of patients. baseline characteristics were comparable. one month hit scores in the two groups were nearly identical ( vs , %ci for difference of : ) , ), as was dissatisfaction with overall headache care ( % versus %, %ci for difference of %: ) , %). not surprisingly, patients randomized to the comprehensive intervention were more likely to be using triptans or migraine-preventive therapy ( % versus %, %ci for difference of %: , %) one month later. conclusion: a comprehensive migraine intervention, when compared to typical care, did not improve hit scores one month after ed discharge. future work is needed to define a migraine intervention that is practical and useful in an ed. background: lumbar puncture (lp) is the standard of care for excluding non-traumatic subarachnoid hemorrhage (sah), and is usually performed following head ct (hct). however, in the setting of a non-diagnostic hct, lp demonstrates a low overall diagnostic yield for sah (< % positive rate). objectives: to describe a series of ed patients diagnosed with sah by lp following a non-diagnostic hct, and, when compared to a set of matched controls, determine if clinical variables can reliably identify these ''ct-negative/lp-positive'' patients. methods: retrospective case-control chart review of ed patients in an integrated health system between the years - (estimated - million visits among eds). patients with a final diagnosis of non-traumatic sah were screened for case inclusion, defined as an initial hct without sah by final radiologist interpretation and a lp with > red blood cells/mm , along with either ) xanthochromic cerebrospinal fluid, ) angiographic evidence of cerebral aneurysm or arteriovenous malformation, or ) head imaging showing sah within hours following lp. control patients were randomly selected among ed patients diagnosed with headache following a negative sah evaluation with hct and lp. controls were matched to cases by year and presenting ed in a : ratio. stepwise logistic regression and classification and regression tree analysis (cart) were employed to identify predictive variables. inter-rater reliability (kappa) was determined by independent chart review. results: fifty-five cases were identified. all cases were hunt-hess grade or . demographics are shown in table . thirty-four cases ( %) had angiographic evidence of sah. five variables were identified that positively predicted sah following a normal hct with % sensitivity ( % ci, - %) and % specificity ( % ci, - %): age > years, neck pain or stiffness, onset of headache with exertion, vomiting with headache, or loss of consciousness at headache onset. kappa values for selected variables ranged from . - . ( % sample). the c-statistic (auc) and hosmer-lemeshow test p-value for the logistic regression model are . and . , respectively (table ) . conclusion: several clinical variables can help safely limit the amount of invasive testing for sah following a non-diagnostic hct. prospective validation of this model is needed prior to practice implementation. background: post-thrombolysis intracerebral hemorrhage (ich) is associated with poor outcomes. previous investigations have attempted to determine the relationship between pre-existing anti-platelet (ap) use and the safety of intravenous thrombolysis, but have been limited by low event rates thus decreasing the precision of estimates. objectives: our objective was to determine whether pre-existing ap therapy increases the risk of ich following thrombolysis. methods: consecutive cases of ed-treated thrombolysis patients were identified using multiple methods, including active and passive surveillance. retrospective data were collected from four hospitals from - , and distinct hospitals from - as part of a cluster randomized trial. the same chart abstraction tool was used during both time periods and data were subjected to numerous quality control checks. hemorrhages were classified using a pre-specified methodology: ich was defined as presence of hemorrhage in radiographic interpretations of follow up imaging (primary outcome). symptomatic ich (secondary outcome) was defined as radiographic ich with associated clinical worsening. a multivariable logistic regression model was constructed to adjust for clinical factors previously identified to be related to postthrombolysis ich. as there were fewer sich events, the multivariable model was constructed similarly, except that variables divided into quartiles in the primary analysis were dichotomized at the median. results: there were patients included, with % having documented pre-existing ap treatment. the mean age was years, the cohort was % male, and the median nihss was . the unadjusted proportion of patients with any ich was . % without ap and . % with ap (difference . %, % ci ) . % to . %); for sich this was . % without ap and % with ap (difference . %, %ci ) to . %). no significant association between pre-existing ap treatment with radiographic or symptomatic ich was observed (table) . conclusion: we did not find that ap treatment was associated with post-thrombolysis ich or sich in this cohort of community treated patients. pre-existing tobacco use, younger age, and lower severity were associated with lower odds of sich. an association between ap therapy and sich may still exist -further research with larger sample sizes is warranted in order to detect smaller effect sizes. background: post-cardiac arrest therapeutic hypothermia (th) improves survival and neurologic outcome after cardiac arrest, but the parameters required for optimal neuroprotection remain uncertain. our laboratory recently reported that -hour th was superior to -hour th in protecting hippocampal ca pyramidal neurons after asphyxial cardiac arrest in rats. cerebellar purkinje cells are also highly sensitive to ischemic injury caused by cardiac arrest, but the effect of th on this neuron population has not been previously studied. objectives: we examined the effect of post-cardiac arrest th onset time and duration on purkinje neuron survival in cerebella collected during our previous study. methods: adult male long evans rats were subjected to -minute asphyxial cardiac arrest followed by cpr. rats that achieved return of spontaneous circulation (rosc) were block randomized to normothermia ( . deg c) or th ( . deg c) initiated , , , or hours after rosc and maintained for or hours (n = per group). sham injured rats underwent anesthesia and instrumentation only. seven days post-cardiac arrest or sham injury, rats were euthanized and brain tissue was processed for histology. surviving purkinje cells with normal morphology were quantified in the primary fissure in nissl stained sagittal sections of the cerebellar vermis. purkinje cell density was calculated for each rat, and group means were compared by anova with bonferroni analysis. results: purkinje cell density averaged (+/) sd) . ( . ) cells/mm in sham-injured rats. neuronal survival in normothermic post-cardiac arrest rats was significantly reduced compared to sham ( . % ( . %)). overall, th resulted in significant neuroprotection compared to normothermia ( . % ( . %) of sham). purkinje cell density with -hour duration th was . % ( . %) of sham and -hour duration th was . % ( . %), both significantly improved from sham (p = . between durations). th initiated , , , and hours post-rosc provided similar benefit: . % ( . %), . % ( . %), . % ( . %), and . % ( . %) of sham, respectively. conclusion: overall, these results indicate that postcardiac arrest th protects cerebellar purkinje cells with a broad therapeutic window. our results underscore the importance of considering multiple brain regions when optimizing the neuroprotective effect of post-cardiac arrest th. the effect of compressor-administered defibrillation on peri-shock pauses in a simulated cardiac arrest scenario joshua glick, evan leibner, thomas terndrup penn state hershey medical center, hershey, pa background: longer pauses in chest compressions during cardiac arrest are associated with a decreased probability of successful defibrillation and patient survival. having multiple personnel share the tasks of performing chest compressions and shock delivery can lead to communication complications that may prolong time spent off the chest. objectives: the purpose of this study was to determine whether compressor-administered defibrillation led to a decrease in pre-shock and peri-shock pauses as compared to bystander-administered defibrillation in a simulated in-hospital cardiac arrest scenario. we hypothesized that combining the responsibilities of shock delivery and chest-compression performance may lower no-flow periods. methods: this was a randomized, controlled study measuring pauses in chest compressions for defibrillation in a simulated cardiac arrest. medical students and ed personnel with current cpr certification were surveyed for participation between july and october . participants were randomized to either a control (facilitator-administered shock) or variable (participantadministered shock) group. all participants completed one minute of chest compressions on a mannequin in a shockable rhythm prior to initiation of prompt and safe defibrillation. pauses for defibrillation were measured and compared in both study groups. results: out of total enrollments, the data from defibrillations were analyzed. subject-initiated defibrillation resulted in a significantly lower pre-shock handsoff time ( . s; % ci: . - . ) compared to facilitator-initiated defibrillation ( . s; % ci: . - . ). furthermore, subject-initiated defibrillation resulted in a significantly lower peri-shock hands-off time ( . s; % ci: . - . ) compared to facilitator-initiated defibrillation ( . s; % ci: . - . ). conclusion: assigning the responsibility for shock delivery to the provider performing compressions encourages continuous compressions throughout the charging period and decreases total time spent off the chest. this modification may also decrease the risk of accidental shock and improve patient survival. however, as this was a simulation-based study, clinical implementation is necessary to further evaluate these potential benefits. objectives: to determine the sensitivity and specificity of peripheral venous oxygen (po ) to predict abnormal central venous oxygen saturation in septic shock patients in the ed. methods: secondary analysis of an ed-based randomized controlled trial of early sepsis resuscitation targeting three physiological variables: cvp, map, and either scvo or lactate clearance. inclusion criteria: suspected infection, two or more sirs criteria, and either systolic blood pressure < mmhg after a fluid bolus or lactate > mm. peripheral venous po was measured prior to enrollment as part of routine care, and scvo was measured as part of the protocol. we analyzed for agreement between venous po and scvo using spearman's rank. sensitivity and specificity to predict an abnormal scvo (< %) were calculated for each incremental value of po . results: a total of were analyzed. median po was mmhg (iqr , ). median initial scvo was % (iqr , ). thirty-nine patients ( %) had an initial scvo < %. spearman's rank demonstrated fair correlation between initial po and scvo (q = . ). a cutoff of venous po < was % sensitive and % specific for detecting an initial scvo < %. twenty-seven patients ( %) demonstrated an initial po of > . conclusion: in ed septic shock patients, venous po demonstrated only fair correlation with scvo , though a cutoff value of was sensitive for predicting an abnormal scvo . twenty percent of patients demonstrated an initial value above the cutoff, potentially representing a group in whom scvo measurement could be avoided. future studies aiming to decrease central line utilization could consider the use of peripheral o measurements in these patients. sessions. ninety-two percent were rns, median clinical experience was - years, and % were from an intensive care unit. provider confidence increased significantly with a single session despite the highly experienced sample (figure ). there was a trend for further increased confidence with an additional session and the increased confidence was maintained for at least - months given the normal sensitivity analysis. conclusion: high fidelity simulation significantly increases provider confidence even among experienced providers. this study was limited by its small sample size and recent changes in acls guidelines. background: recent data suggest alarming delays and deviations in major components of pediatric resuscitation during simulated scenarios by pediatric housestaff. objectives: to identify the most common errors of pediatric residents during multiple simulated pediatric resuscitation scenarios. methods: a retrospective observational study conducted in an academic tertiary care hospital. pediatric residents (pgy and pgy ) were videotaped performing a series of five pediatric resuscitation scenarios using a high-fidelity simulator (simbaby, laerdal): pulseless non-shockable arrest, pulseless shockable arrest, dysrhythmia, respiratory arrest, and shock. the primary outcome was the presence of significant errors prospectively defined using a validated scoring instrument designed to assess sequence, timing, and quality of specific actions during resuscitations based on the aha pals guidelines. residents' clinical performances were measured by a single video reviewer. the primary analysis was the proportion of errors for each critical task for each scenario. we estimated that the evaluation of each resident would provide a confidence interval less than . for the proportion of errors. results: twenty-four of residents completed the study. across all scenarios, pulse check was delayed by more than seconds in % ( %ci: %- %). for non-shockable arrest, cpr was started more than seconds after recognizing arrest in % ( %ci - %) and inappropriate defibrillation was performed in % ( %ci - %). for shockable arrest, participants failed to identify the rhythm in % ( %ci - %), cpr was not performed in % ( %ci - %), while defibrillation was delayed by more than seconds in % ( %ci - %) and not performed in one case. for shock, participants failed to ask for a dextrose check in % ( %ci - %), and it was delayed by more than seconds for all others. conclusion: the most common error across all scenarios was delay in pulse check. delays in starting cpr and inappropriate defibrillation were common errors in non-shockable arrests, while failure to identify rhythm, cpr omission, and delaying defibrillation were noted for shockable arrests. for shock, omission of rapid dextrose check was the most common error, while delaying the test when ordered was also significant. future training in pediatric resuscitation should target these errors. background: many scoring instruments have been described to measure clinical performance during resuscitation; however, the validity of these tools has yet to be proven in pediatric resuscitation. objectives: to determine the external validity of published scoring instruments to evaluate clinical performance during simulated pediatric resuscitations using pals algorithms and to determine if inter-rater reliability could be assessed. methods: this was a prospective quasi-experimental design performed in a simulation lab of a pediatric tertiary care facility. participants were residents from a single pediatric program distinct from where the instrument was originally developed. a total of pgy s and pgy s were videotaped during five simulated pediatric resuscitation scenarios. pediatric emergency physicians rated resident performances before and after a pals course using standardized scoring. each video recording was viewed and scored by two raters blinded to one another. a priori, it was determined that, for the scoring instrument to be valid, participants should improve their scores after participating in the pals course. differences in means between pre-pals and post-pals and pgy and pgy were compared using an anova test. to investigate differences in the scores of the two groups over the five scenarios, a two-factor anova was used. reliability was assessed by calculating an interclass correlation coefficient for each scenario. results: following the pals course, scores improved by . % ( . to . ), . % ( . to . ), . % () . to . ), . % ( . to ), and . % () . to . ) for the pulseless non-shockable arrest, pulseless shockable arrest, dysrhythmia, respiratory, and shock scenarios respectively. there were no differences in scores between pgy s and pgy s before and after the pals course. there was an excellent reliability for each scoring instrument with iccs varying between . and . . conclusion: the scoring instrument was able to demonstrate significant improvements in scores following a pals course for pgy and pgy pediatric residents for the pulseless non-shockable arrest, pulseless shockable, and respiratory arrest scenarios only. however, it was unable to discriminate between pgy s and pgy s both before and after the pals course for any scenarios. the scoring instrument showed excellent inter-reliability for all scenarios. a background: medical simulation is a common and frequently studied component of emergency medicine (em) residency curricula. its utility in the context of em medical student clerkships is not well defined. objectives: the objective was to measure the effect of simulation instruction on medical students' em clerkship oral exam performance. we hypothesized that students randomized to the simulation group would score higher. we predicted that simulation instruction would promote better clinical reasoning skills and knowledge expression. methods: this was a randomized observational study conducted from / to / . participants were fourth year medical students in their em clerkship. students were randomly assigned on their first day to one of two groups. the study group received simulation instruction in place of one of the lectures, while the control group was assigned to the standard curriculum. the standard clerkship curriculum includes lectures, case studies, procedure labs, and clinical shifts without simulation. at the end of the clerkship, all students participated in written and oral exams. graders were not blinded to group allocation. grades were assigned based on a pre-defined set of criteria. the final course composite score was computed based on clinical evaluations and the results of both written and oral exams. oral exam scores between the groups were compared using a two-sample t-test. we used the spearman rank correlation to measure the association between group assignment and the overall course grade. the study was approved by our institutional irb. results: sixty-one students participated in the study and were randomly assigned to one of two groups. twenty-nine ( . %) were assigned to simulation and the remaining ( . %) students were assigned to the standard curriculum. students assigned to the simulation group scored . % ( % ci . - . %) higher on the oral exam than the non-simulation group. additionally, simulation was associated with a higher final course grade (p < . ). limitations of this pilot study include lack of blinding and interexaminer variability. conclusion: simulation training as part of an em clerkship is associated with higher oral exam scores and higher overall course grade compared to the standard curriculum. the results from this pilot study are encouraging and support a larger, more rigorous study. initial approaches to common complaints are taught using a standard curriculum of lecture and small group case-based discussion. we added a simulation exercise to the traditional altered mental status (ams) curriculum with the hypothesis that this would positively affect student knowledge, attitudes, and level of clinical confidence caring for patients with ams. methods: ams simulation sessions were conducted in june and ; student participation was voluntary. the simulation exercises included two ams cases using a full-body simulator and a faculty debriefing after each case. both students who did and did not participate in the simulations completed a written post-test and a survey related to confidence in their approach to ams. results: students completed the post-test and survey. ( %) attended the simulation session. ( %) attended all three sessions. ( %) participated in the lecture and small group. ( %) did not attend any session. post-test scores were higher in students who attended the simulations versus those who did not: (iqr, - ) vs. (iqr, - ); p < . . students who attended the simulations felt more confident about assessing an ams patient ( % vs. %; p = . ), articulating a differential diagnosis ( % vs. %; p = . ), and knowing initial diagnostic tests ( % vs. %; p = . ) and initial interventions ( % vs. %; p = . ) for an ams patient. students who attended the simulations were more likely to rate the overall ams curriculum as useful ( % vs. %; p < . ). conclusion: addition of a simulation session to a standard ams curriculum had a positive effect on student performance on a knowledge-based exam and increased confidence in clinical approach. the study's major limitations were that student participation in the simulation exercise was voluntary and that effect on applied skills was not measured. future research will determine whether simulation is effective for other chief complaints and if it improves actual clinical performance. background: the acgme has defined six core competencies for residents including ''professionalism'' and ''interpersonal and communication skills.'' integral to these two competencies is empathy. prior studies suggest that self-reported empathy declines during medical training; no reported study has yet integrated simulation into the evaluation of empathy in medical training. objectives: to determine if there is a relation between level of training and empathy in patient interactions as rated during simulation. methods: this is a prospective observational study at a tertiary care center comparing participants at four different levels of training: first (ms ) and third year (ms ) medical students, incoming em interns (pgy ), and em senior residents (pgy / ). trainees participated in two simulation scenarios (ectopic pregnancy and status asthmaticus) in which they were responsible for clinical management (cm) and patient interactions (pi). this was the first simulation exposure during an established simulation curriculum for ms , ms , and pgy . two independent raters reviewed videotaped simulation scenarios using checklists of critical actions for clinical management (cm: - points) and patient interactions (pi: - points). inter-rater reliability was assessed by intra-class correlation coefficients (iccs objectives: we explored attitudes and beliefs about the handoff, using qualitative methods, from a diverse group of stakeholders within the ems community. we also characterized perceptions of barriers to high-quality handoffs and identified strategies for optimizing this process. methods: we conducted seven focus groups at three separate gatherings of ems professionals (one local, two national) in / . snowball sampling was used to recruit participants with diverse professional, experiential, geographic, and demographic characteristics. focus groups, lasting - minutes, were moderated by investigators trained in qualitative methods, using an interview guide to elicit conversation. recordings of each group were transcribed. three reviewers analyzed the text in a multi-stage iterative process to code the data, describe the main categories, and identify unifying themes. results: participants included emts, paramedics, physicians, and nurses. clinical experience ranged from months to years. recurrent thematic domains when discussing attitudes and beliefs were: perceptions of respect and competence, professionalism, teamwork, value assigned to the process, and professional duty. modifiers of these domains were: hierarchy, skill/training level, severity/type of patient illness, and system/ regulatory factors. strategies to improving barriers to the handoff included: fostering familiarity and personal connections between ems and ed staff, encouraging two-way conversations, feedback, and direct interactions between ems providers and ed physicians, and optimizing ways for ems providers to share subjective impressions (beyond standardized data elements) with hospital-based care teams. conclusion: ems professionals assign high value to the ed handoff. variations in patient acuity, familiarity with other handoff participants, and perceptions of respect and professionalism appear to influence the perceived quality of this transition. regulatory strategies to standardize the contents of the handoff may not alone overcome barriers to this process. miology, public health) then developed an approach to assign ems records to one of symptom-based illness categories (gastrointestinal illness, respiratory, etc). ems encounter records were characterized into these illness categories using a novel text analytic program. event alerts were identified across the state and local regions in illness categories using either change detection from baseline with (cusum) analysis (three standard deviations) and a novel text-proportion (tap) analysis approach (sas institute, cary, nc). results: . million ems encounter records over a year period were analyzed. the initial analysis focused upon gastrointestinal illness (gi) given the potential relationship of gi distress to infectious outbreaks, food contamination and intentional poisonings (ricin). after accounting for seasonality, a significant gi event was detected in feb (see red circle on graph). this event coincided with a confirmed norovirus outbreak. the use of cusum approach (yellow circle on graph) detected the alert event on jan , . the novel tap approach on a regional basis detected the alert on dec , . conclusion: ems has the advantage of being an early point of contact with patients and providing information on the location of insult or injury. surveillance based on ems information system data can detect emergent outbreaks of illness of interest to public health. a novel text proportion analytic technique shows promise as an early event detection method. assessing chronic stress in the emergency medical services elizabeth a. donnelly , jill chonody university of windsor, windsor, on, canada; university of south australia, adelaide, australia background: attention has been paid to the effect of critical incident stress in the emergency medical services (ems); however, less attention has been given to the effect of chronic stress (e.g., conflict with administration or colleagues, risk of injury, fatigue, interference in non-work activities) in ems. a number of extant instruments assess for workplace stress; however, none address the idiosyncratic aspects of work in ems. objectives: the purpose of this study was to validate an instrument, adapted from mccreary and thompson ( ) , that assesses levels of both organizational and operational work-related chronic stress in ems personnel. methods: to validate this instrument, a cross-sectional, observational web-based survey was used. the instrument was distributed to a systematic probability sample of emts and paramedics (n = , ). the survey also included the perceived stress scale (cohen, ) to assess for convergent construct validity. results: the survey attained a . % usable response rate (n = ); respondent characteristics were consistent across demographic characteristics with other studies of emts and paramedics. the sample was split in order to allow for exploratory and confirmatory fac-tor analyses (n = /n = ). in the exploratory factor analysis, principal axis factoring with an oblique rotation revealed a two-factor, -item solution (kmo = . , v = . , df = , p £. ). confirmatory factor analysis suggested a more parsimonious, two-factor, -item solution (v = . , df = , p £ . , rmsea = . , cfi = . , tli = . , srmr = . ). the factors demonstrated good internal reliability (operational stress a = . , organizational stress a = . ). both factors were significantly correlated (p £ . ) with the hypothesized convergent validity measure. conclusion: theory and empirical research indicate that exposure to chronic workplace stress may play an important part in the development of psychological distress, including burnout, depression, and posttraumatic stress disorder (ptsd). workplace stress and stress reactions may potentially interfere with job performance. as no extant measure assesses for chronic workplace stress in ems, the validation of this chronic stress measure enhances the tools ems leaders and researchers have in assessing the health and well-being of ems providers. effect of naltrexone background: survivors of sarin and other organophosphate poisoning can develop delayed encephalopathy that is not prevented by standard antidotal therapy with atropine and pralidoxime. a rat model of poisoning with the sarin analogue diisoprophylfluorophosphate (dfp) demonstrated impairment of spatial memory despite antidotal therapy with atropine and pralidoxime. additional antidotes are needed after acute poisonings that will prevent the development of encephalopathy. objectives: to determine the efficacy of naltrexone in preventing delayed encephalopathy after poisoning with the sarin analogue dfp in a rat model. the hypothesis is that naltrexone would improve performance on spatial memory after acute dfp poisoning. the sarin analogue dfp was used because it has similar toxicity to sarin while being less dangerous to handle. methods: a randomized controlled experiment at a university animal research laboratory of the effects of naltrexone on spatial memory after dfp poisoning was conducted. long evans rats weighing - grams were randomized to dfp group (n = , rats received a single intraperitoneal (ip) injection of dfp mg/kg) or dfp+naltrexone group (n = , rats received a single ip injection of dfp ( mg/kg) followed by naltrexone mg/kg/day). after injection, rats were monitored for signs and symptoms of cholinesterase toxicity. if toxicity developed, antidotal therapy was initiated with atro-background: one of the primary goals of management of patients presenting with known or suspected acetaminophen (apap) ingestion is to identify the risk for apap-induced hepatotoxicity. current practice is to measure apap level at a minimum of hours post ingestion and plot this value on the rumack-matthew nomogram. one retrospective study of apap levels drawn less than hours post-ingestion found a level less than mcg/ml to be sufficient to exclude toxic ingestion. objectives: the aim of this study was to prospectively determine the negative predictive value (npv) for toxicity of an apap level of less than mcg/ml obtained less than hours post-ingestion. methods: this was a multicenter prospective cohort study of patients presenting to one of five tertiary care hospitals that are part of the toxicology investigator's consortium (toxic). eligible patients presented to the emergency department less than hours after known or suspected ingestion and had the initial apap level obtained at greater than but less than hours post ingestion. a second apap level was obtained at hours or more post-ingestion and plotted on the rumack-matthew nomogram to determine risk of toxicity. the outcome of interest was the npv of an initial apap level less than mcg/ml. a power analysis based on an alpha = . and power of . yielded the requirement of subjects. results: data were collected on patients over a month period from may to nov . patients excluded from npv analysis consisted of: initial apap level greater than mcg/ml ( ), negligible apap level on both the initial and confirmatory apap level ( ), initial apap level drawn less than one hour after ingestion ( ), or an unknown time of ingestion ( ). ninety-three patients met the eligibility criteria. two patients ( . %) with an initial apap level less than mcg/ml ( mcg/ml at min, mcg/ml at min) were determined to be at risk for toxicity based on oh s saem annual meeting abstracts implementation of an emergency department sign-out checklist improves patient hand-offs at change of shift nicole m ma computer-assisted self-interviews improve testing for chlamydia and gonorrhea in the pediatric emergency department is the australian triage system a better indicator of psychiatric patients' needs for intervention than the ena emergency severity index triage system? patients were given an initial dose of mg droperidol intramuscularly followed by an additional dose of mg after min if required. inclusion criteria were patients requiring physical restraint and parenteral sedation. the primary outcome was the time to sedation. secondary outcomes were the proportion of patients requiring additional sedation within the first hour, over-sedation measured as - on the sedation assessment tool, and respiratory compromise measured as oxygen saturation < %. results: droperidol was administered to patients and of these had sedation scores documented. presentations included % with alcohol intoxication. dose ranged from . mg to mg, median mg (interquartile range conclusion: droperidol is effective for rapid sedation for abd and rarely causes over-sedation serum creatinine (scr) is widely used to predict risk; however, gfr is a better assessment of kidney function. objectives: to compare the ability of gfr and scr to predict the development of cin among ed patients receiving cects. we hypothesized that gfr would be the best available predictor of cin. methods: this was a retrospective chart review of ed patients ‡ years old who had a chest or abdomen/pelvis cect between / / and / / . baseline and follow-up scr levels were recorded. patients with initial scr > . mg/dl were excluded, as per hospital radiology department protocol. cin was defined as a scr increase of either %, . mg/dl, or a gfr decrease of % within hours of contrast exposure. gfr was calculated using the ckd epi and mdrd formulae, and analyzed in original units and categorized form (< , ‡ ) with each additional unit decrease in ckd epi, subjects were % more likely to develop cin (or = . ) (p < . ). additionally, subjects with ckd epi < were . (or) times more likely to have cin than subjects with ckd epi ‡ in original units, ckd epi (p < . ) and mdrd (p < . ) both had a significantly higher auc than scr. conclusion: age, as an independent variable, is the best predictor of cin, when compared with scr and gfr. due to a small number of cases with cin, the confidence intervals associated with the odds ratios are wide. future research should focus on patient risk stratification and establishing ed interventions to prevent cin. a rat model of carbon monoxide induced neurotoxicity heather ellsworth non-traumatic subarachnoid hemorrhage diagnosed by lumbar puncture following non-diagnostic head ct: a retrospective case-control study and decision a dass score of > has been previously defined as an indicator of increased stress levels. multivariable logistic regression was utilized to identify demographic and work-life characteristics significantly associated with stress. results: . % of individuals responded to the survey ( , / , ) and prevalence of stress was estimated at . %. the following work-life characteristics were associated with stress: certification level, work experience, and service type. the odds of stress in paramedics was % higher when compared to emt-basics (or = . , % ci = . - . ). when compared to £ years of experience - . ) were more likely to be stressed. ems professionals working in county (or = ci = . - . ) and private services (or = ) were more likely than those working in fire-based services to be stressed. the following demographic characteristics were associated with stress: general health and smoking status finally, former smokers (or = . , % ci = . - . ) and current smokers (or = . , % ci = . - . ) were more likely to be stressed than non-smokers literature suggests this is within the range of stress among nurses, and lower than physicians. while the current study was able to identify demographic and work-life characteristics associated with stress, the long-term effects are largely unknown methods: design: prospective randomized controlled trial. subjects: female sus scrofa swine weighing - kg were infused with amitriptyline . mg/kg/minute until the map fell to % of baseline values. subjects were then randomized to experimental group (ife ml/kg followed by an infusion of . ml/kg/minute) or control group (sb meq/kg plus equal volume of normal saline). interventions: we measured continuous heart rate (hr), sbp, map, cardiac output (co), systemic vascular resistance (svr), and venous oxygen saturation (svo ). laboratory values monitored included ph, pco , bicarbonate, lactate, and amitriptyline levels. descriptive statistics including means, standard deviations, standard errors of measurement, and confidence limits were calculated. results: of swine, seven each were allocated to ife and sb groups. there was no difference at baseline for each group regarding hr, sbp, map, co, svr, or svo . ife and sb groups required similar mean amounts of tca to reach hypotension one ife and two sb pigs survived. conclusion: in this interim data analysis of amitriptyline-induced hypotensive swine, we found no difference in mitigating hypotension between ife and sb lipid rescue : a survey of poison center medical directors regarding intravenous fat emulsion therapy michael r. christian , erin m. pallasch cook county hospital (stroger), chicago, il reliability of non-toxic acetaminophen concentrations obtained less than hours after ingestion evaluating age in the field triage of injured background: hiv screening in eds is advocated to achieve the goal of comprehensive population screening. yet, hiv testing in the ed is sometimes thwarted by a patient's condition (e.g. intoxication) or environmental factors (e.g. other care activities). whether it is possible to test these patients at a later time is unknown. objectives: we aimed to determine if ed patients who were initially unable to receive an hiv testing offer might be tested in the ed at a later time. we hypothesized that factors preventing testing are transient and that there are subsequent opportunities to repeat testing offers. methods: we reviewed medical records for patients presenting to an urban, academic ed who were approached consecutively to offer hiv testing during randomly selected periods from january to january . patients for whom the initial attempted offer could not be completed were reviewed in detail with standardized abstraction forms, duplicate abstraction, and third-party discrepancy adjudication. primary outcomes included repeat hiv testing offers during that ed visit, and whether a testing offer might eventually have been possible either during the initial visit or at a later visit within months. outcomes are described as proportions with confidence intervals. results: of patients approached, initial testing offers could not be completed for ( %). these were % male, % white, and had a median age of ( - ). a repeat offer of testing during the initial visit would have been possible for / ( %), and / ( %) were actually offered testing on repeat approach. of the for whom a testing offer would not have been possible on the initial visit, ( %) had at least one additional visit within months, and / ( %) could have been offered testing on at least one visit. overall, a repeat testing offer would have been possible for / ( %, % ci - %). conclusion: factors preventing an initial offer of hiv testing in the ed are generally transient. opportunities for repeat approach during initial or later ed encounters suggest that, given sufficient resources, the ed could succeed in comprehensively screening the population presenting for care. ed screening personnel who are initially unable to offer testing should repeat their attempt. hiv adopt an ''opt-out'' rapid hiv screening model in order to identify hiv infected patients. previous studies nationwide have shown acceptance rates for hiv screening of - % in emergency departments. however, it is unknown how acceptance rates will vary in a culturally and ethnically diverse urban emergency department.objectives: to determine the characteristics of patients who accept or refuse ''opt-out'' hiv screening in an urban emergency department.methods: a self-administered, anonymous survey is administered to ed patients who are to years of age. the questionnaire is administered in english, russian, mandarin, and spanish. questions include demographic characteristics, hiv risk factors, perception of hiv risk, and acceptance of rapid hiv screening in the emergency department. results: to date patients responded to our survey. of the , ( . %) did not accept an hiv test (group ) in their current ed visit and ( . %) accepted an hiv test (group ). the major two reasons given for opting out (i.e., group ) was ''i do not feel that i am at risk'' ( . %) and ''i have been tested for hiv before'' ( . %). there was no difference between the groups in regards to sex (p = . ), age (p = . ), religious affiliation (p = . ), marital status (p = . ), language spoken at home (p = . ), and whether they had been hiv tested before ( . % in group and . % in group ; p = . ). however, there was a statistically significant difference with regards to educational level and income. more patients in group ( . %) and . % in group had less than a college level education (p < . ). similarly, more patients in group ( . %) and only . % in group had an annual household income of £$ , (p < . ). conclusion: in a culturally and ethnically diverse urban emergency department, patients with a lower socioeconomic status and educational level tend to opt out of hiv screening test offered in the ed. no significant difference in acceptance of ed hiv testing was found to date based on primary language spoken at home or religious affiliation background: antimicrobial resistance is a problem that affects all emergency departments. objectives: our goal was to examine all urinary pathogens and their resistance patterns from urine cultures collected in the emergency department (ed).methods: this study was performed at an urban/suburban community-teaching hospital with an annual volume of , visits. using electronic records, all cases of urine cultures received in were reviewed for data including type of bacteria, antibiotic resistance, and health care exposure (hcx). hcx was defined as no prior hospitalization within the previous six months, hospitalization within the previous three months, hospitalization within the previous six months, nursing home resident (nh), and presence of an indwelling urinary catheter (uc). an investigator abstracted all data with a second re-abstracting a random % for kappa statistics between . and . . group background: approximately - % of patients treated with epinephrine for anaphylaxis receive a second dose but the risk factors associated with repeat epinephrine use remain poorly defined. objectives: to determine whether obesity is a risk factor for requiring + epinephrine doses for patients who present to the emergency department (ed) with anaphylaxis due to food allergy or stinging insect hypersensitivity. methods: we performed a retrospective chart review at four tertiary care hospitals that care for adults and children in new england between the following time periods: massachusetts general hospital ( / / - / / ), brigham and women's hospital ( / / - / / ), children's hospital boston ( / / - / / ), hasbro children's hospital ( / / - / / ). we reviewed the medical records of all patients presenting to the ed for food allergy or stinging insect hypersensitivity using icd cm codes. we focused on anthropomorphic data and number of epinephrine treatments given before and during the ed visit. among children, calculated bmis were classified according to cdc growth indicators as underweight, healthy, overweight, or obese. all patients who presented on or after their th birthday were considered adults.background: transitions of care are ubiquitous in the emergency department (ed) and inevitably introduce the opportunity for errors. despite recommendations in the literature, few emergency medicine (em) residency programs provide formal training or standard process for patient hand-offs. checklists have been shown to be effective quality improvement measures in inpatient settings and may be a feasible method to improve ed hand-offs. objectives: to determine if the use of a sign-out checklist improves the accuracy and efficiency of resident sign-out in the ed as measured by reduced omission of key information, communication behaviors, and time to sign-out each patient. methods: a prospective study of first-and second-year em and non-em residents rotating in the ed at an urban academic medical center with an annual ed volume of , . trained clinical research assistants observed resident sign-out during shift change over a two-week period and completed a -point binary observable behavior data collection tool to indicate whether or not key components of sign-out occurred. time to sign out each patient was recorded. we then created and implemented a computerized sign-out checklist consisting of key elements that should be addressed during transitions of care, and instructed residents to use this during hand-offs. a two-week post-intervention observation phase was conducted using the same data collection tool. proportions, means, and non-parametric comparison tests were calculated using stata. results: one hundred fifteen sign-outs were observed prior to checklist implementation and after; one sign-out was excluded for incompleteness. significant improvements were seen in four of the measured signout components: inclusion of history of present illness increased by % (p < . ), likely diagnosis increased by % (p = . ), disposition status increased by % (p < . ), and patient/care team awareness of plan increased by % (p < . ). (figure ) time data for sign-outs pre-implementation and post-implementation were available. seven sign-outs were excluded for incompleteness or spurious values. mean length of sign out was s ( % ci to ) and . s ( % ci to ) per patient. conclusion: implementation of a checklist improved the transfer of information but did not affect the overall length of time for the sign-out. the objectives: to determine risk factors associated with adult patients presenting to the ed with cellulitis who fail initial antibiotic therapy and require a change of antibiotics or admission to hospital. methods: this was a prospective cohort study of patients ‡ years presenting with cellulitis to one of two tertiary care eds (combined annual census , ). patients were excluded if they had been treated with antibiotics for the cellulitis prior to presenting to the ed, if they were admitted to hospital, or had an abscess only. trained research personnel administered a questionnaire at the initial ed visit with telephone follow-up weeks later. patient characteristics were summarized using descriptive statistics and % confidence intervals (cis) were estimated using standard equations. backwards stepwise multivariable logistic regression models determined predictor variables independently associated with treatment failure (failed initial antibiotic therapy and required a change of antibiotics or admission to hospital). results: patients were enrolled, were excluded, and were lost to follow-up. the mean (sd) age was . ( . ) and . % were male. ( . %) patients were given antibiotics in the ed. ( . %) were given oral, ( . %) were given iv, and ( . %) patients were given both oral and iv antibiotics. ( . %) patients had a treatment failure. fever (temp > °c) at triage (or: . , % ci: . , . ), leg ulcers (or: . , % ci: . , . ), edema or lymphedema (or: . , % ci: . , . ), and prior cellulitis in the same area (or: . , % ci: . , . ) were independently associated with treatment failure. conclusion: this analysis found four risk factors associated with treatment failure in patients presenting to the ed with cellulitis. these risk factors should be considered when initiating empiric outpatient antibiotic therapy for patients with uncomplicated cellulitis. use background: children presenting for care to a pediatric emergency department (ped) commonly require intravenous catheter (iv) placement. prior studies report that the average number of sticks to successfully place an iv in children is . . successfully placing an iv requires identification of appropriate venous access targets. the veinviewer visionÒ (vvv) assists with iv placement by projecting a map of subcutaneous veins on the surface of the skin using near infrared light. objectives: to compare the effectiveness of the vvv versus standard approaches: sight (s) and sight plus palpation (s+p) for identifying peripheral veins for intravenous catheter placement in children treated in a ped. methods: experienced pediatric emergency nurses and physicians identified peripheral venous access targets appropriate for intravenous cannulation of a cross-sectional convenience sample of english speaking children aged - years presenting for treatment of sub-critical injury or illness whose parents provided consent. the clinicians marked the veins with different colored washable marker and counted them on the dorsum of the hand and in the antecubital fossa using the three approaches: s, s+p, and vvv. a trained research assistant photographed each site for independent counting after each marking and recorded demographics and bmi. counts were validated using independent photographic analyses. data were entered into sas . and analyzed using paired t-tests. results: patients completed the study. clinicians were able to identify significantly more veins on the dorsum of the hand using vvv than s alone or s+p, . (p < . , ci . - . ) and . (p < . , ci . - . ), respectively, as well as significantly more veins in the antecubital fossa using vvv than s alone or s+p, . (p < . , ci . - . ) and . (p < . , ci . - . ), respectively. the differences in numbers of veins identified remained significant at p < . level across all ages, races, and bmis of children and across clinicians and validating independent photographic analyses. conclusion: experienced emergency nurses and physicians were able to identify significantly more venous access targets appropriate for intravenous cannulation in the dorsum of the hand and antecubital fossa of children presenting for treatment in a ped using vvv than the standard approaches of sight or sight plus palpation. an background: mental health emergencies have increased over the past two decades, and contribute to the ongoing rise in u.s. ed visit volumes. although data are limited, there is a general perception that the availability of in-person psychiatric consultation in the ed and of inpatient psychiatric beds is inadequate. objectives: to examine the availability of in-person psychiatry consultation in a heterogeneous sample of u.s. eds, and typical delays in transfer of ed patients to an inpatient psychiatric bed. methods: during - , we mailed a survey to all ed directors in a convenience sample of nine us states (ar, co, ga, hi, ma, mn, or, vt, and wy). all sites were asked: ''are psychiatric consults available in-person to the ed?'' (yes/no), with affirmative respondents asked about the typical delay. sites also were asked about typical ed boarding time between a request for patient transfer and actual patient departure from the ed to an inpatient psychiatric bed. ed characteristics included rural/urban location, visit volume (visits/hour), admission rate, ed staffing, and the proportion of patients without insurance. data analysis used chi-square tests and multivariable logistic regression. results: surveys were collected from ( %) of the eds, with > % response rate in every state. overall, only % responded that psychiatric consults were available in-person to the ed. in multivariable logistic regression, ed characteristics independently associated with lack of in-person psychiatric consultation were: location within specific states (eg, ar, ga), rural location, lower visit volume, and lower admission rate. among the subset of eds with psychiatric consults available, % reported a typical wait time of at least hour. overall, % of eds reported that the typical time from request to actual patient transfer to an inpatient psychiatric bed was > hours, and % reported a maximum time in past year of > day (median days, iqr - ). in a multivariable model, location in ma and higher visit volume were associated with greater odds of a maximum wait time of > day. conclusion: among surveyed eds in nine states, only % have in-person psychiatric consultants available. moreover, approximately half of eds report boarding times of > h from request for transfer to actual departure to an inpatient psychiatric bed.background: many emergency departments (ed) in the united states use a five tiered triage protocol that has a limited evaluation of psychiatric patients. the australian triage scale (ats), a psychiatric triage system, has been used throughout australia and new zealand since the early s. objectives: the objective of the study is to compare the current triage system, emergency nurses association (ena) esi -tier, to the ats for the evaluation of the psychiatric patients presenting to the ed. methods: a convenience sample of patients, years of age and older, presenting with psychiatric complaints at triage were given the ena triage assessment by the triage nurse. a second triage assessment, performed by a research fellow, included all observed and reported elements using the ats protocol, a self-assessment survey and an agitation assessment using the richmond agitation sedation scale (rass). the study was performed at an inner city level i trauma center with , visits per year. the ed was a catchment facility for the police department for psychiatric patients in the area. patients were excluded if they were unstable, unable to communicate, or had a non-psychiatric complaint. results were analyzed in spss v . the analysis of data used frequencies, descriptive and anova. results: a total of patients were enrolled in the study: % were african american, % caucasian, % hispanic, % asian, and % indian; % of subjects enrolled were male. the patients' level of agitation using rass showed % were alert and calm, % were restless and anxious, % were agitated, and % combative, violent, or dangerous to self. the only significant correlation found was among the ats and several self assessment questions: ''i feel agitated on a to scale'' (p = . ) and ''i feel violent on a to scale'' (p = . ). there were no significant correlations found among the ena triage, rass scores, and throughput times. conclusion: the ats test was more sensitive to the patient declaring that he or she was agitated or felt violent. this shows that this system might be a more useful system in determining the severity of need of psychiatric patients presenting to the ed. variations background: hemoglobin-based oxygen carriers (hbocs) have been evaluated for small-volume resuscitation of hemorrhagic shock due to their oxygen carrying capability, but have found limited utility due to vasoactive side-effects from nitric oxide (no) scavenging. objectives: to define an optimal hboc dosing strategy and evaluate the effect of an added no donor, we use a prehospital swine polytrauma model to compare the effect of low-vs. moderate-volume hboc resuscitation with and without nitroglycerin (ntg) co-infusion as an no donor. we hypothesize that survival time will improve with moderate resuscitation and that an no donor will add additional benefit. methods: survival time was compared in groups (n = ) of anesthetized swine subjected to simultaneous traumatic brain injury and uncontrolled hemorrhagic shock by aortic tear. animals received one of three different resuscitation fluids: lactated ringers (lr), hboc, or vasoattenuated hboc with ntg co-infusion. for comparison, these fluids were given in a severely limited fashion (sl) as one bolus every minutes up to four total, or a moderately limited fashion (ml) as one bolus every minutes up to seven total, to maintain mean arterial pressure ‡ mmhg. comparison of resuscitation regimen and fluid type on survival time was made using two-way anova with interaction and tukey kramer adjustment for individual comparisons. results: there was a significant interaction between fluid regimen and resuscitation fluid type (anova, p = . ) indicating that the response to sl or ml resuscitation was fluid type-dependent. within the lr and hboc+ntg groups, survival time (mean, %ci) was longer for sl, . min ( injuries are common and result from many different mechanisms of injury (moi). knowing common fracture locations may help in diagnosis and treatment, especially in patients presenting with distracting injuries that may mask the pain of a radius fracture.objectives: we set out to determine the incidence of radius fracture locations among patients presenting to an urban emergency department (ed).background: carbon monoxide (co) is the leading cause of poisoning morbidity and mortality in the united states. standard treatment includes supplemental oxygen and supportive care. the utility of hyperbaric oxygen (hbo) therapy has been challenged by a recent cochrane review. hypothermia may mitigate delayed neurotoxic effects after co poisoning as it is effective in cardiac arrest patients with similar neuropathology. objectives: to develop a rat model of acute and delayed severe co toxicity as measured by behavioral deficits and cell necrosis in post-sacrifice brain tissue.methods: a total of rats were used for model development; variable concentrations of co and exposure times were compared to achieve severe toxicity. for the protocol, six senescent long evans rats were exposed to , ppm of co for minutes then , ppm for minutes, followed by three successive dives at , ppm with an endpoint of apnea or seizure; there was a brief interlude between dives for recovery. a modified katz assessment tool was used to assess behavior at baseline and hours, day, and , , , , , and weeks post-exposure. following this, the brains were transcardially fixed with formalin, and lm sagittal slices were embedded in paraffin and stained with hematoxylin and eosin. a pathologist quantified the percentage of necrotic cells in the cortex, hippocampus (pyramidal cells), caudoputamen, cerebellum (purkinje cells), dentate gyrus, and thalamus of each brain to the nearest % from randomly selected high power fields ( x background: there remains controversy about the cardiotoxic effects of droperidol, and in particular the risk of qt prolongation and torsades des pointes (tdp).objectives: this study aimed to investigate the cardiac and haemodynamic effects of high-dose parenteral droperidol for sedation of acute behavioural disturbance (abd) in the emergency department (ed). methods: a standardised intramuscular (im) protocol for the sedation of ed patients with abd was instituted as part of a prospective observational safety study in four regional and metropolitan eds. patients with abd were given an initial dose of mg droperidol followed by an additional dose of mg after min if required. inclusion criteria were patients requiring physical restraint and parenteral sedation. the primary outcome was the proportion of patients who have a prolonged qt interval on ecg. the qt interval was plotted against the heart rate (hr) on the qt nomogram to determine if the qt was abnormal. secondary outcomes were frequency of hypotension and cardiac arrhythmias. results: ecgs were available from of patients with abd given droperidol. the median dose was mg (iqr - mg; range: to mg). the median age was years (rnge: to ) and were males ( %). a total of four ( %) qt-hr pairs were above the ''at-risk'' line on the qt nomogram. transient hypotension occurred in ( %), and no arrhythmias were detected.conclusion: droperidol appears to be safe when used for rapid sedation in the dose range of to mg. it rarely causes hypotension or qt prolongation. blood background: soldiers and law enforcement agents are repeatedly exposed to blast events in the course of carrying out their duties during training and combat operations. little data exist on the effect of this exposure on the physiological function of the human body. both military and law enforcement dynamic entry personnel, ''breachers'', began expressing sensitivity to the risk of injury as a result of multiple blast exposures. breachers apply explosives as a means of gaining access to barricaded or hardened structures. these specialists can be exposed to as many as a dozen lead-encased charges per day during training exercises.objectives: this observational study was performed by the breacher injury consortium to determine the effect of short-term exposure to blasts by breachers on whole blood lead levels (blls) and zinc protoporphyrin levels (zppls). methods: two -week basic breaching training classes were conducted by the united states marine corps' weapons training battalion dynamic entry school. each class included students and up to three instructors, with six non-breaching marines serving as a control group. to evaluate for lead exposure, venous blood samples were acquired from study participants on the weekend prior and following training in the first training class, whereas the second training class had an additional level performed mid-training. blls and zppls were measured in a whole-blood sample using the furnace atomic absorption method and hematofuorimeter method, respectively. results: analysis of these blast injury data indicated students demonstrated significantly increased blls post-explosion (mean = mcg/dl, sd . , p < . ) compared to pre-training (mean = mcg/dl, sd . ) and control subjects (mean = mcg/dl, sd . , p < . ). instructors also demonstrated significantly increased blls post explosion (mean = mcg/dl, sd . , p < . ) compared to pre-training (mean = mcg/ dl, sd . ) and control subjects (mean = mcg/dl, sd . , p < . ). student and instructor zppls were not significantly different in post-training compared to pretraining or control groups. conclusion: the observation from this study that breachers are at risk of mild increases in blls support the need for further investigation into the role of lead following repeated blast exposure with munitions encased in lead. direct observation of the background: notification of a patient's death to family members represents a challenging and stressful task for emergency physicians. complex communication skills such as those required for breaking bad news (bbn) are conventionally taught with small-group and other interactive learning formats. we developed a de novo multi-media web-based learning (wbl) module of curriculum content for a standardized patient interaction (spi) for senior medical students during their emergency medicine rotation.objectives: we proposed that use of an asynchronous wbl module would result in students' skill acquisition for breaking bad news. methods: we tracked module utilization and performance on the spi to determine whether students accessed the materials and if they were able to demonstrate proficiency in its application. performance on the spi was assessed utilizing a bbn-specific content instrument developed from the griev_ing mnemonic as well as a previously validated instrument for assessing communication skills.results: three hundred seventy-two students were enrolled in the bbn curriculum. there was a % completion rate of the wbl module despite students being given the option to utilize review articles alone for preparation. students interacted with the activities within the module as evidenced by a mean number of mouse clicks of . (sd . ). overall spi scores were . %, (sd . ) with content checklist scores of . % (sd . ) and interpersonal communication scores . % (sd . ). five students had failing content scores (< %) on the spi and had a mean number of clicks of . (sd . ), which is not significantly lower than those passing (p = . ). students in the first year of wbl deployment completed self-confidence assessments which showed significant increases in confidence ( . tobackground: pelvis ultrasonography (us) is a useful bedside tool for the evaluation of women with suspected pelvic pathology. while pelvic us is often performed by the radiology department, it often lacks clinical correlation and takes more time than bedside us in the ed. this was a prospective observational study comparing the ed length of stay (los) of patients receiving ed us versus those receiving radiology us. objectives: the primary objective was to measure the difference in ed los. the secondary objectives were to ) assess the role of pregnancy status, ob/gyn consult in the ed, and disposition, in influencing the ed los; and ) to assess the safety of ed us by looking at patient return to the ed within weeks and whether that led to an alternative diagnosis.methods: subjects were women over years old presenting with a gi or gu complaint, and who received either an ed or radiology us. a t-test was used for the primary objective, and linear regression to test the secondary objective. odds ratios were performed to assess for interaction between these factors and type of ultrasound. subgroup analyses were performed if significant interaction was detected. results: forty-eight patients received an ed us and patients received a radiology us. subjects receiving an ed us spent minutes less in the ed (p < . ). in multivariate analysis, even when controlling for pregnancy status, ob/gyn consult, and disposition, patients who received an ed us had a los reduction of minutes (p < . ). in odds ratio analysis, patients who were pregnant were times more likely to have received an ed us (p < . ). patients who received an ob/gyn consult in the ed were five times more likely to receive a radiology us (p < . ). there was no association between type of us and disposition. in subgroup analyses, pregnant and non-pregnant patients who received an ed us still had a los reduction of minutes (p < . ) and minutes (p < . ), respectively. sample sizes were inadequate for subgroup analysis for subjects who had ob/gyn consults. in patients who did not receive an ob/gyn consult, those who received an ed us had a los reduction of minutes (p < . ). finally, % of subjects returned within two weeks, but none led to an alternative diagnosis. conclusion: even when controlling for disposition, ob/gyn consultation, and pregnancy status, patients who received an ed us had a statistically and clinically significant reduction in their ed los. in addition, ed us is safe and accurate. background: although early surface cooling of burns reduces pain and depth of injury, there are concerns that cooling of large burns may result in hypothermia and worse outcomes. in contrast, controlled mild hypothermia improves outcomes after cardiac arrest and traumatic burn injury. objectives: the authors hypothesized that controlled mild hypothermia would prolong survival in a fluidresuscitated rat model of large scald burns. methods: forty sprague-dawley rats ( - g) were anesthetized with mg/kg intramuscular ketamine and mg/kg xylazine, with supplemental inhalational isoflurane as needed. a single full-thickness scald burn covering % of the total body surface area was created per rat using a mason-walker template placed in boiling water ( deg c) for a period of seconds. the rats were randomized to hypothermia (n = ) and nonhypothermia (n = ). core body temperature was continuously monitored with a rectal temperature probe. hypothermia was induced through intraperitoneal injection of cooled ( deg c) saline. the core temperature was reduced by deg c and maintained for a period of hours, applying an ice or heat pack when necessary. the rats were then rewarmed back to baseline temperature. in the control group, room temperature saline was injected into the intraperitoneal cavity and core temperature was maintained using a heating pad as needed. the rats were monitored until death or for a period of days, whichever was greater. the primary outcome was death. the difference in survival was determined using a kaplan-meier analysis or log rank test. results: the mean core temperatures were . deg c for the hypothermic group and . deg c for the normothermic group. the mean survival times were hours for the hypothermic group ( % confidence interval [ci] = to ) and hours for the normothermic group ( % ci = to ). the seven-day survival rates in the hypothermic and non-hypothermic groups were % and %. these differences were not significant, p = . for both comparisons. conclusion: induction of brief mild hypothermia increases but does not significantly prolong survival in a resuscitated rat model of large scald burns. serum objectives: we sought to determine levels of serum mtdna in ed patients with sepsis compared to controls and the association between mtdna and both inflammation and severity of illness among patients with sepsis. methods: prospective observational study of patients presenting to one of three large, urban, tertiary care eds. inclusion criteria: ) septic shock: suspected infection, two or more systemic inflammatory response (sirs) criteria, and systolic blood pressure (sbp) < mmhg despite a fluid bolus; ) sepsis: suspected infection, two or more sirs criteria, and sbp > mmhg; and ) control: ed patients without suspected infection, no sirs criteria, and sbp > mmhg. three mtdnas (cox-iii, cytochrome b, and nadh) were measured using real-time quantitative pcr from serum drawn at enrollment. il- and il- were measured using a bio-plex suspension array system. baseline characteristics, il- , il- , and mtdnas were compared using one way anova or fisher exact test, as appropriate. correlations between mtdnas and il- /il- were determined using spearman's rank. linear regression models were constructed using sofa score as the dependent variable, and each mtdna as the variable of interest in an independent model. a bonferroni adjustment was made for multiple comparisons.results: of patients, were controls, had sepsis, and had septic shock. we found no significant difference in any serum mtdnas among the cohorts (p = . to . ). all mtdnas showed a small but significant negative correlation with il- and il- (q = ) . to ) . ). among patients with sepsis or septic shock (n = ), we found a small but significant negative association between mtdna and sofa score, most clearly with cytochrome b (p = . ). conclusion: we found no difference in serum mtdnas between patients with sepsis, septic shock, and controls. serum mtdnas were negatively associated with inflammation and severity of illness, suggesting that as opposed to trauma, serum mtdna does not significantly contribute to the pathophysiology of the sepsis syndromes. methods: we consecutively enrolled ed patients ‡ years of age who met anaphylaxis diagnostic criteria from april to july at a tertiary center with , annual visits. we collected data on antihypertensive medications, suspected causes, signs and symptoms, ed management, and disposition. markers of severe anaphylaxis were defined as ) intubation, ) hospitalization (icu or floor), and ) signs and symptoms involving ‡ organ systems. antihypertensive medications evaluated included beta-blockers, angiotensin converting enzyme (ace) inhibitors, and calcium channel blockers (ccb). we conducted univariate and multivariate analyses to measure the association between antihypertensive medications and markers of severe anaphylaxis. because previous studies demonstrated an association between age and the suspected cause of the reaction with anaphylaxis severity, we adjusted for these known confounders in multivariate analyses. we report associations as odds ratios (ors) and corresponding % cis with p-values. results: among patients with anaphylaxis, median age (iqr) was ( - ) and ( . %) were female. eight ( . %) patients were intubated, ( %) required hospitalization, and ( %) had ‡ system involvement. forty-nine ( %) were on beta-blockers, ( %) on ace inhibitors, and ( . %) on ccb. in univariate analysis, ace inhibitors were associated with intubation and ‡ system involvement and ccb were associated with hospital admission. in multivariate analysis, after adjusting for age and suspected cause, ace inhibitors remained associated with hospital admission and beta-blockers remained associated with both hospital admission and ‡ system involvement. conclusion: in ed patients, beta-blocker and ace inhibitor use may predict increased anaphylaxis severity independent of age and suspected cause of the anaphylactic reaction. background: advanced cardiac life support (acls) resuscitation requires rapid assessment and intervention. some skills like patient assessment, quality cpr, defibrillation, and medication administration require provider confidence to be performed quickly and correctly. it is unclear, however, whether high-fidelity simulation can improve confidence with a multidisciplinary group of providers with high levels of clinical experience. objectives: the purpose of the study was to test the hypothesis that providers undergoing high-fidelity simulation of cardiopulmonary arrest scenarios will express greater confidence. methods: this was a prospective cohort study conducted at an urban level i trauma center from january to october with a convenience sample of registered (rn) and license practical nurses, nurse practitioners, resident physicians, and physician assistants who agreed to participate in / high-fidelity simulation (laerdal g) sessions of cardiopulmonary arrest scenarios about months apart. demographics were recorded. providers completed a validated preand post-test five-point likert scale confidence measurement tool before and after each session that ranged from not at all confident ( ) to very confident ( ) in recognizing signs and symptoms of, appropriately intervening in, and evaluating intervention effectiveness in cardiac and respiratory arrests. descriptive statistics, paired t-tests, and anova were used for data analysis. sensitivity testing evaluated subjects who completed their second session at months rather than months. results: sixty-five subjects completed consent, completed one session, and completed at least two background: prehospital studies have focused on the effect of health care provider gender on patient satisfaction. we know of no study that has assessed patient satisfication with patient and prehospital provider gender. some studies have shown higher patient satisfaction rates when cared for by a female health care provider.objectives: to determine the effect of ems provider gender on patient satisfaction with prehospital care. methods: a convenience sampling of all adult patients brought in to our ed, an urban level i trauma center by ambulance. a trained research associate (ra) stationed at triage conducted a survey using press ganey ems patient satisfaction questions. there were thirteen questions evaluating prehospital provider skills such as driving, courtesy, listening, medical care, and communication. each skill was assigned a point value between one and five; the higher the value the better the skill was performed. the patient's ambulance care report was copied for additional data extraction.results: a total of surveys were done. average patient age was , and % were female. scores for all questions totaled (mean . ± . ). prehospital providers pairings were: male-male (n = ), male-female (n = ), and female-female (n = ). there were no statistically significant differences in scores between our pairings (mean scores for male:male . , male:female . , and female:female . ; p = . ). we found nonstatistical differences in satisfaction scores based on the gender of the emt in the back of the ambulance: males had a mean score of . and females had a mean score of . (p = . ). we examined gender concordance by comparing gender of the patient to the gender of the prehospital provider and found that male-male had a mean score of . , female-female . , and when the patient and prehospital provider gender did not match, . (p = . ). conclusion: we found no effect of gender difference on patient satisfaction with prehospital care. we also found that overall, patients are very satisfied with their prehospital care. objectives: we set out to determine the sensitivity and specificity of eps in determining the presence of recently ingested tablets or tablet fragments.methods: this was a prospective volunteer study at an academic emergency department. healthy volunteers were enrolled and kept npo for hours prior to tablet ingestion. over minutes subjects ingested ml of water and tablets. ultrasounds video clips were performed prior to any tablet ingestion, after drinking ml of water, after tablets, after tablets, after tablets, and minutes after the final tablet ingestion yielding six clips per volunteer. all video clips were randomized and shown to three eps who were fellowship-trained in emergency ultrasound. eps recorded the presence or absence of tablets.results: ten volunteers underwent the pill ingestion protocol and sixty clips were collected. results for all cases and each rater are reported in the table. overall there was moderate agreement between raters (kappa = . ). sub-group analysis of , , or pills did not show any significant improvement in sensitivity and specificity.conclusion: ultrasound has moderate specificity but poor sensitivity for identification of tablet ingestion. these results imply that point-of-care ultrasound has limited utility in diagnosing large tablet ingestion. background: intravenous fat emulsion (ife) therapy is a novel treatment that has been used to reverse the acute toxicity of some xenobiotics with varied success. us poison control centers (pcc) are recommending this therapy for clinical use, but data regarding these recommendations are lacking.objectives: to determine how us pcc have incorporated ife as a treatment strategy for poisoning. methods: a closed-format multiple-choice survey instrument was developed, piloted, revised, and then sent electronically to every medical director of an accredited us pcc using surveymonkey in march ; addresses were obtained from the aapcc listserv, participation was voluntary and remained anonymous; three reminder invitations were sent during the study period. data were analyzed using descriptive statistics.results: forty-five of ( %) pcc medical directors completed the survey. all respondents felt that ife therapy played a role in the acute overdose setting. thirty ( %) pcc have a protocol for ife therapy: ( %) recommend an initial bolus of . ml/kg of a % lipid emulsion, ( %) pcc recommend an infusion of lipids, and / pcc recommend an initial infusion rate of . ml/kg of a % lipid emulsion. thirty-three ( %) felt that ife had no clinically significant side effects at a bolus dose of . ml/kg ( % emulsion). forty-four directors ( %) felt that the ''lipid sink'' mechanism contributed to the clinical effects of ife therapy, but ( %) felt that there was a yet undiscovered mechanism that likely contributed as well. in a scenario with cardiac arrest due to a single xenobiotic, directors stated that their center would always or often recommend ife after overdose of bupivicaine ( ; %), verapamil ( ; %), amitriptyline ( ; %), or an unknown xenobiotic ( ; %). in a scenario with significant hemodynamic instability due to a single xenobiotic, directors stated that their pcc would always or often recommend ife after overdose of bupivicaine ( ; %), verapamil ( ; %), amitriptyline ( ; %), or an unknown xenobiotic ( ; %).conclusion: ife therapy is being recommended by us pcc. protocols and dosing regimens are nearly uniform. most directors feel that ife is safe but are more likely to recommend ife in patients with cardiac arrest than in patients with severe hemodynamic compromise. further research is warranted. levels drawn at hours or more ( mcg/ml at hours, mcg ⁄ ml at hours, respectively). npv for toxic ingestion of an initial apap level less than mcg/ml was . % ( % ci . - . %).conclusion: an apap level of less than mcg/ml drawn less than hours after ingestion had a high npv for excluding toxic ingestion. however, the authors would not recommend reliance on levels obtained under hours to exclude toxicity as the potential for up to . % false negative results is considered unacceptable. background: genetic variations in the mu-opioid receptor gene (oprm ) mediate individual differences in response to pain and addiction.objectives: to study whether the common a g (rs ) mu-opioid receptor single nucleotide polymorphism (snp) or the alternative splicing snp of oprm (rs ) was associated with overdose severity, we assessed allele frequencies of each including associations with clinical severity in patients presenting to the emergency department (ed) with acute drug overdose. methods: in an observational cohort study at an urban teaching hospital, we evaluated consecutive adult ed patients presenting with suspected acute drug overdose over a -month period for whom discarded blood samples were available for analysis. specimens were linked with clinical variables (demographics, urine toxicology screens, clinical outcomes) then de-identified prior to genetic snp analysis. in-hospital severe outcomes were defined as either respiratory arrest (ra, defined by mechanical ventilation) or cardiac arrest (ca, defined by loss of pulse). blinded taqman genotyping (applied biosystems) of the snps were performed after standard dna purification (qiagen) and whole genome amplification (qiagen repli-g). the plink . genetic association analysis program was used to verify snp data quality, test for departure from hardy-weinberg equilibrium, and test individual snps for statistical association. results: we evaluated patients ( % female, mean age . ) who overall suffered ras and cas (of whom died). urine toxicology was positive in %, of which there were positives for benzodiazepines, cocaine, opiates, methadone, and barbiturates. all genotypes examined conformed to hardy-weinberg equilibrium. the g allele was associated with . fold increased odds of ca/ra (or . , p < . ). the rs mutant allele was not associated with ca/ ra. conclusion: these data suggest that the g mutant allele of the oprm gene is associated with worse clinical severity in patients with acute drug overdose. the findings add to the growing body of evidence linking the a g snp with clinical outcome and raise the question as to whether the a g snp may be a potential target for personalized medical prescribing practices with regard to behavioral/physiologic overdose vulnerability. key: cord- -yfvu r authors: brat, gabriel a.; weber, griffin m.; gehlenborg, nils; avillach, paul; palmer, nathan p.; chiovato, luca; cimino, james; waitman, lemuel r.; omenn, gilbert s.; malovini, alberto; moore, jason h.; beaulieu-jones, brett k.; tibollo, valentina; murphy, shawn n.; yi, sehi l’; keller, mark s.; bellazzi, riccardo; hanauer, david a.; serret-larmande, arnaud; gutierrez-sacristan, alba; holmes, john j.; bell, douglas s.; mandl, kenneth d.; follett, robert w.; klann, jeffrey g.; murad, douglas a.; scudeller, luigia; bucalo, mauro; kirchoff, katie; craig, jean; obeid, jihad; jouhet, vianney; griffier, romain; cossin, sebastien; moal, bertrand; patel, lav p.; bellasi, antonio; prokosch, hans u.; kraska, detlef; sliz, piotr; tan, amelia l. m.; ngiam, kee yuan; zambelli, alberto; mowery, danielle l.; schiver, emily; devkota, batsal; bradford, robert l.; daniar, mohamad; daniel, christel; benoit, vincent; bey, romain; paris, nicolas; serre, patricia; orlova, nina; dubiel, julien; hilka, martin; jannot, anne sophie; breant, stephane; leblanc, judith; griffon, nicolas; burgun, anita; bernaux, melodie; sandrin, arnaud; salamanca, elisa; cormont, sylvie; ganslandt, thomas; gradinger, tobias; champ, julien; boeker, martin; martel, patricia; esteve, loic; gramfort, alexandre; grisel, olivier; leprovost, damien; moreau, thomas; varoquaux, gael; vie, jill-jênn; wassermann, demian; mensch, arthur; caucheteux, charlotte; haverkamp, christian; lemaitre, guillaume; bosari, silvano; krantz, ian d.; south, andrew; cai, tianxi; kohane, isaac s. title: international electronic health record-derived covid- clinical course profiles: the ce consortium date: - - journal: npj digit med doi: . /s - - - sha: doc_id: cord_uid: yfvu r we leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about coronavirus disease (covid- ). to do this, we formed an international consortium ( ce) of hospitals across five countries (www.covidclinical.net). contributors utilized the informatics for integrating biology and the bedside (i b ) or observational medical outcomes partnership (omop) platforms to map to a common data model. the group focused on temporal changes in key laboratory test values. harmonized data were analyzed locally and converted to a shared aggregate form for rapid analysis and visualization of regional differences and global commonalities. data covered , covid- cases with , laboratory tests. case counts and laboratory trajectories were concordant with existing literature. laboratory tests at the time of diagnosis showed hospital-level differences equivalent to country-level variation across the consortium partners. despite the limitations of decentralized data generation, we established a framework to capture the trajectory of covid- disease in patients and their response to interventions. the coronavirus disease (covid- ) pandemic has caught the world off guard, reshaping ways of life, the economy, and healthcare delivery all over the globe. the virulence and transmissibility of responsible virus (sars-cov- ) is striking. crucially, there remains a paucity of relevant clinical information to drive response at the clinical and population levels. even in an information technology-dominated era, fundamental measurements to guide public health decision-making remain unclear. knowledge still lags on incidence, prevalence, case-fatality rates, and clinical predictors of disease severity and outcomes. while some of the knowledge gaps relate to the need for further laboratory testing, data that should be widely available in electronic health records (ehrs) have not yet been effectively shared across clinical sites, with public health agencies, or with policy makers. at the time of this writing, more than months after the earliest reports of the disease in china, only . % of us cases reported to the cdc included clinical details . even before therapeutic trials are implemented, frontline clinicians are not yet benefitting from knowledge as basic as understanding the differences in the clinical course between male and female patients . through case studies and series, we have learned that covid- can have multi-organ involvement. a growing literature has identified key markers of cardiac , immune , coagulation , muscle , , hepatic , and renal injury and dysfunction, including extensive evidence of myocarditis and cardiac injury associated with severe disease. laboratory perturbations in lactate dehydrogenase (ldh), c-reactive protein (crp), and procalcitonin have been described. however, data from larger cohorts, linked to outcomes, remain unavailable. because ehrs are not themselves agile analytic platforms, we have been successfully building upon the open source and free i b (for informatics for integrating biology and the bedside) toolkit [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] to manage, compute, and share data extracted from ehrs. in response to covid- , we have organized a global community of researchers, most of whom are or have been members of the i b academic users group, to rapidly set up an ad hoc network that can begin to answer some of the clinical and epidemiological questions around covid- through data harmonization, analytics, and visualizations. the consortium for clinical characterization of covid- by ehr ( ce)-pronounced "foresee"-comprises partner hospitals from five countries. our early efforts aim to consolidate, share, and interpret data about the clinical trajectories of the infection in patients with a first focus on laboratory values and comorbidities. this initial report seeks (a) to establish the accessibility and suitability of data from electronic medical record for covid- patients; (b) to learn about the clinical trajectories of patients; (c) to facilitate evaluation and communication of the utility of various laboratory tests and therapies; and (d) to contribute data, reproducible data mining and visualization workflows, and learnings to a global network and the broader public. here, we report on initial results and the structure of a new, rapidly formed network designed to be a highly scalable system, now implemented at sites. the international scope of our collaboration allows us to identify some of the similarities in clinical course and a few country-specific variations. we recognize that these early data are incomplete and are subject to many biases and limitations, which constrain the conclusions we can currently draw. however, we believe the sources of our data and the mechanism we have established for sharing them are sound, reproducible, and scalable. we also hope our results to-date will encourage other sites to share data and contribute to this important research effort. demographic and consortium-level data over a span of weeks, total hospitals in the us ( ), france ( ), italy ( ), germany ( ), and singapore ( ) contributed data to the consortium. this was represented by data collaboratives across these five countries. a total of , patients with covid- diagnosis were included in the data set, with data covering january , through april , . we collected , laboratory values and harmonized them across sites. thirteen percent of sites submitted complete data sets that included values for each laboratory ( . % for at least , and . % for at least of the laboratory measurements). breakdown of sites is shown in table . demographic breakdown by age and sex is shown in fig. . age distribution was different across countries and consistent with previously identified patterns. in particular, patients from italy were more commonly over the age of relative to other countries . us institutions, despite representing a large number of active infections, had the lowest percentage of elderly patients diagnosed with covid- . germany, with its three included hospitals and relatively small number of patients, was more similar to the us and had an increased number of male patients in the − age group. we were able to capture the total number of identified new cases by site and date. to normalize across sites and countries with varying sizes, we reported -day average new case rate per k over time for each country normalized by the ratio between the inpatient discharge rate for each country and inpatient discharge rate for the ce sites in that country. as shown in fig. , the adjusted -day average new case rates derived from ce consortium sites match reasonably well with those reported by jhu csse for germany, us, and singapore. the ce estimates were substantially higher for france and italy, which could reflect the fact that ce sites in france and italy were mainly concentrated in urban areas with high infection rates. laboratory value trajectories our initial data extraction included laboratory markers of cardiac, renal, hepatic, and immune dysfunction that have been strongly associated with poor outcomes in covid- patients in previous publications. laboratory trajectories of each hospital at the population level are presented online at https://covidclinical. net. given limitations of data harmonization and space, we focused on five laboratory trajectories that represented inflammatory, immune, hepatic, coagulation, and renal function. as shown in fig. , trajectory data were remarkably consistent for most institutions at day (day when biological test was positive) with growing differences with continued hospitalization. extensive data harmonization was performed, but we must emphasize that data from each day represented a potentially different population as patients were discharged, died, or laboratory studies were no longer performed. data values from each hospital were an average of all studied patients a specified number of days after diagnosis. initial laboratory values were abnormal for all patients but were not indicative of organ failure. major abnormal elevations were noted in crp and d-dimer on the day of diagnosis. as the number of days from diagnosis progressed, remaining patients who were not discharged or died had, on average, worse values. for nearly all tests, trends toward progressively abnormal values were consistent with worsening disease as inpatient stays continued. most importantly, the initial values and trajectories were highly consistent with previous findings in studies from china , . creatinine, a measure of renal function and the most commonly performed laboratory test in our data set, was divergent over time across sites. rising creatinine would be consistent with an increased proportion of ill patients with significant acute kidney injury over time. hospitals in italy, in contrast, did not see a dramatic rise in creatinine in their hospitalized population, while the small percentage of french and german patients remaining in the hospital for weeks had clear signs of acute kidney injury. this may represent many underlying differences including a high mortality near the beginning of the hospitalization at italian hospitals, severe right time censoring of remaining patients, or a difference in practice. total bilirubin, a measure of conjugation and function by the liver, was initially normal across most sites and showed increases -consistent with other hepatic laboratory tests-among persistently hospitalized patients. the other hepatic laboratory measurements, alanine aminotransferase (alt) and aspartate aminotransferase (ast), were divergent across institutions and showed a more significant perturbation (see https://covidclinical. net). hepatic impairment was not present in most patients on presentation and total bilirubin was only mildly elevated with continued hospitalization. on average, white blood cell count (wbc), a measure of immune response, was within normal limits on presentation. patients who remained in the hospital and survived had increasing wbcs over time without severe leukocytosis . lymphocyte and neutrophil count trajectories can be seen on the website. procalcitonin and ldh were not commonly tested in the total patient population, but results are also online. c-reactive protein, a measure of systemic inflammation, was notably elevated on presentation for all patients in the cohort with a very narrow confidence interval, consistent with previous findings . although it is of unclear importance, populations of patients who remained in the hospital, survived, and had ongoing laboratory testing showed improvements over time. interestingly, despite a decreasing trajectory during the first week, a mild leukocytosis is observed in counterbalance during the second week. the implication may be that crp is not predictive of ongoing hospitalization or crp is being checked for patient populations where the laboratory is more commonly improving. d-dimer, an acute phase reactant and measure of coagulopathy, was elevated across institutions and countries at presentation. it rose consistently in all populations who continued to be hospitalized with the disease. this was consistent with multiple studies that showed a prothrombotic element to the disease. most importantly, changes were consistent across all sites and highly abnormal. there was a large drop in the number of laboratory tests performed after the first day (see fig. ). drop off in tests performed could be a result of death, length of stay, or change in frequency of data collection by the clinical team. from the maximum number of laboratory tests consistently checked on the first day after diagnosis, there was a rapid tapering in frequency of laboratory tests checked. these changes were particularly pronounced in italy and france. we identified the number of days until the number of tests checked were % of their initial maximum value. values for laboratory study for each day are presented on https://covidclinical.net. results varied for each laboratory value and site. there was no obvious country-level pattern. given that several of these tests, such as creatinine, were commonly checked nearly every day in ill patients, the implication was that patients were censored from the laboratory results because of discharge or death or changing practice pattern. thus, for the purposes of this paper, we focused on trends in creatinine. we normalized the number of tests performed by day to the total performed on day . we then looked at the day when the number of tests performed was % of the maximum number performed for each site. for creatinine, for example, a drop-off in testing occurred between day and across institutions. most patients who survived were likely discharged within this time frame or managed with much less monitoring. further results can be found online. there was greater between-hospital variation for laboratory test performance than between-country variation (see fig. ). at the time of diagnosis, there was significant variation between countries and between the hospitals in a specific country. there was no obvious signature presentation for a country for an individual laboratory value. for example, creatinine was a commonly performed laboratory study within a day of diagnosis. the overall standard deviation (sd) for creatinine values across countries was . while the sd within sites was . . standard deviation for countries was . , . , . , and . within france, germany, italy, and the us, respectively. france was a special case as hospitals were reported together by ap-hp and then compared with three hospitals in bordeaux. this was an important finding that could suggest that laboratory values, as individual results, would not be able to fully explain the mortality differences between countries. a rapid mobilization of a multi-national consortium was able to harmonize and integrate data across five countries and three continents in order to begin to answer questions about comparative care of covid- patients and opportunities for international learning. in just over weeks, the group was able to define a question and data model, perform data extraction and harmonization, evaluate the data, and create a site for public evaluation of site-level data. we aggregated ehr data from hospitals, covering a total of , patients seen in these hospitals for covid- . in doing so, we relied upon prior investments made by various governments and institutions in turning the byproducts of clinical documentation into data useful for a variety of operational and scientific tasks. most importantly, at each site there were biomedical informatics experts who understood both the technical characteristics of the data and their clinical relevance. using automated data extraction methods, we were able to show results consistent with country-level demographic and epidemiological differences identified in the literature. rates of a b fig. percentages of patients along with % confidence intervals in each a country and b sex age groups. g.a. brat et al. total case rise in our study was consistent with international tracking sites . age breakdown, with italian sites reporting a larger proportion of older patients, was also reflective of recent publicly available resources . we were able to show that laboratory trajectories across many hospitals could be collected and were concordant with findings from the literature. in truth, the findings generate more questions than they answer; the ability to see consistencies that spanned many countries indicated that the pathophysiology of this disease is shared across countries, and that demographics and care characteristics will have a significant effect on outcomes. as an example, the fall of crp among those who continued to be hospitalized with a continued rise in d-dimer could suggest that ddimer may be more closely related to persistent illness than crp. the limits of our data collection method, where these results were not tied to the patient level and could not be associated across populations, highlight the need for caution with any conclusion related to changes in laboratory levels over time. perhaps most importantly, our study did not show a unique laboratory signature at the country level at the time of diagnosis. researchers around the world have been closely following the rapid spread of covid- and its high mortality rate in certain countries. one possible explanation would be that patients who presented to hospitals in italy did so at a much more advanced stage of disease. our results did not support this idea. there was as much in-hospital and between-hospital variation as between countries. the average of laboratory values at presentation did not indicate major organ failure. this may be due to a larger proportion of healthier patients than those with advanced disease. of course, respiratory failure could not be tracked within the limits of our data set. there were both logistic and data interoperability lessons that were very important to the success of the project and will be critical for future efforts. logistically, to maximize the timeliness of this consortium's first collaboration around covid- , we deliberately aggregated the data to expedite the institutional review board (irb) process at each institution for such data sharing. this constrained our analyses to count, rather than patient-level, data. while the latter would be optimal for deep analysis and identification of subtle patterns and perturbations of clinical courses, we felt that aggregated count data could provide valuable information on the clinical course even as we sought irb permission for analyses at the patient level. interoperability was a significant barrier to overcome, where large variations in units and data presentation required extensive data harmonization. the use of loinc codes allowed for more rapid data extraction [ ] [ ] [ ] [ ] , but often institutions did not have internal mappings from their laboratory tests to loinc codes. manual interpretation of laboratory value descriptions was sometimes necessary. in future iterations, sites should perform unit conversion and ensure data consistency by presenting reference ranges and example data for a first-pass check of data at the site. variations in icd coding and inclusion made harmonization difficult. frequencies of presenting codes were useful to show similar patterns to previous literature, but the current set of codes was too sparse for any further meaningful analysis. future iterations of this project would encompass a much longer data capture timeline and would ensure comprehensive code collection across all sites. in addition, data alignment by a metric that indicates clinical status is necessary to better establish outcomes. using day of diagnosis as an alignment strategy did not allow for clear identification of causes for temporal patterns. this was, in part, because we could not differentiate between patients who underwent lab testing and were not admitted. although additional lab testing was performed almost exclusively for admitted patients, it is possible that some emergency department patients were triaged and sent home. this would explain the rapid drop-off (and subsequent leveling) seen after day in fig. . future studies will need to explicitly differentiate between categories of patients admitted and triaged to home. these care choices may not reflect similar patient physiology but will more readily track care provision. similarly, outcomes need to be selected that represent clinically meaningful endpoints secondary to this initial data alignment. one reason for this difficulty was that identification of level of care was not easily performed. accordingly, it was not easy to follow patients in and out of icus at the site-level and icu data were not reliable. our group, the consortium for clinical characterization of covid- by ehr ( ce), is one of hundreds of efforts (some of which are listed at healthit.gov) that are working to aggregate and curate data to inform clinicians, scientists, policy makers, and the general public. additionally, networks of healthcare organizations such as the act network and pcornet are working with federal authorities to obtain data-driven population-level insights. similar initiatives are active in the other countries participating in ce, including the german medical informatics initiative . disease-specific and organ-specific covid- research collectives are also assembling, including ones for cancers (https://ccc .org), inflammatory bowel disease (https://covidibd.org), and rheumatology , among many others. the world health organization maintains a directory of worldwide research efforts on covid- including clinical data collection . finally, there are dozens of patient self-reporting apps with hundreds of thousands of users worldwide that provide perspectives on the clinical course of the infection outside hospitals. it is clear that in the midst of a novel pathogen, uncertainty far outstrips knowledge. at this early stage, we are partially blind to the underlying physiology of the disease and its interactions with different health system processes. the rapid collation of laboratory-level data across nearly hospitals in five countries is novel in the questions it helps us ask. we are currently struggling to help public health agencies and hospitals better manage the epidemic. by identifying potential differences in care, with proxies of lab changes over time, numerous questions can be asked about whether certain clinical decisions may be affecting lab trajectories (and ultimately outcomes). as an example, differences in creatinine over time may be a signal of patientlevel physiology or hospital decisions about care. the regional clustering of the trajectories identified is striking and deserves further analysis. could there be choices in diuresis and fluid management that may explain differing trajectories? if so, best practice may need to change to the specific physiology of this disease. we have been treating covid- like previous infections despite its unique physiology; with the right information, our scientific and policy leaders can implement guidelines that improve care. there are a multitude of limitations to this study, not least of which is that it is observational and subject to a variety of biases. perhaps the most severe is that study data are limited to those patients who were seen at or admitted to hospitals, due to severity of illness or other possibly biasing characteristics. aggregate laboratory data have limited ability to identify general trends in the admitted population. changes in the cohort as a result of discharge or death may change the composition of the cohort over time. the time-varying average represents the labs of remaining patients in the hospital; survivors who require ongoing care. this leads to a survivor bias. because there is significant patient drop-out, the remaining population cannot be compared to the initial cohort. our study is only able to identify that patients had similar initial labs suggesting consistent initial physiology. it is not possible to use these values as drivers of outcomes such as death or severe disease. differences in health capacity may also lead to differences in admitted patients that ultimately manifest as worse outcomes across institutions or countries. limitations also include heavy right censoring where patient absence can be due to death or discharge, delays in updating codes or in uploading ehr data to the local analytic data repository. furthermore, potentially confounding interactions between comorbidities, chronic diseases and their treatments and lifestyle or exposures were not taken into consideration. again, because of these limitations we were careful to avoid making more than basic and descriptive conclusions. over the coming weeks, we will work to quantify these biases and adjust for them, if we can. this will include adding data types as well as disaggregating the data to the patient level if and when permitted by irbs. for the present, with the current limited knowledge of the clinical course of patients suffering from covid- , these results add to this small knowledge base. our paper strikingly shows the power of harmonized data extraction from ehrs to rapidly study pandemics like covid- . by example, we hope we can motivate an international discussion on what would be required to enable such international monitoring to simply and rapidly be turned on in future covid- "waves" or in future novel pandemics. we invite others to join the ce consortium by sending a note to ce@i b foundation.org. multiple studies have reported significant abnormalities in several laboratory tests in patients with covid- . studies have shown abnormalities in cardiac, hepatic, renal, immune, and coagulation physiology. those laboratory results are associated with both disease presentation and severity. for this initial study, we selected a subset of laboratory studies that are commonly performed, as identified by the logical objects, identifiers, names and codes (loinc) standard , and had been previously associated with worse outcomes in covid- patients. based on the meta-analysis of lippi and plebani , we focused on laboratory studies that are commonly performed: alt, ast, total bilirubin (tbili), albumin, cardiac troponin (high sensitivity), ldh, d-dimer, white blood cell count (wbc), lymphocyte count, neutrophil count, procalcitonin, and prothrombin time. loinc codes were identified for each laboratory study as well as the units and reference ranges. all patients who received a polymerase chain reaction (pcr)-confirmed diagnosis of covid- were included in the data collection. some hospitals only included patients who were admitted to the hospital while others included all patients for whom the test was positive. sites obtained the data for their files in several ways. most sites leveraged the open source i b software platform already installed at their institution , which supports query and analysis of clinical and genomics data. more than organizations worldwide use i b for a variety of purposes, including identifying patients for clinical trials, drug safety monitoring, and epidemiology research. most ce sites with i b used database scripts to directly query their i b repository to calculate counts needed for data files. institutions without i b used their own clinical data warehouse solutions and querying tools to create the files. in some cases, a hybrid method was used that leveraged different data warehouse platforms to fill in i b gaps. for example, assistance publique-hôpitaux de paris (aphp), the largest hospital system in europe, aggregates all ehr data from hospitals in paris and its surroundings. aphp exported data from the observational medical outcomes partnership (omop) common data model for transformation to the shared format. each site generated four data tables, saved as comma-separated values (csv) files. to protect patient privacy, the files we report contain only aggregate counts (no data on individual patients). in order to further protect patient identity, small counts were obfuscated (see below), since an aggregate count of " " represents an individual patient. by computing these values locally and only sharing the aggregate data, sites were able to obtain institutional approval more rapidly. the first file, dailycounts.csv, contained one row per calendar date. each row included the date, the number of new covid- patients, the number of covid- patients in an intensive care unit (icu), and the number of new deaths from covid- . the third file, labs.csv, described the daily trajectories of select laboratory tests. each row corresponded to a laboratory test (identified using a loinc code) and the number of days since a patient had a positive covid- test, ranging from − ( week before the test result) to (the day of the test result) to n (the day the file was created). the values in each row were the number of patients who had a test result on that day and the mean and standard deviation of the test results. the fourth file, diagnoses.csv, listed all the diagnoses recorded in the ehr for covid- patients, starting from week before their positive covid- test to the present, with the count of the number of patients with the corresponding icd- or icd- code. sites optionally obfuscated the values in any of these files by replacing small counts with "− ." sites indicated missing data or data that they were unable to obtain (e.g. whether patients were in an icu) with "− ." sites uploaded their files to a private shared folder. these files were merged into four combined files that included totals from individual sites. each value in the combined file had four components: ( ) number of sites with unmasked values; ( ) sum of those values; ( ) number of sites with obfuscated values; and ( ) sum of the obfuscation thresholds for those sites. for example, if five sites reported values , , − (between and patients), − (between and patients), − (between and patients), then the combined file listed two unmasked sites with a total of patients and three masked sites with up to + + = patients. from this, it was inferred that there were between and patients. given the large geographic distance between our sites, we assumed that each covid- patient was only represented in one ehr. the combined labs.csv file contained a weighted average (rather than the sum) of the unmasked mean test results from each site. diagnosis codes were submitted from the sites as either international clinical diagnosis (icd)- or icd- billing codes. icd- diagnosis codes were mapped to icd- by first attempting to match the icd- codes to child concepts of icd- codes in the accrual to clinical trials (act) icd- → icd- ontology . in the cases where no match was found in the act ontology, icd- codes were matched to the icd- codes that shared a common concept unique identifier (cui) in the build of the us national library of medicine's (nlm's) unified medical language system (umls) . we created a website hosted at https://covidclinical.net to provide interactive visualizations of our data sets as well as direct access to all shareable data collected for this publication. data aggregation and publication processes are shown in fig. . visualizations were implemented using python and altair (http://altair-viz.github.io/) in jupyter notebooks (https://jupyter.org), all of which are freely available on the website. the vega visualizations (http://vega.github.io) generated by altair were embedded into a jekyll-based site (http://jekyllrb.com/) that was hosted on amazon web services. this study was determined to be exempt as secondary research by the partner's healthcare, boston children's hospital and beth israel deaconess medical center. the committee collected certifications of proper institutional review board prior to data sharing for each additional member of the consortium. as data were transmitted in aggregate, no patient-level data were available from any site. further information on experimental design is available in the nature research reporting summary linked to this article. preliminary estimates of the prevalence of selected underlying health conditions among patients with coronavirus disease -united states does covid- hit women and men differently? u.s. isn't keeping track. the new york times association of coronavirus disease (covid- ) with myocardial injury and mortality why the immune system fails to mount an adaptive immune response to a covid- infection the versatile heparin in covid- rhabdomyolysis as potential late complication associated with covid- liver injury in covid- : management and challenges identification of a potential mechanism of acute kidney injury during the covid- outbreak: a study based on single-cell transcriptome analysis procalcitonin in patients with severe coronavirus disease (covid- ): a meta-analysis serving the enterprise and beyond with informatics for integrating biology and the bedside (i b ) rcupcake: an r package for querying and analyzing biomedical data through the bd k pic-sure restful api the genomics research and innovation network: creating an interoperable, federated, genomics learning system scalable collaborative infrastructure for a learning healthcare system (scilhs): architecture shrine: enabling nationally scalable multi-site disease studies overview of data collection and analysis the shared health research information network (shrine): a prototype federated query tool for clinical data repositories accrual to clinical trials (act): a clinical and translational science award consortium network a translational engine at the national scale: informatics for integrating biology and the bedside case-fatality rate and characteristics of patients dying in relation to covid- in italy an interactive web-based dashboard to track covid- in real time prediction models for diagnosis and prognosis of covid- infection: systematic review and critical appraisal laboratory abnormalities in patients with covid- infection impact of selective mapping strategies on automated laboratory result notification to public health authorities learning from the crowd in terminology mapping: the loinc experience standardizing laboratory data by mapping to loinc evaluating congruence between laboratory loinc value sets for quality measures, public health reporting, and mapping common tests early vision for the ctsa program trial innovation network: a perspective from the national center for advancing translational sciences launching pcornet, a national patient-centered clinical research network german medical informatics initiative the covid- global rheumatology alliance: collecting data in a pandemic recent developments in clinical terminologies-snomed ct, loinc, and rxnorm i b : informatics for integrating biology & the bedside ctsa act network i b and shrine ontology with - shrine adapter mapping file (github the unified medical language system (umls): integrating biomedical terminology all authors approved the manuscript. a table including full contributions is listed in supplementary table data files for daily counts, demographics, diagnosis, and labs data sets are available at https://covidclinical.net. supplementary information is available for this paper at https://doi.org/ . / s - - - .correspondence and requests for materials should be addressed to t.c. or i.s.k.reprints and permission information is available at http://www.nature.com/ reprintspublisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons. org/licenses/by/ . /. key: cord- -ljg sj authors: slotwiner, david j.; al-khatib, sana m. title: digital health in electrophysiology and the covid- global pandemic date: - - journal: heart rhythm o doi: . /j.hroo. . . sha: doc_id: cord_uid: ljg sj the tools of digital health are facilitating a much needed paradigm shift to a more patient-centric health care delivery system, yet our healthcare infrastructure is firmly rooted in a (th) century model which was not designed to receive medical data from outside the traditional medical environment. covid- has accelerated this adoption and illustrated the challenges that lie ahead as we make this shift. the diverse ecosystem of digital health tools share one feature in common: they generate data which must be processed, triaged, acted upon and incorporated into the longitudinal electronic health record. critical abnormal findings must be identified and acted upon rapidly, while semi-urgent and non-critical data and trends may be reviewed within a less urgent timeline. clinically irrelevant findings, which presently comprise a significant percentage of the alerts, ideally would be removed to optimize the high cost, high value resource; i.e., the clinicians’ attention and time. we need to transform our established health care infrastructure, technologies and workflows to be able to safely, effectively and efficiently manage the vast quantities of data that these tools will generate. this must include both new technologies from industry as well as expert consensus documents from medical specialty societies including the heart rhythm society. ultimately, research will be fundamental to inform effective development and implementation of these tools. the tools of digital health are facilitating a much needed paradigm shift to a more patient-centric health care delivery system, yet our healthcare infrastructure is firmly rooted in a th century model which was not designed to receive medical data from outside the traditional medical environment. covid- has accelerated this adoption and illustrated the challenges that lie ahead as we make this shift. the diverse ecosystem of digital health tools share one feature in common: they generate data which must be processed, triaged, acted upon and incorporated into the longitudinal electronic health record. critical abnormal findings must be identified and acted upon rapidly, while semi-urgent and non-critical data and trends may be reviewed within a less urgent timeline. clinically irrelevant findings, which presently comprise a significant percentage of the alerts, ideally would be removed to optimize the high cost, high value resource; i.e., the clinicians' attention and time. we need to transform our established health care infrastructure, technologies and workflows to be able to safely, effectively and efficiently manage the vast quantities of data that these tools will generate. this must include both new technologies from industry as well as expert consensus documents from medical specialty societies including the heart rhythm society. ultimately, research will be fundamental to inform effective development and implementation of these tools. as the medical community addresses the complexities associated with the coronavirus disease- (covid- ) pandemic, digital health tools, by communicating physiologic data recorded outside the traditional boundaries of the healthcare environment, are providing solutions to many of the challenges. the pandemic has accelerated the adoption of digital health, yet our healthcare system infrastructures are firmly rooted in a th century closed loop delivery model: systems have been designed to enable clinicians to place an order for diagnostic and therapeutic interventions which are then performed within the traditional boundaries of the clinician's office, hospital, or other clinical care settings. for the most part, current healthcare systems are suited for these tasks: the test is performed, a report is generated, and the results neatly arrive in the clinicians' in-box. the results are reviewed with patients either by phone, via the patient portal, or at the next clinic encounter. this healthcare clinician centric model was ripe for disruption, and the era of digital health and the global covid- pandemic has indeed forever changed care delivery. physiologic data captured outside the traditional medical environment by digital health tools are fundamentally changing the way patients and clinicians communicate, manage diseases, and maintain health. the extensive recording and transmitting of physiologic data and engaging patients in the data collection and review process have caused a paradigm shift. despite the paucity of data, many have postulated that this new model of healthcare delivery will result in significant improvements in patient outcomes. ( , ) ( ) yet the traditional medical establishment, structured around office encounters and periodic testing, is not well suited to evaluate and manage the incessant stream and vast quantity of data and alerts generated by these near continuous monitoring devices. additionally, little attention has been devoted to addressing how such data will enter the medical establishment, or how it will be incorporated into the electronic medical record. it is not fully understood how patients and their clinicians should most effectively communicate between scheduled office encounters. in this article, we describe the present state of heart rhythm digital health tools highlighting some of the effects of j o u r n a l p r e -p r o o f the covid- pandemic and propose ways to develop innovative workflows and technological solutions that will make it possible for practices to efficiently process and manage information. in addition, we highlight some of the research gaps that should be addressed to push this field forward. heart rhythm digital health tools fall into broad categories: medical grade implantable devices such as cardiac implantable electronic devices (cieds), medical grade wearable monitors such as mobile cardiac telemetry monitors, and consumer devices that record physiologic data such as heart rate, activity, and single lead electrocardiograms. there is a diverse ecosystem of digital health tools that generate many different types of data. however, there are similarities between the underlying data, need for triaging and responding to the data, as well as the importance of incorporating the data into the electronic medical record. critical abnormal findings must be identified and acted upon rapidly, while semi-urgent and non-critical data and trends may be reviewed within a less urgent timeline. clinically irrelevant findings, which presently comprise a significant percentage of the alerts, ideally would be removed to optimize the high cost, high value resource; i.e., the clinicians' attention and time. cardiac implantable electronic devices (cieds) cieds, the most sophisticated of our digital health tools, have the highest resolution recordings and highly refined software algorithms capable of accurately identifying most arrhythmias and virtually eliminating artifacts except in clinically important scenarios of device or lead malfunction. yet remote monitoring of these devices poses a massive data burden on clinical practices because practices still must interpret and triage data according to clinical relevance for an individual patient. for example, the clinical significance of non-sustained j o u r n a l p r e -p r o o f ventricular tachycardia or atrial fibrillation varies tremendously across patients, yet when these alerts are received by a technician at a given practice, each alert must be reviewed and processed by an experienced clinician with equal diligence to ensure that clinically important events are acted upon ( figure a) . a patient with known atrial fibrillation, on appropriate thromboembolic prophylaxis, may have many alerts for atrial fibrillation even following attempts to optimize alert triggers. at a minimum, this requires an clinician process each alert and confirm in the medical record that the patient is known to have atrial fibrillation and is receiving thromboembolic prophylaxis. subcutaneous cardiac rhythm monitors which are prone to detecting artifacts and are at times unreliable at accurately interpreting the heart rhythm add an additional level of burden to a practice. an experienced clinician must review the electrogram recordings to assess if an event is due to an artifact or a true arrhythmia. then, the clinician must manage the arrhythmia. an often overlooked but critical burden for practices is the need to have a rigorous quality assessment process in place to ensure that each patient's remote transmitter is communicating. the present cied remote monitoring technology is designed to communicate a complete data download every days (for pacemakers and implantable defibrillators) or every days (for implantable subcutaneous cardiac rhythm monitors). between these intervals, many vendors have designed their systems to transmit data only if a new event or abnormality is detected. therefore, practices must either have robust processes in place to identify if an individual patient reaches the end of their monitoring interval and their transmitter has not communicated appropriately, or, the practice must monitor each of the cied vendor's web portals, carefully culling any inactive patients from the list, and noting when an alert is triggered indicating that a remote transceiver has stopped communicating. once a practice identifies that a remote transmitter is no longer communicating, the tedious and lengthy process of tracking down the patient and addressing the specific problem may take hours. holter monitors, extended electrocardiogram (ecg) monitors, event recorders and mobile cardiac telemetry monitors record varying degrees of data, but unlike cieds or consumer wearable devices, a well established workflow that includes trained telemetry technicians, nurses or other skilled allied professionals is an essential component of the initial data screening process. as a result, most of the artifact and non-critical findings are presented to the interpreting clinician in a single review session, making the burden of data review and interpretation more manageable. data generated from digital health tools utilizing photoplethysmography to record heart rate or devices capable of recording a single lead ecg pose a different challenge for established medical practices. primary concerns surrounding these devices are fundamental: identifying how data will physically or electronically enter the medical establishment, the quality of the data, the patient and clinician expectations regarding review and communication about the data, and a mechanism for incorporating the data into the patient's longitudinal electronic medical record. many clinicians avoid recommending the use of these tools due to these uncertainties as well as concern that they will be inundated with data of uncertain significance. patients may become unnecessarily concerned by artifact, inaccurate data, and/or overreliance on a device's ability (or inability) to correctly categorize the data as normal or abnormal. the covid- pandemic highlighted both the potential of digital health tools to enable delivery of health care beyond the traditional boundaries of medical facilities as well as the inadequacies of our present health care infrastructure to make this switch. the abrupt shuttering of all but essential hospital-based medical services forced patients and clinicians to turn to j o u r n a l p r e -p r o o f alternative methods of both acquiring health data and communicating the results. telehealth services which previously had struggled to gain traction amongst both patients and clinicians suddenly became routine, with both parties quickly learning to appreciate the advantages of telehealth while also recognizing its limitations. patients with cieds already on remote monitoring were at a distinct advantage as clinicians could rapidly and efficiently identify and triage patients with significant arrhythmias or device malfunctions and reassure the remaining patients to avoid medical environments that could pose risk of covid- infection. medical practices that had not implemented cied remote monitoring were forced either to outsource the technical aspects of initiating and maintaining patients on remote monitoring or to identify resources, educate patients and redeploy staff to figure out the complex process of managing data remotely. identifying well trained technicians and nurses to help manage these data is challenging even in normal times. similarly, patients who had adopted digital health tools such as automatic blood pressure cuffs, pulse oximeters, glucose monitors, consumer single lead ecg recorders were at a distinct advantage as they could provide their clinician with potentially important data that would otherwise require them risking exposure to the medical environment to obtain. yet even the early adopters of digital health tools were left to struggle with sharing the data and how best to communicate with their clinicians. it is likely that changes brought forward by the pandemic will remain and continue to grow even after the pandemic is over. the first critical step to managing the deluge of data from consumer devices is to develop the infrastructure that will make it possible for data recorded by these devices to be securely and reliably communicated and incorporated into a patient's electronic health record. the second step is to develop a mechanism to triage incoming digital health data so that it can be efficiently and effectively managed by a practice (figure b) . triage should be possible by a combination of artificial intelligence tools and clinical pathways that make it possible for staff with varying levels of clinical expertise to stratify incoming data into buckets: ) urgent data that must be reviewed and acted upon as soon as possible by a clinician; ) semi-urgent data that can be sent to a clinician's inbox for review within the next business day; ) elective findings that the clinician will want to review with the patient at their next routinely scheduled encounter, and ) artifact that is incorrectly detected as an arrhythmia. expert consensus documents should guide these clinical pathways, while industry engineers will be called upon to develop artificial intelligence tools. for example, an artificial intelligence tool could provide a first-pass screening to triage data. a cied may communicate an event which it labels as non-sustained ventricular tachycardia with % certainty. if the cied is an implantable cardioverter defibrillator, if the episode lasted beats, and if this was the first event in months, the significance of the event is very different from a beat run of non-sustained ventricular tachycardia detected by a mobile cardiac telemetry monitor placed on a patient with coronary artery disease, a left ventricular ejection fraction of % who is being evaluated for recurrent syncope. the electronic medical record should be able to provide an artificial intelligence tool with sufficient clinical data to enable the tool to determine if the patient has an implantable cardioverter defibrillator and whether this is a new arrhythmia, thereby providing a first pass screen. if the event does not meet this criterion, it is labelled as potentially clinically significant, and triaged accordingly. technology should allow patients to know if an event has been detected, transmitted and reviewed. once reviewed, it should be possible for the clinician to easily communicate with the patient -and vice versa. an achilles heel of the present cied wireless remote monitoring systems is the absence of a robust, streamlined method for both medical practices as well as patients to quickly and reliably be notified if their transceiver stops communicating. it is not uncommon for patients to learn that their remote monitor stopped communicating months after the fact. while each cied manufacturer has made some progress in alerting practices when communication stops, the systems are poorly designed and inconsistent, even within a manufacturer's product line. this basic quality concern requires a higher priority status from industry and may be best addressed by developing an industry standard approach, such as has been taken to consistently and uniformly message the clinical community the definition of the cied elective replacement indicator. managing the patient needs and the data generated by digital health tools requires well trained technicians and nurses, and appropriate staffing numbers have yet to be defined. examiners and educational programs offered by private organizations, these remain out of reach both from a financial as well as time commitment perspective for most allied professionals. difficulty identifying and training staff is the reason some practices are outsourcing the technical components of managing the acquisition and collation of digital health data to independent remote monitoring organizations. the tools of digital health bring several new categories of unanswered questions that require scientific study. the first questions pertain to implementation of these tools. patients and the broader public at large are essential partners in acquiring the data, therefore it is imperative that the design and interfaces of these tools be intuitive to individuals who span a broad range of ages, educational and cultural backgrounds. next, it will be important to assess the quality j o u r n a l p r e -p r o o f and reliability of data that clinicians receive from these tools. this will be an ongoing question as new tools are developed. we then need to identify which tools and what data will form a basis for improving patient-centered outcomes. lastly, we need to understand the best tools and strategies for communication to occur between patients and clinician to maximize patient engagement and optimize the potential benefits of digital health. economics is fundamental to driving change, and to-date this has limited the impact of telehealth and digital health. there was perhaps no greater acknowledgment of this than the digital health tools and the data they generate present new challenges for reimbursement. to date, the public has not expected insurance carriers to pay for consumer devices, but with home blood pressure monitors, heart rate and rhythm monitors, patient expectations are changing. clinician time and effort vary widely based upon the frequency and volume of data received. in the united states, cms has implemented base billing codes with additional add-on codes designated for use when the clinician time exceeds a base value within a day window. it remains unclear if private insurance carriers will follow medicare's example. if compelling evidence indicates that these tools improve clinical outcomes, patients and the public will expect their clinician to be adequately reimbursed for reviewing and interpreting such data. the tools of digital health are facilitating a much needed paradigm shift to a patientcentric health care delivery system. covid- with its attendant need to minimize patient exposure to the health care environment has accelerated the adoption of these tools and illustrated the challenges that lie ahead as we make this shift. we now need to focus attention on adapting our established health care systems, technologies and workflows to be able to safely, effectively and efficiently manage the vast quantities of data that these tools generate. this will require both new technologies to be developed by industry as well as expert consensus documents from medical specialty societies including the heart rhythm society. ultimately, research will be fundamental to inform effective development and implementation of these tools, and to understand how they can be used to achieve clinically meaningful improvements in health care outcomes for patients. • the tools of digital health are facilitating a paradigm shift to a more patient-centric health care delivery system. • our present health care infrastructure was not designed to process, triage and incorporate digital health data generated outside the traditional medical environment. • we must transform our established health care infrastructures, technologies and workflows to be able to safely and efficiently manage the vast quantities of data these tools generate. • research is needed to inform the effective development and implementation of these tools, and to identify which have the potential to improve patient-centered outcomes. weill cornell medicine population health sciences, east th street effectiveness of mobile health application use to improve health behavior changes: a systematic review of randomized controlled trials mobile apps for health behavior change: protocol for a systematic review food & drug administration digital health food & drug administration president trump expands telehealth benefits for medicare beneficiaries during covid- outbreak | cms key: cord- -w sh xpn authors: egli, adrian; schrenzel, jacques; greub, gilbert title: digital microbiology date: - - journal: clin microbiol infect doi: . /j.cmi. . . sha: doc_id: cord_uid: w sh xpn background: digitalisation and artificial intelligence have an important impact on the way microbiology laboratories will work in the near future. opportunities and challenges lay ahead to digitalise the microbiological workflows. making an efficient use of big data, machine learning, and artificial intelligence in clinical microbiology requires a profound understanding of data handling aspects. objective: this review article summarizes the most important concepts of digital microbiology. the article provides microbiologists, clinicians and data scientists a viewpoint and practical examples along the diagnostic process. sources: we used peer-reviewed literature identified by a pubmed search for digitalisation, machine learning, artificial intelligence and microbiology. content: we describe the opportunities and challenges of digitalisation in microbiological diagnostic process with various examples. we also provide in this context key aspects of data structure and interoperability, as well as legal aspects. finally, we outline the way for applications in a modern microbiology laboratory. implications: we predict that digitalization and the usage of machine learning will have a profound impact on the daily routine of the laboratory staff. along the analytical process, the most important steps should be identified, where digital technologies can be applied and provide a benefit. the education of all staff involved should be adapted to prepare for the advances in digital microbiology. inflammatory response syndrome (sirs) and the presence of a central venous line, the risk of blood culture contamination can be assessed . in the future, the combination of lis and electronic health record (ehr) data may allow more sophisticated feedback loops and provide automated quality assessments reports to the microbiologist and clinician. another important pre-analytical aspect is diagnostic stewardship. diagnostic stewardship incorporates the concept of recommending the best diagnostic approach for a given situation [ ] [ ] [ ] . digital solutions in this field may range from digital twins , to machine-learning based algorithms in smartphone app or chatbots , . recently, chatbots have been developed to support the diagnostic evaluation and recommending immediate measures, when patients are exposed to sars-cov- . similarly to a microbiologist consultant, a chatbot may provide helpful diagnostic information and advice e.g. on the correct transport media for a sample, assay costs, the expected turn-around time, and test performance in specific sample types. such an interactive tool may be a first source of information for routine and repetitive questions, and could support the pre-analytical quality test performance and data generation within the laboratory are parts of analytics. as an example, automated microscopy allows to acquire high-resolution images of smears from positive blood cultures and can categorize gram staining with high sensitivity and specificity , . besides state-of- the-art automated microscopes, smartphones can also be used for image analysis of microscopy data , . automated plate reading systems act similarly on pattern recognition and can reliably recognize bacterial growth on a agar plate and could be used to pre-screen culture plates [ ] [ ] [ ] [ ] [ ] . such automated plate reading systems are currently established in many european laboratories as part of the ongoing automation process. reading of e-tests and inhibition zone diameters around antibiotic- impregnated disks can also be automatized with well-developed reading software , . clinical decision support systems based on machine learning to provide automated feedback regarding empiric antibiotic prescription adapted to specific patient groups . as a next step, also more complex datasets will be analysed. as physiology and laboratory parameters can rapidly change during an infection, time-series data greatly impact the predictive values of such algorithms -similar to a doctor, who observers the patient during disease progression -machine learning algorithms will also follow the patient's data stream. recently, a series of studies has shown the impact of highfrequency physiological parameters in icus on the prediction of sepsis - or meningitis , . these studies are retrospective analysis and prospective controlled validation studies are largely missing in the field. therefore, although our expectations for digital microbiology may be high, we should remain critical and carefully address the associated challenges. challenges of digitalisation in the microbiology diagnostic process the collection, quality control and cleaning, storage, security and protection, stewardship and governance, interoperability and interconnection, reporting and visualization, versioning, and sharing of data pose considerable challenges for big data in microbiology diagnostic laboratories. some of these data handling aspects may be managed with a profound understanding of the laboratory and due to the increasing quantity of data (explosion of information), it will soon become almost a first step: data structure and interoperability diseases (figure ) [ ] [ ] [ ] . machine learning algorithms require large, structured, interoperable, and interconnected datasets. healthcare data has to be further standardized and annotated with international recognized definitions , . ontologies help to structure data in such a way by using a common vocabulary, and allow to determine relations of variables within a data model . the previously mentioned concepts for data handling have been used for a series of large healthcare university hospitals. the goal is to discover digital biomarkers for early sepsis recognition and prediction of mortality using machine learning algorithms (www.sphn.ch/). epidemiological databases can also benefit from structured data. for example, pulsenet is a large predictions or decisions without being explicitly programmed to perform that task , . machine learning algorithms may be used at each step of the microbiological diagnostic process from pre-to post-analytics, helping us to deal with the increasing quantities and complexity of data , (table ) . human analytical capacity has reached its limits to (i) grasp the huge amount of available complex process management is key, (ii) data handling is easiest at the point where the data is actually diagnostic tests. in general, incentives are needed to further support all aspects of data handling in laboratory medicine -including standardization data structures and machine learning algorithms. conclusion digitalisation in healthcare shows already a profound impact on patients. it is expected, that the developments started will further gain momentum. machine learning radically changes the way we handle healthcare-related data -including data of clinical microbiology and infectious diseases. likely, we will move from the internet-of-things environment (interconnected datasets in a patient with in a disease-free time. in addition, developments of molecular diagnostics such as metagenomics will increase the data complexity. current trends indicate, that the importance of laboratory diagnostics we have to develop strategies for the next five to ten years to face the opportunities and challenges table s . glossary basel) for critically feedback regarding the manuscript. conflict of interest disclosure: none of the authors had a conflict of interest. quality control how reliable is the analytical performance of a test? -surveillance of reagent lots performance with internal and external controls and automated reported in connection to specific used lots of time. imaging are there bacteria on the microscope slide? -automated image acquisition with a microscope and scan for pathogen-like structures and category , , plate reading is there bacterial growth on the plate? -automated image acquisition and scan for colonies and subsequent identification (telebacteriology). expert system does the detected resistance profile make sense? -medical validation of antibiotic resistance profiles with expert database. public health is there a potential outbreak? -automated screening for pathogen similarities e.g. resistance profile or automated bioinformatics , is there a potential bacterial phenotype? -detection of resistance by analysing maldi-tof spectra , sepsis treatment what is the best treatment for the patient? -prediction of sepsis, and best treatment e.g. volume and antibiotics for the patient - tracking strains in the microbiome: insights from metagenomics and design and evaluation of a bacterial clinical infectious diseases ontology ontologies for clinical and translational research: introduction semantic data interoperability, digital medicine, and e- health in infectious disease management: a review the need for a global language -snomed ct introduction. stud health technol mimic-iii, a freely accessible critical care database the eicu collaborative research database, a freely available multi-center database for critical care research pulsenet and the changing paradigm of laboratory-based surveillance microreact: visualizing and sharing data for genomic epidemiology and phylogeography nextstrain: real-time tracking of pathogen evolution improving the quality and workflow of bacterial genome sequencing and analysis: paving the way for a switzerland-wide molecular epidemiological surveillance platform a comprehensive collection of systems biology data characterizing infectious diseases and associated ethical impacts big data and machine learning in critical care: opportunities for collaborative research privacy in the age of medical big data the compare data hubs. database (oxford) , justice point of view introduction to machine learning machine learning in infection management using routine electronic health machine learning for clinical decision support in infectious diseases: a narrative review of current applications machine learning for healthcare: on the verge of a major shift in supervised machine learning for the prediction of infection on admission to hospital: a prospective observational cohort study using artificial intelligence to reduce diagnostic workload without compromising detection of urinary tract infections unsupervised extraction of epidemic syndromes from participatory influenza stewardship: fair enough? the promise of the internet of things in healthcare: how hard is it to keep? stud wearable devices in medical internet of things: scientific research and commercially available devices key: cord- -zzkqb u authors: moore, jason h.; barnett, ian; boland, mary regina; chen, yong; demiris, george; gonzalez-hernandez, graciela; herman, daniel s.; himes, blanca e.; hubbard, rebecca a.; kim, dokyoon; morris, jeffrey s.; mowery, danielle l.; ritchie, marylyn d.; shen, li; urbanowicz, ryan; holmes, john h. title: ideas for how informaticians can get involved with covid- research date: - - journal: biodata min doi: . /s - - -y sha: doc_id: cord_uid: zzkqb u the coronavirus disease (covid- ) pandemic has had a significant impact on population health and wellbeing. biomedical informatics is central to covid- research efforts and for the delivery of healthcare for covid- patients. critical to this effort is the participation of informaticians who typically work on other basic science or clinical problems. the goal of this editorial is to highlight some examples of covid- research areas that could benefit from informatics expertise. each research idea summarizes the covid- application area, followed by an informatics methodology, approach, or technology that could make a contribution. it is our hope that this piece will motivate and make it easy for some informaticians to adopt covid- research projects. the coronavirus disease (covid- ) pandemic has had a significant impact on population health and wellbeing. research efforts are underway to identify vaccines [ ] , improve testing [ , ] , understand transmission [ ] , develop serologic tests [ ] , develop therapies [ ] , predict risk [ ] , and develop mitigation and prevention strategies [ , ] . biomedical informatics is central to each of these research efforts and for the delivery of healthcare for covid- patients. critical to this effort is the participation of informaticians who typically work on other basic science or clinical problems. the goal of this editorial is to highlight some examples of covid- research areas that could benefit from informatics expertise. each research idea summarizes the covid- application area followed by an informatics methodology, approach, or technology that could make a contribution. this is followed by some practical suggestions for getting started. these are organized under sub-disciplines for biomedical informatics including bioinformatics that focuses on basic science questions, clinical informatics that focuses on the delivery of healthcare, clinical research informatics that focuses on research using clinical data, consumer health informatics that focuses on the use of mobile devices and telemedicine, and public health informatics that focuses on research questions at the population or community level. it is our hope that this piece will provide motivation and make it easy for some informaticians to adopt covid- research projects. we present here two applications of bioinformatics approaches to the basic science aspects of severe acute respiratory syndrome coronavirus (sars-cov- ) and covid- . these focus on sequencing the virus, in order to understand the genomics of sars-cov- with the goal of informing treatment regimens and vaccine development. the genome sequences of sars-cov- are essential to design and evaluate diagnostic tests, to track the spread of disease outbreak, and to ultimately discover potential intervention strategies. phylogenetics is the study of the evolutionary connections and relationships among individuals or groups of species. these relationships can be identified through phylogenetic inference methods that evaluate the evolutionary origins of traits of interest, such as dna sequences. similar to tracing your ancestry through a dna test, a phylogenetic analysis approach can be used to help map some of the original spread of the new coronavirus and trace a sars-cov- family tree based on its rapid mutations, which creates different viral lineages. note that many countries have shared an increasing number of sars-cov- genome sequences and related clinical and epidemiological data via the global initiative on sharing all influenza data or gisaid (https://www.gisaid. org). gisaid has generated a phylogenetic tree of sars-cov- genome samples between december and april . in particular, nextstrain, an open-source software package (https://nextstrain.org), uses sars-cov- genome data to help track the spread of disease outbreaks. for example, it could be applied to tell researchers where new cases of the coronavirus are coming from. this can be crucial information for investigating whether new cases arrived in given countries through international travel or local infection. one caveat is that the number of genetic differences among the sars-cov- genomes is close to the error rate of the sequencing process. thus, there is a possibility that some of the observed genetic differences may be artifacts of this process. however, rapid data sharing for sars-cov- is the key to public health action and has led to faster-thanever outbreak research. with more data sharing of the sars-cov- genomes, more genetic diversity will become apparent making it possible to better understand how the coronavirus is being transmitted. while exploring the genome sequence of the sars-cov- virus is anticipated to provide scientists a better understanding of viral evolution and aid in the development of vaccines and treatments, evaluation of host genetics in response to covid- is of similar importance. for other viruses, we know that some individuals have a natural immunity whereby even when exposed to the virus, they do not develop infection. for example, the well-known ccr -delta allele has a variation that protects individuals who have been exposed to the human immunodeficiency virus (hiv); they are protected from developing aids (acquired immunodeficiency syndrome) [ ] . because of this, researchers are gearing up to study the genomes of covid- positive patients in comparison to controls (covid- -negative patients). for example, stawiski et al. investigated coding variation in the gene, ace . ace , the human angiotensinconverting enzyme , is a cell surface protein that the viral spike coat protein sars-cov- engages to invade the host cell [ ] . what would be optimal for these and other genome-wide analyses, to identify potential risk and protective variants, are individuals who test positive for the virus but remain asymptomatic. these individuals will be more difficult to identify because of the lack of widespread testing (most individuals without symptoms are not being tested). however, the research community is building large, international, collaborative consortia to address this challenge, such as the covid- host genetics initiative (https://www.covid hg.org/). much like understanding the viral genome will be useful for drug development, identifying the genetic variation in the host dna that is either increasing the risk of, or protection from, sars-cov- infection will enable us to identify putative targets for therapeutics and vaccines. we present here three topics relevant to the diagnosis and management of covid- patients. these include imaging, suggestions for the roles that informaticians can assume in the pandemic, and the need for novel approaches to delivering patient care and learning from in-practice data. imaging provides a powerful tool for covid- diagnosis and patient monitoring given the impact on lung physiology and anatomy. for example, chest computed tomography (ct) has been shown to have promising sensitivity and early detection power compared with the standard reverse transcriptase polymerase chain reaction (rt-pcr) test [ ] . in addition, imaging plays an important role in assessing patients with worsening respiratory status [ ] , which is crucial for monitoring and treatment planning. given the fast-growing volume of covid- cases, to help alleviate the huge manual evaluation burden on clinicians, there is an urgent call for researchers in imaging informatics (or radiomics) to work on developing automated image analysis and artificial intelligence (ai) methods and tools. to achieve these goals, major efforts have been initiated to address two critical research foci. the first is to create large-scale high-quality imaging data repositories (e.g., radiological society of north america (rsna) covid- imaging data repository, https://www.rsna.org/covid- ) to accelerate collaborative research on image-based covid- diagnosis and treatment. the second is to develop innovative ai methods for automatic image analysis for covid- diagnosis and severity assessment. to get started on supporting these efforts, below we suggest a few relevant resources for interested imaging informaticians. several covid- resource and initiative web portals have been created by major organizations such as american college of radiology (acr), radiological society of north america (rsna), and european society of medical imaging informatics (eusomii). these portals offer important information on policies, guidelines, discoveries, initiatives, data sets, and/or other relevant resources. given the rapidly growing ai-based imaging literature on covid- , it is worth noting a recent review article [ ] , which provides comprehensive coverage on a variety of interesting topics, including ai-empowered contactless imaging workflow, ai in lung image segmentation, ai-assisted diagnosis and severity assessment, ai in follow-up studies, public imaging datasets for covid- , and future directions. to effectively address the ever-growing surge of covid- patient cases, informatics solutions are being developed to help care providers and healthcare institutions manage patients from symptoms to recovery. symptom screening tools have been developed to aid patients in distinguishing covid- symptoms from common colds and flu. telemedicine is helping keep patients at home by deploying chatbots to answer patient covid- questions and providing virtual visits and consultations to limit the number of individuals exposed to covid- and to manage patients with mild covid- symptoms. this reduces resource utilization and overburden on the care delivery system. capacity and resource management tools can generate projects based on regional infection counts and current patient admissions to estimate the number of patients that will require hospitalization, intensive care unit beds, medications, and mechanical ventilation. these projections can improve clinical response times and inform triage care strategies. donation and resource inventory tools can be helpful for identifying, cataloging, and distributing personal protective equipment (ppe), homemade masks, and other critical medical supplies to those fighting on the front lines. informaticians can support these efforts by ) educating patients and care providers about data science resources and electronic health record (ehr) platforms for building point-of-care solutions, ) joining the open-source community efforts to develop these technologies, and ) volunteering with the information services divisions within their healthcare organizations to deploy telehealth tools and engage in patient management projects. the covid- pandemic has been an unprecedented stress test for clinical information systems. the scramble to develop and implement new clinical practices has in many cases outpaced our ability to effectively use standard tools for building, testing, and monitoring these practices. for instance, clinical laboratories have rapidly implemented several different methods for sars-cov- diagnostic testing and have also needed to send out testing to multiple reference laboratories. these complex practices have made it non-trivial to collect even the most basic information, including who is being tested and who is positive. these data are essential both to the care of individual patients and to health providers who need to design these care systems and plan for what is coming next. these data are also being reported to government agencies in multiple new manual processes. the work that has been done to build these data collection systems is extraordinary and commendable. but, for our clinical informatics and public health communities, these challenges highlight the need for developing modern, flexible clinical information systems and robust infrastructure for inter-institution data sharing. the implementation of novel clinical practices has also been notable for how much we still do not know about their clinical utility. as a consequence, there is a great need to learn about clinical utility from in-practice data. for example, the precise clinical sensitivity and clinical specificity of the sars-cov- diagnostic testing being used are currently unclear [ ] . this is critically important because false-negative results could lead to the inappropriate non-use of ppe or insufficient clinical and epidemiological monitoring. the rate of such false-negatives is also highly variable across time, as the disease prevalence changes, and across multiple patient, provider, and geographic factors. to fill in these knowledge gaps, there is a big need for the design and application of methods for estimating such parameters from in-practice data. these approaches must be robust to the many sources of bias in these kinds of retrospective data and must be applied to datasets of large enough sample sizes, to generate meaningfully precise estimates. we present here four clinical research informatics domains related to the generation, integration, and use of clinical and other data that could be leveraged in addressing the pandemic in various settings. the domains include a well-developed informatics infrastructure that encompasses a large healthcare landscape, the potential for systematically and cautiously repurposing drug treatments, the leveraging of existing clinical and biospecimen data, and the role of advanced statistical, integrative, and machine learning (ml) tools for diagnosis and treatment. one critical need to support covid- -related clinical and translational research studies is the development of informatics infrastructure that contains accurate and timely clinical data from the electronic health records of the covid- population. as a first step, healthcare institutions can create patient registries to maintain reliable lists of covid- patients and cases (e.g., confirmed, ruled out, uncertain). these data must be updated regularly (daily or several times each week) and contain a broad set of data elements representing demographics, prior medical histories, current medications, comorbidities, diagnoses, procedures, outcomes, etc. to serve a broad base of clinical investigators and scientific inquiries. to adequately code all patient data, image processing will be needed to encode salient radiological findings, and natural language processing will be needed to extract symptom onset, severity, and duration among other variables. secure informatics platforms such as integrating bench to bedside (i b ) and the shared health research information network (shrine), trinetix, and atlas play an important role in standardizing and harmonizing clinical data to common data models (cdms) including i b , patient-centered outcomes research network (pcornet), fast healthcare interoperability resource (fhir), and observational medical outcomes partnership (omop). once covid- patients are indexed within the patient registry and their clinical data has been extracted, transformed, and loaded into these frameworks, clinical researchers can execute secure, privacy-preserving, and federated queries across all participating sites using any framework to identify patients for clinical trials, generate scientific hypotheses, and conduct observational studies. both aggregate and individual-level information can be made available with appropriate data governance, ethical review, and institutional agreements. informaticians can support these efforts by ) developing technologies and algorithms for extracting, encoding, and mapping raw ehr data to emerging covid- -specific cdms, ) engaging in existing and emerging consortiums, both grass roots and nationally-sponsored efforts, across clinical and translational science awards (ctsas) and informatics networks, and ) connecting with clinicians to develop and share informatics tools and predictive models that identify clinically-formative, actionable insights from heterogeneous, temporal data. one of the major challenges with emerging diseases, such as covid- , is that evidence for effective drugs and treatments is sparse. while vaccine development is important, vaccines are only helpful to prevent individuals from becoming infected in the first place. for those that have covid- , the main strategy for treatment with drugs (while the disease is still emerging) is to reuse those that have been approved for other purposes. there are several drugs that may therapeutic use in covid- , namely: hydroxychloroquine sulfate, chloroquine phosphate, remdesivir, carfilzomib, eravacycline, valrubicin, lopinavir and elbasvir. these medications were designed for treatment of various diseases, including lupus, malaria, cancer and hiv. therefore, the use of these medications to treat covid- is termed 'drug repurposing' and one avenue for studying the potential for a drug to be repurposed is through informatics. informatics methods have been developed for both drug repurposing and pharmacovigilance (studying the adverse effects of a drug). the advantage of using existing ehrs for studying drugs as candidates for drug repurposing is that it enables risk assessment profiles to be generated for each candidate drug. since the drugs have been prescribed previously during routine clinical care, it is possible to study their effects on human health in a variety of situations that may not have been included in the original clinical trials. for example, the birth and pregnancy outcomes following drug exposure can be assessed using ehrs for drugs potentially useful in treating covid- , such as hydroxychloroquine. this is important as the hydroxychloroquine clinical trials for covid- specifically exclude pregnant women from enrolling in their studies. informatics methods can also be designed, which use more sophisticated machine learning and artificial intelligence methods to study the effects of medication exposure during pregnancy on fetal and maternal outcomes [ ] . with the aggregation of clinical and medication data from ehrs, along with the recruitment of covid- positives and negatives for genetic studies (as described above), there is an opportunity to explore genetic data in combination with this ehr data to improve our understanding of the covid- disease's severity and outcomes. early research has suggested that individuals, who have medical conditions such as heart disease, diabetes, obesity, or asthma, may be at higher risk for severe disease and/or worse outcomes from covid- . additionally, early data suggests that some medications such as ace-inhibitors, angiotensin release blockers (arbs), or non-aspirin nonsteroidal anti-inflammatory drugs (nsaids) may be linked to worse health outcomes due to covid- . however, these reports are primarily based on small, observational datasets without rigorous, epidemiological study designs. as such, these associations are met with much controversy in the literature. with the accumulation of covid- positives and negatives, along with access to ehr data, including comorbid conditions and medications, researchers will be able to develop more thorough studies of which medical conditions are associated with poorer covid- outcomes and/or which medical conditions place individuals at higher risk for hospitalization due to covid- . additionally, if these data are paired with genetic data from ehr-linked biobanks, we may be able to determine if some of these differences in covid- severity and/or outcomes related to comorbidities and medications are also related to host genetics. fortunately, there are several efforts to establish data-sharing consortia that provide an opportunity for informaticians to assist with analyses. for example, the consortium for clinical characterization of covid- by ehr, or ce (https://covidclinical.net/), has released summary-level covid- data from several countries including france, germany, italy, singapore, and the united states along with a preprint of the initial analyses [ ] . presently, there is much to be learned regarding how best to treat covid- patients when sufficient resources are available, as well as how to optimize operational decisions such as the triage of patient testing and care when they are not. as accessible, cleaned, and structured ehr data become available for covid- patients at both the institutional and multi-site consortium levels, there will be increased opportunity to apply machine learning to better understand and make risk predictions on a variety of clinically and operationally relevant outcomes. the accessibility of data science and ml packages (e.g. pandas, scikit-learn, and tensorflow python libraries), paired with widely available high-powered computational hardware offers significant opportunity for researchers to get involved in data analysis and modeling. however, many caveats need to be taken into consideration in order to develop and apply effective, rigorous ml analysis pipelines for replicable covid- investigation. some key considerations and targets of research include: ( ) feature engineering, transforming raw data into features (i.e. variables) that ml can better utilize to represent the problem/target outcome, ( ) feature selection, applying expert domain knowledge, statistical methods, and/or ml methods to remove 'irrelevant' features from consideration and improve downstream modeling, ( ) data harmonization, allowing for the integration of data collected at different sites/institutions, ( ) handling different outcomes and related challenges, e.g. binary classification, multi-class, quantitative phenotypes, class imbalance, temporal data, multi-labeled data, censored data, and the use of appropriate evaluation metrics, ( ) ml algorithm selection for a given problem can be a challenge in itself, thus strategies to integrate the predictions of multiple machine learners as an ensemble are likely to be important, ( ) ml modeling pipeline assembly, including critical considerations such as hyper-parameter optimization, accounting for overfitting, and clinical interpretability of trained models, and ( ) considering and accounting for covariates as well as sources of bias in data collection, study design, and application of ml tools in order to avoid drawing conclusions based on spurious correlations. advanced tools may be necessary to deal with data analytic challenges, properly analyze these data, and accurately extract the knowledge embedded in them. some key challenges include: ( ) accounting for correlation structure induced by multi-level, spatial, and longitudinal designs, ( ) adjusting for biases emanating from the observational data using causal approaches, ( ) accounting for privacy-induced limitations on the resolution of data that can be shared, and ( ) discovering and characterizing interpatient heterogeneities in incidence, progression, or response through stratified or latent class models. some of these challenges can be handled by aptly chosen existing methods, while others require new methodological development. the covid- crisis and the extensive data resources that it will produce will provide an excellent opportunity to develop such methods, including privacy-preserving integrative analytical tools as well as advanced causal inference tools that also account for these other data complexities. we present here two related approaches to using informatics solutions, which directly involve the public who are not physically situated within a healthcare setting. the first focuses on using smartphones and other technology for educating the public about the pandemic and ways to avoid infection as well as monitoring, and the second explores the use of sensors in this domain. consumer health informatics, focusing broadly on tools and systems that engage and empower patients and more general health consumers in health delivery and decision making processes, has a substantial role to play in the context of a pandemic. specific areas that consumer informatics researchers and system designers can target include consumer education, self-triage, monitoring, and social engagement. in a time when behavioral guidelines are continuously adjusted based on new data, consumer education is essential to conveying and disseminating actionable and timely information. patient portals and other web sites can provide educational content that can be tailored to individual information needs as well as literacy and health literacy levels. furthermore, systems can include an interactive component that can facilitate decision support and selftriage. one such example is a patient portal for self-triage and scheduling that was created at the university of california san francisco to enable asymptomatic patients to report exposure history and for symptomatic patients to be triaged and paired with appropriate levels of care [ ] . the system is already being used extensively and performs with high sensitivity in recommending emergency-level care for symptomatic patients. it also prevents unnecessary visits. tools that have been traditionally used for patient monitoring at home and the community can also be useful in generating data that provide insight into disease spread and health needs. an example is that of a smart thermometer vendor that has created an app which allows users to record their temperature and other symptoms with a health insurance portability and accountability act (hipaa) compliant platform; data are aggregated and demonstrate how the virus moves from one county to another, providing a detailed visualization map that highlights areas with an unusually high number of recorded prevalence of fever (https:// healthweather.us/). other mobile health tools that track aspects of daily living including activity levels, sleep quality, or symptom self-management can facilitate better monitoring of health and wellness and potentially lead to effective symptom management at an individual level, and contribute to disease surveillance at a population level. examples include activity tracker data that can inform surveillance of social distancing patterns, and home spirometer and pulse oximetry data that can generate a trajectory of symptom progression in various communities. finally, in times of "social distancing", vulnerable populations such as older adults living alone are at greater risk of increased social isolation, which is often referred to as a silent epidemic and great health risk [ ] . digital tools have the potential to connect individuals for the delivery of social services, and creation of virtual peer support groups and connected communities including friends and family members. this current pandemic has highlighted the need for accessible and secure tools that may include video-conferencing, synchronous and asynchronous communication, and even more sophisticated features such as virtual reality and augmented reality, designed for audiences with diverse abilities with the goal to promote social connectedness in times of physical distancing. smartphones and other wearable smart devices contain research-grade sensors that are capable of shedding light on at least a subset of covid- symptoms which include fever, fatigue, dry-cough, and shortness of breath. for example, the temperature recorded by fingerprint sensors, which are now standard on most modern smartphones, has previously been used to successfully predict fever [ ] . in addition, activity sensors such as the accelerometer have been used to detect fatigue. while high resolution computed tomography (ct) images of a patient's lung may provide a more reliable indicator of infection, the high cost and low scalability make this approach infeasible to apply widely at the general population level. on the other hand, smartphones are currently pervasive with high penetrance even in low and middle-income countries and their high-quality sensor data can be used at next to no cost to measure a subset of important covid- symptoms as a screening tool to identify individuals that may require more extensive evaluation or testing. we present here six considerations of the role of public health informatics in the covid- pandemic. these represent a broad range of topics, from information systems for the monitoring and dissemination of accurate information to the public, to leveraging existing evidence currently available in a huge corpus of virus infection-and pandemic-related research, to building more realistic models of disease risk, spread, and effect of societal interventions, to as-yet poorly understood post-pandemic effects on public health. a critical need for any strategy that addresses covid- is adequate disease monitoring. at the level of cases and deaths, several efforts around the world have arisen to maintain and display official counts, including by researchers at johns hopkins university (https:// coronavirus.jhu.edu/map.html) and reporters at the new york times (https://www. nytimes.com/interactive/ /world/coronavirus-maps.html). these and other efforts rely on reports obtained from heterogeneous sources, many of which capture and store data differently, requiring that informaticians process and display data effectively. case and death counts are helpful and widely used by healthcare systems, policy makers, governmental institutions, and the general public. however, they are notoriously biased given the differing availability and use of lab-based tests to determine covid- case status at various locations. more comprehensive efforts to track the true impact of covid- necessitate appropriate wide-scale testing of sars-cov- . knowledge of who carries the virus regardless of symptom or disease status enables efficient prevention of further transmission, the proper identification of risk factors that lead to divergent symptoms, and adequate preparation of healthcare systems to treat patients who are carriers while minimizing risk to providers and patients who have not been infected. design and deployment of population-level testing should be a primary goal for the effective containment of covid- . in conjunction with apps developed by informaticists, contact tracing along with case isolation can proceed effectively to control outbreaks [ ] . such efforts are thought to have curtailed the spread of covid- in singapore and south korea. because it is unlikely in countries like the u.s., that the federal or local governments, or many citizens would use contact tracing without ensuring individual-level data is safeguarded, various informaticists are engaged in efforts to create privacy-preserving contact tracing apps. sars-cov- containment was not successful in most countries, due in part to lack of appropriate wide-scale testing which contributed to its undetected transmission. ultimately, nothing can replace appropriate lab-based viral testing to understand disease transmission, but informatics solutions are helpful to partly overcome testing inadequacies. in the u.s., canada, and mexico, covid near you (https://covidnearyou.org/) is a citizen participation platform via which any person can contribute their current health status as it relates to covid- symptoms and test results. aggregation of this individual-level data is being used to track population-level health in real-time. other data that can be used to fill monitoring gaps includes search engine data (e.g., google queries for covid- -related terms), and to a lesser extent, social media data (e.g., twitter posts related to . informaticists are leading and contributing to such efforts around the world. as results of sars-cov- tests, along with serological assays to detect its seroconversion, become more widely available, retrospective studies can proceed to more accurately determine how covid- spreads and how many true cases existed prior to widespread testing. informaticians can participate in these efforts that require accounting for test characteristics (sensitivity/specificity) and comparing the characteristics of patients who were actually tested versus those of the underlying population. ongoing retrospective analyses such as these are critical to gain knowledge necessary to avoid future resurgences of covid- . systems for disseminating accurate information related to covid- to the public an emerging issue that concerns the prevention of covid- is the widespread dissemination of speculation, rumors, half-truths, disinformation, and conspiracy theories by means of popular social media platforms. in order for policies, guidelines, and mandates, that may be updated on a weekly or even daily basis, to reach and be adopted by the general public it is important for relevant, vetted information sources to be clearly identified and potentially pointed to in response to misleading posts. in recent years there have been many exciting efforts to combine natural language processing (nlp), machine learning, and social media scraping to monitor clinical outcomes of interest such as foodborne illnesses [ ] . there may be an opportunity to work towards adapting such informatics approaches to monitor and perhaps even combat the dissemination of 'bad' information through automated responses that redirect individuals to sources identified as reliable within the scientific community. rule-based systems such as 'expert systems' could be combined with nlp technologies to construct such monitoring and response frameworks. equally important is the consumer health informatics task of developing clear, concise, and easily navigable informational resources for covid- , that summarize up-to-date information and guidelines but also link summary information back to relevant primary sources, attempt to quantify the certainty/ reliability of available information, and offer explanations of reasoning whenever such information or guidelines need to be updated. the spread of infectious diseases such as covid- provides a unique opportunity to assess the regional spread and progression of disease at a population level. differences in pathogenic mechanisms of different diseases responsible for past pandemics imply that the spread of covid- may not be completely predictable based on the observing historical rates of disease transmission. data on the cumulative number of covid- cases is available at country/regional/city levels and by studying the progression and spread of disease in regions affected close to the time of the initial outbreak, meaningful projections of infection rates can be made for areas which will be affected later. for example, by modeling daily regional cumulative covid- cases, regional differences in the trends can illuminate the comparative effectiveness of different policy decisions and can identify countries and policies that have succeeded in slowing the rate of covid- spread, providing evidence for the adoption of effective public health policies by areas still in the early phases of the pandemic. presenting this information to the public using data visualization methods in an important informatics activity. synthesizing evidence to understand covid- origins, spread, and prevention as of april , , there are more than manuscripts published or posted at pubmed, biorxiv, and medrxiv on covid- from researchers all over the world (https://www.ncbi.nlm.nih.gov/research/coronavirus/). these manuscripts cover a wide spectrum of important topics that can help us to understand the critical aspects of clinical and public health impacts of covid- , including the disease mechanism, diagnosis, treatment, prevention, viral infection, replication, pathogenesis, transmission, viral host-range, and virulence. on the other hand, the amount of information is increasingly overwhelming for stakeholders, policymakers, researchers and interested parties to comprehend. a systematic review, which is a type of literature review that uses systematic methods to collect secondary data and critically appraise research studies, can be useful in synthesizing the existing evidence of covid- related research findings. in particular, meta-analysis plays a central role in the systematic review in quantitatively synthesizing evidence from multiple scientific studies which address related questions. manual literature review is time consuming and, more importantly, it is challenging to keep up-to-date with the rapidly increasing volume of literature. medical informatics tools can improve the efficiency and scalability of up-to-date evidence synthesis for covid- related research. for example, clinical natural language processing (nlp) tools can be used for literature screening and information retrieval. software such as abstractr [ ] [ ] [ ] and distillersr (https://www.evidencepartners.com/) has been used to reduce manual effort in literature screening. beyond literature screening, distillersr is also a useful tool for the management of the multi-step workflow of systematic review process. recently, distillersr made its tool freely available for systematic reviewers and researchers to conduct systematic reviews related to covid- . for meta-analysis, tools such as comprehensive meta-analysis (cma) (https://www.meta-analysis.com/), revman (https://training.cochrane.org/online-learning/core-software-cochrane-reviews/ revman), and macros in stata (https://www.stata.com/), are available for standard metaanalyses. however, for covid- related research, more sophisticated methods are needed in order to address unique features related to this topic. for example, the quality of the reported findings in the above-mentioned manuscripts is expected to be highly heterogeneous, especially for those manuscripts that have not been peerreviewed. it is critically important to properly account for such heterogeneity across studies. furthermore, the reported findings may be subject to more severe publication bias and outcome reporting bias [ ] , as the analysis of the data and reporting of the analysis results are likely to be based on different protocols. visualization tools, sensitivity analyses, and inference based on bias correction models can be useful in evaluating the quality of the evidence [ ] [ ] [ ] [ ] [ ] [ ] [ ] . in addition, novel visualization tools, such as the tornado plot in a cumulative meta-analysis [ ] , will be valuable for presenting how the cumulative evidence on answering a covid- related question evolves over time. r packages including 'meta', 'metafor', 'metasens', 'netmeta', 'mvmeta', mada' and 'xmeta' are useful for advanced meta-analyses with these needs. finally, online platforms for meta-analysis, such as programs with shiny interfaces, are in great need for offering convenience to covid- researchers in summarizing and synthesizing results. advanced, more realistic models of disease spread to guide policymaking differential equation-based epidemiological models such as the susceptible-infected-recovered (sir) or susceptible-exposed-infected-recovered (seir) models and their variants are key workhorses for studying infectious disease dynamics. these models have been widely used in making projections and informing policy-makers in constructing mitigation strategies for the disease. one weakness of these models is that they treat individuals in a given population as homogeneous, with constant risk rates, exposure rates, infection rates, and recovery/death rates throughout the larger group. this is a gross oversimplification which is a primary factor of the models' limited predictive accuracy. statisticians have been engaging in covid- efforts with statistical models using functional data or time series modeling techniques. these models often use covariates or latent factors to account for population heterogeneity and provide uncertainty quantification, thus improving on a weakness of the seir models. however, these models do not present the dynamic infectious disease process which may limit their interpretability and accuracy in forecasting. one key area of quantitative research that can emerge from this covid- crisis is hybrid epidemiology-statistical models. that is, models based on sir or seir frameworks that stochastically show the transition probabilities as differing according to person or environmental covariates, accounting for clustering effects, and effectively propagating uncertainty in the forecasting. these can combine the strengths of each type of model, and given the broad availability of large scale data on mobility, density, demographics, etc. that vary in different communities, they can produce much more realistic models and more accurate projections to guide policymaking. the covid- pandemic has resulted in unprecedented disruption to the healthcare system. in addition to understanding the direct health impacts of the disease, there is a public health need to understand the secondary effects of covid- -related healthcare disruption on access to and timeliness of care for other urgent conditions, and resultant effects on health outcomes. prioritizing healthcare resources for covid- patients and efforts to depopulate healthcare settings in order to reduce healthcare-related disease transmission has resulted in reduced access to care for patients across the spectrum of clinical need and severity including delayed access to surgery for cancer patients, organ transplant recipients, and others with time-sensitive conditions. public health informatics can play an important role in informing our understanding of how the effects of healthcare disruption propagate across a community, affecting access to care, and population health. answering questions about the effect of healthcare disruption on population health requires three components: ( ) access to data on healthcare utilization and outcomes, ( ) data on timing and types of public health and hospitallevel interventions, and ( ) causal inference methodologies that support our ability to draw conclusions about the causal effects of these interventions. data on health care utilization and outcomes can be obtained from a variety of sources including individual and multi-institutional ehr data and claims databases. data on public health interventions are already being compiled by researchers, including national and international databases of policy changes (https://is.gd/cqs th, https://is.gd/lvvuiz, https://is.gd/ mlcu i). finally, disentangling the causal impacts of covid- itself; interventions at the local, state, and federal level; and interventions and innovation at the individual health system level requires the rigorous implementation of study designs and analytic methods for causal inference. a number of techniques in common use in health services and econometrics research can be harnessed for this purpose including interrupted time series and difference-in-difference designs [ ] . the total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. popular social networking sites such as facebook, twitter, and instagram dominate this sphere. about million tweets and . billion facebook messages are posted every day (https://www.gwava.com/blog/internet-data-created-daily). a pew research report (http://www.pewinternet.org/fact-sheet/social-media/) states that nearly half of adults worldwide and two-thirds of all american adults ( %) use social networking. the report states that of the total users, % have discussed health information, and, of those, % changed behavior based on this information and % discussed current medical conditions. advances in automated data processing, machine learning, and nlp present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. when events such as the covid- pandemic sweep the world, the public turns to social media. while there is a general belief that most of the content is not useful, adequate collection, filtering, and analysis could reveal potentially useful information for assessing public sentiment. furthermore, given the delay and shortage of available testing in the united states, social media could provide a near real-time monitoring capability (e.g. the penn covid- u.s. twitter map, https://is.gd/l gga), giving insights into the true burden of disease. preliminary work in this direction is under review. the archived version of the paper, with a training dataset and annotation guidelines as supplementary material, is available [ ] . although social media text mining research for health applications is still incipient, the domain has seen a surge in interest in recent years. numerous studies have been published of late in this realm, including studies on pharmacovigilance [ ] , identifying user behavioral patterns [ ] , identifying user social circles with common experiences (like drug abuse) [ ] , monitoring malpractice [ ] , and tracking infectious/viral disease spread [ , ] . population and public health topics are most addressed, although different social networks may be suitable for specific targeted tasks. for example, while twitter data has been utilized for surveillance and content analysis, a significant portion of research using facebook has focused on communication rather than lexical content processing [ , ] . for health monitoring and surveillance research from social media, the most common topic has been influenza surveillance [ , ] . from the perspective of informatics and nlp, proposed techniques have typically been in the areas of data collection (e.g., keywords and queries) [ , ] , text classification [ , ] , and information extraction [ ] . while innovative approaches have been proposed, there is still a lot of progress to be made in this domain. effective utilization of the health-related knowledge contained in social media will require a joint effort by the research community, and bringing together researchers from distinct fields including nlp, machine learning, data science, biomedical informatics, medicine, pharmacology, and public health. the knowledge gaps among researchers in these communities need to be reduced by community sharing of data and the development of novel applied systems. the covid- pandemic presents a myriad of challenges and opportunities for research across virtually every scientific discipline, and biomedical informatics is no exception. from the molecular and genetic sciences to population health, researchers in the five domains of biomedical informatics stand to make substantial contributions to addressing these challenges. we hope, through the numerous examples of research we have considered in this editorial, informatics researchers and practitioners can see possible avenues for their work. there is no dearth of opportunities related to covid- for those working in informatics, and it is our hope that informaticians will vigorously explore these as they arise. furthermore, we hope that those who are not informaticians will appreciate the contributions that informatics researchers can bring to their respective fields as we all seek to address the covid- pandemic and its effects around the world. the covid- vaccine development landscape diagnostic testing for severe acute respiratory syndrome-related coronavirus- : a narrative review laboratory testing of sars-cov, mers-cov, and sars-cov- ( -ncov): current status, challenges, and countermeasures temporal dynamics in viral shedding and transmissibility of covid- vitro diagnostic assays for covid- : recent advances and emerging trends therapeutic options for the novel coronavirus ( -ncov) risk factors of fatal outcome in hospitalized subjects with coronavirus disease from a nationwide analysis in china feasibility of controlling covid- outbreaks by isolation of cases and contacts a systematic review of covid- epidemiology based on current evidence human ace receptor polymorphisms predict sars-cov- susceptibility correlation of chest ct and rt-pcr testing in coronavirus disease (covid- ) in china: a report of cases the role of chest imaging in patient management during the covid- pandemic: a multinational consensus statement from the fleischner society review of artificial intelligence techniques in imaging data acquisition detection of sars-cov- in different types of clinical specimens enabling pregnant women and their physicians to make informed medication decisions using artificial intelligence international electronic health record-derived covid- clinical course profile: the ce consortium. medrxiv rapid design and implementation of an integrated patient self-triage and self-scheduling tool for covid- national academies of sciences e. social isolation and loneliness in older adults: opportunities for the health care system transparent and flexible fingerprint sensor array with multiplexed detection of tactile pressure and skin temperature machine-learned epidemiology: real-time detection of foodborne illness at scale. npj digital med toward systematic review automation: a practical guide to using machine learning tools in research synthesis semi-automated screening of biomedical citations for systematic reviews deploying an interactive machine learning system in an evidencebased practice center: abstrackr publication bias in meta-analysis: prevention, assessment and adjustments meta-analysis, funnel plots and sensitivity analysis a sensitivity analysis for publication bias in systematic reviews: statistical methods in medical research bias in meta-analysis detected by a simple, graphical test the case of the misleading funnel plot maximum likelihood estimation and em algorithm of copas-like selection model for publication bias correction funnel plots for detecting bias in meta-analysis: guidelines on choice of axis misleading funnel plot for detection of bias in meta-analysis cumulative meta-analysis of therapeutic trials for myocardial infarction designing difference in difference studies: best practices for public health policy research a chronological and geographical analysis of personal reports of covid- on twitter towards internet-age pharmacovigilance: extracting adverse drug reactions from user posts to health-related social networks the role of facebook in crush the crave, a mobile-and social media-based smoking cessation intervention: qualitative framework analysis of posts an exploration of social circles and prescription drug abuse through twitter malpractice and malcontent: analyzing medical complaints in twitter national and local influenza surveillance through twitter: an analysis of the - influenza epidemic you are what your tweet: analyzing twitter for public health please like me: facebook and public health communication facebook advertising across an engagement spectrum: a case example for public health communication using social media to perform local influenza surveillance in an inner-city hospital: a retrospective observational study evaluating google, twitter, and wikipedia as tools for influenza surveillance using bayesian change point analysis: a comparative analysis phonetic spelling filter for keyword selection in drug mention mining from social media scoping review on search queries and social media for disease surveillance: a chronology of innovation text classification for automatic detection of e-cigarette use and use for smoking cessation from twitter twitter catches the flu: detecting influenza epidemics using twitter pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations not applicable. all authors contributed equally to the writing and editing of the editorial. all authors read and approved the final manuscript. not applicable.availability of data and materials not applicable.ethics approval and consent to participate not applicable. not applicable. the authors declare that they have no competing interests. key: cord- - gv y authors: bello-orgaz, gema; jung, jason j.; camacho, david title: social big data: recent achievements and new challenges date: - - journal: inf fusion doi: . /j.inffus. . . sha: doc_id: cord_uid: gv y big data has become an important issue for a large number of research areas such as data mining, machine learning, computational intelligence, information fusion, the semantic web, and social networks. the rise of different big data frameworks such as apache hadoop and, more recently, spark, for massive data processing based on the mapreduce paradigm has allowed for the efficient utilisation of data mining methods and machine learning algorithms in different domains. a number of libraries such as mahout and sparkmlib have been designed to develop new efficient applications based on machine learning algorithms. the combination of big data technologies and traditional machine learning algorithms has generated new and interesting challenges in other areas as social media and social networks. these new challenges are focused mainly on problems such as data processing, data storage, data representation, and how data can be used for pattern mining, analysing user behaviours, and visualizing and tracking data, among others. in this paper, we present a revision of the new methodologies that is designed to allow for efficient data mining and information fusion from social media and of the new applications and frameworks that are currently appearing under the “umbrella” of the social networks, social media and big data paradigms. data volume and the multitude of sources have experienced exponential growth, creating new technical and application challenges; data generation has been estimated at . exabytes ( exabyte = . . terabytes) of data per day [ ] . these data come from everywhere: sensors used to gather climate, traffic and flight information, posts to social media sites (twitter and facebook are popular examples), digital pictures and videos (youtube users upload hours of new video content per minute [ ] ), transaction records, and cell phone gps signals, to name a few. the classic methods, algorithms, frameworks, and tools for data management have become both inadequate for processing this amount of data and unable to offer effective solutions for managing the data growth. the problem of managing and extracting useful knowledge from these data sources is currently one of the most popular topics in computing research [ , ] . in this context, big data is a popular phenomenon that aims to provide an alternative to traditional solutions based on databases and data analysis. big data is not just about storage or access to data; its solutions aim to analyse data in order to make sense of them and exploit their value. big data refers to datasets that are terabytes to of challenges in obtaining valuable knowledge for people and companies (see value feature). • velocity: refers to the speed of data transfers. the data's contents are constantly changing through the absorption of complementary data collections, the introduction of previous data or legacy collections, and the different forms of streamed data from multiple sources. from this point of view, new algorithms and methods are needed to adequately process and analyse the online and streaming data. • variety: refers to different types of data collected via sensors, smartphones or social networks, such as videos, images, text, audio, data logs, and so on. moreover, these data can be structured (such as data from relational databases) or unstructured in format. • value: refers to the process of extracting valuable information from large sets of social data, and it is usually referred to as big data analytics. value is the most important characteristic of any big-data-based application, because it allows to generate useful business information. • veracity: refers to the correctness and accuracy of information. behind any information management practice lie the core doctrines of data quality, data governance, and metadata management, along with considerations of privacy and legal concerns. some examples of potential big data sources are the open science data cloud [ ] , the european union open data portal, open data from the u.s. government, healthcare data, public datasets on amazon web services, etc. social media [ ] has become one of the most representative and relevant data sources for big data. social media data are generated from a wide number of internet applications and web sites, with some of the most popular being facebook, twitter, linkedin, youtube, instagram, google, tumblr, flickr, and wordpress. the fast growth of these web sites allow users to be connected and has created a new generation of people (maybe a new kind of society [ ] ) who are enthusiastic about interacting, sharing, and collaborating using these sites [ ] . this information has spread to many different areas such as everyday life [ ] (e-commerce, e-business, e-tourism, hobbies, friendship, ...), education [ ] , health [ ] , and daily work. in this paper, we assume that social big data comes from joining the efforts of the two previous domains: social media and big data. therefore, social big data will be based on the analysis of vast amounts of data that could come from multiple distributed sources but with a strong focus on social media. hence, social big data analysis [ , ] is inherently interdisciplinary and spans areas such as data mining, machine learning, statistics, graph mining, information retrieval, linguistics, natural language processing, the semantic web, ontologies, and big data computing, among others. their applications can be extended to a wide number of domains such as health and political trending and forecasting, hobbies, e-business, cybercrime, counterterrorism, time-evolving opinion mining, social network analysis, and humanmachine interactions. the concept of social big data can be defined as follows: "those processes and methods that are designed to provide sensitive and relevant knowledge to any user or company from social media data sources when data sources can be characterised by their different formats and contents, their very large size, and the online or streamed generation of information." the gathering, fusion, processing and analysing of the big social media data from unstructured (or semi-structured) sources to extract value knowledge is an extremely difficult task which has not been completely solved. the classic methods, algorithms, frameworks and tools for data management have became inadequate for processing the vast amount of data. this issue has generated a large number of open problems and challenges on social big data domain related to different aspects as knowledge representation, data management, data processing, data analysis, and data visualisation [ ] . some of these challenges include accessing to very large quantities of unstructured data (management issues), determination of how much data is enough for having a large quantity of high quality data (quality versus quantity), processing of data stream dynamically changing, or ensuring the enough privacy (ownership and security), among others. however, given the very large heterogeneous dataset from social media, one of the major challenges is to identify the valuable data and how analyse them to discover useful knowledge improving decision making of individual users and companies [ ] . in order to analyse the social media data properly, the traditional analytic techniques and methods (data analysis) require adapting and integrating them to the new big data paradigms emerged for massive data processing. different big data frameworks such as apache hadoop [ ] and spark [ ] have been arising to allow the efficient application of data mining methods and machine learning algorithms in different domains. based on these big data frameworks, several libraries such as mahout [ ] and sparkmlib [ ] have been designed to develop new efficient versions of classical algorithms. this paper is focused on review those new methodologies, frameworks, and algorithms that are currently appearing under the big data paradigm, and their applications to a wide number of domains such as e-commerce, marketing, security, and healthcare. finally, summarising the concepts mentioned previously, fig. shows the conceptual representation of the three basic social big data areas: social media as a natural source for data analysis; big data as a parallel and massive processing paradigm; and data analysis as a set of algorithms and methods used to extract and analyse knowledge. the intersections between these clusters reflect the concept of mixing those areas. for example, the intersection between big data and data analysis shows some machine learning frameworks that have been designed on top of big data technologies (mahoot [ ] , mlbase [ , ] , or sparkmlib [ ] ). the intersection between data analysis and social media represents the concept of current web-based applications that intensively use social media information, such as applications related to marketing and e-health that are described in section . the intersection between big data and social media is reflected in some social media applications such as linkedin, facebook, and youtube that are currently using big data technologies (mon-godb, cassandra, hadoop, and so on) to develop their web systems. finally, the centre of this figure only represents the main goal of any social big data application: knowledge extraction and exploitation. the rest of the paper is structured as follows; section provides an introduction to the basics on the methodologies, frameworks, and software used to work with big data. section provides a description of the current state of the art in the data mining and data analytic techniques that are used in social big data. section describes a number of applications related to marketing, crime analysis, epidemic intelligence, and user experiences. finally, section describes some of the current problems and challenges in social big data; this section also provides some conclusions about the recent achievements and future trends in this interesting research area. currently, the exponential growth of social media has created serious problems for traditional data analysis algorithms and techniques (such as data mining, statistics, machine learning, and so on) due to their high computational complexity for large datasets. this type of methods does not properly scale as the data size increases. for this reason, the methodologies and frameworks behind the big data concept are becoming very popular in a wide number of research and industrial areas. this section provides a short introduction to the methodology based on the mapreduce paradigm and a description of the most popular framework that implements this methodology, apache hadoop. afterwards apache spark is described as emerging big data framework that improves the current performance of the hadoop framework. finally, some implementations and tools for big data domain related to distributed data file systems, data analytics, and machine learning techniques are presented. mapreduce [ , ] is presented as one of the most efficient big data solutions. this programming paradigm and its related algorithms [ ] , were developed to provide significant improvements in large-scale data-intensive applications in clusters [ ] . the programming model implemented by mapreduce is based on the definition of two basic elements: mappers and reducers. the idea behind this programming model is to design map functions (or mappers) that are used to generate a set of intermediate key/value pairs, after which the reduce functions will merge (reduce can be used as a shuffling or combining function) all of the intermediate values that are associated with the same intermediate key. the key aspect of the mapreduce algorithm is that if every map and reduce is independent of all other ongoing maps and reduces, then the operations can be run in parallel on different keys and lists of data. although three functions, map(), combining()/shuffling(), and reduce(), are the basic processes in any mapreduce approach, usually they are decomposed as follows: . prepare the input: the mapreduce system designates map processors (or worker nodes), assigns the input key value k that each processor would work on, and provides each processor with all of the input data associated with that key value. . the map() step: each worker node applies the map() function to the local data and writes the output to a temporary storage space. the map() code is run exactly once for each k key value, generating output that is organised by key values k . a master node arranges it so that for redundant copies of input data only one is processed. . the shuffle() step: the map output is sent to the reduce processors, which assign the k key value that each processor should work on, and provide that processor with all of the map-generated data associated with that key value, such that all data belonging to one key are located on the same worker node. . the reduce() step: worker nodes process each group of output data (per key) in parallel, executing the user-provided reduce() code; each function is run exactly once for each k key value produced by the map step. . produce the final output: the mapreduce system collects all of the reduce outputs and sorts them by k to produce the final outcome. fig. shows the classical "word count problem" using the mapreduce paradigm. as fig. shown, initially a process will split the data into a subset of chunks that will later be processed by the mappers. once the key/values are generated by mappers, a shuffling process is used to mix (combine) these key values (combining the same keys in the same worker node). finally, the reduce functions are used to count the words that generate a common output as a result of the algorithm. as a result of the execution or wrappers/reducers, the output will generate a sorted list of word counts from the original text input. ( , "i thought i") ( , "thought of thinking") ( , "of thanking you") finally, and before the application of this paradigm, it is essential to understand if the algorithms can be translated to mappers and reducers or if the problem can be analysed using traditional strategies. mapreduce provides an excellent technique to work with large sets of data when the algorithm can work on small pieces of that dataset in parallel, but if the algorithm cannot be mapped into this methodology, it may be "trying to use a sledgehammer to crack a nut". any mapreduce system (or framework) is based on a mapreduce engine that allows for implementing the algorithms and distributing the parallel processes. apache hadoop [ ] is an open-source software framework written in java for the distributed storage and distributed processing of very large datasets using the mapreduce paradigm. all of the modules in hadoop have been designed taking into account the assumption that hardware failures (of individual machines or of racks of machines) are commonplace and thus should be automatically managed in the software by the framework. the core of apache hadoop comprises a storage area, the hadoop distributed file system (hdfs), and a processing area (mapreduce). the hdfs (see section . . ) spreads multiple copies of the data across different machines. this not only offers reliability without the need for raid-controlled disks but also allows for multiple locations to run the mapping. if a machine with one copy of the data is busy or offline, another machine can be used. a job scheduler (in hadoop, the jobtracker) keeps track of which mapreduce jobs are executing; schedules individual maps; reduces intermediate merging operations to specific machines; monitors the successes and failures of these individual tasks; and works to complete the entire batch job. the hdfs and the job scheduler can be accessed by the processes and programs that need to read and write data and to submit and monitor the mapreduce jobs. however, hadoop presents a number of limitations: . for maximum parallelism, you need the maps and reduces to be stateless, to not depend on any data generated in the same mapreduce job. you cannot control the order in which the maps run or the reductions. . hadoop is very inefficient (in both cpu time and power consumed) if you are repeating similar searches repeatedly. a database with an index will always be faster than running a mapreduce job over un-indexed data. however, if that index needs to be regenerated whenever data are added, and data are being added continually, mapreduce jobs may have an edge. . in the hadoop implementation, reduce operations do not take place until all of the maps have been completed (or have failed and been skipped). as a result, you do not receive any data back until the entire mapping has finished. . there is a general assumption that the output of the reduce is smaller than the input to the map. that is, you are taking a large data source and generating smaller final values. apache spark [ ] is an open-source cluster computing framework that was originally developed in the amplab at university of california, berkeley. spark had over contributors in june , making it a very high-activity project in the apache software foundation and one of the most active big data open source projects. it provides high-level apis in java, scala, python, and r and an optimised engine that supports general execution graphs. it also supports a rich set of high-level tools including spark sql for sql and structured data processing, spark mllib for machine learning, graphx for graph processing, and spark streaming. the spark framework allows for reusing a working set of data across multiple parallel operations. this includes many iterative machine learning algorithms as well as interactive data analysis tools. therefore, this framework supports these applications while retaining the scalability and fault tolerance of mapreduce. to achieve these goals, spark introduces an abstraction called resilient distributed datasets (rdds). an rdd is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. in contrast to hadoops two-stage disk-based mapreduce paradigm (mappers/reducers), sparks in-memory primitives provide performance up to times faster for certain applications by allowing user programs to load data into a clusters memory and to query it repeatedly. one of the multiple interesting features of spark is that this framework is particularly well suited to machine learning algorithms [ [ ] ]. from a distributed computing perspective, spark requires a cluster manager and a distributed storage system. for cluster management, spark supports stand-alone (native spark cluster), hadoop yarn, and apache mesos. for distributed storage, spark can interface with a wide variety, including the hadoop distributed file system, apache cassandra, openstack swift, and amazon s . spark also supports a pseudo-distributed local mode that is usually used only for development or testing purposes, when distributed storage is not required and the local file system can be used instead; in this scenario, spark is running on a single machine with one executor per cpu core. a list related to big data implementations and mapreduce-based applications was generated by mostosi [ ] . although the author finds that "it is [the list] still incomplete and always will be", his "big-data ecosystem table" [ ] contains more than references related to different big data technologies, frameworks, and applications and, to the best of this authors knowledge, is one of the best (and more exhaustive) lists related to available big data technologies. this list comprises different topics related to big data, and a selection of those technologies and applications were chosen. those topics are related to: distributed programming, distributed files systems, a document data model, a key-value data model, a graph data model, machine learning, applications, business intelligence, and data analysis. this selection attempts to reflect some of the recent popular frameworks and software implementations that are commonly used to develop efficient mapreduce-based systems and applications. • apache pig. pig provides an engine for executing data flows in parallel on hadoop. it includes a language, pig latin, for expressing these data flows. pig latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. • apache storm. storm is a complex event processor and distributed computation framework written basically in the clojure programming language [ ] . it is a distributed real-time computation system for rapidly processing large streams of data. storm is an architecture based on a master-workers paradigm, so that a storm cluster mainly consists of master and worker nodes, with coordination done by zookeeper [ ] . • stratosphere [ ] . stratosphere is a general-purpose cluster computing framework. it is compatible with the hadoop ecosystem:, accessing data stored in the hdfs and running with hadoops new cluster manager yarn. the common input formats of hadoop are supported as well. stratosphere does not use hadoops mapreduce implementation; it is a completely new system that brings its own runtime. the new runtime allows for defining more advanced operations that include more transformations than only map and reduce. additionally, stratosphere allows for expressing analysis jobs using advanced data flow graphs, which are able to resemble common data analysis task more naturally. • apache hdfs. the most extended and popular distributed file system for mapreduce frameworks and applications is the hadoop distributed file system. the hdfs offers a way to store large files across multiple machines. hadoop and hdfs were derived from the google file system (gfs) [ ] . cassandra is a recent open source fork of a stand-alone distributed non-sql dbms system that was initially coded by facebook, derived from what was known of the original google bigtable [ ] and google file system designs [ ] . cassandra uses a system inspired by amazons dynamo for storing data, and mapreduce can retrieve data from cassandra. cassandra can run without the hdfs or on top of it (the datastax fork of cassandra). • apache giraph. giraph is an iterative graph processing system built for high scalability. it is currently used at facebook to analyse the social graph formed by users and their connections. giraph was originated as the open-source counterpart to pregel [ ] , the graph processing framework developed at google (see section . for a further description). • mongodb. mongodb is an open-source document-oriented database system and is part of the nosql family of database systems [ ] . it provides high performance, high availability, and automatic scaling. instead of storing data in tables as is done in a classical relational database, mongodb stores structured data as json-like documents, which are data structures composed of fields and value pairs. its index system supports faster queries and can include keys from embedded documents and arrays. moreover, this database allows users to distribute data across a cluster of machines. • apache mahout [ ] . the mahout(tm) machine learning (ml) library is an apache(tm) project whose main goal is to build scalable libraries that contain the implementation of a number of the conventional ml algorithms (dimensionality reduction, classification, clustering, and topic models, among others). in addition, this library includes implementations for a set of recommender systems (user-based and item-based strategies). the first versions of mahout implemented the algorithms built on the hadoop framework, but recent versions include many new implementations built on the mahout-samsara environment, which runs on spark and h o. the new spark-item similarity implementations enable the next generation of co-occurrence recommenders that can use entire user click streams and contexts in making recommendations. • spark mllib [ ] . mllib is sparks scalable machine learning library, which consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction, as well as underlying optimization primitives. it supports writing applications in java, scala, or python and can run on any hadoop /yarn cluster with no preinstallation. the first version of mllib was developed at uc berkeley by contributors, and it provided a limited set of standard machine learning methods. however, mllib is currently experiencing dramatic growth, and it has over contributors from over organisations. • mlbase [ ] . the mlbase platform consists of three layers: ml optimizer, mllib, and mli. ml optimizer (currently under development) aims to automate the task of ml pipeline construction. the optimizer solves a search problem over the feature extractors and ml algorithms included in mli and mllib. mli [ ] is an experimental api for feature extraction and algorithm development that introduces high-level ml programming abstractions. a prototype of mli has been implemented against spark and serves as a test bed for mllib. finally, mllib is apache sparks distributed ml library. mllib was initially developed as part of the mlbase project, and the library is currently supported by the spark community. pentaho is an open source data integration (kettle) tool that delivers powerful extraction, transformation, and loading capabilities using a groundbreaking, metadata-driven approach. it also provides analytics, reporting, visualisation, and a predictive analytics framework that is directly designed to work with hadoop nodes. it provides data integration and analytic platforms based on hadoop in which datasets can be streamed, blended, and then automatically published into one of the popular analytic databases. • sparkr. there is an important number of r-based applications for mapreduce and other big data applications. r [ ] is a popular and extremely powerful programming language for statistics and data analysis. sparkr provides an r frontend for spark. it allows users to interactively run jobs from the r shell on a cluster, automatically serializes the necessary variables to execute a function on the cluster, and also allows for easy use of existing r packages. social big data analytic can be seen as the set of algorithms and methods used to extract relevant knowledge from social media data sources that could provide heterogeneous contents, with very large size, and constantly changing (stream or online data). this is inherently interdisciplinary and spans areas such as data mining, machine learning, statistics, graph mining, information retrieval, and natural language among others. this section provides a description of the basic methods and algorithms related to network analytics, community detection, text analysis, information diffusion, and information fusion, which are the areas currently used to analyse and process information from social-based sources. today, society lives in a connected world in which communication networks are intertwined with daily life. for example, social networks are one of the most important sources of social big data; specifically, twitter generates over million tweets every day [ ] . in social networks, individuals interact with one another and provide information on their preferences and relationships, and these networks have become important tools for collective intelligence extraction. these connected networks can be represented using graphs, and network analytic methods [ ] can be applied to them for extracting useful knowledge. graphs are structures formed by a set of vertices (also called nodes) and a set of edges, which are connections between pairs of vertices. the information extracted from a social network can be easily represented as a graph in which the vertices or nodes represent the users and the edges represent the relationships among them (e.g., a re-tweet of a message or a favourite mark in twitter). a number of network metrics can be used to perform social analysis of these networks. usually, the importance, or influence, in a social network is analysed through centrality measures. these measures have high computational complexity in large-scale networks. to solve this problem, focusing on a large-scale graph analysis, a second generation of frameworks based on the mapreduce paradigm has appeared, including hama, giraph (based on pregel), and graphlab among others [ ] . pregel [ ] is a graph-parallel system based on the bulk synchronous parallel model (bsp) [ ] . a bsp abstract computer can be interpreted as a set of processors that can follow different threads of computation in which each processor is equipped with fast local memory and interconnected by a communication network. according to this, the platform based on this model comprises three major components: • components capable of processing and/or local memory transactions (i.e., processors). • a network that routes messages between pairs of these components. • a hardware facility that allows for the synchronisation of all or a subset of components. taking into account this model, a bsp algorithm is a sequence of global supersteps that consists of three components: . concurrent computation: every participating processor may perform local asynchronous computations. . communication: the processes exchange data from one processor to another, facilitating remote data storage capabilities. . barrier synchronisation: when a process reaches this point (the barrier), it waits until all other processes have reached the same barrier. hama [ ] and giraph are two distributed graph processing frameworks on hadoop that implement pregel. the main difference between the two frameworks is the matrix computation using the mapreduce paradigm. apache giraph is an iterative graph processing system in which the input is a graph composed of vertices and directed edges. computation proceeds as a sequence of iterations (supersteps). initially, every vertex is active, and for each superstep, every active vertex invokes the "compute method" that will implement the graph algorithm that will be executed. this means that the algorithms implemented using giraph are vertex oriented. apache hama does not only allow users to work with pregel-like graph applications. this computing engine can also be used to perform computeintensive general scientific applications and machine learning algorithms. moreover, it currently supports yarn, which is the resource management technology that lets multiple computing frameworks run on the same hadoop cluster using the same underlying storage. therefore, the same data could be analysed using mapreduce or spark. in contrast, graphlab is based on a different concept. whereas pregel is a one-vertex-centric model, this framework uses vertexto-node mapping in which each vertex can access the state of adjacent vertices. in pregel, the interval between two supersteps is defined by the run time of the vertex with the largest neighbourhood. the graphlab approach improves this splitting of vertices with large neighbourhoods across different machines and synchronises them. finally, elser and montresor [ ] present a study of these data frameworks and their application to graph algorithms. the k-core decomposition algorithm is adapted to each framework. the goal of this algorithm is to compute the centrality of each node in a given graph. the results obtained confirm the improvement achieved in terms of execution time for these frameworks based on hadoop. however, from a programming paradigm point of view, the authors recommend pregel-inspired frameworks (a vertex-centric framework), which is the better fit for graph-related problems. the community detection problem in complex networks has been the subject of many studies in the field of data mining and social network analysis. the goal of the community detection problem is similar to the idea of graph partitioning in graph theory [ , ] . a cluster in a graph can be easily mapped into a community. despite the ambiguity of the community definition, numerous techniques have been used for detecting communities. random walks, spectral clustering, modularity maximization, and statistical mechanics have all been applied to detecting communities [ ] . these algorithms are typically based on the topology information from the graph or network. related to graph connectivity, each cluster should be connected; that is, there should be multiple paths that connect each pair of vertices within the cluster. it is generally accepted that a subset of vertices forms a good cluster if the induced sub-graph is dense and there are few connections from the included vertices to the rest of the graph [ ] . considering both connectivity and density, a possible definition of a graph cluster could be a connected component or a maximal clique [ ] . this is a sub-graph into which no vertex can be added without losing the clique property. one of the most well-known algorithms for community detection was proposed by girvan and newman [ ] . this method uses a new similarity measure called "edge betweenness" based on the number of the shortest paths between all vertex pairs. the proposed algorithm is based on identifying the edges that lie between communities and their successive removal, achieving the isolation of the communities. the main disadvantage of this algorithm is its high computational complexity with very large networks. modularity is the most used and best known quality measure for graph clustering techniques, but its computation is an np-complete problem. however, there are currently a number of algorithms based on good approximations of modularity that are able to detect communities in a reasonable time. the first greedy technique to maximize modularity was a method proposed by newman [ ] . this was an agglomerative hierarchical clustering algorithm in which groups of vertices were successively joined to form larger communities such that modularity increased after the merging. the update of the matrix in the newman algorithm involved a large number of useless operations owing to the sparseness of the adjacency matrix. however, the algorithm was improved by clauset et al. [ ] , who used the matrix of modularity variations to arrange for the algorithm to perform more efficiently. despite the improvements to and modifications of the accuracy of these greedy algorithms, they have poor performance when they are compared against other techniques. for this reason, newman reformulated the modularity measure in terms of eigenvectors by replacing the laplacian matrix with the modularity matrix [ ] , called the spectral optimization of modularity. this improvement must also be applied in order to improve the results of other optimization techniques [ , ] . random walks can also be useful for finding communities. if a graph has a strong community structure, a random walker spends a long time inside a community because of the high density of internal edges and the consequent number of paths that could be followed. zhou and lipowsky [ ] , based on the fact that walkers move preferentially towards vertices that share a large number of neighbours, defined a proximity index that indicates how close a pair of vertices is to all other vertices. communities are detected with a procedure called netwalk, which is an agglomerative hierarchical clustering method by which the similarity between vertices is expressed by their proximity. a number of these techniques are focused on finding disjointed communities. the network is partitioned into dense regions in which nodes have more connections to each other than to the rest of the network, but it is interesting that in some domains, a vertex could belong to several clusters. for instance, it is well-known that people in a social network for natural memberships in multiple communities. therefore, the overlap is a significant feature of many realworld networks. to solve this problem, fuzzy clustering algorithms applied to graphs [ ] and overlapping approaches [ ] have been proposed. xie et al. [ ] reviewed the state of the art in overlapping community detection algorithms. this work noticed that for low overlapping density networks, slpa, oslom, game, and copra offer better performance. for networks with high overlapping density and high overlapping diversity, both slpa and game provide relatively stable performance. however, test results also suggested that the detection in such networks is still not yet fully resolved . a common feature that is observed by various algorithms in real-world networks is the relatively small fraction of overlapping nodes (typically less than %), each of which belongs to only or communities. a significant portion of the unstructured content collected from social media is text. text mining techniques can be applied for automatic organization, navigation, retrieval, and summary of huge volumes of text documents [ ] [ ] [ ] . this concept covers a number of topics and algorithms for text analysis including natural language processing (nlp), information retrieval, data mining, and machine learning [ ] . information extraction techniques attempt to extract entities and their relationships from texts, allowing for the inference of new meaningful knowledge. these kinds of techniques are the starting point for a number of text mining algorithms. a usual model for representing the content of documents or text is the vector space model. in this model, each document is represented by a vector of frequencies of remaining terms within the document [ ] . the term frequency (tf) is a function that relates the number of occurrences of the particular word in the document divided by the number of words in the entire document. another function that is currently used is the inverse document frequency (idf); typically, documents are represented as tf-idf feature vectors. using this data representation, a document represents a data point in n-dimensional space where n is the size of the corpus vocabulary. text data tend to be sparse and high dimensional. a text document corpus can be represented as a large sparse tf-idf matrix, and applying dimensionality reduction methods to represent the data in compressed format [ ] can be very useful. latent semantic indexing [ ] is an automatic indexing method that projects both documents and terms into a low-dimensional space that attempts to represent the semantic concepts in the document. this method is based on the singular value decomposition of the term-document matrix, which constructs a low-ranking approximation of the original matrix while preserving the similarity between the documents. another family of dimension reduction techniques is based on probabilistic topic mod-els such as latent dirichlet allocation (lda) [ ] . this technique provides the mechanism for identifying patterns of term co-occurrence and using those patterns to identify coherent topics. standard lda implementations of the algorithm read the documents of the training corpus numerous times and in a serial way. however, new, efficient, parallel implementations of this algorithm have appeared [ ] in attempts to improve its efficiency. unsupervised machine learning methods can be applied to any text data without the need for a previous manual process. specifically, clustering techniques are widely studied in this domain to find hidden information or patterns in text datasets. these techniques can automatically organise a document corpus into clusters or similar groups based on a blind search in an unlabelled data collection, grouping the data with similar properties into clusters without human supervision. generally, document clustering methods can be mainly categorized into two types [ ] : partitioning algorithms that divide a document corpus into a given number of disjoint clusters that are optimal in terms of some predefined criteria functions [ ] and hierarchical algorithms that group the data points into a hierarchical tree structure or a dendrogram [ ] . both types of clustering algorithms have strengths and weaknesses depending on the structure and characteristics of the dataset used. in zhao and karypis [ ] , a comparative assessment of different clustering algorithms (partitioning and hierarchical) was performed using different similarity measures on high-dimensional text data. the study showed that partitioning algorithms perform better and can also be used to produce hierarchies of higher quality than those returned by the hierarchical ones. in contrast, the classification problem is one of the main topics in the supervised machine learning literature. nearly all of the wellknown techniques for classification, such as decision trees, association rules, bayes methods, nearest neighbour classifiers, svm classifiers, and neural networks, have been extended for automated text categorisation [ ] . sentiment classification has been studied extensively in the area of opinion mining research, and this problem can be formulated as a classification problem with three classes, positive, negative and neutral. therefore, most of the existing techniques designed for this purpose are based on classifiers [ ] . however, the emergence of social networks has created massive and continuous streams of text data. therefore, new challenges have been arising in adapting the classic machine learning methods, because of the need to process these data in the context of a one-pass constraint [ ] . this means that it is necessary to perform data mining tasks online and only one time as the data come in. for example, the online spherical k-means algorithm [ ] is a segment-wise approach that was proposed for streaming text clustering. this technique splits the incoming text stream into small segments that can be processed effectively in memory. then, a set of k-means iterations is applied to each segment in order to cluster them. moreover, in order to consider less important old documents during the clustering process, a decay factor is included. one of the most important roles of social media is to spread information to social links. with the large amount of data and the complex structures of social networks, it has been even more difficult to understand how (and why) information is spread by social reactions (e.g., retweeting in twitter and like in facebook). it can be applied to various applications, e.g., viral marketing, popular topic detection, and virus prevention [ ] . as a result, many studies have been proposed for modelling the information diffusion patterns on social networks. the characteristics of the diffusion models are (i) the topological structure of the network (a sub-graph composed of a set of users to whom the information has been spread) and (ii) temporal dynamics (the evolution of the number of users whom the information has reached over time) [ ] . according to the analytics, these diffusion models can be categorized into explanatory and predictive approaches [ ] . • explanatory models: the aim of these models is to discover the hidden spreading cascades once the activation sequences are collected. these models can build a path that can help users to easily understand how the information has been diffused. the netint method [ ] has applied sub-modular, function-based iterative optimisation to discover the spreading cascade (path) that maximises the likelihood of the collected dataset. in particular, for working with missing data, a k-tree model [ ] has been proposed to estimate the complete activation sequences. • predictive models: these are based on learning processes with the observed diffusion patterns. depending on the previous diffusion patterns, there are two main categories of predictive models: (i) structure-based models (graph-based approaches) and (ii) content-analysis-based models (non-graph-based approaches). moreover, there are more existing approaches to understanding information diffusion patterns. the projected greedy approach for non-sub-modular problems [ ] was recently proposed to populate the useful seeds in social networks. this approach can identify the partial optimisation for understanding the information diffusion. additionally, an evolutionary dynamics model was presented in [ [ ] , [ ] ] that attempted to understand the temporal dynamics of information diffusion over time. one of the relevant topics for analysing information diffusion patterns and models is the concept of time and how it can be represented and managed. one of the popular approaches is based on time series. any time series can be defined as a chronological collection of observations or events. the main characteristics of this type of data are large size, high dimensionality, and continuous change. in the context of data mining, the main problem is how to represent the data. an effective mechanism for compressing the vast amount of time series data is needed in the context of information diffusion. based on this representation, different data mining techniques can be applied such as pattern discovery, classification, rule discovery, and summarisation [ ] . in lin et al. [ ] , a new symbolic representation of time series is proposed that allows for a dimensionality/numerosity reduction. this representation is tested using different classic data mining tasks such as clustering, classification, query by content, and anomaly detection. based on the mathematical models mentioned above, we need to compare a number of various applications that can support users in many different domains. one of the most promising applications is detecting meaningful social events and popular topics in society. such meaningful events and topics can be discovered by well-known text processing schemes (e.g., tf-idf) and simple statistical approaches (e.g., lda, gibbs sampling, and the tste method [ ] ). in particular, not only the time domain but also the frequency domain have been exploited to identify the most frequent events [ ] . the social big data from various sources needs to be fused for providing users with better services. these fusion can be done in different ways and affect to different technologies, methods and even research areas. two of these possible areas are ontologies and social networks, next how previous areas could benefit from information fusion in social big data are briefly described: • ontology-based fusion. semantic heterogeneity is an important issue on information fusion. social networks have inherently different semantics from other types of network. such semantic heterogeneity includes not only linguistic differences (e.g., between 'reference' and 'bibliography') but also mismatching between conceptual structures. to deal with these problems, in [ ] ontologies are exploited from multiple social networks, and more importantly, semantic correspondences obtained by ontology matching methods. more practically, semantic meshup applications have been illustrated. to remedy the data integration issues of the traditional web mashups, the semantic technologies uses the linked open data (lod) based on rdf data model, as the unified data model for combining, aggregating, and transforming data from heterogeneous data resources to build linked data mashups [ ] . • social network integration. next issue is how to integrate the distributed social networks. as many kinds of social networking services have been developed, users are joining multiple services for social interactions with other users and collecting a large amount of information (e.g., statuses on facebook and tweets on twitter). an interesting framework has been proposed for a social identity matching (sim) method across these multiple sns [ ] . it means that the proposed approach can protect user privacy, because only the public information (e.g., username and the social relationships of the users) is employed to find the best matches between social identities. particularly, cloud-based platform has been applied to build software infrastructure where the social network information can be shared and exchanged [ ] . the social big data analysis can be applied to social media data sources for discovering relevant knowledge that can be used to improve the decision making of individual users and companies [ ] . in this context, business intelligence can be defined as those techniques, systems, methodologies, and applications that analyse critical business data to help an enterprise better understand its business and market and to support business decisions [ ] . this field includes methodologies that can be applied to different areas such as e-commerce, marketing, security, and healthcare [ ] ; more recent methodologies have been applied to treat social big data. this section provides short descriptions of some applications of these methodologies in domains that intensively use social big data sources for business intelligence. marketing researchers believe that big social media analytics and cloud computing offer a unique opportunity for businesses to obtain opinions from a vast number of customers, improving traditional strategies. a significant market transformation has been accomplished by leading e-commerce enterprises such amazon and ebay through their innovative and highly scalable e-commerce platforms and recommender systems. social network analysis extracts user intelligence and can provide firms with the opportunity for generating more targeted advertising and marketing campaigns. maurer and wiegmann [ ] show an analysis of advertising effectiveness on social networks. in particular, they carried out a case study using facebook to determine users perceptions regarding facebook ads. the authors found that most of the participants perceived the ads on facebook as annoying or not helpful for their purchase decisions. however, trattner and kappe [ ] show how ads placed on users social streams that have been generated by the facebook tools and applications can increase the number of visitors and the profit and roi of a web-based platform. in addition, the authors present an analysis of real-time measures to detect the most valuable users on facebook. a study of microblogging (twitter) utilization as an ewom (electronic word-of-mouth) advertising mechanism is carried out in jansen et al. [ ] . this work analyses the range, frequency, timing, and table basic features related to social big data applications in marketing area. trattner and kappe [ ] targeted advertising on facebook real-time measures to detect the most valuable users jansen et al. [ ] twitter as ewom advertising mechanism sentiment analysis asur et al. [ ] using twitter to forecast box-office revenues for movies topics detection, sentiment analysis ma et al. [ ] viral marketing in social networks social network analysis, information diffusion models content of tweets in various corporate accounts. the results obtained show that % of microblogs mention a brand. of the branding microblogs, nearly % contained some expression of brand sentiments. therefore, the authors conclude that microblogging reports what customers really feel about the brand and its competitors in real time, and it is a potential advantage to explore it as part of companies overall marketing strategies. customers brand perceptions and purchasing decisions are increasingly influenced by social media services, and these offer new opportunities to build brand relationships with potential customers. another approach that uses twitter data is presented in asur et al. [ ] to forecast box-office revenues for movies. the authors show how a simple model built from the rate at which tweets are created about particular topics can outperform marketbased predictors. moreover, the sentiment extraction from twitter is used to improve the forecasting power of social media. because of the exponential growth use of social networks, researchers are actively attempting to model the dynamics of viral marketing based on the information diffusion process. ma et al. [ ] proposed modelling social network marketing using heat diffusion processes. heat diffusion is a physical phenomenon related to heat, which always flows from a position with higher temperature to a position with lower temperature. the authors present three diffusion models along with three algorithms for selecting the best individuals to receive marketing samples. these models can diffuse both positive and negative comments on products or brands in order to simulate the real opinions within social networks. moreover, the authors complexity analysis shows that the model is also scalable to large social networks. table shows a brief summary of the previously described applications, including the basic functionalities for each and their basic methods. criminals tend to have repetitive pattern behaviours, and these behaviours are dependent upon situational factors. that is, crime will be concentrated in environments with features that facilitate criminal activities [ ] . the purpose of crime data analysis is to identify these crime patterns, allowing for detecting and discovering crimes and their relationships with criminals. the knowledge extracted from applying data mining techniques can be very useful in supporting law enforcement agencies. communication between citizens and government agencies is mostly through telephones, face-to-face meetings, email, and other digital forms. most of these communications are saved or transformed into written text and then archived in a digital format, which has led to opportunities for automatic text analysis using nlp techniques to improve the effectiveness of law enforcement [ ] . a decision support system that combines the use of nlp techniques, similarity measures, and classification approaches is proposed by ku and leroy [ ] to automate and facilitate crime analysis. filtering reports and identifying those that are related to the same or similar crimes can provide useful information to analyse crime trends, which allows for apprehending suspects and improving crime prevention. traditional crime data analysis techniques are typically designed to handle one particular type of dataset and often overlook geospatial distribution. geographic knowledge discovery can be used to discover patterns of criminal behaviour that may help in detecting where, when, and why particular crimes are likely to occur. based on this concept, phillips and lee [ ] present a crime data analysis technique that allows for discovering co-distribution patterns between large, aggregated and heterogeneous datasets. in this approach, aggregated datasets are modelled as graphs that store the geospatial distribution of crime within given regions, and then these graphs are used to discover datasets that show similar geospatial distribution characteristics. the experimental results obtained in this work show that it is possible to discover geospatial co-distribution relationships among crime incidents and socio-economic, socio-demographic and spatial features. another analytical technique that is now in high use by law enforcement agencies to visually identify where crime tends to be highest is the hotspot mapping. this technique is used to predict where crime may happen, using data from the past to inform future actions. each crime event is represented as a point, allowing for the geographic distribution analysis of these points. a number of mapping techniques can be used to identify crime hotspots, such as: point mapping, thematic mapping of geographic areas, spatial ellipses, grid thematic mapping, and kernel density estimation (kde), among others. chainey et al. [ ] conducted a comparative assessment of these techniques, and the results obtained showed that kde was the technique that consistently outperformed the others. moreover, the authors offered a benchmark to compare with the results of other techniques and other crime types, including comparisons between advanced spatial analysis techniques and prediction mapping methods. another novel approach using spatio-temporally tagged tweets for crime prediction is presented by gerber [ ] . this work shows the use of twitter, applying a linguistic analysis and statistical topic modelling to automatically identify discussion topics across a city in the united states. the experimental results showed that adding twitter data improved crime prediction performance versus a standard approach based on kde. finally, the use of data mining in fraud detection is very popular, and there are numerous studies on this area. atm phone scams are one well-known type of fraud. kirkos et al. [ ] analysed the effectiveness of data mining classification techniques (decision trees, neural networks and bayesian belief networks) for identifying fraudulent financial statements, and the experimental results concluded that bayesian belief networks provided higher accuracy for fraud classification. another approach to detecting fraud in real-time credit card transactions was presented by quah and sriganesh [ ] . the system these authors proposed uses a self-organization map to filter and analyse customer behaviour to detect fraud. the main idea is to detect the patterns of the legal cardholder and of the fraudulent transactions through neural network learning and then to develop rules for these two different behaviours. one typical fraud in this area is the atm phone scams that attempts to transfer a victims money into fraudulent accounts. in order to identify the signs of fraudulent accounts and the patterns of fraudulent transactions, li et al. [ ] applied bayesian classification and association rules. detection rules are developed based on the identified signs and applied to the design of a fraudulent account detection system. table shows a brief summary of all of the applications that were previously mentioned, providing a description of the basic functionalities of each and their main methods. [ ] technique to discover geospatial co-distribution relations among crime incidents network analysis chainey et al. [ ] comparative assessment of mapping techniques to predict where crimes may happen spatial analysis, mapping methods gerber [ ] identify discussion topics across a city in the united states to predict crimes linguistic analysis, statistical topic modelling kirkos et al. [ ] identification of fraudulent financial statements classification (decision trees, neural networks and bayesian belief networks) quah and sriganesh [ ] detect fraud detection in real-time credit card transactions neural network learning, association rules li et al. [ ] identify the signs of fraudulent accounts and the patterns of fraudulent transactions bayesian classification, association rules epidemic intelligence can be defined as the early identification, assessment, and verification of potential public health risks [ ] and the timely dissemination of the appropriate alerts. this discipline includes surveillance techniques for the automated and continuous analysis of unstructured free text or media information available on the web from social networks, blogs, digital news media, and official sources. text mining techniques have been applied to biomedical text corpora for named entity recognition, text classification, terminology extraction, and relationship extraction [ ] . these methods are human language processing algorithms that aim to convert unstructured textual data from large-scale collections to a specific format, filtering them according to need. they can be used to detect words related to diseases or their symptoms in published texts [ ] . however, this can be difficult because the same word can refer to different things depending upon context. furthermore, a specific disease can have multiple associated names and symptoms, which increases the complexity of the problem. ontologies can help to automate human understanding of key concepts and the relationships between them, and they allow for achieving a certain level of filtering accuracy. in the health domain, it is necessary to identify and link term classes such as diseases, symptoms, and species in order to detect the potential focus of disease outbreaks. currently, there are a number of available biomedical ontologies that contain all of the necessary terms. for example, the biocaster ontology [ ] is based on the owl semantic web language, and it was designed to support automated reasoning across terms in languages. the increasing popularity and use of microblogging services such as twitter are recently a new valuable data source for web-based surveillance because of their message volume and frequency. twitter users may post about an illness, and their relationships in the network can give us information about whom they could be in contact with. furthermore, user posts retrieved from the public twitter api can come with gps-based location tags, which can be used to locate the potential centre of disease outbreaks. a number of works have already appeared that show the potential of twitter messages to track and predict outbreaks. a document classifier to identify relevant messages was presented in culotta [ ] . in this work, twitter messages related to the flu were gathered, and then a number of classification systems based on different regression models to correlate these messages with cdc statistics were compared; the study found that the best model had a correlation of . (simple model regression). aramaki [ ] presented a comparative study of various machinelearning methods to classify tweets related to influenza into two categories: positive and negative. their experimental results showed that the svm model that used polynomial kernels achieved the highest accuracy (fmeasure of . ) and the lowest training time. well-known regression models were evaluated on their ability to assess disease outbreaks from tweets in bodnar and salathé [ ] . regression methods such as linear, multivariable, and svm were applied to the raw count of tweets that contained at least one of the keywords related to a specific disease, in this case "flu". the models also validated that even using irrelevant tweets and randomly generated datasets, regression methods were able to assess disease levels comparatively well. a new unsupervised machine learning approach to detect public health events was proposed in fisichella et al. [ ] that can complement existing systems because it allows for identifying public health events even if no matching keywords or linguistic patterns can be found. this new approach defined a generative model for predictive event detection from documents by modelling the features based on trajectory distributions. however, in recent years, a number of surveillance systems have appeared that apply these social mining techniques and that have been widely used by public health organizations such as the world health organization (who) and the european centre for disease prevention and control [ ] . tracking and monitoring mechanisms for early detection are critical in reducing the impact of epidemics through rapid responses. one of the earliest surveillance systems is the global public health intelligence network (gphin) [ ] developed by the public health agency of canada in collaboration with the who. it is a secure, web-based, multilingual warning tool that continuously monitors and analyses global media data sources to identify information about disease outbreaks and other events related to public healthcare. the information is filtered for relevance by an automated process and is then analysed by public health agency of canada gphin officials. from to , this surveillance system was able to detect the outbreak of sars (severe acute respiratory syndrome). from the biocaster ontology in arose the biocaster system [ ] for monitoring online media data. the system continuously analyses documents reported from over rss feeds, google news, who, promed-mail, and the european media monitor, among other data sources. the extracted text is classified based on its topical relevance and plotted onto a google map using geo-information. the system has four main stages: topic classification, named entity recognition, disease/location detection, and event recognition. in the first stage, the texts are classified as relevant or non-relevant using a naive bayes classifier. then, for the relevant document corpora, entities of interest from concept types based on the ontology related to diseases, viruses, bacteria, locations, and symptoms are searched. healthmap project [ ] is a global disease alert map that uses data from different sources such as google news, expert-curated discussions such as promed-mail, and official organization reports such as those from the who or euro surveillance, an automated real-time system that monitors, organises, integrates, filters, visualises, and disseminates online information about emerging diseases. another system that collects news from the web related to human and animal health and that plots the data on google maps is epispider [ ] . this tool automatically extracts information on infectious disease outbreaks from multiple sources including promedmail and medical web sites, and it is used as a surveillance system by table basic features related to social big data applications in health care area. ref. num. summary methods culotta [ ] track and predict outbreak detection using twitter classification (regression models) aramaki et al. [ ] classify tweets related to influenza classification bodnar and salathé [ ] assess disease outbreaks from tweets regression methods fisichella et al. [ ] detect public health events modelling trajectory distributions gphin [ ] identify information about disease outbreaks and other events related to public healthcare classification documents for relevance biocaster [ ] monitoring online media data related to diseases, viruses, bacteria, locations and symptoms topic classification, named entity recognition, event recognition healthmap [ ] global disease alert map mapping techniques epispider [ ] human and animal disease alert map topic and location detection [ ] collecting user experiences into a continually growing and adapting multimedia diary. classification of patterns in sensor readings from a camera, microphone, and accelerometers many eyes [ ] creating visualisations in collaborative environment from upload data sets visualisation layout algorithms tweetpulse [ ] building social pulse by aggregating identical user experiences visualising temporal dynamics of the thematic events public healthcare organizations, a number of universities, and health research organizations. additionally, this system automatically converts the topic and location information from the reports into rss feeds. finally, lyon et al. [ ] conducted a comparative assessment of these three systems (biocaster, epispider, and healthmap) related to their ability to gather and analyse information that is relevant to public health. epispider obtained more relevant documents in this study. however, depending on the language of each system, the ability to acquire relevant information from different countries differed significantly. for instance, biocaster gives special priority to languages from the asia-pacific region, and epispider only considers documents written in english. table shows a summary of the previous applications and their related functionalities and methods. big data from social media needs to be visualised for better user experiences and services. for example, the large volume of numerical data (usually in tabular form) can be transformed into different formats. consequently, user understandability can be increased. the capability of supporting timely decisions based on visualising such big data is essential to various domains, e.g., business success, clinical treatments, cyber and national security, and disaster management [ ] . thus, user-experience-based visualisation has been regarded as important for supporting decision makers in making better decisions. more particularly, visualisation is also regarded as a crucial data analytic tool for social media [ ] . it is important for understanding users needs in social networking services. there have been many visualisation approaches to collecting (and improving) user experiences. one of the most well-known is interactive data analytics. based on a set of features of the given big data, users can interact with the visualisation-based analytics system. such systems are r-based software packages [ ] and ggobi [ ] . moreover, some systems have been developed using statistical inferences. a bayesian inference scheme-based multi-input/multi-output (mimo) system [ ] has been developed for better visualisation. we can also consider life-logging services that record all user experiences [ ] , which is also known as quantify-self. various sensors can capture continuous physiological data (e.g., mood, arousal, and blood oxygen levels) together with user activities. in this context, life caching has been presented as a collaborative social action of storing and sharing users life events in an open environment. more practically, this collaborative user experience has been applied to gaming to encourage users. systems such as insense [ ] are based on wearable devices and can collect users experiences into a continually growing and adapting multimedia diary. the insense system uses the patterns in sensor readings from a camera, a microphone, and accelerometers to classify the users activities and automatically collect multimedia clips when the user is in an interesting situation. moreover, visualisation systems such as many eyes [ ] have been designed to upload datasets and create visualisations in collaborative environments, allowing users to upload data, create visualisation of that data, and leave comments on both the visualisation and the data, providing a medium to foment discussion among users. many eyes is designed for ordinary people and does not require any extensive training or prior knowledge to take full advantage of its functionalities. other visual analytics tools have shown some graphical visualisations for supporting efficient analytics of the given big data. particularly, tweetpulse [ ] has built social pulses by aggregating identical user experiences in social networks (e.g., twitter), and visualised temporal dynamics of the thematic events. finally, table provides a summary of those applications related to the methods used for visualisation based on user experiences. with the large number and rapid growth of social media systems and applications, social big data has become an important topic in a broad array of research areas. the aim of this study has been to provide a holistic view and insights for potentially helping to find the most relevant solutions that are currently available for managing knowledge in social media. as such, we have investigated the state-of-the-art technologies and applications for processing the big data from social media. these technologies and applications were discussed in the following aspects: (i) what are the main methodologies and technologies that are available for gathering, storing, processing, and analysing big data from social media? (ii) how does one analyse social big data to discover meaningful patterns? and (iii) how can these patterns be exploited as smart, useful user services through the currently deployed examples in social-based applications? more practically, this survey paper shows and describes a number of existing systems (e.g., frameworks, libraries, software applications) that have been developed and that are currently being used in various domains and applications based on social media. the paper has avoided describing or analysing those straightforward applications such as facebook and twitter that currently intensively use big data technologies, instead focusing on other applications (such as those related to marketing, crime analysis, or epidemic intelligence) that could be of interest to potential readers. although it is extremely difficult to predict which of the different issues studied in this work will be the next "trending topic" in social big data research, from among all of the problems and topics that are currently under study in different areas, we selected some "open topics" related to privacy issues, streaming and online algorithms, and data fusion visualisation, providing some insights and possible future trends. in the era of online big data and social media, protecting the privacy of the users on social media has been regarded as an important issue. ironically, as the analytics introduced in this paper become more advanced, the risk of privacy leakage is growing. as such, many privacy-preserving studies have been proposed to address privacy-related issues. we can note that there are two main well-known approaches. the first one is to exploit "k-anonymity", which is a property possessed by certain anonymised data [ ] . given the private data and a set of specific fields, the system (or service) has to make the data practically useful without identifying the individuals who are the subjects of the data. the second approach is "differential privacy", which can provide an efficient way to maximise the accuracy of queries from statistical databases while minimising the chances of identifying its records [ ] . however, there are still open issues related to privacy. social identification is the important issue when social data are merged from available sources, and secure data communication and graph matching are potential research areas [ ] . the second issue is evaluation. it is not easy to evaluate and test privacy-preserving services with real data. therefore, it would be particularly interesting in the future to consider how to build useful benchmark datasets for evaluation. moreover, we have to consider this data privacy issues in many other research areas. in the context of law (also, international law) enforcement, data privacy must be prevented from any illegal usages, whereas governments tend to trump the user privacy for the purpose of national securities. also, developing educational program for technicians (also, students) is important [ ] . it is still open issue on how (and what) to design the curriculum for the data privacy. one of the current main challenges in data mining related to big data problems is to find adequate approaches to analysing massive amounts of online data (or data streams). because classification methods require previous labelling, these methods also require great effort for real-time analysis. however, because unsupervised techniques do not need this previous process, clustering has become a promising field for real-time analysis, especially when these data come from social media sources. when data streams are analysed, it is important to consider the analysis goal in order to determine the best type of algorithm to be used. we were able to divide data stream analysis into two main categories: • offline analysis: we consider a portion of data (usually large data) and apply an offline clustering algorithm to analyse the data. • online analysis: the data are analysed in real time. these kinds of algorithms are constantly receiving new data and are not usually able to keep past information. a new generation of online [ , ] and streaming [ , ] algorithms is currently being developed in order to manage social big data challenges, and these algorithms require high scalability in both memory consumption [ ] and time computation. some new developments related to traditional clustering algorithms, such as the k-mean [ ] , em [ ] , which has been modified to work with the mapreduce paradigm, and more sophisticated approaches based on graph computing (such as spectral clustering), are currently being developed [ ] [ ] [ ] into more efficient versions from the state-of-theart algorithms [ , ] . finally, data fusion and data visualisation are two clear challenges in social big data. although both areas have been intensively studied with regard to large, distributed, heterogeneous, and streaming data fusion [ ] and data visualisation and analysis [ ] , the current, rapid evolution of social media sources jointly with big data technologies creates some particularly interesting challenges related to: • obtaining more reliable methods for fusing the multiple features of multimedia objects for social media applications [ ] . • studying the dynamics of individual and group behaviour, characterising patterns of information diffusion, and identifying influential individuals in social networks and other social media-based applications [ ] . • identifying events [ ] in social media documents via clustering and using similarity metric learning approaches to produce highquality clustering results [ ] . • the open problems and challenges related to visual analytics [ ] , especially related to the capacity to collect and store new data, are rapidly increasing in number, including the ability to analyse these data volumes [ ] , to record data about the movement of people and objects at a large scale [ ] , and to analyse spatio-temporal data and solve spatio-temporal problems in social media [ ] , among others. ibm, big data and analytics the data explosion in minute by minute data mining with big data analytics over large-scale multidimensional data: the big data revolution! ad - d-data-management-controlling-data-volume-velocity-and-variety.pdf the importance of 'big data': a definition, gartner, stamford the rise of big data on cloud computing: review and open research issues an overview of the open science data cloud big data: survey, technologies, opportunities, and challenges media, society, world: social theory and digital media practice who interacts on the web?: the intersection of users' personality and social media use users of the world, unite! the challenges and opportunities of social media the role of social media in higher education classes (real and virtual)-a literature review the dynamics of health behavior sentiments on a large online social network big social data analysis, big data comput trending: the promises and the challenges of big social data big data: issues and challenges moving forward business intelligence and analytics: from big data to big impact hadoop: the definitive spark: cluster computing with working sets mllib: machine learning in apache spark mlbase: a distributed machine-learning system mli: an api for distributed machine learning mapreduce: simplified data processing on large clusters mapreduce: simplified data processing on large clusters mapreduce algorithms for big data analysis improving mapreduce performance in heterogeneous environments proceedings of the acm sigmod international conference on management of data, sigmod ' useful stuff the big-data ecosystem table clojure programming, o'really the chubby lock service for loosely-coupled distributed systems the stratosphere platform for big data analytics the google file system bigtable: a distributed storage system for structured data pregel: a system for large-scale graph processing mongodb: the definitive guide the r book, st twitter now seeing million tweets per day, increased mobile ad revenue, says ceo an introduction to statistical methods and data analysis an evaluation study of bigdata frameworks for graph processing a bridging model for parallel computation hama: an efficient matrix computation with the mapreduce framework finding local community structure in networks community detection in graphs on clusterings-good, bad and spectral the maximum clique problem, in: handbook of combinatorial optimization community structure in social and biological networks fast algorithm for detecting community structure in networks finding community structure in very large networks modularity and community structure in networks spectral tri partitioning of networks a vector partitioning approach to detecting community structure in complex networks network brownian motion: a new method to measure vertex-vertex proximity and to identify communities and subcommunities a hierarchical clustering algorithm based on fuzzy graph connectedness adaptive k-means algorithm for overlapped graph clustering overlapping community detection in networks: the state-of-the-art and comparative study web document clustering: a feasibility demonstration information retrieval: data structures & algorithms introduction to information retrieval text analytics in social media principal component analysis indexing by latent semantic analysis latent dirichlet allocation proceedings of the th acm sigkdd international conference on knowledge discovery and data mining data clustering: a review fast and effective text mining using linear-time document clustering evaluation of hierarchical clustering algorithms for document datasets empirical and theoretical comparisons of selected criterion functions for document clustering machine learning in automated text categorization thumbs up?: sentiment classification using machine learning techniques data streams: models and algorithms efficient online spherical k-means clustering scalable influence maximization for prevalent viral marketing in large-scale social networks real-time event detection on social data stream information diffusion in online social networks: a survey inferring networks of diffusion and influence correcting for missing data in information cascades seeding influential nodes in nonsubmodular models of information diffusion graphical evolutionary game for information diffusion over social networks evolutionary dynamics of information diffusion over social networks a review on time series data mining a symbolic representation of time series, with implications for streaming algorithms emerging topic detection on twitter based on temporal and social terms evaluation privacy-preserving discovery of topic-based events from social sensor signals: an experimental study on twitter integrating social networks for context fusion in mobile service platforms semantic information integration with linked data mashups approaches privacy-aware framework for matching online social identities in multiple social networking services a social compute cloud: allocating and sharing infrastructure resources via social networks competing on analytics: the new science of winning effectiveness of advertising on social network sites: a case study on facebook social stream marketing on facebook: a case study twitter power: tweets as electronic word of mouth predicting the future with social media mining social networks using heat diffusion processes for marketing candidates selection opportunities for improving egovernment: using language technology in workflow management a decision support system: automated crime report analysis and classification for e-government mining co-distribution patterns for large crime datasets the utility of hotspot mapping for predicting spatial patterns of crime predicting crime using twitter and kernel density estimation data mining techniques for the detection of fraudulent financial statements real-time credit card fraud detection using computational intelligence identifying the signs of fraudulent accounts using data mining techniques epidemic intelligence: a new framework for strengthening disease surveillance in europe a survey of current work in biomedical text mining nowcasting events from the social web with statistical learning an ontology-driven system for detecting global health events towards detecting influenza epidemics by analyzing twitter messages twitter catches the flu: detecting influenza epidemics using twitter validating models for disease detection using twitter detecting health events on the social web to enable epidemic intelligence the landscape of international event-based biosurveillance the global public health intelligence network and early warning outbreak detection biocaster: detecting public health rumors with a web-based text mining system surveillance sans frontieres: internet-based emerging infectious disease intelligence and the healthmap project use of unstructured event-based reports for global infectious disease surveillance comparison of web-based biosecurity intelligence systems: biocaster, epispider and healthmap big-data visualization visualization of entities within social media: toward understanding users' needs parallelmcmccombine: an r package for bayesian methods for big data and analytics ggobi: evolving from xgobi into an extensible framework for interactive data visualization a visualization framework for real time decision making in a multi-input multi-output system insense: interest-based life logging manyeyes: a site for visualization at internet scale social data visualization system for understanding diffusion patterns on twitter: a case study on korean enterprises k-anonymity: a model for protecting privacy proceedings of th international conference on theory and applications of models of computation educating engineers: teaching privacy in a world of open doors online algorithms: the state of the art ultraconservative online algorithms for multiclass problems better streaming algorithms for clustering problems a survey on algorithms for mining frequent itemsets over data streams a multi-objective genetic graphbased clustering algorithm with memory optimization he, parallel k-means clustering based on mapreduce map-reduce for machine learning on multicore parallel spectral clustering in distributed systems a co-evolutionary multi-objective approach for a k-adaptive graph-based clustering algorithm gany: a genetic spectral-based clustering algorithm for large data analysis on spectral clustering: analysis and an algorithm learning spectral clustering, with application to speech separation dfuse: a framework for distributed data fusion visual analytics: definition, process, and challenges multiple feature fusion for social media applications the role of social networks in information diffusion event identification in social media learning similarity metrics for event identification in social media visual analytics visual analytics tools for analysis of movement data space, time and visual analytics this work has been supported by several research grants: co- key: cord- -gduhterq authors: spitzer, ernest; ren, ben; brugts, jasper j; daemen, joost; mcfadden, eugene; tijssen, jan gp; van mieghem, nicolas m title: cardiovascular clinical trials in a pandemic: immediate implications of coronavirus disease date: - - journal: card fail rev doi: . /cfr. . sha: doc_id: cord_uid: gduhterq the coronavirus disease (covid- ) pandemic started in wuhan, hubei province, china, in december , and by april , it had affected > . million people in countries and caused > , deaths. despite diverse societal measures to reduce transmission of the severe acute respiratory syndrome coronavirus , such as implementing social distancing, quarantine, curfews and total lockdowns, its control remains challenging. healthcare practitioners are at the frontline of defence against the virus, with increasing institutional and governmental supports. nevertheless, new or ongoing clinical trials, not related to the disease itself, remain important for the development of new therapies, and require interactions among patients, clinicians and research personnel, which is challenging, given isolation measures. in this article, the authors summarise the acute effects and consequences of the covid- pandemic on current cardiovascular trials. trials may not be able to attend hospitals for follow-up visits or to collect study medications. a careful and periodic risk assessment by sponsors and investigators is required to preserve the safety of trial participants (and employees) and the integrity of trials. in this article, we summarise the immediate implications of the covid- pandemic on ongoing cardiovascular trials. this review incorporates recent recommendations from the us food and drug administration (fda), the european medicines agency, the uk's medicines and healthcare products regulatory agency and australia's therapeutic goods administration, as well as personal views. [ ] [ ] [ ] [ ] [ ] [ ] planning, executing and reporting clinical trials designed for the approval of (or to extend indications for) drugs, biological products, devices and combinations thereof, are highly regulated activities. clinical trialists must observe national regulations, as well as international standards, such as those proposed by the international conference of harmonization, the international organization for standardization and the international medical device regulators forum. two general principles governing the execution of clinical trials are ensuring patient safety and clinical trial integrity. according to the who, patient safety is "the absence of preventable harm to a patient during the process of health care and reduction of risk of unnecessary harm associated with health care to an acceptable minimum." as defined by the fda, data integrity refers to "the completeness, consistency, and accuracy of data. complete, consistent and accurate data should be attributable, legible, contemporaneously recorded, original or a true copy and accurate (alcoa)." data are to be recorded exactly as intended, and when retrieved at a later time, should be the same as originally recorded. while patient safety is paramount, both should be prioritised for the successful execution of clinical trials. if data integrity is compromised, study results may no longer be interpretable, reliable or usable. a pandemic has the potential to directly impact all individuals and organisations involved in clinical research (figure ) . highly contagious and rapidly spreading viruses, such as sars-cov- , require comprehensive measures to avoid human-to-human spread. with the ongoing pandemic, the world has progressively witnessed a reduction of airline activity to almost zero with widespread travel bans, and limitation of private and public transportation, temporary closure of retail businesses, banning of public gatherings and the requirement to work from home. all these measures are designed to limit exposure to potential carriers of the virus. individual measures, such as meticulous hand hygiene, self-isolation and social distancing, are encouraged. public measures, such as quarantines, curfews or lockdowns, have been implemented. however, sectors, such as healthcare, food supply chains, law enforcement, governmental agencies and regulatory bodies, remain indispensable, with an increased workload challenging the capacity of local and national systems, as well as risking (if not sufficiently protected) the well-being of individuals. overall, the majority of people stay at home, work remotely and limit use of healthcare systems as much as possible. the clinical trial life cycle can be divided into trial design and registration, trial start-up, enrolment, follow-up, reporting and regulatory submission. changes in trial conduct should be documented and, if substantial, incorporated as protocol amendments (although not in an expedited manner unless impacting on patient safety); lesser changes may be captured as protocol deviations related to the pandemic. regulatory agencies offer a diverse range of flexibility in such procedures, and applicable guidance documents should be consulted to establish the most appropriate approach for a particular trial. [ ] [ ] [ ] , trial enrolment should be put on hold or stopped if there is significantly reduced feasibility (e.g. drug trials with infusions), when participants require intensive care post-treatment (e.g. surgical trials) or when the investigator is unavailable. when inclusion is delayed by the pandemic, it should be dealt with in a similar manner to other circumstances that lead to a low recruitment rate. if appropriate, and especially if foreseen by the protocol, a data and safety monitoring board may assess futility due to severe impact on data collection or outcomes. however, if stopping or putting on hold a trial puts participants at increased risk, efforts should be taken to continue with trial-related activities. enrolment in cardiovascular trials generally takes place at outpatient visits or during hospitalisation. trials in patient populations with acute presentations (e.g. st-elevation mi [stemi]) may identify potentially suitable trial candidates; however, the capacity to comply with study procedures needs to be assessed, as well as considerations related to patient safety during follow-up. it is also pertinent to consider that covid- may mimic some classical presentations, such as stemi; ecg changes are shown to reflect myocarditis, after angiography demonstrates non-obstructive disease. it is problematic when the trial design mandates a protocol-related treatment before angiography. furthermore, the analysis of outcomes may be rendered more difficult, and parallel analyses of the intention-to-treat and per protocol populations will be pertinent. participants in the follow-up phase (when they are generally at home) constitute a higher-risk population in the reduced capacity at investigational sites will impact on availability to perform study visits (or phone calls) to assess and confirm eligibility, enter data in electronic case report forms (ecrfs), to report (serious) adverse events and to follow the protocol in general. all protocol deviations should be noted, with those that are pandemic related clearly participants should be given the option to continue, suspend or withdraw participation. reporting of sever adverse events is expected to continue according to standard procedures and regulation. dsmbs may be appointed to determine feasibility of continuing trials based on overall conduct, patient safety and date integrity. all meetings should be remote, by means of teleconference/ video conferencing. steering committee calls, investigator meetings, endpoint adjudication committees and dsmbs should meet remotely with appropriate technology in place. consideration should be given to changes that limit the exposure of participants, investigators and staff to sars-cov ; changes in enrolment and testing. reporting should differentiate between pre-pandemic, peri-pandemic and post-pandemic, as well as covid- positivity/negativity. novel trial approaches that reduce physical contact remote site management and monitoring could be considered, if feasible (i.e. privacy issues and site workload are considered). consider virtual visits, telemedicine, electronic consent or teletrials. change site location outside the hospital. deliver medication to homes. clinical research organisations need to swiftly transition into home-based organisations and increase level of oversight to deliver urgent and ongoing responsibilities. remote systems need to be upgraded to allow adequate online execution and oversight. ongoing core laboratory activities during a pandemic core laboratories continue operations utilising virtual environments to analyse and review materials. remote analysts and supervisors utilise secure platforms with access to required validated analysis software and study datasets. continued ict support is pivotal. perform risk-bene t assessment. decide to continue with or without changes, to put on hold or stop trials. evaluate feasibility of protocol adherence or need for modi cation. evaluate capacity to continue trials based on human resources, logistics and drug distribution. limit enrolment. prioritise pandemic-related research, as evidence-based treatments are lacking. incorporate measures that limit physical contact between researchers and participants. consider novel approaches that allow remote data capture and remote monitoring. identified. most importantly, principal investigators must ensure that enrolled subjects fully comply with eligibility criteria and that all measures are taken to report adverse events in a timely fashion, given that these two are of paramount importance for patient safety. coordinating centres may require an increased level of monitoring of ecrfs and a degree of flexibility in terms of timing for data cleaning. , , , large, multicentre collaborative trials require the participation of a coordinating centre, either a contract or an academic research organisation. coordinating centres execute the study, or activities within, on behalf of the sponsor or manufacturer. a study team is composed of a project manager, clinical research associates and study monitors, data managers, biostatistician, quality assurance manager and safety reporting units, with or without a medical monitor. a pandemic prompts the need to work from home and cancel face-toface visits. where systems are upgraded to allow remote work and staff remain available for reception of materials, coordinating centres can continue to operate during a pandemic. site initiation and monitoring visits are cancelled, postponed or performed remotely using webbased technology (although source data verification can be postponed). remote monitoring is possible, but might not be feasible at all participating sites in a trial and increases the workload at the site. moreover, technical requirements, confidentiality issues, updated consents and the increased burden to site personnel could make it impractical. in line with this, quality assurance measures, such as site audits, are postponed unless serious non-compliance is identified. the participation of several committees in clinical trials ensures proper scientific and operational oversight, data integrity and quality, as well as patient safety. typically, the steering committee is composed of established investigators or key opinion leaders, and representatives from parties involved (e.g. coordinating centres, sponsor, grant givers). during the pandemic, office-based professionals work from home, and participation may be limited. nevertheless, given the oversight duties of the steering committee and data and safety monitoring boards, the frequency of meetings might need to be increased to address immediate pandemic-related needs. at the beginning of the pandemic, cardiovascular clinicians saw a reduction in patient load, as the population was advised to stay at home. unfortunately, this has resulted in late presentations of severe conditions (e.g. non-stemi or decompensated heart failure). however, as hospital resources are depleted, not only in materials but also in personnel, cardiovascular clinicians are required to perform pandemic-related tasks and to self-isolate, potentially limiting their availability for participation in committees. the same applies to members of clinical event committees and data and safety monitoring boards. potential exceptions are data managers and biostatisticians. in theory, this could reduce the availability of clinicians to participate in committee calls; however, in practice, this might not be the case. committed investigators tend to stretch time when required, as shown by chinese investigators who managed to report initial cohorts despite being at the centre of the pandemic. , all meetings are planned as teleconferences. cardiovascular trials, particularly interventional trials, rely heavily on imaging. for the purpose of an unbiased and consistent analysis, central laboratories are utilised. imaging modalities, such as echocardiography, ecg, cardiac mri, angiography assessments, intracoronary imaging and cardiac ct, are frequently used. thus, for a core laboratory to ensure timely delivery during a pandemic, conditions should allow analysts and supervisors to work remotely. data should reach core laboratories electronically, with secure and certified datatransfer providers. time windows for imaging follow-up might need to be adjusted and uploading activities may also be interrupted. analysing cardiovascular images might not be as efficient at home when compared with a well-equipped work environment. however, remote access through a secure connection to software and datasets, as well as databases, will allow continuity of activities. information and communication technology departments play a pivotal role in setting up and maintaining reliable infrastructure. a lack of remote access could force activities to stop during a pandemic. safety reporting should continue in line with national regulations and following standard procedures. [ ] [ ] [ ] , investigators should ensure timely capture of serious adverse events, a process that might involve extended use of telehealth. moreover, serious adverse events should be identified, where possible, as pandemic or non-pandemic related. the inability to deliver investigational drugs could pose additional risks to participants and warrants an increased level of safety monitoring. , ongoing trials lacking data and safety monitoring boards might need to revisit that decision on a per-case basis. data and safety monitoring boards may independently assess an ongoing trial that has been severely affected by the pandemic (e.g. incomplete data, incomplete follow-up) to help investigators and sponsors elucidate, without compromising the integrity of the trial, whether continuing the trial will yield interpretable data. a pandemic has a significant impact on the ability to adhere to protocol requirements (e.g. missed follow-up visits or tests). importantly, protocol deviations should be documented with an indication that they are pandemic-related following standard procedures. , - data collection could be challenging, but should not stop. when reporting the results of a trial, cohorts might need to be divided as pre-pandemic, peri-pandemic and post-pandemic. statistical analysis plans might need adaptions when considering the influence of the pandemic in the interpretability of results, especially when endpoints share characteristics with covid- related events. guidance on the interpretability of results when analysing data with missing values, unbalanced completeness or out-of-window assessments (e.g. echocardiograms, control angiograms, laboratory values) might also be required, depending on the duration of the pandemic. for multicentre trials, a per-site assessment might be required for outbreak areas versus non-outbreak areas. the interpretability of the overall evidence generated should be discussed with regulatory authorities. , - the use of vaccines, once available, might also require adequate documentation in study databases to avoid unbalanced usage. ethics committees (ecs; or institutional review boards [irbs]) and regulatory agencies experience a significant increase in activity during a pandemic. ecs/irbs face the burden of protocol amendments for ongoing trials, and prioritise activities related to the pandemic, including the review of covid- trial submissions. , - regulatory agencies play a critical role in protecting citizens from threats, including emerging infectious diseases, thus the importance of providing timely guidance, such as the regulatory documents that form the basis of this article. [ ] [ ] [ ] [ ] [ ] [ ] based on accumulating experience, the advice of ecs/irbs and regulatory agencies to sponsors and investigators could be critical to determine the continuation, modification or pause of trial activities. such recommendations are complex, given the uncertainties related to the pandemic duration. a pandemic has a significant impact on every component of cardiovascular clinical research. when facing a rapidly spreading disease with no effective treatment or vaccine, efforts should be focused on facilitating the day-to-day work of healthcare professionals with required personal protective equipment. pandemic-related investigations should be prioritised. nevertheless, sponsors and investigators should take all necessary actions to ensure patient (and employee) safety and to maintain trial integrity in ongoing, nonpandemic-related clinical trials, and capture pandemic-induced trial adjustments in focused amendments so that meaningful conclusions can be achieved when reporting results. a novel coronavirus from patients with pneumonia in china the novel chinese coronavirus ( -ncov) infections: challenges for fighting the storm clinical characteristics of coronavirus disease in china the reproductive number of covid- is higher compared to sars coronavirus severe acute respiratory syndrome -retrospect and lessons of outbreak in china emerging threats from zoonotic coronaviruses -from sars and mers to -ncov high contagiousness and rapid spread of severe acute respiratory syndrome coronavirus . emerg infect dis fda guidance on conduct of clinical trials of medical products during covid- public health emergency postmarketing adverse event reporting for medical products and dietary supplements during a pandemic guidance on the management of clinical trials during the covid- (coronavirus) pandemic. brussels: european medicines agency points to consider on implications of coronavirus disease (covid- ) on methodological aspects of ongoing clinical trials medicines and healthcare products regulatory agency. managing clinical trials during coronavirus (covid- ) covid- : guidance on clinical trials for institutions, hrecs, researchers and sponsors. canberra: doh data-integrity-and-compliance-with-current-good-manufacturing-practice-guidance-for-industry key: cord- - if bl u authors: wang, yanxin; li, jian; zhao, xi; feng, gengzhong; luo, xin (robert) title: using mobile phone data for emergency management: a systematic literature review date: - - journal: inf syst front doi: . /s - - -w sha: doc_id: cord_uid: if bl u emergency management (em) has always been a concern of people from all walks of life due to the devastating impacts emergencies can have. the global outbreak of covid- in has pushed em to the top topic. as mobile phones have become ubiquitous, many scholars have shown interest in using mobile phone data for em. this paper presents a systematic literature review about the use of mobile phone data for em that includes related articles written between and from six electronic databases. five themes in using mobile phone data for em emerged from the reviewed articles, and a systematic framework is proposed to illustrate the current state of the research. this paper also discusses em under covid- pandemic and five future implications of the proposed framework to guide future work. emergency situations such as terrorist attacks or earthquakes occur at different scales daily around the world. they may be natural or human-caused events that occur suddenly, affect public order, and disrupt the regularity of an area's political, economic, and social life (fogli et al. ; seba et al. ) . such an emergency causes great losses and widespread impacts on society, and "requires a prompt intervention by all involved stakeholders" (fogli et al. ; lauras et al. ) . to gain public support and maintain regular social order, authorities should pay special attention to the effective management of such situations. in this study, emergency management (em) is defined as the effective organization, direction, and management of both emergency-related humanitarian and material resources (othman and beydoun ) . traditionally, it comprises four phases: mitigation, preparedness, response, and recovery (othman and beydoun ) . em is generally considered to have undergone three stages (phillips et al. ) , including passive response (before the s), active preparation and prediction ( s- s) , and whole community response based on integrated information systems (after the s). to align with such developments, scholars have swifted their attention from solving a single issue to focusing on efficient intra-organizational collaboration (janssen et al. ) . lack of collaboration is the chief culprit in major failures in disaster response and takes the form of a lack of available crisis information or poorly managed information flow (valecha ; beydoun et al. ) . information and communication technology (ict) and information systems (iss) are considered as crucial means to enhance the collaboration process and information flow management (sagun et al. ; ipe et al. ). however, a lack of informative and appropriate data hinders further development and practical use of emergency management information systems (emiss) (ghosh et al. ; roberts ) . first, data reflecting human behaviors on a large scale are required for each phase of em, since managing affected people is a crucial part of em. second, due to the rapid changes of challenges encountered in em, data collected promptly and timely are required to support various emiss for corresponding responses. finally, emiss require accurate and objective data to reflect emergencies, while traditional emergency-related data heavily rely on surveys. these three requirements create obstacles for the practical use of emiss in dealing with real-world emergencies. many studies have regarded mobile phone data as a potential data source to fulfill these requirements, because these data reflect human behavior richly and ubiquitously. globally, in , mobile phones attained a registration number of . per people, as reported by the international telecommunication union (itu) (sanou ) . they have been transformed from a simple communication tool to a multifunctional 'mobile-computer' with the rise of apps on mobile platforms. mobile phone data such as cdrs and app data can be applied to the analysis of human mobility (stefania et al. ; duan et al. ; gao et al. ; lwin et al. ) , social networks (poblet et al. ; trestian et al. ; ghurye et al. ; dobra et al. ) , mobile phone usage patterns (jia et al. ; steenbruggen et al. ; gundogdu et al. ; gao et al. ) , and geographic location (lwin et al. ; poblet et al. ; dong et al. ; Šterk and praprotnik ) ; these themes are discussed in additional detail in section . the results can further be developed to address various issues encountered during emergencies, such as predicting epidemic transmission panigutti et al. ) and developing pre-warning systems (zhang et al. ; dong et al. ) . although many efforts have been made to investigate the application of mobile phone data in em, the knowledge and understanding in this field are still fragmented. therefore, a systematic framework that synthesizes the fragmented knowledge and provides insights into the-state-of-art of using mobile phone data for em is needed. this study aims to propose such a framework and provide guidance for further research in this field. this study is related to two streams of literature reviews, which are using ict in em and mobile phone data analysis. on the first stream of using ict in em (martinez-rojas et al. ; tan et al. ) , martinez-rojas et al. ( ) have reviewed related articles from to to discuss current opportunities and challenges of using twitter for em. tan et al. ( ) have summarized the involvement of mobile apps in the crisis informatics literature by reviewing related articles. on the second stream of mobile phone data analysis, blondel et al. ( ) and naboulsi et al. ( ) have summarized some studies on mobile data analysis, some of which can be applied in em. these two literature reviews focused on the method of data mining, while the current study focuses on em. to sum up, there is a lack of a literature review that considers both the characteristics of mobile phones and using such data for em. three research objectives are undertaken to achieve the goal of synthesizing the fragmented knowledge and providing research guidance: (i) extract basic knowledge (e.g. types of mobile phone data, situations) of em from the selected studies; (ii) break the boundaries of different disciplines and aggregate each analysis perspective; and (iii) study the identified knowledge and integrate it into a single framework that draws a comprehensive map of existing findings under this subject, and provides future implications. to attain these objectives, this study follows a methodology of systematic literature review (slr) and synthesized the results of the reviewed studies into a framework, thus allowing a discussion of future implications obtained from the framework. the next section introduces the research method applied in this systematic literature review, which is followed by an illustration of the five major themes in using mobile phone data for em. section presents the proposed framework of using mobile phone data for em based on the five themes. section discusses current em under the covid- pandemic. finally, this paper discusses future implications and provides a conclusion. based on the systematic research methodology (ghobadi ) which is considered as a means of identifying, evaluating and interpreting all available research relevant to a particular research question, or topic area, or phenomenon of interest (budgen and brereton ) , this study processes three phases of work, including planning, conducting, and reporting. these phases are graphically exhibited along with their specific steps and objectives in fig. . the research questions and relevant studies have been identified in phase , and the studies we reviewed in this paper have been selected through a specific research strategy and inclusion/exclusion criteria illustrated in phase . as a result, we have selected papers from six document databases for this review. note that phases and are both explained in appendix , while the results of phase are detailed in sections and . five themes in using mobile phone data for em to identify the trends of research on emergency situations during the selected period, the reviewed articles have been statistically analyzed by their published years and the types of situations mentioned in each article. the results are exhibited in fig. . emergencies can be divided into two categories, natural ( papers) and man-made ( papers). 'natural' refers to the emergencies that occurred due to processes of the earth (such as weather) that can hardly be avoided, while 'manmade' refers to the emergencies that occurred specifically due to human action or inaction. to be more specific, this paper considers that 'natural' emergencies include 'natural disaster' ( papers) and 'disease disaster' ( papers), such as earthquakes and the ebola virus disease, respectively. meanwhile, 'man-made' emergencies cover various types, among which 'traffic accident' (six papers), 'violence and terror incident' (six papers), and 'other' (four papers) have been regarded as three major subcategories based on the articles reviewed. the 'others' category includes 'sudden strikes' (garroppo and niccolini ) , 'damages to pipelines' (dong et al. ) , and 'refugee problem' (andris et al. ) . the remaining studies (eight papers) cannot be clustered into any of the aforementioned categories or subcategories in the course of our review, as they are not clearly related to any particular type of emergency, so we have categorized them as 'general emergency.' studies paid considerably more attention to 'natural' emergencies compared with 'man-made' emergencies, and this was particularly evident in terms of disease disasters. we also note that the number of studies peaked in ( papers; the other years contained , , , , and papers, respectively), and 'disease disaster' has been studied in (seven papers) almost twice as much as in other years. the phenomenon may have been the result of attention drawn by the outbreak of the ebola virus. em generally goes through four phases, namely mitigation, preparedness, response, and recovery. most of the studies investigated the response phase ( papers). this phase involves activities like organizing an evacuation, mobilization, and assisting victims to manage the emergency appropriately. then, papers focused on the preparedness phase, and papers focused on the mitigation phase. the preparedness phase in em comprises a sequence of activities including planning, training, warning, and updating solutions by learning previous emergencies, which can help enhance response abilities (oberg et al. ) . the mitigation phase aims to prevent a disaster or lessen its impacts by modifying the causes and vulnerabilities or distributing the losses (oberg et al. ) . finally, the recovery phase received the least attention from scholars ( papers). this phase consists of both short-term and long-term activities that are designed for reestablishing and returning disaster areas to normal conditions. note that the total sum according to these phases ( papers) is greater than the number of reviewed articles ( papers) since some studies focused on more than one phase. mobile phone data consist of information collected by mobile carriers, sensors, and apps on mobile phones. mobile phone data collected by carriers include cdrs, sms (short message service) information, traffic volume, etc., and data collected by phone sensors, such as gps records, bluetooth-sensed interaction data, etc. most of the studies applied cdrs ( papers) for emergency analysis. this type of mobile phone data contains details about each call such as "the location, call duration, call time, and both parties involved in the conversation" (trestian et al. ) . cdrs containing both users' spatial and temporal information can support research on modeling human mobility during emergencies. moreover, information about caller and recipient ids reveals individuals' social networks, which correlate to infectious disease dissemination (gundogdu et al. ; wesolowski et al. ) . some studies combined cdrs with other data sources to obtain more comprehensive emergency-related information (bharti et al. ; pastor-escuredo et al. ) . bharti et al. ( ) used both cdrs and nighttime lighting data from satellite imagery to analyze population sizes and human mobility (the cdrs for short-term and the satellite data for long-term assessment), which helped in making policies and understanding emergency impacts in the response phase. pastor-escuredo et al. ( ) adopted both cdr and rainfall-level data in mexico to help discover anomalous mobile phone usage patterns in seriously affected areas and assess infrastructure damage and casualty populations in time. sms (seven papers) is considered as another category of mobile phone data (separate from cdr in this study), which contains both passively collected information like communication details and actively collected information. gps data ( papers) can provide the location information of individuals with higher accuracy than cdrs and can be helpful in the identification of individual locations as well as the study of human dynamics. in addition, there were some studies being developed based on app data (six papers) and others (nine papers), such as bluetooth-sensed interaction data, mobilephone-usage data, mobile-traffic data, etc. when applying mobile phone data to practical problems, scholars have to gather and extract useful information from the raw mobile phone data. we have categorized different information processing paths from six analysis perspectives: human mobility, geographic location, social networks, mobile phone usage patterns, collected information, and information diffusion (blondel et al. ) . detailed definitions for these analysis perspectives are exhibited in table (refer to appendix ) with specific examples. as depicted in fig. , spatial-temporal information extracted from cdr and gps data has been mostly investigated by scholars, which is reflected as human mobility ( papers). the spatial and temporal information can benefit the tracking of individuals' trajectories and modeling their movement patterns. since the relationship between human movement patterns and infectious transmission routes were found (blondel et al. ) , the analysis of human mobility contributes to the understanding of the disease dissemination process. mobile phone usage patterns perspective ( papers) refers to various individual behavior gathered from one's mobile phone data, such as normal call volume. arai et al. ( ) found clues to the whereabouts of unobserved populations by analyzing individuals' usage behaviors from cdrs, especially for the whereabouts of children, who are vulnerable to epidemics. this result contributes to tracking epidemic dissemination and the deployment of various resources. social networks ( papers) are constructed based on the communication information gathered from mobile phone data. the analysis of social networks helped to reproduce disease dissemination models and predicted epidemic transmission for the preparedness phase (y. chen et al. ; farrahi et al. ) . scholars also used spatial information extracted from mobile phone data ( papers) at both the individual and aggregate level. the analysis of individual geographic location benefited in tracking anomalous individual locations, and could be further applied to develop pre-warning systems, such as systems to lessen potential damage on natural gas pipelines (dong et al. ) . the collected information (seven papers) perspective refers to the process of gathering multiple types of information content from mobile phones, such as using a specific app to collect public opinions. deng et al. ( ) collected public in addition, other studies focused on a more general perspective of gathering information from mobile phone data, and this kind of information was defined as information diffusion (five papers). diffusion strategies of prevention knowledge could be developed to mitigate the impacts of emergencies by analyzing information diffusion networks (lima et al. ) . research in em mainly focused on five kinds of problems: resource management, evacuation, pre-planning, decisionmaking, and education and training (mingliang et al. ) . within these five types, various issues (such as making evacuation plans, optimizing resource allocation, predicting epidemic transmission, etc.) are defined as applications in this study. although the selected studies developed their unique applications, this study provides a relatively broad classification, five general categories and subcategories to help create a map of the existing applications. this classification is exhibited in table (refer to appendix ) with a detailed definition and a specific example for each category. most studies focused on the decision-making problem ( papers), among which 'conducting public health intervention' ( papers) and 'processing real-time detection' ( papers) had been mostly developed. it was followed by presenting emergency impact (seven papers), 'stating policy/regulations' (five papers), 'making construction plans' (four papers), and 'developing emergency-related platforms' (two papers). the decision-making problem aims to give guidance for the relevant work in the em process. the pre-planning problem has been studied by papers during these five years, among which eight papers focused on 'predicting epidemic transmission' and seven papers focused on 'developing pre-warning system.' this kind of application aims to anticipate what might happen and to provide an early warning. within the evacuation problem ( papers), 'finding victims' (five papers) and 'making evacuation plans' (nine papers) were studied. it aims to solve the issues of finding and rescuing disaster victims. within the education and training problem (seven papers), 'delivering emergency announcement' (five papers) and 'guiding psychological recovery' (two papers) were explored. this kind of problem focuses on spreading emergency-related knowledge for the public and helping the public to face the emergency rationally. the remaining problem type was the resource management problem (five papers). this kind of problem focuses on the appropriate allocation of both material and human resources according to the distributions of the victims and rescuers. our framework for using mobile phone data for em is depicted in fig. and the detailed process is described in section a. . the framework synthesizes the five aforementioned themes (i.e., emergency situations, em phases, types of applications, analysis perspectives, and types of mobile phone data) and illustrates two logical routes with potential correspondences between each theme. the boxes in the framework represent the categories under each theme, and the lines with arrows represent the existing correlation between each theme according to the reviewed studies. the first type of logical route is the decision-making route (represented as blue lines with arrows in the framework). it starts with the emergency situation theme, goes through the phases of em, and ends with the application theme. this route assists managers to make emergency-related decisions comprehensively and conveniently during the emergency. they can make specific emergency plans involved in each phase of em by following the steps: (i) judging what types of emergency the public are encountering; (ii) determining the phases of the emergency in different regions; (iii) figuring out general categories of applications related to the determined phase, and then referring to table and fig. (in appendix ) for subapplications ought to be taken; (iv) determining procedure and activities that can be implemented in practice by considering both the characteristics of the encountered emergency and referred applications. for example, during covid- , relevant managers can identify the situation as a disease disaster. if the managers judge that the phase of the pandemic is in the 'mitigation,' he/she can take applications in pre-planning, education and training, and decision-making categories into account according to our framework. after referring to table and fig. , they can consider implementing applications, like the second type of logical route is the problem-solving route (represented as red lines with arrows in the framework). it ends with the application theme, but starts from the mobile phone data theme while goes through the analysis perspective. this route assists technicians to devise solutions for emergency plans. they can employ mobile technology to implement the identified activities by following the steps: (i) understanding the activity involved in emergency plans given by the decision-maker, i.e. what application is this activity; (ii) identifying appropriate mobile technology that can assist in implementing the application; (iii) gathering information and making appropriate analysis (refer to table ) by using the identified mobile technology; (iv) applying the results of analysis into real-world em. for example, to predict epidemic transmission during covid- pandemic, technicians can find cdr and gps data appropriate for gathering individuals' spatial-temporal information and analyzing human mobility. they can apply the analysis to technical solutions development and decision-support system. the framework also contributes to extending and enriching our understanding of the evolving literature of em. links in the framework represent existing studies in this field, thereby illuminating the established correlations between the themes and indicating potential research gaps. for example, the studies on education and training applications mainly focus on the disease disaster and man-made disaster, while the focus on natural disasters is limited. meanwhile, studies only analyze this application from four perspectives while information diffusion and geographic location are missing. researchers may further consider the feasibility of these two perspectives to analyze this application. the outbreak of the covid- pandemic in early has caused huge impacts on global political, economic, and social development (who d; w. chen and bo ). both emergency-related managers and technicians can refer to the proposed framework for potential decision-making and problem-solving. drawing on the decision-making route, emergency managers can determine emergency plans involved in each phase of managing the covid- pandemic. existing activities under the pandemic fit four categories of applications including predicting epidemic transmission (jia et al. ; iacus et al. ; gatto et al. ) , public health intervention (magklaras and nikolaia lopez bojorquez ; ekong et al. ; canada ) , delivering emergency (barugola et al. ; shi and jiang ; xinhua b) , and stating policy/ regulations (speakman ; hu and zhu ; canada ; singapore ) . specifically, in the mitigation phase, related organizations have predicted epidemic transmission through monitoring population flows to reduce potential risks (jia et al. ; iacus et al. ; gatto et al. ). in the preparedness phase, inspectors and the public have been trained and mobilized to use mobile phone security codes, which help track infection cases and contacts (agdh ; guinchard ; zastrow ). in the response phase, the world health organization has announced global objectives including conducting public health intervention by rapidly tracing, finding, and isolating all cases and contacts (who b; cozzens ; magklaras and nikolaia lopez bojorquez ) . and in the recovery phase, some governments have used mobile technology to aid business resumption with contact tracing applications (thompson a (thompson , b sunil ; devonshire-ellis ; hu and zhu ). drawing on the problem-solving route in our framework, relevant technicians can analyze various collected information from mobile devices to implement the aforementioned applications. first, researchers have used cdr data (jia et al. ) and mobile positioning data (gatto et al. ; iacus et al. ) to predict epidemic transmission by capturing and simulating population movements. they have mentioned that their prediction model can conduct risk assessments and plan limited resources allocation as well. second, contact tracing requires various mobile tools including gps (cozzens ), bluetooth (who c; xinhua a), apps (zhang et al. ; singapore ; agdh ) , and sms technology (who c) to gather individuals' geographic and health-related information. the analysis of this information can be further applied to conduct public health intervention such as isolation strategies and travel restrictions. finally, as for business resumption, history trajectories and health status of individuals collected through specific mobile phone apps (zastrow ), sms (who a; mccabe ) and gps technology (elliott ) helps to evaluate their infection risk, which can be used by governments to resume business and study. our framework not only covers applications that have already been adopted in this em, but also provides a reference for future em. for example, it is possible to consider other applications in our framework such as guiding psychology recovery in the context of covid- . the use of hedonic apps after an earthquake has been identified to reduce perceived risk effectively (jia et al. ) . therefore, decisionmakers can learn from this knowledge and implement appropriate applications in the recovery phase of managing this pandemic. future research directions can be implied based on the proposed framework from five perspectives: ( ) a focus on manmade emergencies, ( ) a focus on the recovery phase, ( ) exploring new applications, ( ) creating better comprehension of the analysis perspective, and ( ) combining other data with mobile phone data. it is reasonable to put forward further research on man-made emergencies with mobile phone data. the results shown in section . indicate a greater emphasis on natural emergencies ( articles reviewed in total) and less emphasis on man-made emergencies ( articles reviewed in total). however, managing violence and terror incidents is also very vital, because these emergencies happen frequently around the world. for such a crisis situation, it is challenging to gather their information due to the dark side and dynamic nature (roberts ; skillicorn ; chen et al. ) . in view of these challenges, current literature has explored the suitability of using social media data (oh et al. ; cheong and lee ; prentice et al. ; qin et al. ). however, lacking individuals' real-time location and objective behavior information makes social media data limited. some studies have identified the feasibility of using mobile phone data in man-made emergencies, such as detecting terrorism attacks and monitoring traffic conditions (blondel et al. ) . however, the application of mobile phone data in this domain still has great potential. a focus on the recovery phase would also be a future direction. the occurrence of emergencies will have both physical and psychological impacts on the victims. however, most scholars focused on the timely and effective response to emergencies, while fewer studies focused on the psychological recovery of people after emergencies. with the popularity and development of g/ g communication technology, many people use their mobile phones for obtaining emergencyrelated information and entertainment to ease their anxiety. for example, jia et al. ( ) discovered that the use of hedonic apps after an earthquake can help people reduce their perceived risk. therefore, there is the potential to use mobile phone data for the study of developing positive psychological structures in the recovery phase. additional applications can be explored and expanded according to the two ways demonstrated in the framework. with a further understanding of analysis perspectives (which are discussed in section . ), scholars can explore additional applications. for example, with a more comprehensive notion of human social interaction, scholars may not only investigate approaches in predicting epidemic transmission, but also apply this knowledge to develop systems in predicting crimes. in addition, with the upgrading of mobile phones and ict (bandyopadhyay et al. ; palshikar et al. ) , new applications can be explored under the proposed framework. as a result, additional applications can be similarly explored within and beyond the existing applications in the future. future work should also make efforts to complement the theoretical foundation of emergency studies with theories from other fields. current literature has developed theories about information transmitted among various stakeholders (wang et al. ; weidinger et al. ; liu and xu ; abedin and babar ; fedorowicz and gogan ) , and theories of coordination and political science (maldonado et al. ). moreover, a better comprehension of the analysis perspective can be made by learning constructs about human behavior, social networks, etc. for example, scholars have found similarities between human communication and infectious disease dissemination (blondel et al. ) , which indicates the benefits of drawing on theories from social networks to apply to em. meanwhile, the current understanding of human mobility is mainly based on data analysis. if scholars can learn and apply aspects from relevant psychological and behavioral theories, the analysis of human mobility can be further deepened. in addition, with a better comprehension of analysis perspectives and applications, it is possible to develop new correspondences between analysis perspectives and applications. the current framework was developed based on the articles reviewed, and cannot reveal every possible relationship between analysis perspectives and applications. for example, human mobility can probably be analyzed in order to construct pre-warning systems. according to the study of daily movement patterns of individuals, a detected anomaly of movement may indicate a coming disaster, which would suggest the need to construct correlations between human mobility and pre-warning systems. consequently, the correspondence between analysis perspectives and applications should be considered and expanded upon in future research. although mobile phone data have been successfully applied to em, combining other micro and macro data can help the development of em research more efficiently (ghosh et al. ) . martinez-rojas et al. ( ) have reviewed articles about using twitter to manage emergencies, indicating the significant roles and value of data from social media in em. under what circumstances it is appropriate and how to combine mobile phone data with micro (e.g., individual locations) and macro (e.g., public opinions) data from social media requires further exploration. moreover, combining data from other platforms is worth consideration as well. for instance, pastor-escuredo et al. ( ) believe there is potential in combining information from officially monitored sensors like traffic video cameras, which can provide a fine-grained validation for the existing measures. emergencies have great impacts on the whole of society, affecting material facilities and social order in terms of economic losses and human casualties. unlike traditional management measures based on data sources with limited adaptability and low accuracy, applications for handling emergencies that utilize ubiquitous and real-time mobile phone data have greatly improved em mechanisms and minimized their negative impacts on society. this systematic literature review analyzes studies concerning the use of mobile phone data for em, and proposes a framework to synthesize the fragmented knowledge of existing studies. the framework comprises five themes, among which six analysis perspectives and five general types of applications are put forward to explain the em process, which includes two logical routes. the framework can support stakeholders, such as emergency managers and technicians, and is used to suggest five future research directions in the field for scholars. in addition, this study discusses em under the covid- pandemic and provides a reference for future management of the pandemic. despite all the contributions mentioned above, this study still possesses some inevitable limitations. first, the common limitation of literature reviews related to keywords exists in this study as well. making the best efforts to alleviate this drawback, this study draws on keywords about em from previous reviews and research works in the field of em as well as keywords about mobile phone data. the second limitation is that the framework develops in this study is only based on the articles reviewed, meaning that relationships not mentioned in the articles are not considered. although this review is intended to be both wide and deep in coverage, it should not be considered as a complete or final summary of the topic. reviews, no matter how current, by definition focus on the past and cannot fully anticipate novel approaches or new developments. finally, this study does not focus on the detailed introduction of data processing and analyzing techniques. nevertheless, we still believe this study makes a strong contribution to the field, especially toward emergency managers and scholars who are looking for direction to develop this field in the future. acknowledgments the corresponding author is a tang scholar. he would like to acknowledge the support from national natural science foundation of china (grant no. ) , the key research and dev elop m ent pro gram of sh aanx i prov inc e (gra nt no. zdlgy - ). authors' contributions yw, jl, xz, gf, xl conceived the study. yw, jl carried out the review, synthesized and analyzed the evidence, and drafted the manuscript. xz, xl, gf supervised the review process and revised the manuscript. all authors read and approved the final manuscript. planning identifying the research questions. our research questions were as follows: ( ) what types of mobile phone data-with respect to its applied phases and use to cope with practical issues-have been studied? ( ) what is the state of the art of this field? and ( ) what can future works develop to facilitate the understanding of this subject? regarding the first question, a statistical analysis was performed on the aspects of emergency situations, phases of em, types of mobile phone data, analysis perspectives, and applications. next, a framework of using mobile phone data for em to address the second question was proposed. finally, five future implications based on the proposed framework are presented in order to address the third question. identifying relevant studies. the second step of planning the research was identifying relevant studies, which defines the scope of this review study. six document databases were searched to find related studies between and . these were: sciencedirect, scopus, web of science, ieee xplore, acm, and springer. ieee xplore and acm are two specialty article databases that provide extensive coverage of the literature in computer science and related areas, and scopus (sciverse scopus) is the largest abstract and citation database. the other three were additional comprehensive and widely searched databases. to draw the boundaries of what articles would be included and reviewed in the study, the phenomenon of interest was identified as 'research that applies mobile phone data to manage emergencies.' we developed an initial list of research keywords to match the definition with published documents and considered various literature expressions that represented the same terminology by considering both aspects of emergencies and mobile phone data. this method is illustrated at the beginning of the conducting phase and further guided the search for related articles. generating a research strategy. the first step was to generate a research strategy by finding and filtering studies from the six databases. two iterations were processed: ( ) searching terms in the keywords list ("mobile phone data" or "short message service" or "call detail record" or "phone gps data" or "cellular network data" or "app data" or "application data" or "bluetooth data") and ("emergency" or "extreme situation" or "extreme event" or "large-scale event" or "special event" or "special situation" or "anomalous event" or "anomalous situation" or "unusual event" or "unusual situation" or "crisis" or"disaster" or "catastrophe" or "traffic accident" or "epidemics" or "infectious disease") and ( < pubyear< ); ( ) searching papers in the reference list of the five previously identified review articles and including additional studies. in the first iteration, papers ( sciencedirect + springer + ieee + acm + web of science + scopus) were initially identified. in the second iteration, additional articles were included as relevant for further selection by searching the reference lists. accordingly, articles were found through the process of this stage (fig. ) . selecting primary studies. the second step was to select the studies to be reviewed through a standard inclusion and exclusion criteria. the specific items are depicted in table . the inclusion criteria i and i ensured that the selected studies were in accord with mobile phone data and applying it in em, which aligned with the objectives of this study. the exclusion criterion e eliminated studies that were found by the keyword "emergency," but discussed emergency-related activities instead of an emergency situation itself or did not discuss a specific situation at all, for example, studies of efficient deployment of emergency departments or the contents and answers of emergency calls. e excluded studies that analyzed the use of mobile phones for self-rescue when individuals encountered difficulties, like heart attacks. such situations were considered to be emergencies for a single person, which did not fall within the scope of this study. e excluded studies with a discussion of non-emergency situations, which were planned events like concerts, festivals, and football matches. all of these criteria were considered to point out the most relevant studies for the topic and improved the reliability and validity of the review study. finally, in order to mitigate bias, two of the authors conducted this step using the above criteria. after discussion, articles were chosen for the analysis and report in this review (refer to the process illustrated in fig. ) . extracting data. the third step was to extract data from the studies that were selected. the following information was extracted from the studies: (i) document demographic information, including the title, year of publication, journal name or conference name, and authors; (ii) information about emergency situations, including the general types of emergency and specific events; (iii) types of mobile phone data used in emergency situations; (iv) methods to apply such data in the situations and objectives of the study (what perspectives it analyzed and to which process of em the study belongs). the detailed results are presented in section . based on the information extracted, the analysis perspectives and applications that were proposed were summarized to create a clearer idea of how to apply mobile phone data to effective management (refer to sections . and . for details). for example, individual movement patterns were studied by stefania et al. ( ) and vogel et al. ( ) to either establish new models or develop existing models, and thus help prevent disease dissemination, while aggregated population mobility was analyzed by sekimoto et al. ( ) to benefit government policymaking. in this study, the phrases 'individual movement patterns' and 'aggregated population mobility' were combined and expressed as 'human mobility,' which represents a kind of analytic perspective. in addition, objectives (or final applications) of the reviewed articles like 'making rescue plans,' 'making traffic regulations,' and 'helping in policy decisions' were paraphrased as 'making policies'. in this way, the current study transferred the metainformation extracted from the reviewed studies into a collective and scientific form for the conclusions and future discussion. preparing the framework for using mobile phone data for emergency management. the final step in the conducting phase was to propose the plan for developing a framework that could illustrate the current state of research. first, we drew on the typology of mobile data sources, types, and uses in the disease disaster management cycle proposed by cinnamon et al. ( ) , and we planned to adjust the framework to adapt to the data from this study. second, we considered "emergency situations" and "mobile phone data" as two starting points, as both aspects were key features of the subject. third, considering the practicality of this framework, it was desired to expand the framework to the perspectives of stakeholders articles which focus on emergency items (e.g. emergency call or emergency department) other than emergency situations i . articles which contain the description of mobile phone data types with respect to emergencies. e . articles which depict emergencies encountered by a single person instead of society (e.g., a heart attack) e . articles which describe non-emergency situations (e.g., festivals or concerts) e . articles for which full text cannot be found e . articles which are not written in english from both managerial and technical layers, respectively. therefore, five themes (situations, phases in em, mobile phone data, analysis perspectives, and applications) were identified as components in the framework. based on these themes, we realized that "applications" served as the final consideration regardless of the perspective. finally, we developed a draft of the framework which started from "emergency situations" and "mobile phone data" and ended in "applications." it consisted of two logical routes: the decision-making, or managerial, route ("emergency situation", "phases in emergency management", and "applications"); and the problemsolving, or technical, route ("mobile phone data", "analysis perspective", and "applications"). the human mobility perspective refers to capturing citizens' spatial-temporal change patterns, like travel, commuting, migration, etc., from real-time mobile phone data. this article utilized mobile-phone bluetooth-sensed data to reflect human interactions and compared them to actual infection cases to simulate the spread of seasonal influenza (farrahi et al. ) . the geographic location perspective refers to gathering personal space information from cdrs and gps data to illustrate geographic networks at both an individual and aggregate level. this article used mobile-phone gps-location data to find abnormalities close to a pipeline as a way to detect damaging activities (dong et al. ) . the collected information perspective refers to gathering multiple types of information content from mobile phones, such as comments and views collected through sms and apps. this article introduced an app that collected online opinions about emergency management and applied this data into assisting the decision-making of governments (deng et al. ). the information diffusion perspective refers to the network of information propagation via the phone, sms, or app, which is a general process of gathering information from mobile phone data. n. zhang et al. ( ) this article analyzed the information dissemination mechanisms of calls and sms messages to validate their effectiveness as pre-warning approaches in reducing losses during emergencies (nan zhang et al. ) . processing real-time detection refers to capturing citizens' anomalous behavior to detect possible emergencies in real-time with appropriate accuracy for planning response efforts. an anomaly detection system was developed by connecting exceptional spatial-temporal patterns from mobile data with real-world emergencies (trestian et al. ) . developing emergency-related platforms refers to implementing a better platform to collect multi-source data which eventually enhances the efficiency of em. this study introduced a platform containing multiple kinds of mobile phone data to implement various intervention measures into the whole management process (poblet et al. ) . making construction plans refers to scientifically building infrastructure to minimize the loss in an emergency or optimizing reconstruction projects. two isolation strategies for controlling epidemics, at the subprefecture and individual level, were proposed based on a model of citizens' trajectories (stefania et al. ) . stating policy/regulations refers to related governments drawing experience from current and previous emergencies and utilizing regularities to manage emergencies. this study represented the population size changes after a political conflict as supporting information provided for governments (bharti et al. ) . pre-planning (total number ) developing pre-warning system refers to integrating multiple characteristics concerning diverse emergencies obtained from the analysis stage to give early warning. this study validated that a mobile phone dataset performed better than traditional survey data in representing commuting patterns to simulate an epidemic spread (panigutti et al. ) . finding victims refers to providing information or clues about victims' whereabouts for responsible authorities. this study analyzed citizens' evacuation routes after a subway accident to optimize future evacuation organizing work (duan et al. ). guiding psychological recovery refers to discovering people's anomalous behaviors after emergencies to specifically provide psychological counseling for them jia et al. ( ) ; baytiyeh ( ) (total number two) this study found that hedonic behavior would reduce perceived risks by studying app usage changes in a disaster and recommend hedonic app using for the public after disasters (jia et al. ) . delivering emergency announcement refers to propagating knowledge or messages about emergencies to help citizens prepare for or respond to them. this study improved the inference of the hotspot of epidemics based on human mobility mining to optimize resource deployment (matamalas et al. ). appendix . characteristic matrix of the reviewed studies ♦ ci processing real-time detection, presenting emergency impacts ♦ 'ap' stands for 'analysis perspective' and 'em phases' represents 'phases of emergency management'. six analysis perspectives are respectively human mobility (hm), social networks (sn), mobile phone usage pattern (mpup), information diffusion (id), geographic location (gl) and collected information (ci). the definitions of ap and applications are consistent with tables and . institutional vs. non-institutional use of social media during emergency response: a case of twitter in australian bush fire covidsafe application the viability of mobile services (sms and cell broadcast) in emergency management solutions: an exploratory study risc: quantifying change after natural disasters to estimate infrastructure damage with mobile phone data the built environment and syrian refugee integration in turkey: an analysis of mobile phone data understanding the unobservable population in call detail records through analysis of mobile phone user calling behavior a case study of greater dhaka in bangladesh smartphone geospatial apps for dengue control, prevention, prediction, and education: mosapp, disapp, and the mosquito perception index (mpi) an embedding based ir model for disaster situations stay safe stay connected: surgical mobile app at the time of covid- outbreak the uses of mobile technologies in the aftermath of terrorist attacks among low socioeconomic populations using mobile phone data to predict the spatial spread of cholera disaster management and information systems: insights to emerging challenges remotely measuring populations during a crisis by overlaying two data sources a survey of results on mobile phone datasets analysis performing systematic literature reviews in software engineering travel restriction measures: covid- program delivery data fusion for city life event detection novel coronavirus named covid- by who introduction to special issue on terrorism informatics reality mining: a prediction algorithm for disease dynamics based on mobile big data a microblogging-based approach to terrorism informatics: exploration and chronicling civilian sentiment and response to terrorism events via twitter evidence and future potential of mobile phone data for disease disaster management countries track mobile location to fight covid- delay-aware accident detection and response system using fog computing area: a mobile application for rapid epidemiology assessment cps model based online opinion governance modeling and evaluation of emergency accidents covid- in china: business lose less, work resumes faster than expected spatiotemporal detection of unusual human population behavior using mobile phone data use of community mobile phone big location data to recognize unusual patterns close to a pipeline which may indicate unauthorized activities and possible risk of damage understanding evacuation and impact of a metro collision on ridership using large-scale mobile phone data covid- mobile positioning data contact tracing and patient privacy regulations: exploratory search of global response strategies and the use of digital tools in covid- : balancing response and recovery emergencies do not stop at night: advanced analysis of displacement based on satellite-derived nighttime light observations predicting a community's flu dynamics with mobile phone data reinvention of interorganizational systems: a case analysis of the diffusion of a bio-terror surveillance system mobile phone data highlights the role of mass gatherings in the spreading of cholera outbreaks precision global health in the digital age design patterns for emergency management: an exercise in reflective practice quantifying information flow during emergencies association between mobile phone traffic volume and road crash fatalities: a population-based case-crossover study anomaly detection mechanisms to find social events using cellular traffic data spread and dynamics of the covid- epidemic in italy: effects of emergency containment measures what drives knowledge sharing in software development teams: a literature review and classification framework exploitation of social media for emergency relief and preparedness: recent research and trends a framework to model human behavior at large scale during natural disasters our digital footprint under covid- : should we fear the uk digital contact tracing app? international review of law computers & technology, countrywide arrhythmia: emergency event detection using mobile phone data flood disaster indicator of water level monitoring system investigating evidence of mobile phone usage by drivers in road traffic accidents xi: we will continue to speed up the restoration of work and life order in the normal epidemic prevention and control process (in mandarin chinese explaining the initial spread of covid- using mobile positioning data: a case study information intermediaries for emergency preparedness and response: a case study from public health the geo-observer network: a proof of concept on participatory sensing of disasters in a remote setting advances in multi-agency disaster management: key elements in disaster research the role of hedonic behavior in reducing perceived risk: evidence from postearthquake mobile-app data population flow drives spatio-temporal distribution of covid- in china population distribution modelling at fine spatiotemporal scale based on mobile phone data use of short message service for monitoring zika-related behaviors in four latin american countries: lessons learned from the field. mhealth event-cloud platform to support decision-making in emergency management disease containment strategies based on mobility and information dissemination social roles and consequences in using social media in disasters: a structurational perspective a st century approach to tackling dengue: crowdsourced surveillance, predictive mapping and tailored communication development of gis integrated big data research toolbox (biggis-rtx) for mobile cdr data processing in disasters management a review of information security aspects of the emerging covid- contact tracing mobile phone applications collaborative systems development in disaster relief: the impact of multi-level governance twitter as a tool for the management and analysis of emergency situations: a systematic literature review a data-driven impact evaluation of hurricane harvey from mobile phone data assessing reliable human mobility patterns from higher order memory in mobile communications how to create a covid- sms self-assessment tool a research review on the public emergency management cell phones and motor vehicle fatalities large-scale mobile traffic analysis: a survey disasters will happenare you ready? information control and terrorism: tracking the mumbai terrorist attack through twitter model-driven disaster management no-notice urban evacuations: using crowdsourced mobile data to minimize risk weakly supervised and online learning of word models for classification to detect disaster reporting tweets assessing the use of mobile phone data to describe recurrent mobility patterns in spatial epidemic models flooding through the lens of mobile phone activity population mobility reductions associated with travel restrictions during the ebola epidemic in sierra leone: use of mobile phone data introduction to emergency management crowdsourcing roles, methods and tools for data-intensive disaster management analyzing the semantic content and persuasive composition of extremist media: a case study of texts produced during the gaza conflict a multi-region empirical study on the internet presence of global extremist organizations advanced methods of cell phone localization for crisis and emergency management applications tracking and disrupting dark networks: challenges of data collection and analysis a scenario-based study on information flow and collaboration patterns in disaster management ict facts and figures a review on security challenges of wireless communications in disaster emergency response and crisis management situations real-time people movement estimation in large disasters from several kinds of mobile phone data covid- prevention and control tips, the least we can do in the face of the epidemic (in mandarin chinese tracetogether' contact tracing application website computational approaches to suspicion in adversarial settings the world needs to follow china methods to combat epidemic traffic incidents in motorways: an empirical proposal for incident detection using data from mobile phone operators a comparison of spatialbased targeted disease mitigation strategies using mobile phone data improving emergency response logistics through advanced gis guidlines for business resumption in singapore, post-covid- taskforce, and more mobile applications in crisis informatics literature: a systematic review design and operation of app-based intelligent landslide monitoring system: the case of three gorges reservoir region integrating rapid risk mapping and mobile phone call record data for strategic malaria elimination planning expert-recommended school supplies for a safe transition back to school your back-to-school checklist to protect against covid- on the use of human mobility proxies for modeling epidemics migration statistics relevant for malaria transmission in senegal derived from mobile phone data and used in an agent-based migration model towards connecting people, locations and real-world events in a cellular network an investigation of interaction patterns in emergency management: a case study of the crash of continental flight mining mobile datasets to enable the finegrained stochastic simulation of ebola diffusion a capability assessment model for emergency management organizations is the frontier shifting into the right direction? a qualitative analysis of acceptance factors for novel firefighter information technologies quantifying travel behavior for infectious disease research: a comparison of data from surveys and mobile phones quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data evaluating spatial interaction models for regional mobility in sub-saharan africa multinational patterns of seasonal asymmetry in human movement influence infectious disease dynamics covid- message library critical preparedness, readiness and response actions for covid- digital tools for covid- contact tracing who announces covid- outbreak a pandemic bluetooth contributes to accurate covid- control tech tools stretch anti-virus battle's grassroots reach cityflowfragility: measuring the fragility of people flow in cities to disasters using gps data collected from smartphones fusion of terrain information and mobile phone location data for flood area detection in rural areas crosscomparative analysis of evacuation behavior after earthquakes using mobile phone data mobile phone data reveals the importance of pre-disaster inter-city social ties for recovery after hurricane maria spatial and temporal analysis of human movements and applications for disaster response management utilizing cell phone usage data improving emergency evacuation planning with mobile phone location data coronavirus contact-tracing apps: can they slow the spread of covid- ? information dissemination analysis of different media towards the application for disaster pre-warning comprehensive analysis of information dissemination in disasters americans' perceptions of privacy and surveillance in the covid- pandemic publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations she is currently pursuing the ph.d. degree in management science and engineering at the school of management, xi'an jiaotong university. her main research interests are big data and behavior analysis, social media and gamification he is currently pursuing the ph.d. degree in management science and engineering at the school of management, xi'an jiaotong university. his main research interests are massive human behavior analysis and blockchain finance exploration he conducted research in the fields of data analytics and pattern recognition as a research assistant professor in the he obtained the b.s. degree in computer science in , the m.s. degree in systems engineering in , and the ph.d. degree in management engineering in , all from xi'an jiaotong university of china. his research interests include logistics and supply chain management, information system management his research interests center around information assurance, innovative technologies for strategic decision-making, and global it management. he is the co-editor-in-chief for the international journal of accounting and information management. ♦ inf syst front key: cord- -ahoo j o authors: lai, yuan; charpignon, marie-laure; ebner, daniel k.; celi, leo anthony title: unsupervised learning for county-level typological classification for covid- research date: - - journal: intelligence-based medicine doi: . /j.ibmed. . sha: doc_id: cord_uid: ahoo j o the analysis of county-level covid- pandemic data faces computational and analytic challenges, particularly when considering the heterogeneity of data sources with variation in geographic, demographic, and socioeconomic factors between counties. this study presents a method to join relevant data from different sources to investigate underlying typological effects and disparities across typologies. both consistencies within and variations between urban and non-urban counties are demonstrated. when different county types were stratified by age group distribution, this method identifies significant community mobility differences occurring before, during, and after the shutdown. counties with a larger proportion of young adults (age - ) have higher baseline mobility and had the least mobility reduction during the lockdown. the covid- pandemic has showcased the need for a multidisciplinary exploration, interpretation, and presentation of data. in comparison with the sars-cov- outbreak from to , advances in cloud storage, analytic infrastructure, and platforms for dissemination of information have dramatically expanded the data resources available for studying virus transmission in communities, as well as the interplay between individual and geographical factors, including the socio-political landscape. policy experts increasingly seek to leverage data, machine learning, and cloud computing in their response strategies. unfortunately, data heterogeneity, a dearth of data standards, and poorly interoperable data-sharing platforms complicate the quality and availability of analyzable data, marring both data value and methodological reproducibility. the new york times (tnyt) developed a live data repository with daily county-level coronavirus cases and deaths (tnyt, ) . county-level data has emerged as the primary geographical level of analysis, self-contained for reporting purposes while additionally responsible for the execution of epidemic policy response. moreover, disaster funding is allocated at the county-level. analyzing data at the county-level has significant benchmarking challenges: for instance, counties have fundamental differences in geographic, demographic, political, and socioeconomic characteristics, which lead to differing and unique epidemiological trajectories that go uncaptured in a static pooled analysis. in response to this, the u.s centers for disease control and prevention (cdc) in created a social vulnerability index (svi) aimed at quantifying the resilience of communities to disasters and disease outbreaks (cdc, ) , an index that has been expanded throughout this pandemic. based on these indicators, the cdc has j o u r n a l p r e -p r o o f identified "most vulnerable" counties and other jurisdictions that are at highest risk for outbreaks, with consequent impact on federal resource distribution, aid, and policy. however, without a deep understanding of the underlying variation across the counties and the states, modeling leads to error, bias, and flawed interpretations, leading to downstream deleterious impacts on the ability for a community --and the nation --to respond to this crisis. a recent paper from bosancianu and colleagues (bosancianu, ) found that a county's political leaning, social structures, and local government effectiveness also explain, in part, covid- mortality. these findings cannot solely be explained by the urban/rural divide, nor racial and ethnic disparities, between counties (bassett et al, ; chen and krieger, ) . county-level analysis has similarly demonstrated a link between political beliefs and compliance with social distancing (painter and qiu, ) , as well as connections between covid- transmission to air pollution and other factors (wu et al., ) . a robust analytical system capable of identifying granular patterns and trends, track county-level case incidence, mortality, and excess mortality (cdc, ), and thereby disentangle causal, mitigative, and correlative effects (knittel and ozaltun, ) , is critical for healthcare resource allocation during this and future pandemics. this project introduces a methodology to specifically address the computational and analytical challenges of aggregating county-level heterogeneous data sources for research. this captures the first steps necessary to reliably frame and analyze county-level data, including incorporation of higher resolution, individual-level data in analysis. the purpose of this study is to summarize publicly available and relevant covid- data sources, to address the benchmarking challenge from the data heterogeneity through clustering, and to classify counties j o u r n a l p r e -p r o o f based on their underlying variations. through these methodologies, greater understanding of the spread of covid- and future pandemics may be attained, leading to better data-driven policies. we represent socioeconomic characteristics by integrating multiple county-level data sources (table s ). these include baseline measures from population census data, geographical information systems data, business pattern censuses, and other sources that report relatively timeinvariant variables. spatial data was collected by quantifying geographical attributes per county and integrating this with other datasets. county land area is enumerated through evaluation of county geometry from tiger/line shapefiles, with subsequent estimation of county-level population density ( people per square km). the cdc publishes spatial data representing the top cities' boundaries ranked by population. using spatial geometry, the intersection of county and city borders are evaluated to approximate the total urban area. based on the total county-level urban area, areas that were greater than % were classified as "urban" while the rest were classified as "non-urban". we calculated county-level total population, gender-, race-, and age group distribution using population estimates. using data reported from the small-area life expectancy estimates project (usaleep), county-level average life expectancy was estimated as a proxy for local quality-of-life differences (usaleep, ) . further, education was represented as the percentage of adults with a bachelor's degree or higher ( - ) as reported by the u.s. census bureau. we further aggregated the age groups † and computed underlying typologies using clustering techniques. k-means clustering is an unsupervised machine learning method that partitions observations into k groups (as clusters) based on their distance to the group means (as clusters' centroids) (lloyd, ) . it is one of the most common non-hierarchical clustering methods (steinley, ) . we first identified the optimal number of clusters, denoted by k, by computing the silhouette score in line with lloyd et al., and then generated categorical variables as typology indicating different age distributions across counties. recent studies identify the importance of the timing of covid- spread in different counties (bialek, et al., ) . a core analytical challenge is how to take these varying timelines into account when comparing virus transmission across different counties. covid- case and death data were collected from tnyt github repository, which reports the county-level cumulative counts daily. multiple measures were then quantified at the county-level, including: (jia, et al., ) . finally, the slope of the growth in death rate over time was estimated via a linear fit for each county. † age group = age - , group = age - , group = age - , group = age - , group = age - , group = age - , group = age - , group = age - , group = age and above. human mobility was evaluated as a dependent and independent variable during the pandemic, with particular emphasis on how mobility changed responding to local policy and affected outbreak trajectory. county-level mobility change was quantified using exposure indices derived from placeiq movement data based on mobile phone data (placeiq, ). the countylevel device exposure index (dex) is a proxy for local human mobility, which reports the county-level average spatial-temporal co-existence of unique mobile devices. this index measures daily average exposure to other people and/or crowds, reflecting local social distancing policy and compliance. dex measures the absolute change of mobility density, demonstrating both weekly patterns and county-level variations. to generate a less-noisy and comparable measure across counties, values were computed by normalizing the county-level dex timeseries raw data to enable cross-county comparison. the mechanism with which urbanization impacts vulnerability to a pandemic and the subsequent health outcomes is not fully elucidated. between the correlation matrices for urban and non-urban environments, consistency is seen but with subtle variation (figure ). both matrices reveal a correlation between some baseline measures: counties with higher educational attainment have higher income levels and life expectancy. race and sex have a weaker correlation with income, unemployment, and education in urban areas compared to non-urban areas. when looking at the correlations between baseline measures and pandemic outcome measures, counties with a comparatively larger population, higher income and education j o u r n a l p r e -p r o o f attainment, and/or life expectancy had the earliest cases. consistent correlations were observed between case rate and population, density, unemployment, income, and education. (colorado), florida, and gulf coast. evaluation of these geographical patterns suggests that urban areas may not be the "epicenters" but rather the "vanguards" of pandemic spread (angel, et al., ) . figure a and b reveal the disparities between urban and non-urban counties in terms of variation in death rate over time, as well as in number of days from the first local death. notably, non-urban counties have steeper slopes than urban counties, are hit later in the total pandemic timeline, and experience death rates higher than in urban areas. figure c bins the counties by death rate slope, highlighting that most counties are classified as non-urban areas, and that these had a long-tail distribution of death rate growth slope as compared to urban counties. figure d compares the density curves of the two county types, demonstrating the more dispersed death rate slope variations in non-urban counties. the k-means clustering algorithm labels all counties into three groups using age group distribution typology. as figure indicated, type a (in red) represents counties with a predominantly young population, defined as in their s. type b (in blue) represents counties with more older adults (age >= ). type c (in green) represents most counties, which contain relatively "typical" age patterns. this method highlights dynamic patterns in county-level age distribution differences versus traditional analytical methods. j o u r n a l p r e -p r o o f we identify three phases for each county according to its normalized human mobility changes ( figure ). phase one prior to march , during which most counties experienced increasing mobility density. phase two occurred in march, when most counties witnessed drastically reduced local mobility density, reaching a nadir in early april. finally, phase three began in early april, marking a slow return to mobility pre-pandemic. counties with different age group distributions demonstrate various mobility changes before, during, and after the u.s. federal government announced the national emergency on march th. counties with a largely young population (type a in red) saw less mobility reduction ( figure ). during the "shelter-in-place" policy implementation period in which most places experienced a drastic decline in mobility, these counties had the largest drop in mobility compared to other counties (in green and blue). furthermore, in the third phase, as businesses have started reopening, these counties demonstrated the largest return of mobility. figure . normalized county-level human mobility changes. the group average changes (defined by the age pattern typology) are in bold-dash lines colored accordingly. two vertical lines represent the median dates when counties experienced maximum and minimum human mobility. figure . box plot of local mobility change grouped by age pattern type and time period (before, during, and after shutdown). this study contributes to both data integration and analytical methods that are critical for pandemic research. analyzing demographic, geographical, and socioeconomic characteristics can inform the local public health response and decision-making (lai et al. ) . however, such comprehensive insights require multi-disciplinary and long-term efforts to collect, integrate, and analyze data from heterogeneous sources. limitations of data sources and quality bemire analysis and interpretation, since representativeness and quality depend on particular sources and collection methods. such data variations bring challenges for integrating heterogeneous data relevant to this pandemic. for example, county-level demographic and socioeconomic census provide long-term baseline measures, but often lack high temporal frequency and spatial granularity. mobile phone data, as another example, provide nearly real-time digital representation of human mobility at high spatiotemporal granularity, but suffer from noisy data and underlying sampling bias. that said, our study extends the exploration of information sources and integration methods considering there is no central source for all available data. this study demonstrates the clustering technique using health-related data for pandemic research. identifying the underlying county typology provides critical value in comparing health outcomes across counties (wallace, sharfstein, kaminsky, & lessler, ) . recent systematic review of k-means clustering in air pollution epidemiology-related literature has demonstrated significant utility for typology discovery and knowledge mining (colin, jabbar, & osornio-vargas, ) . further, k-means clustering is widely used for population segmentation analysis, classifying underlying subgroups with an eye toward evaluating specific healthcare demands and policy interventions (shi, kwan, tan, thumboo, & low, ) . particularly at the county-level, previous studies have implemented clustering techniques to analyze various data sources relating j o u r n a l p r e -p r o o f to demographic, geographic, environment, and socioeconomic determinants of health and disease. two use case applications of clustering include discovery of underlying patterns based on high-dimensional data (cossman, et al., ; chi, grigsby-toussaint, & choi, ) and prediction of counterfactuals for population health policy intervention (strutz, et al., ) . university on march th, , people over and those with chronic health conditions are at the highest risk for covid- complications (sharfstein, ) . though this simple measure evaluates the percentage of the population aged and above, it may fail to capture more dynamic county-level age distribution differences. clustering technique may identify underlying county types defined by age group distributions. in the future, we plan to scale up the clustering method by integrating more variables to identify county typology at higher dimensions. there is no singular source of human mobility data. multiple digital product vendors, data brokers, and research institutes have published mobility data or processed metrics, including placeiq, safe graph, descartes labs, apple mobility trends report, and google community mobility reports (placeiq, ; safe graph, ; descartes labs, ; apple, ; google, ) . product provider-generated mobility measures, such as data shared by apple and google, are limited to data collected by their own digital product line (e.g., google maps or apple maps), customer segments, and user-product interactions. the dex index from placeiq data only represents a fraction of the actual population as samples. even though such data sampling processes are randomly conducted for estimating human mobility, understanding sampling biases, population representativeness, and the resulting accuracy requires a more indepth investigation, possibly with other human mobility-related data from different sources as validation. moreover, integration of data between multiple sources is complicated by vendor-j o u r n a l p r e -p r o o f specific methods for data reporting, collecting, sharing, sampling, aggregation, and quantification. further opportunities exist with regard to integration of mobility data with specific events, such as election or protests (cotti et al. ) . the human mobility data presented here may not fully reflect the compliance (or lack thereof) to local stay-at-home orders and the effects of social distancing (gao et al. ). this study only evaluated data from january nd to may th. the results and interpretations only represent this specific period and may not necessarily translate to future resurgence of the pandemic. while data is updated on tnyt and the placeiq data portals daily, the descriptive summary, clustering results, and death growth rates change with each update. this raises questions on the trade-off between timeliness and accuracy, which is a core challenge in real-time or near real-time data analysis. we excluded new york city (nyc) from this analysis. we believe it would be more appropriate to study nyc in a separate research for several reasons. tnyt's data reports nyc differently by treating it as one entity without specific counties including new york county (manhattan), kings county (brooklyn), bronx this study presents integration of various data sources to investigate the drivers of the community spread of covid- based on county typologies. both similarities and variations between urban and non-urban counties are demonstrated by the methodology. while previous findings reveal possible geographical clusters of covid- cases at the county-level, our study indicates this is from the underlying typology based on high-dimensional variables. counties vary by geographic, demographic, and socioeconomic characteristics, with associated collective behavior during a pandemic. covid- has accelerated data sharing at scale to crowdsource knowledge generation that can inform national and international policy. we showcased a method for data integration to investigate the spread of the pandemic in the united states. the dissonance in presentation between urban and non-urban areas was highlighted, as well as the impact of population age and mobility during the lockdown. just as policy occurs at levels from local to (inter)national, so too must data analysis: this study is a first step toward that end. lac is funded by the national institute of health through nibib r eb . yl led the data analysis and the drafting of the manuscript. all the authors discussed the interpretation of the findings and contributed to the writing. j o u r n a l p r e -p r o o f american communities project (acp). . retrieved from acp apple mobility trends report the coronavirus and the cities: variations in the onset of infection and in the number of reported cases and deaths the unequal toll of covid- mortality by age in the united states: quantifying racial/ethnic disparities geographic differences in covid- cases, deaths, and incidence -united states centers for disease control and prevention (cdc). . social vulnerability index (svi). retrieved from cdc the relationship between in-person voting revealing the unequal burden of covid- by income, race/ethnicity, and household crowding: us county vs zip code analyses can geographically weighted regression improve our contextual understanding of obesity in the us? findings from the usda food atlas a systematic review of data mining and machine learning for air pollution epidemiology persistent clusters of mortality in the united states descartes labs. . retrieved from descartes labs mapping county-level mobility pattern changes in the united states in response to covid- google community mobility reports probability of current covid- outbreaks in all us counties. the university of texas at austin technical report population flow drives spatio-temporal distribution of covid- in china . covid- united states cases by county dashboard what does and does not correlate with covid- death rates. medrxiv urban intelligence for pandemic response. jmir public health and surveillance least squares quantization in pcm political beliefs affect compliance with covid- social distancing orders exposure indices derived from placeiq movement data contextualizing covid- spread: a county-level analysis, urban versus rural, and implications for preparing for the next wave. medrxiv. sharfstein, j. . covid- situation report & public health guidance the atlantic. . the covid tracking project the new york times covid- data a systematic review of the clinical application of data-driven population segmentation analysis safe graph. . social distancing metrics k means clustering: a half century synthesis determining county-level counterfactuals for evaluation of population health interventions: a novel application of k-means cluster analysis retrieved from surgo foundation: https://precisionforcovid.org/ccvi university of california san francisco (ucsf). . covid- county tracker small-area life expectancy estimates project -usaleep comparison of us countylevel public health performance rankings with county cluster and national rankings: assessment based on prevalence rates of smoking and obesity and motor vehicle crash death rates exposure to air pollution and covid- mortality in the united states american community survey -year average county none declared. • this study presents a method to join relevant data from different sources to investigate underlying typological effects and disparities across typologies. both consistencies within and variations between urban and non-urban counties are demonstrated. • significant community mobility differences occurring before, during, and after the shutdown, based on the types of age group distribution. counties with a larger proportion of young adults have higher baseline mobility and the least mobility reduction during the lockdown. ☒ the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐the authors declare the following financial interests/personal relationships which may be considered as potential competing interests:j o u r n a l p r e -p r o o f key: cord- -qxk wfk authors: yamin, mohammad title: information technologies of st century and their impact on the society date: - - journal: int j inf technol doi: . /s - - - sha: doc_id: cord_uid: qxk wfk twenty first century has witnessed emergence of some ground breaking information technologies that have revolutionised our way of life. the revolution began late in th century with the arrival of internet in , which has given rise to methods, tools and gadgets having astonishing applications in all academic disciplines and business sectors. in this article we shall provide a design of a ‘spider robot’ which may be used for efficient cleaning of deadly viruses. in addition, we shall examine some of the emerging technologies which are causing remarkable breakthroughs and improvements which were inconceivable earlier. in particular we shall look at the technologies and tools associated with the internet of things (iot), blockchain, artificial intelligence, sensor networks and social media. we shall analyse capabilities and business value of these technologies and tools. as we recognise, most technologies, after completing their commercial journey, are utilised by the business world in physical as well as in the virtual marketing environments. we shall also look at the social impact of some of these technologies and tools. internet, which was started in [ ] , now has . million terabyte data from google, amazon, microsoft and facebook [ ] . it is estimated that the internet contains over four and a half billion websites on the surface web, the deep web, which we know very little about, is at least four hundred times bigger than the surface web [ ] . soon afterwards in , email platform emerged and then many applications. then we saw a chain of web . technologies like e-commerce, which started, social media platforms, e-business, e-learning, e-government, cloud computing and more from to the early st century [ ] . now we have a large number of internet based technologies which have uncountable applications in many domains including business, science and engineering, and healthcare [ ] . the impact of these technologies on our personal lives is such that we are compelled to adopt many of them whether we like it or not. in this article we shall study the nature, usage and capabilities of the emerging and future technologies. some of these technologies are big data analytics, internet of things (iot), sensor networks (rfid, location based services), artificial intelligence (ai), robotics, blockchain, mobile digital platforms (digital streets, towns and villages), clouds (fog and dew) computing, social networks and business, virtual reality. with the ever increasing computing power and declining costs of data storage, many government and private organizations are gathering enormous amounts of data. accumulated data from the years' of acquisition and processing in many organizations has become enormous meaning that it can no longer be analyzed by traditional tools within a reasonable time. familiar disciplines to create big data include astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, medical records, and scientific research. some of the organizations responsible to create enormous data are google, facebook, youtube, hospitals, proceedings of parliaments, courts, newspapers and magazines, and government departments. because of its size, analysis of big data is not a straightforward task and often requires advanced methods and techniques. lack of timely analysis of big data in certain domains may have devastating results and pose threats to societies, nature and echo system. healthcare field is generating big data, which has the potential to surpass other fields when it come to the growth of data. big medic data usually refers to considerably bigger pool of health, hospital and treatment records, medical claims of administrative nature, and data from clinical trials, smartphone applications, wearable devices such as rfid and heart beat reading devices, different kinds of social media, and omics-research. in particular omics-research (genomics, proteomics, metabolomics etc.) is leading the charge to the growth of big data [ , ] . the challenges in omics-research are data cleaning, normalization, biomolecule identification, data dimensionality reduction, biological contextualization, statistical validation, data storage and handling, sharing and data archiving. data analytics requirements include several tasks like those of data cleaning, normalization, biomolecule identification, data dimensionality reduction, biological contextualization, statistical validation, data storage and handling, sharing and data archiving. these tasks are required for the big data in some of the omics datasets like genomics, transcriptomics, proteomics, metabolomics, metagenomics, phenomics [ ] . according to [ ] , in alone, the data in the united states of america healthcare system amounted to one hundred and fifty exabyte (one exabyte = one billion gigabytes, or bytes), and is expected soon reach to and later . some scientist have classified medical into three categories having (a) large number of samples but small number of parameters; (b) small number of samples and small number of parameters; (c) large small number of samples and small number of parameters [ ] . although the data in the first category may be analyzed by classical methods but it may be incomplete, noisy, and inconsistent, data cleaning. the data in the third category could be big and may require advanced analytics. big data cannot be analyzed in real time by traditional analytical methods. the analysis of big data, popularly known as big data analytics, often involves a number of technologies, sophisticated processes and tools as depicted in fig. . big data can provide smart decision making and business intelligence to the businesses and corporations. big data unless analyzed is impractical and a burden to the organization. big data analytics involves mining and extracting useful associations (knowledge discovery) for intelligent decision-making and forecasts. the challenges in big data analytics are computational complexities, scalability and visualization of data. consequently, the information security risk increases with the surge in the amount of data, which is the case in big data. the aim of data analytics has always been knowledge discovery to support smart and timely decision making. with big data, knowledge base becomes widened and sharper to provide greater business intelligence and assist businesses in becoming a leader in the market. conventional processing paradigm and architecture are inefficient to deal with the large datasets from the big data. some of the problems of big data are to deal with the size of data sets in big data, requiring parallel processing. some of the recent technologies like spark, hadoop, map reduce, r, data lakes and nosql have emerged to provide big data analytics. with all these and other data analytics technologies, it is advantageous to invest in designing superior storage systems. health data predominantly consists of visual, graphs, audio and video data. analysing such data to gain meaningful insights and diagnoses may depend on the choice of tools. medical data has traditionally been scattered in the organization, often not organized properly. what we find usually are medical record keeping systems which consist of heterogeneous data, requiring more efforts to reorganize the data into a common platform. as discussed before, the health profession produces enormous data and so analysing it in an efficient and timely manner can potentially save many lives. commercial operations of clouds from the company platforms began in the year [ ] . initially, clouds complemented and empowered outsourcing. at earlier stages, there were some privacy concerns associated with cloud computing as the owners of data had to give the custody of their data to the cloud owners. however, as time passed, with confidence building measures by cloud owners, the technology became so prevalent that most of the world's smes started using it in one or the other form. more information on cloud computing can be found in [ , ] . as faster processing became the need for some critical applications, the clouds regenerated fog or edge computing. as can be seen in gartner hyper cycles in figs. and , edge computing, as an emerging technology, has also peaked in - . as shown in the cloud computing architecture in fig. , the middle or second layers of the cloud configuration are represented by the fog computing. for some applications delay in communication between the computing devices in the field and data in a cloud (often physically apart by thousands of miles), is detrimental of the time requirements, as it may cause considerable delay in time sensitive applications. for example, processing and storage for early warning of disasters (stampedes, tsunami, etc.) must be in real time. for these kinds of applications, computing and storing resources should be placed closer to where computing is needed (application areas like digital street). in these kind of scenarios fog computing is considered to be suitable [ ] . clouds are integral part of many iot applications and play central role on ubiquitous computing systems in health related cases like the one depicted in fig. . some applications of fog computing can be found in [ ] [ ] [ ] . more results on fog computing are also available in [ ] [ ] [ ] . when fog is overloaded and is not able to cater for the peaks of high demand applications, it offloads some of its data and/or processing to the associated cloud. in such a situation, fog exposes its dependency to a complementary bottom layer of the cloud architectural organisation as shown in the cloud architecture of fig. . this bottom layer of hierarchical resources organization is known as the dew layer. the purpose of the dew layer is to cater for the tasks by exploiting resources near to the end-user with minimum internet access [ , ] . as a feature, dew computing takes care of determining as to when to use for its services linking with the different layers of the cloud architecture. it is also important to note that the dew computing [ ] is associated with the distributed computing hierarchy and is integrated by the fog computing services, which is also evident in fig. . in summary, cloud architecture has three layers, first being cloud, second as fog and the third dew. definition of internet of things (iot), as depicted in fig. , has been changing with the passage of time. with growing number of internet based applications, which use many technologies, devices and tools, one would think, the name of iot seems to have evolved. accordingly, things (technologies, devices and tools) used together in internet based applications to generate data to provide assistance and services to the users from anywhere, at any time. the internet can be considered as a uniform technology from any location as it provides the same service of 'connectivity'. the speed and security however are not uniform. the iot as an emerging technology has peaked during - as is evident from figs. and . this technology is expanding at a very fast rate. according to [ ] [ ] [ ] [ ] , the number of iot devices could be in millions by the year . iot is providing some amazing applications in tandem with wearable devices, sensor networks, fog computing, and other technologies to improve some the critical facets of our lives like healthcare management, service delivery, and business improvements. some applications of iot in the field of crowd management are discussed in [ ] . some applications in of iot in the context of privacy and security are discussed in [ , ] . some of the key devices and associated technologies to iot include rfid tags [ ] , internet, computers, cameras, rfid, mobile devices, coloured lights, rfids, sensors, sensor networks, drones, cloud, fog and dew. blockchain is usually associated with cryptocurrencies like bitcoin (currently, there are over one and a half thousand cryptocurrencies and the numbers are still rising). but the blockchain technology can also be used for many more critical applications of our daily lives. the blockchain is a distributed ledger technology in the form of a distributed transactional database, secured by cryptography, and governed by a consensus mechanism. a blockchain is essentially a record of digital events [ ] . a block represents a completed transaction or ledger. subsequent and prior blocks are chained together, displaying the status of the most recent transaction. the role of chain is to provide linkage between records in a chronological order. this chain continues to grow as and when further transactions take place, which are recorded by adding new blocks to the chain. user security and ledger consistency in the blockchain is provided by asymmetric cryptography and distributed consensus algorithms. once a block is created, it cannot be altered or removed. the technology eliminates the need for having a bank statement for verification of the availability of funds or that of a lawyer for certifying the occurrence of an event. the benefits of blockchain technology are inherited in its characteristics of decentralization, persistency, anonymity and auditability [ , ] . blockchain, being the technology behind cryptocurrencies, started as an open-source bitcoin community to allow reliable peer-to-peer financial transactions. blockchain technology has made it possible to build a globally functional currency relying on code, without using any bank or third-party platforms [ ] . these features have made the blockchain technology, secure and transparent for business transactions of any kind involving any currencies. in literature, we find many applications of blockchain. nowadays, the applications of blockchain technology involve various kinds of transactions requiring verification and automated system of payments using smart contracts. the concept of smart contacts [ ] has virtually eliminated the role of intermediaries. this technology is most suitable for businesses requiring high reliability and honesty. because of its security and transparency features, the technology would benefit businesses trying to attract customers. blockchain can be used to eliminate the occurrence of fake permits as can be seen in [ ] . as discussed above, blockchain is an efficient and transparent way of digital record keeping. this feature is highly desirable in efficient healthcare management. medical field is still undergoing to manage their data efficiently in a digital form. as usual the issues of disparate and nonuniform record storage methods are hampering the digitization, data warehouse and big data analytics, which would allow efficient management and sharing of the data. we learn the magnitude of these problem from examples of such as the target of the national health service (nhs) of the united kingdom to digitize the uk healthcare is by [ ] . these problems lead to inaccuracies of data which can cause many issues in healthcare management, including clinical and administrative errors. use of blockchain in healthcare can bring revolutionary improvements. for example, smart contracts can be used to make it easier for doctors to access patients' data from other organisations. the current consent process often involves bureaucratic processes and is far from being simplified or standardised. this adds to many problems to patients and specialists treating them. the cost associated with the transfer of medical records between different locations can be significant, which can virtually be reduced to zero by using blockchain. more information on the use of blockchain in the healthcare data can be found in [ , ] . one of the ongoing healthcare issue is the eradication of deadly viruses and bacteria from hospitals and healthcare units. nosocomial infections are a common problem for hospitals and currently they are treated using various techniques [ , ] . historically, cleaning the hospital wards and operating rooms with chlorine has been an effective way. on the face of some deadly viruses like ebola, hiv aids, swine influenza h n , h n , various strands of flu, severe acute respiratory syndrome (sars) and middle eastern respiratory syndrome (mers), there are dangerous implications of using this method [ ] . an advanced approach is being used in the usa hospitals, which employs ''robots'' to purify the space as can be seen in [ , ] . however, certain problems exist within the limitations of the current ''robots''. most of these devices require a human to place them in the infected areas. these devices cannot move effectively (they just revolve around themselves); hence, the uv light will not reach all areas but only a very limited area within the range of the uv light emitter. finally, the robot itself maybe infected as the light does not reach most of the robot's surfaces. therefore, there is an emerging need to build a robot that would not require the physical presence of humans to handle it, and could purify the entire room by covering all the room surfaces with uv light while, at the same time, will not be infected itself. figure is an overview of the design of a fully motorized spider robot with six legs. this robot supports wi-fi connectivity for the purpose of control and be able to move around the room and clean the entire area. the spider design will allow the robot to move in any surface, including climbing steps but most importantly the robot will use its legs to move the uv light emitter as well as clear its body before leaving the room. this substantially reduces the risk of the robot transmitting any infections. additionally, the robot will be equipped with a motorized camera allowing the operator to monitor space and stop the process of emitting uv light in case of unpredicted situations. the operator can control the robot via a networked graphical user interface and/or from an augmented reality environment which will utilize technologies such as the oculus touch. in more detail, the user will use the oculus rift virtual reality helmet and the oculus touch, as well as hand controllers to remote control the robot. this will provide the user with the vision of the robot in a natural manner. it will also allow the user to control the two front robotic arms of the spider robot via the oculus touch controller, making it easy to do conduct advance movements, simply by move the hands. the physical movements of the human hand will be captured by the sensors of oculus touch and transmitted to the robot. the robot will then use reverse kinematics to translate the actions and position of the human hand to movements of the robotic arm. this technique will also be used during the training phase of the robot, where the human user will teach the robot how to clean various surfaces and then purify itself, simply by moving their hands accordingly. the design of the spider robot was proposed in a project proposal submitted to the king abdulaziz city of science and technology (https://www.kacst.edu.sa/eng/pages/ default.aspx) by the author and george tsaramirsis (https:// www.researchgate.net/profile/george_tsaramirsis). we have presented details of some of the emerging technologies and real life application, that are providing businesses remarkable opportunities, which were previously unthinkable. businesses are continuously trying to increase the use of new technologies and tools to improve processes, to benefit their client. the iot and associated technologies are now able to provide real time and ubiquitous processing to eliminate the need for human surveillance. similarly, virtual reality, artificial intelligence robotics are having some remarkable applications in the field of medical surgeries. as discussed, with the help of the technology, we now can predict and mitigate some natural disasters such as stampedes with the help of sensor networks and other associated technologies. finally, the increase in big data analytics is influencing businesses and government agencies with smarter decision making to achieve targets or expectations. the evolution of the internet: from military experiment to general purpose technology how much data is on the internet? science focus (the home of bbc science focus magazine) the deep web is the % of the internet you can't google. curiosity history on the internet . : the rise of social media tracking the evolution of the internet of things concept across different application domains integrated omics: tools, advances and future approaches medical big data: promise and challenges where healthcare's big data actually comes from. emerj large datasets in biomedicine: a discussion of salient analytic issues last accessed on / / from a brief history of cloud computing big data analytics: applications, prospects and challenges cloud computing in smes: case of saudi arabia. bijit-bvicam's bringing computation closer toward the user network: is edge computing the solution? managing crowds with wireless and mobile technologies. hindawi improving privacy and security of user data in location based services preserving privacy in internet of things-a survey towards integrating mobile devices into dew computing: a model for hour-wise prediction of energy availability enabling real-time context-aware collaboration through g and mobile edge computing finding your way in the fog: towards a comprehensive definition of fog computing minimizing dependency on internetwork: is dew computing a solution? addepalli s fog computing and its role in the internet of things rfid technology and its applications in internet of things (iot), consumer electronics, communications and networks (cecnet). in: nd international conference proceedings towards internet of things: survey and future vision internet of things (iot): a vision, architectural elements, and future directions blockchain technology in business and information systems research managing crowds with technology: cases of hajj and kumbh mela an overview of blockchain technology: architecture, consensus, and future trends blockchain technology for social impact: opportunities and challenges ahead blockchain for controlling hajj and umrah permits implementing blockchains for efficient health care: systematic review it applications in healthcare management: a survey application of service robots for disinfection in medical institutions service robots in hospitals: new perspectives on niche evolution and technology affordances key: cord- -y ck lo authors: simon, perikles title: robust estimation of infection fatality rates during the early phase of a pandemic date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: y ck lo during a pandemic, robust estimation of case fatality rates (cfrs) is essential to plan and control suppression and mitigation strategies. at present, estimates for the cfr of covid- caused by sars-cov- infection vary considerably. expert consensus of . - % covers in practical terms a range from normal seasonable influenza to spanish influenza. in the following, i deduce a formula for an adjusted infection fatality rate (ifr) to assess mortality in a period following a positive test adjusted for selection bias. official datasets on cases and deaths were combined with data sets on number of tests. after data curation and quality control, a total of ifr (n= ) was calculated for countries for periods of up to days between registration of a case and death. estimates for irfs increased with length of period, but levelled off at > days with a median for all countries of . ( %-ci: . - . ). an epidemiologically derived ifr of . % ( %-ci: . %- . %) was determined for iceland and was very close to the calculated ifr of . % ( %-ci: . - . ), but . - -fold lower than cfrs. ifrs, but not cfrs, were positively associated with increased proportions of elderly in age-cohorts (n= , spearman's ρ =. , p =. ). real-time data on molecular and serological testing may further displace classical diagnosis of disease and its related death. i will critically discuss, why, how and under which conditions the ifr, provides a more solid early estimate of the global burden of a pandemic than the cfr. in the early phase of a pandemic caused by a novel pathogen, it is difficult to estimate the final burden of disease. in the case of the ongoing pandemic caused by sars-cov- , it has been proposed, on the one hand, that it will be the most serious seen for a respiratory virus since the - h n influenza pandemic . this pandemic contributed to the premature death of percent of the world population at the time being . on the other hand, despite all non-refutable morbidity, covid- may fall short of provoking a comparable impact on mortality as the seasonal influenza, which is estimated to contribute to , deaths per year on average . experts conclude that there is still a range for cfr of . up to . , which practically spoken is reflecting the margin from normal seasonal influenza to the lower boundary cfr estimate of - h n influenza . case fatality rates (cfr) can be helpful to critically control and reflect the outcome of robust modelling to estimate the global burden by mortality . during the phase of an outbreak cfrs are preliminary and should be communicated and used with caution . even in the case of the well-known and frequently studied seasonal influenza a, it is a matter of debate on how to estimate a cfr during the phase of the pandemic, which could be calculated by the total number of "deaths" divided by the number of "cases". there are ongoing discussions, what can and should be regarded as a "case" and a reasonably causal-related death . a "case" could ideally be a confirmed case of the infectious disease according to strict diagnostic guidelines, requiring symptoms and confirmatory testing. moreover, the death would ideally be a causally related death and not a death caused by superinfection over the course of hospitalization, for instance. whether we impose less or more strict guidelines to define a suitable "case" and its causally related death, either way will inevitably introduce selection bias. this may lead to both, either substantially higher, or substantially lower estimation of cfrs , . on the one hand, it can be argued that a strict procedure for confirmation of official cases and deaths may underestimate the effect of disease on mortality, since we will miss out both, deaths and cases, for a significant proportion of the population . on the other hand, an infection with a pathogen of mild up to medium virulence, like sars-cov- , can be completely asymptomatic, or may cause only minor symptoms to a majority of infected persons , . at present the size of the denominator of total community infections is unknown , but at least there is increasing evidence that a major part of the younger population, rarely become symptomatic and even more rarely could die of the disease, if diagnosed with covid- . at the beginning of a pandemic with a novel pathogen central aspects of previously acquired immunity or genetic resistance against viruses , or other pathogens are often unknown or only vaguely explored. if they existed, but were not taken into account, the estimated cfr would not reflect the burden of an infectious disease on a macro perspective. in such a setting looking preferentially at those, who have the full-symptomatic disease or are subjected to the surveillance system, can severely overestimate the burden of disease. therefore, in the early phase of an outbreak, infectious disease epidemiologists will rather base their estimates of the potential burden of a pandemic rather models requiring basic assumptions on the basic reproduction number r , the latent period of the infectious period, and the interval of half-maximum infectiousness and many other factors. these all need to be derived from early field studies that often need to be conducted under sub-optimal conditions, in the heat of a pandemic . these basic assumptions of epidemiological key figures are then used for modelling the almost uncharted , , . from a stochastic point of view, such modelling is prone to an exponential propagation of imputation error, which can principally be controlled for and reported with these models . however, as in the present case such error propagation will finally arrive at conclusions via modelling, that are indeed based on assuming merely all or nothing and thus indeed would need to be communicated with uttermost caution to the public and foremost health politicians . under these circumstances, it could be helpful to look at alternative ways to asses a robust figure for cfr as a typical, hard to predict estimate for global burden of a pandemic. alternative ways may allow interdisciplinary abductive reasoning to arrive at an estimate for the global burden of a disease. particularly under non well-defined and dynamic circumstances, abductive reasoning can be more helpful than the best medical evidence employing inductive statistical inference, alone . there has just recently been a promising approach to arrive at a more robust measure of mortality . this involved calculating an infection fatality rate (ifr), which takes not only the asymptomatic population and their relevance for mortality into account, but also adjusted for censoring and ascertainment bias. however, this approach required again an immense workload on retrieving and curating valid data retrieved, from cohorts studied under specific conditions, and making preassumptions, which is again an approach prone to error propagation. here i will deduce that an ifr adjusted for selection bias in favour of more morbid persons, can be determined with the help of available official data in conjunction with the testing figures. i will show that the ifr adjusts the cfr for some essential sorts of bias. in order to cross validate the computed figures, i will estimate infection fatality from an ongoing large-scale testing pool of citizens representative for the general population in iceland. at the time of finishing the dataset ( th of april) . % of the general population in the representative cohort and . % of the typically symptomatic part of the population had been tested . finally, i will compare how cfrs and the calculated ifrs are suited to reflect essential epidemiological aspects already known to be associated with covid- . the estimation of an ifr is based on two different and -regarding the influence of selection biasdivergent procedures to calculate a cfr from infection-related population data. the first formula ( ) is a variation of the non-adjusted cfr in the following termed "classic cfr", which divides the sum of deaths by the sum of cases on a given day. formula ( ) takes into account the persons that passed away or recovered, and days from reported positive testing (d ) to death (dn) as period (d -n) into account. for a given time interval of n days from (d -n) at d sums of test positives (tp ), deceased persons (dp ) and recovered persons (rp ) are given and at dn sum of deceased persons (dpn) are known. then a case fatality ratio cfr for the interval d -n as cfrn- can be calculated: for the data provided by johns hopkins university (ju) the recovered persons (rp ) can be included, while for the data provided by the european center of disease control (ecdc) a more simplified version can be calculated substituting tp − rp with tp . therefore, data of the ecdc were mostly used for quality control aspects. they served to critically revising some negative case and death numbers in the data set of ju for numbers that had been officially reported at one day to the who, and the ecdc, but then were then corrected later on by national health authorities. this was for instance the case for the data of iceland. as i mentioned in the introduction the denominator of total community infections is unknown, but a rather critical factor of uncertainty under the given ongoing pandemic. here i will put this into a likewise simple mathematical term by calculating the cfr as a cfr', taking this unknown denominator into account. for the beginning of a pandemic rp is often low or zero. if the total number of tests (nt) conducted at d is known (nt) an cfr' can be calculated, by taking the total population np of the country or region into account, in which testing had been conducted. noteworthy, this is not a typical formula to calculate a cfr, since it may tend to underestimate the true fatality until the infectious disease has stopped spreading in the population. however, just like formula , at the end it depends only on registered death of persons and number of cases and it is subject to the same factors that add bias on the estimation of a mortality figure, but only in the opposite direction. moreover, since the size of the denominator of total community infections is unknown but appears to be highly relevant, this equation now puts the cfr into a context with the general population. the prevalence (p) for non-recovered infected persons in a total population (np) can be calculated: based on the prevalence of infected persons in a population a cfr' for the case fatality for the interval d -n (cfr'n- ) can be calculated: ( ) cfr′ − = dp n −dp p * both formulas ( ) and ( ) will have their shortcomings. it will be briefly discussed why ( ) will most likely overestimate and ( ) underestimate an ifr under the unavoidable premise that official testing will tend to test more morbid persons. in the equations ( ) and ( ) the pool of newly infected persons is subject to selection bias. formula ( ) typically underestimates ifr, because the prevalence p of active cases is typically determined too high to be generalized to the total population. during an outbreak, this is unavoidable for testing strategies solely based on the health care systems, since guidelines for testing require -or at least favour -preferential testing of persons with an accumulation of risk factors, like specific disease-related symptoms, or stay or visit at an endemic region. accordingly, active cases are overrepresented in a relatively too small pool nt to be representative for the p in the general population np. furthermore, some countries or regions may have limited resources when a pandemic proceeds and will adjust their testing guidelines to detect positive cases with as few nt as possible. to control for this distortion by selection bias aiming at enriching for positive cases in the test pool nt we need to adjust the overestimated prevalence p with the unknown factor f to turn the cfr′ − into the ifr − . in ( ) the calculated cfr is determined rather too high and does not represent an ifr, because of just the same distortion factor f as in ( ) . since testers selectively address the pool of diseased persons, into the pool of persons tested nt, they will therefore also increase the risk of death that would be representative for the population of all infected persons, which should be reflected by our ifr. likewise, we will need to correct cfrn- by the same factor f as in ( ) . in this case, ifr − can be calculated as: at this point, i suggest the reader to jump to table in the results section, to understand better, why cfr is divided by f and cfr' multiplied with f. typical potential distortion factors, like the gross domestic product of a country, but first and foremost the age composition of the population, are inversely correlated with cfr or cfr'. to adjust these two equations for all the factors that act in divergent ways on the figures calculated by ( ) or ( ), we can now solve the equation below for the common distortion factor f to adjust the cfrs calculated in ( ) and ( ) and estimate an ifr − . solving for f: accordingly, cfr adjusted with the factor = √ np nt , resembles an ifr, which can be calculated by equations or , with both equations delivering the same outcome. to compare estimates of the ifr i analysed the data on cases, deaths and recoveries published in real time and once daily after correction by jh . data between the th and nd of march were combined and validated with the respective data from the ecdc . for the numbers of death i used the data for the training data set in its final corrected version from jh for the end of the th of march, for the enlarged final dataset the st of march, and for the data set for the validation with the epidemiolocal project in iceland the version th of april . for the calculation of the epidemiologically derived ifrdecode we did not need to apply the correction factor f, because the prevalence p in the population for revealing a positive test has been determined experimentally as pdecode which was used to determine the ifr for the general population of iceland with the formula: this formula is not relying anymore on cases reported in the official databases of jh or ecdc and it served as a cross-validation figure for the ifr and the cfrs, which are solely based on these data and the population data of iceland in the validation part of the results section. for the final analysis in the training and the final dataset only countries were included which reported at least one death by the nd of march in the ecdc and the jh database. to avoid undue variance by too few case numbers only data points from countries were included with more than a sum of cases, and at least two official reports of the numbers of tests performed in the period th of march till nd of march. data on test frequencies were originally obtained from the open-source "our world in data" . population data of nations were imported from world bank and data on gross domestic products and age cohort compositions from the un in their version for to be comparable with the population data from or as age cohort estimates for . for countries, sufficient data were controlled, corrected and updated with information from the official data source pages of the national health agencies as listed in supplemental table . two more datasets were obtained for a validation study. first, data from a testing cohort of the normal population of iceland led by the genetic company decode . the project is planned as a clinical project with ethical permit and so far, there is to the best of my knowledge no other data available on representative study cohorts from populations without being pre-selected for pathological symptoms. second, data mining for a final enlarged dataset was done on the pages of the official national health agencies, wikipedia, and within the data mining community on github using archived webpages if necessary in order to enable a large-scale cross-country assessment and comparison of ifr-values. in subset-analysis on multiple entries for the log-normalized cumulative testing data, there was a pearson's correlation coefficient between data entries across the three different data sources > . (data not shown). most of the data nested into groups showed signs of unequal variance by barlett and levene testing. therefore, they were log-normalized and in case data included an offset of . was added before log-normalization, taking care that this did not distort the distribution of data, as analysed by shapiro-wilk-testing. normal distributed data with equal variance across groups were then compared using one-way anova f-testing. global significance was followed up by all-pairs tukey-kramer testing as post-hoc test. for reporting, data were de-normalized adding the offset where necessary and reported as means with -% cis, if not specified otherwise. if signs of non-normal distribution or unequal variance prevailed, a wilcoxon-test on rank sums for group comparisons, or a spearman's rank correlation coefficient was reported. for the testing data set descriptive median values and their interquartile -ranges were reported. the combined datasets from ecdc and jh contained information on deaths, cases, and in the jh dataset information on recoveries. for ecdc , data entries from countries starting on the st of december . and for jh , entries from the same countries starting on st of january were combined. cumulative cases and deaths were significantly correlated between both data sets (pearson's r > . , p < . , for both, data not shown). no data on test frequencies were reported in the official international data repositories. from the platform "our world in data" different data entries for countries were retrieved. after combination with the official data on test figures, fourteen countries fulfilled inclusion criteria. testing data for these countries were controlled by visiting the official test report pages of the fourteen countries (supplemental table ), which enabled adding another data points. the calculation of the classic cfr by dividing deaths by cases revealed a large range for the respective medians of the countries and cfr calculated for the period from the nd to the th of march surpassed % for countries (fig. ) . excess mortality was present throughout most points in time for italy, uk, france, the philippines, and canada, except for one data point, which was related to a period from case to death of only one week. excess mortality or mortality too close to the point, or even with the point of death is a bias, which will not be corrected by the factor f and will inevitably lead to an overestimation of cfr with both formulas ( ) and ( ). cfr values above % are theoretically impossible, while cfr values over % are at least highly unrealistic. therefore, i excluded these data points with a cfr > from further analysis. noteworthy, this led to the removal of all data points from italy (classic cfr = . ,) france ( . ), uk ( . ) and philippines ( . ). noteworthy, selection did not exclude japan with a classic cfr of . ( table ). . values > % were excluded from further analysis as explained in the results section. upper right shows the classic cfr calculated as total deaths at one day divided by active cases at same day for the remaining countries. the classic cfr is the higher, the more recent data were assessed (legend for all upper parts). ifr was calculated for the remaining countries and is shown at lower left to vary also depending on period. while the ifr increases with period, this increase declines significantly with increasing period as shown in the lower right. in comparison with the cfr (blues curve) with % ci on splined data with moderate , the classic cfr red curve, the ifr green curve shows higher values in countries that have the who status ( . . ) "in local transmission". values reached those of the countries either the status "outbreak" after the occurrence of the first reported death. though it is plausible that conducting more tests per day, can contribute to artificially increasing both types of cfrs, countries may also respond with increasing their test numbers, once they notice increases in the test positive ratio as a sign for focusing testing too much on the more morbid part of the population (table , last line). in table in the following we will use f as an adjustment factor for determining the ifr − of sars-cov- and comparing the obtained values with the three different cfrs (tab. ). the median ifr values of the nine remaining countries lie in a close margin between . for south korea and . for denmark, while classic cfr and cfr values still show high variance for the remaining countries. especially for south korea and japan as two comparable countries that are in a phase of stagnation the cfr values calculated are roughly - times higher and still rising and they are without correction by f not in the margin of the current expert consensus for a cfr for covid- (table ) . table shows the medians and quartiles of the three different cfr values and the ifr in contrast to the classic cfr, cfr and cfr' the ifr values, including japan, were in the lower range of expert consensus. as the classic cfrs the ifr can depend on the length of the period from cases included to the deaths they are related to figure (lower part). a correlation of the percent increases from day to day with increasing periods in days for all ifrs computed for the countries shown in table , showed a significant negative trend (pearson's r = . ; n = ; p = . ). therefore, dependence on the period between cases and deaths seemed to become more moderate over time and was rather related to the state of the pandemic categorized as "in local transmission" or "outbreak" (fig. , lower right) . as can be observed from the curve of the classic cfr (ccfr), data, which rely on cases assessed before the first death related to the outbreak was registered, tended to be either far too low (ccfr), or to high cfr and ifr. therefore, for the following validation of data from iceland and the evaluation of the final validation dataset, care was taken to include data after the first death reported which ideally also reflected at least a period of days for calculation of the ifrs. for validation of my procedure, i analyzed data from two different testing cohorts in iceland . up to the last data entry for the th of april, . % of the general population in the representative cohort had been tested for sars-cov- by the genetics company decode (decode). additionally, a second, rather typical test cohort of persons with increased risk of infection, representing . % of the total population, had been tested by the directorate of health in iceland via the laboratory of the national university hospital iceland (nuhi). only data were included following the first death on the th of march and allowing for at least a day period of the ifr. prevalence of sars-cov- positive tested persons was . fold-lower (ci: . - . ) in the decode collective, which was highly significantly different from the correction factor f at . (ci: . - . , f = , df = , p = . ; figure ). the upper panel shows four different estimates for mortality of the population of iceland fitted by a spline function with moderate and %-cis shaded. the ifrdecode is the figure derived from testing the general population of iceland and served to cross validate the mortality figures cfr and classic cfr that have been calculated from the data repositories of jh and the ifr that used this repository in conjunction with the test data published by iceland's department of public health. the lower left shows different mortality figures calculated compared to the data ifrdecode that is epidemiologically derived and calculated by formula ( ) the data still seemed to rise with an increase in the period of the ifr as indicated by the change in colour with increasing period from red to blue. the lower left shows the comparison of the distortion factor f compared with the p-quotient, which is the quotient for the prevalence of a positive test results in the test pool of the health officials of iceland compared with the general population. for the decode collective an epidemiologically derived prevalence of being tested positive can be calculated according to formula ( ) for general population, which served to calculate an ifrdecode according to formula ( ) representative for the general population and independent from the cases reported by nuhi or in the ju database. the group comparison for ifrdecode with other calculated fatality rates showed a significant global group difference (n = , df = , f = . , p = . ), which was followed by all-pairs tukey kramer post hoc testing (fig. ) . the ifrdecode with . (ci . - . ) did neither differ from the ifr calculated from ju data (ifrju) . (ci: . - . ), nor the one calculated from nuhi data ifrnuhi = . (ci: . - . ). while the classic cfr (ccfr) . ( . - . ; p = . ), the cfrju . (ci . - . , p = . ), and the cfrnuhi . (ci: . - . , p = . ) tended to overestimate fatality of infection roughly . -fold up to -fold. this margin of overestimation is relatively low, when compared to other observed fold-differences between the cfr and the ifr in the training dataset described in table . in order to validate the concept of the ifr on a larger and more heterogenous collection of countries, from more continents than europe and asia, and in order to compare the estimates with expert consensus and conventional cfrs, a more comprehensive data set was composed. this was achieved by connecting the data from jh and ecdc up to the st of march with the data on test numbers conducted as retrieved from the following internet sources (suppl. table ): data found on wikipedia, in the covid- tracking project on github, the cross validated data on "our world in data" (owid), and non-validated data from owid relying on press releases for instance, but reporting its sources rigorously . double entries in different data sources were cross validated. while cross-validation indicated a high data reliability (r > . , p = . ), highly significantly more data that did not pass quality control of staying below a cut-off for the cfr of % had been retrieved from unofficial data sources (data not shown). this cannot be taken as a sign of higher inaccuracy of unofficial data per se, since the following difficulties were encountered by controlling the data on testing frequencies. though referenced correctly, the data in unofficial sources for one country were sometimes referring to different starting points of cumulative assessment. after finding, visiting and translating the original reports page from japanese authorities (supplemental table ) data point entries could be increased from two to , but it became also evident that there were data reporting cumulative test figures starting th of march and data starting from the very beginning of case and death reporting. there is the question whether to exclude or include the cases from an international cruise ship that was under quarantine in a japanese harbor, since most of the ship passengers were not japanese. this is relevant for the early reporting in japan, while, fortunately, the testing figure done on those individuals is rather negatable for later points with higher cumulative total test figures, which analyzed in the following. some other countries were not reporting cumulative, but solely daily, or weekly reports of their testing figures and only in their national language, which could sometimes unintendedly be misinterpreted as cumulative total, when figures were high and rapidly increasing, as for instance in germany. semiofficial resources started to report first estimates on testing figures more than a week before official sources in germany. these figures were too low and had been corrected by official resources, but the up to now official data are still incomplete, which can only be revealed, it one translates, reads and understands the complete report in german language (suppemental table ). moreover, reporting by unofficial sources can be sometimes more precise than official data, but points towards a new field of uncertainty. for the us, the unofficial data-tracking project on github published the data differentiating in reported positive, negative and tests pending. the "test pending" category could be very relevant. a set of countries emerged ( table ) . by data mining, i was able to retrieve the following data on cumulative test numbers. data points came from the national official reporting organ of the countries, data points were originally retrieved via owid, but controlled, and then updated with official national data, data points came from owid, for the usa from the tracking project, where i relied on the confirmed test numbers, excluding the pending ones, to avoid partial doubling of data. finally, data points were retrieved from wikipedia on the covid- pandemic information pages, where also pdfs and links to the national sources of data are published. these data points belonged originally to countries, but only fulfilled inclusion criteria. these countries are listed with their ifrs in table , which also provides means and %-cis for the means of japan and korea are both at the end of a consolidation phase and did neither show values for the classic cfr nor for cfr in line with expert estimates, but their ifr estimates are again in line with that of all other countries (fig. ) . the ifr is with . - . a bit lower than the current expert consensus, but the margin reflected ( fig. and table ) is the narrowest for all countries. the median for all countries for the ifrs was . ( %-ci: . - . ) and significantly different from cfr with . ( %-ci: . - . , p = . ) and the classic cfr of . ( %-ci: . - . , p = . . ; figure upper left). an epidemiologically derived ifr of . % ( %-ci: . %- . %) was determined for iceland and was very close to the calculated ifr of . % ( %-ci: . - . ), but highly significantly . - fold lower than cfrs. ifrs, but not cfrs, were positively associated with the medians of the countries ifrs were significantly positively associated in men with the proportion of elderly people in the respective countries age cohorts > years (r = . , p = . ), > years (r = . , p = . ), > years (r = . , p = . ), while only significantly associated with the age cohort > years (r = . , p = . ) in females (figure lower part). table : presented are the ifrs for the countries with their means and %-cis in the validation data set. in the lines in bold, the ifrs are nested into three groups according to progressive increase in time period, ranging from below days over - days to weeks and more. in the two lower lines the estimates for the cfr and the classic cfr are shown. during the outbreak of a pandemic, it is difficult to estimate and then communicate a cfr precisely enough from epidemiological data , while situations, where many countries may run out of health resources nevertheless require guidance and recommendations by experts , . in the current situation of the sars-cov- outbreak, the general public is confronted with large differences in the estimation of conventional cfrs between countries like italy (> %), south korea (> %) or germany ( . %). experts' estimates for cfr vary between . up to , which reflects a broad range of pandemic scenarios and a broad range of possible mitigation and suppression strategies which could be derived by experts. in this situation, modelling of scenarios is applied, by relying on key parameters of epidemic spread , , . the most crucial imputation for such models is the basic reproductive number r , which can be assessed from data on the early outbreak of a pandemic, but is compromised by a significant level of uncertainty on top of any epidemiologically derived level of statistical confidence due to essential uncertainties how such data can be transferred from one scenario to another. to put this in simple terms, a cruise ships' field conditions during a quarantine are not comparable to an unanticipated new pandemic outbreak in china , are not comparable to europe . applying modelling now requires even more imputation on the latent period of the infectious period and the interval of half-maximum infectiousness. values could again all be derived from environmental observations, and all prone to a substantially unknown extend of error, especially if assessed for a new pathogen for the very first time. in principal, imputation values can be subject to modelling itself . also modeled values used for imputation into the next model as an assumption will not avoid further inflating the level of uncertainty . what will come out at the end, is the most precise we can get with modeling, we will end up in a range of scenarios from the spanish influenza down to the seasonable flu and with a range of mitigation or suppression strategies that will all be supportable, principally. to improve modelling substantially would now require narrowing the range of fatality down to a margin at which modelling makes sense. a very recent publication describes a way, how we could achieve this goal . in this publication, again a high number of imputations had to be fed again into a model, again field conditions, which are not comparable between each other, had to be chosen and basic assumptions had to be made to model an estimate for the ifr. while this estimate may indeed be more precise than former estimates, the basic problem of requiring a lot of proper field work and requiring a lot of as precise basic assumptions as possible to avoid excessive error propagation of unknown extend, had not been solved or dealt with. the ifr was adjusted for census and for a problem, which at first glance my ifr is not capable to cope with, ascertainment aspects . however, a more morbid person, which ends up preferentially in a test pool during the outbreak of a pandemic, may not be a more elderly person or a person having better access to testing, only. we just are not aware, of the many factors that all may contribute to preferentially testing certain people in the heat of a pandemic. we are trusting in ex posteriori -derived assumptions and confirm them with a model. here i propose a way to adjust for this one particular problem in infection epidemiology -preferential selection of persons that will show up in a test pool -, if there was equal access to the test pool and enough testing capacity. i deduced that my approach only required sequentially monitored confirmed cases, recovered cases, and death events in conjunction with total numbers of diagnostic tests performed in a given population. these data, except the total number of tests conducted, are already subject to official reporting and data collection by national and international centres of disease control. i showed that this approach successfully stabilized against selection bias, with a validation against field data in iceland and by comparing cfrs and calculated ifrs for plausibility and for their ability to reflect an association with census not only within countries, but also across countries. the latter is important, if we assume that biological aspects of a virus are valid across the boundaries of nations. this approach required deducing a correction for selection bias, here termed f, and validating the effect of applying this correction factor to empirical data, to arrive at preliminary estimates for countries of the world with % confidence intervals for ifrs. it is a preliminary estimate with all its shortcomings, but it is a single, potentially relevant variable for crude mortality, logically deduced, requiring only data imputations that potentially could be delivered with high certainty during future pandemics, with low effort, and at low cost. more crucially, it does not require exponential modelling or substantial expert knowledge to arrive at a readout for crude mortality that appears to be robust between countries and appears to reflect viral biology. the correction factor f is simply the square root of the quotient of the total population divided by total tests conducted. even if countries would either go through periods of rapid test rate growths or experience limitations with their testing capacities, the distortion provoked will not lead to huge uncertainty ranges by a substantial unknown error propagation. correcting cfrs with f is capable of harmonizing differences in cfrs between countries that would otherwise be difficult to explain. amongst these candidate countries are japan, south korea, iceland, and norway, which have done meticulous work in dealing with their testing data, protocolling everything transparently and timely, to the public, and moreover, which have strong economies and strong health care systems to cope with the current pandemic. amongst those, that report their testing data almost in real time and comprehensively, is pakistan. pakistan is a country, which seems to fall out of the range of ifrs, with an ifr of . that is roughly -fold lower than the one reported for the so-called developed countries. since testing was reported transparent and timely, it is important to understand, whether this extremely low ifr figure reported in table could be possibly realistic, or not. population statistics of this country compared to any of the developed countries is very informative with this regard. as of , . % of the population in pakistan were over years old and % were younger than . in germany % were younger than then years old, % were older than years. the cfr for people aged - compared to people aged below has been published to be roughly -fold higher . even though this figure is most likely too high because of the cfr being prone to be inflated by selecting a more morbid population into the testing pool, there is agreement amongst scientists, that sars-cov- at least shows a strong difference within countries or within regions to be associated with higher values for older people . by using the correction factor f it is now possible to show significant association of a sars-cov- related mortality figure with age composition not only within, but also between countries. a crucial problem for testing data prevails, on the present level of accuracy for official reporting. even for countries like the us, italy and the uk, with very timely and detailed test reporting, a calculation of an ifr could not have delivered any meaningful outcome than a cfr calculation, or essentially, guessing. in the heat of an ongoing pandemic, it is often still possible to report a death almost in realtime. at the same time, cases will be prone to unreported delay factors, once testing reaches maximum testing capacity. during an exponential growth, this will cause a severe distortion, if a cfr or ifr is calculated. this might happen just because we think that our cases are reported with the day of testing, but in fact, the case may appear as a reported case many days after the death of the person, leading to a cfr of sometimes more than , % (for an example; uk, figure ). on the opposite, germany had been able to expand its testing volume presumably (pending data revision as of april th , suppl. table for reference) by a factor of . from week to week of the current year, which inflated case number. such bias will not only limit the validity of the cfr, but also limit the validity of ifr calculation. however, in contrast to the problems of unknown error propagation in modeling approaches, such limitation could principally be dealt with. if a similar pandemic outbreak, with concomitantly high enough testing capacity, in a well enough informed and health educated cohort of enough people would occur again. the numbers generated here for the ifrs need to be critically taken into consideration by abductive reasoning in an interdisciplinary committee of experts. they are no standalone figures for mortality, since they mainly reduce one particular sort of bias amongst the manifold in empirical work in the field of infection epidemiology. with the decisions to follow certain mitigation or suppression strategies by almost all developed nations, there will be problems to be solved around the globe . there is one last question, which i will not discuss here. provided that there was a defined place with more than . people, and provided at that place everybody knew what covid- is, will not miss a single death, will take care of people coughing, and provided that there was enough testing capacity at hand: could it be that my formula ( ) to correct for selecting more morbid persons into a test pool, will also correct for all other sorts of selection bias, therefore delivering an ultra-precise early estimate for mortality? reassessing the global mortality burden of the influenza pandemic global mortality associated with seasonal influenza epidemics: new burden estimates and predictors from the glamor project covid- -navigating the uncharted communicating the risk of death from novel coronavirus disease (covid- ) case fatality ratio of pandemic influenza. the lancet. infectious diseases potential biases in estimating absolute and relative case-fatality risks during outbreaks estimating the asymptomatic proportion of coronavirus disease (covid- ) cases on board the diamond princess cruise ship early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia case-fatality rate and characteristics of patients dying in relation to covid- in italy quantifying sars-cov- transmission suggests epidemic control with digital contact tracing approaches to uncertainty in exposure assessment in environmental epidemiology. annual review of public health current anti-doping crisis: the limits of medical evidence employing inductive statistical inference. sports medicine (auckland estimates of the severity of coronavirus disease : a model-based analysis. the lancet. infectious diseases data on the two different populations in iceland and description an interactive web-based dashboard to track covid- in real time data provided for the covid- infection by johns hopkins university via the united nations probabilistic population projections: an introduction to demographic forecasting with uncertainty incubation period and other epidemiological characteristics of novel coronavirus infections with right truncation: a statistical analysis of publicly available case data none to be declared. all data used for this study are freely available and accessible via the given references. composed datasets are available upon request. no funding supported the work presented. i would like to thank the health authorities of iceland and decode genetics for making their data freely available for the public. many thanks to the open data source community in its broadest sense. there is massive amounts of work going into curation of pages like the one we can now find on wikipedia about testing. i thank my colleagues, dear friends, and family members, who critically read the manuscript. key: cord- -xs g cy authors: ulahannan, jijo pulickiyil; narayanan, nikhil; thalhath, nishad; prabhakaran, prem; chaliyeduth, sreekanth; suresh, sooraj p; mohammed, musfir; rajeevan, e; joseph, sindhu; balakrishnan, akhil; uthaman, jeevan; karingamadathil, manoj; thomas, sunil thonikkuzhiyil; sureshkumar, unnikrishnan; balan, shabeesh; vellichirammal, neetha nanoth title: a citizen science initiative for open data and visualization of covid- outbreak in kerala, india date: - - journal: j am med inform assoc doi: . /jamia/ocaa sha: doc_id: cord_uid: xs g cy objective: india reported its first covid- case in the state of kerala and an outbreak initiated subsequently. the department of health services, government of kerala, initially released daily updates through daily textual bulletins for public awareness to control the spread of the disease. however, this unstructured data limits upstream applications, such as visualization, and analysis, thus demanding refinement to generate open and reusable datasets. materials and methods: through a citizen science initiative, we leveraged publicly available and crowd-verified data on covid- outbreak in kerala from the government bulletins and media outlets to generate reusable datasets. this was further visualized as a dashboard through a frontend web application and a json repository, which serves as an api for the frontend. results: from the sourced data, we provided real-time analysis, and daily updates of covid- cases in kerala, through a user-friendly bilingual dashboard (https://covid kerala.info/) for non-specialists. to ensure longevity and reusability, the dataset was deposited in an open-access public repository for future analysis. finally, we provide outbreak trends and demographic characteristics of the individuals affected with covid- in kerala during the first days of the outbreak. discussion: we anticipate that our dataset can form the basis for future studies, supplemented with clinical and epidemiological data from the individuals affected with covid- in kerala. conclusion: we reported a citizen science initiative on the covid- outbreak in kerala to collect and deposit data in a structured format, which was utilized for visualizing the outbreak trend and describing demographic characteristics of affected individuals. in december , an outbreak of cases presenting with pneumonia of unknown etiology was reported in wuhan, china. the outbreak, caused by a novel severe acute respiratory syndrome coronavirus- (sars-cov- ), later evolved as a pandemic (coronavirus disease ; , claiming thousands of lives globally. [ ] [ ] [ ] [ ] initial studies revealed the clinical and prognostic features of covid- along with its transmission dynamics and stressed the need for implementing public health measures for containment of infection and transmission among the population at high-risk. [ - ] in response to this, several countries have implemented measures including travel restrictions and physical distancing by community-wide quarantine. [ ] these extensive measures were imposed, taking into consideration the lack of adequate testing kits for detection, a vaccine, or proven antivirals for preventing or treating this disease along with reports of considerable strain on the health system leading to unprecedented loss of human life. india-the second most populated country in the world-reported its first case in the state of kerala on january , , among individuals with travel history from wuhan, the epicenter of the covid- outbreak. [ ] with the subsequent reports of an outbreak in the middle east and europe, kerala has been on high-alert for a potential outbreak, as an estimated % of the population work abroad and being an international tourist destination. [ ] the state has a high population density, with a large proportion of the population falling in the adult and older age group. [ ] this population also shows a high incidence of covid- -associated comorbidities such as hypertension, diabetes, and cardiovascular disease. [ - ] as evidenced by reports of other countries, these factors pose a significant threat for an outbreak and would exert a tremendous burden on the public healthcare system. [ ] [ ] [ ] severe public health measures were implemented in the state of kerala and across india to prevent an outbreak. international flights were banned by march , , and a nation-wide lockdown was initiated on march , . [ ] however, before these measures were implemented, several cases (including travelers from europe and the middle east), along with a few reports of secondary transmission, were reported in kerala. since the first case was reported, the department of health services (dhs), government of kerala, initiated diagnostic testing, isolation, contact tracing, and social distancing through quarantine, and the details of cases were released for the public through daily textual bulletins. for pandemics such as covid- , public awareness via dissemination of reliable information in real-time plays a significant role in controlling the spread of the disease. besides, real-time monitoring for identifying the magnitude of spread helps in hotspot identification, potential intervention measures, resource allocation, and crisis management. [ ] the lack of such a real-time data visualization dashboard for the public with granular information specific to kerala in the local language (malayalam), during the initial days of the outbreak, was the motivation for this work. to achieve this, the collection of relevant information on infection and refining the dataset in a structured manner for upstream purposes such as visualization and/or epidemiological analysis is essential. open/crowd-sourced data has immense potential during the early stage of an outbreak, considering the limitation of obtaining detailed clinical and epidemiological data in real-time during an outbreak. [ ] [ ] [ ] furthermore, the structured datasets, when deposited in open repositories and archived, can ensure longevity for future analytical efforts and policymaking. here, we report a citizen science initiative to leverage publicly available data on covid- cases in kerala from the daily bulletins released by the dhs, government of kerala, and various news outlets. the multi-sourced data was refined to make a structured live dataset to provide real-time analysis and daily updates of covid- cases in kerala through a bilingual (english and malayalam) user-friendly dashboard (https://covid kerala.info/). we aimed to disseminate the data of the outbreak trend, hotspots maps, and daily statistics in a comprehensible manner for non-specialists with bilingual (malayalam and english) interpretation. next, we aimed for the longevity and reusability of the datasets by depositing it in a public repository, aligning with open data principles for future analytical efforts. [ ] finally, to show the scope of the sourced data, we provide a snapshot of outbreak trends and demographic characteristics of the individuals affected with covid- in kerala during the first days of the outbreak. the codd-k constituting, members from different domains, who shared the interest for sourcing data, building the dataset, visualizing, distributing, and interpreting the data on infection outbreak volunteered this effort (https://team.covid kerala.info/). this initiative was in agreement with definitions proposed by different citizen-science initiatives. [ ] the codd-k invited participation in this initiative from the public through social media. the domain experts in the collective defined the data of interest to be collected, established the informatics workflow, and the web application for data visualization. the volunteers contributed by sourcing data from various media outlets for enriching the data. social media channels (telegram channels and whatsapp groups) were used for data collection, which was verified independently and curated by data validation team members. the collective defined the data of interest as minimal structured metadata of the covid- infections in kerala, covering the possible facets of its spatial and temporal nature, excluding the clinical records ( figure ). the resulting datasets should maintain homogeneity and consistency, assuring the privacy and anonymity of the individuals. the notion of this data definition is to make the resulting datasets reusable and interoperable with similar or related datasets. a set of controlled vocabularies were formed as a core of this knowledge organization system to reduce anomalies, prevent typographical errors, and duplicate entries. together with the controlled vocabularies, identifiers of individual entries in each dataset make the datasets interlinked. an essential set of authority control is used in populating spatial data to make it accurate in the naming and hierarchy. a substantial set of secondary datasets were also produced and maintained along with the primary datasets, including derived and combined information from the primary datasets and external resources. we primarily sourced publicly available de-identified data, released daily as textual bulletins (from january , ) by the dhs, government of kerala, india (https://dhs.kerala.gov.in), of the individuals diagnostically confirmed positive for sars-cov- by reverse transcription-polymerase chain reaction (rt-pcr) at the government-approved test centers. we also collected and curated reports from print and visual media for supplementing the data. the quality of the data in terms of veracity and selection bias has been ensured as described (supplementary methods). utmost care was taken to remove any identifiable information to ensure the privacy of the subjects. entries were verified independently by codd-k data validation team members and rectified for inconsistencies ( figure ). since the data collected were publicly available, no individual consent and ethical approval were required for the study. to demonstrate the utility of the collected dataset, we provided the status of the first days (between january , , and june , ) of the covid- outbreak in kerala, and also described demographic characteristics of the individuals affected with covid- . we ensured that the sourced dataset complied with the open definition . laid down by open knowledge foundation. [ ] https://mc.manuscriptcentral.com/jamia we further visualized the data through a user-friendly bilingual progressive web application (pwa) designed to be both device and browser agnostic. for the convenience of the public, the dashboard mainly highlighted the number of individuals who are hospitalized, tested, confirmed, currently active, deceased, recovered, and people under observation (state-wise and district data), updated daily. we also visualized maps for hotspots, and active patients, along with outbreak spread trend (new, active, and recovered cases), new cases by day, diagnostic testing trend, patients-age breakup, confirmed case trajectories at the district administration level (figure a since the sars-cov- infection outbreak occurs in clusters, early identification and isolation of these clusters are essential to contain the outbreak. accurate tracking of the new cases and real-time surveillance is essential for the effective mitigation of covid- . however, the daily public bulletins by dhs did not have any unique identification code for the covid- infected individuals and also for secondary contacts who have contracted the infection through contact transmission. this limited us from tracking the transmission dynamics. as an alternative, we resorted to https://mc.manuscriptcentral.com/jamia mapping hotspots for infection as a proxy measure to indicate possible outbreak areas. initially, red, orange, and green zones based on the number of cases were designated to each district by the government of india. later, the government of kerala started releasing covid- hotspot regions at the level of lsg administration areas. we manually curated the hotspot information from the dhs bulletins, and the dataset was published as a static json file in the geojson format, which improves the browser caching and drops the requirement of server-sided api services. the hotspot locations were highlighted as red dots with descriptions, and when zoomed, the lsg administration area will be displayed on the map. in order to improve the visual clarity of hotspots with varying sizes of the lsgs and different zoom levels in browsers, an identifiable spot is placed on the visual center of the lsg area polygon. this inner center of the polygon was calculated with an iterative grid algorithm. to the best of our knowledge, this feature is unique to our dashboard. we also provided a toggle bar to visualize district boundaries and areas declared as hotspots at lsg resolution ( figure c ). owing to the lack of data, additional information such as the number of active cases in these hotspots could not be plotted. in this report, we describe a citizen science initiative that leveraged publicly available [ ] thus, our model also sets an example for efficient data management in such citizen-science initiatives. while the real-time information serves the public for assessing potential risk india. [ ] [ ] [ ] however, our approach differed from those, as we sourced unstructured data released by the government, supplemented with the information from media outlets. this strategy not only ensures authenticity but also enriches the data available in the public domain into a structured dataset, though it depends on the data release policies adopted by the different state governments. kerala is one of the many states in india with a transparent data release policy, which ensured the authenticity of data collected through our initiative. furthermore, the granularity of the data at the lsg levels, which are manually verified (as released in local language) gives an added advantage, in terms of data depth, over other pan-indian dashboards that mainly rely on apis to fetch cumulative data. although this approach seems to be efficient, an unexpected surge in cases can jeopardize the data collection, thus limiting the feasibility. during such a scenario, a trade-off between depth and breadth of data collected has to be decided. moreover, this approach also has inherent limitations, including issues with the veracity of data, owing to the anonymity, and depth of the data released, including clinical symptoms. since each infected case identified in kerala was not provided with a unique id, it was impossible to track these cases for the assessment of vital epidemiological parameters like the reproduction number (r ). based on our experience of collating and analyzing covid- data from the public domain in kerala, we propose to frame specific guidelines for the public data release for covid- or other epidemics. we recommend the release of official covid- data in a consistent, structured and machine-readable format, in addition to the bulletins, which could be provided with a permanent url and also archived in a public repository for future retrospective analyses. we also suggest releasing the assigned unique id for the individuals affected with covid- , to avoid inconsistencies in reporting and to enable tracking the secondary transmission. furthermore, providing covid- associated symptomatic information, without compromising the privacy of the infected individuals will also aid in the basic understanding of the disease through analytical approaches. our dataset, compiled between january , , to june , , indicates that the infections reported in kerala were mainly among working-age men, with a travel history of places with covid- outbreak. active tracking and isolation of cases with travel history lead to better management of outbreak. since the majority of cases reported in kerala were within the age group of - years, and the patients being in constant inpatient care possibly contributed to a better outcome and lesser mortality rate, respectively. kerala implemented vigorous covid- testing, and even though the test rate was relatively low ( , tests per million of the population), the average number of positives detected for , tests (individuals) was lesser when compared to other states in india. data from kerala also provides insights about the mean duration of illness and the effect of increasing age on this parameter. collectively, we report a citizen science initiative on the covid- outbreak in kerala to collect data in a structured format, utilized for visualizing the outbreak trend, and describing demographic characteristics of affected individuals. while the core aim of this initiative is to document covid- related information for the public, researchers, and policymakers, the implemented data visualization tool also alleviates the citizen's anxiety around the pandemic in kerala. we anticipate that the dataset collected will form the basis for future studies, supplemented with detailed information on clinical and covid- : towards controlling of a pandemic clinical course and risk factors for mortality of adult inpatients with covid- in wuhan, china: a retrospective cohort study a pneumonia outbreak associated with a new coronavirus of probable bat origin a new coronavirus associated with human respiratory disease in china early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia presymptomatic sars-cov- infections and transmission in a skilled nursing facility presumed asymptomatic carrier transmission of covid- characteristics of and important lessons from the coronavirus disease (covid- ) outbreak in china: summary of a report of cases from the chinese center for disease control and prevention prevalence of comorbidities and its effects in patients infected with sars-cov- : a systematic review and meta-analysis travel restrictions hampering covid- response full-genome sequences of the first two sars-cov- viruses from india emigration and remittances: new evidences from the kerala migration survey the kerala tourism model-an indian state on the road to sustainable development kerala: government of kerala prevalence and associated risk factors of hypertension among persons aged - in india: a cross-sectional study the changing patterns of cardiovascular diseases and their risk factors in the states of india: the global burden of disease study incidence of type diabetes mellitus and prediabetes in kerala, india: results from a -year prospective cohort potential association between covid- mortality and health-care resource availability what other countries can learn from italy during the covid- pandemic variation in covid- hospitalizations and deaths across new york city boroughs india under covid- lockdown using "outbreak science" to strengthen the use of models during epidemics an interactive web-based dashboard to track covid- in real time early epidemiological analysis of the coronavirus disease outbreak based on crowdsourced data: a population-level observational study. the lancet digital health open access epidemiological data from the covid- outbreak open knowledge foundation. open definition . . secondary open definition ten principles of citizen science opinion: toward an international definition of citizen science the geojson format secondary frictionless data specs. secondary frictionless data specs may info-data: a collective open dataset of covid- outbreak in the south indian state of kerala linked data glossary genomic epidemiology of sars-cov- in guangdong province crowdsourcing biomedical research: leveraging communities as innovation engines role of technology in responding to disasters: insights from the great deluge in kerala architectural considerations for building a robust crowdsourced disaster relief application. international conference on communication systems & networks (comsnets); space-based monitoring of severe flooding of a southern state in india during south-west monsoon season we acknowledge shane reustle for his help and support for forking the japan covid- coronavirus tracker repository and implementation of the dashboard. we thank jiahui zhou for the original concept and design of the tracker. we also thank sajjad anwar for generously providing the administrative boundary shapefiles and geojsons for kerala. maps were generously provided by the mapbox community team. we also thank the volunteers who have contributed to the sourcing of data from various media outlets for enriching the data. the authors declare no competing interests this study was not funded by any agencies and was purely a voluntary effort during the community-wide quarantine period by a team of technologists, academicians, students, and the general public advocating open data and citizen science. key: cord- -pmf aps authors: avtar, ram; komolafe, akinola adesuji; kouser, asma; singh, deepak; yunus, ali p.; dou, jie; kumar, pankaj; gupta, rajarshi das; johnson, brian alan; thu minh, huynh vuong; aggarwal, ashwani kumar; kurniawan, tonni agustiono title: assessing sustainable development prospects through remote sensing: a review date: - - journal: nan doi: . /j.rsase. . sha: doc_id: cord_uid: pmf aps the earth's ecosystems face severe environmental stress from unsustainable socioeconomic development linked to population growth, urbanization, and industrialization. governments worldwide are interested in sustainability measures to address these issues. remote sensing allows for the measurement, integration, and presentation of useful information for effective decision-making at various temporal and spatial scales. scientists and decision-makers have endorsed extensive use of remote sensing to bridge gaps among disciplines and achieve sustainable development. this paper presents an extensive review of remote sensing technology used to support sustainable development efforts, with a focus on natural resource management and assessment of natural hazards. we further explore how remote sensing can be used in a cross-cutting, interdisciplinary manner to support decision-making aimed at addressing sustainable development challenges. remote sensing technology has improved significantly in terms of sensor resolution, data acquisition time, and accessibility over the past several years. this technology has also been widely applied to address key issues and challenges in sustainability. furthermore, an evaluation of the suitability and limitations of various satellite-derived indices proposed in the literature for assessing sustainable development goals showed that these older indices still perform reasonably well. nevertheless, with advancements in sensor radiometry and resolution, they were less exploited and new indices are less explored. the success of sustainable development in any region depends upon what is known regarding resource management and hazards in the area (tabor and hutchinson, ) . although several approaches and techniques are available to monitor natural resources and hazards, remote sensing (rs) technology has been particularly popular since the s because of its low acquisition costs and high utility for data collection, interpretation, and management. over the past few decades, rs tools and techniques have been deployed for several purposes at various time scales (jensen, ) . rs provides both archived and near-real-time information on earth systems (jensen, ; jensen and cowen, ) . rs is applied to obtain spatial information in various fields in earth system science. the ability of rs to monitor earth systems at various spatial and temporal scales makes it suitable for addressing global environmental, ecological, and socioeconomic challenges. rs can provide a synoptic view of spatial information at local, regional, and global scales, thus facilitating swift decision-making and action (jensen and cowen, ) . as information can be obtained directly through rs, it is the main surveying technology employed for collecting data in inaccessible and remote locations. based on these rs data, forest fragmentation, land use and cover, and species distributions have been mapped and monitored over time (kerr et al., ; menon and bawa, ) . lulc data are especially useful for detecting the distributions of individual species, species assemblages, and species richness over broad areas (kerr and ostrovsky, rs can also be used to derive environmental parameters or indices indirectly, to in turn map species patterns and diversity (turner et al., ) . such parameters are thought to be drivers of biodiversity, and those that are frequently estimated for determining species richness and distribution patterns include (i) primary productivity, (ii) climate variables, and (iii) habitat structure (abdalla, ). these three types of parameters facilitate assessment of the diversity of various species at any given location and time (turner et al., ) . parameters can first be estimated from data obtained by advanced rs sensors; then, both local and global species availability, richness, and diversity can be inferred. the capability of remote sensing application for mineral exploration was started from the passive satellite sensors to active sensors. in the past decades, several studies have been done towards ( ) the mapping of geology and structures (the faults and fractures) that hosts ore deposits; ( ) identifying hydrothermally altered rocks based on their spectral signatures; ( ) mapping surface distribution of rocks and its mineral constituents (sabins, ) . sabins, ( ) in some areas such as mountainous areas in developing countries with a lack of in-situ measurement devices. it is difficult to simulate hydrodynamic models due to a lack of data. the principle behind the usage of the insar is the acquisition and processing of phase shift information obtained from a series of complex sar images. in this case, every pixel element from each image is processed and the elevation at its centroid is established based on signal phase response and the satellite altitude information (rosen et al., ) . the table shows summary of the studies relevant to applications of remote sensing in transportation. an important aspect of sustainable development is the understanding of the dynamics of the population within a community and across national boundaries. this information assists industrialization, and socio-economic development. the population also plays a substantial role in our ability to measure the extent of human influence on the environment. demographic data is measurable and quantifiable, which lends itself to applications in remote sensing. from an economic perspective, the population is one of the determinants of demand. an increase in the population invariably increases the aggregate demand within a country. there are a few ways to go about using remote sensing techniques to count the population. jensen and cowen, ( ) imagery against population census data. they found that high-resolution satellite images do not correlate strongly enough with the population data to serve as a proxy for population data. they also only found a weak correlation between landscape textures and population density. as census data is already being collected through surveying methods on the ground, there is less necessity for remote sensing applications in population estimation. there is a clear distinction in the literature between allocation and estimation. despite the prevalence of population census data, this type of information does not give significant insight into how these people are spatially arranged. the population has been recognized as an indirect driver of land-use change though its effect cannot be explicitly stated (meyer and turner, principle that people tend to cluster, and in the model, the population density is greatest at the indianapolis, indiana. they found that remote sensing-based models that stratified the population according to density levels increased the accuracy of the model. they cited the issue that the census data is of a lower resolution than the remotely sensed data. they also found that the inclusion of textures, temperatures, and spectral responses greatly increased the accuracy of estimation. the remotely sensed data must be combined with in-situ data to ensure accuracy. however, literature seems to agree that, measurements of population density using remote sensing have not been carried out consistently due to the large degree of variation between communities. more technologically developed countries can remotely sense population allocations. japan, for example, has access to positioning data obtained from smartphones. this knowledge was applied when the tohoku earthquake struck, providing insight as to where the highest concentrations of people were in real-time in the midst of the disaster. however, the use of this type of remotely sensed data has raised a lot of concerns if it is to be used in the field of research because for many people it represents a privacy breach. table vulnerability against poverty rates. one of the issues that they experienced is that poverty identified, and it is likely that they differ from place-to-place. extending the studies on poverty, in recent years, scientists have seen the need to shift away from a static poverty mapping model and move towards a more dynamic one. rogers et al., ( ) although there is some evidence that about the spatial trend as shown by high levels of contiguity in three clusters: the peloponnesian region, the islands of the dodecanese and crete. this study was not conclusive, yet it highlights one of the main issues of socio-economic applications of remote sensing. still, accurately attributing a precise number of people to a small spatial designation has been observed consistently and needs further investigation. in some literature, environmental quality has been used interchangeably with the quality of life. this refers to the perception of the quality of the natural environment is integrated into the human environment such that the human population actively interacts with and perceives. lo and faber ( ) were able to show that there is a linkage between income, population density, and forest amenities measured by leaf area. they found that higher levels of greenness were positively correlated with income and median home data and negatively correlated with population density. pozzi and small ( ) suggested that using greenness to determine levels of affluence can lead to ambiguous results because greenness can be indicative of either high or low levels of affluence. at this point, they agree that the stratification between urban, rural, and suburban locals greatly increases the accuracy of results. it can be seen from table that most of the indices applied in sustainable developmental studies were developed a long time ago when the sensor radiometric resolution and spatial resolution was lower than the present ones. despite that, these indices performed well such as ndvi, evi, and lai, etc. today a large number of satellites orbiting outer space today with a narrower range of radiometric resolution. it provides an improved spatial resolution and increased availability of sar data in multiple frequencies. however, lesser attempts have been made for developing new spectral indices capable of retrieving more accurate information from remote sensing products. therefore, there is a need to explore new indices for continued development in attaining sustainable development through remote sensing area from the latest sensors to improve earth's monitoring. • the biggest challenges associated with the remote sensing itself is the availability and distribution of data. lack of freely available high-resolution remote sensing data makes the remote sensing research community debilitated despite the fact that advanced remote sensing tools have become available for processing and analyzing the data. in cases, where the high-resolution data is available commercially, their cost is not affordable to many researchers, especially those from economically weaker countries. • lack of effective national spatial data infrastructures (sdi) in developing countries prevent access to data and information for analysis and validation. • due to the inherent shortcomings of remote sensing devices in measuring the underground conditions directly, inferential methods are sometimes adopted, however, such methods suffer from limited accuracy in many cases, especially in groundwater exploration. • mapping of lake bodies in glaciated areas using various indices are still difficult because of the similar behavior of reflectance from adjoining areas. • since turbidity varies largely between the aquatic systems, generic algorithms for water quality mapping introduces error value of more than % in low to moderately turbid waters. the error in highly turbid water is much more. • use of hyperspectral data for mineral mapping has high potential, however the availability of hyperspectral sensor data is limited. • mapping of surface mineralogy with remote sensing under forest canopy in tropical rainforests region of the world remains difficult. • although there are advanced algorithms for mapping snow cover, remote sensing of snow can be extremely difficult due to mixed pixels arising from cloud cover. • availability of clouds free satellite data during the event of floods is still challenging in tropical region. high-temporal resolution sar remote sensing is the viable solution. • apart from mapping the flood extent and water depths, derivation of flood water characteristics such as flow velocity, sediments load, warning time and awareness, winds and duration of inundation from the integration of satellites space and time (merz et al., ) . • more accurate and open-access precipitation, discharge, boundary data and topography at the global level are needed to increase dependability of flood hazard modeling. • lack of ground data for validation in data scarce regions often affect the reliability of satellite-based rainfall data. • satellites that currently employ rainfall measurements are available only at coarser resolution, which limits the rainfall threshold -landslide initiation mapping in the ungauged catchments. • separating the landslide initiation and deposition areas are challenging even with m resolution planet images (wang et al., ) . • estimation of income distribution from remote sensing data still remain a challenge in understanding the quality of life. • population estimation using remote sensing data without ground measurement remain a difficult task. • sustainable transportation mapping and analysis in developing countries is greatly affected by the availability, cost, licensing and access to high resolution real-time imageries and image processing software. • effective communication between remote sensing experts and decision makers on the effective use of remote sensing for human welfare issues is lacking in most developing countries. this paper reviewed how rs technologies have been used to support several aspects of sustainable development, including ( ) natural resource monitoring, development, and management; ( ) environmental assessments and hazard monitoring; and ( ) socioeconomic development. rs has several advantages, including the ability to provide global-scale coverage, high-resolution data, and multi spatio-temporal coverage with optical, sar, thermal and lidar sensors. it provides large volume of data and recent development in ml algorithms can handle large volume of geospatial data to extract beneficial information. here, we discussed the use of rs for sustaining the earth and human life. with the development of new and improved satellite and airborne sensors, data with increasingly higher spatial, spectral, and/or temporal resolution will become available for researchers, decision-making in many areas of sustainable development. accordingly, the united nations highlighted rs as an indispensable tool for achieving its sustainable development goals (sdgs). rs can be used not only to develop comprehensive policies promoting sustainable development, but also for effective implementation, monitoring, and decision-making. however, for rs to be effective and reliable, adequate information has to be obtained from other sources. in particular, the development of new spectral indices based on improved sensor technology is key for achieving sustainable development goals. spatial data from rs and other sources can be integrated using gis, among other spatial-integration tools, to analyze global environmental processes and change. during the covid- global crisis, the contribution of remote sensing data has been widely discussed in a wide variety of applications including monitoring water and air pollution, management of the threat, monitoring traffic patterns, measuring human and economic activities, and socio-economic restriction. several new studies and applications of remote sensing are emerged during the pandemic and are becoming significant case studies for sustainability applications. for developing countries, however, obtaining rs data for research and development purposes is difficult; thus, it is not efficient to use rs technology to support sustainable development in such countries. as counters strive to achieve the sdgs, more data acquisition platforms should be created and made available to researchers in developing nations to enable them to actively use rs data to support national, regional, and global sustainable development. rs techniques are still not widely employed in developing countries, which are more vulnerable to natural hazards. there may also be conflicts of interests in terms of security and privacy between governments and other entities associated with rs use. additional collaborations between policy think tanks, decision-making bodies in developing countries, and countries or organizations with ready access to gis resources are needed. remote sensing for sustainable forest management detecting areas of high-potential gold mineralization using aster data a narrow-waveband spectral index that tracks diurnal changes in photosynthetic efficiency. remote sensing of environment mapping of groundwater potential zones in the musi basin using remote sensing data and gis using global positioning system techniques in landslide monitoring wide dynamic range vegetation index for remote quantification of biophysical characteristics of vegetation nondestructive estimation of anthocyanins and chlorophylls in anthocyanic leaves use of a green channel in remote sensing of global vegetation from eos-modis. remote sensing of environment large-scale monitoring of snow cover and runoff simulation in himalayan river basins using remote sensing bwe introductory digital image processing: a remote sensing perspective introductory digital image processing: a remote sensing perspective remote sensing of urban/suburban infrastructure and socio-economic attributes. photogrammetric engineering and remote sensing groundwater management and development by integrated remote sensing and geographic information systems: prospects and constraints development of a two-band enhanced vegetation index without a blue band derivation of leaf-area index from quality of light on the forest seasonal variation of colored dissolved organic matter in barataria bay, louisiana, using combined landsat and field data monitoring of the remote sensing assessment of the capability of remote sensing and gis techniques for monitoring reclamation success in coal mine degraded lands afri -aerosol free vegetation index. remote sensing of environment atmospherically resistant vegetation index (arvi) for the tasseled cap-a graphic description of the spectral-temporal development of agricultural crops as seen by landsat. symposium on machine processing of remotely sensed data from space to species: ecological applications for remote sensing remotely sensed habitat diversity predicts butterfly species richness and community similarity in canada remote sensing and geophysical investigations of moghra lake in the qattara depression the use of high spectral resolution bands for estimating absorbed photosynthetically active radiation vegetation effects on soil moisture estimation. igarss . hyperspectral remote sensing of evaporate minerals and associated sediments in lake magadi area quantifying and mapping biodiversity and ecosystem services: utility of a multi-season ndvi mahalanobis distance surrogate. remote sensing of environment machine learning in geosciences and remote sensing application of a weights-of-evidence method and gis to regional groundwater productivity potential mapping vehicle detection in very high resolution satellite images of city areas using landsat etm+ imagery to measure population density in photogrammetric engineering & remote sensing a feedback based modification of the ndvi to minimize canopy background and atmospheric noise remote sensing population density and image texture integration of landsat thematic mapper and census data for quality of life assessment estimation of canopy-average surface-specific leaf area using landsat tm data the use of global positioning system techniques for the continuous monitoring of landslides: application to the geomorphology, regional and global trends in sulfate aerosol since the s global earthquake casualties due to secondary effects: a quantitative analysis for improving rapid loss analyses mapping snags and understory shrubs for a lidar-based assessment of wildlife habitat suitability. remote sensing of environment remote sensing of transportation flows: consortium paper presentation. remote sensing and forest inventory for wildlife habitat assessment the use of the normalized difference water index (ndwi) in the delineation of open water features using thematic mapper applications of geographic information systems, remote-sensing, and a landscape ecology approach to biodiversity conservation in the assessment of economic flood damage: nat machine learning methods for landslide susceptibility studies: a comparative overview of algorithm performance human population growth and global land-use/cover change ecosystems and human well-being effects of multi-dike protection systems on surface water quality in the vietnamese mekong delta quantifying cyanobacterial phycocyanin concentration in turbid productive waters: a quasi-analytical approach intercomparison of satellite remote sensing-based flood inundation mapping environmental impact assessment with remote sensing at isahaya land reclamation site remote sensing (aars) evaluation of forest fire on madeira island using sentinel- a msi imagery forest degradation assessment in the upper catchment of the river tons using remote sensing and gis comparison of modis gross primary production estimates for forests across the usa with those generated by a simple process model, -pgs detecting slope and urban potential unstable areas by means of multi-platform remote sensing techniques: the volterra (italy) case study high-resolution mapping of global surface water and its long-term changes global patterns of loss of life from landslides using the satellite-derived ndvi to assess ecological responses to environmental change gemi: a non-linear index to monitor global vegetation from satellites oil spill detection in glint-contaminated near-infrared modis imagery detection of hydrothermal alteration zones in a tropical region using satellite remote sensing data exploratory analysis of suburban land cover and population density in the usa a modified soil adjusted vegetation index environmental impact assessment of land use planing around the leased limestone mine using remote sensing techniques distinguishing vegetation from soil photogrammetric engineering and remote sensing poverty mapping in uganda: an analysis using remotely sensed and other environmental data water feature extraction and change detection using multitemporal landsat imagery. remote sensing optimization of soil-adjusted vegetation indices. remote sensing of environment synthetic aperture radar interferometry radar-driven high-resolution hydro-meteorological forecasts of the technology satellite- symposium: the proceedings of a symposium held by modeling distribution of amazonian tree species and diversity using remote sensing measurements remote sensing for mineral exploration present use and future perspectives of remote sensing in hydrology and water management vehicle detection in -m resolution satellite and airborne imagery evaluating the ability of npp-viirs nighttime light data to estimate the gross domestic product and the electric power consumption of china at multiple scales: a comparison with dmsp-ols data application of gwqi to assess effect of land use change on groundwater quality in lower shiwaliks of punjab: remote sensing and gis based approach predicting spatial and decadal lulc changes through cellular automata markov chain models using earth observation datasets and geo-information landscape transform and spatial metrics for mapping spatiotemporal land cover dynamics using earth observation data-sets earth observation for landslide assessment multitarget detection/tracking for monostatic ground penetrating radar: application to pavement profiling a comparison of nighttime satellite imagery and population density for the continental united states. photogrammetric engineering and remote sensing using indigenous knowledge, remote sensing and gis for sustainable development integrating remote sensing, geographic information systems and global positioning system techniques with hydrological modeling life expectancy in regional variations and spatial clustering remote sensing for biodiversity science and conservation chris a. hecker, remote sensing: a review remote sensing for natural disaster management. international archives of photogrammetry and remote sensing identification of mineral components in tropical soils using reflectance spectroscopy and advanced spaceborne thermal emission and reflection radiometer (aster) data. remote sensing of environment evaluation of a temporal fire risk index in mediterranean forests from noaa thermal ir. remote sensing of environment solar and infrared radiation measurements evaluating modis data for mapping wildlife habitat distribution. remote sensing of environment nmdi: a normalized multi-band drought index for monitoring soil and vegetation moisture with satellite remote sensing coseismic landslides triggered by the hokkaido, japan (m w . ), earthquake: spatial distribution, controlling factors, and possible failure mechanism improving predictions of forest growth using the -pgs model with observations made by remote sensing integrated modelling of population, employment and land-use change with a multiple activity-based variable grid cellular automaton selecting and conserving lands for biodiversity: the role of remote sensing satellite-based emergency mapping: landslides triggered by the nepal earthquake evaluating environmental influences of zoning in urban ecosystems with remote sensing consumer versus resource control of species diversity and ecosystem functioning satellite-based modeling of gross primary production in an evergreen needleleaf forest. remote sensing of environment satellite remote-sensing technologies used in forest fire management remote sensing imagery in vegetation mapping: a review modification of normalised difference water index (ndwi) to enhance open water features in remotely sensed imagery the role of satellite remote sensing in climate change studies the role of satellite remote sensing in climate change studies an integrated remote sensing and gis approach in the monitoring and evaluation of rapid urban growth for sustainable development in the vehicle detection in remote sensing imagery based on salient information and local shape feature fire detection using infrared images for uav-based forest fire surveillance aircraft systems (icuas) covid- and surface water quality: improved lake water quality during the lockdown sub-pixel mineral mapping of a porphyry copper belt using eo- hyperion data influence of lidar, landsat imagery, disturbance history, plot location accuracy, and plot size on accuracy of imputation maps of forest composition and structure use of normalized difference built-up index in automatically mapping urban areas from tm imagery. international journal of remote sensing application of an empirical neural network to surface water quality estimation in the gulf of finland using combined optical data and microwave data survey and analysis of land satellite remote sensing applied in highway transportations infrastructure and system engineering key: cord- -d qwjui authors: helmy, mohamed; smith, derek; selvarajoo, kumar title: systems biology approaches integrated with artificial intelligence for optimized food-focused metabolic engineering date: - - journal: metab eng commun doi: . /j.mec. .e sha: doc_id: cord_uid: d qwjui metabolic engineering aims to maximize the production of bio-economically important substances (compounds, enzymes, or other proteins) through the optimization of the genetics, cellular processes and growth conditions of microorganisms. this requires detailed understanding of underlying metabolic pathways involved in the production of the targeted substances, and how the cellular processes or growth conditions are regulated by the engineering. to achieve this goal, a large system of experimental techniques, compound libraries, computational methods and data resources, including the multi-omics data, are used. the recent advent of multi-omics systems biology approaches significantly impacted the field by opening new avenues to perform dynamic and large-scale analyses that deepen our knowledge on the manipulations. however, with the enormous transcriptomics, proteomics and metabolomics available, it is a daunting task to integrate the data for a more holistic understanding. novel data mining and analytics approaches, including artificial intelligence (ai), can provide breakthroughs where traditional low-throughput experiment-alone methods cannot easily achieve. here, we review the latest attempts of combining systems biology and ai in metabolic engineering research, and highlight how this alliance can help overcome the current challenges facing industrial biotechnology, especially for food-related substances and compounds using microorganisms. with the growing population of our planet, food security remains a major challenge facing mankind. this is especially true for countries that do not possess large land spaces for agriculture, such as those in the middle east (deserts), japan (mostly mountainous), and singapore (land scarce). moreover, nature conservationists are mostly against the clearing of wild flora and fauna to feed the world. thus, looking at the long term, food security can become a pressing issue for many nations. the rome declaration ( ) by the food and agriculture organization (fao) defines food security as "food security, [is achieved] when all people, at all times, have physical and economic access to sufficient, safe and nutritious food to meet their dietary needs and food preferences for an active and healthy life" [ ] . on the other hand, there is also a growing awareness of healthy diets as diet is considered to be the most significant risk factor that affects general health and cause diseases, disability or premature death. the trending diets are mostly focused on eating habits that are nutritious, help lose weight, avoid processed foods, especially with preservatives, or foods with artificial ingredients such as artificial flavours or colours [ ] . these include plant-based diets (such as vegan) and low calorie fat-burning diets (such as ketogenic) [ , ] . thus, the challenge is not only to produce enough food but also those that are safe, nutritious and appealing to the customer's preference. food security has become even more important during the ongoing covid- pandemic when countries have largely closed their borders, affecting the food import-export trade [ ] . there are several types of modelling approaches today, that can be largely grouped into i) parametric approaches such as dynamic modeling using ordinary differential equations [ ] , and ii) non-parametric models using boolean logics, stoichiometric matrix and bayesian inference algorithms [ , ] . a dynamic model built using differential equations constructs an organism's metabolism step by step using known biochemical reactions and reaction kinetics from their genomic, enzymatic and biochemical information derived from experiment ( figure a ). using this information, the models are used to predict metabolic outcomes for different in silico perturbations, or to understand the key regulatory mechanisms (such as bottlenecks) and flux distributions to a given perturbation [ , ] . in other words, the dynamic models utilize a priori knowledge of metabolic pathways, enzymatic mechanisms and temporal experimental data to simulate the concentrations of metabolites over time. these models are usually referred to as kinetic models [ ] . although kinetic models have been widely used and have proven their benefits [ ] , for large-scale modeling, such as genome-scale modeling, it is a daunting challenge to use dynamic modeling due to the absence of large-scale experimentally measured and reliable kinetics [ ] . to overcome this major challenge, as a trade-off, scientists use other types of modeling such as the parameter-less stoichiometric constraint-based modeling approaches. constraint-based models, have constraints for each decision that represent the minimum and maximum values of the decision (e.g. the minimum and maximum reaction rates) [ ] . a widely used constraintbased modeling is the flux balance analysis (fba) [ ] . the fba models thousands of metabolites and reactions with reasonable computational cost and prediction outcome ( figure figure . schematic representation of different modeling approaches used in metabolic engineering. a) mathematical modeling of metabolic pathways. b) flux balance analysis (fba) modeling. c) steps of promoter-strength modeling using statistical models and mutations data. d) ensemble modeling. although numerous works have used metabolic regulation to control the production of targeted metabolites, recent works indicate that transcriptional and translation control can provide significant fold increase in the intended yield output [ , ] . the transcriptional control changes the way the gene of interest is regulated by manipulating its promoter region. this includes modifications such as mutating the ribosomal binding sites (rbs), the transcription factor binding sites (tfbs), designing and inserting shot sequencing (e.g. new binding sites), or designing an artificial promoter region [ ] . the transcriptional control requires deep understanding of how the gene of interest is regulated (activators, enhancer and suppressors) as well as the knowledge of its genomic structure around the binding sites (such as nucleosome positions) [ ] ( figure c ). thus, modeling the transcriptional control remains a challenge as it requires complex data involving quantitative gene expression under each mutation condition to train a model that simulates the effect of each mutation and then use it to predict the impact on the new mutation. nevertheless, statistical approaches such as the position weight matrix (pwm) modeling, which measures or scores aligned sequences that are likely functionally related, have shown promise for understanding the mutational impact on the transcriptional regulation in mammalian disease cells [ , ] . such methods could be explored in the future for controlling the transcriptional efficiency for metabolic engineering outcome. crucial role in the development of the ensemble predictions, thereby, reducing the number of models to a smaller set [ ] . an example of ensemble modeling was performed for two non-native central pathways for carbon conservation the non-oxidative glycolysis (nog) and the reverse glyoxylate cycle (rgc) pathways using ensemble modeling robustness analysis (emra). emra successfully determined the probability of system failure and identified possible targets for flux improvement [ ] . in another study, ensemble modeling was used to help in developing a l-lysine-producing strain in e. coli [ ] . nevertheless, ensemble modeling come with some major challenges. building an ensemble with different modeling algorithms is more difficult that using any standard modeling strategy, the requirement of perturbation-response data makes it similar to many other data-dependant modeling strategies that perform poorly in the absence of reliable data, and the difficulty in interpreting its overall results. these limitations hinder the utility of this powerful modeling approach. another widely used modeling approach for metabolic engineering is in silico three-dimensional ( d) molecular modeling for the study of receptor/enzyme-ligand docking and protein homology design [ ] . it has a wide range of applications in drug design and metabolism, research and therapeutic antibodies design and molecular interactions research (protein-protein and protein-dna interactions). in metabolic engineering, d modeling is used to design and simulate engineered enzymes that are indispensable for the optimization process of the microorganism's metabolism [ ] . in protein engineering, where no structural data is available, molecular modelling is used to model the d-structures of enzymes, and coupled with enzymesubstrate docking studies, can be used to target regions of interest to improve various attributes, such as specificity, activity and stability under a given environment. this has been used to great effect for single enzymes as in vitro industrial biocatalysts (e.g. sitagliptin [ ]), as well as for entire enzyme cascades (e.g. islatravir [ ] ) for the production of active pharmaceutical ingredients. j o u r n a l p r e -p r o o f dynamic modeling strategies, as mentioned above, often depend on the parameters that are used to build the model. the parameters (such as reaction kinetics or flux ranges) can be determined using bottom-up or top-down approaches [ ] . the bottom-up approach is highly dependent on experiments (such as in vitro enzymatic assays) since it requires information on the reaction kinetics of each enzyme, which is highly challenging to determine for all the enzymes in a pathway or network. furthermore, even if information is obtained from in vitro experiments, the data are often several orders of magnitude different from actual in vivo experiments [ ] . moreover, modeling usually requires data (kinetics or flux rates) for multiple conditions or time points to train the model and test its accuracy or applicability, which requires iterative experimental work [ ] . despite the fact that the bottom-up modeling approaches often use optimization algorithms to estimate the model parameters, such as the genetic algorithm, the complex and non-linear nature of the relationships between metabolites limit the usefulness of the model fitting algorithms [ , ] . another aspect of limitations is the scale of the model. since the bottom-up approach requires detailed experimental measurements, it is more suitable for small-scale models. extending the model size requires either more experiments (higher cost and longer time) or more computational estimation reliance of the parameter values (lower accuracy). thus, an accurate dynamic model based on a bottom-up approach is difficult to establish due to the extended level of uncertainty in the kinetic properties of the enzymes and their reactions [ ] . ensemble modeling helps in building large-scale models, however it also suffers from major limitations as mentioned earlier. on the other hand, the top-down approaches utilize time series metabolomic data to indirectly infer the kinetics, flux rates or concentrations of metabolites, through the establishment of correlation and causation networks between metabolites [ ] . the causation network establishes the cause-effect relationships between the metabolites in the networks and is usually built using time series metabolomic data, while the correlation network uses mathematical and statistical methods to determine the probable relation between the enzymes and metabolites in the network [ ] . nevertheless, the top-down approach has shown notable success in analyzing cellular pathways with simple linear response or mass-action kinetic models with little parameter sensitivity [ , ] . for the comparative d protein modelling, it is most commonly performed using template-based methods, where homologous protein structures are used to generate models using stand-alone programs such as modeller [ ] or through online servers such as robetta, which incorporates the rosettacm method [ ] , hhpred [ ] , and itasser [ ] . these methods produce useful models where good templates are available, but many protein sequences of interest have limited template information, and so poor-quality models are common which hinders their practical applications in guiding protein engineering works. most of the above-mentioned modeling strategies require the availability of sufficient and high-quality experimental data. the data includes metabolite concentrations, and their chemical structures, properties, pathways, reaction rates, genomic sequences, genome annotations, transcriptome sequence, gene expression data and many other types of data, as required for their respective modeling strategies. fortunately, a large number of bioinformatics databases and servers are now freely available with most of this data. many of them are meta-databases that collect and aggregate data from multiple sources such as kegg, pathways commons and metacyc [ ] [ ] [ ] . despite the benefits of these bioinformatics resources, the challenge is in finding the correct dataset and modeling /analytical approaches to take advantage of this wealth of data. this, therefore, raises the need of the involvement of novel data mining and data analytics approaches, such as artificial intelligence (ai). artificial intelligence (ai) provides computers the ability to make decisions based on analyzing the data independently by following predetermined rules or pattern recognition models. since its introduction in , ai has become a hot research area after proving useful in solving several challenges across many fields [ ] . ai and many of its modern techniques such as machine learning (ml) contribute significantly to things that we use in our daily life; from the voice recognition that we use when interacting with smart devices to the algorithms that decide the contents that we see on our social media to the modern-day autonomous cars that will soon be cruising our streets. ai can now read, write, listen, respond to questions, play games or even engage in conversations [ ] . it is also playing a significant role in science, technology and research. in the biomedical and biotechnology fields in particular, ai is heavily employed in addressing certain research challenges while being under-utilized in other aspects. the drug and vaccine discovery fields, for instance, are employing ai to address the challenges of developing new drugs, repurposing existing drugs, understanding drug mechanisms, designing and optimizing clinical trials and identifying biomarkers [ ] . recent surveys show that more than pharma companies and startup companies are employing ai in different aspects of drug discovery [ , ] . this has resulted in the development of over one hundred drugs that are in different development phases in the fields of oncology, neurology and infectious diseases [ ] . furthermore, the research on covid- drugs and vaccination development is employing ai, and this has resulted in dozens of promising drug lead compounds and vaccines is such a short period of time [ , ] . ai is also employed in the fields of genomics, protein-protein interaction prediction, signaling pathways prediction and analysis, protein-dna binding, cancer diagnosis, and genomic mutation variant calling among several other applications [ ] [ ] [ ] [ ] [ ] . on the other hand, ai is not similarly utilized in the fields of metabolomics and metabolic engineering, especially for food applications. although the idea of combining systems biology and ai (machine learning in particular) to study metabolism is relatively old [ ] , the applications of it is still under explored. machine learning (ml) is the field of ai that is interested in developing computer programs that learn and improve its performance automatically based on experience and without explicitly being programmed [ ] . in the last few years, ml research and techniques have improved as large datasets generated by modern analytical lab instruments become available. therefore, in recent reports we are starting to see ml-based research in identifying weight loss biomarkers [ ] , the discovery of food identity markers [ ] farm animal metabolism [ ] and many other applications in untargeted metabolomics [ , ] . in metabolic engineering, several areas are starting to take advantage of ml and systems biology integration including pathways identification and analysis, modeling of metabolisms and growth, and d protein modeling ( figure ). pathways identification and analysis is very crucial for metabolic engineering. it is common that the biochemical pathway of a targeted substance (e.g. enzyme or compound) is unknown or poorly studied. furthermore, in many cases, the gene(s) or gene cluster that is responsible for j o u r n a l p r e -p r o o f producing the targeted substance needs to be transferred to a model organism so that it can be easily manipulated and optimized [ ] . as mentioned above, the different modeling techniques have their limitations, while combining omics data and using standard data analysis approaches for pathways, the final predictions come with its uncertainty [ ] . ml can be utilized to identify the pathways upstream of the substance. for instance, ml model that used naive bayes, decision trees, logistic regression and pathway information of many organisms were used in metacyc to predict the presence of a novel metabolic pathway in a newly-sequenced organism. the analysis of the model performance showed that most of the information about the presence of a pathway in an organism is contained in a small set of used features. mainly, the number of reactions along the path from input to output compound was the most informative feature [ ] . in general, the ml models used for pathway prediction showed better performance than the standard mathematical and statistical methods [ ] . nevertheless, pathway discovery is still heavily relying on traditional approaches such as gene sequence similarity and network analysis. thus, better ml algorithms/methods for pathways discovery are ml can be invaluable for the identification of important genes or enzymes in the pathways of interest. ml classifiers, such as support vector machine, logistic regression and decision treebased models, have been instrumental in predicting gene essentiality within metabolic pathways through training and testing models (by using labeled data of essential and non-essential genes) [ ] . it was also used in finding new drug targets by determining the essential enzymes in a metabolic network of each enzyme by its local network topology, co-expression and gene homologies, and flux balance analyses [ ] . plaimas et al used an ml model that was trained to distinguish between essential and non-essential reactions, which followed an experimental validation using the phenotypic outcome of single knockout mutants of e. coli (keio collection) [ ] . in an earlier study, the side effects of drugs on the metabolic network were investigated by predicting an enzyme inhibitory effect through building an ml model. the model used network topology, functional classes of inhibitors and enzymes as background knowledge, with logic-based representation and a combination of abduction and induction methods to predict drug inhibitory side effects [ ] . newly sequenced genomes undergo two types of annotations; structural annotation and functional annotation. the structural annotation is the process of identifying the genome components and their structures (e.g. identifying genes, their exons, introns and utrs or their regulatory regions), while the functional annotation identifies the functions of the genes and their products. both types of annotation are important for metabolic engineering research; the structural annotation identifies the genes, their sequences, length and structure and, therefore, helps in finding alternative organisms where the same gene, pathways or gene clusters exist. the functional annotation helps in identifying organisms that produce the same substance or tolerate the same growth conditions. comparative genomics, network biology and traditional bioinformatics methods, such as sequence alignment, are usually utilized in this process [ , ] . the rapid advancements in the genome sequencing technologies and the significant drop in its cost in the last decade raised the advantage for fast and accurate annotation methods [ ] . this resulted in the development of several new annotation methods that analyse the newly sequenced genomes from different sequencing platforms that addressed many of the challenges, however, many other challenges remain such as missing short genes and erroneous exon start and end annotation [ , ] . thus, several other methods were introduced with the idea of combining multi-omics data in the process of the genome annotation and, in particular, the proteomic and transcriptomic data [ ] [ ] [ ] [ ] . despite these efforts, over % of the sequenced genomes in the genome online database (gold) are still awaiting annotation [ ] . the high-volume and multi-dimensional nature of the genome sequencing data makes it very suitable for applications of machine learning algorithms [ ] . the ml model will be trained using annotated genomes to identify genome structures, e.g. genes or regulatory regions, using their features to identify the same structures in the newly sequences genomes [ ] . yip deepannotator, an annotation tool that outperformed the ncbi annotation pipeline in rna genes annotation [ ] . the new versions of the annotation tool genemarks for annotation prokaryotic genome (genemarks +) and the eukaryotic self-training gene finder (genemark-ep+) both are utilizing ml algorithms in the annotation process [ , ] . deep convolutional neural networks were used to annotate gene-start sites in different species by training the model using the sites from one species as the positive sample and random sequences from the same species as the negative sample. the model was able to identify gene-start sites in other species [ ] . although, the idea of employing ml in functional annotation started relatively early, it is still underutilized in functional annotation compared to structure annotation. an early attempt of using ml in genes functional annotation from biomedical literature utilized hierarchical text j o u r n a l p r e -p r o o f categorization (htc) [ ] , while tetko et al provided a high-quality curated functional annotation data as a benchmark dataset for the developers of machine ml-based functional annotation methods for bacterial genomes [ ] . the recent reports show the applications of mlbased methods in a wide variety of functional annotations such as the discovery of missing or wrong protein function annotations [ ] , predicting gene functions in plant [ ] , controlling the false discovery rate (fdr), increase the accuracy of protein functional predictions [ ] , and genome-wide functional annotation of splice-variants in eukaryotes [ ] . the advancements of -omics technologies have resulted in a huge accumulation of data (genomics, transcriptomics, proteomics and metabolomics) that is estimated to grow in size to exceed astronomical levels by [ ] . this enormous amount of data has shifted scientific research more towards data-driven approaches such as ml [ ] . combining ml methods with omics data is a typical systems biology approach to address several biomedical challenges. an ml approach was used to replace the traditional kinetic models in estimating the metabolite concentrations over time by combining ml models, proteomic and metabolomic time series data [ ] . also, proteomic and metabolomic data of yeast were combined under several perturbation conditions ( kinase knockouts), and ml was used to predict the yeast metabolome using the enzyme expression proteome of each kinase-deficient condition. the ml quantifies the role of enzyme abundance through mapping the regulatory enzyme expression patterns then utilizing them in predicting the metabolome under the knockout condition [ ] . the availability of transcriptome data and the ability of ml methods to deal with big data led to the development of several genome-scale methods to predict the phenotype using ml models. to take advantage of the accumulated transcriptome data, a biology-guided deep learning system named deepmetabolism was developed [ ] . deepmetabolism uses transcriptomics data to predict cell phenotypes. it integrates unsupervised pre-training with supervised training to predict the phenotype with high accuracy and high speed. on the other hand, jervis et al implemented an ml algorithm to model the bacterial ribosome binding sites (rbss) sequence-phenotype relationship and accurately predicted the optimal high-producers, an approach that directly apply on wide range of metabolic engineering applications [ ] . despite the progress in applying ml techniques in metabolic research, ml is still far from being fully utilized in some important aspects of metabolic engineering, especially in metabolic pathways identification, analysis and bioprocess optimization for the food-based research and industries. in the field of d protein modeling, several ai-based advances are also noted. the most recent critical assessment of protein structure prediction (casp) meeting in saw ai methods come of age. the program alphafold [ ] used a neural net to extract covariant residue pairs from sequence alignments, coupled with estimated distances between them (from - a), and then used the rosetta energy function [ ] to fold the protein based on these ai-derived restraints. alphafold performed exceptionally well in the competition, giving high-accuracy models with template-modelling scores of . or higher for out of domains (as compared with / for the next best method). this has been developed into a lab-based version called prospr [ ] . yang et al used a similar protocol, but with added estimation of relative residue orientations, resulting in trrosetta [ ] , which improved predictions still further. these d modelling methods may be implemented into a comprehensive metabolic engineering platform. one area that could be addressed in the improvement of d protein modelling methods is the inclusion of cofactors. many enzymes are often folded around cofactors; small-to-large organic molecules which form part of the catalytic machinery, such as flavin adenine dinucleotide (fad) or haem. these molecules are often removed in template-based modelling (both manual and automated versions), yet their presence is often important for the correct folding of the enzyme [ ] . this has the effect of lowering the quality of the model due to the removal of key restraints from the structure, requiring extra docking or structure manipulation to reinsert the cofactor after modelling. it should be possible to include the presence of cofactors through a survey of the protein data bank [ ] , where ml methods can be used to identify key determinants of cofactor binding, coupled with identification of these determinants within a target sequence, and application of a combined sequence-and-template-based optimization protocol inclusive of these structural features. an extension of this might also be used for identification of substrates for enzymes within a metabolic pathway or unnatural substrates which is particularly valuable for the j o u r n a l p r e -p r o o f development of synthetic biosynthetic pathways. one input would be enzyme sequence alignments of known function, as well as structural information for both enzyme families and substrates. a neural network could be used to identify common patterns of binding pocket residues across multiple families of enzymes for different substrates, and identify potential sequences that would be suitable for inclusion in a particular metabolic pathway, inclusive of sequence determinants for ease of inclusion into heterologous expression systems. also, if no sequence is available that produces a required product, it might be possible to predict the binding pocket residues that might be mutated to give that product. predictions made can then be experimentally tested, and results fed back into the model. in recent years, the importance of harnessing natural and food ingredients from diverse sources is increasingly realized, such as using engineered microbes or synthetically derived as highlighted in the introduction section. these approaches provide several benefits for producing a more sustainable bio-based economy that relies less on precious land or limited livestock. nevertheless, the bioengineering processes utilized still remain suboptimal, due to the complexity of living systems' emergent behaviors (such as feedback/feedforward inhibition, cofactor imbalances, toxicity of intermediates, bioreactor heterogeneity) that tend to reduce the overall effect of any internal modifications such as adding or engineering a metabolic pathway [ , ] . thus, achieving economically viable large-scale production of microbial-derived metabolites or compounds requires appropriately optimized production strains that generate high yields. until today, however, metabolic engineering efforts mainly serve for broadening and further reducing the cost of those molecules of commercial interests. to address these issues, brunk et al engineered eight e. coli lab strains that produced three commercially important biofuels: isopentenol, limonene, and bisabolene [ ] . to understand the key regulatory or emergent bottleneck scenarios that limit their industrial applicability, they undertook a large scale -omics based systems biology approach where they performed time-series proteomics and metabolomics measurements, and analyzed the resultant high-throughput data using statistical analytics and genome-scale modeling. the integrated approach revealed several novel key findings. for example, they elucidated time-dependent regulation of gene, protein and metabolic pathways related to the tca cycle and pentose-j o u r n a l p r e -p r o o f phosphate pathway, and the resultant coupling of the pathways that affected nadph metabolism. these emergent responses were collectively implicated to downregulate the expected biofuel production. the findings, subsequently, led them to identify a crucial gene (ydbk) whose removal led to a -fold increase in the production of isopentenol in one of the e. coli strains [ ] . despite their success on one strain (out of eight), the overall dynamic changes of metabolic pathways at the different stages of growth for all strains were not understood, as they employed a steady-state genome-scale model, which provided a qualitative, rather than quantitative, inference. this, as mentioned earlier (in dynamic modeling), is due to the lack of kinetic parameter values that are required to develop and test a dynamic model for each strain. to overcome this difficulty, costello and martin ( ) used the same time-series proteomics and metabolomics data of brunk et al and developed a ml model to effectively predict pathway dynamics in an automated fashion [ ] . their model produced both qualitative and quantitative predictions that had better predictions compared to a traditional kinetic model side-by-side. basically, their ml model derived a mapping function between the proteomics and metabolomics dataset with the aid of regression techniques and neural networks onto a training data, and finally verifying the prediction on a test data. apart from better accuracy in the dynamic profiles of the metabolites predicted, the model also did not require detailed understanding of the regulatory steps, which is a major weakness for all modeling approaches. however, their ml model was short in predicting effective regulator(s) for enhanced production of any of the biofuels (isopentenol, limonene, and bisabolene), nor was there any experimental verification. although this is a major weakness in current systems metabolic engineering approaches, nevertheless, ml-based modeling has the future potential to productively guide bioengineering strains without knowing complete metabolic regulatory processes, which are very challenging to obtain. one interesting and popular area of industrially relevant metabolic engineering product in the food and consumer care industries are the terpenes and terpenoids; secondary metabolites or organic compounds naturally found in diverse living species, especially in plants. due to their high commercial values, numerous researches have focused on producing them or their derivatives at industrial scale using microbes [ ] [ ] [ ] [ ] . although several hundreds, or even thousands, of fold increase has been achieved at test tube or flask level by engineering microbes, j o u r n a l p r e -p r o o f the achievement at large industrial scale bioreactors are far from reality. it is our opinion that ml models can help to uncover the relations between output and input more accurately, and identify sweet spots for carefully targeted steps for generating bioreactor scale targeted output. although there is no current workable evidence for this, we believe the future looks promising for this front, provided large investments are made to generate biological data that are required by dynamic or ml models to effectively be predictive. integrating systems biology and ml holds a great promise for improving the way we study and understand metabolism as well as to improve and engineer alternative food sources that are healthier, affordable and nutritious. however, as reviewed in this chapter, this integration faces several challenges and limitations in order to fully utilize the power of both systems biology and ml. a major challenge that faces the application of systems biology and ml in food-grade or gras metabolic engineering is the lack of data. systems biology requires high throughput data from multi-omics levels (genomic, transcriptomic, proteomic and metabolomic), and this data is only available for a small subset of microorganisms in general, and significantly lacking for the food-grade or gras strains, in particular. the availability of such data is necessary for more holistic studying of the organism and helps in discovering new pathways or proteins, simpler, shorter directed pathways or new enzymes with better production rate [ ] . this information will also help in choosing the most appropriate organism to be used for the engineering and production projects. usually, certain model organisms called "chassis" such as yeast and e. coli are used in these projects where the gene(s) or pathways of the substance of interest is transferred from the donor organism. however, the availability of sufficient information about both the donor organism and the chassis help choosing the correct chasse and avoid facing unexpected qualities such as resistance to certain conditions or missing of important pathways [ ] . in addition to the need of large-scale -omics data for building ml models, another data problem is facing the application of ml in the metabolic engineering research. training an ml model for metabolic engineering requires sufficient quantitative data for multiple conditions. the multiple conditions can be multiple knockouts, perturbations or growth conditions. for instance, to build an ml model that predict the required engineering (e.g. knockouts) to improve the j o u r n a l p r e -p r o o f promoter strength, we need to train the model using quantitative data of the downstream gene expression under multiple knockouts or mutants. similarly, the predictive ml models investigating the translation control, transcription factor binding sites, ribosomal binding sites, enzyme engineering (mutation or truncation) and growth optimization require high quality quantitative data in multiple conditions. the same data can also be used in building different mathematical and statistical models, which allows the development of more integrated methods. however, this data is hard to find online and needs to be created for each project. we need more research that focus on the generation of high-quality quantitative data, and on building online resources, such as meta databases, that collect and combine this data to make it available for the community. another major challenge in the ml field is what is known as "the black box problem". the black box problem of ai techniques in general is defined as the difficulty of understanding how they work and how and why they give these results [ ] . this causes the end user of the technique to be uncertain about the quality of the output, and the often biologically unfamiliar modeler will not be able to intervene to improve the performance as well as raising some legal concerns [ ] [ ] [ ] . for example, in the application of ml in d structural modelling, as well as enzyme-substrate identification, the newer ai-based modelling methods are showing some promising results, however, due to the nature of neural nets, it is very difficult to interpret exactly what the programs are learning about the protein-folding problem. we can predict a structure, but without understanding the underlying model for folding. if a way could be found to capture this information, it would be of great use to the community for further study. to address the black box problem, scientists in the field of ai developed a group of ai methods called explainable artificial intelligence (xai) that aim to make the results of ai methods understandable to humans. although this is still new, it holds potential to solve the problems that prevent the systematic performance improvement of ai models [ , ] . although genome annotation, both structural and functional, affects most of the biomedical research aspects, it has a special impact on metabolic engineering in general and applications in food industry in particular. the food-grade or gras microorganisms are a small subset of all organisms and many of them are either not well-studied or not studied at all. hence, there is a big challenge in using these species in ml-based metabolic engineering, as many of them are either not sequenced or sequenced with draft annotation, and/or with no annotation. the j o u r n a l p r e -p r o o f annotations are usually automated using standard pipelines which identify the common genes that they share with other microorganisms and can miss the organism-specific features that need deeper attention. these features are exactly what make those organisms suitable for metabolic engineering and food industry. improved ml-based genome annotation methods will help improving the annotation of the food-safe and gras genomes which will directly impact the research in this area. another area that needs special attention is the pathways prediction in the absence of genome sequence or genome annotation. since many of the food-safe and gras microorganisms are not sequenced yet, methods that predict the pathways for important substances using different -omics data is required. it is easy now to perform whole-or phosphoproteomics, or transcriptomics in different growth conditions or different life stages of an organism. this -omics data can be used, in the absence of genome sequence, to predict the endogenous or biosynthetic pathways of the substance of interest. ml methods can be used instead of the traditional pathway prediction approach due its better suitability to the nature and size of the data. overall, despite the challenges and limitations of ai or ml techniques in dealing with biological datasets, there is no better time than now to explore the full potential of these techniques and to further develop novel methods to overcome the many challenges, including "the black box problem". in parallel, the improvements to the data collection from -omics technologies in time will help to narrow the gap of uncertainty or ambiguity for future systems biology and ml integration for optimal metabolic engineering strategies. rome declaration and plan of action diets for health: goals and guidelines healthy low nitrogen footprint diets the ketogenic diet: evidence for optimism but high-quality research needed covid- risks to global food security world bank, food security and covid- metabolic engineering for higher alcohol production metabolic engineering of vitamin c production in arabidopsis bringing cultured meat to market: technical, socio-political, and regulatory challenges in cellular agriculture conceptual evolution and scientific approaches about synthetic meat engineered microorganisms for the production of food additives approved by the european union-a systematic analysis metabolic engineering j o u r n a l p r e -p r o o f and synthetic biology: synergies, future, and challenges systematic engineering for high-yield production of viridiflorol and amorphadiene in auxotrophic escherichia coli microbial astaxanthin biosynthesis: recent achievements, challenges, and commercialization outlook emerging engineering principles for yield improvement in microbial cell design advancing metabolic engineering through systems biology of industrial microorganisms predicting novel features of toll-like receptor signaling in macrophages transcriptome-wide variability in single embryonic development cells order parameter in bacterial biofilm adaptive response systematic determination of biological network topology: nonintegral connectivity method (nicm) a systems biology approach to overcome trail resistance in cancer treatment a review of dynamic modeling approaches and their application in computational strain optimization for metabolic engineering formulation, construction and analysis of kinetic models of metabolism: a review of modelling frameworks bayesian inference of metabolic kinetics from genome-scale multiomics data flux analysis and metabolomics for systematic metabolic engineering of microorganisms signaling flux redistribution at toll-like receptor pathway junctions basic and applied uses of genome-scale metabolic network reconstructions of escherichia coli physical laws shape biology constraint-based models predict metabolic and associated cellular functions what is flux balance analysis? metabolic engineering to increase crop yield: from concept to execution design of synthetic yeast promoters via tuning of nucleosome architecture inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters impact of cancer mutational signatures on transcription factor motifs in the human genome genome-scale identification of transcription factors that mediate an inflammatory network during breast cellular transformation data mining process ensemble modeling of metabolic networks ensemble modeling for robustness analysis in engineering non-native metabolic pathways ensemble modeling for strain development of l-lysine-producing escherichia coli structure-based drug design strategies and challenges a review of metabolic and science ( -. ) design of an in vitro biocatalytic cascade for the manufacture of islatravir machine learning methods for analysis of metabolic data and metabolic pathway modeling can complex cellular processes be governed by simple linear rules? constructing kinetic models of metabolism at genome-scales: a review silico approach to characterization and reduction of uncertainty in the kinetic models of genome-scale metabolic networks macroscopic law of conservation revealed in the population dynamics of toll-like receptor signaling comparative protein modelling by satisfaction of spatial restraints high-resolution comparative modeling with rosettacm a completely reimplemented mpi bioinformatics toolkit with a new hhpred server at its core the i-tasser suite: protein structure and function prediction the kegg resource for deciphering the genome update: integration, analysis and exploration of pathway data the metacyc database of metabolic pathways and enzymes-a update siri, in my hand: who's the fairest in the land? on the interpretations, illustrations, and implications of artificial intelligence a machine learning approach to predict metabolic pathway dynamics from time-series multiomics data pharma companies using artificial intelligence in drug discovery startups using artificial intelligence in drug discovery drugs in the artificial intelligence in drug discovery pipeline covid- vaccine tracker | raps dozens of coronavirus drugs are in development -what happens next? predicting pdz domain mediated protein interactions from structure predicting the sequence specificities of dna-and rna-binding proteins by deep learning a universal snp and small-indel variant caller using deep neural networks deep learning for genomics using janggu dermatologist-level classification of skin cancer with deep neural networks machine learning predicts the yeast metabolome from the quantitative proteome of kinase knockouts understanding machine learning: from theory to algorithms combining machine learning and metabolomics to identify weight gain discovery of food identity markers by metabolomics and machine learning technology metabolomics meets machine learning: longitudinal metabolite profiling in serum of normal versus overconditioned cows and pathway analysis machine learning in untargeted metabolomics experiments machine learning applications for mass spectrometry-based metabolomics transcriptome analysis and gene expression profiling of abortive and developing ovules during fruit development in hazelnut next generation models for storage and representation of microbial biological annotation an integrative machine learning strategy for improved prediction of essential genes in escherichia coli metabolism using flux-coupled features machine learning based analyses on metabolic networks supports high-throughput knockout screens application of abductive ilp to learning metabolic network inhibition from temporal data comparative genomics approaches to understanding and manipulating plant metabolism genome mining of the streptomyces avermitilis genome and development of genome-minimized hosts for heterologous expression of biosynthetic gene clusters challenges in the next-generation sequencing field whole-genome alignment and comparative annotation insect genomes: progress and challenges a perfect genome annotation is within reach with the proteomics and genomics alliance peptide identification by searching large-scale tandem mass spectra against large databases: bioinformatics methods in proteogenomics metabolomics view project chi sequence view project proteogenomics: from next-generation sequencing (ngs) and mass spectrometry-based proteomics to precision medicine combining rna-seq data and homology-based gene prediction for plants, animals and fungi genomes online database (gold) v. : data updates and feature enhancements machine learning and genome annotation: a match meant to be? introduction to machine learning new machine learning algorithms for genome annotation genome annotation with deep learning modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes genome annotation across species using deep convolutional neural networks functional annotation of genes using hierarchical text categorization mips bacterial genomes functional annotation benchmark dataset machine learning for discovering missing or wrong protein function annotations machine learning: a powerful tool for gene function prediction in plants protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning genome-wide functional annotation of human protein-coding splice variants using multiple instance learning big data: astronomical or genomical? deepmetabolism: a deep learning system to predict phenotype from genome sequencing machine learning of designed translational control allows predictive pathway optimization in escherichia coli improved protein structure prediction using potentials from deep learning the rosetta all-atom energy function for macromolecular modeling and design prospr: democratized implementation of alphafold protein distance prediction network improved protein j o u r n a l p r e -p r o o f structure prediction using predicted interresidue orientations how do cofactors modulate protein folding? the protein data bank the future of metabolic engineering and synthetic biology: towards a systematic practice cell-free metabolic engineering: recent developments and future prospects characterizing strain variation in engineered e. coli using a multi-omics-based workflow identifying and engineering the ideal microbial terpenoid production host a "plug-n-play" modular metabolic system for the production of apocarotenoids use of terpenoids as natural flavouring compounds in food industry, recent patents food agrocybe aegerita serves as a gateway for identifying sesquiterpene biosynthetic enzymes in higher protein folding and de novo protein design for biotechnological applications solving the black box problem: a normative framework for explainable artificial intelligence big data: a new empiricism and its epistemic and socio-political consequences why should i trust you?: explaining the predictions of any classifier how the machine 'thinks': understanding opacity in machine learning algorithms explainable artificial intelligence: a survey the authors thank simon zhang congqiang for critical comments, and the singapore institute of key: cord- -ihq gdwj authors: hasell, joe; mathieu, edouard; beltekian, diana; macdonald, bobbie; giattino, charlie; ortiz-ospina, esteban; roser, max; ritchie, hannah title: a cross-country database of covid- testing date: - - journal: sci data doi: . /s - - - sha: doc_id: cord_uid: ihq gdwj our understanding of the evolution of the covid- pandemic is built upon data concerning confirmed cases and deaths. this data, however, can only be meaningfully interpreted alongside an accurate understanding of the extent of virus testing in different countries. this new database brings together official data on the extent of pcr testing over time for countries. we provide a time series for the daily number of tests performed, or people tested, together with metadata describing data quality and comparability issues needed for the interpretation of the time series. the database is updated regularly through a combination of automated scraping and manual collection and verification, and is entirely replicable, with sources provided for each observation. in providing accessible cross-country data on testing output, it aims to facilitate the incorporation of this crucial information into epidemiological studies, as well as track a key component of countries’ responses to covid- . across the world, researchers and policymakers look to confirmed counts of cases and deaths to understand and compare the spread of the covid- pandemic. however, data on cases and deaths can only be meaningfully interpreted alongside an accurate understanding of the extent and allocation of virus testing . two countries reporting similar numbers of confirmed cases may in fact have very different underlying outbreaks: other things being equal, a country that tests less extensively will find fewer cases. many countries now publish official covid- testing statistics, but the insights offered by these numbers remain relatively unexplored both in public discourse and scientific research. this may be because of barriers limiting access to this data: the statistics are scattered across many websites and policy documents, in a range of different formats. no international authority has taken on the responsibility for collecting and reporting testing data. we developed a new global database to address this lack of access to reliable testing data, thereby complementing the available international datasets on death and case counts . the database consists of official data on the number of covid- diagnostic tests performed over time across countries (as of august ). we rely on figures published in official sources, including press releases, government websites, dedicated dashboards, and social media accounts of national authorities. we do not include in our database figures that explicitly relate to only partial geographic coverage of a country (such as a particular region or city). the resulting database is (i) updated regularly through a combination of automated scraping and manual collection and verification, and (ii) entirely replicable, with sources provided for each observation. in addition, the database includes extensive metadata providing detailed descriptions of the data collected for each country. such information is essential due to heterogeneity in reporting practices, most notably regarding the units of measurement (people tested, cases tested, tests performed, samples tested, etc). series also vary in terms of whether tests pending results are included, the time period covered, and the extent to which figures are affected by aggregation across laboratories (private and public) and subnational regions. the comprehensiveness of our database enables comparisons of the extent of testing between countries and over time -in absolute terms, but also relative to countries' population, and to death or confirmed case counts (fig. ) . such variation offers crucial insights into the pandemic. at the most basic level, it is clear that a country that tests very few people -such as the democratic republic of congo, or nigeria (fig. a ) -can only have very few ( ) : | https://doi.org/ . /s - - - www.nature.com/scientificdata www.nature.com/scientificdata/ confirmed cases. the number of performed tests should be seen as an upper limit for the number of confirmed cases. further, high positive test rates ( fig. -see reference lines) may help identify severe underreporting of cases. the relationship between test positivity rate and case underreporting has been explored in the context of other infectious diseases . in terms of covid- , this link is discussed by ashish jha and colleagues at the harvard global health institute, who provide a sketch of the relationship between cases, deaths and the positivity rate in the united states (see https://globalepidemics.org/ / / /why-we-need- -tests-per-d ay-to-open-the-economy-and-stay-open). in a more formal analysis, golding et al. find that their modelling estimates of the case ascertainment rate are weakly correlated (kendall's correlation coefficient of . ) with the number of tests per case -the inverse of the test positivity rate -derived from our database, with a positive relationship evident in the range of - tests per case . the institute for health metrics and evaluation (ihme) www.nature.com/scientificdata www.nature.com/scientificdata/ include testing data sourced from our database in their covid- models (www.healthdata.org/covid/faqs#dif-ferences% in% modeling). in bringing this data together, our hope is that we will facilitate future research in this direction. more generally, our aim is to provide an essential complement to counts of confirmed cases and deaths. these are the figures that guide public policy, both in the initiation of control measures and as they start to be relaxed. but without the context provided by data on testing, reported cases and deaths may offer a very distorted view on the true scale and spread of the covid- pandemic. the database consists of two parts, provided for each included country: ( ) a time series for the cumulative and daily number of tests performed, or people tested, plus derived variables (discussed below); ( ) metadata including a detailed description of the source and any available information on data quality or comparability issues needed for the interpretation of the time series. for most countries, a single time series is provided: either for the number of people tested, or the number of tests performed. for a few countries for which both are made available, both series are provided. in such cases, metadata is provided for each separate series. data collection methods. the time series data is collected by a combination of manual and automated means. the collection process differs by country and can be categorized into three broad categories. firstly, for a number of countries, figures reported in official sources -including press releases, government websites, dedicated dashboards, and social media accounts of national authorities -are recorded manually as they are released. secondly, where such publications are released in a regular, machine-readable format, or where structured data is published at a stable location, we have automated the data collection via r and python scripts that we execute every day. these are regularly audited for technical bugs by checking their output against the original official sources (see 'technical validation' , below). lastly, in some instances where manual collection has proven prohibitively difficult, we source data from non-official groups collecting the official data, most often on github. these are also regularly audited for accuracy against the original official sources (see 'technical validation' , below). any available information on data quality or comparability issues needed for the interpretation of the time series is gathered and summarized manually into detailed metadata for each series, guided by a checklist of data quality questions (see data records, below). general been the basis for covid- case confirmation, in line with who recommendations . since the primary purpose of the database is to provide information on testing volumes specifically to aid the interpretation of data on confirmed cases, it is exclusively this category of testing technologies that the database aims to include. in order to be included, a data point for a given country must report an aggregate figure that includes both negative tests (or negatively tested individuals) plus positive tests (or confirmed cases). the units (whether the number of tests or individuals is being counted) must be consistent across positive and negative outcomes. the aggregate figure must refer to a known time period -for instance, the number of tests performed in the last day or week. however, where a cumulative total is provided, it is not a requirement that the specific start date to which the cumulative count relates must be specified, provided that it is clear that the figure aims to capture the whole of the relevant outbreak period. figures relating to testing 'capacity' or to rough indications of average testing output, or to the number of tests that have been distributed (rather than actually performed) are not included in the database. where figures for pending tests are provided separately by a source these are excluded from our counts. where they cannot be separated, the figures including pending tests are reported. details concerning pending tests for individual countries can be found in the metadata. the database provides a time series for both the cumulative number of tests (or people tested) and for daily new tests. exactly how these series are derived depends on the way the raw data is reported by the source. where a source provides a complete time series for daily tests, we derive an additional cumulative series as the simple running total of the raw daily data. where a source provides cumulative figures, we derive an additional daily series as the day-to-day change observed in consecutive observations. in many cases the source data is not available at a daily frequency (fig. ) . in order to facilitate cross-country comparisons over time, we derive an additional 'smoothed' daily testing series calculated as the seven-day moving average over a complete, linearly interpolated daily series (described in more detail in the data records section below). retrospective revisions in the source data. due to the efforts to produce timely data, official testing figures are subject to frequent retrospective revisions. this can occur for instance where some laboratories have longer reporting delays than others, and previously uncounted tests are then subsequently included. this issue presents no difficulties where sources provide an updated time series within which such revisions are appropriately incorporated; for instance, by backdating the additional tests to the date they were performed. however, a number of the sources we rely on provide only a 'snapshot' of the current cumulative figure, with no time series. we construct our cumulative and daily testing time series from the sequence of these 'snapshots' . for these cases, retrospective revisions do impact our data since revisions to the data are included on the day the revision is made, not when the revised tests occurred. typically, this results in only small deviations in the www.nature.com/scientificdata www.nature.com/scientificdata/ cumulative figure in proportional terms, but the derived daily testing series can be impacted more meaningfully. at the extreme, in a few cases, such revisions result in a fall in the cumulative total from one day to the next, implying a negative number of tests for that day. this issue is mitigated in two ways. firstly, given that much of retrospective revision relates to testing conducted over the last few days, the 'smoothed' daily time series we derive reduces some of the artificial volatility introduced. secondly, we alert the user as to which data is subject to such concerns as part of the information included in the metadata (see below). a copy of the database has been uploaded to figshare . this provides a version of the database as it stood at the time of submission, on august . a live version of the database, which continues to be updated, can be downloaded from a public github repository (https://github.com/owid/covid- -data/tree/master/public/data/testing) in csv, xlsx, and json formats, which may be imported into a variety of software programs. structure. the database consists of two components: a time series file including observations of cumulative and daily testing (covid-testing-all-observations.csv), and metadata (covid-testing-source-details.csv). each row in the metadata table provides source details (discussed below) corresponding to a given country-series (i.e. the combination of country and series fields make up a unique id within covid-testing-source-details.csv). the time series for cumulative and daily testing for each country-series is then provided in the covid-testing-all-observations.csv file. in addition, we provide the raw data (raw-collected-data.csv), as collected from the source, in order to make it plain how our time series data is constructed from the original observations. we also provide the united nations population data for (un- -population.csv) used to derive the per capita measures included in the time series. raw-collected-data.csv. country. each observation relates to testing conducted within the indicated country. we do not include in our database figures that explicitly relate to only partial geographic coverage of a country (such as a particular region or city). the country's -letter iso - code is also provided as a separate field. units. a short description of the unit of observation of the collected testing figures, selected out of three possible categories: "people tested", "tests performed", "samples tested". series for which it was not possible to discern the category are labelled as "units unclear". series. multiple series (e.g. people tested and samples tested) are included for some countries, and are demarcated by this field. common to covid-testing-all-observations.csv and raw-collected-data.csv. date. depending on the source, this may relate to the date on which samples were taken, analyzed, or registered, or simply the date they were included in official figures (see 'retrospective revisions in the source data' , above). in general, sources try to provide testing data relating to a given, stable cut-off time each day. where significant changes in reporting windows have been found, these have been noted in the notes field (see below). www.nature.com/scientificdata www.nature.com/scientificdata/ cumulative total. the reported cumulative amount of testing as of date. the specific date to which the cumulative figures date back to, if known, is provided in the metadata (see below). in many cases this is not explicitly stated by a source, but only figures that appear to intend to capture the entire period of the testing response to covid- outbreak within the country are included in the database. in covid-testing-all-observations.csv, for those sources only providing daily testing figures, this field is derived as the running total of the raw daily data, and is also provided per thousand people of the country's population. daily change in cumulative total. broadly, this field may be interpreted as the number of new tests (or people tested) per day. for sources that report new tests per day directly, this field in covid-testing-all-observations. csv is identical to the raw data presented in raw-collected-data.csv. for sources that report only cumulative testing figures, the field is derived as the day-to-day change observed in consecutive observations of the raw cumulative total data. this may fail to correspond to the true number of new tests for that date where the source has included retrospective revisions in the cumulative totals (see 'retrospective revisions in the source data' , above). in covid-testing-all-observations.csv, this series is also provided per thousand people of the country's population. source url. a url at which the specific observation of the corresponding raw data can be found. source label. the name of the source for the observation. notes. contains any notes to aid the interpretation of this specific observation (above and beyond details that apply to the whole series, which are provided in covid-testing-source-details.csv). specific to covid-testing-all-observations.csv. -day smoothed daily change. as an outbreak progresses, flows of new tests per day, rather than cumulative figures, become more relevant for understanding trends. daily testing figures however suffer from volatility created by reporting cycles. moreover, since many sources do not provide data at daily intervals, figures for new tests per day are available with more limited coverage. to aid the cross-country analysis of testing volumes over time, we provide this short-term measure of testing output that aims to mitigate these two problems. it is calculated as the right-aligned rolling seven-day average of a complete series of daily changes. for countries for which no complete series of daily changes is available because of the reporting frequency of our source, we derive it by linearly interpolating the daily cumulative totals not available in the raw data, up to a maximum interval of days. the exact code used to derive the -day smoothed daily change is available online (see 'code availability' , below). specific to covid-testing-source-details.csv. number of observations. the number of days for which raw observations are available. detailed description. a written summary of available information concerning the nature and quality of the source data needed for proper interpretation and cross-country comparison. the collation of this information is guided by a 'checklist' of data quality questions regarding: the unit of observation; which testing technologies figures relate to; whether tests pending results are included; the time period covered; and the extent to which figures are affected by aggregation across laboratories (private and public) and subnational regions. in practice the documentation we are able to provide is limited by that made available by the official source. we aim to include any information provided by the original source needed for the interpretation and comparison with other countries. coverage. the database includes observation for countries, covering % of the world's population. because of differences in the frequency at which countries publish testing data, coverage is somewhat lower for more recent periods: % of the world's population is covered with figures relating to - august ; % is covered with figures relating to - august (fig. ). the database represents a collation of publicly available data published by official sources. as such, the key quality concern for the database itself is whether it represents an accurate record of the official data. we employ four main strategies for ensuring this. firstly, all automated collection of data, whether obtained from official channels or from third-party repositories of official data, is subject to initial manual verification when it is added to our database for the first time. secondly, we employ a range of data validation processes, both for our manual and automated time series. we continually check for invalid figures such as negative daily test figures, out-of-sequence dates, or test positivity rates above % (by comparing testing data to confirmed case data), and we monitor each country for abrupt changes in daily testing rates. abrupt positive or negative daily changes are sometimes the result of data corrections in the official data, in which case our database includes them without alteration. these changes can be due, for example, to the deduplication of double-counted tests, or the addition of testing data that was previously not captured by the national system (see table ). in order to mitigate against large impacts due to reporting lags, we automatically exclude the most recent observation for a country if its daily number of new tests is less than half that of the previous observation. this is only applied to the most recent day in each time series: as soon as data for subsequent days becomes available, the data point is reinstated if the sharp fall is still present. www.nature.com/scientificdata www.nature.com/scientificdata/ thirdly, to monitor the ongoing reliability of third-party repositories of official data, we apply a continuous audit process, which will remain active as long as this dataset is updated. each day, three observations are randomly drawn out of all observations in the database that have been obtained via third-party sources. for each selected observation, the recorded figure is manually checked against the direct official channel from which the repository purports to obtain the data. the sampling rate means that each third-party source we make use of is checked around once a week. given that any discrepancies with official channels are likely to be clustered within particular sources, this provides a high degree of quality control on these sources on a timely basis. where any discrepancies are noticed, we switch sources (for the entire time series) to either a different repository or to manual data collection directly from the official channel. finally, the testing data included in the database is viewed by tens of thousands of people every day, including many health researchers, policymakers and journalists, from which we receive a large amount of feedback concerning the data. this serves as a final, 'crowd-sourced' method of verification that has proven very effective, enabling any discrepancies between our data and that published in official channels to be flagged and resolved quickly. code used for the creation of this database is not included in the files uploaded to figshare. our scripts for data collection, processing, and transformation, are available for inspection in the public github repository that hosts our data (https://github.com/owid/covid- -data/tree/master/scripts/scripts/testing). changes observed in the source data as of august case-fatality rate and characteristics of patients dying in relation to covid- in italy an interactive web-based dashboard to track covid- in real time malaria burden through routine reporting: relationship between incidence and test positivity rates reconstructing the global dynamics of under-ascertained covid- cases and infections advice on the use of point-of-care immunodiagnostic tests for covid- : scientific brief this project was funded from multiple sources, including general grants and donations to our world in data. the following is a list of funding sources and affiliations. grants: our world in data has received grants from the bill and melinda gates foundation, the department of health and social care in the united kingdom, and the german philanthropist susanne klatten. sponsors: in addition to grants, our world in data has also received donations from several individuals and organizations: center for effective altruism -effective altruism meta fund, templeton world charity foundation, effective giving, the camp foundation, the rodel foundation, the pritzker innovation fund diana beltekian -data recording, verification and analysis. charlie giattino -data recording, verification and analysis. bobbie macdonald -data recording, verification and analysis. esteban ortiz-ospina -data recording, verification and analysis the authors declare no competing interests. correspondence and requests for materials should be addressed to j.h.reprints and permissions information is available at www.nature.com/reprints.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons.org/licenses/by/ . /.the creative commons public domain dedication waiver http://creativecommons.org/publicdomain/zero/ . / applies to the metadata files associated with this article. key: cord- -zzc a id authors: otoom, mwaffaq; otoum, nesreen; alzubaidi, mohammad a.; etoom, yousef; banihani, rudaina title: an iot-based framework for early identification and monitoring of covid- cases date: - - journal: biomed signal process control doi: . /j.bspc. . sha: doc_id: cord_uid: zzc a id the world has been facing the challenge of covid- since the end of . it is expected that the world will need to battle the covid- pandemic with precautious measures, until an effective vaccine is developed. this paper proposes a real-time covid- detection and monitoring system. the proposed system would employ an internet of things (iots) framework to collect real-time symptom data from users to early identify suspected coronaviruses cases, to monitor the treatment response of those who have already recovered from the virus, and to understand the nature of the virus by collecting and analyzing relevant data. the framework consists of five main components: symptom data collection and uploading (using wearable sensors), quarantine/isolation center, data analysis center (that uses machine learning algorithms), health physicians, and cloud infrastructure. to quickly identify potential coronaviruses cases from this real-time symptom data, this work proposes eight machine learning algorithms, namely support vector machine (svm), neural network, naïve bayes, k-nearest neighbor (k-nn), decision table, decision stump, oner, and zeror. an experiment was conducted to test these eight algorithms on a real covid- symptom dataset, after selecting the relevant symptoms. the results show that five of these eight algorithms achieved an accuracy of more than %. based on these results we believe that real-time symptom data would allow these five algorithms to provide effective and accurate identification of potential cases of covid- , and the framework would then document the treatment response for each patient who has contracted the virus. since its discovery in late december of , there have been more than . million confirmed cases of covid- reported in countries, as of july , [ ] , with approximately a % daily increase. among these cases there have been more than thousand deaths, which represents an approximate . % mortality rate. this novel coronavirus was characterized on march , as a pandemic by the world health organization [ ] . unfortunately, there is no successful treatment procedure or vaccine yet. it is expected that the development of an effective vaccine will take more than a year, especially since the nature of the virus has not yet been completely characterized [ ] . currently, the only way that the world can deal with this coronavirus is to slow down its spread, (i.e. "flatten the curve") by using measures such as social distancing, hand washing and face masks. however, technology could also help slow its spread, through early identification (or prediction) and monitoring of new cases [ ] , [ ] . such technologies include big data, as well as cloud and fog capabilities [ ] , the use of data gathered through remote monitoring, such as mhealth, telehealth, and real-time patient status follow-up [ ] . this paper proposes a covid- detection and monitoring system that would collect real-time symptom data from wearable sensor technologies. to quickly identify potential coronaviruses cases from this real-time data, this paper proposes the use of eight machine learning algorithms, namely support vector machine (svm), neural network, naïve bayes, k-nearest neighbor (k-nn), decision table, decision stump, oner, and zeror. this detection and monitoring system could be implemented with an iot infrastructure that would monitor both potential and confirmed cases, as well as the treatment responses of patients who recover j o u r n a l p r e -p r o o f from the virus. in addition to real-time monitoring, this system could contribute to the understanding of the nature of the virus by collecting, analyzing and archiving relevant data. the proposed framework consists of five main components: ( ) real-time symptom data collection (using wearable devices), ( ) treatment and outcome records from quarantine/isolation centers, ( ) a data analysis center that uses machine learning algorithms, ( ) healthcare physicians, and ( ) a cloud infrastructure. the aim of this framework, is to reduce mortality rates through early detection, following up on recovered cases, and a better understanding of the disease. this work conducts an experiment to test these eight machine learning algorithms on a real dataset. the results show that five of these eight algorithms achieved accuracies of more than %. using these five algorithms will provide effective and accurate prediction and identification of potential cases of covid- , based on real-time symptom data. this paper is organized as follows. section reviews the relevant literature. section details the proposed framework, including the five components. section focuses on the identification (or prediction) of new cases, using machine learning algorithms. lastly, section concludes the work. there is considerable work in the literature regarding the use of the internet of things (iot) to deliver health services. usak et al. conducted a systematic literature review of the use of iot in health care systems. that work also included a discussion of the main challenges of using iot to deliver health services, and a classification of the reviewed work in the literature [ ] . wu et al. proposed a hybrid iot safety and health monitoring system. the goal was to improve outdoor safety. the system consists of two layers: one is used to collect user data, and the other to aggregate the collected data over the internet. wearable devices were used to collect safety indicators from the surrounding environment, and health signs from the user [ ] . hamidi studied authentication of iot smart health data to ensure privacy and security of health information. the work proposed a biometric-based authentication technology [ ] . rath and pattanayak proposed a smart healthcare hospital in urban areas using iot devices, inspired by the literature. issues such as safety, security and timely treatment of patients in vanet zone were discussed. evaluation of the proposed system was conducted using simulators such as ns and netsim [ ] . darwish et al. proposed a cloudiot-health paradigm, which integrates cloud computing with iot in the health area, based on the relevant literature. the paper presented the challenges of integration, as well as new trends in cloudiot-health. these challenges are classified at three levels: technology, communication and networking, and intelligence [ ] . zhong and li studied the monitoring of college students during their physical activities. the paper focused on a physical activity recognition and monitoring (parm) model, which involves data pre-processing. several classifiers, such as decision tree, neural networks, and svm, were tested and discussed [ ] . din and paul proposed an iot-based smart health monitoring and management architecture. the architecture is composed of three layers: ( ) data generation from battery-operated medical sensors and processing, ( ) hadoop processing, and ( ) application layers. because of the limited capacity of batteries to power the sensors, the work employed an energy-harvesting approach using piezoelectric devices attached to the human body [ ] . otoom et al. developed an iot-based prototype for real-time blood sugar control. arima and markov-based statistical models were used to determine the appropriate insulin dose [ ] . alshraideh et al. proposed an iot-based system for cardiovascular disease detection. several machine learning algorithms were used for cvd detection [ ] . nguyen presented a survey of artificial intelligence (ai) methods being used in the research of covid- . this work classified these methods into several categories, including the use of iot [ ] . maghdid proposed the use of sensors available on smartphones to collect health data, such as temperature [ ] . rao and vazquez proposed the use of machine learning algorithms to identify possible covid- cases. the learning is done on collected data from the user through web survey accessed from smartphones [ ] . allam and jones discussed the need to develop standard protocols to share information between smart cities in pandemics, motivated by the outbreak of covid- . for instance, ai methods can be applied to data collected from thermal cameras installed in smart cities, to identify possible covid- cases [ ] . fatima et al. proposed an iot-based approach to identify coronavirus cases. the approach is based on a fuzzy inference system [ ] . peeri et al. conducted a comparison between mers, sars, and covid- , using the available literature. they suggested the use of iot in mapping the spread of the infection [ ] . to our knowledge, no one has developed a complete framework for using iot technology for the identification and monitoring of covid- . equipped with sensors to collect heterogeneous data. these sensors have a limited computational capacity, and a limited lifetime. the more data that they collect, the more helpful decisions can be made. however, data processing complexity becomes a bottleneck [ ] . connectivity can be used to cope with the limited computational power of these sensors. several different communication technologies have been employed, including lowpan, bluetooth, ieee . . , rfid and near-field communication (nfc) [ ] . the network layer is not just used to upload collected data, for analysis. it is also used to facilitate communication between heterogeneous iot objects, at the physical layer. in doing this, the network layer should support scalability, as the number of the objects increases, as well as device discovery, and context awareness. significantly, it should also provide security and privacy for iot devices [ ] . the data uploaded from the iot devices can be deeply analyzed, to generate insights and help make decisions. currently, machine learning and deep learning (ml/dl) algorithms are used for this purpose, and are replacing more traditional methods because of their ability to deal with big data [ ] . al-garadi et al. provided a thematic taxonomy of ml/dl used for iot security [ ] . there are a wide range of applications where iot can be effectively used, including healthcare, smart cities, smart buildings, agriculture, and power grids. in healthcare, iot is sometimes called internet of medical things (iomt) [ ] . it has largely displaced traditional ict-based methods, such as telemedicine or telehealth. iomt can provide more advanced features than these traditional methods. for example, while traditional methods can connect patients with medical doctors remotely, iomt also supports machinehuman and machine-machine interactions, such as ai-based diagnosis. one important issue in designing iomt is the balance between data privacy/security and patient safety [ ] . examples of cyber threats that challenge such designs are eavesdropping on communication channels (to sell the collected data), intervention, disruption, or even modification of the service. however, in cases where the patient's life is at risk, breaking some security measures to access the iomt might be needed to save the patient's life [ ] . ml/dl methods can be used to support this balance. this section depicts and discusses our envisioned iot-based framework, which could be used to monitor and identify (or predict) potential coronaviruses cases, in real time. equally important, this framework could be used to predict the treatment response of confirmed cases, as well as to better understand the nature of the covid- disease. fig. shows the framework of our proposed iot architecture. it consists of five main components: symptom data collection and uploading, a quarantine/isolation center, a data analysis center, an interface to health physicians, all of which are interconnected through a cloud infrastructure. j o u r n a l p r e -p r o o f a. symptom data collection and uploading. the aim of this component is to collect real-time symptom data through a set of wearable sensors on the user's body. in our earlier study [ ] , the most relevant covid- symptoms were identified, based on a real covid- patient dataset. these identified symptoms were: fever, cough, fatigue, sore throat, and shortness of breath. there are several biosensors available to detect these symptoms. for instance, temperature-based sensors can be used for the detection of fever [ ] . cough and its classifications for different ages can be detected using audio-based sensors with acoustic and aerodynamic models [ ] . motion-based and heart-rate sensors can be used to detect fatigue [ ] . sore throat can be detected using image-based classification [ ] . finally, oxygen-based sensors can be used to detect shortness of breath [ ] . other relevant datasuch as travel and contact history during the past - weeks, can be collected in an ad-hoc manner through mobile applications. center. this component collects data records from users who have been quarantined or isolated in a health care center. these records include both health (or technical) and non-technical data. for health (or technical) data, each record includes time-series data of the above-mentioned symptoms, while for non-technical data, each record includes travel and contact history during the past - weeks, chronic diseases, age, gender, and any other relevant information, such as family history of illness. each record would eventually also include the treatment response for each case. c. data analysis center. the data center hosts data analysis and machine learning algorithms. these algorithms are used to build a model for covid- , and to provide a real-time dashboard of the processed data. the model could then be used to quickly identify or predict potential covid- cases, based on real-time data collected and uploaded from users. the model can also predict the patient's treatment response. over time, the disease models developed from this data will provide useful information about the nature of the disease. physicians will monitor suspected cases whose real-time uploaded symptom data indicates a possible infection by our proposed machine learning based identification/prediction model. the physicians will then be able to respond swiftly to these suspected cases by following up with any further clinical investigation needed to confirm the case. this allows the confirmed cases to be isolated and given appropriate health care. e. cloud infrastructure. the cloud infrastructure is interconnected through the internet, and ( ) allows upload of real-time symptom data from each user, ( ) maintains personal health records, ( ) communicates prediction results, ( ) communicates physician recommendations, and ( ) provides for storage of information. j o u r n a l p r e -p r o o f . the system non-invasively collects real-time user symptom data through wearable devices and sensors. again, these symptoms are: fever, cough, fatigue, sore throat, and shortness of breath. further, the user submits information via a mobile application about living in (or travel to) infected areas, as well as possible contact with covid- infected persons. the quarantine/isolation center also periodically submits data from their isolated and quarantined patients who are housed in the center. the content of that data is similar to the real-time data collected from users. . the sensed symptom data are uploaded to the data analysis center using a smartphone, through the cloud infrastructure. digital records from the health care center are also regularly sent to the data analysis center through the cloud infrastructure. the data analysis center hosts machine learning algorithms, which use the data received from the health care center to continuously update its models. the models are then used to identify potential cases, based on the real-time symptom data from each user. the data center also analyzes all its data, and presents the results on a real-time dashboard. that dashboard can be informative to physicians about the nature of the virus. . if a potential case is identified, it will be sent to the relevant physician to follow up with the patient. the patient will then be called and encouraged to visit the health care center for clinical tests, such as the polymerase chain reaction (pcr) test, which is used to identify positive cases. if it turns out that the case is confirmed, the patient can be isolated, and all contacts will be contacted and quarantined. a complementary and integral component to this framework is the use of the same mobile application to educate users, by including useful information on how they can avoid illness, and how to avoid being exposed to the virus. this section further discusses the predictive models, and the machine learning algorithms that will be employed in the data center component of the proposed iot-based framework. in particular, an experiment was conducted to investigate the possibility of using machine learning algorithms for quick identification (or prediction) of potential covid- infections. the rest of this section describes that experimental setup, and presents and discusses the results. a dataset of confirmed covid- cases from the covid- open research dataset (cord- ) repository [ ] was used. the data contains different types of information about each case. our work focused on symptoms, travel history to suspicious areas, and contact history with potentially infected people. however, some of this information was missing for many of the cases documented within the database. moreover, the data was not well structured for use by machine learning algorithms. in our previous work [ ] , the data was preprocessed and structured to be better suited for machine learning. the cases with documented symptoms were collected. this resulted in a list of symptoms. however, many of these symptoms were judged to be synonyms. thus, the number of symptoms was reduced to . this merging of synonymous symptoms was done in an ad-hoc manner by two medical doctors, who are co-authors of this work. for example, "anorexia" and "loss of appetite" were merged together. our previous work also determined the relative importance of these symptoms. the following six different statistically-based feature selection algorithms were employed in that work, to rank the symptoms, based on their importance: spectral score, information score, pearson correlation, intra-class distance, interquartile range, and our variance based feature weighting [ ] . the first five of these methods had been proposed earlier in the literature [ ] . the sixth method was a new one. it not only ranks the symptoms, but also assigns importance weights to each of them. it was found that the most important five symptoms (ordered from most important to least important) are: fever, cough, fatigue, sore throat, and shortness of breath. based on the findings of that earlier work, this work uses those five most important symptoms. in addition, two extra features were added: live and contact. the first feature (live) represents whether or not the person lived, travelled to, or passed by a potentially infected area. the second feature (contact) represents whether or not the person was known to be in contact with a potentially infected person. this resulted in a preprocessed dataset of × data records. among which of those records were from confirmed covid- cases, and records were for non-confirmed cases. this work used this preprocessed dataset to build a predictive model for our identification (or prediction) system. the function of this model is to estimate the likelihood that a given person is infected by covid- . several learning algorithms (i.e. classifiers) could have been used for this purpose. those classifiers can be categorized into multiple categories. weka software [ ] , (which we used in this work) categorizes the classifiers into six categories: ( ) functionbased classifiers, such as support vector machines, ( ) lazy classifiers, such k-nearest neighbors, ( ) bayes based classifiers, such as naïve bayes, ( ) rule-based classifiers, such as decision tables, zeror, and oner, ( ) tree-based classifiers, such as decision stump, and ( ) meta classifiers, such neural networks. in this work, at least one classifier from each category was selected. specifically, this work compares the performance of eight machine learning algorithms: ( ) support vector machine (svm), using radial basis function (rbf) kernel, ( ) neural network, ( ) naïve bayes, ( ) k-nearest neighbor (k-nn), ( ) decision table, ( ) decision stump, ( ) oner, and ( ) zeror [ ] . this work used weka software to run all of these algorithms on our dataset [ ] . the default parameter values were used for each of the eight algorithms. below is a brief description of the eight algorithms: svm is a supervised learning method. given a set of training examples that are labeled (i.e. each instance in the training set either belongs to the positive or negative class), svm learns the hyperplane that best separates the instances from each class, and maximizes the margin between the data instances and the hyperplane itself. this learnt hyperplane is then used to assign (or predict) a class label for any new test instance. ann is a supervised learning method. the learning process tries to mimic the learning that takes place inside the human brain. to do so, multiple layers of nodes are connected through edges. the edges connecting between the nodes are represented as numerical weights. the output of each node is computed as weighted sum of its inputs. given a set of training examples that are labeled (i.e. each instance either belongs to the positive or negative class), the ann learns the numerical weights that best classify the instances from each class. this learnt model is then used to assign (or predict) a class label for any given test instance. the test instance drives the inputs to the nodes of the first layer. then a threshold is applied to the outputs of the final layer, to determine the label for that test instance. naïve bayes is a supervised learning method. the learning process follows a probabilistic approach. it uses bayes theorem to compute the model parameters. given a set of training examples that are labeled (i.e. each instance either belongs to the positive or negative class), naïve bayes computes multiple model parameters, such as the probability of each class label to occur. these parameters are then used to assign (or predict) a class label for any given test instance. this is done by computing the probabilities of the test instance to be assigned to each of the possible class labels. the maximum value among these probabilities decides the label of that test instance. k-nn is a supervised instance-based learning method. the learning process follows a lazy approach. it does not compute a model. given a set of training examples that are labeled (i.e. each instance either belongs to the positive or negative class), k-nn computes distances between a given test instance and all the training instances. these distances are then used to assign (or predict) a class label for the test instance. this is done by aggregating the class labels of the k closest training instances to the test instance. table decision table is a decision stump is a supervised learning method. given a set of training examples that are labeled (i.e. each instance either belongs to the positive or negative class), this method computes a model by building a decision tree, with only one internal node. in other words, it makes the prediction for any given test instance using only one feature of that instance. this feature is determined by computing the information gain for all features across all training instances, selecting the one with the maximum information gain value. oner is a supervised learning method. given a set of training examples that are labeled (i.e. each instance either belongs to the positive or negative class), this method computes a model by generating one rule for each feature in the data set. it then selects the one with the minimum total error. zeror is a supervised learning method. given a set of training examples that are labeled (i.e. each instance either belongs to the positive or negative class), this method computes a model by using only the target feature (i.e. class) while ignoring all other features. it is considered the simplest classification method. it assigns any new test instance to the majority class. usually, it is used as a benchmark to determine baseline performance. to evaluate the performance of the eight learning algorithms, four performance measures were used: accuracy, root mean square error, f-measure, and roc area. these measures can be computed using a confusion matrix and cross validation methods. the confusion matrix is used to visualize the performance of a binary ( -class) supervised learning problem by creating a -by- matrix. each row in the matrix shows the instances in the predicted (or computed) class, while each column shows the instances in the actual class. the resulting matrix consists of four values (see table cross validation is a statistical method used to measure the performance of learning and classification methods. this is done by splitting the available labeled data instances into k folds. one of these folds is used for testing, and the rest are used for training. this work used -fold cross validation. the data instances are divided into folds. for iterations, one-fold was used for testing and folds for training, such that in every iteration a different fold is used for testing. the accuracy of a classifier is computed as the number of correctly classified instances to the total number of instances. it is given by: the root mean square error (rmse) is computed as the square root of the average of squared differences between the predicted classes (or labels) and the actual ones. it is given by: the f-measure is computed by combining the two measures of precision and recall. it is given by: the receiver operating characteristic (roc) is another way to measure the performance of a classifier. this is done by plotting the true positive rate against the false positive rate. the area under the resulting roc curve is then used to measure the performance of the classifier. the closer the area to is, the better the classifier is. the true/false positive rates are given by: fig. shows the confusion matrices that resulted from applying -fold cross validation to the eight selected classifiers. (large numbers in the upper-left and lower right boxes of these matrices represent good scores. large numbers in the lower-left and upper right boxes of these matrices represent bad scores.) fig. shows the roc curves that resulted from applying -fold cross validation to the eight selected classifiers. fig. shows the roc curves that resulted from applying -fold cross validation to the eight selected classifiers. table and fig. compare the performance of the eight algorithms. it shows the accuracy, root mean square error, f-measure and roc area of each algorithm, which were calculated using the well-known -fold cross validation method [ ] . the results presented in table and fig. suggest that the models built using svm, neural network, naïve bayes, k-nn and decision table algorithms are effective in predicting confirmed and potential cases of covid- . taken together, this suggests that our proposed iot-based framework could use a combination of these five effective models. this could be done by aggregating the results of these five learnt models, based on majority votes. this paper has proposed an iot-based framework to reduce the impact of communicable diseases. the proposed framework was used to employ potential covid- case information and health records of confirmed covid- cases to develop a machine-j o u r n a l p r e -p r o o f learning-based predictive model for disease, as well as for analyzing the treatment response. the framework also communicates these results to healthcare physicians, who can then respond swiftly to suspected cases identified by the predictive model by following up with any further clinical investigation needed to confirm the case. this allows the confirmed cases to be isolated and given appropriate health care. an experiment was conducted to test eight machine learning algorithms on a real covid- dataset. they are: ( ) support vector machine, ( ) neural network, ( ) naïve bayes, ( ) k-nearest neighbor (k-nn), ( ) decision table, ( ) decision stump, ( ) oner, and ( ) zeror. the results showed that all these algorithms, except the decision stump, oner, and zeror achieved accuracies of more than %. using the five best algorithms would provide effective and accurate identification of potential cases of covid- . employing the proposed real-time framework could potentially reduce the impact of communicable diseases, as well as mortality rates through early detection of cases. this framework would also provide the ability to follow up on recovered cases, and a better understanding the disease. j o u r n a l p r e -p r o o f coronavirus covid- global cases by the center for systems science and engineering who director-general's opening remarks at the media briefing on covid- - the most promising coronavirus breakthroughs so far, from vaccines to treatments digital technologies and disease prevention digital technology for preventative health care in myanmar a fog-computing architecture for preventive healthcare and assisted living in smart ambients personalized telehealth in the future: a global research agenda health care service delivery based on the internet of things: a systematic and comprehensive study an internet-of-things (iot) network system for connected safety and health monitoring applications an approach to develop the smart health using internet of things and authentication based on biometric technology technological improvement in modern health care applications using internet of things (iot) and proposal of novel health care approach the impact of the hybrid platform of internet of things and cloud computing on healthcare systems: opportunities, challenges, and open problems internet of things sensors assisted physical activity recognition and health monitoring of college students erratum to "smart health monitoring and management system: toward autonomous wearable sensing for internet of things using big data analytics artificial intelligence in the battle against coronavirus (covid- ): a survey and future research directions a novel ai-enabled framework to diagnose coronavirus covid- using smartphone embedded sensors: design study identification of covid- can be quicker through artificial intelligence framework using a mobile phone-based survey in the populations when cities/towns are under quarantine on the coronavirus (covid- ) outbreak and the smart city network: universal data sharing standards coupled with artificial intelligence (ai) to benefit urban health monitoring and management iot enabled smart monitoring of coronavirus empowered with fuzzy inference system the sars, mers and novel coronavirus (covid- ) epidemics, the newest and biggest global health threats: what lessons have we learned real-time statistical modeling of blood sugar a web based cardiovascular disease detection system intelligent multi-dose medication controller for fever: from wearable devices to remote dispensers a mobile cough strength evaluation device using cough sounds heart rate monitoring system during physical exercise for fatigue warning using non-invasive wearable sensor novel image processing method for detecting strep throat (streptococcal pharyngitis) using smartphone extraction and analysis of respiratory motion using wearable inertial sensor system during trunk motion cord- ). . version - - introduction to data mining. pearson education india the weka workbench. online appendix for "data mining: practical machine learning tools and techniques internet of things: a survey on enabling technologies, protocols, and applications fog computing: helping the internet of things realize its potential a survey on the internet of things security context aware computing for the internet of things: a survey internet of things: architectures, protocols, and applications the role of big data analytics in internet of things a survey of machine and deep learning methods for internet of things (iot) security security and privacy issues in implantable medical devices: a comprehensive survey security tradeoffs in cyber physical systems: a case study survey on implantable medical devices a novel computational method for assigning weights of importance to symptoms of covid- patients filter feature selection for one-class classification the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. key: cord- -ew n i z authors: nambiar, devaki; sankar, hari; negi, jyotsna; nair, arun; sadanandan, rajeev title: field-testing of primary health-care indicators, india date: - - journal: bull world health organ doi: . /blt. . sha: doc_id: cord_uid: ew n i z objective: to develop a primary health-care monitoring framework and health outcome indicator list, and field-test and triangulate indicators designed to assess health reforms in kerala, india, – . methods: we used a modified delphi technique to develop a -item indicator list to monitor primary health care. we used a multistage cluster random sampling technique to select one district from each of four district clusters, and then select both a family and a primary health centre from each of the four districts. we field-tested and triangulated the indicators using facility data and a population-based household survey. findings: our data revealed similarities between facility and survey data for some indicators (e.g. low birth weight and pre-check services), but differences for others (e.g. acute diarrhoeal diseases in children younger than years and blood pressure screening). we made four critical observations: (i) data are available at the facility level but in varying formats; (ii) established global indicators may not always be useful in local monitoring; (iii) operational definitions must be refined; and (iv) triangulation and feedback from the field is vital. conclusion: we observe that, while data can be used to develop indices of progress, interpretation of these indicators requires great care. in the attainment of universal health coverage, we consider that our observations of the utility of certain health indicators will provide valuable insights for practitioners and supervisors in the development of a primary health-care monitoring mechanism. under the thirteenth general programme of work and the triple billion targets, the world health organization (who) aims to increase the number of people benefitting from universal health coverage (uhc) by one billion between and . central to this effort is the expansion and improvement of primary health-care services. , progress in achieving uhc can be analysed using the who and world bank's uhc monitoring framework, , but this requires adaptation to local contexts to ensure health reforms keep pace with targets. health programmes in india, as well as the national health policy and flagship ayushman bharat scheme, are being evaluated in relation to the aims of uhc; various efforts are currently underway at both a national and state level, notably in haryana and tamil nadu. according to national sample survey estimates from - , morbidity levels in the southern state of kerala are reportedly four times the national average with disparities by sex and place of residence. although the state has made gains in maternal and child health, it must sustain these gains while addressing the substantial and growing burden of hypertension, diabetes and cancer; vaccine-preventable diseases; , and emerging viral infections such as nipah virus and severe acute respiratory syndrome coronavirus (sars-cov- ). [ ] [ ] [ ] kerala has been subject to unregulated privatization and cost escalation, resulting in persistent inequalities in service access and health attainment between population subgroups. in , the government of kerala announced aardram, a programme of transformation of existing primary health centres to family health centres; with increased staffing, these family health centres provide access to a greater number of services over longer opening hours compared with the original primary health centres. apart from the who's monitoring framework, many countries have done uhc and primary health centre monitoring exercises , alongside independent exercises such as the primary health care performance initiative. however, most of these frameworks are intended for global comparison or decision-making at national levels. the argument for tracking health reforms is clear, but such a monitoring process must be specific to kerala and local decision-making, while also complying with national and global reporting requirements. periodic household surveys offer population-level data, but are not frequent enough to inform ongoing implementation decisions. routinely collected and disaggregated health system data are vital, but are often marred by quality issues as well as technological and operational constraints. we began a -year implementation research study assessing equity in uhc reforms in january . in our first two phases we aimed to develop a conceptual framework and a health outcome indicator shortlist, followed by validation of these indicators using data from both health facilities and a population-based household survey. we report on the fieldtesting and triangulation components of this implementation research project, which took place during and . we reflect on early lessons from the field-testing and triangulation and, drawing broadly from ostrom's institutional analysis and development framework, we emphasize how monitoring can support learning health systems. , we also discuss how the monitoring of uhc progress requires a flexible approach that is tailored to the local political economy. [ ] [ ] [ ] objective to develop a primary health-care monitoring framework and health outcome indicator list, and field-test and triangulate indicators designed to assess health reforms in kerala, india, - . methods we used a modified delphi technique to develop a -item indicator list to monitor primary health care. we used a multistage cluster random sampling technique to select one district from each of four district clusters, and then select both a family and a primary health centre from each of the four districts. we field-tested and triangulated the indicators using facility data and a population-based household survey. findings our data revealed similarities between facility and survey data for some indicators (e.g. low birth weight and pre-check services), but differences for others (e.g. acute diarrhoeal diseases in children younger than years and blood pressure screening). we made four critical observations: (i) data are available at the facility level but in varying formats; (ii) established global indicators may not always be useful in local monitoring; (iii) operational definitions must be refined; and (iv) triangulation and feedback from the field is vital. conclusion we observe that, while data can be used to develop indices of progress, interpretation of these indicators requires great care. in the attainment of universal health coverage, we consider that our observations of the utility of certain health indicators will provide valuable insights for practitioners and supervisors in the development of a primary health-care monitoring mechanism. research field-testing of health-care indicators, india devaki nambiar et al. we began with a policy scoping exercise for the state of kerala in . we then created an -indicator longlist from existing primary health-care monitoring inventories, , [ ] [ ] [ ] [ ] and undertook an extensive data source and mapping exercise, adapting a process previously conducted in the region. we applied a modified delphi process in two rounds, consulting key health system stakeholders of the state (frontline health workers, primary care doctors, public health experts and policymakers), and obtained a shortlist of indicators (available in the data repository). we then field-tested and triangulated some of the indicators using facility-based data (phases and ) and a population-based household survey (phase ). in phase (december ) we selected three family health centres in coastal, hilly and tribal districts (trivandrum, idukki and wayanad, respectively) of the state. we communicated the definitions and logic of the indicators to facility staff, and studied their data-recording methods to synergize our processes with theirs. from these initial steps, we prepared a structured data collection template (available in the data repository) that we provided to the three family health centres. based on inputs from phase and a second round of consultations with state-level programme officers, we refined the indicator list. in phase (june-october ), we used a multistage random cluster sampling technique to generate data related to the indicators at the population and facility level. we applied principal component analysis using stata version (statacorp, college station, united states of america) to data from the latest national family health survey ( - ) to categorize districts into one of four clusters according to health burden and systems performance. using an opensource list randomizer from random. org, we randomly selected one district from each of the four clusters, and then randomly selected both a primary and a family health centre from each of the four selected districts. the people served by these eight health facilities were the population of interest in our study. we held on-site meetings with the staff of the eight health facilities and provided them with excel-based templates (microsoft corporation, redmond, united states of america) to input data for the financial year march -april (data repository). data were sourced from manual registers maintained at facilities. in addition to off-site coordination, we also provided data-entry on-site support to the health staff, visiting each facility at least four times between may and december . we compiled data from the facilities to obtain annual estimates for all health outcome indicators using excel. our sample size estimation was based on the proportion of men and women eligible for blood pressure screening under the national primary care noncommunicable disease programme, that is, those aged years or older. we estimated a sample size using routine data reported by the noncommunicable disease division of the kerala health and family welfare department ( - ), aiming at a precision of % at a % confidence interval (ci), with a conservative design effect of (i.e. a doubling of the sample). health facility catchment areas were grouped by wards, also referred to as primary sampling units. eligible households within a primary sampling unit had at least one member aged years or older. individual written informed consent was sought from each participant before administration of the survey. we employed and trained staff to collect data using hand-held electronic tablets with a bilingual (english and malayalam) survey application. the survey, conducted during june-october , included questions on sociodemographic parameters, health outcome indicators (e.g. noncommunicable disease risk behaviours and screening; awareness of components of aardram and family health centre reform) and financial risk protection (e.g. out-ofpocket expenditure). national family health survey (round iv) state level weights were applied during analysis. we compared data on selected indicators using stata and excel. since our focus was on how indicators were being understood and reported across facilities, we did not expect indicators to directly correspond between facilities and households, but only to approximate each other. all components of the study were approved by the institutional ethics committee of the george institute for global health (project numbers / and / ). we obtained data from health facilities in total (seven family health centres and four primary health centres) during phases and . during phase , we acquired facility data on indicators from eight health facilities (four family, four primary) jointly serving a population of ( table ). the household survey was undertaken in the catchment areas of these facilities, and we acquired data from a representative sample of individuals in households (table ) . we observed both variations between and uniformity in the indicators from health facilities and the household survey (table ). in studying these patterns, we made four key observations (box ). first, the method of reporting our indicators varied between facilities, even although all raw data required to calculate selected indicators were present in manual registers. in the case of indicators related to national programmes (e.g. reproductive, child health and tuberculosis-related indicators), data were uploaded directly to national digital portals without any analysis at the facility level; officers responsible for data compilation and analysis exist only at the district level. feedback from facility staff included requests for adequate training on new or revised reporting systems, and clarification of their role. this situation may improve with the complete digitization of health records under kerala's e-health programme. our second observation is that there exist two problems with the globally recommended indicators: (i) manual routine data reporting at the facility level may be inadequate to construct the global indicator precisely; and (ii) globally relevant data may not be considered relevant to the periodicity (monthly) or level (facility) of review. from the facility-level data, the cover- field-testing of health-care indicators, india devaki nambiar et al. age of antenatal care reported by family health centres was . % ( / ); in household surveys, full coverage of antenatal care was observed for . % ( / ) of eligible women (table ) . here, antenatal care refers to women aged - years having a live birth in the past year and receiving four or more antenatal check-ups, at least one tetanus toxoid injection, and iron and folic acid tablets or syrup for at least days as numerator. the coverage rate is calculated from a denominator of the total number of women aged - years who had a live birth in the past year, which requires retrospective verification of antenatal coverage. however, in some facilities, the antenatal care coverage indicator was calculated using the previous year's number of deliveries plus % as the denominator, and the number of pregnant women who had received antenatal care as the numerator. it was therefore not always clear that the data from any particular individual were included in both the numerator and denominator and, with a target as the denominator, coverage could surpass %. practitioners noted the disconnect between monthly target-based reporting and annual retrospective measurement. our third observation is that definitions and reporting that reflect actual health-provision patterns require to be standardized; otherwise, discrepancies will be observed between data sets. for example, the indicator for acute diarrhoeal diseases among children younger than years was . % ( / ) according to facility records; however, a prevalence of more than times this percentage ( . %; / ; % ci: . - . ) was reported in the household survey (table ) . several chronic care indicators, newly introduced as part of the introduction of family health centres, also showed discrepancies. for instance, the percentage of people screened for blood pressure and blood glucose was . % ( / ; % ci: . - . ) and . % ( / ; % our fourth observation is that such triangulation exercises, as well as obtaining feedback from health workers, programme managers and administrators, are vital for accurate assessment of uhc coverage. a major problem reported by staff and officials is that health facility data are usually just a tally of patient visits, which is simple to produce, as opposed to the actual number of (potentially repeat) patients receiving care or services. state officials have been encouraging a move towards electronic health records to generate more precise indicators, but adoption and integration of these will only be possible when the technology itself is better aligned to facility-level process flows, requiring user inputs, investment and time. other issues raised include: the need for appropriate staff (including temporary contractual staff) training in programme guidelines and reporting requirements; the need for clarity in definitions of treatment (e.g. chronic disease patients may be advised to modify lifestyle factors, which would be missed if treatment monitoring included only those prescribed medication); and the availability of free or subsidised tests relevant to disease control that are reflected in monitoring indicators, particularly for chronic care (e.g. glycated haemoglobin tests for diabetes care ) at the primary health centre level. as already observed in india and other low-and middle-income countries, our results indicate that any approach to improving or monitoring the quality of health-care must be adaptable to local methods of data production and reporting, while ensuring that emerging concerns of local staff are considered. although validity checks are a staple of epidemiological and public health research, such triangulation processes in health systems are infrequent. the every newborn-birth study was a triangulation of maternal and newborn healthcare data in low-and middle-income countries, and some smaller-scale primary-care indicator triangulation exercises have been undertaken by india's national health systems resource centre. , while there exists a variety of approaches to monitoring primary health-care reforms, we consider the most appropriate to be the generation (and modification, if necessary) of indicators from routine data, and their triangulation with household survey data. increasingly, routine data are being digitized to improve accessibility and interpretation, as is the case in kerala. useful considerations when introducing digital health interventions in low-and middle-income countries are intrinsic programme characteristics, human factors, technical factors, the health-care ecosystem and the broader extrinsic ecosystem. our observations demonstrate the continuous and complex interplay between these characteristics; the real value of selected indicators may also be determined by how staff understand and interpret them. our study had several limitations. our indicator selection using the delphi method could have undergone additional rounds, but we considered it more important to get the monitoring process underway and reduce the burden on health workers. some facility-based information could not be acquired due to the additional health department burden of flood relief and nipah outbreak management in the state. our household survey sample was the population aged years and older, resulting in undersampling for other indicators being fieldtested (e.g. newborn low birth weight). an increase in sample size could allow a more precise estimation of all indicators. finally, the reference periods for the facility data and the household survey did not directly overlap; a timed sampling should be undertaken in the future to improve the precision of triangulation. observing the utility of indicators in practice is a key first step in the move towards uhc, requiring investment and commitment. using indicators, standards and other forms of technology, which are easy to adopt, can be problematic because we amplify certain aspects of the world while reducing others. our examination of family health centre reforms cautions that, while data can be used to develop indices of progress, interpretation of these indicators requires great care precisely because of the way they are related to powerful decisions around what constitutes success or failure, who will receive recognition or admonition and, ultimately, the legacy of aardram reforms. we anticipate that our observations will contribute to healthcare reforms in low-and middle-income countries, such as the use of field triangulation to enhance the accountability and relevance of global health metrics. if such activities are carried out in constructive partnerships with state stakeholders and do not introduce unfeasible costs to the system, they may contribute to a sustained and reflexive monitoring process along the path to uhc. ■ observation : data are available at the facility level, but in varying formats and platforms meant for different purposes; digitization may improve this situation. observation : established global indicators may not be useful or interpreted as intended in a local context, and may need to be adapted. observation : operational definitions, thresholds for interpretation and processes of routine data collection must be refined for older indicators and developed for newly introduced indicators. observation : triangulation and feedback from the field level, with qualitative input from local actors, remains vital, particularly for chronic diseases. field-testing of health-care indicators, india devaki nambiar et al. Цель Разработать систему мониторинга первичной медикосанитарной помощи и перечень показателей конечных результатов в отношении здоровья, а также провести тестирование на местах и всесторонне рассмотреть показатели, предназначенные для оценки реформ здравоохранения в штате Керала, Индия, в - гг. Методы Авторы использовали модифицированный «дельфийский» метод для разработки перечня показателей, состоящего из пунктов, с помощью которого осуществлялся мониторинг первичной медико-санитарной помощи. Авторы использовали метод многоступенчатой кластерной случайной выборки, чтобы отобрать один район в каждом из четырех районных кластеров, а затем таким же образом выбирали семью и центр первичной медико-санитарной помощи в каждом из четырех районов. Авторы испытали на местах и всесторонне оценили показатели с использованием данных учреждений и анкетирования домохозяйств на уровне популяции. Результаты Полученные данные выявили сходство между данными учреждений и данными анкетирования по одним показателям (например, низкая масса тела при рождении и услуги предварительной проверки), но различия по другим показателям (например, острые диарейные болезни у детей младше лет и скрининг артериального давления). Авторы составили четыре важных замечания: (i) данные доступны на уровне учреждения, но в различных форматах; (ii) определенные глобальные показатели не всегда могут использоваться для местного мониторинга; (iii) практические определения требуют уточнения; (iv) всестороннее рассмотрение и обратная связь с мест критически важны. Вывод Наблюдения говорят о том, что, хотя данные можно использовать для разработки индексов прогресса, интерпретация этих показателей требует большой осторожности. В достижении всеобщего охвата услугами здравоохранения авторы считают, что их наблюдения о полезности определенных показателей здоровья дадут ценную информацию для практикующих врачей и руководителей при разработке механизма мониторинга первичной медико-санитарной помощи. objetivo elaborar un marco de supervisión de la atención primaria de salud y una lista de indicadores sobre los resultados en la salud, así como realizar ensayos de campo y triangular los indicadores previstos para evaluar las reformas sanitarias en kerala, india, - . métodos se aplicó un método delphi modificado para elaborar una lista de indicadores que incluye elementos para supervisar la atención primaria de salud. además, se empleó una técnica de muestreo aleatorio por conglomerados de etapas múltiples para seleccionar un distrito de cada uno de los cuatro conglomerados de distritos y, a continuación, se seleccionó una familia y un centro de atención primaria de cada uno de los cuatro distritos. se realizaron ensayos de campo y se triangularon los indicadores mediante el uso de datos de los centros y una encuesta domiciliaria basada en la población. resultados los datos obtenidos revelaron similitudes entre los datos de los centros y los de las encuestas para algunos indicadores (por ejemplo, el peso bajo al nacer y los servicios de control previo), así como diferencias para otros (por ejemplo, las enfermedades diarreicas agudas en niños menores de años y la evaluación de la presión arterial). se formularon cuatro observaciones críticas: i) los datos están disponibles a nivel de los establecimientos, pero en distintos formatos; ii) los indicadores globales establecidos no siempre son útiles para realizar una vigilancia local; iii) las definiciones operativas se deben perfeccionar; y iv) la triangulación y las observaciones en el terreno son vitales. conclusión se observa que, si bien los datos se pueden usar para elaborar índices de progreso, la interpretación de estos indicadores requiere gran atención. se cree que las observaciones obtenidas sobre la utilidad de ciertos indicadores de salud permitirán a los profesionales y a los supervisores comprender mejor el desarrollo de un mecanismo de vigilancia de la atención primaria de salud para lograr la cobertura sanitaria universal. geneva: world health organization the astana declaration: the future of primary health care? lancet geneva: world health organization tracking universal health coverage: global monitoring report. geneva: world health organization primary health care on the road to universal health coverage monitoring and evaluating progress towards universal health coverage in india new delhi: government of india, ministry of health and family welfare ayushman bharat -national health protection mission new delhi: niti aayog, national institution for transforming india, government of india a composite indicator to measure universal health care coverage in india: way forward for post- health system performance monitoring framework. health policy plan universal health coverage-pilot in tamil nadu: has it delivered what was expected? chennai: national health mission -tamil nadu field-testing of health-care indicators key indicators of social consumption in india: health | national sample survey th round health inequalities in south asia at the launch of sustainable development goals: exclusions in health in kerala, india need political interventions india state-level disease burden initiative diabetes collaborators. the increasing burden of diabetes and variations among the states of india: the global burden of disease study india state-level disease burden initiative cancer collaborators. the burden of cancers and their variations across the states of india: the global burden of disease study - laboratory supported case-based surveillance outcomes. front public health current status of dengue and chikungunya in india : epidemiology of an outbreak of an emerging disease how countries of south mitigate covid- : models of morocco and kerala what the world can learn from kerala about how to fight covid- . mit technology review aggressive testing, contact tracing, cooked meals: how the indian state of kerala flattened its coronavirus curve. washington post kerala's early experience: moving towards universal health coverage address to the legislative assembly tracking universal health coverage: first global monitoring report. geneva: world health organization achieving the targets for universal health coverage: how is thailand monitoring progress? who south-east asia j public health monitoring and evaluating progress towards universal health coverage in brazil better measurement for performance improvement in low-and middle-income countries: the primary health care performance initiative (phcpi) experience of conceptual framework development and indicator selection disaggregated data to improve child health outcomes. afr j prim health care fam med public health informatics: designing for change -a developing country perspective research methods used in developing and applying quality indicators in primary care the institutional analysis and development framework and the commons learning health systems: an empowering agenda for low-income and middle-income countries a framework for value-creating learning health systems the political economy of universal health coverage. montreux: health systems global the political economy of universal health coverage: a systematic narrative review. health policy plan strengthening accountability of the global health metrics enterprise measuring the performance of primary health care: a practical guide for translating data into improvement. arlington: joint learning network for universal health coverage new delhi: national health system resource centre monitoring and evaluation of health systems strengthening: an operational framework. geneva: world health organization global reference list of core health indicators. geneva: world health organization data source mapping: an essential step for health inequality monitoring wtequity study primary care indicators kerala wtequity study primary care indicators kerala mumbai: international institute for population studies health service coverage and its evaluation strengthening patient-centred care for control of hypertension in public health facilities in kannur district. prince mahidol award conference on the political economy of ncds: a whole of society approach prince mahidol award conference every newborn-birth" protocol: observational study validating indicators for coverage and quality of maternal and newborn health care in bangladesh, nepal and tanzania new delhi: national health systems resource centre. presentation at workshop on health information architecture, design, implementation, and evaluation measuring progress towards universal health coverage and post- sustainable development goals: the informational challenges best practices in scaling digital health in low and middle income countries standards: recipes for reality we thank the department of health and family welfare, government of kerala, the state health systems resource centre, kerala and the aardram task force. dn is affiliated to the faculty of medi- key: cord- -jrl fowa authors: abry, patrice; pustelnik, nelly; roux, stéphane; jensen, pablo; flandrin, patrick; gribonval, rémi; lucas, charles-gérard; guichard, Éric; borgnat, pierre; garnier, nicolas title: spatial and temporal regularization to estimate covid- reproduction number r(t): promoting piecewise smoothness via convex optimization date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: jrl fowa among the different indicators that quantify the spread of an epidemic such as the on-going covid- , stands first the reproduction number which measures how many people can be contaminated by an infected person. in order to permit the monitoring of the evolution of this number, a new estimation procedure is proposed here, assuming a well-accepted model for current incidence data, based on past observations. the novelty of the proposed approach is twofold: ) the estimation of the reproduction number is achieved by convex optimization within a proximal-based inverse problem formulation, with constraints aimed at promoting piecewise smoothness; ) the approach is developed in a multivariate setting, allowing for the simultaneous handling of multiple time series attached to different geographical regions, together with a spatial (graph-based) regularization of their evolutions in time. the effectiveness of the approach is first supported by simulations, and two main applications to real covid- data are then discussed. the first one refers to the comparative evolution of the reproduction number for a number of countries, while the second one focuses on french departments and their joint analysis, leading to dynamic maps revealing the temporal co-evolution of their reproduction numbers. the ongoing covid- pandemic has produced an unprecedented health and economic crisis, urging for the development of adapted actions aimed at monitoring the spread of the new coronavirus. no country remained untouched, thus emphasizing the need for models and tools to perform quantitative predictions, enabling effective managements of patients or an optimized allocations of medical ressources. for instance, the outbreak of this unprecedented pandemic was characterized by a critical lack of tools able to perform predictions related to the pressure on hospital ressources (number of patients, masks, gloves, intensive care unit needs,. . .) [ , ] . as a first step toward such an ambition goal, the present work focuses on the pandemic time evolution assessment. indeed, all countries experienced a propagation mechanism that is basically universal in the onset phase: each infected person happened to infect in average more than one other person, leading to an initial exponential growth. the strength of the spread is quantified by the so-called reproduction number which measures how many people can be contaminated by an infected person. in the early phase where the growth is exponential, this is referred to as r (for covid- , r * [ , ] ). as the pandemic develops and because more people get infected, the effective reproduction number evolves, hence becoming a function of time hereafter labeled r(t). this can indeed end up with the extinction of the pandemic, r(t)! , at the expense though of the contamination of a very large percentage of the total population, and of potentially dramatic consequences. rather than letting the pandemic develop until the reproduction number would eventually decrease below unity (in which case the spread would cease by itself), an active strategy amounts to take actions so as to limit contacts between individuals. this path has been followed by several countries which adopted effective lockdown policies, with the consequence that the reproduction number decreased significantly and rapidly, further remaining below unity as long as social distancing measures were enforced (see for example [ , ] ). however, when lifting the lockdown is at stake, the situation may change with an expected increase in the number of inter-individual contacts, and monitoring in real time the evolution of the instantaneous reproduction number r(t) becomes of the utmost importance: this is the core of the present work. monitoring and estimating r(t) raises however a series of issues related to pandemic data modeling, to parameter estimation techniques and to data availability. concerning the mathematical modeling of infectious diseases, the most celebrated approaches refer to compartmental models such as sir ("susceptible-infectious-recovered"), with variants such as seir ("susceptible-exposed-infectious-recovered"). because such global models do not account well for spatial heterogeneity, clustering of human contact patterns, variability in typical number of contacts (cf. [ ] ), further refinements were proposed [ ] . in such frameworks, the effective reproduction number at time t can be inferred from a fit of the model to the data that leads to an estimated knowledge of the average of infecting contacts per unit time, of the mean infectious period, and of the fraction of the population that is still susceptible. these are powerful approaches that are descriptive and potentially predictive, yet at the expense of being fully parametric and thus requiring the use of dedicated and robust estimation procedures. parameter estimation become all the more involved when the number of parameters grows and/or when the amount and quality of available data are low, as is the case for the covid- pandemic real-time and in emergency monitoring. rather than resorting to fully parametric models and seeing r(t) as the by-product of their identification, a more phenomenological, semi-parametric approach can be followed [ ] [ ] [ ] . this approach has been reported as robust and potentially leading to relevant estimates of r(t), even for epidemic spreading on realistic contact networks, where it is not possible to define a steady exponential growth phase and a basic reproduction number [ ] . the underlying idea is to model incidence data z(t) at time t as resulting from a poisson distribution with a time evolving parameter adjusted to account for the data evolution, which depends on a function f(s) standing for the distribution of the serial interval. this function models the time between the onset of symptoms in a primary case and the onset of symptoms in secondary cases, or equivalently the probability that a person confirmed infected today was actually infected s days earlier by another infected person. the serial interval function is thus an important ingredient of the model, accounting for the biological mechanisms in the epidemic evolution. assuming the distribution f to be known, the whole challenge in the actual use of the semi-parametric poisson-based model thus consists in devising estimatesrðtÞ of r(t) with satisfactory statistical performance. this has been classically addressed by approaches aimed at maximizing the likelihood attached to the model. this can be achieved, e.g., within several variants of bayesian frameworks [ , , , ] , with even dedicated software packages (cf. e.g., https://shiny.dide.imperial. ac.uk/epiestim/). instead, we promote here an alternative approach based on inverse problem formulations and proximal-operator based nonsmooth convex optimisation [ ] [ ] [ ] [ ] [ ] . the questions of modeling and estimation, be they fully parametric or semi-parametric, are intimately intertwined with that of data availability. this will be further discussed but one can however remark at this point that many options are open, with a conditioning of the results to the choices that are made. there is first the nature of the incidence data used in the analysis (reported infected cases, hospitalizations, deaths) and the database they are extracted from. next, there is the granularity of the data (whole country, regions, smaller units) and the specificities that can be attached to a specific choice as well as the comparisons that can be envisioned. in this respect, it is worth remarking that most analyses reported in the literature are based on (possibly multiple) univariate time series, whereas genuinely multivariate analyses (e.g., a joint analysis of the same type of data in different countries in order to compare health policies) might prove more informative. for that category of research work motivated by contributing in emergency to the societal stake of monitoring the pandemic evolution in real-time, or at least, on a daily basis, there are two classes of challenges: ensuring a robust and regular access to relevant data; rapidly developing analysis/estimation tools that are theoretically sound, practically usable on data actually available, and that may contribute to improving current monitoring strategies. in that spirit, the overarching goal of the present work is twofold: ( ) proposing a new, more versatile framework for the estimation of r(t) within the semi-parametric model of [ , ] , reformulating its estimation as an inverse problem whose functional is minimized by using non smooth proximal-based convex optimization; ( ) inserting this approach in an extended multivariate framework, with applications to various complementary datasets corresponding to different geographical regions. the paper is organized as follows. it first discusses data, as collected from different databases, with heterogeneity and uneven quality calling for some preprocessing that is detailed. in the present work, incidence data (thereafter labelled z(t)) refers to the number of daily new infections, either as reported in databases, or as recomputed from other available data such as hospitalization counts. based on a semi-parametric model for r(t), it is then discussed how its estimation can be phrased within a non smooth proximal-based convex optimization framework, intentionally designed to enforce piecewise linearity in the estimation of r(t) via temporal regularization, as well as piecewise constancy in spatial variations of r(t) by graph-based regularization. the effectiveness of these estimation tools is first illustrated on synthetic data, constructed from different models and simulating several scenarii, before being applied to several real pandemic datasets. first, the number of daily new infections for many different countries across the world are analyzed independently. second, focusing on france only, the number of daily new infections per continental france départements (départements constitute usual entities organizing the administrative life in france) are analyzed both independently and in a multivariate setting, illustrating the benefit of this latter formulation. discussions, perpectives and potential improvements are finally discussed. datasets. in the present study, three sources of data were systematically used: • source (jhu) johns hopkins university provides access to the cumulated daily reports of the number of infected, deceased and recovered persons, on a per country basis, for a large number of countries worldwide, essentially since inception of the covid- crisis (january st, time series. the data available on the different data repositories used here are strongly affected by outliers, which may stem from inaccuracy or misreporting in per country reporting procedures, or from changes in the way counts are collected, aggregated, and reported. in the present work, it has been chosen to preprocess data for outlier removal by applying to the raw time series a nonlinear filtering, consisting of a sliding-median over a -day window: outliers defined as ± . standard deviation are replaced by window median to yield the pre-processed time series z(t), from which the reproduction number r(t) is estimated. an example of raw and pre-processed time series is illustrated in fig . when countries are studied independently, the estimation procedure is applied separately to each time series z(t) of size t, the number of days available for analysis. when considering continental france départements, we are given d time series z d (t) of size t each, where � d � d = indexes the départements. these time series are collected and stacked in a matrix of size d × t, and they analyzed both independently and jointly. model. although they can be used for envisioning the impact of possible scenarii in the future development of an on-going epidemic [ ] , sir models, because they require the full estimation of numerous parameters, are often used a posteriori (e.g., long after the epidemic) with consolidated and accurate datasets. during the spread phase and in order to account for the on-line/on-the-fly need to monitor the pandemic and to offer some robustness to partial/ incomplete/noisy data, less detailed semi-parametric models focusing on the only estimation of the time-dependent reproduction number can be preferred [ , , ] . let r(t) denote the instantaneous reproduction number to be estimated and z(t) be the number of daily new infections. it has been proposed in [ , ] that {z(t), t = , . . ., t} can be modeled as a nonstationary time series consisting of a collection of random variables, each drawn from a poisson distribution p p t whose parameter p t depends on the past observations of z(t), on the current value of r(t), and on the serial interval function f(�): the serial interval function f(�) constitutes a key ingredient of the model, whose importance and role in pandemic evolution has been mentioned in introduction. it is assumed to be independent of calendar time (i.e., constant across the epidemic outbreak), and, importantly, independent of r(t), whose role is to account for the time dependencies in pandemic propagation mechanisms. for the covid- pandemic, several studies have empirically estimated the serial interval function f(�) [ , ] . for convenience, f(�) has been modeled as a gamma distribution, with shape and rate parameters . and . , respectively (corresponding to mean and standard deviations of . and . days, see [ ] and references therein). these choices and assumptions have been followed and used here, and the corresponding function is illustrated in fig . in essence, the model in eq ( ) is univariate (only one time series is modeled at a time), and based on a poisson marginal distribution. it is also nonstationary, as the poisson rate evolves along time. the key ingredient of this model consists of the poisson rate evolving as a weighted moving average of past observations, which is qualitatively based on the following rationale: whenr is above , the epidemic is growing and, conversely, when this ratio is below , it decreases and eventually vanishes. non-smooth convex optimisation. the whole challenge in the actual use of the semiparametric poisson-based model described above thus consists in devising estimatesrðtÞ of r (t) that have better statistical performance (more robust, reliable, and hence usable) than the direct brute-force and naive form defined in eq . to estimate r(t), and instead of using bayesian frameworks that are considered state-of-the-art tools for epidemic evolution analysis, we propose and promote here an alternative approach based on an inverse problem formulation. its main principle is to assume some form of temporal regularity in the evolution of r(t) (we use a piecewise linear model in the following). in the case of a joint estimation of r(t) across several continental france départements, we further assume some form of spatial regularity, i.e., that the values of r(t) for neighboring départements are similar. univariate setting. for a single country, or a single département, the observed (possibly preprocessed) data {z(t), � t � t} is represented by a t-dimensional vector z r t . recalling that the poisson law is pðz ¼ njpÞ ¼ p n n! e À p for each integer n � , the negative log-likelihood of observing z given a vector p r t of poisson parameters p t is where r r t is the (unknown) vector of values of r(t). up to an additive term independent of p, this is equal to the kl-divergence (cf. section . . in [ ] ): given the vector of observed values z, the serial interval function f(�), and the number of days t, the vector p given by ( ) reads p = r � fz, with � the entrywise product and f r t�t the matrix with entries f ij = f(i − j). maximum likelihood estimation of r (i.e., minimization of the negative log-likelihood) leads to an optimization problem min r d kl (zjr � fz) which does not ensure any regularity of r(t). to ensure temporal regularity, we propose a penalized approach usinĝ r ¼ argmin r d kl ðz j r � fzÞ þ oðrÞ where o denotes a penalty function. here we wish to promote a piecewise affine and continuous behavior, which may be accomplished [ , ] using o(r) = λ time kd rk , where d is the matrix associated with a laplacian filter (second order discrete temporal derivatives), k�k denotes the ℓ -norm (i.e., the sum of the absolute values of all entries), and λ time is a penalty factor to be tuned. this leads to the following optimization problem: spatially regularized setting. in the case of multiple départements, we consider multiple vectors (z d r t , � d � d) associated to the d time series, and multiple vectors of unknown (r d r t , � d � d), which can be gathered into matrices: a data matrix z r t�d whose columns are z d and a matrix of unknown r r t�d whose columns are the quantities to be estimated r d . a first possibility is to proceed to independent estimations of the (r d r t , � d � d) by addressing the separate optimization problemŝ which can be equivalently rewritten into a matrix form: is the entrywise ℓ norm of d r, i.e., the sum of the absolute values of all its entries. an alternative is to estimate jointly the (r d r t , � d � d) using a penalty function promoting spatial regularity. to account for spatial regularity, we use a spatial analogue of d promoting spatially piecewise constant solutions. the d continental france départements can be considered as the vertices of a graph, where edges are present between adjacent départements. from the adjacency matrix a r d�d of this graph (a ij = if there is an edge e = (i, j) in the graph, a ij = otherwise), the global variation of the function on the graphs can be computed as ∑ ij a ij (r ti − r tj ) and it is known that this can be accessed through the so-called (combinatorial) laplacian of the graph: [ ] . however, in order to promote smoothness over the graph while keeping some sparse discontinuities on some edges, it is preferable to regularize using a total variation on the graph, which amounts to take the ℓ -norm of these gradients (r ti − r tj ) on all existing edges. for that, let us introduce the incidence matrix b r e�d such that l = b > b where e is the number of edges and, on each line representing an existing edge e = (i, j), we set b e,i = and b e,j = − . then, the ℓ -norm krb > k = kbr > k is equal to p t t¼ p ði;jÞ:a ij ¼ jr ti À r tj j. alternatively, it can be computed as krb > k ¼ p t t¼ kbrðtÞk where rðtÞ r d is the t-th row of r, which gathers the values across all départements at a given time t. from that, we can define the regularized optimization problem: optimization problems ( ) and ( ) involve convex, lower semi-continuous, proper and non-negative functions, hence their set of minimizers is non-empty and convex [ ] . we will discuss right after how to compute these using proximal algorithms. by the known sparsity-promoting properties of ℓ regularizers and their variants, the corresponding solutions are such that d r and/or rb > are sparse matrices, in the sense that these matrices of (second order temporal or first order spatial) derivatives have many zero entries. the higher the penalty factors λ time and λ space , the more zeroes in these matrices. in particular, when λ space = , no spatial regularization is performed and ( ) is equivalent to ( ) . when λ space is large enough, rb > is exactly zero, which implies that r(t) is constant at each time since the graph of départements is connected. optimization using a proximal algorithm. the considered optimization problems are of the form where f and g m are proper lower semi-continuous convex, and k m are bounded linear operators. a classical case for m = is typically addressed with the chambolle-pock algorithm [ ] , which has been recently adapted for multiple regularization terms as in eq. of [ ] . to handle the lack of smoothness of lipschitz differentiability for the considered functions f and g m , these approaches rely on their proximity operators. we recall that the proximity operator of a convex, lower semi-continuous function φ is defined as [ ] prox φ ðyÞ ¼ arg min in our case, we consider a separable data fidelity term: as this is a separable function of the entries of its input, its associated proximity operator can be computed component by component [ ] : where τ > . we further consider g m (�) = k.k , m = , , and k (r) ≔ λ time d r, k (r) ≔ λ space rb > . the proximity operators associated to g m read: where (.) + = max( ,.). in algorithm , we express explicitly algorithm of [ ] for our setting, considering the moreau identity that provides the relation between the proximity operator of a function and the proximity operator of its conjugate (cf. eq. ( ) of [ ] ). the choice of the parameters τ and σ m impacts the convergence guarantees. in this work, we adapt a standard choice provided by [ ] to this extended framework. the adjoint of k m , denoted k � m , is given by the sequence ðr ðkþ Þ Þ k n converges to a minimizer of ( ) (cf. thm . of [ ] ). input: data z, tolerance � > ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi p m¼ ; kk m k to assess the relevance and performance of the proposed estimation procedure detailed above, it is first applied to two different synthetic time series z(t). the first one is synthesized using directly the model in eq ( ), with the same serial interval function f(t) as that used for the estimation, and using an a priori prescribed function r(t). the second one is produced from solving a compartmental (sir type) model. for such models, r(t) can be theoretically related to the time scale parameters entering their definition, as the ratio between the infection time scale and the quitting infection (be it by death or recovery) time scale [ , ] . the theoretical serial function f associated to that model and to its parameters is computed analytically (cf., e.g., [ ] ) and used in the estimation procedure. for both cases, the same a priori prescribed function r(t), to be estimated, is chosen as constant (r = . ) over the first days to model the epidemic outbreak, followed by a linear decrease (till below ) over the next days to model lockdown benefits, with finally an abrupt linear increase for the last days, modeling a possible outbreak at when lockdown is lifted. additive gaussian noise is superimposed to the data produced by the models to account for outliers and misreporting. for both cases, the proposed estimation procedure (obtained with λ time set to the same values as those used to analyze real data in the next section) outperforms the naive estimates ( ), which turn out to be very irregular (cf. fig ) . the proposed estimates notably capture well the three different phases of r(t) (stable, decreasing and increasing), with notably a rapid and accurate reaction to the increasing change in the last days. the present section aims to apply the model and estimation tools proposed above to actual covid- data. first, specific methodological issues are addressed, related to tuning the hyperparameter(s) λ time or (λ time , λ space ) in univariate and multivariate settings, and to comparing the consistency between different estimates of r(t) obtained from the same incidence data, yet downloaded from different repositories. then, the estimation tools are applied to the estimation of r(t), both independently for numerous countries and jointly for the continental france départements. estimation of r(t) is performed daily, with t thus increasing every day, and updated results are uploaded on a regular basis on a dedicated webpage (cf. http://perso.ens-lyon.fr/patrice. abry. regularization hyperparameter tuning. a critical issue associated with the practical use of the estimates based on the optimization problems ( ) and ( ) lies in the tuning of the hyperparameters balancing data fidelity terms and penalization terms. while automated and data-driven procedures can be devised, following works such as [ ] and references therein, let us analyze the forms of the functional to be minimized, so as to compute relevant orders of magnitude for these hyperparameters. let us start with the univariate estimation ( ). using λ time = implies no regularization and the achieved estimate turns out to be as noisy as the one obtained with a naive estimator (cf. eq ( )). conversely, for large enough λ time , the proposed estimate becomes exactly a constant, missing any time evolution. tuning λ time is thus critical but can become tedious, especially because differences across countries (or across départements in france) are likely to require different choices for λ time . however, a careful analysis of the functional to minimize shows that the data fidelity term ( ), based on a kullback-leibler divergence, scales proportionally to the input incidence data z while the penalization term, based on the regularization of r(t), is independent of the actual values of z. therefore, the same estimate for r(t) is obtained if we replace z with α × z and λ with α × λ. because orders of magnitude of z are different amongst countries (either because of differences in population size, or of pandemic impact), this critical observation leads us to apply the estimate not to the raw data z but to a normalized version z/std(z), alleviating the burden of selecting one λ time per country, instead enabling to select one same λ time for all countries and further permitting to compare the estimated r(t)'s across countries for equivalent levels of regularization. considering now the graph-based spatially-regularized estimates ( ) while keeping fixed λ time , the different r(t) are analyzed independently for each département when λ space = . conversely, choosing a large enough λ space yields exactly identical estimates across départments that are, satisfactorily, very close to what is obtained from data aggregated over france prior to estimation. further, the connectivity graph amongst the continental france départements leads to an adjacency matrix with non-zero off-diagonal entries (set to the value ), associated to as many edges as existing in the graph. therefore, a careful examination of ( ) shows that the spatial and temporal regularizations have equivalent weights when λ time and λ time are chosen such that the use of z/std(z) and of ( ) above gives a relevant first-order guess to the tuning of λ time and of (λ time , λ space ). estimate consistency using different repository sources. when undertaking such work dedicated to on-going events, to daily evolutions, and to a real stake in forecasting future trends, a solid access to reliable data is critical. as previously mentioned, three sources of data are used, each including data for france, which are thus now used to assess the impact of data sources on estimated r(t). source (jhu) and source (ecdpc) provide cumulated numbers of confirmed cases counted at national levels and (in principle) including all reported cases from any source (hospital, death at home or in care homes. . .). source (spf) does not report that same number, but a collection of other figures related to hospital counts only, from which a daily number of new hospitalizations can be reconstructed and used as a proxy for daily new infections. the corresponding raw and (sliding-median) preprocessed data, illustrated in fig , show overall comparable shapes and evolutions, yet with clearly visible discrepancies of two kinds. first, source (jhu) and source (ecdpc), consisting of crude reports of number of confirmed cases are prone to outliers. those can result from miscounts, from pointwise incorporations of new figures, such as the progressive inclusion of cases from ehpad (care homes) in france, or from corrections of previous erroneous reports. conversely, data from source (spf), based on hospital reports, suffer from far less outliers, yet at the cost of providing only partial figures. second, in france, as in numerous other countries worldwide, the procedure on which confirmed case counts are based, changed several times during the pandemic period, yielding possibly some artificial increase in the local average number of daily new confirmed cases. this has notably been the case for france, prior to the end of the lockdown period (mid-may), when the number of tests performed has regularly increased for about two weeks, or more recently early june when the count procedures has been changed again, likely because of the massive use of serology tests. because the estimate of r(t) essentially relies on comparing a daily number against a past moving average, these changes lead to significant biases that cannot be easily accounted for, but vanishes after some duration controlled by the typical width of the serial distribution f (of the order of ten days). confirmed infection cases across the world. to report estimated r(t)'s for different countries, data from source (ecdpc) are used as they are of better quality than data from source (jhu), and because hospital-based data (as in source (spf)) are not easily available for numerous different countries. visual inspection led us to choose, uniformly for all countries, two values of the temporal regularization parameter: λ time = to produce a strongly-regularized, hence slowly varying estimate, and λ time = . for a milder regularization, and hence a more reactive estimate. these estimates being by construction designed to favor piecewise linear behaviors, local trends can be estimated by computing (robust) estimates of the derivativeŝ bðtÞ ofrðtÞ. the slow and less slow estimates ofrðtÞ thus provide a slow and less slow estimate of the local trends. intuitively, these local trends can be seen as predictors for the forthcoming value of r:rðt þ nÞ ¼rðtÞ þ nbðtÞ. let us start by inspecting again data for france, further comparing estimates stemming from data in source (ecdpc) or in source (spf) (cf. fig ) . as discussed earlier, data from source (ecdpc) show far more outliers that data from source (spf), thus impacting estimation of r and β. as expected, the strongly regularized estimates (λ time = ) are less sensitive than the less regularized ones (λ time = . ), yet discrepancies in estimates are significant, as data from source (ecdpc) yields, for june th, estimates of r slightly above , while that from source (spf) remain steadily around . , with no or mild local trends. again, this might be because late may, france has started massive serology testing, mostly performed outside hospitals. this yielded an abrupt increase in the number of new confirmed cases, biasing upward the estimates of r(t). however, the short-term local trend for june th goes also downward, suggesting that the model is incorporating these irregularities and that estimates will return to unbiased after an estimation time controlled by the typical width of the serial distribution f (of the order of ten days). this recent increase is not seen in source (spf)based estimates that remain very stable, potentially suggesting that hospital-based data are much less affected by changes in testing policies. this local analysis at the current date can be complemented by a more global view on what happened since the lifting of the lockdown. considering the whole period starting from may th we end up with triplets [ th percentile; median; th percentile] that read as given in table : source (ecdpc) provides data for several tens of countries. figs to reportrðtÞ and bðtÞ for several selected countries. more figures are available at perso.ens-lyon.fr/patrice.abry. as of june th (time of writing), fig shows that, for most european countries, the pandemic seems to remain under control despite lifting of the lockdown, with (slowly varying) estimates of r remaining stable below , ranging from . to . depending on countries, and (slowly varying) trends around . sweden and portugal (not shown here) display less favorable patterns, as well as, to a lesser extent, the netherlands, raising the question of whether this might be a potential consequence of less stringent lockdown rules compared to neighboring european countries. fig shows that whiler for canada is clearly below since early may, with a negative local trend, the usa are still bouncing back and forth around . south america is in the above phase but starts to show negative local trends. fig indicates that iran, india or indonesia are in the critical phase withrðtÞ > . fig shows that data for african countries are uneasy to analyze, and that several countries such as egypt or south africa are in pandemic growing phases. phase-space representation. to complement figs to , fig displays a phase-space representation of the time evolution of the pandemic, constructed by plotting one against the other the local average (over a week) of the slowly varying estimated reproduction numberrðtÞ and local trend, ð � rðtÞ; � bðtÞÞ, for a period ranging from mid-april to june th. country names are written at the end (last day) of the trajectories. interestingly, european countries display a c-shape trajectory, starting with r > with negative trends (lockdown effects), thus reaching the safe zone (r < ) but eventually performing a u-turn with a slow increase of local trends till positive. this results in a mild but clear reincrease of r, yet with most values below today, except for france (see comments above) and sweden. the usa display a similar c-shape though almost concentrated on the edge point r(t) = , β = , while canada does return to the safe zone with a specific pattern. south-american countries, obviously at an earlier stage of the pandemic, show an inverted c-shape pattern, with trajectory evolving from the bad top right corner, to the controlling phase (negative local trend, with decreasing r still above though). phase-spaces of asian and african countries essentially confirm these c-shaped trajectories. envisioning these phase-space plots as pertaining to different stages of the pandemic (rather than to different countries), this suggests that covid- pandemic trajectory resembles a clockwise circle, starting from the bad top right corner (r above and positive trends), evolving, likely by lockdown impact, towards the bottom right corner (r still above but negative trends) and finally to the safe bottom left corner (r below and negative then null trend). the lifting of the lockdown may explain the continuation of the trajectory in the still safe but. . . corner (r below and again positive trend). as of june th, it can be only expected that trajectories will not close the loop and reach back the bad top right corner and the r = limit. continental france départements: regularized joint estimates. there is further interest in focusing the analysis on the potential heterogeneity in the epidemic propagation across a given territory, governed by the same sanitary rules and health care system. this can be achieved by estimating a set of localrðtÞ's for different provinces and regions [ ] . such a study is made possible by the data from source (spf), that provides hospital-based data for each of the continental france départements . fig (right) already reported the slow and fast varying estimates of r and local trends computed from data aggregated over the whole france. to further study the variability across the continental france territory, the graphbased, joint spatial and temporal regularization described in eq is applied to the number of confirmed cases consisting of a matrix of size k × t, with d = continental france départements, and t the number of available daily data (e.g., t = on june th, data being available only after march th). the choice λ time = . leading to fast estimates was used for this joint study. using ( ) as a guideline, empirical analyses led to set λ space = . , thus selecting spatial regularization to weight one-fourth of the temporal regularization. first, fig ( top row) maps and compares for june th (chosen arbitrarily as the day of writing) per-département estimates, obtained when départements are analyzed either independently (r indep using eq , left plot) or jointly (r joint using eq , right plot). while the means of r indep andr joint are of the same order (' . and ' . respectively) the standard deviations drop down from ' . to ' . , thus indicating a significant decrease in the variability across departments. this is further complemented by the visual inspection of the maps which reveals reduced discrepancies across neighboring departments, as induced by the estimation procedure. in a second step, short and long-term trends are automatically extracted fromr indep and r joint and short-term trends are displayed in the bottom row of fig (left and right, respectively) . this evidences again a reduced variability across neighboring departments, though much less than that observed forr indep andr joint , likely suggesting that trends on r per se are more robust quantities to estimate than single r's. for june th, fig also indicates reproduction numbers that are essentially stable everywhere across france, thus confirming the trend estimated on data aggregated over all france (cf. fig , right plot) . video animations, available at perso.ens-lyon.fr/patrice.abry/deptregul.mp , and at barthes.enssib.fr/coronavirus/ixxi-sisyphe/., updated on a daily basis, report further comparisons betweenr indep andr joint and their evolution along time for the whole period of data availability. maps for selected days are displayed in fig ( with identical colormaps and colorbars across time). fig shows that until late march (lockdown took place in france on march th),r joint was uniformly above . (chosen as the upper limit of the colorbar to permit to see variations during the lockdown and post-lockdown periods), indicating a rapid evolution of the epidemic across entire france. a slowdown of the epidemic evolution is visible as early as the first days of april (with overall decreases ofr joint , and a clear north vs. south gradient). during april, this gradient rotates slightly and aligns on a north-east vs. south-west direction and globally decreases in amplitude. interestingly, in may, this gradient has reversed direction from south-west to north-east, though with very mild amplitude. as of today (june th), the pandemic, viewed hospital-based data from source (spf), seems under control under the whole continental france. estimation of the reproduction number constitutes a classical task in assessing the status of a pandemic. classically, this is done a posteriori (after the pandemic) and from consolidated data, often relying on detailed and accurate sir-based models and relying on bayesian frameworks for estimation. however, on-the-fly monitoring of the reproduction number time evolution constitutes a critical societal stake in situations such as that of covid- , when decisions need to be taken and action need to be made under emergency. this calls for a triplet of constraints: i) robust access to fast-collected data; ii) semi-parametric models for such data that focus on a subset of critical parameters; iii) estimation procedures that are both elaborated enough to yield robust estimates, and versatile enough to be used on a daily basis and applied to (often-limited in quality and quantity) available data. in that spirit, making use of a robust nonstationary poisson-distribution based semiparametric model proven robust in the literature for epidemic analysis, we developed an original estimation procedure to favor piecewise regular estimation of the evolution of the reproduction number, both along time and across space. this was based on an inverse problem formulation balancing fidelity to time and space regularization, and used proximal operators and nonsmooth convex optimization. this tool can be applied to time series of incidence data, reported, e.g., for a given country. whenever made possible from data, estimation can benefit from a graph of spatial proximity between subdivisions of a given territory. the tool also provides local trends that permit to forecast short-term future values of r. the proposed tools were applied to pandemic incidence data consisting of daily counts of new infections, from several databases providing data either worldwide on an aggregated percountry basis or, for france only, based on the sole hospital counts, spread across the french territory. they permitted to reveal interesting patterns on the state of the pandemic across the world as well as to assess variability across one single territory governed by the same (health care and politics) rules. more importantly, these tools can be used everyday easily as an onthe-fly monitoring procedure for assessing the current state of the pandemic and predict its short-term future evolution. updated estimations are published on-line every day at perso.ens-lyon.fr/patrice.abry and at barthes.enssib.fr/coronavirus/ixxi-sisyphe/. data were (and still are) automatically downloaded on a daily basis using routines written by ourselves. all tools have been developed in matlab™ and can be made available from the corresponding author upon motivated request. at the methodological level, the tool can be further improved in several ways. instead of using o(r) ≔ λ time kd rk + λ space krb > k , for the joint time and space regularization, another possible choice is to directly consider the matrix d rb > of joint spatio-temporal derivatives, and to promote sparsity with an ℓ -norm, or structured sparsity with a mixed norm ℓ , , e.g., kd rb > k , = ∑ t k(d rb > )(t)k . as previously discussed, data collected in the process of a pandemic are prone to several causes for outliers. here, outlier preprocessing and reproduction number estimation were conducted in two independent steps, which can turn suboptimal. they can be combined into a single step at the cost of increasing the representation space permitting to split observation in true data and outliers, by adding to the functional to minimize an extra regularization term and devising the corresponding optimization procedure, which becomes nonconvex, and hence far more complicated to address. finally, when an epidemic model suggests a way to make use of several time series (such as, e.g., infected and deceased) for one same territory, the tool can straightforwardly be extended into a multivariate setting by a mild adaptation of optimization problems ( ) and ( ), replacing the kullback-leibler divergence d kl (zjr � fz) by p i i¼ d kl ðz i j r � fz i Þ. finally, automating a data-driven tuning of the regularization hyperparameters constitutes another important research track. factors determining the diffusion of covid- and suggested strategy to prevent future accelerated viral infectivity similar to covid pooling data from individual clinical trials in the covid- era expected impact of lockdown in ile-de-france and possible exit strategies estimating the burden of sars-cov- in france the impact of a nation-wide lockdown on covid- transmissibility in italy measurability of the epidemic reproduction number in data-driven contact networks mathematical models in epidemiology a new framework and software to estimate time-varying reproduction numbers during epidemics the r package: a toolbox to estimate reproduction numbers for epidemic outbreaks improved inference of time-varying reproduction numbers during infectious disease outbreaks convex analysis and monotone operator theory in hilbert spaces image restoration: total variation, wavelet frames, and beyond proximal splitting methods in signal processing proximal algorithms. foundations and trends ® in optimization wavelet-based image deconvolution and reconstruction different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures epidemiological parameters of coronavirus disease : a pooled analysis of publicly reported individual data of cases from seven countries epidemiological characteristics of covid- cases in italy and estimates of the reproductive numbers one month into the epidemic nonlinear denoising for solid friction dynamics characterization sparsest continuous piecewise-linear representation of data the emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains a first-order primal-dual algorithm for convex problems with applications to imaging proximal splitting algorithms: relax them all! fonctions convexes duales et points proximaux dans un espace hilbertien. comptes rendus de l'acadé mie des sciences de paris a douglas-rachford splitting approach to nonsmooth convex variational signal recovery on the definition and the computation of the basic reproduction ratio r in models for infectious diseases in heterogeneous populations reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission figs and are produced using open ressources from the openstreetmap foundation, whose contributors are here gratefully acknowledged. mapdata©openstreetmap contributors. conceptualization: patrice abry, pablo jensen, patrick flandrin. key: cord- -ca ll tt authors: jia, peng; yang, shujuan title: early warning of epidemics: towards a national intelligent syndromic surveillance system (nisss) in china date: - - journal: bmj glob health doi: . /bmjgh- - sha: doc_id: cord_uid: ca ll tt nan after the sars pandemic in , an urgent demand for an effective national disease reporting and surveillance system could not be clearer in china. the national notifiable disease reporting system (nndrs) operated by the chinese center for disease control and prevention (cdc), also known as the china information system for disease control and prevention, was established in to facilitate the complete and timely reporting of cases during the outbreak. the outbreak of the covid- has further advanced the demand for an intelligent disease reporting system, also known as the national intelligent syndromic surveillance system (nisss), which would be able to analyse these suspected cases on the basis of prior knowledge and real-time information before a disease is confirmed clinically and in the laboratory. by doing so would tackle the epidemic quickly during the outbreak and even forecast the outbreak accurately and robustly at early stages. however, it remains difficult to forecast early risks for epidemics in the nisss with only disease cases reported from hospitals. more novel information input from end users and other external sources is required. the current technology enables the end user reporting or input modules in at least seven manners (figure ). first, the lowest level reporting parties in the current system (ie, hospitals and primary healthcare clinics) should go further down to doctors, who should be able to post and gather the suspected cases they have seen or treated. doctors see patients directly and are generally more sensitive to suspected cases, which could further reduce the delay under the current nndrs structure which only allows reporting the potential epidemic from the hospital administration after confirmation by expert panel. this is increasingly important because not only the covid- outbreak added this urgent demand to this system, but the current healthcare reform is shifting more gatekeeper roles down to the primary healthcare clinics. many western countries are more efficient on this because suspected cases are usually reported first by doctors in private clinics who are the first contact with patients and, under such a structure, are more encouraged to report the suspected cases. with that said, a hierarchical healthcare system is crucial to the successful early detection of infectious diseases. therefore, efforts on improving the nndrs should be aligned with the healthcare reform efforts. second, citizens should be enabled to report their surrounding risk through this novel crowdsourcing system. crowdsourcing is a sourcing model in which summary box ► a national intelligent syndromic surveillance system (nisss) is necessary in order to tackle the epidemic quickly during the outbreak, and forecast the outbreak accurately and robustly at early stages. ► doctors who see patients directly and citizens who are active on the ground are generally more sensitive to and should be enabled to report suspected cases and risk in the nisss. ► hospital and other types of information systems (eg, environmental, ecological, agricultural, wildlife and animal) should be tightly linked with the nisss to enable more timely information sharing and make syndromic surveillance. ► literature databases containing valuable research findings and knowledge and internet activity data reflecting cyber user awareness should be incorporated into the nisss in a real-time way for warning or fighting the epidemic. ► incorporating real-time data into the nisss could greatly facilitate the real-time tracking and consequently guide the epidemic control and prevention work on the ground for curbing the epidemic efficiently. ► the international institute of spatial lifecourse epidemiology (isle), a global health collaborative research network, has committed to working with multiple stakeholders to codevelop the nisss in china. information can be obtained from a large, relatively open and often rapidly-evolving group of internet users. such passive surveillance, if well used, could be even more sensitive to the potential risk than reporting by doctors. for example, the crowdsourcing platform muggenradar has been used by the dutch citizens to report the nuisance level of mosquito in order to indicate the potential risk of malaria. moreover, smartphone-based applications have been popular in china in almost every corner of the daily life, except the disease control and prevention, so crowdsourcing systems should be developed and integrated with the nisss with high priority. the individuals' communication tools could be even more advanced than the level of information technology infrastructure in some less developed remote areas, where smartphone-based user ends should also be made available for hospitals and doctors to report disease case information, in order to cancel off the low level of information technology infrastructure. therefore, a crowdsourcing system may work even better in china and other large countries with great variation in economic development. however, citizens should be better educated to be aware of their surrounding risk, which requires incorporating the public health education into the current education system at all stages. third, hospital information systems should be tightly linked with the nisss to enable more timely information sharing and make syndromic surveillance possible. currently, hospital information systems are not directly linked to the diagnosis-based (or diseasebased) nndrs in china, which could also cause reporting delay and errors (eg, manual typing errors) and should be improved. in the nisss, information about health events that precede a firm clinical diagnosis should be captured early and rapidly from electronic health records (ehrs), and analysed frequently to detect signals that might indicate an outbreak requiring investigation. this has been unprecedentedly possible since artificial intelligence (ai) is nowadays used to predict the future disease risk on the basis of ehrs. ai support is also required to link new symptoms with prior knowledge. in addition, information about hospital resources could be incorporated and updated in the nisss, which will enable the quick reallocation of the limited healthcare resources among hospitals and even among cities and provinces during the outbreak. fourth, information systems in other sections should also be linked with the nisss, so multisource information could be synthesised to maximise clues for the potential risk for epidemics. for example, linking environmental, ecological, agricultural and wildlife and animal information systems that have been continuously monitoring the nature and humannature interfaces would help to better early detect the appearance of disease cases with a natural origin, such as covid- . such linkage can also alarm other sectors how they could be affected by the epidemic, so they all could make their own strategies to reduce their loss while effectively avoiding the epidemic, and to alleviate other common issues, such as inadequate staffing and funding in those sectors. fifth, literature databases, especially in biomedical fields, normally contain valuable research findings and knowledge and should be incorporated into the nisss in a real-time way for warning or fighting the epidemic. for example, one article published months ahead of the covid- outbreak revealed several first-time detection of parasites in farmed and wild snakes in wuhan huanan seafood wholesale market, which could have been an early warning for that region requiring in-depth investigation. identifying and integrating scientific publications have been realised on several commercial social networking sites for scientists and researchers to share papers. ai support is also needed to conduct semantic analyses and identifies potential risk from a sea of knowledge, which can further be automatically positioned by spatial technologies and analysed as a whole. sixth, data of internet activities can be leveraged as a complementary source to reflect cyber user awareness, understand the epidemiological factors of diseases and imply the disease risk in users' surroundings. for example, internet-based search engine (volume of search keywords) and social media activity data (eg, twitter messages, wechat posts) have been associated with the daily numbers of reported human h n cases ; google flu trends was a web service that provided estimates of influenza activity by aggregating google search queries. hence, as the internet and social media are increasingly becoming major sources of health information, such internet surveillance is also more important than ever before in public health emergency control and prevention. such even more passive surveillance should be an additional module in the nisss for disease surveillance where the frequently searched disease-related keywords may deserve special attention by cdc. last but not least, incorporating real-time data into the nisss, if set up properly ahead of time, could greatly facilitate the real-time tracking and consequently guide the epidemic control and prevention work on the ground for curbing the epidemic efficiently. such data-sharing mechanisms and infrastructures would also facilitate timely spatial epidemiological research on the basis of individual-level infected cases linked with respective location data from mobile service providers and/or smartphone-based apps without violating confidentiality requirements. as spatial lifecourse epidemiology is capable of capturing the real-time interaction of three dynamic components (hosts, agents and environments), nisss running on the basis of real-time data would maximise the strengths of spatial lifecourse epidemiology in the real world, enabling it to outpace the epidemics and realise 'precision epidemic prevention and control'. in addition, by setting up such infrastructure ahead of time, the safety of individual confidentiality and information exchange will never be compromised in this powerful system. some practical aspects of implementation include the integration of disparate data sources in the nisss and the governance and privacy concerns of the nisss. the integration of data sharing between agencies can be realised by creating and using an application programming interface (api) for each service, which usually defines many items including the types of calls or requests that can be made and how to make them by data users, and the data formats to use and the conventions to follow by data owners. data users can request raw data to make forecasts on local machines or servers, which will need data-masking methods (eg, k-anonymity, l-diversity, t-closeness) for better privacy protection. some non-technical factors may hinder the realisation of raw data sharing, which may require the identification of a third-party governmental agency and legislation to facilitate data-sharing among sectors; for example, the ongoing beijing big data action plan, by beijing municipal bureau of economy and information technology and beijing municipal bureau of big data management, has linked data from government information systems in more than municipal departments in beijing. with sufficient prior knowledge (eg, knowing which variables are necessary to be used), data users can also request a subset of raw data or processed data to make forecasts, which would decrease the demand for local machine or server configuration and help overcome those non-technical barriers to some extent. however, extra costs will be incurred by creating apis for (multiple) services, which may require higher-level coordination for cost sharing and/or a data-sharing subsidy among sectors. lessons could be drawn from some examples of data sharing in other areas but adopting similar approaches. for example, medical big data sharing is gradually being allowed in south korea; the four main health maintenance organisations in israel and their affiliated hospitals have used the same electronic medical record (emr) platform for the past two decades, with access to patient records available to each point of care as needed, and % of the population has been using the same linked emr system for decades. in addition, multiple stakeholders at different levels of context should sit together, adopting participatory survey and discussion methods to identify more context-specific difficulties and solutions related to the practicality of the nisss, such as ( ) a hierarchy of data sources with the levels of confidentiality and necessity evaluated for each source, ( ) more approaches of organising and integrating disparate data sources and conducting analyses, ( ) more conceptual, architectural and analytical challenges that would arise and ( ) the corresponding solutions that would function best among multiple stakeholders nationally and internationally, including applicability and adaptability of the solutions to similar problems in other countries. the international institute of spatial lifecourse epidemiology (isle), established as a global health collaborative research network, has committed to identifying the key research issues and priorities for spatial lifecourse epidemiology, advancing the use of state-of-the-art technologies in lifecourse epidemiological research and emerging infectious disease research, and facilitating the quality of reporting of transdisciplinary health research. establishing the nisss will be one of the top public health priorities in the next decade, and also on top of isle's agenda. owing to a diverse variety of scholars' backgrounds, isle communicates crossinterdisciplinary knowledge and research findings in a plain language with scholars from various disciplines and multiple stakeholders including policy-makers. isle has committed to working with multiple stakeholders, from different levels of cdcs to industrial partners, doctors and citizens, to codevelop the nisss in china. this effort will exemplify the nextgeneration infectious disease reporting and surveillance system in the st century and serve as a model for many other countries in the world. china needs a national intelligent syndromic surveillance system citizens as sensors: the world of volunteered geography what is syndromic surveillance? mmwr deep patient: an unsupervised representation to predict the future of patients from the electronic health records what next for the coronavirus response? the tsinghua-lancet commission on healthy cities in china: unlocking the power of cities for a healthy china molecular identification and phylogenetic analysis of cryptosporidium, hepatozoon and spirometra in snakes from central china importance of internet surveillance in public health emergency control and prevention: evidence from a digital epidemiologic study during avian influenza a h n outbreaks influenza forecasting with google flu trends spatial lifecourse epidemiology and infectious disease research are we ready for a new era of high-impact and highfrequency epidemics? integrating kindergartener-specific questionnaires with citizen science to improve child health top research priorities in spatial lifecourse epidemiology spatial lifecourse epidemiology spatial lifecourse epidemiology reporting standards (isle-rest) statement contributors both authors have equally contributed to the planning, conduct and reporting of the work described in the article.funding we thank the national natural science foundation of china ( ), the special funds for prevention and control of covid- of sichuan university ( scuncov ), and the international institute of spatial lifecourse epidemiology (isle) for the research support.competing interests none declared. provenance and peer review not commissioned; externally peer reviewed.data availability statement all data relevant to the study are included in the article.open access this is an open access article distributed in accordance with the creative commons attribution non commercial (cc by-nc . ) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. see: http:// creativecommons. org/ licenses/ by-nc/ . /.orcid id peng jia http:// orcid. org/ - - - key: cord- - gyejoc authors: finnie, thomas j.r.; south, andy; bento, ana; sherrard-smith, ellie; jombart, thibaut title: epijson: a unified data-format for epidemiology date: - - journal: epidemics doi: . /j.epidem. . . sha: doc_id: cord_uid: gyejoc epidemiology relies on data but the divergent ways data are recorded and transferred, both within and between outbreaks, and the expanding range of data-types are creating an increasingly complex problem for the discipline. there is a need for a consistent, interpretable and precise way to transfer data while maintaining its fidelity. we introduce ‘epijson’, a new, flexible, and standards-compliant format for the interchange of epidemiological data using javascript object notation. this format is designed to enable the widest range of epidemiological data to be unambiguously held and transferred between people, software and institutions. in this paper, we provide a full description of the format and a discussion of the design decisions made. we introduce a schema enabling automatic checks of the validity of data stored as epijson, which can serve as a basis for the development of additional tools. in addition, we also present the r package ‘repijson’ which provides conversion tools between this format, line-list data and pre-existing analysis tools. an example is given to illustrate how epijson can be used to store line list data. epijson, designed around modern standards for interchange of information on the internet, is simple to implement, read and check. as such, it provides an ideal new standard for epidemiological, and other, data transfer to the fast-growing open-source platform for the analysis of disease outbreaks. infectious disease epidemiology relies on integrating increasingly diverse and complex data. this complexity comes not only from the types of data now collected (for example genetic sequence, image and digital sensor data are routinely generated during the course of a disease outbreak, together with more traditional epidemiological data) but also through multiple partners investigating different facets, from different specialities or covering different geographical areas. this has been seen in recent major epidemics including the influenza pandemic (fraser et al., ) , middle-east respiratory syndrome outbreaks or the west-african ebola epidemic (who ebola response team, , . in this context, the safe storage and swift exchange of epidemiological data between collaborators and institutions is key to the successful assessment of, and response to, infectious disease epidemics. consequently, a great deal of effort has been recently devoted to standardising platforms for the analysis of epidemiological data with software tools being constructed to permit interoperability between separate methodological approaches (jombart et al., ) . similar efforts have also been made in the fields of epidemiological data-gathering and recording (aanensen et al., ; ecdc, ) . overall however, there is a scarcity of systematised approaches for the transfer of data. the production of such a capability would vastly improve our ability to transfer information between systems and in doing so aid the interpretation of disease dynamics and ultimately protect a greater number of individuals. yet epidemics data are still, usually, held as a potentially confusing mass of spread-sheets, databases, text and binary files. a universal format enabling the coherent storage and transfer of these data is lacking. as a consequence, misinterpretation of the data may happen during transfer and result in errors being introduced into subsequent analyses and reports. unfortunately, the inherent complexity of epidemiological data magnifies the risks of such errors. fig. illustrates the major systems within an epidemiology work-flow where a standard for digital epidemiology data would be of assistance. a major difficulty associated with transferring epidemiological data lies in the degree of complexity that a dataset may display. the information that is recorded may vary markedly not only between outbreaks but also within a single outbreak. in addition, the epidemic context itself makes data collection a daunting task, leading to some inevitable disparities in the data recorded. despite these challenges, we can identify a common structure to epidemiological datasets that can make the task of storing them easier. at the top level of this common structure is information relating to the dataset as a whole, such as the name of the infection that is causing the epidemic or the particular geographic setting of the study. this information is meta-data. at a second level, most datasets are divided into subunits (units-of-record) that hold other information. these subunits could be individuals, regions, countries or time periods. in a conventional spreadsheet these units-of-record are usually stored as rows. the information relating to these units-of-record makes up the third level and is usually stored as columns in a conventional spreadsheet. this information can either relate directly to the unit-of-record itself (such as gender for an individual) or can relate to an event happening to or at that unit-of-record (such as the onset of symptoms for an individual). any format for the conveyance of epidemiological data has two competing goals: consistency and flexibility. with this and the common morphology of a dataset in mind, we propose a standard for the storage and transmission of data for infectious disease epidemiology: epijson (epidemiological javascript object notation). this format is intended to be language and software agnostic, simple to implement, and leverages modern data standards whilst maintaining the flexibility to represent most epidemiological data. while initially developed for problems within the infectious disease domain, epijson is applicable to any dataset where "events" happen to "units of record". we believe that it is sufficiently flexible to accommodate other datasets such as those found in non-communicable disease and chemical hazard areas. it has been designed to draw together all relevant epidemiological data into a single place so that, for example, genetic sequences may be stored alongside image data, a patient's standard demographic information and the disease trajectory data in an unambiguous manner. the epijson format capitalises on the common structure of most epidemiological datasets outlined above. fundamentally, the structure of an epijson file consists of three levels that we term "metadata", "records" and "events" (fig. ) . within each of these three levels, data are stored in collections of objects called "attributes" which are the core of data storage in epijson. an "attribute" object is used for storing unambiguously a discrete piece of information, recording not only the value of the data but also its name, type and units. the "name" is a label defining what the attribute is (e.g. "age" or "gender"), "type" defines the type of data being stored (e.g. "number" or "string"), "value" is the actual data value (e.g. or "male"), and the optional "units" key specifies measurement units (e.g. "years"). as this data representation is very generic, it is used as a unit of data storage across the whole structure of epijson. for example: an "attribute"" object could be used to store the name of an infection within "metadata", the gender of an individual within a "record", and the infection status of an individual at a test within an "event". the difference between an "event" and an "attribute" is that an "event" occurs at a defined time or place and can therefore store dates and locations using standard formats. the advantage of the epijson file structure is to clarify which data refer to the dataset, to records or to events associated with records. in contrast, conventional line list and other spreadsheet data can cause confusion as columns are often used to store all levels of data. an epijson file is essentially a text file containing a standards compliant json object. json is a widespread, language independent, human readable data format. json is made up of key/value pairs where keys are names and values are the data. epijson files are readable by any system capable of reading json even if it is not directly aware of the epijson structure. an epijson file consists of two parts: the metadata and the dataset (fig. , table ). both parts are arrays of objects. within this manuscript, an "array" refers to a one-dimensional collection of objects of the same type. arrays are indicated using square brackets "[]" immediately following the data type. for instance, an table keys and values for an attribute object. type value description "name" string string an identifier for this attribute "type" string "string" "number" "integer" "boolean" "date" "location" "base " type must be one of the enumerated types listed. (see table for a fuller explanation of type definitions.) "value" string; numeric; boolean the value to be recorded. it must be a valid example of "type". may be a homogeneous array of one of the permitted types "units" string string a string with the udunits unit name. for non-numeric or non-dimensional attributes this key may be omitted. non-standard units may be used but this is not recommended. if included the string must not be empty array of integers will be noted "integer[]". the metadata is an array of attribute objects (attribute[]) while the dataset is an array of record objects (record[]). both the metadata and the record keys are required but may hold zero objects. the "metadata" key is required to enable the inclusion of information describing the dataset whilst the "records" key holds the data. attribute objects are a fundamental concept in the epijson format. this construct holds data on a parent object in a form that permits a clear understanding of the data. an attribute object consists of keys: "name"", "type", "value" and "units" ( table ) . the "name" key is a character string identifying the name of this attribute. the "type" key identifies the type of data that is held under the value key. this is a character string but is limited to certain enumerated values contained in table . identifying the data type is necessary to ensure that software will correctly interpret the data when parsing an epijson file. the "value" key holds the actual value of the attribute object. this may be either a character string, a number, an integer, a boolean value, a json object or a homogeneous array of one of these (other types as specified in the type key are effectively sub-types of these primitives). the object held in the value key must be compatible with the value of the type key. for numeric value types the "units" key permits the units in which values have been recorded to be included in the dataset. an epijson dataset is made up from a series of record objects detailing units-of-record. certain characteristics of a unit-of-record are fixed (e.g. gender or region name) and may be recorded as part of the "attributes" of a given record. alternatively, other characteristics occur in space or time and are stored as "events". the record object consists of three keys: "id", "attributes" and "events" ( table ). the "id" key should be a string conforming to and generated according to version uuid (universally unique identifier) specification as found in rfc (leach et al., ) , and represents the unique identifier of a record. correctly generated, such a uuid is considered to be sufficiently individual that this identifier will be unique across all epijson datasets and greatly simplifies aggregation and sub-setting operations. should it be necessary to keep another identifier from an existing system then this should be stored as an attribute of the record. the "attributes" key holds an array of "attribute" objects (as above). these are the attributes table possible values for the type key of an attribute object and the consequence for the data held in the value key. type value "string" a character string of unspecified length "number" a decimal number. in languages making a distinction between numeric types, it is recommended that the "value" key is a signed floating point number encoded on at least bits. numbers must not contain any whitespace. commas and periods may only appear where they indicate the decimal place "integer" an integer number. in languages making a distinction between numeric types, it is recommended that the "value" key is a signed integer number encoded on at least bits "boolean" a boolean value, true or false. if a language supports boolean types then the implementation may treat the value key as boolean. must be lower-case and unquoted "date" a character string representing a date conforming to rfc (newman and klyne, ) . e.g. - - t : : z note the 'z' at the end indicating zero offset from utc. time zones are represented as numeric offsets from utc in hh:mm format (e.g. " - - t : : - : " would represent the same time in pacific standard time) "location" a geojson object representing a spatial entity. note: although attributes can hold either spatial (location) or temporal (date) information this is mostly for use in metadata. for records, the event object is the recommended form for storing this information "base " a character string of binary data encoded to a text character set using base encoding as per rfc (josefsson, ) . by including a method of holding binary data within an epijson file we permit abstract data to be included of this specific record object. the "events" key holds an array of "event" objects (noted "event[]") relating to this record. events are usually distinguished from attributes by occurring in time or space while attributes tend to have no spatio-temporal component. for example a date of occurrence is an event, it has a temporal dimension (and spatial-but this is rarely required) while the sex of an individual is an attribute as it lacks either a spatial or temporal dimension. the "event" object records observations made on its parent object. these might be events directly related to disease such as infection, symptom onset, hospitalisation or they might be more generic such as date of birth, the locations from an individual's travel history, their place of work, or their home. events must have a time, a place or both; for recording of time rfc (newman and klyne, ) is used (e.g. " - - t : : z") and we use geojson for recording of location (butler et al., ) . for datasets where the unit-of-record is not the individual, events could be data such as census dates and the resultant population counts, dates of ward cleaning etc. an event object has up to five keys: "id", "name", "date", "location" and "attributes" ( table ). the keys "id", and "name" are mandatory. in addition, at least "date" or "location" must be provided. the key "attributes" is optional. the value of the "id" key, as for the record object, should be a string consisting of a version uuid conforming to and generated as specified in rfc (leach et al., ) . "name" is a string that names the event. the value of "name" need not be unique, indeed it is suggested that a standard set of names are used to identify events across the dataset (and across multiple datasets). the "date" key records the time at which an event took place; it should be a valid string representation of a rfc (newman and klyne, ) conformant date, this is a subset of the iso extended format. if a single date point is insufficient to record an event, for instance to record a period of exposure, then we suggest that multiple events are recorded; in this example, one event object is coded to record the start of exposure and one to record the end. the "location" key stores a geospatial object in geojson format (butler et al., ) representing the location of an event. by allowing different events to have different geospatial types it is possible to choose the most appropriate geospatial object for an event without being restricted by the choice within other events (i.e. it would be valid to record the only vaguely known location of infection as a large polygon, while simultaneously having a point to represent where an individual was admitted to hospital). this scheme also permits the use of different projections for different events (e.g. a national/local grid for some while using wgs , the gps coordinate system, for others). the "attributes" key holds an array of attributes defined in the same way as above. here the attributes relate to the event and may hold data such as number of colonyforming units from a swab, genetic sequence data or the recorded population values from a census. in this example, an individual record is presented (fig. ) . first there is the metadata; here simply who created this dataset. then there comes the actual records, in this case only one person is recorded. this person has an age, and an id from the field recording system. a single event, the onset of disease, is also recorded together with the patient's temperature at that point. the source data could have been presented as a comma delimited file (csv) or line list format but by using epijson we can assign various parts of the line list data to objects in the json format to reduce ambiguity and provide information to other users of this information: in this case, in what units we should interpret the temperature reading. we also draw the reader's attention to the more fully worked example of a data storage and an analytical system communicating via epijson in the supplementary material. the repijson package has been developed as a demonstration implementation and will facilitate data transfer to and from the epijson format within the statistical and programming software r (r core team, ). it provides a variety of functions that can convert data to each of the levels within epijson (metadata, attributes, records, events and objects). it also implements conversion tools for the data structures used to store outbreak line lists in outbreak-tools (jombart et al., ) . fig. provides a brief overview of the conversions and translations provided by the package. the package is released with a fully documented manual and a vignette tutorial which includes the above examples to help users implement the formatting system. it also illustrates how information from epijson files can easily be extracted for further analysis. the software package, repijson, is distributed under gnu public licence (version or greater), and developed on github (https:// github.com/hackout /repijson), where instructions on installation and contributions can be found. the stable version of the package is distributed on the comprehensive r archive network (cran: http://cran.r-project.org/). json, javascript object notation, ecma standard (bray, ) , was chosen as the base technology for a universal fig. . schematic of the common types of data that the repijson package may convert between or include into an epijson file. solid lines indicate conversion of data dashed lines indicated direct inclusion. epidemiology interchange format because: (i) it is lightweight with no complex formatting rules; (ii) it is based on plain text; (iii) it is easy for both humans and machines to read; (iv) it is in widespread use; and, (v) it has a large number of high quality libraries available for all common computer systems and languages. all of these properties combine to make json an ideal data-interchange medium. as epijson is standards-compliant json, a file is made up from a series of key value pairs. keys are always double quoted character strings while the "value" may be a character string, a number, a boolean value or another json object. a pictorial overview of this structure for epijson is presented in fig. . as a detail for implementations, where possible, numeric types should be equivalent to c's long type (i.e. in the range − , , to + , , ) for integers and double precision should be used for floating point values. for clarity these have been referred to using json terminology as integer and number types, respectively, throughout the manuscript. within an attribute object the value of the units key where possible should be a unit name taken from the name parameter of the matching unit definition in the udunits unit database (unidata, a) . when the type key is "integer" or "number" and the units key is omitted, the value should be considered to be nondimensional. the major benefits of the epijson format are its flexibility and simplicity. epijson has broad-scale application to data transfer across multiple disciplines as we reach an era of rapid data assimilation. epijson has been designed to take advantage of existing standards for the data that it represents. this has two major advantages: the first is that existing domain expertise, for example in the representation of spatial data, is implicitly incorporated into this standard; the second is that the implementation of epijson parsers and filters for existing languages and software is greatly simplified. from the outset, epijson has been designed to reduce ambiguity and permit greater ease of transmission for epidemiological datasets. key to this ambition is ensuring that core information held by the dataset may be simply understood and can be transferred with fidelity. to provide documented data of the greatest clarity and concision within epijson we offer the attribute object. this object not only holds much of the fundamental data of the dataset but also important metadata that enables the interpretation of those data. the repetition of identically constructed attribute objects throughout the data structure allows clear documentation of information at all levels of a dataset without the added confusion of a different structure at each level. the first level within epijson is the dataset. information on the dataset itself is stored as metadata. the use of metadata is standard in many epidemiological settings and providing a method for this to be included within epijson is essential and best practice. indeed, including metadata allows some of the most important pieces of information relating to a dataset to be stored alongside the data, such as why, when and by whom the dataset was created or collated. similar to the epidemiological data, the type and detail of metadata can be very broad. the next level is that of the record. a great asset of our format is that epijson makes no assumptions about the unit-of-record (that is person or region, country, etc.) nor even forces record objects to be uniform. although not recommended, it is perfectly possible to mix individuals with regions within a single dataset. the final level is that of the event. so that there is no ambiguity within the data, epijson requires the use of a standard for the recording of time and place. our decision to enter location data using geojson format (butler et al., ) permits recording of event data to any of the standard geographical objects (point, line or polygon) and simultaneously solves potential issues caused by using different coordinate reference systems (crs). an additional benefit is that different events within the same record may be recorded to a different geographical object, even one in a different crs. naturally, no standard epidemiological software currently supports the epijson standard. the first step in more widespread support is to provide a mechanism by which the validity of a epi-json file and hence software implementation can be checked. to this end we provide a schema for epijson files (see links below). in the short term the availability of an r package permits not only epijson files to be used within r but also for them to be converted to and from a wide range of file formats including the ubiquitous csv spreadsheet format read by most systems. the next step is for library functions for common languages to be written, easing the developer effort for high level packages familiar to the epidemiological user base. we believe that, because the standard is well defined, open, based on existing standards and easily validated, this effort in including parsers within existing packages should be small. the transfer of epidemiological data may be particularly sensitive either because of its personal or political nature. however, we believe that encryption is a different problem to that of high fidelity transfer of scientific information. performing encryption well is difficult, as evidenced by the many security breaches in even extensively tested systems over recent years. we believe that the user is better served by well tested external encryption libraries or tools than by providing a mechanism within the epijson standard that would become rapidly obsolescent. however we envisage that much as html (hypertext mark-up language, the text format of most web pages) may be encrypted using https, the text based epijson could be encrypted using a system such as ssl to provide an encrypted version of the format. such a development would also provide the user with the choice of symmetric, password-based encryption as used in later versions of well-known spreadsheet programs or the more secure key-based asymmetric encryption commonly in use on the internet in banking and encrypted email transactions. representing data as epijson means that increased storage space is required in comparison to equivalent, terse, spread-sheets. yet the additional detail of epijson, the reduction in ambiguity and the direct readability by machine outweigh the disadvantage of increased use of storage. further, it is possible to transfer genomic data, images and location data within a single epijson file ensuring that a wide variety of data, relating to a single unit-of-record, can be collated. as with encryption, it is not within the scope of a format for epidemiological data to specify a data compression standard but note that, as a structured text document with a good deal of repetition, epijson files compress well using commonly available compression tools. epijson's requirement for additional space is not considered an impairment to its applicability. finally, the epijson format is deliberately broad to permit the capture of the widest range of epidemiological datasets. however, we recognise that there are sub-types of epidemiological data. it is envisaged that standard attribute sets will be agreed to address specific types of epidemiological data, much as conventions have emerged for netcdf datasets (unidata, b) e.g. the cf convention for climatology (eaton et al., ) . the adoption of well-defined data standards makes both the contribution of data to communal efforts and the development of data tools much easier. this has been the case for public transport data (google, ) , where a standard allows companies to submit their timetables in a form that can be used by google maps to provide route information. this definition of an open data structure for epidemiology is the first step in allowing other developers and collaborators to modify their software and work-flows to utilise the standard. the development of useful tools and practices will, by necessity, be an iterative process with input from a wide range of users. in epijson we provide a well-understood file structure with a verifiable format for storing and exchanging epidemiological data. the epijson format is highly flexible to enable the widest range of datasets to be conveyed while being sufficiently rigid to remove many of the common causes of ambiguity and error. while epijson is not intended to replace existing databases or surveillance systems, it should prove useful for transferring information between these types of systems, collections and analysis tools. in the resources below we provide links to the current epi-json schema (permitting the automatic verification of epijson files), tools to convert between common formats and epijson and example datasets in epijson form. we appreciate that full implementation of the epijson standard across all of the software and systems used within epidemiology is a large undertaking, but by providing an open blueprint for data interchange between these systems, together with validation tools and a library for one of the more popular data analysis environments we hope to greatly simplify the process by which epidemiology systems interact. researchers and institutions should find that collaboration becomes easier by reducing data compatibility problems and additional capabilities will become simpler to add to existing workflows with the adoption of epijson as a medium for information interchange. gnu general public licence (gpl) ≥ . https://github.com/hackout /epijson. https://raw.githubusercontent.com/hackout /epijson/ master/schema/epijson.json. r library in cran: https://cran.r-project.org/web/packages/ repijson/. https://github.com/hackout /repijson. epicollect: linking smartphones to web applications for epidemiology. ecology and community data collection the javascript object notation (json) data interchange format. rfc geojson specification middle east respiratory syndrome coronavirus: quantification of the extent of the epidemic, surveillance biases, and transmissibility netcdf climate and forecast (cf) metadata conventions legionella outbreak toolbox [www document]. legion. dis. outbreak investig. toolbox pandemic potential of a strain of influenza a (h n ): early findings general transit feed specification reference google developers outbreaktools: a new platform for disease outbreak analysis using the r software the base , base and base data encodings a universally unique identifier (uuid) urn namespace date and time on the internet: timestamps. rfc r: a language and environment for statistical computing. r foundation for statistical computing netcdf conventions [www document west african ebola epidemic after one year-slowing but not yet under control ebola virus disease in west africa-the first months of the epidemic and forward projections epijson was created during "hackout : graphical resources for infectious disease epidemiology in r" (https://sites.google.com/ site/hackout /), an event funded by the medical research council we would also like to thank the two anonymous reviewers whose constructive comments helped to strengthen this manuscript. supplementary data associated with this article can be found, in the online version, at doi: . /j.epidem. . . . key: cord- -kqpnwkg authors: sun, yingcheng; guo, fei; kaffashi, farhad; jacono, frank j.; degeorgia, michael; loparo, kenneth a. title: insma: an integrated system for multimodal data acquisition and analysis in the intensive care unit date: - - journal: j biomed inform doi: . /j.jbi. . sha: doc_id: cord_uid: kqpnwkg modern intensive care units (icu) are equipped with a variety of different medical devices to monitor the physiological status of patients. these devices can generate large amounts of multimodal data daily that include physiological waveform signals (arterial blood pressure, electrocardiogram, respiration), patient alarm messages, numeric vitals data, etc. in order to provide opportunities for increasingly improved patient care, it is necessary to develop an effective data acquisition and analysis system that can assist clinicians and provide decision support at the patient bedside. previous research has discussed various data collection methods, but a comprehensive solution for bedside data acquisition to analysis has not been achieved. in this paper, we proposed a multimodal data acquisition and analysis system called insma, with the ability to acquire, store, process, and visualize multiple types of data from the philips intellivue patient monitor. we also discuss how the acquired data can be used for patient state tracking. insma is being tested in the icu at university hospitals cleveland medical center. each year, more than four million acutely ill patients are admitted to intensive care units (icus) in the u.s. alone; approximately , of them do not survive [ ] [ ] [ ] . in extreme situations, like the current covid- pandemic, icus are essential for treating critically ill coronavirus patients. given the high stakes involved, timely and effective care is paramount, and this requires continuous patient surveillance using sophisticated monitoring equipment. as a result, icus are complex, data-intensive environments and dozens of systemic parameters are monitored, including heart rate, respiration, arterial blood pressure, oxygen saturation, temperature, end tidal co concentration, etc. enormous volumes of multimodal physiological data are generated including physiological waveform signals, patient monitoring alarm messages, and numerics and if acquired, synchronized and analyzed, this data can been effectively used to support clinical decision-making at the bedside [ , ] . clinical personnel rely on information from these signals provided on the patient monitor display for visual assessment or as numerics in the emr to understand the current state of the patient, and how it is changing over time. continuous digital monitoring is intended to allow clinicians to dynamically track changes in patient state more closely than would be possible with more sporadic measurements [ ] . the hope has been that this would allow for more accurate diagnosis, earlier anticipation of deterioration, and a clearer understanding of the impact of administered treatments, improving quality of care and lowering costs [ ] . even when data can be viewed in real-time, standard approaches provide little insight into a patient's actual pathophysiologic state. understanding the dynamics of critical illness requires precisely time-stamped physiologic data (sampled frequently enough to accurately recreate the detail of physiologic waveforms) integrated with clinical context, but this will produce an overwhelming amount of data-far too much to be routinely reviewed manually. it is thus necessary to develop a data acquisition system that facilitates the access and review of historical data for medical personnel. the acquired data needs to be synchronized across disparate devices, archived, analyzed and presented to clinical personnel in a manner that supports clinical decision-making at the patient bedside. previous work includes, for example: matam and duncan adopt the real-time data recording system used for f race cars to acquire and analyze data from bedside monitors in the pediatric intensive care unit [ ] . this system supports the storage and review of electrocardiogram (ecg) data retrospectively. raimond et al. developed a platform called "waveformecg" that provides interactive analysis, visualization and annotation of ecg signals [ ] . alexander et al. developed an alarm data collection framework to acquire all alarms generated from philips intellivue mp patient monitors installed in each icu room with the objective of reducing false alarms by leveraging annotations provided by clinicians [ ] . hyung and chul introduced a physiological data acquisition and visualization program "vital recorder" with a user-friendly interface similar to that of a video-editing program for anesthesia information management, where physiological data can be manipulated like editing video clips [ ] . much of the previous work has focused on the acquisition or visualization of certain physiological data, a complete general purpose solution for data collection, analysis and visualization of multimodal icu data is currently unavailable. icu clinical personnel need the ability to effectively deal with different data sources on a patient or departmental level, and need advanced analytic methods that transform this data to actionable and clinically meaningful outcomes for each patent. we have been working on building the integrated medical environment (time) [ ] to address this critical opportunity and in this paper, we discuss an integrated system (insma) that supports multimodal data acquisition, parsing, real-time data analysis and visualization in the icu. in the current implementation, insma acquires data from the philips intellivue patient monitor, and has the ability to store and review the multimodal data acquired either in real-time or on request. the system has been tested in the icu at university hospitals cleveland medical center, for multimodal data analysis and patient state tracking. the remainder of the paper is organized as follows. in section , we discuss some related work. in section , we present the insma framework, and introduce the details of the data acquisition, parsing and visualization modules. in section , we discuss the applications of our proposed system. we conclude our work and suggest possible future work in section . icus provide treatment to patients with the most severe and life-threatening illnesses and injuries. it requires uninterrupted attentiveness and medical care from various clinical specialists and medical equipment to sustain life and help nurture the patients back to health. effective and reliable patient monitoring and data analysis are of ultimate importance in the icu to ensure early diagnosis, timely and informed therapeutic decisions, effective institution of treatment and follow-up [ , ] . several clinical information systems have been developed from both industry and academia to meet the demanding needs of the icu. general electric (ge) co.'s centricity critical care system introduced in creates actionable insight across the healthcare system and the care pathway in intensive care units, enabling enhanced clinical quality and operational efficiency. the system collects data from monitors and ventilators and displays it in spreadsheets reminiscent of the typical icu chart. data are collected from medical devices through device interfaces that connect with ge's unity interface device network [ ] . the datex-ohmeda s/ ™ collect program proposed by ge healthcare can obtain high-resolution data from the datex-ohmeda s/ ™ series monitors [ ] . the program was developed for windows xp and is not compatible with current windows operating systems, and the manufacturer does not intend to update it. philips offers data management solutions that link the philips als monitor/defibrillator and aed and allow quality assurance officers using a direct connection that downloads and forwards every event automatically. quality assurance officers can then retrieve and review an event summary with confidence [ ] . often, the commercial off-the-shelf products do not support the acquisition, archiving, or annotations of high-resolution physiologic data with bedside observations for clinical applications. systems have also been developed in academic settings primarily to support clinical research. tsui et al. developed a system to acquire, model, and predict icp in the icu using wavelet analysis for feature extraction [ ] . goldstein et al. proposed and developed a physiologic data acquisition system that could capture and archive parametric data, but the annotation of important clinical events such as changes in a patient's condition or timing of drug administration, was limited [ ] . kool et al. reported that they collected numerical data at five-second intervals from the datex ohmeda s/ tm monitoring system using their own information management system [ ] . liu et al. [ ] also reported the collection of vital signal data from surgical patients, from philips intellivue mp series monitors, using a self-developed program that was not disclosed. lee and jung developed an anaesthesia information management system (aims) for the acquisition of high-quality vital signal data (vital recorder) to support research [ ] . physiological data of surgical patients were collected from operating rooms by vital recorder through the patient monitor, anaesthesia machine and bispectral index monitor. winslow et al. proposed a platform called waveformecg for visualizing, annotating, and analyzing ecg data [ ] . as discussed in the first section, these systems only focus on acquiring and analyzing one or two types of physiological data, and that is not sufficient for icu applications. matam and duncan used real-time data recording software, atlas from mclaren electronics systems that continuously monitor and analyze data from f racing cars to implement a similar real-time data recording platform system adapted with real time analytics to suit the requirements of the intensive care environment [ ] . the parameter data recorded by philips mp bedside monitors can be transferred to the server in real-time. however, such a third-party data acquisition tool is not flexible enough to customize the functions according to the clinician's requirements, and the compatibility of the data format is another issue. to address the issues described above, our research proposed the insma with the aim of obtaining clinical physiological data including electroencephalography (eeg), electrocardiography (ecg), photoplethysmogram (ppg), peripheral capillary oxygen saturation (spo ), blood pressure (bp) and other signals to be acquired and stored for data sharing, mining, analysis and visualization. the primary data source in our first implantation comes from the intellivue mp (philips, germany) series of monitors. insma contains three independent but data related modules: data acquisition module, parsing module and visualization module. figure shows the insma architecture and its data flow. the data acquisition module establishes the connection with the patient monitor and requests the physiological measurements that are acquired. the raw multimodal data obtained from the monitor includes physiological waveforms, alarm messages and numeric (vitals) data. the specific types of data to be acquired can be chosen by the users according to their needs through the "data type selector" and "physiological signal selector". the data transport rate can also be set by the "serial port selector". once the raw data has been acquired from the monitor, the data parsing module will process, parse and transform the data into a time-series using the physiological identifiers or codes provided by the monitor. the data visualization module will display the graphs for the parsed time-series results. it can plot both real-time signals and historical data given a time range. all three modules are developed using mfc and c/c++, so that they all run in the same operating environment and use compatible data formats, and therefore provide a complete solution for data acquisition, parsing and visualization. the details of these modules are discussed in next sections. the bedside patient monitor is the most common long-term monitoring medical device used in an icu. it is used to continuously monitor the physiological parameters of an patient through specially designed sensors, signal acquisition modules, and invasive or noninvasive interfaces: cardiac activity including ecg and heart rate, circulation including blood pressure & cardiac output indices, spo , respiratory function including respiration rate, oxygenation, capnography, and brain through eeg waveforms and derived indicators, temperature and metabolic rate, etc. the philips intellivue mp is a bedside patient monitoring device that displays various physiological waves (e.g. ecg and blood pressure) and provides important functions such as displaying numeric vitals data (e.g. heartrate, oxygen saturation) and performing alarm functions based on minimum and maximum limits set by the clinical staff in the monitor. a variety of sensors and associated clinical measurement modules can be connected to the monitor, and these modules are generally interchangeable with other monitors provided by philips [ ] . one of these modules is the philips vuelink module that provides an interface to more than third-party specialty measurement devices like baudrate protocol is a connection-oriented, message-based request/respond protocol, based on an object-oriented model concept. all information is stored as attributes within a set of defined object types. the following objects are defined in the protocol: medical device system (mds), alert monitor, numeric, and patient demographics. in order for a client application to access the attributes of instantiated objects, it first has to poll the mds object. then, the client gets the information of the instantiated object via queries that return the attribute values of these objects. after building the association, the following data can be accessed from the intellivue monitor: all measurement numerics and alarm data (real-time update rates up to ms), wave data, and patient demographic data entered by the user in the intellivue monitor. the data acquisition module collects and stores real-time data from patient monitors in intensive care units for further data analytics that supports clinical decision-making. we developed an interface using mfc, to make the data acquisition process easy to be controlled by users. figure shows the interface with function areas - . the data acquisition and preprocessing tools can perform high-resolution recording and processing tasks, such as simultaneously recording of - ecg (at samples/s) channels, and additionally up to non-ecg (at or . samples/s) waves, along with other signal types such as all available numeric values and alert messages. after the program is started, a text file including the monitoring results of the selected data type and signals will be generated for further analysis. the file is named by the date time and patient's demographic information. the data parsing module runs synchronously with the data acquisition module to continuously parse real-time data being streamed from the monitor to increase the efficiency of the data collection and archiving process and to also when parsing the data, we first identify each frame by locating its bof ( xc ) and eof ( xc ), and get the type and length of the message from hdr. next, we interpret the time stamp from the user data. in the data export protocol defined by philips intellivue monitor, the time stamps contain two types of data: absolute time and relative time. for the waveform signals, the intellivue patient monitor supports the wave types ( table i) that are defined by sample period, sample size, array size, update period and bandwidth requirement. the data visualization module can display the multimodal data including the wave signals and numeric data of patients. figure illustrates the main interface. the timestamp is displayed above the chart. the selector ( ) lists all the numeric data types obtained from the parsed results. when a numeric data type is chosen, its value will be displayed in the panel ( ) in real-time. the list in ( ) will update automatically when the program finds "new" data type in the parsed results of that numeric data. the numeric data is displayed as markers (dots) instead of a continuous curve, and each dot represents one data value at that time point. different types of waveforms or numeric values are displayed in different colors, so it is easier for users to distinguish them. all the control commands are displayed in panel ( ) in a log style. the patient's id, wave and numeric data type list can be set in the "option" menu. in addition to providing the patient's physiological status in a real-time mode, the data visualization module can also display waveforms from archived data. users first need to choose the patient and types of waves that are to be visualized, and then set the time range in the "plot setting" window, as shown in figure . in order to evaluate the performance of insma, the system was deployed in the neurological surgery icu at university hospitals cleveland medical center using a dell minicomputer with an intel(r) dual core celeron processor, gb of ram, and gb hard disk storage. results to date indicate that it is reliable for collecting data from the patient monitor. the waveform signals (e.g. ecg, respiration, pleth/co ), numeric signals and alarm event signals when streamed continuously from the monitor over a -hour period generate an approximately mb data file for each patient. the parsed results will be larger in size because the absolute time stamp information is added to each sample point in the file. how the patient is progressing by observing the art and icp waveforms, and how they are temporally correlated. we also collected alarm messages and numeric measurements in the parsed results. the analysis of icu data from patients in the clinical setting is generally limited to visualizing waveform and numeric data and computing simple values such as average heartrate, average respiratory rate, average blood oxygen saturation, etc. in insma, each waveform can be analyzed independently or in conjunction with other waveforms to extract more information, as shown in fig. . it is possible to zoom in to provide additional waveform details for visual inspection or apply different analytical analysis techniques to single or multiple waveform signals to better understand the status of the patient and support clinical decision-making. it has been well established that feature extraction for quantifying the complexity and/or variability in physiological time-series data can provide important information related to health and disease [ ] . specifically, even though temporal patterns of variability can be leveraged as powerful diagnostic and/or prognostic indicators, the current use of beat-tobeat and cycle-to-cycle variability dynamics at the bedside is hampered by: ( ) lack of high-resolution real-time multimodal clinical data, ( ) non-trivial interpretation and integration of these variability metrics into clinical workflows, and ( ) lack of a unified framework for classifying variability dynamics into meaningful clinical categories. algorithms that quantify variability dynamics over multiple temporal scales, such as multiscale entropy (mse) and multifractal detrended fluctuation analysis (mfdfa) have shown a lot of promise as diagnostic tools in clinical research settings, but the difficulty in interpreting these measures by non-specialists prevents their routine implementation and use in the icu. the acquired data from patient monitors can be used to develop novel and generalizable methods for quantifying and tracking patient state in real-time [ ] . we are developing a patient state tracking system based on the analysis of physiologic time-series dynamics as shown in fig. . in stage ii, the data is analyzed using the beat-to-beat or cycle-to-cycle time-series data that is of interest. a new dataset is analyzed (step ) with the same algorithm used in step , the ann classifies the output of the algorithm in step , and the result of the ann classification is then mapped into the physiological phase space in step . this methodology reduces the dimensionality of multiscale variability dynamics in a clinically relevant manner, thereby facilitating the development of clinician-centric visualization tools that can be implemented in a bedside display, and easily integrated in the icu workflow as a generalized early warning system for clinical decompensation in icu patients [ ] . any algorithm that quantifies multiscale variability dynamics [ ] [ ] can be used to process the waveform data in order to classify the information extracted from the raw data in an intuitive and physiologically relevant manner [ ] [ ] , and thus to facilitate the incorporation of subtle and dynamic fluctuations in physiological waveform data. by assessing the current status of a patient in the icu, the system will provide a wealth of information on future trajectories for extracting related clinical information [ ] [ ] [ ] . the amount of data that is available for clinicians to use in support of real-time patient care at the bedside is growing rapidly as a result of advances in medical monitoring and imaging technology. advances in informatics, whether through data acquisition, physiologic alarm detection, or signal analysis and visualization for decision support have the potential to markedly improve patient treatment in icus. clinical monitors have the ability to collect and visualize important numerics or waveforms, but more work is needed to interface to the monitors and acquire and synchronize multimodal physiological data across a diverse set of clinical devices. patient monitors offer the opportunity to acquire a number of different physiological signals in a single device, but in certain cases there are other monitors and devices whose data is critical to patient care, but do not interface to the patient monitor. the time framework that we are developing is directly addressing this unmet clinical need. an integrated solution for multimodal data acquisition, parsing and visualization in the icu (insma) presented in this paper is an important first step in achieving this overall vision [ ] . particularly in the neuro-intensive care unit, there are a variety of different devices that provide valuable information for patient care that do not interface directly to a patient monitor including eeg signal data, real-time tissue blood flow (perfusion) data, and advanced hemodynamic data monitoring (e.g. continuous cardiac output) that are cornerstones in the management of critically ill patients. there are options, for example, with nihon-kohden eeg acquisition systems to collect patient vitals (similar to a bedside patient monitor) as well as interface to specialized devices such as for hemodynamic monitoring. simultaneous acquisition of data from philips patient monitors and nihon-kohden eeg systems in the icu was done to augment data provided in the mimic study [ ] . the objective of data acquisition was to stream real-time data from both monitors for archiving in a single biorepository. this provides valuable data for research, but the intent of time is to stream data for real-time patient care at the bedside. insma is an important first step, and we have also developed data techniques for synchronizing data acquisition from a variety of different icu devices as a core technology for future implementations of time in the icu. we have also demonstrated that patient data acquired from the patient monitor can be used for patient state tracking. the prototype system we developed was optimized to identify the type of dynamics observed in cardiac (ecg or blood pressure) beat-to-beat time-series data collected from icu patients. the prototype system has been tested using icu patient data from ecg to understand how variability in the heartbeat time-series can be used to dynamically track patient state [ ] . in the current development of insma and time, we have implemented the insma software on lenovo thinkcentre m computers, and currently have one system connected to the philips patient monitor in the neurosurgery icu at university hospitals cleveland medical center under the direction of dr. degeorgia to continuously collect patient data. we are completing the development of additional insma units that will be connected to each of the neurosurgery icu beds. the insma system operates in background and once setup for data collection at the bedside and does not require attention from any clinical icu personnel. insma allows the client to send messages to request for patient demographic information, and we implemented a patient demographic request function that monitors any modification of patient information. new patient demographic information will be entered when patients admitted into the icu are connected to the monitor, and this information will be requested and stored in the raw data file. when the patient is discharged and a new patient is admitted and connected to the monitor, the parsing algorithm will capture the change of patient demographic information (e.g. first name, last name, age, weight) and a new patient archive corresponding to the new patient identifier will be created, as shown in fig. . each unit is equipped with a wireless communication link that supports remotely monitoring the insma units only from within the hospital firewall to protect the privacy and security of the data, and then also moves the data to a permanent secure data storage unit. example data from patient monitoring is shown in fig. . the future of critical care will require "information management", that includes the real-time collection, integration, and interpretation of various types of physiological data from multiple sources. the possible research work will focus on ( ) the integration and analysis of massive heterogenous medical data to provide scientific decision-making with machine learning methods [ ] [ ], and ( ) the acquisition and processing of vast amount of multi-channel high-density and real-time streaming data using multivariate and nonlinear time series analysis methods to facilitate rapid diagnosis and treatment [ ] . patient care in the icu can be significantly improved through the application of complex system analysis and information management methods. daily cost of an intensive care unit day: the contribution of mechanical ventilation evaluation of acute physiology and chronic health evaluation iii predictions of hospital mortality in an independent database estimating lives and dollars saved from universal adoption of the leapfrog safety and quality standards clinician-in-the-loop annotation of icu bedside alarm data big data analytics in healthcare: promise and potential technical challenges related to implementation of a formula one real time data acquisition and analysis system in a paediatric intensive care unit waveformecg: a platform for visualizing, annotating, and analyzing ecg data vital recorder-a free research tool for automatic recording of high-resolution time-synchronised physiological data from multiple anaesthesia devices intensive care window: a multi-modal monitoring tool for intensive care research and practice information technology in critical care: review of monitoring and data acquisition systems for patient care and research streamlining data management workflow acquiring, modeling, and predicting intracranial pressure in the intensive care unit entropy module and the bispectral index® monitor during propofol-remifentanil anesthesia artifacts in research data obtained from an anesthesia information and management system university of queensland vital signs dataset: development of an accessible repository of anesthesia patient monitoring data for research context aware image annotation in active learning intensive care window: a multi-modal monitoring tool for intensive care research and practice data acquisition and complex systems analysis in critical care: developing the intensive care unit of the future a framework for patient state tracking by classifying multiscalar wavform features neurocritical care inforamtics: translating raw data into bedside action system for collecting biosignal data from multiple patient monitoring systems physiologic data acquisition system and database for the study of disease dynamics in the intensive care unit learning-based adaptation framework for elastic software systems context aware image annotation in active learning with batch mode eliminating search intent bias in learning to rank information extraction from free text in clinical trials with knowledge-based distant supervision knowledge-guided text structuring in clinical trials complex query recognition based on dynamic learning mechanism a common gene expression signature analysis method for multiple types of cancer deep learning for heterogeneous medical data analysis. world wide web opinion spam detection based on heterogeneous information network raim: recurrent attentive and intensive model of multimodal patient monitoring data  integrated system for icu multimodal data acquisition, analysis and visualization yingcheng sun: conceptualization, methodology, software, visualization, writing -original draft. fei guo: conceptualization, software, methodology, visualization. farhad kaffashi: conceptualization, methodology, formal analysis. frank j. jacono: resources, data curation, validation. michael degeorgia: resources, data curation, validation. kenneth a. loparo: conceptualization, funding acquisition, investigation, methodology this work was supported in part by ahrq grant r hs - a (pi: dr. leo kobayashi). ☒ the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.☐the authors declare the following financial interests/personal relationships which may be considered as potential competing interests: key: cord- -uy r lt authors: greenspan, hayit; san josé estépar, raúl; j. niessen, wiro; siegel, eliot; nielsen, mads title: position paper on covid- imaging and ai: from the clinical needs and technological challenges to initial ai solutions at the lab and national level towards a new era for ai in healthcare date: - - journal: med image anal doi: . /j.media. . sha: doc_id: cord_uid: uy r lt in this position paper, we provide a collection of views on the role of ai in the covid- pandemic, from clinical requirements to the design of ai-based systems, to the translation of the developed tools to the clinic. we highlight key factors in designing system solutions - per specific task; as well as design issues in managing the disease at the national level. we focus on three specific use-cases for which ai systems can be built: early disease detection, management in a hospital setting, and building patient-specific predictive models that require the combination of imaging with additional clinical data. infrastructure considerations and population modeling in two european countries will be described. this pandemic has made the practical and scientific challenges of making ai solutions very explicit. a discussion concludes this paper, with a list of challenges facing the community in the ai road ahead. the covid- pandemic surprised the world with its rapid spread and has had a major impact on the lives of billions of people. imaging is playing a role in the fight against the disease, in some countries as a key tool, from screening and diagnosis through the entire treatment process, but in other countries, as a relatively minor support tool. guidelines and diagnostic protocols are still being defined and updated in countries around the world. where enabled, computed tomography (ct) of the thorax has been shown to provide an important adjunctive role in diagnosing and tracking progress of covid- in comparison to other methods such as monitoring of temperature/respiratory symptoms and the current-gold standard, molecular testing, us- * corresponding author: hayit@eng.tau.ac.il;hayitg@gmail.com ing sputum or nasopharyngeal swabs. several countries (including china, netherlands, russia and more) have elected to use ct as a primary imaging modality, from the initial diagnosis through the entire treatment process. other countries, such as the us and denmark as well as developing countries (southeast asia, africa) are using mostly conventional radiographic (x-ray) imaging of the chest (cxr). in addition to establishing the role of imaging, this is the first time ai, or more specifically, deep learning approaches have the opportunity to join in as tools on the frontlines of fighting an emerging pandemic. these algorithms can be used in support of emergency teams, real-time decision support, and more. in this position paper , a group of researchers provide their views on the role of ai, from clinical requirements to the design of ai-based systems, to the infrastructure necessary to facilitate national-level population modeling. many studies have emerged in the last several months from the medical imaging community with many research groups as well as companies introducing deep learning based solutions to tackle the various tasks: mostly in detection of the disease (vs normal), and more recently also for staging disease severity. for a review of emerging works in this space we refer the reader to a recent review article shi et al. ( a) that covers the first papers published, up to and including march -in the entire pipeline of medical imaging and analysis techniques involved with covid- , including image acquisition, segmentation, diagnosis, and follow-up. we also want to point out several special issues in this space-including ieee special issue of tmi, april ; ieee special issue of jhbi, ; as well as the current special issue of media. in the current position paper, it is not our goal to provide an overview of the publications in the field, rather we present our own experiences in the space and a joint overview of challenges ahead. we start with the radiologist perspective. what are the clinical needs for which ai may provide some benefits? we follow that with an introduction to ai based solutions -the challenges and roadmap for developing ai-based systems, in general and for the covid- pandemic. in section of this paper we focus on three specific use-cases for which ai systems can be built: detection, patient management, and predictive models in which the imaging is combined with additional clinical features. system examples will be briefly introduced. in section we present a different perspective of ai in its role in the upstream and downstream management of the pandemic. specific infrastructure considerations and population modeling in two european countries will be described in section . a discussion concludes this paper, with a list of challenges facing the community in our road ahead. as of this writing, according to the johns hopkins resource center (https://coronavirus.jhu.edu/), there are, approximately, . million confirmed cases with , deaths throughout the world, with , deaths in new york state alone. the rate of increase in cases has continued to rise as demonstrated by the log scale plot in figure . the most common symptoms of the disease, fever, fatigue, dry cough, runny nose, diarrhea and shortness of breath are non-specific and are common to many people with a variety of conditions. the mean incubation period is approximately days and the virus is probably most often transmitted by asymptomatic patients. knowing who is positive for the disease has critical implications for keeping patients away from others. unfortunately, the gold standard lab test, real time reverse transcription polymerase chain reaction (rt-pcr) which detects viral nucleic acid, has not been universally available in many areas and its sensitivity varies considerably depending on how early patients are tested in the course of their disease. recent studies have suggested that rt-rpr has a sensitivity of only - %. consequently, repeat testing is often required to ensure a patient is actually free of the disease. fang et al (fang et al. ( ) ) found that for the patients they studied with thoracic ct and rt-pcr assay performed within days of each other, the sensitivity of ct for covid- was % compared to rt-pcr sensitivity of % (p < . ). on cxr and ct exams of the thorax, findings are usually bilateral ( %) early in the progression of disease and even more likely bilateral ( %) in later stages , zhao et al. ( ) ). the typical presentation in icu patients is bilateral subsegmental areas of air-space consolidation. in non-icu patients, classic findings are transient subsegmental consolidation early and then bilateral ground glass opacities that are typically peripheral in the lungs. pneumothorax (collapsed lung) and pleural fluid or cavitation (due to necrosis) are usually not seen. distinctive patterns of covid- such as crazy paving in which ground glass opacity is combined with superimposed interlobular and intralobular septal thickening and the reverse halo sign where a ground glass region of the lung is surrounded by an irregular thick wall have been previously described in other diseases but are atypical of most pneumonias. the use of thoracic ct for both diagnosis of disease and tracking has varied tremendously from country to country. while countries such as china and iran utilize it for its very high sensitivity to disease in the diagnosis and tracking of progression of disease, the prevailing recommendation in the us and other countries is to only use lab studies for diagnosis, use chest radiography to assess severity of disease, and to hold off on performing thoracic ct except for patients with relatively severe and complicated manifestations of disease (simpson et al. ( ) ). this is due to concerns in the us about exposure of radiology staff and other patients to covid- patients and the thought that ct has limited incremental value over portable chest radiographs which can be performed outside the imaging department. additionally, during a surge period, the presump-tion is made that the vast majority of patients with pulmonary symptoms have the disease, rendering ct as a relatively low value addition to the clinical work-up. as a diagnostic tool, ct offers the potential to differentiate patients with covid- not only from normal patients, but from those with other causes of shortness of breath and cough such as tb or other bacterial or alternatively, other viral pneumonias, bronchitis, heart failure, and pulmonary embolism. as a quantitative tool, it offers the ability to determine what percentage of the lung is involved with the disease and to break this down into areas of ground glass density, consolidation, collapse, etc. this can be evaluated on serial studies which may be predictive of a patient's clinical course and may help to determine optimal clinical treatment. complications of covid- are not limited to acute lung parenchymal disease. these patients have coagulopathies and are at increased risk for pulmonary embolism. diffuse vascular inflammation can result in pericarditis and pericardial effusions. renal and brain manifestations have been described by many authors and are increasingly recognized clinically in covid- patients. long term lung manifestations will not be apparent for many months or years, but there is the potential that these patients will develop higher rates of chronic obstructive pulmonary disease (copd) such as emphysema, chronic bronchitis and asthma than the general population. objective metrics for assessment and follow-up of these complications of the disease would be very valuable from a clinical perspective. the extraordinarily rapid spread of the covid- pandemic has demonstrated that a new disease entity with a subset of relatively unique characteristics can pose a major new clinical challenge that requires new diagnostic tools in imaging. the typical developmental cycle and large number of studies required to develop ai algorithms for various disease entities is much too long to respond effectively to produce these software tools on demand. this is complicated by the fact that the disease can have different manifestations (perhaps due to different strains) in different regions of the world. this suggests the strong need to develop software more rapidly, perhaps using transfer learning from existing algorithms, to train on a relatively limited number of cases, and to train on multiple datasets in various locations that may not be able to be easily combined due to privacy and security issues. it also suggests that we determine how to balance regulatory requirements for adequate testing of the safety and efficacy of these algorithms against the need to have them available in a timely manner to impact clinical care. ai technology, in particular deep learning image analysis tools, can potentially be developed to support radiologists in the triage, quantification, and trend analysis of the data. ai solutions have the potential to analyze multiple cases in parallel to detect whether chest ct or chest radiography reveals any abnormalities in the lung. if the software suggests a significantly increased likelihood of disease, the case can be flagged for further review by a radiologist or clinician for possible treatment/quarantine. such systems, or variations thereof, once verified and tested can become key contributors in the detection and control of patients with the virus. another major use of ai is in predictive analytics: foreseeing events for timely intervention. predictive ai can be potentially applied at three scales: the individual scale, the hospital scale, and the societal scale. an individual may go through various transitions from healthy to potentially contaminated, symptomatic, etc. as depicted in figure . at the individual level, we may use ai for computing risk of contamination based on location, risk of severe covid- based on co-morbidities and health records, risk of acute respiratory distress syndrome (ards) and risk of mortality to help guide testing, intervention, hospitalization and treatment. quantitative ct or chest radiographic imaging may play an important role in risk modeling for the individual, and especially in the risk of ards. at the hospital level, ai for imaging may for example be used for workflow improvement by (semi-) automating radiologist's interpretations, and by forecasting the future need for icu and ventilator capacity. at the societal level ai may be used in forecasting hospital capacity needs and may be an important measure to aid in assessing the need for lock downs and reopenings. so far, we have here concentrated on disease diagnosis and management, but imaging with ai may also have a role to play in relation to late effects like neurological, cardiovascular, and respiratory damage. before entering the discussion on specific usages of ai to ease the burden of the pandemic, we briefly describe the standard procedure of creating an ai solution in order to clarify the nomenclature. the standard way of developing deep learning algorithms and systems entails several phases (greenspan et al. ( ) , litjens et al. ( ) ) : i. data-collection, in which a large amount of data samples need to be collected from predefined categories; expert annotations are needed for groundtruthing the data; ii. training phase in which the collected data are used to train network models. each category needs to be represented well enough so that the training can generalize to new cases that will be seen by the network in the testing phase. in this learning phase, the large number of network parameters (typically on the order of millions) are automatically defined; iii. testing phase in which an additional set of cases not used in training is presented to the network and the output of the network is tested statistically to determine its success of categorization. finally, iv, the software must be validated on independent cohorts to ensure that performance characteristics generalize to unseen data from other imaging sources, demographics, and ethnicity. in the case of a new disease, such as the coronavirus, datasets are just now being identified and annotated. there are very limited data sources as well as limited expertise in labeling the data specific to this new strain of the virus in humans. accordingly, it is not clear that there are enough examples to achieve clinically meaningful learning at this early stage of data collection despite the increasingly critical importance of this software. solutions to this challenge, that may enable rapid development, include the combination of several technologies: transfer learning will utilize pretraining on other but somehow statistically similar data. in the general domain of computer vision, ima-genet has been used for this purpose (donahue et al. ( ) ). in the case of covid- this may be provided by existing databases of annotated images of patients with other lung infections. data augmentation is a trick used from the beginning of applying convolution neural networks (cnns) to imaging data (lecun et al. ( ) ), in which data are transformed to provide extra training data. normally rotations, reflections, scaling or even group actions beyond the affine group can be explored. other technologies include semi-supervised learning and weak learning when labels are noisy and/or missing (cheplygina et al. ( )). thus, the underlying approach to enable rapid development of new ai-based capabilities, is to leverage the ability to modify and adapt existing ai models and combine them with initial clinical understanding to address the new challenges and new disease entities, such as the covid- . in this section we briefly review three possible system developments: ai systems for detection and characterization of disease, ai systems for measuring disease severity and patient monitoring, and ai systems for predictive modeling. each category will be reviewed briefly and a specific system will be described with a focus on the ai based challenges and solutions. the vast majority of efforts for the diagnosis of covid- have been focused on detecting unique injury patterns related to the infection. automated recognition of those patterns became an ideal challenge for the use of cnns trained on the appearance of those patterns. one example of a system for covid- detection and analysis is shown in figure , which presents an overview of the analysis conducted in gozes et al. ( a) . in general, as is shown here, automated solutions are comprised of several components. each one is based on a network model that focuses on a specific task to solve. in the presented example, both d and d analysis are conducted, in parallel. d analysis of the imaging studies is utilized for detection of nodules and focal opacities using nodule-detection algorithms, with modifications to detect ground-glass (gg) opacities. a d analysis of each slice of the case is used to detect and localize covid- diffuse opacities. if we focus on the d analysis -we again see that multiple steps are usually defined. the first step is the extraction of the lung area as a region of interest (roi) using a lung segmentation module. the segmentation step removes image portions that are not relevant for the detection of withinlung disease. within the extracted lung region, a covid- detection procedure is conducted, utilizing one of a variety of possible schemes and corresponding networks. for example, this step can be a procedure for (undefined) abnormality detection, or a specific pattern learning task. in general, a classification neural network (covid- vs. not covid- ) is a key component of the solution. such networks, which are mostly cnn based, enable the localization of covid- manifestations in each d slice that is selected in what have become known asheat maps per d slice. to provide a complete review of the case, both the d and d analysis results can be merged. several quantitative measurements and output visualizations can be used, including per slice localization of opacities, as in figure (a), and a d volumetric presentation of the opacities throughout the lungs, as shown in figure (b), which presents a d visualization of all gg opacities. several studies have shown the ability to segment and classify the extracted lesions using neural networks to provide a diagnostic performance that matches a radiologist rating ; bai et al. ( ) ). in zhang et al. ( ) , , manually annotated ct slices were used for seven classes, including background, lung field, consolidation (cl), groundglass opacity (ggo), pulmonary fibrosis, interstitial thickening, and pleural effusion. after a comparison between different semantic segmentation approaches, they selected deeplabv as their segmentation detection backbone (chen et al. ( ) ). the diagnostic system was based on a neural network fed by the lung-lesion maps. the system was designed to classify normals from common pneumonia and covid- specific pneumonia. their results show a covid- diagnostic accuracy of . % tested in subjects. in bai et al. ( ) , a direct classification of covid- specific pneumonia versus other etiologies was performed using an efficientnet b network (tan and le ( )) followed by a two-layer fully connected network to pool the information from multiple slices and provide a patientlevel diagnosis. this system yielded a % accuracy in a testing set of subjects compared to an % average accuracy for six radiologists. these two examples exemplify the power of ai to perform at a very high level that may augment the radiologist, when designed and tested for a very narrow and specific task within a de-novo diagnostic situation. time delay in covid- testing using rt-pcr can be overcome with integrative solutions. augmented testing using ct, clinical symptoms, and standard white blood cell (wbc) panels has been proposed in mei et al. ( ) . the authors show their ai system that integrates both sources of information is superior to an imaging-alone cnn model as well as a machine learning model based on non-imaging information for the diagnosis of covid- . integrative approaches can overcome the lack of diagnostic specificity of ct imaging for covid- (rubin et al. ( )) it is well understood that chest radiographs (cxr) have lower resolution and contain much less information than their ct counterparts. for example, for covid- patients, the lungs may be so severely infected that they become fully opacified, obscuring details on an x-ray and making it difficult to distinguish between pneumonia, pulmonary edema, pleural effusions, alveolar hemorrhage, or lung collapse ( figure ). still, many countries are using cxr information for initial decision support as well as throughout the patient hospitalization and treatment process ). deep learning pipelines for cxr opacities and infiltration scoring exist. in most publications seen to-date, researchers utilize existing public pneumonia datasets, which were available prior to the spread of coronavirus, to develop network solutions that learn to detect pneumonia on a cxr. in (selvan et al. ( ) ), an attempt to solve the issue of the compact lungs is presented using variational imputation. a deep learning pipeline based on variational autoencoders (vae) has shown in pilot studies > % accuracy in separating covid- patients from other patients with lung infections, both bacterial and viral. a systematic evaluation of one of those system has demonstrated comparable performance to a chest radiologist (murphy et al. ( ) ). this demonstrates the capability of recognizing covid- associated patterns, using the cxr data. we view these results as preliminary, and to be confirmed with more rigorous experimental setup which includes access to covid- and other infections from the same sources with identical acquisition technology, time-window, ethnicity, demographics, etc. such rigorous experiments are critical in order to assess the clinical relevance of the developed technology. in this section we focus on the use of ai for hospitalized patients. image analysis tools can support measurement of the disease extent within the lungs, thus generating quantification for the disease that can serve as an image-based biomarker. such a biomarker may be used to assess relative severity of patients in the hospital wards, enable tracking of disease severity over time, and thus assist in the decision-making process of the physicians handling the case. one such biomarker, termed the corona score, was recently introduced in gozes et al. ( a,b) . the corona score is a measure of the extent of opacities in the lungs. it can be extracted in ct and in cxr cases. figure presents a plot of corona-score measurements per patient over time, in ct cases. using the measure, we can assess relative severity of the patients (left) as well as extract a model for disease burden over the course of treatment (right). additional very valuable information on characterization of disease manifestation can be extracted as well, such as locations of opacities within the lungs, opacities burden within specific lobs of the lungs (using a lungs lobe segmentation module) and analysis of the texture of the opacities using classification of patches extracted from detected covid- areas (using a patch-based classification module). these characteristics are important biomarkers with added value for patient monitoring over time. the clinical covid- lung infections are diagnosed and monitored with ct or cxr imaging where opacities, their type and extent, may be quantified. the picture of radiological findings in covid- patients is complex (wong et al. ( ) ) with mixed patterns: ground-glass opacities, opacities with a rounded morphology, peripheral distribution of disease, consolidation with groundglass opacities, and the so called crazy-paving pattern. first reporting of longitudinal developments monitored by cxr (shi et al. ( b) ) indicate that cxr findings occur before the need for clinical intervention with oxygen and/or ventilation. this fosters the hypothesis that cxr imaging and quantification of findings are valuable in the risk assessment of the individual patient developing severe covid- . in the capital region of denmark, it is standard practice to acquire a cxr for covid- patients. the clinical workflow during the covid- pandemic does not in general allow for manual quantitative scoring of radiographs for productivity reasons. making use of the cxr already recorded during real time risk assessment therefore requires automated methods for quantification of image findings. several scoring systems for the severity of covid- lung infection adapted from general lung infection schemes have been proposed (wong et al. ( ) , shi et al. ( b) , cohen et al. ( ) ). above, in figure , it is shown how opacities may be located in ct images. similar schemes may be used for regional opacity scoring in cxr, as shown in figure . for the administration and risk profiling of the individual patient, imaging does not tell the full story. important risk factors include age, bmi, co-morbidities (especially diabetes, hypertension, asthma, chronic respiratory or heart diseases) (jordan et al. ( ) ). combining imaging with this type of information from the ehr and with data representing the trajectory of change over time enhances the ability to determine and predict the stage of disease. an early indication is that cxr's contribute significantly to the prediction of the probability for a patient to be on a ventilator. here we briefly summarise the patient trajectory prognosis setup: we have in preliminary studies from the cohort from the capital and zealand regions of denmark, combined clinical information from electronic health records (ehr) defining variables relating to vital parameters, comorbidities, and other health parameters with imaging information. modeling was performed using a simple random forest implementation in a -fold cross-validation fashion. in figure are as illustration auc for prediction of outcome in terms of hospitalisation, requirement for ventilator, admission to intensive care unit, and death. these have been illustrated on , covid- positive subjects from the zealand and capital region of denmark. these are preliminary unconsolidated results for illustrative purposes. however, these support the feasibility of an algorithm to predict severity of covid- manifestations early in the course of the disease. the combination of cxr into these prognostic tools have been performed by including a number of quantitative features per lung region as a feature vector in the random forest described above. imaging has played a unique role in the clinical management of the covid- pandemic. public health authorities of many affected countries have been forced to implement severe mitigation strategies to avoid the wide community spread of the virus (parodi and liu ( ) ). mitigation strategies put forth have focused on acute disease management and the plethora of automated imaging solutions that have emerged in the wake of this crisis have been tailored toward this emergent need. until effective therapy is proven to prevent the widespread dissemination of the disease, mitigation strategies will be followed by more focused efforts and containment approaches aimed at avoiding the high societal cost of new confinement policies. in that regard, imaging augmented by ai can also play a crucial role in providing public health officials with pandemic control tools. opportunities in both upstream infection management and downstream solutions related to disease resolution, monitoring of recurrence and health security will be emerging in the months to come as economies reopen to normal life. pandemic control measurements in the pre-clinical phase of the infection may seek to identify those subjects that are more susceptible to the disease due to their underlying risk factors that lead to the acute phase of covid- infections. several epidemiological factors, including age, obesity, smoking, chronic lung disease hypertension, and diabetes, have been identified as risk factors (petrilli et al. ( ) ). however, there is a need to understand further risk factors that can be revealed by image-based studies. imaging has shown to be a powerful source of information to reveal latent traits that can help identify homogeneous subgroups with specific determinants of disease (young et al. ( ) ). this kind of approach could be deployed in retrospective databases of covid- patients with pre-infection imaging to understand why some subjects seem to be much more prone to progression of the viral infection to acute pulmonary inflammation. the identification of high-risk populations by imaging could enable targeted preventive measurements and precision medicine approaches that could catalyze the development of curative and palliative therapies. identification of molecular pathways in those patients at a higher risk may be crucial to catalyze the development of much needed host-targeting therapies. the resolution of the infection has been shown to involve recurrent pulmonary inflammation with vascular injury that has led to post-intensive care complications (ackermann et al. ( ) ). detection of micro embolisms is a crucial task that can be addressed by early diagnostic methods that monitor vascular changes related to vascular pruning or remodeling. methods developed within the context of pulmonary embolism detection, and clot burden quantification could be repurposed for this task ). another critical aspect of controlling the pandemic is the need to monitor infection recurrence as the immunity profile for sars-cov- is still unknown (kirkcaldy et al. ( ) ). identifying early pulmonary signs that are compatible with covid- infection could be an essential tool to monitor subjects that may relapse in the acute episode. ai methods have shown to be able to recognize covid- specific pneumonia identified on radiographic images (murphy et al. ( ) ). the accessibility and potential portability of the imaging equipment in comparison to ct images could enable early pulmonary injury screening if enough specificity can be achieved in the early phases of the disease. eventually, some of those tools might facilitate the implementation of health security screening solutions that revolve around the monitoring of individuals that might present compatible symptoms. although medical imaging solutions might have a limited role in this space, other kinds of non-clinical imaging solutions such as thermal imaging may benefit from solutions that were originally designed in the context of x-ray or ct screening. one of the fascinating aspects that has emerged around the utilization of ai-based imaging approaches to manage the covid- pandemic has been the speed of prototyping imaging solutions and their integration in end-to-end applications that could be easily deployed in a healthcare setting and even ad-hoc makeshift caring facilities. this pandemic has shown the ability of deep neural networks to enable the development of end-to-end products based on a model representation that can be executed in a wide range of devices. another important aspect has been the need for large-scale deployments due to the high incidence of the covid- infection. these deployments have been empowered by the use of cloud-based computing architectures and multi-platform web-based technologies. multiple private and open-source systems have been rapidly designed, tested, and deployed in the last few months. the requirements around the utilization of these systems in the general population for pandemic control are: • high-throughput: the system needs to have the ability to perform scanning and automated analysis within several seconds if screening is intended. • portable: the system might need to reach the community without bringing them to hospital care settings to avoid nosocomial infections. • reusable: imaging augmented with ai has emerged as a highly reusable technique with scalable utilization that can adapt to variable demand. • sensitive: the system needs to be designed with high sensitivity and specificity to detect early signs of disease. • private: systems have to protect patient privacy by minimizing the exchange of information outside of the care setting. web-based technologies that provide embedded solutions to deploy neural network systems have emerged as one of the most promising implementations that fulfill those requirements. multiple public solutions in the context of chest xray detection of early pneumonia and covid- compatible pneumonia have been prototyped, as shown in figure . the covictory app, part of the slow-down covid project (www. slowdowncovid .org) implements a classification neural network for the detection of mild pneumonia as an early risk detection of radiographic changes compatible with covid- . the developers of this system based their system in a network architecture recently proposed for tuberculosis detection that has a very compact and efficient design well-suited for deployment in mobile platforms (pasa et al. ( ) ). the database that trained the network was based on imaging from three major chest x-ray databases: nih chest x-ray, chexpert, and padchest. the developers sub-classified x-ray studies labeled as pneumonia in mild versus moderate/severe pneumonia fig. . illustration of a public ai systems for covid- compatible pneumonias on chest x-rays from two covid- subjects using covictory app (www.covictoryapp.org) with mild pneumonia signs (left) and more severe disease (right). by consensus of multiple readers using spark crowd, an open source system for consensus building (rodrigo et al. ( ) ). another example is the coronavirus xray app that included public-domain images from covid- patients to classify images into three categories: healthy, pneumonia and covid- . both systems were implemented as a static web application in javascript using tensorflow-js. although the training was carried out using customized gpu hardware, the deployment of trained models is intrinsically multi-platform and multi-device thanks to the advancement of web-based technologies. other commercial efforts like cad covid-xray (https://www. delft.care/cad covid/) has leveraged prior infrastructure used for the assessment of tuberculosis on x-ray to provide a readily deployable solution. the covid- crisis has seen the emergence of multiple observational studies to support research into understanding disease risk, monitoring disease trajectory, and for the development of diagnostic and prognostic tools, based on a variety of data sources including clinical data, samples and imaging data. all these studies share the theme that access to high quality data is of the essence, and this access has proven to be a challenge. the causes for this challenge to observational covid- research are actually the same ones that have hampered large scale data-driven research in the health domain over the last years. owing to the data collection that takes place in different places and different institutes, there is fragmentation of data, images and samples. moreover there is a lack of standardization in data collection, which hampers reuse of data. consequently, the reliability, quality and re-usability of data for datadriven research, including the development and validation of ai applications, is problematic. finally, depending on the sys-tem researchers and innovators are working in, ethical and legal frameworks are often unclear and may sometimes be (interpreted as being) obstructive. a coordinated effort is required to improve the accessibility to observational data for covid- research. if implemented for covid- , it can actually serve as a blueprint for large, multi-center observational studies in many domains. as such, addressing the covid- challenges also presents us with an opportunity, and in many places we are already observing that hurdles towards multicenter data accessibility are being addressed with more urgency. an example is the call by the european union for an action to create a pan-european cohort covid- including imaging data. in the netherlands, the health-ri c initiative aims to build a national health data infrastructure, to enable the re-use of data for personalized medicine, and similar initiatives exist in other countries. in light of the current pandemic, these initiatives have focused efforts on supporting observational covid- research, with the aim to facilitate data access to multi-center data. the underlying principle of these infrastructures is that by definition they will have to deal with the heterogeneous and distributed nature of data collection in the healthcare system. in order for such data to be re-usable, harmonisation at the source is required. this calls for local data stewardship, in which the different data types, including e.g. clinical, imaging and lab data, need to be collected in a harmonized way, adhering to international standards. here, the fair principle needs to be adopted, i.e. data needs to be stored such that they are findable, accessible, interoperable and reusable wilkinson et al. ( ) . for clinical data, it is not only important that the same data are collected (e.g. adhering to the world health organisation case report form (crf), often complemented with additional relevant data), but also that their values are unambiguously defined and are machine-readable. the use of electronic crfs (ecrfs) and accompanying software greatly supports this, and large international efforts exist to map observational data to a common data model, including e.g. the observational health data sciences and informatics (ohdsi) model. similarly the imaging and lab data should be processed following agreed standards. in the health-ri c implementation, imaging data are pseudonimized using a computational pipeline that is shared between centers. for lab data, standard ontologies such as loinc can be employed. the covid- observational project will not only collect fair metadata describing the content and type of the data, but also data access policies for the data that are available. this will support the data search, request, and access functionalities provided by the platform. an illustration of the data infrastructure in health-ri c is provided in fig . next to providing data for the development of ai algorithms, it is important to facilitate their objective validation. in the medical imaging domain, challenges have become very popular to objectively compare performance of different algorithms. in the design of challenges, part of the data needs to be kept apart. it is therefore important that, while conducting efforts to provide access to observational covid- data, we already plan for using part of the data for designing challenges around relevant clinical use cases. fig. . design of covid- observational data platform. in order for hospitals to link to the data platform, they need to make their clinical, imaging and lab data fair. tools for data harmonization (fair-ification) are being shared between institutes. fair metadata (and in some cases fair data) and access policies are shared with the observational platform. this enables a search tool for researchers to determine what data resources are available at the participating hospitals. these data can subsequently be requested, and if the request is approved by a data access committee, the data will be provided, or information how the data can be accessed will be shared. in subsequent versions of the data platform, also distributed learning will be supported, so that data can stay at its location. during the pandemic, setting up such an infrastructure from scratch will not lead to timely implementation. health-ri c was already in place prior to the pandemic, and some of its infrastructures could be adjusted to start building a covid- observational data platform. in denmark, a similar initiative was not in place. however, in eastern denmark, the capital and zealand regions share a common data platform in all hospitals with a common ehr and a pacs at each region covering in total hospitals and . million citizens making data collection and curation relatively simple. at continental scale, solutions are being created, but will likely not be in place during the first wave of the pandemic. the burdens to overcome are legal, political, and technical. access to un-consented data from patients follows different legal paths in different countries. in uk the department of social and mental care issued on march , a notice simplifying the legal approval of covid- data processing. in denmark, usual regulation and standards were maintained, but authorities made an effort to grant permission by the usual bodies in fast track. as access to patient information must be restricted, not every researcher with any research goal can be granted access. without governance in place prior to an epidemic, access will be granted on an ad hoc and first-come-first-served basis, not necessarily leading to the most efficient data analysis. finally, data are hosted in many different it systems and the two major technical challenges lie in bringing data to a common platform, and having a (in eu gdpr) compliant technical setup for collaboration. building such infrastructure with proper security and data handling agreements in place is complex and will lead to substantial delays if not in place prior to the epidemic. in the netherlands, the health-ri c platform was in place. in denmark, the efforts have been constrained to the eastern part of the country sharing common ehr and pacs and having infrastructure in place for compliant data sharing at computerome. at a european scale, the commission launched the european covid- data platform on april building on existing hardware infrastructure. this was followed up by a call for establishing a pan-european covid- cohort. funding decision will be in august . even though a tremendous effort has been put in place and usual approvals of access and funding have been fast-tracked, proper infrastructures have not been created in time for the first wave in europe. the current covid- pandemic offers us historic challenges but also opportunities. it is widely believed that a substantial percentage of the (as of this writing) . million confirmed cases and , deaths and trillions of dollars of economic losses would have been avoided with adequate identification of those with active disease and subsequent tracking of location of cases and prediction of emerging hotspots. imaging has already played a major role in diagnosis and tracking and prediction of outcomes and has the potential to play an even greater role in the future. automated computer based identification of probability of disease on chest radiographs and thoracic ct combined with tracking of disease could have been utilized early on in the development of cases, first in wuhan, then other areas of china and asia, and subsequently europe and the united states and elsewhere. this could have been utilized to inform epidemiologic policy decisions as well as hospital resource utilization and ultimately, patient care. this pandemic also represented, perhaps for the first time in history, that a disease with relatively unique imaging and clinical characteristics emerged and spread globally faster than the knowledge to recognize, diagnose and treat the disease. it also created a unique set of challenges and opportunities for the machine learning/ai community to work side by side and in parallel with clinical experts to rapidly train and deploy computer algorithms to treat an emerging disease entity. this required a combination of advanced techniques such as the use of weakly annotated schemes to train models with relatively tiny amounts of training data which has only become widely available recently, many months after the initial outbreak of disease. the imaging community as a whole has demonstrated that extremely rapidly developed ai software using existing algorithms can achieve high accuracy in detection of a novel disease process such as covid- as well as provide rapid quantification and tracking. the majority of research and development has focused on pulmonary disease with developers using standard chest-ct dicom imaging data as input for algorithms designed to automatically detect and measure lung abnormalities associated with covid- . the analysis includes automatic detection of involved lung volumes, automatic measurement of disease as compared to overall lung volume and enhanced visualization techniques that rapidly depict which areas of the lungs are involved and how they change over time in an intuitive manner that can be clinically useful. a variety of manuscripts describing automated detection of covid- cases have been recently published. when reviewing these manuscripts one can see the following interesting trend: all are focusing on one of the several key tasks, as defined herein. each publication has a unique system design that contains a set of network models, or a comparison across models; and the results are all very strong. the compelling results, such as the ones presented herein may lead us to conclude that the task is solved; but is this the correct conclusion? it seems that the detection and quantification tasks are in fact solvable with our existing imaging analysis tools. still, there are several data-related issues which we need to be aware of. experimental evidence is presented on datasets of hundreds and we need to go to real world settings, in which we will start exploring thousands and even more cases, with large variability. our systems to date are focusing on detection of abnormal lungs in a biased scenario of the pandemic in which there is a very high prevalence of patients presenting with the disease. once the pandemic declines substantially, the shift will be immediate to the need to detect covid among a wide variety of diseases including other lung inflammatory processes, occupational exposures, drug reactions, and neoplasms. in that future in which the prevalence of disease is lower,will our solutions that work currently be sensitive enough, without introducing too many false positives? that is the crux of many of the current studies that have tested the different ai solutions within a very narrow diagnostic scope. there are many possibilities and promising directions, yet the unknown looms larger than the known. just as the current pandemic has changed the way many are thinking about distance learning, the practice of telemedicine, and overall safety in a non-socially distanced society, it seems that we are similarly setting the stage with our current on the fly efforts in algorithm development for the future development and deployment of ai. we need to update infrastructure including methods of communication and sharing cases and findings as well as reference databases and algorithms for research, locally, country-based and globally. we need to prove the strengths, build the models and make sure that the steps forward are such that we can continue and expand the use of ai, particularly just in time ai. we believe that imaging is an absolutely vital component of the medical space. for predictive modeling we need to not limit ourselves to just the pixel data but also include additional clinical, patient level information. for this, combined effort among many groups, as well as state and federal level support will result in optimal development, validation, and deployment. many argue that we were caught unaware from a communication, testing, treatment and resource perspective with the current pandemic. but deep learning-augmented imaging has emerged as a unique approach that can deliver innovative solutions from conception to deployment in extreme circumstances to address a global health crisis. the imaging community can take lessons learned from the current pandemic and use them to not only be far better prepared for recurrence of covid- and future pandemics and other unexpected diseases, but also use these lessons to advance the art and science of ai as applied to medical imaging in general. pulmonary vascular endothelialitis, thrombosis, and angiogenesis in covid- ai augmentation of radiologist performance in distinguishing covid- from pneumonia of chest ct findings in coronavirus disease- (covid- ): relationship to duration of infection rethinking atrous convolution for semantic image segmentation not-so-supervised : a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis covid- image data collection decaf: a deep convolutional activation feature for generic visual recognition rapid ai development cycle for the coronavirus (covid- ) pandemic: initial results for automated detection and patient monitoring using deep learning ct image analysis coronavirus detection and analysis on chest ct with deep learning guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique peneta scalable deeplearning model for automated diagnosis of pulmonary embolism using volumetric ct imaging covid- : risk factors for severe disease and death covid- and postinfection immunity: limited evidence, many remaining questions backpropagation applied to handwritten zip code recognition a survey on deep learning in medical image analysis artificial intelligence-enabled rapid diagnosis of patients with covid- covid- on the chest radiograph: a multi-reader evaluation of an ai system from containment to mitigation of covid- in the us efficient deep network architectures for fast chest x-ray tuberculosis screening and visualization factors associated with hospital admission and critical illness among people with coronavirus disease spark-crowd: a spark package for learning from crowdsourced big data the role of chest imaging in patient management during the covid- pandemic: a multinational consensus statement from the fleischner society lung segmentation from chest x-rays using variational data imputation review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid- radiological findings from patients with covid- pneumonia in wuhan, china: a descriptive study. the lancet infectious diseases radiological society of north america expert consensus statement on reporting chest ct findings related to covid- efficientnet: rethinking model scaling for convolutional neural networks. arxiv.org the fair guiding principles for scientific data management and stewardship frequency and distribution of chest radiographic findings in covid- positive patients the role of imaging in novel coronavirus pneumonia (covid- ) uncovering the heterogeneity and temporal complexity of neurodegenerative diseases with subtype and stage inference relation between chest ct findings and clinical conditions of coronavirus disease (covid- ) pneumonia: a multicenter study the authors declare the following financial interests/personal relationships which may be considered as potential competing interests: key: cord- - ijaxk v authors: el mouden, zakariyaa ait; taj, rachida moulay; jakimi, abdeslam; hajar, moha title: towards using graph analytics for tracking covid- date: - - journal: procedia computer science doi: . /j.procs. . . sha: doc_id: cord_uid: ijaxk v graph analytics are now considered the state-of-the-art in many applications of communities detection. the combination between the graph’s definition in mathematics and the graphs in computer science as an abstract data structure is the key behind the success of graph-based approaches in machine learning. based on graphs, several approaches have been developed such as shortest path first (spf) algorithms, subgraphs extraction, social media analytics, transportation networks, bioinformatic algorithms, etc. while spf algorithms are widely used in optimization problems, spectral clustering (sc) algorithms have overcome the limits of the most state-of-art approaches in communities detection. the purpose of this paper is to introduce a graph-based approach of communities detection in the novel coronavirus covid- countries’ datasets. the motivation behind this work is to overcome the limitations of multiclass classification, as sc is an unsupervised clustering algorithm, there is no need to predefine the output clusters as a preprocessing step. our proposed approach is based on a previous contribution on an automatic estimation of the k number of the output clusters. based on dynamic statistical data for more than countries, each cluster is supposed to group countries having similar behaviors of covid- propagation. in late december , an increasing number of pneumonia cases was noticed in wuhan city, china [ , ] . initially, those cases were classified as caused by unknown sources, but after one week, the novel coronavirus was identified and temporarily named -ncov [ ] . coronavirus or cov is a large family of viruses discovered in s in mammals and birds, later in s, coronaviruses were discovered in humans. a novel coronavirus or ncov is a new race that has not been previously iden- in late december , an increasing number of pneumonia cases was noticed in wuhan city, china [ , ] . initially, those cases were classified as caused by unknown sources, but after one week, the novel coronavirus was identified and temporarily named -ncov [ ] . coronavirus or cov is a large family of viruses discovered in s in mammals and birds, later in s, coronaviruses were discovered in humans. a novel coronavirus or ncov is a new race that has not been previously iden-tified in humans. thereafter, the novel coronavirus disease was named covid- in february , [ , ] which is caused by severe acute respiratory syndrome coronavirus or sars-cov- . after two months, the novel virus was characterized as a pandemic as the statistics exceed , cases and , deaths in different countries according to the world health organization (who). the symptoms of covid- disease can be divided into two parts; i) systematic disorders such as fever, cough, fatigue, headache, hemoptysis, acute cardiac injury, hypoxemia, dyspnea, diarrhea and lymphopenia. ii) respiratory disorders such as rhinorrhea, sneezing, sore throat, pneumonia, ground-glass opacities, rnaaemia and acute respiratory distress syndrome [ , ] . from infection to the first symptoms, the incubation period is from days to days in most cases, but a maximal value of incubation period was observed with days in th february , other values of and were noticed after, which proves that the incubation period can vary widely among patients and makes the subject of serious of coronavirus and its ability to spread. while searching for vaccines, the majority of countries implemented quarantine and travel restrictions to influence to spread of this novel coronavirus [ , ] . while writing this paper, the number of covid- cases in the world has reached including deaths ( %) and recovered cases ( . %) according the last update on worldometers in may , . after more than months, china has moved from the top of the table of cases to the th place, leaving the first place to usa with more than cases ( . %) of total cases in the world) and deaths ( % of total deaths in the world). our contribution focuses on the use of machine learning algorithms to manipulate covid- data; the existing approaches classifies countries according to predefined classes and using statistical data in function of time, which is a very basic classification that requires to define the classes before processing the algorithm. the high success of multiclass classification for covid- 's medical images [ ] doesn't make this approach applicable for other situation using other formats of data, the use of each approach will be discussed in the section of related works. our work is based on spectral clustering (sc) which is a clustering approach and not a classification one; the difference is that in advance, we do not have any idea about the number or the structure of the output groups, it is the combinison between the features that makes a set of countries have similar behaviors and then form a strong cluster with minimal links with countries from other clusters. sc is not a foreign tool in medicine, many previous works linked sc to protein-to protein interactions [ , ] , medical imaging [ , ] and we wish that our study will open doors on using sc in tracking epidemics. the paper is structures as follows; after this introduction, section details the literature review, section presents briefly the graph analytics thematic and locates our approach in this thematic. including a graphical abstract, section presents the different processes of our proposed approach. section concludes the paper and presents some perspectives for future works. since the first appearance of covid- disease, many disciplines have joined the scientific community of covid- . artificial intelligence (ai) in turn, contributed with various works to support covid- challenges such as modeling, simulation, predictions, social networks analytics, geographic information systems (gis) for spatial segmentation and tracking, etc. in [ ] , the authors highlight the importance of gis and big data challenges against covid- challenges represented for gis by spatial tracking of confirmed cases, predictions of the epidemic transmission from region to region, spatial data visualization and segmentation, . . . etc. for big data, the problem is the format of data collected from different sources which produces heterogeneous dataset that can't be processed with traditional techniques, to remedy this issue, the authors proposed that it is indisputable for the different interveners to discuss the formulation of those data especially the government and the academics. the study was presented for three scales; individual, group and regional. another contribution in the same field is [ ] , where the authors studied the spatial variability of covid- in the level of the continental united states, the authors implemented a multiscale geographically weighted regression for both global models and local models. medical images analysis can be considered as the ai field with the higher number of contributions since the first appearance of the novel coronavirus. in [ ] the authors presented a comparative study between seven recent deep learning methods for covid- 's detection and classification using chest x-ray images. using the same chest xray images, the authors of [ ] proposed their deep learning model covidx-net based on the convolutional neural networks (cnn) to classify the patients to either positive or negative cases, as results interpretation, the authors recommended the use of the dense convolutional network models (densenet, f = . ) which gives better results in comparison with other models such as inceptionv (f = . ). neural network models were not limited to chest x-ray images analysis, other works used those models to study the efficiency of quarantine systems in different countries, such as [ ] , where the authors analyze the results of quarantine and isolation in china, italy, south korea and usa. the authors concluded that any relaxing of quarantine measures before estimated dates will lead to higher spread of covid- disease in usa. the study was elaborated for data available from the start of the epidemic to march, . in the other hand, spectral clustering has also proven its efficiency against big data challenges, with numerous applications in computer science such as communities' detection [ , ] , bioinformatics [ ] , image processing [ , ] , . . . etc. recent works combined between spectral methods and deep learning models, such as the case of [ ] where the authors presented their deep clustering approach to cluster data using both neural networks and graph analytics. the image segmentation presented by the authors of [ ] is a combinison between sc and regularization which gives good results for high-dimensional image segmentation in comparison the state-of-art algorithms. graph analytics is a mathematical field of study that regroups all the algorithms and approaches based on graphs, we cite for example discovering the meaningful patterns using mathematical properties of graphs. one of the many powerful sides of graphs, is that those data structures can be built from any type of structured, semi-structured or even heterogeneous data (see fig. ) , also graphs can be imported and exported as objects using different formats, one of the main formats to describe graphs is gexf (graph exchange xml format), more details about building gexf graph schemas from relational data can be found in our previous contribution [ ] . how easy is to break the graph by removing a few nodes or edges? how can we compare two or more graphs? those two questions can summarize the connectivity analytics problems. a graph is connected if it contains a path from u to v or from v to u for each pair of nodes (u, v) . in connectivity analytics we study the robustness of a graph, the disconnection of graphs based on their nodes or edges and the similarity analysis (fig. ) . as comparison between two graphs, degrees' histogram are widely used to analyze the connectivity between the nodes. in fig. , the two graphs are ε-neighborhood graphs with different values of thresholding ε = . (left) and ε = . ( right), the two graphs describe the same data and only the connectivity is different. a graph with smaller ε contains higher number of edges in comparison with a higher value of ε and higher degrees, which explains the difference between the two dergrees' histograms. a centrality is the measure of importance of a node or an edge based in its position in the graph. a centralization is measured for the entire graph or network as the variation in the centrality scores among the nodes or edges. we distinguish between four types of centrality; i) degree centrality, which is a measure that tells how much the graph is maximally-connected or close to a clique. ii) group degree centrality is close the first type of centrality but we consider a set of nodes as a single node and we measure how much is that subset of nodes close to a clique. iii) closeness centrality is the sum of shortest paths from all other nodes to a single node, a low closeness centrality means that our node has short distances from other nodes which makes it a key player in the graph by receiving information sooner and influencing other nodes directly or indirectly. iv) betweenness centrality is the ratio of pairwise shortest paths that flows through a node i and count all the shortest paths in the graph, a low betweenness centrality means that the node i can reach other nodes faster than other nodes in the graph. a community or a cluster is dense subgraph where the nodes are more connected to each other than nodes outside the graph. from this definition, we can model a communities' detection in a graph in a graph as a multi-objective optimization problem; let g be a graph and c a connected subgraph of g, for each cluster c, we compute its intra-cluster density σ int and its inter-cluster density σ ext , where: with n the number of nodes in g and n c the number of nodes in the cluster c. a good communities' detection (also called graph clustering) is the optimization function with minimal value of σ ext and maximal value of σ int . louvain community detection [ ] and normalized spectral clustering [ , ] are the most used communities' detection algorithms for graphs and complex networks. our proposed approach consists of a sc based communities detection where the objective is to have an unsupervised grouping of countries having similar behaviors of covid- spreading. the approach is meant to be applied dynamically to track the coronavirus behaviors in the world. the approach is divided into four steps (see fig. ). "it's not who has the best algorithm that wins. it's who has the most data". andrew ng. data processing is a key process in every machine learning model, even with an efficient model, a bad data collection, formatting or scaling can lead to very bad results. as we know, data sources are no longer consistent, data are available and easily accessible from different devices, but each source provides different format of data (text, csv, multimedia, html, . . . etc.) with produces heterogeneous collected data. with the high spread of the novel coronavirus, data related to this disease keep growing, which causes two major problems, the selection of trusted sources and data formatting. for the application, data were collected from eu open data portal which is a trusted source where data can be collected and reused free of charge and without any copyright restrictions. this portal offers a daily updated data in xlsx format, which can be easily converted to csv (comma-separated values) for further processing. the portal presents covid- 's data for statistical studies, where the most important feature is time, the presentation of data according to time produces a high redundancy for a lot of static data, such as the country name, its geographical code, the continent, the population value, . . . etc. to remedy the redundancy problem, we proposed a preprocessing step of data formatting where we applied the object-oriented programming concepts; each country was converted to an object with a set of attributes and methods. for each country object, we are interested in collecting a set of features (table ) , then those features will be used to calculate another set of features (table ) in have more data to manipulate (see table and table ). the vectors vt ests, vcases, vdeaths and vrecovers store the daily value for covid- 's tests, cases, deaths and recovers respectively for each country, from the first appearance of the coronavirus in the country to the day of executing the model which produces vectors with different sizes for different countries, but for each single county, the vectors will have the same size. the size of the vectors is stored as an extra feature called contamination days. the sums of the values of each vector are stored in the features sumt est, sumcases, s umdeaths, s umrecovers. for integer ∈ n * sumcases sumdeaths sumrecovers contaminationdays each day we store the value of the infection fatality rate (ifr) which produces a vector with same size of the four previous vectors, we named this vector as vifr. in machine learning models, the use of features with different scales can never lead to a meaningful result, the features with higher range of values (population feature for example) will always have more coefficient than features will smaller scales (ifr for example). to remedy this problem, a feature scaling process is necessary, especially when we plan to use a similarity kernel in the next step of the approach. the use of a mean normalization will produce a second csv dataset which will be used for the rest of the process. after the feature scaling, we apply the gaussian similarity kernel to measure the similarities between each pair of countries c i and c j basing on the vectorization of the predefined features in the first step of the process. gaussian similarity is defined as follows: with ||c i − c j || the euclidean distance between the corresponding vectors of the pair of countries and σ a manuallychosen positive scaling parameter to control the size of the neighborhood of our produced graph. the result of the similarities measurement is a similarity square matrix s ∈ r n×n where n is the number of the countries in the dataset and each element of the matrix s i j = s(c i , c j ). this matrix will be used to build our graph, where the set of vertices is the countries of our dataset, and the edges are weighed with the similarities of s . for spectral clustering process, we prefer the absolute sc algorithm with an unsupervised choice of k number of clusters. this version of sc is based on the use of the absolute laplacian matrix defined as follows: where d is the diagonal degrees matrix and w the weights matrix, both d and w are square matrices of size n × n. the absolute sc can be defined by the following steps: • compute labs; • extract k eigenvectors associated to the k largest eigenvalues of labs in absolute values; • store the extracted eigenvectors as columns of a matrix u ∈ r n×k ; • using k-means clustering, cluster the lines of u into k clusters. the output is a set of clusters where each cluster groups the countries having high similarities, which means that those countries have lived similar spreading of the coronavirus. as contamination day feature gets higher values, our model produces better results in comparison to the first days of covid- disease; this is due to the vectors getting more values day after day allowing the model to have more data for similarity measurement. "a picture is worth a thousand words". data visualization is a supplementary task, as the main objective of the model is to compute the clusters. but with data visualization, a visual graph can describe the output of our model better than the mathematical results or any textual description. a graph can be built from different sources and using different tools and programming languages. in r, a graph ca be built from an adjacency matrix of types matrix or dgcmatrix of the package matrixcalc, graph functions are available in the package igraph and plotting functions in the package gplots, pajek [ ] network files are also an available option to create graphs in r. in gephi [ ] , a graph can be constructed from a pair of two csv files, the first file for the nodes and the second for the edges, or from a gexf file which is an xml-based document to describe graphs. graph databases are also an important tool for data visualization. in addition to visualization, graph-based systems provide querying languages to interact with graphs. neo j [ ] is a widely used nosql system which manipulates data modelled by graphs and offers cypher language to query the stored graphs. neo j supports csv files and manually creation using cypher querying language which makes it the most powerful tool for graph creation, manipulation and visualization. in this paper, we proposed a graph-based approach for clustering covid- data using spectral clustering. starting with data processing, we highlighted the importance of data collection and feature scaling in increasing the efficiency of each machine learning model. then, we described the transitional phase between the heterogeneous data collected from different sources and the graph data structure, we were based on the use of gaussian similarity kernel to build our graph. next, spectral analysis was applied to the built graph, we selected the absolute spectral clustering, which is based on the absolute laplacian matrix to cluster the vertices of the graph based on the similarities between their properties. the last process was the visualization of the output data, we proposed three different tools and programming languages, then we recommended the graph-based system neo j as it supports a querying language called cypher to intact with the graph. ongoing work intends to link the different processes of the model, developed with two different programming languages (java and r) to build a model able to cluster heterogeneous data based on graph analytics and spectral clustering for communities' detection. furthermore, the data processing needs more improvement to collect more data and expand the set of features, as the model still gives better clustering for countries with more data whose are the countries that discovered coronavirus earlier, than the fresh contaminated countries whose gather less data and generally appears as common nodes between all the clusters due to the fact that majority of the countries had a very similar spreading of coronavirus in the first days of the contamination. prevalence of comorbidities and its effects in coronavirus disease patients: a systematic review and meta-analysis world health organization declares global emergency: a review of the novel coronavirus (covid- ) novel coronavirus ( -ncov) pneumonia a novel coronavirus outbreak of global health concern clinical characteristics of covid- patients with digestive symptoms in hubei, china: a descriptive, cross-sectional, multicenter study the epidemiology and pathogenesis of coronavirus disease (covid- ) outbreak preliminary risk analysis of novel coronavirus spread within and beyond china the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak covid-net: a tailored deep convolutional neural network design for detection of covid- cases from chest radiography images spectral clustering for detecting protein complexes in protein-protein interaction (ppi) networks community detection in protein-protein interaction networks using spectral and graph approaches spectral clustering for medical imaging oriented grouping-constrained spectral clustering for medical imaging segmentation covid- : challenges to gis with big data gis-based spatial modeling of covid- incidence rate in the continental united states using x-ray images and deep learning for automated detection of coronavirus disease covidx-net: a framework of deep learning classifiers to diagnose covid- in x-ray images neural network aided quarantine control model estimation of global covid- spread towards for using spectral clustering in graph mining an application of spectral clustering approach to detect communities in data modeled by graphs polycystic ovarian syndrome novel proteins and significant pathways identified using graph clustering approach kernel cuts: kernel and spectral clustering meet regularization graph laplacian for spectral clustering and seeded image segmentation deep spectral clustering using dual autoencoder network an algorithm of conversion between relational data and graph schema generalized louvain method for community detection in large networks on spectral clustering: analysis and an algorithm an automated spectral clustering for multi-scale data pajek: program for analysis and visualization of large networks gephi: an open source software for exploring and manipulating networks a programmatic introduction to neo j key: cord- -nfl z c authors: slavova, svetla; larochelle, marc r.; root, elisabeth; feaster, daniel j.; villani, jennifer; knott, charles e.; talbert, jeffrey; mack, aimee; crane, dushka; bernson, dana; booth, austin; walsh, sharon l. title: operationalizing and selecting outcome measures for the healing communities study date: - - journal: drug alcohol depend doi: . /j.drugalcdep. . sha: doc_id: cord_uid: nfl z c background: the helping to end addiction long-term (healing) communities study (hcs) is a multisite, parallel-group, cluster randomized wait-list controlled trial evaluating the impact of the communities that heal intervention to reduce opioid overdose deaths and associated adverse outcomes. this paper presents the approach used to define and align administrative data across the four research sites to measure key study outcomes. methods: priority was given to using administrative data and established data collection infrastructure to ensure reliable, timely, and sustainable measures and to harmonize study outcomes across the hcs sites. results: the research teams established multiple data use agreements and developed technical specifications for more than study measures. the primary outcome, number of opioid overdose deaths, will be measured from death certificate data. three secondary outcome measures will support hypothesis testing for specific evidence-based practices known to decrease opioid overdose deaths: ( ) number of naloxone units distributed in hcs communities; ( ) number of unique hcs residents receiving food and drug administration-approved buprenorphine products for treatment of opioid use disorder; and ( ) number of hcs residents with new incidents of high-risk opioid prescribing. conclusions: the hcs has already made an impact on existing data capacity in the four states. in addition to providing data needed to measure study outcomes, the hcs will provide methodology and tools to facilitate data-driven responses to the opioid epidemic, and establish a central repository for community-level longitudinal data to help researchers and public health practitioners study and understand different aspects of the communities that heal framework. the helping to end addiction long-term (healing) communities study (hcs) is a multisite, parallel-group, cluster randomized wait-list controlled trial evaluating the impact of the communities that heal intervention to reduce opioid overdose deaths and other associated adverse outcomes (walsh et al., in press) . the intervention includes three components: ( ) a community-engaged and data-driven process to assist communities in selecting and implementing evidence-based practices to address opioid misuse and opioid use disorder (oud), and reduce opioid overdose deaths (martinez et al., in press) ; ( ) the opioid reduction continuum of care approach which contains a compendium of evidence-based practices and strategies to expand opioid overdose education and naloxone distribution, medications for opioid use disorder (moud), and safe opioid prescribing (winhusen et al., in press) ; and ( ) community-based health communication campaigns to increase awareness and demand for the evidence-based practices and reduce their stigma (lefebrve et al., in press) . a total of communities across four highly affected states (kentucky, massachusetts, new york, ohio) were recruited to participate in the hcs and randomized to one of two waves in a wait-list, controlled design. the communities were randomized to receive either the intervention (referred to as wave communities) or a waitlist control (referred to as wave communities). the hcs has one primary hypothesis (h ) and three secondary hypotheses (h , h , h ) (walsh et al., in press) . it is hypothesized that during the evaluation period (january , to december , ), wave communities compared with wave communities, will: h : reduce opioid overdose deaths (primary outcome); h : increase naloxone distribution; j o u r n a l p r e -p r o o f h : expand utilization of moud; and h : reduce high-risk opioid prescribing. quality data are needed to measure the study outcomes and assess the impact of the integrated intervention and the specific evidence-based practices. data are also an important component of the intervention because communities can use data on opioid overdose mortality and morbidity supplemented with data on community resources and needs to develop a datadriven action plan to expand the utilization of evidence-based practices. communities also need timely and accurate data for visualization in data dashboards designed to monitor the uptake and success of the selected evidence-based practices and strategies, and respond to emerging challenges and community needs (wu et al., in press) . this article describes the process for using administrative data to develop the hcs outcome measures aligned with the primary and three secondary hypotheses of the study. each research site developed collaborations and partnerships with state agencies and other data owners to understand the regulations and policies governing the use of administrative data for research. an hcs data capture work group was formed and included representatives from the four research study sites, the data coordinating center at the rti international, and the sponsors (the national institute on drug abuse and the substance abuse and mental health services administration [samhsa]). a structured consensus decision-making strategy was used to: a. identify data sources to measure the primary, secondary, and other study outcomes; c. develop data governance strategy and data use agreements; d. develop study measure definitions, technical specifications, programming code, procedures for data quality control, common data model, and schedule for data transfer to the data coordinating center; during development, priority was given to use of existing state-level administrative data sources with regulated and sustained data collections and established infrastructures for quality assurance and control. this is an efficient and cost-effective way to study community-level changes, capitalizing on the federal and state investments for collecting standardized surveillance data, and adopting, when possible, validated surveillance definitions. in addition, using multiple administrative data sources allowed for the construction of measures at the community/population level (i.e., unit of analysis being hcs community) by aggregating individual-level data (e.g., unit of measurement being a community resident or a provider practicing within an hcs community) that best matched hcs outcomes. priority also was given to data sources with timely reporting, preferably with less than a -month lag between the occurrence of events and data availability. timeliness and near-realtime access to data was critical for three reasons: ( ) the community engagement component of the intervention is data-driven and dependent on providing ongoing data feedback to community partners throughout the process (walsh et al., in press); j o u r n a l p r e -p r o o f ( ) it is imperative that the study results are made publicly available quickly because of the magnitude and impact of the opioid crisis on us communities; and ( ) the hcs was designed as a four-year study. this study protocol (pro ) was approved by advarra inc., the healing communities study single institutional review board. the study is registered with this section presents the results from the selection and operationalization of administrative data measures for study hypotheses testing (table ) , as well as study measures for secondary analyses and monitoring the progress in implementing evidence-based practices ( the primary hcs outcome is the number of opioid overdose deaths among residents in hcs communities. the traditional data source for capturing drug overdose deaths are death certificate records (isw , ; hedegaard et al., ; warner et al., ) . suspected drug overdose deaths are considered unnatural deaths and are subject to medicolegal death j o u r n a l p r e -p r o o f investigation before the death is certified by a coroner or a medical examiner (hanzlick, ; hanzlick and combs, ) , and a completed death certificate is filed with the office of vital statistics in the state where the death occurred (nchs, a, b) . selected fields from the death certificate record are then sent to the national center for health statistics where the cause-ofdeath information is coded with one underlying and up to multiple (i.e., supplementary) cause-of-death codes using the international classification of diseases, tenth revision (icd- ) coding system (who, ) . the cdc definition for identifying drug overdose deaths with opioid involvement in icd- -coded death certificate records is commonly accepted by researchers and public health agencies. using icd- -coded death certificate data, drug overdose deaths are identified as deaths with an underlying icd- cause-of-death code x - previous research has identified several methodological challenges for identification of opioid involvement in drug overdose deaths (e.g., lack of routinely performed postmortem toxicology testing, especially for fentanyl and designer opioids; challenges to detection and quantification of new designer opioids; variation in jurisdictional office policy in completion of drug overdose death certificates; differences in the proportion of drug overdose death certificates completed by different jurisdictions that do not list the specific contributing drugs) (buchanich et al., ; ruhm, ; slavova et al., ; slavova et al., ; warner and hedegaard, ; j o u r n a l p r e -p r o o f warner et al., ) . prior to the evaluation period, the research sites are administering surveys among the coroners, medical examiners, and toxicology labs serving both wave and wave communities to collect information related to death investigations of suspected drug overdose deaths (including postmortem toxicology testing, timelines for death certificate completion, and possible covid- -related changes in these processes that could have lasting effects during the hcs evaluation period) in order to understand possible limitations and changes in the completeness and accuracy of the primary outcome measure. each hcs research site will use death certificate records from their state office of vital statistics to identify hcs resident deaths with opioid contribution. one challenge in using death certificate data for the primary study outcome is the lag between the death date and the date when death certificate records are available for analysis (rossen et al., ) . sites have been working with local coroners, medical examiners, and state vital statistics offices to improve the timeliness of data availability across all hcs communities. in , almost all the death certificate records in kentucky, massachusetts, new york, and ohio were available for analysis within months after the overdose death (cdc, ). the following steps describe the hcs operational definition for capturing opioid overdose deaths for testing the primary study hypothesis:  step : all sites will use state death certificate files captured months after the end of the evaluation period to identify the death certificate records for residents of hcs communities with a date of death within the evaluation period, an underlying cause-of- this process will ensure a quality harmonized measure that is captured consistently across the four research sites. number of naloxone units distributed in an hcs community as measured by the sum of the naloxone units ( ) the us surgeon general's advisory on naloxone emphasized that expanding naloxone availability in communities is a key public health response to the opioid crisis (hhs, ) . research has shown that opioid overdose death rates were reduced both in communities that implemented overdose education and naloxone distribution programs (walley et al., ) and in jurisdictions enacting laws allowing direct pharmacist dispensing of naloxone (abouk et al., ) . there are three limitations of this data source: ( ) no information is provided about the number of pharmacies dispensing naloxone prescriptions; ( ) suppression rules preclude reporting of data for geographic areas with fewer than four pharmacies; and ( ) prescriptions are assigned to communities based on the location of the pharmacy rather than the customer's residence. suppression rules impacted three communities in massachusetts; this was resolved by requesting the total for the three communities and dividing it relative to the community populations. the assignment of a pharmacy to a community based on pharmacy address may result in an overcount of naloxone in a community with pharmacies that serve residents of non-hcs communities or an undercount if a pharmacy is just outside an hcs community but serves hcs residents. a limitation of the measure is that it may not capture naloxone distributed in hospitals, correctional facilities, or other venues when the naloxone is purchased with support from private donations, foundations, or locally awarded federal funding. the number of hcs residents receiving buprenorphine products approved by the food there are three fda-approved moud products: buprenorphine, methadone, and naltrexone. multiple randomized controlled trials (krupitsky et al., ; mattick et al., mattick et al., , have demonstrated that moud can reduce cravings and illicit opioid use. observational studies have identified that buprenorphine and methadone are associated with reduced mortality sordo et al., ) . thus, as part of the opioid reduction continuum of care approach, communities are required to expand moud with buprenorphine and/or methadone (winhusen et al., in press) . access to moud is geographically heterogeneous and differs by patient population (haffajee et al., ; pashmineh azar et al., ) . for example, opioid treatment programs providing methadone are less common in rural than urban areas (joudrey et al., ) . criminal justice-involved populations, where there has been a historical j o u r n a l p r e -p r o o f preference toward naltrexone (krawczyk et al., ) , are less likely to receive buprenorphine and methadone. there also is a great deal of variation in billing and documentation of the type of moud, administration modality (e.g., office-based administration as compared with prescription filled at pharmacy by patient), provider type, state policies, and insurance coverage. accurate estimation on the prevalence of oud in hcs communities is important for planning and scaling of the moud uptake. however, estimating the population at need for moud is a challenge for the hcs. the hcs team is working on developing improved estimations for oud prevalence in each hcs community using a capture-recapture statistical methodology previously applied by barocas et al. (barocas et al., ) . five potential sources for measurement of moud were identified: medicaid claims, allpayer claims databases, pdmps, opioid treatment program central registries, and pharmacy dispensed prescriptions (iqvia). the disparate data sources vary in completeness and timeliness. all-payer claims databases are large state databases that typically include medical claims across multiple settings (e.g., hospitalizations, emergency departments visits, outpatient visits), pharmacy claims, and eligibility and provider files. data are collected from both public and private payers and reported directly by insurers to a state repository. all-payer claims databases are structured similarly to medicaid claims data and allow for linking of individuals across claims to identify individuals with oud and their treatments. the key advantage is the inclusion of private insurance, allowing more accurate estimation of prevalence of individuals with diagnosed oud and treatment with moud in a state. all-payer claims have been used previously in oud-related research (burke et al., ; freedman et al., ; lebaron et al., ; saloner et al., ) . seventeen states have all- (grecu et al., ) . opioid treatment programs are the only facilities allowed to deliver methadone for oud but may also offer buprenorphine and naltrexone along with behavioral therapy. they must be certified by samhsa and an independent, samhsa-approved accrediting body to dispense moud (samhsa, ). they also must be licensed by the state in which they operate and must be registered with the drug enforcement administration. the registries are established to prevent patient's simultaneous enrollment in multiple locations (e-cfr, a). number of enrolled patients, aggregated at the hcs community level, as permitted by section § . research, cfr part (e-cfr, a) can be used as a measure for methadone treatment uptake, but central registries were not available in all four research sites. iqvia data capture pharmacy dispensed naltrexone. however, naltrexone is indicated for treatment of oud and for alcohol use disorder. because pharmacy records do not include diagnose-related information for making this distinction, this data source may overestimate the uptake of naltrexone for oud. defined as ≥ mg mme over calendar months; or ( ) incident overlapping opioid and benzodiazepine prescriptions greater than days over calendar months. high opioid dosages, co-prescribing opioids with benzodiazepines or other sedative hypnotics, and receipt of opioid prescriptions from multiple providers or pharmacies are associated with opioid-related harms (bohnert et al., ; cochran et al., ; dunn et al., ; rose et al., ) . characteristics of opioid initiation are also important. for example, initiating opioid treatment with extended-release/long-acting opioids (miller et al., ) is associated with increased risk of overdose, and longer prescription duration is associated with transition to long-term opioid use (shah et al., ) . based on available evidence, in the cdc published guidelines (dowell et al., ) for prescribing opioids for chronic pain. numerous quality measures have been developed to encourage and measure progress toward improving the safety of opioid prescribing. after decades of increases, rates of opioid prescribing peaked and are now declining, although they remain historically high (guy et al., ; schieber et al., ) . developing safe and patient-centered approaches for individuals receiving long-term opioid therapy has been a challenge to address in underlying evidence or guidelines. increasing two constructs with the best available evidence to support decreases in opioid-related harms were targeted with the intention of reducing the number of individuals initiating high-risk opioid prescribing and the likelihood that new opioid prescribing episodes develop into longterm episodes (shah et al., ) . state pdmps were identified as the best available data source for these measures across all four research sites. a limitation of these data is the lack of clinical context-such as diagnostic codes for disease or condition-associated with the prescribed medication. as a result, at the patient level, it is difficult to assess the appropriateness of a high-dose opioid prescribing episode, such as that needed for management of severe pain for patients with cancer or end-of-life care. another limitation of this measure is the lack of automated data sharing among state pdmps on prescriptions filled across state boundaries. a benefit of the pdmp data source is that it is timely and captures dispensed prescriptions for controlled substances paid for by both insurance and cash. medicaid claims and all-payer claims databases are alternative data sources. the main advantage of claims data compared with pdmp data is the clinical context. however, claims data lack information on prescriptions paid by cash or alternative insurance coverage, which is associated with increased risk of opioid-related harms (becker et al., ) . medicaid claims are common across the sites, whereas all-payer claims databases exist in only two of the four states. claims data lag by at least months, making them less useful for timely monitoring of progress. existing measures were identified through a review of the literature, including existing measures from cdc and national quality forum, and national committee for quality alliance, which were subsequently adapted to the constructs identified above. all opioid agonist medications, including tramadol, were included, with the exception of antitussive codeine formulations and buprenorphine formulated for pain. the reasons for their exclusion are a lack of clear guidance for conversion to morphine milligram equivalents and a lack of evidence that buprenorphine, a partial opioid agonist, conveys the same risk as this of full opioid agonists. to maintain consistency across sites, the team developed a standardized list of national drug codes for opioids, benzodiazepines, and moud using the medi-span electronic drug file (med-file) v and the drug inactive data file (wolters kluwer, ) . the standardized study drug list is updated quarterly. the med-file includes product names, dosage forms, strength, the ndc, and generic product identifier (gpi). the gpi is a -digit number that allows identification of drug products by primary and secondary classifications and simplifies identification of similar drug products from different manufacturers or different packaging. because our study requires baseline data on opioid utilization, the inactive date file is used to include drugs that may be currently inactive but were used during the baseline period. all gpis beginning with the classification " "-which identifies any drug product containing an opioid or combination-are included in the opioid list. next, opioid products that are not likely to be used in the outpatient/ambulatory pharmacy setting-such as bulk powder, bulk chemicals, and dosage forms typically used in hospitals or hospice settings (e.g., epidurals, ivs)-are excluded. products classified as cough/cold/allergy combinations, cough medications, j o u r n a l p r e -p r o o f antidiarrheal/probiotic agents, buprenorphine products used for oud and pain, and methadone products used for oud were also excluded. the cdc file that identifies oral mmes (cdc, b) was used to add mmes to each opioid product and to identify products as long-acting or short-acting. to ensure the hcs list includes all current and inactive products, the cdc list was cross-referenced with the list of all gpi products. the benzodiazepine products are identified using the gpi classification " ", which identifies any drug product containing a benzodiazepine or combination. products that are not likely to be used in the outpatient/ambulatory pharmacy setting-such as bulk powder, bulk chemicals, and dosage forms typically used in hospitals or hospice settings-were excluded. the full list of gpis for opioids, benzodiazepines, and buprenorphine are included in the appendix. the success of the intervention relies on the community's ability to assess the complexity and specifics of the local opioid epidemic and identify the best ways to implement and promote evidence-based practices locally. a set of additional measures was developed, to be shared with the intervention communities as counts and/or rates over time and visualized as trends on community-tailored dashboards (wu et al., in press) . these measures monitor the complexity of the opioid-related harms as well as the progress in the three main evidence-based practices from the opioid reduction continuum of care approach (winhusen et al., in press) . a list of selected study measures is provided in table working closely with state stakeholders, the research sites also developed standard operating procedures for data quality assurance and control, and improved data collection (e.g., improved timelines of an existing data sources or development of new administrative data collections). the hcs data coordinating center created a common data model to match the complexity and scale of the clinical trial design and measures and the conditions of the data use agreements. the common data model consisted of ( ) an internal identification number for each hcs measure outcome; ( ) frequency of reporting (i.e., daily, monthly, quarterly, semi-annually, or annually); ( ) display features for dashboards and visualization (i.e., display date, display value, research cite/research community identification number, label); and j o u r n a l p r e -p r o o f ( ) internal usage information (i.e., is estimate, is suppressed [per data use agreement suppression requirements], notes, stratification, and version number). the common data model allows coordinated presentation of data to communities to aid with decision making and monitoring of progress and allows the hcs consortium and trial sponsors to routinely monitor progress. during the first year of the hcs, the data capture work group evaluated more than administrative data sources across the four states for their ability to support study measures in multiple relevant domains. the research site teams established multiple data use agreements with data owners to support the calculation for more than study measures based on administrative data collections, such as death certificates, emergency medical services data, inpatient and emergency department discharge billing records, medicaid claims, syndromic surveillance data, pdmp data, drug enforcement administration data on drug take back collection sites and events, data waivered prescriber data, hiv registry, naloxone distribution and dispensed prescription data. there were many challenges related to state variations in data timeliness and content that needed to be addressed, and compromises were made to achieve harmonization across research sites. the harmonization on medicaid measure specifications required participation from the state partners because individual states have some unique codes or code bundles for capturing specific services. collaborative workgroups with participation from state partners were formed with specific focus on medicaid data, pdmp data, and emergency medical services data. another challenge is the lack of quality validation studies for many of the measures, so the degree of possible misclassification of diagnosis or service codes used in some specifications j o u r n a l p r e -p r o o f is unknown. one example is attempting to identify oud prevalence using diagnosis codes in medical claims or other administrative data sources knowing that oud is often underdiagnosed. massachusetts also is seeking to partner with emergency medical services agencies to improve timeliness of data reporting and completeness of race/ethnicity data. new york developed a cloud-based application to facilitate data aggregation and sharing both for hcs and future research projects. in ohio, the hcs team partnered with the innovateohio platform, which was established by executive order a few weeks prior to the hcs project start date. the hcs has been a highly successful "test case" for how a single technology platform could be leveraged to provide necessary data quickly and efficiently for a large study involving multiple state agencies. the platform facilitated a multi-agency data use agreements, and curates, cleans, and links data sets across multiple ohio state agencies monthly. this allowed the ohio hcs team to sign one data use agreement to cover all project data activities. the hcs will provide methodology and tools to facilitate data-driven responses to the opioid epidemic at the local, state, and national levels. number of opioid overdose deaths among hcs residents during the evaluation period as measured by deaths with an underlying cause-of-death being drug overdose (i.e. an underlying cause-of-death icd- code in the range x -x , x -x , x , y -y ) where opioids, alone or in combination with other drugs (i.e. a multiple cause-of-death icd- code in the range t . -t . , or t . ), were determined to be contributing to the drug overdose death. data source: drug overdose deaths are captured by death certificate records; additional medicolegal death investigation records can be used (per established protocol) to determine opioid involvement when specific drugs contributing to the overdose deaths are not listed on the death certificate. number of naloxone units distributed in an hcs community during the evaluation period as measured by the sum of ( ) the naloxone units distributed to community residents by overdose education and naloxone distribution programs with support from state and federal funding, including dedicated hcs funding, and ( ) the naloxone units dispensed by retail pharmacies located within hcs communities. data source: data are captured from state administrative records and supplemented by study records to include naloxone funded through hcs, as well as iqvia xponent® database. number of hcs residents receiving buprenorphine products approved by the food and drug administration for treatment of opioid use disorder as measured by the number of unique individuals residing in an hcs community who had at least one dispensed prescription for these products during the evaluation period. data source: state prescription drug monitoring program data. number of hcs residents with new incidents of high-risk opioid prescribing during the evaluation period as measured by the number of residents in an hcs community who met at least one of the following four criteria for a new high-risk opioid prescribing episode after a washout period of at least days: ( ) incident opioid prescribing episode greater than days duration (continuous opioid receipt with no more than a day gap); ( ) starting an incident opioid prescribing episode with extended-release or long-acting opioid formulation; ( ) incident high-dose opioid prescribing, defined as ≥ mg morphine equivalent dose over calendar months; or ( ) incident overlapping opioid and benzodiazepine prescriptions greater than days over calendar months. association between state laws facilitating pharmacy distribution of naloxone and risk of fatal overdose the medicaid outcomes distributed research network (modrn) innovative solutions for state medicaid programs to leverage their data, build their analytic capacity, and create evidence-based policy claims database council estimated prevalence of opioid use disorder in massachusetts multiple sources of prescription payment and risky opioid therapy among veterans association between opioid prescribing patterns and opioid overdose-related deaths the effect of incomplete death certificates on estimates of unintentional opioid-related overdose deaths in the united states trends in opioid use disorder and overdose among opioid-naive individuals receiving an opioid prescription in massachusetts from annual surveillance report of drugrelated risks and outcomes-united states opioid overdose. data resources. analyzing prescription data and morphine milligram equivalents (mme) centers for disease control and prevention (cdc), . nchs data quality measures an examination of claims-based predictors of overdose from a large medicaid program kasper controlled substance reporting guide, . kentucky cabinet for health and family services nonfatal opioid overdose standardized surveillance case definition no shortcuts to safer opioid prescribing cdc guideline for prescribing opioids for chronic pain--united states opioid prescriptions for chronic pain and overdose: a cohort study electronic code of federal regulations, a. § . security for records electronic code of federal regulations, b. e-cfr website all-payer claims databases -uses and expanded prospects after gobeille mandatory access prescription drug monitoring programs and prescription drug abuse vital signs: changes in opioid prescribing in the united states characteristics of us counties with high opioid overdose mortality and low capacity to deliver medications for opioid use disorder a perspective on medicolegal death investigation in the united states medical examiner and coroner systems: history and trends drug overdose deaths in the united states consensus recommendations for national and state poisoning surveillance drive times to opioid treatment programs in urban and rural counties in us states only one in twenty justice-referred adults in specialty treatment for opioid use receive methadone or buprenorphine injectable extended-release naltrexone for opioid dependence: a double-blind, placebocontrolled, multicentre randomised trial medication for opioid use disorder after nonfatal opioid overdose and association with mortality: a cohort study opioid epidemic or pain crisis? using the virginia all payer claims database to describe opioid medication prescribing patterns and potential harms for patients with cancer health communication campaigns to drive demand for evidence-based practices and reduce stigma in the healing communities study pharmacy reporting and data submission methadone maintenance therapy versus no opioid replacement therapy for opioid dependence buprenorphine maintenance versus placebo or methadone maintenance for opioid dependence prescription opioid duration of action and the risk of unintentional overdose among patients receiving opioid therapy medical examiners' and coroners' handbook on death registration and fetal death reporting u.s. standard certificate of death new york department of health ohio data submission dispenser guide rise and regional disparities in buprenorphine utilization in the united states prescription drug monitoring program training and technical assistance center df#:~:text=in% % c% doj% began% the% harold% rogers% prescri ption,were% interested% in% establishing% c% implementing% c% and% enhancing% pdmps. accessed on prescription drug monitoring program training and technical assistance center (pdmp ttac), . pdmp policies and practices potentially inappropriate opioid prescribing, overdose, and mortality method to adjust provisional counts of drug overdose deaths for underreporting. division of vital statistics corrected us opioid-involved drug poisoning deaths and mortality rates patterns of buprenorphine-naloxone treatment for opioid use disorder in a multistate population variation in adult outpatient opioid prescription dispensing by age and sex -united states drug and opioid-involved overdose deaths -united states characteristics of initial prescription episodes and likelihood of long-term opioid use -united states methodological complexities in quantifying rates of fatal opioid-related overdose drug overdose deaths: let's get specific mortality risk during and after opioid substitution treatment: systematic review and meta-analysis of cohort studies medications for opioid use disorder substance abuse mental health services administration (samhsa), . certification of opioid treatment programs (otps) us surgeon general's advisory on naloxone and opioid overdose department of health and human services (hhs), . hhs guide for clinicians on the appropriate dosage reduction or discontinuation of long-term opioid analgesics fda identifies harm reported from sudden discontinuation of opioid pain medicines and requires label changes to guide prescribers on gradual opioid overdose rates and implementation of overdose education and nasal naloxone distribution in massachusetts: interrupted time series analysis identifying opioid overdose deaths using vital statistics data state variation in certifying manner of death and drugs involved in drug intoxication deaths evidence-based practices in the healing communities study drug alcohol depend drug data. www.wolterskluwercdi.com. accessed on community dashboards to support datainformed decision making in the healing communities study table . healing communities study primary and secondary outcome measures for hypothesis testing all authors contributed to the development of the hcs measures, the development of the framework for the manuscript, and the editing of the manuscript. s. slavova, j. villani, and s.l.walsh drafted the introduction, s. slavova drafted the methods, m.r. larochelle developed the table, and each author participated in drafting parts of the results or discussion sections. key: cord- -f drinpl authors: raoult, didier title: lancet gate: a matter of fact or a matter of concern date: - - journal: new microbes new infect doi: . /j.nmni. . sha: doc_id: cord_uid: f drinpl nan has been the split between the pro and con chloroquine proponents who represented the split for or against trump, for or against bolsonaro, rich european and american countries against the eastern or african countries that use it in the majority. as a matter of fact, chloroquine and hydroxychloroquine have been recommended in countries covering more than half of the world population, not recommended in some parts of western europe and the united states, and even banned in france. this shows that hic et nunc (here and now), there is not a single truth, but at this stage there are opinions, each one having data that it analyzes in the most appropriate way with the method considered best to answer yes to the hypothesis ( ). we should not forget that husserl clearly explained that mathematical methods are the clothes of ideas ( ) and sophisticated models should not dissimulate rough data. moreover, the growing use of "big data" is revealed here. in practice, the use of data collected in a more or less professional way for another use, retreated to make an adjustment (propensity score), allowing a comparison of the different groups to evaluate strategies. this gives the illusion of j o u r n a l p r e -p r o o f being more credible because of the large numbers. in fact, the studies reported by the physicians themselves may correct dubious data by their own experience, the computer will not. in practice, under these conditions, nothing is verifiable anymore and a painful experience has just shown us this with the episode of surgisphere who managed to publish in the two best journals of the medical world, series whose sources are unknown, whose methods are unknown and were retracted. a simple analysis of the elements, that i carried myself, immediately showed that these studies were just impossible, either arranged or purely invented. in one study, death number was superior to the whole deaths of the country ( ), in the other it was claimed that ethnicity was recorded in all cases including france where it is illegal ( ). in these conditions, we see the realization of the prediction of the birth of hyperreality, written by baudrillard (simulacra and simulations) in the seventies ( ), which describes a world where digital reality no longer represents a distortion of reality but simply another reality that no longer has anything to do with tangible reality. that was also predicted in science fiction books by p. dick ( ). the most extreme case was recently revealed in london, where the most rated restaurant on tripadvisor called "the shed at dulwich" did not exist, and which was, in fact, pure farce fuelled by false comments placed on tripadvisor. how a restaurant that didn't exist could become the most popular restaurant in london in months is also part epidemies: vrais dangers et fausses alertes. michel lafon ed. . ( ) j.hopsins. coronavirus resource center key: cord- - aq ai authors: iovanovici, alexandru; avramoni, dacian; prodan, lucian title: a dataset of urban traffic flow for romanian cities amid lockdown and after ease of covid related restrictions date: - - journal: data brief doi: . /j.dib. . sha: doc_id: cord_uid: aq ai this dataset comprises street-level traces of traffic flow as reported by here maps™ for cities of romania from th. of may and until th. of june . this covers the time two days before lifting of the mobility restrictions imposed by the covid nation-wide state of emergency and until four days after the second wave of relaxation, announced for st. of june . data were sampled at a -minute interval, consistent with the here api update time. the data are annotated with relevant political decisions and religious events which might influence the traffic flow. considering the relative scarcity of real-life traffic data, one can use this data set for micro-simulation during development and validation of intelligent transportation solutions (its) algorithms while another facet would be in the area of social and political sciences when discussing the effectiveness and impact of statewide restriction during the covid pandemic. transportation traffic flow demand data table figure cvs data files how data were acquired software application (available in the dataset, as part of the article), developed using python language, using here api for gathering raw data regarding live traffic and a set of custom developed scripts for cleaning the data (detailed below) and plotting visual representations of the instantaneous traffic flow. hand annotation was used for providing supplementary data and information for specific events regarding the national policy against covid and also for description of the cities the datasets covers the period from th. of may and until th. of june , with a sampling period of minutes, using the standard here maps traffic api there are software scripts used: one is responsible for job automation and runs the grabbing script at a minutes interval, which subsequently launches the api requests for each of the cities and writes the xml files with raw data on drive. later the third script iterates over the xml files and extracts the road information data and traffic flow data, discarding the geometrical properties of the road. • there is a scarcity of data available regarding traffic flow and road use demand. even if larger cities in highly developed nations have near real-time data from its systems, in other cases those data are practically impossible to gather with good quality and at decent costs. this data set covers a broad range of demands and loads, form almost empty roads (during covid restrictions) and up to full traffic (after second set of relaxation rules); • this dataset is directly useful for practitioners in the field of its systems design, for assessing transportation capacity and developing algorithms and policies for congestion prediction and mitigation and also for sociologists doing research regarding the impact of covid restrictions and the reaction of the public to the restrictions and gradual lifting of the restrictions. • the main usage of the data, in the field of its, is to provide real-life data from a variety of romanian cities (ranging from small to large in population, area and road network size) useful for training machine learning algorithms for prediction of congestion and for simulation of the impact of traffic incidents over the traffic flow. practitioners in the field of social sciences can benefit from the data in the analysis of specific reactions of the population to covid restrictions. • descriptive statistics could be used for simple analysis of data and detection of anomalies in the traffic flow which in turn can be used for inferring hidden events such as an incident on a minor street which feeds to a major artery. • machine learning methods and tools can be used for identifying signature-features of traffic flow which predict congestion, with high spatial resolution. • qualitative analysis of the impact of covid transportation restrictions can be made, with ramification of both the economic sector and epidemiological one in the field of transportation there is a distinct subfield of intelligent transportation systems (its) characterized by the usage of methods and tools of computation, mathematics and control theory for deriving means of maximizing the usability of the existing infrastructure (transportation capacity and quality) or the decision to develop new infrastructure [ ] . one of the current important topics in this field is related to congestion prediction [ , ] , while a lot of the approaches rely on the means and methods of machine learning to leverage the value of the past (historic data) in order to predict the future (when congestion will arise) [ ] . another subject of interest, directly connected to the problem of congestion is the one related to the traffic incident management [ , ] . a lot of the rules, policies and the systems are designed and work well in stable nominal conditions (when all the participants obey the traffic laws and everything works as intended). analysis done over the root cause of major gridlocks showed that the complex dynamics involved with road traffic allows minor incidents (i.e. a car not giving way when changing lanes) to become major sources of trouble spanning dozens of minutes a few blocks (hundreds of meters) radius [ ] . the resolution of both problems can be addressed in a virtual environment using what is called traffic micro-simulation [ ] . when fed high quality data and with a good description of the existing infrastructure, current software tools for microsimulation are capable of mirroring actual traffic conditions over a time-span ranging from dozens of minutes to hours [ ] . topology of the road infrastructure and the placement of road signs and traffic signaling plans are core components of the simulation scenarios and can be obtained either from local authorities or from open data ( [ , ] ) and an initial leg-work (for collecting data regarding signaling plans). the missing component is represented by the actual conditions on the road, which can be obtained by the existing infrastructure (car counting loops and equipment) -which is costly to deploy and provide low spatial resolution -or by deploying human observers for making assessments -which is costly and provides low temporal resolution [ , ] . over the last decades, with the development of mobile applications targeted at assisting drivers on the road, a new set of sources has appeared in the form of traces form mobile devices of the drivers (or passengers), but still most of them are not providing means of accessing historical data [ ] . major players in the field provide current data inside their applications and most of the time historic data are provided in an aggregate manner, which suffice for the average user, but are not of good enough quality for the practitioners in the field of its [ , ] . we selected here maps (™) [ ] for gathering data because they provide data access via api, allowing scripted automation, and the collection of the data in an automated manner is allowed by their terms and conditions. data provided by the api is always for current conditions but can be inferred by the here maps engine when the actual number of participants to the traffic is low, expressed by confidence level (see below) [ ] . we have chosen a sampling frequency of times per hour (once every minutes) based on empirical observations regarding when data changes and limitations in the software license we used. a smaller than minutes sampling period is not useful because the here traffic api does not update the data that often. each of the cities was defined through a rectangular bounding box with geo-coordinates described in table . the time span covered by this dataset ranges from th. of may and until th. of june, during the mobility restrictions imposed by romanian authorities for containing the covid pandemic and provides the opportunity for capturing a diverse and broad spectrum of scenarios in terms of traffic demand data. the cities for which we provide traffic data, also represent a diverse set in terms of demographics, urban development and geographical placement in romania. a detailed description is provided in table . su, ty. the naming of the fields follows the notation presented in table . . raw xml files as provided by the here maps api web service. each file corresponds to a unique city and a specific moment in time. these are stored into the ./xml.zip archive and follow the naming structure _-- for a more depth and complete analysis , taking into account the context of the data (the transportation and traffic restrictions imposed on the national level by the sars-cov- /covid pandemic) we present in table the most important events with impact over the traffic flow. these data can be augmented by the user of the dataset with supplementary data (such as weather), based on their own avenue of investigation. there is no need for any documents when travelling inside national borders. the data collection is done by a dedicated python script, adapted from the one available at [ ] . using the here maps traffic flow api we query the web service for the data regarding each of the cities, defined by the bounding boxes presented in table . for each of the queries, we get a list of items as an xml formatted response. the structure of each item, as exemplified by the record from figure , consists of a field which describes the static structural characteristics of the road, a list of shapes describing the road segments the geographical coordinates of the start and stop points of each segment, together with its functional class and a field with traffic flow related information. the python script, with detailed comments is available in the data repository under the name grab.py. the here maps api keys were removed and should be replaced by the user's keys. the next level out automation is provided by the unix cron tool (but can be also implemented by microsoft windows scheduled tasks) and consists of a shell script for calling the grabbing script for each of the cities which need to be monitored. this file is available in the repository under the name command-line argument is the label of the city (city name) and the third argument is represented by the base folder path for the output (where the post-processed csv is to be stored). for each of the xml files found into the basepath, the script is extracting the metadata encoded into the file-name (city, date and time) and iterates over the items extracting the relevant information for traffic flow. the structure of the data is described in the data description section. for defensive programming reasons checking of none type is done and default values are stored whenever the actual data are corrupted or missing (i.e. "de" field representing the street/road name is missing and is replaced by "n/a"). for each folder (set of records about a specific city) the parse.py produces a concatenated .csv file with all the records available, one per line. these files, for each of the cities, represent the core element of this dataset and are provided distinctly per city, or as an archive with all the cities and all the records, in the data repository. the data regarding the shapes of the roads are discarded in the csv files but are available in the raw xml files, stored under the ./raw path in the dataset. this work did not include any human subjects nor animal experiments traffic flow prediction for road transportation networks with limited traffic data traffic flow prediction with big data: a deep learning approach traffic and emissions impact of congestion charging in the central beijing urban area: a simulation analysis sumo-simulation of urban mobility: an overview ptv vissim user manual traffic flow in romanian cities during and around lifting of covid restrictions guide -traffic api, bounding-box visualizing real-time traffic patterns using here traffic api traffic flow dynamics dataset on the road traffic noise measurements in the municipality of thessaloniki multi-source dataset for urban computing in a smart city extracting traffic safety knowledge from historical accident data this work was supported by research grant gnac -arut, no. / . . , financed by politehnica university of timisoara. the authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article. key: cord- -zyxqcfa authors: oliver, nuria; lepri, bruno; sterly, harald; lambiotte, renaud; deletaille, sébastien; de nadai, marco; letouzé, emmanuel; salah, albert ali; benjamins, richard; cattuto, ciro; colizza, vittoria; de cordes, nicolas; fraiberger, samuel p.; koebe, till; lehmann, sune; murillo, juan; pentland, alex; pham, phuong n; pivetta, frédéric; saramäki, jari; scarpino, samuel v.; tizzoni, michele; verhulst, stefaan; vinck, patrick title: mobile phone data for informing public health actions across the covid- pandemic life cycle date: - - journal: sci adv doi: . /sciadv.abc sha: doc_id: cord_uid: zyxqcfa nan global scale and spread of the covid- pandemic highlight the need for a more harmonized or coordinated approach. in the following sections, we outline the ways in which different types of mobile phone data can help to better target and design measures to contain and slow the spread of the covid- pandemic. we identify the key reasons why this is not happening on a much broader scale, and we give recommendations on how to make mobile phone data work against the virus. passively generated mobile phone data have emerged as a potentially valuable data source to infer human mobility and social interactions. call detail records (cdrs) are arguably the most researched type of mobile data in this context. cdrs are collected by mobile operators for billing purposes. each record contains information about the time and the cell tower that the phone was connected to when the interaction took place. cdrs are event-driven records: in other words, the record only exists if the phone is actively in use. additional information includes "sightings data" obtained when a phone is seen on a network. there are, however, other types of mobile phone data used to study human mobility behaviors and interactions. x data records or network probes, can be thought as metadata about the phone's data channel, capturing background actions of apps and the network. routine information including highly accurate location data is also collected through mobile phone applications (apps) at a large scale by location intelligence companies ( ) or by ad hoc apps ( , ) . in addition, proximity between mobile phone users can be detected via bluetooth functionality on smartphones. each of these data types requires different processing frameworks and raise complex ethical and political concerns that are discussed in this paper. first, we explore the value and contribution of mobile phone data in analytical efforts to control the covid- pandemic. government and public health authorities broadly raise questions in at least four critical areas of inquiries for which the use of mobile phone data is relevant. first, situational awareness questions seek to develop an understanding of the dynamic environment of the pandemic. mobile phone data can provide access to previously unavailable population estimates and mobility information to enable stakeholders across sectors better understand covid- trends and geographic distribution. second, cause-and-effect questions seek to help identify the key mechanisms and consequences of implementing different measures to contain the spread of covid- . they aim to establish which variables make a difference for a problem and whether further issues might be caused. third, predictive analysis seeks to identify the likelihood of future outcomes and could, for example, leverage real-time population counts and mobility data to enable predictive capabilities and allow stakeholders to assess future risks, needs, and opportunities. finally, impact assessments aim to determine which, whether, and how various interventions affect the spread of covid- and require data to identify the obstacles hampering the achievement of certain objectives or the success of particular interventions. table provides specific examples of questions by areas of inquiry. the relevance and specific questions raised as part of these areas of inquiry differ at various stages of the outbreak, but mobile phone data provide value throughout the epidemiological cycle, shown in fig. . in the early recognition and initiation phase of the pandemic, responders focus on situational analysis and the fast detection of infected cases and their contacts. research has shown that quarantine measures of infected individuals and their family members, combined with surveillance and standard testing procedures, are effective as control measures in the early stages of the pandemic ( ) . individual mobility and contact (close proximity) data offer information about infected individuals, their locations, and social network. contact (close proximity) data can be collected through mobile apps ( , ) , interviews, or surveys ( ) . during the acceleration phase, when community transmission reaches exponential levels, the focus is on interventions for containment, which typically involve social contact and mobility restrictions. at this stage, aggregated mobile phone data are valuable to assess the efficacy of implemented policies through the monitoring of mobility between and within affected municipalities. mobility information also contributes to the building of more accurate epidemiological models that can explain and anticipate the spread of the disease, as shown for h n flu outbreaks ( ) . these models, in turn, can inform the mobilization of resources (e.g., respirators and intensive care units). last, during the deceleration and preparation phases, as the peak of infections is reached, restrictions will likely be lifted ( ) . continued situational monitoring will be important as the covid- pandemic is expected to come in waves ( , ) . near real-time data on mobility and hotspots will be important to understand how lifting and reestablishing various measures translate into behavior, especially to find the optimal combination of measures at the right time (e.g., general mobility restrictions, school closures, and banning of large gatherings), and to balance these restrictions with aspects of economic vitality. after the pandemic has subsided, mobile data will be helpful for post hoc analysis of the impact of different interventions on the progression of the disease and cost-benefit analysis of mobility restrictions. during this phase, digital contact-tracing technologies might be deployed, such as the korean smartphone app corona m ( ) and the singaporean smartphone app tracetogether ( ) , that aim at minimizing the spread of a disease as mobility restrictions are lifted. along this line, researchers at the massachusetts institute of technology and collaborators are working on private kit: safe paths ( ) , an open-source and privacy-first contact-tracing technology that provides individuals with information on their proximity with diagnosed covid- carriers, using global positioning system (gps) and bluetooth data. similarly, several european universities, research centers, and companies have joined forces around pepp-pt [pan-european privacy preserving proximity tracing ( )], a col-laboration on privacy-preserving, general data protection regulation (gdpr)-compliant contact tracing. along this effort, a consortium of research institutions, led by the École polytechnique fédérale de lausanne (epfl), has developed an open decentralized privacy-preserving proximity tracing protocol and implementation using bluetooth low-energy functionality on smartphones, ensuring that personal data and computation stay entirely on an individuals' phones ( ). recently, apple and google have released a joint announcement ( ) describing their system to support bluetooth-based privacypreserving proximity tracing across ios and android smartphones. as a part of the european commission recommendation of a coordinated approach to support the gradual lifting of lockdown measures ( ) , european union (eu) member states, supported by the commission, have developed a toolbox for the development and usage of contact tracing apps, fully compliant with eu rules ( ). • what are the most common mobility flows within and between covid- -affected cities and regions? • what are variables that determine the success of social distancing approaches? • which areas are spreading the epidemics acting as origin nodes in a mobility network and thus could be placed under mobility restrictions? • how do local mobility patterns affect the burden on the medical system? • are people continuing to travel or congregate after social distancing and travel restrictions were put into place? • are business' social distancing recommendations resulting in more workers working from home? • are there hotspots at higher risk of contamination (due to a higher level of mobility and higher concentration of population)? • in what sectors are people working most from home? • what are the key entry points, locations, and movements of roamers or tourists? • what are the social and economic consequences of movement restriction measures? • how are certain human mobility patterns likely to affect the spread of the coronavirus? and what is the likely spread of covid- , based on existing disease models and up-to-date mobility data? • how have travel restrictions affected human mobility behavior and likely disease transmission? • what are the likely effects of mobility restrictions on children's education outcomes? • what is the potential of various restriction measures to avert infection cases and save lives? • what are likely to be the economic consequences of restricted mobility for businesses? • what is the effect of mandatory social distancing measures, including closure of schools? • how has the dissemination of public safety information and voluntary guidance affected human mobility behavior and disease spread? researchers and practitioners have developed a variety of aggregated metrics using mobile phone data that can help fill gaps in information needed to respond to covid- and address uncertainties regarding mobility and behaviors. origin-destination (od) matrices are especially useful in the first epidemiological phases, where the focus is to assess the mobility of the population. the number of people moving between two different areas daily can be computed from the mobile network data, and it can be considered a proxy of human mobility. the geographic areas of interest might be zip codes, municipalities, provinces, or even regions. these mobility flows are compared to those during a reference period to assess the reduction in mobility due to nonpharmaceutical interventions. in particular, they are useful to monitor the impact of different social and mobility contention measures and to identify regions where the measures might not be effective or followed by the population. moreover, these flows can inform spatially explicit disease transmission models to evaluate the potential benefit of such reductions. dwell estimations and hotspots are estimates of particularly high concentration of people in an area, which can be favorable to the transmission of the virus. these metrics are typically constructed within a municipality by dividing the city into grids or neighborhoods ( ) . the estimated number of people in each geographical unit can be computed with different time granularities (e.g., min, min, and hours). contact matrices estimate the number and intensity of the faceto-face interactions people have in a day. they are typically computed by age groups. these matrices have been shown to be extremely useful to assess and determine the decrease of the reproduction number of the virus ( ). however, it is still challenging to estimate face-to-face interactions from colocation and mobility data ( ) . contact-tracing apps can then be used to identify close contacts of those infected with the virus. amount of time spent at home, at work, or other locations are estimates of the individual percentage of time spent at home/work/ other locations (e.g., public parks, malls, and shops), which can be useful to assess the local compliance with countermeasures adopted by governments. the home and work locations need to be computed in a period of time before the deployment of mobility restrictions measures. the percentage of time spent in each location needs to be computed for people who do not move during this time. variations of the time spent on different locations are generally computed on an individual basis and then spatially aggregated at a zip code, municipality, city, or region level. although there is still little information about the age-specific susceptibility to covid- infection, it is clear that age is an important risk factor for covid- severity. we highlight, therefore, the importance of estimating the metrics mentioned above by age groups ( ) . figure shows an example of such metrics. the use of mobile phone data for tackling the covid- pandemic has gained attention but remains relatively scarce. although local alliances have been formed, internationally concerted action is missing, both in terms of coordination and information exchange ( ) . in part, this is the result of a failure to institutionalize past experiences. during the - ebola virus outbreak, several pilot or one-off activities were initiated. however, there was no transition to "business as usual" in terms of standardized procedures to leverage mobile phone data or establish mechanisms for "data readiness" in the country contexts ( , ) . technology has evolved with various platforms offering enhanced and secured access and analysis of mobile data, including for humanitarian and development use cases [e.g., open algorithms for better decisions project ( ) and flowkit ( ) ]. furthermore, high-level meetings have been held [e.g., the european commission's business-to-government (b g) data sharing highlevel expert group], data analysis and sharing initiatives have shown promising results, yet the use of metrics and insights derived from mobile phone data by governments and local authorities is still minimal today ( ) . several factors likely explain this "implementation" gap. first, governments and public authorities frequently are unaware and/or lack a "digital mindset" and capacity needed for both for processing information that often is complex and requires multidisciplinary expertise (e.g., mixing location and health data and specialized modeling) and for establishing the necessary interdisciplinary teams and collaborations. many government units are understaffed and sometimes also lack technological equipment. during the covid- pandemic, most authorities are overwhelmed by the multiplicity and simultaneity of requests; as they have never been confronted with such a crisis, there are few predefined procedures and guides, so targeted and preventive action is quickly abandoned for mass actions. these problems are exacerbated at local levels of governments (e.g., towns and counties), which are precisely the authorities doing the frontline work in most situations. in addition, many public authorities and decision-makers are not aware of the value that mobile phone data would provide for decision-making and are often used to make decisions without knowing the full facts and under conditions of uncertainty. second, despite substantial efforts, access to data remains a challenge. most companies, including mobile network operators, tend to be very reluctant to make data available-even aggregated and anonymized-to researchers and/or governments. apart from data protection issues, such data are also seen and used as commercial assets, thus limiting the potential use for humanitarian goals if there are no sustainable models to support operational systems. one should also be aware that not all mobile network operators in the world are equal in terms of data maturity. some are actively sharing data as a business, while others have hardly started to collect and use data. third, the use of mobile phone data raises legitimate public concerns about privacy, data protection, and civil liberties. governments in china, south korea, israel, and elsewhere have openly accessed and used personal smartphone app data for tracking individual movements and notifying individuals. however, in other regions, such as in europe, both national and regional legal regulations limit such use (especially the eu law on data protection and privacy known as the gdpr). furthermore, around the world, public opinion surveys, social media, and a broad range of civil society actors including consumer groups and human rights organizations have raised legitimate concerns around the ethics, potential loss of privacy, and long-term impact on civil liberties resulting from the use of individual mobile data to monitor covid- . control of the pandemic requires control of people-including their mobility and other behaviors. a key concern is that the pandemic is used to create and legitimize surveillance tools used by government and technology companies that are likely to persist beyond the emergency. such tools and enhanced access to data may be used for purposes such as law enforcement by the government or hypertargeting by the private sector. such an increase in government and industry power and the absence of checks and balance is harmful in any democratic state. the consequences may be even more devastating in less democratic states that routinely target and oppress minorities, vulnerable groups, and other populations of concern. fourth, researchers and technologists frequently fail to articulate their findings in clear, actionable terms that respond to practical political and technical questions. researchers and domain experts tend to define the scope and direction of analytical problems from their perspective and not necessarily from the perspective of govern-ments' needs. critical decisions have to be taken, while key results are often published in scientific journals and in jargon that are not easily accessible to outsiders, including government workers and policy makers. last, there is little political will and resources invested to support preparedness for immediate and rapid action. on country levels, there are too few latent and standing mixed teams, composed of (i) representatives of governments and public authorities, (ii) mobile network operators and technology companies, and (iii) different topic experts (virologists, epidemiologists, and data analysts); and there are no procedures and protocols predefined. none of these challenges are insurmountable, but they require a clear call for action. to effectively build the best, most up-to-date, relevant, and actionable knowledge, we call on governments, mobile network operators, and technology companies (e.g. google, facebook, and apple), and researchers to form mixed teams. governments should be aware of the value of information and knowledge that can be derived from mobile phone data analysis, especially for monitoring the necessary measures to contain the pandemic. they should enable and leverage the fair and responsible provision and use of aggregated and anonymized data for this purpose. mobile network operators and technology companies with widespread adoption of their products (e.g. facebook, google, and apple) should take their social responsibility and the vital role that they can play in tackling the pandemic. they should reach out to governments and the research community. researchers and domain experts (e.g. virologists, epidemiologists, demographers, data scientists, computer scientists, and computational social scientists) should acknowledge the value of interdisciplinary teams and context specificities and sensitivities. impact would be maximized if governments and public authorities are included early on and throughout their efforts to identify the most relevant questions and knowledge needs. creating multidisciplinary interinstitutional teams is of paramount importance, as recently shown successfully in belgium and the valencian region of spain ( ) . four key principles should guide the implementation of such mixed teams to improve their effectiveness, namely (i) the early inclusion of governments, (ii) the liaising with data protection authorities early on, (iii) international exchange, and (iv) preparation for all stages of the pandemic. relevant government and public authorities should be involved early, and researchers need to build upon their knowledge systems and need for information. one key challenge is to make insights actionable-how can findings such as propagation maps lastly be used (e.g., for setting quarantine zones, informing local governments, and targeting communication). at the same time, expectations must be realistic: decisions on measures should be based on facts but are, in the end, political decisions. many insights derived from mobile phone data analytics do not have practical implications-such analysis and the related data collection should be discouraged until proven necessary. we also suggest such efforts be transparent and involve data protection authorities and civil liberties advocates early on and have quick iteration cycles with them. for example, policy makers should consider the creation of an ethics and privacy advisory committee to oversee and provide feedback on projects. this ensures that privacy is maintained and raises potential user acceptance. aggregated mobile phone data can be used in line even with the strict european regulations (gdpr). earlier initiatives have established principles and methods for sharing data or indicators without endangering personal information and build privacy-preserving solutions that use only incentives to manage behavior ( ) ( ) ( ) . the early inclusion of the data protection authority in belgium has led to the publishing of a statement by the european data protection board on how to process mobile phone data in the fight against covid- ( ) . even while acknowledging the value of mobile phone data, the urgency of the situation should not lead to losses of data privacy and other civil liberties that might become permanent after the pandemic. in this regard, the donation of data for good and the direct and limited (in time and scope) sharing of aggregated data by mobile network operators with (democratic) governments and researchers seem to be less problematic than the use of individual location data commercially acquired, brought together, and analyzed by commercial enterprises. more generally, any emergency data system set to monitor covid- and beyond must follow a balanced and well-articulated set of data policies and guidelines and is subjected to risk assessments. specifically, any efforts should meet clear tests on the proportionate, legal, accountable, necessary, and ethical use of mobile phone data in the circumstances of the pandemic and seek to minimize the amount of information gathered to what is necessary to accomplish the objective concerned. these are not unknown criteria; they are well inscribed into international human rights standards and law concerning, for example, the use of force. certainly, the use of mobile phone data does not equate to the use of force, but in the wrong hands, it can have similarly devastating effects and lead to substantially curtail civil liberties. considering the broad absence of legal frameworks and historical mishandling of data by technology companies, there is an urgent need for responsible global leadership and governance to guide efforts to use technology in times of emergency. we further see a clear need for more international exchange, not only with other domain experts but also with other initiatives and groups; findings must be shared quickly-there will be time for peer-reviewed publications later. in particular, in countries with weaker health (and often also economic) systems, the targeting and effectiveness of nonpharmaceutical interventions might make a big difference. this also implies the translation of important findings from english to other relevant languages. for later stages of the pandemic, and for the future, stakeholders should aim for a minimum level of "preparedness" for immediate and rapid action. on country and/or region levels, there will be a need of "standing" mixed teams; up-to-date technology, basic agreements, and legal prescriptions; and data access, procedures, and protocols predefined [also for "appropriate anonymization and aggregation protocols"; ( ) ]. a long-time collaboration between infectious disease modelers, epidemiologists, and researchers of mobile network operator laboratories in france helped jump-start a project on the covid- pandemic, with the support of public health authorities ( ) . last, in addition to (horizontal) international exchange, we also need international approaches that are coordinated by supranational bodies. national initiatives might help to a certain extent but will not be sufficient in the long run. a global pandemic necessitates globally or at least regionally coordinated work. here, promising approaches are emerging: the eu commission on march called upon european mobile network operators to hand over anonymized and aggregated data to the commission to track virus spread and determine priority areas for medical supplies ( ) , while other coordination initiatives are emerging in africa, latin america, and the mena (middle east and north africa) region. it will be important for such initiatives to link up, share knowledge, and collaborate. the covid- pandemic will not be over soon, and it will not be the last pandemic we face. privacy-aware and ethically acceptable solutions to use mobile phone data should be prepared and vetted in advance, and we must raise readiness on national and international levels, so we can act rapidly when the crisis hits. how will country-based mitigation measures influence the course of the covid- epidemic? the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak expected impact of school closure and telework to mitigate covid- epidemic in france impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand an investigation of transmission control measures during the first days of the covid- epidemic in china age profile of susceptibility, mixing, and social distancing shape the dynamics of the novel coronavirus disease outbreak in china using mobile phone data to predict the spatial spread of cholera mobile phone data highlights the role of mass gatherings in the spreading of cholera outbreaks on the use of human mobility proxies for modeling epidemics quantifying the impact of human mobility on malaria impact of human mobility on the emergence of dengue epidemics in pakistan valencia prepara un proyecto pionero con datos de móviles para trazar el movimiento del coronavirus measuring levels of activity in a changing city assessing changes in commuting and individual mobility in major metropolitan areas in the united states during the covid- outbreak the effect of human mobility and control measures on the covid- epidemic in china effect of non-pharmaceutical interventions for containing the covid- outbreak in china covid- outbreak response: a first assessment of mobility changes in italy following national lockdown oxford covid- impact monitor governments are using cellphone location data to manage the coronavirus. the verge effectiveness of social distancing strategies for protecting a community from a pandemic with a data-driven contact network based on census and real-world mobility data aggregated mobility data could help fight covid- pocketcare: tracking the flu with mobile phones using partial observations of proximity and symptoms interventions to mitigate early spread of sars-cov- in singapore: a modelling study quantifying sars-cov- transmission suggests epidemic control with digital contact tracing the covid impact survey: assessing the pulse of the covid- pandemic in spain via questions seasonal transmission potential and activity peaks of the new influenza a(h n ): a monte carlo likelihood analysis based on human mobility ending coronavirus lockdowns will be a dangerous process of trial and error projecting the transmission dynamics of sars-cov- through the postpandemic period south korea took rapid, intrusive measures against covid- -and they worked use of surveillance to fight coronavirus raises concerns about government power after pandemic ends decentralized privacy-preserving proximity tracing apple and google partner on covid- contact tracing technology -apple joint european roadmap towards lifting covid- containment measures mobile applications to support contact tracing in the eu's fight against covid- : common eu toolbox for member states from mobile phone data to the spatial structure of cities can co-location be used as a proxy for face-to-face contacts? waiting on hold ebola: a big data disaster privacy, property, and the law of disaster experimentation can tracking people through phone-call data improve lives? flowkit: unlocking the power of mobile data for humanitarian and development purposes decoded: how ai is helping fight a pandemic -europe's coronavirus app-insights from valencia productive disruption: opportunities and challenges for innovation in infectious disease surveillance on the privacy-conscientious use of mobile phone data sharing is caring four key requirements for sustainable private data sharing and use for public good (data-pop alliance and vodafone institute for society and communications european data protection board, statement on the processing of personal data in the context of the covid- outbreak pourquoi les données téléphoniques aident à comprendre la pandémie de covid- commission tells carriers to hand over mobile data in coronavirus fight uzicanin; cdc community mitigation guidelines work group, community mitigation guidelines to prevent pandemic influenza -united states key: cord- -fkdep cp authors: thompson, robin n.; hollingsworth, t. déirdre; isham, valerie; arribas-bel, daniel; ashby, ben; britton, tom; challenor, peter; chappell, lauren h. k.; clapham, hannah; cunniffe, nik j.; dawid, a. philip; donnelly, christl a.; eggo, rosalind m.; funk, sebastian; gilbert, nigel; glendinning, paul; gog, julia r.; hart, william s.; heesterbeek, hans; house, thomas; keeling, matt; kiss, istván z.; kretzschmar, mirjam e.; lloyd, alun l.; mcbryde, emma s.; mccaw, james m.; mckinley, trevelyan j.; miller, joel c.; morris, martina; o'neill, philip d.; parag, kris v.; pearson, carl a. b.; pellis, lorenzo; pulliam, juliet r. c.; ross, joshua v.; tomba, gianpaolo scalia; silverman, bernard w.; struchiner, claudio j.; tildesley, michael j.; trapman, pieter; webb, cerian r.; mollison, denis; restif, olivier title: key questions for modelling covid- exit strategies date: - - journal: proc biol sci doi: . /rspb. . sha: doc_id: cord_uid: fkdep cp combinations of intense non-pharmaceutical interventions (lockdowns) were introduced worldwide to reduce sars-cov- transmission. many governments have begun to implement exit strategies that relax restrictions while attempting to control the risk of a surge in cases. mathematical modelling has played a central role in guiding interventions, but the challenge of designing optimal exit strategies in the face of ongoing transmission is unprecedented. here, we report discussions from the isaac newton institute ‘models for an exit strategy’ workshop ( – may ). a diverse community of modellers who are providing evidence to governments worldwide were asked to identify the main questions that, if answered, would allow for more accurate predictions of the effects of different exit strategies. based on these questions, we propose a roadmap to facilitate the development of reliable models to guide exit strategies. this roadmap requires a global collaborative effort from the scientific community and policymakers, and has three parts: (i) improve estimation of key epidemiological parameters; (ii) understand sources of heterogeneity in populations; and (iii) focus on requirements for data collection, particularly in low-to-middle-income countries. this will provide important information for planning exit strategies that balance socio-economic benefits with public health. as of august , the coronavirus disease (covid- ) pandemic has been responsible for more than million reported cases worldwide, including over deaths. mathematical modelling is playing an important role in guiding interventions to reduce the spread of severe acute respiratory syndrome coronavirus (sars-cov- ). although the impact of the virus has varied significantly across the world, and different countries have taken different approaches to counter the pandemic, many national governments introduced packages of intense non-pharmaceutical interventions (npis), informally known as 'lockdowns'. although the socio-economic costs (e.g. job losses and long-term mental health effects) are yet to be assessed fully, public health measures have led to substantial reductions in transmission [ ] [ ] [ ] . data from countries such as sweden and japan, where epidemic waves peaked without strict lockdowns, will be useful for comparing approaches and conducting retrospective cost-benefit analyses. as case numbers have either stabilized or declined in many countries, attention has turned to strategies that allow restrictions to be lifted [ , ] in order to alleviate the economic, social and other health costs of lockdowns. however, in countries with active transmission still occurring, daily disease incidence could increase again quickly, while countries that have suppressed community transmission face the risk of transmission reestablishing due to reintroductions. in the absence of a vaccine or sufficient herd immunity to reduce transmission substantially, covid- exit strategies pose unprecedented challenges to policymakers and the scientific community. given our limited knowledge, and the fact that entire packages of interventions were often introduced in quick succession as case numbers increased, it is challenging to estimate the effects of removing individual measures directly and modelling remains of paramount importance. we report discussions from the 'models for an exit strategy' workshop ( ) ( ) ( ) ( ) ( ) may ) that took place online as part of the isaac newton institute's 'infectious dynamics of pandemics' programme. we outline progress to date and open questions in modelling exit strategies that arose during discussions at the workshop. most participants were working actively on covid- at the time of the workshop, often with the aim of providing evidence to governments, public health authorities and the general public to support the pandemic response. after four months of intense model development and data analysis, the workshop gave participants a chance to take stock and openly share their views of the main challenges they are facing. a range of countries was represented, providing a unique forum to discuss the different epidemic dynamics and policies around the world. although the main focus was on epidemiological models, the interplay with other disciplines formed an integral part of the discussion. the purpose of this article is twofold: to highlight key knowledge gaps hindering current predictions and projections, and to provide a roadmap for modellers and other scientists towards solutions. given that sars-cov- is a newly discovered virus, the evidence base is changing rapidly. to conduct a systematic review, we asked the large group of researchers at the workshop for their expert opinions on the most important open questions, and relevant literature, that will enable exit strategies to be planned with more precision. by inviting contributions from representatives of different countries and areas of expertise (including social scientists, immunologists, epidemic modellers and others), and discussing the expert views raised at the workshop in detail, we sought to reduce geographical and disciplinary biases. all evidence is summarized here in a policy-neutral manner. the questions in this article have been grouped as follows. first, we discuss outstanding questions for modelling exit strategies that are related to key epidemiological quantities, such as royalsocietypublishing.org/journal/rspb proc. r. soc. b : the reproduction number and herd immunity fraction. we then identify different sources of heterogeneity underlying sars-cov- transmission and control, and consider how differences between hosts and populations across the world should be included in models. finally, we discuss current challenges relating to data requirements, focusing on the data that are needed to resolve current knowledge gaps and how uncertainty in modelling outputs can be communicated to policymakers and the wider public. in each case, we outline the most relevant issues, summarize expert knowledge and propose specific steps towards the development of evidencebased exit strategies. this leads to a roadmap for future research (figure ) made up of three key steps: (i) improve estimation of epidemiological parameters using outbreak data from different countries; (ii) understand heterogeneities within and between populations that affect virus transmission and interventions; and (iii) focus on data needs, particularly data collection and methods for planning exit strategies in low-to-middle-income countries (lmics) where data are often lacking. this roadmap is not a linear process: improved understanding of each aspect will help to inform other requirements. for example, a clearer understanding of the model resolution required for accurate forecasting ( § a) will inform the data that need to be collected ( § ), and vice versa. if this roadmap can be followed, it will be possible to predict the likely effects of different potential exit strategies with increased precision. this is of clear benefit to global health, allowing exit strategies to be chosen that permit interventions to be relaxed while limiting the risk of substantial further transmission. (a) how can viral transmissibility be assessed more accurately? the time-dependent reproduction number, r(t) or r t , has emerged as the main quantity used to assess the transmissibility of sars-cov- in real time [ ] [ ] [ ] [ ] [ ] . in a population with active virus transmission, the value of r(t) represents the expected number of secondary cases generated by someone infected at time t. if this quantity is, and remains below, one, then an ongoing outbreak will eventually fade out. although easy to understand intuitively, estimating r(t) from case reports (as opposed to, for example, observing r(t) in known or inferred transmission trees [ ] ) requires the use of mathematical models. as factors such as contact rates between infectious and susceptible individuals change during an outbreak in response to public health advice or movement restrictions, the value of r(t) has been found to respond rapidly. for example, across the uk, country-wide and regional estimates of r(t) dropped from approximately . - in mid-march [ , ] to below one after lockdown was introduced [ , ] . one of the criteria for relaxing the lockdown was for the reproduction number to decrease to 'manageable levels' [ ] . monitoring r(t), as well as case numbers, as individual components of the lockdown are relaxed is critical for understanding whether or not the outbreak remains under control [ ] . several mathematical and statistical methods for estimating temporal changes in the reproduction number have been proposed. two popular approaches are the wallinga-teunis method [ ] and the cori method [ , ] . these methods use case notification data along with an estimate of the serial interval distribution (the times between successive cases in a transmission chain) to infer the value of r(t). other approaches exist (e.g. based on compartmental epidemiological models [ ] ), including those that can be used alongside different data (e.g. time series of deaths [ , , ] or phylogenetic data [ ] [ ] [ ] [ ] ). despite this extensive theoretical framework, practical challenges remain. reproduction number estimates often rely on case notification data that are subject to delays between case onset and being recorded. available data, therefore, do not include up-to-date knowledge of current numbers of infections, an issue that can be addressed using 'nowcasting' models [ , , ] . the serial interval represents the period between symptom onset times in a transmission chain, rather than between times at which cases are recorded. time series of symptom onset dates, or even infection dates (to be used with estimates of the generation interval when inferring r(t)), can be estimated from case notification data using latent variable methods [ , ] or methods such as the richardson-lucy deconvolution technique [ , ] . the richardson-lucy approach has previously been applied to infer incidence curves from time series of deaths [ ] . these methods, as well as others that account for reporting delays [ figure . research roadmap to facilitate the development of reliable models to guide exit strategies. three key steps are required: (i) improve estimates of epidemiological parameters (such as the reproduction number and herd immunity fraction) using data from different countries ( § a-d); (ii) understand heterogeneities within and between populations that affect virus transmission and interventions ( § a-d); and (iii) focus on data requirements for predicting the effects of individual interventions, particularly-but not exclusively-in data-limited settings such as lmics ( § a-c). work in these areas must be conducted concurrently; feedback will arise from the results of the proposed research that will be useful for shaping next steps across the different topics. (online version in colour.) royalsocietypublishing.org/journal/rspb proc. r. soc. b : useful avenues to improve the practical estimation of r(t). further, changes in testing practice (or capacity to conduct tests) lead to temporal changes in case numbers that cannot be distinguished easily from changes in transmission. understanding how accurately and how quickly changes in r(t) can be inferred in real time given these challenges is crucial. another way to assess temporal changes in r(t), without requiring nowcasting, is by observing people's transmissionrelevant behaviour directly, e.g. through contact surveys or mobility data [ ] . these methods come with their own limitations: because these surveys do not usually collect data on infections, care must be taken in using them to understand and predict ongoing changes in transmission. other outstanding challenges in assessing variations in r(t) include the decrease in accuracy when case numbers are low, and the requirement to account for temporal changes in the serial interval or generation time distribution of the disease [ , ] . when there are few cases (such as in the 'tail' of an epidemic- § d), there is little information with which to assess virus transmissibility. methods for estimating r(t) based on the assumption that transmissibility is constant within fixed time periods can be applied with windows of long duration (thereby including more case notification data with which to estimate r(t)) [ , ] . however, this comes at the cost of a loss of sensitivity to temporal variations in transmissibility. consequently, when case numbers are low, the methods described above for tracking transmission-relevant behaviour directly are particularly useful. in those scenarios, the 'transmission potential' might be more important than realized transmission [ ] . the effect of population heterogeneity on reproduction number estimates requires further investigation, as current estimates of r(t) tend to be calculated for whole populations (e.g. countries or regions). understanding the characteristics of constituent groups contributing to this value is important to target interventions effectively [ , ] . for this, data on infections within and between different subpopulations (e.g. infections in care homes and in the wider population) are needed. as well as between subpopulations, it is also necessary to ensure that estimates of r(t) account for heterogeneity in transmission between different infectious hosts. such heterogeneity alters the effectiveness of different control measures, and, therefore, the predicted disease dynamics when interventions are relaxed. for a range of diseases, a rule of thumb that around % of infected individuals are the sources of % of infections has been proposed [ , ] . this is supported by recent evidence for covid- , which suggests significant individual-level variation in sars-cov- transmission [ ] with some transmission events leading to large numbers of new infections. finally, it is well documented that presymptomatic individuals (and, to a lesser extent, asymptomatic infected individuals-i.e. those who never develop symptoms) can transmit sars-cov- [ , ] . for that reason, negative serial intervals may occur when an infected host displays covid- symptoms before the person who infected them [ , ] . although methods for estimating r(t) with negative serial intervals exist [ , ] , their inclusion in publicly available software for estimating r(t) should be a priority. increasing the accuracy of estimates of r(t), and supplementing these estimates with other quantities (e.g. estimated epidemic growth rates [ ] ), is of clear importance. as lockdowns are relaxed, this will permit a fast determination of whether or not removed interventions are leading to a surge in cases. (b) what is the herd immunity threshold and when might we reach it? herd immunity refers to the accumulation of sufficient immunity in a population through infection and/or vaccination to prevent further substantial outbreaks. it is a major factor in determining exit strategies, but data are still very limited. dynamically, the threshold at which herd immunity is achieved is the point at which r(t) ( § a) falls below one for an otherwise uncontrolled epidemic, resulting in a negative epidemic growth rate. however, reaching the herd immunity threshold does not mean that the epidemic is over or that there is no risk of further infections. great care must be taken in communicating this concept to the public, to ensure continued adherence to public health measures. crucially, whether immunity is gained naturally through infection or through random or targeted vaccination affects the herd immunity threshold, which also depends critically on the immunological characteristics of the pathogen. since sars-cov- is a new virus, its immunological characteristics-notably the duration and extent to which prior infection confers protection against future infection, and how these vary across the populationare currently unknown [ ] . lockdown measures have impacted contact structures and hence the accumulation of immunity in the population, and are likely to have led to significant heterogeneity in acquired immunity (e.g. by age, location, workplace). knowing the extent and distribution of immunity in the population will help guide exit strategies. as interventions are lifted, whether or not r(t) remains below one depends on the current level of immunity in the population as well as the specific exit strategy followed. a simple illustration is to treat r(t) as a deflation of the original (basic) reproduction number (r , which is assumed to be greater than one): where i(t) is the immunity level in the community at time t and p(t) is the overall reduction factor from the control measures that are in place. if i(t) . À =r , then r(t) remains below one even when all interventions are lifted: herd immunity is achieved. however, recent results [ , ] show that, for heterogeneous populations, herd immunity occurs at a lower immunity level than À =r . the threshold À =r assumes random vaccination, with immunity distributed uniformly in the community. when immunity is obtained from disease exposure, the more socially active individuals in the population are over-represented in cases from the early stages of the epidemic. as a result, the virus preferentially infects individuals with higher numbers of contacts, thereby acting like a well-targeted vaccine. this reduces the herd immunity threshold. however, the extent to which heterogeneity in behaviour lowers the threshold for covid- is currently unknown. we highlight three key challenges for determining the herd immunity threshold for covid- , and hence for understanding the impact of implementing or lifting control measures in different populations. first, most of the quantities for calculating the threshold are not known precisely and require careful investigation. for example, determining the immunity level royalsocietypublishing.org/journal/rspb proc. r. soc. b : in a community is far from trivial for a number of reasons: antibody tests may have variable sensitivity and specificity; it is currently unclear whether or not individuals with mild or no symptoms acquire immunity or test seropositive; the duration of immunity is unknown. second, estimation of r , despite receiving significant attention at the start of the pandemic, still needs to be refined within and between countries as issues with early case reports come to light. third, as discussed in § , sars-cov- does not spread uniformly through populations [ ] . an improved understanding of the main transmission routes, and which communities are most influential, will help to determine how much lower diseaseinduced herd immunity is compared to the classical threshold to summarize, it is vital to obtain more accurate estimates of the current immunity levels in different countries and regions, and to understand how population heterogeneity affects transmission and the accumulation of immunity. quantitative information about current and past infections are key inputs to formulate exit strategies, monitor the progression of epidemics and identify social and demographic sources of transmission heterogeneities. seroprevalence surveys provide a direct way to estimate the fraction of the population that has been exposed to the virus but has not been detected by regular surveillance mechanisms [ ] . given the possibility of mild or asymptomatic infections, which are not typically included in laboratory-confirmed cases, seroprevalence surveys could be particularly useful for tracking the covid- pandemic [ ] . contacts between pathogens and hosts that elicit an immune response can be revealed by the presence of antibodies. typically, a rising concentration of immunoglobulin m (igm) precedes an increase in the concentration of immunoglobulin g (igg). however, for infections by sars-cov- , there is increasing evidence that igg and igm appear concurrently [ ] . most serological assays used for understanding viral transmission measure igg. interpretation of a positive result depends on detailed knowledge of immune response dynamics and its epidemiological correspondence to the developmental stage of the pathogen, for example, the presence of virus shedding [ , ] . serological surveys are common practice in infectious disease epidemiology and have been used to estimate the prevalence of carriers of antibodies, force of infection and reproduction numbers [ ] , and in certain circumstances (e.g. for measles) to infer population immunity to a pathogen [ ] . unfortunately, a single serological survey only provides information about the number of individuals who are seropositive at the time of the survey (as well as information about the individuals tested, such as their ages [ ] ). although information about temporal changes in infections can be obtained by conducting multiple surveys longitudinally [ , ] , the precise timings of infections remain unknown. available tests vary in sensitivity and specificity, which can impact the accuracy of model predictions if seropositivity is used to assess the proportion of individuals protected from infection or disease. propagation of uncertainty due to the sensitivity and specificity of the testing procedures and epidemiological interpretation of the immune response are areas that require attention. the possible presence of immunologically silent individuals, as implied by studies of covid- showing that - % of symptomatically infected people have few or no detectable antibodies [ ] , adds to the known sources of uncertainty. many compartmental modelling studies have used data on deaths as the main reliable dataset for model fitting. the extent to which seroprevalence data could provide an additional useful input for model calibration, and help in formulating exit strategies, has yet to be ascertained. with the caveats above, one-off or regular assessments of population seroprevalence could be helpful in understanding sars-cov- transmission in different locations. (d) is global eradication of sars-cov- a realistic possibility? when r is greater than one, an emerging outbreak will either grow to infect a substantial proportion of the population or become extinct before it is able to do so [ ] [ ] [ ] [ ] [ ] . if instead r is less than one, the outbreak will almost certainly become extinct before a substantial proportion of the population is infected. if new susceptible individuals are introduced into the population (for example, new susceptible individuals are born), it is possible that the disease will persist after its first wave and become endemic [ ] . these theoretical results can be extended to populations with household and network structure [ , ] and scenarios in which r is very close to one [ ] . epidemiological theory and data from different diseases indicate that extinction can be a slow process, often involving a long 'tail' of cases with significant random fluctuations (electronic supplementary material, figure s ). long epidemic tails can be driven by spatial heterogeneities, such as differences in weather in different countries (potentially allowing an outbreak to persist by surviving in different locations at different times of year) and varying access to treatment in different locations. regions or countries that eradicate sars-cov- successfully might experience reimportations from elsewhere [ , ] , for example, the reimportation of the virus to new zealand from the uk in june . at the global scale, smallpox is the only previously endemic human disease to have been eradicated, and extinction took many decades of vaccination. the prevalence and incidence of polio and measles have been reduced substantially through vaccination but both diseases persist. the foot and mouth disease outbreak in the uk and the sars pandemic were new epidemics that were driven extinct without vaccination before they became endemic, but both exhibited long tails before eradication was achieved. the - ebola epidemic in west africa was eliminated (with vaccination at the end of the epidemic [ ] ), but eradication took some time with flare ups occurring in different countries [ , ] . past experience, therefore, raises the possibility that sars-cov- may not be driven to complete extinction in the near future, even if a vaccine is developed and vaccination campaigns are implemented. as exemplified by the ebola outbreak in the democratic republic of the congo that has only recently been declared over [ ] , there is an additional challenge of assessing whether the virus really is extinct rather than persisting in individuals who do not report disease [ ] . sars-cov- could become endemic, persisting in populations with limited access to healthcare or circulating in seasonal outbreaks. appropriate royalsocietypublishing.org/journal/rspb proc. r. soc. b : communication of these scenarios to the public and policymakers-particularly the possibility that sars-cov- may never be eradicated-is essential. (a) how much resolution is needed when modelling human heterogeneities? a common challenge faced by epidemic modellers is the tension between making models more complex (and possibly, therefore, seeming more realistic to stakeholders) and maintaining simplicity (for scientific parsimony when data are sparse and for expediency when predictions are required at short notice) [ ] . how to strike the correct balance is not a settled question, especially given the increasing amount of available data on human demography and behaviour. indeed, outputs of multiple models with different levels of complexity can provide useful and complementary information. many sources of heterogeneity between individuals (and between populations) exist, including the strong skew of severe covid- outcomes towards the elderly and individuals from specific groups. we focus on two sources of heterogeneity in human populations that must be considered when modelling exit strategies: spatial contact structure and health vulnerabilities. there has been considerable success in modelling local contact structure, both in terms of spatial heterogeneity (distinguishing local and long-distance contacts) and in local mixing structures such as households and workplaces. however, challenges include tracking transmission and assessing changes when contact networks are altered. in spatial models with only a small number of near-neighbour contacts, the number of new infections grows slowly; each generation of infected individuals is only slightly larger than the previous one. as a result, in those models, r(t) cannot significantly exceed its threshold value of one [ ] . by contrast, models accounting for transmission within closely interacting groups explicitly contain a mechanism that has a multiplier effect on the value of r(t) [ ] . another challenge is the spatio-temporal structure of human populations: the spatial distribution of individuals is important, but longdistance contacts make populations more connected than in simple percolation-type spatial models [ ] . clustering and pair approximation models can capture some aspects of spatial heterogeneities [ ] , which can result in exponential rather than linear growth in case numbers [ ] . while models can include almost any kind of spatial stratification, ensuring that model outputs are meaningful for exit strategy planning relies on calibration with data. this brings in challenges of merging multiple data types with different stratification levels. for example, case notification data may be aggregated at a regional level within a country, while mobility data from past surveys might be available at finer scales within regions. another challenge is to determine the appropriate scale at which to introduce or lift interventions. although measures are usually directed at whole populations within relevant administrative units (country-wide or smaller), more effective interventions and exit strategies may target specific parts of the population [ ] . here, modelling can be helpful to account for operational costs and imperfect implementation that will offset expected epidemiological gains. the structure of host vulnerability to disease is generally reported via risk factors, including age, sex and ethnicity [ , ] . from a modelling perspective, a number of open questions exist. to what extent does heterogeneous vulnerability at an individual level affect the impact of exit strategies beyond the reporting of potential outcomes? where host vulnerability is an issue, is it necessary to account for considerations other than reported risk factors, as these may be proxies for underlying causes? once communicated to the public, modelling results could create behavioural feedback that might help or hinder exit strategies; some sensitivity analyses would be useful. as with the questions around spatial heterogeneity, modelling variations in host vulnerability could improve proposed exit strategies, and modelling can be used to explore how these are targeted and communicated [ ] . finally, heterogeneities in space and vulnerabilities may interact; modelling these may reveal surprises that can be explored further. (b) what are the roles of networks and households in sars-cov- transmission? npis reduce the opportunity for transmission by breaking up contact networks (closing workplaces and schools, preventing large gatherings), reducing the chance of transmission where links cannot be broken (wearing masks, sneeze barriers) and identifying infected individuals (temperature checks [ ] , diagnostic testing [ ] ). network models [ , ] aim to split pathogen transmission into opportunity (number of contacts) and transmission probability, using data that can be measured directly (through devices such as mobility tracking and contact diaries) and indirectly (through traffic flow and co-occurrence studies). this brings new issues: for example, are observed networks missing key transmission routes, such as indirect contact via contaminated surfaces, or including contacts that are low risk [ ] ? how we measure and interpret contact networks depends on the geographical and social scales of interest (e.g. wider community spread or closed populations such as prisons and care homes; or subpopulations such as workplaces and schools) and the timescales over which the networks are used to understand or predict transmission. in reality, individuals belong to households, children attend schools and adults mix in workplaces as well as in social contexts. this has led to the development of household models [ , [ ] [ ] [ ] [ ] , multilayer networks [ ] , bipartite networks [ , ] and networks that are geographically and socially embedded to reflect location and travel habits [ ] . these tools can play a key role in understanding and monitoring transmission, and exploring scenarios, at the point of exiting a lockdown: in particular, they can inform whether or not, and how quickly, households or local networks merge to form larger and possibly denser contact networks in which local outbreaks can emerge. regional variations and socio-economic factors can also be explored. contact tracing, followed by isolation or treatment of infected contacts, is a well-established method of disease control. the structure of the contact network is important in determining whether or not contact tracing will be successful. for example, contact tracing in clustered networks is known to be most effective [ , ] , since an infected contact can be royalsocietypublishing.org/journal/rspb proc. r. soc. b : traced from multiple different sources. knowledge of the contact network enhances understanding of the correlation structure that emerges as a result of the epidemic. the first wave of an epidemic will typically infect many of the highly connected nodes and will move slowly to less connected parts of the network, leaving behind islands of susceptible and recovered individuals. this can lead to a correlated structure of susceptible and recovered nodes that may make the networks less vulnerable to later epidemic waves [ ] , and has implications for herd immunity ( § b). in heterogeneous populations, relatively few very wellconnected people can be major hubs for transmission. such individuals are often referred to as super-spreaders [ , ] and some theoretical approaches to controlling epidemics are based on targeting them [ ] . however, particularly for respiratory diseases, whether specific individuals can be classified as potential super-spreaders, or instead whether any infected individual has the potential to generate super-spreading events, is debated [ , , ] . as control policies are gradually lifted, the disrupted contact network will start to form again. understanding how proxies for social networks (which can be measured in near real time using mobility data, electronic sensors or trackers) relate to transmission requires careful consideration. using observed contacts to predict virus spread might be successful if these quantities are heavily correlated, but one aim of npis should be at least a partial decoupling of the two, so that society can reopen but transmission remains controlled. currently, a key empirical and theoretical challenge is to understand how households are connected and how this is affected by school opening ( § c). an important area for further research is to improve our understanding of the role of within-household transmission in the covid- pandemic. in particular, do sustained infection chains within households lead to amplification of infection rates between households despite lockdowns aimed at minimizing between-household transmission? even for well-studied household models, development of methods accommodating time-varying parameters such as variable adherence to household-based policies and/or compensatory behaviour would be valuable. it would be useful to compare interventions and de-escalation procedures in different countries to gain insight into: regional variations in contact and transmission networks; the role of different household structures in transmission and the severity of outcomes (accounting for different household sizes and agestructures); the cost-effectiveness of different policies, such as household-based isolation and quarantine in the uk compared to out-of-household quarantine in australia and hong kong. first few x (ffx) studies [ , ] , now adopted in several countries, provide the opportunity not only to improve our understanding of critical epidemiological characteristics (such as incubation periods, generation intervals and the roles of asymptomatic and presymptomatic transmission) but also to make many of these comparisons. a widely implemented early intervention was school closure, which is frequently used during influenza pandemics [ , ] . further, playgrounds were closed and social distancing has kept children separated. however, the role of children in sars-cov- transmission is unclear. early signs from wuhan (china), echoed elsewhere, showed many fewer cases in under s than expected. there are three aspects of the role of children in transmission: (i) susceptibility; (ii) infectiousness once infected; and (iii) propensity to develop disease if infected [ , ] . evidence for age-dependent susceptibility and infectiousness is mixed, with infectiousness the more difficult to quantify. however, evidence is emerging of lower susceptibility to infection in children compared to adults [ ] , although the mechanism underlying this is unknown and it may not be generalizable to all settings. once infected, children appear to have a milder course of infection, and it has been suggested that children have a higher probability of a fully subclinical course of infection. reopening schools is of clear importance both in ensuring equal access to education and enabling carers to return to work. however, the transmission risk within schools and the potential impact on community transmission needs to be understood so that policymakers can balance the potential benefits and harms. as schools begin to reopen, there are major knowledge gaps that prevent clear answers. the most pressing question is the extent to which school restarting will affect population-level transmission, characterized by r(t) ( § a). clearer quantification of the role of children could have come from analysing the effects of school closures in different countries in february and march, but closures generally coincided with other interventions and so it has proved difficult to unpick the effects of individual measures [ ] . almost all schools in sweden stayed open to under- s (with the exception of one school that closed for two weeks [ ] ), and schools in some other countries are beginning to reopen with social distancing measures in place, providing a potential opportunity to understand within-school transmission more clearly. models can also inform the design of studies to generate the data required to answer key questions. the effect of opening schools on r(t) also depends on other changes in the community. children, teachers and support staff are members of households; lifting restrictions may affect all members. modelling school reopening must account for all changes in contacts of household members [ ] , noting that the impact on r(t) may depend on the other interventions in place at that time. the relative risk of restarting different school years (or universities) does not affect the population r(t) straightforwardly, since older children tend to live with adults who are older (compared to younger children), and households with older individuals are at greater risk of severe outcomes. thus, decisions about which age groups return to school first and how they are grouped at school must balance the risks of transmission between children, transmission to and between their teachers, and transmission to and within the households of the children and teachers. return to school affects the number of physical contacts of teachers and support staff. schools will not be the same environments as prior to lockdown, since physical distancing measures will be in place. these include smaller classes and changes in layout, plus increased hygiene measures. some children and teachers may be less likely to return to school because of underlying health conditions and if there is transmission within schools, there may be absenteeism following infection. models must, therefore, consider the different effects on transmission of pre-and post-lockdown school royalsocietypublishing.org/journal/rspb proc. r. soc. b : environments. post-lockdown, with social distancing in place in the wider community, reopening schools could link subcommunities of the population together, and models can be used to estimate the wider effects on population transmission as well as within schools. these estimates are likely to play a central role in decisions surrounding when and how to reopen schools. (d) the pandemic is social: how can we model that? while the effects of population structure and heterogeneities can be approximated in standard compartmental epidemiological models [ , , ] , such models can become highly complex and cumbersome to specify and solve as more heterogeneities are introduced. an alternative approach is agent-based modelling. agent-based models (abm) allow complex systems such as societies to be represented, using virtual agents programmed to have behavioural and individual characteristics (age, sex, ethnicity, income, employment status, etc.) as well as the capacity to interact with other agents [ ] . in addition, abm can include societal-level factors such as the influence of social media, regulations and laws, and community norms. in more sophisticated abm, agents can anticipate and react to scenarios, and learn by trial and error or by imitation. abm can represent systems in which there are feedbacks, tipping points, the emergence of higher-level properties from the actions of individual agents, adaptation and multiple scales of organization-all features of the covid- pandemic and societal reactions to it. while abm arise from a different tradition, they can incorporate the insights of compartmental models; for example, agents must transition through disease states (or compartments) such that the mean transition rates correspond to those in compartmental models. however, building an abm that represents a population on a national scale is a huge challenge and is unlikely be accomplished in a timescale useful for the current pandemic. abm often include many parameters, leading to challenges of model parametrization and a requirement for careful uncertainty quantification and sensitivity analyses to different inputs. on the other hand, useful abm do not have to be all-encompassing. there are already several models that illustrate the effects of policies such as social distancing on small simulated populations. these models can be very helpful as 'thought experiments' to identify the potential effects of candidate policies such as school re-opening and restrictions on long-distance travel, as well as the consequences of non-compliance with government edicts. there are two areas where long-term action should be taken. first, more data about people's ordinary behaviour are required: what individuals do each day (through timeuse diaries), whom they meet ( possibly through mobile phone data, if consent can be obtained) and how they understand and act on government regulation, social media influences and broadcast information [ ] . second, a large, modular abm should be built that represents heterogeneities in populations and that is properly calibrated as a social 'digital twin' of our own society, with which we can carry out virtual policy experiments. had these developments occurred before, they would have been useful currently. as a result, if these are addressed now, they will aid the planning of future exit strategies. (a) what are the additional challenges of data-limited settings? in most countries, criteria for ending covid- lockdowns rely on tracking trends in numbers of confirmed cases and deaths, and assessments of transmissibility ( § a). this section focuses on the relaxation of interventions in lmics, although many issues apply everywhere. perhaps surprisingly, concerns relating to data availability and reliability (e.g. lack of clarity about sampling frames) remain worldwide. other difficulties have also been experienced in many countries throughout the pandemic (e.g. shortages of vital supplies, perhaps due in developed countries to previous emphasis on healthcare system efficiency rather than pandemic preparedness [ ] ). data about the covid- pandemic and about the general population and context can be unreliable or lacking globally. however, due to limited healthcare access and utilization, there can be fewer opportunities for diagnosis and subsequent confirmation of cases in lmics compared to other settings, unless there are active programmes [ ] . distrust can make monitoring programmes difficult, and complicate control activities like test-trace-isolate campaigns [ , ] . other options for monitoring-such as assessing excess disease from general reporting of acute respiratory infections or influenza-like illness-require historical baselines that may not exist [ , ] . in general, while many lmics will have a well-served fraction of the population, dense peri-urban and informal settlements are typically outside that population and may rapidly become a primary concern for transmission [ ] . since confirmed case numbers in these populations are unlikely to provide an accurate representation of the underlying epidemic, reliance on alternative data such as clinically diagnosed cases may be necessary to understand the epidemic trajectory. some tools for rapid assessment of mortality in countries where the numbers of covid- -related deaths are hard to track are starting to become available [ ] . in settings where additional data collection is not affordable, models may provide a clearer picture by incorporating available metadata, such as testing and reporting rates through time, sample backlogs and suspected covid- cases based on syndromic surveillance. by identifying the most informative data, modelling could encourage countries to share available data more widely. for example, burial reports and death certificates may be available, and these data can provide information on the demographics that influence the infection fatality rate. these can in turn reveal potential covid- deaths classified as other causes and hence missing from covid- attributed death notifications. in addition to the challenges in understanding the pandemic in these settings, metrics on health system capacity (including resources such as beds and ventilators), as needed to set targets for control, are often poorly documented [ ] . furthermore, the economic hardships and competing health priorities in low-resource settings change the objectives of lifting restrictions-for example, hunger due to loss of jobs and changes in access to routine healthcare (e.g. hiv services and childhood vaccinations) as a result of lockdown have the potential to cost many lives in themselves, both in the short and long term [ , ] . this must be accounted for when deciding how to relax covid- interventions. royalsocietypublishing.org/journal/rspb proc. r. soc. b : we have identified three key challenges for epidemic modellers to help guide exit strategies in data-limited settings: (i) explore policy responses that are robust to missing information; (ii) conduct value-of-information analyses to prioritize additional data collection; and (iii) develop methods that use metadata to interpret epidemiological patterns. in general, supporting lmics calls for creativity in the data that are used to parametrize models and in the response activities that are undertaken. some lmics have managed the covid- pandemic successfully so far (e.g. vietnam, as well as trinidad and tobago [ ] ). however, additional support in lmics is required and warrants special attention. if interventions are relaxed too soon, fragile healthcare systems may be overwhelmed. if instead they are relaxed too late, socio-economic consequences can be particularly severe. (b) which data should be collected as countries emerge from lockdown, and why? identifying the effects of the different components of lockdown is important to understand how-and in which order-interventions should be released. the impact of previous measures must be understood both to inform policy in real time and to ensure that lessons can be learnt. all models require information to make their predictions relevant. data from pcr tests for the presence of active virus and serological tests for antibodies, together with data on covid- -related deaths, are freely available via a number of internet sites (e.g. [ ] ). however, metadata associated with testing protocols (e.g. reason for testing, type of test, breakdowns by age and underlying health conditions) and the definition of covid- -related death, which are needed to quantify sources of potential bias and parametrize models correctly, are often unavailable. data from individuals likely to have been exposed to the virus (e.g. within households of known infected individuals), but who may or may not have contracted it themselves, are also useful for model parametrization [ ] . new sources of data range from tracking data from mobile phones [ ] to social media surveys [ ] and details of interactions with public health providers [ ] . although potentially valuable, these data sources bring with them biases that are not always understood. these types of data are also often subject to data protection and/or costly fees, meaning that they are not readily available to all scientists. mixing patterns by age were reasonably well-characterized before the current pandemic [ , ] ( particularly for adults of different ages) and have been used extensively in existing models. however, there are gaps in these data and uncertainty in the impacts that different interventions have had on mixing. predictive models for policy tend to make broad assumptions about the effects of elements of social distancing [ ] , although results of studies that attempt to estimate effects in a more data-driven way are beginning to emerge [ ] . the future success of modelling to understand when controls should be relaxed or tightened depends critically on whether, and how accurately as well as how quickly, the effects of different elements of lockdown can be parametrized. given the many differences in lockdown implementation between countries, cross-country comparisons offer an opportunity to estimate the effects on transmission of each component of lockdown [ ] . however, there are many challenges in comparing sars-cov- dynamics in different countries. alongside variability in the timing, type and impact of interventions, the numbers of importations from elsewhere will vary [ , ] . underlying differences in mixing, behavioural changes in response to the pandemic, household structures, occupations and distributions of ages and comorbidities are likely to be important but uncertain drivers of transmission patterns. a current research target is to understand the role of weather and climate in sars-cov- transmission and severity [ ] . many analyses across and within countries highlight potential correlations between environmental variables and transmission [ ] [ ] [ ] [ ] [ ] [ ] , although sometimes by applying ecological niche modelling frameworks that may be ill-suited for modelling a rapidly spreading pathogen [ ] [ ] [ ] . assessments of the interactions between weather and viral transmissibility are facilitated by the availability of extensive datasets describing weather patterns, such as the european centre for medium-range weather forecasts era dataset [ ] and simulations of the community earth system model that can be used to estimate the past, present and future values of meteorological variables worldwide [ ] . temperature, humidity and precipitation are likely to affect the survival of sars-cov- outside the body, and prevailing weather conditions could, in theory, tip r(t) above or below one. however, the effects of these factors on transmission have not been established conclusively, and the impact of seasonality on short-or long-term sars-cov- dynamics is likely to depend on other factors including the timing and impact of interventions, and the dynamics of immunity [ , ] . it is hard to separate the effect of the weather on virus survival from other factors including behavioural changes in different seasons [ ] . the challenge of disentangling the impact of variations in weather on transmission from other epidemiological drivers in different locations is, therefore, a complex open problem. in seeking to understand and compare covid- data from different countries, there is a need to coordinate the design of epidemiological studies, involving longitudinal data collection and case-control studies. this will help enable models to track the progress of the epidemic and the impacts of control policies internationally. it will also allow more refined conclusions than those that follow from population data alone. countries with substantial epidemiological modelling expertise should support epidemiologists elsewhere with standardized protocols for collecting data and using models to inform policy. there is a need to share models to be used 'in the field'. collectively, these efforts will ensure that models are parametrized as realistically as possible for particular settings. in turn, as interventions are relaxed, this will allow us to detect the earliest possible reliable signatures of a resurgence in cases, leading to an unambiguous characterization of when it is necessary for interventions to be reintroduced. (c) how should model and parameter uncertainty be communicated? sars-cov- transmission models have played a crucial role in shaping policies in different countries, and their predictions have been a regular feature of media coverage of the pandemic [ , ] . understandably, both policymakers and journalists generally prefer single 'best guess' figures from models, rather than a range of plausible values. however, the ranges of outputs that modellers provide include important information about the variety of possible scenarios and guard royalsocietypublishing.org/journal/rspb proc. r. soc. b : against over-interpretation of model results. not displaying information about uncertainty can convey a false confidence in predictions. it is critical that modellers present uncertainty in a way that is understandable and useful for policymakers and the public [ ] . there are numerous and often inextricable ways in which uncertainty enters the modelling process. model assumptions inevitably vary according to judgements regarding which features are included [ , ] and which datasets are used to inform the model [ ] . within any model, ranges of parameter values can be considered to allow for uncertainty about clinical characteristics of covid- (e.g. the infectious period and case fatality rate) [ ] . alternative initial conditions (e.g. numbers and locations of imported cases seeding national outbreaks, or levels of population susceptibility) can be considered. in modelling exit strategies, when surges in cases starting from small numbers may occur and where predictions will depend on characterizing epidemiological parameters as accurately as possible, stochastic models may be of particular importance. not all the uncertainty arising from such stochasticity will be reduced by collecting more data; it is inherent to the process. where models have been developed for similar purposes, formal methods of comparison can be applied, but in epidemiological modelling, models often have been developed to address different questions, possibly involving 'what-if?' scenarios, in which case only qualitative comparisons can be made. the ideal outcome is when different models generate similar conclusions, demonstrating robustness to the detailed assumptions. where there is a narrowly defined requirement, such as short-term predictions of cases and deaths, more tractable tools for comparing the outputs from different models in real time would be valuable. one possible approach is to assess the models' past predictive performance [ , ] . ensemble estimates, most commonly applied for forecasting disease trajectories, allow multiple models' predictions to be combined [ , ] . the assessment of past performance can then be used to weight models in the ensemble. such approaches typically lead to improved point and variance estimates. to deal with parameter uncertainty, a common approach is to perform sensitivity analyses in which model parameters are repeatedly sampled from a range of plausible values, and the resulting model predictions compared; both classical and bayesian statistical approaches can be employed [ ] [ ] [ ] . methods of uncertainty quantification provide a framework in which uncertainties in model structure, epidemiological parameters and data can be considered together. in practice, there is usually only a limited number of policies that can be implemented. an important question is often whether or not the optimal policy can be identified given the uncertainties we have described, and decision analyses can be helpful for this [ , ] . in summary, communication of uncertainty to policymakers and the general public is challenging. different levels of detail may be required for different audiences. there are many subtleties: for instance, almost any epidemic model can provide an acceptable fit to data in the early phase of an outbreak, since most models predict exponential growth. this can induce an artificial belief that the model must be based on sensible underlying assumptions, and the true uncertainty about such assumptions has vanished. clear presentation of data is critical. it is important not simply to present data on the numbers of cases, but also on the numbers of individuals who have been tested. clear statements of the individual values used to calculate quantities such as the case fatality rate are vital, so that studies can be interpreted and compared correctly [ , ] . going forwards, improved communication of uncertainty is essential as models are used to predict the effects of different exit strategies. we have highlighted ongoing challenges in modelling the covid- pandemic, and uncertainties faced devising lockdown exit strategies. it is important, however, to put these issues into context: at the start of , sars-cov- was unknown, and its pandemic potential only became apparent at the end of january. the speed with which the scientific and public health communities came together and the openness in sharing data, methods and analyses are unprecedented. at very short notice, epidemic modellers mobilized a substantial workforce-mostly on a voluntary basis-and state-of-the-art computational models. far from the rough-and-ready tools sometimes depicted in the media, the modelling effort deployed since january is a collective and multi-pronged effort benefitting from years of experience of epidemic modelling, combined with long-term engagement with public health agencies and policymakers. drawing on this collective expertise, the virtual workshop convened in mid-may by the isaac newton institute generated a clear overview of the steps needed to improve and validate the scientific advice to guide lockdown exit strategies. importantly, the roadmap outlined in this paper is meant to be feasible within the lifetime of the pandemic. infectious disease epidemiology does not have the luxury of waiting for all data to become available before models must be developed. as discussed here, the solution lies in using diverse and flexible modelling frameworks that can be revised and improved iteratively as more data become available. equally important is the ability to assess the data critically and bring together evidence from multiple fields: numbers of cases and deaths reported by regional or national authorities only represent a single source of data, and expert knowledge is even required to interpret these data correctly. in this spirit, our first recommendation is to improve estimates of key epidemiological parameters. this requires close collaboration between modellers and the individuals and organizations that collect epidemic data, so that the caveats and assumptions on each side are clearly presented and understood. that is a key message from the first section of this study, in which the relevance of theoretical concepts and model parameters in the real world was demonstrated: far from ignoring the complexity of the pandemic, models draw from different sources of expertise to make sense of imperfect observations. by acknowledging the simplifying assumptions of models, we can assess the models' relative impacts and validate or replace them as new evidence becomes available. our second recommendation is to seek to understand important sources of heterogeneity that appear to be driving the pandemic and its response to interventions. agent-based modelling represents one possible framework for modelling complex dynamics, but standard epidemic models can also be extended to include age groups or any other relevant strata in the population as well as spatial structure. network royalsocietypublishing.org/journal/rspb proc. r. soc. b : models provide computationally efficient approaches to capture different types of epidemiological and social interactions. importantly, many modelling frameworks provide avenues for collaboration with other fields, such as the social sciences. our third and final recommendation regards the need to focus on data requirements, particularly (although not exclusively) in resource-limited settings such as lmics. understanding the data required for accurate predictions in different countries requires close communication between modellers and governments, public health authorities and the general public. while this pandemic casts a light on social inequalities between and within countries, modellers have a crucial role to play in sharing knowledge and expertise with those who need it most. during the pandemic so far, countries that might be considered similar in many respects have often differed in their policies; either in the choice or the timing of restrictions imposed on their respective populations. models are important for drawing reliable inferences from global comparisons of the relative impacts of different interventions. all too often, national death tolls have been used for political purposes in the media, attributing the apparent success or failure of particular countries to specific policies without presenting any convincing evidence. modellers must work closely with policymakers, journalists and social scientists to improve the communication of rapidly changing scientific knowledge while conveying the multiple sources of uncertainty in a meaningful way. we are now moving into a stage of the covid- pandemic in which data collection and novel research to inform the modelling issues discussed here are both possible and essential for global health. these are international challenges that require an international collaborative response from diverse scientific communities, which we hope that this article will stimulate. this is of critical importance, not only to tackle this pandemic but also to improve the response to future epidemics of emerging infectious diseases. data accessibility. data sharing is not applicable to this manuscript as no new data were created or analysed in this study. the effect of control strategies to reduce social mixing on outcomes of the covid- epidemic in wuhan, china: a modelling study epidemiological models are important tools for guiding covid- interventions first-wave covid- transmissibility and severity in china outside hubei after control measures, and secondwave scenario planning: a modelling impact assessment how and when to end the covid- lockdown: an optimisation approach. front. public health , segmentation and shielding of the most vulnerable members of the population as elements of an exit strategy from covid- lockdown. medrxiv statistical estimation of the reproductive number from case notification data: a review report : estimating the number of infections and the impact of nonpharmaceutical interventions on covid- in european countries estimating the timevarying reproduction number of sars-cov- using national and subnational case counts impact assessment of nonpharmaceutical interventions against coronavirus disease and influenza in hong kong: an observational study practical considerations for measuring the effective reproductive number the construction and analysis of epidemic trees with reference to the uk foot-and-mouth outbreak mrc biostatistics unit covid- working group. nowcasting and forecasting report /journal/rspb proc. r. soc. b : . government office for science. government publishes latest r number our plan to rebuild: the uk government's covid- recovery strategy estimating in real time the efficacy of measures to control emerging communicable diseases different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures a new framework and software to estimate timevarying reproduction numbers during epidemics improved inference of time-varying reproduction numbers during infectious disease outbreaks assessing the impact of nonpharmaceutical interventions on sars-cov- transmission in switzerland the effective reproduction number as a prelude to statistical estimation of time-dependent epidemic trends birth-death skyline plot reveals temporal changes of epidemic spread in hiv and hepatitis c virus (hcv) modeling the growth and decline of pathogen effective population size provides insight into epidemic dynamics and drivers of antimicrobial resistance the epidemic behavior of the hepatitis c virus adaptive estimation for epidemic renewal and phylogenetic skyline models nowcasting pandemic influenza a/h n hospitalizations in the netherlands estimating the effects of nonpharmaceutical interventions on covid- in europe bayesian-based iterative method of image restoration an iterative technique for the rectification of observed distributions reconstructing influenza incidence by deconvolution of daily mortality time series working group, phe modelling cell. adjusting covid- deaths to account for reporting delay using mobility to estimate the transmission intensity of covid- in italy: a subnational analysis with future scenarios a note on generation times in epidemic models serial interval of sars-cov- was shortened over time by nonpharmaceutical interventions using information theory to optimise epidemic models for real-time prediction and estimation an exact method for quantifying the reliability of end-of-epidemic declarations in real time. medrxiv estimating temporal variation in transmission of covid- and adherence to social distancing measures in australia estimating reproduction numbers for adults and children from case data heterogeneities in the transmission of infectious agents: implications for the design of control programs estimating the overdispersion in covid- transmission using outbreak sizes outside china quantifying sars-cov- transmission suggests epidemic control with digital contact tracing time from symptom onset to hospitalisation of coronavirus disease (covid- ) cases: implications for the proportion of transmissions from infectors with few symptoms serial interval of covid- among publicly reported confirmed cases estimating the generation interval for coronavirus disease (covid- ) based on symptom onset data estimating the time interval between transmission generations when negative values occur in the serial interval data: using covid- as an example challenges in control of covid- : short doubling time and long delay to effect of interventions projecting the transmission dynamics of sars-cov- through the postpandemic period a mathematical model reveals the influence of population heterogeneity on herd immunity to sars-cov- individual variation in susceptibility or exposure to sars-cov- lowers the herd immunity threshold. medrxiv uk government office for national statistics covid- ) infection survey pilot: england use of serological surveys to generate key insights into the changing global landscape of infectious disease sars-cov- seroprevalence in covid- hotspots quantifying antibody kinetics and rna shedding during early-phase sars-cov- infection. medrxiv what policy makers need to know about covid- protective immunity interpreting diagnostic tests for sars-cov- seventy-five years of estimating the force of infection from current status data benefits and challenges in using seroprevalence data to inform models for measles and rubella elimination age-specific incidence and prevalence: a statistical perspective prevalence of sars-cov- in spain (ene-covid): a nationwide, populationbased seroepidemiological study viral kinetics and antibody responses in patients with covid- . medrxiv when does a minor outbreak become a major epidemic? linking the risk from invading pathogens to practical definitions of a major epidemic stochastic epidemic models and their statistical analysis novel coronavirus outbreak in wuhan, china, : intense surveillance is vital for preventing sustained transmission in new locations estimating the probability of a major outbreak from the timing of early cases: an indeterminate problem? plos one , e detecting presymptomatic infection is necessary to forecast major epidemics in the earliest stages of infectious disease outbreaks intervention to maximise the probability of epidemic fade-out epidemics with two levels of mixing the use of chain-binomials with a variable chance of infection for the analysis of intrahousehold epidemics extinction times in the subcritical stochastic sis logistic epidemic estimating covid- outbreak risk through air travel in press. is it safe to lift covid- travel bans? the newfoundland story efficacy and effectiveness of an rvsv-vectored vaccine expressing ebola surface glycoprotein: interim results from the guinea ring vaccination cluster-randomised trial rigorous surveillance is necessary for high confidence in endof-outbreak declarations for ebola and other infectious diseases sexual transmission and the probability of an end of the ebola virus disease epidemic world health organization. th ebola outbreak in the democratic republic of the congo declared over; vigilance against flare-ups and support for survivors must continue influencing public health policy with data-informed mathematical models of infectious diseases: recent developments and new challenges five challenges for spatial epidemic models the effects of local spatial structure on epidemiological invasions pair approximations for spatial structures? management of invading pathogens should be informed by epidemiology rather than administrative boundaries uk government office for national statistics clinical course and risk factors for mortality of adult inpatients with covid- in wuhan, china: a retrospective cohort study sars-cov- infection among travelers returning from wuhan the probability of detection of sars-cov- in saliva mathematics of epidemics on networks: from exact to approximate models epidemic processes in complex networks a novel field-based approach to validate the use of network models for disease spread between dairy herds analysis of a stochastic sir epidemic on a random network incorporating household structure reproduction numbers for epidemic models with households and other social structures ii: comparisons and implications for vaccination reproductive numbers, epidemic spread and control in a community of households reproduction numbers for epidemic models with households and other social structures. i. definition and calculation of r modeling the impact of social distancing testing contact tracing and household quarantine on second-wave scenarios of the covid- epidemic. medrxiv epidemics on random intersection graphs epidemics on random graphs with tunable clustering social networks with strong spatial embedding generate non-standard epidemic dynamics driven by higherorder clustering. biorxiv. royalsocietypublishing.org/journal/rspb disease contact tracing in random and clustered networks impact of delays on effectiveness of contact tracing strategies for covid- : a modelling study network frailty and the geometry of herd immunity spread of epidemic disease on networks random graph dynamics graphs with specified degree distributions, simple epidemics, and local vaccination strategies social encounter networks: collective properties and disease transmission social encounter networks: characterizing great britain the first few x (ffx) cases and contact investigation protocol for -novel coronavirus ( -ncov) infection. characterising pandemic severity and transmissibility from data collected during first few hundred studies closure of schools during an influenza pandemic estimating the impact of school closure on influenza transmission from sentinel data age-dependent effects in the transmission and control of covid- epidemics school closure and management practices during coronavirus outbreaks including covid- : a rapid systematic review susceptibility to and transmission of covid- amongst children and adolescents compared with adults: a systematic review and meta-analysis. medrxiv how sweden wasted a 'rare opportunity' to study coronavirus in schools stepping out of lockdown should start with school re-openings while maintaining distancing measures. insights from mixing matrices and mathematical models. medrxiv impact of selfimposed prevention measures and short-term government-imposed social distancing on mitigating and delaying a covid- epidemic: a modelling study agent-based models computational models that matter during a global pandemic outbreak: a call to action resilience in the face of uncertainty: early lessons from the covid- pandemic health security capacities in the context of covid- outbreak: an analysis of international health regulations annual report data from countries historical parallels, ebola virus disease and cholera: understanding community distrust and social violence with epidemics the ongoing ebola epidemic in the democratic republic of congo a review of the surveillance systems of influenza in selected countries in the tropical region the indepth network: filling vital gaps in global epidemiology local response in health emergencies: key considerations for addressing the covid- pandemic in informal urban settlements revealing the toll of covid- : a technical package for rapid mortality surveillance and epidemic response introducing the lancet global health commission on high-quality health systems in the sdg era the impact of covid- control measures on social contacts and transmission in kenyan informal settlements. medrxiv an appeal for practical social justice in the covid- global response in low-income and middle-income countries coronavirus government response tracker an interactive web-based dashboard to track covid- in real time contact intervals, survival analysis of epidemic data, and estimation of r mobile phone data for informing public health actions across the covid- pandemic life cycle early epidemiological analysis of the coronavirus disease outbreak based on crowdsourced data: a population-level observational study analysis of temporal trends in potential covid- cases reported through projecting social contact matrices in countries using contact surveys and demographic data contagion! the bbc four pandemic-the model behind the documentary report : impact of nonpharmaceutical interventions (npis) to reduce covid- mortality and healthcare demand inferring change points in the spread of covid- reveals the effectiveness of interventions preparedness and vulnerability of african countries against importations of covid- : a modelling study effects of environmental factors on severity and mortality of covid- . medrxiv the correlation between the spread of covid- infections and weather variables in chinese provinces and the impact of chinese government mitigation plans impact of meteorological factors on the covid- transmission: a multi-city study in china covid- transmission in mainland china is associated with temperature and humidity: a time-series analysis impact of weather on covid- pandemic in turkey temperature and precipitation associate with covid- new daily cases: a correlation study between weather and covid- pandemic in oslo correlation between weather and covid- pandemic in jakarta spread of sars-cov- coronavirus likely to be constrained by climate. medrxiv a global-scale ecological niche model to predict sars-cov- coronavirus infection rate species distribution models are inappropriate for covid- european centre for medium-range weather forecasts. the era dataset the community earth system model: a framework for collaborative research susceptible supply limits the role of climate in the early sars-cov- pandemic effects of temperature and humidity on the spread of covid- : a systematic review. medrxiv early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia four key challenges in infectious disease modelling using data from multiple sources how will country-based mitigation measures influence the course of the covid- epidemic? prequential data analysis accuracy of real-time multimodel ensemble forecasts for seasonal influenza in the harnessing multiple models for outbreak management control fast or control smart: when should invading pathogens be controlled? accurate quantification of uncertainty in epidemic parameter estimates and predictions using stochastic compartmental models fitting dynamic models to epidemic outbreaks with quantified uncertainty: a primer for parameter uncertainty, identifiability, and forecasts infectious disease pandemic planning and response: incorporating decision analysis improving the evidence base for decision making during a pandemic: the example of influenza a/h n potential biases in estimating absolute and relative case-fatality risks during outbreaks estimates of the severity of coronavirus disease : a model-based analysis acknowledgements. thanks to the isaac newton institute for mathematical sciences, cambridge (www.newton.ac.uk), for support during the virtual 'infectious dynamics of pandemics' programme. this work was undertaken in part as a contribution to the 'rapid assistance in modelling the pandemic' initiative coordinated by the royal society. thanks to sam abbott for helpful comments about the manuscript. key: cord- -cs s o y authors: costa-santos, c.; neves, a. l.; correia, r.; santos, p.; monteiro-soares, m.; freitas, a.; ribeiro-vaz, i.; henriques, t.; rodrigues, p. p.; costa-pereira, a.; pereira, a. m.; fonseca, j. title: covid- surveillance - a descriptive study on data quality issues date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: cs s o y background: high-quality data is crucial for guiding decision making and practicing evidence-based healthcare, especially if previous knowledge is lacking. nevertheless, data quality frailties have been exposed worldwide during the current covid- pandemic. focusing on a major portuguese surveillance dataset, our study aims to assess data quality issues and suggest possible solutions. methods: on april th , the portuguese directorate-general of health (dgs) made available a dataset (dgsapril) for researchers, upon request. on august th, an updated dataset (dgsaugust) was also obtained. the quality of data was assessed through analysis of data completeness and consistency between both datasets. results: dgsaugust has not followed the data format and variables as dgsapril and a significant number of missing data and inconsistencies were found (e.g. , cases from the dgsapril were apparently not included in dgsaugust). several variables also showed a low degree of completeness and/or changed their values from one dataset to another (e.g. the variable underlying conditions had more than half of cases showing different information between datasets). there were also significant inconsistencies between the number of cases and deaths due to covid- shown in dgsaugust and by the dgs reports publicly provided daily. conclusions: the low quality of covid- surveillance datasets limits its usability to inform good decisions and perform useful research. major improvements in surveillance datasets are therefore urgently needed - e.g. simplification of data entry processes, constant monitoring of data, and increased training and awareness of health care providers - as low data quality may lead to a deficient pandemic control. the availability of accurate data in an epidemic is crucial to guide public health measures and policies [ ] . during pandemics, making epidemiologic data openly available, in realtime, allows researchers with different backgrounds to use diverse analytical methods to build evidence [ , ] in a fast and efficient way. this evidence can then be used to support adequate decision-making which is one of the goals of epidemiological surveillance systems [ ] . to ensure that high-quality data are collected and stored, several factors are needed, including robust information systems that promote reliable data collection [ ] , adequate and clear methods for data collection and integration from different sources, as well as strategic data curation procedures. epidemiological surveillance systems need to be designed having data quality as a high priority and thus promoting, rather than relying on, users' efforts to ensure data quality [ ] . only timely, high-quality data can provide valid and useful evidence for decision making and pandemic management. on the contrary, using datasets without carefully examining the metadata and documentation that describes the overall context of data can be harmful [ ] . at the moment, producing these high-quality datasets within a pandemic is nearly impossible without a broad collaboration between health authorities, health professionals, and researchers from different fields. the urgency to produce scientific evidence to manage the covid- pandemic contributes to lower quality datasets that may jeopardise the validity of results, generating biased evidence. the potential consequences are suboptimal decision making or even not using data at all to drive decisions. methodological challenges associated with analysing covid- data during the pandemic, including access to high-quality health data, have been recognized [ ] and some data quality concerns were described [ ] . nevertheless, to our knowledge, there is no study performing a structured assessment of data quality issues from the datasets provided by national surveillance systems for research purposes during the covid- pandemic. although this is a worldwide concern, this study will use portuguese data as a case study. in early march, the first cases of covid- were diagnosed in portugal [ ] . the portuguese surveillance system for mandatory reporting of communicable diseases is named sinave (national system for epidemiological surveillance) and is in the dependence of the directorate-general of health (dgs). covid- is included in the list of mandatory communicable diseases to be notified through this system either by medical doctors (through sinave med) or laboratories (sinave lab). a covid- specific platform (trace covid- ) was created for the clinical management of covid- patients and contact tracing. however, data from both sinave and trace covid- are not integrated in the electronic health record (ehr). thus, healthcare professionals need to register similar data, several times, for the same suspect or confirmed case of covid- , increasing the burden of healthcare professionals and potentially leading to data entry errors and missing data. the sinave notification form includes a high number of variables, with few or no features to help data input. some examples include ) within general demographic characteristics, patient occupation is chosen from a drop-down list with hundreds of options and with no free text available; ) the questions regarding individual symptoms need to be individually filled using a -response option drop-down list, even for asymptomatic patients; ) in the presence of at least one comorbidity, specific questions on comorbidities need to be filled, and ) there are over questions to characterize clinical findings, disease severity, and use of healthcare resources, including details on hospital isolation. other examples of the suboptimal design are ) the inclusion of two questions on autopsy findings among symptoms and clinical signs, although no previous question ascertains if the patient has died; ) lack of a specific question on disease outcome (only hospital discharge date); ) lack of validation rules that allow, for example, to have a disease diagnosis prior to birth date or to be discharged before the date of hospital admission and ) no mandatory data fields, allowing the user to proceed without completing any data. furthermore, a global assessment of disease severity is included with the options 'unknown', 'severe', 'moderate' and 'not applicable' without a readily available definition and without the possibility to classify the disease as mild. this unfriendly system may impair the quality of covid- surveillance data. the problems described have existed for a long time at sinave and they are usually solved by personal contact with the health local authorities. however, in the current covid-. cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november , . ; pandemic scenario, and due to the pressure of the huge number of new cases reported daily, this does not happen at this moment. since the beginning of the pandemic, several research groups in portugal stated their willingness to contribute by producing knowledge and improving data systems and data quality [ ] . researchers requested access to healthcare disaggregated data related to covid- , in order to timely produce scientific knowledge to help evidence-based decisionmaking during the pandemic. on april th , the dgs made available a dataset (dgsapril) collected by the sinave med, to be accessed by researchers upon request and after submission of a detailed research proposal and documented approval by an ethical committee [ ] . with the dgsapril dataset, dgs also made available the respective metadata [ ] . at least research groups received the data and started their dataset analyses. there are more than one possible data flows from the moment the data are introduced until the dataset is made available to researchers. figure is an example of the information flow from data introduced by public health professionals until the analysis of data. figure : example of one possible information flow from the moment the data are introduced until the dataset is made available to researchers. the ⊗ symbol means that data are not sent and therefore not present in the research database. the dashed line represents a manual cumbersome process that is many times executed by public health professionals and that is very susceptible to errors. on august th , dgs sent an updated dataset (dgsaugust) to the research groups who had requested the first dataset, including covid- cases already included in the initial . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november , . dataset plus new cases diagnosed during may and june . this updated database had an inconsistent manifest, including some variables presented in a different format (for example, instead of a variable with the outcome of the patient, the second dataset presented two dates: death and recovery date), or with different definitions (for example, variable age was defined as the age at the time of covid- onset or as age at the time of covid- notification, in the first and second datasets, respectively), which raised concerns regarding their use for valid research and replication of the analysis made using the first version of data. we aimed to assess data quality issues of covid- surveillance data and suggest solutions to overcome them, using the portuguese surveillance datasets as an example. the data provided by dgs included all covid- confirmed cases notified through the sinave med and, thus, excluding those only reported by laboratories (sinave lab). the dgsapril dataset was provided on april th and the updated one (dgsaugust) on august th . the available variables in both datasets are described in supplementary file . there was a variable named 'outcome', with the information on the outcome of the case, present in dgsapril dataset that was not available in the dgsaugust dataset. on the other hand, there were also some variables (dead, recovery, diagnosis and discharge dates) present in dgsaugust dataset that were not available in the dgsapril dataset. the quality of the data was assessed through the analysis of data completeness and consistency between the dgsapril and dgsaugust. for data completeness evaluation, missing information was classified as "system missings" when there was no information provided (blank cells) and as "coded as unknown" when the information "unknown" was coded. considering the consistency, both datasets were compared in order to evaluate if the data quality increased with the update sent four months later. as many data entry errors could be avoided using an optimized information system, the potential data entry errors in dgsaugust were also described. the number of covid- cases and the number of deaths due to covid- were also compared to the public daily report by portuguese directorate-general of health [ ] . we . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint highlight that it is not expected that the daily numbers of cases and deaths reported publicly were coincident to the numbers obtained in the datasets made available to researchers as these datasets included only the covid- cases notified through the sinave med (excluding those only reported by laboratories). however, the calculation of this difference is important to estimate the potential bias that data of these (dgsapril and dgsaugust) datasets, provided by dgs to researchers, may have. this comparison is only possible in the dgsaugust dataset as in the dgsapril dataset the variable date of diagnosis was not available. considering the , cases made available only in the dgsaugust and diagnosed before april th that, presumably, were not included in the dgsapril dataset, the majority ( %) were diagnosed in the two weeks immediately prior to the april th (the date on which this database was made available). however, % were diagnosed more than two weeks before the dgsapril dataset was made available (figure ). . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; figure : number of unique case identifiers presented in the datasets of covid- cases diagnosed since the start of the pandemic until april th (date when the first database was made available) and after april th several variables showed a low degree of completeness. for example, two variables ("date of first positive laboratory result" and "case required care in an intensive care unit") had more than % of cases with missing information in dgsapril dataset -coded as unknown or system missing. in the dgsaugust dataset, the variable 'case required care in an intensive care unit' reduced the proportion of incomplete information to % of system missings and no cases were coded as unknown. however, the variable 'date of first positive laboratory result' still had % system missings in the dgsaugust dataset. table provides detailed information about missing information for each available variable. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint the consistency of the information for cases identified with the same unique case identifier in both datasets (n= , ) was further evaluated (figure ). table presents the number and percentage of cases with different information, for each variable. is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint both datasets: in dgsapril is the age at the time of covid- onset and, in dgsaugust, the age at the time of covid- notification. the variable 'hospitalization' had % of cases (n= ) with unmatched information ( table ). one hundred and twenty-five cases were recorded as 'unknown if the case was hospitalized' in the dgsapril dataset and corrected to 'no hospitalization' in the dgsaugust. sixty-two cases were recorded as "no hospitalization" and corrected to 'hospitalized' or 'unknown information' in dgsapril and dgsaugust datasets, respectively. fifty-five cases were recorded as hospitalized patients and corrected to 'no hospitalization' or 'unknown information' in dgsapril and dgsaugust datasets, respectively. only cases changed from 'unknown if the case was hospitalized' to 'hospitalization'. the variable 'date of disease onset' had % of cases (n= ) with unmatched information ( table ). in , cases, information about the date of disease onset was provided only in dgsapril and cases had dates in both datasets but the dates did not match. the variable 'date of the first positive laboratory result' did not match in both datasets in % of the cases (n= ). in cases there was a date available in both datasets but the dates did not match, in cases the date was available only in the dgsapril dataset, and in cases the date was available only in the dgsaugust dataset. the variable patient outcome (variable 'outcome') was not present in the dgsaugust dataset which instead presents the variables 'date of recovery' and 'date of death' (not presented in dgsapril) ( table ). in the dgsapril dataset, there were , cases coded as 'alive, recovered and cured', but only % of those (n= ) had recovery date in the updated dataset (dgsaugust), which may be due to the lack of information on a specific date, despite knowing that the case result is alive, recovered and cured. in fact, patients recorded as 'alive, recovered and cured' in the dgsapril, did not have any date in the dgsaugust dataset. however, patients recorded as 'alive, recovered and cured' in the dgsapril had a date of death in the dgsaugust dataset. seven of these were dates of death before april , which is incongruent. among the cases coded as 'died because of covid- ' in the dgsapril dataset, ( %) did not have a date of death in the second dataset. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; the age of one patient is probably wrong ( years old). there were male patients and an older woman ( years old) registered as pregnant. there was a wrong diagnosis date ( - - ) and patients had registered dates of diagnosis before the first official case of covid- was diagnosed in portugal. there were also two patients with a negative length of stay in hospital. the variable 'recovery date' had only three values even though it refers to a days períod -for , patients the date of recovery was recorded as 'april ', for , patients 'may ' and for the date of recovery recorded was 'may '. table shows the number of covid- cases reported by dgsaugust dataset and by the daily public report. the dgsaugust dataset included covid- cases diagnosed between march and june, less , cases ( %) than the daily public report provided by portuguese directorate-general of health. however, when looking at data from march, the dgsaugust dataset reported more cases ( %) than the daily public report. in april, may and june the dgs dataset reported less %, % and % of cases than the public report provided, respectively. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint table shows the number of deaths due to covid- reported by dgsaugust dataset and by the daily public report. the dgsaugust dataset reported , deaths due to covid- until the end of june, less cases ( %) than the daily public report provided by the portuguese directorate-general of health. however, in march the dgsaugust dataset reported more deaths due to covid- ( %) than the daily public report. in april, may and june the dgs dataset reported less %, % and % of cases than the public report provided, respectively. the production of scientific evidence to help manage the covid- pandemic is an urgency worldwide. however, if the quality of datasets is low the evidence produced may be inaccurate and, therefore, have limited applicability. this problem may be particularly critical when low-quality datasets provided by official organisations lead to the replication of biased conclusions in different studies. the problem of using datasets with suboptimal quality for research purposes during the covid- pandemic probably occurs in a large number of countries. this study, using the portuguese surveillance data, reports a high number of inconsistencies and incompleteness of data that may interfere with scientific conclusions. to date, we could identify three scientific papers reporting analysis of these data [ , , ] that may have been affected by the low quality of the datasets [ ] . table presents data quality issues identified in the provided datasets and possible solutions. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; automatically code blank cells as system missing; simplification of data entry processes, reusing the data already in the system; data interoperability differences in cases included guarantee same unique case identifier by recording it in the registry database determine a core of mandatory variables data entry errors the issue of 'missing' versus 'absent' variable coding seems to be present in the findings of the paper by nogueira et al. [ ] the reduction of the risk of death in relation with comorbidities observed on the analysis of first dataset is underestimated if we assume that the updated dataset is the correct. [ ] . if this analysis had included the , cases as missing values, the results and conclusions could be indeed different. in fact, these cases were registered as having no underlying conditions in the first dataset but corrected in the second dataset to 'unknown if the case has underlying conditions' or system missing. this problem might be due to the way these data were collected and/or were recorded in the database sent to the researchers. in the form used to collect covid- surveillance data, comorbidities are recorded one by one after a general question assessing the presence of any comorbidity and the field is not mandatory. from a clinical point of view, it might be enough to register only positive data perceived as relevant (e.g. the presence of a specific diagnosis, but not its absence), especially in a high-burden context as the ongoing pandemic. in the context of clinical research, however, the lack of registered comorbidity data cannot . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint be interpreted as the absence of comorbidities. a similar bias can be found in the other two studies reporting analysis of dgsapril dataset [ , ] . another data quality issue is related with the differences in cases included. in fact, only % of cases included in the dgsapril dataset had the same unique case identifier in the dgsaugust dataset and only % of cases diagnosed until april th included in dgsaugust had the same unique case identifier in the dgsapril. alternatively, the unique case identifier had been changed. we do not know if the unique identifier is generated in each data download or if it is recorded in the database. this last option will be the safest. moreover, until june , it was not mandatory to fill in the national health service user number in order to have a standard unique patient identifier. that may have led to not identifying duplicate sinave med entries for the same patient and increased the difficulty in adequately merging data from sinave lab, sinave med and other data sources. the high percentage of incomplete data in several variables may also produce biases whose dimensions and directions are not possible to estimate. in fact, as our results showed, half of the variables available in the dgsaugust dataset had more than one-third of missing information. furthermore, that dataset was already incomplete since it only provides covid- cases from the medical component of sinave totalizing % of the cases reported by health authorities until the end of june [ ] . it is unclear, however, why the updated version of the dataset in march reported more covid- cases and more deaths than the public report (which would be expected to be more complete). moreover, there were no reported dates of deaths in june in dgsaugust dataset, despite the deaths reported in the public report during this month. the consistency of variables in different updates of datasets is also an important quality issue. in fact, our results shows that the variable 'age' was calculated differently in the two datasets: in the dgsapril dataset it was the age at the time of covid- onset and in the dgsaugust dataset it was age at the time of covid- notification. despite this change in definition, the difference of one year in half of the cases does not seem to be completely justified only by this fact, since the two dates should be relatively close. still related to this problem of inconsistent information and variables, we realised that some information may have been lost in the second dataset sent (dgsaugust). in fact, the outcome of the covid- case is not presented in the second dataset. dgsaugust dataset only presents the . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . in fact, in the dgsaugust dataset it is assumed that the missing information about the recovery date implies that the case had not recovered yet. also, the 'recovery date' had only three dates even though it refers to a four-month period. all the described errors, inconsistencies, data incompleteness, changes in the variables' definitions and format may lead to unreproducible methods and analyses. while important to start working in data analysis as fast as possible in the early beginning of a pandemic, it is also crucial that the models and analysis developed with the first data are validated a posteriori and confirmed with the updated data. it is thus fundamental that the subsequent datasets follow the same metadata and preferably are more complete and with less inconsistencies and errors. quality of healthcare data can be improved through several strategies. first, data entry processes must be simplified, avoiding duplications and reusing the data already in the system, since the need to input the same information in different systems is timeconsuming, frustrating for the user, and can negatively impact both data completeness and accuracy. data interoperability can also be a powerful approach to minimise the number of interactions with the system [ ] . second, data needs to be constantly monitored and tracked [ ] : organisations must develop processes to evaluate data patterns, and establish report systems based on data quality metrics. even before data curation, simple validation procedures and rules in information systems can help detecting and preventing . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint many errors (i.e. male patients classified as "pregnant", or a patient aged years old) and inconsistencies, and improve data completeness. finally, we need to establish the value proposition for both creators and observers [ ] . this includes ensuring that healthcare providers understand the importance of data, receive feedback about their analysis and how it may improve both the assistance to the patient and the whole organization, and have received adequate training for better performance. the adoption of these strategies should pave the way to high-quality, accurate healthcare datasets that can generate accurate knowledge to timely inform health policies, and the readaptation of health care systems to new challenges. our study has some limitations. we asked dgs for clarification on some data issues and are still waiting to receive complete answers that might clarify some of these aspects. therefore, the analysis of the portuguese surveillance data quality was done exclusively with the analysis of the databases provided by dgs to researchers and with our external knowledge about how the information flows from the moment the data are introduced by health professionals until the dataset can be used for data analysis. another limitation is the fact that we only studied the quality issues of covid- data from one country, portugal. however our results seem to be in line with the findings of ashofteh and colleagues [ ] who analysed and compared the quality of official datasets available for covid- , including data from the chinese center for disease control and prevention, the world health organization,and the european centre for disease prevention and control. in fact, they also found noticeable and increasing measurement errors in the three datasets as the pandemic outbreak expanded and more countries contributed data for the official repositories. we describe some important quality issues of the portuguese covid- surveillance datasets, which may jeopardize the validity of some analysis, with possible serious implications in a context as a pandemic. . cc-by-nc-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november , . ; https://doi.org/ . / . . . doi: medrxiv preprint the availability of official data by national health authorities to researchers is an enormous asset, allowing data analysis, modelling and prediction, that may support better decisions for the patient and the community as a whole. however, to fully embrace this potential, it is crucial that these data are accurate and reliable. it urges to define and implement major improvements in the processes and systems of surveillance datasets: simplification of data entry processes, constant monitoring of data, raise awareness of health care providers for the importance of good data and providing them adequate training. data curation processes, capitalising on effective and multidisciplinary collaborations between healthcare providers and data analysts, play a critical role to ensure minimum quality standards. once these processes are fully optimised, the reliability of results and the quality of the scientific evidence produced can be greatly improved. how decision makers can use quantitative approaches to guide outbreak responses open access epidemiological data from the covid- outbreak data sharing: make outbreak research open access updated guidelines for evaluating public health surveillance systems health records as the basis of clinical coding: is the quality adequate? a qualitative study of medical coders' perceptions a review of data quality assessment methods for public health information systems a study on the quality of novel coronavirus (covid- ) official datasets methodological challenges of analysing covid- data during the pandemic comunicado: casos de infeção por novo coronavírus (covid- ) carta aberta ao conselho nacional de saúde pública: um contributo pessoal acerca da epidemia de covid- , em portugal covid- : disponibilização de dados covid metadata relatório de situação -informação publicada diariamente the role of health preconditions on covid- deaths in portugal: evidence from surveillance data of the first infection cases covid- : determinants of hospitalization, icu and death among , reported cases in portugal comparison of multimorbidity in covid- infected and general population in portugal the hidden factor-low quality of data is a major peril in the identification of risk factors for covid- deaths: a comment on nogueira interoperability progress and remaining data quality barriers of certified health information technologies a review of data quality assessment methods for public health information systems integrating research and practice: health system leaders working toward high-value care: workshop summary data used in this work was made available by the portuguese directorate-general of health (dgs), under the scope of article th of the decree law -b/ , from april the nd. key: cord- - pvln x authors: asbury, thomas m; mitman, matt; tang, jijun; zheng, w jim title: genome d: a viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome date: - - journal: bmc bioinformatics doi: . / - - - sha: doc_id: cord_uid: pvln x background: new technologies are enabling the measurement of many types of genomic and epigenomic information at scales ranging from the atomic to nuclear. much of this new data is increasingly structural in nature, and is often difficult to coordinate with other data sets. there is a legitimate need for integrating and visualizing these disparate data sets to reveal structural relationships not apparent when looking at these data in isolation. results: we have applied object-oriented technology to develop a downloadable visualization tool, genome d, for integrating and displaying epigenomic data within a prescribed three-dimensional physical model of the human genome. in order to integrate and visualize large volume of data, novel statistical and mathematical approaches have been developed to reduce the size of the data. to our knowledge, this is the first such tool developed that can visualize human genome in three-dimension. we describe here the major features of genome d and discuss our multi-scale data framework using a representative basic physical model. we then demonstrate many of the issues and benefits of multi-resolution data integration. conclusions: genome d is a software visualization tool that explores a wide range of structural genomic and epigenetic data. data from various sources of differing scales can be integrated within a hierarchical framework that is easily adapted to new developments concerning the structure of the physical genome. in addition, our tool has a simple annotation mechanism to incorporate non-structural information. genome d is unique is its ability to manipulate large amounts of multi-resolution data from diverse sources to uncover complex and new structural relationships within the genome. background a significant portion of genomic data that is currently being generated extends beyond traditional primary sequence information. genome-wide epigenetic characteristics such as dna and histone modifications, nucleosome distributions, along with transcriptional and replication center structural insights are rapidly changing the way the genome is understood. indeed, these new data from high-throughput sources are often demonstrating that much of the genome's functional landscape resides in extra-sequential properties. with this influx of new detail about the higher-level structure and dynamics of the genome, new techniques will be required to visualize and model the full extent of genomic interactions and function. genome browsers, such as the uscs genome database browser [ ] , are specifically aimed at viewing primary sequence information. although supplemental information can easily be annotated via new tracks, representing structural hierarchies and interactions is quite difficult, particularly across non-contiguous genomic segments [ ] . in addition, in spite of the many recent efforts to measure and model the genome structure at various resolutions and detail [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , little work has focused on combining these models into a plausible aggregate, or has taken advantage of the large amount of genomic and epigenomic data available from new high-throughput approaches. to address these issues, we have created an interactive d viewer, genome d, to enable integration and visualization of genomic and epigenomic data. the viewer is designed to display data from multiple scales and uses a hierarchical model of the relative positions of all nucleotide atoms in the cell nucleus, i.e., the complete physical genome. our model framework is flexible and adaptable to handle new more precise structural information as details emerge about the genome's physical arrangement. the large amounts of data generated by highthroughput or whole-genome experiments raise issues of scale, storage, interactivity and abstraction. novel methods will be required to extract useful knowledge. genome d is an early step toward such new approaches. genome d is a gui-based c++ program which runs on windows (xp or later) platforms. its software architecture is based on the model-viewer-controller pattern [ ] . genome d is a viewer application to explore an underlying physical model displaying selections and annotations based on its current user settings. to support multiple resolutions and maintain a high level of interactivity, the model is designed using an objectoriented, hierarchical data architecture [ ] . genome d loads the model incrementally as needed to support user requests. once a model is loaded, genome d supports ucsc genome browser track annotations of the bed and wig formats [ ] . at highest detail, a model of the physical genome requires a d position (x, y, z) for each bp atom of the genome. the large amount of such data ( × bp × atoms/bp × positions × bytes~ gigabytes for humans) is reduced by exploiting the data's hierarchical organization. we store three scales of data for each chromosome in compressed xml format. atomic positions are computed on demand and not saved. this technique reduces the storage size for a human genome to~ . gigabytes, resulting in more than × savings. there are several sample models available for download from the genome d project homepage. more information of our representative model and its data format can be found in additional file . the range of scales and spatial organizations of dna within the human cell presents many visualization challenges. to meet these challenges, genome d manipulates and displays genomic data at multiple resolutions. figure shows several screen captures of the genome d application at various levels of detail. genome d allows the user to specify the degree of detail to view, and the corresponding data is loaded dynamically. because of the large amount of data and the limited memory that is available, only portions of the data can typically be viewed at high resolution. the interactivity of genome d facilitates exploring the model to find areas of interest. additionally, the user can configure various display parameters (such as color and shape) to highlight significant structural relationships. genome d features include: • display of genomic data from nuclear to atomic scale. genome d has multiple windows to visualize the physical genome model from simultaneous different viewpoints and scales. the model resolution of the current viewing window is set by the user, and its viewing camera is controlled by the mouse. resolutions and viewpoints depend of the type of data that is being visualized. • a fully interactive point-and-select d environment the user can navigate to an arbitrary region of interest by selecting a low resolution region and then loading corresponding higher resolution data which appears in another viewing window. • loading of multiple resolution user-created models with an open xml format the genome d application adheres to the model-view-controller software design pattern [ ] . the viewing software is completely separated from the multiscale model that is being viewed. we have chosen a simple open format for each resolution of the model, and users can easily add their own models. • image capture and povray/pdb model export support genome d supports screen capture of the current display image to a jpg format. for highly quality renders, it can export the current model and view as a povray model [ ] format for off-line print quality rendering. in addition, atomic positions of selected dna can be saved to a pdb format file for downstream analysis. • incorporation and user-defined visualization of ucsc annotation tracks onto the physical model the ucsc genome database browser has a variety of epigenetic information that can be exported directly from its web-site [ ] . this data can be loaded into genome d and displayed on the currently loaded genome model. we now give a few examples of applying biological information to a model and suggest possible methods of inferring unique structural relationships at various resolutions. one of the advantages of a multi-scale model is the ability to integrate data from various sources, and perhaps gain insight in higher level relationships or organizations. we choose to concentrate on highthroughput data sets that are becoming commonplace in current research: genome wide nucleosome positions, snps, histone methylations and gene expression profiles. the sample images, which can be visualized in gen-ome d, were export and rendered in povray [ ] . the impact of nucleosome position on gene regulation is well-known [ ] . in addition to nucleosome restructuring/modification [ ] , the rotation and phasing information of dna sequence may also play a significant role in gene regulation [ ] , particularly within non-coding regions. figures a, b show a non-coding nucleosome with multiple snps using genome-wide histone positioning data [ ] combined with a snp dataset [ ] . it highlights one of the advantages of three dimensional genomic data by clearly showing the phasing of the snps relative to the histone. observations of this type and of more complicated structural relationships may provide insights for further analysis, and such hidden three-dimensional structure is perhaps best explored with the human eye using a physical model. figure two examples of nucleosome epigenomic variation. a top view of snp variants rs , rs , rs , and rs (numbered - respectively) within a non-coding histone of chromosome : - . the histone position was obtained from [ ] , the snps were taken from a recent study examining variants associated with hdl cholesterol [ ] . such images may reveal structural relationships between non-coding region snps and histone phasing. b side view of a. c a series of histone trimethylations within encode region enr on chromosome : - [ ] . the histone bp positions are from [ ] . each histone protein is shown as an approximate cylinder wedge: h a (yellow), h b (red), h (blue), h (green). the ca backbones of the h and h n-terminal tails are modeled using the crystal structure of the ncp (pdb a i) [ ] . the bright yellow spheres indicate h k me and h k me , and the orange spheres are h k me , h k me and h k me . another important source of epigenomic information is histone modification. genome-wide histone modifications are being studied through a combination of dna microarray and chromatin immunoprecipitation (chipchip assays) [ ] . histone methylations have important gene regulation implications, and methylations have been shown to serve as binding platforms for transcription machinery. the encode initiative [ ] is creating high-resolution epigenetic information for~ % of the human genome. despite the fact that such modification occurs in histone proteins, current approaches to map and visualize such information are limited to sequence coordinates in the genome. our physical genome model visualizes methylation of histone proteins at atomic detail as determined by crystal structure. figure c shows histone methylations for several histones within an encode region. an integrated physical genome model can show the interplay between histone modifications and other genomic data, such as snps, dna methylation, the structure of gene, promoter and transcription machinery, etc. in addition to epigenomic data, the physical genome model also provides a platform to visualize highthroughput gene expression data and its interplay with global binding information of transcription factors. we consider a sample analysis of transcription factor p . genome-wide binding sites of p proteins [ ] can be combined with the gene expression results from a study investigating the dosing effect of p [ ] . this may identify genes that have p binding sites in their promoter regions and are responsive to the dosing effect of p protein. such large-scale microarray expression data is often displayed with a two-dimensional array format, emphasizing shared expression between genes, while p binding data are stored in tabular form. with a physical model, expression levels of genes in response to p level can be mapped to genome positions together with global p binding information, revealing any structural bias of the expression. figure shows this type of physical genome annotation. drawing inferences from coupling averaged or "snap-shot" expression data with the dynamic architecture of the genome may be helpful in determining structural dependences in expression patterns. to illustrate the capability of genome d to integrate and examine data of appropriate scales, we constructed an elementary model of the physical genome (see additional file for details). this basic model is approximate since precise knowledge of the physical genome is largely unknown at present. however, the model's inaccuracies are secondary to its multi-scale approach that provides a framework to improve and refine the model. current technologies are making significant progress toward capturing chromosome conformation within the nucleus at various scales [ , ] . because our multi-scale model is purely descriptive beyond the ncp scale, it can easily incorporate more accurate structural folding information, such as the 'fractal globule' behaviour [ ] . the genome d viewer, decoupled from the genome model, can be used to view any model that uses our model framework. building a d model of a complete physical genome is a non-trivial task. the structure and organization at a physical level is dynamic and heavily influenced by local and global constraints. a typical experiment may provide new data at a specific resolution or portion of the genome, and the integration of these data with other information to flesh out a multi-resolution model is challenging. for example, an experiment may measure local chromatin structure around a transcription site. this structure can be expressed as a collection of dna strands, ncps, and perhaps lower resolution nm chromatin fibers. our data formats are flexible enough to allow partial integration of this information, when the larger global structure is undetermined, or inferred by more global stochastic measurements from other experiments. combining such data across resolutions is often difficult, but establishing data formats and visualization tools provide a framework that may simplify the integration process. recent advances in determining chromosome folding principles [ ] highlight the need for new visualization methods. more detailed three-dimensional genomic models will help in discovering and characterizing epigenetic processes. we have created a multi-scale genomic viewer, genome d, to display and investigate genomic and epigenomic information in a three-dimensional representation of the physical genome. the viewer software and its underlying data architecture are designed to handle the visualization and integration issues that are present when dealing with large amount of data at multiple resolutions. our data structures can easily accommodate new advances in chromosome folding and organization. a common framework of established scales and formats could vastly improve multi-scale data integration and the ability to infer previously unknown relationships within the composite data. our model architecture defines clear demarcations between four scales (nuclear, fiber, nucleosome and dna), which facilitates data integration in a consistent and well-behaved manner. as more data become available, the ability to model, characterize, visualize, and perhaps most crucially, integrate information at many scales is necessary to achieve fuller understanding of the human genome. software development, and wjz oversaw the whole project. all authors read and approved the final manuscript. the ucsc genome browser database: update gene regulation in the third dimension polymer models for interphase chromosomes a randomwalk/ giant-loop model for interphase chromosomes a polymer, random walk model for the size-distribution of large dna fragments after high linear energy transfer radiation a chromatin folding model that incorporates linker variability generates fibers resembling the native structures capturing chromosome conformation modeling dna loops using the theory of elasticity computational modeling predicts the structure and dynamics of chromatin fiber multiscale modeling of nucleosome dynamics applications programming in smalltalk- : how to use model-view-controller (mvc) object-oriented biological system integration: a sars coronavirus example computer graphics: principles and practice persistence of vision pty. ltd., persistence of vision raytracer (version . ) cooperation between complexes that regulate chromatin structure and transcription the language of covalent histone modifications binding of nf to the mmtv promoter in nucleosomes: influence of rotational phasing, translational positioning and histone h dynamic regulation of nucleosome positioning in the human genome newly identified loci that influence lipid concentrations and risk of coronary artery disease genome-wide approaches to studying chromatin modifications a global map of p transcription-factor binding sites in the human genome gene expression profiling of isogenic cells with different tp gene dosage reveals numerous genes that are affected by tp dosage and identifies cspg as a direct target of p comprehensive mapping of long-range interactions reveals folding principles of the human genome organization of interphase chromatin the role of topological constraints in the kinetics of collapse of macromolecules the landscape of histone modifications across % of the human genome in five human cell lines crystal structure of the nucleosome core particle at . a resolution submit your next manuscript to biomed central and take full advantage of: • convenient online submission • thorough peer review • no space constraints or color figure charges • immediate publication on acceptance • inclusion in pubmed, cas, scopus and google scholar • research which is freely available for redistribution this work is partly supported by grants irg - - from the american cancer society, computational biology core of ul rr - , r gm - s , a pilot project and statistical core of grant p rr - , phrma foundation research starter grant, a pilot project from p rr to w.j.z, and nsf and r gm - s to jt. t.m.a. is supported by nlm training grant -t -lm - . the authors thank y.ruan for valuable discussion about the project, k.zhao and d.e. schones for providing nucleosome positioning data, m.boehnke for critical reading of the manuscript, and t qin, lc tsoi, and k. sims for software testing. the high performance computing facility utilized in this project is supported by nih grants: r lm , p rr , t gm and t lm . project name: genome dproject homepage: http://genomebioinfo.musc.edu/ genome d/index.html operating system: windows-based operation systems (xp or later) programming language: c++ and python other requirements: openglv . and glsl v . (may not be present on some older graphics adapters -see additional file ) any restrictions to use by non-academics: none additional file : supplemental information. additional details about human physical genome model construction and the genome d software.additional file : genome d v . readme. the readme file for genome d software.authors' contributions wjz conceived the initial concept of the project and developed the project with tma. tma developed the d genomic model and worked with mm to develop the genome d software. jt and wjz advised tma and mm on the key: cord- - qg fn f authors: adiga, aniruddha; dubhashi, devdatt; lewis, bryan; marathe, madhav; venkatramanan, srinivasan; vullikanti, anil title: mathematical models for covid- pandemic: a comparative analysis date: - - journal: j indian inst sci doi: . /s - - - sha: doc_id: cord_uid: qg fn f covid- pandemic represents an unprecedented global health crisis in the last years. its economic, social and health impact continues to grow and is likely to end up as one of the worst global disasters since the pandemic and the world wars. mathematical models have played an important role in the ongoing crisis; they have been used to inform public policies and have been instrumental in many of the social distancing measures that were instituted worldwide. in this article, we review some of the important mathematical models used to support the ongoing planning and response efforts. these models differ in their use, their mathematical form and their scope. abstract | covid- pandemic represents an unprecedented global health crisis in the last years. its economic, social and health impact continues to grow and is likely to end up as one of the worst global disasters since the pandemic and the world wars. mathematical models have played an important role in the ongoing crisis; they have been used to inform public policies and have been instrumental in many of the social distancing measures that were instituted worldwide. in this article, we review some of the important mathematical models used to support the ongoing planning and response efforts. these models differ in their use, their mathematical form and their scope. models have been used by mathematical epidemiologists to support a broad range of policy questions. their use during covid- has been widespread. in general, the type and form of models used in epidemiology depend on the phase of the epidemic. before an epidemic, models are used for planning and identifying critical gaps and prepare plans to detect and respond in the event of a pandemic. at the start of a pandemic, policy makers are interested in asking questions such as: (i) where and how did the pandemic start, (ii) risk of its spread in the region, (iii) risk of importation in other regions of the world, (iv) basic understanding of the pathogen and its epidemiological characteristics. as the pandemic takes hold, researchers begin investigating: (i) various intervention and control strategies; usually pharmaceutical interventions do not work in the event of a pandemic and thus nonpharmaceutical interventions are most appropriate, (ii) forecasting the epidemic incidence rate, hospitalization rate and mortality rate, (iii) efficiently allocating scarce medical resources to treat the patients and (iv) understanding the change in individual and collective behavior and adherence to public policies. after the pandemic starts to slow down, modelers are interested in developing models related to recovery and long-term impacts caused by the pandemic. j. indian inst. sci. | vol xxx:x | xxx-xxx | journal.iisc.ernet.in as a result comparing models needs to be done with care. when comparing models: one needs to specify: (a) the purpose of the model, (b) the end user to whom the model is targeted, (c) the spatial and temporal resolution of the model, (d) and the underlying assumptions and limitations. we illustrate these issues by summarizing a few key methods for projection and forecasting of disease outcomes in the us and sweden. organization. the paper is organized as follows. in sect. we give preliminary definitions. section discusses us and uk centric models developed by researchers at the imperial college. section discusses metapopulation models focused on the us that were developed by our group at uva and the models developed by researchers at northeastern university. section describes models developed swedish researchers for studying the outbreak in sweden. in sect. we discuss methods developed for forecasting. section contains discussion, model limitations and concluding remarks. in a companion paper that appears in this special issue, we address certain complementary issues related to pandemic planning and response, including role of data and analytics. important note. the primary purpose of the paper is to highlight some of the salient computational models that are currently being used to support covid- pandemic response. these models, like all models, have their strengths and weaknesses-they have all faced challenges arising from the lack of timely data. our goal is not to pick winners and losers among these model; each model has been used by policy makers and continues to be used to advice various agencies. rather, our goal is to introduce to the reader a range of models that can be used in such situations. a simple model is no better or worse than a complicated model. the suitability of a specific model for a given question needs to be evaluated by the decision maker and the modeler. for epidemiology epidemiological models fall in two broad classes: statistical models that are largely data driven and mechanistic models that are based on underlying theoretical principles developed by scientists on how the disease spreads. data-driven models use statistical and machine learning methods to forecast outcomes, such as case counts, mortality and hospital demands. this is a very active area of research, and a broad class of techniques have been developed, including auto-regressive time series methods, bayesian techniques and deep learning , , , , , . mechanistic models of disease spread within a population , , , use mechanistic (also referred to as procedural or algorithmic) methods to describe the evolution of an epidemic through a population. the most common of these is the sir type models. hybrid models that combine mechanistic models with data driven machine learning approaches are also starting to become popular, e.g., . there are a number of models, which are referred to as sir class of models. these partition a population of n agents into three sets, each corresponding to a disease state, which is one of: susceptible (s), infective (i) and removed or recovered (r). the specific model then specifies how susceptible individuals become infectious, and then recover. in its simplest form (referred to as the basic compartmental model) , , , the population is assumed to be completely mixed. let s(t), i(t) and r(t) denote the number of people who are susceptible, infected and recovered states at time t, respectively. let s(t) = s(t)/n , then, the sir model can be described by the following system of ordinary differential equations where β is referred to as the transmission rate, and γ is the recovery rate. a key parameter in such a model is the "reproductive number", denoted by r = β/γ . at the start of an epidemic, much of the public health effort is focused on estimating r from observed infections . mass action compartmental models have been the workhorse for epidemiologists and have been widely used for over years. their strength comes from their simplicity, both analytically and from the standpoint of understanding the outcomes. software systems have been developed to solve such models and a number of associated tools have been built to support analysis using such models. although simple and powerful, mass action compartmental models do not capture the inherent heterogeneity of the underlying populations. significant amount of research has been conducted j. indian inst. sci. | vol xxx:x | xxx-xxx | journal.iisc.ernet.in to extend the model, usually in two broad ways. the first involves structured metapopulation models-these construct an abstraction of the mixing patterns in the population into m different sub-populations, e.g., age groups and small geographical regions, and attempt to capture the heterogeneity in mixing patterns across subpopulations. in other words, the model has states s j (t), i j (t), r j (t) for each subpopulation j. the evolution of a compartment x j (t) is determined by mixing within and across compartments. for instance, survey data on mixing across age groups have been used to construct age structured metapopulation models . more relevant for our paper are spatial metapopulation models, in which the subpopulations are connected through airline and commuter flow networks , , , , . main steps in constructing structured metapopulation models. this depends on the disease, population and the type of question being studied. the key steps in the development of such models for the spread of diseases over large populations include • constructing subpopulations and compartments: the entire population v is partitioned into subpopulations v j , within which the mixing is assumed to be complete. depending on the disease model, there are s j , e j , i j , r j compartments corresponding to the subpopulation v j (and more, depending on the disease)-these represent the number of individuals in v j in the corresponding state • mixing patterns among compartments: state transitions between compartments might depend on the states of individuals within the subpopulations associated with those compartments, as well as those who they come in contact with. for instance, the s j → e j transition rate might depend on i k for all the subpopulations who come in contact with individuals in v j . mobility and behavioral datasets are needed to model such interactions. such models are very useful at the early days of the outbreak, when the disease dynamics are driven to a large extent by mobility-these can be captured more easily within such models, and there is significant uncertainty in the disease model parameters. they can also model coarser interventions such as reduced mobility between spatial units and reduced mixing rates. however, these models become less useful to model the effect of detailed interventions (e.g., voluntary home isolation, school closures) on disease spread in and across communities. agent-based networked models (sometimes just called as agent-based models) extend metapopulation models further by explicitly capturing the interaction structure of the underlying populations. often such models are also resolved at the level of single individual entities (animals, humans, etc.). in this class of models, the epidemic dynamics can be modeled as a diffusion process on a specific undirected contact network g(v, e) on a population v-each edge e = (u, v) ∈ e implies that individuals (also referred to as nodes) u, v ∈ v come into contact main steps in setting up an agent-based model. while the specific steps depend on the disease, the population, and the type of question being studied, the general process involves the following steps: • construct a network representation g: the set v is the population in a region, and is available from different sources, such as census and landscan. however, the contact patterns are more difficult to model, as no real data are available on contacts between people at a large scale. instead, researchers have tried to model activities and mobility, from which contacts can be inferred, based on co-location. multiple approaches have been developed for this, including random mobility based on statistical models, and very detailed models based on activities in urban regions, which have been estimated through surveys, transportation data, and other sources, e.g., , , , , . • develop models of within-host disease progression: such models can be represented as finite state probabilistic timed transition models, which are designed in close coordination with biologists, epidemiologists, and parameterized using detailed incidence data (see for discussion and additional pointers). • develop high-performance computer (hpc) simulations to study epidemic dynamics in such models, e.g., , , , . typical public health analyses involve large experimental designs, and the models are stochastic; this necessitates the use of such hpc simulations on large computing clusters. • incorporate interventions and behavioral changes: interventions include closure of schools and workplaces , and vaccinations ; whereas, behavioral changes include individual level social distancing, changes in mobility, and use of protective measures. such a network model captures the interplay between the three components of computational epidemiology: (i) individual behaviors of agents, (ii) unstructured, heterogeneous multi-scale networks, and (iii) the dynamical processes on these networks. it is based on the hypothesis that a better understanding of the characteristics of the underlying network and individual behavioral adaptation can give better insights into contagion dynamics and response strategies. although computationally expensive and data intensive, network-based epidemiology alters the types of questions that can be posed, providing qualitatively different insights into disease dynamics and public health policies. it also allows policy makers to formulate and investigate potentially novel and contextspecific interventions. like projection approaches, models for epidemic forecasting can be broadly classified into two broad groups: (i) statistical and machine learning-based data-driven models, (ii) causal or mechanistic models-see , , , , , , and the references therein for the current state of the art in this rapidly evolving field. statistical methods employ statistical and time series-based methodologies to learn patterns in historical epidemic data and leverage those patterns for forecasting. of course, the simplest yet useful class is called method of analogs. one simply compares the current epidemic with one of the earlier outbreaks and then uses the best match to forecast the current epidemic. popular statistical methods for forecasting influenzalike illnesses (that includes covid- ) include, e.g., generalized linear models (glm), autoregressive integrated moving average (arima), and generalized autoregressive moving average (garma) , , . statistical methods are fast, but they crucially depend on the availability of training data. furthermore, since they are purely data driven, they do not capture the underlying causal mechanisms. as a result, epidemic dynamics affected by behavioral adaptations are usually hard to capture. artificial neural networks (ann) have gained increased prominence in epidemic forecasting due to their self-learning ability without prior knowledge (see , , and the references therein). such models have used a wide variety of data as surrogates for producing forecasts. this includes: (i) social media data, (ii) weather data, (iii) incidence curves and (iv) demographic data. causal models can be used for epidemic forecasting in a natural manner , , , , , . these models calibrate the internal model parameters using the disease incidence data seen until a given day and then execute the model forward in time to produce the future time series. compartmental as well as agentbased models can be used to produce such forecasts. the choice of the models depends on the specific question at hand and the computational and data resource constraints. one of the key ideas in forecasting is to develop ensemble models-models that combine forecasts from multiple models , , , . the idea which originated in the domain of weather forecasting has found methodological advances in the machine learning literature. ensemble models typically show better performance than the individual models. modeling group (uk model) background. the modeling group led by neil ferguson was to our knowledge the first model to study the impact of covid- across two large countries: us and uk, see . the basic model was first developed in -it was used to inform policy pertaining to h n pandemic and was one of the three models used to inform the federal pandemic influenza plan and led to the now well-accepted targeted layered containment (tlc) strategy. it was adapted to covid- as discussed below. the model was widely discussed and covered in the scientific as well as popular press . we will refer to this as the ic model. model structure. the basic model structure consists of developing a set of households based on census information for a given country. the structure of the model is largely borrowed from their earlier work, see , . landscan data were used to spatially distribute the population. individual members of the household interact with other members of the household. the data to produce these households are obtained using census information for these countries. census data are used to assign age and household sizes. details on the resolution of census data and the dates were not clear. schools, workplaces and random meeting points are then added. the school data for us were obtained from the national centre of educational statistics, while for uk schools were assigned randomly based on population density. data on average class sizes and staff-student ratios were used to generate a synthetic population of schools distributed proportional to local population density. data on the distribution of workplace size were used to generate workplaces with commuting distance data used to locate workplaces appropriately across the population. individuals are assigned to each of these locations at the start of the simulation. the gravity-style kernel is used to decide how far a person can go in terms of attending work, school or community interaction place. the number of contacts between individuals at school, work and community meeting points are calibrated to produce a given attack rate. each individual has an associated disease transmission model. the disease transmission model parameters are based on the data collected when the pandemic was evolving in wuhan; see page of . finally, the model also has rich set of interventions. these include: (i) case isolation, (ii) voluntary home quarantine, (iii) social distancing of those over years, (iv) social distancing of the entire population, (v) closure of schools and universities; see page . the code was recently released and is being analyzed. this is important as the interpretation of these interventions can have substantial impact on the outcome. model predictions. the imperial college (ic model) model was one of the first models to evaluate the covid- pandemic using detailed agentbased model. the predictions made by the model were quite dire. the results show that to be able to reduce r to close to or below, a combination of case isolation, social distancing of the entire population and either household quarantine or school and university closure is required. the model had tremendous impact-uk and us both decide to start considering complete lock downs-a policy that was practically impossible to even talk about earlier in the western world. the paper came out around the same time that wuhan epidemic was raging and the epidemic in italy had taken a turn for the worse. this made the model results even more critical. strengths and limitations. ic model was one of the first models by a reputed group to report the potential impact of covid- with and without interventions. the model was far more detailed than other models that were published until then. the authors also took great care parameterizing the model with the best disease transmission data that was available until then. the model also considered a very rich set of interventions and was one of the first to analyze pulsing intervention. on the flip side, the representation of the underlying social contact network was relatively simple. second, often the details of how interventions were represented were not clear. since the publication of their article, the modelers have made their code open and the research community has witnessed an intense debate on the pros and cons of various modeling assumptions and the resulting software system, see . we believe that despite certain valid criticisms, overall, the results represented a significant advance in terms of the when the results were put out and the level of details incorporated in the models. northeastern and uva models (us models) background. this approach is an alternative to detailed agent-based models, and has been used in modeling the spread of multiple diseases, including influenza , , ebola and zika . it has been adapted for studying the importation risk of covid- across the world . structured metapopulation models construct a simple abstraction of the mixing patterns in the population, in which the entire region under study is decomposed into fully connected geographical regions, representing subpopulations, which are connected through airline and commuter flow networks. thus, they lack the rich detail of agent-based models, but have fewer parameters, and are, therefore, easy to set up and scale to large regions. model structure. here, we summarize gleam (northeastern model) and patchsim (uva model). gleam uses two classes of datasets-population estimates and mobility. population data are used from the "gridded population of the world" , which gives an estimated population value at a × minutes of arc (referred to as a "cell") over the entire planet. two different kinds of mobility processes are considered-airline travel and commuter flow. the former captures long-distance travel; whereas, the latter captures localized mobility. airline data are obtained from the international air transport association (iata) , and the official airline guide (oag) . there are about airports world wide; these are aggregated at the level of urban regions served by multiple airport (e.g., as in london). a voronoi tessellation is constructed with the resulting airport locations as centers, and the population cells are assigned to these cells, with a mile cutoff from the center. the commuter flows connect cells at a much smaller spatial scale. we represent this mobility pattern as a directed graph on the cells, and refer to it as the mobility network. in the basic seir model, the subpopulation in each cell j is partitioned into compartments s j , e j , i j and r j , corresponding to the disease states. for each cell j, we define the force of infection j as the rate at which a susceptible individual in the subpopulation in cell j becomes infected-this is determined by the interactions the person has with infectious individuals in cell j or any cell j ′ connected in the mobility network. an individual in the susceptible compartment s j becomes infected with probability j t and enters the compartment e j , in a time interval t . from this compartment, the individual moves to the i j and then the r j compartments, with appropriate probabilities, corresponding to the disease model parameters. the patchsim model has a similar structure, except that it uses administrative boundaries (e.g., counties), instead of a voronoi tesselation, which are connected using a mobility network. the mobility network is derived by combining commuter and airline networks, to model time spent per day by individuals of region (patch) i in region (patch) j. since it explicitly captures the level of connectivity through a commuter-like mixing, it is capable of incorporating week-toweek and month-to-month variations in mobility and connectivity. in addition to its capability to run in deterministic or stochastic mode, the open source implementation allows fine-grained control of disease parameters across space and time. although the model has a more generic force of infection mode of operation (where patches can be more general than spatial regions), we will mainly summarize the results from the mobility model, which was used for covid- response. what did the models suggest? gleam model is being used in a number of covid- -related studies and analysis. in , the northeastern university team used the model to understand the spread of covid- within china and relative risk of importation of the disease internationally. their analysis suggested that the spread of covid- out of wuhan into other parts of mainland china was not contained well due to the delays induced by detection and official reporting. it is hard to interpret the results. the paper suggested that international importation could be contained substantially by strong travel ban. while it might have delayed the onset of cases, the subsequent spread across the world suggest that we were not able to arrest the spread effectively. the model is also used to provide weekly projections (see https ://covid .gleam proje ct.org/); this site does not appear to be maintained for the most current forecasts (likely because the team is participating in the cdc forecasting group). the patchsim model is being used to support federal agencies as well as the state of virginia. due to our past experience, we have refrained from providing longer term forecasts, instead of focusing on short-term projections. the model is used within a forecasting via projection selection approach, where a set of counterfactual scenarios are generated based on on-the-ground response efforts and surveillance data, and the best fits are selected based on historical performance. while allowing for future scenarios to be described, they also help to provide a reasonable narrative of past trajectories, and retrospective comparisons are used for metrics such as 'cases averted by doing x' . these projections are revised weekly based on stakeholder feedback and surveillance update. further discussion of how the model is used by the virginia department of health each week can be found at https ://www.vdh.virgi nia.gov/coron aviru s/covid - -data-insig hts/#model . strength and limitations. structured metapopulation models provide a good tradeoff between the realism/compute of detailed agentbased models and simplicity/speed of mass action compartmental models and need far fewer inputs for modeling, and scalability. this is especially true in the early days of the outbreak, when the disease dynamics are driven to a large extent by mobility, which can be captured more easily within such models, and there is significant uncertainty in the disease model parameters. however, once the outbreak has spread, it is harder to model detailed interventions (e.g., social distancing), which are much more localized. further, these are hard to model using a single parameter. both gleam and patchsim models also faced their share of challenges in projecting case counts due to rapidly evolving pandemic, inadequate testing, a lack of understanding of the number of asymptomatic cases and assessing the compliance levels of the population at large. researchers (swedish models) sweden was an outlier amongst countries in that it decided to implement public health interventions without a lockdown. schools and universities were not closed, and restaurants and bars remained open. swedish citizens implemented "work from home" policies where possible. moderate social distancing based on individual responsibility and without police enforcement was employed but emphasis was attempted to be placed on shielding the + age group. background. statistician tom britton developed a very simple model with a focus on predicting the number of infected over time in stockholm. model structure. britton used a very simple sir general epidemic model. it is used to make a coarse grain prediction of the behavior of the outbreak based on knowing the basic reproduction number r and the doubling time d in the initial phase of the epidemic. calibration to calendar time was done using the observed number of case fatalities, together with estimates of the time between infection to death, and the infection fatality risk. predictions were made assuming no change of behavior, as well as for the situation where preventive measures are put in place at one specific time-point. model predictions. one of the controversial predictions from this model was that the number of infections in the stockholm area would quickly rise towards attaining herd immunity within a short period. however, mass testing carried out in stockholm during june indicated a far smaller percentage of infections. strength and limitations. britton's model was intended as a quick and simple method to estimate and predict an on-going epidemic outbreak both with and without preventive measures put in place. it was intended as a complement to more realistic and detailed modeling. the estimation-prediction methodology is much simpler and straight-forward to implement for this simple model. it is more transparent to see how the few model assumptions affect the results, and it is easy to vary the few parameters to see their effect on predictions so that one could see which parameter uncertainties have biggest impact on predictions, and which parameter uncertainties are less influential. model background. the public health authority (fhm) of sweden produced a model to study the spread of covid- in four regions in sweden: dalarna, skåne, stockholm, and västra götaland. . model structure. it is a standard compartmentalized seir model and within each compartment, it is homogeneous; so, individuals are assumed to have the same characteristics and act in the same way. data used in the fitting of the model include point prevalences found by pcrtesting in stockholm at two different time points. model predictions. the model estimated the number of infected individuals at different time points and the date with the largest number of infectious individuals. it predicted that by july , . % ( . - . %) of the population in dalarna will have been infected, % ( . - . %) of the population in skåne will have been infected, % ( . - . %) of the population in stockholm will have been infected, and % ( . - . %) of the population in västra götaland will have been infected. it was hard to test these predictions because of the great uncertainty in immune response to sars-cov- -prevalence of antibodies was surprisingly low but recent studies show that mild cases never seem to develop antibodies against sars-cov- , but only t-cellmediated immunity . the model also investigated the effect of increased contacts during the summer that stabilizes in autumn. it found that if the contacts in stockholm and dalarna increase by less than % in comparison to the contact rate in the beginning of june, the second wave will not exceed the observed first wave. strength and limitations. the simplicity of the model is a strength in ease of calibration and understanding but it is also a major limitation in view of the well-known characteristics of covid- : since it is primarily transmitted through droplet infection, the social contact structure in the population is of primary importance for the dynamics of infection. the compartmental model used in this analysis does not account for variation in contacts, where few individuals may have many contacts, while the majority have fewer. the model is also not age stratified, but covid- strikingly affects different age groups differently; e.g., young people seem to get milder infections. in this model, each infected individual has the same infectivity and the same risk of becoming a reported case, regardless of age. different age groups normally have varied degrees of contacts and have changed their behavior differently during the covid- pandemic. this is not captured in the model. rocklöv developed a model to estimate the impact of covid- on the swedish population at the municipality level, considering demography and human mobility under various scenarios of mitigation and suppression. they attempted to estimate the time course of infections, health care needs, and the mortality in relation to the swedish icu capacity, as well as the costs of care, and compared alternative policies and counterfactual scenarios. model structure. used a seir compartmentalized model with age structured compartments ( - , - , +) susceptibles, infected, in-patient care, icu and recovered populations based on swedish population data at the municipal level. it also incorporated inter-municipality travel using a radiation model. parameters were calibrated based on a combination of values available from international literature and fitting to available outbreak data. the effect of a number of different intervention strategies was considered ranging from no intervention to modest social distancing and finally to imposed isolation of various groups. model predictions. the model predicted an estimated death toll of around , for the strategies based only on social distancing and between and for policies imposing stricter isolation. it predicted icu cases of up to , without much intervention and up to with modest social distancing, way above the available capacity of about icu beds. strength and limitations. the model showed a good fit against the reported covid- -related deaths in sweden up to th of april, , however, the predictions of the total deaths and icu demand turned out to be way off the mark. background. finally, , used an individualbased model parameterized on swedish demographics to assess the anticipated spread of covid- . model structure. employed the individual agent-based model based on work by ferguson et al. . individuals are randomly assigned an age based on swedish demographic data and they are also assigned a household. household size is normally distributed around the average household size in sweden in , . people per household. households were placed on a lattice using high-resolution population data from landscan and census dara from the statstics sweden and each household is additionally allocated to a city based on the closest city center by distance and to a county based on city designation. each individual is placed in a school or workplace at a rate similar to the current participation in sweden. transmission between individuals occurs through contact at each individual's workplace or school, within their household, and in their communities. infectiousness is, thus, a property dependent on contacts from household members, school/workplace members and community members with a probability based on household distances. transmissibility was calibrated against data for the period march- april to reproduce either the doubling time reported using pan-european data or the growth in reported swedish deaths for that period. various types of interventions were studied including the policy implemented in sweden by the public health authorities as well as more aggressive interventions approaching full lockdown. model predictions. their prediction was that "under conservative epidemiological parameter estimates, the current swedish public-health strategy will result in a peak intensive-care load in may that exceeds pre-pandemic capacity by over -fold, with a median mortality of , ( % ci , to , )". strength and limitations. this model was based on adapting the well-known imperial model discussed in sect. to sweden and considered a wide range of intervention strategies. unfortunately the predictions of the model were woefully off the mark on both counts: the deaths by june are under and at the peak the icu infrastructure had at least % unutilized capacity. forecasting is of particular interest to policy makers as they attempt to provide actual counts. since the surveillance systems have relatively stabilized in recent weeks, the development of forecasting models has gained traction and several models are available in the literature. in the us, the centers for disease control and prevention (cdc) has provided a platform for modelers to share their forecasts which are analyzed and combined in a suitable manner to produce ensemble multi-week forecasts for cumulative/incident deaths, hospitalizations and more recently cases at the national, state, and county level. probabilistic forecasts are provided by teams as of july , (there were models as of june , ) and the cdc with the help of has developed uniform ensemble model for multi-step forecasts . model it has been observed previously for other infectious diseases that an ensemble of forecasts from multiple models perform better than any individual contributing model . in the context of covid- case count modeling and forecasting, a multitude of models have been developed based on different assumptions that capture specific aspects of the disease dynamics (reproduction number evolution, contact network construction, etc.). the models employed in the cdc forecast hub can be broadly classified into three categories, data-driven, hybrid models, and mechanistic models with some of the models being open source. data-driven models. they do not model the disease dynamics but attempt to find patterns in the available data and combine them appropriately to make short-term forecasts. in such data-driven models, it is hard to incorporate interventions directly; hence, the machine is presented with a variety of exogenous data sources such as mobility data, hospital records, etc. with the hope that its effects are captured implicitly. early iterations of institute of health metrics and evaluation (ihme) model for death forecasting at state level employed a statistical model that fits a time-varying gaussian error function to the cumulative death counts and is parameterized to control for maximum death rate, maximum death rate epoch, and growth parameter (with many parameters learnt using data from outbreak in china). the ihme models are undergoing revisions (moving towards the hybrid models) and updated implementable versions are available at . the university of texas at austin covid- modeling consortium model uses a very similar statistical model as but employs real-time mobility data as additional predictors and also differ in the fitting process. the carnegie mellon delphi group employs the well known auto-regressive (ar) model that employs lagged version of the case counts and deaths as predictors and determines a sparse set that best describes the observations from it by using j. indian inst. sci. | vol xxx:x | xxx-xxx | journal.iisc.ernet.in lasso regression . is a deep learning model which has been developed along the lines of and attempts to learn the dependence between death rate and other available syndromic, demographic, mobility and clinical data. hybrid models. these methods typically employ statistical techniques to model disease parameters which are then used in epidemiological models to forecast cases. most statistical models , are evolving to become hybrid models. a model that gained significant interest is the youyang gu (yyg) model and uses a machine learning layer over an seir model to learn the set of parameters (mortality rate, initial r , postlockdown r) specific to a region that best fits the region's observed data. the authors (yyg) share the optimal parameters, the seir model and the evaluation scripts with general public for experimentation . los alamos national lab (lanl) model uses a statistical model to determine how the number of covid- infections changes over time. the second process maps the number of infections to the reported data. the number of deaths is a fraction of the number of new cases obtained and is computed using the observed mortality data. mechanistic models. gleam and jhu models are county-level stochastic seir model dynamics. the jhu model incorporates the effectiveness of state-wide intervention policies on social distancing through the r parameter. more recently, model outputs from uva's patchsim model were included as part of a multi-model ensemble (including autoregressive and lstm components) to forecast weekly confirmed cases. types we end the discussion of the models above by qualitatively comparing model types. as discussed in the preliminaries, at one end of the spectrum are models that are largely data driven: these models range from simple statistical models (various forms of regression models) to the more complicated deep learning models. the difference in such model lies in the amount of training data needed, the computational resources needed and how complicated the mathematical function one is trying to fit to the observed data. these models are strictly data driven and, hence, unable to capture the constant behavioral adaptation at an individual and collective level. on the other end of the spectrum seir, meta-population and agent-based network models are based on the underlying procedural representation of the dynamics-in theory, they are able to represent behavioral adaptation endogenously. but both class of models face immense challenges due to the availability of data as discussed below. ( ) agent-based and seir models were used in all the three countries in the early part of the outbreak and continue to be used for counter-factual analysis. the primary reason is the lack of surveillance and disease specific data and hence, purely data-driven models were not easy to use. seir models lacked heterogeneity but were simple to program and analyze. agent-based models were more computationally intensive, required a fair bit of data to instantiate the model but captured the heterogeneity of the underlying countries. by now it has become clear that use of such models for long term forecasting is challenging and likely to lead to mis-leading results. the fundamental reason is adaptive human behavior and lack of data about it. ( ) forecasting, on the other hand, has seen use of data-driven methods as well as causal methods. short-term forecasts have been generally reasonable. given the intense interest in the pandemic, a lot of data are also becoming available for researchers to use. this helps in validating some of the models further. even so, realtime data on behavioral adaptation and compliance remain very hard to get and is one of the central modeling challenges. were some of the models wrong? in a recent opinion piece, professor vikram patel of the harvard school of public health makes a stinging criticism of modeling: crowning these scientific disciplines is the field of modeling, for it was its estimates of mountains of dead bodies which fuelled the panic and led to the unprecedented restrictions on public life around the world. none of these early models, however, explicitly acknowledged the huge assumptions that were made, a similar article in ny times recounted the mistakes in covid- response in europe ; also see . our point of view. it is indeed important to ensure that assumptions underlying mathematical models be made transparent and explicit. but we respectfully disagree with professor patel's statement: most of the good models tried to be very explicit about their assumptions. the mountains of deaths that are being referred to are explicitly calculated when no interventions are put in place and are often used as a worst case scenario. now, one might argue that the authors be explicit and state that this worst case scenario will never occur in practice. forecasting dynamics in social systems is inherently challenging: individual behavior, predictions and epidemic dynamics co-evolve; this coevolution immediately implies that a dire prediction can lead to extreme change in individual and collective behavior leading to reduction in the incidence numbers. would one say forecasts were wrong in such a case or they were influential in ensuring the worst case never happens? none of this implies that one should not explicitly state the assumption underlying their model. of course our experience is that policy makers, news reporters and common public are looking exactly for such a forecastwe have been constantly asked "when will peak occur" or "how many people are likely to die". a few possible ways to overcome this tension between the unsatiable appetite for forecasts and the inherent challenges that lie in doing this accurately, include: • we believe that, in general, it might not be prudent to provide long term forecasts for such systems. • state the assumptions underlying the models as clearly as possible. modelers need to be much more disciplined about this. they also need to ensure that the models are transparent and can be reviewed broadly (and expeditiously). • accept that the forecasts are provisional and that they will be revised as new data comes in, society adapts, the virus adapts and we understand the biological impact of the pandemic. • improve surveillance systems that would produce data that the models can use more effectively. even with data, it is very hard to estimate the prevalence of covid- in society. communicating scientific findings and risks is an important topical area in this context, see , , , . use of models for evidence-based policy making. in a new book, , radical uncertainty, economists john kay and mervyn king (formerly governor of the bank of england) urge caution when using complex models. they argue that models should be valued for the insights they provide but not relied upon to provide accurate forecasts. the so-called "evidence-based policy" comes in for criticism where it relies on models but also supplies a false sense of certainty where none exists, or seeks out the evidence that is desired ex ante-or "cover"--to justify a policy decision. "evidence-based policy has become policy-based evidence". our point of view. the authors make a good point here. but again, everyone, from public to citizens and reporters clamor for a forecast. we argue that this can be addressed in two ways:(i) viewing the problem from the lens of control theory so that we forecast only to control the deviation from the path we want to follow and (ii) not insisting on exact numbers but general trends. as kay and king opine, the value of models, especially in the face of radical uncertainty, is more in exploring alternative scenarios resulting from different policies: a model is useful only if the person using it understands that it does not represent the "the world as it really is" but is a tool for exploring ways in which a decision might or might not go wrong. in his new book the rules of contagion, adam kucharski draws on lessons from the past. in and , during the zika outbreak, researchers planned large-scale clinical studies and vaccine trials. but these were discontinued as soon as the infection ebbed. this is a common frustration in outbreak research; by the time, the infections end, fundamental questions about the contagion can remain unanswered. that is why building long-term research capacity is essential. our point of view. the author makes an important point. we hope that today, after witnessing the devastating impacts of the pandemic on the economy and society, the correct lessons will be learnt: sustained investments need to be made in the field to be ready for the impact of the next pandemic. the paper discusses a few important computational models developed by researchers in the us, uk and sweden for covid- pandemic planning and response. the models have been used by policy makers and public health officials in their respective countries to assess the evolution of the pandemic, design and analyze control measures and study various what-if scenarios. as noted, all models faced challenges due to availability of data, rapidly evolving pandemic and unprecedented control measures put in place. despite these challenges, we believe that mathematical models can provide useful and timely information to the policy makers. on on hand the modelers need to be transparent in the description of their models, clearly state the limitations and carry out detailed sensitivity and uncertainty quantification. having these models reviewed independently is certainly very helpful. on the other hand, policy makers should be aware of the fact that using mathematical models for pandemic planning, forecast response rely on a number of assumptions and lack data to over these assumptions. springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. epideep: exploiting embeddings for epidemic forecasting real-time epidemic forecasting: challenges and opportunities real-time forecasting of infectious disease dynamics with a stochastic semi-mechanistic model forecasting the impact of the first wave of the covid- pandemic on hospital demand and deaths for the usa and european economic area countries an arima model to forecast the spread and the final size of covid- epidemic in italy (first version on ssrn march) accuracy of realtime multi-model ensemble forecasts for seasonal influenza in the us structure of social contact networks and their impact on epidemics computational epidemiology the structure and function of complex networks deep learning based epidemic forecasting with synthetic information | vol xxx:x | xxx-xxx | journal.iisc.ernet.in control of severe acute respiratory syndrome social contacts and mixing patterns relevant to the spread of infectious diseases optimizing influenza vaccine distribution multiscale mobility networks and the spatial spreading of infectious diseases the effect of travel restrictions on the spread of the novel coronavirus (covid- ) outbreak assessing the international spreading risk associated with the west african ebola outbreak optimizing spatial allocation of seasonal influenza vaccine under temporal constraints spread of zika virus in the americas generation and analysis of large synthetic social contact networks modelling disease outbreaks in realistic urban social networks containing pandemic influenza at the source episimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks enhancing user-productivity and capability through integration of distinct software in epidemiological systems fred (a framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and control strategies using census-based populations modeling targeted layered containment of an influenza pandemic in the united states pancasting: forecasting epidemics from provisional data influenza forecasting in human populations: a scoping review near-term forecasts of influenza-like illness: an evaluation of autoregressive time series approaches a systematic review of studies on forecasting the dynamics of influenza outbreaks a framework for evaluating epidemic forecasts forecasting covid- impact on hospital bed-days, icudays, ventilator-days and deaths by us state in the next months covid- cases and deaths forecasts tdefsi: theoryguided deep learning-based epidemic forecasting with synthetic information calibrating a stochastic, agent-based model using quantile-based emulation epidemic forecasting framework combining agent-based models and smart beam particle filtering individual versus superensemble forecasts of seasonal influenza outbreaks in the united states forecasting a moving target: ensemble models for ili case count predictions modelling the pandemic the simulations driving the world's response to covid- strategies for mitigating an influenza pandemic critiqued coronavirus simulation gets thumbs up from code-checking efforts high resolution global gridded data for use in population studies oag official airline guide code for simulating the metapopulation seir model the effect of human mobility and control measures on the covid- epidemic in china basic prediction methodology for covid- : estimation and sensitivity considerations estimates of the number of infected individuals during the covid- outbreak in the dalarna region, skåne region, stockholm region, and västra götaland region robust t cell immunity in convalescent individuals with asymptomatic or mild covid- covid- healthcare demand and mortality in sweden in response to non-pharmaceutical (npis) mitigation and suppression scenarios arbo-prevent: climate change, human mobility and emerging arboviral outbreaks: new models for risk characterization, resilience and prevention intervention strategies against covid- and their estimated impact on swedish healthcare capacity managing covid- spread with voluntary public-health measures: sweden as a case study for pandemic control cdc. covid- forecasthub projections for firstwave covid- deaths across the us using social-distancing measures derived from mobile phones policy implications of models of the spread of coronavirus: perspectives and opportunities for economists evaluating science communication mathematical models to guide pandemic response infodemic and risk communication in the era of cov- radical uncertainty: decision-making beyond the numbers the rules of contagion: why things spread-and why they stop. basic books the authors would like to thank members of the biocomplexity covid- response team and network systems science and advanced computing (nssac) division for their thoughtful comments and suggestions related to epidemic modeling and response support. we thank members of the biocomplexity institute and initiative, university of virginia for useful discussion and suggestions. this work was partially sup- research associate at the nssac division of the biocomplexity institute and initiative. he completed his phd from the department of electrical engineering, indian institute of science (iisc), bangalore, india and has held the position of postdoctoral fellow at iisc and north carolina state university, raleigh, usa. his research areas include signal processing, machine learning, data mining, forecasting, big data analysis etc. at nssac, his primary focus has been the analysis and development of forecasting systems for epidemiological signals such as influenza-like illness and covid- using auxiliary data sources. bryan lewis is a research associate professor in the network systems science and advanced computing division. his research has focused on understanding the transmission dynamics of infectious diseases within specific populations through both analysis and simulation. lewis is a computational epidemiologist with more than years of experience in crafting, analyzing, and interpreting the results of models in the context of real public health problems. as a computational epidemiologist, for more than a decade, lewis has been heavily involved in a series of projects forecasting the spread of infectious disease as well as evaluating the response to them in support of the federal government. these projects have tackled diseases from ebola to pandemic influenza and melioidosis to cholera. professor in biocomplexity, the division director of the networks, simulation science and advanced computing (nssac) division at the biocomplexity institute and initiative, and a professor in the department of computer science at the university of virginia (uva). his research interests are in network science, computational epidemiology, ai, foundations of computing, socially coupled system science and high-performance computing. before joining uva, he held positions at virginia tech and the los alamos national laboratory. he is a fellow of the ieee, acm, siam and aaas. scientist at the biocomplexity institute & initiative, university of virginia and his research focuses on developing, analyzing and optimizing computational models in the field of network epidemiology. he received his phd from the department of electrical and communication engineering, indian institute of science (iisc), and did his postdoctoral research at virginia tech. his areas of interest include network science, stochastic modeling and big data analytics. he has used in-silico models of society to study the spread of infectious diseases and invasive species. recent research includes modeling and forecasting emerging infectious disease outbreaks (e.g., ebola, covid- ), impact of human mobility on disease spread and resource allocation problems in the context of key: cord- -vtt wvm authors: keogh, john g.; rejeb, abderahman; khan, nida; dean, kevin; hand, karen j. title: optimizing global food supply chains: the case for blockchain and gsi standards date: - - journal: building the future of food safety technology doi: . /b - - - - . - sha: doc_id: cord_uid: vtt wvm this chapter examines the integration of gs standards with the functional components of blockchain technology as an approach to realize a coherent standardized framework of industry-based tools for successful food supply chains (fscs) transformation. the globalization of food systems has engendered significant changes to the operation and structure of fscs. alongside increasing consumer demands for safe and sustainable food products, fscs are challenged with issues related to information transparency and consumer trust. uncertainty in matters of transparency and trust arises from the growing information asymmetry between food producers and food consumers, in particular, how and where food is cultivated, harvested, processed, and under what conditions. fscs are tasked with guaranteeing the highest standards in food quality and food safety—ensuring the use of safe and authentic ingredients, limiting product perishability, and mitigating the risk of opportunism, such as quality cheating or falsification of information. a sustainable, food-secure world will require multidirectional sharing of information and enhanced information symmetry between food producers and food consumers. the need for information symmetry will drive transformational changes in fscs methods of practice and will require a coherent standardized framework of best practice recommendations to manage logistic units in the food chain. a standardized framework will enhance food traceability, drive fsc efficiencies, enable data interoperability, improve data governance practices, and set supply chain identification standards for products and assets (what), exchange parties (who), locations (where), business processes (why), and sequence (when). the globalization of food supply chains (fscs) has added significant complexity to food systems and created an information asymmetry between food producers and food consumers. as a result, a growing demand exists for greater transparency into food origins, methods of cultivation, harvesting, and production as well as labor conditions and environmental impact (autio et al., ; bildtgÅ rd, ; donnelly, thakur, & sakai, ) . moreover, the international debate on the integrity of fscs has intensified due to recurring incidents and crises across the five pillars of the food system (earlier referred to as the five consumer reputations): food quality, food safety, food authenticity, food defense, and food security ( fig. . ) . food-related incidents across all five pillars have been amplified through social media platforms (new, ) , creating consumer distrust. according to the most recent edelman trust barometer (etb), trust in the food and beverage industry has declined by two points since (global report: edelman trust barometer ). this decrease is significant as the trust construct in the etb encompasses both competence and ethics. importantly, edelman argued that ethics (e.g., comprised of integrity, dependability, and purpose) is "three times more important to company trust than competence" (global report: edelman trust barometer ). a crucial argument is made in the etb, suggesting that while business is considered competent, only nongovernment organizations (ngos) are considered ethical. this claim may have a profound impact on fscs and strongly suggests that in order to regain citizen-consumer trust, food businesses must be open to feedback and criticisms from ngos. this point should be of particular the vulnerabilities arising from their operations and management (voss, closs, calantone, helferich, & speier, ). an fsc is a network of highly interconnected stakeholders working together to ensure the delivery of safe food products (schiefer & deiters, ) . fsc actors commit to implementing a set of processes and activities that help take the food from its raw material state to the finished product (dani, ) . ensuring the delivery of safe food products is an utmost priority and a primary building block for a healthy and vibrant society. over the years, fscs have witnessed several structural changes and a shift toward the development of more unified, integrated, and coherent relationships between stakeholders (bourlakis & weightman, ) . as such, an fsc has become a "chain of trust" that extends from suppliers, producers, distributors, wholesalers, retailers, and consumers (choi & hong, ; johnston, mccutcheon, stuart, & kerwood, ) . although fscs represent a metaphorical "chain of trust," the trustworthiness of the fsc is as fundamental to the integrity of our food systems as food traceability and transparency and is not without its own unique set of challenges. fscs are vulnerable to natural disasters, malpractices, and exploitative behavior, leading to food security concerns, reputational damage, and significant financial losses. due to the inherent complexities of global fscs, it is almost impossible for stakeholders to police the entire flow of materials and products and identify all possible externalities. recurring disruptions (e.g., natural disasters, avian flu, swine fever, and consecutive food scandals have increased the sense of urgency in the management of fscs (zhong, xu, & wang, ) and negatively impacted consumer trust. the european "horsemeat scandal" in exemplified the vulnerabilities (yamoah & yawson, ) , and legal scholars from cambridge university posited the ability of the eu's regulatory regime to prevent fraud on such a scale was shown to be inadequate. eu food law, with its (over) emphasis on food safety, failed to prevent the occurrence of fraud and may even have played an (unintentional) role in facilitating or enhancing it the cambridge scholars further argued that the free movement of goods within the european union created a sense of "blind trust" in the regulatory framework, which proved to be inadequate to protect businesses and consumers from unscrupulous actors. while natural disasters and political strife are outside of the control of fsc stakeholders, to preserve food quality and food safety and minimize the risk of food fraud or malicious attacks, fsc stakeholders need to establish and agree on foundational methods for analytical science, supply chain standards, technology tools, and food safety standards. the redesign of the fsc is necessary in order to ensure unquestionable integrity in a resilient food ecosystem. this proposal would require a foundational approach to data management and data governance to ensure sources of accurate and trusted data to enable inventory management, order management, traceability, unsafe product recall, and measures to protect against food fraud. failure to do so will result in continued consumer distrust and economic loss. notably, a report by gs uk et al. ( ) reported that eighty percent of united kingdom retailers had inconsistent product data, estimated to cost ukp million in profit erosion over years and a further ukp million in lost sales. the recent emergence of blockchain technology has created significant interest among scholars and practitioners in numerous disciplines. initially, blockchain technology was heralded as a radical innovation laden with a strong appeal to the financial sector, particularly in the use of cryptocurrencies (nakamoto, , p. ). the speculation on the true identity of the pseudonymous "satoshi nakamoto" gave rise to suspicion on the actual creators of bitcoin and their motives (lemieux, ) . moreover, halaburda ( ) argued that there is a lack of consensus on the benefits of blockchain and, importantly, how it may fail. further, rejeb, s} ule, & keogh ( , p. ) argued "ultimately, a blockchain can be viewed as a configuration of multiple technologies, tools and methods that address a particular problem." beyond the sphere of finance, blockchain technology is considered a foundational paradigm (iansiti & lakhani, ) with the potential for significant societal benefits and improve trust between fsc actors. blockchain technology offers several capabilities and functionalities that can significantly reshape existing practices of managing fscs and partnerships, regardless of location, and also offers opportunities to improve efficiency, transparency, trust, and security, across a broad spectrum of business and social transactions (frizzo-barker et al., ) . the technological attributes of blockchain can combine with smart contracts to enable decentralized and self-organization to create, execute, and manage business transactions (schaffers, ) , creating a landscape for innovative approaches to information and collaborative systems. innovations are not only merely a simple composition of technical changes in processes and procedures but also include new forms of social and organizational arrangements (callon, law, & rip, ) . the ubiquitous product bar code stands out as a significant innovation that has transformed business and society. since the decision by us industry (gs , ) to adopt the linear bar code on april , , and the first scan of a -pack of wrigley's juicy fruit chewing gum in marsh's supermarket in troy, ohio, on june , (gs , , the bar code is scanned an estimated billion times daily. gs is a not-for-profit organization tasked with managing industry-driven data and information standards (note, gs is not an acronym). the gs system of interoperable standards assigns and manages globally unique identification of firms, their locations, their products, and assets. they rely on several technology-enabled functions for data capture, data exchange, and data synchronization among fsc exchange partners. in fscs, there is a growing need for interoperability standards to facilitate business-to-business integration. the adoption of gs standards-enabled blockchain technology has the potential to enable fsc stakeholders to meet the fast-changing needs of the agri-food industry and the evolving regulatory requirements for enhanced traceability and rapid recall of unsafe goods. although there is a growing body of evidence concerning the benefits of blockchain technology and its potential to align with gs standards for data and information (fosso wamba et al., ; kamath, ; lacity, ) , the need remains for an extensive examination of the full potentials and limitations. the authors of this section, therefore, reviewed relevant academic literature to examine the full potential of blockchain-enabled gs systems comprehensively, and therefore provide a significant contribution to the academic and practitioner literature. the diversity of blockchain research in the food context is fragmented, and the potentials and limitations in combination with gs standards remain vaguely conceptualized. it is vitally essential to narrow this research gap. this review will begin with an outline of the methodology applied to collect academic contributions to blockchain and gs standards within a fsc context, followed by an in-depth analysis of the findings, concluding with potential areas for future research. in order to explore the full potential of a system integrating blockchain functionalities and gs standards, a systematic review method based on tranfield, denyer, & smart ( ) guidelines was undertaken. the systematic review was considered as a suitable method to locate, analyze, and synthesize peer-reviewed publications. research on blockchain technology is broad and across disciplines; however, a paucity of research specific to food chains exists (fosso wamba et al., ) . similarly, existing research on blockchain technology and gs standards is a patchwork of studies with no coherent or systematic body of knowledge. therefore, the objective of this study was to draw on existing studies and leverage their findings using content analysis to extract insights and provide a deeper understanding of the opportunities for a gs standards-enabled blockchain as an fsc management framework. as stated earlier, the literature on blockchain technology and gs is neither welldeveloped nor conclusive, yet necessary to ensure successful future implementations. in order to facilitate the process of literature collection, a review protocol based on the "preferred reporting items for systematic reviews and meta-analyzes" (prisma) was used (liberati et al., ) . the prisma approach consists of four processes: the use of various sources to locate previous studies, the fast screening of studies and removal of duplicates, the evaluation of studies for relevance and suitability, and the final analysis of relevant publications. fig. . illustrates the prisma process. to ensure unbiased results, this phase of the study was completed by researchers with no previous knowledge or association with gs . conducting the review began with a search for studies on blockchain technology and gs standards. reviewed publications originated from academic sources (peer-reviewed) and included journal articles, conference papers, and book chapters. due to the nascent and limited literature on blockchain technology and gs standards, we supplemented our analysis with other sources of information, including conference proceedings, gray sources, and reports. the survey of the literature was conducted using four major scientific databases: scopus, web of science, sciencedirect, and google scholar. we used a combination of keywords that consisted of the following search string: "blockchain*" and "gs " and ("food chain*" or "food supply*" or agriculture or agro. the google scholar search engine has limited functionality and allows only the full-text search field; therefore, only one search query "blockchain* and gs and food" was used for the retrieval of relevant studies. the titles and abstracts of publications were scanned to obtain a general overview of the study content and to assess the relevance of the material. as shown in fig. . , a total of publications were found. many of the publications were redundant due to the comprehensive coverage of google scholar; studies focused on blockchain technology outside the context of food were removed. a fine-tuned selection of the publications was undertaken to ensure relevance to fscs. table . contains a summary of the findings based on content analysis. the final documents were classified, evaluated, and found to be sufficient in narrative detail to provide an overview of publications to date, specifically related to blockchain technology and gs standards. the loss of trust in the conventional banking system following the global financial crisis laid the groundwork for the introduction of an alternative monetary system based on a novel digital currency and distributed ledger (richter, kraus, & bouncken, ) . "satoshi nakamoto" (a pseudonym for an unknown person, group of people, organization, or other public or private body) introduced an electronic peer-to-peer cash system called bitcoin (nakamoto, , p. ). the proposed system allowed for payments in bitcoin currency, securely and without the intermediation of a trusted third party (ttp) such as a bank. the bitcoin protocol utilizes a blockchain, which provides an ingenious and innovative solution to the doublespending problem (i.e., where digital currency or a token is spent more than once), eliminating the need for a ttp intervention to validate the transactions. moreover, lacity ( , p. ) argued "while ttps provide important functions, they have some serious limitations, like high transaction fees, slow settlement times, low transaction transparency, multiple versions of the truth and security vulnerabilities." the technology behind the bitcoin application is known as a blockchain. the bitcoin blockchain is a distributed database (or distributed ledger) implemented on public, untrusted networks (kano & nakajima, ) with a cryptographic signature (hash) that is resistant to falsification through repeated hashing and a consensus algorithm (sylim, liu, marcelo, & fontelo, ) . blockchain technology is engineered in a way that parties previously unknown to each other can jointly generate and maintain a database of records (information) and can correct and complete transactions, which are fully distributed across several nodes (i.e., computers), validated using consensus of independent verifiers (tijan, aksentijevi c, ivani c, & jardas, ). blockchain is categorized under the distributed ledger technology family and is characterized by a peer-to-peer network and a decentralized distributed database, as depicted in fig. . . a diagrammatic representation of blockchain technology. according to lemieux ( ) , the nodes within a blockchain work collectively as one system to store encrypted sequences of transactional records as a single chained unit or block. nodes in a blockchain network can either be validator nodes (miners in ethereum and bitcoin) that participate in the consensus mechanism or nonvalidator nodes (referred to only as nodes). when any node wants to add a transaction to the ledger, the transaction of interest is broadcast to all nodes in the peer-topeer network. transactions are then collected into a block, where the addition to the blockchain necessitates a consensus mechanism. validators compete to have their local block to be the next addition to the blockchain. the way blocks are constructed and propagated in the system enables the traceback of the whole chain of valid network activities back to the genesis block initiated in the blockchain. furthermore, the consensus methodology employed by the underlying blockchain platform designates the validator, whose block gets added to the blockchain with the others remaining in the queue and participating in the next round of consensus. the validator node gains an incentive for updating the blockchain database (nakamoto, , p. ). the blockchain may impose restrictions on reading the data and the flexibility to become a validator to write to the blockchain, depending upon whether the blockchain is permissioned or permission-less. a consensus algorithm enables secure updating of the blockchain data, which is governed by a set of rules specific to the blockchain platform. this right to update the blockchain data is distributed among the economic set (buterin, b) , a group of users who can update the blockchain based on a set of rules. the economic set is intended to be decentralized with no collusion within the set (a group of users) in order to form a majority, even though they might have a large amount of capital and financial incentives. the blockchain platforms that have emerged employ one of the following decentralized economic sets; however, each example might utilize a different set of consensus algorithms: owners of computing power: this set employs proof-of-work (pow) as a consensus algorithm observed in blockchain platforms like bitcoin and ethereum. each block header in the blockchain has a string of random data called a nonce attached to them (nakamoto, , p. ). the miners (validators) need to search for this random string such that when attached to the block, the hash of the block has a certain number of leading zeros and the miner who can find the nonce is designated to add his local block to the blockchain accompanied by the generation of a new cryptocurrency. this process is called mining. mining involves expensive computations leading to (often massive) wastage of computational power and electricity, undesirable from an ecological point of view (o'dwyer & malone, ) , and resulting in a small exclusive set of users for mining. this exclusivity, however, goes against the idea of having a decentralized set leading blockchain platforms to employ other means of arriving at a consensus. stakeholders: this set employs the different variants of the proof-of-stake (pos) consensus mechanism. pos is a more just system than pow, as the computational resources required to accomplish mining or validation can be done through any computer. ethereum pos requires the miner or the validator to lock a certain amount of their coins in the currency of the blockchain platform to verify the block. this locked number of coins is called a stake. computational power is required to verify whether a validator owns a certain percentage of the coins in the available currency or not. there are several proposals for pos, as pos enables an improved decentralized set, takes power out of the hands of a small exclusive group of validators, and distributes the work evenly across the blockchain. in ethereum pos, the probability of mining the block is proportional to the validator's stake (ethhub, ) just as in pow, and it is proportional to the computational hashing power. as long as a validator is mining, the stake owned by him remains locked. a downside of this consensus mechanism is that the richest validators are accorded a higher priority. the mechanism does, however, encourage more community participation than many other methods. other consensus protocols include the traditional byzantine fault tolerance theory (sousa et al., ) , where the economic set needs to be sampled for the total number of nodes. here, the set most commonly used is stakeholders. hence, such protocols can be considered as subcategories of pos. a user's social network: this is used in ripple and stellar consensus protocols. the ripple protocol, for example, requires a node to define a unique node list (unl), which contains a list of other ripple nodes that the defining node is confident would not work against it. a node consults other nodes in its unl to achieve consensus. consensus happens in multiple rounds with a node declaring a set of transactions in a "candidate set," which is sent to other nodes in the unl. nodes in the unl validate the transactions, vote on them, and broadcast the votes. the initiating node then refines the "candidate set" based on the votes received to include the transactions getting the most significant number of votes for the next round. this process continues until a "candidate set" receives % votes from all the nodes in the unl, and then it becomes a valid block in the ripple blockchain. blockchain technologies are considered a new type of disruptive internet technology (pan, song, ai, & ming, ) and an essential enabler of large-scale societal and economic changes (swan, ; tapscott & tapscott, ) . the rationale for this argument is due to its complex technical constructs (hughes et al., ) , such as the immutability of transactions, security, confidentiality, consensual mechanisms, and the automation capabilities enabled by smart contracts. the latter is heralded as the most important application of blockchain (the integrity of the code in smart contracts requires quality assurance and rigorous testing). by definition, a smart contract is a computer program that formalizes relationships over computer networks (szabo, (szabo, , . although smart contracts predate bitcoin/blockchain by a decade and do not need a blockchain to function (halaburda, ) , a blockchain-based smart contract is executed on a blockchain with a consensus mechanism determining its correct execution. a wide range of applications can be implemented using smart contracts, including gaming, financial, notary, or computation (bartoletti & pompianu, ) . the use of smart contracts in the fsc industry can help to verify digital documents (e.g., certificates such as organic or halal) as well as determine the provenance (source or origin) of specific data. in a cold chain scenario, rejeb et al. ( ) argued that smart contracts connected to iot devices could help to preserve the quality and safety of goods in transit. for example, temperature tolerances embedded into the smart contract can trigger in-transit alerts and facilitate shipment acceptance or rejection based on preset parameters in the smart contract. the first platform for implementing smart contracts was ethereum (buterin, a, pp. e ) , although most platforms today cater to smart contracts. therefore, similar to the radical transformations brought by the internet to individuals and corporate activities, the emergence of blockchain provides opportunities that can broadly impact supply chain processes (fosso wamba et al., ; queiroz, telles, & bonilla, ) . in order to understand the implications of blockchain technology for food chains, it is essential to realize the potentials of its conjunction with gs standards. while the technology is still in a nascent stage of development and deployment, it is worthwhile to draw attention to the potential alignment of blockchain technology with gs standards as proof of their success, and universal adoption is very likely to prevail in the future. traceability is a multifaceted construct that is crucially important in fscs and has received considerable attention through its application in the iso /bs quality standards (cheng & simmons, ) . scholars have stressed the importance and value of traceability in global fscs (charlier & valceschini, ; roth, tsay, pullman, & gray, ) . broadly, traceability refers to the ability to track the flow of products and their attributes throughout the entire production process steps and supply chain (golan et al., ) . furthermore, olsen and borit ( ) completed a comprehensive review of traceability across academic literature, industry standards, and regulations and argued that the various definitions of traceability are inconsistent and confusing, often with vague or recursive usage of terms such as "trace." they provide a comprehensive definition: "the ability to access any or all information relating to that which is under consideration, throughout its entire life cycle, by means of recorded identifications" (olsen and borit, , p. ) . the gs global traceability standard (gs , c: ) aligns with the iso : definition "traceability is the ability to trace the history, application or location of an object [iso : ] . when considering a product or a service, traceability can relate to origin of materials and parts; processing history; distribution and location of the product or service after delivery." traceability is also defined as "part of logistics management that capture, store, and transmit adequate information about a food, feed, food-producing animal or substance at all stages in the food supply chain so that the product can be checked for safety and quality control, traced upward, and tracked downward at any time required" (bosona & gebresenbet, , p. ). in the fsc context, a fundamental goal is to maintain a high level of food traceability to increase consumer trust and confidence in food products and to ensure proper documentation of the food for safety, regulatory, and financial purposes (mahalik & kim, ) . technology has played an increasingly critical role in food traceability over the past two decades (hollands et al., ) . for instance, radio frequency identification (rfid) has been adopted in some fscs to enable nonline-of-sight identification of products to enhance end-to-end food traceability (kelepouris et al., ) . walmart achieved significant efficiency gains by deploying drones in combination with rfid inside a warehouse for inventory control (companik, gravier, & farris, ) . however, technology applications for food traceability are fragmented, often proprietary and noninteroperable, and have enabled trading partners to capture only certain aspects of the fsc. as such, a holistic understanding of how agri-food businesses can better track the flow of food products and related information in extended, globalized fscs is still in a nascent stage of development. for instance, malhotra, gosain, & el sawy ( ) suggested it is imperative to adopt a more comprehensive approach of traceability that extends from source to final consumers in order to obtain a full understanding of information processing and sharing among supply chain stakeholders. in this regard, blockchain technology brings substantial improvements in transparency and trust in food traceability (behnke & janssen, ; biswas, muthukkumarasamy, & tan, ; sander, semeijn, & mahr, ) . however, arguments from many solution providers regarding traceability from "farm to fork" are a flawed concept as privacy law restricts tracking products forward to consumers. in this regard, tracking (to track forward) from farm to fork is impossible unless the consumer is a member of a retailers' loyalty program. however, tracing (to trace backward) from "fork to farm" is a feasible concept enabled by a consumer scanning a gs centric bar code or another code provided by the brand (e.g., proprietary qr code). hence, farm-to-fork transparency is a more useful description of what is feasible (as opposed to farm-to-fork traceability). while a blockchain is not necessarily needed for this function, depending on the complexity of the supply chain, a blockchain that has immutable information (e.g., the original halal or organic certificate from the authoritative source) could improve the integrity of data and information provenance. blockchain is heralded as the new "internet layer of value," providing the trinity of traceability, trust, and transparency to transactions involving data or physical goods and facilitating authentication, validation, traceability, and registration (lima, ; olsen & borit, ) . the application of gs standards with blockchain technology integration enables global solutions linking identification standards for firms, locations, products, and assets with blockchains transactional integrity. thus, the combination of blockchain and gs standards could respond to the emerging and more stringent regulatory requirements for enhanced forms of traceability in fscs (kim, hilton, burks, & reyes, ) . a blockchain can be configured to provide complete information on fsc processes, which is helpful to verify compliance to specifications and to trace a product to its source in adverse events (such as a consumer safety recall). this capability enables blockchain-based applications to solve problems plaguing several domains, including the fsc, where verified and nonrepudiated data are vital across all segments to enable the functioning of the entire fsc as a unit. within the gs standards framework, food traceability is industry-defined and industry-approved and includes categorizations of traceability attributes. these include the need to assign unique identifiers for each product or product class and group them to traceable resource unit (behnke & janssen, ) . fsc actors are both a data creator (i.e., they are the authoritative source of a data attribute) and a data user (i.e., a custodian of data created by other parties such as an upstream supplier). data are created and used in the sequential order of farming, harvesting, production, packaging, distribution, and retailing. in an optimized fsc, the various exchange parties must be interconnected through a common set of interoperable data standards to ensure the data created and used provide a shared understanding of the data attributes and rules (rules on data creation and sharing are encompassed within gs standards). a blockchain can be configured to add value in fscs by creating a platform with access and control of immutable data, which is not subject to egregious manipulation. moreover, blockchain technology can overcome the weaknesses created by the decades-old compliance to the minimum regulatory traceability requirements, such as registering the identity of the exchange party who is the source of inbound goods and registering the identity of the exchange party who is the recipient of outbound goods. this process is known as "one-up/one-down" traceability (wang et al., , pp. e ) and essentially means that exchange parties in an fsc have no visibility on products outside of their immediate exchange partners. blockchain technology enables fsc exchange partners to maintain food traceability by providing a secure, unfalsifiable, and complete history of food products from farm to retail (molding, ). unlike logistics-oriented traceability, the application of blockchain and gs standards can create attribute-oriented traceability, which is not only concerned with the physical flow of food products but also tracks other crucial information, including product quality and safety-related information (skilton & robinson, ). on the latter point, food business operators always seek competitive advantage and premium pricing through the product (e.g., quality) or process differentiation claims (e.g., organically produced, cage-free eggs). this is in response to research indicating that an increasing segment of consumers will seek out food products best aligning with their lifestyle preferences such as vegetarian, vegan, or social and ethical values such as fair trade, organic, or cage-free (beulens, broens, folstar, & hofstede, ; roe & sheldon, ; vellema, loorbach, & van notten, ; yoo, parameswaran, & kishore, ) . in fig. . below, keogh ( ) outlines the essential traceability functions and distinguishes the supply chain flow of traceability event data versus the assurance flow of credence attributes such as food quality and food safety certification. for instance, in economic theory, goods are considered as comprising of ordinary, search, experience, or credence attributes (darby & karni, ; nelson, ) . goods classified as ordinary (e.g., petrol or diesel) have well-known characteristics and known sources and locations to locate and purchase. regarding search, it refers to goods where the consumer can easily access trusted sources of information about the attributes of the product before purchase and at no cost. search is "costless" per se and can vary from inspecting and trying on clothes before buying or going online to find out about a food product, including its ingredients, package size, recipes, or price. in the example of inspecting clothes before purchase, dulleck, kerschbamer, & sutter ( ) differentiate this example as "search" from "experience" by arguing that experience entails unknown characteristics of the good that are revealed only after purchase (e.g., the actual quality of materials, whether it fades after washing). products classified as experience goods have attribute claims such as the product is tasty, flavorful, nutritious, or health-related such as lowers cholesterol and requires the product to be tasted or consumed to verify the claim, which may take time (e.g., lowers cholesterol). verifying the experience attributes may be free if test driving a car or receiving a sample or taster of a food product in a store. nevertheless, test driving or sampling will not confirm how the product will perform over time. generally speaking, verifying experience attributes of food is not free, and it may take considerable time (and likely expense) to verify the claim. credence claims (darby and karni, ) are characterized by asymmetric information between food producers and food consumers. the reason for this is because credence attributes are either intrinsic to the product (e.g., food quality, food safety) or extrinsic methods of processing (e.g., organic, halal, kosher), and consumers cannot verify these claims before or after purchase (dulleck et al., ) . in this regard, a blockchain offers a significant advancement in how credence claims flow (see fig. . ) and are added to a product or batch/lot # record. for instance, the immutability of the data means that a brand owner can add a record such as a third-party certificate (e.g., laboratory analysis verifying a vegan claim or a usda organic certificate), but they cannot edit or change it. this feature adds much-needed integrity to fscs and enhances transparency and consumer trust, especially if the third-party data are made available for consumers to query. in this context, the combination of gs standards and a blockchain provides a consumer with the capability to scan a food product and query its digital record to verify credence claims. at a more detailed level, the fragmentation of fscs and their geographic dispersion illustrates the need for blockchain and gs for achieving an optimal granularity level of traceability units (dasaklis, casino, & patsakis., ) . as such, the combination of blockchain can help in the assurance of food quality and safety, providing secure (toyoda, takis mathiopoulos, sasase, & ohtsuki, ) , precise, and real-time traceability of products. moreover, the speed of food authentication processes makes blockchain a potential enabler of a proactive food systemda key catalyst for anticipating risky situations and taking the necessary preventative measures. triggering automatic and immediate actions in the fsc has been an impetus for large corporations to adopt blockchain technology; for example, walmart leverages gs standards and blockchain technology, defining the data attributes to be entered into their preferred blockchain system, such as the attributes defined under the produce traceability initiative (pti, ). using gs standards as a foundational layer, walmart tracks pork meat and pallets of mangoes, tagged with unique numeric identifiers in china and the united states. walmart has demonstrated the significant value of a gs -enabled blockchain, reducing both business and consumer risk in a product safety recall. more specifically, walmart simulated a product safety recall for mangoes, and this exercise suggested a reduction in time to execute the product safety recall from days pre-blockchain to . s using a blockchain (kamath, ) . the contribution of gs to the de facto individualization of food products has motivated the study of dos santos, torrisi, yamada, & pantoni ( ) , who examine the traceability requirements in recipe-based foods and propose whole-chain traceability with a focus on ingredient certification. with the use of blockchain technology, it is possible to verify the source of any batch or a lot number of ingredients. kim et al. ( ) developed an application called "food bytes" using blockchain technology and enabling consumers to validate and verify specific quality attributes of their foods (e.g., organic) by accessing curated gs standard data from mobile devices, thereby increasing ease of consumer usability and ultimately trust. blockchain technology can help fsc partners develop best practices for traceability and to curb fraudulent and deceptive actions as well as the adulteration of food products. to solve these issues, staples et al. ( ) develop a traceability system based on haccp, gs , and blockchain technology in order to guarantee reliable traceability of the swine supply chain. in their proposed system, gs aids in the coordination of supply chain information, and blockchain is applied to secure food traceability. a pressing challenge facing fscs is the need to coordinate information exchange across several types of commodities, transportation modes, and information systems. by analogy, a similar need was resolved in the healthcare industry through the implementation of electronic health records (ehr) to provide access to an individual patient's records across all subdomains catering to the patient. the healthcare industry is presently working on enhancing ehr through the deployment of blockchain to serve as a decentralized data repository for preserving data integrity, security, and ease of management (shahnaz, qamar, & khalid, ) . closely resembling the role and function of the ehr in the healthcare industry, the creation of a digital food record (dfr) is vital for fscs to facilitate whole-chain traceability, interoperability, linking the different actors and data creators in the chain, and enhancing trust in the market on each product delivered. fsc operators need access to business-critical data at an aggregated level to drive their business strategy and operational decisions, and many of the organizations operate at the global, international, or national levels. data digitization and collaboration efforts of fsc organizations are essential to enable actionable decisions by the broader food industry. currently, much of the data currently exist as siloed, disparate sources that are not easily accessible; including data related to trade (crop shortages/overages), market prices, import/export transaction data, or real-time data on pests, disease or weather patterns, and forecasts. with this in mind, and acknowledging the need for transparent and trusted data sharing, the dutch horticulture and food domain created "horticube," an integrated platform to enable seamless sharing of data and enable semantic interoperability (verhoosel, van bekkum, & verwaart, ) . the platform provides "an application programming interface (api) that is based on the open data protocol (odata). via this interface, application developers can request three forms of information; data sources available, data contained in the source, and the data values from these data sources (verhoosel et al., , p. ). the us food and drug administration is currently implementing the food safety modernization act (fsma) with emphasis on the need for technological tools to accomplish interoperability and collaboration in their "new era of smarter food safety" (fda, ) . in order to enable traceability as envisioned in fsma, a solution is required that incorporates multiple technologies, including iot devices. blockchain is envisioned as a platform of choice in accordance with its characteristic of immutability to prevent the corruption of data (khan, ) . ecosystems suited for the application of blockchain technology are those consisting of an increasing set of distributed nodes that need a standard approach and a cohesive plan to ensure interoperability. more precisely, fscs comprised of various partners working collaboratively to meet the demands of various customer profiles, where collaboration necessitates an exchange of data (mertins et al., ) ; furthermore, the data should be interchanged in real-time and verified to be originating from the designated source. interoperability is a precursor of robust fscs that can withstand market demands by providing small and medium enterprises with the necessary information to decide on the progress of any product within the supply chain and ensure the advancement of safe products to the end consumer. blockchain technology enables an improved level of interoperability as fsc actors would be able to communicate real-time information (bouzdine-chameeva, jaegler, & tesson, ), coordinate functions, and synchronize data exchanges (bajwa, prewett, & shavers, ; behnke & janssen, ) . the potential interoperability provided by blockchains can be realized through the implementation of gs standards. specifically, the electronic product code information standard (epcis), which can be used to ensure the documentation of all fsc events in an understandable form and the aggregation of food products into higher logistic units, business transactions, or other information related to the quantity of food products and their types (xu, weber, & staples, ) . a recent study by the institute of food technologists found evidence that technology providers faced difficulty in collaborating to determine the origin or the recipients of a contaminated product (bhatt & zhang, ) . hence, the novel approach of blockchain provides a specific emphasis on interoperability between disparate fsc systems, allowing technology providers to design robust platforms that ensure interoperable and end-to-end product traceability. the use of iot devices allow organizations within fscs to send and receive data; however, the authenticity of the data still needs to be ascertained. a compounding factor is the technological complexity of fscs (ahtonen & virolainen, ), due to the reliance on siloed systems that hamper collaboration and efficient flow of information. however, blockchain architecture can accommodate interoperability standards at the variable periphery (the iot devices) and other technologies used to connect fsc processes (augustin, sanguansri, fox, cobiac, & cole, ) . blockchain is envisaged as a powerful tool (ande, adebisi, hammoudeh, & saleem, ) and an appropriate medium to store the data from iot devices since it provides seamless authentication, security, protection against attacks, and ease of deployment among other potential advantages (fernández-caramés & fraga-lamas, ) . for fsc, blockchain is seen as the foundational technology for the sharing and distribution (read and write) of data by the organizations comprising the ecosystem, as shown in fig. . . in this model, consumers can read data for any product and trace the entire path from the origin to the destination while relying upon the immutability of blockchain to protect the data from any tampering. supply chain data are stored as a dfr in the various blocks (e.g., b , b , b ) that comprise the blockchain. the first block represented by g in fig. . refers to the genesis block, which functions as a prototype for all the other blocks in the blockchain. • traceability for fresh fruits and vegetables-implementation guide (legacy, developed for gs global traceability standard . . ) (gs , b) • gs global traceability compliance criteria for food application standard (legacy, developed for gs global traceability standard . . ) (gs , b) together, these documents provide comprehensive guidance to fscs on the implementation of a traceability framework. figs. . and . below indicate a single and multiple company view of traceability data generation. underlying the gs traceability standard is the gs epcis (gs , a), which defines traceability as an ordered collection of events that comprise four key dimensions: • what-the subject of the event, either a specific object (epc) or a class of object (epc class) and a quantity • when-the time at which the event occurred • where-the location where the event took place • why-the business context of the event the gs global traceability standard adds a fifth dimension, "who," to identify the parties involved. this can be substantially different from the "where" dimension, as a single location (e.g., a third-party warehouse) may be associated with multiple, independent parties. epcis is supplemented by the core business vocabulary standard (gs , a), which specifies the structure of vocabularies and specific values for the vocabulary elements to be utilized in conjunction with the gs epcis standard. epcis is a standard that defines the type and structure of events and a mechanism for querying the repository. assuming that all parties publish to a common epcis repository (centralized approach) or that all parties make their repositories available (open approach), traceability is simply the process of querying events, analyzing them, and querying subsequent events until all relevant data are retrieved. in practice, neither the centralized nor open approach is possible. in the centralized approach, multiple, competing repositories will naturally prevent a single, centralized repository from ever being realized. even if such a model were to be supported in the short term by key players in the traceability ecosystem, as more and more players are added, the odds of one or more of them already having used a competing repository grows. in the open approach, not all parties will be willing to share all data with all others, especially competitors. depending on the nature of the party querying the data or the nature of the query itself, the response may be no records, some records, or all records satisfying the query. for either approach, there is the question of data integrity: can the system prove that the traceability data have not been tampered with? blockchain is a potential solution to these problems. as a decentralized platform, blockchain integration could provide epcis solution providers with a way of sharing data in a secure fashion. furthermore, the sequential, immutable nature of the blockchain platform either ensures that the data cannot be changed or provides a mechanism for verifying that it has not been tampered with. the critical question is, what exactly gets stored on the blockchain? the options discussed by gs in a white paper on a blockchain (gs , b) are • fully formed, cryptographically signed plain text event data, which raises concerns about scalability, performance, and security if full events are written to a ledger; • a cryptographic hash of the data that has little meaning by itself. this requires offchain data exchange via a separate traceability application and a hash comparison to verify that data have not been altered since the hash was written to the ledger; and • a cryptographic hash of the data and a pointer to off-chain data. this is the same as the above point with a pointer to the off-chain data source. such an approach can enable the ledger to act as part of a discovery mechanism for parties who need to communicate and share data. this then leads to the question of the accessibility of the data: public: everyone sees all transactions; private: this includes a permission layer that makes transactions viewable to only approved parties. integrating epcis (or any other data sharing standard) with blockchain often presents significant challenges: in most cases, volumetric analysis can reveal sensitive business intelligence even without examining the data. for example, if company x is currently publishing records per day, and next year at the same time it is publishing only , it is reasonable to assume that company x's volume is down by % year over year. revealing the subject of an event (the "what" dimension) can reveal who is handling the expensive product, which may be used to plan its theft or diversion. publishing a record in plain text makes the data available to any party that has a copy of the ledger, but not all data should be available to all parties. for example, transformation events in epcis record inputs partially or fully consumed to produce one or more outputs. in the food industry, this is the very nature of a recipe, which is often a closely guarded trade secret. in order to mitigate this risk, the ledger would have to be firmly held by a limited number of parties that could enforce proper data access controls. even if such a system were to be implemented correctly, it means that proprietary information would still be under the control of a third party, which is a risk that many food companies would not be willing to take. publishing a record in an encrypted form would solve the visibility issue, but in order to do so, the industry would have to agree on how to generate the keys for the encrypted data. one option is to use the event's subject (the "what" dimension) as the key. if the identifier for the subject is sufficiently randomized, this ensures that only parties that have encountered the identifier can actually decrypt the data; while other parties could guess at possible values of the identifier, doing so at scale can be expensive and therefore self-limiting. there would also have to be a way to identify which data are relevant to the identifier, which would mean storing something like a hash of the identifier as a key. only those parties that know the identifier (i.e., that have observed it at some point in its traceability journey) will be able to locate the data of interest and decrypt them. parties could publish a hash of the record along with the record's primary key. this could then be used to validate records to ensure that they have not been tampered with, but it means that any party that wishes to query the data would have to know ahead of time where the data reside. once queried successfully, the record's primary key would be used to lookup the hash for comparison. to enable discovery, data consisting of the event's subject (the "what" dimension) and a pointer to a repository could be published. in essence, this is a declaration that the repository has data related to the event's subject, and a query for records related to the event's subject is likely to be successful. to further secure the discovery, the event's subject could be hashed, and that could be used as the key. volumetric analysis is still possible with this option. to limit volumetric analysis, data consisting of the class level of the event's subject and a pointer to a repository could be published. this is essentially a declaration that objects of a specific type have events in the repository, but it does not explicitly say how many or what specific objects they refer to. it still reveals that the company using the repository is handling the product. over and above all of this is the requirement that all publications be to the same type of blockchain ledger. there are currently no interoperability standards for blockchains. the industry would, therefore, have to settle on one, which has the same issue as settling on a single epcis repository. further technical research is required to determine the viability of the various options for publishing to the blockchain. the standardization efforts in global fscs have led to the need for best practice recommendations and common ways of managing logistics units in the food chain. the widespread use of gs standards reflects the tendency of food organizations to operate in an integrated manner with a universal language. this facilitates fscs to structure and align with a cohesive approach to food traceability, empowering multidirectional information sharing, optimizing efficiencies, and added-value activities for fsc stakeholders. moreover, the embeddedness of gs standards in global fscs allows trading partners to work in an industry-regulated environment wherein food quality and food safety are of the utmost priority in delivering sustainable, authentic products to final consumers. today, the usage of gs standards is inevitable as they provide clear guidelines on how to manage and share event data across global fscs (figueroa, añorga, & arrizabalaga, ) . this inevitability is further enhanced through the leadership of the global management board of gs (as of february ) that consists of senior executives from organizations such as procter & gamble, nestle, amazon, google, j.m. smucker, l'oreal, metro ag, alibaba, and others. similarly, the management board for the gs us organization includes senior executives from walmart, wegfern, wendy's, coca cola, target, publix, wegmans, sysco, massachusetts institute of technology, and others. the commitment of these organizations strongly supports the industry adoption of gs standards, and gs enabled blockchain solutions as indicated by walmart in their us-driven "fresh leafy greens" traceability initiative (walmart, ) . moreover, many of these firms have announced blockchain-related initiatives in their supply chains. walmart's traceability initiative reflects growing consumer concerns regarding food quality and safety and the recurring nature of product safety recalls. the combination of gs standards with blockchain can provide immutable evidence of data provenance, enhance food traceability and rapid recall, and increase trust in the quality of food products. gs standards aid organizations in maintaining a unified view of the state of food while transitioning between processing stages across globalized and highly extended supply chains with multiple exchange parties. as such, the broad adoption of electronic traceability as identified by gs can endow the food industry with several capabilities, ranging from the optimization of traceback procedures, the standardization of supply chain processes, the continuous improvement in food production activities, and the development of more efficient and holistic traceability systems. the use of gs standards for the formation of interoperable and scalable food traceability systems can be reinforced with blockchain technology. as envisioned by many food researchers, practitioners, and organizations, blockchain technology represents a practical solution that has a positive impact on fsc collaborations and data sharing. blockchain technology creates a more comprehensive and inclusive framework that promotes an unprecedented level of transparency and visibility of food products as they are exchanged among fsc partners. combined with gs standards, blockchain technology offers a more refined level of interoperability between exchange parties in global fscs and facilitates a move away from the traditional or linear, stove-piped supply chains with limited data sharing. by leveraging blockchain, fscs would be able to develop a management information platform that enables the active collection, transfer, storage, control, and sharing of food-related information among fsc exchange parties. the combination of blockchain and gs standards can create a high level of trust because of the precision in data and information provenance, immutability, nonrepudiation, enhanced integrity, and deeper integration. the development of harmonized global fscs gives rise to more efficient traceability systems that are capable of minimizing the impact of food safety incidents and lowering the costs and risks related to product recalls. therefore, the integration of gs standards into a blockchain can enhance the competitive advantage of fscs. in order to unlock the full potential from the functional components of a blockchain and the integration of gs standards, several prerequisites need to be fulfilled. for example, a more uniformed and standardized model of data governance is necessary to facilitate the operations of fscs in a globalized context. a balance between the conformance with diverse regulatory requirements and the fsc partners' requirements should be established in order to maintain a competitive position in the global market. the inter-and intraorganizational support for blockchain implementations, including the agreement on what type of data should be shared and accessed, the establishment of clear lines of responsibilities and accountability, and the development of more organized and flexible fscs should be considered prior to blockchain adoption (fosso wamba et al., ) . in summary, a blockchain is not a panacea, and non-blockchain solutions are functioning adequately in many fscs today. the business case or use case is crucially important when considering whether a blockchain is required and whether its functionality adds value. moreover, a blockchain does not consider unethical behaviors and opportunism in global fscs (bad character). organizations need to consider other risk factors that could impact ex post transaction costs and reputation. global fsc risk factors include slave labor, child labor, unsafe working conditions, animal welfare, environmental damage, deforestation and habitat loss, bribery and corruption, and various forms of opportunism such as quality cheating or falsification of laboratory or government records before they are added to a blockchain. product data governance and enhanced traceability can be addressed in global fscs, but "bad character" is more difficult to detect and eliminate. essentially, bad data and bad character are the two main enemies of trust in the food chain. this study focused narrowly on existing research combining blockchain, gs standards, and food. due to the narrow scope of the research, we did not explore all technical aspects of the fast-evolving blockchain technology, smart contracts, or cryptography. further research is needed to explore the risks associated with the integrity of data entered into a blockchain, especially situations where bad actors may use a blockchain to establish false trust with false data. in this regard, "immutable lies" are added to a blockchain and create a false sense of trust. because of this potential risk, and because errors occuring in the physical flow of goods within supply chains are common (e.g., damage, shortage, theft) as well as errors in data sharing and privacy, the notion of blockchain "mutability" should be researched further (rejeb et al., (rejeb et al., , . further technical research is encouraged to explore the relationship between the immutability features of a blockchain and the mutability features of the epcis standard. in the latter, epcis permits corrections where the original, erroneous record is preserved, and the correction has a pointer to the original. researchers should explore current epcis adoption challenges and whether epcis could provide blockchain-to-blockchain and blockchain-to-legacy interoperability. the latter may mitigate the risks associated with fsc exchange partners being "forced" to adopt a single proprietary blockchain platform or as a participant in multiple proprietary blockchain platforms in order to trade with their business partners. researchers should explore if the latency of real-time data retrieval in blockchain-based fscs restricts consumer engagement in verifying credence claims in real time due to the complexity of retrieving block transaction history. supply strategy in the food industry e value net perspective internet of things: evolution and technologies from a security perspective recovery of wasted fruit and vegetables for improving sustainable diets bringing farm animal to the consumer's platedthe quest for food business to enhance transparency, labelling and consumer education is your supply chain ready to embrace blockchain? runners and riders: the horsemeat scandal, eu law and multi-level enforcement an empirical analysis of smart contracts: platforms, applications, and design patterns boundary conditions for traceability in food supply chains using blockchain technology food safety and transparency in food chains and networks relationships and challenges food product tracing technology capabilities and interoperability trust in food in modern and late-modern societies blockchain based wine supply chain traceability system agrarian social movements: the absurdly difficult but not impossible agenda of defeating right-wing populism and exploring a socialist future food traceability as an integral part of logistics management in food and agricultural supply chain food supply chain management value co-creation in wine logistics: the case of dartess a next-generation smart contract and decentralized application platform proof of stake: how i learned to love weak subjectivity mapping the dynamics of science and technology: sociology of science in the real world (pp. e ) food fraud. reference material blockchain technology in healthcare coordination for traceability in the food chain. a critical appraisal of european regulation food traceability in the domestic horticulture sector in kenya: an overview traceability in manufacturing systems exploring latent factors influencing the adoption of a processed food traceability system in south korea unveiling the structure of supply networks: case studies in honda, acura, and daimlerchrysler feasibility of warehouse drone adoption and implementation investigating green supply chain management practices and performance: the moderating roles of supply chain ecocentricity and traceability food supply chain management and logistics: from farm to fork free competition and the optimal amount of fraud defining granularity levels for supply chain traceability based on iot and blockchain following the mackerel e cost and benefits of improved information exchange in food supply chains the economics of credence goods: an experimental investigation of the role of verifiability, liability, competition and reputation in credence goods markets ethereum proof of stakedethhub new era of smarter food safety a review on the use of blockchain for the internet of things an attribute-based access control model in rfid systems based on blockchain decentralized applications for healthcare environments bitcoin, blockchain, and fintech: a systematic review and case studies in the supply chain. production planning and control blockchain as a disruptive technology for business: a systematic review synchromodal logistics: an overview of critical success factors, enabling technologies, and open research issues the edelman trust traceability in the us food supply: economic theory and industry studies (agricultural economics reports no. . united states department of agriculture marsh holds place of honor in history of gs barcode gs made easydglobal meat and poultry traceability guideline companion document traceability for fresh fruits and vegetables implementation guide epc information services (epcis) standard gs global traceability compliance criteria for food application standard core business vocabulary standard gs global traceability standard. gs release . . ratified (gs 's framework for the design of interoperable traceability systems for supply chains). gs . gs . ( a). gs foundation for fish, seafood and aquaculture traceability guideline traceability and blockchain how we got here the institute for grocery distribution, cranfield school of management (ktp project), & value chain vision blockchain revolution without the blockchain real-time supply chainda blockchain architecture for project deliveries. robotics and computer-integrated manufacturing blockchain or bust for the food industry? blockchain research, practice and policy: applications, benefits, limitations, emerging research themes and research agenda the truth about blockchain sharing procedure status information on ocean containers across countries using port community systems with decentralized architecture effects of supplier trust on performance of cooperative supplier relationships food traceability on blockchain: walmart's pork and mango pilots with ibm modeling the internet of things adoption barriers in food retail supply chains a novel approach to solve a mining work centralization problem in blockchain technologies rfid-enabled traceability in the food supply chain. industrial management & data systems blockchain, provenance, traceability & chain of custody fast: a mapreduce consensus for high performance blockchains integrating blockchain, smart contracttokens, and iot to design a food traceability solution addressing key challenges to making enterprise blockchain applications a reality who is satoshi nakamoto? regulation trusting records: is blockchain technology the answer? records management the prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration developing open and interoperable dlt/blockchain standards innovation and future trends in food manufacturing and supply chain technologies (pp. e ). woodhead publishing leveraging standard electronic business interfaces to enable adaptive supply chain partnerships food fraud: policy and food chain food safety, food fraud, and food defense: a fast evolving literature towards information customization and interoperability in food chains the promise of blockchain and its impact on relationships between actors in the supply chain: a theory-based research framework bitcoin: a peer-to-peer electronic cash system information and consumer behavior the transparent supply chain the components of a food traceability system bitcoin mining and its energy footprint blockchain technology and enterprise operational capabilities: an empirical test nfc-based traceability in the food chain the produce traceability initiative food fraud vulnerability assessment and mitigation: are you doing enough to prevent food fraud blockchain and supply chain management integration: a systematic review of the literature. supply chain management: an international incorporating block chain technology in food supply chain leveraging the internet of things and blockchain technology in supply chain management how blockchain technology can benefit marketing: six pending research areas exploring new technologies in procurement virtual currencies like bitcoin as a paradigm shift in the field of transactions credence good labeling: the efficiency and distributional implications of several policy approaches unraveling the food supply chain: strategic insights from china and the recalls the acceptance of blockchain technology in meat traceability and transparency igr token-raw material and ingredient certification of recipe based foods using smart contracts the relevance of blockchain for collaborative networked organizations transparency for sustainability in the food chain: challenges and research needs effost critical reviews # using blockchain for electronic health records traceability and normal accident theory: how does supply network complexity influence the traceability of adverse events a byzantine fault-tolerant ordering service for the hyperledger fabric blockchain platform risks and opportunities for systems using blockchain and smart contracts blockchain blueprint for a new economy blockchain technology for detecting falsified and substandard drugs in distribution: pharmaceutical supply chain intervention smart contracts: building blocks for digital free markets formalizing and securing relationships on public networks realizing the potential of blockchain a multistakeholder approach to the stewardship of blockchain and cryptocurrencies blockchain technology implementation in logistics a novel blockchainbased product ownership management system (poms) for anti-counterfeits in the post supply chain towards a methodology for developing evidence-informed management knowledge by means of systematic review food wastage footprint & climate change strategic transparency between food chain and society: cultural perspective images on the future of farmed salmon semantic interoperability for data analysis in the food supply chain the role of security in the food supplier selection decision fresh leafy greens new walmart food traceability initiative questions and answers food-borne illnesses cost us$ billion per year in low-and middleincome countries example use cases assessing supermarket food shopper reaction to horsemeat scandal in the uk a new era of food transparency powered by blockchain knowing about your food from the farm to the table: using information systems that reduce information asymmetry and health risks in retail contexts current status and future development proposal for chinese agricultural product quality and safety traceability food supply chain management: systems, implementations, and future research the authors are thankful to dr. steven j. simske, dr. subhasis thakur, and irene woerner for their thoughtful commentary on this chapter. abderahman rejeb, coauthor and ph.d. candidate, is grateful to professor lászló imre komlósi, dr. katalin czakó, and ms. tihana vasic for their valuable support. conflict of interests • no funding was received for this publication. • john g. keogh, the corresponding author, is a former executive at gs canada and has not advised or worked with or for gs for more than years. • kevin dean is an independent technical consultant advising gs . key: cord- -ihh ur authors: lu, qiang; hao, pei; curcin, vasa; he, weizhong; li, yuan-yuan; luo, qing-ming; guo, yi-ke; li, yi-xue title: kde bioscience: platform for bioinformatics analysis workflows date: - - journal: j biomed inform doi: . /j.jbi. . . sha: doc_id: cord_uid: ihh ur bioinformatics is a dynamic research area in which a large number of algorithms and programs have been developed rapidly and independently without much consideration so far of the need for standardization. the lack of such common standards combined with unfriendly interfaces make it difficult for biologists to learn how to use these tools and to translate the data formats from one to another. consequently, the construction of an integrative bioinformatics platform to facilitate biologists’ research is an urgent and challenging task. kde bioscience is a java-based software platform that collects a variety of bioinformatics tools and provides a workflow mechanism to integrate them. nucleotide and protein sequences from local flat files, web sites, and relational databases can be entered, annotated, and aligned. several home-made or rd-party viewers are built-in to provide visualization of annotations or alignments. kde bioscience can also be deployed in client-server mode where simultaneous execution of the same workflow is supported for multiple users. moreover, workflows can be published as web pages that can be executed from a web browser. the power of kde bioscience comes from the integrated algorithms and data sources. with its generic workflow mechanism other novel calculations and simulations can be integrated to augment the current sequence analysis functions. because of this flexible and extensible architecture, kde bioscience makes an ideal integrated informatics environment for future bioinformatics or systems biology research. the rapid development of genome technologies, especially automatic sequencing techniques, has produced a huge amount of data consisting essentially of nucleotide and protein sequences. for instance, the number of sequences in genbank increases exponentially and as of august (release ) it contained over . billion nucleotide bases from . million individual sequences [ ] . to store, characterize, and mine such a large amount of data requires many databases and programs hosted in high-performance computers. until now, there have been several databases, for example genbank [ ] , uniprot [ ] , pdb [ ] , kegg [ ] , pubmed medline, etc., covering not only nucleotide and protein sequences but also their annotations and related research publications. the programs include those for sequence alignment, prediction of genes, protein structures, and regulatory elements, etc., some of which are organized into packages such as emboss [ ] , phylip [ ] , and gcg wisconsin (http://www.accelrys. com/products/gcg_wisconsin_package/program_ list.html). in general, these databases are built independently by various academic or commercial organizations and their input and output data formats follow their own standards (e.g., fasta, genbank, embl, srs, etc.), most of which are incompatible. the programs themselves are even more complex in that they are implemented using a variety of programming languages and on different operating systems, are operated in different ways using input and output data in a wide range of formats. biologists try to discover biological functions from sequences using informatics techniques but are frequently frustrated by the processes of searching for suitable tools, learning how to use these tools, and translating data formats between them. to facilitate biologistsÕ research, an integrative informatics platform is needed in which many kinds of databases and programs are integrated with a common input-output data format and uniform graphical user interface (gui). to build such an integrative informatics platform, workflow is recognized as a potential solution. some existing efforts include biopipe [ ] , biowbi [ ] , taverna [ ] , wildfire [ ] , etc. all of them provide mechanism to integrate bioinformatics programs into workflows. biopipe is based on programming language perl. it looks lack of user-friendly interface for building workflow so far. biowbi and tarverna use web-services for components to construct workflows. however, to convert a rd-party program into web-services, they lack of integrative gui environment. wildfire aims at using workflow to provide huge computing capability to bioinformatics application. however, there is no integrative environment provided for multiple users to collaborate in the same large-scale bioinformatics project. in this paper, we present a significant integrative informatics platform, knowledge discovery environment of bioscience (kde bioscience), which is supposed to provide a solution of integration of biological data, algorithms, computing hardware, and biologist intelligence for bioinformatics. from the viewpoint of informatics, the requirements of an integrative informatics platform consist of four parts: integration of data, algorithms, computing hardware, and human intelligence. the large scale of sequence and annotation data is one of the prominent characteristic of bioinformatics applications. generally, to handle bulk data, a database management systems (dbms) is the best choice. with support for the structure query language (sql) standard, accessing the dbms is machine-friendly in cases where the data are well-structured. however, since current biological researches generate such a large amount of data, that are usually far from complete and well-structured, it can be better to store them in flat files with a semi-structured format that allows for errors and redundancies. additionally, biologists have become used to publishing their data on web pages that are usually in unstructured formats. providing biologists with an easyto-use bioinformatics platform requires the integration of sequence and annotation data in different formats from dbms, flat files, and web pages. genomic data are so abundant that it defies simple intuitive analysis. thus, many computer programs are employed to assist in such tasks including alignments of sequences, predictions of gene, protein structure, regulatory elements, and visualization. these programs reflect the up-to-the-minute progress in research and as a consequence lack standardization. although there are a few de facto standards, too many varieties will continue to exist for the foreseeable future. practical bioinformatics task typically consist of some algorithms running in parallel or in series. therefore, the key requirements of algorithms integration are ( ) to collect a lot of specified programs and ( ) to provide them an easy way of communications. in bioinformatics, large-scale data needs bulk storage and time-consuming tasks such as alignment of genomes need powerful computing resources-however, such powerful hardware may be unaffordable to a single organization or ordinary researcher. a potential solution is to integrate the distributed storage and computing resources. in this sense, it is necessary for an integrative informatics platform to support distributed storage and computing. furthermore, it is not feasible for a genome project to be handled by only one person. typically a team will be formed involving several experts who focus on different parts of the same project such as sequencing, micro-arrays, bioinformatics analysis, and experimental verification. thus, it is necessary for an integrative informatics platform to provide a mechanism for biologists to work together and share data and designs-or even construct the same workflow. in addition, they can publish their designs as web pages, for access and re-use by other researchers. kde bioscience based on the knowledge discovery environment (kde) [ ] . the basic idea is to represent a bioinformatics analysis process (task) as a workflow (pipeline) constructed from a series of linked nodes. there are two key concepts in this thought: ( ) a data model, which represents data with specialized syntax, ( ) nodes, which represent separate algorithms. the platform itself also provides functions for development, management, and execution of workflows. data in kde bioscience, such as sequences and related annotations are abstracted into data model. these data model, composed of data container and metadata, provide a general structure to transfer data between various programs, (or nodes) in kde bioscience. metadata describes the properties of data including types, names, and structures, etc. kde bioscience provides a mechanism for metadata processing that executes before the workflow operates on the actual data. since metadata provides extra information, the data controls such as logical constraints can be implemented for workflow, for example, verification of the compatibility of a particular algorithm to a given dataset. there are two important data models in kde bioscience: sequence collection and table. sequence collection accommodates the main bioinformatics data of nucleotide/protein sequences and their related annotations. considering the intrinsic linearity of the biological sequence, the sequence is represented as a string of characters, and thereafter its annotations or features are organized along the sequence with one-dimensional coordinates. to facilitate development, the interface of sequen-cedb from the open source project biojava [ ] is adopted as the java interface for this container. to cope with large amounts of sequence data, our java class kse-quencedb implementing the sequencedb interface is based on files in hard disk instead of memory. kse-quencedb metadata is organized as a tree, of which sequence types (dna, rna, or protein), names, and types of related annotations, etc. are described as leafs. output of most algorithms, not only those for bioinformatics but also those for general data processing can be mathematically generalized as two-dimensional matrices. naturally, table is used to store the matrix. table metadata specifies the column names and types. in kde bioscience, a java class kresultset is developed to create the table, which implements the interface of java.sql.resultset based on file system. by using resultset persistence can be implemented easily in a relational database. in this sense, tables provide a bridge between kde bioscience and other applications such as data warehouses for data mining and knowledge discovery. there are some other specified data models involved in kde bioscience, such as clustalwresult for result of alignment program clustalw, which are not to be illustrated in detail in this paper. to explore the data several viewers can be attached for each data model using a simple configuration file. these viewers can be launched at any point of a workflow to visualize the corresponding data. node is another basic component of a workflow. usually, each node represents a distinct algorithm. kde bioscience has so far collected more than commonly used bioinformatics programs covering the analysis and alignment of nucleotide and protein sequences. with the powerful software development kit (sdk) provided by kde, algorithms with various implementations (with or without source code, hosted in local or remote machines) can be integrated. for example, two nodes for blast applications are provided, local blast and net blast and there are nodes for retrieving ncbi, pdb, and uniprot databases remotely. all nodes are classified into several groups: import/export, nucleotide analysis, protein analysis, remote query, alignment, visualization, and accessory tools. import and export nodes transfer sequence data between the kde bioscience workspace and outside-for instance, users can load and store sequence data from and to flat files with various formats such as fasta, genbank, embl, swissprot, genpept, and so on. furthermore, the sequences can be imported and exported as xml or tables in any jdbc (java database connectivity) supported relational dbms. in addition, sequences can also be imported in the form of editable strings, and exported to the clipboard on windows system. generally, import nodes are the starting points of a workflow (task) and export nodes are the end points. the nodes in these two groups provide algorithms to annotate the nucleotide and protein sequences, covering various functions: ( ) nucleotide composition analysis, such as compseq, dan, freekn [ ] , and gc calculation; ( ) gpg island prediction, such as cpgplot and gpgreport [ ] ; ( ) d nucleic structure prediction, such as rnafold [ ] and einverted [ ] ; ( ) nucleic motif analysis, such as fuzznuc, fuzztran, restrict, and tfscan [ ] ; ( ) primer prediction, such as primer [ ] ; ( ) promoter prediction, such as neural network promoter prediction [ ] ; ( ) repeat identification, such as recon [ ] and repeatmasker [ ] ; ( ) trna prediction, such as trnascan [ ] ; ( ) gene finding, such as genescan [ ] , getorf [ ] , and glimmer [ ] ; ( ) statistics, such as geecee and pepstat [ ] ; ( ) protein composition analysis, such as charge, checktrans, compseq, freak, iep, octanol, and pepinfo [ ] ; ( ) protein d structure prediction, such as garnier, helixturnhelix, pepcoil, pepwheel, and tmap [ ] ; ( ) protein motifs prediction, such as antigenic, digest, fuzzpro, and sigcleave [ ] ; ( ) phylogeny analysis, such as phylip [ ] . many of these come from the open source package emboss [ ] . more significant than the list of basic programs integrated is the fact that kde bioscience provides a mechanism-xml integration framework (xif)-that enables programs to be integrated by users themselves without any programming. a standalone gui application, xif studio, is provided to guide the end user in integrating commandline executable applications into the platform. the programs integrated can be executable programs or scripts written in perl or any other shell language. with xif, a trivial java class file and a xml file describing the user interface and command line options will be created automatically. after kde bioscience is rebooted the new nodes will appear in the user interface for use in workflows. there are many popular bioinformatics applications and databases hosted remotely such as the queries of nucleotide sequence, protein sequence, and medline at http:// www.ncbi.nlm.nih.gov. as these databases or algorithms are difficult or impossible to install locally, kde bioscience provides groups of nodes to access them instead of using a web browser. with these nodes, ncbi (http:// www.ncbi.nlm.nih.gov), pdb (http://www.rcsb.org), swissprot (http://us.expasy.org), smart (http://smart. embl-heidelberg.de), kegg (http://www.genome.ad.jp/ kegg/), srs integrated databases (http://www.scbit.org/ srs ), and many other websites such as hugo (http:// www.gene. ucl.ac.uk/nomenclature/) can be accessed. instead of raw web pages, the outputs of queries will be translated into structured data automatically by kde bioscience. thereafter, such data can be used in the workspace as sequence collections or tables for further processing by various nodes. alignment is the basic process in sequence analysis. in kde bioscience alignment programs such as blast [ ] , clustalw [ ] , sim [ ] , mummer [ ] , fastacmd [ ] , and dotter [ ] are integrated for this purpose. visualization nodes include several viewers for data models. these graphical representations allow biologists to get a better understanding of the data. in addition to the default text viewer for plain text, which represents the data with return value of the corresponding tostring( ), there are some other graphical viewers such as featurevista for sequence collections, alignmenttreeview for clu-stalw results, blastviewer for blast results, tableeditor for kresultsets, and so on. some of them are home-made, while others are from open-source projects such as gsviewer (http://www.lasergo.com/gsviewer.htm) for postscript files and rasmol (http://www.umass.edu/microbio/ rasmol/) for molecular structures. in addition to the core functions mentioned above, kde bioscience provides several nodes to assist the analysis-nodes for merging sequences and their annotations, nodes for extracting specific features and so on. with data models and nodes, a variety of algorithms and bioinformatics data can be integrated to provide a powerful integrative environment for complex bioinformatics analysis. to support workflow construction, kde bioscience is built using the java platform enterprise edition (j ee) architecture. it consists of three layers: user interface (ui) layer, execution layer (kde engine), and component layer (refer to fig. ). the ui layer provides an interface for construction of workflows that handles the visual presentation of nodes, data, and widgets for parameter setting, plus drag-and-drop operations, and the actual execution of the workflow. the execution layer provides a mechanism for workflow execution including metadata processing, node invocation, and data transfer. the component layer includes many modules that implement the actual algorithms. the interface between the execution layer and component layer is defined by the sdk. the ui layer provides interfaces to operate the kde engine, including a human-and machine-friendly interface. kde bioscience has two kinds of user-friendly interfaces, one of which is implemented as java gui application, while the other is implemented as web pages. in general, the java application gui provides a visual interface for workflow construction and execution. after a workflow is constructed, it can be deployed as web pages for execution from a standard web browser. to support the construction of distributed applications, a machine friendly interface is also provided, where workflows can be built, modified, and executed via the simple object access protocol (soap) as web-services. the execution layer transforms the graph of workflow in the kde bioscience gui into a concrete execution plan. the layer itself describes the logical model of a workflow, and acts as a virtual machine for node processing. a framework for node invocation, data, and metadata processing has been implemented in which there are two separate aspects of workflow execution: one is for metadata, while the other is for the actual data. the execution of nodes can be divided into two phases, ( ) ''preparing,'' in which fig. . kde bioscience architecture. ear denotes enterprise archive for server side code. rdbms denotes rational database management system. biosql is a part of biojava [ ] , and it provides an interface to dbms. kegg denotes an instance of web-services [ ] . rmi denotes the method of java remote method invocation. metadata is processed and any errors will prevent the workflow from starting execution, and ( ) ''processing,'' in which the actual data itself is processed. here, any errors will stop the workflow executing beyond the point where the error occurred. the preparation phase provides a highly flexible mechanism for checking that workflows have been constructed correctly-for example, sequence types and table column metadata can be checked before the workflow is allowed to execute. this metadata verification significantly increases the likelihood of a successful workflow execution. as we mentioned before, the interface to the execution framework is designed as sdk, with which algorithms can be plugged into kde bioscience as components. the code implementing the sdk builds the low-level component layer that behaves as the micro-code of a virtual machine, carrying out the actual computing task. beside these three basic layers, there are some other management modules aside, such as user management, user space (file system) management, node management and a special module for database access, that construct the platform for collaboration between users. each user has a private user space where workflows and data are stored. moreover, different users belonging to the same group can share a public user space. with this sharing mechanism different users can collaborate quite easily. not only data and results can be shared, but also the same workflow can be edited and executed by different users in the same group. therefore many users can work together on the same bioinformatics analysis. technically the framework of the kde bioscience system is implemented using j ee. the application server adopted is jboss (http://www.jboss.org/), an open source project. several enterprise javabeans (ejbs, server-side components in j ee platform) carry out the above functions, for example, executionbean for execution of workflows, componentsbean for management of nodes, userbean for account management, userspace-bean for operation of the user space, and so on. tomcat embedded in jboss provides support for web-based access. in this section, a concise description is given to illustrate typical use of kde bioscience based on the java fig. . kde bioscience java gui. gui (fig. ) . essentially there are panels in the java gui. at the top-left, it is ( ) user space panel, which provides the space to display data to be processed, results produced, and even workflows constructed. different users belonging to the same group can exchange the data via copy-and-paste operations here. at the bottom-left there is ( ) component panel, where all the nodes are listed as a tree according to their groups of functions. users can drag-and-drop the icons from user space panel (data) and components panel (algorithms) into the top-right ( ) workspace panel to construct the workflow. when a node is selected by a simple click in the workspace, its corresponding parameters can be set in the bottom-middle ( ) properties editor panel. when a workflow is constructed and its parameters are set, the user can select one branch of the workflow and trigger execution via a toolbar icon or pop-up menu. when a workflow is very large, and ( ) navigator panel provides a global view for the entire workflow in contrast to the workspace panel that gives only the view of the part that the user is currently interested in. in this section, two use cases are presented to illustrate kdeÕs usability and function concisely. kde bioscience has been involved in the severe acute respiratory syndrome (sars) research conducted at scbit from the beginning [ ] , and serves as the framework for further investigation [ ] . since sars-coronavirus (sars-cov) was found as the causative virus, one important task in sars research has been to examine the genomic variation between virus samples taken from different patients, and to find other homologous species. kde bioscience facilitated the necessary nucleotide and protein sequence analysis. following is a typical case for sars research. first, we download all sars genomes from the ncbi public database. then, we compare the genomes downloaded (ncbi sequence) and genomes (hp /hp /pc) sequenced by our collaborators using clustalw to look for interesting variations (fig. ) . with the variations obtained, we are able to annotate the reference genome. furthermore, we extract the varying sections and translate them into protein sequences to mark the reference protein sequence (for instance, s-protein, fig. ). subsequently, the s-protein marked with variation points is annotated with various protein analysis toolsfor example, tmap for transmembrane region prediction. with these annotations, we can find some interesting properties of the variation regions (fig. ) . at any time, the sequence can be viewed by a custom sequence viewer featurevista, where the annotations and features are listed and visualized (fig. ) . in addition to the standard analysis process, the workflows constructed can be deployed as web pages, and thus execute using a browser (fig. ) . the results are visualized in an applet (fig. ) . the system for microbial genome annotation (smi-ga) is a web server (http://www.scbit.org/smiga/index. html) provided by scbit for prokaryotic genome annotation, which is built using kde bioscience. smiga users can log into the system to submit dna or protein sequences for analysis. thereafter, the system will take the user to the corresponding selection page, which lists fig. . web page of smiga, an application instance of kde bioscience. all possible functions related to a given type of sequence. for example, in fig. , after a dna sequence is submitted, a web page including functions such as trnascan, glimmer, garnier, tmap, antigenic, etc. is presented. users can choose some or all functions according to their needs. they can also set or adjust the parameter settings for the selected nodes by clicking on their corresponding ''param'' buttons. clicking ''submit'' will trigger the system and automatically send a notification email when the job is done. henceforth, users can come back to the system and view the results for all of the jobs submitted and finished. smiga is a typical application instance of kde bioscience with a customized web ui. all annotation tasks and algorithms are managed and executed in the background. the algorithms are hosted in a distributed computing environment using kde bioscience infrastructure, which brings powerful computing capabilities to smiga. as we have illustrated, in workflows, algorithms are implemented as nodes. roughly, these algorithms can cover any aspects of bioinformatics analysis regardless of their usage, purpose, programming languages, operating systems, and input-output data formats. with appropriate data models, multiple nodes can be linked together to form powerful workflows. furthermore, the uniform interface of kde bioscience makes a variety of algorithms transparent to the biologist. in this way, biologists can avoid tiresome tasks caused by complicated software and concentrate on biological problems. this simple drag-and-drop operation offers a real opportunity to improve the efficiency of bioinformatics research. with the well-structured kde bioscience sdk, the algorithms hosted in a distributed computing environment can be incorporated as nodes regardless of the invocation protocols. also, the workflows constructed can be exported as web-services. in this sense, kde bioscience acts not only the portal to access a bioinformatics grid but also a grid computing service provider. it provides a simple solution to the integration of distributed computing resources. to support collaborative work, user and user space management provide facilities for multiple users to share data and workflows. as a result, several biologists can work concurrently on the same bioinformatics project without data collision. moreover, the use of j ee allows flexibility of architecture and portability of applications. several uis-java gui, web page, and soap application program interface-are presented to fit for various usersÕ requirements. in addition, the robustness of kde bioscience benefits a lot from the robustness of java language itself and j ee. it is plain that workflow provides a workable mechanism to integrate data and algorithms. kde bioscience, which adopts workflow and j ee, provides an integrative platform for biologists to collaborate and use distributed computing resources in a simple manner. finally, it should be pointed out that, although our software brings a lot of advantages, the overhead caused cannot be ignored. the data transferring between different nodes cost much computing resource. while a large-scale date set is processed, the performance decline is noticeable, sometimes even intolerable. in the current version of kde bioscience, a practical solution is to transfer data address such as file location instead of data itself. however, this solution may limit its portability in some distributed computing environments. in summary, we demonstrated in this paper an integrative platform, kde bioscience that provides a bioinformatics framework to integrate data, algorithms, computing resources, and human intelligence. significantly, it allows biologists to simplify the usage of complicated bioinformatics software to concentrate more on biological questions. in fact, the power of kde bioscience comes from not only the flexible workflow mechanism but also more than included programs. with workflows, not only the analysis of nucleotide and protein sequences but also other novel calculations and simulations can be integrated. in this sense, kde bioscience makes an ideal integrated informatics environment for bioinformatics or future systems biology research. genbank: update uniprot: the universal protein knowledgebase the protein data bank kegg: kyoto encyclopedia of genes and genomes emboss: the european molecular biology open software suite phylip: phylogeny inference package (version . ) infogrid: providing integration for knowledge discovery biopipe: a flexible framework for protocol-based bioinformatics analysis biowbi: an integrated tool for building and executing bioinformatic analysis workflows taverna: a tool for the composition and enactment of bioinformatics workflows wildfire: distributed, grid-enabled workflow construction and execution biojava: open source components for bioinformatics emboss: the european molecular biology open software suite fast folding and comparison of rna secondary structures bioinformatics methods and protocols in the series methods in molecular biology application of a time-delay neural network to promoter annotation in the drosophila melanogaster genome automated de novo identification of repeat sequence families in sequenced genomes maskeraid: a performance enhancement to repeatmasker trnascan-se: a program for improved detection of transfer rna genes in genomic sequence prediction of probable genes by fourier analysis of genomic sequences interpolated markov models for eukaryotic gene finding phylip-phylogeny inference package (version . ) basic local alignment search tool rapid similarity searches of nucleic acid and protein data banks computer program for aligning a cdna sequence with a genomic dna sequence align of whole genomes a dot-matrix program with dynamic threshold control suited for genomic dna and protein sequence analysis the chinese sars molecular epidemiology consortium. molecular evolution of the sars coronavirus during the course of the sars epidemic in china sars analysis on the grid. uk e-science all hands meeting the research is funded by hi-tech research and development program of china, grant no. aa . the authors thank alex michie for his help and expertise. key: cord- -ftcs fvq authors: o’reilly-shah, vikas n.; gentry, katherine r.; van cleve, wil; kendale, samir m.; jabaley, craig s.; long, dustin r. title: the covid- pandemic highlights shortcomings in us health care informatics infrastructure: a call to action date: - - journal: anesth analg doi: . /ane. sha: doc_id: cord_uid: ftcs fvq nan s evere acute respiratory syndrome coronavirus- (sars-cov- ), the causative agent of coronavirus disease (covid- ) , was designated a pandemic by the world health organization on march , . by that date, hundreds of thousands of people around the globe had been infected, and millions more are expected to suffer physically and economically from the effects of covid- . scientifically, the pace of progress toward understanding the virus has been dramatic and inspiring: the viral genome was rapidly determined, and a . -angstrom-resolution cryoelectron microscopy structure of the viral spike protein in prefusion conformation was published within weeks of its identification. initial small trials examining the impact of potential therapeutic agents have also been rapidly published; to date, more than clinical trials have been registered, including several for candidate vaccines. in contrast, other aspects of the international covid- response have not yet demonstrated similar progress. the need for rapid aggregation of data with respect to the epidemiology, clinical features, morbidity, and treatment of covid- has cast in sharp relief the lack of data interoperability both globally and between different hospital systems within the united states. this global scale event demonstrates the critical public health and research value of data availability and analytic capacity. specifically in the united states, although efforts have been made to secure the interoperability of health care data, countervailing forces have undermined these efforts for myriad reasons. in this study, we describe these forces and offer a call to policy action to ensure that health care informatics is positioned to better respond to future crises as they arise. efforts to develop a standard for health care data exchange have a long history, but the most promising arose from the passage of the health information technology for economic and clinical health act of (hitech). hitech created an economic motivation for the implementation of electronic health records (ehr) across the united states and is, for this purpose at least, widely viewed as successful. by , % of small rural hospitals and % of office-based physician practices possessed certified health information technology. notably, the staged approach to ehr adoption delayed interoperability requirements until the final stage of adoption. in the competitive us ehr vendor market, this delay led to differences in how vendors approached and implemented interoperability. although it appears that there is general consensus on the use of the substitutable medical apps, reusable technologies on fast healthcare interoperability resources (smart on fhir) standard developed by the nonprofit health level seven international (hl ) for the interchange of data, the standard is not specific enough to ensure, and regulators have failed to require, that different vendors implement the specification in compatible ways. this failure has necessitated the development of health care integration engine software products to bridge the gap, yet another source of financial inefficiency in us health care. furthermore, aspects of the smart on fhir specification remain incomplete. for example, there is no implementation guide for intraoperative anesthesia data, although one may be developed by . it is notable that the hl development work in anesthesiology is done entirely by volunteers. the interoperability framework offered by smart on fhir is, by itself, not sufficient for public health and research purposes. smart on fhir is specifically designed for patient-level data sharing. in the absence of regulations that mandate a specific solution, academicians have developed approaches to the organization and dissemination of standards that allow for multicenter data analyses. the observational health data sciences and informatics (ohdsi), a collaborative group of investigators mostly funded by public granting agencies, is presently in the sixth version of its observational medical outcomes partnership (omop) common data model. once an organization transforms its data into the omop model, as many have, it can participate in data analysis with any number of arbitrary partners through a federated mechanism. as with hl , there is no standard in omop for anesthesiology data, and standards for data from critical care environments remain underdeveloped. within anesthesiology, the multicenter perioperative outcomes group offers arguably the most comprehensive candidate common data model, although costs of participation are high, and most participating sites are academic centers. while the lack of standard specification by regulatory agencies has contributed to these challenges, emr vendors themselves have also played a role. exposing standardized data reduces barriers to adoption of competing ehr platforms, which clearly explains the reticence of vendors to do so. this year, the chief executive officer of a dominant us ehr vendor wrote a letter in which it urged its customers to oppose proposed regulations that would simplify the sharing of patient data; perhaps unsurprisingly, vendors with less market share and other companies attempting to enter the space voiced support for those same regulations. [ ] [ ] [ ] amidst the covid- crisis, further delays in regulatory implementation are under consideration at the very time that data sharing is urgently needed. it is worthwhile to note that the widespread penetration of ehrs into hospital systems facilitated by the hitech act did allow individual systems to react and adapt to the covid- pandemic in intelligent, data-driven ways. as an example, uw medicine-one of the first health care systems in the united states to encounter the disease-developed a comprehensive set of information technology solutions in response to the pandemic, including order sets, documentation templates, and dashboards. the value of the ability to rapidly collate and present information at the institutional level should not be underestimated, even as the potential benefits of interinstitutional data sharing during a pandemic remain as yet unrealized. the framework of proportionality is helpful for considering the ethical ramifications of broad data sharing, especially as seen through the lens of a pandemic. it is critical to balance the probable public health benefits of an intervention with the potential infringements on patient privacy or autonomy. the many benefits of real-time data sharing in the context of a global health care emergency have already been outlined. to briefly recap, if hospitals across the country were able to observe and interpret data being gathered at other institutions in real time and to contribute their own data to the shared repository, the health care system could be learning about and improving its care of covid- patients continuously and collaboratively, based on the sum total of available information rather than incrementally in silos. even as biomedical publishing gradually evolves to become more agile and rapid, traditional approaches to medical knowledge creation and dissemination remain unacceptably slow and continue to permit the dissemination of inaccurate information in the midst of a pandemic. indeed, calls have been made to address the ongoing "infodemic" (as it has been dubbed by the world health organization). additionally, the sharing of data across health systems would hold hospitals accountable for providing care that is consistent with agreed-upon ethical principles during public health crises, such as allocating treatments in ways that maximize the number of lives saved and treating patients equitably with regard to race, ethnicity, and insurance status. who would monitor and report back on such issues? the us centers for disease control national healthcare safety network (nhsn), established to gather data on (primarily bacterial) health care-associated infections, provides a model for centralized aggregation and reporting but would require heavy revision for our purposes. because the system relies on manual case review and entry, data captured are delayed and results are aggregated on a quarterly basis, too slow and too error prone in the context of a rapidly evolving pandemic. the centralized approach also introduces concerns related to oversight and performance penalties, as well as barriers to use by academic researchers. unlike the nshn, such a system would need to automate aggregation to real-time or near realtime status, provide mechanisms to allow research use of data, provide systems for deidentification of data and protections against reidentification of patients, and potentially be firewalled from traditional quality and pay-for-performance reporting purposes to maximize public health surveillance and research capabilities. potential harms that must be considered include breaches of patient privacy, premature decisionmaking based on preliminary or inaccurate information, and the potential misuse or misinterpretation of shared data. privacy concerns have been raised by ehr companies and health care providers as a major reason not to enter into data-sharing agreements. while it is true that the risk of data breaches might increase with increased interoperability, they need not necessarily become more probable. effectively implementing safeguards around encryption, authentication, and data use can mitigate these risks (the risk of ehr data exposure is not, eg, uniquely greater than financial data compromise), which must be balanced against the potential benefits to patients and public health. there are few remaining legal barriers to the sharing of health information. however, legal, ethical, and logistical challenges arise when a health care system houses data that are not necessarily from that system's patients. large institutions may serve as reference laboratories for broad geographic areas and therefore house assay data from external clients that may or may not have agreed to this type of data sharing. indeed, without careful handling, inclusion of outside clients' results, when combined with data from other regional systems, may lead to unrecognized data duplication. institutions must also consider how they will manage and protect the data generated from testing their own employees in the context of a pandemic. apart from legal restrictions on handling of employee health information that stand apart from health insurance portability and accountability act (hipaa) restrictions, there are ethical challenges in understanding how these data might best be used to study the risks to health care workers while also respecting health care worker privacy. on balance, the ethical obligation, then, is for the companies facilitating data sharing and/or storage to ensure their systems meet the highest standards for security. by contrast, risks to privacy may actually be increased as long as ehr systems are not interoperable, given that patient data may be scattered across multiple systems. other risks that may accompany the sharing of real-time clinical data should be acknowledged. for example, the information itself may be inaccurate due to charting errors or coding inconsistencies. decisionmakers may jump to premature or biased conclusions based on apparent associations between an infectious disease and groups that have been the object of adverse implicit or explicit association bias (eg, racial and ethnic groups, homeless, prisoners, sex workers), leading to further stigmatization and limited access to care. such risks might be increased in the setting of a global crisis characterized by a rapidly spreading virus, widespread fear, and unreliable media sources. on balance, however, our view is that there are no public health benefits to the status quo. proprietary control over ehr data benefits only ehr vendors themselves-who profit from institutional contracts and inhibitors to marketplace competition-and their customers-who may retain patients by virtue of limited or absent interoperability. the harms of the status quo include increased health care costs, such as duplicate testing when records are not transferable. the failure to implement interoperable health care records may also harm patients by trapping their data in balkanized systems, keeping physicians from accessing needed information in an efficient manner. access to prior documentation of critical conditions (eg, a difficult airway or history of malignant hyperthermia or critical aortic stenosis) would allow anesthesiologists to make safer, more efficient diagnostic and care decisions. frontline providers shouldering the burdens of health care under pandemic conditions are rapidly realizing that competent physicians and other health care workers can only go so far to solve problems that arise from systemic dysfunction. lack of data infrastructure inhibits communication and study of rapidly evolving clinical practice. hospitals within blocks of each other are relying on ad-hoc interpersonal communications rather than working from a coherent multiorganizational playbook. the seamless capability to share ideas, care plans, and experiences based on reliable data would dramatically alter the us health care landscape. on a smaller scale, interoperability challenges also exist within hospital systems or single hospitals themselves. lack of data interoperability at the device level has ensured that hospital systems have to navigate and manage streams of data from diverse legacy devices, creating challenging data acquisition issues in the context of a surging number of covid- cases. www.anesthesia-analgesia.org anesthesia & analgesia covid- and data infrastructure shortcomings when confronted with a novel disease process, small and often poorly conducted studies rapidly proliferate. these studies are disseminated in mass and social media and may drive therapeutic decisions that could be ineffective at best and cause substantial harm at worst. in the context of covid- , a context where millions have contracted the disease and hundreds of thousands will likely die, timely but robust science is needed. the ability to share and combine data across systems serves as the foundation of such efforts. with data standardization and sharing, variability in care approaches could be harnessed to identify best practices and therapeutic avenues in a much more cohesive, data-driven manner. several concrete examples are illustrative. infection control procedures and equipment or medication shortages related to covid- are significantly impacting the timing of surgery, default approach to airway management, maintenance of anesthesia, and the setting in which postoperative monitoring occurs. such rapidly developed policies are intended to protect anesthesia providers and other health care workers and to conserve critical resources, but is there a signal for patient harm associated with such sudden and profound changes in practice? additionally, anesthesia departments are increasingly relying on the results of preoperative sars-cov- testing to guide such policies. the efficacy of these screening systems (particularly when applied to asymptomatic patients or those in whom such a determination is not possible) is unknown but is of critical importance for airway management, for determining personal protective equipment requirements during anesthetic care, and for determining safe postoperative disposition. collectively, surgical patients undergoing preoperative evaluation are poised to become the largest cohort of asymptomatic patients tested for sars-cov- , and yet the power of this potential resource to broadly inform health care policy will likely go unutilized. unexpected but fundamentally important aspects of this emerging disease, such as the large number of patients presenting for endovascular therapy for acute ischemic stroke, may be uncovered through coordinated approaches to discovery. finally, there has been a rapid shift toward the use of anesthesia machines to meet surge demands for mechanical ventilation. reasonable evidence exists to suggest that modern anesthesia machines are virtually indistinguishable from intensive care unit (icu) ventilators; however, icu ventilators are more fault tolerant, handle circuit leaks more optimally, and handle fresh gas in very different ways. anesthesia machines set improperly and operated by health care providers unfamiliar with their use may unnecessarily waste medical gases or (in the worst case) deliver hypoxic gas mixtures in the context of inadequate oxygen flow into the circle system. again, the impact of such a rapid retasking of medical equipment will, under the current infrastructure, remain unknown for much longer than should be necessary. the public has a pressing interest in ensuring that data standards (eg, omop, fhir) are rapidly developed, adopted by appropriate international standards organizations (eg, hl ), and implemented by ehr vendors in a manner that facilitates interoperability for individual patient care, public health, and research purposes. we agree with others that this will require changes to the regulatory environment created by the hipaa. anesthesiologists, along with nurses, respiratory therapists, advanced practice providers, emergency room physicians, intensivists, and other critical care professionals, stand at the front line of the covid- public health crisis. better data are required to delineate every aspect of this pandemic: supporting local operations and quality work; informing research queries, such as investigations into provider risk following airway management and quantifying the efficacy of therapeutic options; and bolstering public health efforts by providing real-time prevalence, tracking disease spread, and facilitating risk stratification. integration of health care data with nonhealthcare source data is currently an impossibility in the united states due to lack of a universal health care identifier. public funding agencies and their grantees have shouldered the burden of creating stopgap solutions that policymakers have failed to require and major ehr vendors have avoided due to risk of competitive disadvantage. policymakers and funders are called upon to prioritize the modernization of health informatics. anesthesiologists and our specialty societies are called upon to advocate policymakers for these changes and to involve themselves in these organizations in the coming months and years and contribute to development or otherwise risk failing again in optimizing a data-driven response to the next pandemic. e a new coronavirus associated with human respiratory disease in china cryo-em structure of the -ncov spike in the prefusion conformation search of: covid- -list results -clinicaltrials the office of the national coordinator for health information technology. health it quick stats g . best healthcare integration engines software in . available at hl international. anesthesia -documents. available at epic's ceo is urging hospital customers to oppose rules that would make it easier to share medical info cerner growing ehr market share with increased hospital consolidation: klas. fiercehealthcare cerner call for interoperability rule release hhs considers rolling back interoperability timeline amid covid- . healthcare dive responding to covid- : the uw medicine information technology services experience teaching seven principles for public health ethics: towards a curriculum for a short course on ethics in public health programmes pseudoscience and covid- -we've had enough already truth in reporting: how data capture methods obfuscate actual surgical site infection rates within a health care network system legal barriers to the growth of health information exchange-boulders or pebbles? anesthetic management of endovascular treatment of acute ischemic stroke during covid- pandemic: consensus statement from society for neuroscience in anesthesiology & critical care (snacc)_endorsed by society of vascular & interventional neurology (svin) perioperative documentation and data standards--anesthesiology owned and operated balancing health privacy, health information exchange and research in the context of the covid- pandemic the us lacks health information technologies to stop covid- epidemic the authors declare no conflicts of interest.reprints will not be available from the authors. key: cord- - gc dc authors: mcgarvey, peter b.; huang, hongzhan; mazumder, raja; zhang, jian; chen, yongxing; zhang, chengdong; cammer, stephen; will, rebecca; odle, margie; sobral, bruno; moore, margaret; wu, cathy h. title: systems integration of biodefense omics data for analysis of pathogen-host interactions and identification of potential targets date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: gc dc the niaid (national institute for allergy and infectious diseases) biodefense proteomics program aims to identify targets for potential vaccines, therapeutics, and diagnostics for agents of concern in bioterrorism, including bacterial, parasitic, and viral pathogens. the program includes seven proteomics research centers, generating diverse types of pathogen-host data, including mass spectrometry, microarray transcriptional profiles, protein interactions, protein structures and biological reagents. the biodefense resource center (www.proteomicsresource.org) has developed a bioinformatics framework, employing a protein-centric approach to integrate and support mining and analysis of the large and heterogeneous data. underlying this approach is a data warehouse with comprehensive protein + gene identifier and name mappings and annotations extracted from over molecular databases. value-added annotations are provided for key proteins from experimental findings using controlled vocabulary. the availability of pathogen and host omics data in an integrated framework allows global analysis of the data and comparisons across different experiments and organisms, as illustrated in several case studies presented here. ( ) the identification of a hypothetical protein with differential gene and protein expressions in two host systems (mouse macrophage and human hela cells) infected by different bacterial (bacillus anthracis and salmonella typhimurium) and viral (orthopox) pathogens suggesting that this protein can be prioritized for additional analysis and functional characterization. ( ) the analysis of a vaccinia-human protein interaction network supplemented with protein accumulation levels led to the identification of human keratin, type ii cytoskeletal protein as a potential therapeutic target. ( ) comparison of complete genomes from pathogenic variants coupled with experimental information on complete proteomes allowed the identification and prioritization of ten potential diagnostic targets from bacillus anthracis. the integrative analysis across data sets from multiple centers can reveal potential functional significance and hidden relationships between pathogen and host proteins, thereby providing a systems approach to basic understanding of pathogenicity and target identification. the niaid (national institute of allergy and infectious diseases) biodefense proteomics program, established in , aims to characterize the pathogen and host cell proteome by identifying proteins associated with the biology of microbes, mechanisms of microbial pathogenesis and host responses to infection, thereby facilitating the discovery of target genes or proteins as potential candidates for the next generation of vaccines, therapeutics, and diagnostics [ ] . the program includes seven proteomics research centers (prcs) conducting state-of-the-art high-throughput research on pathogens of concern in biodefense and emerging/ reemerging infectious diseases, as well as a biodefense resource center for public dissemination of the pathogen and host data, biological reagents, protocols, and other project deliverables. the prcs work on many different organisms, covering bacterial pathogens (bacillus anthracis, brucella abortus, francisella tularensis, salmonella typhi, s. typhimurium, vibrio cholerae, yersinia pestis), eukaryotic parasites (cryptosporidium parvum, toxoplasma gondii), and viral pathogens (monkeypox, sars-cov, vaccinia). the centers have generated a heterogeneous set of experimental data using various technologies loosely defined as proteomic, but encompassing genomic, structural, immunology and protein interaction technologies, as well as more standard cell and molecular biology techniques used to validate potential targets identified via high-throughput methods. in addition to data, the prcs have provided biological reagents such as clones, antibodies and engineered bacterial strains, other deliverables include standard operating procedures (sops) and new technologies such as instrumental methods and software tools and finally publications related to all of these activities. consequently, there were a number of unique challenges facing the resource center: (i) how to coordinate with the seven prcs with various pathogens, technologies, processes, and data types; (ii) how to provide seamless integration of three institutions that make up the resource center; and (iii) how to provide timely and effective dissemination of newly discovered information to the user community. in particular, due to the breadth of the program, the potential user community is quite broad, from technology or informatics experts who may want to reanalyze the data or develop better algorithms, to a wide group of biomedical scientists who are interested in mining the data for their own studies or just finding new information on a protein or gene of interest quickly and easily. accordingly, we developed a set of functional requirements early in the biodefense resource center development: (i) to implement a center-specific submission protocol and data release plan for timely dissemination, (ii) to promote data interoperability, adopting common standards (such as hupo proteomic standards initiative [ , , ] ), defining a core set of metadata with mapping to controlled vocabularies and ontologies, recommending preferred ids for gene/ protein mapping, and (iii) to provide value-added annotation to capture key findings and integration of the data with related resources for functional interpretation of the data. available online at http://proteomicsresource.org, the architecture, initial content and general features of the biodefense proteomics resource were briefly described elsewhere [ ] . a breakdown of the resources content by organism, prc and other criteria can be seen at: http://www. proteomicsresource.org/resources/catalog.aspx. tutorials and help are provided on the website: http://www.proteomicsresource. org/resources/tutorials.aspx , http://proteininformationresource. org/pirwww/support/help.shtml# the objective of this study is to provide a systems approach to the study of pathogen-host interactions, connecting the various types of experimental data on genomics, proteomics and host-pathogen interactions with information on pathways, regulatory networks, literature, functional annotation and experimental methods. having most of this information accessible in one place can facilitate knowledge discovery and modeling of biological systems. like many problems in data integration, it is easy to know the general outline of what we want, but often much harder to implement and navigate the information, especially if the original data crosses disciplinary, laboratory and institutional boundaries. here we describe in detail a protein-centric approach for systems integration of such a large and heterogeneous set of data from the niaid biodefense proteomics program, and present scientific case studies to illustrate its application to facilitate the basic understanding of pathogen-host interactions and for the identification of potential candidates for therapeutic or diagnostic targets. several scientific use cases are presented that illustrate how one can search varied experimental data from different laboratories and even ones researching different infectious organism and their hosts to make potentially useful connections that could lead to new hypotheses and discoveries. based on the functional requirements of the resource center, we developed a bioinformatics infrastructure for integration of prc deliverables ( figure ). in our workflow, multiple data types from prcs are submitted to the center using a data submission protocol and standard exchange format, with the metadata using controlled vocabulary whenever possible. for functional interpretation of the data, we then map the gene and protein data based on identifier (id) mapping or if necessary using peptide or sequence mapping to proteins in our data warehouse described below. all of the databases, along with information on the prcs and organisms under study are listed in the proteomics catalog accessible from the web portal. the key design principal in the resource center is protein-centric data integration. here the diverse experimental data are integrated and presented in a protein-centric manner where information is queried and presented via common proteins and connected to experimental data and the network of protein attributes, including information on the encoding genes, protein families, pathways, functions and more. in practice a protein-centric approach works well as proteins occupy a middle ground molecularly between gene and transcript information and higher levels of molecular and cellular structure and organization. proteins are often the functional molecules in biological processes described by pathways, molecular interactions and other networks. protein families, in turn, have proven to be invaluable in studying evolution and for inferring and transferring functional annotation across species. underlying the protein-centric data integration is a data warehouse called the master protein directory (mpd) where key information is extracted from the primary data and combined for rapid search, display and analysis capabilities. the mpd is built on the data and capabilities of iproclass [ ] a warehouse of protein information, which in turn is built around uniprotkb [ ] but supplemented with additional sequences from gene models in refseq [ ] and ensembl [ ] and additional annotation and literature from other curated data resources such as model organism databases [ , , , , , ] and generif [ ] . the biodefense data are essentially additional data fields added to a subset of iproclass entries to create the mpd. currently the mpd defines and supports information from the following types of data produced by the prcs: mass spectrometry, microarray, clones, protein interaction, and protein structure. more data types or attributes may be added in the future if needed. supplemental table s shows the common and unique fields used in the mpd for each data type. an advantage of the data warehouse design is that, if needed, additional fields can be extracted from the primary data and easily added as new attributes without greatly altering the existing database design or query mechanisms. the mpd data is stored in an oracle database along with iproclass data. the mpd including the website and ftp files is updated every weeks in conjunction with iproclass or whenever new prc data is released. the various proteomics research centers all used different sources and identifiers for the nucleotide and protein sequences in their analysis pipelines and occasionally would change sources depending on the experiment. this is a common problem encountered when attempting to combine data across research laboratories unless identical sequence databases, processes, platforms and organism names are used. examples of database identifiers used include genbank/embl/ddbj accessions and locus tags, unigene accessions, refseq accessions, ipi accessions, ncbi gi numbers and ids unique to a sequencing center or organism-specific database. the first step was to map all experimental results to a common representation of a protein. this was achieved by mapping all protein and gene ids and names to iproclass proteins. the majority of the mapping using ids from public resources was done using mapping services and tables provide on the protein information resource (pir) web site (http:// proteininformationresource.org/pirwww/search/idmapping.shtml) and ftp site (ftp://ftp.pir.georgetown.edu/databases/iproclass/). however, some mapping problems needed to be addressed either by automated rules, direct sequence comparisons or manual analysis and annotation. mapping difficulties fell into categories: ) one-to-many mappings: a common problem, especially when eukaryotic host proteins derived from alternate splicing or viral polyproteins are involved. uniprotkb usually merges information on alternate splice forms or polyproteins, which helped minimize this problem for our purposes, but in cases where multiple mappings exist, we selected as most informative the entry in the manually reviewed uniprotkb/ swissprot section; if no swissprot entry was found, the longer sequence in uniprotkb/trembl section was selected. users could always find the alternate mappings via the iproclass related sequences link on the mpd entry page to precompiled blast results on all iproclass sequences. ) retired sequences: genomic sequences from databases such as refseq, ipi, or unigene are not static and, as information changes, some gene predictions and translations are retired with each new build. retired sequences often required manual mapping by a curator to match original gene or peptide results to current protein sequences. primarily uniparc (uniprot sequence archive) [ ] was used for this purpose. ) protein sequences not available in iproclass or any public repository: this occurred most often with toxoplasma gondii whose genome sequencing was still in progress and stable builds were not yet available. however, the problem also occurred in some well-characterized organisms like vibrio cholera and bacillus anthracis. several data sets contained information on annotated but not translated pseudogenes. in the case of vibrio cholera, of pseudogenes cloned and sequenced by the harvard institute of proteomics did not contain the annotated point mutation or frame shift and appeared to produce full-length proteins [ ] . in the case of bacillus anthracis, microarrays containing probes for of annotated pseudogenes also showed significant changes in rna expression in response to infection or other treatments [ , , ] . in these cases, new database entries were created in the mpd to house the results. ) alternate species or strain representations: several experimental data sets reported sequence identifiers for strains or variants other than the one used in the experimental sample. this is not an uncommon situation as the genetically most characterized variant is often an attenuated laboratory strain while the more virulent strains are either not yet sequenced or the sequence is of lower quality. often microarray chips or mass spectrometry databases are designed using the best available sequence from the research strain, yet then use rna and protein samples from another similar virulent strain. the question here was what organism strain to map to and represent on our website: the strain the rna or protein the sample came from, or alternatively, the strain that matched the identifier and sequence used in the research to detect the rna or protein. for the mpd we chose to map to the sequence identifier reported in the data files and related publications, with the additional virulent strain information noted in the results summary. data mining design goals. in consultation with niaid and prcs within the project's interoperability working group, we developed goals for data mining. ) all project data and other deliverables should be available via browsing and simple keyword searches. ) the data and information provided by the resource should be sufficient to allow a skilled researcher to download and reanalyze or mine the data for additional information. ) our target user was a biomedical scientist not expert in the technology used to produce the data or in bioinformatics, thus the data, procedures, publications and general results and conclusions of an analysis should be relatively easy to find on the project website for someone not familiar with the details of the particular technologies used to generate it. to allow both simple keyword searches and also boolean searches of the project data, we did the following: ) we included in the mpd only ''validated'' results that were determined by the research centers to be significant using their methods. to do otherwise would confuse users not familiar with the technology and how results are filtered. results that fell below the significance threshold used by the research center were made available via download of data sets at the ftp site. ) ''raw'' unprocessed machine specific data would be stored by the prc and available on request. ) to facilitate and simplify searches across laboratories and data types, we omitted most data type or analysis specific numerical values and statistics from the general mpd search and display. these numerical values are usually platform, laboratory and method dependent and cannot easily be used to compare across datasets so including them might be confusing to users. instead, we focused on providing simple, yet powerful, queries of experimental summaries where a user can query if a gene/protein was presented in the results and, in some cases, if it showed a reported increase or decrease in expression or accumulation based on the prc's criteria. once a set of proteins of interest is identified, a user can then drill down to view the specific experimental values and methods employed to generate the particular dataset. links to details in the publications are provided and full data are available via ftp. since all protein attributes are included as search options, one can query beyond simply protein names, accessions or project data and search pathways, protein families, gene ontology (go) terms, database cross-references and many other attributes, providing many powerful options to the users. to provide a robust text search for the website, we used the pir text indexing system [ ] in which over text fields and unique identifiers from the mpd database are indexed using callable personal librarian (cpl) [ ] which supports fast exact text search, substring & wildcard text search, range search and boolean searches. entry indexing and retrieval is supported by oracle. i: unstructured keyword search. a simple keyword search was implemented on every page of the resource's website. this searches all fields in the mpd. to further facilitate searches, the protein name field is supplemented by also searching the biothesaurus [ ] containing all gene and protein name synonyms and textual variants for each protein from over data sources. in addition, the default option searches all text from the pubmed abstract for all project publications, an abstract of each technology and all text in sops. the text indexed for publications, technologies and sops was annotated with additional standard keywords to facilitate searches. figure shows the results of a simple keyword search with hits for proteins, reagents, publications, technologies and sops. ii: structured text search. the mpd database contains over fields derived from iproclass and proteomics research center's data. currently of these fields are available for individual searches and can be combined with boolean operators as seen in some of the use case examples in this paper. the protein-centric search results are presented in a customizable tabular format where users can add or delete columns. currently fields can be customized. the tabular display has two modes: ) a default mode which displays fields common to all the supported data types and ) a data type specific mode which restricts the results to a particular data type and displays fields specific to that data type. see supplemental table s for a list of data type specific fields. additional filters for proteomic research center and organism are available as pull down menus to aid browsing and viewing the results of queries. examples of these functions are illustrated in the scientific use cases below and in help pages and tutorials available on the website. http:// www.proteomicsresource.org/resources/tutorials.aspx, http:// proteininformationresource.org/pirwww/support/help.shtml# . the resource currently contains information on , proteins from datasets, , reagents, sops, technologies and manuscripts. table shows statistics on proteins in the mpd. currently , % of the proteins are uncharacterized in that they are called either ''uncharacterized'' or ''hypothetical'' and have no other functional annotations or functional domains. of these uncharacterized proteins, , % have experimental data, such as mass spectrometry, microarray, or protein interactions, associated with them. the remaining % of uncharacterized proteins are available as full length clones for further research. though the program is focused on pathogen proteins, about % of the proteins are host proteins from mouse or human, as cell lines or tissue samples from both these organisms were used as infection models for multiple pathogens. simple keyword search. due to the popularity of internet searches, support of unstructured keyword queries, even for structured data, has become critical for any web site. to support this feature, the default protein-centric search returns results for all project deliverables, data, reagents, protocols, technologies and publications. an example is shown in figure where a single keyword search ''bacillus anthracis'' finds , pathogen and host proteins with mass spec, microarray or protein interaction data and also , reagents in the master reagent directory (mostly orf and y h clones and mutant bacterial strains of bacillus anthracis), sops, project publications and technologies. the matched fields column allows users to refine their queries and construct simple boolean searches. example i: integrative analysis. structured fields allow boolean queries across organisms, data types and laboratories. figure shows a query where we make use of the controlled vocabulary in the ''expression condition'' field and searched for common host proteins detected in studies of bacillus anthracis and salmonella typhimurium infection in mouse macrophage cell lines. currently proteins meet these criteria, mostly from mass spectrometry studies by pnnl and the university of michigan; however, if we further restrict the search to include only those also detected in microarray experiments, we find proteins with mass spec data from s. typhimurium infections at one research center, and mass spec and microarray studies done using b. anthracis at another research center. from the customizable results display shown in figure we can view summary information on the proteins detected. full details on the protein and individual experimental results are available via links [ ] . some benefits of the protein-centric mapping approach are visible in the default display. ) minimizing redundancy, the column prc id shows the different identifiers from different databases used by the research centers. in the case of one protein q wua /k pp_mouse -phosphofructokinase type c (ec . . . ), a total of identifiers from different databases (unigene, refseq, ipi, nr) were reported in the research results that all represent either the gene or protein sequence for this single mouse protein. ) discovery of additional experimental information from other studies. fifteen of the sixteen proteins found in the query also have mass spec data from caprion proteomics, as indicated in the 'experiment' column by caprion_ , and _ . caprion is studying brucella abortus and using a similar mouse macrophage model. currently, only the data for uninfected macrophages is available from caprion. additional data on brucella and mouse proteins from bacterial infected macrophages should be included in future releases. from the results display, one can follow links to view the iproclass protein report with executive summaries of the results from the prcs and information collected from over public resources or drill down to view the specific peptides or expression values seen in these studies, read publications and methods about the experiments or download the data for additional analysis. with a comprehensive protein data warehouse, one can also broaden the search for relevant data and information from related organisms using protein cluster or family information. in figure we illustrate this by selecting one protein, k _mouse, an uncharacterized protein seen in three datasets from infected mouse macrophages. using the uniref cluster id [ ] to query for all proteins with at least % identity and no gaps, we find the human homolog k _human was detected in hela cells infected with vaccinia and monkypox virus. if we do a batch retrieval using uniref ids from all original mouse proteins, we find human homologs were also detected in studies of orthopox infection (not shown). the identification of a hypothetical protein with differential gene and protein expressions in two host systems (mouse macrophage and hela cells) infected by different bacterial (bacillus anthracis and salmonella typhimurium) and viral (orthopox) pathogens suggest that this protein should be prioritized for additional analysis and functional characterization. [ , , ] . although different laboratories use different sample preparation, detection and analysis techniques making some direct comparisons difficult, having the data together in one place allows queries and comparisons between proteins and gene sets to be combined and additional analysis undertaken. in this example, data from different labs and data types are combined for further analysis. a query of the mpd on data type = ''interaction'' and organism name = ''virus'' finds vaccinia virus proteins with interactions with human proteins determined by myriad genetics. browsing the results display shows that of the proteins were also seen in mass spec work published by pnnl [ ] . by combining the two experimental data sets experiment = ''myriad_ '' and ''pnnl_ms_ '', we find virus and human proteins with both mass spec and interaction data from each laboratory. further investigation into the experimental details shows that a protein interaction network was determined by myriad genetics using a yeast-two hybrid assay to screen viral bait proteins against a library of human prey proteins cloned from different tissues. the mass spec data from pnnl was obtained from viral preparations isolated from infected human hela cells. the work from pnnl contained quantitative information in the form of spectral counts and accurate mass tag [ ] intensities, downloadable from the project ftp site. we combined all the vaccinia plus human interaction data with peptide counts for each protein and visualized the results using cytoscape [ , ] . the complete network of results is shown in figure a . various methods have been tried and compared for filtering large intra-species interaction networks to limit false positives and to select the biologically relevant interactions [ , , , , , ] . relatively little has yet been done for inter-species pathogen-host networks. several common factors that have proved useful in other studies are ) evaluating network hubs with many interactions and ) using correlations between interacting pairs with similar gene expression patterns. for this small interactome, we looked for relatively abundant proteins associated with multiple interactions. we identified a single human protein interacting with three viral proteins, three being the largest number of viral interactions seen with a single host protein in this data set. the interactions are shown in figure b . the host protein keratin, type ii cytoskeletal (p ) interacts with three viral proteins (p - kda fusion protein, a l; q h -chemokine-binding protein c /b ; p -protein c l). the kda fusion protein a l is the most abundant protein seen in this data set and participates in virus penetration at during cell fusion [ , ] . a l facilitates initial attachment to cells by binding to glycosaminoglycans [ ] . a l is found in all orthopoxviruses and has no cellular or entomopoxvirus homologs. additional viral proteins involved in attachment (d l, h l) [ , ] and fusion (f l, i l) [ , ] were not observed. the chemokine-binding protein c l belongs to a family of poxvirus chemokine-binding proteins that mimic the chemokine response and prevent activation and chemotaxis of leukocytes [ , ] . protein c l belongs to a family of poxvirus paralogs that may function as toll-like receptor inhibitors based on homology to a r [ , , ] . thus, this protein may modulate toll/il- r signaling, resulting in a diminished host immune response and enhancing viral survival. p -keratin, type ii cytoskeletal , the host protein, has tissue specificity in the suprabasal layer of the stratified epithelium of the esophagus, exocervix, vagina, mouth and lingual mucosa, and in cells and cell clusters in the mucosa and serous gland ducts of the esophageal submucosa [ ] . transgenic knockout mice have shown keratin, type ii cytoskeletal to play an important role in maintaining normal epithelial tissue structure [ ] . keratin queries on cluster or family information can discover related information across laboratories and host pathogen systems. using the uniref cluster id for k _mouse, identified in figure , to query for all proteins with at least % identity, we find the human homolog k _human was detected in hela cells infected with vaccinia and monkypox virus [ ] . doi: . /journal.pone. .g in human saliva has been shown to interact with the protein srr- localized on the surface of streptococcus agalactiae and to play a critical role in colonization of this bacterial pathogen [ , ] . little is known about the mechanisms by which poxviruses attach to and enter host cells. no receptor for virion attachment on the host cell surface has been found. poxvirus infection can occur through interaction with human as well as mice airway epithelia, [ , ] we propose that the protein interactions outlined above may represent some of the initial interactions between host and pathogen. thus they represent potential therapeutic targets for further investigation. this is the first report describing the interaction of a poxvirus protein with a host keratin, type ii cytoskeletal protein. proteins. unequivocal identification of pathogens is important so that adequate counter measures can be taken. currently over pathogenic and non-pathogenic bacteria have been completely sequenced. the availability of sequence data allows identification of proteins that are unique at different taxonomic levels, thus providing a means to begin to distinguish pathogenic from non-pathogenic species. however, if the initial screening depends on sequence data alone, the list of potential targets for laboratory validation can be relatively long; by supplementing sequence results with experimental data one can prioritize the target list for validation in the laboratory. we used such an approach by computationally screening potential targets using cupid [ ] , prc data and other computational means to produce a list of potential targets. identifying species-specific proteins can be done with confidence when multiple species and strains have been sequenced as is the case with bacillus anthracis. the approach relies on the fact that if a gene is conserved over time within multiple strains it gives confidence they will not be lost in the near future and hence are ideal for diagnostic targets. these ''core unique'' proteins have related sequences in all selected organisms (in this case all available strains of bacillus anthracis) but not in other related organisms. an initial total of proteins unique to the bacillus anthracis proteome were identified using the cupid and uniprotkb version . . (bacillus anthracis strain ames isolate porton, bacillus anthracis strain ames ancestor, bacillus anthracis strain sterne were compared to twelve other genomes in the bacillus genus). the two closest relatives of bacillus anthracis as determined by cupid are bacillus thuringiensis and bacillus cereus. the species most closely related to the selected organism is based on the best blast hits of its entire proteome [ ] . one needs to be careful in choosing the diagnostic targets that these two non-pathogenic organisms are not being detected. the initial list of was refined to identify ''core unique'' proteins that are amino acids or more in length. the residue cutoff was used to ensure that the target list consisted of proteins that are real (short proteins might not be real) and are unique, as identification of homologs for short proteins is not trivial [ ] . this resulted in a list of ''core unique'' proteins in the pathogenic strains. it is possible that the proteins found may have homologs in other organisms which were undetected by cupid because the genes were not annotated as open reading frames to confirm their uniqueness, the proteins were screened for significant regions of similarity at the dna level (either pseudogenes or unannotated genes) using tblastn against the ncbi nr database. using ncbi's nr which is produced independently of similar, but not identical, sources as iproclass also helps assure no sequences were missing from our warehouse. this additional analysis resulted in a total of bacillus anthracis specific proteins proposed as high-quality targets for development of diagnostic probes. we then supplemented this information with data from the prc projects and master protein directory to create a matrix of information ( figure ). six of the ten targets have data from the university of michigan prc showing that they were differentially expressed in published microarray experiments. nine of the ten are available as clones produced by the harvard institute of proteomics. a search of all the microarray data from the university of michigan (using http://proteinbank.vbi.vt.edu/ proteinbank/p/search/searchproteins.dll) showed that the four proteins not differentially expressed (listed as clones only in figure ) were still constitutively expressed well above background in all studies (not shown). either the proteins or the dna coding for these proteins can be used to develop and test pathogen detection systems. all the ''core unique'' proteins detected in this study lacked meaningful functional annotation (i.e., were annotated simply as ''uncharacterized protein''), which is not surprising as such unique proteins are not easy to characterize. one protein identified as a target is a remnant of a prophage protein. such proteins are well known to be related to virulence [ ] . another protein is from the pxo plasmid. a similar approach was taken for salmonella species using cupid and public prc data and several candidate diagnostic proteins are currently being validated in the laboratory (data not shown). a systems approach to biology or medicine requires the sharing, integration and navigation of large and diverse experimental data sets to develop the models and hypotheses required to make new discoveries and to develop new treatments. to date this has most often been done with selected research data or within an institution or program where common instrumentation and methods make standardization of experimental practices and data management easier to achieve [ , , ] . alternative approaches require a reanalysis of all the data by a common methodology as has been done in some data repositories [ , ] or assigning some common statistical metric to all data of a certain type to allow functional coupling [ ] . these approaches are all potentially useful, but practically difficult to achieve on a large scale with heterogeneous data. the protein-centric approach we employed is a relatively simple, yet powerful and practical, approach to integrate and figure . ten potential diagnostic markers for pathogenic strains of bacillus anthracis. the prioritized protein list was obtained by computationally screening potential targets using cupid [ ] , prc data and other computational means described in the text. doi: . /journal.pone. .g navigate diverse sets of omics data in a manner useful for systems biology. proteins are often the biologically functional elements in cellular networks; thus, many types of data can be mapped to and through proteins as a common biological object. the lightweight data warehouse approach used for the mpd proved useful in practice, especially with large datasets as its simple design and schema allows greater flexibility to add new data types and to modify search and analysis capabilities. similar lightweight approaches and schemas designed to optimize queries have been shown useful in integration of genomic data [ , ] . the main drawback of this approach is that the warehouse does not contain all the data. however, this is rarely a problem if the data are available in some other data resource optimized for that particular data type and if some upfront analysis of the user's needs for query and analysis options is performed. for example, our use case analysis suggested that for microarray and mass spectrometry data, individual raw intensities, machine-specific parameters and most calculated numerical values were not required for general queries and analysis across the combined data as these values were only comparable between the particular analysis performed in one lab. as a result, most numerical values were not included in the mpd for the default search but are accessible for display via hyperlinks to our protein data center or ftp site. however, if a new attribute appear or users request searches on a particular value omitted from the warehouse, adding it is a relatively simple matter of adding new data columns. for instance, in example ii our combination and analysis of mass spectrometry and protein interaction data, we could include peptide counts directly in the mpd for immediate download instead of retrieving them from the ftp files. of course, no one approach can be perfect, as in biology and research there always seems to be exceptions and new data and multiple approaches need to be accommodated. efforts to standardize reporting requirements, vocabularies and develop common xml data formats for sharing data are welcome and can greatly ease the transfer and automated processing of a particular data type. however the current standards do not necessarily guarantee integration as the problems of reconciling gene and protein identifiers as well as differences in experimental methodology remain. we investigated and employed a few common data standards and ontologies in developing the biodefense proteomics resource. we provided some data using mzdata [ ] and mage-ml [ ] but also provided original dataspecific text files for download. we found that several ontologies to describe experimental methods were useful but incomplete and focused on higher eukaryotes and thus did not yet contain terms needed for microbial pathogens. most useful was the gene ontology [ ] which has been widely adopted to annotate and classify large scale results and can be used for searching and classification in the mpd. here we have presented some unique examples to illustrate benefits, as well as the difficulties, associated with integration of a very diverse set of omics research data across different data types, laboratories and organisms. we illustrated with three examples how potential therapeutic and diagnostic targets can be identified from integrated data applying relatively simple and established tools and techniques. we continue to focus on data integration to allow biologists to find relevant data sets for further detailed analysis using the approaches and tools of their choice. in general the analysis of diverse omics data is an area of active research and a number of useful tools are under active development including cytoscape [ ] , bioconductor [ ] and galaxy [ ] . in the future a more seamless integration between data repositories and analysis tools such as these would be the most useful approach to add additional analysis options for integrated data. author contributions building integrated approaches for the proteomics of complex, dynamic systems: nih programs in technology and infrastructure development the hupo proteomics standards initiativeeasing communication and minimizing data loss in a changing world the minimum information about a proteomics experiment (miape) the hupo proteomics standards initiative-overcoming the fragmentation of proteomics data an emerging cyberinfrastructure for biodefense pathogen and pathogen-host data the iproclass integrated database for protein functional analysis the universal protein resource (uniprot) ) dictybase, the model organism database for dictyostelium discoideum the arabidopsis information resource (tair): a model organism database providing a centralized, curated gateway to arabidopsis biology, research materials and community the mouse genome database (mgd): the model organism database for the laboratory mouse the yeast proteome database (ypd) and caenorhabditis elegans proteome database (wormpd): comprehensive resources for the organization and comparison of model organism protein information nucleic acids research, database issue the zebrafish information network: the zebrafish model organism database generif quality assurance as summary revision uniprot archive production and sequence validation of a complete full length orf collection for the pathogenic bacterium vibrio cholerae transcriptional profiling of the bacillus anthracis life cycle in vitro and an implied model for regulation of spore formation transcriptional profiling of bacillus anthracis during infection of host macrophages the global transcriptional responses of bacillus anthracis sterne ( f ) and a delta soda mutant to paraquat reveal metal ion homeostasis imbalances during endogenous superoxide stress the pir integrated protein databases and data retrieval system callable personal librarian (cpl) (version . ). callable personal librarian (cpl) (version ) uniref: comprehensive and non-redundant uniprot reference clusters a data integration methodology for systems biology: experimental verification from bytes to bedside: data integration and computational biology for translational cancer research integration of metabolomic and proteomic phenotypes: analysis of data covariance dissects starch and rfo metabolism from low and high temperature compensation response in arabidopsis thaliana comparative proteomics of human monkeypox and vaccinia intracellular mature and extracellular enveloped virions the utility of accurate mass and lc elution time information in the analysis of complex proteomes cytoscape: a software environment for integrated models of biomolecular interaction networks exploring biological networks with cytoscape software comparative assessment of large-scale data sets of protein-protein interactions a direct comparison of protein interaction confidence assignment schemes a relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage t and the yeast saccharomyces cerevisiae protein interactions: two methods for assessment of the reliability of high throughput observations assessment of the reliability of protein-protein interactions and protein function prediction gaining confidence in high-throughput protein interaction networks vaccinia virus induces cell fusion at acid ph and this activity is mediated by the n-terminus of the -kda virus envelope protein a k envelope protein of vaccinia virus with an important role in virus-host cell interactions is altered during virus persistence and determines the plaque size phenotype of the virus the oligomeric structure of vaccinia viral envelope protein a l is essential for binding to heparin and heparan sulfates on cell surfaces: a structural and functional approach using site-specific mutagenesis vaccinia virus envelope h l protein binds to cell surface heparan sulfate and is important for intracellular mature virion morphogenesis and virus infection in vitro and in vivo vaccinia virus envelope d l protein binds to cell surface chondroitin sulfate and mediates the adsorption of intracellular mature virions to cells vaccinia virus f virion membrane protein is required for entry but not virus assembly, in contrast to the related l protein the vaccinia virus gene i l encodes a membrane protein with an essential role in virion entry blockade of chemokine activity by a soluble chemokine binding protein from vaccinia virus monkeypox virus viral chemokine inhibitor (mpv vcci), a potent inhibitor of rhesus macrophage inflammatory protein- identification of a peptide derived from vaccinia virus a r protein that inhibits cytokine secretion in response to tlr-dependent signaling and reduces in vivo bacterial-induced inflammation a r and a r from vaccinia virus are antagonists of host il- and toll-like receptor signaling the poxvirus protein a r targets toll-like receptor signaling complexes to suppress host defense molecular characterization and expression of the stratification-related cytokeratins and mouse keratin is necessary for internal epithelial integrity clumping factor b, a fibrinogen-binding mscramm (microbial surface components recognizing adhesive matrix molecules) adhesin of staphylococcus aureus, also binds to the tail region of type i cytokeratin the surface protein srr- of streptococcus agalactiae binds human keratin and promotes adherence to epithelial hep- cells vaccinia virus entry, exit, and interaction with differentiated human airway epithelia protective effect of toll-like receptor in pulmonary vaccinia infection computational identification of strain-, species-and genus-specific proteins a search method for homologs of small proteins. ubiquitin-like proteins in prokaryotic cells? the impact of prophages on bacterial chromosomes systems biology at the institute for systems biology a data integration methodology for systems biology integrated genomic and proteomic analyses of gene expression in mammalian cells open source system for analyzing, validating, and storing protein identification data peptideatlas: a resource for target selection for emerging targeted proteomics workflows global networks of functional coupling in eukaryotes from comprehensive data integration ensmart: a generic system for fast and flexible access to biological data biomart central portal-unified access to biological data further steps in standardisation design and implementation of microarray gene expression markup language gene ontology: tool for the unification of biology. the gene ontology consortium bioconductor: open software development for computational biology and bioinformatics galaxy: a platform for interactive large-scale genome analysis key: cord- -vd a eq authors: shu, yuelong; mccauley, john title: gisaid: global initiative on sharing all influenza data – from vision to reality date: - - journal: euro surveill doi: . / - .es. . . . sha: doc_id: cord_uid: vd a eq nan ten years ago, a correspondence [ , ] , signed by more than championed 'a global initiative on sharing avian flu data' (gisaid) [ ] , leading to the gisaid initiative in . what started out as an expression of intent to foster international sharing of all influenza virus data and to publish results collaboratively has emerged as an indispensable mechanism for sharing influenza genetic sequence and metadata that embraces the interests and concerns of the wider influenza community, public health and animal health scientists, along with governments around the world. today gisaid is recognised as an effective and trusted mechanism for rapid sharing of both published and 'unpublished' influenza data [ ] . its concept for incentivising data sharing established an alternative to data sharing via conventional public-domain archives. in , the reluctance of data sharing, in particular of avian h n influenza viruses, created an emergency bringing into focus certain limitations and inequities, such that the world health organization (who)'s global influenza surveillance network (now the global influenza surveillance and response system (gisrs) [ ] ) was criticised on several fronts, including limited global access to h n sequence data that were stored in a database hosted by the los alamos national laboratories in the united states (us) [ , ] . this data repository, set up with financial support from the us centers for disease control and prevention (cdc) as a first attempt to share 'sensitive' data from affected countries, but was accessible only to those who were also providing h n sequence data. this limitedaccess approach restricted wider sharing of data prior to publication, which was vital for broader understanding of the progress of the emergent public and animal health threat. the need for greater transparency in data sharing and for acknowledgement of those contributing samples from h n -infected patients and animals and related genetic sequence data was not satisfied by sharing data after formal publication via public-domain databases. scientists charged with the day to day responsibilities of running who collaborating centres (ccs) for influenza, national influenza centres and the world organisation for animal health (oie)/ food and agriculture organization of the united nations (fao) [ ] reference laboratories, were therefore eager to play a key role and provide scientific oversight in the creation and development of gisaid's data sharing platform that soon became essential for our work. a unique collaboration ensued, involving, in addition to members of who's gisrs and oie/fao reference laboratories, the wider influenza research community along with officials in governmental institutions and non-governmental organisations. facilitated by a wellconnected broadcast executive with background in licensing of intellectual property, an agreement was drawn up on the sharing of genetic data to meet emergency situations, without infringing intellectual property rights -the gisaid database access agreement (daa). the daa governs each individual's access to and their use of data in gisaid's epiflu database [ ] . it was this alliance between scientists and non-scientists, with a diversity of knowledge and experience, involved in drawing up an acceptable simple, yet enforceable, agreement which gained the trust and respect of the scientific community and public health and animal health authorities. the essential features of the daa encourage sharing of data by securing the provider's ownership of the data, requiring acknowledgement of those providing the samples and producing the data, while placing no restriction on the use of the data by registered users adhering to the daa. it essentially defines a code of conduct between providers and users of data, cementing mutual respect for their respective complementary contributions, and upholding the collaborative ethos of who's gisrs, initially established years ago this year [ ] . launched in , the epiflu database was of key importance in the response to the influenza a(h n ) pandemic, allowing countries to readily follow the evolution of the new virus as it spread globally [ ] . acceptance of the gisaid sharing mechanism by providers and users of data, and the confidence of the influenza community, were further illustrated in by the unprecedented immediate release of the genetic sequences of influenza a(h n ) viruses from the first human cases, by chinese scientists at the who collaborating centre for influenza in beijing [ , ] . such events reaffirmed gisaid's applicability to timely sharing of crucial influenza data. the subsequent use of the sequence data to generate, develop and test candidate vaccine viruses by synthetic biology within a few weeks also demonstrated how gisaid successfully bridged this important 'technological' gap [ , ] . the paper by bao et al. from jiangsu province of china published in this issue once again confirms the importance of the timely sharing of data on the evolution of the a(h n ) viruses for global risk assessment. the authors analysed the recently isolated h n viruses form the fifth wave in jiangsu province, and the results showed no significant viral mutations in key functional loci even though the h n viruses are under continuous dynamic reassortment and there is genetic heterogeneity. these findings should help to reduce concerns raised, even though the number of human infection with h n virus increased sharply during the fifth wave in china. gisaid provides the data-sharing platform particularly used by gisrs, through which sequence data considered by the who ccs in selecting viruses recommended for inclusion in seasonal and pre-pandemic vaccines are shared openly and on which research scientists, public and animal health officials and the pharmaceutical industry depend. such openness of the most up-to-date data assists in an understanding of and enhances the credibility of the who recommendations for the composition of these seasonal and potential-pandemic vaccines. furthermore, in promoting the prompt sharing of data from potential pandemic zoonotic virus infections, as well as from seasonal influenza viruses, gisaid ensures a key tenet of the who pandemic influenza preparedness (pip) framework [ ] , highlighting the critical role it plays in mounting an effective mitigating response. gisaid's ability to facilitate efficient global collaborations, such as the global consortium for h n and related influenza viruses [ , ] , is central to monitoring phylogeographic interrelationships among, for example, h subtype viruses in wild and domestic birds in relation to their incidence, cross-border spread and veterinary impact, and assessing risk to animal and human health [ ] . traditional public-domain archives such as genbank, where sharing and use of data takes place anonymously, fulfil a need for an archive of largely published data; however, that conventional method of data exchange notably has not been successful in encouraging rapid sharing of important data in epidemic or (potential) pandemic situations, such as those caused by middle east respiratory syndrome coronavirus (mers-cov) and ebola viruses. while the gisaid epiflu database is hosted and its sustainability ensured through the commitment of the federal republic of germany [ ] , the establishment of gisaid and development of the epiflu database was reliant to a large extent on philanthropy of one individual and voluntary contributions and generosity of many others, together with some initial financial provision by the us cdc and the german max planck society. that gisaid has become accepted as a pragmatic means of meeting the needs of the influenza community in part reflects the particular characteristics of influenza and the continual need for year-round monitoring of the viruses circulating worldwide, essential for the biannual vaccine recommendations and assessment of the risk posed by frequent zoonotic infections by animal influenza viruses [ ] . in the meantime, calls for an equivalent mechanism to promote the timely sharing of data in other urgent epidemic settings go largely unfulfilled [ , ] . a recent publication considered whether the 'paradigm shift' in data sharing by gisaid could be applied more generally to assist in preparedness for and response to other emergent infectious threats, such as those posed by ebola virus [ ] and zika virus [ ] . such a trusted system could complement and take full advantage of the latest advances in rapid sequencing of specimens in the laboratory and in the field, for outbreak investigation [ ] . given the crucial importance of genetic data in improving our understanding of the progress of an emergent, potentially devastating epidemic, the effectiveness of gisaid in influenza pandemic preparedness is selfevident and provides important lessons for future pandemic threats. while the genetic makeup and the necessary associated data of the different viruses are distinct requiring separate databases/compartments for unambiguous analysis, the modi operandi for sharing genetic data are generic and the gisaid mechanism could be applied to other emerging pathogens. indeed, the wider implementation of such a data sharing mechanism should be key in concerted efforts to contain spread of disease in animals and threats to human health, in realising the concept of one health. cc in london was supported by the francis crick institute which receives its core funding from cancer research uk (fc ), the uk medical research council (fc ) and the wellcome trust (fc ). boosting access to disease data plan to pool bird-flu data takes off a global initiative on sharing avian flu data disease and diplomacy: gisaid's innovative contribution to global health world health organization (who) global influenza surveillance and response system. (gisrs) flu researchers slam us agency for hoarding data the contents of the syringe world organisation for animal health (oie)/food and agriculture organization of the united nations (fao) the global initiative on sharing all influenza data (gisaid) swine flu goes global human infection with a novel avian-origin influenza a (h n ) virus the fight against bird flu early response to the emergence of influenza a(h n ) virus in humans in china: the central role of prompt information sharing and public communication synthetic generation of influenza vaccine viruses for rapid response to pandemics rapidly produced sam(®) vaccine against h n influenza is immunogenic in mice virus sharing, genetic sequencing, and global health security role for migratory wild birds in the global spread of avian influenza h n evolution and prevalence of h n avian influenza viruses in china germany's statement on substantive issues and concerns regarding the pip framework and its implementation. special session of the pip advisory group viral factors in influenza pandemic risk assessment developing global norms for sharing data and results during public health emergencies benefits of sharing real-time, portable genome sequencing for ebola surveillance peter bogner, a philanthropist and broadcast executive with background in licensing of intellectual property and with a trust-based network of political contacts, acted as the principal instigator and active proponent and was instrumental in forging an alliance of all stakeholders leading to the creation of gisaid. dr. shu's research was supported by the national key re¬search and development program of china ( yfc ) and the national mega-projects for infectious diseases ( zx ).the work at the who none declared. this is an open-access article distributed under the terms of the creative commons attribution (cc by . ) licence. you may share and adapt the material, but must give appropriate credit to the source, provide a link to the licence, and indicate if changes were made. key: cord- - ru lh c authors: shi, shuyun; he, debiao; li, li; kumar, neeraj; khan, muhammad khurram; choo, kim-kwang raymond title: applications of blockchain in ensuring the security and privacy of electronic health record systems: a survey date: - - journal: comput secur doi: . /j.cose. . sha: doc_id: cord_uid: ru lh c due to the popularity of blockchain, there have been many proposed applications of blockchain in the healthcare sector, such as electronic health record (ehr) systems. therefore, in this paper we perform a systematic literature review of blockchain approaches designed for ehr systems, focusing only on the security and privacy aspects. as part of the review, we introduce relevant background knowledge relating to both ehr systems and blockchain, prior to investigating the (potential) applications of blockchain in ehr systems. we also identify a number of research challenges and opportunities. there is an increasing interest in digitalizing healthcare systems by governments and related industry sectors, partly evidenced by various initiatives taking place in different countries and sectors. for example, the then u.s. president signed into law the health information technology for economic and clinical health (hitech) act of , as part of the american recovery and reinvestment act of . hitech is designed to encourage broader adoption of electronic health records (ehrs), with the ultimate aim of benefiting patients and society. the potential benefits associated with ehr systems (e.g. public healthcare management, online patient access, and patients medical data sharing) have also attracted the interest of the research community [ , , , , , , , , ] . the potential of ehrs is also evidenced by the recent novel coronavirus (also referred to as -ncov and covid- ) pandemic, where remote patient monitoring and other healthcare deliveries are increasingly used in order to contain the situation. as with any maturing consumer technologies, there are a number of research and operational challenges. for example, many existing ehr systems use a centralized server model, and hence such deployments inherit security and privacy limitations associated with the centralized server model (e.g. single point of failure and performance bottleneck). in addition, as ehr systems become more commonplace and the increasing understanding of the importance of data (particularly healthcare data), honest but curious servers may surreptitiously collect personal information of users while carrying out their normal activities. in recent times, there is an increasing trend in deploying blockchain in a broad range of applications, including healthcare (e.g. public healthcare management, counterfeit drug prevention, and clinical trial) [ , , ]. this is not surprising, since blockchain is an immutable, transparent and decentralized distributed database [ ] that can be leveraged to provide a secure and trusty value chain. an architecture of blockchain-based healthcare systems is shown in fig. . blockchain is a distributed ledger database on a peer-to-peer (p p) network that comprises a list of ordered blocks chronologically. in other words, this is a decentralized and trustworthy distributed system (without relying on any third party). trust relation among distributed nodes is established by mathematical methods and cryptography technologies instead of semi-trusted central institutions. blockchain-based systems can mitigate the limitation of the single point of failure. besides, since data is recorded in the public ledger, and all of nodes in the blockchain network have ledger backups and can access these data anytime and anywhere, such a system ensures data transparency and helps to build trust among distributed nodes. it also facilitates data audit and accountability by having the capability to trace tamper-resistant historical record in the ledger. depend- ing on the actual deployment, data in the ledger can be stored in the encrypted form using different cryptographic techniques; hence, preserving data privacy. users can also protect their real identities in the sense of pseudo-anonymity. to enhance robustness, we can introduce smart contracts (i.e. a kind of self-executing program deployed on the distributed blockchain network) to support diverse functions for different application scenarios. specifically, the terms of smart contract can be preset by users and the smart contract will only be executed if the terms are fulfilled. hence, this hands over control to the owner of the data. there are a (small) number of real-world blockchain-based healthcare systems, such as gem, guardtime and healthbank [ ] . hence, in this paper we focus on blockchain-based healthcare systems. specifically, we will comprehensively review some existing work, and identify existing and emerging challenges and potential research opportunities. prior to presenting the results of our re- view, we will first introduce ehr system and blockchain architecture in the next section. then, in section , we will review the extant literature and provide a comparative summary of some existing systems. in section , we identify a number of potential research opportunities. finally, we conclude the paper in the last section. in a centralized architecture, such as those that underpin a conventional ehr system, a central institution is tasked with managing, coordinating and controlling of the entire network. however, in a distributed architecture, all nodes are maintained without relying on a central authority. now, we will briefly explain the ehr system and blockchain technology. the electronic health record (ehr) is generally defined to be the collection of patients' electronic health information (e.g. in the form of electronic medical records -emrs). emrs can serve as a data source for ehr mainly from healthcare providers in the medical institutions. the personal health record (phr) contains personal healthcare information, such as those obtained from wearable devices owned and controlled by patients. information collected as part of phrs can be available to healthcare providers, by users (patients). in theory, ehr systems should ensure the confidentiality, integrity and availability of the stored data, and data can be shared securely among authorized users (e.g. medical practitioners with the right need to access particular patient's data to facilitate diagno- sis). in addition, such a system if implemented well, can reduce data replication and the risk of lost record, and so on. however, the challenge of securing data in such systems, whether in-transit or at-rest, is compounded by the increasing connectivity to these systems (e.g. more potential attack vectors). for example, mobile devices that can sync with the ehr system is a potential attack vector that can be targeted (e.g. an attacker can seek to exploit a known vulnerability in the hospital-issued mobile devices and install malware to facilitate covert exfiltration of sensitive data (e.g. phrs)). one of the key benefits of ehr systems is the availability of large volumes of data, which can be used to facilitate data analysis and machine learning, for example to inform other medical research efforts such as disease forecasting (e.g. the novel coronavirus). furthermore, wearable and other internet of things (iot) devices can collect and upload relevant information, including those relating to phrs, to the ehr systems, which can facilitate healthcare monitoring and personalized health services. blockchain is made popular by the success of bitcoin [ ] , and can be used to facilitate trustworthy and secure transactions across an untrusted network without relying on any centralized third party. we will now introduce the fundamental building blocks in the blockchain [ , , ] . blockchain is a chronological sequence of blocks including a list of complete and valid transaction record. blocks are linked to the previous block by a reference (hash value), and thus forming a chain. the block preceding a given block is called its parent block, and the first block is known as the genesis block. a block [ ] consists of the block header the block header contains: • block version: block validation rules; • previous block hash: hash value of the previous block; • timestamp: the creation time of the current block; • nonce: a -byte random field that miners adjust for every hash calculation to solve a pow mining puzzle (see also section . . ); • body root hash: hash value of the merkle tree root built by transactions in the block body; • target hash: target threshold of hash value of a new valid block. the target hash is used to determine the difficulty of the pow puzzle (see also section . . ). merkle tree is used to store all the valid transactions, in which every leaf node is a transaction and every non-leaf node is the hash value of its two concatenated child nodes. such a tree structure is efficient for the verification of the transaction's existence and integrity, since any node can confirm the validation of any transaction by the hash value of the corresponding branches rather than entire merkle tree. meanwhile, any modification on the transaction will generate a new hash value in the upper layer and this will result in a falsified root hash. besides, the maximum number of transactions that a block can contain depends on the size of each transaction and the block size. these blocks are then chained together using cryptographic hash function in an append-only structure. that means new data is only appended in the form of additional blocks chained with previous blocks since altering and deleting previously confirmed data is impossible. as previously discussed, any modification of one of the blocks will generate a different hash value and different link relation. hence, achieving immutability and security. digital signature based on asymmetric cryptography is generally used for transaction authentication in an untrustworthy environment [ , ] . blockchain uses asymmetric cryptography mechanism to send transactions and verify the authentication of transac- otherwise, it will be discarded in this process. only valid transactions can be stored in the new block of blockchain network. we will take the coin transfer as an example (see fig. ). alice transfers a certain amount of coins to bob. in step , she initiates a transaction signed by her private key. the transaction can be easily verified by others using alice's public key. in step , the transaction is broadcasted to other nodes through the p p network. in step , each node will verify the transaction by predefined rules. in step , each validated transaction will be packed chronologically and appended to a new block once a miner solves the puzzle. finally, every node will update and back up the new block. in the blockchain network, there is no trusted central authority. thus, reaching a consensus for these transactions among untrustworthy nodes in a distributed network is an important issue, which is a transformation of the byzantine generals (bg) problem proposed in [ ] . the bg problem is that a group of generals command the byzantine army to circle the city, and they have no chance of winning the war unless all of them attack at the same time. however, they are not sure whether there are traitors who might retreat in a distributed environment. thus, they have to reach an agreement to attack or retreat. it is the same challenge for the blockchain network. a number of protocols have been designed to reach consensus among all the distributed nodes before a new block is linked into blockchain [ ] , such as the following: • pow (proof of work) is the consensus mechanism used in bitcoin. if the miner node who has certain computing (hashing) power wishes to obtain some rewards, the miner must perform the laborious task of mining to prove that he is not malicious. the task requires that the node repeatedly performs hash computations to find an eligible nonce value that satisfies the requirement that a hashed block head must be less than (or equal to) the target hash value. the nonce is difficult to generate but easy for other nodes to validate. the task is costly (in terms of computing resources) due to the number of difficult calculations. a % attack is a potential attack in the blockchain network, where if a miner or a group of miners can control more than % of the computing power, they could interfere with the generation of new blocks and create fraudulent transaction records beneficial for the attackers. • pos (proof of stake) is an improved and energy-saving mechanism of pow. it is believed that nodes with the largest number of stakes (e.g. currency) would be less likely to attack the network. however, the selection based on account balance is unfair because the richest node is more likely to be dominant in the network, which would be similar to a centralized system gradually. blockchain systems are divided into three types based on permissions given to network nodes: • public blockchain. the public blockchain is open to anyone who wants to join anytime and acts as a simple node or as a miner for economic rewards. bitcoin [ ] and ethereum [ ] are two well-known public blockchain platforms. • private blockchain. the private blockchain network works based on access control, in which participants must obtain an invitation or permissions to join. gemos [ ] and multichain [ ] are both typical private blockchain platforms. • consortium blockchain. the consortium blockchain is "semi-private" sitting on the fence between public and private blockchains. it is granted to a group of approved organizations commonly associated with enterprise use to improve business. hy- perledger fabric [ ] is a business consortium blockchain framework. ethereum also supports for building consortium blockchains. generally, ehrs mainly contain patient medical history, personal statistics (e.g. age and weight), laboratory test results and so on. hence, it is crucial to ensure the security and privacy of these data. in addition, hospitals in countries such as u.s. are subject to exacting regulatory oversight. there are also a number of challenges in deploying and implementing healthcare systems in practice. for example, centralized server models are vulnerable to the single-point attack limitations and malicious insider attacks, as previously discussed. users (e.g. patients) whose data is outsourced or stored in these ehr systems generally lose control of their data, and have no way of knowing who is accessing their data and for what kind of purposes (i.e. violation of personal privacy). such information may also be at risk of being leaked by malicious insiders to another organization, for example an insurance company may deny insurance coverage to the particular patient based on leaked medical history. meanwhile, data sharing is increasingly crucial particularly as our society and population become more mobile. by leveraging the interconnectivity between different healthcare entities, shared data can improve medical service delivery, and so on. overcoming the "information and resource island" (information silo) will be challenging, for example due to privacy concerns and regulations. the information silo also contributes to unnecessary data redundancy and red-tape. in this case, the health insurance portability and accountability act (hipaa) was • unique identifiers rule. only the national provider identifier (npi) identifies covered entities in the standard transactions to protect the patient identity information. • enforcement rule. investigation and penalties for violating hipaa rules. there is another common framework for audit trails for ehrs, called iso , to keep personal health information auditable across systems and domains. secure audit record must be created each time any operation is triggered via the system complying with iso . hence, we posit the importance of a collaborative and transparent data sharing system, which also facilitates audit and post-incident investigation or forensics in the event of an alleged misconduct (e.g. data leakage). such a notion (forensic-by-design) is also emphasized by forensic researchers [ , ] . as a regulatory response to security concerns about managing the distribution, storage and retrieval of health record by medical industry, title cfr part places requirements on medical systems, including measures such as document encryption and the use of digital signature standards to ensure the authenticity, integrity and confidentiality of record. we summarize the following requirements that should be met based on these relevant standards above when implementing the next generation secure ehr systems: • accuracy and integrity of data (e.g. any unauthorized modification of data is not allowed, and can be detected); • security and privacy of data; • efficient data sharing mechanism (e.g. [ ] ); • mechanism to return the control of ehrs back to the patients (e.g. patients can monitor their record and receive notification for loss or unauthorized acquisition); • audit and accountability of data (e.g. forensic-by-design [ , ] ). the above properties can be achieved using blockchain, as explained below: • decentralization. compared with the centralized mode, blockchain no longer needs to rely on the semi-trusted third party. • security. it is resilient to single point of failure and insider attacks in the blockchainbased decentralized system. • pseudonymity. each node is bound with a public pseudonymous address to protect its real identity. • immutability. it is computationally hard to delete or modify any record of any block included in the blockchain by one-way cryptographic hash function. • autonomy. patients hold the rights of their own data and share their data flexibly by the settings of special items in the smart contract. • incentive mechanism. incentive mechanism of blockchain can stimulate the cooperation and sharing of competitive institutions to promote the development of medical services and research. • auditability. it is easy to keep trace of any operation since any historical transaction is recorded in the blockchain. hence, if blockchain is applied correctly in the ehr systems, it can help to ensure the security of ehr systems, enhance the integrity and privacy of data, encourage orga- nizations and individuals to share data, and facilitate both audit and accountability. based on the requirements of a new version of secure ehr systems and the characteristics of blockchain discussed in the preceding section . , we will now describe the key goals in the implementation of secure blockchain-based ehr systems as follows: • privacy: individual data will be used privately and only authorized parties can access the requested data. • security: in the sense of confidentiality, integrity and availability (cia): . confidentiality: only authorized users can access the data. integrity: data must be accurate in transit and not be altered by unauthorized entity(ies). . availability: legitimate user's access to information and resources is not improperly denied. • accountability: an individual or an organization will be audited and be responsible for misbehavior. • authenticity: capability to validate the identities of requestors before allowing access to sensitive data. • anonymity: entities have no visible identifier for privacy. complete anonymity is challenging, and pseudo-anonymity is more common (i.e. users are identified by something other than their actual identities). in order to satisfy the above goals, existing blockchain-based research in the healthcare domain includes the following main aspects: • data storage. blockchain serves as a trusted ledger database to store a broad range of private healthcare data. data privacy should be guaranteed when secure storage is achieved. however, healthcare data volume tends to be large and complex in practice. hence, a corresponding challenge is how to deal with big data storage without having an adverse impact on the performance of blockchain network. • data sharing. in most existing healthcare systems, service providers usually maintain primary stewardship of data. with the notion of self-sovereignty, it is a trend to return the ownership of healthcare data back to the user who is capable of sharing (or not sharing) his personal data at will. it is also necessary to achieve secure data sharing across different organizations and domains. • data audit. audit logs can serve as proofs to hold requestors accountable for their interactions with ehrs when disputes arise. some systems utilize blockchain and smart contract to keep trace for auditability purpose. any operation or request will be recorded in the blockchain ledger, and can be retrieved at any time. • identity manager. the legitimacy of each user's identity needs to be guaranteed in the system. in other words, only legitimate users can make the relevant requests to ensure system security and avoid malicious attacks. in the remaining of this section, we will review existing approaches to achieve data storage, data sharing, data audit, and identity manager (see sections . to . ). according to section . , one of the solutions to ensure greater security in the ehr system is the use of blockchain technology. however, there are potential privacy problems for all of raw/encrypted data in the public ledger, since blockchain as a public database has the risk of sensitive data being exposed under the statistical attack. some measures should be taken to enhance the privacy protection of sensitive health record in the blockchain-based ehr systems. in generally, privacy preserving approaches can be classified into cryptographic and non-cryptographic approaches, including encryption, anonymisation and access control mechanism respectively. encryption scheme is a relatively common method, such as public key encryption (pke), symmetric key encryption (ske), secure multi-party computation (mpc) [ ] and so on. al. [ ] proposed that sensors data will be uploaded using a pair of unique private and public keys in the blockchain network to protect the privacy and security of biometric information. zheng et al. [ ] proposed that data will be encrypted before being uploaded to cloud servers by symmetric key scheme (i.e. rijndael aes [ ] ) with threshold encryption scheme. the symmetric key will be split into multiple shares distributed among different key keepers by shamir's secret sharing scheme [ ] . only if data requestor gets enough key shares, he can decrypt the ciphertext. compromising of some key keepers(less than threshold) would not lead to data leakage. yue et al. [ ] designed an app on smartphones based on blockchain with mpc tech- nique, called healthcare data gateway (hdg). the system allows to run computations of encrypted data directly on the private blockchain cloud and obtain the final results without revealing the raw data. besides, guo et al. [ ] proposed an attribute-based signature scheme with multiple authorities (ma-abs) in the healthcare blockchain. the signature of this scheme attests not to the identity of the patient who endorses a message, instead to a claim (like access policy) regarding the attributes delegated from some authorities he possesses. meanwhile, the system has the ability to resist collusion attack by sharing the secret pseudorandom function (prf) seeds among authorities. in order to resist malicious attacks (e.g. statistical attack), healthcare systems have to change the encryption keys frequently of general methods. it will bring the cost for storage and management of a large amount of historical keys since these historical keys must be stored well to decrypt some historical data in future, then the storage cost will be greater, especially for limited computational resource and storage devices. to address this problem, zhao et al. [ ] designed a lightweight backup and effi- cient recovery key management scheme for body senor networks (bsns) to protect the privacy of sensor data from human body and greatly reduce the storage cost of secret keys. fuzzy vault technology is applied for the generation, backup and recovery of keys without storing any encryption key, and the recovery of the key is executed by bsns. the adversary hardly decrypts sensor data without symmetric key since sensor data is encrypted by symmetric encryption technology (i.e. aes or des). we compare and analyse some systems above, shown in table and . most systems use cryptographic technology to enhance the security and privacy of healthcare data in the blockchain. however, encryption technique is not absolutely secure. the computational cost of encryption is high for some limited devices. transaction record may also reveal user behaviors and identity because of the fixed account address. malicious attackers may break the ciphertext stored in the public ledger by some means. . all of data will be exposed once the corresponding symmetric key is lost table : systems requirements that have been met in table paper security privacy anonymity integrity authentication controllability auditability accountability [ ] [ ] [ ] [ ] [ ] [ ] [ ] meanwhile, another important issue is key management. it is the foundation of entire data field safety that private keys do not reveal. the loss of private key means that the holder would have no ability to control the corresponding data. once the private/symmetric key is compromised, all of data may be exposed by attackers. so, both encryption technique and key management should be considered when developers design a secure ehr system. additionally, it must guarantee that only authorized legitimate users can access private data to enhance security. non-cryptographic approaches mainly use access control mechanism for security and preserving privacy. with regard to the security goals, access control mechanism is a kind of security technique that performs identification authenti- cation and authorization for entities. it is a tool widely used in the secure data sharing with minimal risk of data leakage. we will discuss this mechanism in details in the next section . . . the ehr systems can upload medical record and other information in the blockchain. if these data is stored directly in the blockchain network, it will increase computational overhead and storage burden due to the fixed and limited block size. what's more, these data would also suffer from privacy leakage. to solve these problems, most relevant research and applications [ , , , ] yue et al. [ ] proposed that a simple unified indicator centric schema (ics) could organize all kinds of personal healthcare data easily in one simple "table". in this system, data is uploaded once and retrieved many times. they designed multi-level index and most systems in the previous sections are adopted third-party database architecture. the third-party services (such as cloud computing) in the far-end assist the users to improve quality of service (qos) of the applications by providing data storage and computation power, but with a transmission latency. such a storage system has gained common acceptance depending on a trusted third table (dht). nguyen et al. [ ] designed a system that integrates smart contract with ipfs to improve decentralized cloud storage and controlled data sharing for better user access management. rifi et al. [ ] also adopted ipfs as the candidate for off-chain database to store large amounts of sensor personal data. wang et al. [ ] designed a system that utilizes ipfs to store the encrypted file. the encryption key of the file is first encrypted using abe algorithm, then encrypted with other information (file location hash ciphertext) using aes algorithm. only when the attributes set of the requestor meets the access policy predefined by data owner, the requestor can obtain the clue from blockchain, then download and decrypt the files from ipfs. table : systems requirements that have been met in table paper security privacy anonymity integrity authentication controllability auditability accountability [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] according to table and , the common architecture for data storage in the ehr system is shown in fig. . the advantages of integrating off-line storage into blockchain systems are as follows. first, detailed medical record is not allowed to access directly for patient's data privacy preserving. second, it helps to reduce the throughput require- ment significantly, since only transaction record and a few metadata are stored in the blockchain. besides, data pointers stored in the block can be linked to the location of raw data in the off-chain database for data integrity. however, it is difficult to fully trust the third parties to store these sensitive data. meanwhile, it may also contradict the idea of decentralization. further research is needed to accelerate the acceptance of distributed storage systems in practice, like ipfs. also, the next step should be to improve the storage architecture of blockchain for high storage capacity. healthcare industry relies on multiple sources of information recorded in different sys- tems, such as hospitals, clinics, laboratories and so on. healthcare data should be stored, retrieved and manipulated by different healthcare providers for medical purposes. however, such a sharing approach of medical data is challenging due to heterogeneous data structures among different organizations. it is necessary to consider interoperability of figure : common architecture for data storage in the ehr system data among different organizations before sharing data. we will introduce interoperabil- ity first. interoperability of ehr is the degree to which ehr is understood and used by multiple different providers as they read each other's data. interoperability can be used to standardize and optimize the quality of health care. interoperability can mainly be classified into three levels: • syntactic interoperability: one ehr system can communicate with another system through compatible formats. • semantic interoperability: data can be exchanged and accurately interpreted at the data field level between different systems. the lack of unified interoperability standards has been a major barrier in the highperformance data sharing between different entities. according to the study [ ] , there in some studies [ , , ] , they adopted the health level seven international (fhir) as data specification and standard formats for data exchange between different organi- zations. the criterion was created by hl healthcare standards organization. the system in [ ] bahga et al. [ ] proposed that cloud health information systems technology architecture (chistar) achieves semantic interoperability, defines a general purpose set of data structures and attributes and allows to aggregate healthcare data from disparate data sources. besides, it can support security features and address the key requirements of hipaa. chen et al. [ ] designed a secure interoperable cloud-based ehr service with continuity of care document (ccd). they provided self-protecting security for health documents with support for embedding and user-friendly control. in a word, interoperability is the basic ability for different information systems to communicate, exchange and use data in the healthcare context. ehr systems following international standards can achieve interoperability and support for data sharing between multiple healthcare providers and organizations. we will discuss data sharing in detail next. it is obviously inconvenient and inefficient to transfer paper medical record between different hospitals by patients themselves.sharing healthcare data is considered to be a critical approach to improve the quality of healthcare service and reduce medical costs. though current ehr systems bring much convenience, many obstacles still exist in the healthcare information systems in practice, hinder secure and scalable data sharing across multiple organizations and thus limit the development of medical decision-making and research. as mentioned above, there are risks of the single-point attack and data leakage in a centralized system. besides, patients cannot preserve the ownership of their own private data to share with someone who they trust. it may result in unauthorized use of private data by curious organizations. furthermore, different competing organizations lacking of trust partnerships are not willing to share data, which would also hinder the development of data sharing. in this case, it is necessary to ensure security and privacy-protection and return the control right of data back to users in order to encourage data sharing. it is relatively simply to deal with security and privacy issues when data resides in a single organisa-tion, but it will be challenging in the case of secure health information exchange across different domains. meanwhile, it also needs to consider further how to encourage efficient collaboration in the medical industry. secure access control mechanism as one of common approaches requires that only authorized entities can access sharing data. this mechanism includes access policy commonly consisting of access control list (acl) associated with data owner. acl is a list of requestors who can access data, and related permissions (read, write, update) to specific data. authorization is a function of granting permission to authenticated users in order to access the protected resources following predefined access policies. the authentication process always comes before the authorization process. access policies of this mechanism mainly focus on who is performing which action on what data object for which purposes. traditional access control approaches for ehrs sharing are deployed, managed and run by third parties. users always assume that third parties (e.g. cloud servers) perform authentication and access requests on data usage honestly. however, in fact, the server is honest but curious. it is promising that combining blockchain with access control mechanism is to build a trustworthy system. users can realize secure self-management of their own data and keep shared data private. in this new model, patients can predefine access permissions (authorize, refuse, revoke), operation (read, write, update, delete) and duration to share their data by smart contracts on the blockchain without the loss of control right. smart contracts can be triggered on the blockchain once all of preconditions are met and can provide audit mechanism for any request recorded in the ledger as well. there are many existing studies and applications applying smart contract for secure healthcare data sharing. peterson et al. [ ] proposed that patients can authorize access to their record only under predefined conditions (research of a certain type, and for a given time range). smart contract placed directly on the blockchain verifies whether data requestors meet these conditions to access the specified data. if the requestor does not have the access rights, the system will abort the session. similarly, smart contracts in [ ] can be used for granting and revocation of access right and notifying the updated information as smart contract in most systems includes predefined access policies depending on requestors' role/purposes and based-role/based-purpose privileges. however, it is inflexible to handle unplanned or dynamic events and may lead to potential security threats [ ]. another mechanism, attribute-based access control (abac), has been applied in the secure systems to handle remaining issues in the extensions of rbac and enhance the security in some specific cases. the systems based on access control mechanism record any operation about access policies by logging. however, it is vulnerable to malicious tampering without the assurance of integrity of these logs in the traditional systems. blockchain and smart contract can perform access authorization automatically in a secure container and make sure the integrity of policies and operations. thus, access control mechanism integrated with blockchain can provide secure data sharing. the diversified forms of access control can be applied into different situations depending on the demands for system security. audit-based access control aims to enhance the reliability of posteriori verification [ ] . organization-based access control (orbac) [ ] can be expressed dynamically based on hierarchical structure, including organization, role, activity, view and context. . user's identity may be exposed without de-identification mechanism table : systems requirements that have been met in table paper security privacy anonymity integrity authentication controllability auditability accountability [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] table : systems requirements that have been met in table paper security privacy anonymity integrity authentication controllability auditability accountability [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] based on the information in the table we can also use cryptography technology to enhance secure data sharing and the security of access control mechanism in most ehr systems. dubovitskaya et al. [ ] proposed a framework to manage and share emrs for cancer patient care based on symmetric encryption. patients can generate symmetric encryption keys to encrypt/decrypt the sharing data with doctors. if the symmetric key is compromised, proxy re-encryption algorithm on the data stored in the trusty cloud can be performed and then a new key will be shared with clinicians according to predefined access policies. only the patients can share symmetric keys and set up the access policies by smart contract to enhance the security of sharing data. xia et al. [ ] designed a system that allows users to get access to requested data from a shared sensitive data repository after both their identities and issuing keys are verified. in this system, user-issuer protocol is designed to create membership verification key and transaction key. user-verifier protocol is used for membership verification, then only valid users can send data request to the system. ramani et al. [ ] utilized lightweight public key cryptographic operations to enhance the security of permissioned requests (append, retrieve). nobody can change the patients' data without sending a notification to patients, since the requested transaction will be checked whether it has signed by the patient before being stored on a private blockchain. wang et al. [ ] designed a system that combines ethereum with attribute-based encryption (abe) technology to achieve fine-grained access control over data in the de- centralized storage system without trusted private key generator (pkg). the encryption key of the file is stored on the blockchain in the encrypted format using aes algorithm. requestors whose attributes meet the access policies can decrypt the file encryption key and then download the encrypted file from ipfs. besides, the keyword search implemented by smart contract can avoid dishonest behavior of cloud servers. liu et al. [ ] proposed blockchain-based privacy-preserving data sharing scheme for emr called bpds. the system adopted content extraction signature (ces) [ ] which can remove sensitive information of emrs, support for selective sharing data and generate valid extraction signatures to reduce the risk of data privacy leakage and help enhance the security of access control policies. besides, users can use different public keys for different transactions to keep anonymous. huang et al. [ ] designed a blockchain-based data sharing scheme in the cloud com- puting environment to solve the trust issue among different groups using group signature and ensure the reliability of the data from other organizations. requestors can verify the as shown in table and , cryptography technology can protect sensitive data directly and improve the traditional access control mechanism to meet the demand for security and privacy. however, public key encryption has high computational overhead and trusted pki is necessary for authentication. the similar problem exists in a trusted pkg as one of important components of abe. besides, how to transmit the shared key securely should be addressed in the symmetric encryption. as mentioned before, mpc may not be suitable for wearable devices in the iot context due to high computational cost. it is necessary to improve these algorithms to adapt devices/sensors with limited resource. above all, blockchain as a secure, immutable and decentralized framework makes the control right of data return to patients themselves in the healthcare industry. as shown in fig. , the combination of access control mechanism by smart contract with cryptography technology on sensitive data can be achieved secure data sharing among different individuals and institutions. meanwhile, all of record is included in the immutable public ledger to ensure the integrity and reliability of data and minimize the risk of raw data leakage. concerning potential dishonest behavior or wrong results of third parties (cloud servers) holding large amounts of raw/encrypted data, blockchain offers immutable his- torical record for traceability and accountability, sometimes with cryptography technique (such as group signature). next we discuss about secure audit to enhance the security of ehr systems further. healthcare systems also rely on audit log management as security mechanism since some exceptions may have resulted from the misuse of access privileges or dishonest behavior by third parties or data requestors. audit log can serve as proofs when disputes arise to hold users accountable for their interactions with patient record. immutable public ledger and smart contract in the blockchain can provide immutable record for all of access requests to achieve traceability and accountability. audit log mainly contains vital and understandable information: • timestamp of logged event • user id which requests the data • data owner id whose data is accessed • action type (create, delete, query, update) • the validation result of the request qi et al. [ ] designed a data sharing model with the ability to effectively track the dishonest behaviour of sharing data as well as revoke access right to violated permissions and malicious access. the system provides provenance, audit and medical data sharing among cloud service providers with minimal risk of data privacy. the similar system in [ ] provides auditable and accountable access control for shared cloud repositories among big data entities in a trust-less environment. azaria et al. [ ] also provided auditability via comprehensive log. they mentioned that obfuscation for privacy needs further exploration while preserving auditability in the public ledger. fernandez et al. [ ] designed a blockchain-based system called auditchain to man- to improve quality of research by better reproducibility, the timestamped statistical analysis on clinical trials ensures traceability and integrity of each samples metadata in [ ] based on blockchain which allows to store proofs of existence of data. the related analytical code to process the data must be timestamped in order that data is checked and analysis is reproducible. timestamp in the blockchain will provide for better version control than git. the above-mentioned studies indicate that blockchain plays an important role in auditing and accountability. users can not only hold the control right of their own data, but also monitor all request operations for data audit and accountability when disputes occur. above all, audit log provides reliable evidence for anomalous and potentially malicious behavior to improve the security of access control models. meanwhile, it brings benefits to the adjustment of healthcare service by gaining insight into personnel interactions and workflows in hospitals. store and process. currently, audit log data does not contain required and representative information reliably, which would be difficult to interpret or hardly access. it would get worse in the collaboration of multiple ehr organizations. in this case, it is necessary to consider how to achieve interoperable and well-formatted audit log standard for the support of secure data exchange among different healthcare institutions. membership verification is the first step to ensure the security of any system before getting access to any resource. in the access control mechanism mentioned before, identity authentication is always first performed to make sure that specific rights are granted to data requestors with legal identity before sharing data. common types of user authentication have pass-through authentication, biometric authentication and identity verification based on public key cryptography algorithms. public key infrastructure (pki) is commonly used, which relies on trusted third parties to provide membership management services. identity registration is performed in [ ] with registrar smart contract to map valid string form of identity information to a unique ethereum address via public key cryptography. it can employe a dns-like implementation to allow the mapping of regulate existing forms of id. zhang et al. [ ] established secure links for wireless body area network (wban) area and wireless body area network (psn) area after authentication and key establishment through an improved ieee . . display authenticated association protocol [ ] . the protocol can protect collected data through human body channels (hbcs) and reduce computational load on the sensors. xia et al. [ ] designed an efficient and secure identity-based authentication and key agreement protocol for membership authentication with anonymity in a permissioned blockchain. the process of verification is a challenge-response dialog to prove whether the sender is authentic when the verifier receives a verification request from a user using shared key. most blockchain-based systems use pseudonyms to hide the real identity for privacy. however, there is conflict between privacy preserving and authenticity. that means how to verify the identity without exposing the information of real identity. in addition, adversaries or curious third parties can guess the real identity and relevant behavior pattern through inference attacks, such as transaction graph analysis. shae et al. [ ] designed an anonymous identity authentication mechanism based on zero-knowledge technology [ ] , which can address two conflicting requirements: maintain the identity anonymous and verify the legitimacy of user identity as well as iot devices. sun et al. [ ] proposed a decentralizing attribute-based signature (called dabs) scheme to provide effective verification of signer's attributes without his identity infor- mation leakage. multiple authorities can issue valid signature keys according to user's attributes rather than real identity and provide privacy-preserving verification service. other nodes can verify whether the data owner is qualified by verification key corresponding to satisfied attributes without revealing owner identity. hardjono et al. [ ] designed an anonymous but verifiable identity scheme, called chainachor, using the epid zero-knowledge proof scheme. these anonymous identities can achieve unlinkable transactions using different public key in the blockchain when nodes execute zero-knowledge proof protocol successfully. they also provide optional disclosure of the real identity when disputes occur. biometric authentication is also widely used, such as face and voice pattern identifi- cation, retinal pattern analysis, hand characteristics and automated fingerprint analysis based on pattern recognition. lee et al. [ ] proposed that human nails can be used for identity authentication since nails have the high degree of uniqueness. the system uses histogram of oriented gradients (hog) and local binary pattern (lbp) feature to extract the biometric identification signature, then svm and convolutional neural network are utilized for authentication with high accuracy. this identity verification technology with dynamic identity rather than regular real identity information ensures user anonymity and privacy. the main goal of identity management is to ensure that only authenticated users can be authorized to access the specified resource. currently, most systems rely on membership service component or similar providers for identity authentication. traditional authentication process mainly adopts password authentication and even transmit user account in the clear text. anyone can eavesdrop on the external connection to intercept user account. in this case, attackers or curious third parties may impersonate compromised users to gain access to sensitive data. it is difficult to find and rely on such a trustworthy third membership service party that validates user identity and accomplishes complex cross-heterogeneous domains authentication honestly without potential risk of real identity leakage. besides, typical blockchain systems cannot provide privacy-preserving verification due to public transaction record including pseudonyms and related behavior. in this case, curious third servers or network nodes may collect large amounts of data to infer the real identity by statistical analysis. blockchain can also allow rollback models storage if false predication rate is high. blockchain stores the pointers of relevant data of retrained models in a secure and immutable manner. juneja et al. [ ] proposed that retraining models indexed by pointers in the blockchain can increase the accuracies for continuous remote systems in the context of irregular arrhythmia alarm rate. additionally, artificial intelligence can be applied to design automatic generation of smart contact to enhance secure and flexible operations. in the context of iot, the locations of products can be tracked at each step with radio-frequency identification (rfid), sensors or gps tags. individual healthy situation can be monitored at home via sensor devices and shared on the cloud environment where physical providers can access to provide on-time medical supports. however, as the use of sensors is experiencing exponential growth in various environ- ments, the security level of sensitive data of these sensors has to be improved. currently, a few studies focus on solving the above mentioned problems. related research mainly focuses on the improvement of consensus algorithm, block size design [ ] and so on. croman et al. [ ] mainly improved the scalability of blockchain on latency, throughput and other parameters. the experiments showed that block size and generation inter-val in bitcoin are the first step toward throughput improvements and latency reduction without threat to system decentralization. new challenges for two data types in the blockchain-based system are throughput and fairness. two fairness-based packing algorithms are designed to improve the throughput and fairness of system among users. in the practical application scenario, how to encourage miners to participate in the network is important for the maintenance of trustworthy and stable blockchain. azaria et al. [ ] proposed an incentive mechanism to encourage medical researchers and healthcare authorities as miners and create data economics by awarding for big data on hospital records to researchers. yang et al. [ ] proposed a selection method in the incentive mechanism. providers have less significance (means the efforts that providers have been made on network maintenance and new blocks generation) with higher probabilities of being selected to carry out the task of new block generation and will be granted significance as bonus to reduce the selected probability in future. pham et al. [ ] made further improvements on gas prices of blockchain, which can boost the priority in the processing transaction queue by automatically adjusting the gas price and then trigger an emergency contact to providers for on-time treatment immediately. meanwhile, it should be noted that all transactions can be "seen" by any node in the blockchain network. homomorphic encryption and zero knowledge proofs could be utilized to prevent data forensics by inference, maintain the privacy of individual information and allow computations to be performed without the leakage of input and output of computations. as the above statement, blockchain still has many limitations and more aggressive extensions will require fundamental protocol redesign. so it is urgent to be towards to the improvement of underlying architecture of blockchain for better service. in the context of iot, personal healthcare data streams collected from wearable devices are high in volume and at fast rate. large amounts of data can support for big data and machine learning to increase the quality of data and provide more intelligent health service. however, it may lead to high network latency due to the physical distance to mobile devices and traffic congestion on the cloud servers. besides, the mining process and some encryption algorithms may cost high computational power on resource-limited devices and restrict the use of blockchain. a new trend is increasingly moving from the function of clouds towards network edge with low network latency. it is mainly required by time-sensitive applications, like healthcare monitor applications. combining with edge computing, blockchain is broadened to a wide range of services from pure data storage, such as device configuration and governance, sensor data storage and management, and multi-access payments. if new technologies enter the market without some form of vetting, they should be adopted with care for example based on a cost-benefit-analysis. hence, to improve compliance, security, interoperability and other factors, we need to develop uniform stan- dards, policies and regulations (e.g. those relating to data security and privacy, and blockchain ecosystem). for example, we would likely need different independent and trusted mechanisms to evaluate different blockchain solutions for different applications and context, in terms of privacy, security, throughput, latency, capacity, etc. we would also need to be able to police and enforce penalty for misbehavior and/or violations (e.g. non-compliance or not delivering as agreed in the contract). blockchain has shown great potential in transforming the conventional healthcare industry, as demonstrated in this paper. there, however, remain a number of research and operational challenges, when attempting to fully integrate blockchain technology with existing ehr systems. in this paper, we reviewed and discussed some of these challenges. then, we identified a number of potential research opportunities, for example relating to iot, big data, machine learning and edge computing. we hope this review will contribute to further insight into the development and implementation of the next generation ehr systems, which will benefit our (ageing) society. healthcare professionals organisational barriers to health information technologiesa lit- erature review maturity models of healthcare information systems and technologies: a literature review security and privacy in electronic health records: a systematic literature review implementing electronic health records in hospitals: a systematic literature review electronic health record use by nurses in mental health settings: a literature review personal electronic healthcare records: what influences consumers to engage with their clinical data online? a literature review methodologies for designing healthcare analytics solutions: a literature analysis opportunities and challenges in healthcare information systems research: caring for patients with chronic conditions visualization of blockchain data: a systematic review a blockchain-based approach to health information exchange networks blockchain in healthcare applications: research challenges and opportunities blockchain: a panacea for healthcare cloud-based data security and privacy? ieee technology & engineering management conference (temscon) blockchain technology in healthcare: the revolution starts here bitcoin: a peer-to-peer electronic cash system dcap: a secure and efficient decentralized conditional anonymous payment system based on blockchain an efficient nizk scheme for privacy-preserving transactions over account-model blockchain a survey on privacy protection in blockchain system secure and efficient two-party signing protocol for the identity-based signature scheme in the ieee p standard for public key cryptography multi-party signing protocol for the identitybased signature scheme in ieee p standard the byzantine generals problem a survey on consensus mechanisms and mining strategy management in blockchain networks practical byzantine fault tolerance ethereum: blockchain app platforms multichain: open platform for building blockchains forensic-by-design framework for cyberphysical cloud systems medical cyber-physical systems development: a forensics-driven approach sdte: a secure blockchain-based data trading ecosystem class: cloud log assuring soundness and secrecy scheme for cloud forensics enigma: decentralized computation platform with guaranteed privacy medibchain: a blockchain based privacy preserving platform for healthcare data fingernail analysis management system using microscopy sensor and blockchain technology ordieres-mere, blockchain-based personal health data sharing system using cloud storage the design of rijndael: aes -the advanced encryption standard. [ ] s. vanstone, a. menezes, p. v. oorschot, handbook of applied cryptography healthcare data gateways: found healthcare intelligence on blockchain with novel privacy risk control secure attribute-based signature scheme with multiple authorities for blockchain in electronic health records systems lightweight backup and efficient recovery scheme for health blockchain keys bpds: a blockchain based privacy-preserving data sharing for electronic medical records leveraging blockchain for retraining deep learning architecture in patientspecific arrhythmia classification medrec: using blockchain for medical data access and permission management a decentralizing attribute-based signature for healthcare blockchain using java to generate globally unique identifiers for dicom objects a framework for secure and decentralized sharing of medical imaging data via blockchain consensus blockchain for secure ehrs sharing of mobile cloud based e-health systems towards using blockchain technology for ehealth data access management a blockchain-based framework for data sharing with fine-grained access control in decentralized storage systems an overview of interoperability standards for electronic health records, usa: society for design and process science ieee international conference on bioinformatics and biomedicine medrec: using blockchain for medical data access and permission management fhirchain: applying blockchain to securely and scalably share clinical data applying software patterns to address interoperability in blockchain-based healthcare apps a cloud-based approach for interoperable electronic health records (ehrs) design for a secure interoperable cloud-based personal health record service how distributed ledgers can improve provider data management and support interoperabilityhttps integrating blockchain for data sharing and collaboration in mobile healthcare applications security and privacy in electronic health records: a systematic literature review blockchain for access control in e-health scenarios blockchain based access control servicesdoi blockchain based delegatable access control scheme for a collaborative e-health environment audit-based access control with a distributed ledger: applications to healthcare organizations organization based access control secure and trustable electronic medical records sharing using blockchain bbds: blockchain-based data sharing for electronic medical records in cloud environments secure and efficient data accessibility in blockchain based healthcare systems a secure system for pervasive social network-based healthcare blockchain-based multiple groups data sharing with anonymity and traceability privacy-preserving attribute-based access control model for xml-based electronic health record system dynamic access control policy based on blockchain and machine learning for the internet of things content extraction signatures medshare: trust-less medical data sharing among cloud service providers via blockchain security and privacy in electronic health records: a systematic literature review improving data transparency in clinical trials using blockchain smart contracts blockchain technology for improving clinical research quality blockchain distributed ledger technologies for biomedical and health care applications on the design of a blockchain platform for clinical trial and precision medicine non-interactive zero-knowledge and its applications verifiable anonymous identities and access control in permissioned blockchains big data: are biomedical and health informatics training programs ready? privacy preserving in blockchain based on partial homomorphic encryption system for ai applications a fully homomorphic encryption scheme an architecture and protocol for smart continuous ehealth monitoring using g g-smart diabetes: toward personalized diabetes diagnosis with healthcare big data clouds permissioned blockchain and edge computing empowered privacy-preserving smart grid networks integrated blockchain and edge computing systems: a survey, some research issues and challenges blochie: a blockchain-based platform for healthcare information exchange a design of blockchain-based architecture for the security of electronic health record (ehr) systems a secure remote healthcare system for hospital using blockchain smart contract shuyun shi received the bachelor degree in , from the school of computer she is currently working toward a master degree at the key laboratory of aerospace information security and trusted computing ministry of education he is currently a professor of the key laboratory of aerospace information security and trusted computing, ministry of education, school of cyber science and engineering, wuhan uni- versity, wuhan , china. his main research interests include cryptography and information security li li received her ph.d degree in computer science from computer school she is currently an associate professor at school of software, wuhan univer- sity. her research interests include data security and privacy, applied cryptography and security protocols his research is focused on mobile computing, parallel/distributed computing, multi-agent systems, service oriented computing, routing and security issues in mobile ad hoc, sensor and mesh networks. he has more than technical research papers in leading journals such as-ieee tii his research is supported from dst, tcs and ugc. he has guided many students leading to m.e. and ph.d australia day achievement medallion, and british computer society's wilkes award in . he is also a fellow of the australian computer society digital rights management for multimedia interest group we thank the anonymous reviewers for their valuable comments and suggestions which helped us to improve the content and presentation of this paper. the authors declare that they have no conflicts of interest. key: cord- -h v cs b authors: delaunay, sophie; kahn, patricia; tatay, mercedes; liu, joanne title: knowledge sharing during public health emergencies: from global call to effective implementation date: - - journal: bull world health organ doi: . /blt. . sha: doc_id: cord_uid: h v cs b nan in february , the issue of data sharing during emergencies made headlines around the world after leading research funders, academic journals and nongovernmental organizations signed a joint declaration of commitment to rapidly share data relevant to the zika virus outbreak. this action followed repeated calls from some of the same constituencies for sharing data from clinical trials , conducted in the context of public health emergencies , and public health in general. , while the zika open data initiative is a positive step, it also highlights the shortcomings of calling for knowledge sharing after an outbreak has already begun. to improve epidemic emergency response and to accelerate related research, health authorities in potentially exposed countries must put in place the necessary frameworks for collecting, managing and swiftly making available good-quality, standardized data and for safely securing and sharing biomaterial -such as patient samples -collected during the outbreak. the ebola virus disease outbreak that took the lives of over people offers ample lessons on why more knowledge sharing is essential and on how to achieve it. during the ebola outbreak, massive amounts of data and biomaterials were collected. these included clinical and laboratory data from over patients , some enrolled in clinical trials, tens of thousands of patient specimens and over two years' worth of epidemiological data. if appropriate knowledge-sharing frameworks had been in place, these collections could have offered an historic opportunity to expand what is known about ebola. one framework would guide data standardization, sharing, access and use, including rules on the transfer of agreed data sets into common repositories for curation. another framework would facilitate cataloguing of biomaterials collected during an outbreak and would establish transpar-ent rules for their management, thereby creating a virtual biobank. for future emergencies, a biobank framework will also need to encompass ethics guidelines for the collection and use of samples, including the provision of appropriate information to patients. currently there are no mechanisms in place to advance such knowledge sharing. as the zika outbreak shows, the global public health community is still unprepared to collect good quality, standardized data and biomaterials during emergencies and to share them in ways that provide equitable access to researchers. this limits researchers' ability to optimize the use of available data and specimens in ways that fill key knowledge gaps and to ensure that research will benefit the affected patients and communities where the materials originated. to address these shortcomings, médecins sans frontières (msf), in collaboration with the world health organization (who), has called upon stakeholders to establish a coordinated network of ebola biobanks. additionally, msf has joined oxford university's infectious diseases data observatory to establish a data-sharing platform for existing and future clinical, biological and epidemiological data, with the aim of making this information accessible to stakeholders and researchers with relevant scientific questions. together, a virtual biobank and a data repository could provide a global resource for the essential research needed to plan effective outbreak responses. neither the data sharing nor the biobank proposal is radically new. data sharing is a long-accepted practice in many health research fields, from the global burden of disease collaboration to surveillance and response to influenza, drug-resistant malaria and severe acute respiratory syndrome (sars). biobanks are well-established resources for disease research, for example on human immunodeficiency virus (hiv) and rare diseases. these arrangements benefit patients, researchers and the private sector. our proposal for data and biomaterial frameworks raises many challenges that must be addressed, , , , including technical and ethical concerns and fears that some benefits of sharing data or patient samples will go primarily to wealthy countries. however, these challenges only highlight the need for agreeing prospectively on transparent, ethical principles to guide the collection and future use of data and biomaterials collected for emergency health care. drawing on existing models, such as the nih platform, advocates for knowledge sharing in emergencies should emphasize the need to build the relevant frameworks before or at the onset of an outbreak. since the most significant research gaps exist for diseases found in low-and middle-income countries, these frameworks should address each existing and emerging pathogen, rather than only those that threaten high-income countries. building on its own call for action, who is the appropriate leader for such efforts. in may , when health ministers gather at the world health assembly, outbreak response will be high on the agenda. as they discuss who's forwardthinking research and development blueprint for epidemics and consider strategies for improving emergency preparedness and response, delegates should also make a bold commitment to develop strong, equitable mechanisms that translate calls for sharing data and biomaterials into critically-needed action. ■ knowledge sharing during public health emergencies: from global call to effective implementation sophie delaunay, a patricia kahn, a mercedes tatay b & joanne liu b united states of america. b médecins sans frontières global scientific community commits to sharing data on zika. london: welcome trust sharing clinical trial data-a proposal from the international committee of rationale for who's new position calling for prompt reporting and public disclosure of interventional clinical trial results providing incentives to share data early in health emergencies: the role of journal editors developing global norms for sharing data and results during public health emergencies sharing research data to improve public health public library of science seattle: institute for health metrics and evaluation specimen repository frontier science and technology research foundation rare diseases human biospecimens/biorepositories (rd-hub) bethesda: national institutes of health sharing clinical trial data: maximizing benefits minimizing risk. washington: institute of medicine, national academies press data sharing common data element (cde) resource portal. bethesda: national institutes of health an r&d blueprint for action to prevent epidemics. geneva: world health organization key: cord- -b f z r authors: allam, zaheer title: underlining the role of data science and technology in supporting supply chains, political stability and health networks during pandemics date: - - journal: surveying the covid- pandemic and its implications doi: . /b - - - - . - sha: doc_id: cord_uid: b f z r this concluding chapter explores how data science and technology has been key in fighting covid- through early detection and in the devising of tools for containing the spread. interestingly, two precedence constraints are seen to emerge. first, data-driven modeling is the leading policy at an urban and national level, and second, legislations, which are being passed at record speed, will remain as a legacy postvirus. it is expected that those will accelerate the digital transition of communities for decades to come and lead to a resurgence of the smart cities concept which peaked in . this chapter thus outlines the increasing role of data science in health sciences, the need for more robust digital infrastructures, and the role of technology in supporting livability of communities and world order. in a period of just months, the global landscape had been overturned by the coronavirus, which was not seen as a threat in the initial stages when reports from started trickling from wuhan, china. in fact, for almost weeks after it was reported on december , , it had only affected people and had not spread to any other regions, or outside the city of wuhan. for this reason, even the world health organization hesitated to identify it as a public health emergency of international concern (pheic) (who, c) and as global pandemic (branswell, ) . but when it started to spread, first to neighboring regions, and to more parts of the china, and finally, to regions far and wide, it caused unprecedented panic and fears. worse still, the number of infections started to increase exponentially, with substantial numbers of deaths in different parts of the world. by the end of months, the disease, which was later renamed to covid- , had infected over million people and killed over , , with most of the deaths witnessed in the most developed regions and countries (sullivan et al., ) . worse still, it had sparked fear and panic on governments prompting them to institute some measures with far-reaching impacts on societies, economies, the political sphere, and also the environment. for the first time in history, numerous countries had instituted lockdowns in their countries (barry and jordans, ; bermingham, ) , imposed border restrictions, banned noncitizens or permanent residents from their countries, and grounded transportation networks. there have been also total suspension of flights, on both domestically and internationally (pham, ) , and unprecedented use of security forces in different countries to impose those measures. economies across the globe have been halted, with only a few sectors, more so those providing essential services allowed to perform. the measures have also been accused of disrupting of supply chain (cohen, ; oecd, ; shwartz, ) , thus prompting wide-scale scarcity of basic things including medical supplies that have been in great demand in different parts of the globe. socially, the emergence of covid- brought unprecedented pains, fears, agony, and disarray as people, in their thousands, and millions are hospitalized, others separated with their loved ones forever. in worse-case scenarios, in areas with high daily death rates, deaths from hospitals were buried in mass graves with relatives void of the opportunity to say good bye. the devastation has been even worse on those who traveled outside their countries, and with lockdowns and restricted transportation, people have had to remain displaced until such measures are eased or lifted. the measures that have been imposed in different countries have also seen unprecedented impacts on livability, especially in urban areas where people are forced to remain indoors, with limited supplies (nkengasong and mankoula, ; wearden and jolly, ) , with no social interactions, and with pressures from loss of jobs and source of their livelihoods. such have prompted social strives, where in different parts of the globe such as in india, the united states, and germany, residents of some towns called for lifting of lockdowns and other restrictions. such issues were arising when the united nations' world food programme had warned that the effects of covid- would result into increase in the number of acute hunger, affecting more than million people globally (anthem, ) . politically, as the covid- situation escalated, more political tensions were built. for instance, countries have been seen trading accusations as to the responsibility that each has played in escalating the situation. others such as china have been accused by countries such as the united states, germany, the united kingdom, spain, and even france of having mishandled the situation and failing to share conclusive information and data in time to allow others to prepare (bronstad, ; pleasance, ) . on other occasions, the united states has been blamed for instituting measures such as banning european travelers into the united states, also for banning exportation of medical supplies and other issues (larsen and gramer, ). in fact, with no immediate end on sight for this pandemic, there were fears that the tension may even escalate to dangerous heights, warranting interventions of international bodies. while the emergence of covid- had severely impaired the global systems, it however, exposed the importance and power of modern technologies in helping address events such as pandemics. for instance, in the first stages of the outbreak of this virus, when the chinese scientists and authorities and the who were trying to identify the virus (who, d), and whether it could have impacts beyond wuhan, a tech startup named bluedot was able to detect that the world was facing an outbreak, which may end up having global impact. using real-time data from different sources such as airlines ticketing and numerous news outlets, it made the prediction days earlier, before the who made the announcement (allam et al., ; bowles, ) . on the same, allam et al. ( ) highlighted that there was also another startup named metabiota that correctly predicted that the outbreak would spread to the neighboring regions in a matter of time, and a week later, the prediction came true as japan, thailand, singapore, and others that had been noted as potential target for the outbreak confirmed their first cases. besides those, even when countries went on lockdown, the use of technology became even more apparent, as devices such as drones, robots, sensors, smart helmets, and thermal detectors were widely used for different purposes such as delivery, identifying potential coronavirus virus cases and other purposes (who, b) . technologies such as artificial intelligence (ai), machine learning, natural language processing, and big data have also been instrumental in some countries in implementing quarantines, in search for vaccines and drugs and in helping reduce further spread of the virus. after such success on the use technology, it is incumbent upon different stakeholders in the global sphere to invest more resources in ensuring widespread of the same in different sectors; as going forward, the global systems will require a concerted effort to restore. for instance, in the economic front, there are signs that the world may be headed for a recession and that would have far-reaching impacts not only on the economies (africa renewal, ; statista, a statista, , b stephanie and gerstel, ; wang, ; wearden and jolly, ) but also on the environment. on this, with the use of technologies in sectors such as agriculture, manufacturing, energy production, and building and construction, it would be possible not only to revive the economies but also to void impacts on the environments. by ensuring continuity of the strides already made in different sectors, more so the environmental, the world may escape the dangers posed by climate change and others that are related to environmental sustainability (allam and jones, ) . with that background, this chapter will concentrate on discussing the role of technology and data science in supporting supply chains, economies, and political sustainabilities. the occasional occurrence of pandemics in the world is not unusual from a historical perspective. since time immemorial, humans have had to contend with these, but, fortunately, most of those remained local, especially due to a number of factors. first, the global population has been played a significant part in the spread of pandemics, and in earlier days, population were relatively smaller and people were sparsely distributed. secondly, the interaction between different groups of people from different countries and regions was limited as transportation infrastructures were not well developed, until recently. also, urbanization was not as pronounced as it is today, and this played a key role in preventing widespread. today, things are extremely different, as the population has already increased to . billion people, and it is projected to reach a high of . billion by and to . billion by (un, ). furthermore, technological advancement is at all-time high, and this has made interactions, communication, transportation, and research among other things more robust and efficient. these modern facts have made it possible for pandemics to be widespread and devastating. therefore, in the case of infectious diseases like the covid- , it is not surprising that they spread quite fast and impact numerous people and sectors in a short period of time. for instance, by the time of writing this chapter, the coronavirus had spread to over countries, infecting over million people and killed over people. it had also started to spark some political and economic tensions, besides impacting greatly the social aspects in every part of the globe. the technological advancement, however, cannot be overlooked on the lens of facilitating spread, but its greatest strength has been in controlling and preventing further spread and devastations. for instance, during the spanish flu outbreak ( e ), where technology was rudimentary, over million people lost their lives (martini et al., ) . however, as technology continued to advanced, including in the medical fields, the succeeding pandemic such as the e severe acute respiratory syndrome (sars) spread was effectively contained, and it only spread into countries and only infected people over that period. even other outbreaks such as the middle east respiratory syndrome (mers) (cdc, ), ebola (wojda et al., ) , zika (kazmi et al., ) , and influenza like the swine flu (aris-brosou et al., ) that have broken out did not have devastating impacts as compared with the spanish flu. in fact, even the hiv that emerged in the s, and is still present to date, has well been managed due to availability of technologies (the, ) . despite that, these have had some significant impacts on social and economic frontiers, which cannot be overlooked and which demand even further advancement in the medical field to help with the identification of outbreaks before they spread. one of the most novel ways of ensuring that future outbreaks can be contained is widespread application of data computation and analysis (allam and jones, ) , which already, as discussed in the previous section, was instrumental in making predictions about the outbreak and spread of covid- by bluedot and metabiota. in other fields such as climate change, the application of predictive technologies has been widely used, and these have had significant impacts in helping promote discourses on climate sustainability, emissions reduction, and the need to adopt alternative energy production (allam, a (allam, , b . while the use of technology is being promoted, it is not meant to debase the role of other players in the medical world, but such would supplement the efforts, by making the work of investigators, pharmacists, researchers, and others even more pronounced and with far-reaching outcomes. it would also help in hastening processes, making decisions, collecting data, and reducing human errors in interpretation of said data, thus reducing misdiagnosis and other such issues that have occasionally been witnessed in the medical field . such benefits are made even more robust by the availability of diverse data sources, data collecting technologies, and different data sharing platforms. for instance, with the increase of internet infrastructures, now supporting even g in some regions (o'mahony et al., ) , and with the availability of numerous smart, mobile devices such as phones, drone, wearable technologies, and cameras, data generation and sharing is becoming pronounced. similarly, the increase in number of social media platforms (cinnamon et al., ) , increased online news outlets, and mobile apps among others are helping in generating unprecedented amounts of data (allam, b) . the use of such apps has earlier been used, especially during the haiti earthquake and the cholera outbreak in the same country where health professionals collaborated with telecom companies such as digicel (largest telecompany in the country) to track the movement of people. this allowed for optimal resource management and also in deployment efforts of those offering different forms of assistance (bengtsson et al., ; lu et al., ) . even in the current pandemic, there is evidence of widespread use of technologies, besides what bluedot and metabiota did. in different parts of the globe, governments, telecoms, and startups have been seen to develop different apps and online tools such as trackers that have helped in tracking and mapping the spread of the disease (porterfield, ; voa student union, ; wakefield, b) . for instance, in south korea, it is reported that they developed a platform that allowed security and health personnel effectively impose quarantine and would warn them whenever individuals were flaunting such measures (park, ) . computation technologies were also widely used in china to combat the spread of covid- (chaturvedi, ) , and even as it continued to spread in other countries, we have seen rival tech companies such as apple and google collaborating to make tracking tools to aid in effective mapping and tracking of the spread of the virus (apple, ) . by tapping on these technologies, there are possibilities of enriching the available database such that even in the future, it would be easier to address emergencies of whatever nature. on this, the most promising thing with the use of data computation technologies is that data from different spheres could be collated and analyzed to reach an informed conclusion. this is how bluedot company made the prediction through the use of a wide expertise network as it hosts a rich expertise base comprising of meteorologists, software developers, data scientists, ecologists, geographers, epidemiologists, veterinarians, and others, thus allowing them to make informed conclusion, which are inspired by insights from the diverse backgrounds (bluedot, a (bluedot, , b . going further, even post-covid- , the role of computation technologies will continue, especially in reevaluating the policy responses, and hence help different stakeholders to identify areas of weakness and how such could be strengthened in case of similar future major disruptive events. following the numerous technological interventions initiated to combat the covid- pandemic, it is now evident that technology, especially related to data storage, data processing and sharing is part of the backbone of the health industry. this has been given the impetus by the happening in the technological sphere where much effort has been made in improving data collection methodologies, with the use of smart devices gaining traction. in regard to data storage, developments have been made significantly, where research is being performed to see whether it could be possible to store such in human genomes or proteins such that, in the future, unlimited amounts of data could be stored. in regard to data processing, as noted earlier, the medical field would benefit even further through the upcoming advancements in ai-driven tools, the advancement in machine learning technologies, and also the improvement in big data technologies. with these technologies, it will be possible to process vast amount of data, in real time and from diverse sources as noted earlier; hence, the insights and conclusion drawn from such will have far-reaching impacts. this far, in the medical field, with such technologies, there is evidence that it is now possible to perform noninvasive surgeries that reduce fatalities and also reduce healing time for patients, and such have proven beneficial (elrod, ) , especially during this period of covid- . others such as the d printing are gaining traction, especially in address complex medical issues, especially those requiring implantation of biomedical devices. according to javaid and haleem ( ), d printing is offering the possibility to have relatively smaller implant devices at the comfort of patients. in the case of covid- , already, with the available data, researchers have estimated and predicted the various case scenarios that different countries would face, especially in terms of fatality and recoveries, and infection, hence allowing all involved parties to prepare (giordano et al., ; tokars et al., ) . for instance, in the united states, using such computation, it was estimated that country may have an approximate death toll of , and have millions affected. and already, the number of infections in the country has clocked over million (cole et al., ) . when those technologies and many others are complemented by data computation, the medical field can be made to benefit even further, and already, market of those devices and the number of companies investing in the medical fields are increasing. in particular, with the notable achievements and recognition that data processing startups have gained over this period of covid- following the correct predictions made, the market for smart devices will continue to grow even further. on this, before the emergence of the pandemic, it had been projected that the number of internet of things (iot)e supported devices in the health industry would reach a total of billion devices by and billion devices by (digital information world, ), thus pushing the device market to us$ . billion (fortune business insight, ). mordo intelligence ( ) undervalues the market and argues that it will increase to a high of us$ . billion by from a high of us$ . billion report in . but in both predictions, it is true that the growth in the market is relatively high, at a compounded annual growth rate (cagr) of more than % each year. mordo intelligence ( ) credits such increase in number to factors such as the improvement in accuracy and connectivity that such devices have made possible in the healthcare sector and also the emergence of big data in healthcare such that any amount of data that such could generate would be stored and processed without fear of lack of storage. with the world focusing on statistical modeling for data gathering from the increasing covid- cases, and solutions driven from processing that data, it means that solutions derived may also be technologically driven. in fact, even as governments, scientists, agencies, individuals, and other stakeholders intensify the search for a vaccine and drug for this disease, already, as noted earlier, technological processes have managed to assist in reducing the spread of the virus. for instance, the example cited in the previous section about the use of mobile platforms in the republic of south korea to enforce quarantine is proof that technology holds a sizable share in bringing to an end the coronavirus menace and that of future pandemics. it is worth noting that using technology, the chinese authorities managed to identify the coronavirus genome sequence and posted the same on a public database where it could be accessed by all accredited researchers (ecdc, ). within no time, labs across the world had access to this, and they managed to clone it (scott et al., ) . through such public platforms, those labs shared information and data on all the experiments that failed, helping reduce repetition, and they are also helping researchers on areas to focus (ramiah, ) . in other cases, through statistical modeling, organizations, including the who and others, have been able to develop dashboards that are helping to track the spread of coronavirus, and these are providing people with real-time updates on what is happening across the globe (who, a). similarly, using these modeling, especially those that are ai driven, china managed to diagnose thousands of coronavirus cases, as these could read through thousands of ct scans in a record time with an accuracy level of over % (ramiah, ). this helped reduce time and also ease the pressure on the radiologists who were already overwhelmed due to the fast rates at which cases were being confirmed and hospitalized. an article posted in the university of copenhagen website (uoc, ) explains that ai will, in such cases, go ahead to predict the patients who may urgently be in need of ventilators, depending on severity of their case. ai-based machine learning and natural language processing have also been employed in health facilities in other countries with huge success rates (wright, ) , thus providing hope in the fight against the spread of the virus, and finally, finding a cure for the disease. but its most promising use, in respect to fighting the covid- pandemic, is its ability to crawl through the data pertaining to the , approved drugs already on the market and make predictions from over million possible pairs or over . billion triple-drug combinations. but with ai-powered technologies, researchers predict that possible combination of drugs; whether a part or triple that could go to human trials would take only few weeks. and, already, companies such as healx (earley, ) , exscientia (exscientia, ) , scipher medicine (wakefield, a) , and others are also in advanced stages of proposing possible drugs that could be repurposed and be tried as cures for covid- . outside the hospital environment, as noted in the previous chapters, giant corporations such as apple, google, alibaba, tencent, and others have been seen to develop apps and platforms that provide data sharing platforms, and such have helped in mapping areas where cases were spreading faster and where people could get help and get tested. all the data collected from these platforms and numerous others that are actively being used elsewhere across the globe would remain its usefulness, even post-covid- , but this will require a superior statistical modeling tools to manage the increasing magnitude of such data. in this case, therefore, it will not be farfetched to employ the services of ai-driven technologies such as machine learning, natural language processing, and others such as big data that have already proven capable of delivering quality statistical results in real time and with high levels of reliability. during the period where c vid- has engulfed the world and brought almost everything at a standstill, the role of data processing and sharing is not only being hailed in the health sector but also seen to be critical in other spheres such as the economy, society, and the environment. in the economic sector, there is much that the emergence of the coronavirus has prompted, especially with restriction on movements, grounding of transportation and lockdowns. firsts, as noted by the international labour organization (ilo, ), these measures meant to assist the health sector have transformed the way a majority of people work. on this, a portion of the workforce has managed to continue working from home via teleworking, or through other means. but a majority of the population, especially those in the informal sector, have had their routines greatly disrupted, with a majority globally already filling for unemployment claims. for instance, in the united states, in about weeks, from march to april , over million people filled unemployment claims (jones, ) , higher than the december recession where . million people filed similar claims (department of labor, ). such happenings have increased economic pressures on families, forcing them toward seeking family relief, and other social support systems to see them through the period of aforementioned restrictions. with the disruptions in the labor market, mahler et al. ( ) used data processing to showcase that the estimation on issues such as global poverty will tilt upward, whereas, in the absence of the coronavirus, this year ( ), the global poverty rate would have followed the projected historical trend that showed that it could decrease. but now, with the available data from over countries on those locked down, those losing their jobs and the disruption on other economic areas, it shows that the pandemic will push the poverty by around . % higher from rate, and . % higher from what had been predicted earlier (mahler et al., ) . data processing has also been used during this period to predict that the current health crisis could lead to a recession, where some experts argue that already, the recession has already started in the united states (stewart, ) . the disruptions in the economic sectors have not only prompted challenges on the economic front butalso raised concerns on the security sector, especially with countries seen to short-circuit and ban the exportation of health equipment of other countries. these have raised fears of lack of transparency, thus affecting global collaboration on the fight against the virus. in particular, the issue of data on infections, deaths, and medical supplies has sparked political tensions between different countries, even prompting accusations on independent bodies such as the who (sevastopulo and manson, ) . the need for transparency on data sharing on covid- has thus been emphasized by different global organizations like the united nations (un), world bank, and others (the world bank, ) . according to the world bank ( ), data transparency not only would help in reducing political tension and win over the coronavirus but is also prerequisite in weathering down the economic shocks affecting the global economy, especially by helping enhancing trust in governments, hence promoting investments especially post-covid- . in a bid to ensure that the issue of transparency, especially on the origin and the outbreak of the virus, is established, countries such as the united states have been seen to establish fact-finding committees, whereas others such as germany, sweden, and australia are considering doing the same in due course (amaro, ) . the need for reverse engineering as reported in an article posted in nature medicine (oppmann, ) is warranted by the lack of collaboration, especially by the chinese, whom the european union (eu) chief, ursula von der leyen, said need to be involved in the investigation of the origin of the virus, so that more understanding on the origins can be uncovered, leading to better preparations for future pandemics (amaro, ) . the first such reverse engineering was performed by researchers at peter doherty institute for infection and immunity in collaboration with royal melbourne hospital and university of melbourne where a copy of virus was grown in the lab from samples from an infected patient (reuters, ) . the need for participation of as many parties as possible is to elucidate the real origin of the virus, and issues that have raised numerous theories, where the united states claimed that it may have originated from a virology laboratory in wuhan, china (stanway, ; borger, ; law, ) . china strongly rejected this theory and is also backed by the who, which warned against blaming individual countries for the virus outbreak and spread since this would jeopardize the steps already in place to stop its spread (pérez-peña and mcneil, ) . the availability of diverse institutions, governments, and laboratories and hospitals participating in the fact finding about the coronavirus does not only offer hope and possibilities of gathering data across a diversity array of networks and regions, but also their findings would facilitate efforts of finding a cure for the virus. the identification of the genome sequencing of the virus, for instance, is a positive in the search for vaccines, and drugs, especially noting that these genome sequences are deposited in the public databases, where all researchers can access them. the same are also submitted to the "global initiative on sharing all influenza data" (gisaid) platform. as noted earlier, despite the controversies that are associated with the source of the virus, knowing the actual source would not only hasten in the development of the vaccine and cure but also help in winning back confidence of numerous stakeholders, whom, to this far, have shown dissatisfaction on how the whole issue of the pandemic has been managed. winning the confidence of everyone will help in further collaboration efforts in eliminating the virus, unlike the scenario where individual country is seen to be looking inwardly and applying their own policies, trials, and test, and treating information from other countries with suspicion. while the exploding demand for data-driven solutions at this particular period is all geared toward overcoming the spread and impacts of the coronavirus, this may spark and reignite the need for smart cities concepts, which peaked in (allam and newman, ; allam, ) . in the current dispensations, most of the digital solutions that cities across the world have been observed to be concentrating on is the health sector with the aim of containing any incidence of coronavirus, especially to prevent further spread (allam et al., ; allam and jones, ) . on this front, numerous devices and technologies, such as state-ofthe-art thermal imaging sensors, smart helmets with sensors, use of drones, robots, and mobile phone applications have been in use in this period to help in screening and providing contactless diagnosis against the virus. even postcoronavirus, such technologies will still remain relevant, as they also will be part of numerous other iot devices that are seen to be increasing, as the demand for smart solutions increases. on this, even before the emergence of coronavirus, the demand for smart cities, as expressed by mordor intelligence ( ), was growing, and this had catalyzed the demand for iot devices, which, by , were only . billion devices and had been projected to reach a high of . billion devices by , as the application of smart cities concept continued. besides this, the global market for the iot solutions had reached a high of over us$ billion by , and the projections were that it would cross us$ . trillion by (liu, ) . according to horwitz ( ) from cisco, currently, the number of iot-connected devices globally is over billion, and she also predicts that they would increase to over billion devices by , especially due to improvement in areas such as internet connectivity where many cities will have the g services by then. as those devices increase, the smart cities market will also continue to grow, and as smart cities association ( ) report showcases, it will improve from the $ billion valuation to over us$ . billion valuation by . on the above, though the outbreak of covid- may have somehow halted the attention on application of the smart cities concepts that different cities were piously pursuing, its management is seen to be prompting new legislations aimed at enhancing tech solutions to contain the spread, and most of these will survive postvirus. their enactment, therefore, does not only address the virus, but in the future, they will also add to the existing ones on urban livability, and ultimately, they will lead to better urban and policy decisions. in particular, those policies have formulated to guide in restricting movements, instituting guidelines, and containing the transport sectors, and others will have a positive bearing in the future in ensuring issues such as traffic congestion, supply of basic services, and provision of securities and other issues are maintained. this will be based on the increasing data that different cities are generating those measures that have been placed to contain covid- . for instance, das (philip james) explains how the university of newcastle is using smart technologies to track the adherence to social distancing measures in newcastle, and after analyzing the massive data (capturing over . billion individual events), the conclusion is that the sensors being used were able to give real-time data on how people were responding and also identified areas and issues prompting bottlenecks. on this, as noted by allam and jones ( ) , one of the issues that has appeared prominently in the course of containing the spread of covid- is the nationalistic approach, in decision-making where each country has been observed to look inward, with little regard to the plight of its neighbors. such an approach would be counterproductive in a smart cities concept, as the devices installed need to communicate with others within the network to ensure synchronization of data and information, hence reading to informed decisions and insight drawn after data analysis. with lockdowns, it has been evident that urban livelihood was to be negatively impacted, and in no time, this came to pass, with citizens in a number of cities in different countries protesting. this situation in cities is largely blamed on the haphazardly formulated policies that were mechanically enacted with little consideration of the negative impact that they would draw on locals. in most cities, despite the high population and density, government was seen to delay in implementing measures that would allow them to manage early detection, which would eventually help to reduce the number of local transmissions that prompted the lockdown. however, the blame is not all on government, for it also took time before it was established that the virus could be transmitted from one person to the other. therefore, in most cities, the lockdown came when local transmission had already spread. but while that is the case, local governments had the capacity to learn, especially by analyzing data of cities such as wuhan, which was affected first, and see how cases were spreading quickly and thus prepare effectively, especially by formulating restrictions measures that are more flexible, while being effective, for locals. such would have sufficed, as most urban cities are characterized of high-density and high-rise buildings, where during total lockdown, people would feel trapped, where grant ( ) supports that the planning of such leaves only a little or no open spaces where people could move out for recreation. but with prior planning, as was observed in france, people had opportunities to walk out, albeit under very strict conditions. but while things have been complicated by lockdowns, one take-home after the covid- is the need for intelligent urban planning principles, and this could be achieved by promoting decentralization, of some services, especially those that could be done achieved remotely as advised by shenker ( ) . already, this has been happening in this era of covid- , where some people have been able to work from home via live telecommunicating. there has also been a widespread use of digital transactions such as use of mobile money transactions, which could help in decentralizing the financial services. while the future postcovid is still uncertain, there are clear indications that the technological revolutions that were brought about to address it will remain as a legacy. there will be calls, soon enough, for communities, cities, and regions will use this momentum to craft more resilient fabrics while keeping in mind societal and economic equity in the process. eca estimates billions worth of losses in africa due to covid- impact redefining the smart city: culture, metabolism and governance. case study of port louis biotechnology to render future cities as living and intelligent organisms data as the new driving gears of urbanization artificial intelligence (ai) provided early detection of the coronavirus (covid- ) in china and will influence future urban health policy internationally climate change and economic resilience through urban and cultural heritage: the case of emerging small island developing states economies on the coronavirus (covid- ) outbreak and the smart city network: universal data sharing standards coupled with artificial intelligence (ai) to benefit urban health monitoring and management redefining the smart city: culture, metabolism and governance redefining the use of big data in urban health for increased liveability in smart cities eu chief backs investigation into coronavirus origin and says china should be involved risk of hunger pandemic as covid- set to almost double acute hunger by end of apple and google partner on covid- contact tracing technology viral outbreaks involve destabilized evolutionary networks: evidence from ebola virus death tolls soar in us, italy, iran as global lockdown intensifies improved response to disasters and outbreaks by tracking population movements with mobile phone network data: a post-earthquake geospatial study in haiti coronavirus: world faces 'similar economic shocks' to china as the global lockdowns escalate better public health surveillance for infectious diseases bluedot protects people around the world from infectious diseases with human and artificial intelligence enormous evidence" coronavirus came from chinese lab how canadian ai start-up bluedot spotted coronavirus before anyone else had a clue who says coronavirus is not yet a pandemic but urges countries to prepare class action filed against china over covid- outbreak middle east respiratory syndrome (mers) the china way: use of technology to combat covid- evidence and future potential of mobile phone data for disease disaster management oil markets brace for who's global health emergency declaration us could see millions of coronavirus cases and , or more deaths this is how smart city technology can be used to tell if social distancing is working iot in healthcare expectations for healx will use ai to seek combination therapies to treat covid- event background covid- acute care handbook for physical therapists (fourth edition) partnerships: exscientia announces joint initiative to identify covid- drugs with diamond light source and scripps research iot in bfsi market size to reach usd . billion by ; growing utilization of iot solutions in the banking sector will support growth modelling the covid- epidemic and implementation of population-wide interventions in italy what cities can learn from lockdown about planning for life after the coronavirus pandemic the future of iot miniguide: the burgeoning iot market continues covid- impact on the collection of labour market statistics significant advancements of d printing in the field of jobless claims climb to million in six weeks as covid- layoffs continue to rise a review on zika virus outbreak, epidemiology, transmission and infection dynamics china casts itself as global savior while u.s. and eu focus on virus at home. available at coronavirus origin: few leads, many theories in hunt for source global iot market size predictability of population displacement after the haiti earthquake. proceedings of the national academy of sciences of the united states of america the impact of covid- (coronavirus) on global poverty: why sub-saharan africa might be the region hardest hit the spanish influenza pandemic: a lesson from history years after smart cities market size, share e segmented by solution (smart mobility management looming threat of covid- infection in africa: act collectively, and fast mobile nation : the g future coronavirus: the world economy at risk. oecd. oppmann p. ( ) cuba is going under lockdown over coronavirus concerns covid- : how a phone app is assisting south korea enforce self-quarantine measures now trump's scapegoat, warned about coronavirus early and often airlines around the world are suspending flights to china as the coronavirus spreads angela merkel becomes latest world leader to hint china has mislead the world over coronavirus and urges beijing to be "more transparent surge of smartphone apps promise coronavirus tracking, but raise privacy concerns ways technology is helping to fight the coronavirus australia scientists to share lab-grown coronavirus to hasten vaccine efforts australian lab first outside of china to copy coronavirus, helping vaccine push donald trump threatens to freeze funding for who cities after coronavirus: how covid- could radically alter urban life coronavirus recession looms, its course 'unrecognizable global smart cities market to reach a whopping $ . trillion by china lab rejects covid- conspiracy claims, but virus origins still a mystery coronavirus (c vid- ) impact on gdp growth in france statista. ( b) forecasted impact of coronavirus (covid- ) on gdp in italy q -q the global economic impacts of covid- the coronavirus recession is already here known global covid- deaths pass , e as it happened the global hiv/aids epidemic-progress and challenges transparency is key to weathering shocks, investing in growth, and enhancing trust in government the changing face of surveillance for health caredassociated infections world population prospects artificial intelligence to predict corona-patients' risk of needing ventilators phone apps in china track coronavirus coronavirus: ai steps up in battle against covid- coronavirus: tracking app aims for one million downloads china may adjust gdp growth target due to coronavirus imf: global economy faces worst recession since the great depression who. ( a) coronavirus disease (covid- ): situation report e who. ( b) novel coronavirus ( -ncov): situation report- . available at emergency committee regarding the outbreak of novel coronavirus ( -ncov) who timeline -covid- the ebola outbreak of e : from coordinated multilateral action to effective disease containment, vaccine development, and beyond ai becomes an ally in the fight against covid- key: cord- -l wrrapv authors: duchêne, david a.; duchêne, sebastian; holmes, edward c.; ho, simon y.w. title: evaluating the adequacy of molecular clock models using posterior predictive simulations date: - - journal: mol biol evol doi: . /molbev/msv sha: doc_id: cord_uid: l wrrapv molecular clock models are commonly used to estimate evolutionary rates and timescales from nucleotide sequences. the goal of these models is to account for rate variation among lineages, such that they are assumed to be adequate descriptions of the processes that generated the data. a common approach for selecting a clock model for a data set of interest is to examine a set of candidates and to select the model that provides the best statistical fit. however, this can lead to unreliable estimates if all the candidate models are actually inadequate. for this reason, a method of evaluating absolute model performance is critical. we describe a method that uses posterior predictive simulations to assess the adequacy of clock models. we test the power of this approach using simulated data and find that the method is sensitive to bias in the estimates of branch lengths, which tends to occur when using underparameterized clock models. we also compare the performance of the multinomial test statistic, originally developed to assess the adequacy of substitution models, but find that it has low power in identifying the adequacy of clock models. we illustrate the performance of our method using empirical data sets from coronaviruses, simian immunodeficiency virus, killer whales, and marine turtles. our results indicate that methods of investigating model adequacy, including the one proposed here, should be routinely used in combination with traditional model selection in evolutionary studies. this will reveal whether a broader range of clock models to be considered in phylogenetic analysis. analyses of nucleotide sequences can provide a range of valuable insights into evolutionary relationships and timescales, allowing various biological questions to be addressed. the problem of inferring phylogenies and evolutionary divergence times is a statistical one, such that inferences are dependent on reliable models of the evolutionary process (felsenstein ) . bayesian methods provide a powerful framework for estimating phylogenetic trees and evolutionary rates and timescales using parameter-rich models (huelsenbeck et al. ; yang and rannala ) . model-based phylogenetic inference in a bayesian framework has several desirable properties: it is possible to include detailed descriptions of molecular evolution (dutheil et al. ; heath et al. ) ; many of the model assumptions are explicit (sullivan and joyce ) ; large parameter spaces can be explored efficiently (nylander et al. ; drummond et al. ) ; and uncertainty is naturally incorporated in the estimates. as a consequence, the number and complexity of evolutionary models for bayesian inference has grown rapidly, prompting considerable interest in methods of model selection (xie et al. ; baele et al. ) . evolutionary models can provide useful insight into biological processes, but they are incomplete representations of molecular evolution (goldman ) . this can be problematic in phylogenetic inference when all the available models are poor descriptions of the process that generated the data (gatesy ) . traditional methods of model selection do not allow the rejection, or falsification, of every model in the set of candidates being considered. gelman and shalizi ( ) recently referred to this as a critical weakness in current practice of bayesian statistics. a different approach to model selection is to evaluate the adequacy, or plausibility (following brown a), of the model. this involves testing whether the data could have been generated by the model in question (gelman et al. ). assessment of model adequacy is a critical step in bayesian inference in general (gelman and shalizi ) , and phylogenetics in particular (brown a) . one method of evaluating the adequacy of a model is to use posterior predictive checks (gelman et al. ) . among the first of such methods in phylogenetics was the use of posterior predictive simulations, proposed by bollback ( ) . the first step in this approach is to conduct a bayesian phylogenetic analysis of the empirical data. the second step is to use simulation to generate data sets with the same size as the empirical data, using the values of model parameters sampled from the posterior distribution obtained in the first step. the data generated via these posterior predictive simulations are considered to represent hypothetical alternative or future data sets, but generated by the model used for inference. if the process that generated the empirical data can be described with the model used for inference, the posterior predictive data sets should resemble the empirical data set (gelman et al. ) . therefore, the third step in assessing model adequacy is to perform a comparison between the posterior predictive data and the empirical data. this comparison must be done using a test statistic that quantifies the discrepancies between the posterior predictive data and the empirical data (gelman and meng ) . the test statistic is calculated for each of the posterior predictive data sets to generate a distribution of values. if the test statistic calculated from the empirical data falls outside this distribution of the posterior predictive values, the model in question is considered to be inadequate. previous studies using posterior predictive checks of nucleotide substitution models have implemented a number of different test statistics. some of these provide descriptions of the sequence alignments, such as the homogeneity of base composition (huelsenbeck et al. ; foster ) , site frequency patterns (bollback ; lewis et al. ) , and unequal synonymous versus nonsynonymous substitution rates (nielsen ; rodrigue et al. ). brown ( b) and reid et al. ( ) introduced test statistics based on phylogenetic inferences from posterior predictive data sets. some of the characteristics of inferred phylogenies that can be used as test statistics include the mean tree length and the median robinson-foulds distance between the sampled topologies in the analysis (brown b) . although several test statistics are available for assessing models of nucleotide substitution (brown and eldabaje ; brown a; lewis et al. ) , there are no methods available to assess the adequacy of molecular clock models. molecular clocks have become an established tool in evolutionary biology, allowing the study of molecular evolutionary rates and divergence times between organisms (kumar ; ho ). molecular clock models describe the pattern of evolutionary rates among lineages, relying on external temporal information (e.g., fossil data) to calibrate estimates of absolute rates and times. the primary differences among the various clock models include the number of distinct substitution rates across the tree and the degree to which rates are treated as a heritable trait (thorne et al. ; drummond et al. ; drummond and suchard ; for a review see ho and duchêne ) . for example, the strict clock assumes that the rate is the same for all branches, whereas some relaxed clock models allow each branch to have a different rate. we refer to models that assume a large number of rates as being more parameter rich than models with a small number of rates . although molecular clock models are used routinely, the methods of assessing their efficacy are restricted to estimating and comparing their statistical fit. for example, a common means of model selection is to compare marginal likelihoods in a bayesian framework (baele et al. ). however, model selection can only evaluate the relative statistical fit of the models, such that it can lead to false confidence in the estimates if all the candidate models are actually inadequate. in this study, we introduce a method for assessing the adequacy of molecular clock models. using simulated and empirical data, we show that our approach is sensitive to underparameterization of the clock model, and that it can be used to identify the branches of the tree that are in conflict with the assumed clock model. in practice, our method is also sensitive to other aspects of the hierarchical model, such as misspecification of the node-age priors. we highlight the importance of methods of evaluating the adequacy of substitution models in molecular clock analyses. to evaluate the adequacy of molecular clock models, we propose a method of generating and analyzing posterior predictive data. in this method, the posterior predictive data sets are generated using phylogenetic trees inferred from branchspecific rates and times from the posterior samples ( fig. ). because this method uses branch-specific estimates, it requires a fixed tree topology. the first step in our method is to conduct a bayesian molecular clock analysis of empirical data. we assume that this analysis obtains samples from the posterior distribution of branch-specific rates and times. these estimates are given in relative time, or in absolute time if calibration priors are used. in the second step, we take a random subset of these samples. for each of these samples, we multiply the branch-specific rates and times to produce phylogenetic trees in which the branch lengths are measured in substitutions per site (subs/ site), known as phylograms. to assess model adequacy, we randomly select samples from the posterior, excluding the burn-in. from these samples, posterior predictive data sets are generated by simulation along the phylograms and using the estimates of the parameters in the nucleotide substitution model. the third step in our approach is to use a clock-free method to estimate a phylogram from each of the posterior predictive data sets and from the empirical data set. for this step, we find that the maximum likelihood approach implemented in phangorn (schliep ) is effective. to compute our adequacy index, we consider the branch lengths estimated from the posterior predictive data sets under a clock-free method, such that there is a distribution of length estimates for each branch. we calculate a posterior predictive p value for each branch using the corresponding distribution obtained with the posterior predictive data sets. this value is important for identifying the length estimates for individual branches that are in conflict with the clock model. our index for overall assessment is the proportion of branches in the phylogram from the empirical data that have lengths falling outside the % quantile range of those estimated from the posterior predictive data sets. we refer to our index as a, or overall plausibility of branch length estimates. we also provide a measure of the extent to which the branch length estimates from the clock-free method differ from those obtained using the posterior predictive simulations. to do this, we calculate for each branch the absolute difference between the empirical branch length estimated using a clock-free method and the mean branch length estimated from the posterior predictive data. we then divide this value by the empirical branch length estimated using a clockfree method. this measure corresponds to the deviation of posterior predictive branch lengths from the branch length estimated from the empirical data. for simulations and analyses of empirical data, we present the median value across branches to avoid the effect of extreme values. we refer to this measure as "branch length deviation," of which low values represent high performance. we also investigated the uncertainty in the estimates of posterior predictive branch lengths. this is useful because it provides insight into the combined uncertainty in estimates of rates and times. the method we used was to take the width of the % quantile range from the posterior predictive data sets, divided by the mean length estimated for each branch. this value, along with the width of the % credible interval of the rate estimate from the original analysis, can then be compared among clock models to investigate the increase in uncertainty that can occur when using complex models. we first evaluated the accuracy and uncertainty of substitution rate estimates from simulated data. to do this, we compared the values used to generate the data with those estimated using each of three clock models: strict clock, random local clocks (drummond and suchard ) , and the uncorrelated lognormal relaxed clock (drummond et al. ) . we regarded the branch-specific rates as accurate when the rate used for the simulation was contained within the % credible interval. we found that rate estimates were frequently inaccurate under five circumstances: clock model underparameterization; rate autocorrelation among branches (kishino et al. ) ; uncorrelated beta-distributed rate variation among lineages; misleading node-age priors (i.e., node calibrations that differ considerably from the true node ages); and when data were generated under a strict clock but analyzed with an underparameterized substitution model ( fig. a ). when analyses were performed using the correct or an overparameterized clock model, more than % of branch rates were accurately estimated, such that the true value was contained within the % credible interval ( fig. a) . in most simulation schemes, the uncorrelated lognormal relaxed clock had high accuracy, at the expense of a small increase in the uncertainty compared with the other models ( fig. b ). these results are broadly similar to those of drummond et al. ( ) , who also found that underparameterization of the clock model resulted in low accuracy in rate estimates, whereas overparameterization had a negligible effect on accuracy. we analyzed data generated by simulation to test our method of assessing the adequacy of molecular clock models. the a index was approximately proportional to the branch length deviation ( fig. a ). we found a to be ! . (indicating high performance) when the model used in the analyses matched that used to generate the data, or when it was overparameterized. when the assumed model was the top right box shows the first step in assessing model adequacy using pps. in our analyses, this step is performed using branch-specific rates and times. the bottom box shows our procedure for testing the clock model, which is based on the clock-free posterior predictive distribution of the length of each branch. the thin arrows indicate that the test statistic is the posterior predictive p value for each branch. pps, posterior predictive simulations. duchêne et al. . doi: . /molbev/msv mbe underparameterized, a was . . the uncertainty obtained using posterior predictive branch lengths was sensitive to the rate variance in the simulations. for this reason, estimates from data generated according to a strict clock or an uncorrelated lognormal relaxed clock had lower uncertainty than estimates from data generated under local clocks, regardless of the model used for analysis ( fig. b ). estimates made using the uncorrelated lognormal relaxed clock had a larger variance in three analysis schemes: when data were generated with autocorrelated rates across branches; when data were generated with beta-distributed rates across branches; and when there was a misleading prior for the node ages. for analyses with substitution model underparameterization, our method incorrectly provided greater support for the more complex clock model, indicating that rate variation among lineages was overestimated ( fig. ) . we used our simulated data and posterior predictive simulations to investigate the performance of the multinomial test statistic for evaluating the adequacy of molecular clock models. this test statistic was originally designed to assess models of nucleotide substitution (goldman ; bollback ) and can perform well compared with some of the other existing test statistics (brown b) . the multinomial test statistic for the empirical alignment can be compared with the distribution of test statistics from posterior predictive data sets to produce a posterior predictive p value. we find that the multinomial test statistic correctly identified when the substitution model was matched or underparameterized ( fig. ) . the multinomial likelihood did not have the power to detect clock model adequacy, but it was sensitive to rate variation among lineages, primarily from the simulation involving autocorrelated rates and when the node-age prior was misleading ( fig. ). we used three clock models, as in our analyses of simulated data, to analyze a broad range of nucleotide sequence data fig. . mean values of (a) accuracy and (b) uncertainty of branch rate estimates from molecular clock analyses of simulated data. each cell shows the results of replicate analyses. accuracy is measured as the proportion of data sets for which the rate used for simulation was contained in the % credible interval of the estimate. darker shades in (a) represent high accuracy. uncertainty is measured as the width of the % credible interval as a proportion of the mean rate. dark shades in (b) represent small ranges in branch length estimates, and therefore low uncertainty. the initials stand for each of the schemes for estimation or simulation. sc, strict clock; loc, local clock; ucl, uncorrelated lognormal relaxed clock; rlc, random local clock; acl, autocorrelated relaxed clock; bim, beta-distributed bimodal clock; pri, misleading node-age prior; gtrg, data simulated under the parameter-rich general time-reversible substitution model with among-site rate heterogeneity. mean values of (a) plausibility, a, and (b) uncertainty as described by the posterior predictive simulations from clock analyses of simulated data. each cell shows the results of replicate analyses. values in parentheses are the branch length deviations, of which lower values indicate good performance. the darker shades represent higher values of a and less uncertainty. high values of a represent good performance. in the case of uncertainty, small values indicate small ranges in posterior predictive branch lengths, and therefore low uncertainty. the initials stand for each of the schemes for estimation or simulation. sc, strict clock; loc, local clock; ucl, uncorrelated lognormal relaxed clock; rlc, random local clock; acl, autocorrelated relaxed clock; bim, beta-distributed bimodal clock; pri, misleading node-age prior; gtrg, data simulated under the parameter-rich general time-reversible substitution model with among-site rate heterogeneity. assessing the adequacy of clock models . doi: . /molbev/msv mbe sets: the m (matrix) gene of a set of coronaviruses; the gag gene of simian immunodeficiency virus (siv; wertheim and worobey ); complete mitochondrial genomes of killer whales orcinus orca (morin et al. ) ; and mitochondrial protein-coding genes of marine turtles (duchene et al. ) . the uncorrelated lognormal relaxed clock was the bestfitting clock model according to the marginal likelihood for the coronaviruses, siv, and the killer whales (table ) . for the marine turtles, the random local clock provided the best fit. in all the analyses of empirical data sets, the uncorrelated lognormal relaxed clock had the best performance according to our a index. the highest a index was . for the siv and the killer whales, and the lowest uncertainty in posterior predictive branch lengths was . for the killer whales. the uncertainty for all other data sets was above , indicating that it was larger than the mean of the posterior predictive branch lengths. we calculated the multinomial test statistic for the empirical data sets using the posterior predictive data from a clock model analysis, as well as under a clock-free method. the multinomial test statistic from both methods suggested that the substitution model was inadequate for the siv and the marine turtles, with posterior predictive p values below . . the substitution model was identified as inadequate for the coronavirus data set by the multinomial test statistic estimated using posterior predictive data sets from a clock analysis (p < . ); however, it was identified as adequate when using a clock-free method (p = . ). the mitochondrial data set from killer whales represented the only case in which the substitution model was adequate according to both multinomial likelihood estimates. for the data sets from coronaviruses and killer whales, the clock models with the highest performance had a indices of . and . , respectively (table ). these indices are substantially lower than those obtained in analyses of simulated data when the clock model used for simulation and estimation was matched. however, we evaluated the posterior predictive p values for all branches in these empirical data sets and found that at least two-thirds of the incorrect estimates correspond to relatively short terminal branches (supplementary information, supplementary material online). the branch length deviation in the empirical data ranged between . for the uncorrelated lognormal relaxed clock in the turtle data and . for the killer whale data analyzed with a strict clock (table ) . low values for this metric indicate small differences between the posterior predictive and the empirical branch lengths. although scores for this metric varied considerably between data sets, they were closely associated with the a indices for the different models for each data set individually. for example, in every empirical data set, the lowest branch length deviation was achieved by the model with the highest a index (indicative of higher performance). importantly, the branch length deviation was not directly comparable with the a index between data sets. mbe this is probably because the posterior predictive branch lengths have different amounts of uncertainty. in particular, the a index will tend to be low if the posterior predictive branch length estimates are similar to the empirical value but have low uncertainty. this would create a scenario with a small branch length deviation but also a low a index. this appears to be the case for the coronaviruses, for which all the clock models appear inadequate according to the a index, but with the uncorrelated lognormal relaxed clock having a small branch length deviation. assessing the adequacy of models in phylogenetics is an important process that can provide information beyond that offered by traditional methods for model selection. although traditional model selection can be used to evaluate the relative statistical fit of a set of candidates, model adequacy provides information about the absolute performance of the model, such that even the best-fitting model can be a poor predictor of the data (gelman et al. ). there have been important developments in model adequacy methods and test statistics in the context of substitution models (ripplinger and sullivan ; brown b; lewis et al. ) and estimates of gene trees (reid et al. ). here we have described a method that can be used for assessment of molecular clock models, and which should be used in combination with approaches for evaluating the adequacy of substitution models. the results of our analyses suggest that our method is able detect whether estimates of branchspecific rates and times are consistent with the expected number of substitutions along each branch. for example, in the coronavirus data set analyzed here, the best-fitting clock model was a poor predictor of the data, as was the substitution model. our index is sensitive to underparameterization of clock models and has the benefit of being computationally efficient. in addition, our metric of uncertainty in posterior predictive branch lengths is sensitive to some cases of misspecification of clock models and node-age priors, but not to substitution model misspecification, as shown for our analyses of the coronavirus data set. analyses based on the random local clock and the data simulated under two local clocks generally produced low accuracy ( fig. a) , with lower a indices than the other models that were matched to the true model ( fig. a) . the substandard performance of the random local clock when it is matched to the true model is surprising. a possible explanation is that our simulations of the local clock represented an extreme scenario in which the rates of the local clocks differed by an order of magnitude. previous studies based on simulations and empirical data demonstrated that this model can be effective when the rate differences are smaller (drummond and suchard ; dornburg et al. ) . in our analyses of empirical data, even the highest values of our index were lower than the minimum value obtained in our analyses of simulated data when the three models matched those used for simulation. this is consistent with the results of previous studies of posterior predictive simulations, which have suggested that the proposed threshold for a test statistic using simulations is conservative for empirical data (bollback ; ripplinger and sullivan ; brown b) . it is difficult to suggest a specific threshold for our index to determine whether a model is inadequate. however, the interpretation is straightforward: a low a index indicates that a large proportion of branch rates and times are inconsistent with the expected number of substitutions along the branches. under ideal conditions, an a index of . or higher means that the clock model accurately describes the true pattern of rate variation. however, our method allows the user to inspect the particular branches with inconsistent estimates, which can be useful for identifying regions of the tree that cause the clock model to be inadequate. measuring the effect size of differences in the branch length estimates of the posterior predictive and empirical data can also be useful for quantifying potential errors in the estimates of node times and branch-specific rates. an important finding of our study is that overparameterized clock models typically have higher accuracy than those that are underparameterized. this is consistent with a statistical phenomenon known as the bias-variance trade-off, with underparameterization leading to high bias, and overparameterization leading to high uncertainty. this was demonstrated for molecular clock models by . although our results show a bias when the model is underparameterized, we did not detect high uncertainty with increasing model complexity. this probably occurs because the models used here are not severely overparameterized. this is consistent with the fact that bayesian analyses are robust to mild overparameterization because estimates are integrated over the uncertainty in additional parameters (huelsenbeck and rannala ; lemmon and moriarty ) . we note that our index is insensitive to the overparameterization in our analyses. this problem is also present in some adequacy statistics for substitution models (bollback ; ripplinger and sullivan ) . identifying an overparameterized model is challenging, but a recent study proposed a method to do this for substitution models (lewis et al. ). an equivalent implementation for clock models would also be valuable. another potential solution is to select a pool of adequate models and to perform model selection using methods that penalize an excess of parameters, such as marginal likelihoods or information criteria. we find that our assessment of clock model adequacy can be influenced by other components of the analysis. for example, multiple calibrations can create a misleading node-age prior that is in conflict with the clock model (warnock et al. ; duchêne et al. ; heled and drummond ) . although our simulations with misleading node calibrations were done using a strict clock, our method identified this scenario as clock model inadequacy when the models for estimation were the strict or random local clocks ( fig. a) . in the case of the uncorrelated lognormal relaxed clock, our method identified a misleading node-age prior as causing an increase in uncertainty ( fig. b ). this highlights the critical importance of selecting and using time calibrations appropriately, and we refer the reader to the comprehensive reviews of assessing the adequacy of clock models . doi: . /molbev/msv mbe this topic (benton and donoghue ; ho and phillips ) . another component of the analysis that can have an impact on the adequacy of the clock model is the tree prior, which can influence the estimates of branch lengths. although one study suggested that the effect of the tree prior is not substantial (lepage et al. ), its influence on divergence-time estimates remains largely unknown. we found that substitution model underparameterization led to a severe reduction in accuracy. overconfidence in incorrect branch lengths in terms of substitutions can cause bias in divergence-time estimates (cutler ) . however, this form of model inadequacy is incorrectly identified by the methods we used for estimation as a form of rate variation among lineages. for our data generated using a strict clock and an underparameterized substitution model, the a index rejected the strict clock and supported the overparameterized uncorrelated lognormal relaxed clock. on the other hand, the multinomial test statistic was sensitive to substitution model underparameterization, and to some forms of rate variation among lineages. the sensitivity of the multinomial likelihood to rate variation among lineages might explain why the substitution model was rejected for the coronavirus data set when using a clock model, but not when using a clock-free method. due to this sensitivity and the substantial impact of substitution model misspecification, we recommend the use of a clock-free method to assess the substitution model before performing analyses using a clock model. our results suggest that it is only advisable to perform a clock model analysis when an adequate substitution model is available. other methods for substitution model assessment that are less conservative than the multinomial likelihood represent an interesting area for further research. we find that the a index is sensitive to patterns of rate variation among lineages that conflict with the clock model used for estimation. this is highlighted in the simulations of rate variation among lineages under autocorrelated and the unusual beta-distributed rates. in these cases, the a index identified the uncorrelated lognormal clock as the only adequate clock model, despite an increase in uncertainty in both cases. although other studies have also suggested that the uncorrelated lognormal relaxed clock can account for rate autocorrelation (drummond et al. ; ho et al. ) , an increase in uncertainty can impair the interpretation of divergence-time estimates. we suggest caution when the uncertainty values are above , which occurs when the widths of the % credible intervals are greater than the mean parameter estimates. in our analyses of the two virus data sets, the multinomial test statistic suggested that the best-fitting substitution model was inadequate. in the analyses of the siv data, our index of clock model adequacy was . , similar to that of killer whales, for which the substitution model appeared adequate. we recommend caution when interpreting estimates of evolutionary rates and timescales when the substitution model is inadequate. this typically suggests that the substitution process is not being modeled correctly, which can affect inferences of branch lengths regardless of whether a clock model is used or not. for this reason, the a index of . for the siv data set might be overconfident compared with the same index obtained for the killer whale data. previous research has also suggested that there are processes in the evolution of siv that are not accounted for by current evolutionary models . we also found that all the clock models were inadequate for the coronavirus sequence data. our results might provide an explanation for the lack of consensus over the evolutionary timescale of these viruses. for example, a study of mammalian and avian coronaviruses estimated that these viruses originated at most , years ago (woo et al. ). this result stands in contrast with a subsequent study that suggested a much deeper origin of these viruses, in the order of millions of years (wertheim et al. ) . our results suggest that estimating the timescale of these viruses might not be feasible with the current clock models. our analysis of mitochondrial genomes of killer whales shows that even if the clock model performance is not as high as that obtained in the simulations that match the models used for estimation, a large proportion of the divergence-time estimates can still be useful. examining the estimates of specific branch lengths can indicate whether many of the node-age estimates are reliable, or whether important branches provide unreliable estimates. we recommend this practice when the substitution model has been deemed adequate and when a substantial proportion of the branch lengths are consistent with the clock model (i.e., when the a index is high). we note that the mitochondrial genomes of killer whales have the lowest a index of any data set when analyzed using a random local clock. this might occur because the model identified an average of - rate changes along the tree ( . rate changes; table ). although rate variation is likely to be higher in this data set, it might not be sufficiently high for the model to detect it. analyses of mitochondrial protein-coding genes from marine turtles identified the substitution model as inadequate using the multinomial test statistic. the clock model with the highest performance had an a index of . , which might be considered sufficient to interpret the divergencetime estimates for at least some portions of the tree. again, the fact that the substitution model is inadequate precludes further interpretation of the estimates of evolutionary rates and timescales. this is a surprising result for a mitochondrial data set with several internal-node calibrations. a potential solution is to assess substitution-model adequacy for individual genes and to conduct the molecular clock analysis using only those genes for which an adequate substitution model is available. we believe that, with the advent of genomic data sets, this will become a feasible strategy in the near future. some of the reasons for the paucity of studies that assess model adequacy in phylogenetics include computational demand and the lack of available methods. in this study, we have presented a method of evaluating clock model adequacy, using a simple test statistic that can be computed efficiently. assessment of clock model adequacy is an important complement to traditional methods of model selection for two primary reasons: it allows the researcher to reject all the available models if they are inadequate; and, as implemented in this study, it can be used to identify the branches with length estimates that are implausible under the assumed model. the results of our analyses of empirical data underscore the importance of evaluating the adequacy of the substitution and clock models. in some cases, several models might be adequate, particularly when they are overparameterized. in this respect, methods for traditional model selection are important tools because they can be used to select a single best-fitting model from a set of adequate models. further research into methods, test statistics, and software for evaluating model adequacy is needed, both to improve the existing models and to identify data sets that will consistently provide unreliable estimates. we generated pure-birth trees with tips and root-node ages of my using beast v . (bouckaert et al. ) . we then simulated branch-specific rates under five clock model treatments using the r package nelsi (ho et al. ) . this program simulates rates under a given model and multiplies rates by time to produce phylogenetic trees in which the branch lengths represent subs/site, known as phylograms. these phylograms were then used to simulate the evolution of dna sequences of , nt in the r package phangorn. the five clock model treatments included the following: ) a strict clock with a rate of  À subs/site/my; ) an uncorrelated lognormal relaxed clock (drummond et al. ) , with a mean rate of  À subs/site/my and a standard deviation of . ; ) a treatment in which a randomly selected clade with at least ten tips experienced an increase in the rate, representing a scenario with two local clocks (yoder and yang ) , with rates of  À and  À subs/ site/my; ) a treatment with rate autocorrelation, with an initial rate of  À subs/site/my and a parameter of . (kishino et al. ) ; and ) a treatment with rate variation that followed a beta distribution with equal shape parameters of . and centered at  À subs/site/my, resulting in a bimodal shape. in every simulation, the mean rate was  À subs/site/my, which is approximately the mean mitochondrial evolutionary rate in mammals, birds, nonavian reptiles, and amphibians (pereira and baker ) . we selected this mean rate instead of sampling from the prior because our estimation methods involved an uninformative rate prior, and random samples from this can produce data sets with high sequence saturation or with low information content. we used the jukes-cantor substitution model for simulation (jukes and cantor ) . this model allows us to avoid making arbitrary parameterizations of more parameter-rich models, which is not the focus of this study. to explore the effect of substitution model underparameterization, we simulated additional data sets under a strict clock and a general time-reversible model with gammadistributed rates among sites, using parameters from empirical data (murphy et al. ) . we analyzed these data sets using the same method as for the rest of the simulated data, including the use of the simpler jukes-cantor substitution model. we also explored the effect of using misleading node-age priors. to do this, we placed two time calibrations with incorrect ages. one calibration was placed in one of the two nodes descending from the root selected at random, with an age prior of . times its true age (i.e., younger than the truth). the other calibration was placed on the most recent node in the other clade descending from the root, with an age of . of the root age (i.e., older than the truth). for this scenario, we only used trees with more than one descendant in each of the two oldest clades. we show an example of the simulated phylogeny compared with this kind of marginal prior on node ages in the supplementary information, supplementary material online. our study had simulated data sets for each simulation treatment, for a total of simulated alignments. we analyzed the simulated alignments using bayesian markov chain monte carlo (mcmc) sampling as implemented in beast. we used three different clock models to analyze each of the simulated alignments: the strict clock, uncorrelated lognormal relaxed clock (drummond et al. ) , and random local clock (drummond and suchard ) . we used the same tree prior and substitution model for estimation as those used for simulation. we fixed the age of the root to my and fixed the tree topology to that used to simulate sequence evolution in every analysis. we analyzed the simulated data with an mcmc chain length of  steps, with samples drawn from the posterior every  steps. we discarded the first % of the samples as burn-in, and assessed satisfactory sampling from the posterior by verifying that effective sample sizes for all parameters were above using the r package coda (plummer et al. ) . we performed analyses using each of the three clock models for each of the simulated data sets, for a total of clock analyses. we assessed the accuracy and uncertainty of the estimates made using each of the analysis schemes ( fig. ) . to do this, we compared the simulated rates with the branch-specific rates in the posterior. next, we tested the power of our method for assessing clock model adequacy using the simulated data under each of the scenarios of simulation and analysis. we provide example code and results in a public repository in github (https://github.com/duchene/modadclocks, last accessed july , ). we also tested the power of the multinomial test statistic to assess clock model adequacy in each of the analyses. this test statistic quantifies the frequency of site patterns in an alignment and is appropriate for testing the adequacy of models of nucleotide substitution (bollback ; brown b ). we used four published data sets to investigate the performance of our method of assessing clock model adequacy in empirical data. for each data set, we performed analyses in beast using each of the three clock models used to analyze the simulated data sets. to select the substitution model for each empirical data set, we used the bayesian information criterion as calculated in the r package phangorn. accurate model selection of relaxed molecular clocks in bayesian phylogenetics paleontological evidence to date the tree of life bayesian model adequacy and choice in phylogenetics beast : a software platform for bayesian evolutionary analysis predictive approaches to assessing the fit of evolutionary models detection of implausible phylogenetic inferences using posterior predictive assessment of model fit puma: bayesian analysis of partitioned (and unpartitioned) model adequacy estimating divergence times in the presence of an overdispersed molecular clock relaxed clocks and inferences of heterogeneous patterns of nucleotide substitution and divergence time estimates across whales and dolphins (mammalia: cetacea) relaxed phylogenetics and dating with confidence bayesian coalescent inference of past population dynamics from molecular sequences bayesian random local clocks, or one rate to rule them all marine turtle mitogenome phylogenetics and evolution the impact of calibration and clock-model choice on molecular estimates of divergence times efficient selection of brach specific models of sequence evolution statistical inference of phylogenies a tenth crucial question regarding model use in phylogenetics bayesian data analysis model checking and model improvement simulating normalizing constants: from importance sampling to bridge sampling to path sampling philosophy and the practice of bayesian statistics statistical tests of models of dna substitution a dirichlet process prior for estimating lineage-specific substitution rates calibrated birth-death phylogenetic timetree priors for bayesian inference the changing face of the molecular evolutionary clock molecular-clock methods for estimating evolutionary rates and timescales simulating and detecting autocorrelation of molecular evolutionary rates among lineages accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times frequentist properties of bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models bayesian inference of phylogeny and its impact on evolutionary biology evolution of protein molecules performance of a divergence time estimation method under a probabilistic model of rate evolution molecular clocks: four decades of evolution computing bayes factors using thermodynamic integration the importance of proper model assumption in bayesian phylogenetics a general comparison of relaxed molecular clock models posterior predictive bayesian phylogenetic model selection complete mitochondrial genome phylogeographic analysis of killer whales (orcinus orca) indicates multiple species resolution of the early placental mammal radiation using bayesian phylogenetics mapping mutations on phylogenies bayesian phylogenetic analysis of combined data a mitogenomic timescale for birds detects variable phylogenetic rates of molecular evolution and refutes the standard molecular clock coda: convergence diagnosis and output analysis for poor fit to the multispecies coalescent is widely detectable in empirical data assessment of substitution model adequacy using frequentist and bayesian methods computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons mrbayes . : efficient bayesian phylogenetic inference and model choice across a large model space phangorn: phylogenetic analysis in r model selection in phylogenetics estimating the rate of evolution of the rate of molecular evolution exploring uncertainty in the calibration of the molecular clock a case for the ancient origin of coronaviruses relaxed molecular clocks, the bias-variance trade-off, and the quality of phylogenetic inference dating the age of the siv lineages that gave rise to hiv- and hiv- discovery of seven novel mammalian and avian coronaviruses in deltacoronavirus supports bat coronaviruses as the gene source of alphacoronavirus and betacoronavirus and avian coronaviruses as the gene source of gammacoronavirus and deltacoronavirus improving marginal likelihood estimation for bayesian phylogenetic model selection molecular phylogenetics: principles and practice estimation of primate speciation dates using local molecular clocks we thank the editor, tracy heath, and an anonymous reviewer for suggestions and insights that helped improve this article. this research was undertaken with the assistance of resources from the national computational infrastructure, which is supported by the australian government. d. assessing the adequacy of clock models . doi: . /molbev/msv mbe for each analysis of the empirical data sets, we ran the mcmc chain for steps, with samples drawn from the posterior every steps. we discarded the first % of the samples as burn-in and assessed satisfactory sampling from the posterior by verifying that the effective sample sizes for all parameters were above using the r package coda. we used stepping-stone sampling to estimate the marginal likelihood of the clock model (gelman and meng ; lartillot and philippe ; xie et al. ) . for each bayesian analysis, we performed posterior predictive simulations as done for the simulated data sets, and assessed the substitution model using the multinomial test statistic. in addition, to estimate the clock-free multinomial test statistic, we analyzed each of the empirical data sets using mrbayes . (ronquist et al. ) . for these analyses we used the same chain length, sampling frequency, sampling verification method, and substitution model as in the analyses using clock models.our empirical data sets included nucleotide sequences of coronaviruses. this data set contained sequences of nt of a portion of the m (matrix) gene, as used by wertheim et al. ( ) . these sequences were sampled between and . the best-fitting substitution model for this data set was gtr+g. we also used a data set of the gag gene of sivs, which comprised sequences of nt, sampled between and (wertheim and worobey ). the best-fitting substitution model for this data set was gtr+g. we used the bayesian skyline demographic model (drummond et al. ) for the analyses of both of the virus data sets, and used the sampling times for calibration.we analyzed a data set of the killer whale (o. orca), which contained complete mitochondrial genome sequences of , nt (morin et al. ) . we calibrated the age of the root using a normal distribution with mean of . and a standard deviation of % of the mean, as used in the original study. the best-fitting substitution model for this data set was hky+g. finally, we analyzed a data set of several genera of marine turtles, which comprised sequences of the mitochondrial protein-coding genes (duchene et al. ) , and we selected the gtr+g substitution model. following the scheme in the original study, we used calibrations at four internal nodes. the pure-birth process was used to generate the tree prior in the analyses of the killer whales and the marine turtles. supplementary information is available at molecular biology and evolution online (http://www.mbe.oxfordjournals.org/). key: cord- -kquh ry authors: canhoto, ana isabel title: leveraging machine learning in the global fight against money laundering and terrorism financing: an affordances perspective date: - - journal: j bus res doi: . /j.jbusres. . . sha: doc_id: cord_uid: kquh ry financial services organisations facilitate the movement of money worldwide, and keep records of their clients’ identity and financial behaviour. as such, they have been enlisted by governments worldwide to assist with the detection and prevention of money laundering, which is a key tool in the fight to reduce crime and create sustainable economic development, corresponding to goal of the united nations sustainable development goals. in this paper, we investigate how the technical and contextual affordances of machine learning algorithms may enable these organisations to accomplish that task. we find that, due to the unavailability of high-quality, large training datasets regarding money laundering methods, there is limited scope for using supervised machine learning. conversely, it is possible to use reinforced machine learning and, to an extent, unsupervised learning, although only to model unusual financial behaviour, not actual money laundering. in , the united nations general assembly set out a global agenda for sustainable development consisting of goals which are globally referred to as the sustainable development goals (sdgs). each individual goal is concerned with a particular social, economic or environmental issue, ranging from poverty elimination (goal ) to the strengthening of global partnerships (goal ) (u.n., ). together, the goals constitute an ambitious development agenda (economist, ) , which will require the concerted efforts of governments and private institutions across the world (madsbjerg, ) , and across all goals, in the period leading up to the year . given that economic development is negatively correlated with crime (donfouet, jeanty, & malin, ) , one of the sdgs is specifically concerned with fighting crime. namely, as part of goal , the u.n. has set out a range of targets aimed at reducing criminal activity around the world, such as significantly reducing all forms of violence, ending trafficking, promoting the rule of law and combating organised crime (u.n., ). crime reduction is an essential step in paving the way for sustainable development, because doing so will support the creation of stable societies, enhance effective governance and promote peoples' well-being (unodc, ) . money is a key motivator for those engaging in illegal activities (byrne, ) . human trafficking, for instance, generates an estimated u.s.$ . billion per year for criminal organisations, through activities such as forced labour, sexual exploitation and organ harvesting (fatf, ) . money is also needed to plan and execute criminal operations. in the case of human trafficking, money is needed to move the victims across locations; to run the places and operations where those human beings are exploited; and to bribe the various intermediaries that assist or, at least, condone this criminal activity. given the strong link between money and crime, most governments pursue initiatives to curtail the movement of money to and from criminal organisations, in an attempt to reduce the criminals' incentive and their ability to engage in illicit behaviour (ball et al., ) . these programmes are generally referred to as anti-money laundering and terrorism financing initiatives, or aml programmes for short. the importance of aml programmes in the global fight against crime is such that several heads of state have joined the financial action task force on money laundering (known as the fatf), with the purpose of sharing intelligence on money laundering and terrorism financing techniques, and setting out measures to combat this activity . since its inception, the fatf has advocated the use of technology to profile and detect money laundering and terrorism financing activity. financial transactions (other than direct cash payments) leave electronic traces, and these can be processed and analysed in order to develop insight about the financial behaviours of those engaging in illicit activity, or even to prove criminal association (de goede, ) . hence, it is no surprise that technological solutions such as big data analytics, natural language processing or distributed ledger technology have been touted as an essential component of money laundering detection (e.g. grint, o'driscoll, & paton, , chap. ) . in particular, there is a growing interest in exploring the potential of artificial intelligence (ai) and, specifically, machine learning in supporting aml programmes (kaminski & schonert, ) and, thus, the global fight against crime. advocates highlight machine learning's ability to handle large volumes of data, both structured and unstructured, and its potential to discover the patterns of financial behaviour adopted by those engaging in illicit activity (e.g. banwo, ; fernandez, ) . machine learning can also assist in analysing user-generated online content, such as twitter conversations or youtube videos, using sentiment analysis techniques, to identify supporters of terrorist groups, affiliation with extremist views or even plans to commit criminal activity (ahmad, asghar, alotaibi, & awan, ; azizan & aziz, ; cunliffe & curini, ; garcía-retuerta, bartolomé, chamoso, & corchado, ) . however, the industry remains cautious, and the use of these technologies is, so far, more experimental than systematic (zimiles & mueller, ) . ai and machine learning are seen as costly technological solutions whose benefits remain unproven, as far as aml programmes are concerned (grint et al., , chap. ) . moreover, there is a lack of expertise in understanding and operating ai and machine learning (grint et al., , chap. ) , which, associated with the lack of transparency of the algorithms underpinning them (crosman, ) , creates risks for the organisations relying on them for aml, as well as for the individuals whose financial behaviours are being probed. in order to reconcile these two opposing views regarding the potential of machine learning technology for crime reduction, via its inclusion in aml programmes, this research investigates the following research question: to what extent can machine learning algorithms be leveraged to assist with the detection and prevention of money laundering and terrorism financing? to pursue this goal, we adopt a socio-technical perspective which explicitly accounts for the technical features of information systems, as well as the social context within which such systems are developed and used (loebbecke & picot, ; markus & topi, ) . by doing so, we can move beyond a discussion of the potential of machine learning for aml programmes, and start unpacking the variety of technological and social factors, such as the 'arguments, decisions, uncertainties and the processual nature of decision making' (bowker & star, , p. ), which may support or hinder the performance of the machine learning solution. specifically, we use the theory of affordances to identify the technical features of an approach to aml powered by machine learning, as well as the social behaviours impacting on the solution's use, and how the two condition each other. the issues and concerns being expressed in relation to the use of machine learning for financial crime detection mirror those expressed in terms of using this technology more generally. for instance, many senior managers are concerned with their organisations' lack of expertise in handling big data (merendino et al., ) , while numerous companies are delaying adoption of ai because they are unsure about how it can help their firms (bughin, chui, & mccarthy, ) . likewise, there is growing awareness of the risks of ai for individuals, organisations and society (cheatham, javanmardian, & samandari, ) , including rising evidence about the negative impacts of algorithmic decision-making for organisations and individuals (see newell & marabelli, ) . hence, the findings from our study are relevant beyond the specific context of the u.n.'s sdgs; they talk to the issues at the heart of today's surveillance society (zuboff, ) . the ubiquity of algorithms, and the scale and scope of their impact in everyday life, have led diakopoulos ( , chap. ) to describe them as 'the new power brokers in society' (p. ), and to urge researchers to investigate the sources and 'contours of that power ' (p. ) . this paper addresses diakopoulos ( , chap. ) , constantiou and kallinikos ( ) and others' calls for research, by investigating how the algorithms used in money laundering detection are developed and used to sort and classify financial transactions, and the scope for using machine learning algorithms for that end. the paper is organised as follows. the next section provides a brief overview of the central role of transaction data and profiling technology in the fight against crime, and the challenges of modelling money laundering behaviour. this is followed by an exposition of the theory guiding this researchthe theory of affordancesand its application to ai and machine learning. subsequently, the details of the approach adopted in our empirical investigation are presented, and this is followed by the empirical findings. after discussing the findings, we reflect on the contributions of our paper to theory and practice, as well as areas for further research. given the central role of money in enabling and even motivating criminal activity (byrne, ) , initiatives that hinder the movement of money to and from those individuals engaging in illicit activity are seen as one of the key tools in the international fight against crime (ball et al., ) . financial services organisations are the main point of entry of cash in the financial system, as well as major facilitators of the movement of money globally. moreover, the movement of cash through the financial system generates records, which can be analysed to understand, prove or even anticipate how money is used or how it was generated (de goede, ) . therefore, governments worldwide have passed legislation ordering financial service providers to analyse how their customers are using the firms' financial products and services, in order to develop intelligence which can assist with crime reduction (ball et al., ) . developing intelligence about money laundering is a challenge, however, because of the nature of the phenomenon being modelled. strictly speaking, money laundering does not correspond to one specific behaviour; rather, it can relate to any type of predicate crime, from small-scale tax evasion to the trafficking of weapons of mass destruction. it also includes the case where the money has a legitimate origin (e.g. a salary), but it is used to fund criminal activity (kaufmann, , chap. ) , as in the case of charitable donations to organisations that support terrorism. the money launderer may also commit several crimes simultaneously. for instance, human traffickers also commit bribery and tax evasion (fatf, ) . moreover, money laundering may involve a varying number of actors, from sole traders to highly sophisticated organised crime groups with their own financial director (bell, ) . that is, unlike other decision-making scenarios where knowledge-based systems are modelling a specific behaviour with well-defined boundaries and participants, aml modelling systems need to account for a very broad phenomenon, with many possible behavioural manifestations and combinations of actors. not only is it difficult to develop money laundering models, but it is also very difficult to test their performance. the predicted outputs produced by the model would need to be compared with confirmed cases of money laundering in order to fine-tune the model and to improve its accuracy (zimiles & mueller, ) . however, it takes a long time (many months, possibly years) for a suspected case of money laundering flagged by a financial services provider to be formally investigated by law enforcement and, eventually, convicted. moreover, money launderers change their modes of operation frequently. for instance, the closure of national borders and the restriction of movement caused by the covid- pandemic is leading to a decrease in the street sale of drugs, and a turn to online sales coupled with courier or mail delivery (coyne, ) . criminals are also likely to take advantage of new financial products or trading strategies, such as using mobile payments (whisker & lokanan, ) or virtual currencies (vandezande, ) . therefore, any evidence which may be available to guide modelling gets outdated very quickly. that is, in the case of aml profiling, financial services providers are mostly engaging in speculative modelling (kunreuther, ) . the third challenge faced by financial services providers concerns the volume and type of data to be analysed. the typical financial organisation will produce, daily, a large volume of transaction records, in addition to structured and unstructured data produced by the organisation's many customer touch pointsfrom login data, to biometric information or chatbot conversations (fernandez, ) . aml efforts, thus, require that financial services organisations invest in powerful technical systems to help them process and make sense of such data. in the uk alone, firms invest around £ billion a year in customer profiling and transaction monitoring technology to assist in aml efforts, according to the latest estimates by the regulator (arnold, ) , although there are suggestions that the cost of investing in aml technology, plus the operational costs of aml compliance, outweigh any related benefits, such as improved processes or customer insight (balani, ) . aml systems not only need to be powerful, but they also need to meet other criteria such as stringent data security, customer privacy and identity verification requirements (grint et al., , chap. ) . moreover, by law, financial service providers must always be able to prove that the technologies that they use do not unfairly discriminate against certain customers (crosman, ) . these requirements mean that financial services organisations are wary of adopting technologies where they lack complete control over use of customer data, or whose workings they do not fully understand, as in the case of black-box type of algorithms. that is, while the aml area seems ripe for machine learning deployment, and some industry players are investing in this technology (zimiles & mueller, ) , there are also various organisational and technical barriers to consider. to research these technical and organisational factors, we draw on the theory of affordances, as outlined next. the value of machine learning in aml comes not from what the technology is, but from what it enables users to do. hence, in order to investigate the research question previously presented, we need a lens that accounts for both the technical and the social dimensions, such as the theory of affordances. the theory of affordances originates from direct perception psychology (namely gibson, ) , and studies how the real and perceived characteristics of artefacts condition their use. one of its fields of application is the study of perceptions and use of information technology in organisations (e.g. leonardi, ; volkoff & strong, ) , and the effect of such usage in those organisations (e.g. markus & silver, ; sebastian & bui, ) . the term affordance refers to the patterns of user behaviour made possible by the properties of an artefact, used in a particular setting. for example, the realisation of the affordance 'surfing the web' results from the interaction between the properties of a web browser and the characteristics of the user (de moor, ) . the functional and relational aspects of the artefact are preconditions for activity (greeno, ) . that is, they create possibilities for action (leonardi, ) . for instance, a switch connected by a wire to a power source enables actors to turn the electricity on and off. the characteristics of the artefact also constrain action (hutchby, ) . staying with the switch example, if the switch is positioned very high on a wall, it limits the ability to be switched off by small persons, such as young children. in order for these possibilities for actionthe real affordancesto be realised, the actor (e.g. an organisation's employee, team or unit) needs to recognise the affordance (davern, shaft, & te'eni, a , b and enact it. for instance, for the connectivity characteristic of a web browser to enable an internet user to access information on a remote server, the user needs to understand what the browser is for and how to use it. the actor may recognise the affordance by virtue of the features of the artefactfor instance, the presence of "on" and "off" labels on the switch. in addition, the recognition of affordances is conditioned by the organisational systems in which the artefact is deployed. for example, leonardi ( ) reported how employees from different departments in one organisation used a training simulation software in markedly different ways. contextual features which may impact on the recognition of the affordance include the organisational and environmental structures and demands; attitudes and perceptions towards the artefact; the level of effort required from the actor; the actors' skill, ability and understanding; and the actors' ultimate goal (bernhard, recker, & burton-jones, ; volkoff & strong, ) . affordances are relationalthat is, the realisation of an affordance is both technology-and actor-specific (strong et al., ) . therefore, their study requires the investigation of the technical features of the artefact, the social features related to the user and how the two impact on each other (volkoff & strong, ; zammuto, griffith, majchrzak, dougherty, & faraj, ) , in this way, the theory of affordances rejects the notion of either a technological or an organisational imperative (zammuto et al., ) , and focuses, instead, on the iteration between the two (leonardi, ; strong et al., ) . artificial intelligence is an assemblage of technological components which collect, process and act on data in ways that simulate human intelligence (canhoto & clear, ) . ai can handle large volumes of data, including unstructured inputs such as images or speech, which makes it extremely relevantor even essentialin the age of big data (kietzmann, paschen, & treen, ) . the core component of an ai solution is the machine learning algorithm, which processes the data inputs (skiena, ) . what distinguishes machine learning from classical programming is that, in the former, the goal of the computational procedure is to find patterns in the data set, i.e. the rules that link the inputs to the outputs. in contrast, in classical programming, the rules are developed a priori, and the goal of the computational procedure is to apply those rules to input data, in order to produce an output. there are various types of machine learning, each applicable to a different type of problem. supervised machine learning is indicated for situations whereby there are known inputs and known outputssuch as patterns of cell variations vs. stages of cancer (tucker, ) . the analyst gives the computer training datasets, with data labelled as either input or output. the function of the algorithm is to learn the patterns that connect the inputs to the outputs, and to develop rules to be applied to future instances of the same problem. the opposite approach is unsupervised machine learning, which is indicated for data sets where it is not known which data points refer to inputs and which ones refer to outputsfor instance, a basket of items frequently bought together. the analyst gives the computer a training dataset with no labels. the algorithm's task is to find the best way of grouping the data points, and to develop rules for how they may be related. an intermediate approach, reinforced machine learning, should be applied to problems where certain courses of actions produce better results than othersfor instance, playing a game (mnih et al., ) . the analyst gives the computer a dataset plus a goal, as well as rewards (or penalties) for the actions that it takes. the algorithm's task is to find the best combination of actions to attain that goal. to achieve that, the algorithm sorts through possible combinations of data, and analyses the rewards for different combinations, to find the patterns that maximise the overall goal. the choice of type of algorithm to use should be based on fit with type of problem (skiena, ) . however, in practice, the choice is often determined by pragmatic reasons, such as the analyst's skills, compatibility between programming languages (calvard, ) or processing power available (agarwal, ) . data are integral to the development of machine learning algorithms and, hence, to the system's performance. without data, algorithms have been described as mathematical fiction (constantiou & kallinikos, ) . depending on the technical characteristics of the system, this may be only structured data (namely numeric data), or also include unstructured data such as images or voice (paschen, pitt, & kietzmann, ) . datasets may be collated from historical databases, such as shipping addresses or the type of ip connection used (o'hear, ); real time data, collected via physical sensors or online tracking; or knowledge data, such as whether previous product recommendations were accepted or rejected. moreover, data can be sourced internally or externally. the choice of which type of data to use, or how much data, is often constrained by the need for compatibility between the different elements of the ai solution. standardisation increases the ability to use multiple data sources, but also reduces the system's flexibility and limits its contextual richness (alaimo & kallinikos, ) . another important issue concerns the quality of the training data set, namely how the data were collected, their recency and whether they are representative of the population at large (hudson, ) . this problem is particularly relevant in the case of external data, when firms are unable to access and assess the underlying assumptions and data sources (khan, gadalla, mitchell-keller, & goldberg, ) . once data have been processed through the machine learning algorithm, the system produces an output, which may vary in terms of type and autonomy from human intervention (canhoto & clear, ) . the examples of machine learning that tend to be featured in the media are those where the system has autonomy to act on the basis of the results of the computational processsuch as steering a car without human intervention (goodall, ) . however, the system's output could be something as simple as a score, with no performative value until an analyst acts on it (e.g. elkins, dunbar, adame, & nunamaker, ) . the output can also be re-entered into the training data set, to further the algorithm's development. for instance, alphago zero has mastered the board game go by playing against itself over and over again (silver et al., ) . this means that machine learning algorithms have the capacity to learn over time, and to adapt to changes in the environment (russell & norvig, ) . however, it also means that machine learning can create self-reinforcing feedback loops, quickly becoming so complex that analysis can no longer explain how they work. an example was facebook's ai negotiation bots, which developed their own, incomprehensible-to-humans, language (lewis, yarats, dauphin, parikh, & batra, ). self-reinforcing loops can also spread biases and mistakes. for example, ai-powered bots that automatically aggregate news feeds' content can spread unverified information and rumours (ferrara, varol, davis, menczer, & flammini, ) , while automatic trading algorithms have been blamed for creating flash crashes in the u. s. stock market (varol, ferrara, davis, menczer, & flammini, ) . this problem is particularly relevant in the case of predictive analytics, where analysts are unable to assess the quality of the output prior to implementation and scaling (mittelstadt, allo, taddeo, wachter, & floridi, ) . ai and machine learning can be deployed to perform mechanical, analytical, intuitive or even empathetic tasks, although, given the current state of development of the technology, they are better suited for the former than the latter ones (huang & rust, ) . in summary (fig. ) , the potential of machine learning to be used in different scenarios is shaped by technical features such as its ability to learn patterns in data, process various types of data and act autonomously. moreover, it is shaped by contextual features such as the type of problem to which it is applied, the analyst's skills, system compatibility, processing power, variability of data, quality of the data set, acceptability of the output, comprehensibility, risk of unchecked biases and mistakes and the nature of the task. defendants of machine learning use in aml highlight the potential of this technology to discover novel patterns in financial transaction data, and to do so in a cost-effective manner (e.g. fernandez, ) . however, whether that potential is realised or not depends entirely on the interplay between the technical and contextual features of aml programmes. we investigated this problem empirically, as described next. given the relational and dynamic nature of affordances, they are best studied via qualitative methods (bernhard et al., ) . the explanatory case study methodology is particularly well suited for affordances' research (leonardi, ) , as it enables researchers to identify the genesis of change, in context (dubé and paré, ) . case study methodology is also indicated to study the development of algorithms, to eliminate the possible effects of spurious correlations which can mask which variables are used, how and why (o'neil & schutt, ) . negotiating access for this type of study is extremely difficult. first, the development and use of algorithms is usually shrouded in secrecy (beer & burrows, ) , with most analysis of algorithmic decision making relying on reverse engineering of algorithms (o'neil & schutt, ) . it is particularly difficult to get access to financial services organisations, due to the secretive nature of this sector and particularly since the banking crisis of (canhoto et al., ) . moreover, the subject matter of this research (money laundering detection) is deemed fig . the link between machine learning's features, context and affordances. by many financial services organisations to be highly sensitive: financial institutions are very reluctant to discuss their approach to money laundering and terrorism financing detection for legal, strategic and operational reasons (ball et al., ) . for all these reasons, we used a single, embedded case study. the focus on a single organisation, while limiting in terms of variability of observations and generalisability of the findings, offers a rich and holistic perspective (creswell, ) that is very much needed in this under-researched problem area. moreover, the use of multiple sources of data and of types of evidence in the case study offers a depth of insight into thinking and doing processes not available when using other, mono-data collection instrument approaches (woodside, ) . the unit of analysis was a uk-based financial services organisation, to be referred to as bank. bank is part of one of uk's largest financial services groups. its largest business unit is retail banking, contributing to over three-quarters of the group's profit. the retail business includes the provision of current accounts, savings, personal loans and mortgage services, long term investment products and credit cards, among others. in line with the framework articulated in the previous section, data collection had two foci: the technical features (the algorithms and data used, and the type of outputs developed) and the contextual features (the type of problem, skills, etc…) of money laundering profile development at bank. the data collected is summarised in table . all interviews were recorded and transcribed, while contemporaneous notes were taken during the observations. the data collected were first analysed according to the data collection instrument (e.g. electronic documents), and then across methods (e. g. interview with vs. observation of system administrator). this approach helped to recognise converging findings, and increased the robustness of the analysis (jick, ) . the coding process followed the approach outlined in miles and huberman ( ) , whereby an initial list of codes was developed based on the theoretical categories depicted in fig. , and applied deductively to the data. this list was, subsequently, augmented with codes emerging inductively during the hermeneutic process of data analysis. finally, a detailed case history was developed and presented to the research participants, to confirm the accuracy of the findings. like other financial institutions in the uk, bank needs to analyse the records of financial transactions of its customers, in order to identify those that might be linked to underlying criminal activity. the regulator does not provide financial service providers with models for querying the database, so it is up to bank to develop its own. an important factor to consider is that the financial transactions are not, usually, illegal. indeed, unless the client is defrauding bank, the transactions themselves are legitimate, and part of the normal business of a financial services organisation. that is, what bank is actually trying to achieve with aml profiling is to identify the patterns of behaviours followed uniquely by customers that are attempting to disguise the illegal source or illegal intended use of their money: ' we are trying to find out, first of all, a very basic profile… for instance, a finding may be that a customer aged between and years old is twice as likely to be [involved in criminal activity than] the entire customer base.' (interview, systems manager) the favoured approach to develop algorithms for aml detection is by drawing on factual information provided by law enforcement agencies regarding confirmed cases of criminal activity. for instance, when prosecutors secure a conviction, some information is made public about the convicted person(s)'s characteristics and financial behaviour. hence, bank has access to historical data about confirmed pairings between a specific crime and pattern of transactions. this information is valued by bank because it provides confirmed inputs (the person's characteristics and behaviours) and confirmed outputs (what crimes the person was convicted of). it is, therefore, amenable to analysis through supervised learning (fig. a) . for instance, confirmed reports that a number of terrorist financers had lived in a particular geographical area and been involved in a specific type of business activity led bank to investigate the transaction patterns of that type of business account: 'there is an area (…) with two particular postcodes in which there are lots of [particular type of business mentioned in conviction reports]. one piece of intelligence that we had was that the only two people who were ever convicted for being members of al-qaida, in the uk, were actually from that area. we know that area and a lot of these [businesses] (…) we can look at the customers who live in that area.' (interview, head of fi team) however, this type of data is limited in both number and value. in number, because, due to legal and operational constraints, not all details of the convictions are released. moreover, there is usually a gap of many monthsand, often, several yearsbetween the criminal activity, its conviction and the subsequent release of information. therefore, the training data set is very small. in value, because the information that is made available, by its nature, deals with specific events and, often, unique behaviours. in the case mentioned, the information related to the source of funding of a particular international terrorist organisation associated with al-qaida. other terrorist organisations are known to use different sources of funding, such as trafficking or gambling. moreover, following the conviction mentioned by this interviewee, there was a change in the law that curbed the activities of the type of business mentioned in the reports; and, therefore, limiting terrorist organisations' ability to be funded this way. finally, while bank had customers with the characteristics mentioned, this is not always the case. hence, the training data are not always relevant. the second favoured approach to develop money laundering detection algorithms at bank is based on court production orders (cpos). cpos are mechanisms used by the court to gain access to specific information about someone who is being investigated for suspected criminal activity. the court contacts financial institutions where the suspect has accounts, asking for their transaction history: 'we are asked to provide a lot of witness statements. we often get production orders (…). our action there is reactive.' (interview, head of fi team) following the receipt of a cpo, bank investigates the pattern of financial transactions for the account(s) flagged (fig. b) the advantages of cpos are that they are more frequent and timelier than convictions, which increases their value as training datasets. however, bank does not know what crime the customers are being investigated for, or even whether they will end up being convicted. therefore, there is a low level of certainty that any particular pattern identified corresponds to actual criminal activity, or which crime, which limits their value as training datasets. the third approach used by bank consists of looking for variations in the financial behaviour of particular types of customers, which is suitable for unsupervised learning. for instance, bank used this technique to analyse the financial behaviour of business accounts associated with a particular type of trade suspected of engaging in tax evasion. the analysis identified two clusters of accounts with distinct behavioural patterns in terms of cash depositsone cluster, with a large number of accounts, where traders usually deposited round amounts (e.g. £ , ); and another cluster, with a small number of accounts, where traders tended to deposit exact amounts (e.g. £ , . ). these clusters were, subsequently, subjected to further probing by the analysts, to identify those customers that might be deliberately attempting to avoid paying tax (fig. c) : "normal behaviour dictates that [these customers] only deposit exact amounts. is it possible that, therefore, the money launderer might leave it at the exact pounds and pence? should we be targeting the unusual end instances?" (interview, systems manager) this kind of analysis is fairly frequent at bank. because it focuses on a large number of accounts and current behaviours, this approach avoids the problems of small and dated training data sets that characterise the two approaches previously discussed ( fig. a and b) . moreover, by alternating the focus of analysissuch as particular types of business activities (e.g. certain types of trade), particular types of customers (e.g. who might be vulnerable to identity fraud), particular types of accounts (e.g. dormant accounts) or particular types of transactions (e.g. international transfers) -it allows the organisations to develop insight about the usual patterns of behaviour of those types of account holders or products. however, this approach assumes customers engaging in legitimate behaviour have markedly different financial behaviours from those handling the proceeds of crime. it also assumes that the majority of bank's customers in any given category of analysis are not criminals. moreover, the analysis of the outliers relies on speculation about the reasons underpinning the observed behaviours. another problem faced by bank when using this approach is that, due to limited processing power, bank can only run a specified number of queries at any one time. hence, when the team wants to investigate a new type of behaviour (e.g. deposits in accounts held by traders), the systems manager needs to switch off one of the other queries in use (e.g. international transfers). the final type of approach uses a combination of pattern analysis of a dataset, and criteria matching, to identify accounts with suspicious patterns of financial transactions. it is the approach used for routine analysis of financial transactions, and it is suitable for application of reinforced learning techniques. every day, bank uses this algorithm to analyse the transactions occurred in a given period, giving more weight to those that match known money laundering methods (e.g. depositing large quantities of cash, or quickly defunding an account), and/or that violate rules about the normal use of a given product and/or the expected behaviour for a type of customer. the output is a set of transactions that are deemed to follow an unusual pattern, and that are flagged for further investigation by the analysts (fig. d) : as with the previous approach (fig. c) , this one applies to a large number of accounts, rather than relying on small and dated training data sets (as in fig. a and b) . another advantage of this approach is that can be adjusted to reflect bank's evolving knowledge about its customers and money laundering methods, which is reviewed every week by the financial intelligence team. for instance, the discovery of a money laundering scheme linked to caravan parks led bank to create a filter for type of accommodation; while another filter that gave more weight to personal accounts without a known residential phone number was dropped when it became clear that more and more customers did not have fixed phones in their houses. however, this approach is not focused on known criminal behaviour (unlike the approaches depicted in fig. a and b) , or on accounts with high likelihood of being linked to criminal activity (unlike the approach depicted in fig. c) . rather, it flags a number of accounts that may have unusualrather than suspicioustransaction patterns, and which need to be further investigated, manually, by the analysts team. this investigation, like the approach depicted in fig. c , relies on speculation by the analysts about the legitimacy of the behaviours observed. in turn, the need for the results to be filtered by analysts creates another challenge: queries that produce large amounts of flags undermine the unit's goal of staying within specific performance targets. so, the parameters may be fine-tuned because of the need to limit the number of output cases, rather than because of specific intelligence: ' we wrote a rule that says "from our personal account customers, tell us which accounts are receiving in excess of [£x] in a [n] day period". initially, that prompted many cases, and over time, we brought that figure down to cases between [£y] and [£z] over n days.' (interview, analysts' manager) as with the third approach (fig. c) , due to limited processing power, bank needs to switch off an existing filter whenever it wants to introduce a new one. in addition to the specific technical and organisational challenges associated with the specific types of algorithms discussed above, there are some generic issues that condition bank's ability to use machine learning in aml profiling. in terms of input data, bank can only use its own transaction databases. hence, it will always have a limited view of a customer's financial behaviour. for instance, bank may be aware that a customer has accounts in another financial services organisation, but is unable to access information about the transactions in those accounts. moreover, due to the system's constraints, bank only holds data for analysis for a certain number of months. that is, any automated machine learning exercise is done on records from the last x months of activity only (although analysts can query older databases, manually). furthermore, due to compatibility issues, not all legacy systems can feed into the automated analysis system. one example is the mortgage database. again, analysts can query the mortgage database individually, but not as part of an automated machine learning exercise. in addition, the system currently in place at bank can only process numerical data and some types of non-numerical data (e.g. postcodes). it is unable to process freeform text fields, voice and other types of unstructured data: 'we don't carry details such as scratch pads, history names, notes that customer advisors might use, telephone conversations… although some of the conversations could be useful, and we can't run queries on them. so, there is no use in having them into the system.' (interview, systems manager) additionally, collecting and keeping updated data on all customers is a costly activity for bank, and intrusive for the customer. hence, bank does not collect and does not routinely update all types of data that the analysts deem useful for aml profiling: in summary, the type of evidence available to bank as a training data set, the type of data available for querying, the type of systems in place and other resource constraints, mean that there are significant practical limitations to using machine learning for automated discovery of specific money laundering behaviour. its main application potential seems to be in the case of speculative analysis of unusual behaviour requiring subsequent manual investigation by analysts. though, due to system constraints, bank needs to engage in focused discovery, keeping in mind that the data set may not be as broad, as complete or as up to date as desirable; and that the volume of output (i.e. flagged cases) needs to be manageable within the target deadline. machine learning's ability to discover patterns in data, process various types of data and act autonomously promises to enable financial intermediaries to detect money laundering activity in a cost-effective manner (fernandez, ) . through the use of multiple data collection tools, we researched aml algorithm development at a uk-based financial services organisation. we found that, as far as this type of organisation is considered, the real affordance of machine learning for aml detection falls short of the perceived one (e.g. banwo, . fernandez, . we also identified the technical and contextual features that constrained this organisation's ability to tap into machine learning's potential, as summarised in table . some of these constraints are specific to this organisation, while others are common across the sector, as discussed next. one of the key criteria in choosing between alternative approaches to machine learning is the fit between the type of approach and the nature of the phenomenon being modelled (skiena, ) . our analysis shows that, in aml profiling, there are, actually, two very distinct phenomena being modelled, each requiring a different approach. one phenomenon consists of developing knowledge about money laundering schemes, via descriptive profiling; the other of detecting attempts to launder money, via predictive profiling. while developing knowledge is an essential step in understanding the nature of the phenomenon, ultimately, to meet the goal of assisting with crime reduction, financial intermediaries need to be able to detect and prevent attempts to launder the proceeds of crime through their organisations (ball et al., ) . in the empirical setting considered, the first type of profiling relies on historical datasets, and produces descriptive outputs. supervised machine learning algorithms seem best suited for this type of phenomenon. in turn, the second phenomenon relies on real-time data and knowledge, some of which may be derived from the first type of profiling. it produces performative outputspredictions of high-risk transactions, which need to be investigated by analysts. unsupervised and reinforced machine learning algorithms fit the second type of phenomenon best. that is, not only are the two types of profiling problem best suited to different types of machine learning algorithm (as per skiena, ), but they also use very different inputs, and produce different types of outputs. regardless of the type of profiling, financial service organisations face the constraint that their perspectiveand, hence, ability to modelis limited to the financial transactions processed by the organisation. while some customers may use only one financial services provider for all of their banking needs, manyif not mostwill use more than one provider. hence, any individual intermediary will only process a subset of a customer's financial transactions. moreover, given that financial organisations do not share information about their customers with each other, for both legal and strategic reasons, each organisation will always have an incomplete dataset of the customer's financial transactions. as a result, they may fail to recognise the importance of a particular transaction, or, conversely, give undue importance to another. financial service organisations also have limited insight regarding the reasons underpinning the observed behaviours, because they cannot always probe the customer about the reasons for the observed behaviour. this is particularly the case for online transactions, which have become the norm for around three-quarters of the uk population (cherowbrier, ) . the absence of such information, or doubts about the veracity of the information provided, result in inferences, which may be shaped by various cognitive restrictions (desouza & hensgen, ) , and which become crystallised in subsequent decision-making (bowker & star, ) . we also need to consider the legal requirement for explicability of decision-making, and to prove that no customer has been unfairly discriminated through the use of technology (crosman, ) . this is particularlythough not exclusivelylikely to occur in cases where the ai system has autonomy to act, and when there are self-reinforcing feedback loops (canhoto & clear, ) , as well as when the algorithm is used for prediction rather than description (mittelstadt et al., ) . based on the description of aml monitoring at bank, this means that supervised learning might be the least likely to breach these criteria, because it is used for description not prediction, and there are no feedback loops. generally, unsupervised learning is likely to have low explicability because it has the most potential to produce outputs that are not comprehensible to humans (lewis et al., ) . this characteristic puts the financial services organisation at risk of non-compliance with the sector's regulations. in turn, reinforced learning is the most likely to lead to feedback loops, particularly if the rules have been derived from previous unsupervised learning exercises. even where information about the reasons for the observed behaviour exists, the organisation may be unable to use it in algorithm development. in bank, this was the case for information stored in the form of notes, recordings of conversations or other forms of unstructured data. the type of data that bank's systems could process in practice was much less varied than vast array of data typically mentioned in the ai literature (e.g. kietzmann et al., ) . bank was also unable to draw on all databases due to compatibility issues, or to use data older than a certain period due to system constraints. in theory of affordances' terminology, the realised affordance of ai is much narrower than its functional affordance, which limits its value in aml. though, this observation reflects the extant literature (e.g. grint et al., , chap. ; zimiles & mueller, ) , this may not be the case in other financial services organisations. other providers may have access to systems that can seamlessly integrate with more databases, and/or which can process all types of structured and unstructured data. in addition to these generic constraints, there are others that relate to the particular type of profiling, or the approach, pursued. for descriptive profiling, these challenges are mostly related with the availability of high-quality, relevant training data in a timely manner. this is the case not just in aml, but also for other contexts using inductive approaches to model development (staat, ) . for instance, a similar problem was observed in the use of machine learning to diagnose those infected with the sars-cov- virus: even though this technology could potentially read lung scans in a fraction of the time required by a radiologist, initially it lacked sufficient quality images of lungs confirmed to be infected with this virus (vs. lung cancer, for instance) to form a useful training dataset (ray, ) . a related challenge concerns the relevance of the available data. criminals are constantly innovating their mechanisms of laundering money, such as using mobile payments (whisker & lokanan, ) or virtual currencies (vandezande, ) . therefore, the limited training datasets available may quickly lose relevance and applicability (sloman & lagnado, ) . in the case of predictive profiling, the challenges are mostly related to the quality of the underlying assumptions and the inability to test prior to prior to scaling (mittelstadt et al., ) . the first key assumption is that the majority of bank's customers are not engaged in money laundering. while this may be true of the general population (zhang & trubey, ) , it may not be the case for individual organisations, or for all types of predicate crimefor instance, tax evasion or support for terrorist organisations may be very prevalent in certain geographical locations. the second dominant assumption is that the pattern of transactions of customers engaging in money laundering is very different from that of the other customers. given the dynamic and broad nature of money laundering, this is not necessarily the case for all types of predicate crime and/or customers. both assumptions are difficult to test, meaning that there are few, if any, opportunities to assess the quality of the models developed (zimiles & mueller, ) . moreover, the analysis of outputs produced in the case of the predictive algorithms relies on deductive reasoning, whereby the analysts try to reason about how someone who is trying to use the financial system to launder money without being detected might use the various products and channels at their disposable. this approach is liable to be affected by biases (pazzani, ) , stereotyping (bouissac, ) and other cognitive restrictions (desouza & hensgen, ) . these challenges are heightened by the fact that aml predictive modelling actually focuses on unusual transaction patterns among a specific client base, rather than actual criminal behaviour. if financial organisations were to treat the accounts flagged by the predictive algorithms as certain to be involved in money laundering, and reported them all to law enforcement, it would cause extensive disruption to customers, and could lead to customer complaints and possible financial losses (ball et al., ) . therefore, manual analysis is required, which adds costs and delays to the process. in the case of bank, the limited availability of manual analysts to scrutinise the algorithmic outputs was one of the key constraints shaping the use of aml predictive algorithms. other organisations may not experience this constraint, although evidence from other institutions and even other sectors suggests that this is a generalised problem. for instance, it is one of the reasons why online providers such as youtube or facebook cannot completely prevent the publication of pornographic or extremist materials on their platforms. another constraint that may be specific to bank refers to the number of queries that can run at any one time. this had to do with a number of technological and organisational reasons, common to many other organisations, such as the investment cycle in new technologies, or the limited appetite to invest in what is seen, across the industry, as a risky and costly solution (grint et al., , chap. ). while others have observed similar limitations (e.g. zimiles & mueller, ) , it is possible that other financial services organisations have access to systems that can run more queries simultaneously than bank, for instance by using cloud services. the issue of the cost of ai and machine learning technologies should not be underestimated. small intermediaries may not have enough aml budget to buy sophisticated ai solutions, while large organisations operating across multiple jurisdictions need standardised solutions that few providers are able to offer (grint et al., , chap. ) . other tradeoffs to consider are whether to focus on processing speed, degree of confidence in the result or learning curve (cormen, leiserson, rivest, & stein, ) . adopting new ai technology may also require additional investment elsewhere in the organisation, such as updating legacy database systems to make them compatible with the new solution. moreover, there are indirect costs to consider such as recruiting staff with the necessary subject matter and technical expertise (merendino et al., ) , or trying to retain dissatisfied customers (masciandaro & filotto, ) . that is, the calculation of the cost of ai solutions, and therefore the calculation of this technology's cost-effectiveness, is not straightforward. there are numerous trade-offs, and direct and indirect costs, to consider. this paper set out to investigate if and how machine learning can assist in money laundering detection and contribute to achieving goal number of the u.n.'s sdgs. this question is of interest to both the scholar and the managerial communities (kaminski & schonert, ) . however, there is a lack of empirical investigation regarding the actual use of machine learning in aml, as well as regarding the process of development of algorithms generally (o'neil & schutt, ) . the theoretical framing and subsequent empirical investigation focused on aml monitoring by financial services organisations, given their role as enablers of the movement of cash globally, the legal requirement that they face to detect and prevent money laundering and the abundance of transaction data that they traditionally hold. through our consideration of the characteristics of machine learning, and of the phenomenon of aml profiling, we conclude that there are some opportunities for using machine learning to assist with identifying unusual transaction patterns, or even with suspicious behaviour more generally. however, this potential is severely curtailed by the current legal structures, the mechanisms for data sharing between law enforcement and financial services organisations and the relative high cost, complexity and perceived risk of these solutions. moreover, we did not find any evidence of use of sentiment analysis of user generated online content. hence, we concur with arslanian and fischer ( ) view that the potential for machine learning in aml is far behind that of other applications and other industries (e.g. castelli, manzoni, & popovič, ; fosso wamba et al., ) . in terms of this study's contribution regarding the contribution of machine learning to support goal of the united nation's sdgs, we showed that its value is very limited at the level of individual financial services organisations. on the one hand, this is because of the nature of the phenomenon being modelled, namely a multi-dimensional phenomenon, characterised by secretive and deceptive behaviours, and which is constantly evolving (whisker & lokanan, ) . on the other hand, this is because of the specific position of financial services organisations in the money laundering supply chain, the limited perspective that they have on their customers' transactions and the nature of the aml task that they are asked to perform (i.e. prevent money laundering). while financial services organisations may be essential enablers of money laundering and, indirectly, criminal activity, their perspective is limited to the transaction data for their own customers and their own institution. money laundering often involves multiple individuals and institutions, possibly in multiple jurisdictions, and may take place over an extended period of time. in particular, the type of transnational, organised crime mentioned in the u.n.'s sdg may be difficult to detect via routine aml monitoring by any individual financial services organisation. in some jurisdictions, such as italy, aml monitoring is done at the national level, rather than at the organisational level as is the case of the uk. it is possible that machine learning would be effective for aml at the national level, for either descriptive or predictive profiling, and further research should consider this specific empirical scenario. moreover, financial services providers hold a large volume of data about their customers' identity and behaviour (fernandez, ) . however, they lack timely, relevant and sufficient data about money laundering behaviours with which to train machine learning algorithms. as is the case with the novel sars-cov- virus, the key to unlocking the processing power of machine learning is the training dataset (ray, ) . without such datasets, the actual value of this technology falls very much short of its potential, yet this aspect is largely absent from the discussion about the application of machine learning in aml (e.g. kaminski & schonert, ) , or, indeed, other areas. in summary, as far as individual financial services organisations are concerned, the short-to-medium-term potential of machine learning for aml has been somehow inflated in the commercial and technical literature, and will require the agreement of standardised approaches to transaction monitoring (grint et al., , chap. ) . while this study focused on the specific case of aml, the findings are relevant for other scenarios, and the study makes contributions to the broader literature on algorithmic decision making and customer surveillance, as discussed next. the data-driven view of behavioural analysis and decision-making tends to present the use of ai as means of reducing the influence of the analyst on the process and, hence, bias by 'letting the data speak for itself' (williams, ) . however, as this study showed, there is human influence at every step of the process: starting with the data that the organisation decides to collect, after considering the trade-offs between insight potential on the one hand, and collection costs or customer irritation on the other; to the evidence that is considered relevant when developing the models; to the interpretation of the links between data and the assumed underlying behaviour, or the fine-tuning of the models to cope with staff availability. drawing on the speaking analogy, it is not the case that data speaks for itself. instead, when algorithms use data, they do so with a vocabulary, a set of grammar rules and a range of assumed pragmatic meanings, that are not only socially construed and subjective, but also contextual (constantiou & kallinikos, ) . in fact, it may be undesirable to eliminate the human element from the process. subject matter expertise, intuition and social context are all useful in improving the quality of the decision-making. similar effects were observed in relation to credit decisions during the sub-prime crisis of - , where there was a higher default rate among loans that were screened automatically than among those where the decision was made, at least in part, manually (canhoto & dibb, ) . outside of the financial services sector, it has been shown that manual input is essential to improve the quality of the training datasets to detect covid- in lung images (ray, ) . given the significant scope for human influence on data analysis, across different types of applications, further research could explore how different foci and forms of manual interventions impact on machine learning algorithm's deployment, use and performance. if the models developed via reinforced machine learning wrongly deem a certain transaction as suspicious, there is a 'false positive' error, which generates unnecessary work for analysts and inconveniences the customers. if, on the contrary, the transaction is wrongly deemed legitimate, there is a 'false negative' error, and the financial institution faces the possibility of criminal prosecution, fines and reputation damage. that is, different classification errors impact on different stakeholders. given the scale of continued expansion of customer surveillance by commercial organisations for their own strategic purposes (zuboff, ) , or on behalf of governments (ball et al., ) , the possibility of errors and associated consequences is hugely magnified. yet, by and large, the cost of those errors are not part of the discussions or calculations of the cost and benefits of using ai. further research could conceptualise the value of using ai in ways which not only consider the characteristics of these technologies, but also the context where they are used, and the consequences of their deployment for a broad range of stakeholders impacted by their use (newell & marabelli, ) , including the possibility of discrimination and victimisation of certain groups, or the erosion of privacy. our study considered a broad range of contextual features described in the affordances' literature, and how they impact on the realisation of the affordances of ai. in this way, we contributed to the body of empirical work on the realisation of affordances. in particular, we investigated the constraining aspect of affordances, which is an area that tends to be neglected in empirical research (volkoff & strong, ), yet is absolutely critical to understand why technology sometimes fail to meet expectations. we also contributed to the body of work on affordance realisation processes at organisational level. affordances research often adopts a first-person perspective, focusing on the perceptions and actions of individual actors. yet, as shown by capra and luisi ( ) , organisations manifest properties different from those of the sum of its groups or individuals. however, we acknowledge that our understanding of the phenomenon would have been more thorough if we had considered both the organisational and the individual perspectives. for instance, we only considered resources at the level of the organisation, yet the characteristics of individual actors within the organisation can also have an impact on the realisation of affordances, namely that the willingness and ability to perceive or realise the affordance may be influenced by the individual's attitudes, skills and previous experiences (volkoff & strong, ) . for instance, it has been shown that key decision-makers' attitudes towards big data influence how, or indeed whether, the organisation takes advantage of this technological development (merendino et al., ) . finally, we responded to calls for investigating not only how ai is used, but also how algorithms are developed (e.g. constantiou & kallinikos, ) . we followed the approach recommended by o' neil and schutt ( ) , collecting data about the technology itself, as well as the process and the team in charge of its deployment and use. to be clear, due to the nature of the application (crime detection and prevention), and the conditions of access to the organisation, it was not possible to report, in this paper, on certain technical characteristics of the algorithms deployed by the financial service organisation, such as the proxy variables or the clustering techniques used. doing so would have undermined the organisation's efforts to detect criminal activity, and would go against the conditions of access granted for this research. the inability to report on these aspects limited the technical contribution from this research. nonetheless, this paper filled an important gap previously noticed in the literature on algorithmic decision-making (e.g. beer & burrows, ; newell & marabelli, ) . it was very difficult to secure access to the empirical setting, particularly given the sensitive nature of the application. it required lengthy negotiations, a very detailed plan for the safe collection and storage of the data collected and numerous checks and reassurances. hopefully, other researchers will be encouraged to use similar research strategies, and other organisations will facilitate access to their algorithms, because they appreciate the urgency of understanding the social, as well as the technical, dimensions of this phenomenon. editorial -big data, data science, and analytics: the opportunity and challenge for is research detection and classification of social media-based extremist affiliations using sentiment analysis techniques computing the everyday: social media as data platforms hsbc brings in ai to help spot money laundering the future of finance terrorism detection based on sentiment analysis using machine learning assessing the introduction of anti-money laundering regulations on bank stock valuation: an empirical analysis the private security state? surveillance, consumer data and the war on terror artificial intelligence and financial services: regulatory tracking and change management popular culture, digital archives and the new social life of data. theory an introductory who's who for money laundering investigators understanding the actualization of affordances: a study in the process modeling context bounded semiotics: from utopian to evolutionary models of communication sorting things out: classification and its consequences what every ceo needs to know to succeed with ai business ethics should study illicit businesses: to advance respect for human rights big data, organisational learning, and sensemaking: theorizing interpretive challenges under conditions of dynamic complexity artificial intelligence and machine learning as business tools: factors influencing value creation and value destruction unpacking the interplay between organisational factors and the economic environment in the creation of consumer vulnerability the role of customer management capabilities in public-private partnerships the systems view of life: a unifying vision an artificial intelligence system to predict quality of service in banking organisations confronting the risks of artificial intelligence share of people using internet banking in great britain new games, new rules: big data and the changing context of strategy introduction to algorithms pandemic will force organised crime groups to find new business models. the strategist research design: qualitative, quantitative, and mixed method approaches can ai's 'black box' problem be solved? american banker isis and heritage destruction: a sentiment analysis cognition matters: enduring questions in cognitive is research more enduring questions in cognitive is research: a reply speculative security. the politics of pursuing terrorist monies language/action meets organisational semiotics: situating conversations with norms managing information in complex organisations: semiotics and signals algorithmic accountability reporting: on the investigation of black boxes analysing spatial spillovers in corruption: a dynamic spatial panel data approach rigor in is positivist case research: current practices, trends, and recommendations are users threatened by credibility assessment systems financial flows from human trafficking. financial action task force: history of the fatf. financial action task force: artificial intelligence in financial services the rise of social bots big data analytics and firm performance: effects of dynamic capabilities counterterrorism video analysis using hash-based algorithms away from trolley problems and toward risk management gibson's affordances new technologies and anti-money laundering compliance. london: financial conduct authority artificial intelligence in service technology is biased too. how do we fix it? fivethirtyeight. http s://fivethirtyeight.com/features/technology-is-biased-too-how-do-we-fix-it/amp technologies, texts and affordances mixing qualitative and quantitative methods: triangulation in action monitoring money-laundering risk with machine learning governance in the financial sector: the broader context of. money laundering and terrorist financing algorithms: the new means of production. d!gitalist artificial intelligence in advertising: how marketers can leverage artificial intelligence along the consumer journey risk analysis and risk management in an uncertain world when flexible routines meet flexible technologies: affordance, constraint, and the imbrication of human and material agencies when does technology use enable network change in organisations? a comparative study of feature use and shared affordances deal or no deal? training ai bots to negotiate. facebook code reflections on societal and business model transformation arising from digitization and big data analytics: a research agenda a new role for foundations in financing the global goals a foundation for the study of it effects: a new look at desanctis and poole's concepts of structural features and spirit big data, big decisions for government, business, and society. report on a research agenda setting workshop funded by the u.s. national science foundation money laundering regulation and bank compliance costs: what do your customers know? economics and the italian experience big data, big decisions: the impact of big data on board level decisionmaking qualitative data analysis the ethics of algorithms: mapping the debate playing atari with deep reinforcement learning strategic opportunities (and challenges) of algorithmic decision-making: a call for action on the long-term societal effects of 'datification' fraugster, a startup that uses ai to detect payment fraud, raises $ m doing data science: straight talk from the frontline artificial intelligence: building blocks and an innovation typology knowledge discovery from data ai runs smack up against a big data problem in covid- diagnosis artificial intelligence: a modern approach the influence of is affordances on work practices in health care: a relational coordination approach mastering the game of go without human knowledge the algorithm design manual the problem of induction on abduction, deduction, induction and the categories a theory of organisation-ehr affordance actualization ai cancer detectors. the guardian about the sustainable development goals crime prevention, criminal justice, the rule of law and the sustainable development goals virtual currencies under eu anti-money laundering law online human-bot interactions: detection, estimation, and characterization critical realism and affordances: theorizing itassociated organisational change processes anti-money laundering and counter-terrorist financing threats posed by mobile money data mining: desktop survival guide case study research: theory, methods, practice information technology and the changing fabric of organisation machine learning and sampling scheme: an empirical study of money laundering detection how ai is transforming the fight against money laundering big other: surveillance capitalism and the prospects of an information civilization ana isabel canhoto is a reader in marketing. her research focuses on the use of digital technology in interactions between firms and their customers. one stream of work looks at the use of digital technology on customer insight, such as digital footprints or social media profiling. the other focuses on the impact of technology on targeted interactions, such as the popularisation of algorithmic decision making in customer interactions, or the potential of wearables and beacons for personalisation. key: cord- -xet fcw authors: rieke, nicola; hancox, jonny; li, wenqi; milletarì, fausto; roth, holger r.; albarqouni, shadi; bakas, spyridon; galtier, mathieu n.; landman, bennett a.; maier-hein, klaus; ourselin, sébastien; sheller, micah; summers, ronald m.; trask, andrew; xu, daguang; baust, maximilian; cardoso, m. jorge title: the future of digital health with federated learning date: - - journal: npj digit med doi: . /s - - - sha: doc_id: cord_uid: xet fcw data-driven machine learning (ml) has emerged as a promising approach for building accurate and robust statistical models from medical data, which is collected in huge volumes by modern healthcare systems. existing medical data is not fully exploited by ml primarily because it sits in data silos and privacy concerns restrict access to this data. however, without access to sufficient data, ml will be prevented from reaching its full potential and, ultimately, from making the transition from research to clinical practice. this paper considers key factors contributing to this issue, explores how federated learning (fl) may provide a solution for the future of digital health and highlights the challenges and considerations that need to be addressed. research on artificial intelligence (ai), and particularly the advances in machine learning (ml) and deep learning (dl) have led to disruptive innovations in radiology, pathology, genomics and other fields. modern dl models feature millions of parameters that need to be learned from sufficiently large curated data sets in order to achieve clinical-grade accuracy, while being safe, fair, equitable and generalising well to unseen data [ ] [ ] [ ] [ ] . for example, training an ai-based tumour detector requires a large database encompassing the full spectrum of possible anatomies, pathologies, and input data types. data like this is hard to obtain, because health data is highly sensitive and its usage is tightly regulated . even if data anonymisation could bypass these limitations, it is now well understood that removing metadata such as patient name or date of birth is often not enough to preserve privacy . it is, for example, possible to reconstruct a patient's face from computed tomography (ct) or magnetic resonance imaging (mri) data . another reason why data sharing is not systematic in healthcare is that collecting, curating, and maintaining a high-quality data set takes considerable time, effort, and expense. consequently such data sets may have significant business value, making it less likely that they will be freely shared. instead, data collectors often retain fine-grained control over the data that they have gathered. federated learning (fl) - is a learning paradigm seeking to address the problem of data governance and privacy by training algorithms collaboratively without exchanging the data itself. originally developed for different domains, such as mobile and edge device use cases , it recently gained traction for healthcare applications [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . fl enables gaining insights collaboratively, e.g., in the form of a consensus model, without moving patient data beyond the firewalls of the institutions in which they reside. instead, the ml process occurs locally at each participating institution and only model characteristics (e.g., parameters, gradients) are transferred as depicted in fig. . recent research has shown that models trained by fl can achieve performance levels comparable to ones trained on centrally hosted data sets and superior to models that only see isolated single-institutional data , . a successful implementation of fl could thus hold a significant potential for enabling precision medicine at large-scale, leading to models that yield unbiased decisions, optimally reflect an individual's physiology, and are sensitive to rare diseases while respecting governance and privacy concerns. however, fl still requires rigorous technical consideration to ensure that the algorithm is proceeding optimally without compromising safety or patient privacy. nevertheless, it has the potential to overcome the limitations of approaches that require a single pool of centralised data. we envision a federated future for digital health and with this perspective paper, we share our consensus view with the aim of providing context and detail for the community regarding the benefits and impact of fl for medical applications (section "datadriven medicine requires federated efforts"), as well as highlighting key considerations and challenges of implementing fl for digital health (section "technical considerations"). ml and especially dl is becoming the de facto knowledge discovery approach in many industries, but successfully implementing data-driven applications requires large and diverse data sets. however, medical data sets are difficult to obtain (subsection "the reliance on data"). fl addresses this issue by enabling collaborative learning without centralising data (subsection "the promise of federated efforts") and has already found its way to digital health applications (subsection "current fl efforts for digital health"). this new learning paradigm requires consideration from, but also offers benefits to, various healthcare stakeholders (section "impact on stakeholders"). the reliance on data data-driven approaches rely on data that truly represent the underlying data distribution of the problem. while this is a wellknown requirement, state-of-the-art algorithms are usually evaluated on carefully curated data sets, often originating from only a few sources. this can introduce biases where demographics (e.g., gender, age) or technical imbalances (e.g., acquisition protocol, equipment manufacturer) skew predictions and adversely affect the accuracy for certain groups or sites. however, to capture subtle relationships between disease patterns, socioeconomic and genetic factors, as well as complex and rare cases, it is crucial to expose a model to diverse cases. the need for large databases for ai training has spawned many initiatives seeking to pool data from multiple institutions. this data is often amassed into so-called data lakes. these have been built with the aim of leveraging either the commercial value of data, e.g., ibm's merge healthcare acquisition , or as a resource for economic growth and scientific progress, e.g., nhs scotland's national safe haven , french health data hub , and health data research uk . substantial, albeit smaller, initiatives include the human connectome , the uk biobank , the cancer imaging archive (tcia) , nih cxr , nih deeplesion , the cancer genome atlas (tcga) , the alzheimer's disease neuroimaging initiative (adni) , as well as medical grand challenges such as the camelyon challenge , the international multimodal brain tumor segmentation (brats) challenge [ ] [ ] [ ] or the medical segmentation decathlon . public medical data is usually task-or disease-specific and often released with varying degrees of license restrictions, sometimes limiting its exploitation. centralising or releasing data, however, poses not only regulatory, ethical and legal challenges, related to privacy and data protection, but also technical ones. anonymising, controlling access and safely transferring healthcare data is a non-trivial, and sometimes impossible task. anonymised data from the electronic health record can appear innocuous and gdpr/phi compliant, but just a few data elements may allow for patient reidentification . the same applies to genomic data and medical images making them as unique as a fingerprint . therefore, unless the anonymisation process destroys the fidelity of the data, likely rendering it useless, patient reidentification or information leakage cannot be ruled out. gated access for approved users is often proposed as a putative solution to this issue. however, besides limiting data availability, this is only practical for cases in which the consent granted by the data owners is unconditional, since recalling data from those who may have had access to the data is practically unenforceable. the promise of federated efforts the promise of fl is simple-to address privacy and data governance challenges by enabling ml from non-co-located data. in a fl setting, each data controller not only defines its own governance processes and associated privacy policies, but also controls data access and has the ability to revoke it. this includes both the training, as well as the validation phase. in this way, fl could create new opportunities, e.g., by allowing large-scale, ininstitutional validation, or by enabling novel research on rare diseases, where the incident rates are low and data sets at each single institution are too small. moving the model to the data and not vice versa has another major advantage: high-dimensional, storage-intense medical data does not have to be duplicated from local institutions in a centralised pool and duplicated again by every user that uses this data for local model training. as the model is transferred to the local institutions, it can scale naturally with a potentially growing global data set without disproportionately increasing data storage requirements. as depicted in fig. , a fl workflow can be realised with different topologies and compute plans. the two most common ones for healthcare applications are via an aggregation server [ ] [ ] [ ] and peer to peer approaches , . in all cases, fl implicitly offers a certain degree of privacy, as fl participants never directly access data from other institutions and only receive model parameters that are aggregated over several participants. in a fl workflow with aggregation server, the participating institutions can even remain unknown to each other. however, it has been shown that the models example federated learning (fl) workflows and difference to learning on a centralised data lake. a fl aggregation server-the typical fl workflow in which a federation of training nodes receive the global model, resubmit their partially trained models to a central server intermittently for aggregation and then continue training on the consensus model that the server returns. b fl peer to peer-alternative formulation of fl in which each training node exchanges its partially trained models with some or all of its peers and each does its own aggregation. c centralised training-the general non-fl training workflow in which data acquiring sites donate their data to a central data lake from which they and others are able to extract data for local, independent training. themselves can, under certain conditions, memorise information [ ] [ ] [ ] [ ] . therefore, mechanisms such as differential privacy , or learning from encrypted data have been proposed to further enhance privacy in a fl setting (c.f. section "technical considerations"). overall, the potential of fl for healthcare applications has sparked interest in the community and fl techniques are a growing area of research , . current fl efforts for digital health since fl is a general learning paradigm that removes the data pooling requirement for ai model development, the application range of fl spans the whole of ai for healthcare. by providing an opportunity to capture larger data variability and to analyse patients across different demographics, fl may enable disruptive innovations for the future but is also being employed right now. in the context of electronic health records (ehr), for example, fl helps to represent and to find clinically similar patients , , as well as predicting hospitalisations due to cardiac events , mortality and icu stay time . the applicability and advantages of fl have also been demonstrated in the field of medical imaging, for whole-brain segmentation in mri , as well as brain tumour segmentation , . recently, the technique has been employed for fmri classification to find reliable disease-related biomarkers and suggested as a promising approach in the context of covid- . it is worth noting that fl efforts require agreements to define the scope, aim and technologies used which, since it is still novel, can be difficult to pin down. in this context, today's large-scale initiatives really are the pioneers of tomorrow's standards for safe, fair and innovative collaboration in healthcare applications. these include consortia that aim to advance academic research, such as the trustworthy federated data analytics (tfda) project and the german cancer consortium's joint imaging platform , which enable decentralised research across german medical imaging research institutions. another example is an international research collaboration that uses fl for the development of ai models for the assessment of mammograms . the study showed that the fl-generated models outperformed those trained on a single institute's data and were more generalisable, so that they still performed well on other institutes' data. however, fl is not limited just to academic environments. by linking healthcare institutions, not restricted to research centres, fl can have direct clinical impact. the on-going healthchain project , for example, aims to develop and deploy a fl framework across four hospitals in france. this solution generates common models that can predict treatment response for breast cancer and melanoma patients. it helps oncologists to determine the most effective treatment for each patient from their histology slides or dermoscopy images. another large-scale effort is the federated tumour segmentation (fets) initiative , which is an international federation of committed healthcare institutions using an open-source fl framework with a graphical user interface. the aim is to improve tumour boundary detection, including brain glioma, breast tumours, liver tumours and bone lesions from multiple myeloma patients. another area of impact is within industrial research and translation. fl enables collaborative research for, even competing, companies. in this context, one of the largest initiatives is the melloddy project . it is a project aiming to deploy multi-task fl across the data sets of pharmaceutical companies. by training a common predictive model, which infers how chemical compounds bind to proteins, partners intend to optimise the drug discovery process without revealing their highly valuable in-house data. impact on stakeholders fl comprises a paradigm shift from centralised data lakes and it is important to understand its impact on the various stakeholders in a fl ecosystem. clinicians. clinicians are usually exposed to a sub-group of the population based on their location and demographic environment, which may cause biased assumptions about the probability of certain diseases or their interconnection. by using ml-based systems, e.g., as a second reader, they can augment their own expertise with expert knowledge from other institutions, ensuring a consistency of diagnosis not attainable today. while this applies to ml-based system in general, systems trained in a federated fashion are potentially able to yield even less biased decisions and higher sensitivity to rare cases as they were likely exposed to a more complete data distribution. however, this demands some up-front effort such as compliance with agreements, e.g., regarding the data structure, annotation and report protocol, which is necessary to ensure that the information is presented to collaborators in a commonly understood format. b decentralised: each training node is connected to one or more peers and aggregation occurs on each node in parallel. c hierarchical: federated networks can be composed from several sub-federations, which can be built from a mix of peer to peer and aggregation server federations (d)). fl compute plans-trajectory of a model across several partners. e sequential training/cyclic transfer learning. f aggregation server, g peer to peer. patients. patients are usually treated locally. establishing fl on a global scale could ensure high quality of clinical decisions regardless of the treatment location. in particular, patients requiring medical attention in remote areas could benefit from the same high-quality ml-aided diagnoses that are available in hospitals with a large number of cases. the same holds true for rare, or geographically uncommon, diseases, that are likely to have milder consequences if faster and more accurate diagnoses can be made. fl may also lower the hurdle for becoming a data donor, since patients can be reassured that the data remains with their own institution and data access can be revoked. hospitals and practices. hospitals and practices can remain in full control and possession of their patient data with complete traceability of data access, limiting the risk of misuse by third parties. however, this will require investment in on-premise computing infrastructure or private-cloud service provision and adherence to standardised and synoptic data formats so that ml models can be trained and evaluated seamlessly. the amount of necessary compute capability depends of course on whether a site is only participating in evaluation and testing efforts or also in training efforts. even relatively small institutions can participate and they will still benefit from collective models generated. researchers and ai developers. researchers and ai developers stand to benefit from access to a potentially vast collection of realworld data, which will particularly impact smaller research labs and start-ups. thus, resources can be directed towards solving clinical needs and associated technical problems rather than relying on the limited supply of open data sets. at the same time, it will be necessary to conduct research on algorithmic strategies for federated training, e.g., how to combine models or updates efficiently, how to be robust to distribution shifts , , . fl-based development implies also that the researcher or ai developer cannot investigate or visualise all of the data on which the model is trained, e.g., it is not possible to look at an individual failure case to understand why the current model performs poorly on it. healthcare providers. healthcare providers in many countries are affected by the on-going paradigm shift from volume-based, i.e., fee-for-service-based, to value-based healthcare, which is in turn strongly connected to the successful establishment of precision medicine. this is not about promoting more expensive individualised therapies but instead about achieving better outcomes sooner through more focused treatment, thereby reducing the cost. fl has the potential to increase the accuracy and robustness of healthcare ai, while reducing costs and improving patient outcomes, and may therefore be vital to precision medicine. manufacturers. manufacturers of healthcare software and hardware could benefit from fl as well, since combining the learning from many devices and applications, without revealing patientspecific information, can facilitate the continuous validation or improvement of their ml-based systems. however, realising such a capability may require significant upgrades to local compute, data storage, networking capabilities and associated software. fl is perhaps best-known from the work of konečnỳ et al. , but various other definitions have been proposed in the literature , , , . a fl workflow (fig. ) can be realised via different topologies and compute plans (fig. ) , but the goal remains the same, i.e., to combine knowledge learned from non-co-located data. in this section, we will discuss in more detail what fl is, as well as highlighting the key challenges and technical considerations that arise when applying fl in digital health. federated learning definition fl is a learning paradigm in which multiple parties train collaboratively without the need to exchange or centralise data sets. a general formulation of fl reads as follows: let l denote a global loss function obtained via a weighted combination of k local losses fl k g k k¼ , computed from private data x k , which is residing at the individual involved parties and never shared among them: where w k > denote the respective weight coefficients. in practice, each participant typically obtains and refines a global consensus model by conducting a few rounds of optimisation locally and before sharing updates, either directly or via a parameter server. the more rounds of local training are performed, the less it is guaranteed that the overall procedure is minimising (eq. ) , . the actual process for aggregating parameters depends on the network topology, as nodes might be segregated into subnetworks due to geographical or legal constraints (see fig. ). aggregation strategies can rely on a single aggregating node (hub and spokes models), or on multiple nodes without any centralisation. an example is peer-to-peer fl, where connections exist between all or a subset of the participants and model updates are shared only between directly connected sites , , whereas an example of centralised fl aggregation is given in algorithm . note that aggregation strategies do not necessarily require information about the full model update; clients might chose to share only a subset of the model parameters for the sake of reducing communication overhead, ensure better privacy preservation or to produce multi-task learning algorithms having only part of their parameters learned in a federated manner. a unifying framework enabling various training schemes may disentangle compute resources (data and servers) from the compute plan, as depicted in fig. . the latter defines the trajectory of a model across several partners, to be trained and evaluated on specific data sets. these issues have to be solved for both federated and nonfederated learning efforts via appropriate measures, such as careful study design, common protocols for data acquisition, structured reporting and sophisticated methodologies for discovering bias and hidden stratification. in the following, we touch n. rieke et al. upon the key aspects of fl that are of particular relevance when applied to digital health and need to be taken into account when establishing fl. for technical details and in-depth discussion, we refer the reader to recent surveys , , . data heterogeneity. medical data is particularly diverse-not only because of the variety of modalities, dimensionality and characteristics in general, but even within a specific protocol due to factors such as acquisition differences, brand of the medical device or local demographics. fl may help address certain sources of bias through potentially increased diversity of data sources, but inhomogeneous data distribution poses a challenge for fl algorithms and strategies, as many are assuming independently and identically distributed (iid) data across the participants. in general, strategies such as fedavg are prone to fail under these conditions , , , in part defeating the very purpose of collaborative learning strategies. recent results, however, indicate that fl training is still feasible , even if medical data is not uniformly distributed across the institutions , or includes a local bias . research addressing this problem includes, for example, fedprox , part-data-sharing strategy and fl with domainadaptation . another challenge is that data heterogeneity may lead to a situation in which the global optimal solution may not be optimal for an individual local participant. the definition of model training optimality should, therefore, be agreed by all participants before training. privacy and security. healthcare data is highly sensitive and must be protected accordingly, following appropriate confidentiality procedures. therefore, some of the key considerations are the trade-offs, strategies and remaining risks regarding the privacypreserving potential of fl. privacy vs. performance: it is important to note that fl does not solve all potential privacy issues and-similar to ml algorithms in general-will always carry some risks. privacy-preserving techniques for fl offer levels of protection that exceed today's current commercially available ml models . however, there is a trade-off in terms of performance and these techniques may affect, for example, the accuracy of the final model . furthermore, future techniques and/or ancillary data could be used to compromise a model previously considered to be low-risk. level of trust: broadly speaking, participating parties can enter two types of fl collaboration: trusted-for fl consortia in which all parties are considered trustworthy and are bound by an enforceable collaboration agreement, we can eliminate many of the more nefarious motivations, such as deliberate attempts to extract sensitive information or to intentionally corrupt the model. this reduces the need for sophisticated counter-measures, falling back to the principles of standard collaborative research. non-trusted-in fl systems that operate on larger scales, it might be impractical to establish an enforceable collaborative agreement. some clients may deliberately try to degrade performance, bring the system down or extract information from other parties. hence, security strategies will be required to mitigate these risks such as, advanced encryption of model submissions, secure authentication of all parties, traceability of actions, differential privacy, verification systems, execution integrity, model confidentiality and protections against adversarial attacks. information leakage: by definition, fl systems avoid sharing healthcare data among participating institutions. however, the shared information may still indirectly expose private data used for local training, e.g., by model inversion of the model updates, the gradients themselves or adversarial attacks , . fl is different from traditional training insofar as the training process is exposed to multiple parties, thereby increasing the risk of leakage via reverseengineering if adversaries can observe model changes over time, observe specific model updates (i.e., a single institution's update), or manipulate the model (e.g., induce additional memorisation by others through gradient-ascent-style attacks). developing countermeasures, such as limiting the granularity of the updates and adding noise , and ensuring adequate differential privacy , may be needed and is still an active area of research . traceability and accountability. as per all safety-critical applications, the reproducibility of a system is important for fl in healthcare. in contrast to centralised training, fl requires multiparty computations in environments that exhibit considerable variety in terms of hardware, software and networks. traceability of all system assets including data access history, training configurations, and hyperparameter tuning throughout the training processes is thus mandatory. in particular in non-trusted federations, traceability and accountability processes require execution integrity. after the training process reaches the mutually agreed model optimality criteria, it may also be helpful to measure the amount of contribution from each participant, such as computational resources consumed, quality of the data used for local training, etc. these measurements could then be used to determine relevant compensation, and establish a revenue model among the participants . one implication of fl is that researchers are not able to investigate data upon which models are being trained to make sense of unexpected results. moreover, taking statistical measurements of their training data as part of the model development workflow will need to be approved by the collaborating parties as not violating privacy. although each site will have access to its own raw data, federations may decide to provide some sort of secure intra-node viewing facility to cater for this need or may provide some other way to increase explainability and interpretability of the global model. system architecture. unlike running large-scale fl amongst consumer devices such as mcmahan et al. , healthcare institutional participants are equipped with relatively powerful computational resources and reliable, higher-throughput networks enabling training of larger models with many more local training steps, and sharing more model information between nodes. these unique characteristics of fl in healthcare also bring challenges such as ensuring data integrity when communicating by use of redundant nodes, designing secure encryption methods to prevent data leakage, or designing appropriate node schedulers to make best-use of the distributed computational devices and reduce idle time. the administration of such a federation can be realised in different ways. in situations requiring the most stringent data privacy between parties, training may operate via some sort of "honest broker" system, in which a trusted third party acts as the intermediary and facilitates access to data. this setup requires an independent entity controlling the overall system, which may not always be desirable, since it could involve additional cost and procedural viscosity. however, it has the advantage that the precise internal mechanisms can be abstracted away from the clients, making the system more agile and simpler to update. in a peer-topeer system each site interacts directly with some or all of the other participants. in other words, there is no gatekeeper function, all protocols must be agreed up-front, which requires significant agreement efforts, and changes must be made in a synchronised fashion by all parties to avoid problems. additionally, in a trustlessbased architecture the platform operator may be cryptographically locked into being honest by means of a secure protocol, but this may introduce significant computational overheads. conclusion ml, and particularly dl, has led to a wide range of innovations in the area of digital healthcare. as all ml methods benefit greatly from the ability to access data that approximates the true global distribution, fl is a promising approach to obtain powerful, accurate, safe, robust and unbiased models. by enabling multiple parties to train collaboratively without the need to exchange or centralise data sets, fl neatly addresses issues related to egress of sensitive medical data. as a consequence, it may open novel research and business avenues and has the potential to improve patient care globally. however, already today, fl has an impact on nearly all stakeholders and the entire treatment cycle, ranging from improved medical image analysis providing clinicians with better diagnostic tools, over true precision medicine by helping to find similar patients, to collaborative and accelerated drug discovery decreasing cost and time-to-market for pharma companies. not all technical questions have been answered yet and fl will certainly be an active research area throughout the next decade . despite this, we truly believe that its potential impact on precision medicine and ultimately improving medical care is very promising. reporting summary further information on research design is available in the nature research reporting summary linked to this article. received: march ; accepted: august ; deep learning deep learning in medicine-promise, progress, and challenges deep learning: a primer for radiologists clinically applicable deep learning for diagnosis and referral in retinal disease revisiting unreasonable effectiveness of data in deep learning era a systematic review of barriers to data sharing in public health estimating the success of reidentifications in incomplete datasets using generative models identification of anonymous mri research participants with face-recognition software communicationefficient learning of deep networks from decentralized data federated learning: challenges, methods, and future directions federated machine learning: concept and applications advances and open problems in federated learning privacy-preserving patient similarity learning in a federated environment: development and analysis federated learning of predictive models from federated electronic health records braintorrent: a peerto-peer environment for decentralized federated learning privacy-preserving federated brain tumour segmentation multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation multi-site fmri analysis using privacy-preserving federated learning and domain adaptation: abide results patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records federated learning for healthcare informatics ibm's merge healthcare acquisition nhs scotland's national safe haven the french health data hub and the german medical informatics initiatives: two national projects to promote data sharing in healthcare the human connectome: a structural description of the human brain uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age the cancer imaging archive (tcia): maintaining and operating a public information repository chestx-ray : hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases deeplesion: automated mining of largescale lesion annotations and universal lesion detection with deep learning the cancer genome atlas (tcga): an immeasurable source of knowledge the alzheimer's disease neuroimaging initiative (adni): mri methods grand challenge-a platform for end-to-end development of machine learning solutions in biomedical imaging h&e-stained sentinel lymph node sections of breast cancer patients: the camelyon dataset the multimodal brain tumor image segmentation benchmark (brats) identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features a large annotated medical image dataset for the development and evaluation of segmentation algorithms quantifying differences and similarities in whole-brain white matter architecture using local connectome fingerprints distributed deep learning networks among institutions for medical imaging membership inference attacks against machine learning models white-box vs blackbox: bayes optimal strategies for membership inference understanding deep learning requires rethinking generalization the secret sharer: evaluating and testing unintended memorization in neural networks deep learning with differential privacy privacy-preserving deep learning a roadmap for foundational research on artificial intelligence in medical imaging: from the nih/rsna/acr/the academy workshop federated tensor factorization for computational phenotyping federated deep learning via neural architecture search trustworthy federated data analytics (tfda medical institutions collaborate to improve mammogram assessment ai the federated tumor segmentation (fets) initiative machine learning ledger orchestration for drug discovery federated optimization: distributed machine learning for on-device intelligence peer-to-peer federated learning on graphs federated optimization in heterogeneous networks federated learning with non-iid data on the convergence of fedavg on non-iid data p sgd: patient privacy preserving sgd for regularizing deep cnns in pathological image classification deep leakage from gradients beyond inferring class representatives: user-level privacy leakage from federated learning deep models under the gan: information leakage from collaborative deep learning data shapley: equitable valuation of data for machine learning analytics") and the prime programme of the german academic exchange service (daad) with funds from the german federal ministry of education and research (bmbf). the content and opinions expressed in this publication is solely the responsibility of the authors and do not necessarily represent those of the institutions they are affiliated with, e.g., the u.s. department of health and human services or the national institutes of health. open access funding provided by projekt deal. supplementary information is available for this paper at https://doi.org/ . / s - - - .correspondence and requests for materials should be addressed to n.r. publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons. org/licenses/by/ . /. key: cord- -irjo l s authors: krittanawong, chayakrit; rogers, albert j.; johnson, kipp w.; wang, zhen; turakhia, mintu p.; halperin, jonathan l.; narayan, sanjiv m. title: integration of novel monitoring devices with machine learning technology for scalable cardiovascular management date: - - journal: nat rev cardiol doi: . /s - - - sha: doc_id: cord_uid: irjo l s ambulatory monitoring is increasingly important for cardiovascular care but is often limited by the unpredictability of cardiovascular events, the intermittent nature of ambulatory monitors and the variable clinical significance of recorded data in patients. technological advances in computing have led to the introduction of novel physiological biosignals that can increase the frequency at which abnormalities in cardiovascular parameters can be detected, making expert-level, automated diagnosis a reality. however, use of these biosignals for diagnosis also raises numerous concerns related to accuracy and actionability within clinical guidelines, in addition to medico-legal and ethical issues. analytical methods such as machine learning can potentially increase the accuracy and improve the actionability of device-based diagnoses. coupled with interoperability of data to widen access to all stakeholders, seamless connectivity (an internet of things) and maintenance of anonymity, this approach could ultimately facilitate near-real-time diagnosis and therapy. these tools are increasingly recognized by regulatory agencies and professional medical societies, but several technical and ethical issues remain. in this review, we describe the current state of cardiovascular monitoring along the continuum from biosignal acquisition to the identification of novel biosensors and the development of analytical techniques and ultimately to regulatory and ethical issues. furthermore, we outline new paradigms for cardiovascular monitoring. patients with cardiovascular conditions can have variable clinical presentations ranging from no symptoms to haemodynamic collapse, from hypertensive urgency to hypotension and from silent coronary ischaemia to acute coronary syndrome, as well as decompensated heart failure (hf), stroke or sudden death. this diversity in clinical presentation of cardiovascular disorders poses a major challenge for disease monitoring. although clinicians use a variety of implanted, ambulatory and consumer wearable technologies for disease monitoring, the devices that are best suited to individual patients are difficult to establish. indeed, optimal monitoring strategies have yet to be developed for some applications. hf can worsen progressively over days or weeks, but current telemedicine systems might not be sufficient to detect acute exacerbations in hf or to prevent rehospitalization , . conversely, arrhythmias can often occur suddenly or intermittently and might require immediate intervention , . ambulatory rhythm-monitoring devices that allow only sporadic interpretation of data might be appropriate for benign events but not for life-threatening arrhythmias. this misalignment between clinical need and current monitoring technologies is also illustrated by the lack of robust strategies for the detection of impending coronary syndromes, hypertensive emergencies, hypotensive events or stroke in high-risk patients with atrial fibrillation (af). advances in cardiovascular monitoring technologies, such as the use of ubiquitous mobile devices and the development of novel portable sensors with seamless wireless connectivity and machine learning algorithms that can provide specialist-level diagnosis in near real time, have the potential for a more personalized care. devices have been developed to assess haemodynamics, which can detect potential signs of worsening hf . furthermore, continuous electrocardiogram (ecg) recordings have been used to redefine phenotypes for af and ventricular arrhythmias , and can predict success of antiarrhythmic therapy . wearable activity trackers and smartwatches can measure physiological indices such as heart rate, breathing patterns and cardiometabolic activity , and can even detect af . furthermore, smartphone applications have been successful in shortening the time to first response for sudden cardiac arrest . this confluence of novel technologies has also attracted much public interest and the promise to expand applications for cardiovascular monitoring. in this review, we describe the latest advances in cardiovascular monitoring technology, focusing first on biosignal acquisition and analytical techniques that enable accurate diagnosis, triage and management (fig. ). we discuss monitoring in the context of likely future directions in cardiovascular care and identify numerous technical and clinical obstacles, issues regarding data security and privacy, and ethical dilemmas and regulatory challenges that must be overcome before integrated and scalable cardiovascular monitoring tools can be developed. biosignals, physiological signals that can be continuously measured and monitored to provide information on electrical, chemical and mechanical activity, are the foundations of assessment of health and disease, and have been used to develop personalized physiological 'portraits' of individuals. numerous current and emerging wearable technologies can measure multiple physiological biosignals such as pulse, cardiac output, blood-pressure levels, heart rhythm, respiratory rate, electrolyte levels, sympathetic nerve activity, galvanic skin resistance and thoracic and lower-extremity oedema (fig. ) . some devices can acquire multiple biosignals simultaneously, which can provide inputs to powerful integrated monitors and diagnostic systems. in developing scalable monitoring technology, the short-term goal is to implement guideline-driven care, whereas a longer-term goal is to expand the scope of care by tracking physiological variables continuously in each individual. table summarizes the use of wearable sensor technologies to detect biosignals. some sensor technologies can now integrate multiple modalities, such as chest patches that monitor heart rate, heart rhythm, respiration rate and skin temperature , . sensors are being developed to measure myocardial contractility and cardiac output (ballistocardiography), cardiac acoustic data (phonocardiography) and other indices . we describe various biosensors in the following sections, with reference to their target biosignals and potential clinical applications. to date, more than three million people living in the usa have cardiac implantable electronic devices (cieds) such as pacemakers, defibrillators or left ventricular assist devices . many more patients have other non-cieds such as cochlear implants and nerve stimulators. cieds are the gold standard for cardiac rhythm detection, providing sensitive and specific measurements with little noise continuously over long time frames of several years. cieds are also highly effective prototypes for real-time automatic diagnosis and therapy. indications for cied use include pacing for bradyarrhythmias, and tachypacing and defibrillation for tachyarrhythmias. additionally, most cieds also record intracardiac electrograms as a surrogate for ecgs. cieds that are prescribed for one indication might provide monitoring that confers clinical benefits for a separate indication, such as the monitoring of atrial arrhythmias by atrial leads in pacemakers or defibrillators, or the monitoring of atrial arrhythmias by far-field atrial electrograms from ventricular leads in some pacemakers or implantable cardioverter-defibrillators (icds) . cieds are well suited to monitor symptoms of hf. in patients with an icd or a pacemaker, cieds can provide indices of heart rate variability and pulmonary impedance, which can track hf and prove an alert for possible decompensation . diminished heart rate variability (< ms) has been shown to indicate increased sympathetic and decreased vagal modulation, and is associated with increased risk of death, worsening hf and malignant ventricular arrhythmias . a decline in electrical impedance of the thorax is indicative of pulmonary congestion . another promising biosignal for the detection of hf is pulmonary artery pressure. compass-hf was the first randomized trial to investigate the efficacy of intracardiac pressure monitoring for hf management with the use of a right ventricular sensor (chronicle, medtronic) that measures estimated pulmonary artery dia stolic pressure as a surrogate for pulmonary artery pressure. notably, continuous haemodynamic monitoring did not significantly reduce the incidence of hf-related events compared with optimal medical management. the subsequent champion study showed that monitoring pulmonary artery pressure using the cardiomems system (abbott) significantly lowered the rate of repeated hf hospitalization and was associated with reduced costs compared with standard care. a meta-analysis involving mostly patients with hf with reduced ejection fraction found that pressure monitoring, but not impedance monitoring, was associated with a lower rate of hospital admission for hf . other forms of hf monitors in development integrate pulmonary artery pressure monitoring with vital sign monitoring (cordella heart failure system, endotronix), left atrial pressure monitoring and various wearable devices . additional cied-based biosensors for cardiovascular monitoring are likely to emerge in the next - years. an implanted device that provides neurostimulation of the phrenic nerve has been shown to be effective in reducing episodes of central sleep apnoea . such novel cieds could, in principle, detect physiological markers that correlate with symptoms of af or hf that frequently accompany sleep apnoea. numerous leadless, extravascular devices currently under investigation can defibrillate or pace the heart . future innovations might eliminate the need to extract the device for battery replacement by using external recharging systems or designs that can transduce energy from respiratory or cardiac motion . the body surface ecg is a widely used biosignal in medically prescribed monitors and consumer devices (fig. ) . ambulatory ecg monitors typically consist of three or more chest electrodes connected to an external recorder or a fully contained patch monitor, and can record continuously for - days. some devices have fewer leads, such as the spider flash (datacard group), which consist of two leads and can record for up to min before and min after detecting an event, and the cardiostat (icentia), a single-lead ecg monitor that can provide continuous recordings. data from such ecg monitors are uploaded to a central server either wirelessly or by direct device 'interrogation' , interpreted using semiautomated algorithms and manually confirmed to generate reports and alerts. some devices can provide near-real-time management options. the mobile cardiac outpatient telemetry (mcot) system is an ambulatory ecg monitoring system that can transmit signals over a cellular network without activation by the patient and might increase diagnostic yield compared with other systems , . a major application for ecg sensors is to optimize the detection of af . af is, in many ways, an ideal target for biosensors. numerous ecg sensors focus on detecting rapid and irregularly irregular qrs complexes in af, but other metrics of rapid and irregular atrial rate and irregular beat-to-beat waveforms might increase diagnostic specificity . af can also cause beat-to-beat changes in perfusion and haemodynamics that might allow detection from non-electrical biosignals. another major indication for ecg monitors is the detection of st-segment shifts indicative of coronary ischaemia, which requires relatively noise-free ecgs and sophisticated detection algorithms. machine learning technologies have been incorporated into wearable devices for the detection of st-segment elevation with an accuracy of up to . % (ref. ). in principle, coronary ischaemia monitoring could also use optical, electrochemical, mechanical or microrna-based biosensors, but these applications have not yet been widely adopted. limitations of ecg-based ambulatory monitoring include noise (particularly during physical activity), the typically limited monitoring duration of - weeks (which might be insufficient to detect infrequent events) and delays in generating reports and instigating appropriate actions . insertable or implantable loop recorders are minimally invasive devices that can provide long-term ecg monitoring for months or years and include the reveal linq system (medtronic), the confirm rx insertable cardiac monitor (abbott) and the biomonitor (biotronik). the devices are inserted subcutaneously over the sternum or under the clavicle to mimic v leads and to optimize ecg recordings. data are uploaded during device checks on a - -monthly basis. the advantages of implantable loop recorders include the capacity for long-term monitoring and consistent ecg wave morphologies owing to a fixed spatial orientation. paradoxically, such devices fig. | emerging paradigms for ambulatory monitoring. numerous innovations in biosignal acquisition, diagnosis and medical triage, and data access enable the curation of data as a dynamic resource that can ultimately be used to alter management guidelines and provide novel pathophysiological insights into cardiovascular diseases. however, the acquisition, processing and use of these innovative technologies is associated with various challenges. ebm, evidence-based medicine. nature reviews | cardiology are suboptimal for the diagnosis of arrhythmias of short durations (tens of seconds to minutes) and for classifying the type of atrial arrhythmia . these limitations might be overcome with improvements in signal processing algorithms . furthermore, most implantable loop recorders cannot establish the haemodynamic significance of detected arrhythmias, although the reveal linq system does include an accelerometer that measures patient activity. a modified reveal linq device was used to capture ecg data, temperature, heart rate and other parameters in american black bears and detected low activity and extreme bradycardia during hibernation . lastly, delays in the reporting of urgent events measured by implanted devices might be worsened by longer recording durations between device checks, although some platforms (reveal linq and confirm rx) allow home monitoring with programmable alerts. finally, numerous wearable ecg devices are available to the public. the apple watch (apple) and kardiamobile (alivecor) are approved by the fda for rhythm monitoring and have clinical-level accuracy for the detection of arrhythmias such as af . none of these devices provides continuous monitoring, although daily and nightly use for months might ultimately provide near-continuous recordings. however, at present, these devices require activation by the patient to record the ecg, and smartwatch pulse checks (via photoplethysmography (ppg)) occur only intermittently. therefore, these monitors can miss paroxysmal arrhythmia events that are too short in duration or too catastrophic in nature to be captured by the patient and cannot measure arrhythmia burden. as wearable devices become increasingly flexible, stretchable and weightless, they can be comfortably worn continuously to provide uninterrupted ecg data . at present, unclassifiable tracings are common among all ecg monitoring devices, which is likely to improve with technological advances . some systems have increased signal fidelity, such as the kardiamobile six-lead device (alivecor) or the cam device (bardydx), which might reduce noise and improve p-wave discernment . patients are increasingly opting for fda-approved consumer devices, which increases the urgency to extend guidelines to adopt such technologies when appropriate . ppg is an optical technique used to detect fluctuations in reflected light that can provide data on the cycle-by-cycle changes in cardiac haemodynamics . ppg uses a light source, such as a light-emitting diode, to illuminate the face, fingertips or other accessible parts of the body. early fitness trackers used this technology to estimate heart rate, but ppg-measured heart rate is associated with a low positive predictive value , particularly if patients are ambulatory or exercising . the watch-af trial was a prospective, case-control study that compared the diagnostic accuracy of a smartwatch-based algorithm using ppg signals with ecg data measured by cardiologists. the ppg algorithm had very high specificity and diagnostic accuracy, but was limited by a high dropout rate owing to insufficient signal quality. although few comparison studies have been performed, ppg-based analysis of heart rate and rhythm might be less accurate than ecg-based assessment . an emerging area for ppg-based sensors is for the monitoring of blood-pressure levels. ppg-based blood-pressure assessment requires the mapping of pulsatile peripheral waveforms to aortic pressure and uses algorithms that incorporate machine learning technologies , . however, the sensitivity and specificity of such a sensor in measuring blood-pressure levels in the general population have not yet been defined, and measurement variability might affect their accuracy . ppg data can also be measured without body surface contact . video cameras can detect subtle fluctuations in facial perfusion with normal heartbeats to identify arrhythmias, including af . once technical, workflow and regulatory challenges are overcome, this contactless approach could be used for health screening in a physician's office, in a nursing home or in public spaces. however, this approach also highlights societal and ethical issues related to patient privacy and confidentiality, and the physician's responsibility to inform and treat patients . the infrastructure needed to inform a passer-by of an abnormality detected by contactless sensing technology is not yet available, and whether this protocol is appropriate given that consent for testing was nevertheless, major advances in ppg sensor technology could facilitate the acquisition of haemodynamic data and assessment of their clinical significance in multiple domains, including hf, coronary ischaemia and arrhythmia monitoring. importantly, these devices could also be used to augment traditional home sphygmomanometer devices for haemodynamic monitoring. numerous biosensors are being developed that can monitor hf progression. intrathoracic impedance can be used to detect pulmonary congestion in patients with hf. daily self-measurement of lung impedance using non-invasive devices has been described. in patients with hf, use of the edema guard monitor (cardioset medical) combined with a symptom diary was associated with increases in self-behaviour score for days after hospital discharge . in an analysis of more than , individuals in the uk biobank, a machine learning model revealed that leg bioimpedance was inversely associated with hf incidence . numerous innovative and non-invasive tools can be used to detect leg impedance, such as sock-based sensors . furthermore, microphone-based devices have been used to transform cardiac acoustic vibrations to biomedical signals in quantitative versions of the phonocardiogram . such devices can track respiratory rate, heart and lung sounds, and body motion or position, and might be superior to physical examination for predicting worsening hf . biosensors for other cardiovascular indications are in development. an external device has been described nature reviews | cardiology that can monitor impending thrombosis in intra-arterial mechanical pumps with the use of an accelerometer for real-time analysis of pump vibrations to detect thrombosis and possibly prevent thromboembolic events . ballistocardiography, a non-invasive measure of body motion generated by the ejection of blood in each cardiac cycle , has been incorporated into devices such as weighing scales to measure heart rate , whereas a digital artificial intelligence (ai)-powered stethoscope that integrates both ecg and phonocardiogram data was approved in by the fda to assess patients for the presence of af and heart murmurs . the most promising systems might combine multimodality biosignals rather than using a single biosignal. several challenges must be overcome before novel monitoring strategies can be adopted for clinical use in the ambulatory setting, which introduces noise from motion, electromagnetic interference and various patient activities, which are more controlled in the clinic. biosensor design must match hardware specifications to biosignal characteristics for each clinical indication. furthermore, device design must take into account the trade-off between duration and quantity of collected data, required battery power and device size, and durability in real-world use. importantly, devices tested under one set of clinical conditions are not applicable for use for other clinical conditions, a particularly relevant point to remember given the growth of poorly regulated consumer medical devices. subtle changes in biosignals might also confound analysis, such that testing and validation might need to be repeated de novo for each device being investigated. of note, many widely used consumer devices have only modest accuracy even for the 'simple' biosignals of heart rate or energy expenditure . whether accuracy is reduced owing to differences in study cohorts between initial device validation and real-world users , biological differences in biosignals owing to varying activity levels or other factors is unknown , , . biosignals that are calibrated in healthy volunteers might differ in accuracy when detecting disease. for example, tachycardia or irregularly irregular af might introduce noise or variabilities in qrs morphology compared with sinus rhythm and can influence ecg algorithms . similarly, variability in pulse waveforms might influence ppg algorithms. accordingly, algorithms developed with machine learning technology are best applied when the training and test populations are analogous. when these populations differ, learned features might become inaccurate, compounded in machine learning by limited methods to interpret its decisions (justifying why machine learning has sometimes been described as a 'black box') . testing and validation for each specific clinical application are, therefore, critical in device development. the large quantity of data generated by ambulatory monitoring devices necessitates accurate and automated diagnosis and an infrastructure to enable quick clinical actions. the time-honoured method of human review and annotation of clinical data is also time-consuming, expensive and not scalable. novel, scalable approaches to data interpretation and actionability might allow the potential of novel ambulatory monitoring to be realized. by reducing the time needed for data interpretation, ambulatory monitoring can detect acute events, such as worsening hf, incipient coronary syndrome or impending sudden cardiac arrest, and provide timely feedback for less urgent events. traditional analytical models for ambulatory monitoring rely on a limited number of biosignals and apply intuitive rules, such as those related to rate or regularity of heart rhythm, to flag a normal or abnormal result (fig. a) . such forms of ai systems are known as 'expert systems' . although these traditional models might introduce inaccuracy in data interpretation, slight inaccuracies might be acceptable in traditional health-care paradigms in which data flagged by the device are verified by clinicians. however, this approach might not be safe for wearable consumer devices with little or no clinician input. machine learning is a rapidly developing branch of ai that has shown early promise for use in cardiovascular medicine through the extraction of clinically relevant patterns from complex data, such as detecting myocardial ischaemia from cardiac ct images and interpreting arrhythmias from wearable ecg monitors . machine learning can also facilitate novel strategies for communication between patients and the health-care team (fig. b) . machine learning-based classification of biosensor data from multiple sensors can automatically evaluate the haemodynamic consequences of hf, arrhythmias or coronary syndromes, and can enable rapid triage without the need to develop, test and separately implement complex rules. conversely, machine learning algorithms are not perfect and are limited by the presence of noise and training data that might not adequately represent the real-world clinical setting. in a study to detect af, a third of ecgs could not be interpreted by a consumer device but could be classified by experts . furthermore, in a proof-of-concept study involving the use of smartwatch-based ppg sensor data analysed by a deep neural network, af was diagnosed accurately in recumbent patients (c statistic . ) but not in ambulatory patients (c statistic . ) . advanced monitoring systems that integrate data from multiple streams can better mimic the diagnostic performance of a clinician than current devices that monitor a single data stream. a system that identifies an impending event is likely to be more accurate if an event detected from the ecg is combined with evaluation of potential haemodynamic compromise (such as from a ppg signal) than use of either signal alone. the integration of multiple physiological data streams is a complex task for which simple rules might not readily exist. machine learning might provide such decision-making potential because of its proven capacity to classify complex data. figure b illustrates a typical machine learning architecture comprising an artificial neural network with multiple inputs. this type of architecture can capture www.nature.com/nrcardio multimodal biosignals such as ecg, pulse oximetry and electronic medical record (emr) data (denoted x n in fig. b) and classify them by adjudicated outcome (denoted y or y ), which might represent response or non-response to therapy, or the presence or absence of a haemodynamically significant event. layers in the model (denoted h -h n ) distil input biosignals into archetypes of data that are relevant to the outcomes, constructed iteratively in the hidden 'deeper' layers during algorithm training. these hidden layers are integrated at lower levels to reduce the extent (or dimensionality) of data and identify patterns that best match with the critical event , . although decisions made by such machine learning models are not always readily interpretable, studies have shown that these models make mistakes similar to those made by humans and can learn 'expert' decision-making processes even if not trained in these processes, raising confidence that machine learning decisions are medically intuitive . several machine learning-based monitoring systems have been assessed for their efficacy in guiding clinical management. the link-hf multicentre study investigated the accuracy of a smartphone-based and cloud-based machine learning algorithm that analysed data from a wearable patch for predicting the risk of rehospitalization (via measurement of physiological parameters such as ecg, heart rate, respiratory rate, body temperature, activity level and body position) in patients with hf. this system predicted the risk of imminent hf hospitalization with up to % sensitivity and % specificity, which is similar to that of implanted devices. a follow-up study to determine whether this approach can prospectively prevent rehospitalizations for hf is ongoing. the music study was a multicentre, non-randomized trial to validate a multiparameter algorithm in an external multisensor monitoring system to predict impending acute hf decompensation in patients with hf with reduced ejection fraction. algorithm performance met the prespecified end point with % sensitivity and % specificity for the detection of hf events. numerous monitoring devices that use machine learning technology have been developed to detect ventricular arrhythmias and impending sudden cardiac arrest. the design of the plus emergency watch (formerly the ibeat heart watch) involves a closed-loop system that uses machine learning algorithms to monitor signals detected from a dedicated watch, which then automatically contacts emergency services if the wearer does not respond to a notification within s (ref. ). machine learning technology ('deep learning') has also been shown to improve the performance of shock advice algorithms in an automated external defibrillator to predict the onset of ventricular arrhythmias with the use of an artificial neural network and to predict the onset of sudden cardiac arrest within h by incorporating heart rate variability parameters with vital sign data . a system that can warn patients of an impending life-threatening cardiac event, even if only by several minutes, will greatly increase the availability and efficacy of a bystander or emergency medical response . the application of machine learning to continuous biosensor data is beginning to provide insights into the pathophysiological mechanisms underlying numerous cardiovascular conditions, such as the identification of novel disease phenotypes that might respond fig. | traditional analytical models for ambulatory monitoring versus future models incorporating machine learning technology. a | traditional systems for the analysis of ambulatory monitoring data rely on a limited number of biosignals and apply signal processing algorithms related to the rate or regularity of heart rhythm to flag a normal or abnormal result. the provider is then alerted to the result for management purposes. in a parallel pathway involving cardiac implanted electronic devices (cieds; dashed line), data analysed by the cied can be used to deliver therapy by altering pacing or delivering implantable cardioverter-defibrillator therapy. b | a potential future model for monitoring might incorporate multiple inputs including biosignals (such as electrograms, haemodynamics and activity levels), patient input and clinical data, which are analysed by a machine learning algorithm. deep neural networks, a type of machine learning technology, facilitate the classification of multiple diverse inputs even if traditional rules would be difficult to devise. in this scenario, deep neural networks receive inputs (denoted x , x , x , x and x n ) and use hidden nodes (denoted h , h , h and h n ) to classify them into actionable outputs (denoted y and y ). this model can be tailored to the patient and the type of sensor available. given that many ambulatory devices are likely to be patient-driven, data will be directly sent to the patient. additional infrastructure is needed to inform health-care providers of actionable diagnoses . af, atrial fibrillation; emr, electronic medical record. differentially to therapy. novel immune phenotypes for pulmonary arterial hypertension were identified by unsupervised machine learning analysis of a proteomic panel including cytokines and chemokines from whole-blood samples . the investigators identified four clusters independent of who-defined pulmonary arterial hypertension subtypes, which showed distinct immune profiles and predicted a -year transplant-free survival of . % in the highest-risk cluster and . % in the lowest-risk cluster. a machine learning-based cluster analysis of echocardiogram data from patients in the topcat trial revealed three novel phenotypes of hf and preserved ejection fraction with distinct clinical characteristics and long-term outcomes . in a study involving , patients with hf with reduced ejection fraction from the swedish hf registry, the use of machine learning to analyse demographic, clinical and laboratory data resulted in a random forest-based model that predicted -year survival with a c statistic of . (ref. ). cluster analysis led to the identification of four distinct phenotypes of hf with reduced ejection fraction that differed in terms of outcomes and response to therapeutics, highlighting the role of such novel analytical strategies in increasing the effectiveness of current therapies. machine learning data have also provided mechanistic insights into the pathophysiology of af. patients with persistent or paroxysmal af show rates of response to antiarrhythmic medications of - % and to cardiac ablation of - % . data from continuous ecgs show that current clinical classifications poorly reflect the true temporal persistence of af . additional studies could identify af patterns or other physiological phenotypes in patients with 'less advanced' persistent af in whom pulmonary vein isolation alone might be effective, or conversely those with 'more advanced' paroxysmal af in whom pulmonary vein isolation might be less effective. patients could thus be stratified for treatment according to newly recognized patterns of af (that is, staccato versus legato) or by incorporating haemodynamic or clinical data. a proof-of-concept study showed that machine learning trained on daily af burden from continuous cied tracings revealed signatures with incremental prognostic value for the risk of stroke beyond the cha ds -vasc score . patients with hf and arrhythmias could thus show differing prognosis depending on arrhythmia burden . therefore, although in the near future digital health platforms are unlikely to provide 'precision medicine' at the granular level of individualizing therapy according to genotype, such platforms might still provide the opportunity for personalized care on the basis of deep patient phenotyping to provide novel disease insights. the fda published a discussion paper in april describing the development, testing and regulatory oversight for machine learning approaches between the stages of premarketing and postmarketing performance . in general, a desirable system should accurately identify and separate data indicative of urgent or non-urgent clinical states. in the absence of such a system, all biosensor data that meet prescribed cut-off points, such as extreme bradycardia or tachycardia, are flagged and the health-care provider is alerted. this fda guidance allows device manufacturers to invest in the development of models with a lower-risk pathway to implementation and is intended to increase clinician-patient interactions and promote wellness. however, a drawback of applying traditional regulatory processes to rapidly evolving devices is that machine learning algorithms are typically 'frozen' , with no further changes permitted, when a 'software as a medical device' (samd) application is submitted (defined as software that is intended to be used for medical purposes that performs these tasks without being part of a hardware medical device) . this process limits the opportunity to approve self-learning algorithms, which would ultimately differ from the submitted version, and this limitation is amplified by the inevitable time between receiving trial data and approving the data for use in patients. one potential solution could be to submit several versions of a device for approval, including a base case for the most validated primary labelling indication, plus alternatives with preliminary data for secondary labelling indications. another approach is to approve a 'snapshot' of the samd self-learning algorithms associated with a registry, which is similar to postmarketing studies for devices and drugs that require repeated evaluation at predetermined intervals. development and training of algorithms requires gold-standard data (often termed a 'ground truth'), yet such data can be difficult to obtain in patients, which complicates the regulatory and clinical pathway. biosignals are typically complex, non-linear, high-dimensional (comprising many variables) and dynamic. high-quality labelled datasets are scarce both for novel biosignals such as ballistocardiograms and for well-established biosignals such as thoracic impedance, energy expenditure or ecgs measured from atypical locations. although new datasets can be created for such signals, the accuracy of the sets must be validated de novo. bias is introduced whenever humans interact with data, which should be considered when scalable systems are being designed. one ideal solution would be the development of curated databases with specific biosignal data streams that are labelled by adjudicated outcomes and tailored to each use . although standardized databases such as physionet have been useful for testing algorithms for research , these databases are small and might not include data from novel biosensors. the plethora of commercially available health monitoring devices has facilitated the generation of large proprietary datasets, yet these databases are not always transparent or available for validation . therefore, the regulatory pathway might require several clinical tests with prototypes in each class of device or algorithm, and multiple well-curated datasets. device manufacturers should demonstrate that emerging devices can be operated by untrained users to acquire recordings that will perform well with their systems, including analysis of human factors that can bias the results and analyses specific to their algorithms. therefore, although standardization of novel biosensors across manufacturers is www.nature.com/nrcardio ultimately desirable, this goal might need to be deferred until technologies become more mature. regulatory agencies in the usa, including the fda, and patient advocacy groups have unanimously taken the position that patients must be empowered in their relationship with health-care providers and have access to their data . meaningful use criteria for emrs require data sharing through patient access portals, yet such data might be difficult for patients to interpret (table ) . historically, medical device data have been kept in databases owned and maintained by industry and accessible by health-care providers, yet with more limited accessibility for patients. consumer devices have shifted this landscape, empowering individuals to access their data from device companies, who then directly provide automated reports without having to notify a caregiver (fig. ) . this model introduced several potential challenges. whether meaningful use criteria for emrs apply to consumer device-based data is unclear. moreover, whether a health-care organization can have timely and unfettered access to data 'ordered' then paid for by a consumer and then stored in devices that are also paid for by the consumer is unclear (table ) . one important additional point is that these devices have already been developed with use of data that arguably belong to the consumer. in , the alphabetowned ai company deepmind technologies partnered with health-care authorities in the uk to access health data without the need for patients' permission . this model introduces potential risks of a 'services for data' social media business model in which personal data are commoditized for sale to or by third-party companies. alternatively, if medical devices and data are owned and paid for by consumers, an opportunity exists for market forces or legislation to return control to data owners. device manufacturers or third parties could conceivably compete in providing patient-friendly data visualization tools, to which medical providers could also pay for access. this scenario has its own challenges and is likely to be a point of contention in coming years. a complicated responsibility exists for data that are shared between users (patients, health-care providers and algorithm developers), data owners (health-care organizations, individuals and industry) and industry. health-care organizations are liable for unauthorized access to emrs, yet numerous privacy concerns exist for non-health-related mobile data. consumer devices are also likely to encounter cybersecurity risks, which must be addressed proactively. data breaches, both unintentional and malicious in nature, have been reported by many companies that are now entering the health-care market, as well as diagnostic companies and cied manufacturers . the technical shift to consumer-driven technology might provide a catalyst to standardize biosensor and data formats, and in turn increase security. blockchain technology, which has been successfully used in financial markets and other industries, might have a role in patient-centred monitoring by tagging data ownership, providing traceability and enabling incentive programmes for sharing data . geopolitical regulations are also in development. the eu in with the primary goal of giving individuals control over their personal data, and aims to unify the regulations within the region and provide safeguards to protect data, requiring all stakeholders to disclose data collection practices and breaches that occur. this regulation has become a model for privacy laws elsewhere and is similar in structure to the california consumer privacy act. however, it is unclear how general consumer regulations will apply to or potentially influence the us health insurance portability and accountability act, which could also be modified given that it covers only a fraction of an individual's health-related data . devices that integrate high-fidelity biosignal detection with broadband wireless connectivity and cloud processing could, in principle, facilitate real-time care. a similar landscape is rapidly developing in the automotive industry with regard to the design of autonomous driving vehicles that apply multimodal, ultrafast fusion algorithms to multiple data streams that can provide an immediate response. to apply this technology to wearable devices, collected data must interact within a rapidly changing clinical context, which has already occurred for icd therapy for tachyarrhythmia or pacing technology for bradycardia . however, this technology is less developed for other domains such as af management and hf or blood-pressure monitoring and devices that require multimodal data. several clinical studies of mobile and wearable device platforms are summarized in table . one early model is the currently available mcot system for arrhythmia monitoring. the mcot system includes ecg sensors and a device that automatically transmits data to a central analysis hub for annotation and alerts the health-care provider . the cycle time for this process ranges from minutes to hours. this approach can increase the diagnostic yield over that of other ambulatory ecg systems and has been used during the coronavirus disease (covid- ) pandemic to monitor the qt interval in patients receiving hydroxychloroquine or azithromycin while simultaneously minimizing clinician exposure and preserving personal protective equipment resources . during the covid- pandemic, the heart rhythm society (hrs) recommended the use of digital wearable devices to obtain vital signs and ecg tracings, as well as the use of mcot after hospital discharge . furthermore, the hrs recommended the replacement of in-person clinic visits and cied checks with telehealth consultations whenever feasible. these approaches are not yet recommended as an 'emergency response' system for scenarios such as impending sudden cardiac arrest. new real-time systems might lay the foundation for real-time data transmission and response that are coordinated with emergency medical services and bystanders . early proof-of-concept systems have shown success in rapidly alerting bystanders and emergency medical providers to expedite first response . in europe, community volunteers can rapidly deliver automated external defibrillator to people experiencing sudden cardiac arrest . possible future directions include the development of a wireless internet of things (in which multiple devices are connected in their own dedicated network) for real-time cardiovascular care delivery. an important consideration is that medical care systems are not required to be fully automatic, unlike self-driving cars. optimal medical systems might require only conditional autonomy, in that input from medical professionals and patients should be considered, rather than complete autonomy . although this need for conditional autonomy reduces some technical challenges, conditional autonomy also introduces limitations such as the need for integration with contemporaneous medical systems and to allow practitioner oversight while retaining speed of response and accuracy. a growing number of publications support the use of monitoring devices in cardiovascular diagnostics and decision-making, including those that integrate machine learning technology. this rapid expansion of the evidence base has coincided with increased fda guidance supporting the use of wearable devices for health care. table summarizes clinical studies of mobile and wearable device platforms. detection of subclinical af in patients with cryptogenic stroke. the aha/acc/hrs guidelines for the management of af recommend ambulatory monitoring to screen patients for af and, if this is inconclusive, a cardiac monitor should be implanted . the crystal-af trial showed that ecg monitoring with an insertable cardiac monitor was superior to conventional follow-up for detecting af in patients after cryptogenic stroke. the embrace trial extended these observations by showing that a high burden of premature atrial beats predicted af in patients with cryptogenic stroke. the long recording duration of wearable ecg devices makes them desirable for detecting subclinical af, although whether such information can influence therapeutic decisions to prevent stroke is yet to be shown. future studies should thus compare the accuracy and cost-effectiveness of wearable devices with those of traditional monitors in patients at risk of stroke and after stroke. screening for sudden cardiac arrest. individuals at risk of sudden cardiac death have a diverse spectrum of phenotypes. the aha/acc/hrs guidelines provided a class i indication for ambulatory monitoring in patients with palpitations, presyncope or syncope to undergo monitoring to detect potential ventricular arrhythmias . a class iia recommendation was indicated for patients with suspected long qt syndrome and to determine whether symptoms, including palpitations, presyncope or syncope, are caused by ventricular arrhythmias. ambulatory ecg monitoring was also recommended for patients starting certain antiarrhythmic medications (including disopyramide, dofetilide, ibutilide, procainamide or sotalol) with or without risk factors for torsades de pointes . the esc guidelines on the diagnosis and management of hypertrophic cardiomyopathy recommended ambulatory ecg monitoring every - months in patients with hypertrophic cardiomyopathy with left atrial dilation of ≥ mm or after septal reduction therapies . the diversity of patient phenotypes in this group introduces challenges and might require non-uniform monitoring intensity between patient populations. the current lack of infrastructure to facilitate actions in response to data from wearable devices might limit their use in detecting life-threatening arrhythmias. however, professional society guidelines have provided recommendations on the use of wearable cardioverterdefibrillators to prevent sudden cardiac death and have called for increased transparency in monitoring data from cieds and consumer arrhythmia-monitoring devices . arrhythmia screening in patients with syncope. the esc guidelines for the diagnosis and management of syncope recommend ambulatory ecg monitoring in patients with recurrent and unexplained syncope . depending on the frequency of events and the clinical context, patients can be monitored with the use of implanted devices or external devices that send alerts to health-care providers. devices that encompass multiple sensor streams, such as activity, pulse oximetry and haemodynamics, to track the temporal relationship between episodes of hypotension, posture and cardiac rhythm might provide pathophysiological insights in different populations and are currently under investigation . the aha guidelines and the esc guidelines recommend ambulatory arrhythmia monitoring for various subgroups of patients with acute coronary syndromes, including those with left ventricular ejection fraction < %, failed reperfusion and high risk of ventricular arrhythmia, and patients requiring β-blocker therapy adequacy assessment , . similarly, a expert consensus statement from the international society for holter and noninvasive electrocardiology and the hrs provided a class i recommendation for ambulatory monitoring in patients with arrhythmic and non-arrhythmic conditions, including non-ischaemic cardiomyopathy . although these recommendations were largely instituted for arrhythmia detection, signals for recurrent ischaemia might also be derived from these data. fitness and health-tracking devices. in july , the fda issued guidance for general wellness devices such as activity trackers, smartwatches and other products intended to improve physical fitness, nutrition or other wellness goals . subsequently, in september , the fda issued new draft guidance for clinical support applications that provides diagnostic and treatment recommendations for physicians but not for patients . screening of the general population for af. in , the us preventive services task force concluded that insufficient evidence is available to determine whether the benefits of af screening outweigh the associated risks . this conclusion was formed on the basis of the potential physical and psychological risks of unnecessary treatment (false positives) in asymptomatic individuals aged ≥ years. conversely, the esc guidelines recommend screening for af in individuals older than years in order to consider anticoagulation on the basis of findings from the safe and strokestop studies, in which af screening of asymptomatic individuals aged ≥ years and ≥ years, respectively, was shown to be cost-effective. investigators in the ongoing screen-af trial will randomly assign individuals aged ≥ years to weeks of ambulatory ecg monitoring with a home blood-pressure monitor that can automatically detect af or to the standard of care, to assess the primary end point of af detection. the apple heart study enroled , participants in the usa over months to ascertain whether a ppg-enabled device could detect af in individuals without a known history of the disease. inclusion criteria included absence of self-reported af, atrial flutter or oral anticoagulation use in individuals with a compatible apple smartphone and smartwatch. overall, , participants ( . %) were notified of irregular rhythms with this technology . in a subset of enrollees who wore and returned clinical gold-standard ecg patches containing data that could be analysed, af (≥ s) was present in % of all participants and in % of participants aged ≥ years. the positive predictive value for simultaneous af on ambulatory ecg patch monitoring was % ( % ci - %). the huawei heart study , conducted by the mafa ii investigators, assessed the use of a wristband or wristwatch with ppg technology to monitor pulse rhythm in , individuals. of these individuals, were notified as having suspected af, including who had af confirmed by a gold-standard clinical device. therefore, this wristwatch provided a positive predictive value of . % ( % ci . - . %) in the subset of individuals who also had clinical monitoring . the proportion of individuals with positive test results in both studies reflects the expected pretest probability of af in a wide and relatively healthy population, and can inform on the design of future screening trials and the best target populations for such a strategy. the aha/acc/hrs guidelines for the management of af emphasize that anticoagulation should not be tailored by the detection of af episodes, the precise onset of af or the temporal patterns of af . indeed, the impact-af trial showed that pill-in-the-pocket use of non-vitamin k oral anticoagulants on the basis of detected af did not reduce bleeding or thromboembolic event rates compared with standard therapy in patients with an indication for oral anticoagulation. furthermore, the react.com study showed the feasibility of a targeted strategy of implantable cardiac monitor-guided intermittent administration of non-vitamin k oral anticoagulants on the basis of remote monitoring in low-risk af populations. however, this strategy might be less effective in other patient populations, and the investigators did not assess treatment adherence among participants . in standard clinical practice, oral anticoagulation is indicated as soon as af is detected in patients with a single cha ds -vasc risk factor . emerging monitoring devices might facilitate the definition of a specific device-detected af threshold that warrants the initiation of anticoagulation therapy. in the trends study , this af threshold might be an af duration as short as . h. by contrast, a substudy of the assert trial suggested a threshold duration of subclinical af of ≥ h (ref. ). ongoing clinical trials are testing the use of oral anticoagulants for several proposed thresholds of af duration. the artesia trial is currently enrolling patients with af of ≥ min, and the noah trial is enrolling patients with an atrial high rate (≥ bpm) of duration of ≥ min. both trials are enrolling patients with a cied with an atrial lead and exclude individuals with a single af episode longer than h. finally, the loop study is using the reveal linq system to detect af of ≥ min, confirmed by at least two senior cardiologists. the results of these and other trials will help to define the device-detected af threshold that warrants the initiation of anticoagulation therapy. cardiovascular monitoring is poised for dramatic technological advances through developments in novel biosignal definition and biosensor acquisition, automated diagnosis and expert-level triage, secure data transmission and patient-centric disease management. numerous challenges remain in ensuring that data are owned and fully accessible by patients, but at the same time allowing relevant stakeholders to access data and enable timely disease management. once data security and the other ethical and regulatory concerns associated with wearable technologies are addressed, this expanded monitoring paradigm has the potential to revolutionize the cardiovascular care of ambulatory patients. published online xx xx xxxx www.nature.com/nrcardio effectiveness of remote patient monitoring after discharge of hospitalized patients with heart failure: the better effectiveness after transition -heart failure (beat-hf) randomized clinical trial continuous wearable monitoring analytics predict heart failure hospitalization defining the pattern of initiation of monomorphic ventricular tachycardia using the beat-to-beat intervals recorded on implantable cardioverter defibrillators from the raft study: a computer-based algorithm clinical classifications of atrial fibrillation poorly reflect its temporal persistence: insights from , patients continuously monitored with implantable devices cryoballoon or radiofrequency ablation for atrial fibrillation assessed by continuous monitoring: a randomized clinical trial monitoring of heart and breathing rates using dual cameras on a smartphone large-scale assessment of a smartwatch to identify atrial fibrillation a smartphone application for dispatch of lay responders to out-of-hospital cardiac arrests predicting heart failure events with home monitoring: use of a novel, wearable necklace to measure stroke volume, cardiac output and thoracic impedance simultaneous monitoring of ballistocardiogram and photoplethysmogram using a camera the national icd registry report: version . including leads and pediatrics for years avoiding inappropriate therapy of single-lead implantable cardioverter-defibrillator by using atrial-sensing electrodes remote management of heart failure using implantable electronic devices heart rate variability: measurement and clinical utility intrathoracic impedance monitoring in patients with heart failure: correlation with fluid status and feasibility of early warning preceding hospitalization randomized controlled trial of an implantable continuous hemodynamic monitor in patients with advanced heart failure ambulatory hemodynamic monitoring reduces heart failure hospitalizations in "real-world" clinical practice device monitoring in heart failure management: outcomes based on a systematic review and meta-analysis direct left atrial pressure monitoring in ambulatory heart failure patients: initial experience with a new permanent implantable device transvenous neurostimulation for central sleep apnoea: a randomised controlled trial an entirely subcutaneous implantable cardioverter-defibrillator permanent leadless cardiac pacing: results of the leadless trial symbiotic cardiac pacemaker first experience with a mobile cardiac outpatient telemetry (mcot) system for the diagnosis and management of cardiac arrhythmia the diagnosis of cardiac arrhythmias: a prospective multi-center randomized study comparing mobile cardiac outpatient telemetry versus standard loop event monitoring worldwide epidemiology of atrial fibrillation: a global burden of disease study comparison of ambulatory patch ecg monitors: the benefit of the p-wave and signal clarity wearable real-time heart attack detection and warning system to reduce road accidents a comparative study of ecg-derived respiration in ambulatory monitoring using the single-lead ecg real-world performance of an enhanced atrial fibrillation detection algorithm in an insertable cardiac monitor performance of a new leadless implantable cardiac monitor in detecting and quantifying atrial fibrillation: results of the xpect trial six years in the life of a mother bear -the longest continuous heart rate recordings from a free-ranging mammal cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network decoupling of mechanical properties and ionic conductivity in supramolecular lithium ion conductors smartwatch algorithm for automated detection of atrial fibrillation rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: the apple heart study utility of the photoplethysmogram in circulatory monitoring smart watches for heart rate assessment in atrial arrhythmias passive detection of atrial fibrillation using a commercially available smartwatch accuracy in wrist-worn, sensor-based measurements of heart rate and energy expenditure in a diverse cohort the watch af trial: smartwatches for detection of atrial fibrillation photoplethysmography based atrial fibrillation detection: a review development and validation of a novel cuff-less blood pressure monitoring device estimation of aortic systolic blood pressure from radial systolic and diastolic blood pressures alone using artificial neural networks automated measurement of office, home and ambulatory blood pressure in atrial fibrillation algorithmic principles of remote ppg high-throughput, contact-free detection of atrial fibrillation from video with deep learning diagnosing with a camera from a distance-proceed cautiously and responsibly self-care monitoring of heart failure symptoms and lung impedance at home following hospital discharge: longitudinal study bioimpedance and new-onset heart failure: a longitudinal study of > individuals from the general population sensors on instrumented socks for detection of lower leg edema-an in vitro study precision wearable accelerometer contact microphones for longitudinal monitoring of mechano-acoustic cardiopulmonary signals ambulatory monitoring of heart sounds via an implanted device is superior to auscultation for prediction of heart failure events accelerometer detects pump thrombosis and thromboembolic events in an in vitro hvad circuit heart rate detection from single-foot plantar bioimpedance measurements in a weighing scale fda approves eko cardiology technology advances mobile device accuracy for step counting across age groups concurrent validity of wearable activity trackers under free-living conditions photoplethysmography sampling frequency: pilot assessment of how low can we go to analyze pulse rate variability with reliability? deep learning for cardiovascular medicine: a practical primer high-performance medicine: the convergence of human and artificial intelligence a novel deep learning approach for automated diagnosis of acute ischemic infarction on computed tomography classifying and interpreting disorganized electrical patterns within the fibrillating human heart using machine learning design and performance of a multisensor heart failure monitoring algorithm: results from the multisensor monitoring in congestive heart failure (music) study plus. plus emergency watch. plus deep feature learning for sudden cardiac arrest detection in automated external defibrillators prediction of ventricular tachycardia one hour before occurrence using artificial neural networks prediction of cardiac arrest in critically ill patients presenting to the emergency department using a machine learning score incorporating heart rate variability compared with the modified early warning score discovery of distinct immune phenotypes using machine learning in pulmonary arterial hypertension phenomapping of patients with heart failure with preserved ejection fraction using machine learning-based unsupervised cluster analysis machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients harmonized outcome measures for use in atrial fibrillation patient registries and clinical practice: endorsed by the heart rhythm society board of trustees identification of paroxysmal atrial fibrillation subtypes in over , individuals atrial fibrillation burden signature and near-term prediction of stroke: a machine learning analysis huff and puff, this castle is made of bricks us fda artificial intelligence and machine learning discussion paper new concepts in sudden cardiac arrest to address an intractable epidemic: jacc state-of-the-art review who owns the data? open data for healthcare. front google health-data scandal spooks researchers m quest diagnostics patients impacted by amca data breach devicemaker data breach exposes k patients' information integrating blockchain technology with artificial intelligence for cardiovascular medicine health information privacy beyond hipaa: a environmental scan of major trends and challenges aha/acc/hrs guideline for management of patients with ventricular arrhythmias and the prevention of sudden cardiac death: executive summary: a report of the american college of cardiology/american heart association task force on clinical practice guidelines and the heart rhythm society inpatient use of ambulatory telemetry monitors for covid- patients treated with hydroxychloroquine and/or azithromycin guidance for cardiac electrophysiology during the covid- pandemic from the heart rhythm society covid- task force electrophysiology section of the american college of cardiology; and the electrocardiography and arrhythmias committee of the council on clinical cardiology the pulsepoint respond mobile device application to crowdsource basic life support for patients with out-of-hospital cardiac arrest: challenges for optimal implementation the challenges and possibilities of public access defibrillation community-based automated external defibrillator only resuscitation for out-ofhospital cardiac arrest patients aha/acc/hrs focused update of the aha/acc/hrs guideline for the management of patients with atrial fibrillation: a report of the american college of cardiology/ american heart association task force on clinical practice guidelines and the heart rhythm society in collaboration with the society of thoracic surgeons cryptogenic stroke and underlying atrial fibrillation atrial fibrillation in patients with cryptogenic stroke esc guidelines on diagnosis and management of hypertrophic cardiomyopathy: the task force for the diagnosis and management of hypertrophic cardiomyopathy of the european society of cardiology (esc) wearable cardioverterdefibrillator therapy for the prevention of sudden cardiac death transparent sharing of digital health data: a call to action esc guidelines for the diagnosis and management of syncope esc guidelines for the management of acute myocardial infarction in patients presenting with st-segment elevation: the task force for the management of acute myocardial infarction in patients presenting with st-segment elevation of the european society of cardiology (esc) ishne-hrs expert consensus statement on ambulatory ecg and external cardiac monitoring/telemetry screening for atrial fibrillation with electrocardiography: us preventive services task force recommendation statement esc guidelines for the management of atrial fibrillation developed in collaboration with eacts a randomised controlled trial and cost-effectiveness study of systematic screening (targeted and total population screening) versus routine practice for the detection of atrial fibrillation in people aged and over. the safe study mass screening for untreated atrial fibrillation: the strokestop study mobile photoplethysmographic technology to detect atrial fibrillation a multifaceted intervention to improve treatment with oral anticoagulants in atrial fibrillation (impact-af): an international, clusterrandomised trial targeted anticoagulation for atrial fibrillation guided by continuous rhythm assessment with an insertable cardiac monitor: the rhythm evaluation for anticoagulation with continuous monitoring (react.com) pilot study iii stroke risk stratification in atrial fibrillation: bridging the evidence gaps the relationship between daily atrial tachyarrhythmia burden from implantable device diagnostics and stroke risk duration of device-detected subclinical atrial fibrillation and occurrence of stroke in assert effect of a -month pedometer-based walking intervention on functional capacity in patients with chronic heart failure with reduced (hfref) and with preserved (hfpef) ejection fraction: study protocol for two multicenter randomized controlled trials isosorbide mononitrate in heart failure with preserved ejection fraction detection of atrial fibrillation with a smartphone camera: first prospective, international, two-centre, clinical validation study (detect af pro) novel method to efficiently create an mhealth app: implementation of a real-time electrocardiogram r peak detector mobile phone detection of atrial fibrillation with mechanocardiography effect of a home-based wearable continuous ecg monitoring patch on detection of undiagnosed atrial fibrillation: the mstops randomized clinical trial screening for atrial fibrillation using economical and accurate technology (from the safety study) recurrent atrial fibrillation/ flutter detection after ablation or cardioversion using the alivecor kardiamobile device: iheart results mobile health technology for atrial fibrillation management integrating decision support, education, and patient involvement: maf app trial feasibility and usability of a mobile application to assess symptoms and affect in patients with atrial fibrillation: a pilot study smart devices for a smart detection of atrial fibrillation assessment of remote heart rhythm sampling using the alivecor heart monitor to screen for atrial fibrillation: the rehearse-af study age-and-sex stratified prevalence of atrial fibrillation in rural western india: results of smart-india, a population-based screening study head-to-head comparison of the alivecor heart monitor and microlife watchbp office afib for atrial fibrillation screening in a primary care setting accuracy of a wrist-worn wearable device for monitoring heart rates in hospital inpatients: a prospective observational study feasibility and cost-effectiveness of stroke prevention through community screening for atrial fibrillation using iphone ecg in pharmacies. the search-af study comparison of the patient-activated event recording system vs. traditional h holter electrocardiography in individuals with paroxysmal palpitations or dizziness triage tests for identifying atrial fibrillation in primary care: a diagnostic accuracy study comparing single-lead ecg and modified bp monitors comparison of the microlife blood pressure monitor with the omron blood pressure monitor for detecting atrial fibrillation iphone ecg application for community screening to detect silent atrial fibrillation: a novel technology to prevent stroke prospective, multicentre validation of a simple, patient-operated electrocardiographic system for the detection of arrhythmias and electrocardiographic changes short-term ecg for out of hospital detection of silent atrial fibrillation episodes detection of atrial fibrillation using a modified microlife blood pressure monitor wearable ballistocardiogram and seismocardiogram systems for health and performance springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- -guqc authors: wissel, benjamin d; van camp, p j; kouril, michal; weis, chad; glauser, tracy a; white, peter s; kohane, isaac s; dexheimer, judith w title: an interactive online dashboard for tracking covid- in u.s. counties, cities, and states in real time date: - - journal: j am med inform assoc doi: . /jamia/ocaa sha: doc_id: cord_uid: guqc objective: to create an online resource that informs the public of covid- outbreaks in their area. materials and methods: this r shiny application aggregates data from multiple resources that track covid- and visualizes them through an interactive, online dashboard. results: the web resource, called the covid- watcher, can be accessed at https://covid watcher.research.cchmc.org/. it displays covid- data from every county and metropolitan areas in the u.s. features include rankings of the worst affected areas and auto-generating plots that depict temporal changes in testing capacity, cases, and deaths. discussion: the centers for disease control and prevention (cdc) do not publish covid- data for local municipalities, so it is critical that academic resources fill this void so the public can stay informed. the data used have limitations and likely underestimate the scale of the outbreak. conclusions: the covid- watcher can provide the public with real-time updates of outbreaks in their area. as of april th , the united states of america (u.s.a.) had % of novel coronavirus disease (covid- ) cases worldwide, the most of any country. [ ] at this date, new york city was the epicenter of cases in the u.s., but large outbreaks were present in several other major metropolitan areas, including new orleans, detroit, chicago, and boston. several online tools track covid- outbreaks at the county, state, and national level. [ ] [ ] [ ] [ ] however, it has become apparent that tracking outbreaks at the city level is critical, as the outbreak in china was centered within and surrounding the city of wuhan, lombardy within italy, madrid within spain, and london within the united kingdom. our team developed a methodology to aggregate county-level covid- data into metropolitan areas and display these data in an interactive dashboard that updates in real-time. the purpose of this website was to make this information more accessible to the public, and to allow for more granular assessment of infection spread and impact. we assessed three publicly available datasets that are updated daily and include county-and/or state-level counts of covid- confirmed cases and deaths in the u.s.a. the new york times (nyt) began tracking covid- cases and deaths on the county level in wissel cases as individuals who tested positive for covid- . cases were attributed to the county in which the person was treated and were counted on the date that the case was announced to the public. if it was not possible to attribute a case to a specific county, then it was still counted for the state in which they were treated. the johns hopkins university center for systems science and engineering was the first group aggregate covid- data and release it to the public in an accessible and sizable manner. [ ] this group publishes total cases, recovered cases, and deaths at the national, state, and, as of covid tracking project data. the covid tracking project is a grassroots effort incubated by the atlantic that tracks covid- testing in u.s. states. [ ] this group releases daily updates for the number of positive tests, negative tests, pending tests, hospitalizations, number of patients in the intensive care unit, and deaths. since there is a high amount of variability in state reporting, some of these data are not available for every state. these three data resources use different strategies to aggregate covid- data from multiple sources. since a gold standard has not been established, we compared the consistency of these sources with the centers for disease control and prevention (cdc). [ ] the cdc only releases data for confirmed cases for the entire country, so that was the only metric that could be wissel compared between all four sources. all states, the district of columbia, and five u.s. territories were included. we used the u.s. census bureau's lists of counties comprising major metropolitan areas [ ] to track the proportion of each area's residents that became infected or died of covid- , we used the u.s. census bureau's population estimate for each county to normalize data to tests, cases, and deaths per , residents. [ ] code. the application, referred to as the covid- watcher, checks for data updates from the nyt and covid tracking project every hour. when data updates are released, they are automatically downloaded onto the server and incorporated into the web resource. new data must pass a quality control check that ensures updated data files are the anticipated size and format. data visualizations are generated using the ggplot package [ ] in r statistical software version . . , [ ] and the application was developed using r shiny. [ ] the web resource is hosted in an amazon web services (aws) environment behind a scalable load balancer to accommodate user load. the source code was placed in a public github repository and can be accessed at https://github.com/wisselbd/covid-tracker. the site is maintained by the cincinnati children's hospital medical center division of biomedical informatics. the covid- watcher dashboard can be accessed at https://covid watcher.research.cchmc.org/. the resource includes all u.s. counties, as well as metropolitan areas that are collectively inhabited by over million americans ( . % of the population). a screenshot of the web resource is shown in figure . users can view covid- cases and deaths from the nyt at the county, city, state, or national level, and the total number of tests reported by the covid tracking project, including the breakdown between positive and negative tests, is shown for each state. multiple areas can be selected at once and plots autogenerate after each selection. options include normalizing counts by population size, linear and logarithmic axes, and a button to download a screenshot of the plots. users can search tables that display rankings of the least and most affected areas. a summary of the covid- data sources is shown in table . data are updated at the end of each day in all cases except for the nyt, where they are released the following day. the nyt, wissel johns hopkins, and covid tracking project provide easy-to-access download portals, while the cdc only provides a dashboard without an option to download the data. a comparison of confirmed cases reported in each data source is shown in figure . the sources were highly consistent at the national level. in the absence of a uniform government standard for tracking covid- outbreaks in the u.s.a., academic and newsgroup-based data repositories have become the de facto standard. while these datasets are publicly available, they require informatics and data visualization to extract and display information due to their complexity and continual updates. visualizing covid- data in real-time through online dashboards is a pragmatic way to meet the medical community's demand for up-to-date information. the data displayed by the covid- watcher can be used to evaluate the effectiveness of mitigation efforts. normalizing data by an area's population shows the relative proportion of the population that have been infected. the logarithmic scale shows the rate of spread, and flattening the exponential curve indicates the spread of the virus is slowing. users should take caution in using these data to forecast future events. to make projections, these data should be used in conjunction with the university of washington institute for health metrics and evaluation (ihme) model, [ ] the university of pennsylvania's covid- hospital impact model for epidemics (chime) model, [ ] or other sir models. the authors welcome community feedback, ideas for further development, and contributions. github repository has a section for issue tracking where users can submit comments about the web resource. [ ] alternatively, contributors can make improvements to the code itself by wissel forking the repository, modifying their copy of the code and submitting pull requests back to the authors. these modifications will be reviewed and, if judged to be suitable, merged into the main code. in particular, the authors would like to see community contributions related to geopersonalization of the website visualization, various analytics modeling, data points such as addition of countries, and timeline augmentation. although these datasets reviewed in table in conclusion, we developed the covid- watcher to communicate up-to-date covid- information to the medical community and general public. the web application's pipeline was wissel developed to be extendable and additional data sources will be added as they become available. we hope that by making the code used by this web resource available to the public, developers will submit ideas for improvement. since it is possible that public data releases will be interrupted in the future, we recommend that the cdc immediately begin public releases of their entire covid- data so academia can drive further innovation. these tools could not have been developed without many individual and selfless efforts to create resources for the public good. special thanks to danny t.y. wu, phd and sander su for their help launching the site. this research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors. wissel application's design, submitted feedback on the manuscript for intellectual content, and approved the final version. b.d.w. and j.w.d. have full access to the data and source code and take responsibility for the integrity and accuracy of the report. figure . screenshot of the covid- watcher web resource. users can view data from the new york times at the county, city, state, or national level. multiple areas can be compared at once. plots for the selected regions automatically generate and have options to view on logarithmic scale or normalize data by the population size. wissel an interactive web-based dashboard to track covid- in real time an interactive visualization of the exponential spread of covid- coronavirus in the u.s.: how fast it's growing covid- county tracker an ongoing repository of data on coronavirus cases and deaths in the u novel coronavirus covid- ( -ncov) data repository by johns hopkins csse the covid tracking project cumulative total number of covid- cases in the united states by report date core based statistical areas (cbsas), metropolitan divisions, and combined statistical areas (csas county population totals ggplot : elegant graphics for data analysis r: a language and environment for statistical computing shiny: web application framework for r. r package version forecasting covid- impact on hospital bed-days, icu-days, ventilator-days and deaths by us state in the next months locally informed simulation to predict hospital capacity needs during the covid- pandemic covid- watcher the authors have no competing interests to declare. key: cord- -yzwsqlb authors: ray, bisakha; ghedin, elodie; chunara, rumi title: network inference from multimodal data: a review of approaches from infectious disease transmission date: - - journal: j biomed inform doi: . /j.jbi. . . sha: doc_id: cord_uid: yzwsqlb networks inference problems are commonly found in multiple biomedical subfields such as genomics, metagenomics, neuroscience, and epidemiology. networks are useful for representing a wide range of complex interactions ranging from those between molecular biomarkers, neurons, and microbial communities, to those found in human or animal populations. recent technological advances have resulted in an increasing amount of healthcare data in multiple modalities, increasing the preponderance of network inference problems. multi-domain data can now be used to improve the robustness and reliability of recovered networks from unimodal data. for infectious diseases in particular, there is a body of knowledge that has been focused on combining multiple pieces of linked information. combining or analyzing disparate modalities in concert has demonstrated greater insight into disease transmission than could be obtained from any single modality in isolation. this has been particularly helpful in understanding incidence and transmission at early stages of infections that have pandemic potential. novel pieces of linked information in the form of spatial, temporal, and other covariates including high-throughput sequence data, clinical visits, social network information, pharmaceutical prescriptions, and clinical symptoms (reported as free-text data) also encourage further investigation of these methods. the purpose of this review is to provide an in-depth analysis of multimodal infectious disease transmission network inference methods with a specific focus on bayesian inference. we focus on analytical bayesian inference-based methods as this enables recovering multiple parameters simultaneously, for example, not just the disease transmission network, but also parameters of epidemic dynamics. our review studies their assumptions, key inference parameters and limitations, and ultimately provides insights about improving future network inference methods in multiple applications. dynamical systems and their interactions are common across many areas of systems biology, neuroscience, healthcare, and medicine. identifying these interactions is important because they can broaden our understanding of problems ranging from regulatory interactions in biomarkers, to functional connectivity in neurons, to how infectious agents transmit and cause disease in large populations. several methods have been developed to reverse engineer or, identify cause and effect pathways of target variables in these interaction networks from observational data [ ] [ ] [ ] . in genomics, regulatory interactions such as disease phenotype-genotype pairs can be identified by network reverse engineering [ , ] . molecular biomarkers or key drivers identified can then be used as targets for therapeutic drugs and directly benefit patient outcomes. in microbiome studies, network inference is utilized to uncover associations amongst microbes and between microbes and ecosystems or hosts [ , , ] . this can include insights about taxa associations, phylogeny, and evolution of ecosystems. in neuroscience, there is an effort towards recovering brain-connectivity networks from functional magnetic resonance imaging (fmri) and calcium fluorescence time series data [ , ] . identifying structural or functional neuronal pairs illuminates understanding of the structure of the brain, can help better understand animal and human intelligence, and inform treatment of neuronal diseases. infectious disease transmission networks are widely studied in public health. understanding disease transmission in large populations is an important modeling challenge because a better understanding of transmission can help predict who will be affected, and where or when they will be. network interactions can be further refined by considering multiple circulating pathogenic strains in a population along with strain-specific interventions, such as during influenza and cold seasons. thus, network interactions can be used to inform interventional measures in the form of antiviral drugs, vaccinations, quarantine, prophylactic drugs, and workplace or school closings to contain infections in affected areas [ ] [ ] [ ] [ ] . developing robust network inference methods to accurately and coherently map interactions is, therefore, fundamentally important and useful for several biomedical fields. as summarized in fig. , many methods have been used to identify pairwise interactions in genomics, neuroscience [ , ] and microbiome research [ ] including correlation and information gain-based metrics for association, inverse covariance for conditional independence testing, and granger causality for causation from temporal data. further, multimodal data integration methods such as horizontal integration, model-based integration, kernelbased integration, and non-negative matrix factorization have been used to combine information from multiple modalities of 'omics' data such as gene expression, protein expression, somatic mutations, and dna methylation with demographic, diagnoses, and phenotypical clinical data. bayesian inference has been used to analyze changes in gene expression from microarray data as dna measurements can have several unmeasured confounders and thereby incorporate noise and uncertainty [ ] . multi-modal integration can be used for classification tasks, to predict clinical phenotypes such as tumor stage or lymph node status, for clustering of patients into subgroups, and to identify important regulatory modules [ ] [ ] [ ] [ ] [ ] . in neuroscience, not just data integration, but multimodal data fusion has been performed by various methods such as linear regression, structural equation modeling, independent component analysis, principal component analysis, and partial least squares [ ] . multiple modalities such as fmri, electroencephalography, and diffusion tensor imaging (dti) have been jointly analyzed to uncover more details than could be captured by a single imaging technique [ ] . in metagenomics, network inference from microbial data has been performed using methods such as inverse covariance and correlation [ ] . in evolutionary biology, the massive generation of molecular data has enabled bayesian inference of phylogenetic trees using markov chain monte carlo chain (mcmc) techniques [ , ] . in infectious disease transmission network inference, bayesian inference frameworks have been primarily used to integrate data such as dates of pathogen sample collection and symptom report date, pathogen genome sequences, and locations of patients [ ] [ ] [ ] . this problem remains challenging as the data generative processes and scales of heterogeneous modalities may be widely different, transformations applied to separate modalities may not preserve the interactions between modalities, and separately integrated models may not capture interaction effects between modalities [ ] . as evidence mounts regarding the complex combination of biological, environmental, and social factors behind disease, emphasis on the development of advanced modeling and inference methods that incorporate multimodal data into singular frameworks has increased. these methods are becoming more important to consider given that the types of healthcare data available for understanding disease pathology, evolution, and transmission are numerous and growing. for example, internet and mobile connectivity has enabled mobile sensors, point-of-care diagnostics, web logs, and participatory social media data which can provide complementary health information to traditional sources [ , ] . in the era of precision medicine, it becomes especially important to combine clinical information with biomarker and environmental information to recover complex genotype-phenotype maps [ ] [ ] [ ] [ ] . infectious disease networks are one area where the need to bring together data types has long been recognized, specifically to better understand disease transmission. data sources including high-throughput sequencing technologies have enabled genomic data to become more cost effective, offering support for studying transmission by revealing pathways of pathogen introduction and evolution in a population. yet, genomic data in isolation is insufficient to obtain a comprehensive picture of disease in the population. while these data can provide information about pathogen evolution, genetic diversity, and molecular interaction, they do not capture other environmental, spatial, and clinical factors that can affect transmission. for infectious disease surveillance, this information is usually conveyed through epidemiological data, which can be collected in various ways such as in clinical settings from the medical record, or in more recent efforts through web search logs, or participatory surveillance. participatory surveillance data types typically include age, sex, date of symptom onset, and diagnostic information such as severity of symptoms. in clinical settings, epidemiological data are generally collected from patients reporting illness. this can include, for example, age at diagnosis, sex, race, family history, diagnostic information such as severity of symptoms, and phenotypical information such as presence or absence of disease which may not be standardized. highthroughput sequencing of pathogen genomes, along with linked spatial and temporal information, can advance surveillance by increasing granularity and leading to a better understanding of the spread of an infectious disease [ ] . considerable efforts have been made to unify genomic and epidemiologic information from traditional clinical forms into singular statistical frameworks to refine understanding of disease transmission [ ] [ ] [ ] [ ] [ ] [ ] . one approach to design and improve disease transmission models has been to analytically combine multiple, individually weak predictive signals in the form of sparse epidemiological, spatial, pathogen genomic, and temporal data [ , , , , ] . molecular epidemiology is the evolving field wherein the above data types are considered together; epidemiological models are used in concert with pathogen phylogeny and immunodynamics to uncover disease transmission patterns [ ] . pathogen genomic data can capture within-host pathogen diversity (the product of effective population size in a generation and the average pathogen replication time [ , ] ) and dynamics or provide information critical to understanding disease transmission such as evidence of new transmission pathways that cannot be inferred from epidemiological data alone [ , ] . in addition, the remaining possibilities can then be examined using any available epidemiological data. as molecular epidemiology and infectious disease transmission are areas in which network inference methods have been developed for bringing together multimodal data we use this review to investigate the foundational work in this specific field. a summary of data types, relevant questions and purpose of such studies is summarized in fig. , and we further articulate the approaches below. in molecular epidemiology, several approaches have been used to overlay pathogen genomic information on traditionally collected epidemiologic information to recover transmission networks. additional modeling structure is needed in these problems because infectious disease transmission occurs through contact networks of heterogeneous individuals, which may not be captured by compartmental models such as susceptible-infec tious-recovered (sir) and susceptible-latent-infectious-recov ered (slir) models [ ] . as well, for increased utility in epidemiology, there is a necessity to estimate epidemic parameters in addition to the transmission network. unlike other fields wherein recovery of just the topology of the networks is desired, in molecular epidemiology bayesian inference is commonly used to reverse engineer infectious disease transmission networks in addition to estimating epidemic parameters (fig. ). while precise features can be extracted from observed data, there are latent variables not directly measured which must simultaneously be considered to provide a complete picture. thus, bayesian inference methods have been used to simultaneously infer epidemic parameters and structure of the transmission network in a single framework. instead of capturing pairwise interactions, such as correlations or inverse covariance, bayesian inference is capable of considering all nodes and inferring a global network and transmission parameters [ ] . moreover, bayesian inference is capable of modeling noisy, partially sampled realistic outbreak data while incorporating prior information. while this review focuses on infectious disease transmission, network inference methods have implications in many areas. modeling network diffusion and influence, identifying important nodes, link prediction, influence probabilities and community topology and parameter detection are key questions in several fields ranging from genomics to social network analysis [ ] . analogous frameworks can be developed with different modalities of observational genomics or clinical data to model information propagation and capture the influences of nodes, nodes that are more influential than others, and the temporal dynamics of information diffusion. for modeling information spread in such networks, influence and susceptibility of nodes can serve to be analogous to epidemic transmission parameters. however, these modified methods should also account for differences in the method of information propagation in such networks from infectious disease transmission by incorporating constraints in the form of temporal decay of infection, strengths of ties measured from biological domain knowledge, and multiple pathways of information spread. to identify the studies most relevant for this focused review, we queried pubmed. for practicality and relevance, our search, summarized in fig. , was limited to papers from the last ten years. as our review is focused on infectious disease transmission network inference, we started with the keywords 'transmission' and 'epidemiological'. to ensure that we captured studies that incorporate pathogen genomic data, we added the keywords 'genetic', 'genomic' and 'phylogenetic' giving articles total. next, to narrow the results to those that are comprised of a study of multi-modal data, we found that the keywords 'combining' or 'integrating' alongside 'bayesian inference' or 'inference' were comprehensive. these filters yielded and articles in total. we found that some resulting articles focused on outbreak detection, sexually transmitted diseases, laboratory methods, and phylogenetic analysis. also, the focus of several articles was to either overlay information from different modalities or to sequentially analyze them to eliminate unlikely transmission pathways. after a full-text review to exclude these and focus on methodological approaches, articles resulted which use bayesian inference to recover transmission networks from multimodal data for infectious diseases, and which represent the topic of this review. this included bayesian likelihood-based methods for integrating pathogen genomic information with temporal, spatial, and epidemiological characteristics for infectious diseases such as foot and mouth disease (fmd), and respiratory illnesses, including influenza. as incorporating genomic data simultaneously in analytical multimodal frameworks is a relatively novel idea, the literature on this is limited. recent unified platforms have been made available to the community for analysis of outbreaks and storing of outbreak data [ ] . thus, it is essential to review available literature on this novel and burgeoning topic. for validation, we repeated our queries on google scholar. although google scholar generated a much broader range of papers, based on the types of papers indexed, we verified that it also yielded the articles selected from pubmed. we are confident in our choice of articles for review as we have used two separate publications databases. below we summarize the theoretical underpinnings of the likelihood-based framework approaches, inference parameters, and assumptions about each of these studies and articulate the limitations, which can motivate future research. infectious disease transmission study is a rapidly developing field given the recent advent of widely available epidemiological, social contact, social networking and pathogen genomic data. in this section we briefly review multimodal integration methods for combining pathogen genomic data and epidemiological data in a single analysis, for inferring infection transmission trees and epidemic dynamic parameters. advances in genomic technology such as sequences of whole genomes of rna viruses and identifying single nucleotide variations using sensitive mass spectrometry have enabled the tracing of transmission patterns and mutational parameters of the severe acute respiratory syndrome (sars) virus [ ] . in this study, phylogenetic trees were inferred based on phylogenetic analysis using parsimony (paup ⁄ ) using a maximum likelihood criterion [ ] . mutation rate was then inferred based on a model which assumes that the number of mutations observed between an isolate and its fig. . study design and inclusion-exclusion criteria. this is a decision tree showing our searches and selection criteria for both pubmed and google scholar. we focused only on genomic epidemiology methods utilizing bayesian inference for infectious disease transmission. ancestor is proportional to the mutation rate and their temporal difference [ ] . their estimated mutation rate was similar to existing literature on mutation rates of other viral pathogens. phylogenetic reconstruction revealed three major branches in taiwan, hong kong, and china. gardy et al. [ ] analyzed a tuberculosis outbreak in british columbia in using whole-genome pathogen sequences and contact tracing using social network information. epidemiological information collection included completing a social network questionnaire to identify contact patterns, high-risk behaviors such as cocaine and alcohol usage, and possible geographical regions of spread. pathogen genomic data consisted of restriction-fragmentlength polymorphism analysis of tuberculosis isolates. phylogenetic inference of genetic lineage based on single nucleotide polymorphisms from the genomic data was performed. their method demonstrated that transmission information inference such as identifying a possible source patient from contact tracing by epidemiological investigation can be refined by adding ancestral and diversity information from genomic data. in one of the earliest attempts to study genetic sequence data, as well as dates and locations of samples in concert, jombart et al. [ ] proposed a maximal spanning tree graph-based approach that went beyond existing phylogenetic methods. this method was utilized to uncover the spatiotemporal dynamics of the influenza a (h n ) from and to study its worldwide spread. a total of gene sequences of hemagglutinin (ha) and of neuraminidase (na) were obtained from genbank. classical phylogenetic approaches fail to capture the hierarchical relationship between both ancestors and descendants sampled at the same time. using their algorithm called seqtrack [ ] , the authors constructed ancestries in samples based on a maximal-spanning tree. seqtrack [ ] utilizes the fact that in the absence of recombination and reverse mutations, strains will have unique ancestors characterized by the fewest possible mutations, no sample can be the ancestor of a sample which temporally preceded it, and the likelihood of ancestry can be estimated from the genomic differentiation between samples. seqtrack was successful in reconstructing the transmission trees in both completely and incompletely sampled outbreaks unlike phylogenetic approaches, which failed to capture ancestral relationships between the tips of trees. however, this method cannot capture the underlying within-host virus genetic parameters. moreover, mutations generated once can be present in different samples and transmission likelihood based on genetic distance may not be reliable. the above methods exploit information from different modalities separately. recent methodological advancements have seen simultaneous integration of multiple modalities of data in singular bayesian inference frameworks. in the following section we discuss state-of-the-art approaches based on bayesian inference, to reconstruct partially-observed transmission trees and multiple origins of pathogen introduction in a host population [ , , , , ] . we specifically focus on bayesian likelihood-based methods as the methods consider heterogeneous modalities in a single framework and simultaneously infer the transmission network and epidemic parameters such as rate of infection transmission and rate of recovery. infectious disease transmission network inference is one problem area wherein there is a foundational literature of bayesian inference methods; reviewing them together allows understanding and comparison of specific related features across models. methods are summarized in table . in bayesian inference, information recorded before the study is included as a prior in the hypothesis. based on bayes theorem as shown below, this method incorporates prior information and likelihoods from the sample data to compute a posterior probability distribution or, pðhypothesisjdataÞ. the denominator is a normalization constant or, the marginal probability density of the sample data computed over all hypotheses [ ] . the hypothesis for this problem can be expressed in the form of a transmission network over individuals, locations, or farms, parameters such as rate of infectiousness and recovery, or mutation probability of pathogens. the posterior probability distribution can then be estimated as in the equation below. the posterior probability is then a measure that the inferred transmission tree and parameters are correct. it can be extremely difficult to analytically compute the posterior probability distribution as it involves iterating over all possible combinations of branches of such a transmission tree and parameter values. however, it is possible to approximate the posterior probability distribution using mcmc [ ] techniques. in mcmc, a markov chain is constructed which is described by the state space of the parameters of the model and which has the posterior probability distribution as its stationary distribution. for an iteration of the mcmc, a new tree is proposed by stochastically altering the previous tree. the new tree is accepted or rejected based on a probability computed from a metropolis-hastings or gibbs update. the quality of the results from the mcmc approximation can depend on the number of iterations that it is run for, the convergence criterion and the accuracy of the update function [ ] . cottam et al. [ ] developed one of the earliest methods to address this problem studying foot-and-mouth disease (fmd) in twenty farms in the uk. in this study, fmd virus genomes (the fmd virus has a positive strand rna genome and it is a member of the genus aphthovirus in the family picornaviridae) were collected from clinical samples from the infected farms. the samples were chosen so that they could be used to study variation within the outbreak and the time required for accumulation of genetic change, and to study transmission events. total rna was extracted directly from epithelial suspensions, blood, or esophageal suspensions. sanger sequencing was performed on overlapping amplicons covering the genome [ ] . as the rna virus has a high substitution rate, the number of mutations was sufficient to distinguish between different farms. they designed a maximum likelihood-based method incorporating complete genome sequences, date at which infection in a farm was identified, and the date of culling of the animals. the goal was to trace the transmission of fmd in durham county, uk during the outbreak to infer the date of infection of animals and most likely period of their infectiousness. in their approach, they first generated the phylogenies of the viral genomes [ , ] . once the tip of the trees were generated, they constructed possible transmission trees by recursively working backwards to identify a most recent common ancestor (mrca) in the form of a farm and assigned each haplotype to a farm. the likelihood of each tree was then estimated using epidemiological data. their study included assumptions of the mean incubation time prior to being infectious to be five days, the distribution of incubation times to follow a discrete gamma distribution, the most likely date of infection to be the date of reporting minus the date of the oldest reported lesion of the farm minus the mean incubation time, and the farms to be a source of infection immediately after being identified as infected up to the day of culling. spatial dependence in the transmission events was determined from the transmission tree by studying mean transmission distance. [ ] developed a bayesian likelihood-based framework integrating genetic and epidemiological data. this method was tested on an epidemic dataset of poultry farms in an epidemic of avian influenza a (h n ) in the netherlands in consisting of geographical, genomic, and date of culling data. consensus sequences of the ha, na and polymerase pb genes were derived by pooling sequence data from five infected animals for out of the farms analyzed. the likelihood of one farm infecting another increased if the former was not culled at the time of infection of the latter, if they were in geographical proximity, or if the sampled pathogen genomic sequences were related. their model included several assumptions such as non-correlation of genetic distance, time of infection, and geographical distance between host and target farms. the likelihood function was generated as follows: for the temporal component, a farm could infect another if its infection time was before the infection time of the target farm or if the infection time of the latter was between the infection and culling time of the former. if a farm was already culled, its infectiousness decayed exponentially. for the geographical component, two farms could infect each other with likelihood equal to the inverse of the distance between them. this likelihood varied according to a spatial kernel. for the genomic component, probabilities of transitions and transversions, and the presence or absence of a deletion was considered. if there was no missing data, the likelihood function was just a product of independent geographical, genomic, and temporal components. this method also allowed missing data by assuming that all the links to a specific missing data type are in one subtree. mcmc [ ] was performed to sample all possible transmission trees and parameters. marginalizing over a large number of subtrees over all possible values can also prove computationally expensive. mutations were assumed to be fixed in the population before or after an infection, ignoring a molecular clock. in the method by morelli et al. [ ] , the authors developed a likelihood-based function that inferred the transmission trees and infection times of the hosts. the authors assumed that a premise or farm can be infected at a certain time followed by a latency period, a time period from infectiousness to detection, and a time of pathogen collection. this method utilized the fmd dataset from the study by cottam et al. in order to simplify the posterior distribution further, latent variables denoting unobserved pathogens were removed and a pseudo-distribution incorporating the genetic distance between the observed and measured consensus sequences was generated. the posterior distribution corresponded to a pseudo-posterior distribution because the pathogens were sampled at observation time and not infection time. the genetic distance was measured by hamming distance between sequences in isolation without considering the entire genetic network. several assumptions including independence of latency time and infectiousness period were made. in determining the interval from the end-of-latency period to detection, the informative prior was centered on lesion age. this made this inference technique sensitive to veterinary estimates of lesion age. this study considered a single source of viral introduction in the population, which is feasible if the population size considered is small. this technique did not incorporate unobserved sources of infection and assumed all hosts were sampled. the authors also assumed that each host had the same probability of being infected. teunis et al. [ ] developed a bayesian inference framework to infer transmission probability matrices. the authors assumed that likelihood of infection transmission over all observed individuals would be equal to the product of conditional probability distributions between each pair of individuals i and j, and the correspond-ing entry from the transition probability matrix representing any possible transmissions from ancestors to i. the inferred matrices could be utilized to identify network metrics such as number of cases infected by each infected source and transmission patterns could be detected by analyzing pairwise observed cases during an outbreak. the likelihood function could be generated by observed times of onset, genetic distance, and geographical locations. their inferred parameters were the transmission tree and reproductive number. their method was applied to a norovirus outbreak in a university hospital in netherlands. in a method developed by ypma et al. [ ] , the statistical framework for inferring the transmission tree simultaneously generated the phylogenetic tree. this method also utilized the fmd dataset from the study by cottam et al. their approach for generating the joint posterior probability of the transmission tree differed from existing methods in including the simultaneous estimation of the phylogenetic tree and within-host dynamics. the posterior probability distribution defined a sampling space consisting of the transmission tree, epidemiological parameters, and withinhost dynamics which were inferred from the measured epidemiological data and the phylogenetic tree and mutation parameters which were inferred from the pathogen genomic data. the posterior probability distribution was estimated using the mcmc technique. the performance of the method was evaluated by measuring the probability assigned to actual transmission events. the assumptions made were that all infected hosts were observed, time of onset was known, sequences were sampled from a subpopulation of the infected hosts, and a single source/host introduced the infection in the population. in going beyond existing methods, the authors did not assume that events in the phylogenetic tree coincide with actual transmission events. a huge sampling fraction would be necessary to capture such microscale genetic diversity. this method works best when all infected hosts are observed and sampled. mollentze et al. [ ] have used multimodal data in the form of genomic, spatial and temporal information to address the problem of unobserved cases, an existing disease well established in a population, and multiple introductions of pathogens. their method estimated the effective size of the infected population thus being able to provide insight into number of unobserved cases. the authors modified morelli et al.'s method described above by replacing the spatial kernel with a spatial power transmission kernel to accommodate wider variety of transmission. in addition, the substitution model used by morelli et al. was modified by a kimura three parameter model [ ] . this method was applied to a partially-sampled rabies virus dataset from south africa. the separate transmission trees from partially-observed data could be grouped into separate clusters with most transmissions in the under-sampled dataset being indirect transmissions. reconstructions were sensitive to choice of priors for incubation and infectious periods. in a more recent approach to study outbreaks and possible transmission routes, jombart et al. [ ] , in addition to reconstructing the transmission tree, addressed important issues such as inferring possible infection dates, secondary infections, mutation rates, multiple pathways of pathogen introduction, foreign imports, unobserved cases, proportion of infected hosts sampled, and superspreading in a bayesian framework. jombart tested their algorithm outbreaker on the sars outbreak in singapore using known cases of primary and secondary infection [ , , ] . in this study, genome sequences of severe acute respiratory syndrome (sars) were downloaded from genbank and analyzed. their method relies on pathogen genetic sequences and collection dates. similar to their previous approach [ ] , their method assumed mutations to be parameters of transmission events. epidemiological pseudo-likelihood was based on collection dates. genomic pseudo-likelihood was computed based on genetic distances between isolates. this method would benefit from known transmission pathways and mutation rates and is specifically suitable for densely sampled outbreaks. their method assumed generation time-time from primary to secondary infections-and time from infection to collection were available. their method ignored within-host diversity of pathogens. instead of using a strict molecular clock, this method used a generational clock. didelot et al. [ ] developed a framework to examine if wholegenome sequences were enough to capture transmission events. unlike other existing studies, the authors took into account within-host evolution and did not assume that branches in phylogenetic trees correspond to actual transmission events. the generation time corresponds to the time between a host being infected and infecting others. for pathogens with short generation times, genetic diversity may not accrue to a very high degree and one can ignore within-host diversity. however, for diseases with high latency times and ones in which the host remains asymptomatic, there is scope for accumulation of considerable within-host genetic diversity. their method used a timed phylogenetic tree from which a transmission tree is inferred on its own or can be combined with any available epidemiological support. their simulations revealed that considering within-host pathogen generation intervals resulted in more realistic phylogenies between infector and infected. the method was tested on simulated datasets and with a real-world tuberculosis dataset with a known outbreak source with only genomic data and then modified using any available epidemiological data. the latter modified network resembled more the actual transmission activity in having a web-like layout and fewer bidirectional links. their approach would work well for densely sampled outbreaks. some of the most common parameters inferred for infectious disease transmission in these bayesian approaches are the transmission tree between infected individuals or animals, the mutation rates of different pathogens, phylogenetic tree, within-host diversity, latency period, and infection dates [ , , , ] . additional parameters in recent work are reproductive number [ ] , foreign imports, superspreaders, and proportion of infected hosts sampled [ ] . several simplifying assumptions have been made in the reviewed bayesian studies, limiting their applicability across different epidemic situations. in cottam's [ ] approach, the phylogenetic trees generated from the genomic data are weighed by epidemiological factors to limit analysis to possible transmission trees. however, sequential approaches may not be ideal to reconstruct transmission trees and a method that combines all modalities in a single likelihood function may be necessary. ypma et al. [ ] assumed that pathogen mutations emerge in the host population immediately before or following infections. moreover, the approach weighed each data type via their likelihood functions and considers each data type independent of the others, which may not be a realistic assumption. jombart et al. [ ] also inferred ancestral relationships to the most closely sampled ancestor as all ancestors may not be sampled. morelli et al. [ ] assumed flat priors for all model parameters. however, the method was estimated with the prior for the duration from latency to infection centered on the lesion age making the method sensitive to it and to veterinary assessment of infection age. the method developed by mollentze et al. [ ] required knowledge of epidemiology for infection and incubation periods. identifying parents of infected nodes, as proposed by teunis et al., [ ] assumes that all infectious cases were observed which may not be true in realistic, partiallyobserved outbreaks. didelot et al. [ ] developed a framework based on a timed phylogenetic tree, which infers within-host evolutionary dynamics with a constant population size and denselysampled outbreaks. several of these approaches rely on assumptions of denselysampled outbreaks, a single pathogen introduction in the population, single infected index cases, samples capturing the entire outbreak, that all cases comprising the outbreak are observed, existence of single pathogen strains, and all nodes in the transmission network having constant infectiousness and the same rate of transmission. however, in real situations the nodes will have different infectiousness and rate of spreading from animal to animal, or human to human. moreover, the use of clinical data only is nonrepresentative of how infection transmits to a population as it generally only captures the most severely affected cases. our literature review is summarized in table . as large-scale and detailed genomic data becomes more available, analyses of existing bayesian inference methods described in our review will inform their integration in epidemiological and other biomedical research. as more and more quantities of diverse data becomes available, developing bayesian inference frameworks will be the favored tool to integrate information and draw inference about transmission and epidemic parameters simultaneously. the specific focus in this review on the application of network inference in infectious disease transmission enables us to consider and comment on common parameters, data types and assumptions (summarized in table ). novel data sources have increased the resolution of information as well as enabled a closer monitoring and study of interactions; spatial and genomic resolution of the bayesian network-inference studies reviewed are summarized in fig. to illustrate the scope of current methods. further, we have added suggestions for addressing identified challenges in these methods regarding their common assumptions and parameters in table . given the increasing number and types of biomedical data available, we also discuss how models can be augmented to harness added value from these multiple and highergranularity modalities such as minor variant identification from deep sequencing data or community-generated epidemiological data. existing methods are based on pathogen genome sequences which may largely be consensus in nature where the nucleotide or amino acid residue at any given site is the most common residue found at each position of the sequence. other recent approaches have reconstructed epidemic transmission using whole genome sequencing. detailed viral genomic sequence data can help distinguish pathogen variants and thus augment analysis of transmission pathways and host-infectee relationships in the population. highly parallel sequencing technology is now available to study rna and dna genomes at greater depth than was previously possible. using advanced deep sequencing methods, minor variations that describe transmission events can be captured and must also then be represented in models [ , ] . models can also be encumbered with considerable selection bias by being based on clinical or veterinary data representative of a subsample of only the most severely infected hosts who access clinics. existing multi-modal frameworks are designed based on clinical data such as sequences collected from cases of influenza [ , ] or veterinary assessment of fmd [ , ] , which generally represent the most severe cases with access to traditional healthcare institutions and automatically inherit considerable selection bias. models to-date do not consider participatory surveillance data that has become increasingly available via mobile and internet accessibility (e.g. data from web logs, search queries, web survey-based participatory efforts such as goviral with linked symptomatic, immunization, and molecular information [ ] and online social networks and social network questionnaires). another approach to improve the granularity of collected data could be community-generated data. these data can be finegrained and can capture information on a wide range of cases from asymptomatic to mildly infectious to severe. this data can be utilized to incorporate additional transmission parameters of a community which can be more representative of disease transmission. as exemplified in fig. a , community-generated data can be collected at the fine-grained spatial level of households, schools, workplaces, or zip codes and models must then also accommodate these spatial resolutions. studies to-date have also generally depended on available small sample sizes and some are specifically tailored to a specific disease or pathogen such as sars, avian influenza, or fmd [ , , ] . hiseq platform with m. tuberculosis cdc reference sequence and aligned using burrows-wheeler aligner algorithm. sars dna sequences were obtained from genbank and aligned using muscle. for avian influenza, rna consensus sequences of the haemagglutinin, neuriminidase and polymerase pb genes were sequenced. for h n influenza, isolates were typed for hemagglutinin (ha) and neuraminidase (na) genes. methods will have to handle missing data and unobserved and unsampled hosts to be applicable to realistic scenarios. in simpler cases, assumptions of single introductions of infection with single strains being passed between hosts may be adequate. however, robust frameworks will have to consider multiple introductions of pathogens in the host population with multiple circulating strains and co-infections in hosts. in order to be truly useful, frameworks have to address questions regarding rapid mutations of certain pathogens, phylogenetic uncertainty, recombination and reassortment, population stochastics, super spreading, exported cases, multiple introductions of pathogens in a population, within and between-host pathogen evolution, and phenotypic information. methods will also need to scale up to advances in nextgeneration sequencing technology capable of producing large amounts of genomic data inexpensively [ , ] . in the study of infectious diseases, the challenge remains to develop robust statistical frameworks that will take into account the relationship between epidemiological data and phylogeny and utilize that to infer pathogen transmission while taking into account realistic evolutionary times and accumulation of withinhost diversity. moreover, to benefit public health inference methods need to uncover generic transmission patterns, wider range of infections and risks including asymptomatic to mildly infectious cases, clusters and specific environments, and host types. network inference frameworks from the study of infectious diseases can be analogously modified to incorporate diverse forms of multimodal data and model information propagation and interactions in diverse applications such as drug-target pairs and neuronal connectivity or social network analysis. the detailed examination of models, data sources and parameters performed here can inform inference methods in different fields, and bring to light the way that new data sources can augment the approaches. in general, this will enable understanding and interpretation of influence and information propagation by mapping relationships between nodes in other applications. review of multimodal integration methods for transmission network inference a comprehensive assessment of methods for de-novo reverse-engineering of genome-scale regulatory networks sparse and compositionally robust inference of microbial ecological networks model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals dialogue on reverse-engineering assessment and methods molecular ecological network analyses marine bacterial, archaeal and protistan association networks reveal ecological linkages network modelling methods for fmri modeling the worldwide spread of pandemic influenza: baseline case and containment interventions a 'smallworld-like' model for comparing interventions aimed at preventing and controlling influenza pandemics reducing the impact of the next influenza pandemic using household-based public health interventions estimating the impact of school closure on influenza transmission from sentinel data model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals network modelling methods for fmri sparse and compositionally robust inference of microbial ecological networks a bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes mvda: a multi-view genomic data integration methodology information content and analysis methods for multi-modal high-throughput biomedical data a novel computational framework for simultaneous integration of multiple types of genomic data to identify microrna-gene regulatory modules a kernel-based integration of genome-wide data for clinical decision support predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks a review of multivariate methods for multimodal fusion of brain imaging data bayesian inference of phylogeny and its impact on evolutionary biology mrbayes . : efficient bayesian phylogenetic inference and model choice across a large model space a bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data unravelling transmission trees of infectious diseases by combining genetic and epidemiological data bayesian inference of infectious disease transmission from whole-genome sequence data methods of integrating data to uncover genotype-phenotype interactions why we need crowdsourced data in infectious disease surveillance wholegenome sequencing and social-network analysis of a tuberculosis outbreak novel clinico-genome network modeling for revolutionizing genotype-phenotype-based personalized cancer care integrative, multimodal analysis of glioblastoma using tcga molecular data, pathology images, and clinical outcomes an informatics research agenda to support precision medicine: seven key areas the foundation of precision medicine: integration of electronic health records with genomics through basic, clinical, and translational research relating phylogenetic trees to transmission trees of infectious disease outbreaks bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data extracting transmission networks from phylogeographic data for epidemic and endemic diseases: ebola virus in sierra leone h n pandemic influenza and polio in nigeria the role of pathogen genomics in assessing disease transmission reconstructing disease outbreaks from genetic data: a graph approach molecular epidemiology: application of contemporary techniques to the typing of microorganisms integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus the distribution of pairwise genetic distances: a tool for investigating disease transmission the mathematics of infectious diseases dynamics and control of diseases in networks with community structure outbreaktools: a new platform for disease outbreak analysis using the r software mutational dynamics of the sars coronavirus in cell culture and human populations isolated in phylogenetic analysis using parsimony (and other methods). version , sinauer associates molecular evolution and phylogenetics adegenet: a r package for the multivariate analysis of genetic markers a bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data bayesian inference in ecology an introduction to mcmc for machine learning molecular epidemiology of the foot-and-mouth disease virus outbreak in the united kingdom in tcs: a computer program to estimate gene genealogies a cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and dna sequence data. iii. cladogram estimation infectious disease transmission as a forensic problem: who infected whom? estimation of evolutionary distances between homologous nucleotide-sequences comparative full-length genome sequence analysis of sars coronavirus isolates and common mutations associated with putative origins of infection extensive geographical mixing of human h n influenza a virus in a single university community quantifying influenza virus diversity and transmission in humans surveillance of acute respiratory infections using community-submitted symptoms and specimens for molecular diagnostic testing eight challenges in phylodynamic inference sequencing technologies-the next generation the authors declare no conflict of interest. key: cord- -w m ci i authors: yamin, mohammad title: it applications in healthcare management: a survey date: - - journal: int j inf technol doi: . /s - - - sha: doc_id: cord_uid: w m ci i healthcare management is currently undergoing substantial changes, and reshaping our perception of the medical field. one spectrum is that of the considerable changes that we see in surgical machines and equipment, and the way the procedures are performed. computing power, internet and associated technologies are transforming surgical operations into model based procedures. the other spectrum is the management side of healthcare, which is equally critical to the medical profession. in particular, recent advances in the field of information technology (it) is assisting in better management of health appointments and record management. with the proliferation of it and management, data is now playing a vital role in diagnostics, drug administration and management of healthcare services. with the advancement in data processing, large amounts of medical data collected by medical centres and providers, can now be mined and analysed to assist in planning and making appropriate decisions. in this article, we shall provide an overview of the role of it that have been reshaping the healthcare management, hospital, health profession and industry. internet and web . have been instrumental in evolving most of the information technology (it) applications that we are now accustomed to. barely years ago, we had not imagined the technological advancements, which are now implemented into our lives. technological advancement, riding on the wave of disruptive technologies of the last quarter of a century, is reshaping and redefining the world's social order and norms. the medical field is part and parcel of our lives and is perhaps one that we care the most about. in the recent times, this field has undergone vast changes. the medical profession today has evolved greatly since the beginning of this century. not only the medical procedures, the medical data flow and management has also undergone considerable improvement. advance data transfer and management techniques have made improvements in disease diagnostic and have been a critical role in national health planning and efficient record keeping. in particular, the medical profession has undergone substantial changes through the capabilities of database management, which has given rise to the healthcare information systems (his). medical data collected from different institutions can now be mined and analysed in time using big data analytics tools. this article shall provide an overview of the involvement, importance and benefits of it and how data management is contributing to the medical science field. in particular, the following will be discussed: . patient and data management . big medical data analytics . privacy and security of healthcare data artificial intelligence and robotics in healthcare artificial intelligence (ai) is a field of computer science which has its roots in logic (mathematics), psychology, philosophy, linguistics, arts, science and management among others. many ai-dominant tools and applications can be found in games, auto spare parts, heavy machinery and various medical instrumentations. according to [ ] , many programs are developed with the help of ai to perform specific tasks which make use of many activities including medical diagnostic, time sharing, interactive interpreters, graphical user interfaces and the computer mouse, rapid development environments, the linked listdata structure, automatic storage management, symbolic, functional, dynamic, and object-oriented programming. here are some other examples. pharmaceutical developments involve extensive clinical trials in real environment and may take many years, sometimes more than a decade. this process requires a considerable amount of resources and can cost billions of dollars. in order to save lives and money, it is desirable to expedite this time consuming process to make healthcare cheaper. an urgency of this nature was witnessed during the ebola crisis in some west african nations [ ] . an urgency of this nature was witnessed during the ebola crisis in some west african nations [ ] , where a program powered by ai was used to scan existing medicines that could be redesigned to fight the disease. nowadays medical consultation, with the help of an online illness database, can take place without the direct interaction of a medical practitioner. an example of such a practice can be found in [ ] . advantages of such consultations lies in the fact that the symptoms presented online by the patients are matched with those of about ten thousand known diseases [ ] , whereas doctors can only match the symptoms against a fraction of the known diseases which they are familiar without the intervention of an it database. likewise virtual nursing to support patients online is also available from [ ] . yet another app [ ] developed by boston children hospital provides basic health information and advice for parents of ill children. since the introduction in s, robots have been used in many scientific and social fields. their use in fields like nuclear arsenal, production of automobiles, and manipulations in space has been widespread. in comparison, the medical science field has somewhat been gradual to take advantage of this technology. nevertheless, robots are now days being utilised to assist in or perform many complex medical procedures, which has given rise to the term robotic surgery. a comprehensive discussion of robotic surgery is available in [ ] . indeed there are many advantages of using robots in healthcare management. for example, developing pharmaceuticals through clinical trials can take many years and cost may rise to billions of dollars. making this process faster and more affordable could change the world. robots can play a critical role in such trials. robots are also very effective in medico training where simulations are achieved. use of robots in abdominal surgery [ ] is innovative in comparison to the conventional laparoscopy procedures, and has the potential to eliminate the existing shortcomings and drawbacks. according to [ ] , the most promising procedures are those in which the robot enables a laparoscopic approach where open surgery is usually required. likewise robotic surgery for gynaecology related ailments [ ] can also be very effective. the conventional laparoscopy, the surgeon would have limited degree of freedom with a d vision, whereas the robotic system would provide a d view with intuitive motion and enable additional degrees of freedom. prostatic surgery, being complex in the way of the degree of freedom, is another area where robots are very effective [ ] . nosocomial infections pose a major problem hospitals and clinics around the world. this becomes worst in case of large and intense gatherings of people like in hajj [ ] . to keep the environment free of these viruses and bacteria, the environments needs to be cleaned efficiently and regularly. however some viruses like ebola [ ] and mers coronavirus [ ] are highly contagious and cannot be cleaned by humans without having protective gear, which can be costly and may or may not be effective. cleaning the hospital with chlorine is an effective way. however, there are many drawbacks of this method including a high risk for cleaners to be infected themselves. a sophisticated approach was implemented in us hospitals by using robots [ ] to purify the space. however, some of the robots used are not motorized and may require a human to place them in the infected room or area, posing the risk of infection to humans. it is not known as to which substance or matter can effectively eliminate deadly viruses like ebola and mers coronavirus. however, according to [ ] , beta-coronaviruses, which is one of the four strands of mers coronavirus, can be effectively eliminated by uv light. for such a deadly and highly contagious virus clearing, a robot that can be remote controlled is needed, which can move forward, backward and sideways, and is able to clean itself. concept and related technology of virtual reality (vr), augmented reality (ar) and mixed reality (mr), which originate from image processing and make extensive use of ai, have been in use for decades in games and movies. for some time now, the vr, ar and mr tools and technology are assisting the medical field in a significant way. virtual reality has been used in robotic surgery to develop simulators to train surgeons in a stress free environment using realistic models and surgical simulation, without adversely affecting operating time or patient safety [ ] . an account of ar and vr technologies, including the advantages and disadvantages can be found in [ ] . the ar and vr technology can be very useful in medical and healthcare education, an account of which can be found in [ ] . applications of vr, ar and mr can also be very helpful in urology [ ] . healthcare management information systems (hmis) is the generic name for many kinds of information systems used in healthcare industry. the use of hmis is widespread in various units of hospitals, clinics, and insurance companies. a doctor records patient data (habits, complaints, prescriptions, vital signs, etc.) in one such system they use. a number of hospital/clinic units like radiology, pathology, and pharmacy store patient data in these systems for later retrieval and usage by different stakeholders as shown in fig. . appointment system itself can also be part of a hmis or a standalone system. modern healthcare without hmis cannot be comprehended. an information system or electronic system dealing includes a database system, sometimes known as the backend of the main system. a database is a collection of data, tailored to the information needs of an organisation for a specific purpose. it can grow bigger or shrink to a smaller size depending on the needs of different times. a database in itself would be of no use unless it was coupled with a software system known as the database management system (dbms), which facilitates various functions and processes like data entry and storage, retrieval, and update, data redundancy, data sharing, data consistency, data integrity, concurrency, and data privacy & security. more information about database systems can be found in [ ] . data in an information system can be organised in several different ways. classical representation of data [ ] is tabular. most tables can be regarded as mathematical relations, for this reason, a dbms with tabular data organisation is known as relational. relational databases have by far been dominant in the business organisations until recently. in the eighties and nineties of the last century, object-oriented and semantic data models were also introduced. in particular, object-oriented databases [ ] became very popular for managing audio-visual data. semantic database technology [ ] , although very efficient for search queries, did not reach its expected peak, perhaps because of its complex data model dealing with ontologies. a comparative study of relational, object oriented and semantic database systems is carried out in [ ] . although the evolution of information technologies began much earlier, however, its implementation by businesses and corporations occurred in the second half of the last century. the healthcare industry a bit slow to take the advantage of the technology. earlier, medical records and systems were paper based. thus the first phase of the usage of information technology and systems in hospital and healthcare management was to transform paper based records to database systems. not only did it change paper based systems to electronic ones but also allowed data manipulation. as a result, more data could be processed in a very short time, which was critical in making timely decisions. a discussion of early stages of this transformation can be found in [ ] . consolidation of this technology resulted in healthcare management information systems (hmis). timely decision is critical in all walks of life but it assumes a much greater urgency in some healthcare cases. hmis is capable to transferring data from one location or application to another within split of a second, as opposed to days without the help of electronic systems. this allows users to access, use and manipulate data instantly and concurrently. as the technology continues to evolve and refine, the performance of information systems operations, in particular those of hmis, is bound to improve. ubiquitous health system is meant to provide healthcare services remotely. according to [ ] , ubiquitous healthcare system means the environment that users can receive the medical treatment regardless of the location and time. essentially, a ubiquitous healthcare system would monitor daily or periodic activities of patients and alert the patients or health workers of problems, if any. a discussion of national healthcare systems in korea, including the concept of u-hospitals, is discussed in [ ] . a rich picture of a ubiquitous healthcare system is shown in fig. . internet of things (iot) is an emerging technology. data in organisations, including hospitals, clinics, and insurance companies, is increasing every day. to process this large data, advanced data processing techniques are combined into what is now known as the big data analytics. internet of things (iot) is a general paradigm. when dealing with medical environment, it is known as internet of medical of things (iomt), which is same as the medical internet of things. a discussion of iomt can be found in [ ] . the fundamental objective of iot is to provide ubiquitous access to numerous devices and machines on service providers (sps) [ ] , covering many areas such as location based services (lbs), smart home, smart city, e-health [ , ] . these tools, devices and machines cover service providers in many areas of location based services. a few other could be smart home, street or city, elements of ubiquitous health systems, tools for electronic learning and electronic business, the number which is increasing day by day and it may reach fifty billions in [ , ] . we have conceptualised this in fig. . the iot applications invariably use cloud storage coupled with fog computing [ ] . data in organisations is growing at a rapid rate. some organisations have collected huge pile from their operations. organisational data needs to be mined and analysed on a regular basis to uncover hidden patterns, unknown correlations, market trends, customers, patients and treatment records to help organizations to plan and make more informed decisions. mining and analysis of big data captures the immense wealth of intelligence hidden in organisational data, makes information transparent and usable at much higher frequency, and enhance performance. all of these contribute to providing higher value in efficient decision making. when data grows, it cannot be handled manually, often big data organisations use robotic arm to perform data loading and unloading operations. medical data is also growing at a rapid rate in many hospitals, clinics and insurance companies. mining and analysis of healthcare data is even more critical as it relates to the wellbeing of the society. data centric research is highly desirable to find answers to questions on many chronical diseases, and to plan and manage national and international health programs. potential befits of big data analytics in healthcare are discussed in [ ] . in particular, authors have provided five strategies for success with big data. these are ( ) implementing big data governance, ( ) developing an information sharing culture, ( ) training key personnel to use big data analytics, ( ) incorporating cloud computing into the organization's big data analytics, and ( ) generating new business ideas from big data analytics. a survey of big data analytics in healthcare and government is conducted in [ ] , where the need for the analytics is linked to the provision and improvement of patient centric services, detection of diseases before they spread, monitoring the quality of services in hospitals, improving methods of treatment, and provision of quality medical education. authors in [ ] identify medical signals as a major source of big data, discuss analytical techniques of such data. some of the well known tools used for big data analytics are apache tm hadoop Ò is highly scalable storage platform. it provides cost-effective storage solution for large data volumes. it doesn't require any particular format. apache spark is an open source, in-memory processing machine. its performance is much faster than that of hadoop. mapreduce is used for iterative algorithms or interactive data mining. there are many other big data analytics tools, a list of them can be found at http://bigdata-madesimple.com/top- -big-data-tools-data-analysis/. there are always security and privacy concerns associated with data. protection of privacy and security is a responsibility of data holding organisations. with medical data these concerns assumes higher pronounce and priority. medical data can be very sensitive, and may be linked to life and death of the patients. often socio political issues are linked with medical data. it is well known that india was divided into two countries giving rise to pakistan in . mohammad ali jinnah, very clever and shrewd, was the leader of pakistan campaign, larry collins and dominique lapierre [ ] argue that the partition of india could have been avoided if ''the most closely guarded secret in india'' had become known. jinnah was suffering from tuberculosis which was slowly but surely killing him. in [ , , ] , privacy issues of data are discussed. healthcare procedures and management are now greatly relying on it applications, which are providing realtime access to data and utilities. without these applications, healthcare will be limited, compromised and prone to major problems. many of the medical equipments, which are used in procedures, are highly dependent on technology. ai, robots, vr, ar, mr, iomt, ubiquitous medical services, and big data analytics are all directly or indirectly related to it. hmis is critical for efficient management of records, appointment systems, diagnostics and their needs of medical centres. as the technology advances, the healthcare is likely to further improve. artificial intelligence: a modern approach perspectives on west africa ebola virus disease outbreak intelligent machines, the artificially intelligent doctor will hear you now meet molly-the virtual nurse boston children's hospital launches cloud-based education on amazon alexa-enabled devices robotic surgery-a personal view of the past, present and future robotic abdominal surgery status of robotic assistance-a less traumatic and more accurate minimally invasive surgery? robotic surgery in gynaecology a randomised controlled trial of robotic vs. open radical prostatectomy: early outcomes health management in crowded events: hajj and kumbh. bvicam's middle east respiratory syndrome coronavirus (mers-cov) origin and animal reservoir the iowa disinfection cleaning project: opportunities, successes, and challenges of a structured intervention program in hospitals inactivation of middle east respiratory syndrome-coronavirus in human plasma using amotosalen and ultraviolet a light impact of virtual reality simulators in training of robotic surgery augmented and virtual reality in surgery-the digital surgical environment: applications, limitations and legal pitfalls preliminary study of vr and ar applications in medical and healthcare education application of virtual, augmented, and mixed reality to urology database systems. a practical approach to design, implementation and management, th edn a relational model of data for large shared data banks semantic, relational and object oriented systems: a comparative study health care in the information society: a prognosis for the year unified ubiquitous healthcare system architecture with collaborative model national health information system in korea medical internet of things and big data in healthcare information analytics for healthcare service discovery intelligent human activity recognition scheme for e-health applications preserving privacy in internet of things-a survey internet of things: converging. technologies for smart environments and integrated ecosystems security of the internet of things: perspectives and challenges a survey on fog computing: properties, roles and challenges big data analytics: understanding its capabilities and potential benefits for healthcare organizations a survey of big data analytics in healthcare and government. nd international symposium on big data and cloud computing (isbcc' ) big data analytics in healthcare freedom at midnight: the epic drama of india's stuggle for independence improving privacy and security of user data in location based services big data security and privacy in healthcare: a review key: cord- -yltc wpv authors: lessler, justin; azman, andrew s.; grabowski, m. kate; salje, henrik; rodriguez-barraquer, isabel title: trends in the mechanistic and dynamic modeling of infectious diseases date: - - journal: curr epidemiol rep doi: . /s - - - sha: doc_id: cord_uid: yltc wpv the dynamics of infectious disease epidemics are driven by interactions between individuals with differing disease status (e.g., susceptible, infected, immune). mechanistic models that capture the dynamics of such “dependent happenings” are a fundamental tool of infectious disease epidemiology. recent methodological advances combined with access to new data sources and computational power have resulted in an explosion in the use of dynamic models in the analysis of emerging and established infectious diseases. increasing use of models to inform practical public health decision making has challenged the field to develop new methods to exploit available data and appropriately characterize the uncertainty in the results. here, we discuss recent advances and areas of active research in the mechanistic and dynamic modeling of infectious disease. we highlight how a growing emphasis on data and inference, novel forecasting methods, and increasing access to “big data” are changing the field of infectious disease dynamics. we showcase the application of these methods in phylodynamic research, which combines mechanistic models with rich sources of molecular data to tie genetic data to population-level disease dynamics. as dynamics and mechanistic modeling methods mature and are increasingly tied to principled statistical approaches, the historic separation between the infectious disease dynamics and “traditional” epidemiologic methods is beginning to erode; this presents new opportunities for cross pollination between fields and novel applications. in , ronald ross coined the term bdependent happen-ings^to capture the fundamental difference between the study of infectious diseases in populations and other health phenomena [ ] . because infectious diseases are, for the most part, acquired from the people around us, our own future health status depends on that of our neighbors (e.g., the more people we know who are infected, the more likely we are to become infected ourselves). for acute infectious diseases, the health status of the population often changes quickly over time, with the number of people infectious, susceptible to being infected, and immune to the disease changing substantially over the course of an epidemic. further, the membership in each of these groups does not vary arbitrarily over time but is driven by often well-understood biological processes (box ). for instance, in the simple example of a permanently immunizing infection spread through person-to-person transmission such as measles, new susceptible individuals only enter the population through birth and immigration; these individuals can then only become infected by contact with existing infectious individuals, who, in turn, will eventually become immune or die and be removed forever removed from participation in the epidemic process. the epidemic dynamics of infectious diseases are driven by similar mechanistic relationships between the current and future health states of the population. the expected number of infections at some time t is illustrated for a directly transmitted disease in the above equation. dynamic and mechanistic models of disease spread, regardless of complexity, capture these relationships in order to improve inference or predict the disease dynamics. the study of infectious disease dynamics encompasses the study of any of the shared drivers of the mechanistic processes of disease spread with an eye towards better understanding disease transmission. as illustrated above, these include: the size of the susceptible population ( ): the number of people available to be infected. the dynamics of susceptibility is not shown here, but can itself can be complex, as new susceptibles enter the population through birth, immigration and loss of immunity. for many diseases (e.g., dengue, influenza), susceptibility is not a binary state, and complex models may be needed. the force of infection: the force of infection is the probability that any individual who is susceptible at a given time becomes infected (analogous to the hazard of infection). the size of the susceptible population times the force of infection is the reproductive number ( ). when this value is above , the epidemic will grow. when it falls below , it will recede. the infectious process ( ): the infectious process dictates the chances of becoming infected on a direct or indirect contact with an infectious individual. here represented as a per contact probability of infection, this itself can be a complex, multi-faceted process. the contact process ( ): the process by which infectious contacts are made, whether directly or mediated by a some vector or the environment, is one of the most complex parts of the infectious process. much modern research focuses on accounting the role of space and population structure in the contact process. previous infections ( ): fundamental to the nature of infectious diseases is the number of previous infections, however, these may not as directly lead to current infections as illustrated here if transmission is mediated by a vector or the environment. the natural history of disease ( ): how infectious people are at particular times after their infection determines their contribution to ongoing disease transmission, and fundamentally drives the speed at which epidemics move through the population. other aspects of disease natural history (e.g., the incubation period) may determine our ability to control a disease and its ultimate health impact. over the course of the twentieth century, the main body of epidemiologic research became increasingly reliant on models of statistical association, often with strong assumptions of independence between observations (hereafter referred to as associative models) [ ] . however, as a result of the need to deal with dependent happenings, there remained a strong subpopulation within infectious disease epidemiology that used models of an entirely different type. variously referred to as bmathematical,^bdynamic,^or bmechanistic^models, these models are characterized by having a mechanistic representation of the dynamic epidemic process that determines how the population's state at time t + depends on its state at time t (hereafter referred to as mechanistic models). historically, these models have more often been deterministic and built top-down from first principles rather than based on patterns in any particular dataset. however, as increasing computational power has caused an explosion in the types of models that can be subject to rigorous statistical analysis, there has been a shift toward more data-driven and statistical approaches and a greater focus on stochasticity and uncertainty. this confluence between principled statistical inference and mechanistic processes is paying huge dividends in the quality of the work being produced and the types of questions being answered across disciplines within infectious disease epidemiology. infectious disease models are being given a firmer empirical footing, while the use of generative mechanistic approaches allows us to use models as tools for forecasting, strategic planning, and other activities in ways that would not be possible with models that do not represent the underlying dynamic epidemiologic processes. in this manuscript, we review current research into dynamic and mechanistic models of infectious disease with a focus on how the confluence of mechanistic approaches, new statistical methods, and novel sources of data related to disease spread are opening up new avenues in infectious disease research and public health. for those interested in further pursuing the topic, we provide a list of key resources in box . recent work in infectious disease dynamics has been characterized by an increasing focus on data and principled approaches to inference. traditionally, deterministic models were a dominant tool for studying the theoretical and practical basis of disease transmission in humans and animals. this approach yielded important practical and theoretical results there exist a number of freely available resources that aid infectious disease modelling efforts. courses that form the basis of our understanding of disease dynamics [ •, ] but was limited in approach. deterministic models are usually parameterized through some combination of trajectory matching (i.e., minimizing the distance between observed and simulated data) and specifying parameters based on previous literature. this approach may be sufficient to describe the expected behavior of an infectious disease in a large population, but an increasing focus on how stochasticity and parameter uncertainty impact public health decision making, combined with the growing availability of computational power, has driven a move toward more statistically principled and data-driven likelihood-based approaches. illustrative of this evolution is the contrast between early descriptions of the key dynamic properties of hiv transmission with more recent dynamic characterizations of pandemic h n influenza (h n pdm), middle eastern respiratory syndrome (mers-cov), and ebola. in the late s, several papers were published laying out the essential properties of hiv transmission dynamics that would govern the course of the epidemic (at least in the near term) [ - ]. these papers presented deterministic epidemic models that captured the processes driving the epidemic and highlighted the key parameters, such as the speed of progression to aids, that needed to be investigated. uncertainty was largely addressed through scenario-based approaches (e.g., different future epidemic trajectories were presented for different plausible sets of parameters), and for the most part, different aspects of the transmission dynamics were derived from independent studies, with only the growth rate (i.e., doubling time) estimated from incidence data. while the parameters essential to characterizing epidemic dynamics remain largely unchanged for recently emerging pathogens, the approach to data and estimation is qualitatively different. integrated statistical frameworks built on markov chain monte carlo (mcmc) techniques are used to estimate all, or most, parameters from different datasets and to produce posterior distributions both for parameter estimates and forecasts of future incidence [ , •, ]. these methods allow innovative use of unconventional data sources, such as disease incidence among travelers [ , •], to estimate population incidence of the disease, and molecular data can supplement incidence data providing independent estimates of the same parameters (see discussion of phylodynamics below) [ , •]. these recent attempts to quickly characterize the properties of emerging diseases are emblematic of an increasing focus on developing statistical methods, grounded in dynamical models, to estimate key epidemic parameters based on diverse data sources. surveillance data is often used to estimate the reproductive number (r t , the number of secondary infections that a primary infection is expected to infect at any point, t, in an epidemic), incubation period, and serial interval (the expected time between symptom onset in a case and the people that case infects), as was done in recent outbreaks of mers-cov [ •, ] and ebola [ ] . surveillance data has also been paired with serological data to estimate force of infection (i.e., the hazard of infection) and basic reproductive number (r t , when the population is fully susceptible, designated r ) of several pathogens, including dengue and chikungunya [ ] [ ] [ ] . dynamic modeling approaches can also aid in the interpretation of surveillance data. state-space models (e.g., hidden markov models) have been used to pair our mechanistic understanding of disease transmission with a statistical inference framework by linking observed incidence and dynamics with underlying population disease burden and susceptibility (i.e., the population's state). notably, this approach has been used to estimate global reductions in mortality due to measles in the face of incomplete reporting [ •] . likewise, valle and colleagues used hybrid associative and mechanistic models to account for biases that treatment of detected malaria cases might have on estimates of key values such as the incidence rate [ ] . perhaps, the biggest limitation when attempting to characterize the parameters driving disease transmission remains data availability. data on disease transmission often comes from incomplete surveillance data or represents one aspect of a partially observed epidemic process. for example, epidemic curves are usually limited to symptomatic cases. similarly, key events in the transmission process, such as the exact time of infection, are generally not observable and have to be inferred from observed data. methods, such as the use of mcmc-based data augmentation and known transmission processes to infer the possible distribution of transmission trees, have been developed to deal with partially observed data and have been used to reconstruct outbreaks [ ] , characterize risk factors for transmission [ ] , and quantify the impact of interventions [ ] . a limitation of likelihood-based approaches, such as those mentioned above, is that it is often impossible or impractical to evaluate the data likelihood, in particular, for complex models and large datasets. to deal with this challenge, several blikelihood-free approaches^have been developed, including approximate bayesian computation (abc) and sequential monte carlo (smc) [ , ] . an advantage of these approaches is that they only require the ability to simulate from candidate models (i.e., if data can be simulated, calculation of the likelihood is unnecessary) and therefore can be more easily applied than methods that require iterative evaluation of the likelihood. abc has been used to integrate phylodynamic and epidemic models of influenza and other pathogens [ ] , and smc methods have been used to parametrize dengue transmission models using data from multi-centric vaccine clinical trials [ •] . despite important advances over the past decades in linking data and transmission models as tools of inference, many challenges remain and are the topic of continued research. inference for complex models and using large datasets remain challenging, in part, due to computational burden. mechanistic models offer promise as a way to simultaneously link data from diverse, heterogenous data sources (as in [ ] ), but this promise has yet to be realized, though some phylodynamics methods come close (see below). further, rapid inference on emergent epidemics remains a tool only used in high-profile epidemics [ •, ], and these inferential techniques remain inaccessible to field epidemiologists. scientists and physicians have tried to forecast the course of epidemics since the time of hippocrates. associations between incidence and extrinsic factors such as time of year, climate, and weather can and have been used to forecast infectious diseases [ ] [ ] [ ] . however, mechanistic models that capture the natural history of the disease (e.g., duration of immunity and cross protection) [ ] , mode of transmission [ ] , and movement patterns [ , ] can improve forecasts, particularly when associations with extrinsic drivers of incidence, such as climate, are weak or unknown (e.g., for emerging pathogens). in recent years, forecasts based on models that capture the underlying mechanistic processes of transmission and pathogenesis have become common. uses range from forecasting the peak timing and magnitude of an influenza season [ ] , to forecasting the spread and spatial extent of emerging pathogens such as zika, ebola, and chikungunya [ ] [ ] [ ] . the mechanistic underpinning of these models allows forecasts to take into account dynamic processes that may, otherwise, be impossible to capture, including changes in behavior and resource availability in response to an epidemic [ ] . approaches adopted from computer science, machine learning, and climate science have enhanced our ability to provide reliable forecasts with quantified uncertainty. particularly important are ensemble approaches [ ] , which integrate forecasts from multiple imperfect models or different parameterizations of the same model to calculate a distribution of potential courses of the epidemic [ , ] . ensemble approaches have been used for forecasting influenza in temperate regions [ ] , where influenza is highly seasonal, and more recently in subtropical areas such as hong kong, where the seasonal pattern is less distinct [ ] . similarly, ensemblebased climate models have been incorporated with infectious disease models to forecast climate-related disease including plague and malaria [ ] . these examples use multiple parameterizations of a single model. ensemble approaches can also be used to accommodate uncertainty in model structure by comparing estimates from parameterizations across different models, as is work by smith and colleagues where an ensemble of different individual-based models was used to estimate the impact of a future malaria vaccine [ ] . there has been an explosion in the number of forecasts being made to aid public health decision making, including a number of government-sponsored contests to forecast the progression epidemics of diseases ranging from influenza to chikungunya and dengue [ ] [ ] [ ] . as forecasts become more widely used, care must be given to ensure that the purposes of the model and uncertainty (both structural and statistical) are well communicated. in a recent outbreaks of emerging infectious diseases, like ebola, groups raced to make forecasts of the evolution and spatial spread of the outbreak [ , ] , with some predicting an epidemic size orders of magnitude greater than what was actually observed. while some of these extreme forecasts were made as worst-case planning scenarios, they were interpreted as likely scenarios, raising alarm and casting doubt on the validity of model-based forecasts, thus highlighting the importance of clear communication of a model's purpose and its limitations [ , ] . the quality of infectious disease forecasts and standards for their interpretation are far from the gold standard of methods and conventions used in the meteorology. improvement in both the methods used and their practical use remain critical areas of future research. the advent of bbig data^has opened up new avenues in how we parameterize and understand models of infectious disease spread. big data refers to massive datasets that are too large or complex to be processed using conventional approaches [ ] . however, advances in computing increasingly allow their use without large delays in processing time or unrealistic computing capacity requirements. one of the most successful attempts to use big data to understand disease dynamics has been the use of call data records (cdrs) to capture human mobility. for each call that is made or received, mobile phone operators capture the mobile phone tower through which the call is made. by tracking tower locations for a subscriber, we can capture where he or she is moving. in practice, to ensure confidentiality, cdrs are usually averaged over millions of subscribers to provide estimates of flux between different locations in a country. transmission models built upon cdr-based estimates of seasonal patterns of human movement have been used to explain patterns of rubella disease in kenya [ ] and dengue in pakistan [ ] . in both instances, models built on empirical human movements seen in cdrs outperformed alternative parametric models of population movement based on our theoretical understanding of human travel patterns (e.g., gravity models where movement is based on community size and distance [ ] ) and models where movement was not considered. cdr-based models have also been used to understand the dynamics of large-scale outbreaks such as ebola in west africa [ ] and challenges in malaria elimination [ ] . questions remain as to the generalizability of cdr-based analyses in settings where mobile phone ownership is low [ ] , and problems capturing flows between countries remain. however, the largescale penetration of mobile phones, even in resourcepoor settings, makes cdrs a hugely valuable data source for informing infectious disease models. another type of big data that has enormous potential for furthering our understanding of disease dynamics is satellite imagery. detailed satellite imagery can provide high-spatial-resolution estimates of key determinants of many infectious disease processes, including environmental factors (e.g., land cover), climatic conditions (e.g., precipitation, temperature), and population density throughout the globe [ , ] . in infectious disease epidemiology, such datasets have recently been used as the basis for statistical models that produce fine scale maps of disease incidence, prevalence, and derived transmission parameters (e.g., force of infection, basic reproductive number) for a large number of diseases. early efforts focused on mapping the global distribution of key drivers of malaria transmission [ , ] . these approaches have since been used to estimate the burden from a wide range of pathogens [ ] [ ] [ ] [ ] , vectors [ ] , and host reservoirs [ ] . these analyses have allowed disease burden and risk to be estimated in areas with limited surveillance capabilities, expanding our understanding of the global burden of many pathogens. high-resolution geographic data can gain additional power when paired with mechanistic models that capture changes in disease risk, as in recent analyses that accounted for the effect of birth, natural infection, and vaccine disruptions driving increases in measles susceptibility and epidemic risk in the wake of the ebola outbreak [ ] . finally, big data are increasingly being used with mechanistic models to more directly estimate disease burden in real time [ ] . for example, patterns in the usage of different google search terms have been shown to correlate well with incidence trends for diseases such as influenza [ , ] and dengue [ ] . it is worth noting that big data alone can typically only explain part of trends in incidence, and models that incorporate seasonal dynamics typically outperform models that rely solely on search terms. similar approaches have been used with wikipedia updates [ ] and social media sites such as twitter and facebook. electronic medication sales data and electronic medical records have also been proposed as novel data sources of disease trends [ ] . these approaches can provide estimates much faster than traditional surveillance systems, where it often takes weeks or months for results of cases to be aggregated and analyzed. mechanistic models can then be fit to this data to better understand seasonal or spatial parameters. for example, yang et al. used mechanistic models fit to google flu trend data to estimate epidemiological parameters such as the basic reproductive number and the attack rate for cities in the usa over a -year period [ , ] . phylodynamics, the study of how epidemiological, immunological, and evolutionary processes interact to shape pathogen genealogies, is among the newest and fastest-growing areas in infectious disease research [ ] . the term phylodynamics was coined in by grenfell et al., who observed that the structure of pathogen phylogenies reveals important features of epidemic dynamics in populations and within hosts [ ] . this relationship provides a theoretical framework for linking molecular data with population-level disease patterns using dynamic models. early methodological work in phylodynamics concentrated on the formal integration of the kingman's coalescent and birth death models from population genetics with standard deterministic epidemic models. the coalescent model provides a framework for estimating the probability of coalescent events (lineages converging at a common ancestor) as we move back in time given changes in population size [ ] . the branching patterns in a phylogenetic tree describe the ancestral history of sequenced pathogens, such that nodes closer to the root of the tree represent historical coalescent events while nodes near the tip represent recent events. the strong relationship between the genetic divergence of pathogens and time allows us to estimate the timing of coalescent events and estimate the rate of growth (or decline) of pathogen populations. these estimates are the critical link between genetic and epidemic models [ ] . the formal statistical integration of population genetic and epidemic models allows us to estimate the critical epidemiological parameters such as the basic reproductive number directly from pathogen sequence data [ ] [ ] [ ] . for example, magiorkinis et al. used sequence data from viruses collected over a -year period in greece to estimate subtype-specific reproductive numbers and generation times for hepatitis c [ ] . using data from the athena hiv cohort, which samples ∼ % of hiv-infected persons in the netherlands, bezemer et al. used viral sequence data to estimate reproductive numbers for hundreds of circulating transmission chains, showing that large chains persisted within the netherlands for years near the threshold for sustaining an epidemic (r = ) [ , ] . other phylodynamic applications have focused on elucidating the spatial dispersal pattern diseases such as influenza and hiv. in an analysis of nearly , influenza genomes, bedford et al. showed fundamental differences in the global circulation patterns of h n , h n , and influenza b viruses and that these were potentially driven by differences in the force of infection and rates of immune escape (i.e., antigenic drift) [ ] . likewise, faria et al. used hiv sequence data from central africa to reconstruct the early epidemic dynamics of hiv- using phylodynamic methods and showed that kinshasa in the democratic republic of congo likely served as the focal point for global hiv spread [ , ] . phylodynamics plays an important role in real-time infectious disease surveillance and targeted control [ ] . in recent epidemics of mers-cov and ebola, genomic data was used to assess transmission patterns, monitor viral evolution in populations, and inform epidemic control [ •, [ ] [ ] [ ] . analyses of hivepidemics among us and european men who have sex with men demonstrate that the amalgamation of epidemiologic and genomic data can be used to identify high-risk transmitters and optimal targeted intervention packages [ •, ] . however, the utility of real-time phylodynamic analysis in many settings remains hindered by inadequate infrastructure, few viral sequence data, and limited analytic capacity at local levels. initial phylodynamic models could only deal with simple epidemic patterns (e.g., exponential growth), and recent methodological work has focused on extending the phylodynamic framework to account for complex nonlinear population dynamics [ , ] . for instance, rasmussen and colleagues showed how phylodynamic models could be extended to integrate more complex stochastic and structured epidemic models using bayesian mcmc and particle filtering [ •]. others have focused on resolving transmission network structure from phylogenies [ , ] or integrating data across multiple scales by incorporating information on intra-host pathogen diversity and ecological processes directly into phylodynamic models [ ] . however, equally important recent work has shown that phylodynamic inferences can be highly sensiti ve to sam pl ing and unmeasured factors. simulation studies show that the relationship between phylogenetic trees and the underlying transmission networks is a complex function of the sampling fraction and underlying epidemic dynamics [ , ] and that failure to account for intra-host viral diversity may bias phylodynamic inference [ ] . here, we have focused on areas where we feel that there has been the most innovation in the use of dynamic epidemic models in recent years. this is not to imply that innovation has stopped in other areas where dynamic models play a key role. dynamic models have long been key to our understanding of epidemic theory. innovative models continue to be developed to deal with the challenges posed by pathogen evolution [ ] , complex immunological interactions [ ] , and host heterogeneity [ ] . there has been increasing emphasis on the use of dynamic models in informing public health policy since the early s when they played a key role in the response to the foot-and-mouth disease outbreak in the uk [ ] and assessment of the risk from a smallpox-based bioterrorist attack [ , ] . these uses have extended to endemic disease, such as a modeling analysis by granich and colleagues [ ] that highlighted the potential of btest-and-treat^strategies for hiv control. recently, dynamic models have played an important role in guiding the response to emerging disease threats, from pandemic influenza [ ] , to multi-drug resistant organisms [ , ] to mers-cov [ ] . many of the themes discussed throughout this manuscript have had a profound impact on these efforts, as does the need to report results and assumptions in a way accessible to policy makers. mechanistic models also crop up in other areas of epidemiology, often in less obvious ways. nearly all of the key methods of genetic epidemiology are based on a mechanistic understanding of the underlying processes inheritance, mutation, and selective pressure. social epidemiology at its core is based on the idea that our health depends on the behavior and health of those around us and, hence, has its own approaches to dependent happenings (though the terminology differs). recently, there has been increasing interest in using mechanistic modeling approaches similar to those used for infectious disease to understand health phenomena that are, in part, socially driven, such as obesity [ ] . physiological measurements are often founded on mechanistic models of processes within the body (e.g., use of serum creatinine to approximate the glomerular filtration rate, a key measure of kidney function [ ] ). infectious disease dynamics is, perhaps, unique in epidemiology in the number of researchers that it brings from non-health related disciplines, particularly physics, computer science, and ecology. this, combined with the unique aspects of infectious disease systems, has contributed to the use of models that are distinct from btradition-al^epidemiologic methods. however, the field is being transformed by the same forces that are transforming epidemiology in general: increasing access to technological tools and computational power; an explosion in the availability of data at the molecular, individual, and population levels; and a shift in what the important epidemiologic questions are as we eliminate old health threats and change our environment. increasing emphasis on principled statistical analysis in infectious disease modeling combined with an increasing need to deal with dynamic phenomena in epidemiologic inference opens up new opportunities for the cross pollination of ideas and the erosion of the historical barriers between epidemiologic fields. analysis of mers outbreak that incorporated a wide array of different data sources including human mobility, phylogenetic and case data into mechanistic models to allow inference on key transmission parameters. who ebola response team. ebola virus disease in west africa-the first months of the epidemic and forward projections hospital outbreak of middle east respiratory syndrome coronavirus reconstruction of years of chikungunya epidemiology in the philippines demonstrates episodic and focal transmission estimating dengue transmission intensity from sero-prevalence surveys in multiple countries revisiting rayong: shifting seroprofiles of dengue in thailand and their implications for transmission and control uses mechanistic models to make key inferences about the global burden of disease in the presence of imperfect data improving the modeling of disease data from the government surveillance system: a case study on malaria in the brazilian amazon role of social networks in shaping disease transmission during a community outbreak of h n pandemic influenza a bayesian mcmc approach to study transmission of influenza: application to household longitudinal data inferring influenza dynamics and control in households approximate bayesian computation scheme for parameter inference and model selection in dynamical systems sequential monte carlo without likelihoods phylodynamic inference and model assessment with approximate bayesian computation: influenza as a case study estimation of parameters related to vaccine efficacy and dengue transmission from two large phase iii studies measuring the performance of vaccination programs using cross-sectional surveys: a likelihood framework and retrospective analysis forecasting malaria incidence from historical morbidity patterns in epidemic-prone areas of ethiopia: a simple seasonal adjustment method performs best climate cycles and forecasts of cutaneous leishmaniasis, a nonstationary vector-borne disease cholera dynamics and el niño-southern oscillation interactions between serotypes of dengue highlight epidemiological impact of cross-immunity generalized reproduction numbers and the prediction of patterns in waterborne disease socially structured human movement shapes dengue transmission despite the diffusive effect of mosquito dispersal impact of human mobility on the emergence of dengue epidemics in pakistan forecasting influenza epidemics in hong kong estimating the future number of cases in the ebola epidemic-liberia and sierra leone assessing the international spreading risk associated with the west african ebola outbreak model-based projections of zika virus infections in childbearing women in the americas ebola cases and health system demand in liberia a bayesian ensemble approach for epidemiological projections testing a multi-malaria-model ensemble against years of data in the kenyan highlands malaria early warnings based on seasonal climate forecasts from multi-model ensembles realtime influenza forecasts during the - season improvement of disease prediction and modeling through the use of meteorological ensembles: human plague in uganda ensemble modeling of the likely public health impact of a preerythrocytic malaria vaccine epidemic prediction initiative chikungunya threat inspires new darpa challenge noaa's national weather s e r v i c e ebola infections fewer than predicted by disease models planning for big data quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data the gravity model in transportation analysis: theory and extensions. utretch: vsp commentary: containing the ebola outbreak-the potential and challenge of mobile network data integrating rapid risk mapping and mobile phone call record data for strategic malaria elimination planning the impact of biases in mobile phone ownership on estimates of human mobility the quality control of long-term climatological data using objective data analysis high-resolution gridded population datasets for latin america and the caribbean in a world malaria map: plasmodium falciparum endemicity in the malaria atlas project: developing global maps of malaria risk global distribution maps of the leishmaniases. elife using global maps to predict the risk of dengue in europe mapping the zoonotic niche of marburg virus disease in africa remote sensing, land cover changes, and vector-borne diseases: use of high spatial resolution satellite imagery to map the risk of occurrence of cutaneous leishmaniasis in ghardaïa the global distribution of the arbovirus vectors aedes aegypti and ae mapping the zoonotic niche of ebola virus disease in reduced vaccination and the risk of measles and other childhood infections post-ebola enhancing disease surveillance with novel data streams: challenges and opportunities detecting influenza epidemics using search engine query data comparison: flu prescription sales data from a retail pharmacy in the us with google flu trends and us ilinet (cdc) data as flu activity indicator prediction of dengue incidence using search query surveillance wikipedia usage estimates prevalence of influenza-like illness in the united states in near real-time tracking cholera through surveillance of oral rehydration solution sales at pharmacies: insights from urban bangladesh accurate estimation of influenza epidemics using google search data via argo viral phylodynamics unifying the epidemiological and evolutionary dynamics of pathogens phylodynamics of infectious disease epidemics integrating phylodynamics and epidemiology to estimate transmission diversity in viral epidemics viral phylodynamics and the search for an beffective number of infections^ estimating the basic reproductive number from viral sequence data dispersion of the hiv- epidemic in men who have sex with men in the netherlands: a combined mathematical model and phylogenetic analysis global circulation patterns of seasonal influenza viruses vary with antigenic drift hiv epidemiology. the early spread and epidemic ignition of hiv- in human populations genomic analysis of emerging pathogens: methods, application and future trends genomic surveillance elucidates ebola virus origin and transmission during the outbreak phylodynamic analysis of ebola virus in the insights into the early epidemic spread of ebola in sierra leone provided by viral sequence data key example showing that clinical and phylogenetic data can be used to identify predominant sources of ongoing viral hiv- transmission during early infection in men who have sex with men: a phylodynamic analysis inference for nonlinear epidemiological models using genealogies and time series complex population dynamics and the coalescent under neutrality uses innovative methods to combine mechanistic models within phylodynamics to estimate transmission parameters modelling tree shape and structure in viral phylodynamics inferring epidemic contact structure from phylogenetic trees reconciling phylodynamics with epidemiology: the case of dengue virus in southern vietnam contact heterogeneity and phylodynamics: how contact networks shape parasite evolutionary trees how the dynamics and structure of sexual contact networks shape pathogen phylogenies within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data core groups, antimicrobial resistance and rebound in gonorrhoea in north america age profile of immunity to influenza: effect of original antigenic sin insights from unifying modern approximations to infections on networks dynamics of the uk foot and mouth epidemic: stochastic dispersal in a heterogeneous landscape containing bioterrorist smallpox emergency response to a smallpox attack: the case for mass vaccination universal voluntary hiv testing with immediate antiretroviral therapy as a strategy for elimination of hiv transmission: a mathematical model strategies for containing an emerging influenza pandemic in southeast asia improving control of antibiotic-resistant gonorrhea by integrating research agendas across disciplines: key questions arising from mathematical modeling modeling epidemics of multidrug-resistant m. tuberculosis of heterogeneous fitness estimating potential incidence of mers-cov associated with hajj pilgrims to saudi arabia social network analysis and agent-based modeling in social epidemiology estimating glomerular filtration rate from serum creatinine and cystatin c key: cord- -sbyu yuc authors: farrokhi, aydin; shirazi, farid; hajli, nick; tajvidi, mina title: using artificial intelligence to detect crisis related to events: decision making in b b by artificial intelligence date: - - journal: industrial marketing management doi: . /j.indmarman. . . sha: doc_id: cord_uid: sbyu yuc artificial intelligence (ai) could be an important foundation of competitive advantage in the market for firms. as such, firms use ai to achieve deep market engagement when the firm's data are employed to make informed decisions. this study examines the role of computer-mediated ai agents in detecting crises related to events in a firm. a crisis threatens organizational performance; therefore, a data-driven strategy will result in an efficient and timely reflection, which increases the success of crisis management. the study extends the situational crisis communication theory (scct) and attribution theory frameworks built on big data and machine learning capabilities for early detection of crises in the market. this research proposes a structural model composed of a statistical and sentimental big data analytics approach. the findings of our empirical research suggest that knowledge extracted from day-to-day data communications such as email communications of a firm can lead to the sensing of critical events related to business activities. to test our model, we use a publicly available dataset containing , items belonging to users, mostly senior managers of enron during through the crisis. the findings suggest that the model is plausible in the early detection of enron's critical events, which can support decision making in the market. audiences based on their past crises and particularly crises that their audiences are aware of. scct suggests and emphasizes the usage of communication to maintain and withhold the reputation of an organization (coombs, . the effectiveness of actions taken in a recovery process is directly related to the quality of meaningful insights extracted from collected data. in many cases, there are stakeholders other than managers who can benefit from the knowledge obtained from a situational assessment. for example, stakeholders have a vested interest in understanding the impact of strategies implemented (ki & nekmat, ) . by using real-time data, management can keep the situation under control from becoming a full-fledged, blown up crisis. this is important for two reasons. firstly, sensing a crisis in its early stages will help the management team to improve an organization's preparedness of a crisis. secondly, during a crisis, it is crucial to have effective and productive communication. comprehensive knowledge and a good understanding of the nature of a crisis will help in planning, controlling, and leading the situation. this pioneering study is among the first studies that endeavour to use email data and sentiment analysis for extracting meaningful information that helps early detection of a crisis in an organization. our framework is designed based on cognitive architecture through the implementation of artificial agents. we developed a critical event detection analysis model (ceda) that extends scct and attribution theories based on ai and big data analytics. to expand on the proposed methodology, the next section of this paper is an overview of the current literature for existing methods developed on popular networks. facebook, twitter, rrs feeds, email, and others might have very different functionalities; however, knowledge built on one is fairly applicable to all, particularly in textual analysis. sections and of this study describe our methodology for discovering the change in trends in enron emails and provides an overview of its results. section focuses on big data and data mining, while section presents our approach to language-based sentiment analysis. in section , we discuss textual and sentimental analyses, whereas section underlines the theory of artificial intelligence rational agents. in section , we discuss the hypotheses and methodology used in this study, while section examines the combined effect of frequency analysis and sentiment analysis in detecting enron's crises. section concludes this study. faulkner ( ) defines crisis or disaster as "a triggering event, which is so significant that it challenges the existing structure, routine operation, or survival of the organization" (p. ). in this context, an organizational crisis, according to , is defined as the perception of an unpredictable event that threatens important expectancies of stakeholders that can severely impact an organization's performance and generate adverse outcomes. such outcomes, according to faulkner ( ) , can be considered as a shock at both the individual and collective levels, where the severity of the unexpected nature of the event may cause not only stress in the community but also a sense of helplessness and disorientation among others (faulkner, ) . our emotional response and confidence may affect our behaviour in assisting and aiding others (willner & smith, ) . willner and smith ( ) claim that weiner's attribution theory weiner ( weiner ( , indicates that our emotional response to any behaviour is directly related to our attributions to the source of the individual's behaviour and our confidence in whether the behaviour can be changed. jeong ( ) claims that, according to the weiner's attribution-action model, the tendency of the actor being punished by others increases if the actor who caused the problem was perceived to hold a responsibility to a dilemma (higher internal and lower external attributions), as opposed to when the higher external and lower internal attributions are made. scct, along with the attribution theory, offer guidelines to assess the reputational threat based on different crisis clusters and stakeholders' perceptions. thus, they provide frameworks for crisis communication while taking into account the organization's situation and the publics' emotions (ott & theunissen, ) . scct specifies ten crisis types or frames namely: natural disaster, rumour, product tampering, workplace violence, challenges, technical error product recall, technical-error accident, human-error product recall, human error accident, and organizational misdeed, while the attributions of crisis responsibility have been used to group the various crisis types into three main clusters of (a) victim, (b) accidental, and (c) intentional . the victim cluster, for example, contains crisis types that produce very low attributions of crisis responsibility (e.g. natural disasters) and represents a mild reputational threat, while the intentional cluster yields to strong attributions of crisis responsibility and represents a severe reputational threat . anderson and schram ( ) describe "crisis informatics" and "disaster informatics" as a field of research that focuses on the use of information and communication technologies during emergencies. the adoption and application of data science to address the managerial issues of business are still developing, yet the results have already been seen to be transformative for the organizations that have adopted it. in corporate finance, data science is widely used to help management handle tasks such as fraud detection and credit risk assessment (wu, chen, & olson, ) . inputs from internal and external events increase a firm's agility and help top management to make more informed decisions and mitigate risks involved (vosoughi, roy, & aral, ) . big data and the capability of its analytics in interpreting real-time events benefit management (baesens, bapna, marsden, vanthienen, & zhao, ) . as an example, the report released by towers watson in reveals the importance of extracted knowledge in the supervision of the energy and enablement of employees for effective management (global workforce study, ) . situational awareness is the knowledge that can be integrated from accessible data and used to assess a situation to manage it (sarter & woods, ) . in fast-paced business environments, improving situational awareness helps both managers and other stakeholders to improve performance through their early engagement (nofi, ) . regarding crises, the content created and shared on an organization's data network, either authored by the organization's actors or external participants, becomes crucial. the first step to analyze the situation is to collect the organization's network data. for large organizations, the main concerns with the data are scale (volume), streaming (velocity), forms (variety), and uncertainty (veracity). for example, in social media, users share information to establish connections with others (treem & leonardi, ) . content generated by users in social media has surpassed zeta bytes of data (reinsel, gantz, & rydning, ) . social media, web analytics, and media semantics have been effective marketing tools to increase brand awareness, loyalty, engagement, sales, influencing customer satisfaction, and conversation related to business-to-business (b b) and business-to-customer (b c) interactions (agnihotri, dingus, hu, & krush, ; järvinen & taiminen, ; mehmet & clarke, ; siamagka, christodoulides, michaelidou, & valvi, ; swani, brownb, & milne, ) . even though using social media to increase situational awareness is not new (watson & rodrigues, ), yet, organizational and technical changes must happen before social media can be fully embraced (plotnick & hiltz, ) . data are essential inputs to make better and informed decisions (waller & fawcett, ) . data mining is the process of finding a. farrokhi, et al. industrial marketing management ( ) - meaningful patterns in data to extract valuable knowledge that leads to making informed decisions (witten, frank, hall, & pal, ) . during the development of a crisis, data stored in structured and unstructured datasets in conjunction with various real-time data streams (feeds from social media or sensors) can empower management and stakeholders to understand its severity and intensity. some of these data are streaming data, and analyzing them requires processing a sheer volume of data. storing massive data is costly, and, in many cases, the data itself may not have the same value in the future. for example, the processing of detected earthquake signals at a later point in time might not be useful for detecting an earthquake. in the past, an organization's primary data were captured through business transactions and processes. resources such as enterprise resource planning (erp), human resource, financial and audit reporting, performance monitoring, customer relationship management (crm), and supply chain management (scm) commonly used for sourcing internal data. recently, data from social networks and other external sources are considered new sources of data. data from external sources can be categorized into two groups. the first category includes data that is directly related to companies, such as online social media content and mobile devices. the volume of capturing business and personal interaction data (such as social media and mobile data) is increasing tremendously. the second category includes data that may not be directly related to an organization; however, it can affect the organization's performance. for example, collecting socio-cultural data can help to improve business processes by making decisions more aligned with the current cultural changes. also, knowledge derived from arbitrary data resources can be equally valuable to organizations. for instance, internet-connected devices form a massively connected network of things called the "internet-of-things" (iot), which produces tremendous amounts of data that are used for planning and managing smart cities. likewise, public data from data lakes or governments can be harvested (baesens et al., ) and used for data aggregation. big data refers to datasets with colossal volume, variety, and velocity (witten et al., ) . big data has been around since early and has become a powerful resource for many businesses (white, ) . business analytical tools are evolving, and they can now produce real business value (jans, lybaert, & vanhoof, ) . decision-making and forecasting models are the main areas of any business analytics (candi et al., ; wu et al., ) . data harvested from multiple sources are fed into sophisticated algorithms equipped with advanced statistics, econometrics, and machine learning sciences (hiebert, ) . these data will be used to uncover useful hidden patterns and relations that would help to make informed decisions (dubey, gunasekaran, childe, blome, & papadopoulos, ; holton, ). in the event of a crisis, bad decisions are often costly and result in the misappropriation of resources, which hurts both organizations and society. a scientific analysis of the crisis based on the knowledge obtained from the event helps to carefully craft a strategy to manage the situation (baesens et al., ) . in business settings, email can be a very important source of data for crisis management. over and above recorded messages, it exceptionally contains rich data such as timestamps, details of the sequence of interactions, and users' intention that emerge in the organizational context (bülow, lee, & panteli, ) . closely related to email source, rss feeds are the next source of data for crisis management. thelwall and stuart ( ) collected web feeds from rss databases, and google searches to build a database of daily lists of postings. then they generated a time-series graph of frequently used words that show a significant increase in usage during the monitoring period. for example, rss web feeds are effective in extracting words with a sudden increase in usage in posts relevant to crises. facebook is another effective communication platform, with more than one billion active users. a positive or negative message about an organization can spread in a matter of a minute. the observation showed that all social media (blogs, twitter, facebook, and others) influence individual perceptions of an organization more than the transported messages. in other words, the medium is more important than the message itself (ki & nekmat, ) . regarding the effectiveness of collecting real-time data, baesens et al. ( ) , points out a well-known japanese barbershop chain that has sensors in all its stores' chairs. sensors detect available seats for a haircut, duration of each haircut, and its total processing time. the collected information is used for the firm's online appointment system, performance analyses, and resource allocations. another prominent source of online network data is blogs, which a verity of topics including social, technical, political, and so forth are covered from the bloggers' perspectives. generally, blogs are entirely personal and typically reflect bloggers' perceptions of the conversed topic (thelwall & stuart, ) . blogs can be influential based on what has been reported, the way it has been reported, the source that reports it, and the politics of the event. table shows a summary of the data sources described above and the related studies. digital media, such as social media networks, websites, and blogs, are used as one of the popular platforms for managing crises in organizations. the social network facilitates real-time communication between individuals and groups, and they become a common platform for engaging stakeholders in communication discourse . despite the availability of cloud collaboration platforms and increased popularity of social media, by far, email remains the most common medium used for communication in work settings (bülow et al., ; jung & lyytinen, ) . one of the drawbacks of social media data analysis is that a large number of messages must be monitored from the social network to find quantitative evidence during a crisis (thelwall & stuart, ) . extracting accurate data from the mining of unstructured data is not feasible; however, one of the important features in big data, triangulation, allows validating data. as such, the majority of human interaction data offers an opportunity to generate new insights (baesens et al., ) . investing in data quality at first may be considered difficult and expensive, yet the economic return of high-quality data is substantial. according to baesens et al. ( ) , even small gains in data quality improve analytical performance. during the crisis, big data can be used by organizations to create a favourable impression or to lead a selected group or audience towards a corrective action (hiebert, ) . thelwall and stuart ( ) shows rss web feeds are effective in extracting words with a sudden increase in usage in posts relevant to crises. facebook ki and nekmat ( ) the medium is more important than the message itself, in influencing an individual's perceptions of an organization. blogs thelwall and stuart ( ) covers verity of topics including social, technical, and political perspectives from the bloggers a. farrokhi, et al. industrial marketing management ( ) - obtaining relevant data is an integral part of predictive analysis modelling, which requires extra attention. therefore, it becomes a challenging process. one way to target the relevant data is through the detection of frequently occurring items shared in the communication network (baesens et al., ) . kavanaugh et al. ( ) grouped posted messages in three categories. the first group consists of people who appear to be complaining at first; however, they represent opportunities for organizations. these messages that are mostly posted directly on an organization's facebook or twitter pages ask for improvement in a product or service. also, in this group, positive messages can be spread through online media channels to praise how well the organization resolved the problem. the second group is commentators. they are not asking for resolution; instead, they are venting to spread negative words on an organization's online media channels. they might go even further by posting complaint messages on other organization's blogs or newsletters. the third group is the ugly one. this group intends to harm the organization's reputation by spreading harmful content. the goal is to spread negative word-of-mouth. competitors can further harm an organization's reputation by taking advantage of these negative contents by greatly exaggerating the flaws (kavanaugh et al., ) . in , british airways' customer paid for a promoted tweet to maximize spreading his complaint about his lost luggage. his tweet got over , impressions in the first six hours of its posting (grégoire, salle, & tripp, ) . in online networks, comments charged by anger spread out quickly (fan, zhao, chen, & xu, ) . in "spite-driven" comments, in which a customer may go beyond negating the organization's reputation, going viral is more likely. (grégoire et al., ) . novelty encourages information sharing. since novelty provides more information to understand events better, it becomes more valuable to have and to share. therefore, novel news tends to be shared more often (vosoughi et al., ) . however, there are differences in why and how widely the true or false news spreads. true stories stimulate feelings of joy or sadness and are more likely to be anticipated. false stories, on the other hand, cause fear, disgust, and stronger emotional feelings. the empirical study of vosoughi et al. ( ) shows that people tend to react and reflect on false stories with higher rates. a list of , twitter hashtags' emotion weighted and classified into eight distinctive groups. the original classification of emotions to anger, fear, anticipation, trust, surprise, sadness, joy, and disgust is plutchik's ( ) work, and it is based on the national research council canada (nrc) lexicon, which contains approximately , english words. there are differences between news and rumours. news is an asserted claim, and rumours are shared claims among people. on platforms such as twitter, a rumour can easily be started by a topic being shared, and since it triggers emotions (fear or disgust), it might be retweeted many times. fake news also falls in this category. it is defined as a willful distortion of the truth. a study by vosoughi et al. ( ) showed that even though people who had spread false news had fewer followers and spent less time on twitter, their false news spread farther, faster, deeper, and more broadly than the active users who share the truth. in this regard, false news and rumours are overwhelmingly more novel than real news. to measure the novelty degree of true and false tweets, they compared the distributions of the experimental guided tweets for days. false news can create many disadvantages for businesses, including misallocation of resources, misalignment of business strategies, and even loss of reputation (vosoughi et al., ) . each developing crisis scenario is unique; however, high profile crises usually get more attention and are contingent. this is due to users' comments and replies that facilitate and accelerate the news to go viral (ki & nekmat, ) . the contingency starts when participants start responding to one another's replies. for a message to be contingent, the role of participants needs to be interchangeable (sundar, kalyanaraman, & brown, ) . contributors' perception of closeness to the subject heightens their engagement during crisis communication (ki & nekmat, ) . another reason for people's participation in conversations is their desire to connect to a community that shares similar opinions. this provides them with the opportunity to express their views about a crisis. it has been consistently seen that people would rather re-publish another user's message than start a new, standalone message. for example, in retweeting, users build upon another's commentary and opinion to develop their credibility on communicating their insights on a subject. ki and nekmat's ( ) research indicated that the majority of messages posted on an organization's facebook wall during crises consisted of individual responses built upon the messages of others. the impact of the internet in delivering news of crisis events is significant (bucher, ) . for example, in january , a group of young people in france used a youtube channel to express their reason for switching from the orange cell provider to free. the video went viral, and it was viewed more than . million times, almost overnight. in another case, fedex suffered from negative publicity from someone posting a youtube video showing a fedex driver throwing a package containing a fragile item. this video was viewed over half a million times by the time fedex had responded on the third day of its posting. still, three years after its posting, the video was viewed over nine million times (grégoire et al., ) . in another example, grégoire et al. ( ) reported on a young girl who was offended when she received a t-shirt as a gift in , and she decided to share her opinion on the company's facebook page. the page at the time had . million followers. her comment received hundreds of responses and then hit twitter. one of the goals of managing a crisis event is to communicate information effectively to the relevant audience in time. this helps to lessen the adversarial impact on business performance and hopefully promote a positive image. some social media channels, such as instagram, pinterest, and flickr, might be more effective in spreading word-of-mouth than traditional communication channels. in many instances, it is more convenient for customers to reach directly into organizations through online media channels (tweeting or sending an email) as opposed to buying a newspaper to read an article about the company or sending paper mails (grégoire et al., ) . it is not always easy to understand the intention behind the communicated messages since the people on the other side of the crisis communication may have a different agenda (thelwall & stuart, ) . statistical methods have been used and are still being used for identifying and classifying text. antweiler and frank ( ) used statistical algorithms to code the messages collected through yahoo finance as bullish, bearish, or neither (antweiler & frank, ) ; however, language-based analysis has recently become more predominant to use (li, ) . araque, corcuera-platas, sánchez-rada, and iglesias ( ) argue that with the increasing power of social networks, many natural language processing (nlp) tasks and applications are being used in order to analyze this massive information processing. one of the domains in which language-based analysis has been employed is in examining the content of corporate financial reports and executive conference calls (larcker & zakolyukina, ) . fraud-monitoring software sifts through employees' emails or documents to detect a. farrokhi, et al. industrial marketing management ( ) - corporate misconduct (purda & skillicorn, ) . the text analytics-based sentiment analysis involves a very largescale data collection, filtering, classification, and clustering aspects of data mining technologies that handle the application of text analytics (sharda, delan, & turban, ) . by tapping into data sources, such as tweets, facebook posts, online communities, emails, weblogs, chat rooms, and other search engines, sentiment analytics offers marketers, decision-makers, and other stakeholders, insights about the opinions in the text collection (sharda et al., ) . essentially, the language-based analysis approach is based on earlier psychology and linguistic work. this approach has resulted in building lists of words ("bags of words") in which each group of words is associated with a particular sentiment. for example, it can examine a text for an indication of anger, anxiety, and negation (larcker & zakolyukina, ) or negativity, optimism, and deceptiveness (li, ) . the sender's psychological state of mind is affected by the individual's assessment of the surroundings. the individuals' reflection of their surroundings can be seen in their writings and shared information patterns (yates & paquette, ) . sentiment analysis is the process of classifying texts (araque et al., ) into positive, negative, or neutral. the process is designed to detect the underlying hidden expression in the text. in business, sentiment analysis is used in various domains, including customer need to change analysis, marketing, and performance enhancement (ragini, rubesh, & bhaskar, ) . in this context, sentiment analysis is used to determine a text's subjectivity and its polarity (ragini et al., ) . various studies have used language-based analysis to assess situations during times of disasters; however, this study found it more useful when the language-based analysis was combined with statistical analysis. privacy is a major for the current digital age (hajli, shirazi, tajvidi, & huda, ) . individuals' social background, cultural expectations, and norms shape their privacy expectations (nissenbaum, ). xu, jiang, wang, yuan, and ren ( ) argue that although the information discovered by data mining can be precious to many applications, there is an increasing concern about the other side of the coin, namely the privacy threats posed by data mining. previous research argues social media platforms, such as facebook, twitter, and apps location services to track their users, may pose major privacy concerns (nadeem, juntunen, shirazi, & hajli, ; wang, tajvidi, lin, & hajli, ) . another example is related to mobile devices in which some systems allow the installed apps to track and use the contextual information of users in order to adapt to the new conditions in an automated fashion. privacy protection in contextually aware processes relies on procedures that address the anonymity and confidentiality of personal information (shirazi & iqbal, ) . shirazi and iqbal ( ) studied privacy concerns with communicated messages in the context of both internal and external organizational networks. shin and choi ( ) argue that despite the fact that big data technology has the potential to provide powerful competitive advantages, private and public organizations are struggling to establish effective governance and privacy in connection with big data initiatives. in the distributed privacy preservation model, as noted by aggarwal and yu ( ) , a new aggregated dataset that does not include any personally identifiable information can be obtained from the source records. aggregating extracted anonymous data from organizations' employees' emails, for example, provides the required personal information privacy protection for all parties. in one behavioral aspect of organizational data usage, the management and employees tend to change their attitudes towards using an organization's data when they understand what data are accessed during the data collection process and what it has been used for (díaz, rowshankish, & saleh, ) . díaz et al. ( ) argue that building a data-driven culture in organizations will ease and enhance the actual data analytical efforts. building awareness among the employees has to be a part of an organization's data privacy strategy while holding them accountable and responsible for protecting all organizational data, especially private personal information. as such, before we generated a table of text extracted from emails, the dataset was anonymized in such a way that identifiers (e.g., name, employee id, and phone number), quasi-identifiers (attributes that can be linked to external data), and other sensitive attributes were removed from the dataset (ghinita, tao, & kalnis, ; mamede, baptista, & dias, ) . roux-dufort ( ) argues that the traditional perception of crises used to narrow down and only include exceptional circumstances. the scandal of financial domains (enron, worldcom), significant terrorist attacks ( / ), supernatural events, such as hurricanes and so forth are only a handful of examples of those events. as such, research and studies in the field of crisis management gain their legitimacy through the process of investigation and from the potency of the investigated incidents. therefore, the more important the incident, the more licit the investigation will be because the ambiguity of the content urges the need to obtain knowledge (roux-dufort, ) . this study aims to develop a big data analytics framework by deploying artificial intelligence rational agents generated by r/python programming language capable of collecting data from different sources, such as emails, tweets, facebook, weblogs, online communities, databases, and documents, among others (structured, semistructured, and unstructured data). r/python programming with their extensive libraries, frameworks and extensions offers excellent tools and capabilities for solving complex projects involving artificial intelligence and big data. these capabilities include but are not limited to artificial intelligence, machine learning, data science, natural language processing and object detection and tracking (joshi, ) . to test our model, we focused on existing emails extracted during the enron crisis. as a consequence, this pioneering project is, in fact, among the first studies that endeavour to use sentiment analysis for extracting meaningful information that helps early detection of a crisis in an organization. our framework is designed based on cognitive architecture through the implementation of artificial agents. research on artificial intelligence (ai) agents has long focused on developing mechanisms to enhance how agents sense, keep a record of and interact with their environment (castelfranchi, ; elliott & brzezinski, ; rousseau & hayes-roth, ) , the so-called "intelligent systems" (russell, ) . recent studies of cognitive architectures indicate both abstract models of cognition, in natural and artificial agents, and the software-based models (lieto, bhatt, oltramari, & vernon, ) for designing systems that do the right things intelligently (russell, ) . this approach encompasses considering the intelligent entity as an agent, that is to say, a system that senses its environment and acts upon it. in this context, an agent is defined by the mapping from percept sequences to actions that the agent instantiates. we define rational agents as agents whose actions make sense for the information possessed by the agent and its goals. kreiner ( a, b) argue that rationality refers to the purposefulness and forward-looking character of an agent. the theoretical foundation of perfect rationality within ai is well defined by newell's paper on "knowledge level" (newell, ) . knowledge-level analysis of ai systems relies on an assumption of perfect rationality. it can be used to establish an upper bound on the performance of any possible system by establishing what a perfectly rational agent would do given the same knowledge (russell, ) . within the context of ai, russell ( ) argues that intelligence is strongly related to the capacity for successful behaviour-the so-called "agent-based" view of ai. the candidates for formal definitions of intelligence (dean, aloimonos, & allen, ; russell, ; russell & a. farrokhi, et al. industrial marketing management ( ) - norvig, simon, ; wellman, ) of a system s are perfect rationality (the capacity to generate maximally successful behaviour); calculative rationality (the capacity to compute the perfectly rational decision); metalevel rationality (the capacity to select the optimal combination of computation sequence-plus-action); and bounded optimality (the capacity to generate maximally successful behaviour given the available information and resources). the metalevel rationality applied in this study is, in fact, a knowledge-level analysis or perfect rationality associated with computing action. in other words, while perfect rationality is difficult to achieve, considering the limitation of computing settings, the metalevel rationality of ai has been deployed in this study (see fig. ). russell ( ) is the capacity to select the optimal combination of computation sequence-plus-action under the constraint that the computation must select the action. metalevel architecture splits the agent into two (or more) notional parts. the object-level carries out computations concerned with the application domain, such as projecting the results of physical actions, computing the utility of certain security states, and so on. the metalevel is a type of decision-making process whose application domain consists of the object-level computations themselves, the computational objects, and the states that they affect. the sheer volume of corporate archival and real-time data requires a change in traditional crisis analysis approaches that use static, archival data and manual analysis. as mentioned by vera-baquero, colomo-palacios, and molloy ( ), real-time access to business performance information is critical for corporations to run a competitive business and respond to the ever-changing business environment. with machine learning, ai automation and big data analysis, as deployed in our critical event detection analysis (ceda) method depicted in fig. , can build patterns of positive and negative (abnormal) words to monitor emerging trends proactively before a potential crisis occurs. in this context, the machine learning algorithms help us anticipate when employees (or stakeholders) are experiencing issues, and allow crisis managers to potentially address the problem and control how incidents are communicated and presented to the world . previous studies have considered the use of network data for situational awareness; however, to the authors' knowledge, none have specifically investigated or analyzed the use of email communication by major organizations for situational assessment of a developing crisis. in our method, we used email data to detect critical events. email usage is fairly well distributed across all types of organizations in developed nations. in a conducted survey, it has been shown that the two most used channels for communication in organizations are an intranet ( %) and email ( %). by and large, email is the most common channel for communication in organizations (moynihan & hathi, ) . this study further examines trends in email communication displayed by the organizations' email users to provide a more comprehensive examination of the effectiveness of email meta-data for organizational crisis detection by asking the following question: : what are the impacts of the sudden change in email communication trends and sentiment of the day on identifying a developing crisis in an organization? rq . : how an artificial agent's meta rationality can identify a sudden change in the email communication trend? (capacity to select the optimal combination of computation sequence-plus-action under the constraint that the computation must select the action). our approach seeks to improve detecting a crisis in organizations in its early stages. emails, when analyzed effectively, can allow management to make informed decisions to avoid a potential crisis. as argued by (ulmer et al., ) , crisis communication is growing as a field of study. the unpredictable events or crises can disrupt an organization's operations, threaten to damage organizational reputations (coombs & holladay, ) . in particular, we are interested in the analysis of text and its relationship with the contexts in which it was used. in this context, early detection of crisis through analysis of patterns of communication context is particularly an important step in tackling crises in its early stage. if critical events are not detected in the early stages, they may develop to potentially unmanageable crises. building on our discussion covered in sections , . , . , , . , and . , we propose: h . : a sudden increase in the frequency of communicated emails positively correlates with a developing situation in an organization. people's behaviour is shaped by the response to factors such as feelings, attitudes, beliefs, abilities, consequences of action, and accepted social norms. human responses to internal or external stimuli are also evident in online communication channels. for example, the swearing-in social network promotes retaliatory responses (turel & qahri-saremi, ) . linguistic analysis can effectively be used in detecting financial frauds. purda and skillicorn ( ) showed when a company is committing fraud, the employed writing style and presentation style in communicating financial information changes. studying organizations' management's discussion and analysis (md&a) report showed that the companies who have been associated with fraudulent activity in the past, have pushed to write an md&a section in the -k report without referring to words relating to merger activity or potential legal problems (words such as settlement, legal, and judgments). combining the above discussion and our argument from section , we propose: situational crisis communication theory a. farrokhi, et al. industrial marketing management ( ) - negative, correlates with a developing crisis in an organization. fig. shows the relationships among the indicators in our proposed integrated model. we suggest a sudden change in email communication trends, sudden increases in the frequency of communicated emails taken together with the day's overall sentiment, and predicts public behavioral intention. to capture the information, we developed a methodology that captures the sudden change in email communication patterns in an organization. for our study, we worked on publicly available enron email dataset. this dataset was initially collected and prepared for the calo project. it contains , items belonging to users, mostly senior management of enron. the email data are organized in the form of files and folders. originally, the email dataset was made public by the federal energy regulatory commission during its investigation. it turned out to have many integrity problems, and later on, melinda gervasio, at sri, corrected the problems. to our knowledge, this is the best available substantial dataset that relates to an organizational crisis and concerns the public interest. fig. illustrates our critical event detection analysis method (ceda), which consists of four stages. in the first stage, data preparation, we developed a python script that crawls in the dataset's folders and builds a corpus of data, including some initial statistical data relevant to each item set, and distinguishes sent emails, received the email, or non-email items. in total, we recognized , sent emails and , received emails. the numbers have factored in the ccs and bccs. we developed an automatic method in order to extend and examine other email datasets if they become available. the advantage of an automatic method is that it can provide convenient and standardized access to extract relevant information and is not intrusive. each item in the enron mail directory list contains the header of the email, the subject, the date, the correspondence to the sender and receiver, and the email body sections. to work with data, we built a new corpus of data where each line became our new line for the data frame in r. in the second stage, we identified and compared communications from different periods of the last three years of enron before its bankruptcy. in ceda, situational events are detected by inferring from trend changes in communication patterns in the existing dataset. in near realtime processing, continuous massive streaming of content is processed in a single instant since the streaming data may not be available for reprocessing again, or it does not have the same value at a later time. to process and analyze semi-structured data, we first studied some available tools that could perform pattern discovery on text files. as an aggregation tool for semi-structured data, we deployed nvivo software version . nvivo allows importing, managing, and analyzing text and has advanced visualization tools. a common issue with using the existing analytical tools is that most of them are designed to perform predefined tasks. after some preliminary assessment, we found them to be very limited for this study. our input data was scattered widely in irregular patterns under several folders and subfolders. also, the volume and number of records in our dataset were beyond the practical functionality of these tools. initially, it was difficult to infer from the email counts whether they were crisis-related. for instance, sent/received email frequency on its own would not be conclusive enough to indicate an emerging crisis. however, including other external environmental factors, such as historical news events from reputable news outlets and market reflection on enron's stock price, increased our method's predictive capability. fig. illustrates the summary count of the resulting calculation of sent and received emails using our method of inspecting the enron dataset. for example, to calculate the number of sent emails, we included all the recipient addresses that were in the to:, cc:, and bcc: sections of the header of that email as with. we calculated the number of received emails by counting how many other people received the same email by looking at the detailed information available in the metadata of each email. we also detected mailing lists and tagged them separately. our preliminary examination of enron's email suggested that the frequency change in sent and received emails could provide valuable information. to detect ongoing crisis events at enron, we defined a statistical model that raises red flags for suspicious instances. to explain our statistical model, suppose we knew event (e ) was occurring on may , (fig. ) . let us consider the line l avr represents the average number of emails received per day from may through may . in the same way, let us consider l ϭ and l ϭ to represent one standard deviation and two standard deviations, respectively, of the number of emails received per day for the same period. the threshold in our model is defined as any day in which the total number of emails received falls out of . standard deviations of the mean of its previous days' number of emails received. this may indicate that an important message has a. farrokhi, et al. industrial marketing management ( ) - triggered a sudden increase in the number of emails received for that day. in other words, the message has gone viral. in fig. , the regular repeating low points in the sent or received curves are weekends. including them unnecessarily widens the standard deviations of the past days (the days before the event day), and this would negatively impact our model's detection assessment. the model detects e event the same way as e , except the reference point for calculating the average line, l' avr starts from the event e . we are expecting some residual effect from the previous events, including e , which may impact the number of received emails in the days coming after e . since we set a new start point for calculating event e , the model implicitly includes the residual effects of all the previous events, but more so of the last event since some of the old events' effects may have already tapered off. therefore, we always calculated the most recent effects to add to the accuracy of our method. this way, we detected all the days where the total number of emails received was significantly higher than the mean of the total number of emails received for the period they accounted for. this covered . % of the population of all past days where the total number of received emails varies to be less than . σ deviation from the period's average. we added sentiment analysis to our study to measure whether the sudden surges in the total number of received emails were due to a positive or negative sentiment fact of the day. it is essential to analyze the organization's emotional load of the day to understand the true meaning of the surges in the number of emails. our sentiment analysis consists of two stages, namely, individual email-based polarity analysis and day-based polarity analysis. each email has information about its sentiment. to avoid a washout effect, we avoided working with combined emails' text. in order to categorize the emails, we customized publicly available tools to score the sentiment of each email based on the bing liu lexicon. the lexicons are a dictionary of words that are used for calculating the polarity of the text. bing liu's opinion-mining lexicon contains positive words and negative words that include misspelled and slang terms of words as well (liu, ) . opinion words or phrases in a text play a key role in carrying the sentiment of the text as a whole (ragini et al., ) . fig. shows the output of ceda's sentiment analysis of enron's received emails (see also figs. and in appendix b). we used mixed market responses and media responses as a proxy to indicate the firm's real crisis events. the average daily percent movement of the stock market is used as a basis to detect crises in enron's financials. looking at the s&p stock market over the last ten years, the average daily move in the stock price is between − % and + % (financial, ) . therefore, for our study, any change beyond % is considered as an important market reaction event. ceda compares enron's every day's closing price to up to five past consecutive closing prices. if the stock price for the day is less than the n days ago stock price minus % accumulated loss for the past n days, that is considered as a market's negative reaction to a possible crisis event. we searched the public news outlet and academic sources to retrieve the list of important events that hit enron during the sample period. the events are taken from the new york times, washington posts dailies and agsm (australian graduate school of management) unsw sydney. table shows a partial news events timeline taken from data collected by the university of new south wales-sydney (unsw business school). the table represents examples of how the news was coded. the variable is set to if an event translated to a crisis event, otherwise. a complete list of all chronological news events is presented in appendix a. hypothesis examines the correlations between sudden increases in daily communicated email trends and developing situations in an organization. hypothesis examines the correlations between the overall sentiment of the days and developing situations in an organization. we used logistic regression analysis as an appropriate inferential statistic for three reasons. first, our research questions are relational questions seeking information about the relationships between detected events by ceda and crisis events; we are interested in determining whether there is an association among these variables. second, we considered the level of measurements; all four variables in this study are dichotomous: sudden change in email communication trends detected (yes/no), the overall sentiment of the day (negative/positive), the sudden drop in the firm's stock price (yes/no), and bad news in media (yes/no). third, logistic regression allows testing the probability of falling a given case into one of two categories on the dependent variable. a. farrokhi, et al. industrial marketing management ( ) - . experimental results and statistical analysis weiner's ( weiner's ( , ) attribution theory has three main parts that are "locus-focuses on whether the source of cause is internal/external", "controllability-whether the person has a sufficient degree of control over his/her behaviour" and "stability-whether the cause of the behaviour is permanent or temporary" (willner & smith, ) . the sentiment analysis deployed in this study searches for additional cues in building meaningful information (negative or positive) that helps in the early detection of a crisis in an organization. as part of the proposed method, we built a method in which critical events are detected by segregating the days based on becoming an outlier compared to their previous days' average number of emails received. in the second part, we used lexicon-based sentiment analysis in order to compute the sentiment score of each day. for correctly detected critical days in which there was not a corresponding external news event to support them, our comparison cloud analysis suggested an excellent insight of ongoing issues for the day in question (fig. ) . table represents a sample output from dataset preparation of our proposed critical event detection analysis model (fig. ceda, apply filtering base on item types and apply email counting rules). at this stage, the unstructured enron dataset's email data, which were in the form of files and folders, categorized and filtered based on item types (sent, received, calendar, contact, note) and transformed into our numerical version of enron's corpus data. each row in our dataset represents statistics wherein "date" represents the day in which the data extraction and calculations were made for, "email counts" accounts for the total number of emails that have been sent and received in that day, "total word count" is the total word count for all emails all together for that day, "to" represent the total number of email addresses where the emails were sent to using "to" box, "ext email address/to count" represents the number of email addresses that were external users (non-enron email addresses) using "to" box, same explanation of "to" applies to "cc" and "bcc", "reply chain" represents the total number of times that the emails were forwarded, "all to" represents the total number of email addresses that emails have been sent to all together (to + cc + bcc). a similar definition of "ext email address" goes for "all ext to addresses". tables and show examples of resulting output from assessing the sentiments of the day extracted from the enron emails. the texts from the body and subject lines of all received emails for each day were analyzed using publicly available sentiment-based lexicons. we calculated the sentiment of email data, bing liu, affin, loughran, and nrc lexicons to increase the reliability of our findings. enron's news events (appendix a, table ) were manually collected and classified. in order to have a better classification of the crisis news, the table was presented to six mba students, three ph.d. students, and two information system professors and asked them to rate the news based on whether the news relates to a crisis event or not. table in appendix a is the outcome of consensually agreed results. the variable is set to if a news event translated to a crisis event, otherwise. in table , the critical event information is reflected in the fourth column. the second and third columns in table is the output of our ceda model. similarly, when there was a surge in the number of received emails detected or the sentiment of the day was negative, the variable is set to . we ran a series of logistic regression analyses to test the relation between hypothesized constructs. the binary logistic regression analysis result showed that our model is a good fit (− *log likelihood = . ) (appendix b, table ). referred to as model deviance, − *log likelihood is the most useful test to compare competing models in binary logistic regression analysis (stevens & pituch, ) . the omnibus tests of model coefficients test results also confirm that our model includes the set of predictors that fits the data significantly better than a null model (� ( ) = . , p < . ) (appendix b, table ). in table , each predictor's regression slope (b) represents the change in the log odds of falling detected crisis events by ceda (eev-within days) into the market and media reactions to real crisis events (news&nse). the model's positive regression (b = . ) indicates the detected event by our predictor variable (detected events within three days) has a high probability of falling into the target group (news& nse). in this test, it is important that odds ratio (exp(b)) does not include . (not zero) between the lower and upper confidence bound for a % confidence interval. table provides the accuracy of the model. the overall classification accuracy based on the model is . %. our method for detecting the sudden change in the number of received emails trend combined with the sentiment of the day is conclusive and predicted % (table ) of the news event published in major news outlets and market response when a crisis event hit enron for the period of our study. the results support our hypotheses. detection of critical events by finding the change in the pattern of the number of received emails in combination with the use of the sentiment analysis method is statistically significant. new technological advancements such as artificial intelligence have become an important foundation of competitive advantage in the market for firms. therefore, in this research, we examine the role of a. farrokhi, et al. industrial marketing management ( ) - computer-mediated artificial intelligence agents in detecting crisis related to events in a firm. the findings of our empirical research suggest that knowledge extracted from day-to-day data communications such as email communications of a firm can lead to the sensing of critical events related to business activities. critical events, in general, if not detected in the early stages, are a threat to an organization, and can become unmanageable crises. the past crises or history of crises in an organization could help crisis managers evaluate whether a recent crisis is an exceptional event (unstable), or is part of a pattern of events (stable). as claims, reoccurrence of an event, or continuation of more than one event, may indicate that the recent event is not an exceptional incident. as such, it is crucial for organizations to have the ability to access real-time crisis information to be able to assess the situation. technology provides a platform in which crisis-related information is acquired in the fastest and most direct manner. computermediated communication platforms, such as an organization's internal communication channels and external social media channels, facilitate real-time dialogue between intended stakeholders and, therefore, become strategically important. we performed analyses of the enron email dataset to identify changes in patterns of emailing frequencies and used that information to detect critical events. we developed a big data tool to perform an initial email count and to calculate the number of "emails sent" and "emails received" for the last three years of enron before its bankruptcy. then, we used advanced analytical tools to visualize and represent the result of the email counts to make a greater sense of a large amount of data more quickly and easily. pattern changes in the emails' metadata showed greater importance in detecting and assessing situational events. despite the simple psychological fact behind the change in the number of emails sent and received during crisis periods, the use of email metadata is a relatively underexplored area. this study analyzed enron users' email communication effectiveness in detecting critical events. through the mining of the organization's semi-structured email data and using more in-depth content analysis, we developed a model called critical event detection analysis model (ceda) for detecting critical events. the model analyzes the connection between the frequency change in the number of emails received and an ongoing situation communicated through an interactive, computermediated communication channel. to obtain a better result in the detection process, the model factors in the textual character of communicated messages (e.g., polarity, emotions). our managerial contribution is to provide a tool to enhance decision-making in organizations by detecting crisis in its early stages. our theoretical contribution is to build a framework to detect triggering events in an organization using email metadata. this can serve as a foundation for other researchers to explore other social network data or metadata for assessing and predicting emerging critical events in organizations. our main practical implication is that we develop and introduce a critical event detection analysis model (ceda) for detecting critical events in this empirical study. we analyze enron users' email communication effectiveness in detecting critical events. our analysis helps us develop the model to a big data tool to perform an initial email count. the model helps the firms to calculate the number of "emails sent" and "emails received" for the last three years of enron before its bankruptcy. we argue that critical events are a threat to an organization. organizations can learn from past crises to empower crisis managers to evaluate whether a recent crisis is unstable or stable. our findings provide the ability to access real-time crisis information to be able to assess the situation. this helps organizations to have better forecasting for the market, for example. our research suggests to the managers that computer-mediated communication platforms, such as an organization's internal communication channels and external social media channels, are essential tools to enable real-time dialogue between intended stakeholders. farrokhi, et al. industrial marketing management ( ) - this study was limited to an organizational crisis and as such it did not cover the external crisis such as the severe acute respiratory syndrome pandemic, and the swine flu epidemic (pan, pan, & leidner, ) and large scale supernatural disasters (e.g., hurricane katrina, asian tsunami) and/or significant events, such as the / terrorist attacks. another limitation of this study is associated with sentiment analysis of english text; thus, languages other than english need to be explored. a possible extension to the methodology would allow researchers to explore the domain to multinational corporations (mnc). authors are grateful for the contribution of mr. shahyad sharghi in this study. a. farrokhi, et al. industrial marketing management ( ) - appendix a. appendix (continued on next page) a. farrokhi, et al. industrial marketing management ( ) - step . a . . a estimation terminated at iteration number because parameter estimates changed by less than . . table testing theoretical model's improvement in fit after including full set of predictors. chi-square df sig. step a. farrokhi, et al. industrial marketing management ( ) event detection in social streams a general survey of privacy-preserving data mining social media: influencing customer satisfaction in b b sales design and implementation of a data analytics infrastructure in support of crisis informatics research (nier track) is all that talk just noise? the information content of internet stock message boards enhancing deep learning sentiment analysis with ensemble techniques in social applications rationality, imagination and intelligence: some boundaries in human decision-making rationality, imagination and intelligence: some boundaries in human decision-making transformational issues of big data and analytics in networked business crisis communication and the internet: risk and trust in a global media distant relations: the affordances of email in interorganizational conflict social strategy to gain knowledge for innovation modelling social action for ai agents impact of past crises on current crisis communication: insights from situational crisis communication theory ongoing crisis communication: planning, managing, and responding protecting organization reputations during a crisis: the development and application of situational crisis communication theory. corporate reputation review comparing apology to equivalent crisis response strategies: clarifying apology's role and value in crisis communication artificial intelligence: theory and practice why data culture matters big data and predictive analytics and manufacturing performance: integrating institutional theory, resource-based view and big data culture autonomous agents as synthetic characters anger is more influential than joy: sentiment correlation in weibo towards a framework for tourism disaster management average daily percent move of the stock market: s&p volatility returns communication barriers in crisis management: a literature review global workforce study at a glance ( ). the global workforce study at a glance. retrieved from the towers watson global talent management managing social media crises with your customers: the good, the bad, and the ugly towards an understanding of privacy management architecture in big data: an experimental research public relations and propaganda in framing the iraq war: a preliminary review identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem internal fraud risk reduction: results of a data mining case study harnessing marketing automation for b b content marketing public's responses to an oil spill accident: a test of the attribution theory and situational crisis communication theory artificial intelligence with python towards an ecological account of media choice: a case study on pluralistic reasoning while choosing email social media use by government: from the routine to the critical situational crisis communication and interactivity: usage and effectiveness of facebook for crisis management by fortune companies detecting deceptive discussions in conference calls survey of the literature the role of cognitive architectures in general artificial intelligence sentiment analysis and opinion mining automated anonymization of text documents b b social media semantics: analyzing multimodal online meanings in marketing conversations retrieved from internal communications: emerging trends and the use of technology consumers' value co-creation in sharing economy: the role of social support, consumers' ethical perceptions and relationship quality the knowledge level privacy in context: technology, policy, and the integrity of social life defining and measuring shared situational awareness (no. crm-d . a ) review reputations at risk: engagement during social media crises crisis response information networks barriers to use of social media by emergency managers the nature of emotions: human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice accounting variables, deception, and a bag of words: assessing the tools of fraud detection big data analytics for disaster response and recovery through sentiment analysis data age :the evolution of data to life-critical. idc white paper a social-psychological model for synthetic actors is crisis management (only) a management of exceptions rationality and intelligence artificial intelligence: a modern approach situation awareness: a critical but ill-defined phenomenon business intelligence, analytics, and data science: a managerial perspective ecological views of big data: perspectives and issues community clouds within m-commerce: a privacy by design perspective determinants of social media adoption by b b organizations. industrial marketing management rational choice and the structure of the environment applied multivariate statistics for the social sciences: analyses with sas and ibm's spss explicating web site interactivity: impression formation effects in political campaign sites should tweets differ for b b and b c? an analysis of fortune companies' twitter communications ruok? blogging communication technologies during crises social media use in organizations: exploring the affordances of visibility, editability, persistence, and association explaining unplanned online media behaviors: dual system theory models of impulsive use and swearing on social networking sites post-crisis communication and renewal: expanding the parameters of post-crisis discourse real-time business activity monitoring and analysis of process performance on big-data domains the spread of true and false news online data science, predictive analytics, and big data: a revolution that will transform supply chain design and management towards an ethical and trustworthy social commerce community for brand value co-creation: a trust-commitment perspective bringing privacy into the fold: considerations for the use of social media in crisis management an attributional theory of achievement, motivation and emotion an attributional theory of motivation and emotion a market-oriented programming environment and its application to distributed multicommodity flow problems hadoop: the definitive guide, chapter meet hadoop attribution theory applied to helping behaviour towards people with intellectual disabilities who challenge data mining: practical machine learning tools and techniques business intelligence in risk management: some recent progresses information security in big data: privacy and data mining emergency knowledge management and social media technologies: a case study of the haitian earthquake social media and culture in crisis communication: mcdonald's and kfc crises management in china key: cord- - vrlrim authors: lefkowitz, e.j.; odom, m.r.; upton, c. title: virus databases date: - - journal: encyclopedia of virology doi: . /b - - . - sha: doc_id: cord_uid: vrlrim as tools and technologies for the analysis of biological organisms (including viruses) have improved, the amount of raw data generated by these technologies has increased exponentially. today's challenge, therefore, is to provide computational systems that support data storage, retrieval, display, and analysis in a manner that allows the average researcher to mine this information for knowledge pertinent to his or her work. every article in this encyclopedia contains knowledge that has been derived in part from the analysis of such large data sets, which in turn are directly dependent on the databases that are used to organize this information. fortunately, continual improvements in data-intensive biological technologies have been matched by the development of computational technologies, including those related to databases. this work forms the basis of many of the technologies that encompass the field of bioinformatics. this article provides an overview of database structure and how that structure supports the storage of biological information. the different types of data associated with the analysis of viruses are discussed, followed by a review of some of the various online databases that store general biological, as well as virus-specific, information. in , niu and frankel-conrat published the c-terminal amino acid sequence of tobacco mosaic virus capsid protein. the complete -amino-acid sequence of this protein was published in . the first completely sequenced viral genome published was that of bacteriophage ms in (genbank accession number v ). sanger used dna from bacteriophage phix ( j ) in developing the dideoxy sequencing method, while the first animal viral genome, sv ( j ), was sequenced using the maxam and gilbert method and published in . viruses therefore played a pivotal role in the development of modern-day sequencing methods, and viral sequence information (both protein and nucleotide) formed a substantial subset of the earliest available biological databases. in , margaret o. dayhoff published the first publicly available database of biological sequence information. this atlas of protein sequence and structure was available only in printed form and contained the sequences of approximately proteins. establishment of a database of nucleic acid sequences began in through the efforts of walter goad at the us department of energy's los alamos national laboratory (lanl) and separately at the european molecular biology laboratories (embl) in the early s. in , the lanl database received funding from the national institutes of health (nih) and was christened genbank. in december of , the los alamos sequence library contained sequences of which were from eukaryotic viruses and were from bacteriophages. by its tenth release in , genbank contained sequences ( nucleotides) of which ( nucleotides) were viral. in august of , genbank (release ) contained approximately records, including viral sequences. the number of available sequences has increased exponentially as sequencing technology has improved. in addition, other high-throughput technologies have been developed in recent years, such as those for gene expression and proteomic studies. all of these technologies generate enormous new data sets at ever-increasing rates. the challenge, therefore, has been to provide computational systems that support the storage, retrieval, analysis, and display of this information so that the research scientist can take advantage of this wealth of resources to ask and answer questions relevant to his or her work. every article in this encyclopedia contains knowledge that has been derived in part from the analysis of large data sets. the ability to effectively and efficiently utilize these data sets is directly dependent on the databases that have been developed to support storage of this information. fortunately, the continual development and improvement of data-intensive biological technologies has been matched by the development and improvement of computational technologies. this work, which includes both the development and utilization of databases as well as tools for storage and analysis of biological information, forms a very important part of the bioinformatics field. this article provides an overview of database structure and how that structure supports the storage of biological information. the different types of data associated with the analysis of viruses are discussed, followed by a review of some of the various online databases that store general biological information as well as virusspecific information. definition a database is simply a collection of information, including the means to store, manipulate, retrieve, and share that information. for many of us, lab notebook fulfilled our initial need for a 'database'. however, this information storage vehicle did not prove to be an ideal place to archive our data. backups were difficult, and retrieval more so. the advent of computers -especially the desktop computer -provided a new solution to the problem of data storage. though initially this innovation took the form of spreadsheets and electronic notebooks, the subsequent development of both personal and large-scale database systems provided a much more robust solution to the problems of data storage, retrieval, and manipulation. the computer program supplying this functionality is called a 'database management system' (dbms). such systems provide at least four things: ( ) the necessary computer code to guide a user through the process of database design; ( ) a computer language that can be used to insert, manipulate, and query the data; ( ) tools that allow the data to be exported in a variety of formats for sharing and distribution; and ( ) the administrative functions necessary to ensure data integrity, security, and backup. however, regardless of the sophistication and diverse functions available in a typical modern dbms, it is still up to the user to provide the proper context for data storage. the database must be properly designed to ensure that it supports the structure of the data being stored and also supports the types of queries and manipulations necessary to fully understand and efficiently analyze the properties of the data. the development of a database begins with a description of the data to be stored, all of the parameters associated with the data, and frequently a diagram of the format that will be used. the format used to store the data is called the database schema. the schema provides a detailed picture of the internal format of the database that includes specific containers to store each individual piece of data. while databases can store data in any number of different formats, the design of the particular schema used for a project is dependent on the data and the needs and expertise of the individuals creating, maintaining, and using the database. as an example, we will explore some of the possible formats for storing viral sequence data and provide examples of the database schema that could be used for such a project. figure (a) provides an example of a genbank sequence record that is familiar to most biologists. these records are provided in a 'flat file' format in which all of the information associated with this particular sequence is provided in a human-readable form and in which all of the information is connected in some manner to the original sequence. in this format, the relationships between each piece of information and every other piece of information are only implicitly defined, that is, each line starts with a label that describes the information in the rest of the line, but it is up to the investigator reading the record to make all of the proper connections between each of the data fields (lines). the proper connections are not explicitly defined in this record. as trained scientists, we are able to read the record in figure (a) and discern that this particular amino acid sequence is derived from a strain of ebola virus that was studied by a group in germany, and that this sequence codes for a protein that functions as the virus rna polymerase. the format of this record was carefully designed to allow us, or a computer, to pull out each individual type of information. however as trained scientists, we already understand the proper connections between the different information fields in this file. the computer does not. therefore, to analyze the data using a computer, a custom software program must be written to provide access to the data. extensible markup language (xml) is another widely used format for storing database information. figure (b) shows an example of part of the xml record for the ebola virus polymerase protein. in this format, each data field can be many lines long; the start and end of a data record contained within a particular field are indicated by tags made of a label between two brackets (''). unlike the lines in the genbank record in figure (a), a field in an xml record can be placed inside of another, defining a structure and a relationship between them. for example, the tseq_orgname is placed inside of the tseq record to show that this organism name applies only to that sequence record. if the file contained multiple sequences, each tseq field would have its own tseq_orgname subfield, and the relationship between them would be very clear. this self-describing hierarchical structure makes xml very powerful for expressing many types of data that are hard to express in a single table, such as that used in a spreadsheet. however, in order to find any piece of information in the xml file, a user (with an appropriate search program) needs to traverse the whole file in order to pull out the particular items of data that are of interest. therefore, while an xml file may be an excellent format for defining and exchanging data, it is often not the best vehicle for efficiently storing and querying that data. that is still the realm of the relational database. 'relational database management systems' (rdbmss) are designed to do two things extremely well: ( ) store and update structured data with high integrity, and ( ) provide powerful tools to search, summarize, and analyze the data. the format used for storing the data is to divide it into several tables, each of which is equivalent to a single spreadsheet. the relationships between the data in the tables are then defined, and the rdbms ensures that all data follow the rules laid out by this design. this set of tables and relationships is called the schema. an example diagram of a relational database schema is provided in figure . this viral genome database (vgd) schema is an idealized version of a database used to store viral genome sequences, their associated gene sequences, and associated descriptive and analytical information. each box in figure represents a single object or concept, such as a genome, gene, or virus, about which we want to store data and is contained in a single table in the rdbms. the names listed in the box are the columns of that table, which hold the various types of data about the object. the 'gene' table therefore contains columns holding data such as the name of the gene, its coding strand, and a description of its function. the rdbms is lines and arrows display the relationships between fields as defined by the foreign key (fk) and primary key (pk) that connect two tables. (each arrow points to the table containing the primary key.) tables are color-coded according to the source of the information they contain: yellow, data obtained from the original genbank sequence record and the ictv eighth report; pink, data obtained from automated annotation or manual curation; blue, controlled vocabularies to ensure data consistency; green, administrative data. able to enforce a series of rules for tables that are linked by defining relationships that ensure data integrity and accuracy. these relationships are defined by a foreign key in one table that links to corresponding data in another table defined by a primary key. in this example, the rdms can check that every gene in the 'gene' table refers to an existing genome in the 'genome' table, by ensuring that each of these tables contains a matching 'genome_id'. since any one genome can code for many genes, many genes may contain the same 'genome_id'. this defines what is called a one-to-many relationship between the 'genome' and 'gene' tables. all of these relationships are identified in figure by arrows connecting the tables. because viruses have evolved a variety of alternative coding strategies such as splicing and rna editing; it is necessary to design the database so that these processes can be formally described. the 'gene_segment' table specifies the genomic location of the nucleotides that code for each gene. if a gene is coded in the traditional manner -one orf, one protein -then that gene would have one record in the 'gene_segment' table. however, as described above, if a gene is translated from a spliced transcript, it would be represented in the 'gene_segment' table by two or more records, each of which specifies the location of a single exon. if an rna transcript is edited by stuttering of the polymerase at a particular run of nucleotides, resulting in the addition of one or more nontemplated nucleotides, then that gene will also have at least two records in the 'gene_segment' table. in this case, the second 'gene_segment' record may overlap the last base of the first record for that gene. in this manner, an extra, nontemplated base becomes part of the final gene transcript. other more complex coding schemes can also be identified using this, or similar, database structures. the tables in figure are grouped according to the type of information they contain. though the database itself does not formally group tables in this manner, database schema diagrams are created to benefit database designers and users by enhancing their ability to understand the structure of the database. these diagrams make it easier to both populate the database with data and query the database for information. the core tables hold basic biological information about each viral strain and its genomic sequence (or sequences if the virus contains segmented genomes) as well as the genes coded for by each genome. the taxonomy tables provide the taxonomic classification of each virus. taxonomic designations are taken directly from the eighth report of the international committee on taxonomy of viruses (ictv). the 'gene properties' tables provide information related to the properties of each gene in the database. gene properties may be generated from computational analyses such as calculations of molecular weight and isoelectric point (pi) that are derived from the amino acid sequence. gene properties may also be derived from a manual curation process in which an investigator might identify, for example, functional attributes of a sequence based on evidence provided from a literature search. assignment of 'gene ontology' terms (see below) is another example of information provided during manual curation. the blast tables store the results of similarity searches of every gene and genome in the vgd searched against a variety of sequence databases using the national center for biotechnology information (ncbi) blast program. examples of search databases might include the complete genbank nonredundant protein database and/or a database comprised of all the protein sequences in the vgd itself. while most of us store our blast search results as files on our desktop computers, it is useful to store this information within the database to provide rapid access to similarity results for comparative purposes; to use these results to assign genes to orthologous families of related sequences; and to use these results in applications that analyze data in the database and, for example, display the results of an analysis between two or more types of viruses showing shared sets of common genes. finally, the 'admin' tables provide information on each new data release, an archive of old data records that have been subsequently updated, and a log detailing updates to the database schema itself. it is useful for database designers, managers, and data submitters to understand the types of information that each table contains and the source of that information. therefore, the database schema provided in figure is color-coded according to the type and source of information each table provides. yellow tables contain basic biological data obtained either directly from the genbank record or from other sources such as the ictv. pink tables contain data obtained as the result of either computational analyses (blast searches, calculations of molecular weight, functional motif similarities, etc.) or from manual curation. blue tables provide a controlled vocabulary that is used to populate fields in other tables. this ensures that a descriptive term used to describe some property of a virus has been approved for use by a human curator, is spelled correctly, and when multiple terms or aliases exist for the same descriptor, the same one is always chosen. while the use of a controlled vocabulary may appear trivial, in fact, misuse of terms, or even misspellings, can result in severe problems in computer-based databases. the computer does not know that the terms 'negative-sense rna virus' and 'negative-strand rna virus' may both be referring to the same type of virus. the provision and use of a controlled vocabulary increases the likelihood that these terms will be used properly, and ensures that the fields containing these terms will be easily comparable. for example, the 'geno-me_molecule' table contains the following permissible values for 'molecule_type': 'ambisense ssrna', 'dsrna', 'negative-sense ssrna', 'positive-sense ssrna', 'ssdna', and 'dsdna'. a particular viral genome must then have one of these values entered into the 'molecule_type' field of the 'genome' table, since this field is a foreign key to the 'molecule_type' primary key of the 'genome_molecule' table. entering 'double-stranded dna' would not be permissible. raw data obtained directly from high-throughput analytical techniques such as automated sequencing, protein interaction, or microarray experiments contain little-to-no information as to the content or meaning. the process of adding value to the raw data to increase the knowledge content is known as annotation and curation. as an example, the results of a microarray experiment may provide an indication that individual genes are up-or downregulated under certain experimental conditions. by annotating the properties of those genes, we are able to see that certain sets of genes showing coordinated regulation are a part of common biological pathways. an important pattern then emerges that was not discernable solely by inspection of the original data. the annotation process consists of a semiautomated analysis of the information content of the data and provides a variety of descriptive features that aid the process of assigning meaning to the data. the investigator is then able to use this analytical information to more closely inspect the data during a manual curation process that might support the reconstruction of gene expression or protein interaction pathways, or allow for the inference of functional attributes of each identified gene. all of this curated information can then be stored back in the database and associated with each particular gene. for each piece of information associated with a gene (or other biological entity) during the process of annotation and curation, it is always important to provide the evidence used to support each assignment. this evidence may be described in a standard operating procedure (sop) document which, much like an experimental protocol, details the annotation process and includes a description of the computer algorithms, programs, and analysis pipelines that were used to compile that information. each piece of information annotated by the use of this pipeline might then be coded, for example, 'iea: inferred from electronic annotation'. for information obtained from the literature during manual curation, the literature reference from which the information was obtained should always be provided along with a code that describes the source of the information. some of the possible evidence codes include 'ida: inferred from direct assay', 'igi: inferred from genetic interaction', 'imp: inferred from mutant phenotype', or 'iss: inferred from sequence or structural similarity'. these evidence codes are taken from a list provided by the gene ontology (go) consortium (see below) and as such represent a controlled vocabulary that any data curator can use and that will be understood by anyone familiar with the go database. this controlled evidence vocabulary is stored in the 'evidence' table, and each record in every one of the gene properties tables is assigned an evidence code noting the source of the annotation/curation data. as indicated above, the use of controlled vocabularies (ontologies) to describe the attributes of biological data is extremely important. it is only through the use of these controlled vocabularies that a consistent, documented approach can be taken during the annotation/curation process. and while there may be instances where creating your own ontology may be necessary, the use of already available, community-developed ontologies ensures that the ontological descriptions assigned to your database will be understood by anyone familiar with the public ontology. use of these public ontologies also ensures that they support comparative analyses with other available databases that also make use of the same ontological descriptions. the go consortium provides one of the most extensive and widely used controlled vocabularies available for biological systems. go describes biological systems in terms of their biological processes, cellular components, and molecular functions. the go effort is community-driven, and any scientist can participate in the development and refinement of the go vocabulary. currently, go contains a number of terms specific to viral processes, but these tend to be oriented toward particular viral families, and may not necessarily be the same terms used by investigators in other areas of virology. therefore it is important that work continues in the virus community to expand the availability and use of go terms relevant to all viruses. go is not intended to cover all things biological. therefore, other ontologies exist and are actively being developed to support the description of many other biological processes and entities. for example, go does not describe disease-related processes or mutants; it does not cover protein structure or protein interactions; and it does not cover evolutionary processes. a complementary effort is under way to better organize existing ontologies, and to provide tools and mechanisms to develop and catalog new ontologies. this work is being undertaken by the national center for biomedical ontologies, located at stanford university, with participants worldwide. the most comprehensive, well-designed database is useless if no method has been provided to access that database, or if access is difficult due to a poorly designed application. therefore, providing a search interface that meets the needs of intended users is critical to fully realizing the potential of any effort at developing a comprehensive database. access can be provided using a number of different methods ranging from direct query of the database using the relatively standardized 'structured query language' (sql), to customized applications designed to provide the ability to ask sophisticated questions regarding the data contained in the database and mine the data for meaningful patterns. web pages may be designed to provide simple-touse forms to access and query data stored in an rdbms. using the vgd schema as a data source, one example of an sql query might be to find the gene_id and name of all the proteins in the database that have a molecular weight between and , and also have at least one transmembrane region. many database providers also provide users with the ability to download copies of the database so that these users may analyze the data using their own set of analytical tools. when a user queries a database using any of the available access methods, the results of that query are generally provided in the form of a table where columns represent fields in the database and the rows represent the data from individual database records. tabular output can be easily imported into spreadsheet applications, sorted, manipulated, and reformatted for use in other applications. but while extremely flexible, tabular output is not always the best format to use to fully understand the underlying data and the biological implications. therefore, many applications that connect to databases provide a variety of visualization tools that display the data graphically, showing patterns in the data that may be difficult to discern using text-based output. an example of one such visual display is provided in figure and shows conservation of synteny between the genes of two different poxvirus species. the information used to generate this figure comes directly from the data provided in the vgd. every gene in the two viruses (in this case crocodilepox virus and molluscum contagiosum virus) has been compared to every other gene using the blast search program. the results of this search are stored in the blast tables of the vgd. in addition, the location of each gene within its respective genomic sequence is stored in the 'gene_segment' table. this information, once extracted from the database server, is initially text but it is then submitted to a program running on the server that reformats the data and creates a graph. in this manner, it is much easier to visualize the series of points formed along a diagonal when there are a series of similar genes with similar genomic locations present in each of the two viruses. these data sets may contain gene synteny patterns that display deletion, insertion, or recombination events during the course of viral evolution. these patterns can be difficult to detect with text-based tables, but are easy to discern using visual displays of the data. information provided to a user as the result of a database query may contain data derived from a combination of sources, and displayed using both visual and textual feedback. figure shows the web-based output of a query designed to display information related to a particular virus gene. the top of this web page displays the location of the gene on the genome visually showing surrounding genes on a partial map of the viral genome. basic gene information such as genome coordinates, gene name, and the nucleotide and amino acid sequence are also provided. this information was originally obtained from the original genbank record and then stored in the vgd database. data added as the result of an automated annotation pipeline are also displayed. this includes calculated values for molecular weight and pi; amino acid composition; functional motifs; blast similarity searches; and predicted protein structural properties such as transmembrane domains, coiled-coil regions, and signal sequences. finally, information obtained from a manual curation of the gene through an extensive literature search is also displayed. curated information includes a mini review of gene function; experimentally determined gene properties such as molecular weight, pi, and protein structure; alternative names and aliases used in the literature; assignment of ontological terms describing gene function; the availability of reagents such as antibodies and clones; and also, as available, information on the functional effects of mutations. all of the information to construct the web page for this gene is directly provided as the result of a single database query. (the tables storing the manually curated gene information are not shown in figure .) obviously, compiling the data and entering it into the database required a substantial amount of effort, both computationally and manually; however, the information is now much more available and useful to the research scientist. no discussion of databases would be complete without considering errors. as in any other scientific endeavor, the data we generate, the knowledge we derive from the data, and the inferences we make as a result of the analysis of the data are all subject to error. these errors can be introduced at many points in the analytical chain. the original data may be faulty: using sequence data as one example, nucleotides in a dna sequence may have been misread or miscalled, or someone may even have mistyped the sequence. the database may have been poorly designed; a field in a table designed to hold sequence information may have been set to hold only characters, whereas the sequences imported into that field may be longer than nucleotides. the sequences would have then been automatically truncated to characters, resulting in the loss of data. the curator may have mistyped an enzyme commission (ec) number for an rna polymerase, or may have incorrectly assigned a genomic sequence to the wrong taxonomic classification. or even more insidious, the curator may have been using annotations provided by other groups that had justified their own annotations on the basis of matches to annotations provided by yet another group. such chains of evidence may extend far back, and the chance of propagating an early error increases with time. such error propagation can be widespread indeed, affecting the work of multiple sequencing centers and database creators and providers. this is especially true given the dependencies of genomic sequence annotations on previously published annotations. the possible sources of errors are numerous, and it is the responsibility of both the database provider and the user to be aware of, and on the lookout for, errors. the database provider can, with careful database and application design, apply error-checking routines to many aspects of the data storage and analysis pipeline. the code can check for truncated sequences, interrupted open reading frames, and nonsense data, as well as data annotations that do not match a provided controlled vocabulary. but the user should always approach any database or the output of any application with a little healthy skepticism. the user is the final arbiter of the accuracy of the information, and it is their responsibility to look out for inconsistent or erroneous results that may indicate either a random or systemic error at some point in the process of data collection and analysis. it is not feasible to provide a comprehensive and current list of all available databases that contain virus-related information or information of use to virus researchers. new databases appear on a regular basis; existing databases either disappear or become stagnant and outdated; or databases may change focus and domains of interest. any resource published in book format attempting to provide an up-to-date list would be out-of-date on the day of publication. even web-based lists of database resources quickly become out-of-date due to the rapidity with which available resources change, and the difficulty and extensive effort required to keep an online list current and inclusive. therefore, our approach in this article is to provide an overview of the types of data that are obtainable from available biological databases, and to list some of the more important database resources that have been available for extended periods of time and, importantly, remain current through a process of continual updating and refinement. we should also emphasize that the use of web-based search tools such as google, various web logs (blogs), and news groups, can provide some of the best means of locating existing and newly available web-based information sources. information contained in databases can be used to address a wide variety of problems. a sampling of the areas of research facilitated by virus databases includes . taxonomy and classification; . host range, distribution, and ecology; . evolutionary biology; . pathogenesis; . host-pathogen interaction; . epidemiology; . disease surveillance; . detection; . prevention; . prophylaxis; . diagnosis; and . treatment. addressing these problems involves mining the data in an appropriate database in order to detect patterns that allow certain associations, generalizations, cause-effect relationships, or structure-function relationships to be discerned. table provides a list of some of the more useful and stable database resources of possible interest to virus researchers. below, we expand on some of this information and provide a brief discussion concerning the sources and intended uses of these data sets. the two major, overarching collections of biological databases are at the ncbi, supported by the national library of medicine at the nih, and the embl, part of the european bioinformatics institute. these large data repositories try to be all-inclusive, acting as the primary source of publicly available molecular biological data for the scientific community. in fact, most journals require that, prior to publication, investigators submit their original sequence data to one of these repositories. in addition to sequence data, ncbi and embl (along with many other data repositories) include a large variety of other data types, such as that obtained from gene-expression experiments and studies investigating biological structures. journals are also extending the requirement for data deposition to some of these other data types. note that while much of the data available from these repositories is raw data obtained directly as the result of experimental investigation in the laboratory, a variety of 'valueadded' secondary databases are also available that take primary data records and manipulate or annotate them in some fashion in order to derive additional useful information. when an investigator is unsure about the existence or source of some biological data, the ncbi and embl websites should serve as the starting point for locating such information. the ncbi entrez search engine provides a powerful interface to access all information contained in the various ncbi databases, including all available sequence records. a search engine such as google might also be used if ncbi and embl fail to locate the desired information. of course pubmed, the repository of literature citations maintained at ncbi, also represents a major reference site for locating biological information. finally, the journal nucleic acids research (nar) publishes an annual 'database' issue and an annual 'web server' issue that are excellent references for finding new biological databases and websites. and while the most recent nar database or web server issue may contain articles on a variety of new and interesting databases and websites, be sure to also look at issues from previous years. older issues contain articles on many existing sites that may not necessarily be represented in the latest there are several websites that serve to provide general virus-specific information and links of use to virus researchers. one of these is the ncbi viral genomes project, which provides an overview of all virus-related ncbi resources including taxonomy, sequence, and reference information. links to other sources of viral data are provided, as well as a number of analytical tools that have been developed to support viral taxonomic classification and sequence clustering. another useful site is the all the virology on the www website. this site provides numerous links to other virus-specific websites, databases, information, news, and analytical resources. it is updated on a regular basis and is therefore as current as any site of this scope can be. one of the strengths of storing information within a database is that information derived from different sources or different data sets can be compared so that important common and distinguishing features can be recognized. such comparative analyses are greatly aided by having a rigorous classification scheme for the information being studied. the international union of microbiological societies has designated the international committee on taxonomy of viruses (ictv) as the official body that determines taxonomic classifications for viruses. through a series of subcommittees and associated study groups, scientists with expertise on each viral species participate in the establishment of new taxonomic groups, assignment of new isolates to existing or newly established taxonomic groups, and reassessment of existing assignments as additional research data become available. the ictv uses more than individual characteristics for classification, though sequence homology has gained increasing importance over the years as one of the major classifiers of taxonomic position. currently, as described in its eighth report, the ictv recognizes orders, families, genera, and species of viruses. the ictv officially classifies viral isolates only to the species level. divisions within species, such as clades, subgroups, strains, isolates, types, etc., are left to others. the ictv classifications are available in book form as well as from an online database. this database, the ictvdb, contains the complete taxonomic hierarchy, and assigns each known viral isolate to its appropriate place in that hierarchy. descriptive information on each viral species is also available. the ncbi also provides a web-based taxonomy browser for access to taxonomically specified sets of sequence records. ncbi's viral taxonomy is not completely congruent with that of ictv, but efforts have been under way to ensure congruency with the official ictv classification. the primary repositories of existing sequence information come from the three organizations that comprise the international nucleotide sequence database collaboration. these three sites are genbank (maintained at ncbi), embl, and the dna data bank of japan (ddbj). because all sequence information submitted to any one of these entities is shared with the others, a researcher need query only one of these sites to get the most up-to-date set of available sequences. genbank stores all publicly available nucleotide sequences for all organisms, as well as viruses. this includes whole-genome sequences as well as partial-genome and individual coding sequences. sequences are also available from largescale sequencing projects, such as those from shotgun sequencing of environmental samples (including viruses), and high-throughput low-and high-coverage genomic sequencing projects. ncbi provides separate database divisions for access to these sequence datasets. the sequence provided in each genbank record is the distillation of the raw data generated by (in most cases these days) automated sequencing machines. the trace files and base calls provided by the sequencers are then assembled into a collection of contiguous sequences (contigs) until the final sequence has been assembled. in recognition of the fact that there is useful information contained in these trace files and sequence assemblies (especially if one would like to look for possible sequencing errors or polymorphisms), ncbi now provides separate trace file and assembly archives for genbank sequences when the laboratory responsible for generating the sequence submits these files. currently, the only viruses represented in these archives are influenza a, chlorella, and a few bacteriophages. an important caveat in using data obtained from gen-bank or other sources is that no sequence data can be considered to be % accurate. furthermore, the annotation associated with the sequence, as provided in the genbank record, may also contain inaccuracies or be outof-date. genbank records are provided and maintained by the group originally submitting the sequence to genbank. genbank may review these records for obvious errors and formatting mistakes (such as the lack of an open reading frame where one is indicated), but given the large numbers of sequences being submitted, it is impossible to verify all of the information in these records. in addition, the submitter of a sequence essentially 'owns' that sequence record and is thus responsible for all updates and corrections. ncbi generally will not change any of the information in the genbank record unless the sequence submitter provides the changes. in some cases, sequence annotations will be updated and expanded, but many, if not most, records never change following their initial submission. (these facts emphasize the responsibility that submitters of sequence data have to ensure the accuracy of their original submission and to update their sequence data and annotations as necessary.) therefore, the user of the information has the responsibility to ensure, to the extent possible, its accuracy is sufficient to support any conclusions derived from that information. in recognition of these problems, ncbi established the reference sequence (refseq) database project, which attempts to provide reference sequences for genomes, genes, mrnas, proteins, and rna sequences that can be used, in ncbi's words, as ''a stable reference for gene characterization, mutation analysis, expression studies, and polymorphism discovery''. refseq records are manually curated by ncbi staff, and therefore should provide more current (and hopefully more accurate) sequence annotations to support the needs of the research community. for viruses, refseq provides a complete genomic sequence and annotation for one representative isolate of each viral species. ncbi solicits members of the research community to participate as advisors for each viral family represented in refseq, in an effort to ensure the accuracy of the refseq effort. in addition to the nucleotide sequence databases mentioned above, uniprot provides a general, all-inclusive protein sequence database that adds value through annotation and analysis of all the available protein sequences. uniprot represents a collaborative effort of three groups that previously maintained separate protein databases (pir, swissprot, and trembl). these groups, the national biomedical research foundation at georgetown university, the swiss institute of bioinformatics, and the european bioinformatics institute, formed a consortium in to merge each of their individual databases into one comprehensive database, uniprot. uniprot data can be queried by searching for similarity to a query sequence, or by identifying useful records based on the text annotations. sequences are also grouped into clusters based on sequence similarity. similarity of a query sequence to a particular cluster may be useful in assigning functional characteristics to sequences of unknown function. ncbi also provides a protein sequence database (with corresponding refseq records) consisting of all protein-coding sequences that have been annotated within all genbank nucleotide sequence records. the above-mentioned sequence databases are not limited to viral data, but rather store sequence information for all biological organisms. in many cases, access to nonviral sequences is necessary for comparative purposes, or to study virus-host interactions. but it is frequently easier to use virus-specific databases when they exist, to provide a more focused view of the data that may simplify many of the analyses of interest. table lists many of these virus-specific sites. sites of note include the nih-supported bioinformatics resource centers for biodefense and emerging and reemerging infectious diseases (brcs). the brcs concentrate on providing databases, annotations, and analytical resources on nih priority pathogens, a list that includes many viruses. in addition, the lanl has developed a variety of viral databases and analytical resources including databases focusing on hiv and influenza. for plant virologists, the descriptions of plant viruses (dpv) website contains a comprehensive database of sequence and other information on plant viruses. the three-dimensional structures for quite a few viral proteins and virion particles have been determined. these structures are available in the primary database for experimentally determined structures, the protein data bank (pdb). the pdb currently contains the structures for more than viral proteins and viral protein complexes out of total structures. several virus-specific structure databases also exist. these include the viperdb database of icosahedral viral capsid structures, which provides analytical and visualization tools for the study of viral capsid structures; virus world at the institute for molecular virology at the university of wisconsin, which contains a variety of structural images of viruses; and the big picture book of viruses, which provides a catalog of images of viruses, along with descriptive information. ultimately, the biology of viruses is determined by genomic sequence (with a little help from the host and the environment). nucleotide sequences may be structural, functional, regulatory, or protein coding. protein sequences may be structural, functional, and/or regulatory, as well. patterns specified in nucleotide or amino acid sequences can be identified and associated with many of these biological roles. both general and virus-specific databases exist that map these roles to specific sequence motifs. most also provide tools that allow investigators to search their own sequences for the presence of particular patterns or motifs characteristic of function. general databases include the ncbi conserved domain database; the pfam (protein family) database of multiple sequence alignments and hidden markov models; and the prosite database of protein families and domains. each of these databases and associated search algorithms differ in how they detect a particular search motif or define a particular protein family. it can therefore be useful to employ multiple databases and search methods when analyzing a new sequence (though in many cases they will each detect a similar set of putative functional motifs). interpro is a database of protein families, domains, and functional sites that combines many other existing motif databases. interpro provides a search tool, interproscan, which is able to utilize several different search algorithms dependent on the database to be searched. it allows users to choose which of the available databases and search tools to use when analyzing their own sequences of interest. a comprehensive report is provided that not only summarizes the results of the search, but also provides a comprehensive annotation derived from similarities to known functional domains. all of the above databases define functional attributes based on similarities in amino acid sequence. these amino acid similarities can be used to classify proteins into functional families. placing proteins into common functional families is also frequently performed by grouping the proteins into orthologous families based on the overall similarity of their amino acid sequence as determined by pairwise blast comparisons. two virus-specific databases of orthologous gene families are the viral clusters of orthologous groups database (vogs) at ncbi, and the viral orthologous clusters database (vocs) at the viral bioinformatics resource center and viral bioinformatics, canada. many other types of useful information, both general and virus-specific, have been collected into databases that are available to researchers. these include databases of gene-expression experiments (ncbi gene expression omnibus -geo); protein-protein interaction databases, such as the ncbi hiv protein-interaction database; the immune epitope database and analysis resource (iedb) at the la jolla institute for allergy and immunology; and databases and resources for defining and visualizing biological pathways, such as metabolic, regulatory, and signaling pathways. these pathway databases include reactome at the cold spring harbor laboratory, new york; biocyc at sri international, menlo park, california; and the kyoto encyclopedia of genes and genomes (kegg) at kyoto university in japan. as indicated above, the information contained in a database is useless unless there is some way to retrieve that information from the database. in addition, having access to all of the information in every existing database would be meaningless unless tools are available that allow one to process and understand the data contained within those databases. therefore, a discussion of virus databases would not be complete without at least a passing reference to the tools that are available for analysis. to populate a database such as the vgd with sequence and analytical information, and to utilize this information for subsequent analyses, requires a variety of analytical tools including programs for . sequence record reformatting, . database import and export, . sequence similarity comparison, . gene prediction and identification, . detection of functional motifs, . comparative analysis, . multiple sequence alignment, . phylogenetic inference, . structural prediction, and . visualization. sources for some of these tools have already been mentioned, and many other tools are available from the same websites that provide many of the databases listed in table . the goal of all of these sites that make available data and analytical tools is to provide -or enable the discovery of -knowledge, rather than simply providing access to data. only in this manner can the ultimate goal of biological understanding be fully realized. see also: evolution of viruses; phylogeny of viruses; taxonomy, classification and nomenclature of viruses; virus classification by pairwise sequence comparison (pasc). gene ontology: tool for the unification of biology. the gene ontology consortium national center for biotechnology information viral genomes project virus taxonomy: classification and nomenclature of viruses. eighth report of the international committee on taxonomy of viruses the molecular biology database collection: update reactome: a knowledgebase of biological pathways virus bioinformatics: databases and recent applications immunoinformatics comes of age hiv sequence databases hepatitis c databases, principles and utility to researchers poxvirus bioinformatics resource center: a comprehensive poxviridae informational and analytical resource biological weapons defense: infectious diseases and counterbioterrorism exploring icosahedral virus structures with viper national center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge los alamos hepatitis c immunology database aidsinfo key: cord- -qu a q authors: fonseca, david; garcía-peñalvo, francisco josé; camba, jorge d. title: new methods and technologies for enhancing usability and accessibility of educational data date: - - journal: univers access inf soc doi: . /s - - - sha: doc_id: cord_uid: qu a q nan recent advances in information and communication technologies (icts) have fostered the development of new methods and tools for exploring the increasingly large amounts of data that come from pedagogical domains [ ] [ ] [ ] [ ] [ ] . these data have the potential to transform education into a personalized experience [ , ] that meets the needs of each individual student [ ] . educational data research is becoming highly relevant in massive online courses [ ] , especially moocs (massive open online courses) [ ] [ ] [ ] [ ] and spocs (small private online courses) [ ] [ ] [ ] . educational data are also the basis for learning analytics [ ] [ ] [ ] , with an increasing focus on the way educational data are presented [ ] [ ] [ ] , how users interact with the data [ ] [ ] [ ] [ ] , and data privacy and security [ ] [ ] [ ] [ ] . there are many types of data that can support student's learning [ ] , but the type and nature of the data, how they can be accessed, and who can access them, vary significantly. whether educational data are collected from collaborative learning environments [ ] [ ] [ ] , course management systems [ , ] , gamified training applications [ , ] , or administrative systems from schools and universities [ ] [ ] [ ] , valuable properties, patterns, and insights often emerge. when combined with other factors such as timing and context, these factors play an important role in understanding how students learn [ ] , the settings in which they learn [ ] , and the effectiveness of the educational approaches [ ] . extracting information from data to ultimately turn it into knowledge [ , ] can contribute to draw a more comprehensive picture of student's learning, which can empower students, parents, and educators as well as education stakeholders and policymakers [ ] . educational data usability and accessibility is even more relevant in the context of the global pandemic due to the sars-cov- virus, which causes covid- disease. this situation is having an unprecedented impact on education. according to unesco [ ] , in the first months of , the pandemic has affected . % of the total number of students enrolled worldwide: over . billion people have been unable to receive face-to-face instruction because of the closure of schools and universities [ ] . the consequences are more severe in emerging countries [ , ] and to families affected by poverty and risk exclusion [ ] , presenting digital inequalities [ ] , and causing exclusion and inequality situations in vulnerable groups, ethnic minorities, and people with disabilities [ ] . significant challenges have been reported in the online transformation of educational activities [ , ] , particularly assessment processes [ ] . consequently, it is vital to improve access to educational technologies and reduce gaps in use and literacy [ ] . a multi-disciplinary approach is required to deploy technological ecosystems [ , ] that favor blended or online training, teacher and student training for the efficient use of educational technologies [ ] , and policies for both government and academic leaders to define strategies and manage uncertain scenarios [ ] . in the context of educational data access, it is critical to ensure transparency, ethics, and individuals' rights. this uais special issue builds on the work started in a number of previous special issues [ ] [ ] [ ] [ ] and two international events: • the invited session entitled "emerging interactive systems for education", in the thematic area "learning and this special issue focuses on how to improve universal access to educational data, with emphasis on (a) new technologies and associated data in educational contexts: artificial intelligence systems [ ] , robotics [ ] [ ] [ ] , augmented [ ] [ ] [ ] and virtual reality (vr) [ ] [ ] [ ] [ ] [ ] , and educational data integration and management [ ] ; (b) the role of data in the digital transformation and future of higher education: personal learning environments (ple) [ , ] , mobile ple [ , ] , stealth assessment [ ] , technology-supported collaboration and teamwork in educational environments [ ] , and student's engagement and interactions [ , ] ; (c) user and case studies on icts in education [ , ] ; (d) educational data in serious games and gamification: gamification design [ ] [ ] [ ] [ ] , serious game mechanics for education [ , ] , ubiquitous/pervasive gaming [ ] , and game-based learning and teaching programming [ , ] ; and (e) educational data visualization and data mining [ ] : learning analytics [ ] , knowledge discovery [ ] , user experience [ , ] , social impact [ ] , good practices [ ] , and accessibility [ , ] . the special issue comprises the following accepted papers. collaborative learning systems are a niche for analyzing educational data. for example, virtual reality and d modeling applications can leverage the integration of collaborative approaches in medicine [ ] , architecture [ ] , or urbanism [ ] . huang et al. developed a study devoted to construct a d modeling practice field based on virtual reality technology, in which students can learn d modeling through a new vr design collaboration framework and complete design goals. the proposed design collaboration model includes the concept of a learning community. the results of this study indicate that the system usability of the vr modeling practice field is superior to that of the traditional modeling learning field and learners are more creative and motivated. the authors emphasize that through the new design collaboration model, students can effectively learn d modeling in vr. conde et al. explore the assessment of instant messaging tools for the acquisition of teamwork competence throughout a case study about the use of the instant messaging app whatsapp. from the results, the authors conclude that students prefer instant messaging tools in teamwork activities over other interaction tools such as forums; and that the use of those tools has a positive impact on students' grades. in an effort to demonstrate the potential of virtual worlds in education [ ] , especially in distance education [ ] , krassmann et al. introduce a framework to prepare the implementation of virtual worlds. their approach emphasizes requirements that distance education students need to meet in order to have a successful learning experience. the authors present an exploratory study and propose eight guidelines to harness the potential of the technology of virtual worlds for distance education. pervasive games [ ] enhance the gaming experience and level of engagement by including real world aspects into the game space. arango-lópez et al. propose geopgd, a methodology that integrates the design of geolocated narrative as the core of the game experience. this methodology guides designers and developers through the different stages of building a pervasive game by providing tools for defining the narrative components, places, and interactions between the user and the game. gallego-durán et al. tackle the challenges of learning programming as a universal ability [ , ] . the authors propose a radically different perspective to this issue, teaching students with a bottom-up approach, starting from machine code and assembly programming. their results suggest that such a small intervention could have a limited positive influence on the students' programming skills. pazmiño et al. did a systematic literature review [ ] to answer the question: what is the baseline of scientific documents on learning analytics in ecuador? the selected documents were analyzed using statistical implicative analysis after removing duplicates and applying inclusion, exclusion, and quality criteria. the outcome of this research has allowed building up a baseline of scientific knowledge about learning analytics in ecuador. user experience analysis in the educational realm is directly linked to the levels of user acceptance and satisfaction [ ] of the new wave of educational technological ecosystems and the personalization of learning [ ] . barneche-naya and hernández-ibañez describe the results of a case study intended to compare three different user movement paradigms (metaphoric, symbolic and natural) designed to control the visit to virtual environments for a nui-based museum installation. the study evaluates the performance of each movement scheme with respect to the navigation of the environment, the degree of intuitiveness perceived by the users, and the overall user experience. the results show that the natural movement scheme stands out as the most adequate for the contemplation of the virtual environment and the most balanced at a general level for the three variables considered. the symbolic scheme proved to be the most efficient. the natural movement and symbolic schemes appear to be the most appropriate to navigate digital environments such as museum installations. in another paper related to user experience, zardari et al. introduce an e-learning portal for higher education that was assessed from a user experience standpoint using an eye-tracking system. the results emphasize students' satisfaction with the learning portal. finally, toborda et al. analyze metrics to measure effectiveness and engagement levels in pervasive gaming experiences. regarding the analytics of accessibility, martins et al. present a study that assesses accessibility in mobile applications, which may be applicable to education and tourism [ ] . fourteen mobile applications were analyzed using a manual and automatic methodology through an evaluation model based on quantitative and qualitative requirements, as well as the use of features such as voiceover and talkback. the results show a high number of errors in most quantitative requirements as well as non-compliance with most qualitative requirements. also, in the context of accessibility, romero yesa et al. share a good practice in designing accessible educational resources [ ] . the authors developed a new virtual teaching unit for supporting classroom teaching based on usability and accessibility criteria. the goal is to help increase teaching quality by improving syllabus design. a large-scale dataset for educational question generation time-dependent performance prediction system for early insight in learning trends exploration of youth's digital competencies: a dataset in the educational context of vietnam presenting the regression tree method and its application in a large-scale educational dataset dashboard for large educational dataset understanding the implementation of personalized learning: a research synthesis appraising research on personalized learning: definitions, theoretical alignment, advancements, and future directions towards elearning university la sociedad del conocimiento y sus implicaciones en la formación universitaria docente an adaptive hybrid mooc model: disrupting the mooc concept in higher education participantes heterogéneos en moocs y sus necesidades de aprendizaje adaptativo innovation in the instructional design of open mass courses (moocs) to develop entrepreneurship competencies in energy sustainability massive open online courses in the initial training of social science teachers: experiences, methodological conceptions, and technological use for sustainable development a recommender system for videos suggestion in a spoc: a proposed personalized learning method the construction of teaching interaction platform and teaching practice based on spoc mode exploration and practice of spoc mixed teaching mode in data structure course learning analytics as a breakthrough in educational improvement sein-echaluce ml can we apply learning analytics tools in challenge based learning contexts? learning analytics the emergence of a discipline learning analytics dashboard applications representing data visualization goals and tasks through meta-modeling to tailor information dashboards connecting domain-specific features to source code: towards the automatization of dashboard generation tap into visual analysis of customization of grouping of activities in elearning mejora de los procesos de evaluación mediante analítica visual del aprendizaje visual learning analytics of educational data: a systematic literature review and research agenda multimodal data to design visual learning analytics for understanding regulation of learning protected users: a moodle plugin to improve confidentiality and privacy support through user aliases privacidad, seguridad y legalidad en soluciones educativas basadas en blockchain: una revisión sistemática de la literatura exploring the relationship of ethics and privacy in learning analytics and design: implications for the field of educational technology whose data? which rights? whose power? a policy discourse analysis of student privacy policy documents educational data mining: a review of the state of the art beyond grades: improving college students' social-cognitive outcomes in stem through a collaborative learning environment learning analytics in collaborative learning supported by slack: from the perspective of engagement eclectic as a learning ecosystem for higher education disruption the current state of analytics: implications for learning management system (lms) use in writing pedagogy valoración y evaluación de los aprendizajes basados en juegos (gbl) en contextos e-learning engagement in the course of programming in higher education through the use of gamification educational data mining and learning analytics: an updated survey management process of big data in high education as sociotechnical system automatic tutoring system to support cross-disciplinary training in big data how students learn: history, mathematics, and science in the classroom improving the information society skills: is knowledge accessible for all? smart teachers, successful students? a systematic review of the literature on teachers' cognitive abilities and teacher effectiveness knowledge spirals in higher education teaching innovation epistemological and ontological spirals: from individual experience in educational innovation to the organisational knowledge in the university sector systematic review of how engineering schools around the world are deploying the agenda unesco covid- impact on education school closure and management practices during coronavirus outbreaks including covid- : a rapid systematic review estudio exploratorio en iberoamérica sobre procesos de enseñanza-aprendizaje y propuesta de evaluación en tiempos de pandemia impact of the pandemic on higher education in emerging countries: emerging opportunities. ssrn, challenges and research agenda covid- , school closures, and child poverty: a social crisis in the making covid- and digital inequalities: reciprocal impacts and mitigation strategies. comput simulating the potential impacts of covid- school closures on schooling and learning outcomes: a set of global estimates the difference between emergency remote teaching and online learning education and the covid- pandemic la evaluación online en la educación superior en tiempos de la covid- la covid- : ¿enzima de la transformación digital de la docencia o reflejo de una crisis metodológica y competencial en la educación superior? a metamodel proposal for developing learning ecosystems validation of the learning ecosystem metamodel using transformation rules digital competence of early childhood education teachers: attitude, knowledge and use of ict modelo de referencia para la enseñanza no presencial en universidades presenciales user experience and access using augmented and multimedia technologies: special issue of uxelate ( ) workshop and hci international conference ( ) special sessions information society skills: is knowledge accessible for all? part i information society skills: is knowledge accessible for all? part ii interactive and collaborative technological ecosystems for improving academic motivation and engagement learning and collaboration technologies. designing learning experiences. th international conference, lct , held as part of the st hci international conference, hcii learning and collaboration technologies. designing learning experiences. th international conference, lct , held as part of the st hci international conference, hcii teem' proceedings of the seventh international conference on technological ecosystems for enhancing multiculturality visualizing artificial intelligence used in education over two decades garcía-peñalvo fj robosteam project systematic mapping: challenge based learning and robotics robotics from stem areas in primary school: a systematic review social steam learning at an early age with robotic platforms: a case study in four schools in spain augmented reality and pedestrian navigation through its implementation in m-learning and e-learning: evaluation of an educational program in chile from reality to augmented reality: rapid strategies for developing marker-based ar content using image capturing and authoring tools relationship between student profile, tool use, participation, and academic performance with the use of augmented reality technology for visualized architecture models nextmed: automatic imaging segmentation, d reconstruction, and d model visualization platform using augmented and virtual reality virtual reality simulation-based learning technology in education. th international conference, lct . held as part of hci international methodologies of learning served by virtual reality: a case study in urban interventions virtual interactive innovations applied for digital urban transformations. mixed approach future gener improvement of an online education model with the integration of machine learning and data analysis in an modeling the personal adult learner: the concept of ple re-interpreted personal learning environments and online classrooms: an experience with university students entornos personales de aprendizaje móvil: una revisión sistemática de la literatura use of mobile technologies in personal learning environments of intercultural contexts: individual and group tasks deepstealth: game-based learning stealth assessment with deep neural networks teamwork assessment in the educational web of data: a learning analytics approach towards iso . telemat mapping research in student engagement and educational technology in higher education: a systematic evidence map facilitating student engagement through the flipped learning approach in k- : a systematic review technological ecosystems in citizen science: a framework to involve children and young people web-based system for adaptable rubrics: case study on cad assessment review of gamification design frameworks opening the black box of gameful experience: implications for gamification process design a immersive visualization technologies to facilitate multidisciplinary design education methodology i'm in applied to workshop: successful educational practice for consultants in user experience with gamification fields serious gaming for climate adaptation assessing the potential and challenges of a digital serious game for urban climate adaptation designing productively negative experiences with serious game mechanics: qualitative analysis of game-play and game design in a randomized trial exploring features of the pervasive game pokémon go that enable behavior change: qualitative study state of the art in the teaching of computational thinking and programming in childhood education influence of problem-based learning games on effective computer programming learning in higher education urban data and urban design: a data mining approach to architecture education implementing learning analytics for learning impact: taking tools to task a knowledge discovery education framework targeting the effective budget use and opinion explorations in designing specific high cost product analysis of user satisfaction with online education platforms in china during the covid- pandemic analyzing the usability of the wyred platform with undergraduate students to improve its features peer social acceptance and academic achievement: a meta-analytic study semantic spiral timelines used as support for e-learning analysis of accessibility in computing education research a social-mediabased approach to assessing the effectiveness of equitable housing policy in mitigating education accessibility induced social inequalities in shanghai. china virtual reality educational tool for human anatomy motivation and academic improvement using augmented reality for d architectural visualization evaluation of an interactive educational system in urban knowledge acquisition and representation based on students' profiles retrieving objective indicators from student logs in virtual worlds educación virtual para todos: una revisión sistemática pervasive games: theory and design exploring the computational thinking effects in pre-university education computational thinking unplugged guidelines for performing systematic literature reviews in software engineering. version . school of arkaevision vr game: user experience research between real and virtual paestum measuring user experience on personalized online training system to support online learning towards a social and context-aware mobile recommendation system for tourism bridging the accessibility gap in open educational resources the guest editors would like to thank the universal access in the information society journal editors-on-chief, dr. constantine stephanidis and dr. margherita antona, for their confidence in our responsibility to lead this special issue. we want to express our gratitude to all the researchers that have made this special issue a reality. this work was partially funded by the spanish government ministry of economy and competitiveness throughout the defines project (ref. tin - -r). key: cord- - t bn authors: cori, anne; donnelly, christl a.; dorigatti, ilaria; ferguson, neil m.; fraser, christophe; garske, tini; jombart, thibaut; nedjati-gilani, gemma; nouvellet, pierre; riley, steven; van kerkhove, maria d.; mills, harriet l.; blake, isobel m. title: key data for outbreak evaluation: building on the ebola experience date: - - journal: philos trans r soc lond b biol sci doi: . /rstb. . sha: doc_id: cord_uid: t bn following the detection of an infectious disease outbreak, rapid epidemiological assessment is critical for guiding an effective public health response. to understand the transmission dynamics and potential impact of an outbreak, several types of data are necessary. here we build on experience gained in the west african ebola epidemic and prior emerging infectious disease outbreaks to set out a checklist of data needed to: ( ) quantify severity and transmissibility; ( ) characterize heterogeneities in transmission and their determinants; and ( ) assess the effectiveness of different interventions. we differentiate data needs into individual-level data (e.g. a detailed list of reported cases), exposure data (e.g. identifying where/how cases may have been infected) and population-level data (e.g. size/demographics of the population(s) affected and when/where interventions were implemented). a remarkable amount of individual-level and exposure data was collected during the west african ebola epidemic, which allowed the assessment of ( ) and ( ). however, gaps in population-level data (particularly around which interventions were applied when and where) posed challenges to the assessment of ( ). here we highlight recurrent data issues, give practical suggestions for addressing these issues and discuss priorities for improvements in data collection in future outbreaks. this article is part of the themed issue ‘the – west african ebola epidemic: data, decision-making and disease control’. detection of a new infectious disease outbreak requires rapid assessment of both the clinical severity and the pattern of transmission to plan appropriate response activities. following the subsequent roll-out of interventions, continued evaluation is necessary to detect reductions in transmission and assess the relative impact of different interventions. surveillance data are crucial for informing these analyses, and directly determine the extent to which they can be performed. despite the unprecedented scale of the - west african ebola epidemic [ , ] , detailed data were collected during the outbreak, which proved invaluable in guiding the response [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . multiple studies have already considered the lessons to be learned from the ebola experience with respect to coordinating international responses to health crises, strengthening local health systems and improving clinical care and surveillance tools [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] . here we discuss what can be learned to improve real-time epidemiological assessment in future outbreaks via improved data collection and analyses, building on similar contributions after other epidemics [ - ] . we focus on efforts to reduce and interrupt transmission. first, we outline analyses that are essential to inform response activities during different stages of an epidemic. second, we detail the various types of data needed to perform these analyses, with examples from the ebola experience. third, we summarize the successes and challenges of data collection experienced during this outbreak, and the implications this had for answering key public health questions. fourth, we suggest improvements that could be implemented in future outbreaks, again drawing from the ebola experience. finally, we discuss issues related to availability of data and analyses (box ). key public health questions for any emerging infectious disease outbreak are the following: (i) what is the likely public health impact of the outbreak? (ii) how feasible is controlling the outbreak and what interventions would be appropriate? (iii) are current interventions effective and could they be improved? here, we describe statistical and mathematical analyses that facilitate epidemic response planning, focusing on these questions (figure ). in this section, we take a general view, as these questions are recurring during most, if not all, outbreaks. we provide examples from the west african ebola epidemic in subsequent sections. (a) what is the likely public health impact of the outbreak? a key issue in the early phase of an epidemic is to determine the potential impact of the outbreak in terms of clinical severity and the likely total number of cases over different time periods. the severity of a pathogen is often characterized by the case fatality ratio (cfr), the proportion of cases who die as a result of their infection. estimating the cfr during an outbreak can be challenging due to inconsistent case definitions, incomplete case reporting and right-censoring of data [ ] [ ] [ ] . in particular, it is critical to know the proportion of cases for whom clinical outcome is unknown or has not been recorded, which is typically easier to assess using detailed case data rather than aggregated case counts [ ] . the cfr may differ across populations (e.g. age, space, treatment); quantifying box . recommendations for collecting and using data for outbreak response. data need to be collected at each of the three levels: individual level: detailed information about cases exposure level: information about exposure events that may have led to transmission population level: characteristics of the population(s) in which the outbreak is spreading and the interventions carried out in the population(s) although some data will be context-specific, others, in particular at the population level, will be useful for a wide range of epidemics, and should be routinely and centrally collected in preparation for the next outbreak. . optimizing data quality having a general framework ready in advance of the next outbreak will facilitate: quality and timeliness of data centralization and harmonization of data at all three levels preparedness to deploy training material, personnel and logistical resources there is substantial room for improvement in the quality of data collected in an epidemic context, particularly for emerging pathogens. . ensuring adequate data availability data need to be shared ensuring a balance between the following considerations: ethical: protecting anonymity while ensuring data are sufficiently detailed to be useful scientific: wide data access is desirable to promote independent analyses; however, mechanisms must be in place for systematic comparison of results practical: deciding on a data format for sharing, on who will be responsible for data cleaning and on how various roles will be acknowledged discussions and decisions relating to data sharing remain ongoing and guidelines should be agreed on in advance of the next outbreak. . analysing data and reporting results in an appropriate manner the scientific community should agree on guidelines for epidemic modelling and analyses (such as those in place for reporting experimental studies), such as: assumptions underlying analyses should be clearly stated sensitivity to these assumptions should be tested uncertainty in results should be adequately explored and reported these are particularly relevant for reporting epidemic projections. these heterogeneities can help target resources appropriately and compare different care regimens. for less severe emerging pathogens, the case definition typically only encompasses a small fraction of all infected individuals, and hence the infection fatality rate (i.e. the proportion of infected individuals who die, rather than the proportion of cases who die-as per the case definition, which may not be equivalent to infection) may be a more useful measure of severity [ ] . (ii) short-and long-term incidence projections short-term impact of an outbreak can be assessed by predicting the number of cases that will arise in the next few days or weeks. this is particularly relevant for evaluation of immediate health-care capacity needs. projections of future incidence and estimates of the doubling time (the time taken for the incidence to double) can be obtained by extrapolating the early time series of reported cases either obtained from aggregated surveillance data or calculated from individual records [ ] . these projections typically rely on the assumption that incidence initially grows exponentially [ ] . they are subject to uncertainty, which increases the further one looks into the future. quantifying and appropriately reporting such uncertainty and underlying assumptions are crucial [ , , , ] . overstating uncertainty can lead to inappropriately pessimistic projections, which may in turn be detrimental to the control of the outbreak [ ] . on the other hand, understating uncertainty prevents policymakers from making decisions based on the whole spectrum of possible impacts. some studies have already discussed how to find the balance between these two extremes [ - ] . here, we propose two simple rules of thumb for projecting case numbers. first, projections should not be made for more than two or three generations of cases into the future. second, central projections should be shown together with lower and upper bounds. in the future, the modelling community should agree on guidelines for reporting epidemic projections, as are already in place for reporting experimental studies [ ] . a number of factors can lead to incidence not growing exponentially. in particular, this happens once herd immunity accumulates, if population behaviour changes or as a result of the implementation of interventions. dynamic transmission models, which account for saturation effects, can be used to assess the long-term impact of the outbreak such as predicting the timing and magnitude of the epidemic peak or the attack rate (final proportion of population infected) [ , ] . however, these models are hard to parametrize as they require information on population size and immunity, interventions (if any) and potential behavioural changes over time, all of which may be subject to uncertainty [ , ] . we discuss these issues in more detail in § . projecting longer term is likely to be associated with a large degree of uncertainty and these projections may be more useful for evaluating qualitative trends and evaluating intervention choices rather than predicting exact case numbers. interventions would be appropriate? interventions to reduce transmission can include community mobilization, quarantine, isolation, treatment or vaccination. the potential success of these interventions is determined by general characteristics of the disease such as overall transmissibility and how this varies across populations [ ] . furthermore, certain types of interventions may be more or less appropriate depending on the natural history of the disease and the context of the epidemic. the transmissibility of a pathogen determines the intensity of interventions needed to achieve epidemic control [ ] . the parameter most often used to quantify transmissibility is the reproduction number (r), the mean number of secondary cases infected by a single individual. this parameter has an intuitive interpretation: if r . , then the epidemic is likely to grow, whereas if r , , the epidemic will decline [ , ] . the final attack rate ( proportion of the population infected) of an epidemic also depends on the value of r at the start of an epidemic (termed r if the population has no immunity). r can be estimated from the incidence of reported cases, given knowledge of the serial interval distribution (i.e. distribution of time between symptom onset in a case and symptom onset in his/her infector; see § b(iii)) [ ] [ ] [ ] [ ] . heterogeneity in the number of secondary cases generated by each infected individual affects epidemic establishment and the ease of control. greater heterogeneity reduces the chance of an outbreak emerging from a single case [ ] . however, this heterogeneity can make an established outbreak hard to control using mass interventions, as a single uncontrolled case can generate a large number of secondary cases [ ] . conversely, heterogeneity in transmission may provide opportunities for targeting interventions if the individuals who contribute more to transmission (because of environmental, behavioural and/or biological factors [ ] [ ] [ ] ) share socio-demographic or geographical characteristics that can be identified [ , ] . reconstruction of transmission trees (who infects whom) can provide an understanding of who contributes more to transmission. this can be done using detailed case investigations and/or using genetic data [ ] [ ] [ ] [ ] . environmental, behavioural and biological factors may also lead to groups of individuals being disproportionately more likely to acquire infection (e.g. children during influenza outbreaks [ , ] or health-care workers (hcws) during ebola outbreaks [ , ] ). to identify whether such groups exist and target them appropriately, the proportion infected in each group must be estimated. this requires population size estimates for the different groups, which may be difficult to obtain, as we highlight § c. spatial heterogeneity in transmission is particularly interesting to assess as it can inform the targeting of surveillance and interventions to the geographical areas most at risk. phylo-geographical studies based on genetic data can improve understanding of the geographical origins of the outbreak, identify and characterize sub-outbreaks and quantify whether transmission is very local or travels large distances [ , , ] . results of such analyses can be used to determine the appropriate spatial scale of control measures. spatially explicit epidemic models can also be used to quantify the risk of exportation of the infection from one place to another. this can help public health officials to tailor prevention and control resources to the level of risk likely to be experienced by a given area. such models typically require detailed data on mobility patterns and immunity levels of the populations in the areas of interest [ ] [ ] [ ] . disease natural history fundamentally affects outbreak dynamics. the generation time distribution (i.e. distribution of time between infection of an index case and infection of its secondary cases) determines-with the reproduction number-the growth rate of an epidemic [ ] . most commonly, the generation time distribution is estimated from data on the serial interval distribution of an infection-the delay between symptom onset in a case and symptom onset in his/her infector. other delays between events in the natural history of infection (e.g. exposure, onset of symptoms, hospitalization and recovery or death) also affect disease transmission or have implications for control [ , ] . delays from symptom onset to recovery (or death) will determine the required duration of health-care and case isolation. the incubation period (the delay between infection and symptom onset of a case) and the extent to which infectiousness precedes symptom onset will determine the feasibility and effectiveness of contact tracing or prophylaxis [ ] . estimating these delay distributions requires detailed data on individual cases and their exposure, e.g. through transmission pairs identified in household studies [ ] . other analyses can also help refine the type of interventions that should be considered. ecological associations between transmissibility (measured by r) at a fine spatio-temporal scale and any covariate measured at the same scale, may be rstb.royalsocietypublishing.org phil. trans. r. soc. b : of interest. for instance, analyses of the west african ebola epidemic showed that districts with lower reported funeral attendance and faster hospitalization experienced lower transmissibility, highlighting the effectiveness of promoting safe burials and early hospitalization [ ] . however, interpreting the results of such analyses can be challenging, as they might be prone to bias and confounding. efficacy (which measures the impact of an intervention under ideal and controlled circumstances) and effectiveness (which measures the impact of an intervention under real-world conditions) of an intervention (e.g. vaccine) are best measured in a trial setting (either individual-or cluster-randomized [ - ] ). however, performing trials to evaluate the comparative impact of different multi-intervention packages is impractical. dynamic epidemic models, where the interventions of interest can be explicitly incorporated, allow the impact of such intervention packages to be predicted [ ] . however, outputs of such models are strongly determined by the underlying assumptions and parameter values. hence they require careful parametrization, supported by data such as intervention efficacy and the size, infectivity, susceptibility and mixing of different groups [ , ] . these parameters may not be straightforward to estimate, as we discuss in § using examples from the west african ebola epidemic. another factor determining the appropriate choice of interventions is their cost, combinations of interventions with higher effectiveness at lower cost (i.e. higher cost-effectiveness) being preferable. economic analyses combined with mathematical models can help to evaluate the optimal resource allocation among both current available interventions and potential new interventions, accounting for development and testing costs for the latter. indirect costs, e.g. those associated with a restricted workforce following school closures [ ] , or trade limitations from air-travel restrictions [ ] , also need to be considered. while economic analyses have played an important role in designing optimal intervention packages for endemic diseases such as hiv and malaria [ , ] , such analyses are more difficult to perform during an epidemic, when cost data might be unavailable and uncertain, costs may vary rapidly over time and ethical considerations suggest interventions should be implemented immediately. (c) are current interventions effective and could they be improved? tracking changes in estimates of key epidemiological parameters over the course of an outbreak enhances situational awareness. it also allows the impact of interventions to be assessed as they are implemented, although disentangling the effects of different interventions carried out simultaneously may be challenging. obtaining reliable estimates of the epidemiological parameters detailed above requires a wide range of data, such as incidence time series and detailed case information (figure ). here, we explain how these can be obtained from various sources, with the objective to help improve data collection systems in preparation for future outbreaks. we use ebola as a specific example throughout this section, commenting on the strengths and limitations of the data collected during the west african epidemic. we distinguish data needs at the individual level, the exposure level and the population level. simple analyses can be performed solely using incidence time series, from surveillance designed to capture aggregate case counts. however, individual case reports provide much richer information, essential to estimate many of the key parameters outlined above (e.g. characterization of delay distributions). such data are typically stored in a case database or 'line-list'a table with one line containing individual data for each case. the more data recorded on each case, the more detailed the analyses can be. in the ebola epidemic, demographic characteristics, spatial location, laboratory results and clinical details such as symptoms, hospitalization status, treatment and outcome, and dates associated with these were reported for at least a subset of cases. the appropriate information to collect may vary depending on the disease: for ebola, dates of isolation and funeral were relevant. comprehensive demographic information can be used to determine risk factors for transmission or severity of infection and to project case numbers stratified by demographic characteristics. detailed information also helps to identify and merge any duplicate entries in a line-list, which may occur when the same person visits multiple health centres over the course of illness, for instance. finally, information on how each case was detected-for example, through hospitalization, or via contact tracing-can aid assessment of how representative the data are and allow subsequent adjustment for bias [ ] ; however, this was not available for ebola. cases in the line-list should be classified using standardized case definitions, which is sometimes difficult for outbreaks of new pathogens, or where different case definitions are provided by different organizations (e.g. world health organization (who) and us centers for disease control and prevention (cdc)) [ ] . for the ebola response, although the who released official case definitions of confirmed, probable and suspected ebola cases [ ] , different countries adopted different testing strategies, thereby limiting the opportunity for inter-country comparison. for example, in guinea deceased individuals were not tested for ebola, meaning these individuals were never classified as confirmed cases, unlike in liberia and sierra leone. encouraging use of a consistent testing protocol and case definition, and ensuring transparency in what is used where and when, would improve the validity of subsequent analyses. laboratory testing of clinical specimens is key for confirming cases and test results should be linked to the line-list. understanding diagnostic test performance in field conditions is important; cross-validation of diagnostic sensitivity and specificity between laboratories is useful to assess the extent to which observed differences in case incidence may be explained by variations in laboratory conditions and practices. in addition, recording raw test results with the case classification may help evaluation of diagnostic performance. ebola cases were defined as confirmed cases once ebola virus rna was isolated from clinical specimens using reverse transcription polymerase chain reaction (rt-pcr [ ] temperature and humidity to which these laboratories were subjected reduced the test performance compared with manufacturer evaluation reports [ ] . during the west african ebola epidemic, the case line-list contained a large quantity of data collected from reported cases. the information allowed estimation of the cfr, the incubation period distribution (and evaluation of differences in these by age and gender) and the reproduction number [ , , ] . projections of the likely scale of the outbreak were also made, either from the line-list or from aggregated case counts [ , , , ] . there were regular data updates [ ] , with a total of over confirmed and probable cases reported, which allowed analyses to be updated as more data became available. data on exposures allow cases to be linked to their potential sources of infection, and hence provide a better understanding of transmission characteristics. the relevant modes (e.g. airborne, foodborne) and pathways (e.g. animal-human, human-human) of transmission may be identified using information on exposure reported by cases. cases can report contact both with sick individuals (their potential source of infection) and healthy individuals they have contacted since becoming ill (who may need to be traced and monitored as potential new cases). these data will be more informative if the majority of infections are symptomatic (and hence easily identifiable), if individuals are mostly infectious after the onset of symptoms [ ] and if the time window over which exposures and contacts are monitored is as long as the upper bound of the incubation period distribution. if exposure information is collected with enough demographic information to allow record linkage, these backward and forward contacts can be identified in the case line-list, defining transmission pairs. depending on the mutation rate of the pathogen, genetic data can also be used to identify or confirm these epidemiological links [ , ] . the increasing availability of full genome pathogen sequences offers exciting prospects in that respect. some genetic sequence information was available during the ebola epidemic, but most sequences could not be linked to case records, limiting the use of sequence data in this context. on the other hand, individuals who were named as potential sources of infection could often be identified in the case linelist, although this process was hindered by non-unique names and limited demographic information collected on the potential sources [ ] . these exposure data were used to characterize variation in transmission over the course of infection [ ] , and to estimate the serial interval and the incubation period [ ] . the upper bound of the incubation period distribution was estimated to be days [ ] , which supported monitoring contacts for up to three weeks. studies of transmission in well-defined, small settings such as households are useful to quantify asymptomatic transmission, infectivity over time and the serial interval as they capture explicitly the number and timing of secondary cases. additionally, these studies can estimate the secondary attack rate (the proportion of contacts of a case who become infected within one incubation period), which can be used to characterize heterogeneities in transmission of different groups [ ] . estimates of the secondary attack rate have been obtained for the west african ebola epidemic by reconstructing household data based on information reported by cases, in particular, as part of contact-tracing activities [ , ] . although they might not immediately appear as useful as individual-or exposure-level data, metadata are crucial for answering many public health questions. knowing the sizes of affected populations is important for quantifying the attack rate and informing dynamic transmission models. census data are likely to be the most reliable source, but may be infrequently collected. methods based on interpreting satellite imagery [ , ] can inform population size and structure, although demographic stratifications are not always available. for the west african ebola epidemic, the most recently available age-and genderstratified population census data were from in guinea [ ] , in sierra leone [ ] and in liberia [ ] . a particular population of interest is hcws who, due to their contact with patients, are often at high risk of infection and may also be high-risk transmitters. large numbers of hcws were infected during the west african ebola epidemic [ , ] . however, the proportion of hcws affected at different stages of the outbreak and the relative risk of acquisition for the hcws compared with the general population could not be estimated since the total number of hcws was not systematically reported and changed during the course of the outbreak with the scale-up of interventions. note that, depending on the transmission route, the definition of hcws may need to include anyone working in a health-care setting who could be at risk (e.g. cleaners may be exposed to bodily fluids). characterizing population movement is crucial to assessing the risk of exportation of the infection from one place to another. air-travel data are the most reliable, consistently available and commonly used data source to inform models of long-distance spread [ ] . such data were widely used during the west african ebola epidemic to quantify the risk of international spread of the disease, and to assess the potential impact of airport screening and travel restrictions on the outbreak [ , - ] . however, air travel does not cover other population movements that may play an important role in disease spread, e.g. travel by road or on foot in guinea, liberia and sierra leone and across the porous country borders during the west african ebola epidemic [ ] . usually, little data are available to directly characterize these typically smaller-scale movements. gravity models, which assume that connectivity between two places depends on their population sizes and the distance between them, can be used to quantify spatial connectivity between different regions [ ] [ ] [ ] [ ] , and have proved useful to predict local epidemic spread, e.g. for chikungunya [ ] . such models require data on population sizes and geographical distances. recently, mobile phone data have been explored as an alternative source of data on mobility patterns, which could be used to predict spatial epidemic spread [ ] [ ] [ ] . however, a number of challenges (in particular related to privacy issues) meant that such data were unavailable to rstb.royalsocietypublishing.org phil. trans. r. soc. b : understand the regional and local spread of the west african ebola epidemic [ ] . in addition, inter-country movement is not captured from these data. at a national level, the utility of mobile phone data may depend on whether mobile phone users are representative of the population contributing to transmission, the level of mobile phone coverage in the affected population and whether population movement is likely to remain the same during an epidemic compared to the time period of the data. assessing seroprevalence in a population affected by an outbreak can provide valuable information on the underlying scale of population exposure and insight into how interventions might be targeted. for instance, if there is pre-existing population immunity prior to an outbreak that varies with age (as was the case in the h n influenza pandemic), vaccination could be targeted at those with lower pre-existing immunity. dynamic transmission models incorporating such differences in susceptibility can be used to explore different targeting strategies [ ] . ideally, serological surveys would be undertaken to quantify seroprevalence [ ] ; however, they are expensive and not performed on a regular basis. in the absence of such data, information on historical outbreaks and vaccine use can sometimes be used to infer seroprevalence [ ] . serological studies performed before and after an epidemic can also be useful to measure the attack rate and the scale of the outbreak, and hence provide information on the level of underreporting during the outbreak [ ] . it was widely assumed that the population in west africa was entirely susceptible to ebola at the start of this outbreak, with no known previous outbreaks in the area. however, some studies have suggested that there might have been low levels of prior immunity [ ] . during an outbreak, multiple interventions are often implemented by different groups and organizations. evaluating the role of interventions in interrupting transmission is important for revising and improving efforts, but it is challenging without detailed quantitative information of what has been implemented where and when [ , ] . maintaining a systematic real-time record of the different interventions at a fine spatio-temporal scale would help, e.g. the number and location of health-care facilities and their personnel, number of beds, vaccine or treatment coverage and details of local community mobilization. developing centralized platforms to routinely record such data once a large-scale outbreak is underway is probably unfeasible. however, developing such tools in advance of outbreaks (such as those developed for the global polio eradication initiative [ ] and those recently developed to collect health-care facility data [ ] ) should be a priority since better information to evaluate intervention policies in real time will allow for more optimal resource allocation. during the west african ebola epidemic, many data on interventions were recorded at a local level by some of the numerous partners (e.g. non-governmental organizations (ngos) and other organizations) involved in the response. for example, some data were collected on the number and capacity of hospitals over time [ ] and these were used in a study modelling community transmission to assess the impact of increasing hospital bed capacity [ ] . however, the decentralization of the response meant that intervention data were not systematically reported or collated and these data were not shared widely with the research community. a failure to report interventions centrally and systematically can make it difficult to disentangle a lack of intervention effect from a lack of intervention implementation. this can particularly be a problem when numerous groups coordinate their own efforts, making it impossible to draw firm conclusions about interventions. in the absence of detailed data on intervention efforts in west africa, multiple studies have assessed the combined impact of all interventions in place, by comparing transmissibility in the early phase (with no intervention) to that in the later phases [ , ] . however, this approach provides less compelling evidence of a causal effect and does not disentangle the impact of different interventions performed at the same time, and hence is less informative for future response planning. vaccine or treatment trials together with case-control and cohort studies can be useful in assessing the impact of an intervention. for example, during the west african ebola epidemic there was an urgent need to estimate the efficacy of newly developed vaccines. trials such as the guinea ebola ça suffit vaccine trial [ ] provided key data on the effectiveness of the rvsv-zebov ebola vaccine [ , ] . these trials occurred at the tail end of the epidemic and results will be useful in future outbreaks. statistical power from trials will be maximized by implementing such studies as early as possible in future outbreaks. this will be facilitated if research on diagnostics, drugs and vaccines is promoted between, and not only during, outbreaks, e.g. through new initiatives such as the coalition for epidemic preparedness innovation [ ] . all of the data sources mentioned above are inevitably imperfect; what they are trying to measure is different from what they measure in practice. quantifying the mismatch between the two is vital to appropriately account for these imperfections. for instance, case line-lists are likely to contain information on only a proportion of all infected individuals: typically those with symptoms, or those who sought care. the level of reporting may also be influenced by the capacity of the local health systems, which can vary over time and space. during the west african ebola epidemic, less than a third of cases were estimated to be reported [ ] and severe cases were probably over-represented compared to mild cases. at the end of , health-care capacity was exceeded in many parts of guinea, liberia and sierra leone [ ] , but new health-care facilities were subsequently built; hence the line-list of cases is likely to be more complete towards the end of the outbreak. underreporting might also have been higher in this compared to previous ebola outbreaks, during which the health-care systems were less overburdened. systematic evaluation of the surveillance system [ ] over different spatial units and time periods could help inform the level of underreporting. in addition, joint analysis of genetic sequence and surveillance data can provide insight into the degree of underreporting [ ] . quantifying completeness of, and potential biases in the line-list is important, e.g. to adequately quantify the cfr [ ] . although differences in the cfr were observed across rstb.royalsocietypublishing.org phil. trans. r. soc. b : different health-care facilities in the west african ebola epidemic, it was not possible to determine whether these were due to reporting differences or underlying differences between settings [ ] . similarly, exposure-level data can be incomplete, depending on the available capacity (personnel and resources) to perform contact tracing and the willingness of people to share information on their contacts. complete data can be used to assess the route of transmission-animal to human or human to human-and the number of cases imported from other locations. if data are incomplete, these estimates may be incorrect. population-level data may suffer the same issues. for instance, the recording of intervention efforts (e.g. the number of personal protection equipment (ppe) kits distributed) can differ from the reality of the intervention (e.g. the number of people who used ppe in practice). quantifying this mismatch is crucial to evaluating the impact of various interventions, and requires dedicated studies, with both qualitative and quantitative components, on the acceptability of and the adherence to given interventions. such studies have been carried out in the past, e.g. to assess the potential impact of face masks on the risk of influenza transmission in households, or of condom use on the risk of hiv transmission [ ] [ ] [ ] . it is also important to quantify delays encountered in the reporting of cases [ ] . during the west african ebola epidemic, there were delays in all databases and disparities between different data sources. in particular, comparison between the line-list of cases and aggregated daily case counts reported by the affected countries highlighted reporting delays in the line-list. as a result, at any point in the outbreak, naive time series of case counts derived from the line-list suggested that the epidemic was declining, due to right-censoring. comparison between the line-list and reported aggregated case counts (which were more up-todate) and between successive versions of the line-list allowed the reporting delays to be quantified. analyses such as the projected incidence could then be adjusted accordingly [ ] . in summary, the enormous quantity of detailed data collected during the west african ebola epidemic played an invaluable role in guiding response efforts. however, several analyses could not be performed. early on, severity, transmissibility and delay distributions were quantified and short-term projections were made, based on the case line-list and/or contact tracing data [ , , ] . some heterogeneity in severity and transmissibility could be identified (e.g. by age and viraemia [ , , , , [ ] [ ] [ ] ), but other types of heterogeneity could not be assessed. for example, it was not possible to compare the cfr between hospitalized and non-hospitalized cases, because of biases in the way cases were recorded in the line-list [ ] . long-term projections were extremely challenging due to large uncertainty in population sizes, behaviour changes and changing intervention efforts. in addition, the relative risk of ebola acquisition for hcws was difficult to estimate due to the absence of reliable spatio-temporal data on the number of hcws. finally, systematic evaluation of interventions was not feasible due to multiple control measures being carried out at the same time, with little central recording of details of each intervention. data collection during the west african ebola epidemic was possible, in part, due to a pre-existing case investigation form and data management system (epi info viral haemorrhagic fever (vhf) application [ , ] ). for outbreaks of new pathogens, such systems are not usually in place. the list of data needs we have outlined above, based on our experience during the ebola epidemic, could serve as a basis for data collection in future outbreaks. here we outline improvements that could be made to streamline data collection, minimize delays between data collection and data dissemination, and improve data quality during future outbreaks. all of these are necessary for timely analysis to inform the response in real time. we consider this particularly for line-list and exposure data. data need to be digitized before they can be analysed; streamlining this process reduces the potential for delays and errors. using electronic data capture on tablets or phones may reduce delays and errors compared with using paper forms that then require manual digitization, with its contingent issues [ ] [ ] [ ] [ ] . electronic questionnaires may not be available at the start of the epidemic, but could be quickly adopted using available tools (e.g. epicollect [ ] or epibasket [ ] ). interpretation problems (e.g. abbreviations) and spelling errors could also be minimized at the data collection level by using multiple choice questions rather than free text and at the data entry level by using drop-down menus. for example, for spatial information the choices could be the standard administrative levels used in the country-such as district or county. as response efforts evolve over time, such as the building of new treatment centres, the lists would be updated as required. similarly, using pop-out calendars to select dates would limit typing errors. additionally, internal consistency checks could flag problems such as the recorded date of death being before that of symptom onset, and the system could force a manual check before the record is saved. in the west african ebola epidemic, data were collected using paper forms (electronic supplementary material, figures s -s ) and manually entered into an electronic system at local operation centres. data entry problems were particularly noticeable in clinical dates and free text variables, such as district locations and hospital names, and required considerable cleaning (e.g. to correct spelling errors) [ , , ] . with any data system there is potential for delays in data entry and dissemination due to limitations in personnel and hardware, and logistical constraints in data delivery (e.g. reliable electricity, internet access and transport). although the latter limitations are arguably outside the scope of outbreak control, delays from data collection to dissemination could be reduced by increasing training in data entry and providing more data entry facilities where possible. delays in data entry and release during the ebola epidemic meant that real-time analyses of the line-list data could not be performed on fully up-to-date data. in the world today, an epidemic emerging anywhere is a global threat due to high population connectivity [ , ] and, as such, a response will have to operate across multiple languages. ideally, the same data entry system would be used in every affected country to allow easy collation into a single case line-list. however, a global response requires careful translation of the questions and the system to ensure they make sense and are identical in every language. additionally, different languages, dialects or alphabets could be challenging due to differing pronunciations, accents, alternate names or spellings, and these should be acknowledged in the form design. the language barrier was not a major problem during the ebola response: guinea had a form in french, while in sierra leone and liberia the form was in english, though there were some minor differences in formulations between the two (for example the english version of the form asked 'in the past one month prior to symptom onset: did the patient have contact with (. . .) any sick person before becoming ill' (see § in the form in electronic supplementary material, figure s ), while the french form did not specify 'before becoming ill'). analysis of laboratory results and sequence data can be much more powerful if they can be dated and linked to the epidemiological data recorded for each case. in the early stages of the ebola response, there were reports that laboratory results could not always be linked to case records as labels were incorrectly written or damaged in transit [ ] . later in the epidemic, case report forms came with pre-printed unique id barcode stickers to label all records and samples for each case (electronic supplementary material, form s ). this would be useful if implemented early in future outbreaks, particularly if laboratory tests are not performed at the point of care. rapid diagnostic tests were developed during the west african ebola epidemic but not used widely [ , ] ; similarly, mobile sequence tests were introduced later in the epidemic [ ] : both of these would reduce delays and maximize the potential to link patient data. we have described a set of data needed to perform analyses to inform the public health response during an outbreak. we have proposed strategies to ensure fast and high-quality data collection, to enable robust and timely analyses. however, such analyses can only be performed if the data are accessible to data analysts and modellers. the west african ebola epidemic has prompted an ongoing public debate about the ethical, practical and scientific implications of wide data access [ , [ ] [ ] [ ] . ethical considerations require removal of data that might compromise anonymity, but such detailed information might be required to answer important public health questions. mechanisms also need to be found to appropriately acknowledge those who collected and digitalized the data. from a scientific perspective, having several groups analysing the data is desirable, as independent analyses leading to similar results will reinforce their utility for policymaking [ , ] . such parallelized efforts have been formalized through consortiums of highly experienced groups, e.g. for modelling of hiv [ ] , influenza [ , ] and malaria [ ] . however, if results differ, understanding what assumptions drive these differences may confuse and delay the process of decision-making. a consequence of data sharing, therefore, needs to be an increased emphasis on evidence synthesis, such as systematic reviews of the different analyses. to enable this process, groups should explicitly state all assumptions underlying their analyses and which data they are based on. like the original analyses, reviews need to be timely to be useful. these issues have partly been addressed through new data availability policies [ ] . furthermore, this process would ideally be performed by a group independent of those performing the primary analyses. identifying an effective system, appropriate personnel and appropriate recognition for this important role needs to be planned in advance of future outbreaks. data from the west african ebola epidemic required significant cleaning before it could be analysed, and this was necessary for every updated dataset, i.e. every few days during the peak. were outbreak data to be shared more widely, collaborative or centralized cleaning would be optimal, to avoid repetition and ensure consistency across different groups. if a centralized effort was not possible, regular sharing of the cleaning code and cleaned datasets between groups would facilitate comparison of results. even better, code could be shared on a collaborative platform such as github (https://github.com/), leading to a common clean dataset being actively maintained by the scientific community. however, for this process to be effective, a set of best practices would need to be established in advance, such as designing a transparent workflow, establishing a fair distribution of tasks and clarifying how credit will be given for this (often lengthy) task. this is important to avoid duplication of effort and allow effective collaboration. as the process of data sharing is debated further, it is critical that the practicalities of data cleaning are discussed in parallel. based on our experience, a centralized cleaning platform would be the most effective method. finally, analyses will be most useful if they are shared widely across policymakers, local health teams and other research groups. the format for dissemination could be anything from a report to a scientific publication. reports can be made available faster and regularly updated but do not undergo the peer review process of publications. peer review can delay publication, though there are now new platforms for fast-tracking this process [ , ] . at the time of writing, as interest in zika virus is gaining momentum, there seems to be an encouraging trend to the use of pre-print servers [ , ]. by the very nature of emerging infectious diseases, we do not know which pathogen will emerge next, when or where. there have been many suggestions about how to be rstb.royalsocietypublishing.org phil. trans. r. soc. b : better prepared [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] , ] ; here, we argue that preparedness should include development of a broad-use data collection system that can be easily and quickly adapted to any disease (in agreement with [ ] ) as well as the regular collection of population health data in centralized systems. different infectious diseases may require different types of data [ ] : a single approach is not applicable to all diseases. in particular, different interventions may need to be recorded for different diseases. data collection and management may need to evolve as the outbreak progresses and/or as more is learnt about the pathogen: the ebola data collection forms changed late in to streamline collection and entry when the response effort was almost overwhelmed with cases (electronic supplementary material, figures s and s ). similarly, both for ebola and the recent zika outbreak, reports of sexual transmission have led to a broadening of the contact tracing and exposure information collected. the data collected during the ebola epidemic allowed many analyses to be performed, which informed the response. however, as in many outbreak situations, it has not been possible to systematically quantify the relative contribution of different interventions (such as safe burials, hospitalization, contact tracing and community mobilization) in reducing transmission. this is because data on where and when these interventions took place were not centrally recorded and released in a timely fashion. efforts have been made to collate some information from sierra leone, liberia and guinea [ ] ; however, this commenced late in the epidemic. as this effort relies on different organizations to contribute data, the submissions are in different formats and are unlikely to provide comprehensive information. in the midst of a global public health crisis resources are often deployed favouring implementation rather than documentation of interventions. however, we would argue that securing some resources to monitor interventions-especially during the early stages-is critical to optimally prioritizing future control efforts. some of the data we have suggested to be collected during outbreaks may not be obviously useful at the field or case management level, e.g. detailed demographic characteristics of cases and contacts. collecting such data costs money and time as well as trained personnel. these three 'resources' are limited and, during an epidemic, should be prioritized where their need is greatest. however, from a population perspective, collecting these data may help quantify epidemic impact, assist in the design and evaluation of interventions, and help prevent new infections. further studies might examine how to appropriately balance these two considerations. we have built on the ebola experience to draw up a list of the data needed to assist the response throughout an epidemic, which should help to collect relevant data in a standardized effort in future epidemics. to make the most of these data, epidemiologists and modellers should work now to develop tools to automatically clean, analyse and report on the data in a more timely and robust manner [ ] . based on critical review of past outbreak analyses, future studies could flag common methodological mistakes and propose good practice [ ] . this includes clearly stating all assumptions underlying a model or analysis and ensuring that parametrization is either directly informed by relevant data or has appropriate sensitivity analyses, with corresponding uncertainty clearly reported [ , ] . improving our ability to respond effectively to the next outbreak will require collaboration between all parties involved in outbreak response: those in the field, epidemiologists, modellers and policymakers as well as the populations affected. here we have given the data analyst perspective on what data are required to answer important policy questions. it is equally important that other perspectives should be heard to be better prepared for and improve interactions during crises, thereby minimizing the impact of future outbreaks. a review of epidemiological parameters from ebola outbreaks to inform early public health decision-making after ebola in west africa: unpredictable risks, preventable epidemics ebola virus disease in west africa-the first months of the epidemic and forward projections ebola virus disease among male and female persons in west africa west african ebola epidemic after one year-slowing but not yet under control ebola virus disease among children in west africa chains of transmission and control of ebola virus disease in conakry, guinea, in : an observational study estimating the future number of cases in the ebola epidemic-liberia and sierra leone effectiveness of screening for ebola at airports exposure patterns driving ebola transmission in west africa, a retrospective observational study are we learning the lessons of the ebola outbreak? the next epidemic-lessons from ebola global health security: the wider lessons from the west african ebola virus disease epidemic will ebola change the game? ten essential reforms before the next pandemic. the report of the harvard-lshtm independent panel on the global response to ebola clinical research during the ebola virus disease outbreak in guinea: lessons learned and ways forward the emergence of ebola as a global health security threat: from 'lessons learned' to coordinated multilateral containment efforts ebola in west africa: lessons we may have learned lessons learned from hospital ebola preparation ebola in west africa: learning the lessons strengthening the detection of and early response to public health emergencies: lessons from the west african ebola epidemic ebola clinical trials: five lessons learned and a way forward lessons from ebola: improving infectious disease surveillance to inform outbreak management epidemiological data management during an outbreak of ebola virus disease: key issues and observations from improving the evidence base for decision making during a pandemic: the example of influenza a/h n epidemic and intervention modelling-a scientific rationale for policy decisions? lessons from the influenza pandemic emerging infections: what have we learned from sars? assessing the severity of the novel influenza a/h n pandemic potential biases in estimating absolute and relative case-fatality risks during outbreaks optimizing the precision of case fatality ratio estimates under the surveillance pyramid approach influenza-related deathsavailable methods for estimating numbers and detecting patterns for seasonal and pandemic influenza in europe an introduction to infectious disease modelling six challenges in modelling for public health policy uses and abuses of mathematics in biology national center for infectious diseases sars community outreach team decisions under uncertainty: a computational framework for quantification of policies addressing infectious disease epidemics development and consideration of global policies for managing the future risks of poliovirus outbreaks: insights and lessons learned through modeling quantifying uncertainty in epidemiological models the orion statement: guidelines for transparent reporting of outbreak reports and intervention studies of nosocomial infection generality of the final size formula for an epidemic of a newly invading infectious disease modeling to inform infectious disease control nine challenges in incorporating the dynamics of behaviour in infectious diseases models nine challenges in modelling the emergence of novel pathogens factors that make an infectious disease outbreak controllable infectious diseases of humans: dynamics and control a brief history of r and a recipe for its calculation epiestim: a package to estimate time varying reproduction numbers from epidemic curves (r package a new framework and software to estimate timevarying reproduction numbers during epidemics how generation intervals shape the relationship between growth rates and reproductive numbers different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures superspreading and the effect of individual variation on disease emergence mathematical models of infectious disease transmission a bayesian mcmc approach to study transmission of influenza: application to household longitudinal data risk factors of influenza transmission in households heterogeneities in the transmission of infectious agents: implications for the design of control programs bayesian inference of infectious disease transmission from wholegenome sequence data bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data unravelling transmission trees of infectious diseases by combining genetic and epidemiological data relating phylogenetic trees to transmission trees of infectious disease outbreaks household transmission of pandemic influenza a (h n ) virus in the united states ebola virus disease in health care workers-sierra leone ebola virus disease cases among health care workers not working in ebola treatment units-liberia temporal and spatial analysis of the - ebola virus outbreak in west africa evolution and spread of ebola virus in liberia modeling the spatial spread of infectious diseases: the global epidemic and mobility computational model the foot-and-mouth epidemic in great britain: pattern of spread and impact of interventions large-scale spatial-transmission models of infectious disease the interval between successive cases of an infectious disease standards of evidence: criteria for efficacy, effectiveness and dissemination criteria for distinguishing effectiveness from efficacy trials in systematic reviews agency for healthcare research and quality vaccine epidemiology: efficacy, effectiveness, and the translational research roadmap modeling infectious diseases in humans and animals social contacts and mixing patterns relevant to the spread of infectious diseases modeling infectious disease parameters based on serological and social contact data: a modern statistical perspective (statistics for biology and health closure of schools during an influenza pandemic the international response to the outbreak of sars in health benefits, costs, and cost-effectiveness of earlier eligibility for adult antiretroviral therapy and expanded treatment coverage: a combined analysis of mathematical models costs and cost-effectiveness of malaria control interventions-a systematic review performance of case definitions for influenza surveillance case definition recommendations for ebola or marburg virus diseases mobile diagnostics in outbreak response, not only for ebola: a blueprint for a modular and robust field laboratory reebov antigen rapid test kit for point-of-care and laboratory-based testing for ebola virus disease: a field validation study temporal variations in the effective reproduction number of the west africa ebola outbreak inference and forecast of the current west african ebola outbreak in guinea, sierra leone and liberia epidemiological and viral genomic sequence analysis of the ebola outbreak reveals clustered transmission transmissibility and pathogenicity of ebola virus: a systematic review and meta-analysis of household secondary attack rate and asymptomatic infection transmission dynamics of ebola virus disease and intervention effectiveness in sierra leone landscan tm about worldpop population statistics of guinea population and housing census national population and housing census final results the role of the airline transportation network in the prediction and predictability of global epidemics assessment of the potential for international dissemination of ebola virus via commercial air travel during the west african outbreak assessing the international spreading risk associated with the west african ebola outbreak assessing the impact of travel restrictions on international spread of the west african ebola epidemic mali case, ebola imported from guinea local and regional rstb.royalsocietypublishing.org phil gravity models for airline passenger volume estimation five challenges for spatial epidemic models dynamic population mapping using mobile phone data mapping population and pathogen movements impact of human mobility on the emergence of dengue epidemics in pakistan commentary: containing the ebola outbreak -the potential and challenge of mobile network data assessing optimal target populations for influenza vaccination programmes: an evidence synthesis and modelling study use of serological surveys to generate key insights into the changing global landscape of infectious disease key issues in the persistence of poliomyelitis in nigeria: a casecontrol study increased transmissibility explains the third wave of infection by the h n pandemic virus in england undiagnosed acute viral febrile illnesses how to make predictions about future infectious disease risks new approaches to quantifying the spread of infection polio information system about healthsites the humanitarian data exchange measuring the impact of ebola control measures in sierra leone retrospective analysis of the - ebola epidemic in liberia the ring vaccination trial: a novel cluster randomised controlled trial design to evaluate vaccine efficacy and effectiveness during outbreaks, with special reference to ebola efficacy and effectiveness of an rvsv-vectored vaccine expressing ebola surface glycoprotein: interim results from the guinea ring vaccination cluster-randomised trial efficacy and effectiveness of an rvsv-vectored vaccine in preventing ebola virus disease: final results from the guinea ring vaccination, open-label, cluster-randomised trial (ebola Ça suffit!) an r&d blueprint or action to prevent epidemics plan of action use of capturerecapture to estimate underreporting of ebola virus disease updated guidelines for evaluating public health surveillance systems: recommendations from the guidelines working group inference for nonlinear epidemiological models using genealogies and time series heterogeneities in the case fatality ratio in the west african ebola outbreak - face mask use and control of respiratory virus transmission in households barriers to mask wearing for influenza-like illnesses among urban hispanic households efficacy of structural-level condom distribution interventions: a meta-analysis of us and international studies quantifying reporting timeliness to improve outbreak control clinical features of and risk factors for fatal ebola virus disease assessment of the severity of ebola virus disease in sierra leone use of viremia to evaluate the baseline case fatality ratio of ebola virus disease and inform treatment studies: a retrospective cohort study the epi info viral hemorrhagic fever application the epi info viral hemorrhagic fever (vhf) application: a resource for outbreak data management and contact tracing in the - west africa ebola epidemic e-health technologies show promise in developing countries a comparison of the completeness and timeliness of automated electronic laboratory reporting and spontaneous reporting of notifiable conditions replacing paper data collection forms with electronic data entry in the field: findings from a study of communityacquired bloodstream infections in pemba, rstb.royalsocietypublishing.org phil comparison of electronic data capture (edc) with the standard data capture method for clinical trial data epibasket: how e-commerce tools can improve epidemiological preparedness. emerg who report on global surveillance of epidemic-prone infectious diseases address to the sixty-ninth world health assembly notes from the field: baseline assessment of the use of ebola rapid diagnostic tests-forecariah, guinea essential medicines and health products: diagnostics real-time, portable genome sequencing for ebola surveillance sharing data for global infectious disease surveillance and outbreak detection data sharing: make outbreak research open access providing incentives to share data early in health emergencies: the role of journal editors mathematical modelling of the pandemic h n studies needed to address public health challenges of the h n influenza pandemic: insights from modeling public health impact and cost-effectiveness of the rts,s/as malaria vaccine: a systematic comparison of predictions from four mathematical models developing global norms for sharing data and results during public health emergencies outbreaktools: a new platform for disease outbreak analysis using the r software avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola seven challenges for model-driven data collection in experimental and observational studies acknowledgements. the authors give credit to and thank the many individuals who were involved in data collection, entry and management during the west african ebola epidemic. key: cord- -lyld up authors: prakash, a.; muthya, s.; arokiaswamy, t. p.; nair, r. s. title: using machine learning to assess covid- risks date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: lyld up abstract: importance: identifying potential covid- patients in the general population is a huge challenge at the moment. given the low availability of infected covid- patients clinical data, it is challenging to understand and comprehend similar and complex patterns in these symptomatic patients. laboratory testing for covid antigen with rt-pcr | (reverse transcriptase) is not possible or economical for whole populations. objective: to develop a covid risk stratifier model that classifies people into different risk cohorts, based on their symptoms and validate the same. design: analysis of covid cases across wuhan and new york were done to identify the course of these cases prior to being symptomatic and being hospitalised for the infection. a dataset based on these statistics were generated and was then fed into an unsupervised learning algorithm to reveal patterns and identify similar groups of people in the population. each of these cohorts were then classified and identified into three risk levels that were validated against the real world cases and studies. setting: the study is based on general population. participants: the adult population were considered for the analysis, development and validation of the model results: of million observations generated, % of them exhibited covid symptoms and patterns, and % of them belonged to the asymptomatic and non-infected group of people. upon clustering, three clinically obvious clusters were obtained, out of which the cluster a had % of the symptomatic cases that were classified into one cohort, the other two cohorts, cluster b had people with no symptoms but with high number of comorbidities and cluster c had people with few leading indicators for the infection with few comorbidities. this was then validated against participants whose data we collected as a part of a research study through our covid-research tool and about % of them were classified correctly. conclusion: a model was developed and validated that classifies people into covid risk categories based on their symptoms. this can be used to monitor and track cases that rapidly transition into being symptomatic which eventually get tested positive for the infection in order to initiate early medical interventions. keywords: covid- , synthetic data, patient clustering, unsupervised learning, risk classification covid has surprised the world with its infectivity and rapid spread globally causing massive loss of life and livelihoods. the right way to tackle this pandemic is to act quickly in identifying those at risk and treat patients early. identifying and tracking symptoms of covid infected patients is challenging today as new insights of its etiologic, pathology, public health impact, epidemiology, treatment options, vaccination etc. are emerging continuously with its global spread. machine learning has been extensively used in biomedical and medical sciences today to help in improving hospital outcomes, by effective early interventions that lead to improved prognosis. data can be a powerful tool to analyse, interpret and build predictive models around them to support improved health care, if validated and analysed rightly. a scientific approach of using these techniques, can perfectly complement the clinical diagnostic and treatment protocols. getting access to datasets that capture the trends in the general population from being healthy to acquiring the infection and in-hospital prognosis phase is quite challenging and isn't open source for the public due to obvious security and privacy concerns at the moment. nevertheless, current investigations and studies are available that encapsulate most of the common statistics and symptoms of covid patients. using this, our proposed method captures these statistics along with some clinical background and generates a dataset on which we intend to apply an unsupervised learning algorithm to identify patterns and classify them into risk cohorts. in predictive modelling, the term "unsupervised learning" refers to instances where the data does not have a label associated with it. getting labels on data can be a very expensive process in terms of money, time and manpower. in such cases the knowledge is inferred from the data itself by applying clustering algorithms to find hidden, similar patterns and groups by some exploratory analysis. in our method, we have tried to infer patterns in different cohorts of people and label their covid risk levels through analysis and further validation of the same. in cases where the data isn't available, one proven method in the healthcare space is to generate faux data through good clinical reasoning and validation. the data set is usually generated using a logic based algorithm that captures human knowledge about the subject along with current research and studies with some evidence. covid based research has evidently increased since the pandemic has struck and related resources are available extensively today, and this method has tried to capture these studies into an interpretable form for analysis and categorization of different risk cohorts that were validated against current data. this model can be used to identify risk levels, based on which cohort they belong to or transition into, over a period of time. creating synthetic datasets in healthcare is predominantly increasing because of the existing challenges in healthcare systems to record information in ehr and emr formats and even if this isn't a hindrance, security and privacy controls laws on these data are very stringent that it becomes hard to get access to. nevertheless, synthetic data sets can be evolved to a better real world representation without compromising on the quality of the clinical information but also can help avoiding privacy clauses and concerns around them. [ ] one such notable example is gan's (generative adversarial network) in the deep learning research space that generates completely synthetic data with real world logics. medgan(choi et al) [ ] is an algorithm that generated realistic synthetic ehr's that were high dimensional and discrete in nature using gan's and autoencoders. the rcgan [ ] is another interesting work that generated high dimensional realistic synthetic time series datasets using recurrent gan's. most of the gan techniques applied in healthcare settings had some or very little real world data that was fed into gan's which isn't the case with our problem statement. the availability of covid patient records at absolutely zero today. two drawbacks of using gan's are validation and poor interpretability in assessing why some samples are created and this makes it hard to implement. laura et al.( ) [ ] used naive bayes clustering methods to generate realistic datasets taking mimic iii as a baseline and had much better results compared to medgan [ ] . one very similar approach as ours as is, of chen et al.( ) [ ] which generated more than a million "synthetic residents" by an algorithm named synthea, also called as synthetic mass that represented residential population around massachusetts, usa and mimicked the statistics of the population including their demographics, vaccinations, medical visits and comorbidities. this was also compared and validated with the original population around the city. another notable work is of harvard dataverse, which has , completely synthetic datasets of patients generated from software called synthea that was mentioned previously. [ ] mahmoud et al.( ) used k-means clustering to predict patient outcomes in elderly patients [ ] .hany et al.( ) have summarized how clustering techniques and pattern identification in ad patients(alzheimer's disease), from early to last stages of the disease can be effective in healthcare [ ] . lio (et al.) applied clustering techniques to find patterns in end stage renal disease patients who initiated hemodialysis [ ] . proposed method: coronavirus is known to progress in some infected patients, affecting the vital organs of the body rapidly. not everybody will experience similar symptoms, it varies from person to person and a majority remain asymptomatic. it becomes challenging to identify such patterns in the general population. if the population is tracked for symptoms constantly, monitored and assessed for infection, then identifying people who are likely to be infected can initiate early interventions. to understand complex patterns and symptoms in infected cases, a real world dataset explaining clinical conditions during the asymptomatic case is required to do any research and build predictive models around it. obtaining such historical clinical information of cases can be very expensive, time consuming and sensitive to be made open source to the public. often, even if this kind of data existed, clinical records have missing gaps and acute information that aren't adequate enough to draw conclusions from . with this challenge of obtaining clinical data by conventional methods, generating a "synthetic but convincing" dataset to understand the patterns in symptomatic cases with current evidence and studies is the need of the hour. studies have shown that generating synthetic clinical dataset is a promising and a plausible approach to take in such scenarios that can solve current problems. using statistical studies related to covid infected cases across cities of wuhan, china and new york, usa a dataset was generated that fit across these populations and describes them well enough to work on. although the statistics of infected cases across the globe are contrasting to one another, an effort was made to capture the recurring patterns and similarities in both the cities that normalises this difference to an acceptable level. the reason why cases across these cities were chosen in particular, is primarily due to the fact that they have the most number of cases with a huge population and validation of our dataset would use this as a baseline and for future studies. covid in-hospital admission information was considered from the period march , until april , from an investigation conducted in new york [ ] , which was the epicenter of covid cases in the united states. this investigation consisted of a total of participants who were diagnosed with the infection and had received treatment for the same. statistics of these patients included the comorbidities, symptoms, age, gender ,race and more. similarly, the characteristics from wuhan, which happens to be the world's epicentre for the virus, were studied, from the period december , to march , . [ ] this study investigated the trends in the spread of the virus, and symptoms. they were studied across different cohorts of population that were classified into mild, moderate and severe with respect to the infection. it also captured similar characteristics of that of the former study mentioned. both of these studies were compared against each other in terms of infected population's statistics and were found to be contrasting at few places with different numbers in demographics like gender and age. the common characteristics were found to be comorbidities and the early symptoms of the infection. we tried minimising these differences and came up with numbers that equated instances from both the investigations and fairly generalised the infection trends and symptoms for a general population. we do not intend to build a universal data set that represents the global population. our interest is to capture major symptomatic and infection prone populations based on the studies till date and simulate the same. the idea behind generating synthetic dataset begins with exploiting freely available information regarding the statistics, prevalence and incidence of this infection. from these statistics, we can get a fair picture regarding the demographics, and prevalence of symptoms and comorbidities in the infected population. the synthetic data was generated by griser's method [ ] . the features we considered were symptoms observed in infected patients and comorbidities. we did not consider age as a feature since we believe "covid is a de novo disease. initially thought to affect elders predominantly. with time, other age groups were also affected but mechanism is poorly understood. age does not appear to be a primary factor in getting infected or disease progression". we define "covid criteria met" definition based on higher incidence and prevalence of certain symptoms associated with this virus that is likely to be experienced by the host during the initial stages of the infection. [ ] we also identified few leading indicators or signs that were likely to occur in some symptomatic populations. for example, diarrhea, nausea ,conjunctivitis and loss of taste and smell [ ] were found to be in the very early stages of the infection. also, travel history along with flu-like symptoms can be a strong indicator of the infection. the covid criteria met definition was curated from observed symptoms in infected patients globally and coronavirus studies. [ ] [ ] [ ] table explains the covid criteria met definition and table explains the features we have identified and considered as a part of our study. we generated over million records that captured the above statistics and demographics with categorical information(boolean values). the description of the data generated is explained in table . clustering analysis is used on unlabelled dataset to learn different cohorts and patterns in the data. most popular clustering algorithm used in healthcare applications is the k-means clustering algorithm. this is used when we have numerical and continuous clinical data [ ] . k-means algorithm [ ] groups similar data points together and identifies the underlying patterns within them. it uses distance based metrics to group data into k different clusters by calculating k different centroids(an imaginary location that represents the center of the cluster) and assigning every data point to the nearest centroid. applying k-means on our data does not make sense since, euclidean distance isn't the right distance metric when we have a dataset with categorical features. rather, we need to capture the dissimilarity measure between our data points. here is when we use the k-modes algorithm, which is a slight extension of k-means, except that it quantifies the dissimilarity between two data points rather than compute distances. [ ] where z is the categorical variables ranging from a , a ...an and zi is the i th element and qi is near cluster center. . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . listing . explains high level algorithm for k-modes clustering. input: data of dimension n*x, n being the number of observations and x being the number of features. k : no of clusters. step :randomly select k different modes from the input data such that ci, j = , ..k. step : compare each data point in the cluster to each data point from the input data set. step : add for every dissimilarity encountered and for equal values as shown in equation ( ) step : assign each individual to the closest centroid. ( ) step : calculate mode for each feature for every centroid. step : repeat steps to until no changes are obtained in the data assignment to the centroids. choosing the optimal k or number of cluster values is a very important step in clustering. we applied silhouette scoring and cost analysis to arrive at our k value. silhouette score is a measure of similarity between a data point and its own clusters [ ] . the best value is supposedly close to and negative values would indicate data points assigned to wrong clusters. the k is chosen based on the high silhouette coefficient value obtained after iterating through various k values. the silhouette coefficient is usually measured based on the euclidean distance. given that our data has a non-gaussian and discrete distribution, we apply hamming distance instead of the euclidean distance. post the silhouette score and cost analysis on various k values (refer table ). we found k= would be an optimal number for our objective. visualisation of clusters is still a question of research today, nevertheless, to get a fair idea and validity of our algorithm, we applied mca(multiple correspondence analysis) on the data and applied it in order to project it on a d space. multiple correspondence analysis tries to identify associations in multiple categorical variables. [ ] the parameters init was set to huang, k was set to and n_init was set to in the k-modes parameter selection process. table . explains the statistics obtained on each cluster. next step was to identify the risk groups among these cohorts. cohort risk identification was made based on clinical knowledge and evidence of covid studies. from the inferences we drew from the above clusters we assigned subjective risks namely low, medium and high to cluster b, cluster a, cluster c respectively. cluster c happens to contain the symptomatic group of people with same exhibiting characteristics of the ones that were investigated and observed globally till date. hence, it is identified as a high risk group that is likely to be symptomatic for covid. internal validation of this model is essential to gauge the accuracy of it and test the sensitivity of the algorithm's ability to profile a new data point into its right risk group. to perform this validation, we analysed data from an open source database that had high level summaries of the corona positive patients at the time of detection [ ] [ ] . we used this information to simulate timelines for various scenarios from the onset of the symptoms till confirmed date. we simulated about observations for patients based on the age group, symptoms developed over a timeline of to days. we essentially captured major use cases that included, covid symptoms with leading indicators, covid symptoms without leading indicators, flu like symptoms with travel history, all of these developed over and days. the objective was to identify if the model could distinguish between normal flu and covid symptoms that had lower incidence in the general population. fig . shows results of the validation when run through the covid risk model. the model was sensitive towards the covid met indicator conditions and leading indicators, but there were few false positives in example cases like, the risk was "high" when it gave a more weightage to conditions like low immunity and travel history. this is expected to improve when the model is re-trained on a bigger real world dataset that has complex correlations and patterns. covid research tool (c rt) is a web application developed by the team at cohere. the idea behind this web application is to collect data from individuals for our research study and track their symptoms to identify the risk of infection. this web application is for the public, and allows anybody to register and enter their symptoms at least once a day for a period of days. with prior research on covid symptoms from various sources and our clinical advisory board, we curated a questionnaire that was user friendly and targeted all levels of population. we released this application in the month of may and have collected (still collecting) about plus data points from users. we have a privacy and security protocol in place to handle this collected information that is anonymised and run through the covid model in the backend. a new user registers into our application, after acknowledging the consent form. the designated user enters his symptoms at least once a day on the web application(can be accessed on any digital device). this information is then anonymised, run through the covid risk algorithm which profiles this input with a similar group of people and gives out the group/ risk category the person belongs to. when a change in the risk trend is observed, an alert is emailed to the user. the entire application workflow is explained in fig . the collected data along with the risks were validated clinically and was compared against the targets generated for each of them manually with clinical knowledge. the classification report and confusion matrix of this validation is shown in table. and fig respectively. . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint we were successfully able to validate and deploy this model. our research tool at present uses this model to display a covid risk score based on the user's input. understanding covid patterns in the symptomatic cases is still an ongoing challenge and subject of research. although, there are few sets of rules or pointers that could indicate the presence of the infection, just a rule based solution wouldn't be the right approach. rule based decision systems tend to generate a lot of false positives and have a relatively low precision. while, significantly larger real world data is definitely the key to better insights, and results, our objective is to mimic the real world scenarios and identify patterns in order to catch these symptoms earlier. the earlier the treatment, better the prognosis. unsupervised learning like clustering can be a powerful analysis in healthcare, because often in practice, clinicians tend to profile similar cases and conditions along with other dimensions of a patient to land at an informed decision. clustering that way, is mimicking this concept, with only a superior ability to identify patterns across a huge dataset in terms of dimensions and size. for the model to capture more complex correlation across the features and cohort patterns, our goal is to continue to collect relevant data from a larger population to improve the algorithm, learn better patterns and reveal insights that can help identify those with high risk of being infected with covid . this is just one part of the problem that we try to solve. a bigger challenge is identifying the asymptomatic cases that go unidentified and unnoticed, spreading widely in the population, described as community transmission by epidemiologists. this work was a self-funded project at cohere-med inc. . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted june , . . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . cluster a (medium risk) % % cluster b (low risk) % % cluster c (high risk) % % table . risk stratification analysis on the real world data. . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint . cc-by-nd . international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted june , . . https://doi.org/ . / . . . doi: medrxiv preprint generating multi-label discrete patient records using generative adversarial networks real-valued (medical) time series generation with recurrent conditional gans generating synthetic but plausible healthcare record datasets the validity of synthetic clinical data: a validation study of a leading synthetic data generator (synthea) using clinical quality measures , synthetic medicare patient records presenting characteristics, comorbidities, and outcomes among patients hospitalized with covid- in the new york city area association of public health interventions with the epidemiology of the covid- outbreak in wuhan, china alterations in smell or taste in mildly symptomatic outpatients with sars-cov- infection clustering-aided approach for predicting patient outcomes with application to elderly healthcare in ireland a fast clustering algorithm to cluster very large categorical data sets in data mining extensions to the k-means algorithm for clustering large data sets with categorical value silhouettes: a graphical aid to the interpretation and validation of cluster analysis clustering-aided approach for predicting patient outcomes with application to elderly healthcare in ireland the application of unsupervised clustering methods to alzheimer's disease approach and method for generating realistic synthetic electronic healthcare records for secondary use cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis key: cord- - hu so authors: marsch, lisa a. title: digital health data-driven approaches to understand human behavior date: - - journal: neuropsychopharmacology doi: . /s - - - sha: doc_id: cord_uid: hu so advances in digital technologies and data analytics have created unparalleled opportunities to assess and modify health behavior and thus accelerate the ability of science to understand and contribute to improved health behavior and health outcomes. digital health data capture the richness and granularity of individuals’ behavior, the confluence of factors that impact behavior in the moment, and the within-individual evolution of behavior over time. these data may contribute to discovery science by revealing digital markers of health/risk behavior as well as translational science by informing personalized and timely models of intervention delivery. and they may help inform diagnostic classification of clinically problematic behavior and the clinical trajectories of diagnosable disorders over time. this manuscript provides a review of the state of the science of digital health data-driven approaches to understanding human behavior. it reviews methods of digital health assessment and sources of digital health data. it provides a synthesis of the scientific literature evaluating how digitally derived empirical data can inform our understanding of health behavior, with a particular focus on understanding the assessment, diagnosis and clinical trajectories of psychiatric disorders. and, it concludes with a discussion of future directions and timely opportunities in this line of research and its clinical application. overview and limitations of theoretical models of human behavior and diagnostic models of psychiatric disorders human behavior is one of the biggest drivers of health and wellness as well as mortality and morbidity. indeed, health risk behavior, including poor diet, physical inactivity, tobacco, alcohol, and other substance use, causes as much as % of the illness, suffering, and early death related to chronic diseases [ ] [ ] [ ] . health risk behavior is linked to obesity, type diabetes [ ] , heart disease, liver disease, kidney failure, and neurological diseases. it is also linked to many mental health disorders including anxiety and depression [ , ] . and it greatly increases one's risk for a wide variety of cancers. for example, heavy alcohol use greatly increases risk of breast [ ] [ ] [ ] , esophageal, and upper [ ] digestive and liver cancers [ , ] . smoking is strongly linked to lung cancer and is also a major contributor to esophageal cancer [ ] [ ] [ ] [ ] . and, obesity increases risk of colorectal and esophageal cancer [ ] [ ] [ ] . research designed to explain and predict health behavior and events influencing health outcomes has heavily relied on theoretical models of health behavior and behavior change [ , ] . at the psychological level, the cognitive literature has focused on such performance-related processes as goal maintenance in working memory, impulsivity, and cognitive homeostasis. the affective science and social psychology literatures have focused on emotion regulation processes, social influences and resource models. in parallel, the health psychology and behavioral medicine literatures have focused on processes, such as self-efficacy and outcome expectancies. at the behavioral level, focus has been largely placed on behavioral disinhibition and temporal discounting. at the neural level [ ] , health behavior can be conceptualized in terms of top-down control (implemented by fronto-parietal networks) over impulsive drives or habits (implemented by subcortical and ventromedial prefrontal regions [ ] ). and an emerging framework from neuroeconomics has characterized decision processes in terms of goal-directed versus habitual or pavlovian control over action [ ] [ ] [ ] . overall, these models afford a conceptual framework for illustrating causal processes of key constructs hypothesized to influence or change a target behavior. theoretical models may be useful for developing, implementing, and evaluating behavior change interventions. and, interventions informed by theories of human behavior are generally more effective compared with those that are not [ ] . collectively, various theoretical models have articulated that an individual's beliefs and attitudes, behavior intentions, level of motivation for behavior change, and social and cognitive processes impact health behavior [ ] . despite the promise of theoretical models of health behavior, their ability to explain and predict health behavior has been only modestly successful [ ] [ ] [ ] . many theoretical models have regarded human behavior as linear or static in nature and have not recognized that behavior is dynamic and responsive to diverse social, biological, and environmental contexts. and, theoretical models have heavily focused on between-person differences in behavior and have not embraced the study of important withinperson differences in behavior. further, many theoretical models of health behavior and behavior change have often been derived within siloed disciplines (e.g., health psychology, neuroscience) with little crosstalk [ , ] . in addition, research examining factors that influence health behavior has tended to examine a small set of potential moderators or mediators of health behavior at a specific level of analysis (e.g., emotion regulation alone or impulsivity alone) and may lead to over-simplified accounts of behavior [ ] [ ] [ ] [ ] change. finally, little research has established the temporal precedence of a broad array of potential factors impacting health behavior [ ] . more frequent and longer assessment of moderators, mediators, and outcome(s) will be necessary to elucidate the temporal dynamics between changes in specific mechanisms [ ] and behavior [ , ] . similar limitations are evident in our current models for understanding and determining clinical diagnoses for psychological or psychiatric disorders. the current process for identifying diagnosable disorders heavily relies on measuring the number and type of symptoms that a person may be experiencing as well as associated distress or impairment. although this current diagnostic process provides a useful common language of mental disorders for clinicians, the process is largely based on consensus from expert panels and may oversimplify our understanding of human behavior [ ] . and indeed, many mental health clinicians do not measure behavior, cognition, and emotion when ascribing a psychiatric disorder to a patient. further, mental health professionals usually interact with, and provide diagnoses to, patients at a specific moment in patients' lives, but recent evidence shows that people with psychological disorders may experience many different kinds of disorders across diagnostics families over their lifespan [ ] . there is tremendous opportunity to understand psychological/biological systems that span the full range of human behavior from normal to abnormal and to empirically assess how they are situated in environmental and neurodevelopmental contexts [ ] . examining a broad array of factors impacting health behavior at multiple levels of empirical analysis and over time will enable a more comprehensive picture of health behavior and will increase our ability to develop more impactful interventions and better understand the conditions under which replications of effects do and do not occur. advances in digital technologies and data analytics have created unparalleled opportunities to assess and modify health behavior and thus accelerate the ability of science to understand and contribute to improved health behavior and health outcomes. digital health refers to the use of data captured via digital technology to measure individuals' health behavior in daily life and to provide digital therapeutic tools accessible anytime and anywhere [ , ] . for example, smartphones have an array of native sensors including bluetooth, gps, light sensor, accelerometer, microphone, and proximity sensors as well as systems logs of calls, and short message service use. smartphones, as well as some wearable devices (e.g., smartwatches), thus enable passive, ecological sensing of behavioral, and physiological features, such as one's sleep, physical activity, social interactions, electrodermal activity, and cardiac activity [ ] . individuals can also offer responses to questions they are prompted to answer on mobile devices (sometimes called "ecological momentary assessment" or ema) to provide snapshots into, for example, their context, social interactions, stress, pain, mood, eating, physical activity, mental health symptoms, and substance use. and, social media data, that many individuals produce in high volume, provide information about individuals' behavior, preferences, and social networks. these "digital exhaust" [ ] data or "digital footprints" [ ] enable the continuous measurement of individuals' behavior and physiology in naturalistic settings. these digital data may greatly complement and extend traditional sources of clinical data (which is typically captured on an episodic basis in a clinical context) with intensive, longitudinal ecologically valid data. digital health data capture the richness and granularity of individuals' behavior, the confluence of factors that impact behavior in the moment, and the within-individual evolution of behavior over time. as such, they may contribute to discovery science by revealing digital markers of health and risk behavior [ , ] . they may help us to better develop empirically based diagnostic classifications of aberrant/dysfunctional behavior and the clinical trajectories of diagnosable disorders over time [ ] . and, they may help us in translational science by informing more personalized, biomarker-informed, and timely models of intervention delivery. as the majority of the world has access to digital technologyindeed, there are billion mobile phone subscriptions worldwide [ ] -digital health data-driven approaches can be used to understand human behavior across the population. this manuscript provides a review of the state of the science of digital health data-driven approaches to understanding human behavior. the manuscript first describes various methods of digital health assessment and sources of digital health data. it then provides a synthesis of the scientific literature evaluating how digitally derived empirical data can inform our understanding of health and risk behavior. it then focuses on how digital health may help us to develop a better empirically based understanding in the assessment, diagnosis, and measurement of clinical trajectories of aberrant/dysfunctional disorders in the field of psychiatry (a field that has led pioneering research in digital health [ ] ). finally, it concludes with a discussion of future directions and timely opportunities in this line of research and its clinical application, including the development of personalized digital interventions (e.g., behavior change interventions) informed by digital health assessment. digital health assessment methods although digitally derived data have been used to understand behavior and context in the field of computer science for over years, a primary term currently used to capture digital health assessment is "digital phenotyping" [ ] and is increasingly used by scientists, funders, as well as the popular press. digital phenotyping [ ] primarily employs passively sensed data to allow for a moment-by-moment (in situ) quantification of behavior. these data can include data derived from smartphone or smartwatch sensors (e.g., an individual's activity, location), features of voice and speech data collected by mobile devices (e.g., prosody and sentiment), as well as data that captures a person's interaction with their mobile device (e.g., patterns of typing or scrolling). digital phenotyping largely employs passive data (to reduce burden to participants in data collection), and some researchers confine their definition of digital phenotyping to passive data. however, digital measurement and analytics also encompass many other sources of data that are actively generated by individuals, including social media data, ema data, and online search engine activity. overall, digital phenotyping focuses on the use of such digital data to understand and predict health outcomes of behaviors of interest. sophisticated inferences from these data are increasingly possible due to the rapidly advancing fields of big-data analytics and advanced artificial intelligence (including advanced machine learning approaches that focus on the creation of systems that learn from data instead of simply following programmed instructions). behavioral health systems that leverage passive sensing and machine learning to learn and adapt to a person's actual behavior and surroundings offer a promising foundation for predictive modeling of an individual's behavioral health trajectory and may support new breakthrough intervention technologies targeting health behavior. these developments enable behavioral monitoring to occur in the background as individuals go about their lives and build dynamic computational models tailored to the user that can lead to effective interventions. and digital phenotyping may reveal new insights into how other data sources (such as genetic, molecular and neural circuitry data) interrelate with clinically observable psychopathology [ , ] . overview of the scientific literature on the application of digitally derived empirical data to understand health behavior and psychopathology a robust and rapidly growing scientific literature is increasingly demonstrating the potential utility of digital assessment in revealing new insights into human behavior, including psychological and psychiatric disorders. digital health biomarkers of health and risk behavior captured via mobile technology. continuous smartphone sensing (e.g., of activity, mobility, sleep) has been shown to be significantly linked to mental well-being, academic performance (grade point average), and behavioral trends of a college student body, such as increased stress, reduced sleep, and reduced affect as the college term progresses and stress increases [ ] . these patterns may help us to understand, in close to real time, when individuals may be at risk of academic and/or mental health decline. assessment of individual's interactions with mobile devices (e.g., swipes, taps and keystroke events) have been shown to capture neurocognitive function in the real world and may provide an ecological surrogate for laboratory-based neuropsychological assessment [ ] . and, continuous smartphone monitoring can measure brain health and cognitive impairment in daily life [ ] . and, digital data derived from mobile sensing (e.g., calling, texting, conversation and app use) have also been used to characterize behavioral sociability patterns and to map these behaviors onto personality traits [ ] . further, phenotypic data gathered via wearable sensors have shown that several metrics of sleep (total sleep time and sleep efficiency) are associated with cardiovascular disease risk markers, such as waist circumference and [ ] body mass index and that insufficient sleep is linked to premature telomere attrition. thus, these digitally derived health risk data can provide real time insights into biological aging. digital health measurement of aberrant/dysfunctional behavior and the clinical trajectories of diagnosable disorders over time captured via mobile technology. digital assessment has also illuminated novel insights into the nature and course of psychological and psychiatric disorders. high-frequency assessment of cognition and mood via wearable devices among persons with major depressive disorder has been shown to be feasible and valid over an extended period [ ] . behavioral indicators passively collected through a mobile sensing platform (e.g., the sum of outgoing calls, a count of unique numbers texted, the dynamic variation of voice, speaking rate) have been shown to predict symptoms of depression and ptsd [ ] . features derived from gps data collected via phone sensors, including location variance, entropy, and circadian movement, have been shown to predict severity of depressive symptoms and that these relationships can differ at different points in time (e.g., weekend vs. weekday [ ] ). and assessment of voice data has identified vocal acoustic biomarkers that have shown promise in predicting treatment response among persons with depression [ ] . movement data from actigraphs alone, a single measure of gross motor activity from a sensor worn on the wrist, were able to identify the diagnostic group status of individuals with major depression or bipolar vs. healthy controls % of the time. this level of accuracy in diagnostic classification is greater than published inter-reliability rates for second raters using the structured clinical interview for the dsm (scid). and results showed that actigraphy data predicted the majority of variation in patients' depression severity over an~ -week period [ ] . emotion dynamics captured over time via digital technology have been shown to differentially predict bipolar and depressive symptoms concurrently and prospectively [ ] . and, ema data captured on smartphones has been shown to predict future mood among persons with bipolar disorder [ ] . in addition, smartphone usage patterns have been shown to be linked to functional brain activity related to depression. for example, phone unlock duration has been shown to be positively linked to resting-state functional connectivity between the subgenual cingulate cortex (an area understood to be involved in depression) and the ventromedial/ orbitofrontal cortex [ ] . results suggest that digital biomarker data may reflect readily capturable data that relate to brain functioning. further, a small pilot study evaluated changes in mobility patterns and social behavior among persons diagnosed with schizophrenia using passively collected smartphone data. results indicated that the rate of behavioral anomalies that were identified in the weeks prior to a clinical relapse were markedly higher ( %) than rates of behavioral anomalies during other periods of time [ ] . and, other research has underscored the significant variability across individuals in digital indicators of a psychotic relapse [ ] thus underscoring the multi-dimensional nature of a diagnosis of a psychotic disorder. in addition, a small series of case studies demonstrated that selfreported psychotic symptoms are linked to various behaviors (cognition scores on games) and activity levels (step count) among persons with psychotic illness. importantly, results revealed considerable variability in the patterns in these data streams across individuals, underscoring the utility of these approaches in understanding and monitoring within-individual clinical trajectories [ ] . and other research has demonstrated that decreased variability in physical activity and noisy conditions on an inpatient psychiatric unit, captured via multimodal measurement, is associated with violent ideation among inpatients with serious mental illness [ ] . assessment of geography via passive sensing of geolocation using gps has demonstrated that drug craving, stress, and mood among persons with an opioid use disorder were predicted by exposure to visible signs of environmental disorder along a gpsderived [ , ] track (such as visible signs of poverty, violence, and drug activity). a recent digital health ema study demonstrated a stronger link between drug craving and drug use than between stress and drug use-a result that was not well-documented or understood from prior traditional clinical assessment [ ] . and, among smokers trying to quit, lapses to smoking were shown to be associated with increases in negative mood for many days (and not just hours) before a smoking lapse [ ] . these studies reveal new insights into the dynamic nature of drug use events and the confluence of factors that impact them. unfortunately, only a few studies have included randomized controlled evaluations of the clinical utility of digital phenotyping in the clinical treatment of psychological disorders. among these studies, one recent, controlled study that investigated the effect of smartphone monitoring of persons with bipolar disorder did not show a statistically significant benefit on depressive or manic symptoms compared with a control group, although persons with smartphone monitoring reported higher quality of life and lower stress [ , ] . digital health measurement of health behavior captured via additional (non-mobile) data sources. in addition to data captured via mobile devices, other sources of digital data have been shown to reveal insights into human behavior. for example, social media data have provided new insights into mental health and substance use behavior. in one study, a deep-learning method was able to identify individuals' risk for substance use using content from their instagram profiles [ ] . and another evaluation demonstrated that community-generated instagram data (post captions and comments from friends or followers), when evaluated along with user-generated content (individuals' post captions and comments), were able to identify depression among individuals. other work has also demonstrated that facebook status updates can predict postpartum depression [ ] and that depression can be identified via daily variation in word sentiment analysis among twitter and facebook users [ , ] . such methods offer promise for conducting population-level risk assessments and inform population-level interventions [ ] . data from online search engine activities are another source of consumer-generated digital data that can reveal individual-level as well as population-level behavioral patterns. for example, online health-seeking behavior has been shown to predict real-world healthcare utilization [ ] . online search activity has been shown to be related to changes in use of new substances [ ] , and substance use search data have been strongly correlated with overdose deaths [ ] . and, a recent study analyzed over million google search queries across the united state related to mental health during the covid- global pandemic. results revealed that mental health search queries increased rapidly prior to the issuance of stay-athome order within states, and these searches markedly decreased after the announcement and implementation of these orders, presumably once a response/management plan was in place [ ] . overall, the existing scientific literature demonstrates a compelling "proof of concept" that digital health data can provide new insights into human behavior, including psychopathology. this line of research offers great promise for advancing our theoretical models of health behavior and informing behavior change interventions that are responsive to the dynamic nature of health behavior. the promise of digital health is particularly compelling when applied to the field of psychiatry. digital assessment allows for the continuous, empirical quantification of clinically useful digital biomarkers that can be useful in identifying and refining diagnostic processes over time. these data may also be useful as outcomes in measurement-based care. these data may help us to generate predictive models that reflect the confluence of factors, and their relations over time, that may inform when an individual may be at risk for a clinically significant event (such as a relapse or psychotic event). these methods may help detect a problem before it occurs and inform in-the-moment preventative interventions. and, given that psychiatric conditions are often chronic and recurrent, digital data captured in an intensive longitudinal manner can inform strategies for optimizing responsive and adaptive models of clinical care over time. thus, digital health offers value along a full spectrum from measurement to intervention delivery-by providing novel digital biomarkers, new insights into clinical diagnoses of psychiatric disorders, personalized intervention delivery on digital platforms, as well as digital outcome measurement over time. these multiple applications of digital health can complement one another by measuring behavior and informing interventions that are responsive to that measurement. despite the promise of digital health data-driven approaches to understanding human behavior, there remain many gaps and opportunities in the field. as noted above, most digital health research has not embraced rigorous experimental research designs. indeed, only a paucity of trials has embraced well-powered, randomized, controlled research designs to allow for causal inference about the value of digital assessment and associated data analytics in informing clinical outcomes [ ] . in addition, tremendous variability exists in the specific digital metrics being employed in digital health research-ranging from smartphone sensing data, smartwatch sensing data, ema data, social media data, and online search engine data. and within each of these categories, there is also great variability in the types of features that are being extracted and applied to clinical inference. for example, in smartphone sensing alone, some research focuses heavily on gps, other work focuses on actigraphy, while still other research focuses on movement. the specific features and sources of digital health data (including the potential combination of multiple sources of digital data) that provide maximal precision in characterizing human behavior and behavioral disorders remain understudied as do the psychometric characteristics (e.g., validity and reliability) of such metrics [ ] . in order to realize the potential of digital health and provide the most robust and replicable results, a priority focus on experimental rigor and reproducibility is critically needed. in addition, digital health research to date has been conducted within our existing classification systems (e.g., patients with bipolar disorder or depression) which, as noted above, can be refined with digital health approaches. and most digital health research has been focused on disease-specific models (e.g., focusing on depression alone or substance use alone). the rich, granular data afforded by digital health approaches offer tremendous opportunity to transcend siloed disease-specific models of behavior and care to empirically embrace, understand, and treat the complexity and interrelatedness of behavioral patterns and clinical disorders. indeed, scientific research has demonstrated that many disorders co-occur and interrelate in meaningful ways and that these disorders evolve and change over the lifespan. digital health offers great (but yet unrealized) promise to provide a data-informed understanding of this full spectrum of health and wellness. this may include the development of an ontology of behavior that is informed by digital health data, which may enable a new understanding of co-occurring aberrant/dysfunctional behaviors and their evolution over time. and this may include digital therapeutic interventions that are responsive to the combination of needs and goals of each individual and their evolution over time. finally, much of the current research appears to ground in assumptions that digital health data will be of interest and of value to consumers, patients, and clinicians. although one could make the case that patients may value self-monitoring and feedback on their behavior and their clinical status and that clinicians may welcome actionable digital health data that can aid them in the care of patients, this may not always be the case. for example, if patients do not experience value in generating and sharing these data, they will not be inclined to do so (or to do so for any extended period of time). if providers receive large volumes of unsolicited data and/or data that do not directly inform their clinical work, they may perceive such a model to be burdensome and unhelpful. and if patients do not understand the privacy and security considerations of how their sensitive data will be handled and/or if healthcare systems do not understand data sharing/protection policies of industry vendors, this will undoubtedly impact adoption [ ] . indeed, it is possible that the current scientific literature largely reflects a subset of the population that are willing to share personal health data collected on digital devices, which may not be broadly generalizable. a broader dialog is needed to establish fundamental principles of privacy and research ethics in the digital health space. this may include establishing best practices for ensuring protections of patient privacy and sensitive information while still allowing for data to be shared between parties (e.g., patients and clinicians) in accordance with patient and provider preferences. and, this may include informed consent processes that are adaptive and dynamic in response to each individual's digital literacy and data sharing preferences [ ] . overall, as research and clinical application of digital measurement of behavior expands, there is an urgent need to ensure that implementation science approaches are employed to systematically assess the preferences of all the relevant digital health stakeholders and to inform models of development and deployment that have the greatest chance of scalability and sustainability. this will undoubtedly require an interdisciplinary effort across the scientific arena (including behavioral science, data science, computer science and neuroscience) as well as the digital health industry and experts in public policy. digital health and data analytics are transforming our world. and, the real-world precision assessment that digital health methods enable are providing unprecedented insights into human behavior and psychiatric disorders and can inform interventions that are personalizable and adaptive to individuals' changing needs and preferences over time. now is the moment of opportunity to embrace a systematic, rigorous, and comprehensive research agenda to realize this vision. research reported in this publication was supported by the national institute on drug abuse of the national institutes of health [grant number p da ]. the author is affiliated with pear therapeutics, inc., healthsim, llc, and square systems, inc. conflicts of interest are extensively managed by her academic institution, dartmouth college. actual causes of death in the united states health lifestyle behaviors among u.s. adults. ssm-popul health the state of us health, - : burden of diseases, injuries, and risk factors a narrative review of the effects of sugar-sweetened beverages on human health: a key global health issue sedentary behaviors and risk of depression: a meta-analysis of prospective studies sedentary behaviour and risk of anxiety: a systematic review and meta-analysis metaanalysis of studies of alcohol and breast cancer with consideration of the methodological issues epidemiology and pathophysiology of alcohol and breast cancer: update alcoholattributable cancer deaths and years of potential life lost in the united states population based cohort study of the association between alcohol intake and cancer of the upper digestive tract alcohol and liver cancer alcohol consumption and liver cancer risk: a meta-analysis epidemiology of esophageal cancer global patterns of cancer incidence and mortality rates and trends influence of smoking cessation after diagnosis of early stage lung cancer on prognosis: systematic review of observational studies with meta-analysis cigarette smoking and lung cancer-relative risk estimates for the major histological types from a pooled analysis of case-control studies obesity and risk of colorectal cancer: a meta-analysis of studies with , events obesity and colorectal cancer risk: a meta-analysis of cohort studies overweight, obesity, and mortality from cancer in a prospectively studied cohort of u.s. adults health behavior and health education: theory, research, and practice from theory to intervention: mapping theoretically derived behavioural determinants to behaviour change techniques the contributions of cognitive neuroscience and neuroimaging to understanding mechanisms of behavior change in addiction the neural correlates of subjective value during intertemporal choice a framework for studying the neurobiology of value-based decision making cortical substrates for exploratory decisions in humans neural computations underlying arbitration between model-based and model-free learning the role of behavioral science theory in development and implementation of public health interventions health behavior models for informing digital technology interventions for individuals with mental illness health behavior models in the age of mobile interventions: are our theories up to the task? advancing models and theories for digital behavior change interventions self-regulation in health behavior: concepts, theories, and central issues handbook of self-regulation: research, theory, and applications how can research keep up with ehealth? ten strategies for increasing the timeliness and usefulness of ehealth research theory-based processes that promote the remission of substance use disorders delineating mechanisms of change in child and adolescent therapy: methodological issues and research recommendations the search for mechanisms of change in behavioral treatments for alcohol use disorders: a commentary evaluating mechanisms of behavior change to inform and evaluate technology-based interventions. behavioral healthcare and technology: using science-based innovations to transform practice the effect of the timing and spacing of observations in longitudinal studies of tobacco and other drug use: temporal design considerations using repeated daily assessments to uncover oscillating patterns and temporally-dynamic triggers in structures of psychopathology: applications to the dsm- alternative model of personality disorders a theoretical and empirical modeling of anxiety integrated with rdoc and temporal dynamics digital phenotyping: technology for a new science of behavior longitudinal assessment of mental health disorders and comorbidities across decades among participants in the dunedin birth cohort study the nimh research domain criteria (rdoc) project: precision medicine for psychiatry mobile technology and the digitization of healthcare big data in digital healthcare: lessons learnt and recommendations for general practice passive sensing of health outcomes through smartphones: systematic review of current solutions and possible limitations critical questions for big data digital footprints: an internet society reference framework deep digital phenotyping and digital twins for precision health: time to dig deeper data-driven diagnostics and the potential of mobile artificial intelligence for digital therapeutic phenotyping in computational psychiatry ericsson mobility report digital phenotyping: a global tool for psychiatry digital phenotyping, behavioral sensing, or personal sensing: names and transparency in the digital age new dimensions and new tools to realize the potential of rdoc: digital phenotyping via smartphones and connected devices new tools for new research in psychiatry: a scalable and customizable platform to empower data driven smartphone research harnessing smartphone-based digital phenotyping to enhance behavioral and mental health proceedings of the acm international joint conference on pervasive and ubiquitous computing digital biomarkers of cognitive function the th acm sigkdd conference on knowledge discovery and data mining (kdd) sensing sociability: individual differences in young adults' conversation, calling, texting, and app use behaviors in daily life digital phenotyping by consumer wearables identifies sleep-associated markers of cardiovascular disease risk and biological aging wearable technology for high-frequency cognitive and mood assessment in major depressive disorder: longitudinal observational study behavioral indicators on a mobile sensing platform predict clinically validated psychiatric symptoms of mood and anxiety disorders the relationship between mobile phone location sensor data and depressive symptom severity vocal acoustic biomarkers of depression severity and treatment response digital biomarkers of mood disorders and symptom change emotion dynamics concurrently and prospectively predict mood psychopathology forecasting mood in bipolar disorder from smartphone self-assessments: hierarchical bayesian approach fusing mobile phone sensing and brain imaging to assess depression in college students relapse prediction in schizophrenia through digital phenotyping: a pilot study cross-check: integrating self-report, behavioral sensing, and smartphone use to identify digital indicators of psychotic relapse using a smartphone app to identify clinically relevant behavior trends via symptom report, cognition scores, and exercise levels: a case series use of multimodal technology to identify digital correlates of violence among inpatients with serious mental illness: a pilot study realtime tracking of neighborhood surroundings and mood in urban drug misusers: application of a new method to study behavior in its geographical context prediction of stress and drug craving ninety minutes in the future with passively collected gps data before and after: craving, mood, and background stress in the hours surrounding drug use and stressful events in patients with opioid-use disorder negative affect and smoking lapses: a prospective analysis daily electronic self-monitoring in bipolar disorder using smartphones-the monarca i trial: a randomized, placebo-controlled, single-blind, parallel group trial the effect of smartphone-based monitoring on illness activity in bipolar disorder: the monarca ii randomized controlled single-blinded trial identifying substance use risk based on deep neural networks and instagram social media data proceedings of the sigchi conference on human factors in computing systems - predicting depression from language-based emotion dynamics: longitudinal analysis of facebook and twitter status updates proceedings of the th international joint conference on artificial intelligence - exploring the utility of communitygenerated social media content for detecting depression: an analytical study on instagram from health search to healthcare: explorations of intention and utilization via query logs and user surveys can big data predict the rise of novel drug abuse? public interest in medicationassisted treatment for opioid used disorder in the united states flattening the mental health curve: covid- stay-at-home orders are associated with alterations in mental health search behavior in the united states digital phenotyping: hype or hope? validating digital phenotyping technologies for clinical use: the critical importance of "resolution digital technologies in psychiatry: present and future from return of information to return of value: ethical considerations when sharing individual-level research data open access this article is licensed under a creative commons attribution . international license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creativecommons. org/licenses/by/ . /. key: cord- -slq aou authors: safta, cosmin; ray, jaideep; sargsyan, khachik title: characterization of partially observed epidemics through bayesian inference: application to covid- date: - - journal: comput mech doi: . /s - - -z sha: doc_id: cord_uid: slq aou we demonstrate a bayesian method for the “real-time” characterization and forecasting of partially observed covid- epidemic. characterization is the estimation of infection spread parameters using daily counts of symptomatic patients. the method is designed to help guide medical resource allocation in the early epoch of the outbreak. the estimation problem is posed as one of bayesian inference and solved using a markov chain monte carlo technique. the data used in this study was sourced before the arrival of the second wave of infection in july . the proposed modeling approach, when applied at the country level, generally provides accurate forecasts at the regional, state and country level. the epidemiological model detected the flattening of the curve in california, after public health measures were instituted. the method also detected different disease dynamics when applied to specific regions of new mexico. in this paper, we formulate and describe a data-driven epidemiological model to forecast the short-term evolution of a partially-observed epidemic, with the aim of helping estimate and plan the deployment of medical resources and personnel. it also allows us to forecast, over a short period, the stream of patients seeking medical care, and thus estimate the demand for medical resources. it is meant to be used in the early days of the outbreak, when data and information about the pathogen and its interaction with its host population is scarce. the model is simple and makes few demands on our epidemiological knowledge of the pathogen. the method is cast as one of bayesian inference of the latent infection rate (number of people infected per day), conditioned on a time-series of developing a forecasting method that is applicable in the early epoch of a partially-observed outbreak poses some peculiar difficulties. the evolution of an outbreak depends on the characteristics of the pathogen and its interaction with patterns of life (i.e., population mixing) of the host population, both of which are ill-defined during the early epoch. these difficulties are further amplified if the pathogen is novel, and its myriad presentations in a human population is not fully known. in such a case, the various stages of the disease (e.g., prodrome, symptomatic etc.), and the residence times in each, are unknown. further, the patterns of life are expected to change over time as the virulence of the pathogen becomes known and medical countermeasures are put in place. in addition, to be useful, the model must provide its forecasts and results in a timely fashion, despite imperfect knowledge about the efficacy of the countermeasures and the degree of adherence of the population to them. these requirements point towards a simple model that does not require much information or knowledge of the pathogen and its behavior to produce its outputs. in addition, it suggests an inferential approach conditioned on an easily obtained/observed marker of the progression of the outbreak (e.g., the time-series of daily new cases), even though the quality of the observations may leave much to be desired. in keeping with these insights into the peculiarities of forecasting during the early epoch, we pose our method as one of bayesian inference of a parametric model of the latent infection rate (which varies over time). this infection rate curve is convolved with the probability density function (pdf) of the incubation period of the disease to produce an expression for the time-series of newly symptomatic cases, an observable that is widely reported as "daily new cases" by various data sources [ , , ] . a markov chain monte carlo (mcmc) method is used to construct a distribution for the parameters of the infection rate curve, even under an imperfect knowledge of the incubation period's pdf. this uncertain infection rate curve, which reflects the lack of data and the imperfections of the epidemiological model, can be used to provide stochastic, short-term forecasts of the outbreak's evolution. the reliance on the daily new cases, rather than the timeseries of fatalities (which, arguably, has fewer uncertainties in it) is deliberate. fatalities are delayed and thus are not a timely source of information. in addition, in order to model fatalities, the presentation and progress of the disease in an individual must be completely known, a luxury not available for a novel pathogen. our approach is heavily influenced by a similar effort undertaken in the late s to analyze and forecast the progress of aids in san francisco [ ] , with its reliance on simplicity and inference, though the formulation of our model is original, as is the statistical method used in the inference. there have been many attempts at modeling the behavior of covid- , most of which have forecasting as their primary aim. our ignorance of its behavior in the human population is evident in the choice of modeling techniques used for the purpose. time-series methods such as arima [ , ] and logistic regression for cumulative time-series [ ] have been used extensively, as have machine-learning methods using long short-term memory models [ , ] and autoencoders [ ] . these approaches do not require any disease models and focus solely on fitting the data, daily or cumulative, of new cases as reported. ref. [ ] contains a comprehensive summary of various machinelearning methods used to "curve-fit" covid- data and produce forecasts. approaches that attempt to embed disease dynamical models into their forecasting process have also be explored, usually via compartmental seir models or their extensions. compartmental models represent the progress of the disease in an individual via a set of stages with exponentially distributed residence times, and predict the size of the population in each of the stages. these mechanistic models are fitted to data to infer the means of the exponential distributions, using mcmc [ ] and ensemble kalman filters (or modifications) [ , , ] . less common disease modeling techniques, such as agent-based simulations [ ] , modeling of infection and epidemiological processes as statistical ones [ ] and the propagation of epidemics on a social network [ ] have also been explored, as have been methods that include proxies of population mixing (e.g., using google mobility data [ ] ). there is also a group of modeling teams that submit their epidemiological forecasts regarding the covid- pandemic to the cdc; details can be found at their website [ ] . apart from forecasting and assisting in resource allocation, data-driven methods have also been used to assess whether countermeasures have been successful e.g., by replacing a time-series of daily new cases with a piecewise linear approximation, the author in ref. [ ] showed that the lockdown in india did not have a significant effect on "flattening the curve". we perform a similar analysis later, for the shelter-in-place orders implemented in california in mid-march, . efforts to develop metrics, derived from observed time-series of cases, that could be used to monitor countermeasures' efficacy and trigger decisions [ ] , too exist. there have also been studies to estimate the unrecorded cases of covid- by computing excess cases of influenza-like-illness versus previous years' numbers [ ] . estimates of resurgence of the disease as nations come out of selfquarantine have also been developed [ ] . some modeling and forecasting efforts have played an important role in guiding policy makers when responding to the pandemic. the first covid- forecasts, which led to serious considerations of imposing restrictions on the mixing of people in the united kingdom and the usa, were generated by a team from imperial college, london [ ] . influential covid- forecasts for usa were generated by a team from the university of washington, seattle [ ] and were used to estimate the demand for medical resources [ ] . these forecasts have also been compared to actual data, once they became available [ ] , an assessment that we also perform in this paper. adaptations of the university of washington model, that include mobility data to assess changes in population mixing, have also been developed [ ] , showing enduring interest in using models and data to understand, predict and control the pandemic. figure shows a schematic of the overall workflow developed in this paper. the epidemiological model is formulated in sect. , with postulated forms for the infection rate curve and the derivation of the prediction for daily new cases; we also discuss a filtering approach that is applied to the data before using it to infer model parameters. in sect. we describe the "error model" and the statistical approach used to infer the latent infection rate curve, and to account for the uncertainties in the incubation period distribution. results, including push-forward posteriors and posterior predictive checks, are presented in sect. and we conclude in sect. . the appendix includes a presentation of data sources used in this paper. we present here an epidemiological model to characterize and forecast the rate at which people turn symptomatic from disease over time. for the purpose of this work, we assume that once people develop symptoms, they have ready access to medical services and can be diagnosed quickly. from this perspective, these forecasts represent a lower bound on the actual number of people that are infected with covid- as the people currently infected, but still incubating, are not accounted for. a fraction of the population infected might also exhibit minor or no symptoms at all and might not seek medical advice. therefore, these cases will not be part of patient counts released by health officials. the epidemiological model consists of two main components: an infection rate model, presented in sect. . and an incubation rate model, described in sect. . . these models are combined, through a convolution presented in sect. . , into a forecast of number of cases that turn symptomatic daily. these forecasts are compared to data presented in sect. . and the appendix. the infection rate represents the probability of an individual, that eventually will get affected during an epidemic, to be infected at a specific time following the start of the epidemic [ ] . we approximate the rate of infection with a gamma distribution with unknown shape parameter k and scale parameter θ . depending on the choice for the pair (k, θ) this distribution can display a sharp increase in the number of people infected followed by a long tail, a dynamic that could lead to significant pressure on the medical resources. alternatively, the model can also capture weaker gradients ("flattening the curve") equivalent to public health efforts to temporarily increase social separation and thus reduce the pressure on available medical resources. the infection rate model is given by where f (t; k, θ) is the pdf of the distribution, with k and θ strictly positive. figure shows example infection rate models for several shape and scale parameter values. the time in this figure is referenced with respect to start of the most of the results presented in this paper employ a lognormal incubation distribution for covid- [ ] . the nominal and % confidence interval values for the mean, μ, and standard deviation σ of the natural logarithm of the incubation model are provided in table . the pdf, f ln , and cumulative distribution function (cdf), f ln , of the lognormal distribution are given by to ascertain the impact of limited sample size on the uncertainty of μ and σ , we analyze their theoretical distributions and compare with the data in table . letμ andσ be the mean and standard deviation computed from a set of n samples of the natural logarithm of the incubation rate random variable. it follows that μ − μ σ/ √ n has a student's t-distribution with n-degrees of freedom. to model the uncertainty inσ we assume that has a χ distribution with (n − ) degrees of freedom. while the data in [ ] is based on n = confirmed cases, we found the corresponding % cis for μ and σ computed based on the student's t and chi-square distributions assumed above to be narrower than the ranges provided in table . instead, to construct uncertain models for these statistics, we employed a number of degrees of freedom n * = that provided the closest agreement, in a l -sense, to the % ci in the reference. the left frame in fig. shows the family of pdfs with μ and σ drawn from student's t and χ distributions, respectively. the nominal incubation pdf is shown in black in this frame. the impact of the uncertainty in the incubation model parameters is displayed in the right frame of this figure. for example, days after infection, there is a large variability ( - %) in the fraction of infected people that completed the incubation phase and started displaying symptoms. this variability decreases at later times, e.g. after days more then % of case completed the incubation process. in the results section we will compare results obtained with the lognormal incubation model with results based on other probability distributions. again, we turn to [ ] which provides parameter values corresponding to gamma, weibull, and erlang distributions. with these assumptions the number of people infected and with completed incubation period at time t i can be written as a convolution between the infection rate and the cumulative distribution function for the incubation distribution [ , , ] where n is the total number of people that will be infected throughout the epidemic and t is the start time of the epidemic. this formulation assumes independence between the calendar date of the infection and the incubation distribution. using eq. ( ), the number of people developing symptoms between times t i− and t i is computed as where can be approximated using the lognormal pdf as leading to where f ln is the lognormal pdf. the results presented in sect. compute the number of people that turn symptomatic the number of people developing symptoms daily n i , computed through eqs. ( ) or ( ), are compared to data obtained from several sources at the national, state, or regional levels. we present the data sources in the appendix. we found that, for some states or regions, the reported daily counts exhibited a significant amount of noise. this is caused by variation in testing capabilities and sometimes by how data is aggregated from region to region and the time of the day when it is made available. sometimes previously undiagnosed cases are categorized as covid- and reported on the current day instead of being allocated to the original date. we employ a set of finite difference filters [ , ] that preserve low wavenumber information, i.e. weekly or monthly trends, and reduces high wavenumber noise, e.g. large day to day variability such as all cases for successive days being reported at the end of the time rangê here y is the original data,ŷ is the filtered data. matrix i is the identity matrix and d is a band-diagonal matrix, e.g. triadiagonal for a n = , i.e. a -nd order filter and pentadiagonal for a n = , i.e. a -th order filter. we have compared -nd and -th order filters, and did not observe any significant difference between the filtered results. reference [ ] provides d matrices for filters up to -th order. time series of y andŷ for several regions are presented in the appendix. for the remainder of this paper we will only use filtered data to infer epidemiological parameters. for notational convenience, we will drop the hat and refer to the filtered data as y. note that all the data used in this study predate june , (in fact, most of the studies use data gathered before may , ) when covid- tests were administered primarily to symptomatic patients. thus the results and inferences presented in this paper apply only to the symptomatic cohort who seek medical care, and thus pose the demand for medical resources. the data is also bereft of any information about the "second wave" of infections that affected southern and western usa in late june, [ ] . given data, y, in the form of time-series of daily counts, as shown in sect. . , and the model predictions n i for the number of new symptomatic counts daily, presented in sect. , we will employ a bayesian framework to calibrate the epidemiological model parameters. the discrepancy between the data and the model is written as where y and n are arrays containing the data and model predictions here, d is the number of data points, the model parameters are grouped as = {t , n , k, θ} and represents the error model and encapsulates, in this context, both errors in the observations as well as errors due to imperfect modeling choices. the observation errors include variations due to testing capabilities as well as errors when tests are interpreted. values for the vector of parameters can be estimated in the form of a multivariate pdf via bayes theorem where p( | y) is the posterior distribution we are seeking after observing the data y, p( y| ) is the likelihood of observing the data y for a particular value of , and p( ) encapsulates any prior information available for the model parameters. bayesian methods are well-suited for dealing with heterogeneous sources of uncertainty, in this case from our modeling assumptions, i.e. model and parametric uncertainties, as well as the communicated daily counts of covid- new cases, i.e. experimental errors. in this work we explore both deterministic and stochastic formulations for the incubation model. in the former case the mean and standard deviation of the incubation model are fixed at their nominal values and the model prediction n i for day t i is a scalar value that depends on only. in the latter case, the incubation model is stochastic with mean and standard deviation of its natural logarithm treated as student's t and χ random variables, respectively, as discussed in sect. . . let us denote the underlying independent random variables by ξ = {ξ μ , ξ σ }. the model prediction n i (ξ ) is now a random variable induced by ξ plugged in eq. ( ), and n(ξ ) is a random vector. we explore two formulations for the statistical discrepancy between n and y. in the first approach we assume has a zero-mean multivariate normal (mvn) distribution. under this assumption the likelihood p( y| ) for the deterministic incubation model can be written as the covariance matrix c n can in principle be parameterized, e.g. square exponential or matern models, and the corresponding parameters inferred jointly with . however, given the sparsity of data, we neglect correlations across time and presume a diagonal covariance matrix with diagonal entries computed as the additive, σ a , and multiplicative, σ m , components will be inferred jointly with the model parameters here, we infer the logarithm of these parameters to ensure they remain positive. under these assumptions, the mvn likelihood in eq. ( ) is written as a product of independent gaussian densities where σ i is given by eq. ( ) . in sect. . we will compare results obtained using only the additive part σ a , i.e. fixing σ m = , of eq. ( ) with results using both the additive and multiplicative components. the second approach assumes a negative-binomial distribution for the discrepancy between data and model predictions. the negative-binomial distribution is used commonly in epidemiology to model overly dispersed data, e.g. in case where the standard deviation exceeds the mean [ ] . this is observed for most regions explored in this report, in particular for the second half of april and the first half of may. for this modeling choice, the likelihood of observing the data given a choice for the model parameters is given by where α > is the dispersion parameter, and is the binomial coefficient. for simulations employing a negative binomial distribution of discrepancies, the logarithm of the dispersion parameter α (to ensure it remains positive) will be inferred jointly with the other model parameters, for the stochastic incubation model the likelihood reads as p( y| ) = π n( ),ξ ( y), ( ) which we simplify by assuming independence of the discrepancies between different days, arriving at unlike the deterministic incubation model, the likelihood elements for each day π n i ( ),ξ (y i ) are not analytically tractable anymore since they now incorporate contributions from ξ , i.e. from the variability of the parameters of the incubation model. one can evaluate the likelihood via kernel density estimation by sampling ξ for each sample of , and combining these samples with samples of the assumed discrepancy , in order to arrive at an estimate of π n i ( ),ξ (y i ). in fact, by sampling a single value of ξ for each sample of , one achieves an unbiased estimate of the likelihood π n i ( ),ξ (y i ), and given the independent-component assumption, it also leads to an unbiased estimate of the full likelihood π n( ),ξ ( y). a markov chain monte carlo (mcmc) algorithm is used to sample from the posterior density p( | y). mcmc is a class of techniques that allows sampling from a posterior distribution by constructing a markov chain that has the posterior as its stationary distribution. in particular, we use a delayed-rejection adaptive metropolis (dram) algorithm [ ] . we have also explored additional algorithms, including transitional mcmc (tmcmc) [ , ] as well as ensemble samplers [ ] that allow model evaluations to run in parallel as well as sampling multi-modal posterior distributions. as we revised the model implementation, the computational expense reduced by approximately two orders of magnitude, and all results presented in this report are based on posterior sampling via dram. a key step in mcmc is the accept-reject mechanism via metropolis-hastings algorithm. each sample of , drawn from a proposal q(·| i ) is accepted with probability where p( i | y) and p( i+ | y) are the values of the posterior pdf's evaluated at samples i and i+ , respectively. in this work we employ symmetrical proposals, q( i | i+ ) = q( i+ | i ). this is a straightforward application of mcmc for the deterministic incubation model. in stochastic incubation model, we employ the unbiased estimate of the approximate likelihood as described in the previous section. this is the essence of the pseudo-marginal mcmc algorithm [ ] guaranteeing that the accepted mcmc samples correspond to the posterior distribution. in other words, at each mcmc step we draw a random sample ξ from its distribution, and then we estimate the likelihood in a way similar to the deterministic incubation model, in eqs. ( ) or ( ). figure shows samples corresponding to a typical mcmc simulation to sample the posterior distribution of . we used the raftery-lewis diagnostic [ ] to determine the number of mcmc samples required for converged statistics corresponding to stationary posterior distributions for . the required number of samples is of the order o( − ) depending on the geographical region employed in the inference. the resulting effective sample size [ ] varies between and , samples depending on each parameter which is sufficient to estimate joint distributions for the model parameters. figure displays d and d joint marginal distributions based on the chain samples shown in the previous figure. these results indicate strong dependencies between some of the model parameters, e.g. between the start of the epidemic t and the the scale parameter k of the infection rate model. this was somewhat expected based on the evolution of the daily counts of symptomatic cases and the functional form that couples the infection rate and incubation models. the number of samples in the mcmc simulations is tailored to capture these dependencies. we will employ both pushed-forward distributions and bayesian posterior-predictive distributions [ ] to assess the predictive skill of the proposed statistical model of the covid- disease spread. the schematic in eq. ( ) illustrates the process to generate push-forward posterior estimates , ) , . . . , y (pf,m) } → p pf ( y pf | y). ( ) here, y (pf) denotes hypothetical data y and p pf ( y (pf) | y) denotes the push-forward probability density of the hypothetical data y (pf) conditioned on the observed data y. we start with samples from the posterior distribution p( | y). the pushed-forward posterior does not account for the discrepancy between the data y and the model predictions n, subsumed into the definition of the error model presented in eqs. ( ) and ( ). the bayesian posterior-predictive distribution, defined in eq. ( ) is computed by marginalization of the likelihood over the posterior distribution of model parameters : in practice, we estimate p pp y (pp) | y through sampling, because analytical estimates are not usually available. the sampling workflow is similar to the one shown in eq. ( ) . after the model evaluations y = n( ) are completed, we add random noise consistent with the likelihood model settings presented in sect. . . the resulting samples are used to compute summary statistics p pp y (pp) | y . the push-forward and posterior-predictive distribution workflows can be used in hindcast mode, to check how well the model follows the data, and for short-term forecasts for the spread dynamics of this disease. in the hindcast regime, the infection rate is convolved with the incubation rate model to generate statistics for y (pp) (or y (pf) ) that will be compared against y, the data used to infer the model parameters. the same functional form can be used to generate statistics for y (pp) (or y (pf) ) beyond the set of dates for which data was available. we limit these forecasts to - days as our infection rate model does not count for changes in social dynamics that can significantly impact the epidemic over a longer time range. the statistical models described above are calibrated using data available at the country, state, and regional levels, and the calibrated model is used to gauge the agreement between the model and the data and to generate short-term forecasts, typically - days ahead. first, we will assess the predictive capabilities of these models for several modeling choices: we will then present results exploring the epidemiological dynamics at several geographical scales in sect. . . the push-forward and posterior-predictive figures presented in this section show data used to calibrate the epidemiological model with filled black circles. the shaded color region illustrates either the pushed-forward posterior or the posterior-predictive distribution with darker colors near the median and lighter colors near the low and high quantile values. the blue colors correspond to the hindcast dates and red colors correspond to forecasts. the inter-quartile range is marked with green lines and the % confidence interval with dashed lines. some of the plots also show data collected at a later time, with open circles, to check the agreement between the forecast and the observed number of cases after the model has been calibrated. we start the analysis with an assessment of the impact of the choice of the family of distributions on the model prediction. the left frame of fig. shows median (with red lines and symbols) and % ci with blue/magenta lines for the new daily cases based on lognormal, gamma, weibull, and erlang distributions for the incubation model. the mean and standard deviation of the natural logarithm of the associated lognormal random variable, and the shape and scale parameters for the other distributions are available in appendix table from reference [ ] . the results for all four incubation models are visually very close. this observation holds for other simulations at national/state/regional levels (results not shown). the results presented in the remainder of this paper are based on lognormal incubation models. the right frame in fig. presents the corresponding infection rate curve that resulted from the model calibration. this represents a lower bound on the true number of infected people, as our model will not capture the asymptomatic cases or the population that displays minor symptoms and did not seek medical care. next, we analyze the impact of the choice of deterministic vs stochastic incubation models on the model prediction. first we ran our model using the lognormal incubation model with mean and standard deviation fixed at their nominal values in table . we then used the same dataset to calibrate the epidemiological model which employs an incubation rate with uncertain mean and standard deviation as described in sect. . . these results are labeled "deterministic" and "stochastic", respectively, in fig. . this figure shows results based on data corresponding to the united states. the choice of deterministic vs stochastic incubation models produce very similar outputs. the results shown in the right frame of fig. indicate a relatively wide spread, between . and . with a nominal around . , of the fraction of people that complete the incubation and start exhibiting symptoms days after infec- next, we explore results based on either ae or a + me formulations for the statistical discrepancy between the epidemiological model and the data. this choice impacts the construction of the covariance matrix for the gaussian likelihood model, in eq. ( ) . for ae we only infer σ a while for a + me we infer both σ a and σ m . the ae results in fig. a are based on the same dataset as the a + me results in fig. b . both formulations present advantages and disadvantages when attempting to model daily symptomatic cases that span several orders of magnitude. the ae model, in fig. a , presents a posterior-predictive range around the peak region that is consistent with the spread in the data. however, the constant σ = σ a over the entire date range results in much wider uncertainties predicted by the model at the onset of the epidemic. the a + me model handles the discrepancy better overall as the multiplicative error term allows it to adjust the uncertainty bound with the data. nevertheless, this model results in a wider uncertainty band than warranted by the data near the peak region. these results indicate that a formulation for an error model that is time dependent can improve the discrepancy between the covid- data and the epidemiological model. we briefly explore the difference between pushed-forward posterior, in fig. c , and the posterior-predictive data, in fig. b . these results show that uncertainties in the model parameters alone are not sufficient to capture the spread in the data. this observation suggests more work is needed on augmenting the epidemiological model with embedded components that can explain the spread in the data without the need for external error terms. the negative binomial distribution is used commonly in epidemiology to model overly dispersed data, e.g. in cases where the variance exceeds the mean [ ] . we also observe similar trends in some of the covid- datasets. figure shows results based on data for alaska. the results based on the two error models are very similar, with the negative binomial results (on the top row) offering a slightly wider uncertainty band to better cover the data dispersion. nevertheless, results are very similar, as they are for other regions that exhibit a similar number of daily cases, typically less than a few hundred. for regions with a larger number of daily cases, the likelihood evaluation was fraught with errors due to the evaluation of the negative binomial. we therefore shifted our attention to the gaussian formulation which offers a more robust evaluation for this problem. in this section we examine forecasts based on data aggregated at country, state, and regional levels, and highlight similarities and differences in the epidemic dynamics resulted from these datasets. the data in fig. illustrates the built-in delay in the disease dynamics due to the incubation process. a stay-at-home order was issued on march . given the incubation rate distribution, it takes about days for - % of the people infected to start showing symptoms. after the stay at home order was issued, the number of daily case continued to rise because of infections that occurred before march . the data begins to flatten out in the first week of april and the model captures this trend a few days later, april - . the data corresponding to april - show an increased dispersion. to capture this increased noise, we switched from an ae model to a + me model, with results shown in fig. . fig. b . the data for the central region, shows a smaller daily count compared to the nw region. the epidemiological model captures the relatively large dispersion in the data for both regions. for the nm-c the first cases are recorded around march and the model suggests the peak has been reached around mid-april, while nm-nw starts about a week later, around march , but records approximately twice more daily cases when it reaches the peak in the first half of may. both regions display declining cases as of late may. comparing the californian and new mexican results, it is clear that the degree of scatter in the new mexico data is much larger and adversely affects the inference, the model fit and the forecast accuracy. the reason for this scatter is unknown, but the daily numbers for new mexico are much smaller that california's and are affected by individual events e.g., detection of transmission in a nursing home or a community. this is further accentuated by the fact that new mexico is a sparsely populated region where sustained transmission, resulting in smooth curves, is largely impossible outside its few urban communities. this section discusses an analysis of the aggregate data from all us states. the posterior-predictive results shown in fig. a -d suggest the peak in the number of daily cases was reached around mid-april. nevertheless the model had to adjust the downward slope as the number of daily cases has been declining at a slower pace compared to the time window that immediately followed the peak. as a result, the prediction for the total number of people, n , that would be infected in us during this first wave of infections has been steadily increasing as results show in fig. e . we conclude our analysis of the proposed epidemiological model with available daily symptomatic cases pertaining to germany, italy, and spain, in figs. , and . for germany, the uncertainty range increases while the epidemic is winding down, as the data has a relatively large spread compared to the number of daily cases recorded around mid-may. this reinforces an earlier point about the need to refine the error model with a time-dependent component. for spain, a brief initial downslope can be observed in early april, also evident in the filtered data presented in fig. b . this, how- an overly-dispersed dataset and a wide uncertainty band for spain. forecasts based on daily symptomatic cases reported for italy, in fig. , exhibit an upward shift observed around april - , similar to data for spain above. the subsequent forecasts display narrower uncertainty bands compared to other similar forecasts above, possibly due to the absence of hotspots and/or regular data reporting. figures , and show inferences and forecasts obtained using data available till mid-may, . they indicate that the outbreak was dying down, with forecasts of daily new cases trending down. in early june, public health measures to restrict population mixing were curtailed, and by mid-july, both california and the us were experiencing an explosive increase in new cases of covid- being detected every day, quite at variance with the forecasts in the figures. this was due to the second wave of infections caused by enhanced population mixing. the model in eq. cannot capture the second wave of infections due to its reliance on a unimodal infection curve n f (τ − t ; k, θ). this was by design, as the model is meant to be used early in an outbreak, with public health officials focussing on the the first wave of infections. however, it can be trivially extended with a second infection curve to yield an augmented equation [ ] τ − t ; k [ ] , θ [ ] f ln (t i − τ ; μ, σ ) dτ +n [ ] t i t f [ ] τ − (t + t); k [ ] , θ [ ] with two sets of parameters for the two infection curves, which are separated in time by t > . eq. is then fitted to data which is suspected to contain effects of two waves of infection. this process does double the number of parameters to be estimated from data. however, the posterior density inferred for the parameters of the first wave (i.e., those with [ ] superscript), using data collected before the arrival of the second wave, can be used to impose informative priors, considerably simplifying the estimation problem. note that the augmentation shown in eq. is very intuitive, and can be repeated if multiple infection waves are suspected. a second method that could, in principle, be used to infer multiple waves of infection are compartmental models e.g., sir models, or their extensions. these models represent the epidemiological evolution of a patient through a sequence of compartments/states, with the residence time in each compartment modeled as a random variable. one of these compartments, "infectious", can then be used to model spread of the disease to other individuals. such compartmental models have also been parlayed into ordinary differential equation (ode) models for an entire population, with the population distributed among the various compartments. ode models assume that the residence time in each compartment is exponentially distributed, and using multiple compartments, can represent incubation and symptomatic periods that are not exponentially distributed. this does lead to an explosion of compartments. the spread-of-infection model often involves a time-dependent reproductive number r(t) that can be used to model the effectiveness of epidemic control measures. it denotes that number of individuals a single infected individual will spread the disease to, and as public health measures are put in place (or removed), r(t) will decrease or increase. we did not consider sir models, or their extensions, in our study as our model is meant to be used early in an outbreak when data is scarce and incomplete. since our method is data-driven and involves fitting a model, a deterministic (ode) compartmental model with few parameters would be desirable. the reasons for avoiding ode-based compartmental models are: -the incubation period of covid- is not exponential (it is lognormal) and there is no way of modeling it with a single "infectious" compartment. -while it is possible to decompose the "infectious" compartment into multiple sub-compartments, it would increase the dimensionality of the inverse problem as we would have to infer the fraction of the infected population in each of the sub-compartments. this is not desirable when data is scarce. -we did not consider using extensions of sir i.e., those with more compartments since it would require us to know the residence time in each compartment. this information is not available with much certainty at the start of the epidemic. this is particularly true for covid- where only a fraction of the "infectious" cohort progress to compartments which exhibit symptoms. -sir models can infer the existence of a second wave of infections but would require a very flexible parameterization of r(t) that would allow bi-or multimodal behavior. it is unknown what sparsely parameterized functional form would be sufficient for modeling r(t). this paper illustrates the performance of a method for producing short-term forecasts (with a forecasting horizon of about - days) of a partially-observed infectious disease outbreak. we have applied the method to the covid- pandemic of spring, . the forecasting problem is formulated as a bayesian inverse problem, predicated on an incubation period model. the bayesian inverse problem is solved using markov chain monte carlo and infers parameters of the latent infection-rate curve from an observed time-series of new case counts. the forecast is merely the posteriorpredictive simulations using realizations of the infection-rate curve and the incubation period model. the method accommodates multiple, competing incubation period models using a pseudo-marginal metropolis-hastings sampler. the variability in the incubation rate model has little impact on the forecast uncertainty, which is mostly due to the variability in the observed data and the discrepancy between the latent infection rate model and the spread dynamics at several geographical scales. the uncertainty in the incubation period distribution also has little impact on the inferred latent infection rate curve. the method is applied at the country, provincial and regional/county scales. the bulk of the study used data aggregated at the state and country level for the united states, as well as counties in new mexico and california. we also analyzed data from a few european countries. the wide disparity of daily new cases motivated us to study two formulations for the error model used in the likelihood, though the gaussian error models were found to be acceptable for all cases. the most successful error model included a combination of multiplicative and additive errors. this was because of the wide disparity in the daily case counts experienced over the full duration of the outbreak. the method was found to be sufficiently robust to produce useful forecasts at all three spatial resolutions, though high-variance noise in low-count data (poorly reported/low-count/largely unscathed counties) posed the stiffest challenge in discerning the latent infection rate. the method produces rough-and-ready information required to monitor the efficacy of quarantining efforts. it can be used to predict the potential shift in demand of medical resources due to the model's inferential capabilities to detect changes in disease dynamics through short-term forecasts. it took about days of data (about the % quantile of the incubation model distribution) to infer the flattening of the infection rate in california after curbs on population mixing were instituted. the method also detected the anomalous dynamics of covid- in northwestern new mexico, where the outbreak has displayed a stubborn persistence over time. our approach suffers from two shortcomings. the first is our reliance on the time-series of daily new confirmed cases as the calibration data. as the pandemic has progressed and testing for covid- infection has become widespread in the usa, the daily confirmed new cases are no longer mainly of symptomatic cases who might require medical care, and forecasts developed using our method would overpredict the demand for medical resources. however, as stated in sect. , our approach, with its emphasis on simplicity and reliance on easily observed data, is meant to be used in the early epoch of the outbreak for medical resource forecasting, and within those pragmatic considerations, has worked well. the approach could perhaps be augmented with a time-series of covid- tests administered every day to tease apart the effect of increased testing on the observed data, but that is beyond the scope of the current work. undoubtedly this would result in a more complex model, which would need to be conditioned on more plentiful data, which might not be readily available during the early epoch of an outbreak. the second shortcoming of our approach is that it does not model, detect or infer a second wave of infections, caused by an increase in population mixing. this can be accomplished by adding a second infection rate curve/model to the inference procedure. this doubles the number of parameters to be inferred from the data, but the parameters of the first wave can be tightly constrained using informative priors. this issue is currently being investigated by the authors. tone provided in scaling up the short-term forecasts to large datasets. this work was funded in part by the laboratory directed research and development (ldrd) program at sandia national laboratories. khachik sargsyan was supported by the u.s. department of energy, office of science, office of advanced scientific computing research, scientific discovery through advanced computing (scidac) program through the fastmath institute. sandia national laboratories is a multimission laboratory managed and operated by national technology and engineering solutions of sandia, llc., a wholly owned subsidiary of honeywell international, inc., for the u.s. department of energy's national nuclear security administration under contract de-na- . this paper describes objective technical results and analysis. any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the u.s. department of energy or the united states government. we have used several sources [ , , [ ] [ ] [ ] to gather daily counts of symptomatic cases at several times while we performed this work. the illustrations in this section present both the original data with blue symbols as well as filtered data with red symbols and lines. figure shows data for all of the us (data extracted from [ ] ), and for selected states (data extracted from [ ] ). the filtering approach, presented in sect. . , preserves the weekly scale variability observed for some of the datasets in this figure, and removes some of the large day to day variability observed for example in alaska, in fig. d . figure shows the data for several countries with a significant number of covid- cases as of may , . similar to us and some of the us states, a weekly frequency trend can be observed superimposed on the overall epidemiological trend. these trends are observed on the downward slope mostly, e.g. for italy and germany. when the epidemic is taking hold, it is possible that any higher frequency fluctuation is hidden inside the sharply increasing counts. possible explanations include regional hot-spots flaring up periodically as well as expanded testing capabilities ramping-up over time. we have also explored epidemiological models applied at regional scale. the left frame in fig. shows a set of counties in the bay area that were the first to issue the stayat-home order on march , . two groups of counties in new mexico are shown with red and blue in the right frame of fig. . these regions displayed different disease dynamics, e.g. a shelter-in-place was first issued in the bay area on march , then extended to the entire state on march , while the new daily counts were much larger in the nw new mexico compared to the central region. the daily counts, shown in fig. for these three regions was aggregated based on county data provided by [ ] . - coronavirus pandemic covid- ) data in the united states covid- coronavirus pandemic covid- data repository by the center for systems science and engineering covid- pandemic data/united states medical cases reopenings stall as us records nearly , cases of covid- in single day modelling the occurrence of the novel pandemic covid- outbreak; a box and jenkins approach the pseudo-marginal approach for efficient monte carlo computations model calibration, nowcasting, and operational prediction of the covid- pandemic a method for obtaining shortterm projections and lower bounds on the size of the aids epidemic development and application of pandemic projection measures (ppm) for forecasting the covid- outbreak hawkes process modeling of covid- with mobility leading indicators and spatial covariates dynamics and development of the covid- epidemics in the us: a compartmental model with deep learning enhancement worldwide and regional forecasting of coronavirus (covid- ) spread using a deep learning model forecasting covid- outbreak progression in italian regions: a model based on neural network training from chinese data sequential data assimilation of the stochastic seir epidemic model for regional covid- dynamics an international assessment of the covid- pandemic using ensemble data assimilation report : impact of non-pharmaceutical interventions (npis) to reduce covid mortality and healthcare demand ensemble samplers with affine invariance an adaptive metropolis algorithm markov chain monte carlo in practice: a roundtable discussion several new numerical methods for compressible shear-layer simulations covasim: an agent-based model of covid- dynamics and interventions transitional markov chain monte carlo sampler in uqtk predictive accuracy of a hierarchical logistic model of cumulative sars-cov- case growth the incubation period of coronavirus disease (covid- ) from publicly reported confirmed cases: estimation and application realistic distributions of infectious periods in epidemic models: changing patterns of persistence and dynamics maximum likelihood estimation of the negative binomial dispersion parameter for highly overdispersed data, with applications to infectious diseases estimating the early outbreak cumulative incidence of covid- in the united states: three complementary approaches bayesian posterior predictive checksforcomplex models learning as we go: an examination of the statistical accuracy of covid daily death count predictions forecasting covid- impact on hospital bed-days, icu-days, ventilator-days and deaths by us state in the next months forecasting the impact of the first wave of the covid- pandemic on hospital demand and deaths for the usa and european economic area countries bayesian updating and model class selection for hysteretic structural models using stochastic simulation initial simulation of sars-cov spread and intervention effects in the continental us an arima model to forecast the spread and the final size of covid- epidemic in italy how many iterations in the gibbs sampler? using high-order methods on adaptively refined block-structured meshes: derivatives, interpolations, and filters deriving a model for influenza epidemics from historical data modeling covid- on a network: super-spreaders, testing and containment real-time characterization of partially observed epidemics using surrogate models machine learning model estimating number of covid- infection cases over coming days in every province of south korea (xgboost and multioutputregressor) projections for firstwave covid- deaths across the us using social-distancing measures derived from mobile phones projection of covid- cases and deaths in the us as individual states re-open publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations the authors acknowledge the helpful feedback that john jakeman have provided on various aspects related to speed up of model evaluations. the authors also acknowledge the support erin acquesta, thomas catanach, kenny chowdhary, bert debusschere, edgar galvan, gianluca geraci, mohammad khalil, and teresa por- key: cord- - py k e authors: buyse, marc; trotta, laura; saad, everardo d.; sakamoto, junichi title: central statistical monitoring of investigator-led clinical trials in oncology date: - - journal: int j clin oncol doi: . /s - - - sha: doc_id: cord_uid: py k e investigator-led clinical trials are pragmatic trials that aim to investigate the benefits and harms of treatments in routine clinical practice. these much-needed trials represent the majority of all trials currently conducted. they are however threatened by the rising costs of clinical research, which are in part due to extensive trial monitoring processes that focus on unimportant details. risk-based quality management focuses, instead, on “things that really matter”. we discuss the role of central statistical monitoring as part of risk-based quality management. we describe the principles of central statistical monitoring, provide examples of its use, and argue that it could help drive down the cost of randomized clinical trials, especially investigator-led trials, whilst improving their quality. medical practice largely relies on the evidence generated by clinical trials, particularly randomized controlled trials (rcts). these are considered the gold-standard approach for evaluating therapeutic interventions due to their capacity to allow for inferences about causal links between treatment and outcomes [ ] . a general property of experimental research is that internal validity (i.e., the reliability of results) and external validity (i.e., their generalizability) tend to move in opposite directions in response to attempts to control trial features such as the population, the intervention, and the assessment of outcomes. this gives rise to different attitudes towards clinical trials in general, and rcts in particular: one that prioritizes internal validity (the explanatory attitude), and one that places more emphasis on the generalizability of results (the pragmatic attitude) [ ] . industrysponsored trials, here defined as trials that aim to investigate experimental drugs with largely unknown effects, are typically characterized by an explanatory approach, which is suitable for the development of these novel agents or combinations. in contrast, investigator-led clinical trials, here defined as trials that aim to investigate the benefits and harms of treatments in routine clinical practice, are typically characterized by a pragmatic attitude. table characterizes some of the contrasts between an explanatory and a pragmatic approach to clinical trials. these contrasts have direct implications on the conduct of investigator-led trials, notably with regards to ways of ensuring their quality, which is the focus of this paper. investigator-led clinical trials belong to a research area known as comparative-effectiveness research. we note that "real-world evidence" is a broader concept, given that it is often applied to observational research, something that falls outside the scope of our paper [ , ] . industry-sponsored clinical trials are essential for the development of new treatments. these clinical trials need to fulfil commercial interests and market expectations, which may not always address all patients' needs [ ] . moreover, clinical trials that lead to the approval of novel drugs or devices often have shortcomings that have been recognized for decades. such shortcomings include the strictness of the eligibility criteria, the choice of comparators, the effect size of interest, the choice of outcomes, and insufficient data on long-term toxicity [ ] . arguably, some of these shortcomings are a by-product of the general principles underlying marketing approval by regulatory agencies, such as the japanese pharmaceutical and medical devices agency (pmda), the european medicines agency (ema), and the us food and drug administration (fda). these agencies must determine whether a new drug is sufficiently safe and effective to be made available for clinical use, which requires a careful assessment of the quality of the pivotal trial design, conduct, data and analysis whilst allowing safe and effective new drugs to enter the market quickly [ ] . however, the need remains to generate additional, post-approval evidence on novel drugs or devices [ , ] . such evidence is required for clinical practice, as it provides a far better understanding of the effectiveness and safety of competing interventions in "real life". moreover, it allows the assessment of patients and settings not necessarily covered by the initial approval, thus leading to potential extensions of indications and refinement of the drug usage in patient subgroups. even for newly approved drugs, many questions of clinical interest typically remain unanswered at the time of approval, including the duration of therapy, dose or schedule modifications that may lead to a better benefit/risk ratio, combinations of the new drug with existing regimens, and so on. likewise, repurposing of existing drugs, whose safety and efficacy profile is well documented in other indications, is more likely to be attractive in the setting of investigator-led trials than to pharmaceutical companies for whom a given product ceases to be financially attractive towards the end of its life-cycle [ ] . finally, large, simple trials that address questions of major public health importance have been advocated for decades as one of the pillars of evidence-based medicine [ ] . all in all, more and larger investigator-led trials are needed, and it is crucially important to identify ways of conducting them as cost-effectively as possible [ , ] . in particular, excessive regulation of investigator-led trials, using industrysponsored trials as a model, is both unnecessary and counterproductive [ ] . taruno (table ) in [ ] ). publicly available clinical-trial registries are useful to assess the importance of investigator-led clinical trials in worldwide clinical research. the longest established and largest registry is clinicaltrials.gov, with , trial protocols as of march , . clinicaltrials.gov contains trial protocols from both the us and other countries, and distinguishes between four major types of funders: ( ) industry (e.g., pharmaceutical and device companies), ( ) the us national institutes of health, ( ) other federal agencies (e.g., fda, centers for disease control and prevention, or department of veterans affairs), and ( ) all others (including individuals, universities, and community-based organizations). for the purposes of this paper, we focus on clinical trials conducted by sponsors other than the pharmaceutical and device industry, i.e., funder types ( )-( ), as opposed to funder types ( ). we call these trials "investigator-led" clinical trials for simplicity. figures and show the number of registered interventional clinical trials in oncology, by funder type and year the trial started, in the us (fig. ) and all other countries (fig. ). in the us, about such trials were reported to have started in , about being industry trials and about investigator-led trials (roughly half of which sponsored by nih and other federal agencies, and half by other sponsors). in other countries, about such trials were reported in , about being industry trials versus about investigator-led trials (with there may be substantial under-reporting of clinical trials to clinicaltrials.gov, especially for non-us trials and for investigator-led trials, so it is conservative to assume that investigator-led trials outnumber industry-sponsored trials worldwide. as such, investigator-led trials have the potential to generate much of the evidence upon which the treatment of cancer patients is decided. yet, as stated above, investigator-led trials may be under threat because of excessive regulation and bureaucracy, and the accompanying direct and indirect costs. the rising costs of clinical trials have been a matter of major concern for some time [ ] . the contribution of clinical trials to the overall costs of drug development is not known with precision, but recent estimates suggest that pivotal clinical trials leading to fda approval have a median cost of us$ million; such costs are even higher in oncology and cardiovascular medicine, as well as in trials with a long-term clinical outcome, such as survival [ ] . interestingly, the cost of clinical trials was found to have huge variability, with more than -fold differences at the extremes of the cost distribution among the trials surveyed [ ] . the extent to which the skyrocketing costs of clinical research depend on individual components of clinical-trial conduct can vary substantially across trials, and likely when industry-sponsored studies are compared with investigator-led trials. in industry-sponsored trials, a great deal of resources are spent in making sure that the data collected in clinical trials are free from error. this is usually done through on-site monitoring (site visits) including source-data verification and other types of quality assurance procedures, alongside with centralized monitoring including data management and the statistical monitoring that is the focus of the present paper. while some on-site activities make intuitive sense, their cost has become exorbitant in the large multicenter trials that are typically required for the approval of new therapies [ ] . it has been estimated that for large, global clinical trials, leaving aside site payments, the cost of on-site monitoring represents about % of the total trial [ ] . the high costs of monitoring could be justified if monitoring activities were likely to have an impact on patient safety or on the trial results [ ] . yet, there is no evidence showing that extensive data monitoring has any major impact on the quality of clinical-trial data, and none of the randomized studies assessing more intensive versus less intensive monitoring has shown any difference in terms of clinically relevant treatment outcomes [ ] [ ] [ ] [ ] [ ] . besides, there may also be a lack of effectiveness of sending large numbers of data queries to the centers as part of the data management process. in one limited study, only six queries were found ( . % of queries) that might have influenced the results of three phase cancer clinical trials, had the discrepancy not been revealed [ ] . but without question, the most time-consuming and least efficient activity is source-data verification, which can take up to % of the time spent for on-site visits, hence it is especially important to make sure that such time is well spent. a large retrospective study of industrysponsored clinical trials has shown that only . % of all data were changed as a result of source-data verification [ ] . moreover, it has been shown via simulations that random errors, which comprise most of the errors detected during source-data verification, have a negligible impact on the trial results [ ] . in contrast, systematic errors (those that create a bias in the comparison between the treatment groups of a randomized trial) can have a huge impact on the trial results, but these types of errors can either be prevented or detected and corrected centrally [ , ] . all in all, the monitoring of clinical trials needs to be re-engineered, not just for investigator-led trials, but also for industry-sponsored trials. to instigate and support this much-needed transition, regulatory agencies worldwide have advocated the use of risk-based quality management, including risk-based monitoring and central statistical monitoring (csm) [ , ] . the central principle of risk-based quality management is to "focus on things that matter". what matters for a randomized clinical trial is to provide a reliable estimate of the difference in efficacy and tolerance between the treatments being compared. it is important to stress that the criteria to assess efficacy and tolerance may differ between industrysponsored trials and investigator-led trials. for instance, in terms of efficacy, industry-sponsored trials often use the centrally reviewed progression-free survival (pfs), which may provide the most sensitive indicator of the antitumor effect of a treatment, while investigator-led trials use the locally assessed pfs, which may provide the most relevant indicator of disease progression for clinical decision-making (for instance to change therapy). neither of these two assessments of pfs is better than the other; they serve different purposes and have their own advantages and limitations. centrally reviewed pfs is arguably a "cleaner" endpoint, but it is quite expensive to measure and does not reflect clinical routine; as such it is neither feasible nor desirable in investigator-led trials. in terms of safety, investigator-led trials can collect much simpler data than industry-sponsored trials of drugs for which safety has not yet been demonstrated. typically, in investigator-led trials, the occurrence of common terminology criteria for adverse events grade or toxicities will suffice, plus any unexpected toxicity not known to be associated with the drug being investigated. finally, medical history and concomitant medications, which may be important to document drug interactions with an experimental treatment, serve no useful purpose in investigator-led trials. all in all, investigator-led trials should collect radically simpler data than industry-sponsored trials. similarly, data quality needs to be evaluated in a "fit for purpose" manner: while it may be required to attempt to reach % accuracy in all the data collected for a pivotal trial of an experimental treatment, such a high bar is by no means required for investigator-led trials, as long as no systematic bias is at play to create data differences between the randomized treatment groups (for instance, a higher proportion of missing data in one group than in the other) [ ] . both types of trials may benefit from central statistical monitoring of the data; industry-sponsored trials to target centers that are detected as having potential data quality issues, which may require an on-site audit, and investigatorled trials as the primary method for checking data quality. central statistical monitoring (csm) is part of risk-based quality management [ ] . as shown in fig. , the process starts with a risk assessment and categorization tool (ract) [ ] . csm helps quality management by providing statistical indicators of quality based on data collected in the trial from all sources. a "data quality assessment" of multicenter trials can be based on the simple statistical idea that data should be broadly comparable across all centers [ ] . note that this idea is premised on the fact that data consistency is an acceptable surrogate for data quality. note also that other tools of central monitoring can be used in addition, to uncover situations in which data issues occur in most (or sometimes all) centers; these other tools, which include "key risk indicators" and "quality tolerance limits", are beyond the scope of this article. taken together, all these tools produce statistical signals that may reveal issues in specific centers. actions must then be taken to address these issues, such as contacting the center for clarification, or in some cases performing an on-site audit to understand the cause of the data issue (fig. ) . although it is a simple idea to perform a central data quality assessment based on the consistency of data across all centers, the statistical models required to implement the idea are necessarily complex to properly account for the natural variability in the data [ , ] . essentially, a central data quality assessment is efficient if: . data have undergone basic data management checks, whether automated or manual, to eliminate obvious errors (such as out-of-range or impossible values) that can be detected and corrected without a statistical approach; . data quality issues are limited to a few centers, while the other centers have data of good quality; . all data are used, rather than a few key data items such as those for the primary endpoint or major safety variables; . many statistical tests are performed, rather than just a few obvious ones such as a shift in mean or a difference in variability. it is worth emphasizing the last two points, namely that it is statistically preferable to run many tests on all data collected than on a few data items carefully selected for their relevance or importance. hence, what matters for a reliable statistical assessment of data quality is volume rather than clinical relevance. the reason is that the power of statistical detection comes from an accumulation of evidence, which would not be available if only important items and standard tests were considered [ ] . in addition, investigators pay more attention to key data (such as the primary efficacy endpoint or important safety variables), which, therefore, do not constitute reliable indicators of overall data quality. this being said, careful checks of key data are also essential, but such checks, for the most part, are not statistical in nature. figure shows a made-up example of systolic blood pressure, measured during six successive visits, in nine centers (numbered c -c ) of a fictitious multicentre trial. each colored line represents one patient. it is easy, even visually, to spot centers that deviate from the norm: a lack of variability is apparent in center c , an upward shift in mean in center c , and data propagation in center c . while these inconsistencies are too extreme to be commonly seen in practice, others may escape visual scrutiny and yet be revealing of issues worth investigating further. for instance, the fig. the risk-based quality management process data of center c may well be inconsistent with the data of other centers, as it seems to have smaller variability, but it is impossible to tell from fig. if this inconsistency falls beyond the play of chance. figure depicts only one variable, but the power of the statistical approach is to perform many tests on all variables. this can lead to a large number of tests: in a trial of centers, if data are available on variables, and if five tests on average are performed on each variable, the system generates × × = , tests. there is obviously a need to summarize the statistical information produced by all these tests in an overall inconsistency index. essentially, if p ij represents the p value of the j th statistical test in center i, the data inconsistency score for center i is equal to where w j is a weight that accounts for the correlation between the tests. put simply, the dis is a weighted geometric mean of the p values of all tests performed to compare center i with all other centers. in fact, the calculation of the dis is more complex than this formula suggests, but the technical details are unimportant here [ ] . venet et al. discusses other ways of combining many statistical tests to identify data issues in multicenter trials [ ] . it is visually useful to display the dis as a function of center size, as shown in fig. [ ] . when the trial includes many centers, it may be useful to limit the number of centers found to have statistical inconsistencies by setting the false discovery rate to a low probability, such as % [ ] . timmermans et al. provide a detailed example of csm applied to a completed trial, the stomach cancer adjuvant multi-institutional trial (samit) group trial, involving patients across centers in japan, which was subsequently published [ , ] . this trial, like many trials in oncology, included many centers with only a couple of patients [ ] . table shows the main findings of csm in this trial, which led to further checks and data corrections prior to final analysis [ ] . this example shows the power of csm to identify data issues even in small centers, providing a large enough number of patient-related variables are included in the analysis [ ] . table also shows the actions taken, when required, to correct the few data issues that remained in this final dataset. it is noteworthy that some of the statistical findings led to no action if an explanation was found for them (e.g., visits on unusual days of the week), or if, upon further investigation, the findings seemed likely to be due to the play of chance. experience from actual trials [ , , , , ] as well as extensive simulation studies [ ] have shown that a statistical data quality assessment based on the principles outlined above is quite effective at detecting data errors. experience from actual trials suggests that data errors can be broadly classified as: . fraud, such as fabricating patient records or even fabricating entire patients [ , , ] . data tampering, such as filling in missing data, or propagating data from one visit to the next [ ] . sloppiness, such as not reporting some adverse events, making transcription errors, etc. [ ] . miscalibration or other problems with automated equipment [ ] fig. a made-up example of systolic blood pressure, measured during six successive visits, in centers (numbered c -c ) of a multicentre trial. each colored line represents the systolic blood pressure of one patient over time whilst some of these data errors are worse than others, in so far as they may have a more profound impact on the results of the trial, all of them can potentially be detected using csm, at a far lower cost and with much higher efficiency than through labor-intensive methods such as source data verification and other on-site data reviews. investigatorled trials generate more than half of all randomized evidence on new treatments, and it seems essential that this evidence be submitted to statistical quality checks before going to print and influencing clinical practice. the magic of randomization versus the myth of real-world evidence explanatory and pragmatic attitudes in therapeutical trials real-world evidence -what is it and what can it tell us? safeguarding the future of independent, academic clinical cancer research in europe for the benefit of patients design characteristics, risk of bias, and reporting of randomised controlled trials supporting approvals of cancer drugs by european medicines agency, - : cross sectional analysis postapproval studies of drugs initially approved by the fda on the basis of limited evidence: systematic review generating comparative evidence on new drugs and devices after approval drug repurposing in oncology-patient and health systems opportunities why do we need some large, simple randomized trials more trials -to do more trials better improving public health by improving clinical trial guidelines and their application analysis of the status of specified clinical trials using jrct (japan registry of clinical trials) researchers facing increasing costs for clinical research, with few solutions estimated costs of pivotal trials for novel therapeutic agents approved by the us food and drug administration developing systems for cost-effective auditing of clinical trials forum on drug discovery development and translation. transforming clinical research in the united states randomized clinical trials-removing unnecessary obstacles the value of source data verification in a cancer clinical trial risk-adapted monitoring is not inferior to extensive on-site monitoring: results of the adamon cluster-randomised study triggered or routine site monitoring visits for randomised controlled trials: results of temper, a prospective, matched-pair study a randomized evaluation of on-site monitoring nested in a multinational randomized trial validation of a riskassessment scale and a risk-adapted monitoring plan for academic clinical research studies-the pre-optimon study improving the quality of drug research or simply increasing its cost? an evidence-based study of the cost for data monitoring in clinical trials evaluating source data verification as a quality control measure in clinical trials the impact of data errors on the outcome of randomized clinical trials ensuring trial validity by data quality assurance and diversification of monitoring methods reflection paper on risk based quality management in clinical trials department of health and human services ( ) food and drug administration guidance for industry. oversight of clinical investigations. a risk-based approach to monitoring data-driven risk identification in phase iii clinical trials using central statistical monitoring available at https ://trans celer atebi ophar mainc .com/asset s/rbm-asset s/ (accessed a statistical approach to central monitoring of data quality in clinical trials linear mixed-effects models for central statistical monitoring of multicenter clinical trials use of the betabinomial model for central statistical monitoring of multicenter clinical trials a hercule poirot of clinical research detection of atypical data in multicenter clinical trials using unsupervised statistical monitoring statistical monitoring of data quality and consistency in the stomach cancer adjuvant multi-institutional the control of the false discovery rate in multiple testing under dependency sequential paclitaxel followed by tegafur and uracil (uft) or s- vs. uft or s- monotherapy as adjuvant chemotherapy for t a/b gastric cancer (samit): a phase factorial randomised controlled trial fraud in clinical trials the role of biostatistics in the prevention, detection and treatment of fraud in clinical trials guidance for industry, investigators, and institutional review boards. fda guidance on conduct of clinical trials of medical products during covid- pandemic key: cord- -pb e w s authors: kolatkar, anand; kennedy, kevin; halabuk, dan; kunken, josh; marrinucci, dena; bethel, kelly; guzman, rodney; huckaby, tim; kuhn, peter title: c-me: a d community-based, real-time collaboration tool for scientific research and training date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: pb e w s the need for effective collaboration tools is growing as multidisciplinary proteome-wide projects and distributed research teams become more common. the resulting data is often quite disparate, stored in separate locations, and not contextually related. collaborative molecular modeling environment (c-me) is an interactive community-based collaboration system that allows researchers to organize information, visualize data on a two-dimensional ( -d) or three-dimensional ( -d) basis, and share and manage that information with collaborators in real time. c-me stores the information in industry-standard databases that are immediately accessible by appropriate permission within the computer network directory service or anonymously across the internet through the c-me application or through a web browser. the system addresses two important aspects of collaboration: context and information management. c-me allows a researcher to use a -d atomic structure model or a -d image as a contextual basis on which to attach and share annotations to specific atoms or molecules or to specific regions of a -d image. these annotations provide additional information about the atomic structure or image data that can then be evaluated, amended or added to by other project members. the laboratory environment has changed considerably over the last decade as larger projects and broader scientific challenges have required scientists to cooperate and collaborate with peers around the globe. thus, few research labs remain isolated, independent units. the internet has enabled these research teams to work in trans-disciplinary environments, with researchers from many disciplines, nations, time zones, and languages working together on large-scale research projects. while this new development enables researchers to address these new challenges, a new approach to conducting daily research is required. despite changes to the typical laboratory environment, data often continues to be kept in disparate forms of media. for example, protein structure/activity data annotations and images may be kept in paper lab notebooks, manuscripts might be stored electronically in portable document format (pdf), and molecular structure coordinate files may be stored on a hard disk to be viewed and analyzed in graphical molecular viewers, to name a few. however, most of these storage methods are static, one-or two-dimensional, and not connected in real time and not contextually related. in addition, there is a need to manage the changes in these data over the lifetime of a project and avoid differing versions of documents amongst the project collaborators. geographic separation of laboratory units can also present challenges, especially for large-scale research projects. the time and cost for traveling, the efficient distribution of information among collaborators, and the streamlining of workflow processes can be key concerns for laboratory heads and organizational directors. to address these concerns, ''collaboratory systems'' have been put into operation. collaboratory systems are often computerbased systems that manage research workflow processes and allow different lab units to communicate and share data. wulf defined a collaboratory as a ''center without walls, in which the nation's researchers can perform their research without regard to physical location, interacting with colleagues, accessing instrumentation, sharing data and computational resources, and accessing information in digital libraries'' (wulf w ( ) ''the national collaboratory.'' towards a national collaboratory. unpublished report of a national science foundation invitational workshop, new york: rockefeller university). bly refines the definition to ''a system which combines the interests of the scientific community at large with those of the computer science and engineering community to create integrated, tool-oriented computing and communication systems to support scientific collaborations'' [ ] . chin and lansing state that the research and development of scientific collaboratories has, thus far, been a tool-centric approach [ ] . the main goal was to provide tools for shared access and manipulation of specific software systems or scientific instruments. such an emphasis on tools was necessary in these early development years because of the lack of basic collaboration tools-text chat, synchronous audio, videoconferencing-to support rudimentary levels of communication and interaction. today, however, such tools are available in off-the-shelf software packages, such as microsoft livemeeting, and ibm lotus sametime. the design of collaboratories can move beyond developing general communication mechanisms to evaluating and supporting data sharing in the scientific context. new communication channels can be added to facilitate collaborative exchange of ideas and data. social network services are an example of a novel communication channel that has attracted large numbers of participants. services like myspace and facebook provide a web site for members with similar interests to interact and share information through a variety of formats, such as, discussion groups, voice and video, file sharing, and email. the success of these services has to be considered within a scientific context to provide an effective on-line environment for scientific collaboration. a key challenge is to represent and visualize molecular and cellular data as a function of time, space, and ontological state, and then to efficiently and productively manage and share that data with collaborators and the scientific community. effective proteome-wide prediction of cell signaling interactions require that data on biological relevance, structural accessibility, and molecular sequence be combined to allow for the accurate modeling of cell signaling pathways, for example [ ] . when structural genomics intersect with functional proteomics, real-time three-dimensional ( -d) annotation is needed to integrate the two by providing the structural contextual basis upon which to attach the functional information. another challenge is to efficiently communicate and discuss these models and their interpretation with collaborators. research teams are increasingly interdisciplinary and collaborative among laboratories in different departments and institutions located around the world, so it is important for them to have the tools to bridge the gaps of specialization, geography, and time. researchers must employ data organization and workflow management tools to share the context of their data. the ability to quickly search and retrieve complex -d data in real time is critical to the efficiency and productivity of large-scale research work. currently, the insufficient level of detail made available by published literature, both in print and online, makes it challenging for researchers to share their knowledge. for instance, data from large-scale structural genomics initiatives , such as, psi i & ii (us), sgc (europe), spine (europe), riken (japan), will require centralized and automated facilities across continents, new methods to divide labor, and novel tools to disseminate and annotate the data to put it into the larger scientific context of systems biology. a number of approaches to combine the power of computerized data storage with advanced visualization technology have been developed in the past. some of the systems below are specific to molecular structures and others are more broad-based client/ server collaboration systems offering additional communication and data sharing functionality. kinemage was the first desktop software tool used to visualize macromolecules, making it possible to enhance the communication of -d concepts in journal articles with electronic mass distribution of interactive graphics images. developed at duke university medical center, kinemage illustrated a particular idea about a -d object, rather than neutrally displaying that object [ ] . kinemages, at the time, were a new type of published illustration, providing -d views in addition to standard figures, stereo figures, and color plates. operations on the displayed kinemage could be rotated in real time, parts of the display could be turned on or off, points could be identified and annotated, and changes between different forms could be animated. this was the first widely available molecular viewing and communication tool that allowed researchers to better communicate ideas that depended on -d information. the molecular interactive collaborative environment (mice) is an application that provides collaborative, interactive visualization of complex scientific data in -d environments within which multiple users can examine complex data sets in real time [ ] . mice is portable and enables users to not only view molecular scenes on their own computer, but to distribute these scenes and interact with other users anywhere on the internet. components of mice are already in use in other san diego supercomputer center projects that require platform-independent collaborative software. the biocore system developed at the university of illinois is a large, complex collaboration environment that provides many features, such as data sharing, electronic laboratory notebooks, collaborative molecular viewing and message boards [ ] . the biocore system is being used for research and data management as well as for training and uses a combination of a web browser and specialized molecular viewer software to interact with the system and other users. the emsl collaboratory provides both a scientific annotation middleware (sam) server and an electronic laboratory notebook (eln) client as tools for developing collaboration systems [ ] . in particular, these tools have been used to develop an nmr virtual facility which enables remote nmr operation as well as sharing documents, data, and images and interacting with personnel at the nmr facility. isee is a software tool that allows users to leverage structural biology data by integrating data from many sources into a single data file and browser [ ] . structures can be annotated with methods, key points of interests, and alignments. users can then browse the annotated structures, add new annotations, and evaluate existing ones. the files are small enough to be sent as an e-mail message, and the browser runs under windows, mac, and linux operating systems. though isee offers saved view states, it is not quite a scientific wiki-where users can make annotations without restrictions-because of its prepackaged annotation system. most recently we have developed the collaborative molecular modeling environment (c-me), a new collaboratory system that integrates many of the key features available on kinemage, mice, isee, and biocore systems into one thin-client windows application. the key features that c-me provides include -d image and -d molecular structure annotation support in realtime, as well as centralized data storage and organization using hierarchical layers (projectsrentitiesrannotations); the annotation system is not limited to predefined ontological categories ( figure ). collaborative molecular modeling environment (c-me) is a collaboratory system developed for the purpose of supporting research projects involving the molecular systems biology and structural proteomics of the sars corona virus as well as a cancer diagnostic research project. these projects require a simple but secure process to store all relevant data in a variety of formats in a single location that are then available either through the c-me client or through a web interface for those collaborators who do not have the c-me client. the c-me backend architecture enables these features and is built on windows server , microsoft sql server , and microsoft office sharepoint server (moss ). c-me includes a public domain client application for the windows xp and vista operating systems that can be downloaded from http://c-me.scripps.edu. c-me is built using commercial off-the-shelf (cots) software as an open-ended content management and sharing system. the primary advantage for using cots components to build c-me is that it reduces overall system development costs and requires less development time-only the components required to perform a specific function need to be developed. developing a custom system from scratch could involve creating basic functionality, retrieving data from a database, or rendering or moving a threedimensional image on the screen, all of which might already be available commercially. c-me uses cots components for most of its back-end functions. windows server , sql server and moss provide the central authentication, database, and web infrastructure that enable an application like c-me to be developed quickly. windows server provides the active directory authentication functionality. users can be added and removed from the directory and permissions assigned according to project or affiliation. sql server is the relational database management system that provides the data storage, indexing, and retrieval functionality for all data stored on moss . thus, any information exposed by c-me is actually stored in a sql server database and made available through moss either directly using a browser or through c-me, using web services to access the data. moss is the portal system that organizes and makes available the data that is used by c-me including the root project image, protein data bank (pdb) coordinate files, and -d images, as well as all annotations. the user can access this information with c-me which uses web services to read and write data to and from the moss server. the user also has the option to access that data through a standard web browser ( figure ). c-me is being developed using a rapid prototyping methodology and new programming tools from microsoft, including the c# programming language, .net . framework, and the windows presentation foundation (wpf). other systems, such as java, eclipse and netbeans, also provide built-in features for rapid application development. however, since we are leveraging moss for the data organization and storage functionality, we are using the .net . framework and wpf because of their tight integration with the microsoft windows and server systems figure . c-me application window displaying a project. the project, entity and annotations list boxes are displayed along the left hand side and the graphical window for displaying the d molecule representations or d images to the right. a project can have an associated d image to provide additional context as shown in the graphical window for the selected cancer therapeutics project. for this project, a protein structure entity is displayed on the project image as a thumbnail image of the cartoon representation of its d structure which can then be opened for viewing and annotation by double-clicking it. doi: . /journal.pone. .g [ ] . wpf is the graphical subsystem of the .net . framework and provides a programming model for building applications with a separation between the user interface and the underlying logic. the built-in wpf features for rendering d and d graphic objects were particularly helpful in more quickly assembling the functionality that supports c-me. the rapid prototyping has enabled us to review iterations of the prototypes in two-week cycles providing bug and usability feedback to the developers for the next cycle of development. figure illustrates the relationships among the image rendering components and the back-end components, which communicate through industry-standard web services. the c-me graphical client application communicates via web service calls to the moss application programming interface (api) to read or write data to a specific moss entity or project site or to create new project and entity sites. figure shows the flow of data as it is entered into c-me and directed to the appropriate site on the moss server which uses sql server as the relational database system for data storage. allowing multiple participants to jointly create and edit web pages is known as a ''wiki,'' first developed and implemented by ward cunningham in (http://en.wikipedia.org/wiki/wiki). the wikipedia is perhaps the most widely known and used wiki and is essentially an encyclopedia collaboratively written by volunteers [ , ] . by allowing researchers to publish their data sets to a larger community through a publicly or privately available data portal, c-me creates the possibility of a ''wiki-like'' research community, where any number of authorized participants can view and annotate research results interactively in a specific -d or -d context. importantly, c-me is a collaboratory system that enables the sharing and analysis of biological data with project level security in addition to the large wiki features through: controlled access-users are authenticated against an active directory to determine their level of access to projects and entities. access can also be provided to accounts residing on external ldap repositories via moss. anonymous read-only access is also available. simple graphical interface-the graphical interface provides a standard set of mouse and keyboard controls to manipulate and annotate images and structures. data organization-data entered into c-me is organized in a hierarchical system that allows creation of projects containing multiple sub-projects. data consistency-all data is stored in one location so that all users access the same data from the c-me application. all changes are immediately available to all users. robust data backend-all c-me data is stored in an industrystandard sql server database accessed through the moss application programming interface. context-specific annotation-annotations are related graphically to the feature in the image or molecular structure that is pertinent. the system supports several data formats, has data translation capabilities, and is able to interact and exchange data with other sources, such as external databases. by leveraging moss functionality, c-me offers subscription capabilities, performs user authentication, establishes and manages permissions and privileges, and has data encryption capabilities to ensure secure data transmission as part of its security package. c-me organizes data into a hierarchical structure that displays the organization of a data set at three abstraction levels: the project (multiple) level, the entity level, and the annotation level. each layer of the hierarchy contains one or more instances of the layer below it. the user may select a particular level from one the three list boxes along the left side of the c-me window. a project is a collection of entities created around a particular research goal, organization, or any other meaningful category that a research team might create. each project can have an associated image to provide context for any data to be added to the project. projects, in turn, contain entities, which can be either a -d atomic coordinate file created in protein data bank format, a -d image file created in png, gif, or jpg format, or a set of microscope images from a clinical blood sample in the case of specimen entities. -d entities can be rotated, zoomed and translated to orient the view as required. entities can then be annotated, by adding notes, documents, images, e-mail threads, or other forms of information. annotations are anchored to specific locations in the -d or -d entities, so an annotation is always placed in a specific context and in a precise relationship to related annotations. the c-me application does not provide advanced molecular viewing features, such as surface display, geometry analysis, and electron density viewing. it is not intended to replace existing advanced or specialized molecular viewers, such as pymol or coot [ , ] . instead, other viewers can be launched from c-me to provide more detailed molecular analysis. c-me provides basic -d image analysis functionality which includes the following image manipulation functions: rgb color filtering, zooming, and basic image enhancements such as contrast and brightness control. advanced image analysis operations should be performed with an appropriate external image manipulation application (e.g. adobe photoshop or a purposespecific image analysis application). c-me presents a user with an intuitive interface that reflects the three levels of information within the system. individual windows on the left side of the screen reflect the selections made by the user at each level of the c-me system ( figure ). the top left window reflects the projects that are currently established on the system. the middle left window reflects the entities that have been chosen for examination by the user. the lower left window reflects the list of annotations that have already been entered by other users and allows users to navigate through them as well as to open the annotations for viewing. since moss creates a central repository for data in almost any format, users are not limited in the ways they organize their data. being able to link multiple kinds of data to an image or a protein structure, for example, makes it easier to organize experiments, collect and distribute results, and speed up the process of uncovering deeper knowledge from various experiments. each entity, sample, or laboratory report can become its own site. all information about that entity can be collected and retrieved from one point either using c-me or a web browser. a movie demonstrating major c-me features and usage is available (video s , supporting information). once an entity is created, a user can begin annotating the molecule, image or specimen. annotations to molecule entities can be attached to the entire molecule, a set of user-defined amino acid residues, or a set of user-defined atoms. for an image entity, annotations are attached to the entire image or to a user-selected region of the image. specimen entities are different in that the annotations are actually the set of images (called cell hit annotations) associated with a specific sample added to the entity when the specimen entity was created. additional images can be added to a specimen entity by right-clicking on the entity name in the entities window at the left. annotations provide additional information and points of discussion for a molecular structure. text notes, urls, files, screen captures and discussions are all types of annotations that can be added to a structure or image. a user can then either double-click the annotation from the list in the lower left window or select the annotation flag from the graphical window to open that annotation. office documents and pdf files are opened for reading in a new tab in the c-me graphical window ( figure ). the user may then decide to open that document for editing. any file can be attached as an annotation. commonly used file annotations include generic text files, images, audio or video media files, and pdf files. c-me relies on the user's installed software to open these files with the appropriate application. in addition, microsoft word, excel and powerpoint documents can be attached as annotations in c-me and are more tightly integrated with c-me and open in a separate tab in the graphical window. for example, excel spreadsheets can be attached as annotations and accessed through c-me, where data or formulas in the cells can be modified by users with appropriate permissions from local or remote locations. multiple users can open a given document simultaneously. a check-in/check-out feature is avail- able. when a user edits an annotation, it is checked out to the first person who edits it. another user who wants to edit the same file at the same time can only open it as a ''read-only'' version. general text descriptions can be added as a note annotation and can be attached to the entire entity or to a specific region of the entity. a note annotation flag appears as a small pad-and-pen icon. a uniform resource locator (url) annotation is an html link that can be attached to an entity. this allows any web page to be added as additional information without having to actually extract information from the web site. a screen capture annotation allows an arbitrary rectangle of the user's screen to be captured as an image and attached to an entity. when selecting a screen capture annotation, c-me is minimized and the rest of the desktop is shown with a light grey color. the mouse cursor becomes a cross-hair cursor which can be used to drag out a rectangle to save as the screen captured image. a window then appears with the captured image in the bottom pane and a place to enter a title for the screen capture annotation, and optionally, an associated url and additional notes. users can also discuss aspects of an entity by creating a discussion thread. when creating a new discussion annotation, a window appears to enter the subject of the discussion and the actual text of the discussion. another user can then double-click this discussion and select a particular subject and click on the reply button to continue the discussion. other users with access to that entity can view all the entries in the discussion and contribute if desired. microsoft infopath forms, which are xml-based and are used to enter data about experiments and samples, can be published directly into a custom list or automatically sent via an e-mail message to a predefined e-mail account. infopath forms are currently being used to gather pathologist reviews of microscope images of stained cells. these forms are then processed to determine an overall score for each image from each pathologist's review. these scores can then be used to classify a cell image. by relying on industry-standard commercial off-the-shelf software products, c-me addresses three areas of concern related to security: authentication and information access, data backup, and data transport. c-me is built on the windows server operating system, and relies on the active directory for authentication and access control. system administrators or lab directors can control access and grant varying levels of privilegefrom ''read only,'' to ''contribute,'' to ''design pages''-or deny access entirely. the data backup function is built into the moss data portal and data contributed to c-me is automatically backed up. furthermore, data is encrypted and transferred using the ssl standard. generally, c-me is being used during regular laboratory operations to discuss and share current progress and problems in a particular research project. rather than having to create new powerpoint presentations, the presenter can start the c-me application and browse to the appropriate project to share the current thoughts and results. the annotations placed there collaboratively by other researchers working on that project can be viewed and edited, and new annotations can be added on the spot. similarly, c-me is being used to provide a tour of a completed crystal structure using the published research paper as a basis. now, the reader can step through the c-me annotations extracted from the paper and watch as the relevant portions of the structure are highlighted to place additional data, such as activity assays, sequence comparisons, and protein purification gels, in the context of the structural features. in addition, this guided display feature of c-me will also be used to share ideas, concepts and data with external collaborators allowing them to review, amend or add annotations. from the bench-top perspective, c-me is being used in both structure-based and cell-based research projects. there are currently two primary projects which represent similar but different challenges for the collaborative environment. the functional and structural proteomics analysis of sars-cov related proteins (fsps, http://visp.scripps.edu/sars/default. aspx) is an nih-funded project that requires the production of over sars virus proteome proteins, their crystallization, and eventually, determination of their crystal structures. c-me is being used to collect, organize and share the known fsps structures as well as to annotate them with functional study information. the second project is the cancer bioengineering research project (cbrp, http://cancer.scripps.edu). this project is focused on detecting rare circulating tumor cells in blood specimens from cancer patients. these cells have similarities that allow for their automated detection, and once detected, a d image of each cell is produced and is analyzed in multiple ways, requiring annotation of various types. these image analysis results need to be associated with the patient sample they came from, and each specimen might result in + different images, each of which requires evaluation and annotation by a pathologist. c-me is being used to store and organize the scanned microscope cell images, with the image analysis annotations and scoring attached to each stored image. in addition, infopath forms are used to allow pathologists to input the image analysis data to determine whether circulating tumor cells exist. c-me is being used in clinical classrooms. for example, the viewing and annotation features make it suitable for advancedlevel molecular biology courses. using c-me, instructors can present molecular-based material to students at a level of detail not available in print textbooks. it also organizes course materials in hierarchical trees and makes it possible for students to experience the process of research collaboration with their instructors. an instructor in a graduate level pathology course, for example, uses c-me to present students with examples of pathology reports in -d image formats. the instructor prepares appropriate annotations ahead of class time, and students reviewing the material after class can add comments or leave questions for the instructor. additionally, image-based medical specialties can use the features of c-me for education. in a pathology fellowship training program, where visual microscopic skills are paramount, learning modules for various pathologic diseases are created. this gives trainees the opportunity to practice making diagnoses by looking at microscopic images, as they will in their future daily practice. but unlike in real life, these microscopic images come with embedded annotations, files, powerpoint presentations and on-line literature reviews, highlighting the corresponding disease information the trainees need to master in association with each image of abnormal cells. trainees can also add modules to c-me as they pursue special interests in one disease or another, eventually creating a training 'wiki' for future year trainees. while c-me contains significant capability for use by researchers today, the application will be enhanced to increase its utility. currently, c-me requires an internet connection to access the project and entity data and annotations. we are planning enhancements that would allow offline use of c-me with data synchronization when c-me is re-connected to the internet. c-me will also be extended to leverage the search and indexing functionality already present in moss to provide searching functionality from the c-me client application. a c-me user would be able to perform searches within and across projects or entities to find relevant documents, notes, and other annotations related to the search criteria. in addition, c-me will detect user presence to signal a user that a collaborator is currently on-line. this would allow users to send an email or initiate an instant messenger session to discuss annotations and results if the other users or collaborators are currently using c-me. currently, infopath forms are used to capture pathologists' scores from surveys designed to evaluate cell images. the use of infopath forms will be extended to include tests and quizzes for use by instructors to evaluate student performance and understanding of material presented in a c-me project or entity. collaborative molecular modeling environment (c-me) enhances existing laboratory resources with interactive, real-time annotation and visualization capabilities. with this technology, researchers can create wiki-style collaborations for internal or distributed laboratory environments. as a content management and collaboration tool, collaborative molecular modeling environment (c-me) has the potential to improve research efficiency by providing collaborators real-time access to data organized at the molecular and project levels. its platform consolidates data from disparate sources, allowing users to easily compare, contrast, and merge information from different file systems. it complements existing visualization and collaboration software tools, such as isee, kinemage, mice, and biocore by allowing for real-time two-dimensional and three-dimensional annotation, hierarchical project and data organization, and dynamic collaboration. the c-me installation is available for download for computers running the windows xp or vista operating system (http://c-me. scripps.edu). this web site also contains additional information about the c-me application. video s video demonstration of the c-me application in use. the major c-me features are shown using a d protein structure entity and its associated annotations as an example. found at: doi: . /journal.pone. .s ( . mb swf) capturing and supporting contexts for scientific data sharing via the biological sciences collaboratory structures in systems biology the kinemage: a tool for scientific communication a prototype molecular interactive collaborative environment (mice) biocore: a collaboratory for structural biology electronic notebooks: an interface component for semantic records systems disseminating structural genomics data to the public: from a data dump to an animated story applications = code+markup: a guide to the microsofth windowsh presentation foundation reference revolution, news@nature wikinomics: how mass collaboration changes everything the pymol molecular graphics system coot: model-building tools for molecular graphics we would like to thank the members of the kuhn and stevens laboratories at the scripps research institute for discussions, testing the c-me application, and providing feedback. this is tsri manuscript no. . conceived and designed the experiments: pk ak th rg. analyzed the data: ak jk dm kk dh kb. wrote the paper: pk ak. other: implemented the application: ak kk dh jk. key: cord- -y p iznq authors: keogh, john g.; rejeb, abderahman; khan, nida; dean, kevin; hand, karen j. title: data and food supply chain: blockchain and gs standards in the food chain: a review of the possibilities and challenges date: - - journal: building the future of food safety technology doi: . /b - - - - . - sha: doc_id: cord_uid: y p iznq this chapter examines the integration of gs standards with the functional components of blockchain technology as an approach to realize a coherent standardized framework of industry-based tools for successful food supply chain transformation. the vulnerability of food supply chains is explored through traceability technologies and standards with particular attention paid to interoperability. fscs are vulnerable to natural disasters, malpractices, and exploitative behavior, leading to food security concerns, reputational damage, and significant financial losses. due to the inherent complexities of global fscs, it is almost impossible for stakeholders to police the entire flow of materials and products and identify all possible externalities. recurring disruptions (e.g., natural disasters, avian flu, swine fever, and consecutive food scandals have increased the sense of urgency in the management of fscs (zhong, xu, & wang, ) and negatively impacted consumer trust. the european "horsemeat scandal" in exemplified the vulnerabilities (yamoah & yawson, ) , and legal scholars from cambridge university posited the ability of the eu's regulatory regime to prevent fraud on such a scale was shown to be inadequate. eu food law, with its (over) emphasis on food safety, failed to prevent the occurrence of fraud and may even have played an (unintentional) role in facilitating or enhancing it the cambridge scholars further argued that the free movement of goods within the european union created a sense of "blind trust" in the regulatory framework, which proved to be inadequate to protect businesses and consumers from unscrupulous actors. while natural disasters and political strife are outside of the control of fsc stakeholders, to preserve food quality and food safety and minimize the risk of food fraud or malicious attacks, fsc stakeholders need to establish and agree on foundational methods for analytical science, supply chain standards, technology tools, and food safety standards. the redesign of the fsc is necessary in order to ensure unquestionable integrity in a resilient food ecosystem. this proposal would require a foundational approach to data management and data governance to ensure sources of accurate and trusted data to enable inventory management, order management, traceability, unsafe product recall, and measures to protect against food fraud. failure to do so will result in continued consumer distrust and economic loss. notably, a report by gs uk et al. ( ) reported that eighty percent of united kingdom retailers had inconsistent product data, estimated to cost ukp million in profit erosion over years and a further ukp million in lost sales. the recent emergence of blockchain technology has created significant interest among scholars and practitioners in numerous disciplines. initially, blockchain technology was heralded as a radical innovation laden with a strong appeal to the financial sector, particularly in the use of cryptocurrencies (nakamoto, , p. ). the speculation on the true identity of the pseudonymous "satoshi nakamoto" gave rise to suspicion on the actual creators of bitcoin and their motives (lemieux, ) . moreover, halaburda ( ) argued that there is a lack of consensus on the benefits of blockchain and, importantly, how it may fail. further, rejeb, s} ule, & keogh ( , p. ) argued "ultimately, a blockchain can be viewed as a configuration of multiple technologies, tools and methods that address a particular problem." beyond the sphere of finance, blockchain technology is considered a foundational paradigm (iansiti & lakhani, ) with the potential for significant societal benefits and improve trust between fsc actors. blockchain technology offers several capabilities and functionalities that can significantly reshape existing practices of managing fscs and partnerships, regardless of location, and also offers opportunities to improve efficiency, transparency, trust, and security, across a broad spectrum of business and social transactions (frizzo-barker et al., ) . the technological attributes of blockchain can combine with smart contracts to enable decentralized and self-organization to create, execute, and manage business transactions (schaffers, ) , creating a landscape for innovative approaches to information and collaborative systems. innovations are not only merely a simple composition of technical changes in processes and procedures but also include new forms of social and organizational arrangements (callon, law, & rip, ) . the ubiquitous product bar code stands out as a significant innovation that has transformed business and society. since the decision by us industry (gs , ) to adopt the linear bar code on april , , and the first scan of a -pack of wrigley's juicy fruit chewing gum in marsh's supermarket in troy, ohio, on june , (gs , , the bar code is scanned an estimated billion times daily. gs is a not-for-profit organization tasked with managing industry-driven data and information standards (note, gs is not an acronym). the gs system of interoperable standards assigns and manages globally unique identification of firms, their locations, their products, and assets. they rely on several technology-enabled functions for data capture, data exchange, and data synchronization among fsc exchange partners. in fscs, there is a growing need for interoperability standards to facilitate business-to-business integration. the adoption of gs standards-enabled blockchain technology has the potential to enable fsc stakeholders to meet the fast-changing needs of the agri-food industry and the evolving regulatory requirements for enhanced traceability and rapid recall of unsafe goods. although there is a growing body of evidence concerning the benefits of blockchain technology and its potential to align with gs standards for data and information (fosso wamba et al., ; kamath, ; lacity, ) , the need remains for an extensive examination of the full potentials and limitations. the authors of this section, therefore, reviewed relevant academic literature to examine the full potential of blockchain-enabled gs systems comprehensively, and therefore provide a significant contribution to the academic and practitioner literature. the diversity of blockchain research in the food context is fragmented, and the potentials and limitations in combination with gs standards remain vaguely conceptualized. it is vitally essential to narrow this research gap. this review will begin with an outline of the methodology applied to collect academic contributions to blockchain and gs standards within a fsc context, followed by an in-depth analysis of the findings, concluding with potential areas for future research. in order to explore the full potential of a system integrating blockchain functionalities and gs standards, a systematic review method based on tranfield, denyer, & smart ( ) guidelines was undertaken. the systematic review was considered as a suitable method to locate, analyze, and synthesize peer-reviewed publications. research on blockchain technology is broad and across disciplines; however, a paucity of research specific to food chains exists (fosso wamba et al., ) . similarly, existing research on blockchain technology and gs standards is a patchwork of studies with no coherent or systematic body of knowledge. therefore, the objective of this study was to draw on existing studies and leverage their findings using content analysis to extract insights and provide a deeper understanding of the opportunities for a gs standards-enabled blockchain as an fsc management framework. as stated earlier, the literature on blockchain technology and gs is neither well-developed nor conclusive, yet necessary to ensure successful future implementations. in order to facilitate the process of literature collection, a review protocol based on the "preferred reporting items for systematic reviews and meta-analyzes" (prisma) was used (liberati et al., ) . the prisma approach consists of four processes: the use of various sources to locate previous studies, the fast screening of studies and removal of duplicates, the evaluation of studies for relevance and suitability, and the final analysis of relevant publications. fig. . illustrates the prisma process. to ensure unbiased results, this phase of the study was completed by researchers with no previous knowledge or association with gs . conducting the review began with a search for studies on blockchain technology and gs standards. reviewed publications originated from academic sources (peer-reviewed) and included journal articles, conference papers, and book chapters. due to the nascent and limited literature on blockchain technology and gs standards, we supplemented our analysis with other sources of information, including conference proceedings, gray sources, and reports. the survey of the literature was conducted using four major scientific databases: scopus, web of science, sciencedirect, and google scholar. we used a combination of keywords that consisted of the following search string: "blockchain*" and "gs " and ("food chain*" or "food supply*" or agriculture or agro. the google scholar search engine has limited functionality and allows only the fulltext search field; therefore, only one search query "blockchain* and gs and food" was used for the retrieval of relevant studies. the titles and abstracts of publications were scanned to obtain a general overview of the study content and to assess the relevance of the material. as shown in fig. . , a total of publications were found. many of the publications were redundant due to the comprehensive coverage of google scholar; studies focused on blockchain technology outside the context of food were removed. a fine-tuned selection of the publications was undertaken to ensure relevance to fscs. table . contains a summary of the findings based on content analysis. the final documents were classified, evaluated, and found to be sufficient in narrative detail to provide an overview of publications to date, specifically related to blockchain technology and gs standards. the loss of trust in the conventional banking system following the global financial crisis laid the groundwork for the introduction of an alternative monetary system based on a novel digital currency and distributed ledger (richter, kraus, & bouncken, ) . "satoshi nakamoto" (a pseudonym for an unknown person, group of people, organization, or other public or private body) introduced an electronic peer-to-peer cash system called bitcoin (nakamoto, , p. ). the proposed system allowed for payments in bitcoin currency, securely and without the intermediation of a trusted third party (ttp) such as a bank. the bitcoin protocol utilizes a blockchain, which provides an ingenious and innovative solution to the doublespending problem (i.e., where digital currency or a token is spent more than once), eliminating the need for a ttp intervention to validate the transactions. moreover, lacity ( , p. ) argued "while ttps provide important functions, they have some serious limitations, like high transaction fees, slow settlement times, low transaction transparency, multiple versions of the truth and security vulnerabilities." the technology behind the bitcoin application is known as a blockchain. the bitcoin blockchain is a distributed database (or distributed ledger) implemented on public, untrusted networks (kano & nakajima, ) with a cryptographic signature (hash) that is resistant to falsification through repeated hashing and a consensus algorithm (sylim, liu, marcelo, & fontelo, ) . blockchain technology is engineered in a way that parties previously unknown to each other can jointly generate and maintain a database of records (information) and can correct and complete transactions, which are fully distributed across several nodes (i.e., computers), validated using consensus of independent verifiers (tijan, aksentijevi c, ivani c, & jardas, ). blockchain is categorized under the distributed ledger technology family and is characterized by a peer-to-peer network and a decentralized distributed database, as depicted in fig. . . a diagrammatic representation of blockchain technology. according to lemieux ( ) , the nodes within a blockchain work collectively as one system to store encrypted sequences of transactional records as a single chained unit or block. nodes in a blockchain network can either be validator nodes (miners in ethereum and bitcoin) that participate in the consensus mechanism or nonvalidator nodes (referred to only as nodes). when any node wants to add a transaction to the ledger, the transaction of interest is broadcast to all nodes in the peer-to-peer network. transactions are then collected into a block, where the addition to the blockchain necessitates a consensus mechanism. validators compete to have their local block to be the next addition to the blockchain. the way blocks are constructed and propagated in the system enables the traceback of the whole chain of valid network activities back to the genesis block initiated in the blockchain. furthermore, the consensus methodology employed by the underlying blockchain platform designates the validator, whose block gets added to the blockchain with the others remaining in the queue and participating in the next round of consensus. the validator node gains an incentive for updating the blockchain database (nakamoto, , p. ). the blockchain may impose restrictions on reading the data and the flexibility to become a validator to write to the blockchain, depending upon whether the blockchain is permissioned or permission-less. a consensus algorithm enables secure updating of the blockchain data, which is governed by a set of rules specific to the blockchain platform. this right to update the blockchain data is distributed among the economic set (buterin, b) , a group of users who can update the blockchain based on a set of rules. the economic set is intended to be decentralized with no collusion within the set (a group of users) in order to form a majority, even though they might have a large amount of capital and financial incentives. the blockchain platforms that have emerged employ one of the following decentralized economic sets; however, each example might utilize a different set of consensus algorithms: owners of computing power: this set employs proof-of-work (pow) as a consensus algorithm observed in blockchain platforms like bitcoin and ethereum. each block header in the blockchain has a string of random data called a nonce attached to them (nakamoto, , p. ). the miners (validators) need to search for this random string such that when attached to the block, the hash of the block has a certain number of leading zeros and the miner who can find the nonce is designated to add his local block to the blockchain accompanied by the generation of a new cryptocurrency. this process is called mining. mining involves expensive computations leading to (often massive) wastage of computational power and electricity, undesirable from an ecological point of view (o'dwyer & malone, ) , and resulting in a small exclusive set of users for mining. this exclusivity, however, goes against the idea of having a decentralized set leading blockchain platforms to employ other means of arriving at a consensus. stakeholders: this set employs the different variants of the proof-of-stake (pos) consensus mechanism. pos is a more just system than pow, as the computational resources required to accomplish mining or validation can be done through any computer. ethereum pos requires the miner or the validator to lock a certain amount of their coins in the currency of the blockchain platform to verify the block. this locked number of coins is called a stake. computational power is required to verify whether a validator owns a certain percentage of the coins in the available currency or not. there are several proposals for pos, as pos enables an improved decentralized set, takes power out of the hands of a small exclusive group of validators, and distributes the work evenly across the blockchain. in ethereum pos, the probability of mining the block is proportional to the validator's stake (ethhub, ) just as in pow, and it is proportional to the computational hashing power. as long as a validator is mining, the stake owned by him remains locked. a downside of this consensus mechanism is that the richest validators are accorded a higher priority. the mechanism does, however, encourage more community participation than many other methods. other consensus protocols include the traditional byzantine fault tolerance theory (sousa et al., ) , where the economic set needs to be sampled for the total number of nodes. here, the set most commonly used is stakeholders. hence, such protocols can be considered as subcategories of pos. a user's social network: this is used in ripple and stellar consensus protocols. the ripple protocol, for example, requires a node to define a unique node list (unl), which contains a list of other ripple nodes that the defining node is confident would not work against it. a node consults other nodes in its unl to achieve consensus. consensus happens in multiple rounds with a node declaring a set of transactions in a "candidate set," which is sent to other nodes in the unl. nodes in the unl validate the transactions, vote on them, and broadcast the votes. the initiating node then refines the "candidate set" based on the votes received to include the transactions getting the most significant number of votes for the next round. this process continues until a "candidate set" receives % votes from all the nodes in the unl, and then it becomes a valid block in the ripple blockchain. blockchain technologies are considered a new type of disruptive internet technology (pan, song, ai, & ming, ) and an essential enabler of large-scale societal and economic changes (swan, ; tapscott & tapscott, ) . the rationale for this argument is due to its complex technical constructs (hughes et al., ) , such as the immutability of transactions, security, confidentiality, consensual mechanisms, and the automation capabilities enabled by smart contracts. the latter is heralded as the most important application of blockchain (the integrity of the code in smart contracts requires quality assurance and rigorous testing). by definition, a smart contract is a computer program that formalizes relationships over computer networks (szabo, (szabo, , . although smart contracts predate bitcoin/blockchain by a decade and do not need a blockchain to function (halaburda, ) , a blockchainbased smart contract is executed on a blockchain with a consensus mechanism determining its correct execution. a wide range of applications can be implemented using smart contracts, including gaming, financial, notary, or computation (bartoletti & pompianu, ) . the use of smart contracts in the fsc industry can help to verify digital documents (e.g., certificates such as organic or halal) as well as determine the provenance (source or origin) of specific data. in a cold chain scenario, rejeb et al. ( ) argued that smart contracts connected to iot devices could help to preserve the quality and safety of goods in transit. for example, temperature tolerances embedded into the smart contract can trigger in-transit alerts and facilitate shipment acceptance or rejection based on preset parameters in the smart contract. the first platform for implementing smart contracts was ethereum (buterin, a, pp. e ) , although most platforms today cater to smart contracts. therefore, similar to the radical transformations brought by the internet to individuals and corporate activities, the emergence of blockchain provides opportunities that can broadly impact supply chain processes (fosso wamba et al., ; queiroz, telles, & bonilla, ) . in order to understand the implications of blockchain technology for food chains, it is essential to realize the potentials of its conjunction with gs standards. while the technology is still in a nascent stage of development and deployment, it is worthwhile to draw attention to the potential alignment of blockchain technology with gs standards as proof of their success, and universal adoption is very likely to prevail in the future. traceability is a multifaceted construct that is crucially important in fscs and has received considerable attention through its application in the iso /bs quality standards (cheng & simmons, ) . scholars have stressed the importance and value of traceability in global fscs (charlier & valceschini, ; roth, tsay, pullman, & gray, ) . broadly, traceability refers to the ability to track the flow of products and their attributes throughout the entire production process steps and supply chain (golan et al., ) . furthermore, olsen and borit ( ) completed a comprehensive review of traceability across academic literature, industry standards, and regulations and argued that the various definitions of traceability are inconsistent and confusing, often with vague or recursive usage of terms such as "trace." they provide a comprehensive definition: "the ability to access any or all information relating to that which is under consideration, throughout its entire life cycle, by means of recorded identifications" (olsen and borit, , p. ) . the gs global traceability standard (gs , c: ) aligns with the iso : definition "traceability is the ability to trace the history, application or location of an object [iso : ] . when considering a product or a service, traceability can relate to origin of materials and parts; processing history; distribution and location of the product or service after delivery." traceability is also defined as "part of logistics management that capture, store, and transmit adequate information about a food, feed, food-producing animal or substance at all stages in the food supply chain so that the product can be checked for safety and quality control, traced upward, and tracked downward at any time required" (bosona & gebresenbet, , p. ). in the fsc context, a fundamental goal is to maintain a high level of food traceability to increase consumer trust and confidence in food products and to ensure proper documentation of the food for safety, regulatory, and financial purposes (mahalik & kim, ) . technology has played an increasingly critical role in food traceability over the past two decades (hollands et al., ) . for instance, radio frequency identification (rfid) has been adopted in some fscs to enable non-line-of-sight identification of products to enhance end-to-end food traceability (kelepouris et al., ) . walmart achieved significant efficiency gains by deploying drones in combination with rfid inside a warehouse for inventory control (companik, gravier, & farris, ) . however, technology applications for food traceability are fragmented, often proprietary and noninteroperable, and have enabled trading partners to capture only certain aspects of the fsc. as such, a holistic understanding of how agri-food businesses can better track the flow of food products and related information in extended, globalized fscs is still in a nascent stage of development. for instance, malhotra, gosain, & el sawy ( ) suggested it is imperative to adopt a more comprehensive approach of traceability that extends from source to final consumers in order to obtain a full understanding of information processing and sharing among supply chain stakeholders. in this regard, blockchain technology brings substantial improvements in transparency and trust in food traceability (behnke & janssen, ; biswas, muthukkumarasamy, & tan, ; sander, semeijn, & mahr, ) . however, arguments from many solution providers regarding traceability from "farm to fork" are a flawed concept as privacy law restricts tracking products forward to consumers. in this regard, tracking (to track forward) from farm to fork is impossible unless the consumer is a member of a retailers' loyalty program. however, tracing (to trace backward) from "fork to farm" is a feasible concept enabled by a consumer scanning a gs centric bar code or another code provided by the brand (e.g., proprietary qr code). hence, farm-to-fork transparency is a more useful description of what is feasible (as opposed to farm-to-fork traceability). while a blockchain is not necessarily needed for this function, depending on the complexity of the supply chain, a blockchain that has immutable information (e.g., the original halal or organic certificate from the authoritative source) could improve the integrity of data and information provenance. blockchain is heralded as the new "internet layer of value," providing the trinity of traceability, trust, and transparency to transactions involving data or physical goods and facilitating authentication, validation, traceability, and registration (lima, ; olsen & borit, ) . the application of gs standards with blockchain technology integration enables global solutions linking identification standards for firms, locations, products, and assets with blockchains transactional integrity. thus, the combination of blockchain and gs standards could respond to the emerging and more stringent regulatory requirements for enhanced forms of traceability in fscs (kim, hilton, burks, & reyes, ) . a blockchain can be configured to provide complete information on fsc processes, which is helpful to verify compliance to specifications and to trace a product to its source in adverse events (such as a consumer safety recall). this capability enables blockchain-based applications to solve problems plaguing several domains, including the fsc, where verified and nonrepudiated data are vital across all segments to enable the functioning of the entire fsc as a unit. within the gs standards framework, food traceability is industry-defined and industry-approved and includes categorizations of traceability attributes. these include the need to assign unique identifiers for each product or product class and group them to traceable resource unit (behnke & janssen, ) . fsc actors are both a data creator (i.e., they are the authoritative source of a data attribute) and a data user (i.e., a custodian of data created by other parties such as an upstream supplier). data are created and used in the sequential order of farming, harvesting, production, packaging, distribution, and retailing. in an optimized fsc, the various exchange parties must be interconnected through a common set of interoperable data standards to ensure the data created and used provide a shared understanding of the data attributes and rules (rules on data creation and sharing are encompassed within gs standards). a blockchain can be configured to add value in fscs by creating a platform with access and control of immutable data, which is not subject to egregious manipulation. moreover, blockchain technology can overcome the weaknesses created by the decades-old compliance to the minimum regulatory traceability requirements, such as registering the identity of the exchange party who is the source of inbound goods and registering the identity of the exchange party who is the recipient of outbound goods. this process is known as "one-up/one-down" traceability (wang et al., , pp. e ) and essentially means that exchange parties in an fsc have no visibility on products outside of their immediate exchange partners. blockchain technology enables fsc exchange partners to maintain food traceability by providing a secure, unfalsifiable, and complete history of food products from farm to retail (molding, ). unlike logistics-oriented traceability, the application of blockchain and gs standards can create attribute-oriented traceability, which is not only concerned with the physical flow of food products but also tracks other crucial information, including product quality and safety-related information (skilton & robinson, ). on the latter point, food business operators always seek competitive advantage and premium pricing through the product (e.g., quality) or process differentiation claims (e.g., organically produced, cage-free eggs). this is in response to research indicating that an increasing segment of consumers will seek out food products best aligning with their lifestyle preferences such as vegetarian, vegan, or social and ethical values such as fair trade, organic, or cage-free (beulens, broens, folstar, & hofstede, ; roe & sheldon, ; vellema, loorbach, & van notten, ; yoo, parameswaran, & kishore, ) . in fig. . below, keogh ( ) outlines the essential traceability functions and distinguishes the supply chain flow of traceability event data versus the assurance flow of credence attributes such as food quality and food safety certification. for instance, in economic theory, goods are considered as comprising of ordinary, search, experience, or credence attributes (darby & karni, ; nelson, ) . goods classified as ordinary (e.g., petrol or diesel) have well-known characteristics and known sources and locations to locate and purchase. regarding search, it refers to goods where the consumer can easily access trusted sources of information about the attributes of the product before purchase and at no cost. search is "costless" per se and can vary from inspecting and trying on clothes before buying or going online to find out about a food product, including its ingredients, package size, recipes, or price. in the example of inspecting clothes before purchase, dulleck, kerschbamer, & sutter ( ) differentiate this example as "search" from "experience" by arguing that experience entails unknown characteristics of the good that are revealed only after purchase (e.g., the actual quality of materials, whether it fades after washing). products classified as experience goods have attribute claims such as the product is tasty, flavorful, nutritious, or health-related such as lowers cholesterol and requires the product to be tasted or consumed to verify the claim, which may take time (e.g., lowers cholesterol). verifying the experience attributes may be free if test driving a car or receiving a sample or taster of a food product in a store. nevertheless, test driving or sampling will not confirm how the product will perform over time. generally speaking, verifying experience attributes of food is not free, and it may take considerable time (and likely expense) to verify the claim. credence claims (darby and karni, ) are characterized by asymmetric information between food producers and food consumers. the reason for this is because credence attributes are either intrinsic to the product (e.g., food quality, food safety) or extrinsic methods of processing (e.g., organic, halal, kosher), and consumers cannot verify these claims before or after purchase (dulleck et al., ) . in this regard, a blockchain offers a significant advancement in how credence claims flow (see fig. . ) and are added to a product or batch/lot # record. for instance, the immutability of the data means that a brand owner can add a record such as a third-party certificate (e.g., laboratory analysis verifying a vegan claim or a usda organic certificate), but they cannot edit or change it. this feature adds much-needed integrity to fscs and enhances transparency and consumer trust, especially if the third-party data are made available for consumers to query. in this context, the combination of gs standards and a blockchain provides a consumer with the capability to scan a food product and query its digital record to verify credence claims. at a more detailed level, the fragmentation of fscs and their geographic dispersion illustrates the need for blockchain and gs for achieving an optimal granularity level of traceability units (dasaklis, casino, & patsakis., ) . as such, the combination of blockchain can help in the assurance of food quality and safety, providing secure (toyoda, takis mathiopoulos, sasase, & ohtsuki, ) , precise, and real-time traceability of products. moreover, the speed of food authentication processes makes blockchain a potential enabler of a proactive food systemda key catalyst for anticipating risky situations and taking the necessary preventative measures. triggering automatic and immediate actions in the fsc has been an impetus for large corporations to adopt blockchain technology; for example, walmart leverages gs standards and blockchain technology, defining the data attributes to be entered into their preferred blockchain system, such as the attributes defined under the produce traceability initiative (pti, ). using gs standards as a foundational layer, walmart tracks pork meat and pallets of mangoes, tagged with unique numeric identifiers in china and the united states. walmart has demonstrated the significant value of a gs -enabled blockchain, reducing both business and consumer risk in a product safety recall. more specifically, walmart simulated a product safety recall for mangoes, and this exercise suggested a reduction in time to execute the product safety recall from days pre-blockchain to . s using a blockchain (kamath, ) . the contribution of gs to the de facto individualization of food products has motivated the study of dos santos, torrisi, yamada, & pantoni ( ) , who examine the traceability requirements in recipe-based foods and propose whole-chain traceability with a focus on ingredient certification. with the use of blockchain technology, it is possible to verify the source of any batch or a lot number of ingredients. kim et al. ( ) developed an application called "food bytes" using blockchain technology and enabling consumers to validate and verify specific quality attributes of their foods (e.g., organic) by accessing curated gs standard data from mobile devices, thereby increasing ease of consumer usability and ultimately trust. blockchain technology can help fsc partners develop best practices for traceability and to curb fraudulent and deceptive actions as well as the adulteration of food products. to solve these issues, staples et al. ( ) develop a traceability system based on haccp, gs , and blockchain technology in order to guarantee reliable traceability of the swine supply chain. in their proposed system, gs aids in the coordination of supply chain information, and blockchain is applied to secure food traceability. a pressing challenge facing fscs is the need to coordinate information exchange across several types of commodities, transportation modes, and information systems. by analogy, a similar need was resolved in the healthcare industry through the implementation of electronic health records (ehr) to provide access to an individual patient's records across all subdomains catering to the patient. the healthcare industry is presently working on enhancing ehr through the deployment of blockchain to serve as a decentralized data repository for preserving data integrity, security, and ease of management (shahnaz, qamar, & khalid, ) . closely resembling the role and function of the ehr in the healthcare industry, the creation of a digital food record (dfr) is vital for fscs to facilitate whole-chain traceability, interoperability, linking the different actors and data creators in the chain, and enhancing trust in the market on each product delivered. fsc operators need access to business-critical data at an aggregated level to drive their business strategy and operational decisions, and many of the organizations operate at the global, international, or national levels. data digitization and collaboration efforts of fsc organizations are essential to enable actionable decisions by the broader food industry. currently, much of the data currently exist as siloed, disparate sources that are not easily accessible; including data related to trade (crop shortages/overages), market prices, import/export transaction data, or realtime data on pests, disease or weather patterns, and forecasts. with this in mind, and acknowledging the need for transparent and trusted data sharing, the dutch horticulture and food domain created "horticube," an integrated platform to enable seamless sharing of data and enable semantic interoperability (verhoosel, van bekkum, & verwaart, ) . the platform provides "an application programming interface (api) that is based on the open data protocol (odata). via this interface, application developers can request three forms of information; data sources available, data contained in the source, and the data values from these data sources (verhoosel et al., , p. ) . the us food and drug administration is currently implementing the food safety modernization act (fsma) with emphasis on the need for technological tools to accomplish interoperability and collaboration in their "new era of smarter food safety" (fda, ) . in order to enable traceability as envisioned in fsma, a solution is required that incorporates multiple technologies, including iot devices. blockchain is envisioned as a platform of choice in accordance with its characteristic of immutability to prevent the corruption of data (khan, ) . ecosystems suited for the application of blockchain technology are those consisting of an increasing set of distributed nodes that need a standard approach and a cohesive plan to ensure interoperability. more precisely, fscs comprised of various partners working collaboratively to meet the demands of various customer profiles, where collaboration necessitates an exchange of data (mertins et al., ) ; furthermore, the data should be interchanged in real-time and verified to be originating from the designated source. interoperability is a precursor of robust fscs that can withstand market demands by providing small and medium enterprises with the necessary information to decide on the progress of any product within the supply chain and ensure the advancement of safe products to the end consumer. blockchain technology enables an improved level of interoperability as fsc actors would be able to communicate real-time information (bouzdine-chameeva, jaegler, & tesson, ), coordinate functions, and synchronize data exchanges (bajwa, prewett, & shavers, ; behnke & janssen, ) . the potential interoperability provided by blockchains can be realized through the implementation of gs standards. specifically, the electronic product code information standard (epcis), which can be used to ensure the documentation of all fsc events in an understandable form and the aggregation of food products into higher logistic units, business transactions, or other information related to the quantity of food products and their types (xu, weber, & staples, ) . a recent study by the institute of food technologists found evidence that technology providers faced difficulty in collaborating to determine the origin or the recipients of a contaminated product (bhatt & zhang, ) . hence, the novel approach of blockchain provides a specific emphasis on interoperability between disparate fsc systems, allowing technology providers to design robust platforms that ensure interoperable and end-to-end product traceability. the use of iot devices allow organizations within fscs to send and receive data; however, the authenticity of the data still needs to be ascertained. a compounding factor is the technological complexity of fscs (ahtonen & virolainen, ), due to the reliance on siloed systems that hamper collaboration and efficient flow of information. however, blockchain architecture can accommodate interoperability standards at the variable periphery (the iot devices) and other technologies used to connect fsc processes (augustin, sanguansri, fox, cobiac, & cole, ) . blockchain is envisaged as a powerful tool (ande, adebisi, hammoudeh, & saleem, ) and an appropriate medium to store the data from iot devices since it provides seamless authentication, security, protection against attacks, and ease of deployment among other potential advantages (fernández-caramés & fraga-lamas, ) . for fsc, blockchain is seen as the foundational technology for the sharing and distribution (read and write) of data by the organizations comprising the ecosystem, as shown in fig. . . in this model, consumers can read data for any product and trace the entire path from the origin to the destination while relying upon the immutability of blockchain to protect the data from any tampering. supply chain data are stored as a dfr in the various blocks (e.g., b , b , b ) that comprise the blockchain. the first block represented by g in fig. . refers to the genesis block, which functions as a prototype for all the other blocks in the blockchain. gs ratified version . of the gs global traceability standard (gs , b), documenting the business process and system requirements for full-chain traceability. the document is generic by design with supplemental, industry-specific documents developed separately. of interest to fscs are together, these documents provide comprehensive guidance to fscs on the implementation of a traceability framework. figs. . and . below indicate a single and multiple company view of traceability data generation. generation of traceability data in a multiparty supply chain (gs , a) underlying the gs traceability standard is the gs epcis (gs , a), which defines traceability as an ordered collection of events that comprise four key dimensions: • whatdthe subject of the event, either a specific object (epc) or a class of object (epc class) and a quantity • whendthe time at which the event occurred • wheredthe location where the event took place • whydthe business context of the event the gs global traceability standard adds a fifth dimension, "who," to identify the parties involved. this can be substantially different from the "where" dimension, as a single location (e.g., a third-party warehouse) may be associated with multiple, independent parties. epcis is supplemented by the core business vocabulary standard (gs , a), which specifies the structure of vocabularies and specific values for the vocabulary elements to be utilized in conjunction with the gs epcis standard. epcis is a standard that defines the type and structure of events and a mechanism for querying the repository. assuming that all parties publish to a common epcis repository (centralized approach) or that all parties make their repositories available (open approach), traceability is simply the process of querying events, analyzing them, and querying subsequent events until all relevant data are retrieved. in practice, neither the centralized nor open approach is possible. in the centralized approach, multiple, competing repositories will naturally prevent a single, centralized repository from ever being realized. even if such a model were to be supported in the short term by key players in the traceability ecosystem, as more and more players are added, the odds of one or more of them already having used a competing repository grows. in the open approach, not all parties will be willing to share all data with all others, especially competitors. depending on the nature of the party querying the data or the nature of the query itself, the response may be no records, some records, or all records satisfying the query. for either approach, there is the question of data integrity: can the system prove that the traceability data have not been tampered with? blockchain is a potential solution to these problems. as a decentralized platform, blockchain integration could provide epcis solution providers with a way of sharing data in a secure fashion. furthermore, the sequential, immutable nature of the blockchain platform either ensures that the data cannot be changed or provides a mechanism for verifying that it has not been tampered with. the critical question is, what exactly gets stored on the blockchain? the options discussed by gs in a white paper on a blockchain (gs , b) are • fully formed, cryptographically signed plain text event data, which raises concerns about scalability, performance, and security if full events are written to a ledger; • a cryptographic hash of the data that has little meaning by itself. this requires off-chain data exchange via a separate traceability application and a hash comparison to verify that data have not been altered since the hash was written to the ledger; and • a cryptographic hash of the data and a pointer to off-chain data. this is the same as the above point with a pointer to the off-chain data source. such an approach can enable the ledger to act as part of a discovery mechanism for parties who need to communicate and share data. this then leads to the question of the accessibility of the data: public: everyone sees all transactions; private: this includes a permission layer that makes transactions viewable to only approved parties. integrating epcis (or any other data sharing standard) with blockchain often presents significant challenges: in most cases, volumetric analysis can reveal sensitive business intelligence even without examining the data. for example, if company x is currently publishing records per day, and next year at the same time it is publishing only , it is reasonable to assume that company x's volume is down by % year over year. revealing the subject of an event (the "what" dimension) can reveal who is handling the expensive product, which may be used to plan its theft or diversion. publishing a record in plain text makes the data available to any party that has a copy of the ledger, but not all data should be available to all parties. for example, transformation events in epcis record inputs partially or fully consumed to produce one or more outputs. in the food industry, this is the very nature of a recipe, which is often a closely guarded trade secret. in order to mitigate this risk, the ledger would have to be firmly held by a limited number of parties that could enforce proper data access controls. even if such a system were to be implemented correctly, it means that proprietary information would still be under the control of a third party, which is a risk that many food companies would not be willing to take. publishing a record in an encrypted form would solve the visibility issue, but in order to do so, the industry would have to agree on how to generate the keys for the encrypted data. one option is to use the event's subject (the "what" dimension) as the key. if the identifier for the subject is sufficiently randomized, this ensures that only parties that have encountered the identifier can actually decrypt the data; while other parties could guess at possible values of the identifier, doing so at scale can be expensive and therefore self-limiting. there would also have to be a way to identify which data are relevant to the identifier, which would mean storing something like a hash of the identifier as a key. only those parties that know the identifier (i.e., that have observed it at some point in its traceability journey) will be able to locate the data of interest and decrypt them. parties could publish a hash of the record along with the record's primary key. this could then be used to validate records to ensure that they have not been tampered with, but it means that any party that wishes to query the data would have to know ahead of time where the data reside. once queried successfully, the record's primary key would be used to lookup the hash for comparison. to enable discovery, data consisting of the event's subject (the "what" dimension) and a pointer to a repository could be published. in essence, this is a declaration that the repository has data related to the event's subject, and a query for records related to the event's subject is likely to be successful. to further secure the discovery, the event's subject could be hashed, and that could be used as the key. volumetric analysis is still possible with this option. to limit volumetric analysis, data consisting of the class level of the event's subject and a pointer to a repository could be published. this is essentially a declaration that objects of a specific type have events in the repository, but it does not explicitly say how many or what specific objects they refer to. it still reveals that the company using the repository is handling the product. over and above all of this is the requirement that all publications be to the same type of blockchain ledger. there are currently no interoperability standards for blockchains. the industry would, therefore, have to settle on one, which has the same issue as settling on a single epcis repository. further technical research is required to determine the viability of the various options for publishing to the blockchain. the standardization efforts in global fscs have led to the need for best practice recommendations and common ways of managing logistics units in the food chain. the widespread use of gs standards reflects the tendency of food organizations to operate in an integrated manner with a universal language. this facilitates fscs to structure and align with a cohesive approach to food traceability, empowering multidirectional information sharing, optimizing efficiencies, and added-value activities for fsc stakeholders. moreover, the embeddedness of gs standards in global fscs allows trading partners to work in an industry-regulated environment wherein food quality and food safety are of the utmost priority in delivering sustainable, authentic products to final consumers. today, the usage of gs standards is inevitable as they provide clear guidelines on how to manage and share event data across global fscs (figueroa, añorga, & arrizabalaga, ) . this inevitability is further enhanced through the leadership of the global management board of gs (as of february ) that consists of senior executives from organizations such as procter & gamble, nestle, amazon, google, j.m. smucker, l'oreal, metro ag, alibaba, and others. similarly, the management board for the gs us organization includes senior executives from walmart, wegfern, wendy's, coca cola, target, publix, wegmans, sysco, massachusetts institute of technology, and others. the commitment of these organizations strongly supports the industry adoption of gs standards, and gs enabled blockchain solutions as indicated by walmart in their us-driven "fresh leafy greens" traceability initiative (walmart, ) . moreover, many of these firms have announced blockchain-related initiatives in their supply chains. walmart's traceability initiative reflects growing consumer concerns regarding food quality and safety and the recurring nature of product safety recalls. the combination of gs standards with blockchain can provide immutable evidence of data provenance, enhance food traceability and rapid recall, and increase trust in the quality of food products. gs standards aid organizations in maintaining a unified view of the state of food while transitioning between processing stages across globalized and highly extended supply chains with multiple exchange parties. as such, the broad adoption of electronic traceability as identified by gs can endow the food industry with several capabilities, ranging from the optimization of traceback procedures, the standardization of supply chain processes, the continuous improvement in food production activities, and the development of more efficient and holistic traceability systems. the use of gs standards for the formation of interoperable and scalable food traceability systems can be reinforced with blockchain technology. as envisioned by many food researchers, practitioners, and organizations, blockchain technology represents a practical solution that has a positive impact on fsc collaborations and data sharing. blockchain technology creates a more comprehensive and inclusive framework that promotes an unprecedented level of transparency and visibility of food products as they are exchanged among fsc partners. combined with gs standards, blockchain technology offers a more refined level of interoperability between exchange parties in global fscs and facilitates a move away from the traditional or linear, stove-piped supply chains with limited data sharing. by leveraging blockchain, fscs would be able to develop a management information platform that enables the active collection, transfer, storage, control, and sharing of food-related information among fsc exchange parties. the combination of blockchain and gs standards can create a high level of trust because of the precision in data and information provenance, immutability, nonrepudiation, enhanced integrity, and deeper integration. the development of harmonized global fscs gives rise to more efficient traceability systems that are capable of minimizing the impact of food safety incidents and lowering the costs and risks related to product recalls. therefore, the integration of gs standards into a blockchain can enhance the competitive advantage of fscs. in order to unlock the full potential from the functional components of a blockchain and the integration of gs standards, several prerequisites need to be fulfilled. for example, a more uniformed and standardized model of data governance is necessary to facilitate the operations of fscs in a globalized context. a balance between the conformance with diverse regulatory requirements and the fsc partners' requirements should be established in order to maintain a competitive position in the global market. the inter-and intraorganizational support for blockchain implementations, including the agreement on what type of data should be shared and accessed, the establishment of clear lines of responsibilities and accountability, and the development of more organized and flexible fscs should be considered prior to blockchain adoption (fosso wamba et al., ) . in summary, a blockchain is not a panacea, and non-blockchain solutions are functioning adequately in many fscs today. the business case or use case is crucially important when considering whether a blockchain is required and whether its functionality adds value. moreover, a blockchain does not consider unethical behaviors and opportunism in global fscs (bad character). organizations need to consider other risk factors that could impact ex post transaction costs and reputation. global fsc risk factors include slave labor, child labor, unsafe working conditions, animal welfare, environmental damage, deforestation and habitat loss, bribery and corruption, and various forms of opportunism such as quality cheating or falsification of laboratory or government records before they are added to a blockchain. product data governance and enhanced traceability can be addressed in global fscs, but "bad character" is more difficult to detect and eliminate. essentially, bad data and bad character are the two main enemies of trust in the food chain. this study focused narrowly on existing research combining blockchain, gs standards, and food. due to the narrow scope of the research, we did not explore all technical aspects of the fast-evolving blockchain technology, smart contracts, or cryptography. further research is needed to explore the risks associated with the integrity of data entered into a blockchain, especially situations where bad actors may use a blockchain to establish false trust with false data. in this regard, "immutable lies" are added to a blockchain and create a false sense of trust. because of this potential risk, and because errors occuring in the physical flow of goods within supply chains are common (e.g., damage, shortage, theft) as well as errors in data sharing and privacy, the notion of blockchain "mutability" should be researched further (rejeb et al., (rejeb et al., , . further technical research is encouraged to explore the relationship between the immutability features of a blockchain and the mutability features of the epcis standard. in the latter, epcis permits corrections where the original, erroneous record is preserved, and the correction has a pointer to the original. researchers should explore current epcis adoption challenges and whether epcis could provide blockchain-to-blockchain and blockchain-to-legacy interoperability. the latter may mitigate the risks associated with fsc exchange partners being "forced" to adopt a single proprietary blockchain platform or as a participant in multiple proprietary blockchain platforms in order to trade with their business partners. researchers should explore if the latency of real-time data retrieval in blockchain-based fscs restricts consumer engagement in verifying credence claims in real time due to the complexity of retrieving block transaction history. supply strategy in the food industry e value net perspective internet of things: evolution and technologies from a security perspective recovery of wasted fruit and vegetables for improving sustainable diets bringing farm animal to the consumer's platedthe quest for food business to enhance transparency, labelling and consumer education is your supply chain ready to embrace blockchain? runners and riders: the horsemeat scandal, eu law and multi-level enforcement an empirical analysis of smart contracts: platforms, applications, and design patterns boundary conditions for traceability in food supply chains using blockchain technology food safety and transparency in food chains and networks relationships and challenges food product tracing technology capabilities and interoperability trust in food in modern and late-modern societies blockchain based wine supply chain traceability system agrarian social movements: the absurdly difficult but not impossible agenda of defeating right-wing populism and exploring a socialist future food traceability as an integral part of logistics management in food and agricultural supply chain food supply chain management value co-creation in wine logistics: the case of dartess a next-generation smart contract and decentralized application platform proof of stake: how i learned to love weak subjectivity mapping the dynamics of science and technology: sociology of science in the real world (pp. e ) food fraud. reference material blockchain technology in healthcare coordination for traceability in the food chain. a critical appraisal of european regulation food traceability in the domestic horticulture sector in kenya: an overview traceability in manufacturing systems exploring latent factors influencing the adoption of a processed food traceability system in south korea unveiling the structure of supply networks: case studies in honda, acura, and daimlerchrysler feasibility of warehouse drone adoption and implementation investigating green supply chain management practices and performance: the moderating roles of supply chain ecocentricity and traceability food supply chain management and logistics: from farm to fork free competition and the optimal amount of fraud defining granularity levels for supply chain traceability based on iot and blockchain following the mackerel e cost and benefits of improved information exchange in food supply chains the economics of credence goods: an experimental investigation of the role of verifiability, liability, competition and reputation in credence goods markets ethereum proof of stakedethhub new era of smarter food safety a review on the use of blockchain for the internet of things an attribute-based access control model in rfid systems based on blockchain decentralized applications for healthcare environments bitcoin, blockchain, and fintech: a systematic review and case studies in the supply chain. production planning and control blockchain as a disruptive technology for business: a systematic review. international journal of information management synchromodal logistics: an overview of critical success factors, enabling technologies, and open research issues the edelman trust traceability in the us food supply: economic theory and industry studies (agricultural economics reports no. . united states department of agriculture marsh holds place of honor in history of gs barcode gs made easydglobal meat and poultry traceability guideline companion document traceability for fresh fruits and vegetables implementation guide epc information services (epcis) standard gs global traceability compliance criteria for food application standard core business vocabulary standard gs global traceability standard. gs release . . ratified (gs 's framework for the design of interoperable traceability systems for supply chains). gs . gs . ( a). gs foundation for fish, seafood and aquaculture traceability guideline traceability and blockchain how we got here the institute for grocery distribution, cranfield school of management (ktp project), & value chain vision blockchain revolution without the blockchain real-time supply chainda blockchain architecture for project deliveries. robotics and computer-integrated manufacturing blockchain or bust for the food industry? blockchain research, practice and policy: applications, benefits, limitations, emerging research themes and research agenda the truth about blockchain sharing procedure status information on ocean containers across countries using port community systems with decentralized architecture effects of supplier trust on performance of cooperative supplier relationships food traceability on blockchain: walmart's pork and mango pilots with ibm modeling the internet of things adoption barriers in food retail supply chains a novel approach to solve a mining work centralization problem in blockchain technologies rfid-enabled traceability in the food supply chain. industrial management & data systems blockchain, provenance, traceability & chain of custody fast: a mapreduce consensus for high performance blockchains integrating blockchain, smart contracttokens, and iot to design a food traceability solution addressing key challenges to making enterprise blockchain applications a reality who is satoshi nakamoto? regulation trusting records: is blockchain technology the answer? records management the prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration developing open and interoperable dlt/blockchain standards innovation and future trends in food manufacturing and supply chain technologies (pp. e ). woodhead publishing leveraging standard electronic business interfaces to enable adaptive supply chain partnerships food fraud: policy and food chain food safety, food fraud, and food defense: a fast evolving literature towards information customization and interoperability in food chains the promise of blockchain and its impact on relationships between actors in the supply chain: a theory-based research framework bitcoin: a peer-to-peer electronic cash system information and consumer behavior the transparent supply chain the components of a food traceability system bitcoin mining and its energy footprint blockchain technology and enterprise operational capabilities: an empirical test nfc-based traceability in the food chain the produce traceability initiative food fraud vulnerability assessment and mitigation: are you doing enough to prevent food fraud blockchain and supply chain management integration: a systematic review of the literature. supply chain management: an international incorporating block chain technology in food supply chain leveraging the internet of things and blockchain technology in supply chain management how blockchain technology can benefit marketing: six pending research areas exploring new technologies in procurement virtual currencies like bitcoin as a paradigm shift in the field of transactions credence good labeling: the efficiency and distributional implications of several policy approaches unraveling the food supply chain: strategic insights from china and the recalls the acceptance of blockchain technology in meat traceability and transparency igr token-raw material and ingredient certification of recipe based foods using smart contracts the relevance of blockchain for collaborative networked organizations transparency for sustainability in the food chain: challenges and research needs effost critical reviews # using blockchain for electronic health records traceability and normal accident theory: how does supply network complexity influence the traceability of adverse events a byzantine fault-tolerant ordering service for the hyperledger fabric blockchain platform risks and opportunities for systems using blockchain and smart contracts blockchain technology for detecting falsified and substandard drugs in distribution: pharmaceutical supply chain intervention smart contracts: building blocks for digital free markets formalizing and securing relationships on public networks realizing the potential of blockchain a multistakeholder approach to the stewardship of blockchain and cryptocurrencies blockchain technology implementation in logistics a novel blockchainbased product ownership management system (poms) for anti-counterfeits in the post supply chain towards a methodology for developing evidence-informed management knowledge by means of systematic review food wastage footprint & climate change strategic transparency between food chain and society: cultural perspective images on the future of farmed salmon semantic interoperability for data analysis in the food supply chain the role of security in the food supplier selection decision fresh leafy greens new walmart food traceability initiative questions and answers food-borne illnesses cost us$ billion per year in low-and middleincome countries example use cases assessing supermarket food shopper reaction to horsemeat scandal in the uk a new era of food transparency powered by blockchain knowing about your food from the farm to the table: using information systems that reduce information asymmetry and health risks in retail contexts current status and future development proposal for chinese agricultural product quality and safety traceability food supply chain management: systems, implementations, and future research the authors are thankful to dr. steven j. simske, dr. subhasis thakur, and irene woerner for their thoughtful commentary on this chapter. abderahman rejeb, coauthor and ph.d. candidate, is grateful to professor lászló imre komlósi, dr. katalin czakó, and ms. tihana vasic for their valuable support. • no funding was received for this publication. • john g. keogh, the corresponding author, is a former executive at gs canada and has not advised or worked with or for gs for more than years. • kevin dean is an independent technical consultant advising gs . key: cord- -hveuq x authors: reis, ben y; kohane, isaac s; mandl, kenneth d title: an epidemiological network model for disease outbreak detection date: - - journal: plos med doi: . /journal.pmed. sha: doc_id: cord_uid: hveuq x background: advanced disease-surveillance systems have been deployed worldwide to provide early detection of infectious disease outbreaks and bioterrorist attacks. new methods that improve the overall detection capabilities of these systems can have a broad practical impact. furthermore, most current generation surveillance systems are vulnerable to dramatic and unpredictable shifts in the health-care data that they monitor. these shifts can occur during major public events, such as the olympics, as a result of population surges and public closures. shifts can also occur during epidemics and pandemics as a result of quarantines, the worried-well flooding emergency departments or, conversely, the public staying away from hospitals for fear of nosocomial infection. most surveillance systems are not robust to such shifts in health-care utilization, either because they do not adjust baselines and alert-thresholds to new utilization levels, or because the utilization shifts themselves may trigger an alarm. as a result, public-health crises and major public events threaten to undermine health-surveillance systems at the very times they are needed most. methods and findings: to address this challenge, we introduce a class of epidemiological network models that monitor the relationships among different health-care data streams instead of monitoring the data streams themselves. by extracting the extra information present in the relationships between the data streams, these models have the potential to improve the detection capabilities of a system. furthermore, the models' relational nature has the potential to increase a system's robustness to unpredictable baseline shifts. we implemented these models and evaluated their effectiveness using historical emergency department data from five hospitals in a single metropolitan area, recorded over a period of . y by the automated epidemiological geotemporal integrated surveillance real-time public health–surveillance system, developed by the children's hospital informatics program at the harvard-mit division of health sciences and technology on behalf of the massachusetts department of public health. we performed experiments with semi-synthetic outbreaks of different magnitudes and simulated baseline shifts of different types and magnitudes. the results show that the network models provide better detection of localized outbreaks, and greater robustness to unpredictable shifts than a reference time-series modeling approach. conclusions: the integrated network models of epidemiological data streams and their interrelationships have the potential to improve current surveillance efforts, providing better localized outbreak detection under normal circumstances, as well as more robust performance in the face of shifts in health-care utilization during epidemics and major public events. abbreviations: aegis, automated epidemiological geotemporal integrated surveillance; cusum, cumulative sum; ewma, exponential weighted moving average; sars, severe acute respiratory syndrome * to whom correspondence should be addressed. e-mail: ben_reis@ harvard.edu advanced disease-surveillance systems have been deployed worldwide to provide early detection of infectious disease outbreaks and bioterrorist attacks. new methods that improve the overall detection capabilities of these systems can have a broad practical impact. furthermore, most current generation surveillance systems are vulnerable to dramatic and unpredictable shifts in the health-care data that they monitor. these shifts can occur during major public events, such as the olympics, as a result of population surges and public closures. shifts can also occur during epidemics and pandemics as a result of quarantines, the worriedwell flooding emergency departments or, conversely, the public staying away from hospitals for fear of nosocomial infection. most surveillance systems are not robust to such shifts in health-care utilization, either because they do not adjust baselines and alert-thresholds to new utilization levels, or because the utilization shifts themselves may trigger an alarm. as a result, public-health crises and major public events threaten to undermine health-surveillance systems at the very times they are needed most. to address this challenge, we introduce a class of epidemiological network models that monitor the relationships among different health-care data streams instead of monitoring the data streams themselves. by extracting the extra information present in the relationships between the data streams, these models have the potential to improve the detection capabilities of a system. furthermore, the models' relational nature has the potential to increase a system's robustness to unpredictable baseline shifts. we implemented these models and evaluated their effectiveness using historical emergency department data from five hospitals in a single metropolitan area, recorded over a period of . y by the automated epidemiological geotemporal integrated surveillance real-time public health-surveillance system, developed by the children's hospital informatics program at the harvard-mit division of health sciences and technology on behalf of the massachusetts department of public health. we performed experiments with semi-synthetic outbreaks of different magnitudes and simulated baseline shifts of different types and magnitudes. the results show that the network models provide better detection of localized outbreaks, and greater robustness to unpredictable shifts than a reference time-series modeling approach. understanding and monitoring large-scale disease patterns is critical for planning and directing public-health responses during pandemics [ ] [ ] [ ] [ ] [ ] . in order to address the growing threats of global infectious disease pandemics such as influenza [ ] , severe acute respiratory syndrome (sars) [ ] , and bioterrorism [ ] , advanced disease-surveillance systems have been deployed worldwide to monitor epidemiological data such as hospital visits [ , ] , pharmaceutical orders [ ] , and laboratory tests [ ] . improving the overall detection capabilities of these systems can have a wide practical impact. furthermore, it would be beneficial to reduce the vulnerability of many of these systems to shifts in health-care utilization that can occur during public-health emergencies such as epidemics and pandemics [ ] [ ] [ ] or during major public events [ ] . we need to be prepared for the shifts in health-care utilization that often accompany major public events, such as the olympics, caused by population surges or closures of certain areas to the public [ ] . first, we need to be prepared for drops in health-care utilization under emergency conditions, including epidemics and pandemics where the public may stay away from hospitals for fear of being infected, as . % reported doing so during the sars epidemic in hong kong [ ] . similarly, a detailed study of the greater toronto area found major drops in numerous types of health-care utilization during the sars epidemic, including emergency department visits, physician visits, inpatient and outpatient procedures, and outpatient diagnostic tests [ ] . second, the ''worried-well''-those wrongly suspecting that they have been infected-may proceed to flood hospitals, not only stressing the clinical resources, but also dramatically shifting the baseline from its historical pattern, potentially obscuring a real signal [ ] . third, public-health interventions such as closures, quarantines, and travel restrictions can cause major changes in health-care utilization patterns. such shifts threaten to undermine disease-surveillance systems at the very times they are needed most. during major public events, the risks and potential costs of bioterrorist attacks and other public-health emergencies increase. during epidemics, as health resources are already stretched, it is important to maintain disease outbreaksurveillance capabilities and situational awareness [ , ] . at present, many disease-surveillance systems rely either on comparing current counts with historical time-series models, or on identifying sudden increases in utilization (e.g., cumulative sum [cusum] or exponential weighted moving average [ewma] [ , ] ). these approaches are not robust to major shifts in health-care utilization: systems based on historical time-series models of health-care counts do not adjust their baselines and alert-thresholds to the new unknown utilization levels, while systems based on identifying sudden increases in utilization may be falsely triggered by the utilization shifts themselves. in order to both improve overall detection performance and reduce vulnerability to baseline shifts, we introduce a general class of epidemiological network models that explicitly capture the relationships among epidemiological data streams. in this approach, the surveillance task is transformed from one of monitoring health-care data streams, to one of monitoring the relationships among these data streams: an epidemiological network begins with historical time-series models of the ratios between each possible pair of data streams being monitored. (as described in discussion, it may be desirable to model only a selected subset of these ratios.) these ratios do not remain at a constant value; rather, we assume that these ratios vary in a predictable way according to seasonal and other patterns that can be modeled. the ratios predicted by these historical models are compared with the ratios observed in the actual data in order to determine whether an aberration has occurred. the complete approach is described in detail below. these network models have two primary benefits. first, they take advantage of the extra information present in the relationships between the monitored data streams in order to increase overall detection performance. second, their relational nature makes them more robust to the unpredictable shifts described above, as illustrated by the following scenario. the olympics bring a large influx of people into a metropolitan area for wk and cause a broad surge in overall health-care utilization. in the midst of this surge, a localized infectious disease outbreak takes place. the surge in overall utilization falsely triggers the alarms of standard biosurveillance models and thus masks the actual outbreak. on the other hand, since the surge affects multiple data streams similarly, the relationships between the various data streams are not affected as much by the surge. since the network model monitors these relationships, it is able to ignore the surge and thus detect the outbreak. our assumption is that broad utilization shifts would affect multiple data streams in a similar way, and would thus not significantly affect the ratios among these data streams. in order to validate this assumption, we need to study the stability of the ratios around real-world surges. this assessment is difficult to do, since for most planned events, such as the olympics, additional temporary health-care facilities are set up at the site of the event in order to deal with the expected surge. this preparation reduces or eliminates the surge that is recorded by the permanent health-care system, and therefore makes it hard to find data that describe surges. however, some modest shifts do appear in the health-care utilization data, and they are informative. we obtained data on the sydney summer olympics directly from the centre for epidemiology and research, new south wales department of health, new south wales emergency department data collection. the data show a % surge in visits during the olympics. while the magnitude of this shift is far less dramatic than those expected in a disaster, the sydney olympics nonetheless provide an opportunity to measure the stability of the ratios under surge conditions. despite the surge, the relative rates of major syndromic groups remained very stable between the same periods in and . injury visits accounted for . % of overall visits in , compared with an almost identical . % in . gastroenteritis visits accounted for . % in , compared with . % in . as shown in table , the resulting ratios among the different syndromic groups remained stable. although we would have liked to examine the stability of ratios in the face of a larger surge, we were not able to find a larger surge for which multi-year health-care utilization data were available. it is important to note that while the above data about a planned event are informative, surveillance systems need to be prepared for the much larger surges that would likely accompany unplanned events, such as pandemics, natural disasters, or other unexpected events that cause large shifts in utilization. initial motivation for this work originated as a result of the authors' experience advising the hellenic center for infectious diseases control in advance of the summer olympics in athens [ ] , where there was concern that a population surge caused by the influx of a large number of tourists would significantly alter health-care utilization patterns relative to the baseline levels recorded during the previous summer. the epidemiological network model was then formalized in the context of the us centers for disease control and prevention's nationwide biosense health-surveillance system [ ] , for which the authors are researching improved surveillance methods for integration of inputs from multiple health-care data streams. biosense collects and analyzes health-care utilization data, which have been made anonymous, from a number of national data sources, including the department of defense and the veteran's administration, and is now procuring local emergency department data sources from around the united states. in order to evaluate the practical utility of this approach for surveillance, we constructed epidemiological network models based on real-world historical health-care data and compared their outbreak-detection performance to that of standard historical models. the models were evaluated using semi-synthetic data-streams-real background data with injected outbreaks-both under normal conditions and in the presence of different types of baseline shifts. the proposed epidemiological network model is compared with a previously described reference time-series model [ ] . both models are used to detect simulated outbreaks introduced into actual historical daily counts for respiratory-related visits, gastrointestinal-related visits, and total visits at five emergency departments in the same metropolitan area. the data cover a period of , d, or roughly . y. the first , d are used to train the models, while the final d are used to test their performance. the data are collected by the automated epidemiological geotemporal integrated surveillance (aegis) real-time public health-surveillance system, developed by the child-ren's hospital informatics program at the harvard-mit division of health sciences and technology on behalf of the massachusetts department of public health. aegis fully automates the monitoring of emergency departments across massachusetts. the system receives automatic updates from the various health-care facilities and performs outbreak detection, alerting, and visualization functions for publichealth personnel and clinicians. the aegis system incorporates both temporal and geospatial approaches for outbreak detection. the goal of an epidemiological network model is to model the historical relationships among health-care data streams and to interpret newly observed data in the context of these modeled relationships. in the training phase, we construct time-series models of the ratios between all possible pairs of health-care utilization data streams. these models capture the weekly, seasonal, and long-term variations in these ratios. in the testing phase, the actual observed ratios are compared with the ratios predicted by the historical models. we begin with n health-care data streams, s i , each describing daily counts of a particular syndrome category at a particular hospital. for this study, we use three syndromic categories (respiratory, gastrointestinal, and total visits) at five hospitals, for a total of n ¼ data streams. all possible pair-wise ratios are calculated among these n data streams, for a total of n À n ¼ ratios, r ij : for each day, t, we calculate the ratio of the daily counts for stream s i to the daily counts for stream s j . for each ratio, the numerator s i is called the target data stream, and the denominator s j is called the context data stream, since the target data stream is said to be interpreted in the context of the context data stream, as described below. a sample epidemiological network consisting of nodes and edges is shown in figure . the nodes in the network represent the data streams: each of the n data streams appears twice, once as a context data stream and another time as a target data stream. edges represent ratios between data streams: namely, the target data stream divided by the context data stream. to train the network, a time-series model, r ij , is fitted for each ratio, r ij , over the training period using established time-series methods [ ] . the data are first smoothed with a -d exponential filter (ewma with coefficient . ) to reduce the effects of noise [ ] . the linear trend is calculated and subtracted out, then the overall mean is calculated and subtracted out, then the day-of-week means (seven values) are calculated and subtracted out, and finally the day-of-year means ( values) are calculated. in order to generate predictions from this model, these four components are summed, using the appropriate values for day of the week, day of the year, and trend. the difference between each actual ratio, r ij , and its corresponding modeled prediction, r ij , is the error, e ij . during network operation, the goal of the network is to determine the extent to which the observed ratios among the data streams differ from the ratios predicted by the historical models. observed ratios, r ij , are calculated from the observed data, and are compared with the expected ratios to yield the observed errors, e ij : e ij ðtÞ ¼ r ij ðtÞ À r ij ðtÞ ð Þ in order to interpret the magnitudes of these deviations from the expected values, the observed errors are compared with the historical errors from the training phase. a nonparametric approach is used to rank the current error against the historical errors. this rank is divided by the maximum rank ( þ the number of training days), resulting in a value of between and , which is the individual aberration score, w ij . conceptually, each of the individual aberration scores, w ij , represents the interpretation of the activity of the target data stream, s i , from the perspective of the activity at the context data stream, s j : if the observed ratio between these two data streams is exactly as predicted by the historical model, e ij is equal to and w ij is equal to a moderate value. if the target data stream is higher than expected, e ij is positive and w ij is a higher value closer to . if it is lower than expected, e ij is positive and w ij is a lower value closer to . high aberration scores, w ij , are represented by thicker edges in the network visualization, as shown in figure . some ratios are more unpredictable than others-i.e., they have a greater amount of variability that is not accounted for by the historical model, and thus a greater modeling error. the nonparametric approach to evaluating aberrations adjusts for this variability by interpreting a given aberration in the context of all previous aberrations for that particular ratio during the training period. it is important to note that each individual aberration score, w ij , can be affected by the activities of both its target and context data streams. for example, it would be unclear from a single high w ij score as to whether the target data stream is unexpectedly high or whether the context data stream is unexpectedly low. in order to obtain an integrated consensus view of a particular target data stream, s i , an integrated consensus score, c i , is created by averaging together all the aberration scores that have s i as the target data stream (i.e., in the numerator of the ratio). this integrated score represents the collective interpretation of the activity at the target node, from the perspective of all the other nodes: each data stream appears twice in the network. the context nodes on the left are used for interpreting the activity of the target nodes on the right. each edge represents the ratio of the target node divided by the context node, with a thicker edge indicating that the ratio is higher than expected. doi: . /journal.pmed. .g or an alarm is generated whenever c i is greater than a threshold value c thresh . as described below, this threshold value is chosen to achieve a desired specificity. the nonparameteric nature of the individual aberration scores addresses the potential issue of outliers that would normally arise when taking an average. it is also important to note that while the integrated consensus score helps to reduce the effects of fluctuations in individual context data streams, it is still possible for an extreme drop in one context data stream to trigger a false alarm in a target data stream. this is particularly true in networks having few context data streams. in the case of only one context data stream, a substantial decrease in the count in the context data stream will trigger a false alarm in the target data stream. for comparison, we also implement a reference time-series surveillance approach that models each health-care data stream directly, instead of modeling the relationships between data streams as above. this model uses the same time-series modeling methods described above and previously [ ] . first, the daily counts data are smoothed with a -d exponential filter. the linear trend is calculated and subtracted out, then the overall mean is calculated and subtracted out, and then the mean for each day of the week (seven values) is calculated and subtracted out. finally, the mean for each day of the year ( values) is calculated and subtracted out. to generate a prediction, these four components are added together, taking the appropriate values depending on the particular day of the week and day of the year. the difference between the observed daily counts and the counts predicted by the model is the aberration score for that data stream. an alarm is generated whenever this aberration score is greater than a threshold value, chosen to achieve a desired level of specificity, as described below. by employing identical time-series methods for modeling the relationships between the streams in the network approach and modeling the actual data streams themselves in the reference approach, we are able to perform a controlled comparison between the two approaches. following established methods [ ] [ ] [ ] , we use semisynthetic localized outbreaks to evaluate the disease-monitoring capabilities of the network. the injected outbreaks used here follow a -d lognormal temporal distribution (figure ), representing the epidemiological distribution of incubation times resulting from a single-source common vehicle infection, as described by sartwell [ ] . when injecting outbreaks into either respiratory-or gastrointestinal-related data streams, the same number of visits is also added to the appropriate total-visits data stream for that hospital in order to maintain consistency. multiple simulation experiments are performed, varying the number of data streams used in the network, the target data stream, s i , into which the outbreaks are introduced, and the magnitude of the outbreaks. while many additional outbreak types are possible, the simulated outbreaks used here serve as a paradigmatic set of benchmark stimuli for gauging the relative outbreak-detection performance of the different surveillance approaches. we constructed epidemiological networks from respiratory, gastrointestinal, and total daily visit data from five hospitals in a single metropolitan area, for a total of data streams, s i (n ¼ ). in training the network, we modeled all possible pair-wise ratios between the data streams, for a total of ratios. for comparison, we implemented the reference time-series surveillance model described above, which uses the same time-series methods but instead of modeling the epidemiological relationships, models the data streams directly. semi-synthetic simulated outbreaks were used to evaluate the aberration-detection capabilities of the network, as described above. we simulated outbreaks across a range of magnitudes occurring at any one of the data streams. for the first set of experiments, , tests were performed: target data streams d of the testing period outbreak sizes (with a peak magnitude increase ranging from . % to . %) two models (network versus reference). for the purposes of systematic comparison between the reference and network models, we allowed for the addition of fractional cases in the simulations. we compared the detection sensitivities of the reference and network models by fixing specificity at a benchmark % and measuring the sensitivity of the model. in order to measure sensitivity at a desired specificity, we gradually increased the alarm threshold incrementally from to the maximum value until the desired specificity was reached. we then measured the sensitivity at the same threshold. sensitivity is defined in terms of outbreak-days-the proportion of all days during which outbreaks were occurring such that an alarm was generated. at % specificity, the network approach significantly outperformed the reference approach in detecting respiratory and gastrointestinal outbreaks, yielding . % . % and . % . % absolute increases in sensitivity, respectively (representing . % and . % relative improvements in sensitivity, respectively), for outbreaks characterized by a . % increase on the peak day of the outbreak (table ) . we found this ordering of sensitivities to be consistent over the range of outbreak sizes. for outbreaks introduced into the total-outbreak signals, the reference model achieved . % % better absolute sensitivity than the network model ( . % difference in relative sensitivity). this result is likely because the total-visit signals are much larger in absolute terms, and therefore the signal-to-noise ratio is higher (table ) , making it easier for the reference model to detect the outbreaks. the ''total outbreak'' experiments were run for reasons of comprehensiveness, but it should be noted that there is no clear epidemiological correlate to an outbreak that affects all syndrome groups, other than a population surge, which the network models are designed to ignore as described in the discussion section. also, an increase in total visits without an increase in respiratory or gastrointestinal visits may correspond to an outbreak in yet another syndrome category. table also shows results for the same experiments at three other practical specificity levels, and an average for all four specificity levels. in all cases, the network approach performs better for respiratory and gastrointestinal outbreaks and the reference model performs better in total-visit outbreaks. by visually inspecting the response of the network model to the outbreaks, it can be seen that while the individual aberration scores exhibited fairly noisy behavior throughout the testing period (figure ) , the integrated consensus scores consolidated the information from the individual aberration scores, reconstructing the simulated outbreaks presented to the system (figure ). next, we studied the effects of different network compositions on detection performance, constructing networks of different sizes and constituent data streams ( figure ). for each target data stream, we created different homogeneous context networks-i.e., networks containing the target data stream plus between one and five additional data streams of a single syndromic category. in total, , networks were created and analyzed ( target data streams networks). we then introduced simulated outbreaks characterized by a . % increase in daily visit counts over the background counts in the target data stream on the peak day of the outbreak into the target data stream of each network, and calculated the sensitivity obtained from all the networks having particular size and membership characteristics, for a fixed benchmark specificity of %. in total, , tests were performed ( , networks d). we found that detection performance generally increased with network size ( figure ). furthermore, regardless of which data stream contained the outbreaks, total-visit data streams provided the best context for detection. this is consistent with the greater statistical stability of the total- visits data streams, which on average had a far smaller variability (table ) . total data streams were also the easiest target data streams in which to detect outbreaks, followed by respiratory data streams, and then by gastrointestinal data streams. this result is likely because the number of injected cases is a constant proportion of stream size. for a constant number of injected cases, total data streams would likely be the hardest target data streams for detection. next, we systematically compared the performance advantage gained from five key context groups. for a respiratory target signal, the five groups were as follows: ( ) total visits at the same hospital; ( ) total visits at all other hospitals; ( ) gastrointestinal visits at the same hospital; ( ) gastrointestinal visits at all other hospitals; and ( ) respiratory visits at all other hospitals. if the target signal comprised gastrointestinal or total visits, the five context groups above would be changed accordingly, as detailed in figures - . given the possibility of either including or excluding each of these five groups, there were ( À ) possible networks for each target signal. the results of the above analysis are shown for respiratory (figure ) , gastrointestinal (figure ) , and total-visit target signals (figure ). each row represents a different network construction. rows are ranked by the average sensitivity achieved over the five possible target signals for that table. the following general trends are apparent. total visits at all the other hospitals were the most helpful context group overall. given a context of all the streams from the same hospital, it is beneficial to add total visits from other hospitals, as well as the same syndrome group from the other hospitals. beginning with a context of total visits from the same hospital, there is a slight additional advantage in including a different syndrome group from the same hospital. in order to gauge the performance of the network and reference models in the face of baseline shifts in health-care utilization, we performed a further set of simulation experiments, where, in addition to the simulated outbreaks of peak magnitude . %, we introduced various types and magnitudes of baseline shifts for a period of d in the middle of the -d testing period. we compared the performance of the reference time-series model, the complete network model, and a network model containing only total-visit nodes. for respiratory and gastrointestinal outbreaks, we also compared the performance of a two-node network containing only the target data stream and the total-visit data stream from the same hospital. we began by simulating the effects of a large population surge, such as might be seen during a large public event. we did this by introducing a uniform increase across all data streams for d in the middle of the testing period. we found that the detection performance of the reference model degraded rapidly with increasing baseline shifts, while the performance of the various network models remained stable ( figure ). we next simulated the effects of a frightened public staying away from hospitals during an epidemic. we did this by introducing uniform drops across all data streams for d. here too, we found that the detection performance of the reference model degraded rapidly with increasing baseline shifts, while the performance of the various network models remained robust ( figure ). we then simulated the effects of the ''worried-well'' on a surveillance system by introducing targeted increases in only one syndromic category-respiratory or gastrointestinal ( figure ) . we compared the performance of the reference model, a full-network model, the two-node networks described above, and a homogeneous network model containing only data streams of the same syndromic category as the target data stream. the performance of the full and homogeneous networks was superior to that of the reference model. the homogeneous networks, consisting of solely respiratory or gastrointestinal data streams, proved robust to the targeted shifts and achieved consistent detection performance even in the face of large shifts. this result is consistent with all the ratios in these networks being affected equally by the targeted baseline shifts. the performance of the full network degraded slightly in the face of larger shifts, while the performance of the two-node network degraded more severely. these results are because the two-node network did not include relationships that were unaffected by the shifts that could help stabilize performance. it should be noted that this same phenomenon-an increase in one syndromic category across multiple locations-may also be indicative of a widespread outbreak, as discussed further below. in this paper, we describe an epidemiological network model that monitors the relationships between health-care utilization data streams for the purpose of detecting disease outbreaks. results from simulation experiments show that these models deliver improved outbreak-detection performance under normal conditions compared with a standard reference time-series model. furthermore, the network models are far more robust than the reference model to the unpredictable baseline shifts that may occur around epidemics or large public events. the results also show that epidemiological relationships are inherently valuable for surveillance: the activity at one hospital can be better understood by examining it in relation to the activity at other hospitals. in a previous paper [ ] , we showed the benefits of interpreting epidemiological data in its temporal context-namely, the epidemiological activity on surrounding days [ ] . in the present study, we show that it is also beneficial to examine epidemiological data in its network context-i.e., the activity of related epidemiological data streams. based on the results obtained, it is clear that different types of networks are useful for detecting different types of signals. we present eight different classes of signals, their possible interpretations, and the approaches that would be able to detect them: the first four classes of signals involve increases in one or more data streams. ( ) a rise in one syndrome group at a single location may correspond to a localized outbreak or simply a data irregularity. such a signal could be detected by all network models as well as the reference model. ( ) a rise in all syndrome groups at a single location probably corresponds to a geographical shift in utilization, (e.g., a quarantine elsewhere), as an outbreak would not be expected to cause an increase in all syndrome groups. such a signal would be detected by network models that include multiple locations, and by the reference model. ( ) a rise in one syndrome group across all locations may correspond to a widespread outbreak or may similarly result from the visits by the ''worried-well.'' such a signal would be detected by network models that include multiple syndrome groups, and by the reference model. ( ) a rise in all syndrome groups in figure . simulation of a population surge during a large public event to simulate a population surge during a large public event, all data streams are increased by a uniform amount (x-axis) for d in the middle of the testing period. full networks, total-visit networks, two-node networks (target data stream and total visits at the same hospital), and reference models are compared. average results are shown for each target data stream type. error bars are standard errors. doi: . /journal.pmed. .g figure . simulation of a frightened public staying away from hospitals during a pandemic to simulate a frightened public staying away from hospitals during a pandemic, all data streams are dropped by a uniform amount (x-axis) for d in the middle of the testing period. full networks, total-visit networks, two-node networks (target data stream and total visits at the same hospital), and reference models are compared. average results are shown for each target data stream type. error bars are standard errors. doi: . /journal.pmed. .g all locations probably corresponds to a population surge, as an outbreak would not be expected to cause an increase in all syndrome groups. this signal would be ignored by all network models, but would be detected by the reference model. the next four classes of signals involve decreases in one or more data streams. all of these signals are unlikely to be indicative of an outbreak, but are important for maintaining situational awareness in certain critical situations. as mentioned above, a significant decrease in a context data stream has the potential to trigger a false alarm in the target data stream, especially in networks with few context nodes. this is particularly true in two-node networks, where there is only one context data stream. ( ) a fall in one syndrome group at a single location does not have an obvious interpretation. all models will ignore such a signal, since they are set to alarm on increases only. ( ) a fall in all syndrome groups at a single location could represent a geographical shift in utilization (e.g., a local quarantine). all models will ignore such a signal. the baselines of all models will be affected, except for network models that include only nodes from single locations. ( ) a fall in one syndrome group at all locations may represent a frightened public. all models will ignore such a signal. the baselines of all models will be affected, except for network models that include only nodes from single syndromic groups. ( ) a fall in all data types at all locations may represent a regional population decrease or a frightened public staying away from hospitals out of concern for nosocomial infection (e.g., during an influenza pandemic). all models will ignore such a signal. the baseline of only the reference model will be affected. from this overview, it is clear that the network models are more robust than the reference model, with fewer false alarms (in scenarios and ) and less vulnerability to irregularities in baselines (in scenarios [ ] [ ] [ ] . based on the results obtained, when constructing epidemiological networks for monitoring a particular epidemiological data stream, we recommend prioritizing the inclusion of a total visits from all other hospitals, followed by total visits from the same hospital, followed by data streams of the same syndrome group from other hospitals and streams of different syndrome groups from the same hospital, followed by data streams of different syndrome groups from different hospitals. we further recommend that, in addition to fullnetwork models, homogeneous network models (e.g., only respiratory nodes from multiple hospitals) be maintained for greater stability in the face of major targeted shifts in healthcare utilization. the two-node networks described above are similar in certain ways to the ''rate''-based approach used by a small number of surveillance systems today [ ] [ ] [ ] [ ] . instead of monitoring daily counts directly, these systems monitor daily counts as a proportion of the total counts. for example, the respiratory-related visits at a certain hospital could be tracked as a percentage of the total number of visits to that hospital, or alternatively, as a percentage of the total number of respiratory visits in the region. these ''rate''-based approaches have been proposed where absolute daily counts are too unstable for modeling [ ] , or where population-atrisk numbers are not available for use in spatiotemporal scan statistics [ ] . the approach presented here is fundamentally different in that it explicitly models and tracks all possible inter-data stream relationships, not just those between a particular data stream and its corresponding total-visits data stream. furthermore, the present approach is motivated by the desire to increase robustness in the face of large shifts in health-care utilization that may occur during epidemics or major public events. as such, this study includes a systematic study of the models' responses to different magnitudes of both broad and targeted baseline shifts. the two-node networks described above are an example of this general class of ''rate''-based models. while the two-node approach works well under normal conditions, it is not as robust to targeted shifts in health-care utilization as larger network models. the results therefore show that there is value in modeling all, or a selected combination of the relationships among health-care data streams, not just the relationship between a data stream and its corresponding total-visits data stream. modeling all these relationships involves an order-n expansion of the number of models maintained internally by the system: n À n models are used to monitor n data streams. the additional information inherent in this larger space is extracted to improve detection performance, after which the individual model outputs are collapsed back to form the n integrated outputs of the system. since the number of models grows quadratically with the number of data streams, n, the method can become computationally intensive for large numbers of streams. in such a case, the number of models could be minimized by, for example, constructing only networks that include nodes from different figure . simulation of the effects of the worried-well flooding hospitals during a pandemic to simulate the effects of the worried-well flooding hospitals during a pandemic, a targeted rise is introduced in only one type of data stream. full networks, respiratory-or gastrointestinal-only networks, two-node networks, and reference models are compared. error bars are standard errors. doi: . /journal.pmed. .g syndrome groups but from the same hospital, or alternatively, including all context nodes from the same hospital and only total-visit nodes from other hospitals. this work is different from other recent epidemiological research that has described simulated contact networks of individual people moving about in a regional environment and transmitting infectious diseases from one person to another. these simulations model the rate of spread of an infection under various conditions and interventions and help prepare for emergency scenarios by evaluating different health policies. on the other hand, we studied relational networks of hospitals monitoring health-care utilization in a regional environment, for the purpose of detecting localized outbreaks in a timely fashion and maintaining situational awareness under various conditions. our work is also focused on generating an integrated network view of an entire healthcare environment. limitations of this study include the use of simulated infectious disease outbreaks and baseline shifts. we use a realistic outbreak shape and baseline shift pattern, and perform simulation experiments varying the magnitudes of both of these. while other outbreak shapes and baseline shift patterns are possible, this approach allows us to create a paradigmatic set of conditions for evaluating the relative outbreak-detection performance of the various approaches [ ] . another possible limitation is that even though our findings are based on data across multiple disease categories (syndromes), multiple hospitals, and multiple years, relationships between epidemiological data streams may be different in other data environments. also, our methods are focused on temporal modeling, and therefore do not have an explicit geospatial representation of patient location, even though grouping the data by hospital does preserve a certain degree of geospatial information. the specific temporal modeling approach used requires a solid base of historical data for the training set. however, this modeling approach is not integral to the network strategy, and one could build an operational network by using other temporal modeling approaches. furthermore, as advanced disease-surveillance systems grow to monitor an increasing number of data streams, the risk of information overload increases. to address this problem, attempts to integrate information from multiple data streams have largely focused on detecting the multiple effects of a single outbreak across many data streams [ ] [ ] [ ] [ ] . the approach described here is fundamentally different in that it focuses on detecting outbreaks in one data stream by monitoring fluctuations in its relationships to the other data streams, although it can also be used for detecting outbreaks that affect multiple data streams. we recommend using the network approaches described here alongside current approaches to realize the complementary benefits of both. these findings suggest areas for future investigation. there are inherent time lags among epidemiological data streams: for example, pediatric data have been found to lead adult data in respiratory visits [ ] . while the approach described here may implicitly model these relative time lags, future approaches can include explicit modeling of relative temporal relationships among data streams. it is also possible to develop this method further to track outbreaks in multiple hospitals and syndrome groups. it is further possible to study the effects on timeliness of detection of different network approaches. also, while we show the utility of the network approach for monitoring disease patterns on a regional basis, networks constructed from national or global data may help reveal important trends at wider scales. editors' summary background. the main task of public-health officials is to promote health in communities around the world. to do this, they need to monitor human health continually, so that any outbreaks (epidemics) of infectious diseases (particularly global epidemics or pandemics) or any bioterrorist attacks can be detected and dealt with quickly. in recent years, advanced disease-surveillance systems have been introduced that analyze data on hospital visits, purchases of drugs, and the use of laboratory tests to look for tell-tale signs of disease outbreaks. these surveillance systems work by comparing current data on the use of health-care resources with historical data or by identifying sudden increases in the use of these resources. so, for example, more doctors asking for tests for salmonella than in the past might presage an outbreak of food poisoning, and a sudden rise in people buying overthe-counter flu remedies might indicate the start of an influenza pandemic. why was this study done? existing disease-surveillance systems don't always detect disease outbreaks, particularly in situations where there are shifts in the baseline patterns of health-care use. for example, during an epidemic, people might stay away from hospitals because of the fear of becoming infected, whereas after a suspected bioterrorist attack with an infectious agent, hospitals might be flooded with ''worried well'' (healthy people who think they have been exposed to the agent). baseline shifts like these might prevent the detection of increased illness caused by the epidemic or the bioterrorist attack. localized population surges associated with major public events (for example, the olympics) are also likely to reduce the ability of existing surveillance systems to detect infectious disease outbreaks. in this study, the researchers developed a new class of surveillance systems called ''epidemiological network models.'' these systems aim to improve the detection of disease outbreaks by monitoring fluctuations in the relationships between information detailing the use of various health-care resources over time (data streams). what did the researchers do and find? the researchers used data collected over a -y period from five boston hospitals on visits for respiratory (breathing) problems and for gastrointestinal (stomach and gut) problems, and on total visits ( data streams in total), to construct a network model that included all the possible pair-wise comparisons between the data streams. they tested this model by comparing its ability to detect simulated disease outbreaks implanted into data collected over an additional year with that of a reference model based on individual data streams. the network approach, they report, was better at detecting localized outbreaks of respiratory and gastrointestinal disease than the reference approach. to investigate how well the network model dealt with baseline shifts in the use of health-care resources, the researchers then added in a large population surge. the detection performance of the reference model decreased in this test, but the performance of the complete network model and of models that included relationships between only some of the data streams remained stable. finally, the researchers tested what would happen in a situation where there were large numbers of ''worried well.'' again, the network models detected disease outbreaks consistently better than the reference model. what do these findings mean? these findings suggest that epidemiological network systems that monitor the relationships between health-care resource-utilization data streams might detect disease outbreaks better than current systems under normal conditions and might be less affected by unpredictable shifts in the baseline data. however, because the tests of the new class of surveillance system reported here used simulated infectious disease outbreaks and baseline shifts, the network models may behave differently in real-life situations or if built using data from other hospitals. nevertheless, these findings strongly suggest that public-health officials, provided they have sufficient computer power at their disposal, might improve their ability to detect disease outbreaks by using epidemiological network systems alongside their current disease-surveillance systems. additional information. please access these web sites via the online version of this summary at http://dx.doi.org/ . /journal.pmed. . wikipedia pages on public health (note that wikipedia is a free online encyclopedia that anyone can edit, and is available in several languages) a brief description from the world health organization of public-health surveillance (in english, french, spanish, russian, arabic, and chinese) a detailed report from the us centers for disease control and prevention called ''framework for evaluating public health surveillance systems for the early detection of outbreaks'' the international society for disease surveillance web site containing pandemic influenza at the source strategies for containing an emerging influenza pandemic in southeast asia public health vaccination policies for containing an anthrax outbreak world health organization writing group ( ) nonpharmaceutical interventions for pandemic influenza, national and community measures transmissibility of pandemic influenza syndromic surveillance for influenzalike illness in ambulatory care network sars surveillance during emergency public health response planning for smallpox outbreaks systematic review: surveillance systems for early detection of bioterrorismrelated diseases implementing syndromic surveillance: a practical guide informed by the early experience national retail data monitor for public health surveillance using laboratory-based surveillance data for prevention: an algorithm for detecting salmonella outbreaks sars-related perceptions in hong kong utilization of ontario's health system during the sars outbreak. toronto: institute for clinical and evaluative sciences pandemic influenza preparedness and mitigation in refugee and displaced populations. who guidelines for humanitarian agencies medical care delivery at the olympic games algorithm for statistical detection of peaks-syndromic surveillance system for the athens biosense: implementation of a national early event detection and situational awareness system time series modeling for syndromic surveillance using temporal context to improve biosurveillance measuring outbreak-detection performance by using controlled feature set simulations the distribution of incubation periods of infectious disease harvard team suggests route to better bioterror alerts can syndromic surveillance data detect local outbreaks of communicable disease? a model using a historical cryptosporidiosis outbreak a space-time permutation scan statistic for disease outbreak detection syndromic surveillance in public health practice: the new york city emergency department system monitoring over-the-counter pharmacy sales for early outbreak detection in new york city algorithms for rapid outbreak detection: a research synthesis integrating syndromic surveillance data across multiple locations: effects on outbreak detection performance public health monitoring tools for multiple data streams bivariate method for spatio-temporal syndromic surveillance identifying pediatric age groups for influenza vaccination using a real-time regional surveillance system the authors thank john brownstein of harvard medical school for helpful comments on the manuscript.author contributions. byr, kdm, and isk wrote the paper and analyzed and interpreted the data. byr and kdm designed the study, byr performed experiments, kdm and byr collected data, and isk suggested particular methods to be used in the data analysis. key: cord- -mc pifep authors: rowhani-farid, anisa; allen, michelle; barnett, adrian g. title: what incentives increase data sharing in health and medical research? a systematic review date: - - journal: res integr peer rev doi: . /s - - - sha: doc_id: cord_uid: mc pifep background: the foundation of health and medical research is data. data sharing facilitates the progress of research and strengthens science. data sharing in research is widely discussed in the literature; however, there are seemingly no evidence-based incentives that promote data sharing. methods: a systematic review (registration: . /osf.io/ pz e) of the health and medical research literature was used to uncover any evidence-based incentives, with pre- and post-empirical data that examined data sharing rates. we were also interested in quantifying and classifying the number of opinion pieces on the importance of incentives, the number observational studies that analysed data sharing rates and practices, and strategies aimed at increasing data sharing rates. results: only one incentive (using open data badges) has been tested in health and medical research that examined data sharing rates. the number of opinion pieces (n = ) out-weighed the number of article-testing strategies (n = ), and the number of observational studies exceeded them both (n = ). conclusions: given that data is the foundation of evidence-based health and medical research, it is paradoxical that there is only one evidence-based incentive to promote data sharing. more well-designed studies are needed in order to increase the currently low rates of data sharing. electronic supplementary material: the online version of this article (doi: . /s - - - ) contains supplementary material, which is available to authorized users. research waste: hidden data, irreproducible research the foundation of health and medical research is data-its generation, analysis, re-analysis, verification, and sharing [ ] . data sharing is a key part of the movement towards science that is open, where data is easily accessible, intelligible, reproducible, replicable, and verifiable [ ] . data sharing is defined here as making raw research data available in an open data depository, and includes controlled access where data is made available upon request which may be required due to legal or ethical reasons. despite the wide-scale benefits of data sharing such as addressing global public health emergencies, it is yet to become common research practice. for instance, the severe acute respiratory syndrome (sars) disease was controlled only months after its emergence by a world health organization-coordinated effort based on extensive data sharing [ ] . likewise, the researchers working on the ebola outbreak have recently committed to work openly in outbreaks to honour the memory of their colleagues who died at the forefront of the ebola outbreak, and to ensure that no future epidemic is as devastating [ ] . notwithstanding these benefits, numerous studies have demonstrated low rates of data sharing in health and medical research, with the leading journal the british medical journal (bmj) having a rate as low as . % [ ] and biomedical journal articles % [ ] . there are of course legitimate reasons to withhold data, such as the concern about patient privacy, and the requirement for patient consent for sharing [ ] . with % of the world's spending on health and medical research, an estimated $ billion, wasted every year, it is clear that the scientific community is in crisis, leading to questions about the veracity of scientific knowledge [ ] . data sharing and openness in scientific research should be fundamental to the philosophy of how scientific knowledge is generated. thomas kuhn introduced the concept of paradigm shifts that arise from a scientific crisis. the paradigm shift before us today is from closed, hidden science to open science and data sharing [ ] . sharing scientific data will allow for data verification and re-analysis, and for testing new hypotheses. open data reduces research waste in terms of time, costs, and participant burden, and in turn, strengthens scientific knowledge by ensuring research integrity [ , ] . the many current problems in health and medical research have led to the emergence of a new field, metaresearch, which is concerned with improving research practices [ ] . meta-research has five sub-themes with 'reproducibility' and 'incentives' as two of the themes [ ] . reproducibility is concerned with the verification of research findings, which can be achieved through the sharing of data and methods [ ] . incentives is concerned with rewarding researchers, which includes incentives to share their data and methods [ ] . we were interested in how researchers are incentivised to openly share their raw data, thus combining two sub-themes of meta-research. historically, it has not been common practice for the content of a research article to include access to the raw data from scientific experiments [ ] . this flaw, created by technological limitations among others, has hindered the progress of scientific knowledge [ ] . however, we can no longer blame technology for outdated research practices. there are many data depositories which allow researchers to easily share their data using a citable doi. there have also been many recent policies and frameworks to encourage openness in research [ ] . yet, uptake in health and medicine is low and what is lacking, it appears, are rewards that incentivize researchers to share their data [ ] . incentives are defined here as rewards that are given to researchers if they participate in sharing their raw scientific data [ ] . the queensland university of technology (qut) library staff assisted in developing a rigorous and clearly documented methodology for both the search strategy and the selection of studies. the aim was to minimise bias by documenting the search process and the decisions made to allow the review to be reproduced and updated. the cochrane handbook for systematic reviews was used as a guide for this systematic review: http://handbook.cochrane.org/. the equator network additional file : prisma ( ) checklist [ ] was used to ensure good practice as well as accurate reporting. three systematic review registries (prospero, joanna briggs institute, and cochrane) were checked to ensure our proposed systematic review had not already been done. our systematic review protocol was registered at the open science framework on august (doi.org/ . /osf.io/ pz e). this review considered published journal articles with empirical data that trialed any incentive to increase data sharing in health and medical research. articles must have tested an incentive that could increase data sharing in health and medical research. for the purposes of this review, health and medical research data is defined as any raw data that has been generated through research from a health and medical facility, institute or organisation. incentives are defined here as 'a benefit, reward, or cost that motivates an […] action'. this was based on the definition of incentives in economics, which groups incentives into four categories: financial, moral, natural, and coercive [ ] . the review included any paper with empirical data on sharing that compared an intervention and control, which used a clear research design (including randomised and non-randomised designs). the types of measures included are the percent of datasets shared, or the number of datasets shared, or the relative ratio of data sharing. this review excluded the following, but still classified these excluded papers by field: all editorial and opinion pieces that only discuss strategies to increase data sharing without trialling them. strategies that do not involve incentives, e.g., education seminars, change in a data sharing policy or some other policy, access to data management tools and managers. observational studies that describe data sharing patterns. this search strategy was designed to access published articles through the following steps: ) ((("open science" or "open data" or "data sharing") and (incentive* or motivation* or reward* or barrier*))) ) ) relevant articles that did not appear in the database search but were known to the reviewers were handpicked and extracted into endnote. two reviewers, arf and ma, screened the titles of the articles and based on the inclusion and exclusion criteria, extracted them into endnote. duplicates were removed. the reviewers independently screened the extracted article titles and abstracts based on the inclusion and exclusion criteria and categorised them into five groups: arf read the titles and abstracts of all extracted articles and ma verified her findings by reading a random sample of %. discrepancies between the two reviewers were approximately %, however these were relatively minor and resolved through discussion of the scope of each of the categories. for instance, a research paper outlined the introduction of a data system, one reviewer classified it as an observational study, but after discussion it was agreed that it was a strategy article as its objective was to increase data sharing rates rather than observing data sharing patterns. the two reviewers independently read eligible documents and extracted data sharing incentives in health and medical research. both reviewers were agnostic regarding the types of incentives to look for. the final list of incentives was determined and agreed on by all authors [ ] . individual incentives were grouped into research fields. a qualitative description of each incentive was presented. based on our prior experience of the literature, the research fields and sub-fields for classification were: a. health and medical research i. psychology ii. genetics iii. other (health/medical) b. non-health and medical research i. information technology ii. ecology iii. astronomy iv. other (non-health/medical) the other article-strategies, opinion pieces, and observational studies were also grouped into the same research fields. the database searches found articles, of which met the inclusion criteria based on assessment of titles and abstracts and were exported into endnote. after automatically removing duplicates, articles remained and after manually removing the remainder of the duplicates, articles remained. titles and abstracts were read and categorised based on the above inclusion and exclusion criteria. one study was hand-picked as it met the inclusion criteria, bringing the total number of extracted articles to . after screening titles and abstracts, nine articles were classified under incentives in health and medical research. these articles were then read in full, and one of them was judged as an incentive that satisfied the inclusion criteria. the prisma [ ] flow chart that outlines the journey of the articles from identification to inclusion is in fig. . the categorisation of all articles into the sub-fields and article type is in table . a review of the reference list for the one included intervention was undertaken [ ] . the titles and abstracts of the full reference list of this study ( papers) and those that cited the study ( papers) were read, but none met the inclusion criteria of this systematic review. articles were irrelevant, bringing the total number of screened articles to . the distribution of articles across type of study was similar for both health and medical research and non-health and medical research ( table ) . observational studies were the most common type (n = , n = ), then opinion pieces (n = , n = ), then articles testing strategies (n = , n = ), and articles testing incentives were uncommon (n = , n = ). these articles did not fit the inclusion criteria, but based on the abstracts they were mostly concerned with observing data sharing patterns in the health and medical research community, using quantitative and qualitative methods. the motivation behind these studies was often to identify the barriers and benefits to data sharing in health and medical research. for instance, federer et al. ( ) conducted a survey to investigate the differences in experiences with and perceptions about sharing data, as well as barriers to sharing among clinical and basic science researchers [ ] . these articles also did not fit the inclusion criteria, but based on the abstracts they were opinion and editorial pieces that discussed the importance and benefits of data sharing and also outlined the lack of incentives for researchers to share data. open data and open material badges were created by the center of open science and were tested at the journal psychological science [ ] . [ ] . a limitation of the badge study was that it did not use a randomized parallel group design; notwithstanding, it was the only incentive that was tested in the health and medical research community, with pre-and postincentive empirical data [ ] . the pre-and post-design of the study makes it vulnerable to other policy changes over time, such as a change from a government funding agency like the recent statement on data sharing from the australian national health and medical research council [ ] . however, the kidwell et al. study addressed this concern with contemporary control journals. a limitation of the badge scheme was that even with badges, the accessibility, correctness, usability, and completeness of the shared data and materials was not %, which was attributable to gaps in specifications for earning badges. in late , the center for open science badges committee considered provisions for situations in which the data or materials for which a badge was issued somehow disappear from public view and how adherence to badge specifications can be improved by providing easy procedures for editors/journal staff to validate data and material availability before issuing a badge, and by providing community guidelines for validation and enforcement [ ] . of the non-health/medical incentives, seven were categorised as information technology, and nine as other. upon reading the full text, all the non-health/medical incentives were proposed incentives or strategies as opposed to tested incentives with comparative data. given that the systematic review found only one incentive, we classified the data sharing strategies tested in the health and medical research community. seventy-six articles were classified under 'strategies' and table shows the further classification into categories based on a secondary screening of titles and abstracts. the articles are grouped by whether they presented any data, descriptive, or empirical. the majority, / , of strategies were technological strategies such as the introduction of a data system to manage and store scientific data. seven of the strategies concerned encouraging collaboration among research bodies to increase data sharing. eight were a combination of collaboration across consortia and the introduction of a technological system. three had a data sharing policy as the strategy but did not test the effectiveness of the policy, but two of them reported descriptive data from their experience in implementing the policy. one strategy was an open data campaign. below we give some examples of the strategies used to promote data sharing. two articles discussed an incentive system for human genomic data and data from rare diseases, namely, microattribution and nanopublication-the linkage of data to their contributors. however, the articles only discussed the models and did not present empirical data [ , ] . another article discussed the openfmri project that aims to provide the neuroimaging community with a resource to support open sharing of fmri data [ ] . in , the openfmri database had full datasets from seven different laboratories and in october , the database had datasets openly available (https://openfmri.org/dataset/). the authors identified credit as a barrier towards sharing data and so incorporated attribution into the openfmri website where a dataset is linked to the publication and the list of investigators involved in collecting the data [ ] . an article discussed open source drug discovery and outlined its experience with two projects, the praziquantel (pzg) project and the open source malaria project [ ] . the article did not have pre-and post-strategy data. the authors discussed the constituent elements of an open research approach to drug discovery, such as the introduction of an electronic lab notebook that allows the deposition of all primary data as well as data management and coordination tools that enhances community input [ ] . the article describes the benefits and successes of the open projects and outlines how their uptake needs to be incentivised in the scientific community [ ] . an article discussed the development of the collaboratory for ms d (c-ms d), an integrated knowledge environment that unites structural biologists working in the area of mass spectrometric-based methods for the analysis of tertiary and quaternary macromolecular structures (ms d) [ ] . c-ms d is a web-portal designed to provide collaborators with a shared work environment that integrates data storage and management with data analysis tools [ ] . the goal is not only to provide a common data sharing and archiving system, but also to assist in the building of new collaborations and to spur the development of new tools and technologies [ ] . one article outlined the collaborative efforts of the global alzheimer's association interactive network (gaain) to consolidate the efforts of independent alzheimer's disease data repositories around the world with the goals of revealing more insights into the causes of alzheimer's disease, improving treatments, and designing preventative measures that delay the onset of physical symptoms [ ] . in , they had registered data repositories from around the world with over , subjects using gaain's search interfaces [ ] . the methodology employed by gaain to motivate participants to voluntarily join its federation is by providing incentives: data collected by its data partners are advertised, as well as the identity of the data partners, including their logos and url links, on each gaain search page [ ] . gaiin attributes its success in registering data repositories to date to these incentives which provide opportunities for groups to increase their public visibility while retaining control of their data, making the relationship between gaiin and its partners mutually beneficial [ ] . this study did not have pre-and post-strategy empirical data, but described the importance of incentives in motivating researchers to share their data with others [ ] . an article described how data sharing in computational neuroscience was fostered through a collaborative workshop that brought together experimental and theoretical neuroscientists, computer scientists, legal experts, and governmental observers [ ] . this workshop guided the development of new funding to support data sharing in computational neuroscience, and considered a conceptual framework that would direct the data sharing movement in computational neuroscience [ ] . the workshop also unveiled the impediments to data sharing and outlined the lack of an established mechanism to provide credit for data sharing as a concern [ ] . a recommendation was that dataset usage statistics and other user feedback be used as important measures of credit [ ] . one article addressed the need to facilitate a culture of responsible and effective sharing of cancer genome data through the establishment of the global alliance for genomic health (ga gh) in [ ] . the collaborative body unpacked the challenges with sharing cancer campaign ( ) [ ] genomic data as well as the potential solutions [ ] . the ga gh developed an ethical and legal framework for action with the successful fostering of an international 'coalition of the willing' to deliver a powerful, globally accessible clinic-genomic platform that supports datadriven advances for patients and societies [ ] . an article discussed the efforts of the wellcome trust sanger institute to develop and implement an institutewide data sharing policy [ ] . the article outlined that successful policy implementation depends on working out detailed requirements (guidance), devoting efforts and resources to alleviate disincentives (facilitation), instituting monitoring processes (oversight), and leadership [ ] . the topic of disincentives (facilitation) included concerns about lack of credit [ ] . they propose that cultural barriers to data sharing continue to exist and that it is important to align the reward system to ensure that scientists sharing data are acknowledged/cited and that data sharing is credited in research assessment exercises and grant career reviews [ ] . one intervention was an open data campaign which was included in the review via an open letter in june from the alltrials campaign to the director of the european medicines agency to remove barriers to accessing clinical trial data [ ] . the alltrials campaign is supported by more than , people and organisations worldwide [ ] . this letter contributed to the european medicines agency publishing the clinical reports underpinning market authorization requests for new drugs, which was part of a more proactive policy on transparency that applied to all centralized marketing authorisations submitted after january [ ] . the adoption of this policy was a significant step in ensuring transparency of health and medical research in europe [ ] . this systematic review verified that there are few evidence-based incentives for data sharing in health and medical research. the irony is that we live in an evidence-based world, which is built upon the availability of raw data, but we hardly have any evidence to demonstrate what will motivate researchers to share data. [ ] . it is interesting to note the great number of opinion pieces (n = ) on the importance of developing incentives for researchers, which outnumbered the number of articles that tested strategies to increase data sharing rates (n = ). 'opinion pieces' are mutually exclusive from 'strategies' as the former is concerned with discussing possible strategies and incentives and the latter tests the ideas and strategies and provides evidence of what works or does not work. these strategies included: the introduction of data systems such as electronic laboratory notebooks and databases for data deposition that incorporated a system of credit through data linkage; collaboration across consortia that also introduce data systems that also use data attribution as an incentive; collaboration across consortia through workshops and development of frameworks for data sharing; implementation of data sharing policies; and campaigns to promote data sharing. these strategies discussed the requirement of introducing rewards to increase data sharing rates and the only form of incentive used was via data attribution and advertising on websites. studies that test the effectiveness of attribution and advertising as a form of credit are necessary. in light of the small number of studies, we see a clear need for studies to design and test incentives that would motivate researchers to share data. organisations are promoting the development of incentives to reduce research waste. in late , the cochrane and the reward alliance combined to create the annual cochrane-reward prize for reducing waste in research. the monetary prize is awarded to 'any person or organisation that has tested and implemented strategies to reduce waste in one of the five stages of research production [question selection, study design, research conduct, publication, and reporting] in the area of health'. this prize is an example of an incentive for researchers to design studies or implement policies that reduce research waste; it will be interesting to see the impact of this initiative [ ] . another endeavour in the area of developing incentives and rewards for researchers is the convening in early of a group of leaders from the usa and europe from academia, government, journals, funders, and the press to help develop new models for academic promotion and professional incentives that would promote the highest quality science, organised by the meta-research innovation center at stanford (met-rics). the focus will be on designing practical actions that embody principles that this community has embraced, while also recognizing that the effect of any such policies will need empirical evaluation. while the systematic barriers to widespread data sharing are being addressed through the general shift towards more openness in research, the conversation on data sharing includes an alternative view where users of shared data are called 'research parasites' who 'steal from research productivity' and who are 'taking over' [ , ] . there is also some questioning of whether data sharing is worth the effort [ ] . these points, however, are contrary to the purpose of sharing data, which is to progress science as a body of knowledge and to make the research process more robust and verifiable [ , ] . a limitation of this systematic review is that we did not search the grey literature (materials and research produced by organizations outside of the traditional commercial or academic publishing and distribution channels). this review could be perceived as having a narrow design, given that we anticipated a lack of evidence-based incentives for data sharing in health and medical research, hence making the topic of this systematic review too simple. however, we could not be sure that there were no incentives and the recent paper by lund and colleagues ( ) emphasises the importance of conducting systematic reviews prior to designing interventions in order to avoid adding to the already large issue of research waste [ ] . the current meta-research discourse outlines the numerous benefits of openness in research: verification of research findings, progressing health and medicine, gaining new insights from re-analyses, reducing research waste, increasing research value, and promoting research transparency. however, this systematic review of the literature has uncovered a lack of evidencebased incentives for researchers to share data, which is ironic in an evidence-based world. the open data badge is the only tested incentive that motivated researchers to share data [ ] . this low-cost incentive could be adopted by journals and added to the reward system to promote reproducible and sharable research [ , ] . other incentives like attribution require empirical data. instead of evidence-based incentives, the literature is full of opinion pieces that emphasize the lack of incentives for researchers to share data, outweighing the number of strategies that aim to increase data sharing rates in health and medicine. observational studies that identify data sharing patterns and barriers are also plentiful, and whilst these studies can provide useful background knowledge, they do not provide good evidence of what can be done to increase data sharing. the open knowledge foundation: open data means better science meta-research: evaluation and improvement of research methods and practices perspectives on open science and scientific data sharing:an interdisciplinary workshop data sharing: make outbreak research open access reproducible research practices and transparency across the biomedical literature a systematic review of barriers to data sharing in public health avoidable waste in the production and reporting of research evidence the structure of scientific revolutions increasing value and reducing waste in biomedical research regulation and management open science: many good resolutions, very few incentives, yet. in: incentives and performance: governance of research organizations reproducing statistical results preferred reporting items for systematic reviews and meta-analyses: the prisma statement accessed badges to acknowledge open practices: a simple, low-cost, effective method for increasing transparency biomedical data sharing and reuse: attitudes and practices of clinical and scientific research staff accessed microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain speeding up research with the semantic web towards open sharing of task-based fmri data: the openfmri project open source drug discovery -a limited tutorial the collaboratory for ms d: a new cyberinfrastructure for the structural elucidation of biological macromolecules and their assemblies using mass spectrometry-based approaches the global alzheimer's association interactive network data sharing for computational neuroscience facilitating a culture of responsible and effective sharing of cancer genome data developing and implementing an institute-wide data sharing policy open data campaign open letter: european medicines agency should remove barriers to access clinical trial data the cochrane-reward prize for reducing waste in research accessed data sharing data sharing -is the juice worth the squeeze? towards evidence based research assessing value in biomedical research: the pqrst of appraisal and reward northwestern university schizophrenia data and software tool (nusdast). frontiers in neuroinformatics schizconnect: a one-stop web-based resource for large-scale schizophrenia neuroimaging data integration coding rare diseases in health information systems: a tool for visualizing classifications and integrating phenotypic and genetic data universal syntax solutions for the integration, search, and exchange of phenotype and genotype information a web-portal for interactive data exploration, visualization, and hypothesis testing lord: a phenotype-genotype semantically integrated biomedical data tool to support rare disease diagnosis coding in health information systems proteomics fasta archive and reference resource using the wiki paradigm as crowd sourcing environment for bioinformatics protocols disgenet-rdf: harnessing the innovative power of the semantic web to explore the genetic basis of diseases brisk-research-oriented storage kit for biologyrelated data isa-tab-nano: a specification for sharing nanomaterial research data in spreadsheet-based format usage: a web-based approach towards the analysis of sage data. serial analysis of gene expression a universal open-source electronic laboratory notebook grin-global: an international project to develop a global plant genebank information management system padma database: pathogen associated drosophila microarray database all the world's a stage: facilitating discovery science and improved cancer care through the global alliance for genomics and health scens: a system for the mediated sharing of sensitive data the us-mexico border infectious disease surveillance project: establishing binational border surveillance medetect: domain entity annotation in biomedical references using linked open data open science cbs neuroimaging repository: sharing ultra-high-field mr images of the brain race by hearts: using technology to facilitate enjoyable and social workouts using a database/repository structure to facilitate multi-institution utilization management data sharing real-time data streaming for functionally improved ehealth solutions real-time clinical information exchange between ems and the emergency department integration of a mobile-integrated therapy with electronic health records: lessons learned loni mind: metadata in nifti for dwi application of foss g and open data to support polio eradication, vaccine delivery and ebola emergency response in west africa improving hiv surveillance data for public health action in washington, dc: a novel multiorganizational data-sharing method an i b -based, generalizable, open source, self-scaling chronic disease registry next generation cancer data discovery, access, and integration using prizms and nanopublications owling clinical data repositories with the ontology web language design and generation of linked clinical data cubes the current and potential role of satellite remote sensing in the campaign against malaria the ebi rdf platform: linked open data for the life sciences a digital repository with an extensible data model for biobanking and genomic analysis management catalyzer: a novel tool for integrating, managing and publishing heterogeneous bioscience data. concurrency computation pract experience a simple tool for neuroimaging data sharing e-health systems for management of mdr-tb in resource-poor environments: a decade of experience and recommendations for future work the cardiac atlas project-an imaging database for computational modeling and statistical atlases of the heart american college of rheumatology's rheumatology informatics system for effectiveness registry pilot the rheumatology informatics system for effectiveness (rise): enabling data access across disparate sites for quality improvement and research the preprocessed connectomes project: an open science repository of preprocessed data kimosys: a web-based repository of experimental data for kinetic models of biological systems the nih d print exchange: a public resource for bioscientific and biomedical d prints. d printing addit manuf implementation of chemotherapy treatment plans (ctp) in a large comprehensive cancer center (ccc): the key roles of infrastructure and data sharing implementing standards for the interoperability among healthcare providers in the public regionalized healthcare information system of the lombardy region rapid growth in use of personal health records software breakthrough makes data sharing easy. hospital peer review preventing, controlling, and sharing data of arsenicosis in china possibilities and implications of using the icf and other vocabulary standards in electronic health records the functional magnetic resonance imaging data center (fmridc): the challenges and rewards of large-scale databasing of neuroimaging studies one health surveillance -more than a buzz word? xnat central: open sourcing imaging research data flexible specification of data models for neuroscience databases health care provider quality improvement organization medicare data-sharing: a diabetes quality improvement initiative the global alzheimer's association interactive network (gaain) pediatric patients in the track tbi trial-testing common data elements in children rapid learning in practice: validation of an eu population-based prediction model in usa trial data for h&n cancer developing the foundation for syndromic surveillance and health information exchange for yolo county, california. online journal of public health informatics the national academies collection: reports funded by national institutes of health new models of open innovation to rejuvenate the biopharmaceutical ecosystem, a proposal by the acnp liaison committee the preclinical data forum network: a new ecnp initiative to improve data quality and robustness for (preclinical) neuroscience data sharing in neuroimaging research sharing data with physicians helps break down barriers. data strategies & benchmarks : the monthly advisory for health care executives sharing overdose data across state agencies to inform public health strategies: a case study act levels the playing field on healthcare performance the qut librarians assisted in designing the search strategy for this review. no monetary assistance was provided for this systematic review; however, support was provided in kind by the australian centre for health services innovation at the institute of health and biomedical innovation at qut. the datasets generated and analysed during the current study are available at the open science framework repository (doi . /osf.io/dspu ). authors' contributions arf collected and analysed all the data for the study and wrote the manuscript. ma collected the data and analysed ( %) for the study and edited the manuscript. agb provided close student mentorship for this research, which is a part of arf's phd under his primary supervision, and was a major contributor for the writing of this manuscript. all authors read and approved the final manuscript. the authors declare that they have no competing interests. not applicable.ethics approval and consent to participate not applicable. key: cord- -nm dx pq authors: theys, kristof; lemey, philippe; vandamme, anne-mieke; baele, guy title: advances in visualization tools for phylogenomic and phylodynamic studies of viral diseases date: - - journal: front public health doi: . /fpubh. . sha: doc_id: cord_uid: nm dx pq genomic and epidemiological monitoring have become an integral part of our response to emerging and ongoing epidemics of viral infectious diseases. advances in high-throughput sequencing, including portable genomic sequencing at reduced costs and turnaround time, are paralleled by continuing developments in methodology to infer evolutionary histories (dynamics/patterns) and to identify factors driving viral spread in space and time. the traditionally static nature of visualizing phylogenetic trees that represent these evolutionary relationships/processes has also evolved, albeit perhaps at a slower rate. advanced visualization tools with increased resolution assist in drawing conclusions from phylogenetic estimates and may even have potential to better inform public health and treatment decisions, but the design (and choice of what analyses are shown) is hindered by the complexity of information embedded within current phylogenetic models and the integration of available meta-data. in this review, we discuss visualization challenges for the interpretation and exploration of reconstructed histories of viral epidemics that arose from increasing volumes of sequence data and the wealth of additional data layers that can be integrated. we focus on solutions that address joint temporal and spatial visualization but also consider what the future may bring in terms of visualization and how this may become of value for the coming era of real-time digital pathogen surveillance, where actionable results and adequate intervention strategies need to be obtained within days. genomic and epidemiological monitoring have become an integral part of our response to emerging and ongoing epidemics of viral infectious diseases. advances in high-throughput sequencing, including portable genomic sequencing at reduced costs and turnaround time, are paralleled by continuing developments in methodology to infer evolutionary histories (dynamics/patterns) and to identify factors driving viral spread in space and time. the traditionally static nature of visualizing phylogenetic trees that represent these evolutionary relationships/processes has also evolved, albeit perhaps at a slower rate. advanced visualization tools with increased resolution assist in drawing conclusions from phylogenetic estimates and may even have potential to better inform public health and treatment decisions, but the design (and choice of what analyses are shown) is hindered by the complexity of information embedded within current phylogenetic models and the integration of available meta-data. in this review, we discuss visualization challenges for the interpretation and exploration of reconstructed histories of viral epidemics that arose from increasing volumes of sequence data and the wealth of additional data layers that can be integrated. we focus on solutions that address joint temporal and spatial visualization but also consider what the future may bring in terms of visualization and how this may become of value for the coming era of real-time digital pathogen surveillance, where actionable results and adequate intervention strategies need to be obtained within days. keywords: visualization, phylogenetics, phylogenomics, phylodynamics, infectious disease, epidemiology, evolution despite major advances in drug and vaccine design in recent decades, viral infectious diseases continue to pose serious threats to public health, both as globally well-established epidemics of e.g., human immunodeficiency virus type (hiv- ), dengue virus (denv) or hepatitis c virus (hcv), and as emerging or re-emerging epidemics of e.g., zika virus (zikv), middle east respiratory syndrome coronavirus (mers-cov), measles virus (mv), or ebola virus (ebov). efforts to reconstruct the dynamics of viral epidemics have gained considerable attention as they may support the design of optimal disease control and treatment strategies ( , ) . these analyses are able to provide answers to questions on the diverse processes underlying disease epidemiology, including the (zoonotic) origin and timing of virus outbreaks, drivers of spatial spread, characteristics of transmission clusters and factors contributing to enhanced viral pathogenicity and adaptation ( ) ( ) ( ) . molecular epidemiological techniques have proven to be important and effective in informing public health and therapeutic decisions in the context of viral pathogens ( , ) , given that most of the viruses with a severe global disease burden are characterized by high rates of evolutionary change. these genetic changes are being accumulated in viral genomes on a time scale similar to the one where the dynamics of population genetic and epidemiological processes can be observed, which has lead to the definition of viral phylodynamics as the study of how epidemiological, immunological, and evolutionary processes act and potentially interact to shape viral phylogenies ( ) . as such, phylogenetic trees constitute a crucial instrument in studies of virus evolution and molecular epidemiology, elucidating evolutionary relationships between sampled virus variants based on the temporal resolution in the genetic data of these fast-evolving viruses that allows resolving their epidemiology in terms of months or years. through the integration of population genetics theory, epidemiological data and mathematical modeling, insights into epidemiological, immunological, and evolutionary processes shaping genetic variation can be inferred from these phylogenies. the field of phylodynamics has generated new opportunities to obtain a more detailed understanding of evolutionary histories-through time as well as geographic space-and transmission dynamics of both well-established viral epidemics and emerging outbreaks ( , ) . the ability of molecular epidemiological analyses, and phylodynamic analyses in particular, to fully exploit the information embedded in viral sequence data has significantly improved through a combination of technological innovations and advances in inference frameworks during the past decades. from a data perspective, genomic epidemiology is becoming a standard framework driven by high-throughput sequencing technologies that are associated with reduced costs and increasing turnover. moreover, the portability and potential of rapid deployment on-site of these new technologies enable the generation of complete genome data from samples within hours of taking the samples ( ). this rising availability of wholegenome sequences increases the resolution by which historical events and epidemic dynamics can be reconstructed. from a methodological perspective, new developments in statistical and computational methods along with advances in hardware infrastructure have allowed the analysis of ever-growing data sets, the incorporation of more complex models and the inclusion of information related to sample collection, infected host characteristics and clinical or experimental status (generally known as metadata) ( , , , ) . in contrast to a marked increase in the number of software packages targetting increasingly efficient but complex approaches to infer annotated phylogenies by exploiting genomic data and the associated metadata, the intuitive and interactive visualization of their outcomes has not received the same degree of attention, despite being a key aspect in the interpretation and dissemination of the rich information that is inferred. phylogenies are typically visualized in a rather simplistic manner, with the concept of depicting evolutionary relationships using a tree structure already illustrated in charles darwin's notebook ( ) and his seminal book "the origin of species" ( ) . early phylogenetic tree visualization efforts constituted an integral part of phylogenetic inference software packages and as such were restricted to simply showing the inferred phylogenies on a command line or in a simple text file, often even without an accompanying graphical user interface. the longstanding use of phylogenies in molecular epidemiological analyses has however led to the emergence of increasingly feature-rich visualization tools over time. the advent of the new research disciplines such as phylogenomics and phylodynamics necessitated more complex visualizations in order to accommodate projections of pathogen dispersal onto a geographic map, ancestral reconstruction of various types of trait data and appealing animations of the reconstructed evolution and spread over time. tree visualizations resulting from these analyses are also complemented by visual reconstructions of other important aspects of the model reconstructions, such as population size dynamics over time, transmission networks and estimates of ancestral states for traits of interest throughout the tree ( ) . across disciplines, adequate visualizations are pivotal to communicate, disseminate and translate research findings into meaningful information and actionable insights for clinical, research and public health officials. the aim to improve datadriven decision making fits within a broader scope to establish a universal data visualization literacy ( ) . to this end, enhancing collaborations and dissemination of visualizations is increasingly achieved through sharing of online resources for hosting annotated tree reconstructions ( ) , online workspaces ( ) and continuously updated pipelines that accommodate increasing data flow during infectious disease outbreaks ( ) (see further sections for more information and examples of these packages). given the plethora of options for presenting and visualizing results, and its importance for effectively communicating with a wide audience, choosing the appropriate representation and visualization strategy can be challenging. recent work on this topic focuses on navigating through all the available visualization options by offering clear guidelines on how to turn large datasets into compelling and aesthetically appealing figures ( ). a large array of software packages for performing phylogenetic and phylodynamic analyses have emerged in the last decade, in particularly for fast-evolving rna viruses [see ( ) for a recent overview]. a more recent but similar trend can be seen for methodologies and applications aimed at visualization of the output of these frameworks. in addition to the need to communicate these outputs in a visual manner, an increasing recognition for the added value of adequate visualization for surveillance, prevention, control and treatment of viral infectious diseases has resulted into the merging of data analytics and visualization, with the visualization aspect being increasingly considered as an elementary component within all-round analysis platforms. this review illustrates the evolution in phylogenetically-informed visualization modalities for evolutionary inference and epidemic modeling based on viral sequence data, evolving from an initial purpose to serve basic interpretation of the results to an in-depth translation of complex information into usable data for virologists, researchers and public health officials alike. novel features and innovative approaches often stem from a community need, which can be translated into a specific challenge to be addressed by current and future software applications. throughout this article, we discuss some of the major bottlenecks for interpretation and visualization of phylodynamic results, and subsequently solutions that have addressed or can address these challenges. a closer inspection of how tools for manipulation, visualization and interpretation of evolutionary scenarios have steadily grown over time reveals different trends of interest. first, visualization needs for phylodynamic analyses are very heterogeneous in nature, driven by the intrinsic objective to better understand viral disease epidemiology. due to the increasing complexity and interactivity of the various aspects that make up phylodynamic analyses, the gradual change in visualization tools has resulted in a wide but incomplete range of solutions provided (illustrated by the wikipedia list of phylogenetic tree visualization software ). software applications for phylodynamic analyses have extended into investigations of population dynamics over time, trait evolution and spatiotemporal dispersal, while still using a phylogenetic tree as their core concept. while we will focus predominantly on the concept of a phylogenetic tree as the backbone of phylodynamic visualization, these analyses also produce other types of output that go beyond visualizing phylogenies, especially when it comes to trait data reconstruction. second, the continuing advances in visualization-which try to keep up with increasing complexities in the statistical models employed-not only result in more features being available for end users to exploit, they may also come at an increased cost in terms of usability and responsiveness. formats for input and output files have increased in complexity, from simple text files to xml specifications and (geo)json file formats for geographical features. reading, understanding and editing such files may prove to be a challenging task for practitioners. however, most visualization tools do not expose these complexities to their users and offer an intuitive point-and-click interface and/or drag-and-drop functionality for customizing the visualization ( ) . despite such intuitive interactivity, intricate knowledge and a certain amount of programming/scripting experience is often required for those users who want to customize and/or extend their visualization beyond what the application has to offer. third, visualization goals tend to become context-dependent in that not all phylodynamic analyses deal with the same sense of urgency, with established epidemics requiring different prevention and treatment strategies than outbreak detection and surveillance. for example, in established epidemics (e.g., hiv- ) thefocus may be on identifying (important) clusters within a very large phylogeny ( ) , whereas analyses in ongoing outbreaks often determine whether newly generated sequences correspond to strains of the virus known to circulate in a certain region and try to establish spillover from animal reservoirs ( ) . finally, despite the major achievements so far, visualization tools are reaching https://en.wikipedia.org/wiki/list_of_phylogenetic_tree_visualization_software the limits of their capacity to comprehensibly present analysis results of large datasets. promising developments and strategies are becoming available that move visualization beyond the goal of communicating and synthesizing results, and actively play an important role in providing analytics to better understand evolutionary and demographic processes fueling viral dispersal and pathogenicity. phylogenetic tree visualizations have played a central role since the earliest evolutionary and molecular epidemiological analyses of fast-evolving viral pathogens. the first computer programs aimed at constructing phylogenies [e.g., paup * ; ( , ) , and phylip; ( ) ] were only equipped with minimal tree drawing and printing facilities, limited by the available operating systems and programming languages of that time. standalone, phylogenetically-oriented programs [e.g., must; ( ) and later on treeview; ( )] were specifically developed to interact with tree reconstruction output and to ease tree editing and viewing. even as phylogenetic inference became inherently more sophisticated, for example with the development of bayesian phylogenetic inference and the release of initial versions of mrbayes ( ) which contained sophisticated search strategies to ensure finding the optimal set of phylogenetic trees, these software packages still contained their own text-based tree visualization component(s). however, over time a wide range tree visualization software has been released, offering a continuous increase of tree visualization and manipulation functionalities. these packages have been developed as either standalone software packages or have been integrated into larger data management and analysis platforms [e.g., mega ( ) ]. the numerous all-round programs available to date offer a range of similar basic tree editing capabilities including the coloring and formatting of tree nodes, edges and labels, the addition of numerical or textual annotations, searching for specific taxa as well as the re-rooting, rotation and collapsing of clades. different tree formats can be imported and again exported to various textual and graphical formats (e.g., vector-based formats: portable document format (pdf), encapsulated postscript (eps), scalable vector graphics (svg), . . . ). a limited set of applications provide more advanced visualization functionalities that enable interactive visualization and management of highly customized and annotated phylogenetic trees. nevertheless, major hurdles still exist that hinder adequate communication and interpretation of phylodynamic analyses. these hurdles mainly relate to the scalability of the visualization, highlighting uncertainty associated with the results and the interactive rendering of available metadata. recent innovative developments attempt to tackle these bottlenecks, although some tools are specifically directed toward addressing a single (visualization) challenge. we here provide an overview of such challenges, along with examples of figures generated by software packages that aim to tackle these challenges. note that all of our visualization examples are shown in the evolving visualization examples section below. first, a major challenge is the ever-increasing size of data sets being analyzed, leading to difficulties with navigating through the resulting phylogenetic trees and to problems with interpreting the inferred dynamics, not only from a computational perspective (e.g., to render large images in a timely manner) but also from the human capability to deal with high levels of detail. software packages that mainly aim to visualize phylogenetic trees as well as those that target more broad analyses have implemented various solutions to accommodate systematic exploration of large phylogenies. dendroscope ( ) was one of the first visualization tools aimed at large phylogenies, with its own format to save and reopen trees that had been edited graphically, offering a magnifier functionality to focus on specific parts of the (large) phylogeny. follow-up versions ( ) focused on rooted phylogenetic trees and networks, and offered parallel implementations of demanding algorithms for computing consensus trees and consensus networks to increase responsiveness. phylo.io ( ) improves the legibility of large trees by automatically collapsing nodes so that an overview of the tree remains visible at any given time. itol [( ), but see below] and icytree ( ) also provide intuitive panning and zooming utilities that make exploring large phylogenetic trees of many thousands of taxa feasible. the phylogeotool [( ); also see figure ] eases navigation of large trees by performing an a priori iterative clustering of subtrees according to a predefined diversity ratio, as well as pre-rendering the visualization of those subtrees enabling fluent navigation. pastml ( ) allows visualizing the tree annotated with reconstructed ancestral states as a zoomable html map based on the cytoscape framework ( ) . pastview ( ) offers synthetic views such as transition maps, integrates comparative analysis methods to highlight agreements or discrepancies between methods of ancestral annotations inference, and is also available as a webserver instance. grapetree ( ) initially collapses branches if there are more than , nodes in the tree and then uses a static layout that splits the tree layout task into a series of sequential node layout tasks. with the development of many packages targetting the visualization of large phylogenies in recent years, the question arises whether they will continue to be maintained and extended with novel features. a second challenge lies with the fact that phylogenies represent hypotheses that encompass different sources of error, and the extent of uncertainty at different levels should be presented accordingly. bootstrapping ( ) and other procedures are often used to investigate the robustness of clustering in estimated tree topologies,. numerical values that express the support of a cluster are generally shown on the internal nodes of a single consensus summary tree [e.g., figtree; ( ) ] or by a customized symbol [e.g., itol; ( ) ]. although conceptually different, posterior probabilities on a maximum clade credibility (mcc) tree, majority consensus tree or other condensed trees from the posterior set of trees resulting from bayesian phylogenetic inference can be shown in a similar manner. an informative and qualitative approach to represent the complete distribution of rooted tree topologies is provided by densitree [( ); also see figure ], which draws all trees in a set simultaneously and transparently, and the different output visualizations highlight various aspects of tree uncertainty. for time-scaled phylogenetic trees, uncertainty in divergence time estimates of ancestral nodes (e.g., % highest posterior density (hpd) intervals) is usually displayed with a horizontal (node) bar (see figure for an example). additionally, ancestral reconstructions of discrete or continuous trait states at the inner nodes of a tree are increasingly facilitated by various probabilistic frameworks, and these inferences are also accompanied by posterior distributions describing uncertainty. to visualize this uncertainty, pastml ( ) inserts pie charts at inner nodes to show likely states when reconstructing discrete traits such as the evolutionary history of drug resistance mutations, while spread ( ) is able to depict uncertainty of continuous traits, e.g., as polygon contours for (geographical) states at the inner nodes [see ( ) for an example]. much like the visualization packages that focus on large phylogenies (see above), the applications listed here have their own specific focus with sometimes limited overlap in functionality. a third challenge consists of the visual integration of metadata with phylogenetic trees-often in the form of either a discrete and/or continuous trait associated with each sequence-which is in part related to the previous challenge concerning uncertainty of trait reconstructions. incorporating virus trait information (e.g., drug resistance mutations, treatment activity scores) or host characteristics (e.g., gender, age, risk group) in phylogenetic inference can substantially facilitate the interpretation for end users and accelerate the identification of potential transmission patterns. tree reconstruction and visualization software generally share a set of basic operations for coloring taxa, branches or clades according to partial or exact label matches. while these annotations can be performed manually using a graphical user interface, this can be timeconsuming and is prone to errors. hence, several software programs offer functionalities to automate the selection and annotation of clades of interest, for example through the use of javascript libraries [e.g., phyd ; ( ), spread ; ( ) ]also see figure -or python toolkits [e.g., ete; ( ), baltic; ( ) ]. alternatively, drag-and-drop functionality of plain text annotation files generated with user-friendly text editors facilitate this process, as is for example the case in itol ( ) . these scripting visualization frameworks also foster more intense tree editing through their functionalities to annotate inner nodes, clades and individual leaves with charts (pie, line, bar, heatmap, boxplot), popup information, images, colored strips and even multiple sequence alignments. even more advanced integration efforts entail the superimposition of tree topology with layers of information on geographical maps, such as terrain elevation, type of landcover and human population density [e.g., r package seraphim; ( , ) ]. finally, visualization and accompanying interpretation are a critical component of infectious disease epidemiological and evolutionary analyses. indeed, many researchers use visualization software during analyses for data exploration, identifying inconsistencies, and refining their data set to ensure well-supported conclusions regarding an ongoing outbreak. as such, the visualizations themselves are gradually refined and improved over the course of a research project, with the final figures accompanying a publication often being post-processed versions of the default output of a visualization package or customly designed to attract a wide audience, both through the journal's website and especially social media [see e.g., ( ) ]. on the other hand, the advent of one-stop platforms [microreact; ( ) and nextstrain; ( , ) , also see figure ] that seamlessly connect the different steps of increasingly complex analyses and visualization of genomic epidemiology and phylodynamics allows automating this process. applications that are exclusively tailored toward tree manipulation and viewing are starting to offer management services and registration of user accounts [itol; ( ) ], while command-line tools (gotree; https://github.com/evolbioinfo/gotree) aimed at manipulating phylogenetic trees and inference methods (pastml; ( ) increasingly enable exporting trees that can directly be uploaded to itol, supporting the automation of scripting and analysis pipelines. in the previous sections, we have already covered a wide range of software packages for visualizing phylogenetic trees as well as their associated metadata, which may or may not be used in a joint estimation of sequence and trait data [for an overview of integrating these data types in various inference frameworks for pathogen phylodynamics, we refer to ( ) ]. we here organize our visualization examples into different broader categories: different approaches toward visualizing associated trait data with a focus on phylogeography (figures - ) , browser-based online applications (figures , ) , applications that use existing libraries such as those available in r, python and javascript for example (figures , ) , non-phylogenetic visualizations typically associated with pathogen phylodynamics (figure ) , and finally custom-written code or applications that focus on assessing phylogenetic uncertainty (figures , ) . as a first example, we illustrate the development of innovative visualization software packages on the output of a bayesian phylodynamic analysis of a rabies virus (rabv) data set consisting of time-stamped genetic data along with two discrete trait characteristics per sequence, i.e., the sampling location-in this case the state within the united states from which the sample originated-and the bat host type. this rabv data set comprises nucleoprotein gene sequences from north american bat populations, with a total of bat species sampled between and across states in the united states ( ). we used beast . ( ) in combination with beagle ( ) to estimate the time-scaled phylogenetic tree relating the sequences, along with inferring the ancestral locations of the virus using a bayesian discrete phylogeographic approach ( ) and, at the same time, infer the history of host jumping using the same model approach. upon completion of the analysis, we constructed a maximum clade credibility (mcc) tree from the posterior tree distribution using treeannotator, a software tool that is part of the beast distribution. this mcc tree contains at its internal nodes the age estimates of all of the internal nodes, along with discrete probability distributions for the inferred location and host traits at those internal nodes. figure shows the visualization of the mcc tree in figtree, with internal nodes annotated according to the posterior ancestral location state probabilities within the mcc tree file. as expected, one can observe that posterior support for the preferred ancestral location decreases from the observed tips toward the root, in other words the further we go back in time, the more uncertain the inferred location states become. all of the information required to make the figtree visualization in figure is contained within a nexus file, containing all of the ancestral trait annotations, which we use as the (only) input for the figtree ( ) . the standard newick file format itself does not contain such trait annotations but remains in popular use for storing phylogenetic trees and is hence supported by most (if not all) phylogenetic visualization packages. in general however, newick and other older formats (e.g., nexus) offer limited expressiveness for storing and visualizing annotated phylogenetic trees and associated data, which has lead to extensions for this format being proposed [e.g., the extended newick format; ( ) ]. figtree allows users to upload annotation information for the sequences in the analyzed alignment in the form of a simple tabdelimited text file, and a parsimony approach can be used to infer the most parsimonious state reconstruction for the internal nodes from those provided for the tips. itol ( ) is another application that can take an mcc tree as its input file and allows annotating branches and nodes of the phylogenetic tree using descriptions provided through the use of simple text files in which custom visualization options can easily be declared (figure ) . itol is even suited for showing very large trees (with more than , leaves) with webkit-based browsers-such as chromium/google chrome, opera and safari-offering the best performance. newer input/output file formats for phylogenetic trees and their accompanying annotations, including the xml-based standards phyloxml ( ) and nexml ( ) , have the benefit of being more robust for complex analyses and easier to process and extend. in particular, applications of phylodynamics aimed at reconstruction and interpretation of spatio-temporal histories have become broadly and increasingly applied in viral disease investigations. the incorporation of geographical and phylogenetic uncertainty into molecular epidemiology dynamics is now well-established ( , ) , and dedicated developments from a visualization perspective have soon followed to accommodate the outcomes of these models. early attempts include the mapping of geo-referenced phylogenetic taxa to their geographical coordinates [e.g., gengis; ( ), cartographer; ( )], while more recent efforts of joint ancestral reconstruction of geographical and evolutionary histories enable visual summaries of spatial-temporal diffusion through the interactive cartographic projection using gis-and kml-based virtual globe software ( ) . the latest developments generalize toward interactive web-based visualization of any phylogenetic trait history and are based on data-driven documents (d ) javascript libraries and the json format to store geographic and other tree-related information ( ) . as an example, we have created a web-based visualization of our analyzed rabv data set by loading the obtained mcc tree into the spread figure | figtree allows visualizing various tree formats, including maximum clade credibility trees from bayesian phylogenetic analyses ( ) . external and internal nodes can easily be annotated using the information in the source tree file, and the time information within the tree allows adding a time axis which facilitates interpretation. annotations shown here for the rabv data set are the % highest posterior density (hpd) age intervals and the most probable ancestral location state at each internal node, with the circle width corresponding to the posterior support for the internal location state reconstruction. ( ) phylodynamic visualization software package (see figure ). spread actually consists of a parsing and a rendering module, with the former obtaining the relevant information out of the mcc tree and the latter converting this information into a (geo)json file format, potentially in combination with a geographic map, which can easily be downloaded from websites offering geojson files of different regions of the world and with different levels of detail. the generated output consists of an in-browser animation that allows tracking a reconstructed epidemic over time using a simple slider bar, with the possibility to zoom into specific areas of the map. in figure , we show the reconstructed spread of rabv across the united states at four different time points throughout the epidemic, starting with the estimated location of origin in the state of arizona and tracking the rabv spread as it disperses to all of the states in our data set. the spread visualization in figure is an example of an increasing trend toward web-based software tools that can run in any modern browser, making them compatible with all major operation systems, without requiring any additional software packages to be installed by the user. a distinction can be made between browser-based tools that are able to work without internet access [phylocanvas; (http://phylocanvas.org), phylotree.js; ( ), icytree; ( ), spread ; ( ), phylogeotool; ( ), see figure ] or that are only accessible online [itol; ( ) , phylo.io; ( ) ]. web-based visualization platforms enhance figure | interactive tree of life [itol; ( ) ] visualization of the mcc tree for rabv. rather than exploiting the annotations within an mcc tree, itol allows importing external text files with annotations through an easy drag-and-drop interface. we have here colored the tip nodes according to the bat host species (outer circle) as well as the sampling location (inner circle) corresponding to each sample. many visual aspects can be set this way and an extensive online help page is available. collaborations and output dissemination in a very efficient and simple manner through their ability to share web links of complex and pre-annotated tree visualizations. transferring genomic data and associated data to an online service may invoke privacy issues which is not the case for tools that execute data processing purely on the client side. by contrast, online accessible visualization tools such as itol ( ) offer tree management possibilities to organize and save different projects, annotated datasets and trees for their users. these online packages typically also provide export functionalities to facilitate the production of publication-quality and high-resolution illustrations [see also mrent; ( ), mesquite; ( )], directed toward end-users with minimal programming experience. spread also illustrates the growing movement toward animated visualizations over time and (geographic) space and as such focuses entirely on the visualization aspect of pathogen phylodynamics. recently, entire pipelines have emerged that include data curation, analysis and visualization components, with nextstrain as its most popular example ( ) . on the data side, python scripts maintain a database of available sequences and related metadata, sourced from public repositories as well as github repositories and other (more custom-made) sources of genomic data. fast heuristic tools enable performing phylodynamic analysis including subsampling, aligning and phylogenetic inference, dating of ancestral nodes and discrete trait geographic reconstruction, capturing the most likely transmission events. the accompanying nextstrain website (https://nextstrain.org/) provides a continually-updated view of publicly available data alongside visualization for a number of pathogens such as west nile virus, ebola virus, and zika virus. for the latter virus, we provide the currently available data visualization in nextstrain (at time of submission) in figure , showing a color-coded time-scaled maximum-likelihood tree alongside an animation of zika geographic transmissions over figure | projecting an mcc tree onto a geographic map using spread ( ) . in a discrete phylogeography setting, as is the case here, the ancestral location states are combined with coordinates corresponding to the states in the us from which the rabv samples were obtained. we use centroid coordinates for the us states to enable this visualization. spread animates the reconstructed virus dispersal over time, and we here show four snapshots (starting from the estimated origin of the epidemic at the root node) that capture the reconstructed dispersal over time and geographic space, i.e., in , , , and the "present" (mid ) . the transitions between the us states are colored according to the us state of destination for that particular transition, whereas the size of the circles around a location is proportional to the number of lineages that maintain that location. time as well as the genetic diversity across the genome. analysis of such outbreaks relies on public sharing of data, and nextstrain has taken the lead to address data sharing concerns by preventing access to the raw genome sequences, and by clearly indicating the source of each sequence, while allowing derived data-such as the inferred phylogenetic trees-to be made publicly available. we note that these animated visualizations by their very nature do not easily yield publication-ready figures, requiring alternative approaches to be devised. animations resulting from software packages such as spread, spread and nextstrain can be hosted on the authors' website or they can be captured into a video file format and uploaded as supplementary materials onto the journal website. alternatively, screenshots of the animation can be taken at relevant time points throughout the visualization and subsequently post-processed to include in the main or supplementary publication text. finally, browser-based packages such as spread employ javascript libraries (e.g., d ) to produce dynamic, interactive data visualizations in web browsers, known specifically for allowing great control over the final visualization. custom programs are also typically written in r as a long list of popular r libraries are readily available, with ggplot quickly rising to popularity and finding use in both r and python languages. a system for declaratively creating graphics based on the grammar of graphics ( ), ggplot was built for making professional looking figures with minimal programming efforts. figure shows an example of ggtree, which extends ggplot and is designed for visualization and annotation of phylogenetic trees with their covariates and other associated data ( ) . a recent software package that is implemented in javascript and python is pastml ( ) , which uses the cytoscape.js library ( ) for visualizing phylogenetic trees (figure ) . given that these types of libraries contain many tried-and-tested functions that save substantial time when creating novel software packages, future visualization efforts are likely to see increased usage of readily available visualization libraries in programming languages such as r, python and javascript. phylogenies reconstructed from viral sequence data and their corresponding annotated tree-like drawings and animations lie at the heart of many evolutionary and epidemiological studies that involve phylogenomics and phylodynamics applications. additional graphical output can be generated using visualization packages that focus on other aspects than the estimated phylogeny, but that are however in some manner dependent on the phylogeny. coalescent-based phylodynamic models that connect population genetics theory to genomic data can infer the demographic history of viral populations ( ) , and plots of figure | the phylogeotool offers a visual approach to explore large phylogenetic trees and to depict characteristics of strains and clades-including for example the geographic context and distribution of sampling dates-in an interactive way ( ) . a progressive zooming approach is used to ensure an efficient and interactive visual navigation of the entire phylogenetic tree. the effective population sizes over time-such as the one shown in figure for our rabv data set, which uses the skygrid model ( ) and its accompanying visualization in tracer ( )are commonly used to visualize the inferred past population size dynamics ( , , ) . a variety of other summary statistics computed over the course of a phylogeny also benefit from visual representations, such as for the basic reproduction number and its rate of change as a function through time ( ) . closely related are lineage-through-time plots ( ) that allow exploring graphically the demographic signal in virus sequence data and revealing temporal changes of epidemic spread. neher et al. ( ) plotted cumulative antigenic changes over time by integrating viral phenotypic information into phylogenetic trees of influenza viruses, thereby providing additional insights into the rate of antigenic evolution compared to representations of neutralization titers that are traditionally transformed into a lower-dimensional space ( , ) . another example relates to reconstructions of phylogeographic diffusion in discrete space, where patterns of migration links are typically projected into a cartographic context, but quantitive measures are additionally computed including the expected number of effective location state transitions (known as "markov jumps"). information on migrations in and out of a location state can be obtained by visualizations of the number of actual jumps between locations as well as the waiting times for each location, either as a total or proportionally over time ( ) ( ) ( ) ( ) . the inference of transmission trees and networks ("who infected whom and when") using temporal, epidemiological and genetic information is an application of phylodynamics that has made substantial methodological progress in the last decade ( ) ( ) ( ) . different from phylogenetic trees that represent evolutionary relationships between sampled viruses, transmission trees describe transmission events between hosts and require visualizations that are tailored to the analysis objectives ( ) ( ) ( ) . consensus transmission trees, such as maximum parent credibility (mpc) trees ( ) or edmonds' trees ( ) , visually alert the user on putative infectors (indicated with arrows), corresponding infection times and potential superspreaders. ( ) use the cytoscape framework ( ) for drawing the transmission trees, and a similar adaptation of the original biological network-oriented framework has been done by pastml ( ) (see above). finally, in order to compare two or more trees that are estimated from the same set of virus samples, but differ in the method used for tree construction or in the genomic region considered, tanglegrams provide insightful visualizations. the most popular use case is the comparison of two trees displayed leaf-by-leaf-wise with differences in clustering highlighted by lines connecting shared tips ( ) . alternatively, tanglegrams allow mapping tree tip locations to mapped geographical locations using gengis ( , ) . the python toolkit baltic (https://github.com/evogytis/baltic) provides functionalities to draw tangled chains, as shown in figure , which are advanced sequential tanglegrams to compare a series of trees ( , ) . the use of phylogenetic networks, which are a generalization of phylogenetic trees, can also visualize phylogenetic incongruences, which could be due to reticulate evolutionary phenomena such as recombination (e.g., hiv- ) and hybridization (e.g., influenza virus) events ( , , ) . tanglegrams and related visualization of sets of trees [e.g., densitree ( ); see figure ] provide a qualitative and illustrative comparison of trees, buy this may prove to be less suited for the interpretation of extremely large trees or sets of trees. recent quantitative approaches allow the exploration and visualization of the relationships between trees in a multidimensional space of tree similarities, based on different treeto-tree distance metrics that identify a reduced tree space that maximally describe distinct patterns of observed evolution [mesquite; ( ), r package treespace; ( , ) ]. we have discussed a wide range of visualization packages for phylogenetic and phylodynamic analyses that allow improving our understanding of viral epidemiological and population dynamics. while these efforts may ultimately assist in informing public health or treatment decisions, visualization needs can differ according to the type of virus epidemics studied and questions that need to be answered. for example, the required level of visualization detail is high for (re-)emerging viral outbreaks when actionable insights should be obtained in a timely fashion in order to control further viral transmissions, figure | r package ggtree ( ) visualization of a phylogenetic tree constructed from publicly available zika virus (zikv) genomes. ggtree allows similar advanced customized visualization of phylogenetic trees as e.g., itol, but by means of the traditional r scripting language. in this figure, tree leaves are colored according to continent of sampling, with a size corresponding to the host status and shape indicating the completeness of the cds, using a cutoff of % of nucleotide positions being informative. a heatmap was added to denote the presence of amino acid mutations at three chosen genome positions. finally, a particular clade was highlighted in blue based on a given internal node and two additional links between chosen taxa were added. figure . the top-down visualization corresponds to an iterative clustering starting from the root of the tree at the top, with the size of the dot corresponding to the number of taxa in a clade which share the same ancestral state which is indicated on top of the dot. in this type of visualization, a compressed representation of the ancestral scenarios is visualized that highlights the main facts and hides minor details by performing both a vertical and horizontal merge [but see ( ) ]. the branch width corresponds to the number of times its subtree is found in the initial tree, and the circle size at a tip is proportional to the size of the compressed (or merged) cluster. with real-time tracking of viral spread and the identification of sources, transmission patterns and contributing factors being key priorities ( ) . as a result, software packages that aim to address these questions are typically developed with an explicit emphasis on speed through the use of heuristics, and stress the importance of connectivity and interactivity to quickly respond to the availability of new data in order to develop novel insights into an ongoing epidemic. one-stop and fullyintegrated analysis platforms such as microreact ( ) and nextstrain ( ) adhere to these needs by providing the necessary visualizations of virus epidemiology and evolution across time and space, and by implementing support for collaborative analyses and sharing of genomic data and analysis outputs. a strategy of interest in these settings is the ability for phylogenetic placement of novel sequence data ( , ) , for example when updated outbreak information suggests specific cases should be investigated but the reconstruction of a new phylogeny is not desirable, as this may prove too time consuming. to avoid such de novo re-analyses of data sets, software tools such as itol ( ) and phylogeotool ( ) offer functionalities to visualize placements of sequence data onto an existing phylogeny. a key future challenge of these approaches is to assess and visualize the associated phylogenetic placement uncertainty, or if this information would be unavailable to at least indicate the various stages in which novel sequences were added onto the (backbone) phylogeny. while methodological developments are rigorous in their accuracy assessment-for example through simulation studies-and may even provide visual options for representing the placement uncertainty [see e.g., ( ) ], visualization packages themselves do not offer an automated way of assessing or conveying this information and as such may project overconfidence of the power of the phylogenetic placement method used. additionally, other flexible visualization options based on real-time outbreak monitoring can be of great interest such as highlighting locations from which cases have been reported but for which genomic data are still lacking, to clarify the potential impact of these missing data on the currently available inference results. investigations of more established epidemics usually involve much larger sample sizes, are more retrospective-oriented in ( ) . this type of output does not directly depend upon the estimated mcc tree, but rather on the estimated (log) population sizes of the skygrid model ( ) , which are provided in a separate output file by beast ( ) . design and incorporate more heterogeneous information, and therefore benefit from more extended visualization frameworks. for most of these globally prevalent pathogens, clinical and phenotypic information is often available and questions relate to the population-or patient-level dynamics of viral adaptation and the identification of transmission clusters. for example, the selection of the virus strain composition of the seasonal influenza vaccine is informed by analyses and visualizations of circulating strains and their antigenic properties using the nextflu framework ( , ) . other diverse examples include investigations of the impact of country-specific public health interventions on transmission dynamics ( , ) , the identification of distinctive socio-demographic, clinical and epidemiological features associated with regional and global epidemics ( ) ( ) ( ) ( ) and large-scale modeling of epidemiological links among geographical locations ( ) ( ) ( ) . in these settings, relevant software packages should consider the scalability of large phylogenies and allow user-friendly exploration of heterogeneous and customized annotations. overall, it is anticipated that future work on visualization tools, accompanying analysis and visualization software developments as described above, will result in a merging of these two epidemic perspectives, with the development of context-independent visualization software tools that can handle both scenarios equally well. viral pathogens, in particular rna viruses, have been responsible for epidemics and recurrent outbreaks associated with high morbidity and mortality in the human population, for a duration that can span from hundreds of years [e.g., hcv ( ) and denv ( ) ] to decades [e.g., hiv- ( )]. rna viruses are known for their potential to quickly adapt to host and treatment selective pressure, but their rapid accumulation of genomic changes also provides opportunities to study their population and transmission dynamics in high resolution. consequently, the fields of phylogenomics and phylodynamics play a pivotal role in studies on epidemiology and transmission of viral infectious diseases, and have advanced our understanding of the dynamical processes that govern virus dispersal and evolution at both population and host levels. compared to the tremendous achievements in the performance of evolutionary and statistical inference models and hypothesis testing frameworks, software packages and resources aimed at visualizing the output of these studies have experienced difficulties to handle the increasing complexity and sizes of the analyses, for example to display levels of uncertainty and to integrate associated demographic and clinical information. accurate and meaningful visual representation and communication are however essential tools for the interpretation and translation of outcomes into actionable insights for the design of optimal prevention, control and treatment interventions. with a plethora of applications for phylodynamics having been introduced in the last decades, in particular tailored toward reconstructions of spatiotemporal histories-which start to become useful in public health surveillance-visualization has substantially grown as an elementary discipline for investigations of infectious disease epidemiology and evolution. an extensive array of software and tools for the manipulation, editing and annotation of output visualizations in the field of pathogen phylodynamics is available to date, characterized by varying technical specifications and functionalities that respond to heterogeneous needs from the research and public health communities. the increasing recognition for visualization tools in support of viral outbreak surveillance and control has stimulated the advent of more complex and fully integrated frameworks and platforms, all the while focusing on user experience and ease of customisation. we anticipate that future visualization developments will take further leaps in this ongoing trend by tackling remaining challenges to display increasing amounts of dense information in a human-readable manner and introducing concepts from new disciplines such as visual analytics. in particular, high expectations are stemming from the ensemble of visualization methods that allow users to work at, and move between, focused and contextual views of a data set ( ) . large scientific data sets with a temporal aspect have been the subject of multi-level focus+context approaches for their interactive visualization ( ) , which minimize the seam between data views by displaying the focus on a specific situation or part of the data within its context. these approaches are part of an extensive series of interface mechanisms used to separate and blend views of the data, such as overview+detail, which uses a spatial separation between focused and contextual views, and zooming, which uses a temporal separation between these views ( ) . phylogenetic trees can be interactively visualized as three-dimensional stacked representations ( ). the field of phylogenomics and phylodynamics visualizations will increasingly implement and adapt technologies from other disciplines, as already illustrated by tools and studies using the network-oriented cytoscape package ( , , ) , or through the use of virtual reality technologies including customizable mapping frameworks and high-performance geospatial analytical toolboxes. as such, concomitant to the ongoing developments figure | tanglegrams are typically shown in a side-by-side manner, in order to easily and visually identify differences in clustering between two or more phylogenetic trees, for example when inferred from different influenza proteins (pb , pb , pa, ha, np, na, m , and ns ). such a series of trees can also be visualized in a circle facing inwards with a particular isolate highlighted in all plotted phylogenies (left figure) , or with all isolates interconnected between all proteins (right figure). figure | bayesian phylogenetic inference software packages generate a large number of posterior trees, potentially annotated with inferred ancestral traits. this collection of trees is often summarized using a consensus tree, allowing to plot a single tree with posterior support values on the internal nodes. densitree enables drawing all posterior trees in the collection; areas where a lot of the trees agree in topology and branch lengths show up as highly colored areas, while areas with little agreement show up as webs ( ) . we refer to figure for the color legend of the host species, as the legend drawn by densitree was not very readable and could not be edited (in terms of its textual information). in sample collection and sequencing, the design of more complex analytical inference models and powerful hardware infrastructure will be complemented by a new era in visualization applications that will collaboratively foster visualizations that track virus epidemics and outbreaks in real-time and with high resolution. an initial but already comprehensive list of publications was compiled from backward and forward citation searches of the various visualization software packages the authors have (co-)developed, as well as those packages that the authors have used throughout their academic career. complementing this already extensive list, we searched pubmed and google scholar, which keeps track of arxiv and biorxiv submissions and hence decreased the risk of missing potential publications. additional supplementary searches have been performed by backward and forward citation chasing of all of the included references throughout the writing process of writing the manuscript for the initial submission on april th . no date restrictions were applied, but only visualization packages and publications written in english were considered. kt wrote the manuscript. pl helped with the interpretation and writing. a-mv gave the idea, helped with the interpretation and writing. gb wrote the manuscript and prepared the visualizations. phylodynamic applications in st century global infectious disease research. glob health res policy phylodynamic assessment of intervention strategies for the west african ebola virus outbreak hiv epidemiology. the early spread and epidemic ignition of hiv- in human populations genomic and epidemiological monitoring of yellow fever virus transmission potential virus genomes reveal factors that spread and sustained the ebola epidemic science forum: improving pandemic influenza risk assessment enhanced use of phylogenetic data to inform public health approaches to hiv among men who have sex with men unifying the epidemiological and evolutionary dynamics of pathogens emerging concepts of data integration in pathogen phylodynamics recent advances in computational phylodynamics realtime, portable genome sequencing for ebola surveillance adaptive mcmc in bayesian phylogenetics: an application to analyzing partitioned data in beast beagle : improved performance, scaling and usability for a highperformance computing library for statistical phylogenetics on the origin of species by means of natural selection fast, accurate and simulation-free stochastic mapping data visualization literacy: definitions, conceptual frameworks, exercises, and assessments phylogeotool: interactively exploring large phylogenies in an epidemiological context interactive tree of life (itol): an online tool for phylogenetic tree display and annotation nextstrain: real-time tracking of pathogen evolution fundamentals of data visualization metagenomic sequencing at the epicenter of the nigeria lassa fever outbreak a fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences phylogenetic analysis using parsimony (*and other methods) phylip -phylogeny inference package (version . ) tree view: an application to display phylogenetic trees on personal computers mrbayes . : efficient bayesian phylogenetic inference and model choice across a large model space mega : molecular evolutionary genetics analysis version . for bigger datasets dendroscope: an interactive viewer for large phylogenetic trees dendroscope : an interactive tool for rooted phylogenetic trees and networks io: interactive viewing and comparison of large phylogenetic trees on the web icytree: rapid browser-based visualization for phylogenetic trees and networks a fast likelihood method to reconstruct and visualize ancestral scenarios cytoscape: a software environment for integrated models of biomolecular interaction networks pastview: a user-friendly interface to explore evolutionary scenarios grapetree: visualization of core genomic relationships among , bacterial pathogens confidence limits on phylogenies: an approach using the bootstrap densitree: making sense of sets of phylogenetic trees spread : interactive visualization of spatiotemporal history and trait evolutionary processes phyd : a phylogenetic tree viewer with extended phyloxml support for functional genomics data visualization ete : reconstruction, analysis, and visualization of phylogenomic data seraphim: studying environmental rasters and phylogenetically informed movements explaining the geographic spread of emerging epidemics: a framework for comparing viral phylogenies and environmental landscape data microreact: visualizing and sharing data for genomic epidemiology and phylogeography nextflu: real-time tracking of seasonal influenza virus evolution in humans two methods for mapping and visualizing associated data on phylogeny using ggtree posterior summarization in improving bayesian population dynamics inference: a coalescent-based model for multiple loci bayesian phylogenetic and phylodynamic data integration using beast . host phylogeny constrains cross-species emergence and establishment of rabies virus in bats bayesian phylogeography finds its roots extended newick: it is time for a standard representation of phylogenetic networks phyloxml: xml for evolutionary biology and comparative genomics nexml: rich, extensible, and verifiable representation of comparative data and metadata phylogeography takes a relaxed random walk in continuous space and time gengis: a geospatial information system for genomic data cartographer, a mesquite package for plotting geographic data spread: spatial phylogenetic reconstruction of evolutionary dynamics js -a javascript library for application development and interactive data visualization in phylogenetics mrent: an editor for publication-quality phylogenetic tree illustrations the grammar of graphics (statistics and computing) js: a graph theory library for visualisation and analysis origins of the coalescent genie: estimating demographic history from molecular phylogenies smooth skyride through a rough skyline: bayesian coalescent-based inference of population dynamics birth-death skyline plot reveals temporal changes of epidemic spread in hiv and hepatitis c virus (hcv) inferring population history from molecular phylogenies prediction, dynamics, and visualization of antigenic phenotypes of seasonal influenza viruses mapping the antigenic and genetic evolution of influenza virus dengue viruses cluster antigenically but not as discrete serotypes unifying viral genetics and human transportation data to predict the global transmission dynamics of human influenza h n counting labeled transitions in continuoustime markov models of evolution global migration dynamics underlie evolution and persistence of human influenza a (h n ) phylodynamics of h n / influenza reveals the transition from host adaptation to immune-driven selection bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data using genomics data to reconstruct transmission trees during disease outbreaks inferring hiv- transmission networks and sources of epidemic spread in africa with deep-sequence phylogenetic analysis epidemic reconstruction in a phylogenetics framework: transmission trees as partitions of the node set relating phylogenetic trees to transmission trees of infectious disease outbreaks a bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks reassortment between influenza b lineages and the emergence of a coadapted pb -pb -ha gene complex host ecology determines the dispersal patterns of a plant virus mers-cov recombination: implications about the reservoir and potential for adaptation tanglegrams for rooted phylogenetic trees and networks mesquite: a modular system for evolutionary analysis mapping phylogenetic trees to reveal distinct patterns of evolution treespace: statistical exploration of landscapes of phylogenetic trees real-time analysis and visualization of pathogen sequence data pplacer: linear time maximumlikelihood and bayesian phylogenetic placement of sequences onto a fixed reference tree a format for phylogenetic placements tracing the impact of public health interventions on hiv- transmission in portugal using molecular epidemiology the effect of interventions on the transmission and spread of hiv in south africa: a phylodynamic analysis the global origins of resistanceassociated variants in the non-structural proteins a and b of the hepatitis c virus the global spread of hepatitis c virus a and b: a phylodynamic and phylogeographic analysis hiv- infection in cyprus, the eastern mediterranean european frontier: a densely sampled transmission dynamics analysis from to sub-epidemics explain localized high prevalence of reduced susceptibility to rilpivirine in treatment-naive hiv- -infected patients: subtype and geographic compartmentalization of baseline resistance mutations phylogenetic analysis reveals the global migration of seasonal influenza a viruses the impact of migratory flyways on the spread of avian influenza virus in north america spatiotemporal characteristics of the largest hiv- crf ag outbreak in spain: evidence for onward transmissions evolutionary analysis provides insight into the origin and adaptation of hcv the origin, emergence and evolutionary genetics of dengue virus a review of overview+detail, zooming, and focus+context interfaces a four-level focus+context approach to interactive visual analysis of temporal features in large scientific data we are grateful to gytis dudas for providing a figure from his baltic visualization package (https://github.com/evogytis/baltic). we thank simon dellicour for fruitful discussions. the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.copyright © theys, lemey, vandamme and baele. this is an open-access article distributed under the terms of the creative commons attribution license (cc by). the use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. no use, distribution or reproduction is permitted which does not comply with these terms. key: cord- -th da bb authors: gardy, jennifer l.; loman, nicholas j. title: towards a genomics-informed, real-time, global pathogen surveillance system date: - - journal: nat rev genet doi: . /nrg. . sha: doc_id: cord_uid: th da bb the recent ebola and zika epidemics demonstrate the need for the continuous surveillance, rapid diagnosis and real-time tracking of emerging infectious diseases. fast, affordable sequencing of pathogen genomes — now a staple of the public health microbiology laboratory in well-resourced settings — can affect each of these areas. coupling genomic diagnostics and epidemiology to innovative digital disease detection platforms raises the possibility of an open, global, digital pathogen surveillance system. when informed by a one health approach, in which human, animal and environmental health are considered together, such a genomics-based system has profound potential to improve public health in settings lacking robust laboratory capacity. supplementary information: the online version of this article (doi: . /nrg. . ) contains supplementary material, which is available to authorized users. in late and early , a lethal haemorrhagic fever spread throughout forested guinea (guinée forestière), undiagnosed for months. by the time it was reported to be ebola, the virus had spread to three countries and was likely past the point at which case-level control measures, such as isolation and infection control, could have contained the nascent outbreak. in , a new dengue-like illness was implicated in a dramatic increase in brazil's microcephaly cases; one year later, analyses revealed that the zika virus had been sweeping through the americas, unnoticed by existing surveillance systems, since late . although public health surveillance systems have evolved to meet the changing needs of our global popu lation, we continue to dramatically underestimate our vulnerability to pathogens, both old and new . indeed, the recent events in west africa and brazil highlight the gaps in existing infectious disease surveillance systems, particularly when dealing with novel pathogens or pathogens whose geographic range has extended into a new region. despite the lessons learned from previous outbreaks , such as the severe acute respiratory syndrome (sars) epidemic in - and the influenza pandemic -particularly the need for enhanced national surveillance and diagnostic capacity -infectious threats continue to surprise and sometimes overwhelm the global health response. the cost of these epidemics demands that we take action: with fewer than , cases, the ebola outbreak ultimately resulted in over , deaths, left nearly , children without parents and caused cumulative gross domestic product losses of more than % . as with prior crises, in the wake of ebola, multiple commissions have offered suggestions for essential reforms , . most focus on systems-level change, such as funding research and development or creating a centralized pandemic preparedness and response agency. however, they also call for enhanced molecular diagnostic and surveillance capacity coupled to data-sharing frameworks. this hints at an emerging paradigm for rapid outbreak response, one that employs new tools for pathogen genome sequencing and epidemiological analysis (fig. ) and that can be deployed anywhere. in this model, portable, in-country genomic diagnostics are targeted to key settings for routine human, animal and environ mental surveillance or rapidly deployed to a setting with a nascent outbreak. within our increasingly digital landscape, wherein a clinical sample can be transformed into a stream of data for rapid analysis and dissemination in a matter of hours, we face a tremendous opportunity to more proactively respond to disease events. however, the potential benefits of such a system are not guaranteed, and many obstacles remain. here, we review recent advances in genomicsinformed outbreak response, including the role of real-time sequencing in both diagnostics and epidemiology. we outline the opportunities for integrating sequencing with the one health and digital epidemiology fields, and we examine the ethical, legal the systematic collection, analysis and dissemination of health-related data to support planning, implementation and evaluation of public health practices and response. outbreaks and epidemics are both defined as increases in the number of cases of a particular disease beyond what is expected in a given setting. in outbreaks, the affected settings are smaller geographic regions; epidemics can span larger areas. clinical metagenomics. with its untargeted approach to sequencing, clinical metagenomics can cross disciplines in a way that clinical microbiology struggles to -identifying viral, bacterial, fungal and other eukaryotic pathogens in a single assay and coupling pathogen detection to pathogen discovery. given the current high cost of the technique -conservatively estimated at several thousand dollars -it is most often used when dealing with potentially lethal infections that fail the conventional diagnostic paradigm, such as the recent diagnosis of an unusual case of meningoencephalitis caused by the amoeboid parasite balamuthia man drillaris or the diagnosis and treatment of neuroleptospirosis in a critically unwell teenager . in the latter case, despite a high index of suspicion for infection, leptospira santa rosai was not detected by culture or pcr, as the diagnostic primer sequences were eventually found to be a poor match to the genome of the pathogen. intravenous antibiotic therapy resulted in rapid recovery. in such an example, the costs are easily justified, particularly when offset against the cost of a stay in an intensive treatment unit. however, routine diagnostic metagenomics is currently limited to a handful of clinical research laboratories worldwide; it is therefore regarded as a 'test of last resort' and kept in reserve for vexing diagnostic conundrums. substantial practical challenges hinder the adoption of metagenomics for diagnostics (fig. ) (reviewed in depth in ref. ) . chief among these is analytic sensitivity, which depends on pathogen factors (for example, genome size, ease of lysis and life cycle); analytic factors (for example, the completeness of reference databases and the potential to mistake a target for a close genetic relative); and sample factors (for example, pathogen abundance within a sample and contaminating background dna). as an example of a problematic sample, during zika surveillance, attempts to perform un targeted metagenomics sequencing on blood yielded few, or in some cases zero, reads owing to low viral titres . targetenrichment technologies (reviewed in ref. ) such as bait probes can be employed, but even these were unsuccessful at recovering whole zika genomes, necessitating pcr enrichment . in addition to sensitivity, universal pathogen detection through clinical metagenomics is complicated by specificity issues arising from misclassification or contaminated reagents, the challenge of reproducing results from a complex clinical workflow, nucleic acid stability under varying assay conditions, ever-changing bioinformatics workflows and cost. given these issues, could metagenomics replace conventional microbiological and molecular tests for infection? recent studies have used metagenomics in common presentations, including sepsis , pneumonia , urinary tract infections and eye infections . these have generally yielded promising results, albeit typically at a lower sensitivity than conventional tests and at a much greater cost. despite these problems, two factors will drive sequencing to eventually become routine clinical practice. first, the ever-decreasing cost of sequencing coupled with the potential for cost savings achieved by using a single diagnostic modality versus tens or hundreds of different diagnostic assays -each potentially requiring specific instrumentation, reagents, validation and labour -is attractive from a laboratory operations perspective. second, and perhaps most compelling, is the additional information afforded by genomics, including the ability to predict virulence or drug resistance phenotypes, the ability to detect polymicrobial infections and phylogenetic reconstruction for outbreak analysis. novel technologies: portable sequencing. given that outbreaks of emerging infectious diseases (eids) most often occur in settings with minimal laboratory capacity, where routine culture and bench-top sequencing are simply not feasible, the need for a portable diagnostic platform capable of in situ clinical metagenomics and outbreak surveillance is evident. a trend towards smaller and less expensive bench-top sequencing instruments was seen with the genome sequencer junior system (which has since been discontinued), the ion torrent personal genome machine (pgm) system and the illumina miseq system, which were released in close succession . each of these instruments costs <$ , and puts ngs capability into the hands of smaller laboratories, including clinical settings. in , the minion from oxford nanopore technologies was released to early access users , heralding the potential nature reviews | genetics outbreak response portable genome sequencing digital epidemiology one health figure | a genomics-informed surveillance and outbreak response model. portable genome sequencing technology and digital epidemiology platforms form the foundation for both real-time pathogen and disease surveillance systems and outbreak response efforts, all of which exist within the one health context, in which surveillance, outbreak detection and response span the human, animal and environmental health domains. the event through which a pathogen is transferred from one entity to another. transmission can be person-to-person, as in the case of ebola, vector-to-person, as with zika, or environment-to-person via routes including food, water and contact with a contaminated object or surface. the use of genome sequencing to understand infectious disease transmission and epidemiology. see fig. . for highly portable 'lab-in-a-suitcase' sequencing. the minion is pocket-sized and is controlled and powered through a laptop usb connection. it is provided under a model whereby the hardware is free but the consumer pays a premium for the reagent and flow cell consumables. compared with bench-top instruments, the absence of a rolling service contract or regular engineer visits makes it theoretically possible to scale this platform out to potentially unlimited numbers of labora tories. importantly, the minion has been used in field situations, including in diagnostic tent labora tories during the ebola epidemic , and in a roving busbased mobile laboratory in brazil as part of the zibra project , . others have taken the minion to more extreme environments where even the smallest traditional bench-top sequencer could not go, including the arctic and antarctic , a deep mine and zero gravity aboard the reduced-gravity aircraft (nicknamed the 'vomit comet') and the international space station . however, this technology is not yet a panacea; remaining challenges include high dna or rna input requirements (currently hundreds of nanograms), which often necessitate pcr-based amplification approaches; a flow cell cost of $ , keeping the cost per sample high despite multiplexing approaches; and high error rates, which require that genomes are sequenced to high coverage for single nucleotide polymorphism-based analysis and analysed at the signal level. moreover, although the long reads produced by the minion overcome a number of challenges in assembling eukaryotic microbial pathogen genomes, such as the presence of discrete chromosomes or long repetitive regions, the upstream nucleic acid extraction steps required to obtain genomic dna vary across microbial domains and might necessitate reagents and equipment far less portable than the minion. from transmission to epidemic dynamics. genomics is capable of informing not just pathogen diagnostics but also epidemiology. pathogen sequencing has been used for decades to understand transmission in viral outbreaks, from early studies of hantavirus in the united states of america to human immunodeficiency virus (hiv) in the united kingdom ; more recently, the approach has been successfully extended to include bacterial pathogens (reviewed in ref. ) and has come to be known as genomic epidemiology, a term encompassing everything from population dynamics to the reconstruction of individual transmission events within outbreaks . most transmission-focused investigations to date have been retrospective, with only a subset unfolding in real time, as cases are diagnosed [ ] [ ] [ ] [ ] [ ] . in transmission-focused investigations, genetic variants are used to identify person-to-person transmission figure | challenges to in-field clinical metagenomics for rapid diagnosis and outbreak response. a mobile medical unit deploying a portable clinical metagenomics platform has been established at the epicentre of an infectious disease outbreak, but the team faces challenges throughout the diagnostic process and epidemiological response. for example, in the case of zika virus, samples, such as blood, with low viral titres, a small genome of < kb and transient viraemia combine to complicate detection of viral nucleic acid by use of a strictly metagenomic approach. furthermore, obtaining a sufficient amount of viral nucleic acids for genome sequencing beyond simple diagnostics requires a tiling pcr and amplicon sequencing approach . other challenges include, for example, access to a reliable internet connection, the ability to collect sample metadata and translating genomic findings into real-time, actionable recommendations. the average number of secondary cases of an infectious disease produced by a single infectious case, given a completely susceptible population. a term describing infectious diseases that typically exist in an animal reservoir but that can be transmitted to humans. the transmission of an infectious disease, such as ebola, from a survivor of that disease who has recovered from their symptoms. a term describing infectious diseases that are transmitted to humans through contact with a non-human species, particularly those diseases spread through insect bites. an example is the zika virus, which is carried by mosquitos. geographical settings where a variety of factors converge to create the social and environmental conditions that promote disease transmission. the process by which an infectious disease changes from existing exclusively in animals to being able to infect, then transmit between, humans. see fig. . events (fig. ) , either through manual interpretation of the variants shared between outbreak cases or via modelbased approaches , with the result being a transmission network. epidemic investigations are very different -only a subset of the epidemic cases are sequenced. thus, the goal is to use the population structure of the pathogen to understand the overall dynamics of the epidemic. here, phylodynamic approaches are used to infer epidemiological parameters of interest. first conceptualized in by grenfell et al. as a union of "immunodynamics, epidemiology, and evolutionary biology" (ref. ), phylodynamics captures both epidemiological and evolutionary information from measurably evolving pathogens -those viruses and bacteria for which high mutation rates and/or a range of sampling dates contribute to a meaningful amount of genetic variation between sequences , -in other words, enough genetic diversity to be able to infer an evolutionary history for a pathogen of interest, even if that history is only over the short time frame of an outbreak or epidemic. this is possible for most pathogens, particularly single-stranded dna viruses, rna viruses and many bacterial species , , but there are certain species for which the lack of a strict molecular clock and/or frequent recombination complicate both phylodynamics studies and attempts to infer transmission events . phylodynamics relies on tools such as bayesian evolutionary analysis sampling trees (beast) , in which sequence data are used to build a time-labelled phylogenetic tree using a specific evolutionary process as a guide -often variations on a theme of coalescent theory . from the tree, one can infer epidemiological parameters, including the basic reproductive number r (ref. ). while the insights that can be gained from genomic data alone are exciting, the utility of phylodynamic approaches is greatly extended when additional data are integrated into the models (reviewed in ref. ). genomic epidemiology in action: ebola. the many genomic epidemiology studies from the ebola outbreak (reviewed in ref. ) used bench-top and portable sequencing platforms to reveal outbreak-level events and epidemic-level trends. real-time analyses published around the peak of the epidemic suggested the following: the outbreak probably arose from a single introduction into humans and not repeated zoonotic introductions , ; sexual transmission had a previously unrecognized role in maintaining transmission chains ; and survivor transmission -another un recognized phenomenon -contributed to disease flare-ups later in the outbreak . the first sequencing efforts, all of which had an effect on the epidemiological response in real time, unfolded months into the epidemic. had they been deployed earlier, we can only speculate as to their potential impact. arguably, the most compelling use of early sequencing would have been to provide a definitive ebola diagnosis in this previously unaffected region of west africa. however, even after the outbreak was underway, sequencing could have benefited the public health response. for example, ruling out bush meat as a source of repeated viral introductions could have changed public health messaging campaigns from avoiding bush meat to the importance of hygiene and safe funeral practices , potentially averting some cases. portable sequencing and phylodynamic approaches are currently being deployed in the ongoing zika epidemic; whether the real-time reporting of genomic findings is able to alter the course of a vector-borne epidemic remains to be seen. retrospective phylodynamic investigations are also useful for pandemic preparedness planning. a recent analysis of , ebola virus genomes -approximately % of all cases -reconstructs the movement of the virus across west africa and reveals drivers for its spread . the authors deduce that ebola importation was more likely to occur between regions of a country than across international borders and that both population size and distance to a nearby large urban centre were associated with local expansion of the virus. these findings may affect decision-making around border closures in future ebola outbreaks and point to the need to develop surveillance, diagnostic and treatment capacity in urban centres. the role of the environment in deploying genomics for surveillance, diagnostics and epidemiological investigation, a key question remains: where? many regions lack the diagnostic laboratory capacity to carry out basic surveillance, but continuous genomic surveillance in all of these settings would be impossible. numerous projects have attempted to describe the pool of geographic hot spots and candidate pathogens from which the next epidemic or pandemic will arise. determining these factors is key to predicting and preventing spillover events (fig. ) ). they report an increasing number of events each decade, generally located in hot spots defined by specific environmental, ecological and socio-economic characteristics. most eids are zoonotic in origin, with the highest risk of spillover in regions with high wildlife diversity that have experienced recent demographic change and/or recent increases in farming activity . a global biogeographic analysis of human infectious disease further supports the use of biodiversity as a proxy for eid hot spots , and reviews focused on systems-level, rather than ecological, factors identify the breakdown of local public health systems as drivers of outbreaks, suggesting that surveillance ought to be targeted to settings where bio diversity and changing demographics meet inadequate sanitation and hygiene, lack of a public health infra structure for deliver ing interventions and no or limited resources for control of zoonoses and vector-borne diseases . these analyses provide a shortlist of regions, including parts of eastern and southeastern asia, india and equatorial africa, on which genomic and other surveillance activities should be focused , . within these regions, sewer systems and wastewater treatment plants could be important foci for sample collection, providing a single point of entry to biological readouts from an entire community. indeed, proof-of-concept metagenomics studies have revealed the presence of antibiotic resistance genes , human-specific viruses . most were zoonotic in origin, and over one-quarter had been detected in non-human species many years before being identified as human pathogens. a later review reiterates this observation, noting that recent agents of concern -ebola, zika and chikungunya -had been identified decades before they achieved pandemic magnitude . as a result of ngs technology, the pace of novel virus discovery is accelerating, with recent large-scale studies revealing new viruses sampled from macaque faeces in a single geographic location and , new viruses discovered from rna transcriptomic analyses of multiple invertebrate species . however, understanding which of these new entities might pose a threat requires a new approach. one health. the emergence of a zoonotic pathogen proceeds in stages (fig. ) ; in an effort to better anticipate these transitions and more proactively respond to emerging threats, the one health movement was launched in . recognizing that human, domestic animal and wildlife health and disease are linked to each other and that changing land-use patterns contribute to disease spread, one health aims to develop systems-minded, forward-thinking approaches to disease surveillance, control and prevention . by investing in infrastructure for human and animal health surveillance, committing to timely information sharing and establishing collaborations across multiple sectors and disciplines, the goal of the one health community is an integrated system incorporating human, animal and environmental surveillance -a goal in which genomics can have an important role. the one health approach has been implemented through the predict project, which is part of the emerging pandemic threats (ept) programme of the us agency for international development (usaid). predict explores the spillover of selected viral zoonoses from particular wildlife taxa , and early efforts have focused on developing non-invasive sampling techniques for wildlife , estimating the breadth of mammalian viral diversity across nine viral families and at least , undiscovered species and demonstrating that viral community diversity is at least a partially deterministic process, suggesting that forecasting community changes, which potentially signal spillover, is a possibility . although the goal of using integrated surveillance information to predict an outbreak is still many years away, one health studies are already leveraging the tools and techniques of genomic epidemiology to understand current outbreaks. combining genomic data with data streams from enhanced one health surveillance platforms presents an opportunity to detect the population expansions nature reviews | genetics figure | inferring transmission events from genomic data. genomic approaches to identifying transmission events typically involve four steps. in the first step, outbreak isolates, and often non-outbreak control isolates, are sequenced and their genomes either assembled de novo or mapped against a reference genome. next, the genomic differences between the sequences are identified -depending on the pathogen and the scale of the outbreak, these may include features such as genetic variants, insertions and deletions or the presence or absence of specific genes or mobile genetic elements. in the third step, these features are examined to infer the relationships between the isolates from whence they came -a variant common to a subset of isolates, for example, suggests that those cases are epidemiologically linked. finally, the genomic evidence for epidemiological linkages is reviewed in the context of known epidemiological information, such as social contact between two cases or a common location or other exposure. recently, automated methods for inferring potential epidemiological linkages from genomic data alone have been developed, greatly facilitating large-scale genomic epidemiological investigations . and/or cross-species transmissions that may precede a human health event. for example, genome sequences from a raccoon-associated variant of rabies virus (rrv), when paired with fine-scale geographic information and data from canadian and us wildlife rabies vaccination programmes, demonstrated that multiple cross-border incursions were responsible for the expansion of rrv into canada and sustained outbreaks in several provinces ; this finding led to renewed concern about and action against rabies on the part of public health authorities . one of the first studies coupling detailed wildlife and livestock movement data with phylodynamic analysis of a bacterial pathogen revealed that crossspecies jumps from an elk reservoir were the source of increasing rates of brucella abortus infections in nearby livestock ; as the most common zoonosis of humans, brucellosis control programmes will benefit substantially from this sort of one health approach . this model, in which diagnostic testing in reference laboratories triggers genomic follow-up, represents an effective near-term solution for integrating genomics into one health surveillance efforts as the community explores solutions to the many challenges facing in situ clinical metagenomics surveillance of animal populations (reviewed in ref. ). initial forays into this area have been successful; for example, metagenomics analysis of human diarrhoeal specimens and stools from nearby pigs revealed potential zoonotic transmission of rotavirus . however, metagenomic sequencing across a range of animal species and environments yields more questions than answers. what is an early signal of patho gen emergence versus background microbial noise ? which emerging agents are capable of crossing the species barrier and causing human disease ? what degree of sampling is required to capture potential spillovers ? ultimately, a more efficient use of metagenomics in a one health surveillance strategy might be scanning for zoonotic 'jumps' in selected sentinel human populations rather than a sweeping animal surveillance strategy , with sentinels chosen according to eid hotspot maps and other factors and interesting genomic signals triggering follow-up sequencing in the relevant animal reser voirs. by combining genomic data generated through these targeted surveillance efforts with phylodynamic approaches, it will be possible to take simple presence or absence signals and derive useful epidemiological insights: signals of population expansion; evidence of transmission within and between animal reservoirs and humans; and epidemiological analysis of a pathogen's early expansion. most modern surveillance systems use human, animal, environmental and other data to carry out disease-specific surveillance, in which a single disease is monitored through one or more data streams, such as positive laboratory test results or reportable communicable disease notifications. despite marked advances over the preceding decades, testimony from multiple expert groups has repeatedly emphasized the need for improved surveillance capacity , , including the use of syndromic surveillance, a more pathogen-agnostic approach aimed at early detection of emerging disease , . syndromic surveillance systems might leverage unique data streams such as school or employee absenteeism, grocery store or pharmacy purchases of specific items or calls to a nursing hotline as signals of illness in a population. increasingly, digital streams are being used as an input to these systems, be they participatory epidemiology projects such as flu near you , the automated analysis of trending words or phrases on social media sites, such as twitter , , or internet search queries [ ] [ ] [ ] . this new approach to surveillance is known as digital epidemiology and is also referred to as digital disease detection . in digital epidemiology, information is first retrieved from a range of sources, including digital media, newswires, official reports and crowd sourcing; second, translated and processed, which includes extracting disease events and ensuring reports are not duplicated; third, analysed for trends; and fourth, disseminated to the community through media, including websites, email lists and mobile alerts in spillover, a pathogen previously restricted to animals gradually begins to move into the human population. during stage one (pre-emergence), as a result of changing demographics and/or land use, a pathogen undergoes a population expansion, extends its host range or moves into a new geographic region. during stage two (localized emergence), contact with animals or animal products results in spillover of the pathogen from its natural reservoir(s) into humans but with little to no onward person-to-person transmission. during stage three (pandemic emergence), the pathogen is able to sustain long transmission chains, that is, a series of disease transmission events, such as a sequential series of person-to-person transmissions, and its movement across borders is facilitated by human travel patterns . epidemiology platforms are currently operating , and their flexible nature and cost-effective, real-time reporting make them effective tools for gathering epidemic intelligence, particularly in settings lacking traditional disease surveillance systems. the fields of one health and digital epidemiology are increasingly overlapping. in the predict consortium, the healthmap system and local media surveillance were combined to identify health events in five countries over a -week period . predict also suggested a role for digital epidemiology in not just event detection but also the identification of changing eid drivers. eids are driven by multiple factors, many of which have digital outputs and represent novel sources of surveillance data . for example, human movement can be revealed by mobile phone data or by the patterns of lighted cities at night, hunting data collected by states can reveal interactions between humans and wildlife, and social media and digital news sources can reveal early signals of famine, war and other social unrest. a major challenge is that the number of digital data sets available for each driver varies substantially, from hundreds for surveying land use changes -many based on remote sensing data -to mere handfuls around social inequalities and human susceptibility to infection, with most data biased towards north america and europe. the digital and genomic epidemiology domains are also starting to overlap. in the ebola outbreak, digital epidemiology revealed that drivers of infection risk included settings where households lacked a radio, with high rainfall and with urban land cover , echoing the evidence from a genomic study suggesting that sites at which urban and rural populations mix contribute to disease . during the zika epidemic, majumder et al. used healthmap and google trends to estimate the basic reproductive number r to be . - . ; phylo dynamic estimates from brazilian genomic data gave similar ranges ( . - . ) , indicating that both types of data streams can be leveraged in calculating epi demiological parameters that help shape the public health response. a digital pathogen surveillance era recent reports have called for the integration of genomic data with digital epidemiology streams , . when informed by a one health approach, the epidemiological potential of this digital pathogen surveillance system is profound. imagine parallel networks of portable patho gen sequencers deployed to laboratories and communities in eid hot spots -regions that are traditionally underserved with respect to laboratory and surveillance capacity -and processing samples collected from targeted sentinel wildlife species, insect vectors and humans (fig. ) . samples would be pooled for routine surveillance -either through targeted diagnostics or, if the issue of analytical sensitivity can be overcome, through metagenomics -with a full genomic work-up of individual samples should a pathogenic signal be detected. at the same time, existing internet-based platforms such as healthmap and new local participatory epidemiology efforts would be collecting data to both identify potential hotspot regions and detect eid events, enabling both prospective and rapid-response deployment of additional sequencers. genome sequencing data coupled with rich metadata would then be released in real time to web-based platforms, such as virological for colla borative analysis and nextstrain for analysis and visualization . these sites -already used in the ebola and zika responses -would act as the nexus for a global network of interested parties contributing to real-time phylo dynamic and epidemiological analyses and looking for signals of spillover, pathogen population expansion and sustained human-to-human transmission. results would be immediately shared with the one health frontlineepidemiologists, veterinarians and community health workers -who would then implement evidence-based interventions to mitigate further spread. the pathway to such a reality is not without its roadblocks. apart from technical and implementation challenges, a series of larger concerns surrounds the rollout of genomics-based rapid outbreak response, ranging from the uptake of a new, disruptive technology to effecting systems-level change on a global scale. sequencing-based diagnostics, particularly clinical metagenomics approaches, are still straddling the boundary between research and clinical use. in this realm, uncertainty is a certainty, be it uncertainty inherent to the technology itself or informational uncertainty, such as how accurate, complete and reliable results actually are . early adopters of genomics in the academic domain are used to uncertainty, often acknowledging and appraising it, but routine clinical use requires meeting the evidentiary thresholds mandated by a range of stakeholders, from regulators to the laboratories implementing new sequencing-based tests. decision criteria that influence whether a new genomic test is adopted include the ability of the assay to differentiate pathogens from commensals, the correlation of pathogen presence with disease, the sensitivity and specificity of the test, its reproducibility and robustness across sample types and settings and a cost comparable to that of existing platforms . validation -defining the conditions needed to obtain reliable results from an assay, evaluating the performance of the assay under said conditions and specifying how the results should be interpreted, including outlining limitations -is also critical. much can be learned from the domain of microbial forensics, where sequencing is playing a large part . budowle et al. review validation considerations for ngs , noting that this technology requires validating sample preparation protocols, including extraction, enrichment and library preparation steps, sequencing protocols, and downstream bioinformatics analyses, including alignment and assembly, variant calling, the underlying reference databases and software tools and the interpretation of the data. complete validation of a sequencing assay may not always be possible, particularly for emerging patho gens. therefore, just as the west african ebola virus outbreak triggered a review of the ethical context for trialling new therapeutics and vaccines , the scale-up of ngs in emerging epidemics will engender similar conversations. rather than wait for this to happen, an anticipatory approach is best, outlining the exceptional circumstances under which unvalidated approaches might be used, selecting the appropriate approach and examining the benefits of a potentially untested approach in light of individual and societal interests. if the social landscape surrounding the introduction of a new technology is not considered, prior experience suggests that the road to implementation will be difficult, with hurdles ranging from public mistrust to moratoria on research . the enthusiasm of the scientific community for new technology must not lead to inflated claims of clinical utility and poor downstream decisions around the deployment of that technology. howard et al. outline several principles for successfully integrating genomics into the public health system, and as we pilot digital pathogen surveillance, the community would do well to keep many of them in mind: ensuring that the instruments and processes used are reliable and that reporting is standardized and readily interpretable by end users; that the technology is used to address important health problems; that the advantages of the approach outweigh the disadvantages; and that economic evaluation suggests savings to the health care system and society . it is also important to reconsider the role of the diagnostic reference laboratory in the new genomic landscape. as their mandates expand to include enhanced surveillance and closer collaboration with field epidemiologists, laboratory directors will face new challenges, from managing exploratory work alongside routine clinical care to hiring a new sort of technologist, one with basic genomics and epidemiology training. the ethical, social and legal implications of digital pathogen surveillance are an emerging area of research (reviewed in ref. ). chief among the issues that geller et al. identify is the tension that exists when a new technology has the power to identify a problem but there is limited or no capacity to address the issue. balancing the benefits and harms to both individuals and populations is challenging when the predictive insight offered by a genomic technology is variable -for example, using genomics to identify an individual as a 'super spreader' has important implications for quarantine and isolation, but that label may be predicated on a tenuous prediction. the problem is further compounded by the fact that many infectious disease diagnoses carry with them a certain amount of stigma and that an individual's right to privacy might be superseded by the need to protect the larger population . data sharing and integration. a critical need for successful digital pathogen surveillance is the capacity for rapid, barrier-free data sharing, and arguments for such sharing are frequently rehashed after outbreaks and epidemics. genomic epidemiology was born largely in the academic sphere, with early papers coming from laboratories with nature reviews | genetics in one such region, the syndromic surveillance system reports higher-than-average sales of a common medication used to relieve fever. spatial analysis of the data from the pharmacies in the region suggests that the trend is unique to a particular district; a follow-up geographic information system (gis) analysis using satellite data reveals that this area borders a forest and is increasingly being used for the commercial production of bat guano. an alert is triggered, and the field response team meets with citizens in the area. nasopharyngeal swabs are taken from humans and livestock with fever as well as from guano and bat tissue collected in the area. the samples are immediately analysed using a portable dna sequencer coupled to a smartphone. an app on the phone reports the clinical metagenomic results in real time, revealing that in many of the ill humans and animals, a novel coronavirus makes up the bulk of the microbial nucleic acid fraction. the sequencing data are immediately uploaded to a public repository as they are generated, tagged with metadata about the host, sample type and location and stored according to a pathogen surveillance ontology. the data release triggers an announcement via social media of a novel sequence, and within minutes, interested virologists have created a shared online workspace and open lab notebook to collect their analyses of the new pathogen. extensive histories in microbial genomics and bioinformatics. for this community, open access to genome sequences, software and, more recently, publications has tended to be the rule rather than the exception. indeed, a national research council report described "the culture of genomics" as "unique in its evolution into a global web of tools and information" (ref. ). the same report includes a series of recommendations on access to pathogen genome data, including the statement that "rapid, unrestricted public access to primary genome sequence data, annotations of genome data, genome databases, and internet-based tools for genome analysis should be encouraged" (ref. ). as genomics has moved into the domain of clinical and public health practice, the notion of free and im mediate access to genomic surveillance data has encountered several barriers: the siloing of critical metadata across multiple public health databases with no interoperability; balancing openness and transparency with patient privacy and safety; variable data quality, particularly in resource-limited settings; concerns over data reuse by third parties; a lack of standards and ontologies to capture metadata; and career advancement disincentives to releasing data [ ] [ ] [ ] . despite these challenges, the spirit of open access and open data remains strong in the community, with over public health leaders from around the world recently signing a joint statement on data sharing for public health surveillance . the ebola and zika responses in particular highlight the role of realtime sharing of data and samples, be it through the use of chat groups and a labkey server to disseminate zika data or github to share ebola data . in the wake of ebola, yozwiak et al. and chretien et al. outline additional issues facing data sharing, from differing cultures and academic norms to complicated consent procedures and technical limitations. they note that we as a community must agree on standards and practices promoting cooperation -a conversation that could begin by examining how the global alliance for genomics and health (ga gh) framework for responsible sharing of genomic and health-related data (box ) could be adapted for the digital pathogen surveillance community. the future: the sequencing singularity? transformative change to public and global health is profoundly difficult. complicating the existence of a rapid, open, transparent response is the fact that no matter the setting, there are often conflicting interests at work. in an outbreak scenario, conflict may result from governments wishing to keep an outbreak quiet and/or from the tension between lower-income and middle-income countries with few resources for generating and using data and the researchers or response teams from better-resourced settings . indeed, the conflicting values in outbreak responses meet the definition of a 'wicked' problem, where issues resist simple resolution and span multiple jurisdictions and where each stakeholder has a different perspective on the solution. even the international health regulations (ihr), which ostensibly provide a legal instrument for global health security, fail to effect a basic surveillance and outbreak response. as of the most recent self-reporting, only % of the member countries of the ihr are in compliance, meeting the prescribed minimum public health core capacities . in these settings, digital pathogen surveillance must be within the purview of the larger global health community and its diverse group of non-state actors rather than being solely the responsibility of nations themselves . this raises an important issue: if nations are willing to cede a certain amount of surveillance and diag nostic control box | the global alliance for genomics and health (ga gh) framework for genomic data sharing in the universal declaration of human rights, article outlines the right of every individual "to share in scientific advancement and its benefit". in this spirit, the global alliance for genomics and health (ga gh) data-sharing framework , which covers data donors, producers and users, is guided by the principles of privacy, fairness and non-discrimination and has as its goal the promotion of health and well-being and the fair distribution of benefits arising from genomic research. the core elements of the framework include the following: • transparency: knowing how the data will be handled, accessed and exchanged • accountability: tracking of data access and mechanisms for addressing misuse • engagement: involving citizens and facilitating dialogue and deliberation around the societal implications of data sharing • quality and security: mitigating unauthorized access and implementing an unbiased approach to storing and processing data • privacy, data protection and confidentiality: complying with the relevant regulations at every stage • risk-benefit analysis: weighing benefits (including new knowledge, efficiencies and informed decision making) against risks (including invasion of privacy and breaches of confidentiality), minimizing harm and maximizing benefit at the individual and societal levels • recognition and attribution: ensuring recognition is meaningful to participants, providing due credit to all who shared data and ensuring credit is given for both primary and secondary data use • sustainability: implementing systems for archiving and retrieval • education and training: advancing data sharing, improving data quality, educating people on why data sharing matters, and building capacity • accessibility and dissemination: maximizing accessibility, promoting collaboration and using publication and digital dissemination to share results to the global health community, the notion of reciprocity suggests that they should derive some corresponding local benefit. the 'trickle-down' effects of global genomic surveillance have yet to be fully articulated, but they are likely to be realized first in the zoonotic domain, where global surveillance efforts will feed back into improved animal health at a local level, in turn benefiting local farmers. outbreaks occur at the intersection of risk perception, governance, policy and economics , and outbreak response is often based on political instinct rather than data . building a resilient and responsive public health system is therefore more than just enhancing surveillance and coupling it to novel technology -it is about engagement, trust, cooperation and building local capacity , as well as a focus on pandemic prevention through development rather than pandemic response via disaster relief mechanisms . expert panels convened by harvard and the london school of hygiene and tropical medicine and by the national academy of medicine have called for a central pandemic preparedness and response agency and also underscored the need for deeper partnerships between formal and informal surveillance, epidemiology and academic and public health networks . more recently, evolutionary biologist michael worobey wrote: "systematic pathogen surveillance is within our grasp, but is still undervalued and underfunded relative to the magnitude of the threat" (ref. ). if we are to achieve the sequencing singularity -the moment at which pathogen, environmental and digital data streams are integrated into a global surveillance system -we require a community united behind a vision in which public health and the attendant data belong to the public and behind the idea that we are a better, healthier society when the public is able to access and benefit from the data being collected about us and the pathogens we share the planet with. virus genomes reveal factors that spread and sustained the ebola epidemic zika virus in the americas: early epidemiological and genetic findings this work is the first to leverage genome sequences generated early in the zika outbreak to provide a real-time glimpse into the spread of the virus establishment and cryptic transmission of zika virus in brazil and the americas genomic epidemiology reveals multiple introductions of zika virus into the united states this paper is the first to use a genomic approach to track the entry of zika into the usa our shared vulnerability to dangerous pathogens progress in global surveillance and response capacity years after severe acute respiratory syndrome west african ebola crisis and orphans commission on a global health risk framework for the future. the neglected dimension of global security: a framework to counter infectious disease crises will ebola change the game? ten essential reforms before the next pandemic. the report of the harvard-lshtm independent panel on the global response to ebola application of next generation sequencing in clinical microbiology and infection prevention this report, from the american society for microbiology and the college of american pathologists, provides a comprehensive overview of clinical metagenomics and the associated validation challenges diagnosing balamuthia mandrillaris encephalitis with metagenomic deep sequencing actionable diagnosis of neuroleptospirosis by next-generation sequencing multiplex pcr method for minion and illumina sequencing of zika and other virus genomes directly from clinical samples clinical and biological insights from viral genome sequencing next-generation sequencing diagnostics of bacteremia in septic patients rapid pathogen identification in bacterial pneumonia using real-time metagenomics identification of bacterial pathogens and antimicrobial resistance directly from clinical urines by nanopore-based metagenomic sequencing illuminating uveitis: metagenomic deep sequencing identifies common and rare pathogens performance comparison of benchtop high-throughput sequencing platforms the oxford nanopore minion: delivery of nanopore sequencing to the genomics community real-time, portable genome sequencing for ebola surveillance nanopore sequencing as a rapidly deployable ebola outbreak tool mobile real-time surveillance of zika virus in brazil extreme metagenomics using nanopore dna sequencing: a field report from svalbard real-time dna sequencing in the antarctic dry valleys ising the oxford nanopore sequencer deep sequencing: intra-terrestrial metagenomics illustrates the potential of off-grid nanopore dna sequencing nanopore sequencing in microgravity nanopore dna sequencing and genome assembly on the international space station genetic identification of a hantavirus associated with an outbreak of acute respiratory illness the molecular epidemiology of human immunodeficiency virus type in edinburgh whole genome sequencing -implications for infection prevention and outbreak investigations real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic escherichia coli realtime investigation of a legionella pneumophila outbreak using whole genome sequencing a multi-country salmonella enteritidis phage type b outbreak associated with eggs from a german producer: 'near real-time' application of whole genome sequencing and food chain investigations, united kingdom rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of salmonella implementation of nationwide real-time whole-genome sequencing to enhance listeriosis outbreak detection and investigation a brief primer on genomic epidemiology: lessons learned from mycobacterium tuberculosis genomic infectious disease epidemiology in partially sampled and ongoing outbreaks this paper introduces the concept of phylodynamics, which has since become a key tool in the population genomics and epidemiology toolboxes measurably evolving populations genome-scale rates of evolutionary change in bacteria towards a new paradigm linking virus molecular evolution and pathogenesis: experimental design and phylodynamic inference this paper describes beast, a frequently used toolkit for phylogenetics and phylodynamic reconstructions inferring epidemiological dynamics with bayesian coalescent inference: the merits of deterministic and stochastic models the epidemic behavior of the hepatitis c virus emerging concepts of data integration in pathogen phylodynamics the evolution of ebola virus: insights from the - epidemic emergence of zaire ebola virus disease in guinea genomic surveillance elucidates ebola virus origin and transmission during the outbreak molecular evidence of sexual transmission of ebola virus reduced evolutionary rate in reemerged ebola virus transmission chains the wanderings of the communication on the ebola virus disease ecological origins of novel human pathogens this landmark work surveys the emergence of infectious diseases since and identifies a number of hot spots for disease emergence global biogeography of human infectious diseases preventing pandemics via international development: a systems approach the structure and diversity of human environmental surveillance of viruses by tangential flow filtration and metagenomic reconstruction search strategy has influenced the discovery rate of human viruses detecting the emergence of novel, zoonotic viruses pathogenic to humans non-random patterns in viral diversity redefining the invertebrate rna virosphere prediction and prevention of the next pandemic zoonosis conference summary: one world, one health: building interdisciplinary bridges to health in a globalized world one health proof of concept: bringing a transdisciplinary approach to surveillance for zoonotic viruses at the human-wild animal interface optimization of a novel noninvasive oral sampling technique for zoonotic pathogen surveillance in nonhuman primates a strategy to estimate unknown viral diversity in mammals processes underlying rabies virus incursions across us-canada border as revealed by whole-genome phylogeography the changing face of rabies in canada genomics reveals historic and contemporary transmission dynamics of a bacterial disease among wildlife and livestock brucellosis in livestock and wildlife: zoonotic diseases without pandemic potential in need of innovative one health approaches viral metagenomics on animals as a tool for the detection of zoonoses prior to human infection? unbiased whole-genome deep sequencing of human and porcine stool samples reveals circulation of multiple groups of rotaviruses and a putative zoonotic infection traditional and syndromic surveillance of infectious diseases and pathogens committee on achieving sustainable global capacity for surveillance and response to emerging diseases of zoonotic origin. sustaining global surveillance and response to emerging zoonotic diseases implementing syndromic surveillance: a practical guide informed by the early experience what is syndromic surveillance? mmwr suppl flu near you: crowdsourced symptom reporting spanning influenza seasons the reliability of tweets as a supplementary method of seasonal influenza surveillance twitter improves influenza forecasting web queries as a source for syndromic surveillance google trends: a web-based tool for real-time surveillance of disease outbreaks assessing google flu trends performance in the united states during the influenza virus a (h n ) pandemic digital disease detection -harnessing the web for public health surveillance an overview of internet biosurveillance digital disease detection: a systematic review of event-based internet biosurveillance systems healthmap: the development of automated real-time internet surveillance for epidemic intelligence healthmap has become one of the most important digital epidemiology resources evaluation of local media surveillance for improved disease recognition and monitoring in global hotspot regions drivers of emerging infectious disease events as a framework for digital detection precision global health in the digital age spatial determinants of ebola virus disease risk for the west african epidemic utilizing nontraditional data sources for near real-time estimation of transmission dynamics during the - colombian zika virus disease outbreak precision public health for the era of precision medicine this paper describes the nextflu project, which gave rise to the nextstrain platform, whose approach to analysis and visualization recently earned an international prize for open science known unknowns: building an ethics of uncertainty into genomic medicine delphi technology foresight study: mapping social construction of scientific evidence on metagenomics tests for water safety criteria for validation of methods in microbial forensics expansion of microbial forensics validation of high throughput sequencing and microbial forensics applications the ebola clinical trials: a precedent for research ethics in disasters germline genome-editing research and its socioethical implications the ethical introduction of genome-based information and technologies into public health genomics and infectious disease: a call to identify the ethical, legal and social implications for public health and clinical practice us) committee on genomics databases for bioterrorism threat agents. seeking security: pathogens, open access, and genome databases perspectives on data sharing in disease surveillance. chatham house: the royal institute of international affairs overcoming barriers to data sharing in public health: a global perspective big data or bust: realizing the microbial genomics revolution public health surveillance: a call to share data. international association of public health institutes real-time sharing of zika virus data in an interconnected world democratic databases: science on github data sharing: make outbreak research open access make data sharing routine to prepare for public health emergencies best practices for ethical sharing of individual-level health research data from low-and middle-income settings grand challenges in global health governance social and economic aspects of the transmission of pathogenic bacteria between wildlife and food animals: a thematic analysis of published research knowledge epidemiology: molecular mapping of zika spread framework for responsible sharing of genomic and health-related data literature review of zika virus using genomics data to reconstruct transmission trees during disease outbreaks smith foundation for health research programmes. both authors contributed equally to all aspects of the article. springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. key: cord- -qnvgg e authors: langille, morgan g. i.; eisen, jonathan a. title: biotorrents: a file sharing service for scientific data date: - - journal: plos one doi: . /journal.pone. sha: doc_id: cord_uid: qnvgg e the transfer of scientific data has emerged as a significant challenge, as datasets continue to grow in size and demand for open access sharing increases. current methods for file transfer do not scale well for large files and can cause long transfer times. in this study we present biotorrents, a website that allows open access sharing of scientific data and uses the popular bittorrent peer-to-peer file sharing technology. biotorrents allows files to be transferred rapidly due to the sharing of bandwidth across multiple institutions and provides more reliable file transfers due to the built-in error checking of the file sharing technology. biotorrents contains multiple features, including keyword searching, category browsing, rss feeds, torrent comments, and a discussion forum. biotorrents is available at http://www.biotorrents.net. the amount of data being produced in the sciences continues to expand at a tremendous rate [ ] . in parallel, and also at an increasing rate, is the demand to make this data openly available to other researchers, both pre-publication [ ] and post-publication [ ] . considerable effort and attention has been given to improving the portability of data by developing data format standards [ ] , minimal information for experiment reporting [ ] [ ] [ ] [ ] , data sharing polices [ ] , and data management [ ] [ ] [ ] [ ] . however, the practical aspect of moving data from one location to another has relatively stayed the same; that being the use of hypertext transfer protocol (http) [ ] or file transfer protocol (ftp) [ ] . these protocols require that a single server be the source of the data and that all requests for data be handled from that single location (fig. a ). in addition, the server of the data has to have a large amount of bandwidth to provide adequate download speeds for all data requests. unfortunately, as the number of requests for data increases and the provider's bandwidth becomes saturated, the access time for each data request can increase rapidly. even if bandwidth limitations are very large, these file transfer methods require that the data is centrally stored, making the data inaccessible if the server malfunctions. many different solutions have been proposed to help with many of the challenges of moving large amounts of data. bio-mirror (http://www.bio-mirror.net/) was started in and consists of several servers sharing the same identical datasets in various countries. bio-mirror improves on download speeds, but requires that the data be replicated across all servers, is restricted to only very popular genomic datasets, and does not include the fast growing datasets such as the sequence read archive (sra) (http://www.ncbi.nlm.nih.gov/sra). the tranche project (https://trancheproject.org/) is the software behind the proteome commons (https://proteomecommons.org/) proteomics repository. the focus of the tranche project is to provide a secure repository that can be shared across multiple servers. considering that all bandwidth is provided by these dedicated tranche servers, considerable administration and funding is necessary in order to maintain such a service. an alternative to these repository-like resources is to use a peer-to-peer file transfer protocol. these peerto-peer networks allow the sharing of datasets directly with each other without the need for a central repository to provide the data hosting or bandwidth for downloading. one of the earliest and most popular peer-to-peer protocols is gnutella (http://rfcgnutella.sourceforge.net/) which is the protocol behind many popular file sharing clients such as limewire (http://www. limewire.com/), shareaza (http://shareaza.sourceforge.net/), and bearshare (http://www.bearshare.com/). unfortunately, this protocol was centered on sharing individual files and does scale well for sharing very large files. in comparison, the bittorrent protocol [ ] handles large files very well, is actively being developed, and is a very popular method for data transfer. for example, bittorrent can be used to transfer data from the amazon simple storage service (s ) (http://aws.amazon.com/ s /), is used by twitter (http://twitter.com/) as a method to distribute files to a large number of servers (http://github.com/lg/ murder), and for distributing numerous types of media. the bittorrent protocol works by first splitting the data into small pieces (usually kb to mb in size), allowing the large dataset to be distributed in pieces and downloaded from various sources (fig. b) . a checksum is created for each file piece to verify the integrity of the data being received and these are stored within a small ''torrent'' file. the torrent file also contains the address of one or more ''trackers''. the tracker is responsible for maintaining a list of clients that are currently sharing the torrent, so that clients can make direct connections with other clients to obtain the data. a bittorrent software client (see table ) uses the data in the torrent file to contact the tracker and allow transferring of the data between computers containing either full or partial copies of the dataset. therefore, bandwidth is shared and distributed among all computers in the transaction instead of a single source providing all of the required bandwidth. the sum of available bandwidth grows as the number of file transfers increases, and thus scales indefinitely. the end result is faster transfer times, less bandwidth requirements from a single source, and decentralization of the data. torrent files have been hosted on numerous websites and in theory scientific data can be currently transferred using any one of these bittorrent trackers. however, many of these websites contain materials that violate copyright laws and are prone to being shut down due to copyright infringement. in addition, the vast majority of data on these trackers is non-science related and makes searching or browsing for legitimate scientific data nearly impossible. therefore, to improve upon the open sharing of scientific data we created biotorrents, a legal bittorrent tracker that hosts scientific data and software. the most basic requirement of any torrent server software is the actual ''tracker'' that individual torrent clients interact with to obtain information about where to download pieces of data for a particular torrent. in order to minimize any possible transfer disruptions arising from the biotorrents tracker not being accessible, a secondary tracker is added automatically to all new torrents uploaded to biotorrents. currently this backup tracker is set to use the open bittorrent tracker (http://openbittorrent. com/). also, many bittorrent clients support a distributed hash table (dht) for peer discovery, which often allows data transfer to continue in the absence of a tracker, further enhancing the reliability over traditional client-server file transfers. in addition to the basic tracker, biotorrents has several features supporting the finding, sharing, and commenting of torrents. relevant torrents can be found by browsing categories (genomics, transcriptomics, papers, etc.), license types (public domain, creative commons, gnu general public license, etc.) and by using the provided text search. also, torrents are indexed by google (http://www.google.com) allowing users searching for datasets, but unaware of biotorrents existence, to be directed to their availability on biotorrents. information about each dataset on biotorrents is supplied on a details page giving a description of the data, number of files, date added, user name of the person who created the dataset, and various other details including a link to the actual torrent file. to begin downloading of a dataset, the user downloads and opens the torrent file in the user's previously installed bittorrent client software ( table ). the user can then control many aspects of their download (stopping, starting, download limits, etc.) through their client software without any further need to visit the biotorrents webpage. the bittorrent client will automatically connect with other clients sharing the same torrent and begin to download pieces in a non-random order. the integrity of each data piece is verified using the original file hash provided in the downloaded torrent ensuring that the completed download is an exact copy. the bittorrent client contacts the biotorrents tracker frequently (approximately every minutes) to obtain the addresses of other clients and also to report statistics of how much data they have downloaded and uploaded. these statistics are linked to the user's profile (default is the guest account), to allow real-time display on biotorrents of who is sharing a particular dataset. the choice of bittorrent client will depend on the operating system and options that the user requires. for example, some bittorrent clients (see table ) have a feature called local peer discovery (lpd), that searches for other computers sharing the same data on their local area network (lan), and allows rapid direct transfer of data over the shared network instead of over the internet. this situation may arise often in research institutions where lans are often quite large and multiple researchers are working on similar datasets. another significant feature of the bittorrent client, utorrent, is the addition of a newly designed transfer protocol called utp [ ] , that is able to monitor and adapt to network congestion by limiting its transfer speeds when other network traffic is detected. this functionality is important for system administrators and internet service providers (isps) that figure . illustration of differences between traditional and peer to peer file transfer protocols. a) traditional file transfer protocols such as http and ftp use a single host for obtaining a dataset (grey filled black box), even though other computers contain the same file or partial copies while downloading (partially filled black box). this can cause transfers to be slow due to bandwidth limitations or if the host fails. b) the peer-to-peer file transfer protocol, bittorrent, breaks up the dataset into small pieces (shown as pattern blocks within black box), and allows sharing among computers with full copies or partial copies of the dataset. this allows faster transfer times and decentralization of the data. doi: . /journal.pone. .g may have previously attempted to block or hinder bittorrent activity due its bandwidth saturating effects. sharing data on biotorrents is a simple three step process. first, the user creates a torrent file on their personal computer using the same bittorrent client software that is used for downloading ( table ) . the only piece of information the user needs to create the torrent, is the biotorrents tracker announce url which is personalized for each user (see below), and is located on the biotorrents upload page. second, this newly created torrent file is uploaded on the ''biotorrents -upload'' page along with a user description, category, and license type for the data. third, the user leaves their computer/server on with their bittorrent client running so that other users can download the data from them. it should be noted that only users that have created a free account with biotorrents are able to upload new torrents. this is to limit any possible spamming of the website as well as provide accountability for the data being shared. biotorrents enforces this and tracks users by giving each user a passkey. this passkey is automatically embedded within each torrent file that is downloaded from biotorrents and is appended to the biotorrents tracker's announce url. although, we would hope that most users create an account on biotorrents, we still allow anyone to download torrents without doing so. an alternative upload method is provided for more advanced users that have many datasets to share and/or are sharing data from a remote linux based server. this method uses a perl (http://www.perl.org) script that takes the dataset to be shared as input and returns a link to the dataset on biotorrents along with the torrent file; therefore, allowing torrents to be created for numerous datasets automatically. this feature would be useful for institutions or data providers that would like to add a bittorrent download option for their datasets. considering that many datasets in science are often updated, biotorrents allows torrents to be optionally grouped into versions. this functionality allows improved browsing of biotorrents by providing links between torrents. more importantly, this versioning classification allows users interested in certain software or datasets to be notified via a really simple syndication (rss) feed that a new version is available on biotorrents. in addition, this rss feed can be used to obtain automated updates for datasets that are often changing, such as genomic and protein databases. for example, a user could copy the rss feed for a dataset that is being updated often on biotorrents (weekly, monthly, etc.) into their bittorrent rss capable client. when a new version is released on biotorrents the bittorrent client automatically downloads the torrent file, checks to see what parts of the data have changed, and downloads only pieces that have been updated. the speed and effectiveness of the bittorrent protocol depends on the number of peers; in particular, those peers that have a complete copy of the file and can act as ''seeds''. therefore, it is important that individuals or institutions act as seeds to achieve full potential. currently, all newly added data is automatically downloaded and shared from the biotorrents server. this is to ensure that each dataset always has at least one server available for downloading. as the number of datasets and users of biotorrents increases, and to improve on transfer speeds on a geospatial scale (i.e. across countries and continents), we would encourage other institutions to automatically download and share all or some of the data on biotorrents. any logged in biotorrents user can write comments or questions about a particular torrent directly on its details page. this can provide useful feedback both to the creator of the dataset as well as to other users downloading it. alternatively, researchers wanting to discuss more general questions about biotorrents, particular datasets, or science, can use the provided ''biotorrents -forums''. comments and discussion posts can be read by all visitors, but a free account is necessary to post to either of these. users that would like to be updated on newly uploaded datasets can use the biotorrents rss web feed. the rss feeds can be configured for certain categories, license types, users, and search terms, and can also be used with many bittorrent clients to automatically download all or some of the datasets on biotorrents without human intervention. finally, the ''biotorrents -faq'' (frequently asked questions) page provides users with information about bittorrent technology and general help for using biotorrents for both downloading and sharing of data. bittorrent technology can supplement and extend current methods for transferring and publishing of scientific data on various scales. large institutions and data repositories such as genbank [ ] , could offer their popular or larger datasets via biotorrents as an alternative method for download with minimal effort. the amount of data being transferred by these large institutions should not be underestimated. for example, in a single month ncbi users downloaded the genomes ( gb), bacteria genomes ( gb), taxonomy ( gb), genbank ( gb), and blast non-redundant (nr) ( gb) datasets; , , , , and times, respectively (personal correspondence). if bittorrent technology was implemented for these datasets then the data supplier would benefit from decreased bandwidth use, while researchers downloading the data, especially those not on the same continent as the data supplier, enjoy faster transfer times. small groups or individual researchers can also benefit from using biotorrents as their primary method for publishing data. although, these less popular datasets may not enjoy the same speed benefits from using the bittorrent protocol due to the lack of data exchange among simultaneous downloads, the lower barrier of entry to providing data compared with running a personal web server, and the ability to operate behind routers employing network address translation (nat) makes the use of biotorrents for less popular datasets still beneficial. in addition, biotorrents allows researchers to make their data, software, and analyses available instantly, without the requirement of an official submission process or accompanying manuscript. this form of data publishing allows open and rapid access to information that would expedite science, especially for time-sensitive events such as the recent outbreaks of influenza h n [ ] or severe acute respiratory syndrome (sars) [ ] . no matter what the circumstance, biotorrents provides a useful resource for advancing the sharing of open scientific information. the source code for biotorrents.net was derived from the tbdev.net (http://tbdev.net) gnu general public licensed (gpl) project. the dynamic web pages are coded in php with some features being implemented with javascript. all information, including information about users, torrents, and discussion forums are stored in a mysql database. the original source code was altered in various ways to allow easier use of biotorrents by scientists; the most significant being, that anyone can download torrents without signing up for an account. in addition, torrents can be classified by various categories and license types, and grouped with other alternative versions of torrents. the biotorrents web server along with the source code is available freely under the gnu general public license at http:// www.biotorrents.net. big data: how do your data grow? prepublication data sharing postpublication sharing of data and tools the functional genomics experiment model (fuge): an extensible framework for standards in functional genomics the minimum information required for reporting a molecular interaction experiment (mimix) the minimum information about a proteomics experiment (miape) taking the first steps towards a standard for reporting on phylogenies: minimum information about a phylogenetic analysis (miapa) minimum information about a microarray experiment (miame)-toward standards for microarray data omics data sharing a system for management of information in distributed biomedical collaboratories computational representation of biological systems mimas . is a multiomics information management and annotation system hypertext transfer protocol -http/ . . internet engineering task force rfc , file transfer protocol (ftp) incentives build robustness in bittorrent low extra delay background transport (ledbat) emergence and pandemic potential of swine-origin h n influenza virus the genome sequence of the sars-associated coronavirus the authors would like to thank elizabeth wilbanks and aaron darling for reading and editing of the manuscript, and mike chelen and reviewer andrew perry for suggesting several improvements for the biotorrents website. conceived and designed the experiments: ml jae. performed the experiments: ml. wrote the paper: ml jae. key: cord- - drl xas authors: farah, i.; lalli, g.; baker, d.; schumacher, a. title: a global omics data sharing and analytics marketplace: case study of a rapid data covid- pandemic response platform. date: - - journal: nan doi: . / . . . sha: doc_id: cord_uid: drl xas under public health emergencies, particularly an early epidemic, it is fundamental that genetic and other healthcare data is shared across borders in both a timely and accurate manner before the outbreak of a global pandemic. however, although the covid- pandemic has created a tidal wave of data, most patient data is siloed, not easily accessible, and due to low sample size, largely not actionable. based on the precision medicine platform shivom, a novel and secure data sharing and data analytics marketplace, we developed a versatile pandemic preparedness platform that allows healthcare professionals to rapidly share and analyze genetic data. the platform solves several problems of the global medical and research community, such as siloed data, cross-border data sharing, lack of state-of-the-art analytic tools, gdpr-compliance, and ease-of-use. the platform serves as a central marketplace of 'discoverability'. the platform combines patient genomic & omics data sets, a marketplace for ai & bioinformatics algorithms, new diagnostic tools, and data-sharing capabilities to advance virus epidemiology and biomarker discovery. the bioinformatics marketplace contains some preinstalled covid- pipelines to analyze virus- and host genomes without the need for bioinformatics expertise. the platform will be the quickest way to rapidly gain insight into the association between virus-host interactions and covid- in various populations which can have a significant impact on managing the current pandemic and potential future disease outbreaks. in december , a cluster of severe pneumonia cases epidemiologically linked to an open-air live animal market in the city of wuhan, china , . the sudden spread of this deadly disease, later called covid- , led local health officials to issue an epidemiological alert to the world health organization's (who) and the chinese center for disease control and prevention. quickly, the etiological agent of the unknown disease was found to be a coronavirus, sars-cov- , the same subgenus as the sars virus that caused a deadly global outbreak in . by reconstructing the evolutionary history of sars-cov- , an international research team has discovered that the lineage that gave rise to the virus has been circulating in horseshoe bats for decades and likely includes other viruses with the ability to infect humans . the team concluded that the global health system was too late in responding to the initial sars-cov- outbreak, and that preventing future pandemics will require the implementation of human disease surveillance systems that are able to quickly identify novel pathogens and their interaction with the humans host and respond in real time. the key to successful surveillance is knowing which viruses caused the new outbreak and how the host genome affects severity of disease. in outbreaks of zoonotic pathogens, identification of the infection source is crucial because this may allow health authorities to separate human populations from the pathogen source (e.g. wildlife or domestic animal reservoirs) posing the zoonotic risk . even if stopping an outbreak in its early stages is not possible, data collection, data sharing, and analysis of host<>virus interactions is nevertheless important for containment purposes in other regions of the planet and prevention of future outbreaks. such surveillance measures are easily implemented with the pandemic preparedness platform showcased in this report that is based on the novel data sharing and data analytics marketplace shivom (https://www.shivom.io/). the platform is a proven research ecosystem used by universities, biotech, and bioinformatics organizations to share and analyze omics data and can be used for a variety of use cases; from precision medicine, drug discovery, translational science to building data repositories, and tackling a disease outbreak. in the current coronavirus outbreak, the global research community is mounting a large-scale public health response to understand the pathogen and combat the pandemic. vast amounts of new data are being generated and need to be analyzed in the context of pre-existing data and shared for solutioncentered research approaches to combat the disease and inform public health policies. however, research showed that the vast majority of clinical studies on covid- do not meet proper quality standards due to small sample size . our approach is designed to provide healthcare professionals with an urgently needed platform to find and analyze genetic data, and securely and anonymously share sensitive patient data to fight the disease outbreak. no patient data can be traced back to the data donor (patient or data custodian) because only algorithms can access the raw data. researchers querying the data-hub have no access to the genetic data source, i.e. no raw data (e.g. no sequencing reads can be downloaded). instead, researchers and data analysts can run predefined bioinformatics and ai algorithms on the data. only summary statistics are provided (e.g. gwas, phewas, polygenic risk score analyses). in addition to genetics bioinformatics pipelines, specialized covid- bioinformatics tools are available (e.g., assembly, mapping, and metagenomics pipelines). using these open tools, researchers can quickly understand why some people develop severe illness while others do show only mild symptoms or no symptoms whatsoever. we might hypothesize that some people harbor protective gene variants, or their gene regulation reacts differently when the covid- virus attacks the host. the usability of digital healthcare data is limited when it is trapped in silos limiting its maximum potential value. examples of covid- data depositories include the european commission covid- data portal , the canadian covid genomics network (cancogen), genomics england and the genomicc (genetics of mortality in critical care) consortium , the covid- host genetics initiative , the national covid cohort collaborative (n c) , the covid- genomics uk (cog-uk) consortium , the secure collective covid- research (scor) consortium , the international covid- data research alliance , and the covid human genetic effort consortium , among several others. all of them, while very valuable for fighting the pandemic, present complicated access hurdles, limited data-sharing capabilities, and missing interoperability and analytics capabilities that limit their usefulness. compartmentalized information limits our healthcare system's ability to make large strides in research and development restricting the available benefits that can be generated for the public. this is especially true for omics data. if omics data silos are broken up and can be globally aggregated, we believe that society will realize a gain of value that observes accelerating returns. if individuals and medical centers around the world shared their genomic data more fully, then as a whole this data would be significantly more useful. particularly the role of host genetics in impacting susceptibility and severity of covid- has been studied only in a few small cohorts. expanding the opportunities to find patients somewhere in the world that possess unusual mutations and phenotypes relevant to national efforts, or have sars-cov- resistant mutations, would shift precision medicine to another level. by combining genome sequenced data with health records, researchers and clinicians would have an invaluable resource that can be used to improve patient outcomes, and additionally be used to investigate the causes and treatments of diseases. for big data approaches to thrive, however, historical barriers dividing research groups and institutions need to be broken down and a new era of open, collaborative and data-driven science needs to be ushered in. there is also a cost to using and producing data, and if this cost is shared amongst many stakeholder's new data sources, which outperform older databases significantly, can be built. another obstacle is data generation itself. genomes have to come from somewhere and currently researchers have limited access to them. healthcare organizations and private individuals often have restricted access to their own data and even when they do, they are limited in the ways they can use this data, or for sharing with third parties (e.g. for scientific studies). computer scientists, bioinformaticians, and machine learning researchers are tackling the pandemic the way they know how: compiling datasets and building algorithms to learn from them. but the vast majority of data collected is not accessible to the public, partly because there is no platform that offers anonymized data sharing on a global level, while addressing privacy concerns, and regulatory hurdles. what is needed is a platform that can store data in a secure, anonymized, and impenetrable manner with the goal of preventing malicious attacks or unauthorized access. while ensuring access controls are maintained and satisfied, the database must also be user friendly, publicly accessible, and searchable. in addition, there must be ways to quickly and easily analyze the collected data. these properties make the data in the database usable, meaning it must be easy for use in population health studies, pharmaceutical r&d, and personal genomics. in addition, such a platform must be accessible on a global scale, as well as offer data provenance and auditing features. we believe that for such a platform to be successful, adding new genomic data to the platform must be trivial and accessible to everyone. we aim to provide such a platform by using convergent technologies including genomics, cryptography, distributed ledgers, and artificial intelligence. global pandemics come and go. the tragic impacts of the unusually deadly influenza pandemic in , called the spanish flu, or the bubonic plague that killed as many as one-third of europe's and north africa's population in the th century reminds us that covid- wasn't the first global pandemic, and it is clear that it won't be the last. in contrast to the middle ages, we now have the technological tools and exceptional scientific networks to fight pandemics. already, the covid- pandemic has demonstrated that sharing data can improve clinical outcomes for patients, and that quick access to data literally has life-or-death consequences in healthcare delivery. however, only a minority of healthcare organizations are participating in datasharing efforts. without proper coordination amongst initiatives and agencies, and making the efforts public, these initiatives run the risk of duplicating efforts or missing opportunities, resulting in slower progress and economic inefficiencies and further data silos. comprehensive solutions to problems such as the sars-cov- virus outbreak require genuine international cooperation, not only between governments and the private sector, but also from all levels of our health systems. researchers from all over the world need to be able to act quickly, without waiting for international efforts to slowly coordinate their efforts, to understand early local outbreaks and to start intervention measures without delay. new solutions to tackle the pandemic need also early adopters that want to move away from our archaic data information systems that can hamstring real-time detection of emerging disease threats. many research and innovation actors have reoriented some of their previously funded activities towards covid- , but often with little guidance from policy makers. despite all complications and concerns, the advantages of sharing data quickly and safely can be significant. a lack of coordination at international level on numerous initiatives launched by different institutions to combat covid- resulted in notably reduced cooperation and exchanges of data and results between projects, limited interoperability, as well as lower data quality and interpretation. open data plays a major role in the global response to large-scale outbreaks. for instance, analysis of the ebola outbreak points to the importance of open data, including genomic data to generate learnings about diseases where effective vaccines are lacking or have not been developed. similarly, open data played a major role in the response to the zika outbreak with a commitment from leading national agencies and science organizations to share data. with the shivom platform, it is possible to help the global research community fasttrack discover new preventive measures and optimize logistic solutions. moreover, this data lends itself naturally in predicting and preparing for future, inevitable disease outbreaks. a commitment to open access in genomics research has found widespread backing in science and health policy circles, but data repositories derived from human subjects may have to operate under managed access, to protect privacy, align with participant consent and ensure that the resource can be managed in a sustainable way. data access committees need to be flexible, to cope with changing technology and opportunities and threats from the wider data sharing environment. the spread of sars-cov- has taught us that prevention relies on community engagement and has exposed the fragility in our research relationships. in an ideal scenario, health workers, such as doctors, virologists, data custodians at datarepositories, biobanks and in pharma, as well as students at the university should be able to quickly collect, share, and analyze health data in a global, cross-border manner without complicated, lengthy access-procedures. the ability to share data securely should not be restricted to selected specialists in the field but should be possible for all healthcare workers, as such the access and sharing procedures need to be easy. data collected during a pandemic should not only be accessible, but also analyzable; meaning all stakeholders and the public should have the ability to analyze the data, not only the data custodians or project leads of research consortia. having such an affordable and easy-to-use platform in place, even developing countries are enabled to work together to reduce a pathogens' spread and minimize its impact, despite severe resource constraints. however, our health systems are -more often than not-conservative and innovation averse. what is needed is true leadership in the research community that is able to adopt new solutions, who are ready to tackle the uncertainty a situation like covid- creates. there is no value in competing for data during a global crisis, instead humanity needs to collaborate more than ever before to defeat a common enemy that has taken the world hostage. collective solutions that provide a 'one-stop shop' for the centralization of information on pathogen <> host interactions can help ensure that appropriate conditions for collaborative research and sharing of preliminary research findings and data are in place to reap their full benefits. the provided platform offers such a one-stop shop solution, without being an exclusive storing solution. data shared on the platform can also be collected and shared elsewhere. beyond short-term policy responses to covid- , the shivom platform supports research and innovation that could help tackle future epidemics early, on a national or international scale. the shivom data analytics platform was built on a principle of open science, transparency, and collaboration and has been designed to give researchers a one-stop ecosystem to easily, securely and rapidly share and analyze their omics data, therefore simplifying data acquisition and data analysis processes. no coding proficiency is needed for using the platform, so everyone can use it and get results, minimizing work times. our organization started out with the mission to accelerate data access, easy analysis, and exchange across international and organizational boundaries to promote equal access to healthcare for all. the platform facilitates sharing and analysis of genetic, biomedical, and health-related data in a safe and trusted environment and contains two connected marketplaces, one for omics data and one for bioinformatics/ai pipelines and tools (see fig. ). . cc-by-nc . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint figure : data flow in the shivom platform. the user does not need in-depth bioinformatics expertise and is guided through all necessary steps from data upload to data analytics. basically, data from a variety of sources is uploaded by the data custodian and data permission rules are set. if the data is set to open or monetizable, it will be searchable in the data marketplace. once a dataset of interest is identified and filtered, the user can select a bioinformatic workflow (pipeline), configure user-specific parameters if needed, and then start the analysis. the user can then investigate and share the resulting report and, depending on the workflow, run further downstream analysis. it was shown recently that clinical data collection with rapid sequencing of patients is a valuable tool to investigate suspected health-care associated covid- cases . similarly, the shivom platform can enable quick insight into local outbreaks to identify opportunities to target infection-control interventions to further reduce healthcare associated infections. if widely adopted by researchers and doctors around the globe, this wealth of genomic and healthcare information is a powerful tool to study, predict, diagnose and guide the treatment of covid- . the information from combined epidemiological and genomic investigations can be fed back to the clinical, infection control, and hospital management teams to trigger further investigations and measures to control the pandemic. signing up on the platform is easy and doesn't require complicated access procedures. the idea behind this mechanism is to provide access to clinical and health data and analytics tools for everyone. calls for data sharing have mostly been restricted to publicly funded research, but we and others argue that the distinction between publicly funded and industry-funded research is an artificial and irrelevant one . at the center of precision medicine should be the patient which should override commercial interests and regulatory hurdles. data sharing would lead to tremendous benefits for patients, progress in science, and rational use of healthcare resources based on evidence. as such, the shivom platform stands out for its ease of use from the very first steps; the only requirements are a valid email, preferably institutional, in case the user wants to use advanced features that come with institutional plans. the user needs to set a password, and if required, can utilize an additional step verification procedure to add an extra layer of security. the personal data stored by the platform is protected under gdpr regulations whether it is hosted in the european union or another country on other continents. after the sign-up, users are directed to a settings and profile page that allows to add additional information about the user and organization, research interests and dataset needs, or to switch between light and dark mode to make the user experience more personalized. the personal profile can be changed at any time. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint depending on the account type (e.g. academic group or company), the user can invite team members to the workspace to work on projects together and to share data and bioinformatics pipelines. team members can be easily assigned user roles, i.e. any new user can be upgraded to administrator for the account. team members can be removed any time from the account by the administrator (project lead or manager). with the 'manage plan' tab, the platform can be adapted and scaled to meet the requirements of the users, from single/private users (e.g. students or freelance bioinformaticians), academia and startups, to small and medium enterprises with vast amounts of data and unique requirements as well as pharmaceutical organizations looking for custom solutions. the shivom platform was designed to make the process of uploading and sharing covid- and other genetic data as easy and simple as possible. the reason is that access to the data sharing and analysis features should not be restricted to a few trained experts (data analysts or bioinformaticians) but should be easy for all healthcare professionals. importing records (genetic or metadata files) into the user's workspace is done via the 'my datasets' page. in case of sequencing data, the user will primarily choose to upload variant call format (vcf) files. vcf is a tab delimited text file format, often stored in a compressed manner, that contains meta-information lines, a header, and then data lines each containing information about a position in the genome . the format also has the ability to contain genotype information on samples for each position in the genome (virus or host). vcf is the preferred format on the platform because it is unambiguous, scalable and flexible, allowing extra information to be added to the info field. many millions of variants can be stored in a single vcf file, and most bioinformatics pipelines on the platform use vcf files as input. most ngs pipelines use the fastq file format, the most widely used format, generally delivered from a sequencer. the fastq format is a text-based format for storing both a nucleotide sequence and its corresponding quality scores. fastq.gz is the preferred file format for the standard covid- pipelines preinstalled on the shivom platform. despite vcf being very popular especially in the statistical genetic area, there are also other file formats that can be used on the upload or directly at the point of configuring pipeline parameters. the purpose of these other genetic file formats is to reduce file size -from many vcfs to just three distinct files: a) .bed (plink binary biallelic genotype table), that contains the genotype call at biallelic variants, b) .bim (plink extended map file) which also includes the names of the alleles: (chromosome, snp, cm, base-position, allele , allele ); c) .fam (plink sample information file), is the last of the files and contains all the details with regards to the individuals including, whether there are parents in the datasets, and the sex (male/female). this file will also contain information on the phenotype/traits of the individuals in the cohort. ideally, the sequencing file is accompanied by phenotypic metadata (e.g. is it patient or control data, age, height, comorbidities, etc..), describing the uploaded datasets (see metadata). the metadata file is uploaded as a csv file, a comma-separated values file, which allows data to be saved in a tabular format. csv files can be used with most any spreadsheet program, such as microsoft excel or google spreadsheets. for ease-of-use, a sample template file is provided on the platform. in the next step, the user has the option to save specific data sharing permission settings related to the uploaded dataset to allow for fine-grained data sharing with others. these unique permission settings transform the data analytics platform into a massive data marketplace. providing health information for a data marketplace and sharing it is a key in leveraging the full potential of data in precision medicine and the fight against pandemics. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint the marketplace is a central point of entry for the users in order to find needed data faster and more structured. in addition, interoperability can be created through the marketplace, which supports cross-sectoral applications. unrestricted access to relevant pandemic data in particular will become increasingly important in the future. the user has three options: open -all data designated open will be available for the public to analyze. however, open does not mean that the data is downloadable, this permission setting only allows only for algorithms/pipelines to use the data in the data analytics module. all open data is searchable on the platform. being a data-hub for covid- data, we highly encourage all researchers and data custodians to make their data open/public and accessible for the scientific community, thereby increasing the amount of opensource covid- data currently available and usable. private -private data will be stored in a secure, private environment, and is only available and analyzable by the data owner and team members from the same organization or project and will not show up in data search results. this setting is usually used when data is not supposed to be shared (or only within a consortium), or if the data needs to be analyzed or published before sharing commences. monetizable -by choosing this option, the data owner/custodian authorizes the marketplace to license their information on their behalf following defined terms and conditions. in this context, data consumers can play a dual role by providing data back to the marketplace. data marketplaces make it possible to share and monetize different types of information to create incremental value. by combining information and analytical models and structures to generate incentives for data suppliers, more participants will deliver data to the platform and the precision medicine ecosystem. to monetize pandemic data, making datasets available for researchers at a cost, can make sense if a laboratory is not otherwise able to finance the data collection, e.g. when located in an underfunded developing nation. at any time, the permission settings of data files can be updated. the settings have an audit trail and the platform owner or server host are not able to change the settings, only the data owner/custodian is able to modify the settings. in the next step during data import is to add additional information to the dataset, such a dataset name, a digital object identifier (doi), health status, author or company name, a description of the dataset, or the name of the genotyping or sequencing system used (a list of sequencing platforms is available in a drop down menu). other features will be added in later versions. after requesting the user's digital signature to confirm the operations carried out so far, the data will be uploaded, and the user will be redirected to a page containing a summary of his/her data sets. during the upload, the data is checked for quality metrics and for duplicates (compared to data already existing on the data marketplace) by internal algorithms that prevent monetization of datasets that were shared by others before. genetic data is largely transformed into a common format as part of the data curation process. the platform also starts automated proprietary anti-fraud mechanisms to identify data that was tampered with. . cc-by-nc . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint all data in a user's workspace are easily searchable and can be sorted by therapeutic area (based on collection of biomedical ontologies) and other parameters. generally, the workspace allows for building larger datasets in time, and at any point to share data whenever it is needed, or to revoke access. there are several potential use cases for using the platform as storage space for data. first of all, having a sharing infrastructure will enable healthcare professionals to quickly accumulate and share anonymized data with the public (e.g. sequencing data from covid- patients), which is particularly important during an early disease outbreak. second, the platform is an ideal storage infrastructure for data custodians (e.g. biobanks, direct to consumer genetic companies, or patient support groups) that inherently have large curated patient and volunteer datasets. the data can then be monetized, but more importantly, the datasets can be made available to the wider research and scientific community to make medical breakthroughs. since the basis of the platform is precision medicine, at this point in time, data custodians are encouraged to store their data as vcf files; however, a plethora of other file formats are planned to be added to the platform in time. vcf are particularly useful as they are one of the final file formats of most genomic technologies across both microarray chips and sequencing (genome vcf/ gvcf). they are also the most common file format when it comes to secondary genetic analysis, but equally their smaller file size reduces the cost on storage. third, using the platform as a repository for research data during a peer-review publication process. researchers are increasingly encouraged, or even mandated, to make their research data available, accessible, discoverable and usable. for example, the explicit permission settings in the platform would allow researchers to grant access to the journal's reviewers to their data and analyses, prior to publication. this time-restricted data sharing can be valuable when data should not yet be available to the wider public. fourth, the platform is an ideal data sharing infrastructure for research consortia. groups that are located in different countries can easily share data and result and build larger, statistically more relevant datasets. such a procedure is particularly valuable for pre-competitive consortia, to significantly lower the costs to accumulate valuable datasets, while the in-house research can be kept confidential. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint the shivom platform was developed with a comprehensive security strategy to protect user data privacy and security. the central pillar of the shivom data-hub is patient data privacy and confidentiality. the duty of confidentiality is a medical cornerstone since the hippocratic oath, dating back to the th century bc. on the other hand, the right to privacy is a relatively recent juridical concept . the emergence of genetic databases, direct-to-consumer genetics, and electronic medical record keeping has led to increased interest in analyzing historical patient and control data for research purposes and drug development. such research use of patient data, however, raises concerns about confidentiality and institutional liability . obviously, doctors and other healthcare workers have to make sure that no private and protected information of a covid- patient, such as name, address, phone number, email address, or biometric identifiers are exposed to rd parties. as such, institutional review boards and data marketplaces must balance patient data security with a researcher's ability to explore potentially important clinical, environmental and socioeconomic relationships. in this context, the platform was designed from the beginning to be gdpr compliant and to provide full anonymization of patient records (privacy by design). the advantage of this approach is that personal or confidential data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. as such, by making it impossible to connect personal data to an identifiable person, data controllers (i.e. the data custodian or patient who can upload and manage data) and processors (researchers) are permitted to use, process and publish personal information in just about any way that they choose. of course, for data to be truly anonymized, the anonymization must be irreversible. anonymization will provide opportunities for data controllers to use personal data in more innovative ways for the greater good. full anonymization is achieved with several steps. first of all, no private, identifiable data of patients is stored together with clinical data. second, and most importantly, the platform provides only summary statistics/analytics on the data. that means sharing of data only allows bioinformatics pipelines or machine learning algorithms to touch the data; no raw data (e.g. dna sequencing reads) can be downloaded or viewed by the data user/processor. all data must be stripped of any information that could lead to identification, either directly or indirectly-for instance, by combining data with other available information, such as genealogy databases . however, at this point in time, the consent mechanisms as well as the process of removing private or confidential information from raw data remains with the data custodian, for example the hospital that uploads the covid- related sequences. in this context, it is important to understand that only the data owner/custodian has all the access rights and can manage the data such as granting permissions to use the data. the platform provider has no power to manage the data and only has the possibility to identify duplicate datasets that will automatically be excluded from the data marketplace. other security features on the platform are that no statistical analyses are possible on single individuals, and only vetted algorithms and bioinformatics pipelines are available on the analytics marketplace to avoid that algorithms are deployed to specifically read out raw data from individuals or to cross-link data from individuals to other, de-identifying databases. by harnessing the power of cloud computing the platform enables data analysts and healthcare professionals, with limited or no computational and data analysis training, to analyze and make sense of genomic data and covid- in the fastest and most inexpensive way possible. the genetic data stored with the platform is handled and managed with state-of-the-art technology and processes, using a combination of security layers that keep data and management information (cryptographic keys, permission settings etc.) safe. currently, all the covid- related data is uploaded to amazon web services (aws) s buckets that provide data storage with high durability and availability. like other services, s denies access from most sources by default. the shivom platform requires specific permission to be allowed to perform transactions or computation actions on the bucket. the s storage service allows to devote a bucket to each individual application, group or even an individual user doing work. access to the buckets is regulated via the platform on different, non-aws server(s), and all information related to the access is encrypted using state-of-the-art key management services (kms). in addition, no user/custodian data is stored alongside genetic data and is completely separated and encrypted using different cloud storage solutions. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint one powerful feature of our solutions is the ability to maintain an immutable data-access audit trail via blockchain. this immutability property is crucial in scenarios where auditability is desired, such as in maintaining access logs for sensitive healthcare, genetic and biometric data. in precision medicine, access audit trails for logs provide guarantees that as multiple parties (e.g., researchers, laboratories, hospitals, insurance companies, patients and data custodians) access this data create an immutable record for all queries [ ] [ ] [ ] . since blockchains are distributed, this means that no individual party can change or manipulate logs , . such a feature can help journal reviewers and research collaborators to protect themselves from inflation bias, also known as "p-hacking" or "selective reporting". such bias can be the cause of misreporting or misinterpretation of true effect sizes in published studies and occurs when researchers try out several statistical analyses and/or switch between various data sets and eligibility specifications, and then selectively report only those that produce significant results. all permission settings stored in the system are coded into a blockchain, guaranteeing that nobody can temper with the permission setting provided by the data owner. each data transfer on the platform represents a cryptographic transaction hash. hashing means taking an input string from the platform of any length and giving out an output of a fixed length. in the context of blockchain technology, the transactions are taken as input and run through a hashing algorithm (e.g. a hash function belonging to the keccak family, the same family which the sha- hash functions belong to) which gives an output of a fixed length that is stored immutable on the chain. a transaction is always cryptographically signed by the system. this makes it straightforward to guard access to specific modifications of the database. transactions such as monetization of datasets are encoded as smart contracts on the blockchain. a "smart contract" is simply a piece of code that is running on a distributed ledger, currently ethereum. it's called a "contract" because code that runs on the blockchain can control digital assets and instructions. practically speaking, each contract is an immutable computer program that runs deterministically in the context of a virtual machine as part of the ethereum network protocol-i.e., on the decentralized ethereum world computer. once deployed, the code of a smart contract cannot change. unlike with traditional software, the only way to modify a smart contract is to deploy a new instance, making it immutable. all the contracts used by the shivom platform only run if they are called by a transaction, e.g. when a file is used in a data analytics pipeline. in this case, a smart contract is triggered that can, if the file permissions are set to monetizable, transfer a fee to the electronic wallet of the data owner. the outcome of the execution of the smart contract is the same for everyone who runs it, given the context of the transaction that initiated its execution and the state of the blockchain at the moment of execution. if needed, smart contracts can be put in place to accommodate specific terms needed in more complex research or drug development agreements, e.g. if datasets belong to a consortium or a biobank, or if there are data lock constraints (e.g. an embargo on data release) and usual access procedures are complicated. for example, most biobanks contracts for accessing data are long, arduous, detailed documents with complicated language. they require lawyers and consultants to frame them, to decipher them, and if need be, to defend them. those processes can be made much simpler by applying self-executing, self-enforcing contracts, providing a tremendous opportunity for use in any research field that relies on data to drive transactions. these contracts already possess multiple advantages over traditional arrangements, including accuracy and transparency. the terms and conditions of these contracts are fully visible and accessible to all relevant parties. there is no way to dispute them once the contract is established, which facilitates total transparency of the transaction to all concerned parties. other advantages are speed (execution in almost realtime), security (the highest level of data encryption is used), efficiency, and foremost trust, as the transparent, autonomous, and secure nature of the agreement removes any possibility of manipulation, bias, or error. advances in data capture, diagnostics and sensor technologies are expanding the scope of personalized medicine beyond genotypes, providing new opportunities for developing richer and more dynamic multi-scale models of individual health. collecting, and sharing phenotypic and socioeconomic data from patients and healthy controls, in combination with their genotype and other omics profiles, present new opportunities for mapping and exploring the critical yet poorly characterized "phenome" and "envirome" dimensions of personalized medicine. therefore, in is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint the context of covid- , if available, metadata should be shared next to sequencing information of the patients. such metadata can contain information with regards to the comorbidities, ethnic background, treatment and support that they receive rather than biological or clinical data. currently, metadata is uploaded as a csv file to the platform, which allows almost all phenotypic data to be saved in a tabular format. other inputs may be available in the future, e.g. to parse information directly from a medical record or from wearables. using a simple tabular format also allows it to be source agnostic, e.g. it is not difficult to combine genetic information with patient questionnaires. indeed, patients support groups, direct-toconsumer genetic companies or patient registries are less likely to have biological data types in their repositories and often have only metadata that comes from questionnaires that volunteers have filled. an example of this is the dementia platform uk (dpuk) , a patient registry that comprises thousands of individuals from across dozens independent cohorts -which allows a variety of analysis to be conducted including machine learning and mendelian randomization techniques. the platform is multi-purpose built, which allows for all types of data that is alpha-numerical encoded to be stored. other platforms are engineered specifically for a particular type of cohort, e.g. tailored towards analyzing biobank data. the shivom platform is completely source agnostic and is engineered to allow large cohort datasets to be uploaded regardless of biobanks or cohorts. additionally, the platform will allow for biostatistical (non-genetic), epidemiological and machine learning analysis types to be undertaken. the metadata is structured in a simple and intuitive way in order to allow the user to be able to quickly query the dataset. depending on whether the data being uploaded has accompanying genetics information, the first column of the file will either contain the name of the vcf file or the first metadata attribute. csv files are supported by nearly all data upload interfaces and they are easier to manage if data owners are planning to move data between platforms, export and import it from one interface to another. a template that illustrates how metadata can be organized, including a standard metadata set (e.g. height, smoking status, gender, etc..) is provided on the platform. for covid- datasets (provided as vcf) we encourage data providers to share as many as possible phenotypic, clinical and socioeconomic data, keeping in mind that all data should be anonymized. working with several covid- consortia, we established a guideline of what an ideal dataset should include, keeping in mind that many clinical data are usually not available see supplement for metadata structure): • ethnicity, age, gender, height, weight is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . the platform supports fine-grained search queries across global omics datasets. the idea is to find and analyze genetic and other omics dataset from various global sources in the easiest, fastest, and most efficient way. on the 'discover' page, with just a few simple steps, the user will be able to quickly discover and select the data according to the disease of interest and filter the dataset based on the appropriate metadata for their study or research question. one strength of the platform is that the user can find and combine open-source, in-house, and proprietary datasets in a matter of seconds. the user has the opportunity to perform a free search with advanced search operators (e.g. searching for gene-or disease-name), or can use separate clinical search bars, and therapeutic area, disease or dataset title. the discover window will then be populated with the search results. currently, the search query form supports doid ontology, a comprehensive hierarchical controlled vocabulary for human disease representation. but other standards (efo, icd ) are planned to be added in the future as well. once the data has been selected, it will be possible to further filter it based on the information contained in the associated metadata files, generally gender, diet, country, etc., and also the type of data permission associated. for example, it is possible to select only free data that is not associated with any costs. this fine-grained filtering feature therefore allows to stratify patients based on the information the user is interested in, so that it will be possible to select the most significant data for the analyzes to be conducted. for example, once the database is populated, it should be possible to select covid- patients by ethnicity or smoking status to quickly find confounding factors that may affect the outcome and severity of the disease. once the user has completed the selection of the data that has to go into the data analysis, a simple click on the 'choose pipeline' box will bring the user to the data analytics marketplace for downstream analytics. after selecting datasets for analysis, the user is directed to the data analytics marketplace. the marketplace is designed to provide researchers with or without bioinformatics expertise to perform a wide variety of data analyses and to share in-house pipelines as well as machine learning tools with the scientific community. since the launch of the platform, a variety of standard bioinformatics pipelines/workflows, primarily for secondary analyses, were already added to the marketplace that are free to use, including pipelines to analyze covid- patients. the shivom analytical platform hosts a variety of genomic-based pipelines; from raw assembly and variant calling, to statistical association-based analysis for population-based studies. in time, a vast variety of new data analytics tools and pipelines will be added to the marketplace, including ai analytics features. currently, all provided standard pipelines are built on nextflow technology . nextflow is a domainspecific language that enables rapid pipeline development through the adaptation of existing pipelines written in any scripting language . nextflow technology was chosen for the platform because it is increasingly becoming the standard for pipeline development. its multi-scale containerization makes it possible to bundle entire pipelines, subcomponents and individual tools into their own containers, which is essential for numerical stability. nextflow can use any data structure and outputs are not limited to files but can also include in-memory values and data objects . another key specification of nextflow is its integration with software repositories (including github and bitbucket) and its native support for various cloud systems which provides rapid computation and effective scaling. nextflow is designed specifically for bioinformaticians familiar with programming. however, it was our aim to provide a platform that can be used by inexperienced users and without common line input. as such, the shivom data is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint analytics marketplace was designed to provide an easy to use graphical user interface (gui) where all pipelines can be easily selected and configured, so that no coding experience is required by the user. in addition to the data marketplace, this feature sets the platform apart from other cloud computing tools that use nextflow or other workflow management tools such as toil, snakemake or bpipe. users can easily automate and standardize analyses (e.g. gwas, meta-gwas, genetic correlation, two sample mendelian randomization, polygenic risk score, phewas, colocalization etc..) by selecting all parameters for sets of different analysis activities. this makes it possible to run analyses fully automated, meaning that even new users can start using sophisticated analyses by running a workflow in which all analysis steps are incorporated, i.e. assembled by more experienced bioinformaticians or machine learning experts. findings that cannot be reproduced pose risks to pharmaceutical r&d, causing both delays and significantly higher costs of drug development, and also jeopardize the public trust in science ( ) . by using standard pipelines and allow for rd parties to share their workflows on the platform, and by creating reproducible and shareable pipelines using the public nextflow workflow framework, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the goal of making science reproducible. the disadvantage of many research consortia, including some covid- consortia, is that they do not open the data to all consortium members to analyze the collected data individually, but only provide data analysis by a central coordinator. then the summary statistics of the analyzed cohorts are fed back to the consortium members. for example, the host genetics initiative agreed on a standardized genome-wide association studies (gwas) pipeline . the shivom platform works differently, and all participants of a research consortium (and other rd parties if permitted) can analyze the datasets independently. to make gwas analyses available to the whole research community, one of the most important preinstalled pipelines on the platform is a standard gwas analysis workflow for data quality control (qc) and basic association testing. genome-wide association studies are used to identify associations between single nucleotide polymorphisms (snps) and phenotypic traits . gwas aims to identify snps of which the allele frequencies vary systematically as a function of phenotypic trait values. identification of trait-associated snps may reveal new insights into the biological mechanisms underlying these phenotypes. gwas are of particular interest to researchers that study virus-host interactions. the sars-cov- virus does not affect everyone in the same way. some groups seem particularly vulnerable to severe covid- , notably those with existing health conditions or with genetic predispositions. variation in susceptibility to infectious disease and its consequences depends on environmental factors, but also genetic differences, highlighting the need for covid- host genetics to engage with questions related to the role of genetic susceptibility factors in creating potential inequalities in the ability to work or access public space, stigma, and inequalities in the quality and scope of data . the genome sequencing of the coronavirus can provide clues regarding how the virus has evolved and variants within the genome. the highlighted platform contains a set of bioinformatics pipelines that allow the interrogation of host and virus genomes. genetic information might enable targeting therapeutic interventions to those more likely to develop severe illness or protecting them from adverse reactions. in addition, information from those less susceptible to infection with sars-cov- may be valuable in identifying potential therapies. the current version of the platform comes with preinstalled covid- specific pipelines covering assembly statistics, alignment statistics, virus variant calling, and metagenomics analysis, and other pipelines that can be used for analyzing covid- patient data such as gwas or metagwas. the is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint research community is encouraged to add more pipelines and machine learning tools to support the fight against the pandemic. all the standard preinstalled covid- pipelines work with fastq sequencing files. the platform works with illumina short reads as well as oxford nanopore long reads. long and short reads differ according to the length of reads produced and the underlying technology used for sequencing. the short-read length ( - bp) limits its capability to resolve complex regions with repetitive or heterozygous sequences, so the longer read lengths (up to kbs) have fundamentally more information. however, these tend to suffer from higher error rates than the short reads. the short reads come in zipped format (.gz) and do not need to be extracted before use; since there are paired-end, each sample will have two files: r and r fastq that need to be included in an analysis. long reads come as single fastq files. each analysis requires a sample metadata sheet, which specifies the fastq.gz files for each of the sample and other associated metadata files. a template metadata file can be accessed on the platform, in which the user can also find the description of all columns and the example values. one of the common analyses done on patient sequences is to identify the pathogens contained in the analyzed sample and to assign taxonomic labels, usually obtained through meta-genomic studies . metagenomics allows researchers to create a picture of a patient's pathogens without the need to isolate and culture individual organisms. the prediction power on covid- also depends on data related to underlying morbidities, including other flu-like illnesses. though metagenomic testing for other viruses can therefore increase the specificity of the gwas studies and other algorithms. the shivom platform comes with a metagenomics analysis pipeline that is based on the nf-core/vipr pipeline, a bioinformatics best-practice analysis pipeline for assembly and intrahost low-frequency variant calling for pathogen samples. the pipeline uses the kraken program metagenomics classification [ ] [ ] [ ] . using exact alignment of k-mers, the pipeline achieves classification significantly quicker to the fastest blast program. the process of running pipelines on the platform are largely the same and do not require in-depth bioinformatics expertise or command line entry. the user is guided through all steps. after the upload of the selected fastq files and associated metadata, the user can choose to add a quality control and then to modify the parameters of the pipeline, if required. after starting the computation, the user can easily monitor the progress of the pipeline and all other tasks in the 'projects' tab (see fig. ). is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint by pressing on a project, the user can see more details about the status of the computation and parameters used. if required, it is possible to download a log file and to monitor the costs of computations. in case a pipeline fails, e.g. in the case there are problems with the raw data or metadata sheet, all processes the pipeline went through are monitored in real time with additional information on the computation's cpu load and memory. in this way it is easy to find the step where a pipeline failed, and the error can be addressed. after the pipelines are finished, the user can monitor the results by clicking the 'results' button. all the pipeline-specific plots are presented in an interactive view and can be exported to a pdf report that contains all the parameters and settings of the analysis for easy results sharing. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint by interrogating the species profile, the presence of sars-cov can be confirmed and the presence of other pathogens that may add to observed phenotypes can be analyzed. the other preinstalled pipelines also work with patient sequencing files: this is a bioinformatics pipeline for sars-cov sequence assembly, by aligning and merging fragments from a longer viral dna sequence in order to reconstruct the original sequence. the raw data, represented by fastq files, a mixture of human and pathogen sequences, are taken as input and then checked for quality. after removal of the adapter sequences, the reads are mapped to the human reference genome (hg ; grch . ). the unaligned, viral sequences are then taken for de novo assembly using the spades program and evaluated using the quast program. the assembled genome files are viewable in programs such as bandage. the contigs are subjected to gene/orf prediction and the resulting sequences are further annotated using prokka. this pipeline produces variant files and an annotation table from the sars-cov viruses present in the covid- patient. unlike going for a de novo assembly, the user can directly align the reads to the available pathogen reference genome and identify the variants presented within the analyzed samples. the resulting variant calls can be used in downstream analyses to identify clues regarding how the virus has evolved and the distribution of variants within the virus strain. the raw data (fastq short reads) from the patient are taken as input and are checked for quality with fastqqc. the reads are then mapped to the hg human reference genome. the unaligned reads are then taken for alignment with the coronavirus (nc_ ) reference genome. the alignments are checked for duplicates and realigned using picard. the variants calling tools (lofreq) are used on these realigned files to get the list of the variants in the patient's viruses. the variants are annotated with snpeff. this pipeline is designed to produce vcf variant files from the covid- patients sequencing reads. the resulting vcf files can be valuable in identifying the variants associated with the coronavirus infection and can be easily exported and used for downstream analyses. the pipeline performs a quality check on the reads and trims the adapters. these trimmed reads are then aligned to the human reference genome (hg ). the alignment files are checked for duplicates, realigned and variants are called. these variants are annotated using the snpeff program. overall, the whole workflow, from data upload to running a covid- related pipeline was designed to be as easy as possible, with only a few steps involved and no command line entry needed, so that is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint researchers and healthcare professional can run analyses of patients sequences and share the files and results with the research community. researchers with other nextflow pipelines related to tackle the covid- pandemic are highly encouraged to share them on the shivom platform. the main goal of this precision-medicine platform is to provide the global research community with an online marketplace for rapid data sharing, managing & analytics capabilities, guidelines, pilot data, and deliver ai insights that can better understand complex diseases, aging & longevity as well as global pandemics at the molecular and epidemiological level. among other use-cases, the provided platform can be used to rapidly study sars-cov- , including analyses of the host response to covid- disease, establish a multi-institutional collaborative datahub for rapid response for current and future pandemics, characterizing potential co-infections, and identifying potential therapeutic targets for preclinical and clinical development. no patient data can be traced back to the data donor (patient or data custodian) because only algorithms can access the raw data. researchers querying the data-hub have no access to the genetic data source, i.e. no raw data (e.g. no sequencing reads can be downloaded). instead, researchers and data analysts can run predefined bioinformatics and ai algorithms on the data. only summary statistics are provided (e.g. gwas, phewas, polygenic risk score analyses). in addition to genetics bioinformatics pipelines, specialized covid- bioinformatics tools are available (e.g., assembly and mapping pipelines). using these free tools, researchers can quickly see why some people develop severe illness while others do show only mild symptoms or no symptoms whatsoever. we might hypothesize that some people harbor protective gene variants, or their gene regulation reacts differently when the covid- virus attacks the host. particularly during an early outbreak, virus-host genetic testing has value in identifying those people who are at high or low risk of serious consequences of coronavirus infection . this information is of value for the development of new therapies, but also for considering how to stratify risk, and identify those who might require more protection from the virus due to their genetic variants. identifying variants associated with increased susceptibility to infection, or with serious respiratory effects, relies on the availability of genetic data from the affected individuals. however, here covid- research encounters challenges associated with the population distribution of genetic data, and the consequent privileging of specific groups in genetic analyses . most data sets used for genome-wide association studies are skewed toward caucasian populations, who account for nearly % of individuals in gwas catalogs . in comparison, those from african and asian ancestry groups are poorly represented. even large-scale genetic analyses may consequently fail to identify informative variants whose frequency differs among populations, either under-or overestimating risk in understudied ethnic subgroups. data from many countries showed that people from black and minority ethnic (bame) groups have been disproportionately affected. people from a bame background make up about % of the uk population but account for a third of virus patients admitted to hospital critical care units. black americans represent around % of the us population but % of those who have contracted the virus. more than % of healthcare professionals who have died in the uk have been from bame backgrounds. similar patterns showing disproportionate numbers of bame virus victims have emerged in the us and other european countries with large minority populations. in the us and uk, the proportion of critically ill black patients is double that of asians, so geography and socioeconomic factors alone are unlikely to explain the stark differences. the key to fighting early disease outbreaks is speed. with the reality that some infections may originate from unknowing carriers, fast patient stratification and identification is even more paramount. it is important to investigate if the severity of the disease may lie in the interaction with the host genome and epigenome. why do some people get severely sick and die while others do not show any symptoms? it could be that some people harbor protective gene variants, or their gene regulation is able to deal with viruses attacking the host. very quick data sharing and analysis ensures that precision medicine is brought to patients and healthy individuals faster, cheaper, and with significantly less severe adverse effects, leveraging information from the interaction between labs, biobanks, clinics, cros, investigators, patients and a variety of other stakeholders. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint as outlined in this paper, we have the technology to make easy data sharing a reality. however, the observation that data sharing technologies are still extremely underutilized and most data custodians are reluctant to share their data, demonstrates that the global healthcare community needs an elemental paradigm shift, a major change in the concepts and practices of how data is shared and utilized, particularly during a pandemic. a paradigm shift, as outlined by the american physicist and philosopher thomas kuhn, requires a fundamental change in the basic concepts and experimental practices of a scientific discipline. kuhn presented his notion of a paradigm shift in his influential book the structure of scientific revolutions. we argue that the world needs such a revolutionary paradigm. researchers use real-world data from laboratories, testing centers, hospitals, academic institutions, payors, wearables, and other data sources such as direct to consumer genetic companies to better understand disease outbreaks and to develop vaccines and medicines quickly. such data is used across the whole spectrum of drug discovery, vaccine development, and basic academic research. however, whoever has ever applied for data access to a major biobank knows that typically, it's costly, time-consuming and nerve wrecking, in the bestcase scenario, to obtain such data. data is often received from a third-party partner that aggregates, anonymizes it, and makes it usable for their research purposes. often, there is a significant lag time between when the data is generated and when it's available for use. for example, when an infected patient is sequenced in a hospital, it can take between weeks and even years until the data surfaces and is available in a public database. but if healthcare systems are trying to control a pathogen outbreak and want to understand more about the affected patients and host-virus interactions, things are changing day by day, we need that insight into the real-time aspect of what's going on. what we need is an ecosystem that, for example, allows a doctor who sees a patient and sends a buccal swab out for sequencing, to get sequencing results, sharing it with the global community and gets in-depth data analytics back to evaluate if a similar host-virus interaction was observed anywhere else on the planet -within hours or less. in the current pandemic, for example, there is a need to understand covid- patient data at the local level, directly from the sources, to get data in real time. omics data from patients, exposed individuals, and healthy control subjects will be a critical tool in all aspects of the response-from helping to design clinical trials, identifying vulnerable groups of the population, to understanding whether covid- patients taking specific medicines are at higher risk of developing severe side-effects. the ability to collect, analyze, and interpret data is fundamental to the management of infectious diseases. to do so, stakeholders need to be incentivized to share their data. sharing data is good for science, not only because transparency enhances trust in science, but because data can be reused, reanalyzed, repurposed, and mined for further insight. sharing data on the shivom platform can increase transparency. the more transparent the contributors are about their data, analytic methods, or algorithms, the more confident they are of their work, and more open to public scrutiny. using the platform can also help researchers and consortia that want to collaborate within a safe space. we need more data aggregators that are committed to supporting an ecosystem of research communities, businesses, and research partners, by sharing data or algorithms in safe and responsible ways. such an open ecosystem approach can yield high dividends for society. in addition, consortia can come together to build patient cohorts, maximizing their research budget. research organizations who share similar goals can join efforts to create insight at scale, creating datasets that meet the needs of multiple stakeholders at lower costs. a successful response to the covid- pandemic requires convincing large numbers of scientists and healthcare organizations to change their data sharing behaviors. although the majority of countries have population studies and build massive digital biobanks, to our knowledge, none of the larger biobanks in the world have a data sharing infrastructure to share data with other countries. the shivom platform provides the only crossborder data marketplace to allow data custodians and biobanks to easily & securely share and monetize anonymized datasets. a massive, decentralized, and crowd-sourced data can reliably be converted to life-saving knowledge if accompanied by expertise, transparency, rigor, and collaboration. overall, there are several obvious advantages using the platform: • making data actionable as data can be analyzed on the bioinformatics platform • breaking down data silos, putting data in a larger, global context • making data easily findable & searchable . cc-by-nc . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint • ability to easily share data with fine-grained permission with clients, collaborators, or the public • improved accessibility by avoiding complicated access procedures, promoting maximum use of research data • advanced interoperability by integrating with applications or workflows for analysis, storage, and processing; e.g. new ai algorithms • better reusability so that data is not stored and forgotten, but reusable in many research projects • increased impact as sharing data will help the global fight against complex and rare diseases • gain reputation for the user's organization by giving researchers the ability to find, use and cite their work data monetization options. usually, for most data custodians such as clinical laboratories, data cleanup procedures and data deposition are not formal responsibilities. these jobs have to be done by some fairly busy people, in an increasingly demanding environment. there is no remuneration for the labs to submit their data; but done out of goodwill. as long as the data collection is funded by public institutions this is no major problem, but private hospitals, small biotech or genetic companies, data sharing can become a challenge. the shivom platform offers the possibility to monetize their datasets. practically, that means if a dataset is monetizable, whenever a bioinformatics pipeline or algorithm touches the dataset, a smart contract is triggered that sends a predefined fee to the data custodian. the pricing can be adjusted depending on the quality, data type and other factors. while open data sharing is encouraged for the publicly funded research community, such monetization options can be a valuable extra resource of funds for organizations that have to worry about sustainability and revenue models. unlike many other data brokers, organizations can upload and monetize data directly without a distributor or middleman. this leads to real-time payouts, higher revenue share rates, and instant upload availability. it allows researchers to upload their own data and get it in front of an audience of other scientists. in a way, it almost acts like a content creation tool, rather than a pure data sharing service. from the data-user perspective (e.g. a pharma or biotech company), getting access to large genetic datasets gets much easier and more affordable as it is not necessary to acquire exclusive licenses that can cost millions of dollars. instead, the data user only has to pay for accessing the data that is actually used, similar to the way people access music on streaming services. in addition, since the data custodian remains full ownership of the data, there is no need for complicated business relationships, nor will their internal business model be challenged. using the shivom platform as the regular data repository can help researchers with their data curation tasks, preventing what was termed 'data curation dept' . researchers can migrate/copy their data to the platform to protect themselves from breaking in-house legacy hardware, e.g. old unmaintained servers. also, digitizing raw data and storing it increases security of the data as many datasets are still stored on decaying physical media (e.g. consents done on paper, metadata stored on cds/dvds, minidisc, memory sticks, external hard drives etc.). if the physical medium decays, then the raw data are lost, making it impossible to reproduce the original research or apply new techniques to analyze the original raw data. loss of data has far-reaching consequences: a recent survey demonstrated that the lack of availability of raw data and data provenance were common factors driving irreproducible research . for many academic labs, the loss of data could lead to a loss of funding because future grant applications would not be able to list their data as a resource. even for normal daily tasks in a typical university laboratory, if data is not lost but requires additional work to locate, for example, after a phd student left the research group, then an additional time and staff cost is involved to find the data and re-learning knowledge that was the domain of the departed team member. particularly in multinational pandemic cohort studies where the recruitment and data collection are so difficult, it is unacceptable that there is such a risk to the hugely valuable resource of historical data through the accrual of data curation debt . joining forces and sharing information at the national level also eases and supports international cooperation initiatives. national coordination of pandemic policy responses can also benefit from joining forces with international research cooperation platforms and initiatives. it is important to note that data stored and shared on the shivom platform does not need to be exclusive to this data marketplace; the data can be hosted in other databases as well. the benefits of data sharing do not end here, other benefits that should incentivize researchers to share their datasets include the chance of getting more . cc-by-nc . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint citations, increase of exposure that may lead to new collaborations, boosting the number of publications, empowering replication, avoiding duplication of experiments, increasing public faith in science, guiding government policy, and many more. also, far from being mere rehashes of old datasets, more and more evidence shows that studies based on analyses of previously published data can achieve just as much impact as original projects and published papers based on shared data are just as likely to appear in high-impact journals, and are just as well-cited, compared with papers presenting original data , . building a massive global data-hub with its wealth of multi-omic information is a powerful tool to study, predict, diagnose and guide the treatment of complex disease and help us to live longer and healthier. also, data sharing should not stop with a successful vaccine rollout. healthcare systems need to monitor closely the characteristics of the most promising vaccine candidates. such monitoring includes understanding the patient's response, dosing regimen, potential efficacy, and side effects. measures to monitor and control the covid- pandemic by the scientific community will be relevant for as long as its risk continues. moving genomics data and other information into routine healthcare management will be critical for integrating precision medicine into health systems. however, data sharing is often hindered by complicated regulatory and ethics frameworks. obviously, there are ethical and legal issues to consider in any work that involves sharing of sensitive personal data. nevertheless, it is obvious that we need to be able to share data to fight global pandemics. the sars-cov virus does not care about jurisdictional boundaries. data from covid- patients enables experts to interpret complex pieces of biological evidence, leading to scientific and medical advances that ultimately improve care for all patients. but these advantages must be balanced against the duty to protect the confidentiality of each individual patient. it is out of the scope of this study to discuss ethics in depth. the shivom platform is built to be operational in all jurisdictions around the world. many regulatory frameworks do already exist that allow to collect and hold anonymized data from patients in the interest of improving patient care or in the public interest. it is now time to go a step further and decouple procedures from regional regulatory frameworks, to make data sharing easy across borders and regulatory frameworks. in this context, the magic word is anonymization. on the outlined data sharing platform, no access to raw data is permitted. only algorithms that provide summary statistics are able to touch the data, guaranteeing that no dataset can be traced back to an individual. this process makes it possible to share important multi-omics information anonymously and securely, whilst still enabling linkage to metadata such as diagnosis, lifestyle factors, treatment and outcome data. the key is that data on the platform is largely protected from reidentification of individuals. re-identification in this context is understood as either identity disclosure or attribute disclosure, that is, as either revealing the identity of a person, thus breaching his or her anonymity, or disclosing personal information, such as susceptibility to a disease, which is a breach of privacy . the shivom platform was designed to make it impossible to disclose information from genetic data not solely on the individual, but also on their relatives and their ethnic heritage, e.g. protecting the identity of siblings or other relatives in the context of forensic investigations. ideally, patients should be involved in such data sharing initiatives. institutions that collect and host data and promote data sharing should explain the benefits of collecting and using patient data to their clients. there is a need for public education on genetics, to take the frightening aspect away, ultimately incentivizing patients and healthy control groups to share their data anonymously for the greater good. one of the cornerstones of the proposed data ecosystem is easy public access to data. if adopted by the research community, researchers seeking to use public data must no longer comb websites and public repositories to find what they need. it is important that every human being has the potential to participate in this process, either on the data provider side or by analyzing datasets. access to the data should not be restricted to certain organizations or researchers with a long publication record. these systems are outdated and archaic. usually, with most databases and biobanks, the bona fide user must apply for access with a corresponding data access control body, provide an institution id such as openid or orcid, and often provide a long publication record. usually, that means they must register from, and be affiliated with, an approved research institute and need to register their organization which means they will need their organization's signing official to is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . participate in the registration process. often, when interested researchers want to access data in public databases, they are asked to submit a lengthy proposal detailing the data they want to use and the rationale for their request. most of the time, projects must involve health-related research that is in the public interest. sometimes the researcher needs to go through an institutional review board (irb) first and even then, all proposals received by the data repository are considered on an individual basis and may have to be evaluated by an internal or external ethics committee. on average, a straightforward application and approval process takes - months; sometimes it has taken up to months . obtaining access to data that are available from international sources adds another layer of complexity, due to international law. most databases require the user then require a standard material transfer agreement (mta) to be signed prior to any data delivery that governs how the researcher can use the data. but even when data access is granted, it is common that datasets are restricted to only those data and participants that the researcher required at that time (e.g. covid- patients only or specific case-control subsets). in addition, most agreements grant access for a limited time only, at the end of which a report summarizing research progress is required; for continued data access, yearly renewal requests are often obligatory. to exemplify the problem, our researchers applied for access to covid- data with four dataproviders (e.g. data repositories, covid- research consortia, and biobanks) in the uk, eu and us to support our covid- research efforts. all four requests were rejected, each for a different reason. one public data repository denied access to their data because their legal terms only allow access for groups working in public institutions but not private/commercial companies. another biobank was concerned about the lack of publications from our research unit that could demonstrate the organization's current activity in health research, keeping in mind that our researchers are highly respected in their field and have a vast list of scientific publications from previous appointments. other objections were that the proposed covid- research was too broad, basically no exploratory studies were permitted, or that our junior researcher had no publication record ( or more papers). one covid- data consortium suggested that our organization shares our genetic data with them but did not allow access to their data sets and only provided access to secondary analyses that their researchers performed on the data. such processes are often undemocratic, against the public interest and simply take too long to provide any benefit under a health crisis. a large majority of biobanks and databases are nonprofit organizations, and most of them operate on a not-for-profit basis. consequently, public funding puts some obligation on data-custodians to enable research with any scientific and social value. against this background, we argue requiring data custodians to prioritize access options and avoid underutilization of data. thus, biobanks ought to make all necessary arrangements that facilitate the best possible utilization of the data sets. as such, interested researchers on our data hub do not require a long publication record, nor do they need to provide a -page proposal, referrals, or a membership to an elite club (e.g. an accredited, approved research institute) to search and analyze data. usability and accessibility needs to be guaranteed for every interested individual, the academic researcher from any elite university as well as the pharma manager, the phd student from kenia, the doctor at a local clinic in wuhan, a student in a remote area of pakistan, or the nurse in a village hospital in brazil. in addition, data should not only be accessible for a specific research purpose but should enable research that is broader in scope and exploratory in nature (i.e. hypothesis-generating). many diseases hide in plain sight -in our genomemaking genomics the most specific and sensitive way to identify disease early and guide both patients and healthcare practitioners to the best course of treatment. yet despite many advances in data analytics and ai, our understanding of the human panome (including epigenome, proteome, metabolome, transcriptome, and so forth) is incomplete and the ways in which we utilize this information is limited and imperfect. clinical diagnosis. having a secure data-hub for fast data dissemination and analytics in the clinical setting will improve routine clinical care. most complex diseases, such as cancer have become increasingly dependent on genetic and genomic information. when biopsy samples are collected from patients and sequenced, there is the need to be able, as quickly as possible, to tell whether any variants detected are associated with the disease or indicate that a patient is likely to respond to a particular drug. at the same time, with every sequence shared, the accuracy of the global diagnostic accuracy improves, not only for the current patient but for all cancer patients around is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint the globe. in addition, linking patients' clinical data with genomic data will help answer research questions relevant to patient health that would otherwise be difficult to tackle. rare diseases. access to genetic data and 'realworld evidence' (rwe) obtained from observational data is crucial to help push research and innovation particularly on rare diseases. the standard approach to clinical trials is challenging, if not impossible for most rare diseases. it is the characteristics of rare diseases that they can only be adequately tackled if there is enough data to make evidence-based decisions, given that the number of patients is so small. to make this happen it is absolutely necessary to break down global data silos. to move things forward, many global initiatives, such as the eu regulation on orphan medicinal products (omps) were introduced specifically to address the challenge of developing medicines that treat patients with rare diseases. however, the initial progress seen since the introduction of these initiatives has tapered off in recent years, with a decline in a number of drug approvals over the past years. in fact, % of rare diseases conditions remain without treatment. one of the main hurdles is the collection and access to data, particularly obtained outside the context of randomized controlled trials and generated during routine clinical practice. gaining the critical mass of data to close data gaps needs global coordination and represents a critical step for driving forward research in the area of orphan drugs. in addition, the described platform can be used to collect valuable datasets to specific themes and therapeutic areas that are not yet available. currently, several pilot data initiatives started to collect research-area specific datasets on the shivom platform: cannabis data-hub. in collaboration with direct to consumer companies, patient support groups and medical professionals, we aim at collecting data on the interaction of individual's genomes and medicinal compounds found in medical and recreational cannabis. cannabis-related research allows governments and pharma researchers to better understand efficacy and potential side effects of hundreds of compounds like cannabinoids and terpenes and to improve legal frameworks for cannabis products. for example, several gene variations are known to affect the user's endocannabinoid system, increasing sensitivity to Δ -tetrahydrocannabinol (thc), potentially posing risk factors for thc-induced psychosis and schizophrenia , . similarly, certain variations in akt can make people's brains function differently when consuming marijuana . about percent of occasional cannabis users develop a physiologic dependence, cravings, or other addiction-related behaviors that can affect everyday life. as a result, many cannabis users often go through various products blindly before their ideal balance between their drug's efficacy and their tolerability is reached. patients and their doctors can learn about pharmacogenetic aspects of cannabis use and be informed on their susceptibilities and can therefore adjust cannabis habits to best fit their genotype. having a comprehensive anonymized genetic database available combined with real world data will pave to way to better pain management as it's been demonstrated that cannabis may be a promising option to replace opiates in the treatment of pain, as well as replacing other drugs such as antiepileptics and antipsychotics . the longevity dataset is an ongoing, concerted effort to provide genetic and epigenetic information on individuals who have attained extreme ages. the remarkable growth in the number of centenarians (aged ≥ ) has garnered significant attention over the past years. consequently, a number of centenarian studies have emerged, ranging in emphasis from demographic to genetic. however, most of these studies are not aimed at breaking down data silos. as such, data will be collected from centenarians, with a subset of people who attained an age of years or higher -so-called "supercentenarians" to yield sufficient numbers to warrant descriptive studies and to find genetic factors associated with exceptional longevity. having enough datasets will improve the search for meaningful subnetworks of the overall omic network that play important roles in the supercentenarians' longevity and will help to better understand epigenetic mechanisms of aging , . a data repository for the scientific publishing process. having data deposited in a public data-hub is of utmost importance, particularly when data needs to be published rapidly, as in a time sensitive pandemic. the case is easily demonstrated by retraction of two covid- studies in may of . the new england journal of medicine published an article that found that angiotensin converting enzyme (ace) inhibitors and angiotensin receptor blockers were not associated with a higher risk of harm in patients with covid- . the lancet is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . published an observational study by the same authors, indicating that hospital patients with covid- treated with hydroxychloroquine / chloroquine were at greater risk of dying and of ventricular arrhythmia than patients not given the drugs . both journals retracted these articles because the authors said they could no longer vouch for the veracity of the primary data sources, raising serious questions about data transparency. the problem was that both studies used data from a healthcare analytics company called surgisphere. after several concerns were raised with respect to the correctness of the data, the study authors announced an independent third-party peer review. however, the data custodian surgisphere refused to transfer the full dataset, claiming it would violate confidentiality requirements and agreements with clients, leading the authors to request the retraction of both studies. this problem could have been avoided by publishing the datasets to a data repository as the described platform in this article. journal reviewers and other researchers could have easily accessed and analyzed the data before problems arose, a method that could easily be a mandatory part of the submission process. the urgency of sharing emerging research and data during a pandemic is obvious. rapid publication and early sharing of results is clearly warranted, but it means that the research community must also be able to access and analyze the data in a timely manner. publishing research data to an open platform with audit features and data provenance increases the chance of reproducibility and to have reliable assurances for the integrity of the raw data. considering that we have the technical capabilities to publish anonymized datasets, journals, in principle, should try to have their authors publicize raw data in a public database or journal site upon the publication of the paper to increase reproducibility of the published results and to increase public trust in science . in this context, sharing raw omics data publicly should be a necessary condition for any master or phd thesis as well as research studies to be considered as scientifically sound, unless the authors have acceptable reasons not to do so (e.g. data contains confidential personal information); keeping in mind that fine-grained permission settings also allow to publish and analyze industrial & proprietary research. overall, by utilizing an easy-accessible data-hub, it is possible to incorporate valuable omics information into routine preventive care, research and longitudinal patient monitoring. in the end, we can leverage the benefits of omics to expand patient access to high-quality care and diagnosis. it is the aim of the authors to further evolve the proposed data sharing and analytics platform to the point that it is a widely used and truly actionable data hub for the global healthcare ecosystem. together with the research community, we are working on adding a variety of other functionalities to make this precision medicine ecosystem more valuable and innovative, particularly in the fields of artificial intelligence, interoperability, pandemic preparedness, and other new research fields. longterm this can include projects to add functionalities for clinical trials, ai marketplaces, iot data, or quantum computing. some of the planned addons in the roadmap for future releases include a cohort browser, improved search engines, many new pipelines and ai tools, as well as fully federated data sharing capability. we plan to add a forum to the platform to help researchers exchange information on datasets and data curation. often, those who want to make their data more open face a bewildering array of options on where and how to share it. providing a space for sharing data and guiding young researchers in their task by sharing necessary expertise in data curation and metadata will help to ensure that the data they plan to share are useful for others. the urgency of tackling covid- has led governments in many countries to launch a number of short-notice and fast-tracked initiatives (e.g. calls for research proposals). without proper coordination amongst ministries and agencies, they run the risk of duplicating efforts or missing opportunities, resulting in slower progress and research inefficiencies. the best way to prevent a further spread of the virus and potential new pandemics is to apply the same principles as we use to prevent catastrophic forest fires: survey aggressively for smaller brush fires and stomp them out immediately. we need to invest now in an integrated data surveillance architecture designed to identify new outbreaks at the very early stages and rapidly invoke highly targeted containment responses. currently, our healthcare system is not set up for this. first, modern molecular diagnostic technologies detect only those infectious agents we already know exist; they come up blank when presented with a novel agent. infections caused by new or unexpected pathogens are not identified . cc-by-nc . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint until there are too many unexplained infections to ignore, which triggers hospitals to send samples to a public health lab that has more sophisticated capabilities. and even when previously unknown infectious agents are finally identified, it takes far too long to develop, validate and distribute tests for the new agent. by then, the window of time to contain an emerging infection will likely be past. using the platform as demonstrated in this article, all those processes can be sped up dramatically. it would enable our healthcare ecosystem to halt the spread of emerging infectious diseases before they reach the pandemic stage. in this new surveillance architecture proposed here, genomic analysis of patients presenting with severe clinical symptoms would be performed up front in hospital labs piggybacked on routine diagnostic testing. researchers can use the data in our data-hub to compare new cases with existing datasets. given national epidemiological data on infection rates; where symptomatic people seek healthcare, a remarkably small number of surveillance sites would be required in order to identify an outbreak of an emerging agent. using genomic sequencing, the novel causative agent could be identified within hours, and the network would instantaneously connect the patients presenting at multiple locations as part of the same incident, providing situational awareness of the scope and scale of the event. using a globally accessible, easy-to-use solution that provides a 'one-stop shop' for the centralization of information on new virus-host interactions can help ensure that appropriate conditions for collaborative research and sharing of preliminary research findings and data are in place to reap their full benefits. beyond short-term policy responses to covid- , incorporating data sharing with the shivom platform into ongoing mission-oriented research and innovation policies could help tackle future pandemics on a national or international scale, for example to enhance datasets informativeness to reveal host infection dynamics for sars-cov- and therefore to improve medical treatment options for covid- patients. in addition, it can help to determine and design optimal targets for therapeutics against sars-cov- . the platform is in alignment with recent guidelines published by the research data alliance (rda) covid- working group . the rda brought together various, global expertise to develop a body of work that comprises how data from multiple disciplines inform response to a pandemic combined with guidelines and recommendations on data sharing under the present covid- circumstances. the rda report offers best practices and advice -around data sharing, software, data governance, and legal and ethical considerationsfor four key research areas: clinical data, omics practices, epidemiology and social sciences. in line with these recommendations, the shivom platform can be used for: • sharing clinical data in a timely and trustworthy manner to maximize the impact of healthcare measures and clinical research during the emergency response • encouraging people to publish their omics data alongside a paper • underlining that epidemiology data underpin early response strategies and public health measures • providing a platform to collect important social and behavioral data (as metadata) in all pandemic studies • evidencing the importance of public data sharing capabilities alongside data analytics • offering a general ecosystem to exploit relevant ethical frameworks relating to the collection, analysis and sharing of data in similar emergency situations at this point in time, the research community should be clear that a second or third wave of the covid- pandemic will hit most countries. in addition, other epidemics and global pandemics will follow; it is just a matter of time. as such, the global research and medical community need to be prepared. when analysis of datasets shared on the shivom platform results in publications, we encourage the researchers to cite this publication to enhance widespread adoption of this data sharing and analytics marketplace. we also encourage sharing data from covid- and other virus outbreaks, including data from people who were exposed to the same pathogen but did not get the disease or possibly were infected but did not get symptoms. . cc-by-nc . international license it is made available under a perpetuity. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint we developed a next-generation data sharing-& data analytics ecosystem that allows users to easily and rapidly share and analyze omics data in a safe environment and demonstrate its usefulness as a pandemic preparedness platform for the global research community. adopting the platform as a data hub has the clear benefit that it enables individual researchers and healthcare organizations to build data cohorts, making large, or expensive-tocollect, datasets available to all. in this way, data sharing opens new avenues of research and insights into complex disease and infectious disease outbreaks. this is not just true of large-scale data sharing initiatives, even relatively small datasets, for example rare disease patient data, if shared, can contribute to big data and fuel scientific discoveries in unexpected ways. particularly during early disease outbreaks, the patient-level meta-analysis of similar, past outbreaks may reveal numerous novel findings that go well beyond the original purpose of the projects that generated the data. using the platform as demonstrated will enhance opportunities for co-operation and exploitation of results. our data sharing approach is inclusive, decentralized, and transparent and provides ease of access. if adopted, we expect the platform to substantially contribute to the understanding of the variability of pathogen susceptibility, severity, and outcomes in the population and help to prepare for future outbreaks. in providing clinicians and researchers with a platform to share covid- related data, the clinical research community has an improved way to share data and knowledge, removing a major hurdle in the rapid response to a new disease outbreak. as, if, db, and gl are employees of shivom ventures limited. conceived and designed the platform: as, if. analyzed data: if, gl, db. wrote the paper: as. is the author/funder, who has granted medrxiv a license to display the preprint in (which was not certified by peer review) preprint the copyright holder for this this version posted september , . . https://doi.org/ . / . . . doi: medrxiv preprint a pneumonia outbreak associated with a new coronavirus of probable bat origin early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia evolutionary origins of the sars-cov- sarbecovirus lineage responsible for the covid- pandemic sample sizes in covid- -related research european commission covid- data portal the covid- host genetics initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the sars-cov- virus pandemic covid- genomics uk (cog-uk) consortium secure collective covid- research (scor) consortium international covid- data research alliance rapid implementation of sars-cov- sequencing to investigate cases of health-care associated covid- : a prospective genomic surveillance study why we need easy access to all data from all clinical trials and how to accomplish it the variant call format (vcf) version . specification health data privacy and confidentiality rights: crisis or redemption? patient confidentiality in the research use of clinical medical databases re-identifiability of genomic data and the gdpr on the blockchain -toward a new era in precision medicine reinventing healthcare -towards a global, blockchain-based precision medicine ecosystem decentralized genomics audit logging via permissioned blockchain ledgering realizing the potential of blockchain technologies in genomics enforcing human subject regulations using blockchain and smart contracts the sequential organ failure assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation clinical pulmonary infection score calculator in the early diagnosis and treatment of ventilator-associated pneumonia in the icu quick sepsis-related organ failure assessment, systemic inflammatory response syndrome, and early warning scores for detecting clinical deterioration in infected patients outside theintensive care unit nextflow enables reproducible computational workflows efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies a tutorial on conducting genome-wide association studies: quality control and statistical analysis societal considerations in host genome testing for covid- assembling genomes and mini-metagenomes from highly chimeric reads. in lecture notes in computer science (including subseries lecture notes in artificial intelligence and kraken: a set of tools for quality control and analysis of high-throughput sequence data improved metagenomic analysis with kraken ultrafast metagenomic sequence classification using exact alignments the missing diversity in recognizing, reporting and reducing the data curation debt of cohort studies , scientists lift the lid on reproducibility data sharing and the future of science assessment of the impact of shared brain imaging data on the scientific literature barriers to accessing public cancer genomic data association between abcb c t polymorphism and increased risk of cannabis dependence cannabis consumption and psychosis or schizophrenia development protein kinase b (akt ) genotype mediates sensitivity to cannabis-induced impairments in psychomotor control systematic review efficacy of cannabis-based medicines for pain management: a systematic review and meta-analysis of randomized controlled trials unraveling the complex genetics of living past handbook of epigenetics epigenetics and late-onset alzheimer's disease data transparency: 'nothing has changed since tamiflu drug therapy, and mortality in covid- retracted:hydroxychloroquine or chloroquine with or without a macrolide for treatment of covid- : a multinational registry analysis no raw data, no science: another possible source of the reproducibility crisis rda covid- working group. rda covid- recommendations and guidelines rda recommendation key: cord- -dezauioa authors: johnson, stephanie; parker, michael title: ethical challenges in pathogen sequencing: a systematic scoping review date: - - journal: wellcome open res doi: . /wellcomeopenres. . sha: doc_id: cord_uid: dezauioa background: going forward, the routine implementation of genomic surveillance activities and outbreak investigation is to be expected. we sought to systematically identify the emerging ethical challenges; and to systematically assess the gaps in ethical frameworks or thinking and identify where further work is needed to solve practical challenges. methods: we systematically searched indexed academic literature from pubmed, google scholar, and web of science from to april for peer-reviewed articles that substantively engaged in discussion of ethical issues in the use of pathogen genome sequencing technologies for diagnostic, surveillance and outbreak investigation. results: articles were identified; nine united states, five united kingdom, five the netherlands, three canada, two switzerland, one australia, two south africa, and one italy. eight articles were specifically about the use of sequencing in hiv. eleven were not specific to a particular disease. results were organized into four themes: tensions between public and private interests; difficulties with translation from research to clinical and public health practice; the importance of community trust and support; equity and global partnerships; and the importance of context. conclusion: while pathogen sequencing has the potential to be transformative for public health, there are a number of key ethical issues that must be addressed, particularly around the conditions of use for pathogen sequence data. ethical standards should be informed by public values, and further empirical work investigating stakeholders’ views are required. development in the field should also be under-pinned by a strong commitment to values of justice, in particular global health equity. genetic information derived from pathogens is an increasingly essential input for infectious disease control, public health and research . although routine sequencing of pathogens was, until recently, unthinkable, the centers for disease control (cdc), food and drug administration (fda), state, and global public health laboratories now routinely sequence more than foodborne bacterial isolates a day and more than , influenza virus genomes a year , . in the united kingdom, public health england now engages in routine clinical genomic diagnostics and drug sensitivity testing for mycobacterium tuberculosis . in the research setting, phylogenetic analysis (the study of evolutionary relationships among pathogens) is being used to track and understand factors associated with the spread of infections such as hiv and to monitor the global spread of drug-resistant infections . mobile genomic sequencing technology is also being applied to disease outbreak investigation, most publicly in the case of the ebola outbreak in west africa - . going forward, the routine implementation of genomic surveillance activities and outbreak investigation is to be expected. while the technical developments of sequencing technology are being implemented at a rapid pace, the non-technical aspects of implementing this technology are still being broadly discussed between the different stakeholders involved . the successful implementation of this rapidly developing technology will, for example, require sharing of samples and metadata, interdisciplinary global collaborative partnerships, and will need to offer useful evidence for public health decision-making. importantly, the successful and appropriate response to these challenges will also require the systematic identification, analysis and addressing of a number of complex ethical, legal and social issues. a number of factors will contribute to the types of ethical issues that arise in different instances. these are likely to include characteristics of the disease, the environmental, political and geographical context, existing laws and policies, public attitudes, and cultural differences . in the work reported in this paper, taking these and other ethical issues as our focus, we sought to systematically examine the available literature to: identify the emerging ethical challenges and proposed solutions; and to systematically assess the gaps in ethical frameworks or thinking and identify where further work is needed to solve practical challenges. scoping reviews seek to identify literature relevant to a research objective and may include a variety of research formats and conceptual literature [ ] [ ] [ ] . this study sought to review published literature on ethical aspects of pathogen sequencing. inclusion criteria for the study encompassed a broad range of article types, including empirical studies, news articles, opinion pieces, features, editorials, reports of practice, and theoretical articles. we systematically searched indexed academic literature from pubmed, google scholar, and web of science from to april for peer-reviewed articles that substantively engaged in discussion of ethical issues in the use of pathogen genome sequencing technologies for diagnostic, surveillance and outbreak investigation. the search was then updated in january . the initial search strategies were developed through an iterative process and used a combination of controlled vocabulary (mesh terms) and free text words. an example medline search strategy is provided in table . reference lists of included articles were searched for relevant articles and further database searchers were conducted using the names of researchers commonly publishing in this field. finally, we also reviewed relevant international research and clinical practice guidance for relevant guidelines e.g. website of the world health organization. we sought to maximize the literature included in the review by reviewing guidelines, frameworks, commentaries and original research reviews related to pathogen sequencing. we also included studies on molecular typing where enough accuracy could be included to include transmission tracking, as this was thought to provide useful insights into the ethical challenges pathogen sequencing technologies may pose. we excluded studies considering genomics outside of infectious disease or focusing on host response studies as these were not deemed relevant or specific enough to the topic under investigation. duplicates were removed. sj undertook title and abstract screening to remove obviously irrelevant studies, borderline cases were discussed with mp and a decision reached by consensus. data was then abstracted by sj and cross-checked for accuracy by mp. names of study authors, institutions, journals of publication and results were non-blinded. analysis sj initially inductively analysed all studies, recording the aims and main findings in microsoft excel ( . . . ), and developing descriptive codes to chart the broad themes describing the literature base. similar findings were then grouped according to topic area and a preliminary list of themes was developed in collaboration with mp. both authors engaged in iterative discussions about organization of findings, after which the final themes were decided. sj subsequently re-examined each study and extracted data using a standard format (design, results and recommendations). the search produced articles after duplicates were removed. all articles were initially screened by title and abstract and thirty-nine full text articles were assessed for eligibility. twenty eight articles were included in the final analysis; three ethical guidelines or frameworks - ; seven empirical research studies , - ; eight reviews of the ethics , - ; and ten publications that contained a section on the ethical aspects of pathogen sequencing - . the literature largely originated from the us and other high-income countries (hics, country determined by lead author institution): nine from the united states (us), five from the united kingdom (uk), five from the netherlands, three from canada, two from switzerland, one from australia, and one from italy. only two publications, by the same author, originated from the global south (south africa). eight articles were specifically about the use of sequencing in hiv , , , , , , , . eleven were not specific to a particular disease. table presents a summary of studies. results were organized into four themes: tensions between public and private interests; difficulties with translation from research to clinical and public health practice; the importance of community trust and support; equity and global partnerships; and the importance of context. when considering the implications of collecting, using and sharing pathogen genome sequence data, the interests and rights of individuals were universally acknowledged. in particular, the literature pointed to the importance of considering and/or protecting individual rights to autonomy , , , , and to privacy , , , . many pointed out that the potential of sequencing techniques to detect the origin and routes of transmission of an outbreak may result in negative consequences for the individuals involved . consequences may include stigmatization, penalties, economic risks, problems with interpersonal relationships (e.g. inadvertent disclosure of infidelity), emotional distress and the capacity for discrimination , , , , , , , , [ ] [ ] [ ] . there was also concern that sequencing could lead to serious legal consequences, particularly with regards to the criminalization of hiv transmission , , , , , , , . it was also acknowledged that individuals may have an interest in avoiding the use of information about them for purposes they do not endorse, such as to support anti-gay sentiment or as part of a criminal investigation , . somewere concerned about forced testing either of certain groups such as gay men or healthcare workers , , - . there was acknowledgment of individual professional interests of researchersand practitioners to ownership and use of data , , . sequencing was also seen to carry risks for communities and groups. many authors noted that certain groups can be placed at risk through characterization as high risk or likely to transmit virus (hiv), including geographically defined groups, sexual or gender minorities, or those defined by ethnicity, nationality, or migration status . similarly, data regarding transmission patterns of multidrug-resistant tuberculosis (tb) could be used for discrimination based on ethnicity, and possible challenges to immigration . institutions could be subject to increasing numbers of legal claims, or companies could suffer reputational or economic damage . it was also noted that some communities may be particularly at risk of being exploited by research, especially during emergency outbreak situations , . on the other hand, it was acknowledged that widespread availability and use of sequence data contributes important benefits to the clinical and research communities . in particular, the rapid sharing of data can help identify etiological factors, predict disease spread, evaluate existing and novel treatments, symptomatic care and preventive measures, and guide the deployment of there was strong support for the permissibility of conducting sequencing studies, as long as potential risks were thoughtfully mitigated , , , , . in one empirical study, patients and healthcare workers were asked if the benefits of hiv molecular epidemiology outweigh the risks; all said yes . threequarters of respondents answered with an unqualified, yes, and one quarter gave a positive answer with qualifications, such as 'it's very necessary, just as long as parameters are set in place and they're kept', or 'with proper protections in place, the benefits outweigh the risks' . in another study, expert delphi panelists held that the protection of the public was of overriding importance, but that most of the potential harms could be managed . there were differences across the literature in the priority afforded to different conditions of use, and to the types of risk or amount of risk deemed acceptable. it was broadly agreed that any research should have a favourable risk benefit ratio , , and that maximizing the utility of data must be weighed against concerns over interests of individuals and that policies on data collection and release should seek to align the interests of different parties . there was, however, disagreement as to whether privacy concerns or public interest should take precedence , , and some noted that the balance between the public health benefit and personal privacy risk for individuals whose genetic data (personal or pathogen) are included is difficult to delineate, since neither the true benefit nor the actual risk to participants has been adequately defined , . below we set out the key recommendations from the literature on the conditions of use for pathogen sequence data. box summarises recommendations from the literature for future research focus and study design. it was clear that the release of information of relevance to public health should not be delayed by publication timelines or concerns over academic ownership of data , , . recommendations to address such conflicts of interests included: that medical journals should update their policies to support pre-publication sharing of pathogen sequence data related to outbreaks ; publication disclaimers prohibiting use of sequence data for publication without permission , ; acknowledgment of data sharing contributions and the inclusion of such criteria to the assessment of academic research credit ; establishment of governance structures and dispute resolution mechanisms that can mediate where disagreements arise . much of the literature suggested that traditional methods of de-identification or anonymization of data are insufficient to meet their purpose in the context of pathogen sequencing , , , , , , . existing approaches to minimize the risk of privacy loss to participants are based on de-identification of data by removal of a predefined set of identifiers . however, this has three key limitations in the context of pathogen genomics. first, sharing of corresponding sample metadata (minimally time and place of collection, ideally with demographic, laboratory, and clinical data) is essential to enhance the interpretation and the value of genomic data , therefore removal of key identifiers such as geographic location may severely limit the utility of genomic data . second, removing predefined identifiers may be ineffective at protecting privacy and confidentiality , , , . for example, one study demonstrated how sample collection dates associated with microbiological testing at a large tertiary hospital were highly correlated with patient admission date (protected health information), meaning data is re-identifiable . small study populations may also mean that individuals who are part of a transmission chain may be able to identify others during the course of routine contact tracing (e.g. sexual partners) . third, anonymization of data does little to mitigate potential risk to communities and groups , , . perhaps a consequence there was clear support for re-visioning of existing privacy standards, and for privacy policies specific to the context of sequencing studies. there was debate in the literature around the importance of consent to the use of sequence data and associated meta-data in epidemiological investigation. in the research setting, coltart et al. state that research participants and patients whose samples are being used for phylogenetic analysis should ideally have consented to such use, but suggest that when using data from previous studies, where only broad consent for hiv-related research might have been obtained, waivers of specific consent are allowable when samples are no longer linked to identifiers, or when broad consent for sample collection for research and storage in future studies was given . in the public health and clinical setting, an australian study reported that one of the key differences amongst participants in a modified delphi social and behavioural research into conceptual and normative aspects should be backed up by empirical research . . new inter-disciplinary collaborations including microbiologists, engineers and bioethicists . . as real-time and other intervention strategies that build on hiv phylogenetic information continue to emerge, it will be critical to address questions of efficacy for cluster growth interventions to ensure that the benefits outweigh potential risks. implementation science research may also inform best practices for discussing the meaning and limitations of sequence data and cluster membership with community members and help to identify acceptable and evidence-based approaches that impose the least risk to persons within specific contexts. these might involve partnerships with providers for non-intrusive patient follow-up related to clusters, more detailed consent procedures for future follow-up related to hiv test results or partner services referrals, and specific guidelines and education to mitigate criminalization risks . . communication methods that increase the understanding of phylogenetic studies need to be designed and evaluated. these must emphasise potential harms, thoughtful mitigation of harms to risk groups, processes for monitoring risk, and clear protection procedures to minimise risks . . collaboration between stakeholders is necessary, with an active exchange of experiences and best practices. the first step, should be sought in creating awareness and consensus within sectors on the causal factors of barriers to sharing of sequence data . ethical conduct of studies . need to pre-define exceptional circumstances where un-validated techniques might be used in emergency situations . . to ensure scientific validity, researchers and their associates should be competent to implement the proposed study design. in order to maximize scientific validity, the researchers should ensure that they have all necessary resources, that the community accepts the protocol and that a competent and independent research ethics committee (rec) or institutional review board (irb) reviews and approves the protocol . . the scientific objectives of research should guide the choice of participants and determine the inclusion criteria and appropriate recruitment strategies. it is unethical to use privilege, convenience and/or vulnerability as criteria for selecting participants. exclusion of certain population sub-groups or communities in a research study without appropriate scientific justification is also considered unethical . . risk mitigation strategies must also provide for redress mechanisms in cases of abuse or misuse of phylogenetic data. these strategies might require the establishment of ties with local legal services, organisations working to protect people with hiv, and criminalised or stigmatised populations, to ensure that they have access to the means to protect their rights . study included the necessity for consent before testing and data-linkage. no panelists agreed with the statement "under no conditions should a study be conducted without prior consent", although only ten of thirty agreed that consent is not required under any conditions . in a dutch study, outbreak managers thought intervention without seeking explicit consent of all individuals involved is justified when there is at least be a substantial public health threat, realistic expectation that deploying the techniques will help to mitigate the outbreak, and that source and contact tracing would most likely not be successful without the use of molecular typing techniques . there was a strong commitment to rapid and open data-sharing, particularly in emergency or outbreak situations , , and in such conditions for incentives and safeguards to encourage rapid and unrestricted access to data release , , . the world health organization (who) recommended that in emergency outbreak situations "the first set of sequences providing crucial information on the pathogen, genotype, lineage, and strain(s) causing the outbreak should be generated and shared as rapidly as possible. sharing of corresponding anonymised sample metadata (minimally time and place of collection, ideally with demographic, laboratory, and clinical data) is essential to enhance the interpretation and the value of genomic data" . however, access to data gathered as part of clinical care was seen as ethically more contentious as "publicly accessible databases are not an appropriate storage location for the level of metadata required to enable clinical and epidemiological analysis for the purposes of providing patient and population care" . suggestions were made for a tiered approach to data release, whereby a separate database governed by appropriate public health authorities would collate and store metadata in a location to which access to data could be limited to users with a legitimate clinical or public health need to use it, and data that cannot be released into public domains but is needed by authorised healthcare and public health professionals for service delivery remains within a suitable secured access database . a public survey on tb explored questions related to database access and the potential benefits and risks associated with it. most felt that medical professionals and the research community should have access to such a database; and a significant proportion thought that other agencies, such as the police ( %) and immigration officials ( %), should also have access to the genomic database. experts, however, were clear that they felt transmission data should not be used in litigation; this was partially because it was deemed too unreliable , and also because of the potential for 'abuse' of data , , . overall, there was broad support for further work in defining the conditions for collection use and storage of data and samples , , , , and for policy and legal clarity to aid the ethical implementation of these technologies. this will require more work to carefully assess and understand risks ; research to decide how much individual privacy might be risk in the name of public health ; consideration of alternative strategies required to mitigate this risk, such as suppression of data in the public domain where it may cause serious harm; and adjustments to communication plans . effective phylogenetic work often occurs at the interface between research and public health practice because the same data can be used for both purposes . in this regard, pathogen sequencing was described as 'straddling the boundary between research and clinical use' . the hybrid nature of sequencing activities imposes important ethical challenges. clinical implementation of metagenomics sequencing (un-targeted testing) has the potential to detect unexpected or incidental findings that may include infections with hepatitis or hiv . incidental findings of a different type may occur if non-germline samples (such as faecal samples) are contaminated with germline cells, which could potentially reveal predictive information about developing inherited disease . furthermore, informed consent for phylogenetic studies that are difficult complex and difficult to understand, and in which the benefits and risk may not be fully determined, may also be difficult to achieve . mutenherwa suggests that where sequence data are generated for routine clinical management, its subsequent use for research and surveillance may be underestimated by patients . others suggested that the right to withdraw from research activities-a key indicator of voluntary participation in research-was overlooked by expert stakeholders . understanding and interpreting phylogenetic data requires significant expertise , and presents a challenge to established professional boundaries. expertise in phylogenetic studies creates new obligations for researchers, such as deciding whether or not to participate in forensic investigations and potential prosecutions of individuals , and to consider the down-stream uses and misuses of data . the routine implementation of pathogen sequencing studies may create new responsibilities for clinical microbiologists (related to public health) , and require major changes in culture such that diagnostic interpretation, therapeutic management decisions and antimicrobial treatment regimes are delegated to physicians instead of microbiologists . many noted that there are important reasons to ensure that the public and individuals understand the uses of data collected as part of a sequencing studies, and the potential risks. first, this was seen to have some intrinsic value in that it supports patient autonomy and truth telling is a respected moral virtue , , , . second, truth-telling was seen to be important because it may lead to better outcomes in research and public health practice. this is both because this was deemed to promote trust in research and therefore lead to increased participation, and because it promotes disclosure, which is helpful from a public health perspective. third, promoting understanding of uses and risks of data was also seen as a way of avoiding harm and exploitation of vulnerable individuals and communities, by enhancing understanding of risks that may be specific to that them. in some cases, this was balanced by a number of practical challenges to telling people the truth, such as: risk of fear mongering ; information needed for legal proceedings in public interest ; and the fact that it may be difficult to adequately inform the public and/or ensure full understanding . none-the-less, there was a clear recommendation in the literature to raise public awareness and understanding of these techniques , , , and for early and meaningful community engagement , , , prior to conducting sequencing studies. this was seen as particularly important when working with vulnerable groups or when the risks of participation are high. the notion of justice appeared to be a widely recognized ethical principle in the field . stakeholders pointed to the importance of equitable access to data , and to benefit-sharing obligations , . this included an ethical imperative that outbreak related research and countermeasures, such as diagnostics and vaccines, should be accessible to all affected countries , and towards reciprocal arrangements such that countries that participate in sequencing activities should derive some corresponding local benefit . collaboration between researchers from africa and hics was raised as an important ethical consideration . in one empirical study, interviewees were concerned that african researchers were not meaningfully engaged in the scientific research process in health research in general and phylogenetic research in particular, and that for equitable and mutually beneficial collaborative research partnerships to be realized, local researchers were encouraged to take leading and active roles throughout the research process . this type of collaborative research practice was supported elsewhere in the literature , , . for some, this was to enhance equity as well as to help maximize the utility of data and lead to better public health outcomes. it was argued that local researchers were more likely to understand their health care and research systems and study results were more likely to be easily translated into policy , and that context specific responses to particular outbreaks were likely to be required. recommendations were made to conduct studies exploring the nature of existing collaborative partnerships between researchers from low-and middle-income countries (lmics) and hics to explore team composition and distribution of roles, including contribution to intellectual property . outbreak related research and countermeasures, such as diagnostics and vaccines, must be accessible to all affected countries not only as a legal obligation, but also as an ethical imperative . it was also noted that global and interdisciplinary partnerships are a necessary component of an effective genomic informed response to infectious disease. this was because of the vast range of stakeholders and varying interests involved in control of infectious disease outbreaks, and because issues may resist simple resolution and span multiple jurisdictions . for example, conflict may result from governments wishing to keep an outbreak quiet and/or from the tension between lmics with few resources for generating and using data and the researchers or response teams from better-resourced settings , . ownership of samples and data was seen as an important barrier to global cooperation. the nagoya protocol (np), for example, was developed to facilitate access to genetic resources and the fair and equitable sharing of benefits arising from their utilization . nevertheless, despite the importance of reinforcing sovereignty rights of states over genetic resources in their territory, uncertainties about intellectual property rights and the resulting disputes hamper access to samples , . ribeiro et al. explain that: "the real or perceived possibilities for the commercial valorization of microbial genetic resources (mgr) has enforced their appropriation for further use in research, innovation and product development. the problem for public health surveillance occurs when such appropriation is triggered at initial (upstream) phases of the research and innovation cycle, such as sampling and sequencing of microorganisms, instead of later stages, such as the actual product development (in this case drugs, diagnostics and vaccines). as such, stakeholders are reluctant to share their (intangible) assets even in early phases of the innovation process, decreasing the scope of innovation efforts due to the lack of access to upstream research inputs." the same authors suggest that standardized and simplified sharing agreements , and collaboration between stakeholders with an active exchange of experiences and best practices are required. in general, recommendations were made for a global approach to ethics, policy and legal frameworks , , , , . for example, it was suggested global data sharing arrangements should include "a global data governance or ethical framework, supplemented by local memoranda of understanding that take into account the local context" ; or to investigate how the global alliance for genomics and health (ga gh) framework for responsible data-sharing could be adapted for digital pathogen surveillance . lastly, it was clear that the types of ethical issues likely to arise are in part dependent upon the contexts in which studies are conducted, as well as the nature of the pathogen under study. chiefly, information that may impact on interpersonal relationships was viewed as particularly sensitive and therefore, worthy of additional ethical reflection. examples included: sexually transmitted infections , ; consent requirements to use isolates collected from dead neonates for the purposes of epidemiological research ; and disclosure of family members as the source of infection . it was suggested that the balance of risks to patients and public health benefits is likely to be affected by the characteristics of the pathogen, in terms of likely morbidity and mortality: infectivity; treatability and drug resistance . stakeholders also suggested that the ethical permissibility of sharing data about, particularly with regards to the source of transmission, may be different in professional contexts, where healthcare providers or companies are seen to carry a responsibility to control risk, as opposed to outside of professional contexts where protecting individuals from 'naming and shaming' may be of greater concern . it was also noted that the legal and regulatory structures in which studies are conducted may also influence the implementation and ethics of conducting pathogen sequencing studies. in particular, use of phylogenetic analyses in criminal convictions was raised as an ethical risk , , . although quality assessment of all included materials is desirable in systematic reviews, it was not possible in this case due to the inclusion of a diverse range of research formats and literature, such as commentaries and ethics guidelines. a second limitation of this review is that a large proportion of the literature included related to phylogenetic and hiv specifically ( out of ), meaning that the issues relevant to this context may be over-represented. this review highlights that while pathogen sequencing has the potential to be transformative for public health and clinical practice and to bring about important health benefits, there are a number of key ethical issues that must be addressed. in particular, there was clear support in the literature for innovative and critical thinking around the conditions of use for pathogen sequence data. this includes context specific standards of practice for consent, data collection, use and sharing. these practices should be informed by public values, and further empirical work investigating stakeholders' views are required. this should include experts in pathogen sequencing, patients and the general public, as well as end users such as public health professionals and clinicians. lastly, it is both a scientific and an ethical imperative that development in the field is under-pinned by a strong commitment to values of justice, in particular global health equity. all data underlying the results are available as part of the article and no additional source data are required. this manuscript presents the results of a scoping review of the literature on the topic of the ethical issues in the use of pathogen genome sequencing technologies. the authors make excellent use of the scoping review methodology which they implemented clearly (good use of tables!) and rigorously. there are minor formatting issues and typos: p. under theme : somewere concerned about forced testing either of certain groups such as gay men or healthcare workers , , - . there was acknowledgment of individual professional interests of researchersand practitioners to ownership and use of data. p. under theme : 'that may be specific to that them' the main substantial issue is the substantial focus of the manuscript on the hiv context ( articles out of ) which the authors acknowledge. hiv is a particularly stigmatized, serious condition, with lifelong consequences for patients which is not the case for many other communicable diseases. the problem is that in the past few months, covid- has been a game changer in the field. covid- is both much more contagious but less stigmatizing and, for most people, less dangerous than hiv. considering that the last update to this scoping review was made in january , none of the covid- emerging literature was considered. the result, which is not the fault of the authors, is that the article will only have a limited relevance to the current global pandemic. given the importance of covid to the field and beyond, it could be worth it for the author to take the time to update their research accounting for very recent developments before publishing. otherwise, the publication maybe perceived as already outdated and not garner much attention from readers. a second issue of the manuscript is that while it acknowledges the tension and blurry demarcation between the research and the public health context, it doesn't really provide any solution in this regard. for example, in the context of pathogen genome sequencing for outbreak surveillance during a public health emergency, informed consent is often not required. however, the lack of consent can create issue later for data sharing with the research community. such a scenario is not really discussed in the manuscript. similarly, the impact of the public health vs. research situation on the potential requirement for ethics review is not discussed. perhaps this was not touched upon in the literature, but it is certainly a preoccupation of researchers in the covid context. is the study design appropriate and is the work technically sound? yes are all the source data underlying the results available to ensure full reproducibility? yes are the conclusions drawn adequately supported by the results? yes in the paper in the review appear to be appropriate and informative. some additional concepts that weren't included, possibly because the article presents a review of already published ideas, are the following: the potential for civil legal consequences from publication of genomic data. the article discusses at a few different points the potential for criminal prosecution, but publication of data could also open up participants of a study to civil penalties as well. how ownership rights over microorganisms affect infectious disease control and innovation: a rootcause analysis of barriers to data sharing as experienced by key stakeholders pubmed abstract | publisher full text | free full text societal implications of the internet of pathogens reviewer expertise: ethical, legal and social issues of genomic research, medical law, bioethics. reviewer report july https://doi.org/ . /wellcomeopenres. .r © armstrong g. this is an open access peer review report distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. office of advanced molecular detection, national center for emerging and zoonotic infectious diseases, centers for disease control and prevention, atlanta, ga, usa the article is generally well written, although it would be stronger if it had a more concise, more focused set of recommendations for readers to act on. the focus of the paper is mostly on the use of microbial genomic data for research but also touches a little on the issues around public health use of the data. this reviewer is not an expert in ethics; that said, the ethical concepts brought out ○ the potential for violation of a privacy when cases are rare. for example, early in an outbreak, as is highlighted in the article, there is an urgency to making sequence data from the pathogen in question public. however, it's common early in an outbreak for the media to discover and publish the names of early cases. in that situation (which is relatively common), there's a risk that sequence data made public could be linked to a specific person. ○ one concern that is often under-appreciated is that, under certain circumstances, a researcher or public health agency might be legally compelled to release the identity of someone. in general, public health laws are quite strong in shielding public health data, but that may not be the case with research data. if there is uncertainty about whether such data are protected, any participants in research should be notified. ○ gh ge is mentioned in the manuscript, but there's no mention of pha ge ( https://pha ge.github.io/), which is much more applicable here. ○ issues around the nagoya protocol are addressed, but the tension between gisaid and insdc is not. gisaid provides protections for intellectual property rights that the insdc members do not, and such protections are key for participation of lmics. however, where public funds are used to obtain data, there is a strong argument that the data should be made publicly available (i.e., through insdc) without restriction as long as privacy and confidentiality are not placed at risk. the tension between the two models has been particularly strong with the advent of covid- , and the very assertive push by gisaid to prevent researchers from submitting to insdc or from citing it. ○ some (minor) specific issues that need to be addressed: i would remove the reference to microsoft excel. there's a strong argument for including information about statistical software, but which spreadsheet is used is not particularly relevant.○ "theme : …": there are typos in the first paragraph.○ box , item : this is not a complete, declarative sentence like the others.○ reference appears to be the incorrect reference. also, the quote from it (pages - ) appears to be a quote from a separate document (that should be cited separately, if still available). is the study design appropriate and is the work technically sound? if applicable, is the statistical analysis and its interpretation appropriate? are all the source data underlying the results available to ensure full reproducibility?no source data required competing interests: no competing interests were disclosed.reviewer expertise: pathogen genomics to support public health.i confirm that i have read this submission and believe that i have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. key: cord- -k wrory authors: prieto, diana m; das, tapas k; savachkin, alex a; uribe, andres; izurieta, ricardo; malavade, sharad title: a systematic review to identify areas of enhancements of pandemic simulation models for operational use at provincial and local levels date: - - journal: bmc public health doi: . / - - - sha: doc_id: cord_uid: k wrory background: in recent years, computer simulation models have supported development of pandemic influenza preparedness policies. however, u.s. policymakers have raised several concerns about the practical use of these models. in this review paper, we examine the extent to which the current literature already addresses these concerns and identify means of enhancing the current models for higher operational use. methods: we surveyed pubmed and other sources for published research literature on simulation models for influenza pandemic preparedness. we identified models published between and that consider single-region (e.g., country, province, city) outbreaks and multi-pronged mitigation strategies. we developed a plan for examination of the literature based on the concerns raised by the policymakers. results: while examining the concerns about the adequacy and validity of data, we found that though the epidemiological data supporting the models appears to be adequate, it should be validated through as many updates as possible during an outbreak. demographical data must improve its interfaces for access, retrieval, and translation into model parameters. regarding the concern about credibility and validity of modeling assumptions, we found that the models often simplify reality to reduce computational burden. such simplifications may be permissible if they do not interfere with the performance assessment of the mitigation strategies. we also agreed with the concern that social behavior is inadequately represented in pandemic influenza models. our review showed that the models consider only a few social-behavioral aspects including contact rates, withdrawal from work or school due to symptoms appearance or to care for sick relatives, and compliance to social distancing, vaccination, and antiviral prophylaxis. the concern about the degree of accessibility of the models is palpable, since we found three models that are currently accessible by the public while other models are seeking public accessibility. policymakers would prefer models scalable to any population size that can be downloadable and operable in personal computers. but scaling models to larger populations would often require computational needs that cannot be handled with personal computers and laptops. as a limitation, we state that some existing models could not be included in our review due to their limited available documentation discussing the choice of relevant parameter values. conclusions: to adequately address the concerns of the policymakers, we need continuing model enhancements in critical areas including: updating of epidemiological data during a pandemic, smooth handling of large demographical databases, incorporation of a broader spectrum of social-behavioral aspects, updating information for contact patterns, adaptation of recent methodologies for collecting human mobility data, and improvement of computational efficiency and accessibility. results: while examining the concerns about the adequacy and validity of data, we found that though the epidemiological data supporting the models appears to be adequate, it should be validated through as many updates as possible during an outbreak. demographical data must improve its interfaces for access, retrieval, and translation into model parameters. regarding the concern about credibility and validity of modeling assumptions, we found that the models often simplify reality to reduce computational burden. such simplifications may be permissible if they do not interfere with the performance assessment of the mitigation strategies. we also agreed with the concern that social behavior is inadequately represented in pandemic influenza models. our review showed that the models consider only a few social-behavioral aspects including contact rates, withdrawal from work or school due to symptoms appearance or to care for sick relatives, and compliance to social distancing, vaccination, and antiviral prophylaxis. the concern about the degree of accessibility of the models is palpable, since we found three models that are currently accessible by the public while other models are seeking public accessibility. policymakers would prefer models scalable to any population size that can be downloadable and operable in personal computers. but scaling models to larger populations would often require computational needs that cannot be handled with personal computers and laptops. as a limitation, we state that some existing models could not be included in our review due to their limited available documentation discussing the choice of relevant parameter values. conclusions: to adequately address the concerns of the policymakers, we need continuing model enhancements in critical areas including: updating of epidemiological data during a pandemic, smooth handling of large demographical databases, incorporation of a broader spectrum of social-behavioral aspects, updating information for contact patterns, adaptation of recent methodologies for collecting human mobility data, and improvement of computational efficiency and accessibility. the ability of computer simulation models to "better frame problems and opportunities, integrate data sources, quantify the impact of specific events or outcomes, and improve multi-stakeholder decision making," has motivated their use in public health preparedness (php) [ ] . in , one such initiative was the creation of the preparedness modeling unit by the centers for disease control and prevention (cdc) in the u.s. the purpose of this unit is to coordinate, develop, and promote "problem-appropriate and data-centric" computer models that substantiate php decision making [ ] . of the existing computer simulation models addressing php, those focused on disease spread and mitigation of pandemic influenza (pi) have been recognized by the public health officials as useful decision support tools for preparedness planning [ ] . in recent years, computer simulation models were used by the centers for disease control and prevention (cdc), department of health and human services (hhs), and other federal agencies to formulate the "u.s. community containment guidance for pandemic influenza" [ ] . although the potential of the exiting pi models is well acknowledged, it is perceived that the models are not yet usable by the state and local public health practitioners for operational decision making [ , [ ] [ ] [ ] . to identify the challenges associated with the practical implementation of the pi models, the national network of public health institutes, at the request of cdc, conducted a national survey of the practitioners [ ] . the challenges identified by the survey are summarized in table . we divided the challenges (labeled a through a in table ) into two categories: those (a through a ) that are related to model design and implementation and can potentially be addressed by adaptation of the existing models and their supporting databases, and those (a through a ) that are related to resource and policy issues, and can only be addressed by changing public health resource management approaches and enforcing new policies. although it is important to address the challenges a through a , we consider this a prerogative of the public health administrators. hence, the challenges a to a will not be discussed in this paper. the challenges a through a reflect the perspectives of the public health officials, the end users of the pi models, on the practical usability of the existing pi models and databases in supporting decision making. addressing these challenges would require a broad set of enhancements to the existing pi models and associated databases, which have not been fully attempted in the literature. in this paper, we conduct a review of the pi mitigation models available in the published research literature with an objective of answering the question: "how to enhance the pandemic simulation models and the associated databases for operational use at provincial and local levels?" we believe that our review accomplishes its objective in two steps. first, it exposes the differences between the perspectives of the public health practitioners and the developers of models and databases on the required model capabilities. second, it derives recommendations for enhancing practical usability of the pi models and the associated databases. in this section, we describe each of the design and implementation challenges of the existing pi models (a -a ) and present our methods to examine the challenges in the research literature. in addition, we present our paper screening and parameter selection criteria. design and implementation challenges of pandemic models and databases validity of data support (a ) public health policy makers advocate that the model parameters be derived from up to date demographical and epidemiological data during an outbreak [ ] . in this paper we examine some of the key aspects of data support, such as data availability, data access, data retrieval, and data translation. to ensure data availability, a process must be in place for collection and archival of both demographical and epidemiological data during an outbreak. the data must be temporally consistent, i.e., it must represent the actual state of the outbreak. in the united states and other few countries, availability of temporally consistent demographical data is currently supported by governmental databases including the decennial census and the national household travel survey [ ] [ ] [ ] [ ] . to ensure temporal consistency of epidemiological data, the institute of medicine (iom) has recommended enhancing the data collection protocols to support real-time decision making [ ] . the frequency of data updating may vary based on the decision objective of the model (e.g., outbreak detection, outbreak surveillance, and initiation and scope of interventions). as noted by fay-wolfe, the timeliness of a decision is as important as its correctness [ ] , and there should be a balance between the cost of data updating and the marginal benefits of the model driven decisions. archival of data must allow expedited access for model developers and users. in addition, mechanisms should be available for manual or automatic retrieval of data and its translation into model parameter values in a timely manner. in our review of the existing pi models at provincial and local levels, we examined the validity of data that was used in supporting major model parameters. the major model parameters include: the reproduction number, defined as the number of secondary infections that arise from a typical primary case [ ] ; the proportion of the population who become infected, also called infection attack rate [ ] ; the disease natural history within an individual; and fractions of symptomatic and asymptomatic individuals. the first row of table summarizes our approach to examine data validity. for each reviewed pi model, and, for each of the major model parameters, we examined the source and the age of data used (a a, a b), the type of interface used for data access and retrieval (a c), and the technique used for translating data into the parameter values (a d). public health practitioners have emphasized the need for models with credible and valid assumptions [ ] . credibility and validity of model assumptions generally refer to how closely the assumptions represent reality. however, for modeling purposes, assumptions are often made to balance data needs, analytical tractability, and computational feasibility of the models with their ability to support timely and correct decisions [ ] . making strong assumptions may produce results that are timely but with limited or no decision support value. on the other hand, relaxing the simplifying assumptions to the point of analytical intractability or computational infeasibility may seriously compromise the fundamental purpose of the models. every model is comprised of multitudes of assumptions pertaining to contact dynamics, transmission and infection processes, spatial and temporal considerations, demographics, mobility mode(s), and stochasticity of parameters. credibility and validity of these assumptions largely depend on how well they support the decision objectives of the models. for example, if a model objective is to test a household isolation strategy (allowing sick individuals to be isolated at home, in a separate room), the model assumptions must allow tracking of all the individuals within the household (primary caregivers and others) so that the contact among the household members can be assigned and possible spread of infection within the household can be assessed. this idea is further discussed in the results section through an analysis of some of the model assumptions regarding contact probability and frequency of new infection updates that were made in two of the commonly referenced pi models in the pandemic literature [ , ] . it has been observed in [ ] that the existing pi models fall short of capturing relevant aspects of human behavior. this observation naturally evokes the following questions. what are the relevant behavioral aspects that must be considered in pi models? are there scientific evidences that establish the relative importance of these aspects? what temporal consistency is required for data support of the aspects of human behavior? the third row of table summarizes our plan to examine how the existing models capture human behavior. for each reviewed pi model, we first identify the behavioral aspects that were considered, and then for each aspect we examine the source and the age of data used, the type of interface used for data access and retrieval, and the technique used for translating data into model parameter values (a a-d). we also attempt to answer the questions raised above, with a particular focus on determining what enhancements can be done to better represent human behavior in pi models. public health practitioners have indicated the need for openly available models and population specific data that can be downloaded and synthesized using personal computers [ ] . while the ability to access the models is essential for end users, executing the pi models on personal computers, in most cases, may not be feasible due to the computational complexities of the models. some of the existing models feature highly granular description of disease spread dynamics and mitigation via consideration of scenarios involving millions of individuals and refined time scales. while such details might increase credibility and validity of the models, this can also result in a substantial computational burden, sometimes, beyond the capabilities of personal computers. there are several factors which contribute to the computational burden of the pi models, the primary of which is the population size. higher population size of the affected region requires larger datasets to be accessed, retrieved, and downloaded to populate the models. other critical issues that add to the computational burden are: data interface with a limited bandwidth, the frequency of updating of data during a pandemic progress, pre-processing (filtering and quality assurance) requirement for raw data, and the need for data translation into parameter values using methods, like maximum likelihood estimation and other arithmetic conversions. the choice of the pi model itself can also have a significant influence on the computational burden. for example, differential equation (de) models divide population members into compartments, where in each compartment every member makes the same number of contacts (homogeneous mixing) and a contact can be any member in the compartment (perfect mixing). in contrast, agentbased (ab) models track each individual of the population where an individual contacts only the members in his/her relationship network (e.g., neighbors, co-workers, household members, etc.) [ ] . the refined traceability of individual members offered by ab models increases the usage of computational resources. further increases in the computational needs are brought on by the need for running multiple replicates of the models and generating reliable output summaries. as summarized in the last row of table , we examine which models have been made available to general public and whether they are offered as an open or closed source code. we also check for the documentation of model implementation as well as for existence of user support, if any. in addition, we look for the ways that researchers have attempted to address the computational feasibility of their models, including data access, retrieval and translation, model execution, and generation of model outputs. the initial set of articles for our review was selected following the prisma reporting methodology, as applicable. we used the pubmed search engine with the keyword string "influenza" and "pandemic" and "model" in english language. a total of papers were found which were published between and . we filtered those using the following selection criteria (also depicted in figure ). -articles that evaluate one or more strategies in each of the mitigation categories: social distancing, vaccination, and antiviral application. we limited the paper (by excluding models that do not consider all three categories) to contain the scope of this review, as we examined a large table plan for examination of the design and implementation challenges of the existing pi models design and implementation challenges validity of data support (a ) for model parameters for each pi model and for each of the major model parameters (e.g., reproduction number, illness attack rate) examine: a a. data source for parameter values (actual, simulated, assumed) a b. age of data a c. type of interface for data access and retrieval (manual, automatic) a d. technique to translate raw data into model parameter values (e.g., arithmetic conversion, bayesian estimation) credibility and validity of model assumptions ( body of related papers from which our selected articles drew their parameters (see additional tables). -articles with single-region simulation models. we defined single-region for the purpose of this review as either a country or any part thereof. models presenting disease spread approaches without mention of any regional boundary were included, as these approaches can directly support decision makers at provincial and local levels. there exists a significant and important body of literature that is dedicated to global pandemic influenza modeling that aims at quantifying global disease spread [ ] [ ] [ ] [ ] , assessing the impact of global vaccine distribution and immunization strategies [ ] [ ] [ ] and assessing the impact of recommended or self-initiated mobility behaviors in the global disease spread [ , ] . as these overarching aims of the global models do not directly impact operational decisions of provincial and local policy makers during an evolving pandemic, we have not included them in our final selection of articles. -articles that include data sources for most model parameter values and, when possible, specify the methods for parameter estimation. we included this criterion in order to evaluate models with respect to the challenge of "validity of data support." see table where we outline our evaluation plan. clearly, models not satisfying this criterion would not support our review objectives. using the above filtering criteria, an additional snowball search was implemented outside pubmed, which yielded additional eligible papers [ , [ ] [ ] [ ] [ ] and bringing the total number of papers reviewed to twentythree. we grouped the twenty-three selected articles in eleven different clusters based on their model (see table ). the clusters are named either by the name used in the literature or by the first author name(s). for example, all three papers in the imperial-pitt cluster use the model introduced initially by ferguson et al. [ ] . in each cluster, to review the criteria for the design and implementation challenge (a ), we selected the article with the largest and most detailed testbed (marked in bold in table ). as stated earlier, credibility and validity of model assumptions (a ), were examined via two most commonly cited models in the pandemic literature [ , ] . the challenges a -a were examined separately for each of the selected articles. out of the ten model clusters presented in table , eight are agent-based simulation models, while the rest are differential equation models. also, while most of the models use purely epidemiological measures (e.g., infection attack rates and reproduction numbers) to assess the effectiveness of mitigation strategies, only a few use economic measures [ , , ] . in our review, we examined epidemiological, demographical, and social-behavioral parameters of the pandemic models. we did not examine the parameters of the mitigation strategies as a separate category since those are functions of the epidemiological, demographical, and social-behavioral parameters. for example, the risk groups for vaccine and antiviral (which are mitigation parameters) are functions of epidemiological parameters such as susceptibility to infection and susceptibility to death, respectively. another example is the compliance to non-pharmaceutical interventions, a mitigation strategy parameter, which can be achieved by altering the social behavioral parameters of the model. in this section, we present the results of our review of the models that evaluate at least one strategy from each mitigation category (social distancing, vaccination and antiviral application). we also identify areas of enhancements of the simulation based pi models for operational use. our discussion on validity of data support includes both epidemiological and demographic data. additional file : table s summarizes the most common epidemiological parameters used in the selected models along with their data sources, interface for data access and retrieval, and techniques used in translating raw data into parameter values. additional file : table s presents information similar to above for demographic parameters. the most commonly used epidemiological parameters are reproduction number (r), illness attack rate (iar), initial set of articles filtered from pubmed using keyword search (n = ) remaining articles (n = ) exclusion of articles that do not examine pandemic influenza spread under a comprehensive set of mitigation strategies (n = ) exclusion of articles that examine global pandemic spread (n = ) remaining articles (n = ) remaining articles (n = ) inclusion of articles that meet the above criteria but are obtained using snowball search outside pubmed (n = ) exclusion of articles that do not provide a comprehensive support for data collection and parameterization methods (n = ) articles reviewed (n = ) figure selection criteria for pi models for systematic review. disease natural history parameters, and fraction of asymptomatic infected cases. in the models that we have examined, estimates of reproduction numbers have been obtained by fitting case/mortality time series data from the past pandemics into models using differential equations [ ] , cumulative exponential growth equations [ ] , and bayesian likelihood expressions [ ] . iars have been estimated primarily using household sampling studies [ ] , epidemic surveys [ , ] , and case time series reported for h n [ , ] . the parameters of the disease natural history, which are modeled using either a continuous or phase-partitioned time scale (see additional file : table s ), have been estimated from household random sampling data [ , , ] , viral shedding profiles from experimental control studies [ , , , ] , and case time series reported for h n [ , ] . bayesian likelihood estimation methods were used in translating case time series data [ , ] . fraction of asymptomatic infected cases has been estimated using data sources and translation techniques similar to the ones used for natural history. recent phylogenetic studies on the h n virus help to identify which of the above epidemiological parameters need real-time re-assessment. these studies suggest that the migratory patterns of the virus, rather than the intrinsic genomic features, are responsible for the second pandemic wave in [ , ] . since r and iar are affected not only by the genomic features but also by the migratory patterns of the virus, a close monitoring of these parameters throughout the pandemic spread is essential. real-time monitoring of parameters describing disease natural history and fraction of asymptomatic cases is generally not necessary since they are mostly dependent on the intrinsic genomic features of the virus. these parameters can be estimated when a viral evolution is confirmed through laboratory surveillance. estimation methods may include surveys (e.g., household surveys of members of index cases [ , ] ) and laboratory experiments that inoculate pandemic strains into human volunteers [ ] . current pandemic research literature shows the existence of estimation methodologies for iar and r that can be readily used provided that raw data is available [ ] . there exist several estimators for r (wallinga et al. [ , ] , fraser [ ] , white and pagano [ ] , bettencourt et al. [ ] , and cauchemez et al. [ ] ). these estimates have been derived from different underlying infection transmission models (e.g., differential equations, time since infection and directed network). with different underlying transmission models, the estimators consider data from different perspectives, thereby yield different values for r at a certain time t. for example, fraser [ ] proposes an instantaneous r that observes how past case incidence data (e.g., in time points t- , t- , t- ) contribute to the present incidence at time t. in contrast, wallinga et al. [ , ] and cauchemez et al. [ ] propose estimators that observe how the future incidences (e.g., t + , t + , t + ) are contributed by a case at time t. white and pagano [ ] considers an estimator that can be called a running estimate of the instantaneous reproduction number. further extensions of the above methods have been developed to accommodate more realistic assumptions. bettencourt extended its r estimator to account for multiple introductions from a reservoir [ ] . the wallinga estimator was extended by cowling [ ] to allow for reporting delays and repeated importations, and by glass [ ] to allow for heterogeneities among age groups (e.g., adults and children). the fraser estimator was extended by nishiura [ ] to allow the estimation of the reproduction number for a specific age class given infection by another age class. the above methods for real-time estimation of r are difficult to implement in the initial and evolving stages of a pandemic given the present status of the surveillance systems. at provincial and local levels, surveillance systems are passive as they mostly collect data from infected cases who are seeking healthcare [ ] . with passive surveillance, only a fraction of symptomatic cases are detected with a probable time delay from the onset of symptoms. once the symptomatic cases seek healthcare and are reported to the surveillance system, the healthcare providers selectively submit specimens to the public health laboratories (phl) for confirmatory testing. during the h n pandemic in , in regions with high incidence rates, the daily testing capacities of the phl were far exceeded by the number of specimens received. in these phl, the existing first-come-first-serve testing policy and the manual methods for receiving and processing the specimens further delayed the pace of publication of confirmed cases. the time series of the laboratory confirmed cases likely have been higher due to the increased specimen submission resulting from the behavioral response (fear) of both the susceptible population and the healthcare providers after the pandemic declaration [ ] . similarly, time series of the confirmed cases likely have been lower at the later stages of the pandemic as federal agencies advocated to refrain from specimen submission [ ] . the present status of the surveillance systems calls for the models to account for: the underreporting rates, the delay between onset of symptoms and infection reporting, and the fear factor. in addition, we believe that it is necessary to develop and analyze the cost of strategies to implement active surveillance and reduce the delays in the confirmatory testing of the specimens. in our opinion, the above enhancement can be achieved by developing methods for statistical sampling and testing of specimens in the phl. in addition, new scheduling protocols will have to be developed for testing the specimens, given the limited laboratory testing resources, in order to better assess the epidemiological parameters of an outbreak. with better sampling and scheduling schemes at the phl, alterations in the specimen submission policies during a pandemic (as experienced in the u.s. during the outbreak) may not be necessary. the above enhancements would also support a better real-time assessment of the iar, which is also derived from case incidence data. our review of the selected pi models indicates that currently all of the tasks relating to access and retrieval of epidemiological data are being done manually. techniques for translation of data into model parameter values range from relatively simple arithmetic conversions to more time-consuming methods of fitting mathematical and statistical models (see additional file : table s ). there exist recent mechanisms to estimate incidence curves in real-time by using web-based questionnaires from symptomatic volunteers [ ] , google and yahoo search queries [ , ] and tweeter messages [ ] and have supported influenza preparedness in several european countries and the u.s. [ , ] . if real-time incidence estimates are to be translated into pi models parameters, complex translation techniques might delay execution of the model. we believe that model developers should consider building (semi)automatic interfaces for epidemiological data access and retrieval and develop translation algorithms that can balance the run time and accuracy. additional file : table s shows the most common demographic parameters that are used in the selected models. the parameters are population size/density, distribution of household size, peer-group size, age, commuting travel, long-distance travel, and importation of infected cases to the modeled region. estimation of these parameters has traditionally relied on comprehensive public databases, including the u.s. census, landscan, italian institute of statistics, census of canada, hong kong survey data, uk national statistics, national household travel survey, uk department of transport, u.s. national centre for educational statistics, the italian ministry of university and research and the uk department for education and skills. readers are referred to additional file : table s for a complete list of databases and their web addresses. our literature review shows that access and retrieval of these data are currently handled through manual procedures. hence, there is an opportunity for developing tools to accomplish (semi)automatic data access, retrieval, and translation into model parameters whenever a new outbreak begins. it is worth noting that access to demographic information is currently limited in many countries, and therefore obtaining demographic parameters in real-time would only be possible for where information holders (censing agencies and governmental institutions) openly share the data. the data sources supporting parameters for importation of infected cases reach beyond the modeled region requiring the regional models to couple with global importation models. this coupling is essential since the possibility of new infection arrivals may accelerate the occurrence of the pandemic peak [ ] . this information on peak occurrence could significantly influence time of interventions. some of the single region models consider a closed community with infusion of a small set of infected cases at the beginning [ , , ] . single region models also consider a pseudo global coupling through a constant introduction of cases per unit time [ , ] . other single region models adopt a more detailed approach, where, for each time unit, the number of imported infections is estimated by the product of the new arrivals to the region and the probability of an import being infected. this infection probability is estimated through a global disease spread compartmental model [ , ] . the latter approach is similar to the one used by merler [ ] for seeding infections worldwide and is operationally viable due to its computational simplicity. for a more comprehensive approach to case importation and global modeling of disease spread, see [ ] . recall that our objective here is to discuss how the credibility and validity of assumptions should be viewed in light of their impact on the usability of models for public health decision making. we examine the assumptions regarding contact probability and the frequency of new infection updates (e.g., daily, quarterly, hourly) in two models: the imperial-pitt [ ] and the uw-lanl models [ ] . choice of these models was driven by their similarities (in region, mixing groups, and the infection transmission processes), and the facts that these models were cross validated by halloran [ ] and were used for developing the cdc and hhs "community containment guidance for pandemic influenza" [ ] . we first examine the assumptions that influence contact probabilities within different mixing groups (see table ). for household, the imperial-pitt model assumes constant contact probability while the uw-lanl model assumes that the probability varies with age (e.g., kid to kid, kid to adult). the assumption of contact probability varying with age matches reality better than assuming it to be constant [ ] . however, for households with smaller living areas the variations may not be significant. also, neither of the papers aimed at examining strategies (e.g., isolation of sick children within a house) that depended on age-based contact probability. hence, we believe that the assumptions can be considered credible and valid. for workplaces and schools, the assumption of % of contacts within the group and % contacts outside the group, as made in the imperial-pitt model, appears closer to reality than the assumption of constant probability in the uw-lanl model [ ] . for community places, the imperial-pitt model considered proximity as a factor influencing the contact probability, which was required for implementing the strategy of providing antiviral prophylaxis to individuals within a ring of certain radius around each detected case. we also examined the assumptions regarding the frequency of infection updates. the frequency of update dictates how often the infection status of the contacted individuals is evaluated. in reality, infection transmission may occur (or does not occur) whenever there is a contact event between a susceptible and an infected subject. the imperial-pitt and the uw-lanl models do not evaluate infection status after each contact event, since this would require consideration of refined daily schedules to determine the times of the contact events. instead, the models evaluate infection status every six hours [ ] or at the end of the day [ ] by aggregating the contact events. while such simplified assumptions do not allow the determination of the exact time of infection for each susceptible, they offer a significant computational reduction. moreover, in a real-life situation, it will be nearly impossible to determine the exact time of each infection, and hence practical mitigation (or surveillance) strategies should not rely on it. the above analysis reveals how the nature of mitigation strategies drives the modeling assumptions and the computational burden. we therefore believe that the policymakers and the modelers should work collaboratively in developing modeling assumptions that adequately support the mitigation strategy needs. furthermore, the issue of credibility and validity of the model assumptions should be viewed from the perspectives of the decision needs and the balance between analytical tractability and computational complexity. for example, it is unlikely that any mitigation strategy would have an element that depends of the minute by minute changes in the disease status. hence, it might be unnecessary to consider a time scale of the order of a minute for a model and thus increase both computational and data needs. contact rate is the most common social-behavioral aspect considered by the models that we have examined. in these models, except for eichner et al. [ ] , the values of the contact rates were assumed due to the unavailability of reliable data required to describe the mobility and connectivity of modern human networks [ , , ] . however, it is now possible to find "fresh" estimates of the types, frequency, and duration of human contacts either from a recent survey at the continental level [ ] or from a model that derives synthetic contact information at the country level [ ] . in addition, recent advances in data collection through bluetooth enabled mobile telephones [ ] and radio frequency identification (rfid) devices [ ] allow better extraction of proximity patterns and social relationships. availability of these data creates further opportunity to explore methods of access, retrieval, and translation into model parameters. issues of data confidentiality, cost of the sensing devices, and low compliance to the activation of sensing applications might prevent the bluetooth and rfid technologies from being effectively used in evolving pandemic outbreaks. another possibility is the use of aggregated and anonymous network bandwidth consumption data (from network service providers) to extrapolate population distribution in different areas at different points in time [ , ] . other social-behavioral parameters that are considered by the reviewed models include reactive withdrawal from work or school due to appearance of symptoms [ ] , work absenteeism to care for sick relatives or children at home due to school closure [ , , , ] , and compliance to social distancing, vaccination, and antiviral prophylaxis [ , ] . once again, due to the lack of data support, the values of most of these parameters were assumed and their sensitivities were studied to assess the best and worst case scenarios. existing surveys collected during the h n outbreak can be useful in quantifying the above parameters [ , ] . recent literature has explored many additional socialbehavioral aspects that were not considered in the models we reviewed. there are surveys that quantify the levels of support for school closure, follow up on sick students by the teachers [ ] , healthcare seeking behavior [ ] , perceived severity, perceived susceptibility, fear, general compliance intentions, compliance to wearing face masks, role of information, wishful thinking, fatalistic thinking, intentions to fly away, stocking, staying indoors, avoiding social contact, avoiding health care professionals, keeping children at home and staying at home, and going to work despite being advised to stay at home [ ] . there are also models that assess the effect of selfinitiated avoidance to a place with disease prevalence [ ] , voluntary vaccination and free-ride (not to vaccinate but rely on the rest of the population to keep coverage high [ ] . other recognized behaviors include refusal to vaccinate due to religious beliefs and not vaccinating due to lack of awareness [ ] . we believe that there is a need for further studies to establish the relative influence of all of the above mentioned social-behavioral factors on operational models for pandemic spread and mitigation. subsequently, the influential factors need to be analyzed to determine how relevant information about those factors should be collected (e.g., in real-time or through surveys before an outbreak), accessed, retrieved, and translated into the final model parameter values. it is important to mention very recent efforts in improving models for assessment of relevant social behavioral components including commuting, long distance travel behavior [ , , ] , and authority recommended decline of travel to/from affected regions [ ] . for operational modeling, it would be helpful to adapt the approaches used by these models in translating massive data sets (e.g., bank notes, mobile phone user trajectories, air and commuting travel networks) into model parameter values. in addition, available new methodologies to model social-behavior that adapts to evolving disease dynamics [ ] should be incorporated into the operational models. with regards to accessibility and scalability of the selected models, we first attempted to determine which of the simulation models were made available to general public, either as an open or closed source code. we also checked for available documentation for model implementation and user support, if any. most importantly, we looked into how the researchers attempted to achieve the computational feasibility of their models (see additional file : table s ). three of the models that make their source codes accessible to general public are influsim [ ] , ciofi [ ] and flute [ ] . influsim is a closed source differential equation-based model with a graphical user interface (gui) which allows the evaluation of a variety of mitigation strategies, including school closure, place closure, antiviral application to infected cases, and isolation. ciofi is an open source model that is coupled with a differential equation model to allow for a more realistic importation of cases to a region. flute is an open source model, which is an updated version of the uw-lanl [ ] agent-based model. the source code for flute is also available as a parallelized version that supports simulation of large populations on multiple processors. among these three softwares, influsim has a gui, whereas ciofi and uw-lanl are provided as a c/c++ code. influsim's gui seems to be more user friendly for healthcare policymakers. flute and ciofi, on the other hand, offer more options for mitigation strategies, but requires the knowledge of c/c++ programming language and the communication protocols for parallelization. other c++ models are planning to become, or are already, publicly accessible, according to the models of infectious disease agent study (midas) survey [ ] . we note that the policy makers would greatly benefit if softwares like flute or ciofi can be made available through a cyber-enabled computing infrastructure, such as teragrid [ ] . this will provide the policy makers access to the program through a web based gui without having to cope with the issues of software parallelization and equipment availability. moreover, the policy makers will not require the skills of programming, modeling, and data integration. the need for replicates for accurate assessment of the model output measures and the run time per replicate are major scalability issues for pandemic simulation models. large-scale simulations of the u.s. population reported running times of up to h per replicate, depending on the number of parallel threads used [ ] (see additional file : table s for further details). it would then take a run time of one week to execute replicates of only one pandemic scenario. note that, most of the modeling approaches have reported between to replicates per scenario [ , [ ] [ ] [ ] [ ] [ ] [ ] , with the exception of [ , , , ] which implemented between to replicates. clearly, it would take about one month to run replicates for a single scenario involving the entire u.s. population. while it may not be necessary to simulate the entire population of a country to address mitigation related questions, the issue of the computational burden is daunting nonetheless. we therefore believe that the modeling community should actively seek to develop innovative methodologies to reduce the computational requirements associated with obtaining reliable outputs. minimization of running time has been recently addressed through high performance computing techniques and parallelization by some of the midas models (e.g., epifast) and other research groups (e.g., dicon, gsam), as reported in [ ] . minimization of replicates can be achieved by running the replicates, one more at a time, until the confidence intervals for the output variables become acceptable [ , ] . in addition to the need of minimizing running time and number of replicates, it is also necessary to develop innovative methodologies to minimize the setting up time of operational models. these methodologies should enable the user to automatically select the level of modeling detail, according to the population to mimic (see a discussion of this framework in the context of human mobility [ ] ), and allow the automatic calibration of the model parameters. there exist several simulation models of pandemic influenza that can be used at the provincial and local levels and were not treated as part of the evaluated models in this article. their exclusion is due to their limited available documentation discussing the choice of demographic, social-behavioral or epidemiological parameter values. we mention and discuss their relevant features in this manuscript, whenever applicable. for information about the additional models, the reader is referred to [ , , ] . there also exist a body of literature evaluating less than three types of mitigation strategies that were not considered as part of the review, as we discussed in the methods section. this literature is valuable is providing insights about reproduction patterns [ , ] , effect of cross-immunity [ ] , antiviral resistance [ ] , vaccine dosage [ , ] , social-distancing [ ] and public health interventions in previous pandemics [ , ] . though the literature on pandemic models is rich and contains analysis and results that are valuable for public health preparedness, policy makers have raised several questions regarding practical use of these models. the questions are as follows. is the data support adequate and valid? how credible and valid are the model assumptions? is human behavior represented appropriately in these models? how accessible and scalable are these models? this review paper attempts to determine to what extent the current literature addresses the above questions at provincial and local levels, and what the areas of possible enhancements are. the findings with regards to the areas of enhancements are summarized below. enhance the following: availability of real-time epidemiological data; access and retrieval of demographical and epidemiological data; translation of data into model parameter values. we analyzed the most common epidemiological and demographical parameters that are used in pandemic models, and discussed the need for adequate updating of these parameters during an outbreak. as regards the epidemiological parameters, we have noted the need to obtain prompt and reliable estimates for the iar and r, which we believe can be obtained by enhancing protocols for expedited and representative specimen collection and testing. during a pandemic, the estimates for iar and r should also be obtained as often as possible to update simulation models. for the disease natural history and the fraction of asymptomatic cases, estimation should occur every time viral evolution is confirmed by the public health laboratories. for periodic updating of the simulation models, there is a need to develop interfaces for (semi)automatic data access and retrieval. algorithms for translating data into model parameters should not delay model execution and decision making. demographic data are generally available. but most of the models that we examined are not capable of performing (semi)automatic access, retrieval, and translation of demographic data into model parameter values. examine validity of modeling assumptions from the point of view of the decisions that are supported by the model. by referring to two of the most commonly cited pandemic preparedness models [ , ] , we discussed how simplifying model assumptions are made to reduce computational burden, as long as the assumptions do not interfere with the performance evaluation of the mitigation strategies. some mitigation strategies require more realistic model assumptions (e.g., location based antiviral prophylaxis would require models that track geographic coordinates of individuals so that those within a radius of an infected individual can be identified). whereas other mitigation strategies might be well supported by coarser models (e.g.,"antiviral prophylaxis for household members") would require models that track household membership). therefore, whenever validity of the modeling assumptions is examined, the criteria chosen for the examination should depend on the decisions supported by the model. incorporate the following: a broader spectrum of social behavioral aspects; updated information for contact patterns; new methodologies for collection of human mobility data. some of the social behavioral factors that have been considered in the examined models are social distancing and vaccination compliance, natural withdraw from work when symptoms appear, and work absenteeism to care for sick family members. although some of the examined models attempt to capture social-behavioral issues, it appears that they lack adequate consideration of many other factors (e.g., voluntary vaccination, voluntary avoidance to travel to affected regions). hence, there is a need for research studies or expert opinion analysis to identify which social-behavioral factors are significant for disease spread. it is also essential to determine how the social behavioral data should be collected (in real-time or through surveys), archived for easy access, retrieved, and translated into model parameters. in addition, operational models for pandemic spread and mitigation should reflect the state of the art in data for the contact parameters and integrate recent methodologies for collection of human mobility data. enhance computational efficiency of the solution algorithms. our review indicates that some of the models have reached a reasonable running time of up to h per replicate for a large region, such as the entire usa [ , ] . however, operational models need also to be set up and replicated in real-time, and methodologies addressing these two issues are needed. we have also discussed the question whether the public health decision makers should be burdened with the task of downloading and running models using local computers (laptops). this task can be far more complex than how it is perceived by the public health decision makers. we believe that models should be housed in a cyber computing environment with an easy user interface for the decision makers. additional file : additional file : table s epidemiological parameters in models for pandemic influenza preparedness. the excel sheet "additional file : table s " shows the epidemiological parameters most commonly used in the models for pandemic influenza, the parameter data sources, and the means for access, retrieval and translation. additional file : table s demographic parameters in models for pandemic influenza preparedness. the excel sheet "additional file : table s " shows the demographic parameters most commonly used in the models for pandemic influenza, the parameter data sources, and the means for access, retrieval and translation. additional file : table s accessibility and scalability features investigated in the models. the excel sheet "additional file " shows the different models examined, together with their type of public access, number and running time per replicate, and techniques to manage computational burden. use of computer modeling for emergency preparedness functions by local and state health official: a needs assessment cdc's new preparedness modeling initiative: beyond (and before) crisis response interim pre-pandemic planning guidance: community strategy for pandemic influenza mitigation in the united states modeling community containment for pandemic influenza: a letter report m bd: recommendations for modeling disaster responses in public health and medicine: a position paper of the society for medical decision making. med decision making yale new haven center for emergency preparedness and disaster responsem, and us northern command: study to determine the requirements for an operational epidemiological modeling process in support of decision making during disaster medical and public health response operations national household travel survey (nths) viii censimento generale della popolazione e delle abitazioni real-time database and information systems: research advance how generation intervals shape the relationship between growth rates and reproduction numbers concepts of transmission and dynamics strategies for mitigating an influenza pandemic mitigation strategies for pandemic influenza in the united states heterogeneity and network structure in the dynamics of diffusion: comparing agent-based and differential equation models the role of population heterogeneity and human mobility in the spread of pandemic influenza potential for a global dynamic of influenza a (h n ) the global transmission and control of influenza multiscale mobility networks and the spatial spreading of infectious diseases modeling human mobility responses to the large-scale spreading of infectious diseases human mobility networks, travel restrictions, and the global spread of h n pandemic flute, a publicly available stochastic influenza epidemic simulation model targeted social distancing design for pandemic influenza a large scale simulation model for assessment of societal risk and development of dynamic mitigation strategies a predictive decision aid methodology for dynamic mitigation of influenza pandemics: special issue on optimization in disaster relief strategies for containing an emerging influenza pandemic in southeast asia modeling targeted layered containment of an influenza pandemic in the united states reducing the impact of the next influenza pandemic using household-based public health interventions scalia tomba img: mitigation measures for pandemic influenza in italy: an individual based model considering different scenarios simple models for containment of a pandemic a model for influenza with vaccination and antiviral treatment containing pandemic influenza with antiviral agents containing pandemic influenza at the source economic evaluation of influenza pandemic mitigation strategies in the united states using a stochastic microsimulation transmission model modelling mitigation strategies for pandemic (h n ) effective, robust design of community mitigation for pandemic influenza: a systematic examination of proposed us guidance rescinding community mitigation strategies in an influenza pandemic health outcomes and costs of community mitigation strategies for an influenza pandemic in the united states assessing the role of basic control measures, antivirals and vaccine in curtailing pandemic influenza: scenarios for the us, uk and the netherlands mathematical assessment of canadas pandemic influenza preparedness plan a model for the spread and control of pandemic influenza in an isolated geographical region the influenza pandemic preparedness planning tool influsim transmissibility of pandemic influenza epidemic influenza; a survey chicago: american medical association pandemic potential of a strain of influenza a (h n ): early findings an influenza simulation model for immunization studies non-pharmaceutical interventions for pandemic influenza, national and community measures local and systemic cytokine responses during experimental human influenza a virus infection the early diversification of influenza a/ h n pdm phylogeography of the spring and fall waves of the h n / pandemic influenza virus in the united states estimates from a national prospective survey of household contacts in france household transmission of pandemic influenza a (h n ) virus in the united states timelines of infection and disease in human influenza: a review of volunteer challenge studies different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures real-time estimates in early detection of sars estimating individual and household reproduction numbers in an emerging epidemic a likelihood-based method for real-time estimation of the serial interval and reproductive number of an epidemic real time bayesian estimation of the epidemic potential of emerging infectious diseases estimating in real time the efficacy of measures to control emerging communicable diseases the effective reproduction number of pandemic influenza. prospective estimation estimating reproduction numbers for adults and children from case data pros and cons of estimating the reproduction number from early epidemic growth rate of influenza a (h n ) global infectious disease surveillance and health intelligence monitoring influenza activity, including using internet searches for influenza surveillance detecting influenza epidemics using search engine query data the use of twitter to track levels of disease activity and public concern in the u.s. during the influenza a h n pandemic the role of the airline transportation network in the prediction and predictability of global epidemics social contacts and mixing patterns relevant to the spread of infectious diseases little italy: an agent-based approach to the estimation of contact patterns -fitting predicted matrices to serological data reality mining: sensing complex social systems dynamics of person-to-person interactions from distributed rfid sensor networks cellular census: explorations in urban data collection. pervasive computing mobile landscapes: using location data from cell-phones for urban analysis. environ and planning b: plann and des social and economic impact of school closure resulting from pandemic influenza a/h n compliance and side effects of prophylactic oseltamivir treatment in a school in south west england using an online survey of healthcare-seeking behaviour to estimate the magnitude and severity of the h n v influenza epidemic in england behavioural intentions in response to an influenza pandemic modelling the influence of human behaviour on the spread of infectious diseases: a review the scaling laws of human travel understanding individual human mobility patterns adaptive human behavior in epidemiological models stochastic modelling of the spatial spread of influenza in germany simple models of influenza progression within a heterogeneous population planning for the next influenza h n season: a modelling study a populationdynamic model for evaluating the potential spread of drug-resistant influenza virus infections during community-based use of antivirals optimizing the dose of pre-pandemic influenza vaccines to reduce the infection attack rate finding optimal vaccination strategies for pandemic influenza using genetic algorithms living with influenza: impacts of government imposed and voluntarily selected interventions public health interventions and epidemic intensity during the influenza pandemic the effect of public health measures on the influenza pandemic in the us cities the authors wish to thank doctor lillian stark, virology administrator of the bureau of laboratories in tampa, florida, for providing valuable information on the problems faced by the laboratory during the h n pandemics. the authors also wish to thank the reviewers of this manuscript for providing valuable suggestions and reference material. we appreciate the support of dayna martinez, a doctoral student at usf, in providing some literature information on social-behavioral aspects of pandemic influenza. authors' contributions dp conducted the systematic review and analysis of the models. td and as guided dp and au in designing the conceptual framework for the review. all three jointly wrote the manuscript. ri and sm provided public health expert opinion on the conceptual framework and also reviewed the manuscript. all authors read and approved the final manuscript. the authors declare that they have no competing interests. key: cord- -nyfnwrtm authors: zhang, tenghao title: integrating gis technique with google trends data to analyse covid- severity and public interest date: - - journal: public health doi: . /j.puhe. . . sha: doc_id: cord_uid: nyfnwrtm nan at the time of writing, the tally of confirmed novel coronavirus cases worldwide has exceeded . million. the united states has become the global epicentre since april , and now it is accounted for nearly one-quarter of the world's total cases. some studies suggest that health related issues can cause anxiety which may lead to increased public attention, typically manifested by online information search. , along the same lines, given the substantial regional disparities of covid- case severities across states in the united states, the relationship between regional case severities and the public interest emerges as an imperative for covid- -based public health studies. to investigate the relationship between the above two indicators, geographic information systems (gis) techniques can play a crucial role. adams et al.'s ( ) gis-based study points out the shortcomings of using unnormalized covid- demographic data in choropleth mapping, and their use of the normalized data (confirmed cases per , people) presents a more accurate visualisation of pandemic severity. while i entirely agree with their point of view and methods, i would like to propose an alternative gis technique which has the potential to facilitate a better understanding of the research, namely, the cartogram technique. , a cartogram is a map in which the geometry of areas is distorted to convey the value of an alternative thematic mapping variable. hence, if the normalized covid- related data is used in a cartogram, it can provide some novel perspectives on data interpretation. to perform the analysis, the data were obtained from two sources. the covid- case data were retrieved from the us health authority (https://cdc.gov/covid-datatracker). i retrieved the total confirmed cases per , population by state, and then i divided the new confirmed cases (during the past week of data collection) by the total previous cases and obtained a growth of new cases indicator. public interest was captured by people's google search data in each state. the data were acquired from the google trends service, which uses a normalized relative search volume available from www. worldometer.info (accessed th the role of health anxiety in online health information search health anxiety in the digital age: an exploration of psychological determinants of online health information seeking the disguised pandemic: the importance of data normalization in covid- web mapping diffusion-based method for producing density-equalizing maps area cartograms: their use and creation monitoring public interest toward pertussis outbreaks: an extensive google trends -based analysis mapping the changing internet attention to the spread of coronavirus disease in china key: cord- -e jb sex authors: fourcade, marion; johns, fleur title: loops, ladders and links: the recursivity of social and machine learning date: - - journal: theory soc doi: . /s - - -x sha: doc_id: cord_uid: e jb sex machine learning algorithms reshape how people communicate, exchange, and associate; how institutions sort them and slot them into social positions; and how they experience life, down to the most ordinary and intimate aspects. in this article, we draw on examples from the field of social media to review the commonalities, interactions, and contradictions between the dispositions of people and those of machines as they learn from and make sense of each other. a fundamental intuition of actor-network theory holds that what we call "the social" is assembled from heterogeneous collections of human and non-human "actants." this may include human-made physical objects (e.g., a seat belt), mathematical formulas (e.g., financial derivatives), or elements from the natural world-such as plants, microbes, or scallops (callon ; latour latour , . in the words of bruno latour ( , p . ) sociology is nothing but the "tracing of associations." "tracing," however, is a rather capacious concept: socio-technical associations, including those involving non-human "actants," always crystallize in concrete places, structural positions, or social collectives. for instance, men are more likely to "associate" with video games than women (bulut ) . furthermore, since the connection between men and video games is known, men, women and institutions might develop strategies around, against, and through it. in other words, techno-social mediations are always both objective and subjective. they "exist … in things and in minds … outside and inside of agents" (wacquant , p. ) . this is why people think, relate, and fight over them, with them, and through them. all of this makes digital technologies a particularly rich terrain for sociologists to study. what, we may wonder, is the glue that holds things together at the automated interface of online and offline lives? what kind of subjectivities and relations manifest on and around social network sites, for instance? and how do the specific mediations these sites rely upon-be it hardware, software, human labor-concretely matter for the nature and shape of associations, including the most mundane? in this article, we are especially concerned with one particular kind of associative practice: a branch of artificial intelligence called machine learning. machine learning is ubiquitous on social media platforms and applications, where it is routinely deployed to automate, predict, and intervene in human and non-human behavior. generally speaking, machine learning refers to the practice of automating the discovery of rules and patterns from data, however dispersed and heterogeneous it may be, and drawing inferences from those patterns, without explicit programming. using examples drawn from social media, we seek to understand the kinds of social dispositions that machine learning techniques tend to elicit or reinforce; how these social dispositions, in turn, help to support according to pedro domingos's account, approaches to machine learning may be broken down into five "tribes." symbolists proceed through inverse deduction, starting with received premises or known facts and working backwards from those to identify rules that would allow those premises or facts to be inferred. the algorithm of choice for the symbolist is the decision tree. connectionists model machine learning on the brain, devising multilayered neural networks. their preferred algorithm is backpropagation, or the iterative adjustment of network parameters (initially set randomly) to try to bring that network's output closer and closer to a desired result (that is, towards satisfactory performance of an assigned task). evolutionaries canvas entire "populations" of hypotheses and devise computer programs to combine and swap these randomly, repeatedly assessing these combinations' "fitness" by comparing output to training data. their preferred kind of algorithm is the so-called genetic algorithm designed to simulate the biological process of evolution. bayesians are concerned with navigating uncertainty, which they do through probabilistic inference. bayesian models start with an estimate of the probability of certain outcomes (or a series of such estimates comprising one or more hypothetical bayesian network(s)) and then update these estimates as they encounter and process more data. analogizers focus on recognizing similarities within data and inferring other similarities on that basis. two of their go-to algorithms are the nearest-neighbor classifier and the support vector machine. the first makes predictions about how to classify unseen data by finding labeled data most similar to that unseen data (pattern matching). the second classifies unseen data into sets by plotting the coordinates of available or observed data according to their similarity to one another and inferring a decision boundary that would enable their distinction. machine learning implementations; and what kinds of social formations these interactions give rise to-all of these, indicatively rather than exhaustively. our arguments are fourfold. in the first two sections below, we argue that the accretive effects of social and machine learning are fostering an ever-more-prevalent hunger for data, and searching dispositions responsive to this hunger -"loops" in this paper's title. we then show how interactions between those so disposed and machine learning systems are producing new orders of stratification and association, or "ladders" and "links", and new stakes in the struggle in and around these orders. the penultimate section contends that such interactions, through social and mechanical infrastructures of machine learning, tend to engineer competition and psycho-social and economic dependencies conducive to evermore intensive data production, and hence to the redoubling of machine-learned stratification. finally, before concluding, we argue that machine learning implementations are inclined, in many respects, towards the degradation of sociality. consequently, new implementations are been called upon to judge and test the kind of solidaristic associations that machine learned systems have themselves produced, as a sort of second order learning process. our conclusion is a call to action: to renew, at the social and machine learning interface, fundamental questions of how to live and act together. the things that feel natural to us are not natural at all. they are the result of long processes of inculcation, exposure, and training that fall under the broad concept of "socialization" or "social learning." because the term "social learning" helps us better draw the parallel with "machine learning," we use it here to refer to the range of processes by which societies and their constituent elements (individuals, institutions, and so on) iteratively and interactively take on certain characteristics, and exhibit change-or not-over time. historically, the concept is perhaps most strongly associated with theories of how individuals, and specifically children, learn to feel, act, think, and relate to the world and to each other. theories of social learning and socialization have explained how people come to assume behaviors and attitudes in ways not well captured by a focus on internal motivation or conscious deliberation (miller and dollard ; bandura ; mauss ; elias ) . empirical studies have explored, for instance, how children learn speech and social grammar through a combination of direct experience (trying things out and experiencing rewarding or punishing consequences) and modeling (observing and imitating others, especially primary associates) (gopnik ). berger and luckmann ( ) , relying on the work of george herbert mead, discuss the learning process of socialization as one involving two stages: in the primary stage, children form a self by internalizing the attitudes of those others with whom they entertain an emotional relationship (typically their parents); in the secondary stage, persons-in-becoming learn to play appropriate roles in institutionalized subworlds, such as work or school. pierre bourdieu offers a similar approach to the formation of habitus. as a system of dispositions that "generates meaningful practices and meaning-giving perceptions," habitus takes shape through at least two types of social learning: "early, imperceptible learning" (as in the family) and "scholastic...methodical learning" (within educational and other institutions) (bourdieu , pp. , ) . organizations and collective entities also learn. for instance, scholars have used the concept of social learning to understand how states, institutions, and communities (at various scales) acquire distinguishing characteristics and assemble what appear to be convictions-in-common. ludwik fleck ( fleck ( [ ) and later thomas kuhn ( ) famously argued that science normally works through adherence to common ways of thinking about and puzzling over problems. relying explicitly on kuhn, hall ( ) makes a similar argument about elites and experts being socialized into long lasting political and policy positions. collective socialization into policy paradigms is one of the main drivers of institutional path dependency, as it makes it difficult for people to imagine alternatives. for our purposes, social learning encapsulates all those social processes-material, institutional, embodied, and symbolic-through which particular ways of knowing, acting, and relating to one another as aggregate and individuated actants are encoded and reproduced, or by which "[e]ach society [gains and sustains] its own special habits" (mauss , p. ) . "learning" in this context implies much more than the acquisition of skills and knowledge. it extends to adoption through imitation, stylistic borrowing, riffing, meme-making, sampling, acculturation, identification, modeling, prioritization, valuation, and the propagation and practice of informal pedagogies of many kinds. understood in this way, "learning" does not hinge decisively upon the embodied capacities and needs of human individuals because those capacities and needs are only ever realized relationally or through "ecological interaction," including through interaction with machines (foster ) . it is not hard to see why digital domains, online interactions, and social media networks have become a privileged site of observation for such processes (e.g., greenhow and robelia ), all the more so since socialization there often starts in childhood. this suggests that (contra dreyfus ) social and machine learning must be analyzed as co-productive of, rather than antithetical to, one another. machine learning is, similarly, a catch-all term-one encompassing a range of ways of programming computers or computing systems to undertake certain tasks (and satisfy certain performance thresholds) without explicitly directing the machines in question how to do so. instead, machine learning is aimed at having computers learn (more or less autonomously) from preexisting data, including the data output from prior attempts to undertake the tasks in question, and devise their own ways of both tackling those tasks and iteratively improving at them (alpaydin ) . implementations of machine learning now span all areas of social and economic life. machine learning "has been turning up everywhere, driven by exponentially growing mountains of [digital] data" (domingos ) . in this article, we take social media as one domain in which machine learning has been widely implemented. we do so recognizing that not all data analysis in which social media platforms engage is automated, and that those aspects that are automated do not necessarily involve machine learning. two points are important for our purposes: most "machines" must be trained, cleaned, and tested by humans in order to "learn." in implementations of machine learning on social media platforms, for instance, humans are everywhere "in the loop"-an immense, poorly paid, and crowdsourced workforce that relentlessly labels, rates, and expunges the "content" to be consumed (gillespie ; gray and suri ) . and yet, both supervised and unsupervised machines generate new patterns of interpretation, new ways of reading the social world and of intervening in it. any reference to machine learning throughout this article should be taken to encapsulate these "more-thanhuman" and "more-than-machine" qualities of machine learning. cybernetic feedback, data hunger, and meaning accretion analogies between human (or social) learning and machine-based learning are at least as old as artificial intelligence itself. the transdisciplinary search for common properties among physical systems, biological systems, and social systems, for instance, was an impetus for the macy foundation conferences on "circular causal and feedback mechanisms in biology and social systems" in the early days of cybernetics ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . in the analytical model developed by norbert wiener, "the concept of feedback provides the basis for the theoretical elimination of the frontier between the living and the non-living" (lafontaine , p. ) . just as the knowing and feeling person is dynamically produced through communication and interactions with others, the ideal cybernetic system continuously enriches itself from the reactions it causes. in both cases, life is irrelevant: what matters, for both living and inanimate objects, is that information circulates in an everrenewed loop. put another way, information/computation are "substrate independent" (tegmark , p. ). wiener's ambitions (and even more, the exaggerated claims of his posthumanist descendants, see, e.g., kurzweil ) were immediately met with criticism. starting in the s, philosopher hubert dreyfus arose as one of the main critics of the claim that artificial intelligence would ever approach its human equivalent. likening the field to "alchemy" ( ), he argued that machines would never be able to replicate the unconscious processes necessary for the understanding of context and the acquisition of tacit skills ( , )-the fact that, to quote michael polanyi ( ) , "we know more than we can tell." in other words, machines cannot develop anything like the embodied intuition that characterizes humans. furthermore, machines are poorly equipped to deal with the fact that all human learning is cultural, that is, anchored not in individual psyches but in collective systems of meaning and in sedimented, relational histories (vygotsky ; bourdieu ; durkheim ; hasse ) . is this starting to change today when machines successfully recognize images, translate texts, answer the phone, and write news briefs? some social and computational scientists believe that we are on the verge of a real revolution, where machine learning tools will help decode tacit knowledge, make sense of cultural repertoires, and understand micro-dynamics at the individual level (foster ). our concern is not, however, with confirming or refuting predictive claims about what computation can and cannot do to advance scholars' understanding of social life. rather, we are interested in how social and computational learning already interact. not only may social and machine learning usefully be compared, but they are reinforcing and shaping one another in practice. in those jurisdictions in which a large proportion of the population is interacting, communicating, and transacting ubiquitously online, social learning and machine learning share certain tendencies and dependencies. both practices rely upon and reinforce a pervasive appetite for digital input or feedback that we characterize as "data hunger." they also share a propensity to assemble insight and make meaning accretively-a propensity that we denote here as "world or meaning accretion." throughout this article, we probe the dynamic interaction of social and machine learning by drawing examples from one genre of online social contention and connection in which the pervasive influence of machine learning is evident: namely, that which occurs across social media channels and platforms. below we explain first how data hunger is fostered by both social and computing systems and techniques, and then how world or meaning accretion manifests in social and machine learning practices. these explanations set the stage for our subsequent discussion of how these interlocking dynamics operate to constitute and distribute power. data hunger: searching as a natural attitude as suggested earlier, the human person is the product of a long, dynamic, and never settled process of socialization. it is through this process of sustained exposure that the self (or the habitus, in pierre bourdieu's vocabulary) becomes adjusted to its specific social world. as bourdieu puts it, "when habitus encounters a social world of which it is the product, it is like a 'fish in water': it does not feel the weight of the water, and it takes the world about itself for granted" (bourdieu in wacquant , p. ) . the socialized self is a constantly learning self. the richer the process-the more varied and intense the interactions-the more "information" about different parts of the social world will be internalized and the more socially versatile-and socially effective, possibly-the outcome. (this is why, for instance, parents with means often seek to offer "all-round" training to their offspring [lareau ]) . machine learning, like social learning, is data hungry. "learning" in this context entails a computing system acquiring capacity to generalize beyond the range of data with which it has been presented in the training phase. "learning" is therefore contingent upon continuous access to data-which, in the kinds of cases that preoccupy us, means continuous access to output from individuals, groups, and "bots" designed to mimic individuals and groups. at the outset, access to data in enough volume and variety must be ensured to enable a particular learnermodel combination to attain desired accuracy and confidence levels. thereafter, data of even greater volume and variety is typically (though not universally) required if machine learning is to deliver continuous improvement, or at least maintain performance, on assigned tasks. the data hunger of machine learning interacts with that of social learning in important ways. engineers, particularly in the social media sector, have structured machine learning technologies not only to take advantage of vast quantities of behavioral traces that people leave behind when they interact with digital artefacts, but also to solicit more through playful or addictive designs and cybernetic feedback loops. the machine-learning self is not only encouraged to respond more, interact more, and volunteer more, but also primed to develop a new attitude toward the acquisition of information (andrejevic , p. ). with the world's knowledge at her fingertips, she understands that she must "do her own research" about everything-be it religion, politics, vaccines, or cooking. her responsibility as a citizen is not only to learn the collective norms, but also to know how to search and learn so as to make her own opinion "for herself," or figure out where she belongs, or gain new skills. the development of searching as a "natural attitude" (schutz ) is an eminently social process of course: it often means finding the right people to follow or emulate (pariser ) , using the right keywords so that the search process yields results consistent with expectations (tripodi ) , or implicitly soliciting feedback from others in the form of likes and comments. the social media user also must extend this searching disposition to her own person: through cybernetic feedback, algorithms habituate her to search for herself in the data. this involves looking reflexively at her own past behavior so as to inform her future behavior. surrounded by digital devices, some of which she owns, she internalizes the all-seeing eye and learns to watch herself and respond to algorithmic demands (brubaker ). data hunger transmutes into self-hunger: an imperative to be digitally discernible in order to be present as a subject. this, of course, exacts a kind of self-producing discipline that may be eerily familiar to those populations that have always been under heavy institutional surveillance, such as the poor, felons, migrants, racial minorities (browne ; benjamin ) , or the citizens of authoritarian countries. it may also be increasingly familiar to users of health or car insurance, people living in a smart home, or anyone being "tracked" by their employer or school by virtue of simply using institutionally licensed it infrastructure. but the productive nature of the process is not a simple extension of what michel foucault called "disciplinary power" nor of the self-governance characteristic of "governmentality." rather than simply adjusting herself to algorithmic demands, the user internalizes the injunction to produce herself through the machine-learning-driven process itself. in that sense the machine-learnable self is altogether different from the socially learning, self-surveilling, or self-improving self. the point for her is not simply to track herself so she can conform or become a better version of herself; it is, instead, about the productive reorganization of her own experience and self-understanding. as such, it is generative of a new sense of selfhood-a sense of discovering and crafting oneself through digital means that is quite different from the "analog" means of self-cultivation through training and introspection. when one is learning from a machine, and in the process making oneself learnable by it, mundane activities undergo a subtle redefinition. hydrating regularly or taking a stroll are not only imperatives to be followed or coerced into. their actual phenomenology morphs into the practice of feeding or assembling longitudinal databases and keeping track of one's performance: "step counting" and its counterparts (schüll ; adams ) . likewise, what makes friendships real and defines their true nature is what the machine sees: usually, frequency of online interaction. for instance, snapchat has perfected the art of classifying-and ranking-relationships that way, so people are constantly presented with an ever-changing picture of their own dyadic connections, ranked from most to least important. no longer, contra foucault ( ) , is "permanent self-examination" crucial to self-crafting so much as attention to data-productive practices capable of making the self learnable and sustaining its searching process. to ensure one's learnability-and thereby one's selfhood-one must both feed and reproduce a hunger for data on and around the self. human learning is not only about constant, dynamic social exposure and world hunger, it is also about what we might call world or meaning accretion. the self is constantly both unsettled (by new experiences) and settling (as a result of past experiences). people take on well institutionalized social roles (berger and luckmann ) . they develop habits, styles, personalities-a "system of dispositions" in bourdieu's vocabulary-by which they become adjusted to their social world. this system is made accretively, through the conscious and unconscious sedimentation of social experiences and interactions that are specific to the individual, and variable in quality and form. accretion here refers to a process, like the incremental build-up of sediment on a riverbank, involving the gradual accumulation of additional layers or matter. even when change occurs rapidly and unexpectedly, the ongoing process of learning how to constitute and comport oneself and perform as a social agent requires one to grapple with and mobilize social legacies, social memory, and pre-established social norms (goffman ) . the habitus, bourdieu would say, is both structured and structuring, historical and generative. social processes of impression formation offer a good illustration of how social learning depends upon accreting data at volume, irrespective of the value of any particular datum. the popular insight that first impressions matter and tend to endure is broadly supported by research in social psychology and social cognition (uleman and kressel ) . it is clear that impressions are formed cumulatively and that early-acquired information tends to structure and inform the interpretation of information later acquired about persons and groups encountered in social life (hamilton and sherman ) . this has also been shown to be the case in online environments (marlow et al. ) . in other words, social impressions are constituted by the incremental build-up of a variegated mass of data. machine learning produces insight in a somewhat comparable way-that is, accretively. insofar as machine learning yields outputs that may be regarded as meaningful (which is often taken to mean "useful" for the task assigned), then that "meaning" is assembled through the accumulation of "experience" or from iterative exposure to available data in sufficient volume, whether in the form of a stream or in a succession of batches. machine learning, like social learning, never produces insight entirely ab initio or independently of preexisting data. to say that meaning is made accretively in machine learning is not to say that machine learning programs are inflexible or inattentive to the unpredictable; far from it. all machine learning provides for the handling of the unforeseen; indeed, capacity to extend from the known to the unknown is what qualifies machine learning as "learning." moreover, a number of techniques are available to make machine learning systems robust in the face of "unknown unknowns" (that is, rare events not manifest in training data). nonetheless, machine learning does entail giving far greater weight to experience than to the event. the more data that has been ingested by a machine learning system, the less revolutionary, reconfigurative force might be borne by any adventitious datum that it encounters. if, paraphrasing marx, one considers that people make their own history, but not in circumstances they choose for themselves, but rather in present circumstances given and inherited, then the social-machine learning interface emphasizes the preponderance of the "given and inherited" in present circumstances, far more than the potentiality for "mak[ing]" that may lie within them (marx (marx [ ). one example of the compound effect of social and automated meaning accretion in the exemplary setting to which we return throughout this articlesocial media-is the durability of negative reputation across interlocking platforms. for instance, people experience considerable difficulty in countering the effects of "revenge porn" online, reversing the harms of identity theft, or managing spoiled identities once they are digitally archived (lageson and maruna ) . as langlois and slane have observed, "[w]hen somebody is publicly shamed online, that shaming becomes a live archive, stored on servers and circulating through information networks via search, instant messaging, sharing, liking, copying, and pasting" (langlois and slane ) . in such settings, the data accretion upon which machine learning depends for the development of granular insights-and, on social media platforms, associated auctioning and targeting of advertising-compounds the cumulative, sedimentary effect of social data, making negative impressions generated by "revenge porn," or by one's online identity having been fraudulently coopted, hard to displace or renew. the truth value of later, positive data may be irrelevant if enough negative data has accumulated in the meantime. data hunger and the accretive making of meaning are two aspects of the embedded sociality of machine learning and of the "mechanical" dimensions of social learning. together, they suggest modes of social relation, conflict, and action that machine learning systems may nourish among people on whom those systems bear, knowingly or unknowingly. this has significant implications for social and economic inequality, as we explore below. what are the social consequences of machine learning's signature hunger for diverse, continuous, ever more detailed and "meaningful" data and the tendency of many automated systems to hoard historic data from which to learn? in this section, we discuss three observable consequences of data hunger and meaning accretion. we show how these establish certain non-negotiable preconditions for social inclusion; we highlight how they fuel the production of digitally-based forms of social stratification and association; and we specify some recurrent modes of relation fostered thereby. all three ordering effects entail the uneven distribution of power and resources and all three play a role in sustaining intersecting hierarchies of race, class, gender, and other modes of domination and axes of inequality. machine learning's data appetite and the "digestive" or computational abilities that attend it are often sold as tools for the increased organizational efficiency, responsiveness, and inclusiveness of societies and social institutions. with the help of machine learning, the argument goes, governments and non-governmental organizations develop an ability to render visible and classify populations that are traditionally unseen by standard data infrastructures. moreover, those who have historically been seen may be seen at a greater resolution, or in a more finelygrained, timely, and difference-attentive way. among international organizations, too, there is much hope that enhanced learning along these lines might result from the further utilization of machine learning capacities (johns ) . for instance, machine learning, deployed in fingerprint, iris, or facial recognition, or to nourish sophisticated forms of online identification, is increasingly replacing older, document-based ones (torpey )-and transforming the very concept of citizenship in the process (cheney-lippold ). whatever the pluses and minuses of "inclusiveness" in this mode, it entails a major infrastructural shift in the way that social learning takes place at the state and inter-state level, or how governments come to "know" their polities. governments around the world are exploring possibilities for gathering and analysing digital data algorithmically, to supplement-and eventually, perhaps, supersede-household surveys, telephone surveys, field site visits, and other traditional data collection methods. this devolves the process of assembling and representing a polity, and understanding its social and economic condition, down to agents outside the scope of public administration: commercial satellite operators (capturing satellite image data being used to assess a range of conditions, including agricultural yield and poverty), supermarkets (gathering scanner data, now widely used in cpi generation), and social media platforms. if official statistics (and associated data gathering infrastructures and labor forces) have been key to producing the modern polity, governmental embrace of machine learning capacities signals a change in ownership of that means of production. social media has become a key site for public and private parties-police departments, immigration agencies, schools, employers and insurers among others-to gather intelligence about the social networks of individuals, their health habits, their propensity to take risks or the danger they might represent to the public, to an organization's bottom line or to its reputation (trottier ; omand ; bousquet ; amoore ; stark ) . informational and power asymmetries characteristic of these institutions are often intensified in the process. this is notwithstanding the fact that automated systems' effects may be tempered by manual work-arounds and other modes of resistance within bureaucracies, such as the practices of frontline welfare workers intervening in automated systems in the interests of their clients, and strategies of foot-dragging and data obfuscation by legal professionals confronting predictive technologies in criminal justice (raso ; brayne and christin ) . the deployment of machine learning to the ends outlined in the foregoing paragraph furthers the centrality of data hungry social media platforms to the distribution of all sorts of economic and social opportunities and scarce public resources. at every scale, machine-learning-powered corporations are becoming indispensable mediators of relations between the governing and the governed (a transition process sharply accelerated by the covid- pandemic). this invests them with power of a specific sort: the power of "translating the images and concerns of one world into that of another, and then disciplining or maintaining that translation in order to stabilize a powerful network" and their own influential position within it (star , p. ) . the "powerful network" in question is society, but it is heterogeneous, comprising living and non-living, automated and organic elements: a composite to which we can give the name "society" only with impropriety (that is, without adherence to conventional, anthropocentric understandings of the term). for all practical purposes, much of social life already is digital. this insertion of new translators, or repositioning of old translators, within the circuits of society is an important socio-economic transformation in its own right. and the social consequences of this new "inclusion" are uneven in ways commonly conceived in terms of bias, but not well captured by that term. socially disadvantaged populations are most at risk of being surveilled in this way and profiled into new kinds of "measurable types" (cheney-lippold ). in addition, social media user samples are known to be non-representative, which might further unbalance the burden of surveillant attention. (twitter users, for instance, are skewed towards young, urban, minority individuals (murthy et al. ).) consequently, satisfaction of data hunger and practices of automated meaning accretion may come at the cost of increased social distrust, fostering strategies of posturing, evasion, and resistance among those targeted by such practices. these reactions, in turn, may undermine the capacity of state agents to tap into social data-gathering practices, further compounding existing power and information asymmetries (harkin ) . for instance, sarah brayne ( ) finds that government surveillance via social media and other means encourages marginalized communities to engage in "system avoidance," jeopardizing their access to valuable social services in the process. finally, people accustomed to being surveilled will not hesitate to instrumentalize social media to reverse monitor their relationships with surveilling institutions, for instance by taping public interactions with police officers or with social workers and sharing them online (byrne et al. ) . while this kind of resistance might further draw a wedge between vulnerable populations and those formally in charge of assisting and protecting them, it has also become a powerful aspect of grassroots mobilization in and around machine learning and techno-social approaches to institutional reform (benjamin ) . in all the foregoing settings, aspirations for greater inclusiveness, timeliness, and accuracy of data representation-upon which machine learning is predicated and which underlie its data hunger-produce newly actionable social divisions. the remainder of this article analyzes some recurrent types of social division that machine learning generates, and types of social action and experience elicited thereby. there is, of course, no society without ordering-and no computing either. social order, like computing order, comes in many shapes and varieties but generally "the gap between computation and human problem solving may be much smaller than we think" (foster , p. ) . in what follows, we cut through the complexity of this socialcomputational interface by distinguishing between two main ideal types of classification: ordinal (organized by judgments of positionality, priority, probability or value along one particular dimension) and nominal (organized by judgments of difference and similarity) (fourcade ) . social processes of ordinalization in the analog world might include exams, tests, or sports competitions: every level allows one to compete for the next level and be ranked accordingly. in the digital world, ordinal scoring might take the form of predictive analytics-which, in the case of social media, typically means the algorithmic optimization of online verification and visibility. by contrast, processes of nominalization include, in the analog world, various forms of homophily (the tendency of people to associate with others who are similar to them in various ways) and institutional sorting by category. translated for the digital world, these find an echo in clustering technologies-for instance a recommendation algorithm that works by finding the "nearest neighbors" whose taste is similar to one's own, or one that matches people based on some physical characteristic or career trajectory. the difference between ordinal systems and nominal systems maps well onto the difference between bayesian and analogical approaches to machine learning, to reference pedro domingos's ( ) useful typology. it is, however, only at the output or interface stage that these socially ubiquitous machine learning orderings become accessible to experience. what does it mean, and what does it feel like, to live in a society that is regulated through machine learning systems-or rather, where machine learning systems are interacting productively with social ordering systems of an ordinal and nominal kind? in this section, we identify some new, or newly manifest, drivers of social structure that emerge in machine learning-dominated environments. let us begin with the ordinal effects of these technologies (remembering that machine learning systems comprise human as well as non-human elements). as machine learning systems become more universal, the benefits of inclusion now depend less on access itself, and more on one's performance within each system and according to its rules. for instance, visibility on social media depends on "engagement," or how important each individual is to the activity of the platform. if one does not post frequently and consistently, comment or message others on facebook or instagram, or if others do not interact with one's posts, one's visibility to them diminishes quickly. if one is not active on the dating app tinder, one cannot expect one's profile to be shown to prospective suitors. similarly, uber drivers and riders rank one another on punctuality, friendliness, and the like, but uber (the company) ranks both drivers and riders on their behavior within the system, from canceling too many rides to failing to provide feedback. uber egypt states on its website: "the rating system is designed to give mutual feedback. if you never rate your drivers, you may see your own rating fall." even for those willing to incur the social costs of disengagement, opting out of machine learning may not be an option. failure to respond to someone's tag, or to like their photo, or otherwise maintain data productivity, and one might be dropped from their network, consciously or unconsciously, a dangerous proposition in a world where self-worth has become closely associated with measures of network centrality or social influence. as bucher has observed, "abstaining from using a digital device for one week does not result in disconnection, or less data production, but more digital data points … to an algorithm, … absence provides important pieces of information" (bucher , p. ) . engagement can also be forced on non-participants by the actions of other users-through tagging, rating, commenting, and endorsing, for instance (casemajor et al. ) . note that none of this is a scandal or a gross misuse of the technology. on the contrary, this is what any system looking for efficiency and relevance is bound to look like. but any ordering system that acts on people will generate social learning, including action directed at itself in return. engagement, to feed data hunger and enable the accretion of "meaningful" data from noise, is not neutral, socially or psychologically. the constant monitoring and management of one's social connections, interactions, and interpellations places a nontrivial burden on one's life. the first strategy of engagement is simply massive time investment, to manage the seemingly ever-growing myriad of online relationships (boyd ). to help with the process, social media platforms now bombard their users constantly with notifications, making it difficult to stay away and orienting users' behavior toward mindless and unproductive "grinding" (for instance, repetitively "liking" every post in their feed). but even this intensive "nudging" is often not enough. otherwise, how can we explain the fact that a whole industry of social media derivatives has popped up, to help people optimize their behavior vis-a-vis the algorithm, manage their following, and gain an edge so that they can climb the priority order over other, less savvy users? now users need to manage two systems (if not more): the primary one and the (often multiple) analytics apps that help improve and adjust their conduct in it. in these ways, interaction with machine learning systems tends to encourage continuous effort towards ordinal self-optimization. however, efforts of ordinal optimization, too, may soon become useless: as marilyn strathern (citing british economist charles goodhardt) put it, "when a measure becomes a target, it ceases to be a good measure" (strathern , p. ) . machine learning systems do not reward time spent on engagement without regard to the impact of that engagement across the network as a whole. now, in desperation, those with the disposable income to do so may turn to money as the next saving grace to satisfy the imperative to produce "good" data at volume and without interruption, and reap social rewards for doing so. the demand for maximizing one's data productivity and machine learning measurability is there, so the market is happy to oblige. with a monthly subscription to a social media platform, or even a social media marketing service, users can render themselves more visible. this possibility, and the payoffs of visibility, are learned socially, both through the observation and mimicry of models (influencers, for instance) or through explicit instruction (from the numerous online and offline guides to maximizing "personal brand"). one can buy oneself instagram or twitter followers. social media scheduling tools, such as tweetdeck and post planner, help one to plan ahead to try to maximize engagement with one's postings, including by strategically managing their release across time zones. a paying account on linkedin dramatically improves a user's chance of being seen by other users. the same is true of tinder. if a user cannot afford the premium subscription, the site still offers them one-off "boosts" for $ . that will send their profile near the top of their potential matches' swiping queue for min. finally, wealthier users can completely outsource the process of online profile management to someone else (perhaps recruiting a freelance social media manager through an online platform like upwork, the interface of which exhibits ordinal features like client ratings and job success scores). in all the foregoing ways, the inclusionary promise of machine learning has shifted toward more familiar sociological terrain, where money and other vectors of domination determine outcomes. in addition to economic capital, distributions of social and cultural capital, as well as traditional ascriptive characteristics, such as race or gender, play an outsized role in determining likeability and other outcomes of socially learned modes of engagement with machine learning systems. for instance, experiments with mechanical turkers have shown that being attractive increases the likelihood of appearing trustworthy on twitter, but being black creates a contrarian negative effect (groggel et al. ) . in another example, empirical studies of social media use among those bilingual in hindi and english have observed that positive modes of social media engagement tend to be expressed in english, with negative emotions and profanity more commonly voiced in hindi. one speculative explanation for this is that english is the language of "aspiration" in india or offers greater prospects for accumulating social and cultural capital on social media than hindi (rudra et al. ) . in short, wellestablished off-platform distinctions and social hierarchies shape the extent to which on-platform identities and forms of materialized labor will be defined as valuable and value-generating in the field of social media. in summary, ordinality is a necessary feature of all online socio-technical systems and it demands a relentless catering to one's digital doppelgängers' interactions with others and with algorithms. to be sure, design features tend to make systems addictive and feed this sentiment of oppression (boyd ). what really fuels both, however, is the work of social ordering and the generation of ordinal salience by the algorithm. in the social world, any type of scoring, whether implicit or explicit, produces tremendous amounts of status anxiety and often leads to productive resources (time and money) being diverted in an effort to better one's odds (espeland and sauder ; mau ) . those who are short on both presumably fare worse, not only because that makes them less desirable in the real world, but also because they cannot afford the effort and expense needed to overcome their disadvantage in the online world. the very act of ranking thus both recycles old forms of social inequality and also creates new categories of undesirables. as every teenager knows, those who have a high ratio of following to followers exhibit low social status or look "desperate." in this light, jeff bezos may be the perfect illustration of intertwining between asymmetries of real world and virtual world power: the founder and ceo of amazon and currently the richest man in the world has . million followers on twitter, but follows only one person: his ex-wife. ordinalization has implications not just for hierarchical positioning, but also for belonging-an important dimension of all social systems (simmel ) . ordinal stigma (the shame of being perceived as inferior) often translates into nominal stigma, or the shame of non-belonging. not obtaining recognition (in the form of "likes" or "followers"), in return for one's appreciation of other people, can be a painful experience, all the more since it is public. concern to lessen the sting of this kind of algorithmic cruelty is indeed why, presumably, tinder has moved from a simple elo or desirability score (which depends on who has swiped to indicate liking for the person in question, and their own scores, an ordinal measure) to a system that relies more heavily on type matching (a nominal logic), where people are connected based on taste similarity as expressed through swiping, sound, and image features (carman ) . in addition to employing machine learning to rank users, most social media platforms also use forms of clustering and type matching, which allow them to group users according to some underlying similarity (analogical machine learning in domingos's terms). this kind of computing is just as hungry for data as those we discuss above, but its social consequences are different. now the aim is trying to figure a person out or at least to amplify and reinforce a version of that person that appears in some confluence of data exhaust within the system in question. that is, in part, the aim of the algorithm (or rather, of the socio-technical system from which the algorithm emanates) behind facebook's news feed (cooper ) . typically, the more data one feeds the algorithm, the better its prediction, the more focused the offering, and the more homogeneous the network of associations forged through receipt and onward sharing of similar offerings. homogenous networks may, in turn, nourish better-and more saleablemachine learning programs. the more predictable one is, the better the chances that one will be seen-and engaged-by relevant audiences. being inconsistent or too frequently turning against type in data-generative behaviors can make it harder for a machine learning system to place and connect a person associatively. in both offline and online social worlds (not that the two can easily be disentangled), deviations from those expectations that data correlations tend to yield are often harshly punished by verbal abuse, dis-association, or both. experiences of being so punished, alongside experiences of being rewarded by a machine learning interface for having found a comfortable group (or a group within which one has strong correlations), can lead to some form of social closure, a desire to "play to type." as one heavy social media user told us, "you want to mimic the behavior [and the style] of the people who are worthy of your likes" in the hope that they will like you in return. that's why social media have been variously accused of generating "online echo chambers" and "filter bubbles," and of fueling polarization (e.g., pariser ). on the other hand, being visible to the wrong group is often a recipe for being ostracized, "woke-shamed," "called-out," or even "canceled" (yar and bromwich ) . in these and other ways, implementations of machine learning in social media complement and reinforce certain predilections widely learned socially. in many physical, familial, political, legal, cultural, and institutional environments, people learn socially to feel suspicious of those they experience as unfamiliar or fundamentally different from themselves. there is an extensive body of scholarly work investigating social rules and procedures through which people learn to recognize, deal with, and distance themselves from bodies that they read as strange and ultimately align themselves with and against pre-existing nominal social groupings and identities (ahmed ; goffman ) . this is vital to the operation of the genre of algorithm known as a recommendation algorithm, a feature of all social media platforms. on facebook, such an algorithm generates a list of "people you may know" and on twitter, a "who to follow" list. recommendation algorithms derive value from this social learning of homophily (mcpherson et al. ) . for one, it makes reactions to automated recommendations more predictable. recommendation algorithms also reinforce this social learning by minimizing social media encounters with identities likely to be read as strange or nonassimilable, which in turn improves the likelihood of their recommendations being actioned. accordingly, it has been observed that the profile pictures of accounts recommended on tiktok tend to exhibit similarities-physical and racial-to the profile image of the initial account holder to whom those recommendations are presented (heilweil ) . in that sense, part of what digital technologies do is organize the online migration of existing offline associations. but it would be an error to think that machine learning only reinforces patterns that exist otherwise in the social world. first, growing awareness that extreme type consistency may lead to online boredom, claustrophobia, and insularity (crawford ) has led platforms to experiment with and implement various kinds of exploratory features. second, people willfully sort themselves online in all sorts of non-overlapping ways: through twitter hashtags, group signups, click and purchasing behavior, social networks, and much more. the abundance of data, which is a product of the sheer compulsion that people feel to self-index and classify others (harcourt ; brubaker ) , might be repurposed to revisit common off-line classifications. categories like marriage or citizenship can now be algorithmically parsed and tested in ways that wield power over people. for instance, advertisers' appetite for information about major life events has spurred the application of predictive analytics to personal relationships. speech recognition, browsing patterns, and email and text messages can be mined for information about, for instance, the likelihood of relationships enduring or breaking up (dickson ) . similarly, the us national security agency measures people's national allegiance from how they search on the internet, redefining rights in the process (cheney-lippold ). even age-virtual rather than chronological-can be calculated according to standards of mental and physical fitness and vary widely depending on daily performance (cheney-lippold , p. ). quantitatively measured identities-algorithmic gender, ethnicity, or sexuality-do not have to correspond to discrete nominal types anymore. they can be fully ordinalized along a continuum of intensity (fourcade ) . the question now is: how much of a us citizen are you, really? how latinx? how gay? in a machine learning world, where each individual can be represented as a bundle of vectors, everyone is ultimately a unique combination, a category of one, however "precisely inaccurate" that category's digital content may be (mcfarland and mcfarland ) . changes in market research from the s to the s, aimed at tracking consumer mobility and aspiration through attention to "psychographic variables," constitute a pre-history, of sorts, for contemporary machine learning practices in commercial settings (arvidsson ; gandy ; fourcade and healy ; lauer ) . however, the volume and variety of variables now digitally discernible mean that the latter have outstripped the former exponentially. machine learning techniques have the potential to reveal unlikely associations, no matter how small, that may have been invisible, or muted, in the physically constraining geography of the offline world. repurposed for intervention, disparate data can be assembled to form new, meaningful types and social entities. paraphrasing donald mackenzie ( ), machine learning is an "engine, not a camera." christopher wylie, a former lead scientist at the defunct firm cambridge analytica-which famously matched fraudulently obtained facebook data with consumer data bought from us data brokers and weaponized them in the context of the us presidential election-recalls the experience of searching for-and discovering-incongruous social universes: "[we] spent hours exploring random and weird combinations of attributes.… one day we found ourselves wondering whether there were donors to anti-gay churches who also shopped at organic food stores. we did a search of the consumer data sets we had acquired for the pilot and i found a handful of people whose data showed that they did both. i instantly wanted to meet one of these mythical creatures." after identifying a potential target in fairfax county, he discovered a real person who wore yoga pants, drank kombucha, and held fire-andbrimstone views on religion and sexuality. "how the hell would a pollster classify this woman?" only with the benefit of machine learning-and associated predictive analytics-could wylie and his colleagues claim capacity to microtarget such anomalous, alloyed types, and monetize that capacity (wylie , pp. - ) . to summarize, optimization makes social hierarchies, including new ones, and pattern recognition makes measurable types and social groupings, including new ones. in practice, ordinality and nominality often work in concert, both in the offline and in the online worlds (fourcade ). as we have seen, old categories (e.g., race and gender) may reassert themselves through new, machine-learned hierarchies, and new, machine-learned categories may gain purchase in all sorts of offline hierarchies (micheli et al. ; madden et al. ) . this is why people strive to raise their digital profiles and to belong to those categories that are most valued there (for instance "verified" badges or recognition as a social media "influencer"). conversely, patternmatching can be a strategy of optimization, too: people will carefully manage their affiliations, for instance, so as to raise their score-aligning themselves with the visible and disassociating themselves from the underperforming. we examine these complex interconnections below and discuss the dispositions and sentiments that they foster and nourish. it should be clear by now that, paraphrasing latour ( , p. ), we can expect little from the "social explanation" of machine learning; machine learning is "its own explanation." the social does not lie "behind" it, any more than machine learning algorithms lie "behind" contemporary social life. social relations fostered by the automated instantiation of stratification and association-including in social mediaare diverse, algorithmic predictability notwithstanding. also, they are continually shifting and unfolding. just as latour ( , p. ) reminds us not to confuse technology with the objects it leaves in its wake, it is important not to presume the "social" of social media to be fixed by its automated operations. we can, nevertheless, observe certain modes of social relation and patterns of experience that tend to be engineered into the ordinal and nominal orders that machine learning (re)produces. in this section, we specify some of these modes of relation, before showing how machine learning can both reify and ramify them. our argument here is with accounts of machine learning that envisage social and political stakes and conflicts as exogenous to the practice-considerations to be addressed through ex ante ethics-by-design initiatives or ex post audits or certifications-rather than fundamental to machine learning structures and operations. machine learning is social learning, as we highlighted above. in this section, we examine further the kinds of sociality that machine learning makes-specifically those of competitive struggle and dependency-before turning to prospects for their change. social scientists' accounts of modes of sociality online are often rendered in terms of the antagonism between competition and cooperation immanent in capitalism (e.g., fuchs ). this is not without justification. after all, social media platforms are sites of social struggle, where people seek recognition: to be seen, first and foremost, but also to see-to be a voyeur of themselves and of others (harcourt ; brubaker ) . in that sense, platforms may be likened to fields in the bourdieusian sense, where people who invest in platform-specific stakes and rules of the game are best positioned to accumulate platform-specific forms of capital (e.g., likes, followers, views, retweets, etc.) (levina and arriaga ) . some of this capital may transfer to other platforms through built-in technological bridges (e.g., between facebook and instagram), or undergo a process of "conversion" when made efficacious and profitable in other fields (bourdieu ; fourcade and healy ) . for instance, as social status built online becomes a path to economic accumulation in its own right (by allowing payment in the form of advertising, sponsorships, or fans' gifts), new career aspirations are attached to social media platforms. according to a recent and well-publicized survey, "vlogger/youtuber" has replaced "astronaut" as the most enviable job for american and british children (berger ) . in a more mundane manner, college admissions offices or prospective employers increasingly expect one's presentation of self to include the careful management of one's online personality-often referred to as one's "brand" (e.g., sweetwood ) . similarly, private services will aggregate and score any potentially relevant information (and highlight "red flags") about individuals across platforms and throughout the web, for a fee. in this real-life competition, digitally produced ordinal positions (e.g., popularity, visibility, influence, social network location) and nominal associations (e.g., matches to advertised products, educational institutions, jobs) may be relevant. machine learning algorithms within social media both depend on and reinforce competitive striving within ordinal registers of the kind highlighted above-or in bourdieu's terms, competitive struggles over field-specific forms of capital. as georg simmel observed, the practice of competing socializes people to compete; it "compels the competitor" (simmel (simmel [ ). socially learned habits of competition are essential to maintain data-productive engagement with social media platforms. for instance, empirical studies suggest that motives for "friending" and following others on social media include upward and downward social comparison (ouwerkerk and johnson ; vogel et al. ) . social media platforms' interfaces then reinforce these social habits of comparison by making visible and comparable public tallies of the affirmative attention that particular profiles and posts have garnered: "[b]eing social in social media means accumulating accolades: likes, comments, and above all, friends or followers" (gehl , p. ) . in this competitive "[l]ike economy," "user interactions are instantly transformed into comparable forms of data and presented to other users in a way that generates more traffic and engagement" (gerlitz and helmond , p. )-engagement from which algorithms can continuously learn in order to enhance their own predictive capacity and its monetization through sales of advertising. at the same time, the distributed structure of social media (that is, its multinodal and cumulative composition) also fosters forms of cooperation, gift exchange, redistribution, and reciprocity. redistributive behavior on social media platforms manifests primarily in a philanthropic mode rather than in the equitypromoting mode characteristic of, for instance, progressive taxation. examples include practices like the #followfriday or #ff hashtag on twitter, a spontaneous form of redistributive behavior that emerged in whereby "micro-influencers" started actively encouraging their own followers to follow others. insofar as those so recommended are themselves able to monetize their growing follower base through product endorsement and content creation for advertisers, this redistribution of social capital serves, at least potentially, as a redistribution of economic capital. even so, to the extent that purportedly "free" gifts, in the digital economy and elsewhere, tend to be reciprocated (fourcade and kluttz ) , such generosity might amount to little more than an effective strategy of burnishing one's social media "brand," enlarging one's follower base, and thereby increasing one's store of accumulated social (and potentially economic) capital. far from being antithetical to competitive relations on social media, redistributive practices in a gift-giving mode often complement them (mauss ) . social media cooperation can also be explicitly anti-social, even violent (e.g., patton et al. ) . in these and other ways, digitized sociality is often at once competitive and cooperative, connective and divisive (zukin and papadantonakis ) . whether it is enacted in competitive, redistributive or other modes, sociality on social media is nonetheless emergent and dynamic. no wonder that bruno latour was the social theorist of choice when we started this investigation. but-as latour ( ) himself pointed out-gabriel tarde might have been a better choice. what makes social forms cohere are behaviors of imitation, counter- an exception to this observation would be social media campaigns directed at equitable goals, such as campaigns to increase the prominence and influence of previously under-represented groups-womenalsoknowstuff and pocalsoknowstuff twitter handles, hashtags, and feeds, for example. recommendation in this mode has been shown to increase recommended users' chance of being followed by a factor of roughly two or three compared to a recommendation-free scenario (garcia gavilanes et al. ). for instance, lewis ( , p. ) reports that "how-to manuals for building influence on youtube often list collaborations as one of the most effective strategies." imitation, and influence (tarde ) . social media, powered by trends and virality, mimicry and applause, parody and mockery, mindless "grinding" and tagging, looks quintessentially tardian. even so, social media does not amount simply to transfering online practices of imitation naturally occuring offline. the properties of machine learning highlighted above-cybernetic feedback; data hunger; accretive meaning-making; ordinal and nominal ordering-lend social media platforms and interfaces a distinctive, compulsive, and calculating quality-engineering a relentlessly "participatory subjectivity" (bucher , p. ; boyd ) . how one feels and how one acts when on social media is not just an effect of subjective perceptions and predispositions. it is also an effect of the software and hardware that mediate the imitative (or counter-imitative) process itself-and of the economic rationale behind their implementation. we cannot understand the structural features and phenomenological nature of digital technologies in general, and of social media in particular, if we do not understand the purposes for which they were designed. the simple answer, of course, is that data hunger and meaning accretion are essential to the generation of profit (zuboff ), whether profit accrues from a saleable power to target advertising, commercializable developments in artificial intelligence, or by other comparable means. strategies for producing continuous and usable data flows to profit-making ends vary, but tend to leverage precisely the social-machine learning interface that we highlighted above. social media interfaces tend to exhibit design features at both the back-and front-end that support user dependency and enable its monetization. for example, the "infinite scroll," which allows users to swipe down a page endlessly (without clicking or refreshing) rapidly became a staple of social media apps after its invention in , giving them an almost hypnotic feel and maximizing the "time on device" and hence users' availability to advertisers (andersson ) . similarly, youtube's recommendation algorithm was famously optimized to maximize users' time on site, so as to serve them more advertisements (levin ; roose ) . social media platforms also employ psycho-social strategies to this end, including campaigns to draw people in by drumming up reciprocity and participation-the notifications, the singling out of trends, the introduction of "challenges"-and more generally the formation of habits through gamification. prominent critics of social media, such as tristan harris (originally from google) and sandy parakilas (originally from facebook), have denounced apps that look like "slot machines" and use a wide range of intermittent rewards to keep users hooked and in the (instagram, tiktok, facebook, …) zone, addicted "by design" (schüll ; fourcade ) . importantly, this dependency has broader social ramifications than may be captured by a focus on individual unfreedom. worries about the "psychic numbing" of the liberal subject (zuboff ) , or the demise of the sovereign consumer, do not preoccupy us so much as the ongoing immiseration of the many who "toil on the invisible margins of the social factory" (morozov ) or whose data traces make them the targets of particularly punitive extractive processes. dependencies engineered into social media interfaces help, in combination with a range of other structural factors, to sustain broader economic dependencies, the burdens and benefits of which land very differently across the globe (see, e.g., taylor and broeders ) . in this light, the question of how amenable these dynamics may be to social change becomes salient for many. recent advances in digital technology are often characterized as revolutionary. however, as well as being addictive, the combined effect of machine learning and social learning may be as conducive to social inertia as it is to social change. data hunger on the part of mechanisms of both social learning and machine learning, together with their dependence on data accretion to make meaning, encourage replication of interface features and usage practices known to foster continuous, data-productive engagement. significant shifts in interface design-and in the social learning that has accreted around use of a particular interface-risk negatively impacting data-productive engagement. one study of users' reactions to changes in the facebook timeline suggested that "major interface changes induce psychological stress as well as technology-related stress" (wisniewski et al. ) . in recognition of these sensitivities, those responsible for social media platforms' interfaces tend to approach their redesign incrementally, so as to promote continuity rather than discontinuity in user behaviour. the emphasis placed on continuity in social media platform design may foster tentativeness in other respects as well, as we discuss in the next section. at the same time, social learning and machine learning, in combination, are not necessarily inimical to social change. machine learning's associative design and propensity to virality have the potential to loosen or unsettle social orders rapidly. and much as the built environment of the new urban economy can be structured to foster otherwise unlikely encounters (hanson and hillier ; zukin ) , so digital space can be structured to similar effect. for example, the popular chinese social media platform wechat has three features, enabled by machine learning, that encourage openended, opportunistic interactions between random users-shake, drift bottle, and people nearby-albeit, in the case of people nearby, random users within one's immediate geographic vicinity. (these are distinct from the more narrow, instrumental range of encounters among strangers occasioned by platforms like tinder, the sexual tenor of which are clearly established in advance, with machine learning parameters set accordingly.) qualitative investigation of wechat use and its impact on chinese social practices has suggested that wechat challenges some existing social practices, while reinforcing others. it may also foster the establishment of new social practices, some defiant of prevailing social order. for instance, people report interacting with strangers via wechat in ways they normally would not, including shifting to horizontallystructured interactions atypical of chinese social structures offline (wang et al. ). this is not necessarily unique to wechat. the kinds of ruptures and reorderings engineered through machine learning do not, however, create equal opportunities for value creation and accumulation, any more than they are inherently liberating or democratizing. social media channels have been shown to serve autocratic goals of "regime entrenchment" quite effectively (gunitsky ) . likewise, they serve economic goals of data accumulation and concentration (zuboff ) . machine-learned sociality lives on corporate servers and must be with regard to wechat in china and vkontakte in russia, as well as to government initiatives in egypt, the ukraine, and elsewhere, seva gunitsky ( ) highlights a number of reasons why, and means by which, nondemocratic regimes have proactively sought (with mixed success) to co-opt social media, rather than simply trying to suppress it, in order to try to ensure central government regimes' durability. meticulously "programmed" (bucher ) to meet specific economic objectives. as such, it is both an extremely lucrative proposition for some and (we have seen) a socially dangerous one for many. it favors certain companies, their shareholders and executives, while compounding conditions of social dependency and economic precarity for most other people. finally, with its content sanitized by underground armies of ghost workers (gray and suri ), it is artificial in both a technical and literal sense-"artificially artificial," in the words of jeff bezos (casilli and posada ). we have already suggested that machine-learned sociality, as it manifests on social media, tends to be competitive and individualizing (in its ordinal dimension) and algorithmic and emergent (in its nominal dimension). although resistance to algorithms is growing, those who are classified in ways they find detrimental (on either dimension) may be more likely to try to work on themselves or navigate algorithmic workarounds than to contest the classificatory instrument itself (ziewitz ) . furthermore, we know that people who work under distributed, algorithmically managed conditions (e.g., mechanical turk workers, uber drivers) find it difficult to communicate amongst themselves and organize (irani and silberman ; lehdonvirta ; dubal ) . these features of the growing entanglement of social and machine learning may imply dire prospects for collective action-and beyond it, for the achievement of any sort of broad-based, solidaristic project. in this section, we tentatively review possibilities for solidarity and mobilization as they present themselves in the field of social media. machine learning systems' capacity to ingest and represent immense quantities of data does increase the chances that those with common experiences will find one another, at least insofar as those experiences are shared online. machine-learned types thereby become potentially important determinants of solidarity, displacing or supplementing the traditional forces of geography, ascribed identities, and voluntary association. those dimensions of social life that social media algorithms have determined people really care about often help give rise to, or supercharge, amorphous but effective forms of offline action, if only because the broadcasting costs are close to zero. examples may include the viral amplification of videos and messages, the spontaneity of flash mobs (molnár ) , the leaderless, networked protests of the arab spring (tufekci ) , or of the french gilets jaunes (haynes ), and the #metoo movement's reliance on public disclosures on social media platforms. nonetheless, the thinness, fleeting character, and relative randomness of the affiliations summoned in those ways (based on segmented versions of the self, which may or may not overlap) might make social recognition and commonality of purpose difficult to sustain in the long run. more significant, perhaps, is the emergence of modes of collective action that are specifically designed not only to fit the online medium, but also to capitalize on its technical features. many of these strategies were first implemented to stigmatize or sow division, although there is no fatality that this is their only possible use. examples include the anti-semitic (((echo))) tagging on twitter-originally devised to facilitate trolling by online mobs (weisman ) but later repurposed by non-jews as an expression of solidarity; the in-the-wild training of a microsoft chatter bot, literally "taught" by well-organized users to tweet inflammatory comments; the artificial manipulation of conversations and trends through robotic accounts; or the effective delegation, by the trump campaign, of the management of its ad-buying activities to facebook's algorithms, optimized on the likelihood that users will take certain campaign-relevant actions-"signing up for a rally, buying a hat, giving up a phone number" (bogost and madrigal ) . the exploitation of algorithms for divisive purposes often spurs its own reactions, from organized counter-mobilizations to institutional interventions by platforms themselves. during the black lives matter protests, for instance, kpop fans flooded rightwing hashtags on instagram and twitter with fancams and memes in order to overwhelm racist messaging. even so, often the work of "civilizing" the social media public sphere is left to algorithms, supported by human decision-makers working through rules and protocols (and replacing them in especially sensitive cases). social media companies ban millions of accounts every month for inappropriate language or astroturfing (coordinated operations on social media that masquerade as a grassroot movement): algorithms have been trained to detect and exclude certain types of coalitions on the basis of a combination of social structure and content. in , the british far right movement "britain first" moved to tiktok after being expelled from facebook, twitter, instagram, and youtube-and then over to vkontakte or vk, a russian platform, after being banned from tiktok (usa news ). chastised in the offline world for stirring discord and hate, the economic engines that gave the movement a megaphone have relegated it to their margins with embarrassment. the episode goes to show that there is nothing inherently inclusive in the kind of group solidarity that machine learning enables, and thus it has to be constantly put to the (machine learning) test. in the end, platforms' ideal of collective action may resemble the tardean, imitative but atomized crowd, nimble but lacking in endurance and capacity (tufekci ) . mimetic expressions of solidarity, such as photo filters (e.g., rainbow), the "blacking out" of one's newsfeed, or the much-bemoaned superficiality of "clicktivism" may be effective at raising consciousness or the profile of an issue, but they may be insufficient to support broader-based social and political transformations. in fact, social media might actually crowd out other solidaristic institutions by also serving as a (feeble, often) palliative for their failures. for example, crowdsourced campaigns, now commonly used to finance healthcare costs, loss of employment, or educational expenses, perform a privatized solidarity that is a far cry from the universal logic of public welfare institutions. up to this point, our emphasis has been on the kinds of sociality that machine learning implementations tend to engender on the social media field, in both vertical (ordinal) and horizontal (nominal) configurations. we have, in a sense, been "reassembling the social" afresh, with an eye, especially, to its computational components and chains of reference (latour ) . throughout, we have stressed, nonetheless, that machine learning and other applications of artificial intelligence must be understood as forces internal to social life-both subject to and integral to its contingent properties-not forces external to it or determinative of it. accordingly, it is just as important to engage in efforts to reassemble "the machine"-that is, to revisit and put once more into contention the associative preconditions for machine learning taking the form that it currently does, in social media platforms for instance. and if we seek to reassemble the machine, paraphrasing latour ( , p. ) , "it's necessary, aside from the circulation and formatting of traditionally conceived [socio-technical] ties, to detect other circulating entities." so what could be some "other circulating entities" within the socio-technical complex of machine learning, or how could we envisage its elements circulating, and associating, otherwise? on some level, our analysis suggests that the world has changed very little. like every society, machine-learned society is powered by two fundamental, sometimes contradictory forces: stratification and association, vertical and horizontal difference. to be sure, preexisting social divisions and inequalities are still very much part of its operations. but the forces of ordinality and nominality have also been materialized and formatted in new ways, of which for-profit social media offer a particularly stark illustration. the machinelearnable manifestations of these forces in social media: these are among the "other circulating entities" now traceable. recursive dynamics between social and machine learning arise where social structures, economic relations and computational systems intersect. central to these dynamics in the social media field are the development of a searching disposition to match the searchability of the environment, the learnability of the self through quantified measurement, the role of scores in the processing of social positions and hierarchies, the decategorization and recategorization of associational identities, automated feedback that fosters compulsive habits and competitive social dispositions, and strategic interactions between users and platforms around the manipulation of algorithms. what, then, of prospects for reassembly of existing configurations? notwithstanding the lofty claims of the it industry, there is nothing inherently democratizing or solidaristic about the kinds of social inclusiveness that machine learning brings about. the effects of individuals and groups' social lives being rendered algorithmically learnable are ambivalent and uneven. in fact, they may be as divisive and hierarchizing as they may be connective and flattening. moreover, the conditions for entry into struggle in the social media field are set by a remarkably small number of corporate entities and "great men of tech" with global reach and influence (grewal ) . a level playing field this most definitely is not. rather, it has been carved up and crenellated by those who happen to have accumulated greatest access to the data processing and storage capacity that machine learning systems require, together with the real property, intellectual property, and personal property rights, and the network of political and regulatory lobbyists that ensure that exclusivity of access is maintained (cohen ) . power in this field is, accordingly, unlikely to be reconfigured or redistributed organically, or through generalized exhortation to commit to equity or ethics (many versions of which are self-serving on the part of major players). instead, political action aimed at building or rebuilding social solidarities across such hierarchies and among such clusters must work with and through them, in ways attentive to the specifics of their instantiation in particular techno-social settings. to open to meaningful political negotiation those allocations and configurations of power that machine learning systems help to inscribe in public and private life-this demands more than encompassing a greater proportion of people within existing practices of ruling and being ruled, and more than tinkering around the edges of existing rules. the greater the change in sociality and social relations-and machine learning is transforming both, as we have recounted-the more arrant and urgent the need for social, political and regulatory action specifically attuned to that change and to the possibility of further changes. social and political action must be organized around the inequalities and nominal embattlements axiomatic to the field of social media, and to all fields shaped in large part by machine learning. and these inequalities and embattlements must be approached not as minor deviations from a prevailing norm of equality (that is, something that can be corrected after the fact or addressed through incremental, technical fixes), but as constitutive of the field itself. this cannot, moreover, be left up to the few whose interests and investments have most shaped the field to date. it is not our aim to set out a program for this here so much as to elucidate some of the social and automated conditions under which such action may be advanced. that, we must recognize, is a task for society, in all its heterogeneity. it is up to society, in other words, to reassemble the machine. how the reification of merit breeds inequality: theory and experimental evidence step-counting in the "health-society strange encounters:embodied others in post-coloniality introduction to machine learning cloud ethics: algorithms and the attributes of ourselves and others social media apps are "deliberately" addictive to users on the 'pre-history of the panoptic sort': mobility in market research social learning through imitation race after technology. abolitionist tools for the new jim code. cambridge: polity american kids would much rather be youtubers than astronauts. ars technica the social construction of reality: a treatise in the sociology of knowledge how facebook works for trump distinction: a social critique of the judgement of taste (r the logic of practice the field of cultural production the forms of capital mining social media data for policing, the ethical way. government technology surveillance and system avoidance: criminal justice contact and institutional attachment technologies of crime prediction: the reception of algorithms in policing and criminal courts dark matters: on the surveillance of blackness digital hyperconnectivity and the self if ... then: algorithmic power and politics nothing to disconnect from? being singular plural in an age of machine learning a precarious game: the illusion of dream jobs in the video game industry social media surveillance in social work: practice realities and ethical implications some elements of a sociology of translation: domestication of the scallops and the fishermen of st brieuc bay tinder says it no longer uses a "desirability" score to rank people. the verge non-participation in digital media: toward a framework of mediated political action: media the platformization of labor and society jus algoritmi: how the national security agency remade citizenship we are data. algorithms and the making of our digital selves between truth and power: the legal constructions of informational capitalism how the facebook algorithm works in and how to work with it following you: disciplines of listening in social media can alexa and facebook predict the end of your relationship accessed the master algorithm: how the quest for the ultimate learning machine will remake our world alchemy and artificial intelligence. rand corporation artificial intelligence the drive to precarity: a political history of work, regulation, & labor advocacy in san francisco's taxi & uber economics the elementary forms of religious life the civilizing process:sociogenetic and psychogenetic investigations engines of anxiety: academic rankings, reputation, and accountability genesis and development of a scientific fact culture and computation: steps to a probably approximately correct theory of culture technologies of the self. lectures at university of vermont the fly and the cookie: alignment and unhingement in st-century capitalism seeing like a market a maussian bargain: accumulation by gift in the digital economy internet and society: social theory in the information age the panoptic sort: a political economy of personal information follow my friends this friday! an analysis of human-generated friendship recommendations the case for alternative social media the like economy: social buttons and the data-intensive web custodians of the internet: platforms, content moderation, and the hidden decisions that shape social media stigma: notes on the management of spoiled identity the interaction order the philosophical baby: what children's minds tell us about truth, love, and the meaning of life ghost work. how to stop sillicon valley from building a new underclass old communication, new literacies: social network sites as social learning resources network power. the social dynamics of globalization race and the beauty premium: mechanical turk workers' evaluations of twitter accounts. information corrupting the cyber-commons: social media as a tool of autocratic stability policy paradigms, social learning, and the state: the case of economic policymaking in britain perceiving persons and groups the architecture of community: some new proposals on the social consequences of architectural and planning decisions exposed: desire and disobedience in the digital age simmel, the police form and the limits of democratic policing posthuman learning: ai from novice to expert? gilets jaunes and the two faces of facebook our weird behavior during the pandemic is messing with ai models there's something strange about tiktok recommendations turkopticon: interrupting worker invisibility in amazon mechanical turk from planning to prototypes: new ways of seeing like a state the structure of scientific revolutions the age of spiritual machines: when computers exceed human intelligence the cybernetic matrix of`french theory digital degradation: stigma management in the internet age economies of reputation: the case of revenge porn unequal childhoods: class, race, and family life the moral dilemmas of a safety belt the pasteurization of france reassembling the social: an introduction to actor-network-theory gabriel tarde and the end of the social an inquiry into modes of existence: an anthropology of the moderns creditworthy: a history of consumer surveillance and financial identity in america algorithms that divide and unite: delocalisation, identity and collective action in 'microwork google to hire thousands of moderators after outcry over youtube abuse videos | technology. the guardian distinction and status production on user-generated content platforms: using bourdieu's theory of cultural production to understand social dynamics in online fields alternative influence: broadcasting the reactionary right on youtube an engine, not a camera: how financial models shape markets privacy, poverty, and big data: a matrix of vulnerabilities for poor americans impression formation in online peer production: activity traces and personal profiles in github the eighteenth brumaire of louis bonaparte the metric society: on the quantification of the social techniques of the body the gift. the form and reason of exchange in archaic societies big data and the danger of being precisely inaccurate birds of a feather: homophily in social networks digital footprints: an emerging dimension of digital inequality social learning and imitation (pp. xiv, ) reframing public space through digital mobilization: flash mobs and contemporary urban youth culture capitalism's new clothes. the baffler urban social media demographics: an exploration of twitter use in major american cities the palgrave handbook of security, risk and intelligence motives for online friending and following: the dark side of social network site connections the filter bubble: what the internet is hiding from you when twitter fingers turn to trigger fingers: a qualitative study of social media-related gang violence the tacit dimension displacement as regulation: new regulatory technologies and front-line decision-making in ontario works the making of a youtube radical understanding language preference for expression of opinion and sentiment: what do hindi-english speakers do on twitter? addiction by design: machine gambling in las vegas data for life: wearable technology and the design of self-care alfred schutz on phenomenology and social relations how is society possible? sociology of competition power, technology and the phenomenology of conventions: on being allergic to onions testing and being tested in pandemic times improving ratings': audit in the british university system social media tips for students to improve their college admission chances the laws of imitation in the name of development: power, profit and the datafication of the global south life . : being human in the age of artificial intelligence the invention of the passport: surveillance, citizenship, and the state analyzing scriptural inference in conservative news practices policing social media twitter and tear gas: the power and fragility of networked protest a brief history of theory and research on impression formation far-right activists tommy robinson and britain first turn to russia's vk after being banned from tiktok and every big social platform social comparison, social media, and selfesteem mind in society: the development of higher psychological processes towards a reflexive sociology: a workshop with pierre bourdieu. sociological theory space collapse: reinforcing, reconfiguring and enhancing chinese social practices through wechat. conference on web and social media (icwsm ) semitism))): being jewish in america in the age of trump understanding user adaptation strategies for the launching of facebook timeline mindf*ck. cambridge analytica and the plot to break america tales from the teenage cancel culture. the new york times rethinking gaming. the ethical work of optimization in web search engines the age of surveillance capitalism. the fight for a human future at the new frontier of power the innovation complex: cities, tech and the new economy hackathons as co-optation ritual: socializing workers and institutionalizing innovation in the publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations as well as numerous articles that explore and theorize national variations in political mores, valuation cultures, economic policy, and economic knowledge. more recently, she has written extensively on the political economy of digitality, looking specifically at the changing nature of inequality and stratification in the digital era since , she has been conducting fieldwork on the role of digital technology and digital data in development, humanitarian aid, and disaster relief-work funded, since , by the australian research council. relevant publications include "global governance through the pairing of list and algorithm" (environment and planning d: society and space ); "data, detection, and the redistribution of the sensible acknowledgments we are grateful to kieran healy, etienne ollion, john torpey, wayne wobcke, and sharon zukin for helpful comments and suggestions. we also thank the institute for advanced study for institutional support. an earlier version of this article was presented at the "social and ethical challenges of machine learning" workshop at the institute for advanced study, princeton, november . key: cord- -vvlsqy authors: peters, bjoern; sette, alessandro title: integrating epitope data into the emerging web of biomedical knowledge resources date: journal: nat rev immunol doi: . /nri sha: doc_id: cord_uid: vvlsqy the recognition of immune epitopes is an important molecular mechanism of the vertebrate immune system to discriminate between self and non-self. increasing amounts of data on immune epitopes are becoming available due to technological advances in epitope-mapping techniques and the availability of genomic information for pathogens. organizing this data poses a challenge that is similar to the successful effort that was required to organize genomic data, which needed the establishment of centralized databases that complement the primary literature to make the data readily accessible and searchable by researchers. as described in this innovation article, the immune epitope database and analysis resource aims to achieve the same for the more complex and context-dependent information on immune epitopes, and to integrate this data with existing and emerging knowledge resources. integrating epitope data into the emerging web of biomedical knowledge resources abstract | the recognition of immune epitopes is an important molecular mechanism of the vertebrate immune system to discriminate between self and non-self. increasing amounts of data on immune epitopes are becoming available due to technological advances in epitope-mapping techniques and the availability of genomic information for pathogens. organizing this data poses a challenge that is similar to the successful effort that was required to organize genomic data, which needed the establishment of centralized databases that complement the primary literature to make the data readily accessible and searchable by researchers. as described in this innovation article, the immune epitope database and analysis resource aims to achieve the same for the more complex and context-dependent information on immune epitopes, and to integrate this data with existing and emerging knowledge resources. in this age of information-and technologydriven research, keeping up with the large amounts of published data is overwhelming for any researcher, particularly in areas not related to their primary expertise. to benefit from published data, it is increasingly stored in dedicated searchable databases. there is a family of such established databases, which includes swissprot, the research collaboratory for structural bioinformatics (rcsb) protein data bank (pdb) and the national center for biotechnology information (ncbi) databases pubmed, genbank and taxonomy browser (box ) , all of which can claim to be widely used as the major source of information in their domains. today, researchers are much more likely to retrieve a protein sequence from swissprot or genbank, rather than look up these sequences in the primary publication. for new references, such information would not even be found in the primary literature, as it is now mandatory to deposit sequences in a database before publication. importantly, these knowledge resources can easily be interlinked, making it possible to proceed from a protein sequence to its crystal structure, to its related source organism and to the literature references that describe it. with the emergence and consolidation of new databases, this information will expand to include single-nucleotide polymorphisms (snps), biomedical imaging and disease association, as well as immune epitope data, such as in the immune epitope database and analysis resource (iedb), which is the focus of this article. several databases devoted to immuneepitope-related information have been established before the recently developed iedb, such as syfpeithi , the international immunogenetics information system (imgt) , antijen , ΦΙΜΜ , mhcbn and hla ligand database , as well as pathogenspecific resources, such as the hiv database and the hcv database hosted at the los alamos national laboratory, new mexico, usa (see further information for web sites). clearly, the design of the iedb , would not have been possible without this pioneering work of others. the development and application of immuno-informatic databases and tools continues to be a very active and exciting field of research, as evidenced by two recent reviews of the field , . although the focus of this innovation article is on the iedb, we want to acknowledge that its success is based and is dependent on contributions of a much larger scientific community. the iedb is part of the national institute of allergy and infectious diseases (niaid) biodefence programme. the iedb is designed to organize the ever-growing body of information related to immune epitopes that are recognized by t cells and antibodies from humans, non-human primates and laboratory animals. the current focus of the database is on niaid category a, b and c priority pathogens , which include influenza a virus, and emerging and re-emerging infectious pathogens, such as bacillus anthracis, ebola virus, west nile virus, nipah virus and severe acute respiratory syndrome (sars)-associated coronavirus. epitopes from other infectious pathogens, allergens and those involved in autoimmunity are also within the scope of the iedb. the users of the database, which range from clinicians and vaccinologists to basic researchers, are able to freely access all of the information related to each epitope. this includes not only structural information related to the chemical nature of the epitope, but also taxonomic information related to the source of the epitope and contextual information related to the host recognizing the epitope, the conditions associated with immunization and the type of assay used to detect the responses. this rich description of the multiple contexts (each corresponding to a record) in which each epitope was reported to be recognized is important because it allows the researcher to ask specific questions. for example, which t-cell epitopes were recognized in macaques, are derived from sars and are associated with interferon release, after recognition of infected cells in vitro? currently, the iedb contains data derived from over , published papers, relating to approximately , different epitopes, and approximately new papers are added weekly . in addition to hosting data, the iedb also hosts a collection of bioinformatics tools that can be used to predict b-cell and t-cell epitopes and to analyse responses. for example, the degree of conservation of a given epitope in different pathogen isolates can be examined, or the three-dimensional structure of epitopes in their native antigen can be visualized. in the design and implementation of the iedb, new problems were encountered and new solutions devised. this innovation article discusses the background and rationale of the development of the iedb, illustrates how different disciplines have come together in its design and implementation, and illustrates its potential use for immunological and biological scientists. the iedb is the first epitope-related database that attempts to capture the context of immune recognition in a detailed, searchable way. we accomplished this by using several hundred different fields encompassing the database, grouped into several main classes or categories, such as the literary reference, the structure of the epitope, the source organism of the epitope and information on the context of epitope recognition, such as the host species, immunization strategy and the type of assay used to detect a response. the complexity of the data captured in the iedb makes it difficult to ensure consistent annotation of the data by curators and accurate interpretation of the data by users. this common problem in developing a knowledge resource is addressed by developing a formal ontology. ontologies provide exact definitions of the terms used in annotating the data, as well as their relationships, and they facilitate the integration of data from different sources. the recent creation of the national center for biomedical ontologies will centralize and improve efforts in this area. it is difficult to underestimate the importance of developing a formal ontology for biological processes in general and, in our case, for immune epitopes in particular. until now, formal ontologies for host-pathogen interactions had not been developed, and accessible formal ontologies for immune-based databases had been limited to a few examples . developing a complete ontology requires exhaustive information on the kinds of data present in the knowledge domain. the gathering of such data must be done in a formal way, again requiring an ontology. to escape an infinite loop, it is necessary to start with an incomplete ontology that is updated over time as deficiencies become apparent as more data are collected. this is what we have done by creating the first version of an ontology of immune epitopes , which is now progressing towards a more formal ontology (fig. ) . this effort involves the collaboration of a consortium of groups who are working towards the development of an integrated ontology for the description of biological and medical experiments and investigations -the ontology for box | the emerging web of biomedical knowledge resources listed here is a representative selection of freely and publicly available resources of biomedical knowledge. • swissprot (http://www.expasy.org/sprot/) is the manually curated section of the uniprot knowledgebase. it contains protein sequences with a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications and variants) and a minimal level of redundancy. • the rcsb pdb (http://www.rcsb.org/pdb/) is one of several organizations that act as deposition, data processing and distribution centres. it maintains an archive of macromolecular structural data as part of the worldwide protein data bank (wwpdb at http://www.wwpdb.org/). • pubmed (http://www.pubmed.org) is a service of the us national library of medicine that includes abstracts and citations from medline and other life science journals for biomedical articles dating back to the s, as well as links to full text articles. • the ncbi taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy/) contains the names of all organisms that are represented in genetic databases with at least one nucleotide or protein sequence, placed in a taxonomic tree. • gene ontology (http://www.geneontology.org/) provides a controlled vocabulary to describe gene and gene-product attributes in terms of their location in cellular components, their participation in biological processes and their specific molecular function. ). in addition, we have contributed to, and benefited from, a large-scale revision of immunology-related terms in the gene ontology database , . collaboration in ontology development between different scientific communities is the key to integrate different biomedical knowledge resources. different scientific communities often use the same term for different purposes, or different terms for the same concept, so that development of an ontology is necessary to ensure that a given term has the same meaning and associated attributes in all databases that use it, so that researchers can navigate the already immense and ever-growing body of biological data with confidence, accuracy and ease. automating the extraction of data text mining might at first seem to be an improbable ally for the bench immunologist. nevertheless, we believe that this field will have a dramatic impact in immunobioinformatics and systems biology . the field originates from the need to automate the extraction of meaning from massive amounts of text. several of the pioneering data-mining applications were developed for non-biological purposes, such as security projects sponsored by various intelligence agencies, and for the purpose of mining data in patent applications. over the years the field has become increasingly sophisticated. the basic premise is that of a program that scans through text and extracts data in a defined form (that is, a format that can be recognized by a computer, placed in a correct ontological format, and used by a database). in its basic, but already highly useful, form, a text is classified into one of several categories. for the iedb, we are using such categorizations to identify abstracts listed in the pubmed database that probably contain epitoperelated information -a similar approach to that which has been successfully applied by others , . ideally, one would like to go further and train a knowledge extraction algorithm to recognize complex immunological information from text, such as 't-cell epitopes were recognized in macaques, derived from sars and associated with interferon release' . the challenge has, however, been that currently available text-recognition programs tend to lose efficacy when interpreting complex sentences, not to mention when gathering information distributed throughout an article (for example, the fact that the immune response was observed in macaques may be found in the materials and methods, whereas the actual data may be found in a figure several pages away from the methods section). the lack of available training sets has been a significant stumbling block in progressive training towards accomplishing more complex tasks. however, the iedb curation may offer an opportunity to make advances in the field, because thousands of different manuscripts are being curated. this provides a comprehensive set of immunological papers and matching records of curated information, which includes categories such as where in the manuscript the information was gathered from. such datasets have proven to be invaluable in deriving ever-more potent text-mining tools . there are numerous ways of accessing the data in the iedb that are tailored to different user groups. searches come in three types: quick, simple and advanced. the quick search scans the entire text of a curated record for any occurrence of the specified search term. the simple search allows for more targeted queries without overwhelming the user with choices. it allows for the most-commonly desired types of query, such as epitopes that are recognized by t cells restricted by hla-a* . in the advanced query, values for all of the more than database fields can be specified, and it is also possible to customize the reporting format (fig. ) . in addition to the search interfaces, it is also possible to browse for epitopes through the iedb records by their mhc restriction or source species. finally, the entire content of the iedb is fully downloadable as a single file in xml format. numerous tools have been developed to predict the presence of epitopes in protein sequences (see for example the listings in ref. ), and several groups have used them successfully to map new epitopes and for other applications (reviewed in ref. ). the iedb provides several tools to predict peptide binding to mhc class i molecules, and these were recently compared to the large set of tools that are available elsewhere on the internet . such a large-scale comparison is meant to inform tool users of the current state of the art. for tool developers, this comparison provides a set of benchmark data with which to evaluate newly developed tools against, and it instructs them on which approaches have proven to be successful. in combination with predicting the ability of a particular peptide sequence to bind mhc class i molecules, predictions of its processing by the proteasome and transport by the transporter associated with antigen processing are also made available. these can be used to further narrow the set of candidate t-cell epitopes from a protein sequence. there is an ongoing formal evaluation of these tools that takes advantage of the data collected in the iedb, as well as an evaluation of mhc-class-ii-binding predictions. in addition to evaluations of existing servers, we also plan to hold prediction contests in which interested scientists can submit their predictions for a set of targets. such contests have had a tremendous positive impact in the evaluation and prediction of protein structure . figure | generating a formal ontology for the immune epitope database (iedb). the initial ontology of the iedb described all elements of the database as classes with associated characteristics (left panel shows this for the immunogen class). in the development process towards a formal ontology, these elements are placed in a hierarchy (simplified view depicted on the right), in which relationships between the different classes are made explicit. for example, the previously separate and unconnected classes antigen, immunogen and adjuvant are now recognized as being objects (for example, proteins), which participate in a certain role (as immunogens) in a specific process (such as immunization). compared with the t-cell-epitope predictions, the state of antibody-epitope predictions is widely considered to be suboptimal . in a recent workshop sponsored by the niaid that brought together many experts from the antibody-epitope-prediction community, this concern was widely shared, and the steps needed to improve this situation were discussed . there was wide agreement that the field could greatly benefit from community-assembled datasets that clarify what types of epitope should be included in an evaluation and that this should be dependent on the intended use of a prediction. similarly, the metrics used to quantify the success of predictions should be commonly agreed on for a better comparison between different studies. this effort to establish community-accepted datasets and metrics will aid in the acceptance of a newly emerging second generation of antibody-epitope prediction tools, many of which take advantage of the three-dimensional structures that are available for antibody-antigen binding. three steps in an advanced query for b-cell epitopes are illustrated. first, criteria are specified to query for epitopes that are recognized in mice, where the immunogen applied was the epitope source species and the species is selected to be severe acute respiratory syndrome (sars)-associated coronavirus (a). on submitting this query, a summary of epitope records matching these criteria are displayed (b). this includes information on the curated reference, epitope structure, epitope source, and assay used. when choosing the 'details' link for a specific epitope, the complete curated information is displayed (c). after curating all journal articles published with immune epitope information, several summary analyses could be carried out. this pie chart illustrates the relative number of antibody and t-cell epitopes identified. surprisingly, although protective immunity against the influenza virus is known to be largely mediated by antibodies, this chart reveals that much more data is available on t-cell epitopes for this virus. figure | homology mapping of an epitope into its three-dimensional source protein structure. for a given epitope and its source protein, the homology tool of the immune epitope database and analysis resource identifies homologous proteins with known three-dimensional structures, and maps the location of epitopes in these structures. in this example, the peptide ntnsgpddqigyyrratr (shown in blue), which is recognized by antibodies from mice immunized with inactivated severe acute respiratory syndrome (sars)-associated coronavirus (sars-cov) , is mapped to an x-ray structure of the sars-cov nucleocapsid protein. the arrow indicates a section of the epitope that is exposed at the surface of the virus, making it a candidate binding site for the antibody in the native protein structure. the epitope analytical tools provided by the iedb, the degree of conservation of various epitopes in a representative set of influenzavirus sequences was determined. several interesting, highly conserved epitopes were identified by this analysis. at the same time, the analysis can be used to probe for potential gaps in our knowledge relating to influenza-virus epitopes. indeed, significant gaps in the current knowledge were revealed, including a paucity of antibody epitopes in comparison to t-cell epitopes (fig. ) , a limited number of epitopes reported for avian influenza virus strains and/or subtypes, and a limited number of epitopes reported from proteins other than haemagglutinin and nucleocapsid protein. these gaps in our collective knowledge should inspire directions for further study of immunity against the influenza a virus. in this innovation article, we present the experience gained so far in a cutting edge project, which involves the extraction of complex immunological data from the literature, making it available to the scientific community and integrating it in the emerging web of biomedical knowledge resources. the issues and solutions to the challenges that have been encountered in the development of this project are of relevance in the context of the general trend of initiatives that are aimed at collecting and displaying large amounts of data of genomic and proteomic origin, and in the context of host-pathogen interactions in general. the initial phase of building and starting additional tools are provided to analyse already-identified responses. the conservancy tool (epitope conservancy analysis) of the iedb calculates the level of sequence identity with which a set of epitopes occurs across different strains of a pathogen. the population-coverage tool (population coverage calculation) estimates the frequencies of responses to a set of t-cell epitopes with known mhc restrictions in different sets of populations with known mhc allele frequencies . finally, the epitope-homology mapping tool (homology mapping) visualizes the location of epitopes within the threedimensional structure of their source antigen using a customized epitope viewer . this mapping is done through a screening of available structures in the pdb, from which proteins are selected that closely resemble the epitope source antigen and specifically conserve the sequence of the epitope itself (fig. ) . as an illustration of the usefulness and power of the compilation of immuneepitope-related data, we have recently curated and analysed all the published data on influenza a virus antibody and t-cell epitopes . this effort has resulted in an inventory of the existing knowledge related to this field, and will also allow us to start probing possible crossreactivities among h n avian and human influenza virus strains. the analysis revealed over different influenza virus epitopes, derived from different strains and viral proteins. as all the data are freely available online, this effort translates into a single resource for researchers that allows access to most existing epitope data for influenza virus. for example, because of the capacity to extract data related to specific contexts, an interested scientist can selectively view epitopes that are known to be associated with protection from challenge with influenza virus. by the use of to populate the database with epitope information is completed. the focus now is on enhancing the value of this information to the user community. an important component of the rationale for writing this article is to involve the scientific community at large in the realisation of the best possible epitope database by stimulating debate and feedback on these issues, as changes and improvements are continuously considered and contemplated. the design and implementation of the immune epitope database and analysis resource immunoinformatics comes of age from functional genomics to functional immunomics: new challenges, old problems, big rewards curation of complex, context-dependent immunological data national center for biomedical ontology: advancing biomedicine through structured organization of scientific knowledge imgt-ontology for immunogenetics and immunoinformatics an ontology for immune epitopes: application to the design of a broad scope database of immune reactivities development of fugo: an ontology for functional genomics investigations the gene ontology (go) database and informatics resource ontology development for biological systems a survey of current work in biomedical text mining literature mining for the biologist: from information retrieval to biological discovery supporting the curation of biological databases with reusable text mining prebind and textomy-mining the biomedical literature for protein-protein interactions using a support vector machine evaluation of text data mining for database curation: lessons learned from the kdd challenge cup immunomics: discovering new targets for vaccines and therapeutics syfpeithi: database for mhc ligands and peptide motifs imgt/ligm-db, the imgt comprehensive database of immunoglobulin and t cell receptor nucleotide sequences antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data an update on the functional molecular immunology (fimm) database mhcbn: a comprehensive database of mhc binding and non-binding peptides population of the hla ligand database hiv molecular immunology los alamos hepatitis c immunology database the immune epitope database and analysis resource: from vision to blueprint a community resource benchmarking predictions of peptide binding to mhc-i molecules a decade of casp: progress, bottlenecks and prognosis in protein structure prediction benchmarking b cell epitope prediction: underperformance of existing methods towards a consensus on datasets and evaluation metrics for developing b-cell epitope prediction tools predicting population coverage of t-cell epitope-based diagnostics and vaccines epitopeviewer: a java application for the visualization and analysis of immune epitopes in the immune epitope database and analysis resource (iedb) ab and t cell epitopes of influenza a virus, knowledge and opportunities mapping of antigenic sites on the nucleocapsid protein of the severe acute respiratory syndrome coronavirus i thank j. kang, p. schwartzberg, m. felices, and a. prince for helpful discussions. this work was supported by grants from the nih (ai ) and the center for disease control (ci ). we thank j. ponomarenko and p. bourne at the san diego supercomputer center, who developed the homology mapping tool. this work was supported by the national institutes of health, usa. the authors declare no competing financial interests. the authors declare no competing financial interests.